ieee transactions on pattern analysis and machine ...jhoey/papers/hoeylittlepami07.pdf · previous...

15
Value-Directed Human Behavior Analysis from Video Using Partially Observable Markov Decision Processes Jesse Hoey and James J. Little, Member, IEEE Abstract—This paper presents a method for learning decision theoretic models of human behaviors from video data. Our system learns relationships between the movements of a person, the context in which they are acting, and a utility function. This learning makes explicit that the meaning of a behavior to an observer is contained in its relationship to actions and outcomes. An agent wishing to capitalize on these relationships must learn to distinguish the behaviors according to how they help the agent to maximize utility. The model we use is a partially observable Markov decision process, or POMDP. The video observations are integrated into the POMDP using a dynamic Bayesian network that creates spatial and temporal abstractions amenable to decision making at the high level. The parameters of the model are learned from training data using an a posteriori constrained optimization technique based on the expectation-maximization algorithm. The system automatically discovers classes of behaviors and determines which are important for choosing actions that optimize over the utility of possible outcomes. This type of learning obviates the need for labeled data from expert knowledge about which behaviors are significant and removes bias about what behaviors may be useful to recognize in a particular situation. We show results in three interactions: a single player imitation game, a gestural robotic control problem, and a card game played by two people. Index Terms—Face and gesture recognition, video analysis, motion, statistical models, clustering algorithms, machine learning, parameter learning, control theory, dynamic programming. Ç 1 INTRODUCTION T HIS paper describes a model of human behaviors that unifies computer vision and decision theory through a framework of probabilistic modeling. The motivation is that computational agents will need capabilities for learning, recognizing, and using the extensive range of human nonverbal communication skills. The perceiver of a nonverbal signal must not only recognize the signal, but must understand what it is useful for. The signal’s usefulness will be defined by its relationship to both signaler and receiver, their actions, their possible futures together, and the individual ways they assign value to these futures. This paper describes a method for the automatic learning and analysis of purposeful, context-dependent, human nonverbal behavior. No prior knowledge about the structure of behaviors or the number of behaviors is necessary. The method learns which behaviors (and how many) are conducive to achieving value in the context. An important aspect of our work is the explicit modeling of uncertainty in both the observations and in the temporal dynamics of the system. Taking this uncertainty into account allows the system to better optimize over possible outcomes based on noisy visual data. This paper will focus on human nonverbal communicative behaviors, including facial expressions and hand gestures. We will use the term display to refer to face or hand behaviors. There has been a growing body of work in the past decade on the communicative function of the face [24], [46] and of the hands [38]. This psychological research has drawn three major conclusions. First, displays are often purposeful communicative signals [24]. Second, the pur- pose is not defined by the display alone, but also depends on the context in which the display was emitted [46]. Third, the signals are not universal, but vary between individuals in their physical appearance, their contextual relationships, and their purpose [46]. We believe that these three considerations should be used as critical constraints in the design of computational communicative agents able to learn, recognize, and use human behavior. First, context dependence implies that the agent must model the relation- ships between the displays and the context. Second, the agent must be able to compute the utility of taking actions in situations involving purposeful displays. Third, the agent needs to adapt to new partners and new situations. These constraints can be integrated into a decision- theoretic, vision-based model of human nonverbal signals. The model can be used to predict human behavior or to choose actions that maximize expected utility. The basis for this model is a partially observable Markov decision process, or POMDP [1]. A POMDP describes the effects of an agent’s actions upon its environment, the utility of states in the environment, and the relationship between the observations, the actions, and the states. A POMDP model allows an agent to predict the long term effects of its actions upon its environment and to choose valuable actions based on these predictions. The model can be acquired from data and can be used for decision making based, in part, on the nonverbal behavior of a human through observation. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 29, NO. 7, JULY 2007 1 . J. Hoey is with the School of Computing, University of Dundee, Queen Mother Building, Dundee, Scotland, DD1 4HN. E-mail: [email protected]. . J.J. Little is with the Department of Computer Science, University of British Columbia, ICCS 117, 2366 Main Mall, Vancouver, B.C., Canada, V6T 1Z4. E-mail: [email protected]. Manuscript received 15 July 2004; revised 22 June 2005; accepted 29 Sept. 2006; published online 18 Jan. 2007. Recommended for acceptance by I.A. Essa. For information on obtaining reprints of this article, please send e-mail to: [email protected], and reference IEEECS Log Number TPAMI-0358-0704. Digital Object Identifier no. 10.1109/TPAMI.2007.1020. 0162-8828/07/$25.00 ß 2007 IEEE Published by the IEEE Computer Society

Upload: others

Post on 30-Apr-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE ...jhoey/papers/HoeyLittlePAMI07.pdf · previous work in modeling human behaviors from both labeled and unlabeled data. Section 3

Value-Directed Human Behavior Analysisfrom Video Using Partially Observable

Markov Decision ProcessesJesse Hoey and James J. Little, Member, IEEE

Abstract—This paper presents a method for learning decision theoretic models of human behaviors from video data. Our system learns

relationships between the movements of a person, the context in which they are acting, and a utility function. This learning makes explicit

that the meaning of a behavior to an observer is contained in its relationship to actions and outcomes. An agent wishing to capitalize on

these relationships must learn to distinguish the behaviors according to how they help the agent to maximize utility. The model we use is a

partially observable Markov decision process, or POMDP. The video observations are integrated into the POMDP using a dynamic

Bayesian network that creates spatial and temporal abstractions amenable to decision making at the high level. The parameters of the

model are learned from training data using an a posteriori constrained optimization technique based on the expectation-maximization

algorithm. The system automatically discovers classes of behaviors and determines which are important for choosing actions that

optimize over the utility of possible outcomes. This type of learning obviates the need for labeled data from expert knowledge about which

behaviors are significant and removes bias about what behaviors may be useful to recognize in a particular situation. We show results in

three interactions: a single player imitation game, a gestural robotic control problem, and a card game played by two people.

Index Terms—Face and gesture recognition, video analysis, motion, statistical models, clustering algorithms, machine learning,

parameter learning, control theory, dynamic programming.

Ç

1 INTRODUCTION

THIS paper describes a model of human behaviors thatunifies computer vision and decision theory through a

framework of probabilistic modeling. The motivation is thatcomputational agents will need capabilities for learning,recognizing, and using the extensive range of humannonverbal communication skills. The perceiver of anonverbal signal must not only recognize the signal, but mustunderstand what it is useful for. The signal’s usefulness will bedefined by its relationship to both signaler and receiver, theiractions, their possible futures together, and the individualways they assign value to these futures. This paper describesa method for the automatic learning and analysis ofpurposeful, context-dependent, human nonverbal behavior.No prior knowledge about the structure of behaviors or thenumber of behaviors is necessary. The method learns whichbehaviors (and how many) are conducive to achieving valuein the context. An important aspect of our work is the explicitmodeling of uncertainty in both the observations and in thetemporal dynamics of the system. Taking this uncertaintyinto account allows the system to better optimize overpossible outcomes based on noisy visual data. This paperwill focus on human nonverbal communicative behaviors,

including facial expressions and hand gestures. We will usethe term display to refer to face or hand behaviors.

There has been a growing body of work in the pastdecade on the communicative function of the face [24], [46]and of the hands [38]. This psychological research hasdrawn three major conclusions. First, displays are oftenpurposeful communicative signals [24]. Second, the pur-pose is not defined by the display alone, but also dependson the context in which the display was emitted [46]. Third,the signals are not universal, but vary between individualsin their physical appearance, their contextual relationships,and their purpose [46]. We believe that these threeconsiderations should be used as critical constraints in thedesign of computational communicative agents able tolearn, recognize, and use human behavior. First, contextdependence implies that the agent must model the relation-ships between the displays and the context. Second, theagent must be able to compute the utility of taking actionsin situations involving purposeful displays. Third, the agentneeds to adapt to new partners and new situations.

These constraints can be integrated into a decision-theoretic, vision-based model of human nonverbal signals.The model can be used to predict human behavior or tochoose actions that maximize expected utility. The basis forthis model is a partially observable Markov decisionprocess, or POMDP [1]. A POMDP describes the effects ofan agent’s actions upon its environment, the utility of statesin the environment, and the relationship between theobservations, the actions, and the states. A POMDP modelallows an agent to predict the long term effects of its actionsupon its environment and to choose valuable actions basedon these predictions. The model can be acquired from dataand can be used for decision making based, in part, on thenonverbal behavior of a human through observation.

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 29, NO. 7, JULY 2007 1

. J. Hoey is with the School of Computing, University of Dundee, QueenMother Building, Dundee, Scotland, DD1 4HN.E-mail: [email protected].

. J.J. Little is with the Department of Computer Science, University ofBritish Columbia, ICCS 117, 2366 Main Mall, Vancouver, B.C., Canada,V6T 1Z4. E-mail: [email protected].

Manuscript received 15 July 2004; revised 22 June 2005; accepted 29 Sept.2006; published online 18 Jan. 2007.Recommended for acceptance by I.A. Essa.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log Number TPAMI-0358-0704.Digital Object Identifier no. 10.1109/TPAMI.2007.1020.

0162-8828/07/$25.00 � 2007 IEEE Published by the IEEE Computer Society

Page 2: IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE ...jhoey/papers/HoeyLittlePAMI07.pdf · previous work in modeling human behaviors from both labeled and unlabeled data. Section 3

Our work is distinguished from other work on recognizinghuman nonverbal behavior primarily because it automati-cally learns what behaviors are distinguishable in the dataand useful in the task. We do not train classifiers forpredefined behaviors and then base decisions upon theclassifier outputs. Instead, the training process discoverscategories of behaviors in the data. This method is generalin that it removes the need to reengineer a model for each task.This is especially important in the design of human-interactive systems since the human behaviors that will beobserved are typically poorly known at the time of modelspecification. Instead, our method does not require expertknowledge about which displays are important nor extensivehuman labeling of data. Expert data labeling is not only timeconsuming, but also unnecessarily constrains the resultingmodels to the types of behaviors believed to be important bythe experts.

In contrast, the Facial Action Coding System (FACS) hasbecome the standard for psychological inquiries into facialexpression. Computer vision researchers have made signifi-cant progress toward automatic FACS coding [2], [21], [55].However, although the recognition of facial action unitsmay give the ability to discriminate between very subtledifferences in facial motion, or “microactions,” it requiresextensive training and domain specific knowledge [2].Further, the importance of such a detailed level of representa-tion is not clear for computer vision systems that intend totake actions based on observations of humans [42] and muchsimpler representations will be sufficient in many tasks. Ourmethod takes a more task-oriented approach, automaticallyfinding and modeling behaviors that are sufficient forperformance.

There will be little discussion of speech recognition andnatural language understanding in this paper. Our workfocuses solely on recognizing and using nonverbal com-municative acts. However, it is well known that gesture andfacial expression are intimately tied to speech [14], [38] andone might object to our omission. Nevertheless, our modelsare theoretically well grounded and models of the sametype have been used extensively in speech recognition [45]and dialogue management [41], [57]. Therefore, we believethat our models are ideally suited for integration withdialogue modeling work at some time in the future.

This paper is organized as follows: We first reviewprevious work in modeling human behaviors from bothlabeled and unlabeled data. Section 3 then describes ageneral POMDP model of human behavior modeling withina task and Section 4 goes into more detail of a specificPOMDP model. Finally, Section 5 discusses the results oflearning the model in three situations.

2 PREVIOUS WORK

Learning and solving POMDPs is a well-studied problem inthe artificial intelligence literature. One of the main areas ofresearch is into learning, solving, and using a POMDPsimultaneously, online, using reinforcement learning [52].Although these approaches carry optimality guarantees, theydo not scale well to real-world domains. On the other hand,recent developments in efficient approximate solutiontechniques for specific POMDPs (where the model is known)have shown promise for solving large, realistic problems [9],[29], [50]. However, these methods do not approach theproblem of learning the models and typically assume thatsensing yields simple observations that are easily quantified.

This is a problem in domains that require interactions withhumans as the inputs to the decision maker will be verycomplex. For example, if human facial expressions need to bedistinguished for performance in some task, then theobservations will be entire sequences of video. Therefore,the decision making process needs to be coupled with somehierarchical modeling that takes care of spatio-temporallyabstracting these complex observations. This problem is wellstudied in computer vision [7], [11], [12], [40], [51], where thegoal is to train a classifier to recognize some predefined set ofbehaviors using supervised learning. Our work insteadaddresses the coupling between the behavior models andthe decision making process: We automatically learn whichbehaviors are important to recognize.

Representation of human behavior in video is usuallydone by first estimating some quantity of interest at thepixel level and then spatially abstracting this to a low-dimensional feature vector. Optical flow [6], [20], [21], colorblobs [12], [17], deformable models [35], motion energy [7],and filtered images [2] are the more well-used pixel-levelfeatures. Spatial abstraction is usually approached usingeither a model-based or a view-based representation of abody part. Model-based approaches are often three-dimen-sional wire-frame models [2], sometimes including muscu-lature [21]. View-based approaches spatially abstract videoframes by projecting them to a low-dimensional subspace,using, for example, principal components [6], [23], [56],Fisher linear discriminants [3], or independent componentanalysis [2]. Other representations of faces and bodies usetemplates [7], feature points [35], or “blobs” [12], [51].

Our work uses Zernike basis functions [44] for holisticrepresentation of the face and facial motion. The Zernikepolynomial basis provides a rich and data independentdescription of optical flow fields and gray-scale images.When applied to optical flow, the Zernike basis can be seen asan extension of the standard affine basis [27]. The Zernikerepresentation differs from approaches such as Eigen-analysis [56], or facial action unit recognition [55] in that itmakes no commitment to a particular type of motion, leadingto a transportable classification system (e.g., usable forclustering different types of behaviors), which is necessaryfor a general modeling technique such as we are pursuing.Zernike polynomials have been used for recognizing handposes [32], handwriting and silhouettes [53], and optical flowfields [28]. It is important to note, however, that our learningmethoddoesnotrelyonthis particularchoiceofbasis functionand others could be investigated within the same method.

Once features are computed for each frame, theirtemporal progression must be modeled. Spatio-temporaltemplates [21], dynamic time warping [18], and hiddenMarkov models [51] are popular approaches. HMMs havebeen applied to many recognition problems in computervision, such as hand gestures [47], American Sign Language[51], and facial action units [2].

There are many DBN extensions of HMMs, including thecoupled hidden Markov model [11]. Hierarchical models[22] are particularly interesting as they incorporate temporalabstractions and have been used for modeling full bodymotions [12]. However, learning a hierarchy is a notoriouslydifficult problem. Seer [40] implements a real-time hier-archical model using a layered HMM, where each layerrepresents events at a different temporal granularity.However, in Seer, each layer is trained independently fromlabeled data. The DBN underlying a POMDP is known asthe input-output hidden Markov model [5]. POMDPs have

2 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 29, NO. 7, JULY 2007

Page 3: IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE ...jhoey/papers/HoeyLittlePAMI07.pdf · previous work in modeling human behaviors from both labeled and unlabeled data. Section 3

been used in realistic domains, including robot control [54]and spoken dialogue management [57]. Darrell and Pent-land used POMDPs for control of an active camera [17].Their POMDP model was trained to foveate regionscontaining information of interest, such as hands duringgesturing. However, their work is focused on computingpolicies in a reinforcement learning setting. They do notlearn the number of behaviors and they separate visualrecognition from decision making.

Most of the methods we have been describing are trainedwith labeled data, which requires human intervention andmakes adaptivity more difficult. The alternative is to developsystems that can discover categories of motions. In particular,clustering sequences of data using mixtures of hiddenMarkov models was proposed by Smyth [49] and has beenused in computer vision for unsupervised clustering of data[16], [36]. Darrell et al. [18] examine the same kind of models,but use dynamic time warping (DTW). Brand et al. haveshown how to discover patterns of motion in an officeenvironment using a hidden Markov model [10], but withouthierarchical structure. These works do not explicitly modelactions and utilities. Work in human-computer interaction(HCI) has made steps in the direction of learning userbehaviors as they relate to system actions. Although mostHCI systems only make use of human interface actions, suchas mouse or keyboard actions, some have begun to integratevisual and auditory information [42].

Jebara and Pentland [33] presented action-reaction learning,in which a dynamic model was learned from observing videoof two people interacting. The model was then used in areactive way to simulate interactions. Our work generalizesaction-reaction learningbyaddinghigh-level context,actions,and utilities. Action-reaction learning is designed for imita-tion-type tasks,whileour model is applicable to interactions inwhich plans need to be developed autonomously.

3 LEARNING AND SOLVING POMDPS

A partially observable Markov decision process (POMDP) isa probabilistic temporal model of an agent interacting withthe environment. POMDPs can be learned from data, and,once specified, can be used to compute policies of action thatmaximize some notion of utility. In this section, we first givean overview of POMDPs in Section 3.1, followed by adiscussion in Section 3.2 of how to learn the parameters of aPOMDP from data with the expectation-maximizationalgorithm. We show in Section 3.3 how to approximatelysolve the POMDP by solving the associated MDP model.The solution to the POMDP gives indications about thestructure of the model and about how useful this structure isfor achieving value.

In this paper, we use standard notation for variables,

where capital letters denote random variables, while small

letters denote an instantiation of that variable. Subscripts on

small letters denote a particular value of a variable. Bold-

faced letters represent sets of variables, usually referring to

sets that extend over time (e.g., sequences of data). Thus,

X ¼ fX1; . . . ; XNg is a set of N random variables, x ¼fx1; . . . ; xNg is an assignment of values to those variables,

and xi is an assignment to Xi. We will also write xi;j as the

particular assignment Xi ¼ j.

3.1 Overview

A POMDP is a tuple hS;A;�S; R;O;�Oi, where S is a finiteset of (possibly unobservable) states of the environment,A isa finite set of agent actions, �S : S �A! S is a transitionfunction that describes the effects of agent actions upon theworld states, R : S �A! IR is a reward function that givesthe expected reward for taking action A in state S, O is a setof observations, and �O : S �A! O is an observationfunction that gives the probability of observations in eachstate-action pair. Fig. 1a shows two time slices of a POMDPas a dynamic Bayesian network (DBN). Shaded nodesdenote observables, unshaded nodes denote unobservablevariables. Parameter random variables are denoted by �,prior hyperparameters by �, and the diamond is the reward.

When POMDPs are used by a decision maker interactingwith another agent (possibly a human), then the state, S,includes some descriptions of the behaviors of this otheragent. That is, we factor the state, S, into two parts,S ¼ fC;Dg, as shown in Fig. 1b.1 Now, D is a high-leveldescription of the observed behaviors of other agent(s), whileC contains the remainder of the state, including observableaction(s) of the other agent(s). In order to focus on learningmodels of behaviors, we assume that only D in this model isunobservable directly, while C and A are always fullyobserved. The transition function is also factorized in twoparts: �C ¼ P ðCtjCt�1; Dt�1; AtÞ gives the state dynamicsgiven the observed behaviors and the action taken, while�D ¼ P ðDtjDt�1; Ct; AtÞ gives the expected behaviors giventhe state and action. We assume that C and D are discreterandom variables, so the associated parameters �C and �D

are multinomial distributions. We denote the complete set ofparameters �� ¼ f�C;�D;�Og. We also include fixed priorhyperparameters in Fig. 1: �C; �D are Dirichlet priorparameters, while �O is a more complex prior explored indetail in Section 4.

The observations at time step t, Ot, are a sequence ofTt observations (e.g., video frames), o1 . . . oTt . In domains withhuman behaviors as observations, the rate at which decisionsare made at the high level is slower than the rate of

HOEY AND LITTLE: VALUE-DIRECTED HUMAN BEHAVIOR ANALYSIS FROM VIDEO USING PARTIALLY OBSERVABLE MARKOV DECISION... 3

1. Factored representations write the state space implicitly as the crossproduct of a set of multinomial, discrete variables, and allow conditionalindependencies in the transition function, T , to be exploited by solutiontechniques [9].

Fig. 1. (a) Two time slices of general POMDP as a dynamic Bayesiannetwork (DBN). (b) The same POMDP with a factored state S ¼ fD;Cg,for display understanding, where D is the unobserved behaviordescriptor and C is the remainder of the state, assumed to be fullyobservable in this paper.

Page 4: IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE ...jhoey/papers/HoeyLittlePAMI07.pdf · previous work in modeling human behaviors from both labeled and unlabeled data. Section 3

observations (the video frame rate) and, therefore, Tt � 1.This difference in temporal scales between decision makingand observations means that the observation function mustbe capable of spatio-temporally abstracting a video streaminto a set of high level descriptors of behaviors, D. That is, itmust be capable of computingP ðOtjDtÞand of computing thegradient of this function. We describe one particular suchfunction in Section 4. We assume throughout this paper thatthe boundaries of the observation sequences will be given bythe changes in the fully observable context state, C and A.There are other approaches to the temporal segmentationproblem, ranging from the complete Bayesian solution inwhich the segmentation is parameterized [22] to specificationof a fixed segmentation time [40].

A complete set of POMDP parameters, �, allows us tocompute the likelihood of a sequence of data, o ¼ fo; c; ag,P ðo; c; aj�Þ, by summing over the unobserved values of dusing the standard forward algorithm for input-outputhidden Markov models [5], [39]. Denoting T as the numberof POMDP transitions in the sequence, then a ¼ a1; . . . ; aTand c ¼ c1; . . . ; cT are the inputs, o ¼ o1; . . . ;oT (where eachot is a sequence of video frames) are the outputs.

3.2 Learning POMDP Parameters

This section describes how to learn the parameters of aPOMDP, which can be formulated as a constrainedoptimization of the likelihood of the observations, giventhe model, over the (constrained) model parameters. Wewill show how the expectation-maximization, or EM, algo-rithm can be used to find a locally optimal solution [19].Learning takes place over the entire model simultaneously:Both the output distributions, �O, and the high-levelPOMDP transition functions, �C;�D, are learned from dataduring the process. Since the behaviors, D, are unobser-vable, this means that the learning will perform bothclustering of the observation sequences into a set ofbehavior descriptors, D, and learning of the relationshipbetween these behavior descriptors, the observable context,C, and the action, A. In this section, we discuss the generallearning method. Section 4.4 shows how to learn theparameters of a hierarchical observation function.

Learning the POMDP parameters is to find the set ofparameters, �� ¼ �� that maximize the posterior density ofall observations and the model, P ðo; c; a;��Þ, subject toconstraints on the parameters. The EM algorithm startsfrom an estimate of the parameter values, �0, and computes

��¼arg max��¼��

�XD¼d

P ðdjo; c; a; �0Þ logP ðd;o; c; aj�Þ þ logP ð�Þ�:

The “E” step of the EM algorithm is to computeexpectation over the hidden state, P ðdjo; c; a; �0Þ, giventhe observations o; c; a and a current guess of theparameter values, �0. The “M” step is then to perform themaximization which, in this case, can be computedanalytically by taking derivatives with respect to eachparameter, setting to zero, and solving for the parameter.The resulting update equations for the parameters of thePOMDP transition functions are the same as for an input-output hidden Markov model [5].

The update equation for the D transition parameter,�Dijkl ¼ P ðdt;ijdt�1;j; ct;k; at;lÞ, is then

�Dijkl¼�Dijkl þ

Pt2f1...NtgjCt¼k^At¼l P ðdt;i; dt�1;jjo; a; c; �0ÞP

i �Dijkl þP

t2f1...NtgjCt¼k^At¼l P ðdt;idt�1;jjo; a; c; �0Þh i ;

where the sum over the temporal sequence is only over timesteps in which the observed values of ct and at in the dataare k and l, respectively, and �Dijkl is the parameter of theDirichlet smoothing prior. The summand can be factored as

P ðdt;i; dt�1;jjo; a; c; �0Þ ¼ �t;iP ðctjdt�1;j; ct�1; atÞP ðotjdt;iÞP ðdt;ijdt�1;j; ct�1; atÞ�t�1;j;

where P ðotjdt;iÞ is the likelihood of the data given thebehavior and �; � are the usual forward and backwardvariables,

�t;j ¼ P ðdt;jo1; . . . ;ot; a1; . . . ; at; c1; . . . ; ctÞ;�t;i ¼ P ðotþ1; . . . ;oT ; atþ1; . . . ; aT ; ctþ1; . . . ; cT jdt;i; ctÞ;

for which we can derive recursive updates

�t;j ¼Xk

P ðotjdt;jÞP ðctjdt�1;k; ct�1; atÞP ðdt;jjdt�1;k; ct�1; atÞ�t�1;k;

ð1Þ

�t�1;i ¼Xk

�t;kP ðctjdt�1;i; ct�1; atÞP ðotjdt;kÞP ðdt;kjdt�1;i; ct�1; atÞ: ð2Þ

The updates to �Cijkl ¼ P ðct;ijct�1;j; dt�1;k; at;lÞ are

�Cijkl ¼X

t2f1...NtgjCt¼i^Ct�1¼j^At¼l�k;

where �k ¼ P ðdt;kjo; a; cÞ ¼ �t;k�t;k.The updates to the jth component of the mixture of the

output distributions, P ðOjDjÞ, will depend on the particu-lar form of that function, but will be weighted by �j. Here,we assume only that the gradient of the observationfunction can be computed with respect to its parameters.We show more details of these updates in Section 4 for thehierarchical function we have used.

3.3 Solving POMDPs

This section discusses how to solve the learned model to yielda policy of action. If observations are drawn from a finite set,then an optimal policy of action can be computed for aPOMDP using dynamic programming over the space of theagent’s belief about the state, bðsÞ. Dynamic programming“backs up” the reward function through the POMDPdynamics, thereby computing a value function over thebelief space. This value function gives the expected value ofthe agent being in each belief state and can be used to mapbelief states into the actions that will achieve this value. Sincethe belief space is infinite, computing the optimal valuefunction is possible only for very small problems. If theobservation space is continuous or very large, as in our case,the difficulty is increased even further [29].

Although the ultimate goal of computing the valuefunction is to yield a policy of action, we are concerned inthis paper with learning the POMDP model. As we willdescribe in the next section, the value function has animportant role to play in the learning because it givesinformation about which distinctions in the state space are

4 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 29, NO. 7, JULY 2007

Page 5: IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE ...jhoey/papers/HoeyLittlePAMI07.pdf · previous work in modeling human behaviors from both labeled and unlabeled data. Section 3

useful to make. The ability of the value function to providethis information is intimately connected with its final abilityto choose correct actions. However, we can use the simplestpossible approximation for a particular domain that stillyields sufficient information to guide our learning process. Inthe domains we have investigated, we can consider thePOMDP as a fully observable MDP (the MDP approximation):The state,S, is assigned its most likely value in the belief state,s� ¼ arg maxs bðsÞ. Solving the resulting fully observableMDP using dynamic programming consists of computingvalue functions, V n, where V nðsÞ gives the expected value ofbeing in state s with a horizon of n stages to go, assumingoptimal actions at each step. The actions that maximizeV n arethe policy with n stages to go. The value functions arecomputed by setting V 0 ¼ R (the reward function), anditerating [4]

V nþ1ðsÞ ¼ RðsÞ þmaxA¼a

XS¼s0

Prðs0ja; sÞ � V nðs0Þ( )

; ð3Þ

where Prðs0ja; sÞ is the transition function from s to s0 onaction a. The actions that maximize (3) form the optimal (withregard to the MDP) n stage-to-go policy, �nðsÞ. We will onlyconsider finite horizon policies in this paper. We use theSPUDD MDP solver, which exploits the structure inherent ina factored representation for efficient solutions [31].

We again stress the fact that, although we are solving themodel using a strong approximation, the full POMDP modelis still being learned and the solution is only being used in thiswork to guide our structure learning. For the domains weconsider in this paper, this approximation still preserves theinformation about the state space distinctions necessary tolearn the smallest model. It is also important to note that thesolution technique we describe here could be replaced withother approximate model-based POMDP solution algo-rithms. The resulting changes to the value-directed structurelearning technique would not alter the basic premise: thatstate space distinctions which are not useful will be apparentin any reasonable approximate solution.

3.4 Value-Directed Structure Learning

The value function, V ðsÞ, gives the expected value for thedecision maker in each state. However, there may be parts ofthe state space that are indistinguishable (or nearly so) insofaras decisions go. Eliminating the distinctions between them bymerging states can lead to efficiency gains without compro-mising decision quality. Such state merging is a form ofstructure learning (for model order) based upon the utility ofstates. While this idea has been explored in the machinelearning literature [15], we apply it here to the task of learningthe minimum number of behaviors that need to be distin-guished in our learned POMDP. This value-directed structurelearning can be contrasted with data-dependent structurelearning, in which the complexity of the model is traded offagainst the quality of fit to the data. For example, Bayesianinference gives rise to a manifestation of Occam’s razor byassigning higher probability to simpler models in many cases.However, this only considers utility based on data prediction.In our case, we explicitly look for models that are the simplestfor achieving value within a task and these may not be thesame as those given by Bayesian model selection.

There are two problems that must be solved: first,finding the parts of the state space that can be merged and,second, actually performing the merging, which involvescombining the (learned) observation functions for different

behaviors. We address each of these in turn. To mergestates, we use the fact that parts of the state space withsimilar values in the value function can be aggregated toform abstract states when computing a policy withoutsacrificing much in terms of value. We can also look at thepolicy for a given model and find states that map to thesame action. This process is repeated until no more mergesare possible or until the number of behaviors becomes 1 (inwhich case, recognition of the behaviors is deemed useless).

To see how this is implemented, recall that the state spaceis represented in a factored POMDP as a product over a set ofvariables. Therefore, the value function can be split intoNd pieces, Vi, one for each value (behavior class) of thevariableD ¼ di. Each such Vi gives the values of being in anystate in which D ¼ di. A similar split occurs for the policy,yielding subpolicies, �i, giving the actions to take for eachD ¼ di. The Vi can be compared by computing the differencebetween them, dij ¼ kVi � Vjk, where kXk � maxfjxj : x 2Xg is the supremum norm. This value difference is anindication of how much value will potentially be lost if wemerge the states i and j. Two subpolicies, �i and �j, arecompared by dividing the number of states for which theyagree, nij ¼ j�i ^ �jj, by the size of the state space spannedby all variables except D, written jS�Dj. This fraction,pij ¼ nij=jS�Dj, is the amount of the state space over whichthe policies agree. The algorithm shown in Fig. 2 uses thesemeasures, beginning withNd as large as the training data willsupport. The thresholds !p and !d govern how aggressivelythis method merges states. These parameters are not learnedand must be tuned manually. We discuss particular values forthese thresholds in Section 5.

Once two states have been found, we must initialize anew POMDP model with one state representing theobservation space that both original states did (Step 6).We have used two methods for doing this. In the first, wesimply delete one of the two states. We use this method inSections 5.2 and 5.3.2. This may be dangerous, however, as alarge part of the observation space may not be modeled andthe next iteration of learning (Step 1) may account for it inunpredictable ways. A more principled method is toreinitialize an observation function P ðOjdiÞ, where di isthe new merged state based on all the data that wasclassified into states with D ¼ di _ dj (the old states to bemerged). We use this method in Section 5.3.1.

HOEY AND LITTLE: VALUE-DIRECTED HUMAN BEHAVIOR ANALYSIS FROM VIDEO USING PARTIALLY OBSERVABLE MARKOV DECISION... 5

Fig. 2. Value-directed structure learning algorithm.

Page 6: IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE ...jhoey/papers/HoeyLittlePAMI07.pdf · previous work in modeling human behaviors from both labeled and unlabeled data. Section 3

4 OBSERVATION FUNCTION

We now return to the observation function P ðOjDÞ. Recallthat this function is a mapping between spatio-temporallyextended observations (sequences of video frames), O, andhigh-level behavior descriptors,D, and, therefore, must havea hierarchical structure. Further, since we will eventuallywish to sample from this distribution to find POMDPsolutions [29], we require a generative model. Finally, werequire that this function be fairly generic such that it can beapplied to different types of behaviors without modification.In what follows, we describe a model forP ðOjDÞ that satisfiesall of these constraints, shown as a dynamic Bayesiannetwork in Fig. 3a. It is a mixture of coupled hidden Markovmodels (CHMMs) [11], a hierarchical dynamic Bayesiannetwork that performs temporal abstraction using hiddenMarkov chains over X and W . These variables (X;W )describe instantaneous dynamics and configuration of theregion being tracked. For example, the configuration classesmay correspond to characteristic facial poses, such as theapex of a smile. The dynamics classes are motion classes andmay correspond to, for example, motion during expansion ofthe face to a smile. Section 5.1 shows how configuration anddynamics together outperform either one separately in arecognition task.

We assume that a region of interest in each frame isselected by some independent tracking process (details in[26]). The observations are video image regions, f , and thespatio-temporal derivatives, rf , between pairs of imagesover these regions. The observations are spatially abstractedusing projections, Z, to an a priori basis of feature-weighted2D polynomials, as shown in Fig. 3b. The basis functions arefixed in order to ensure the model is generalizable to differentbehaviors, but we show in Section 4.3 how feature weightscan be learned such that only those basis functions that arenecessary are used.

As a generative model, it can be described as follows: Ahigh-level behavior at time t, dt, generates a sequenceof images ft0¼1 . . . ft0¼T and spatio-temporal derivatives,rft0¼1 . . .rft0¼T by first generating a sequence of values,wt0¼1 . . .wt0¼T and xt0¼1 . . .xt0¼T , for the discrete configurationand dynamics variables, W and X, respectively. The

dynamics in the hidden chains are given by the transition

functions �X ¼ P ðXt0 jXt0�1;Wt0 ; DtÞ and �W ¼ P ðWt0 jWt0�1;

Xt0�1; DtÞ and the initialization functions �X ¼ P ðXt0¼1jWt0¼1; DtÞ and �W ¼ P ðWt0¼1jDtÞ. The dynamics and con-

figuration variables at time t0, Xt0 ;Wt0 , generate an image, ft0

and a spatio-temporal derivative, rft0 , through the condi-

tional distributions, �I ¼ P ðft0 jWt0 Þ and �f ¼ P ðrft0 jXt0 Þ.We discuss these last two functions in more detail in

Sections 4.1 and 4.2, respectively.This mixture model can be used to compute the like-

lihood of a video sequence given the display descriptor,

P ðo1 . . . oT jdtÞ, where ot � fft;rftg, using the recursive

equations:

P ðo1 . . . otjdtÞ ¼Xij

P ðrftjxt;iÞP ðftjwt;jÞP ðxt;ijxt�1;j; wt;k; dT ÞXkl

P ðwt;jjwt�1;k; xt�1;l; dT ÞP ðxT�1;k; wT�1;lo1 . . . oT�1jdtÞ

pðo1jdtÞ ¼Xij

P ðrftjx1;iÞP ðftjw1;jÞP ðx1;ijw1;j; dtÞP ðw1;jjdtÞ:

ð4Þ

Table 1 shows the parameters in the model. On the left are

parameters that have fixed values. These numbers are set

manually based on prior expectations. The right table shows

the parameters that are learned using EM, the initialization

heuristics, or the value-directed learning methods.The following sections give more details on the observa-

tion function. Sections 4.1 and 4.2 give overviews of the

likelihood computations for dynamics, P ðrft0 jXt0 Þ, and

configuration, P ðft0 jWt0 Þ, respectively. Feature weighting is

described in Section 4.3. Section 4.4 then discusses learning

the observation function.

4.1 Dynamics

Fig. 3b shows an expanded version of the dynamics vertical

chain from Fig. 3a. We wish to derive the likelihood of a

derivative,rf , given the high-level dynamics class, X. Since

we wish to classify optical flow fields, we expand the

likelihood as

6 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 29, NO. 7, JULY 2007

Fig. 3. (a) Two time slices of the hierarchical observation function used in this paper shown as a DBN that parameterizes P ðOjDÞ ¼P ðf1 . . . fTt ;rf1 . . .rfTt jDÞ. Time indices are primed at the low level to distinguish them from the higher level. The low-level time index, t0, goes from

1 . . .T between times t� 1 and t at the high level. (b) Bayesian network for the mixture of Gaussians over spatio-temporal image derivatives with

feature weighting, parameterizing the distribution P ðrfjXÞ in (a). The distribution P ðf jWÞ is parameterized with a similar network.

Page 7: IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE ...jhoey/papers/HoeyLittlePAMI07.pdf · previous work in modeling human behaviors from both labeled and unlabeled data. Section 3

P ðrf jX;�Þ ¼Zv

P ðrf jv;�ÞP ðvjX;�Þ; ð5Þ

where we have assumed the derivatives independent of themotion class given the flow, v.

There are two terms in the integration. The distributionover spatio-temporal derivatives conditioned on the flow,P ðrfjv;�Þ, is estimated in a gradient-based formulationusing the brightness constancy assumption and is given by:2

P ðrfjvÞ / N ðf� ;�fsv; AÞ, whereA ¼ fs�1f0s þ �2, and �1;�2

are noise covariances in the optical flow estimation [48].P ðvjXÞ is parameterized using a projection of v to the basis ofZernike polynomials, P ðvjXÞ ¼

RzXP ðvjzXÞP ðzXjXÞ, where

ZX is the feature vector in the polynomial basis space. Weparameterize the distribution overZX givenXwith a normal,P ðZXjXÞ ¼ N ðZX;�z;x;�z;xÞ. We are expecting flow fields tobe normally distributed in the space of the basis functionprojections.

The distribution over v given zX is given by projections tothe basis of Zernike polynomials, which have useful proper-ties for modeling flow fields [27] and images [53]. Our methodis not restricted to this basis set, but gains independence fromthe data by using an a priori set of basis functions, leading to amore generally applicable observation function. Zernikepolynomials are an orthogonal set of complex polynomialsdefined on a 2D elliptical region as follows [44]:

Amn ðx; yÞ

Bmn ðx; yÞ

� �¼

Xðn�jmjÞ=2

l¼0

ð�1Þlðn� lÞ!l!½12ðnþ jmjÞ � l�!½12ðn� jmjÞ � l�!n�2l

cosðmÞsinðmÞ

� �;

ð6Þ

where ¼ arctan ðy=xÞ, ¼ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffix2 þ y2

p 1, and x; y are

measured from the region center. The lowest two orders ofZernike polynomials correspond to the standard affine basis,

and higher orders (higher values of n and m) representhigher spatial frequencies. The basis is orthogonal such thateach order can be used as an independent characterization ofa 2D function and each such function has a uniquedecomposition in the basis. Define an N �Nz matrix Bwhose columns are the Nz Zernike basis functions (withpixels arranged rowwise), alternating between Am

n and Bmn

such that columns 0; 1; 2; 3 . . . are Zernike polynomialsA0

0; A11; B

11; A

20 . . . . Then, a 2D function, f , with N pixels is

projected to the basis using z ¼ �B0f and can be reconstructedfrom the Nz � 1 column vector of coefficients, z, usingf ¼ Bz. The normalization constant � ¼ �mðnþ 1Þ=�, where�m ¼ 1 ifm ¼ 0 and �m ¼ 2 otherwise. Each of the vertical andhorizontal components of a flow field can be written as alinear combination of the basis functions and, so, we canwrite v ¼MzX, where

v ¼ vxvy

� �M ¼ B 0

0 B

� �zX ¼

zxzy

� �; ð7Þ

in which zx; zy are the Zernike coefficients for horizontal andvertical flow, respectively. In practice, M will be some subsetof the Zernike basis vectors, the remaining variance in theflow fields being attributed to zero-mean Gaussian noise.Thus, we write v ¼MzX þ np, where np / Nð0;�pÞ and, so,P ðvjzÞ ¼ N ðv;MzX;�pÞ. The noise, np, is a combination ofthree noise sources: the reconstruction error (energy in thehigher order moments not in M), the geometric error (due todiscretization of a circular region), and the numerical error(from discrete integration) [37]. The choice of a subset of basiselements to use will depend on what the projections are beingused for (see Section 4.3).

Since all of the terms in the integration (5) are Gaussian

distributions, we can integrate over v and z analytically by

successively completing the squares in v and zX to obtain

P ðf� jXfsÞ ¼

ffiffiffiffiffiffiffiffiffiffiffij~�z;xj

qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffijAjj�z;xj

p e12ð~�0z;x ~��1

z;x ~�z;x��0z;x��1z;x�z;x��Þ; ð8Þ

HOEY AND LITTLE: VALUE-DIRECTED HUMAN BEHAVIOR ANALYSIS FROM VIDEO USING PARTIALLY OBSERVABLE MARKOV DECISION... 7

2. The spatio-temporal derivative is rf ¼ ffs; f�g, where fs ¼ ffx; fyg isthe spatial derivative and f� is the temporal derivative. The expression fsvmeans fxvx þ fyvy, where vx; vy are the horizontal and vertical optical flowfield, respectively.

TABLE 1List of Fixed and Learned Parameters in the Model

Page 8: IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE ...jhoey/papers/HoeyLittlePAMI07.pdf · previous work in modeling human behaviors from both labeled and unlabeled data. Section 3

where

~�z;x ¼ ��1z;x þM 0 �p þ f 0sA

�1fs� ��1

� ��1M

�1

;

~�z;x ¼ ~�z;x ��1z;x�z;x �M 0��1

p �ww� �

;

�w ¼ f 0sA�1fs þ ��1

p

� ��1;

� ¼ f 0�A�1f� þ w0�ww w ¼ f 0sA�1f� :

ð9Þ

The brightness constancy assumption fails if the velocity vis large enough to produce aliasing. Therefore, a multiscalepyramid decomposition of the optical flow field must beused. This results in distribution over the flow vectors,P ðvjrfÞ N ðv;�v;�vÞ, where �v ¼ ðf 0sA�1fsÞ�1 and �v ¼��vf

0sA�1f� [48]. Using these coarse-to-fine estimates, (9)

become

~�z;x ¼ ��1z;x þM 0ð�p þ �vÞ�1M

� ��1;

~�z;x ¼ ~�z;x ��1z;x�z;x þM 0ð�p þ �vÞ�1�v

� �:

ð10Þ

The mean of this distribution, ~�z;x, is a weighted combina-tion of the mean Zernike projection from the data (M 0�v)and the model mean, �z;x.

4.2 Configuration

The classification of image configurations proceeds analo-gously to the classification of temporal derivatives. We usethe same set of basis functions, but the measurements arenow the images, I, the subspace projections over imageregions are H and the basis feature vectors are ZW . Thedistribution overZW givenW is parameterized with a normalP ðZW jWÞ ¼ N ðZW ;�z;w;�z;wÞ. The distribution over theimage regions given the Zernike projection is normal,P ðHjZW;�Þ ¼ N ðH;BZW;�qÞ. The distribution over imagesgiven the subspace image region, h, P ðfjH;�Þ, can beapproximated using a normal distribution at each pixel,P ðf jH;�Þ N ðf ;H;�hÞ.

The integrations give results similar to that for thedynamics (8):

P ðf jW;�Þ / e�0���1� ����0z;w��1

z;w�z;w ; ð11Þ

where

�� ¼ ��1q ��1

h þ ��1q

h i�1��1h f 0 þ ��1

z;w�z;w;

�� ¼ B0��1q B�B0��1

q ��1h þ ��1

q

h i�1��1q B� ��1

z;w

�1

:

In this case, however, there are no data-dependent variancesand we approximate P ðf jW�Þ ¼ P ðzW jW�Þ, where zW ¼B0f is the projection of the image region to the basis.

4.3 Feature Weighting

In general, we will not know which basis coefficients are themost useful for our classification task: which basis vectorsshould be included in M and which should be left out (aspart of np). We use the feature weighting techniques of [13]that characterize the relevance of basis vectors by examin-ing how the cluster means,3 �z, are distributed along eachbasis dimension, k ¼ 1 . . .Nz. Relevant dimensions will

have well-separated means (large interclass distance alongthat dimension), while irrelevant dimensions will havemeans that are all similar to the mean of the data, ��.

To implement these notions, we place a conjugate normalprior on the cluster means, �z Nð��; T Þ, where T isdiagonal with elements �2

1 :::�2Nz

and �2k is the feature weight

for dimension k. The prior biases the model means to be closeto the data mean along dimensions with small feature weights(small variance of the means), but allows them to be far fromthe data mean along dimensions with large feature weights(large variance of the means). Thus, �2

k will be large if k is adimension relevant to clustering, while �2

k ! 0 if the dimen-sion is irrelevant. Feature selection occurs if we allow �2

k ¼ 0for some k. We do not select features in this work.

Conjugate priors are placed on the feature weights, �2k , and

on the model covariances, �z. Each feature weight isunivariate and, so, an inverse gamma distribution is the prioron each �2

k :

P ð�2k ja; bÞ / ð�2

k Þ�a�1e�b=�

2k : ð12Þ

This prior allows some control over the magnitude of thelearned feature weights, �2

k . The model covariances aremultivariate, for which the conjugate prior is an inverse-Wishart prior:

P ð�zj�;��Þ / j�zj�ð�þNzþ1Þ=2e�12trð�����1

z Þ; ð13Þ

where �� is the covariance of all the data and � is a parameterthat dictates the expected size of the clusters (the intraclassdistance). This prior stabilizes the cluster learning.

4.4 Learning the Observation Function

Recall from Section 3.2 that the updates (M step) to thejth component of the output distribution, P ðOjDjÞ, will beweighted by P ðdt;jjo; a; cÞ ¼ �t;j�t;j, where � and � are theforward and backward variables ((1) and (2)), respectively,that require the computation of P ðojdt;jÞ as given by (4). Thismeans that we can use the standard equations for updatingthe transition functions in the X and W chains as would beused for a normal CHMM, except the evidence from the datais weighted by P ðdt;kjo; a; cÞ.

The updates to the output distributions in the config-uration process, P ðf jWÞ, and in the dynamics process,P ðrf jXÞ, are as they would be in a mixture model, exceptthat the feature weights bias the updates toward the priordistributions, and the prior weights are given by the statelikelihoods in the POMDP (high-level) chain. The updateequation for the mean of the ith Gaussian output distribu-tion in the jth model (a component of P ðojxt;i; dt;jÞ), �z;ij, is

�z;ij ¼ ð��;ij��1z;ij þ T�1

j �1 ��1

z;ij

XTtt0¼1

~�z;xij�t0;ij

!þ T�1

j ��

" #;

where ��;ij ¼PTt

t0¼1 �t0;ij, ~�z;xij is given by (10) using theparameters of the ith Gaussian output in the jth model,�z;xij;�z;xij, and

�t0;ij ¼ P ðxt0;idt;jjo; a; c�0Þ¼ P ðxt0;ijot0¼1 . . . ot0¼T ; dt;j; �

0ÞP ðdt;jjo; a; c; �0Þ:

The first term is given by the the usual forward-backwardequations in a CHMM [11], while the second is given by theforward-backward equations in the high-level POMDP ((1)and (2)). Thus, the most likely mean for each state x is theweighted sum of the most likely values of z as given by (10).

8 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 29, NO. 7, JULY 2007

3. We drop the subscript x or w here since these equations apply equallyto the X or W output distributions.

Page 9: IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE ...jhoey/papers/HoeyLittlePAMI07.pdf · previous work in modeling human behaviors from both labeled and unlabeled data. Section 3

Dimensions of the means, �z;i, with small feature weights, �2k ,

will be biased toward the data mean, ��, in that dimension.This is reasonable because such dimensions are not relevantfor clustering and, so, should be the same for any cluster, X.

The updates to the feature weights are

�2k ¼

b

aþNx=2þ 1þ 1

2aþNx þ 2

XNx

i¼1

ð�z;i;k � ��kÞ2

and show that those dimensions, k, with �z;i;k very differentfrom the data mean, ��k, across all states, will receive largevalues of �2

k , while those with �z;i;k ��k will receive smallvalues of �2

k . Intuitively, the dimensions along which thedata is well separated (large interclass distance) will beweighted more. The complete derivation, along with theupdates to the output distributions of the CHMMs,including to the feature weights, can be found in [26].

Model initialization proceeds in a bottom-up fashion. First,the dynamics mixture model with Nx classes is initializedfrom a set of single (time-independent) spatio-temporalderivative fields, rf , by first computing the expected mostlikely values of zx for each frame using a single zero-meanmodel with constant diagonal covariance 0.001 and thenfitting a Gaussian mixture to the result of K-means clusteringwith K ¼ Nx [28]. While the K-means algorithm uses theEuclidean distance in the space ofZ, the Gaussian fits use theMahalanobis distance. All of the feature weights, �2

k , areinitialized to 1 and state assignment probability �X isinitialized evenly. The mixture model over the configurationsis initialized in a similar way, using the projections of imageregions to the Zernike basis.

The entire model is initialized using the estimates ofdynamics and configuration mixtures by first classifying allof the data using the two mixture models and finding thelargest Nd sets of sequences whose sets of visited X statesmatch exactly. Second, find the set of X (W ) states visitedby all the sequences in set i: Ni

X (NiW ). Third, initialize a

CHMM for each set, i, by assigning the output distributionsto be those in the simple mixture models visited by thesequences in the cluster. The transition and initial stateprobabilities are initialized randomly. Finally, training eachCHMM, keeping the output distributions fixed and initi-alizing the mixture probabilities for the mixture of CHMMsevenly for each context state.

4.5 Complexity

The complexity of parameter learning is dominated bythe computation of (10) in the “E” step, which isOðNd½NpNz þN3

z �T 00Þ, where Nd is the number of high-level display states, T 00 is the length of the entiresequence of data, Np is the maximum number of pixelsin the region being tracked, and Nz is the maximumdimensionality of the feature vectors Zx; Zw. The compu-tation is repeated until EM converges, usually some smallnumber of iterations n < 10.

The worst-case complexity of solving the POMDP usingvalue iteration over the associated MDP isOðN2

sNaHÞ, whereNs is the number of states in the POMDP,Na is the number ofactions, and H is the horizon. In typical problems, Ns is verylarge (exponential in the number of variables in the POMDP)and the complexity of the entire system, learning plussolution, will be dominated by the solution term. In theexperiment described in Section 5.3, the number of features isNz ¼ 36 (the major factor in the learning complexity), but thenumber of states isNs > 6� 105. Note, however, that we use a

structured approach to solving this MDP [31], which cansubstantially reduce the solution average case complexity.

Once a policy has been found, the complexity of usingthe model online is composed of updating the belief state(OðNoN

2s Þ) and consulting the policy (OðNsNaÞ). The belief

state inference will dominate for any reasonably sizedmodel, but techniques that leverage structure can substan-tially reduce the running time so as to make this possible innear real-time. The other major computation is optical flow,which can be done in near-real-time as well.

5 EXPERIMENTS

We present three sets of experiments in this section, eachdesigned to address a particular issue in the learning methodwe have presented. In the first (imitation game), we explorethe representational power of our computer vision modelingtechniques by examining how they can be used to learn fairlycomplex facial expressions. The second experiment demon-strates the value-directed learning technique using a simpleset of hand gesture sequences. The third experiment thenshows how the model can be used to learn a more complexinteraction during a card matching game. We demonstrate onboth synthetic and real data in this third experiment.

5.1 Imitation Game

In this experiment, human subjects imitated the facialexpressions of an animated character. We learn a mixture ofCHMMs model of their facial expressions and use thismodel to predict the animation that caused it. To play theimitation game, a player watches a computer animated faceon a screen and is told to imitate the actions of the face. Theanimations start from a neutral face (n) and warp to one offour poses fa1; a2; a3; a4g, as shown in Fig. 4. The pose isheld for one second and the face then warps back to neutral,where it remains for an another second.

The simplified model is shown in Fig. 4b. The cartoonfacial expressions are A ¼ fa1 . . . a4g, the observations ofthe human’s actions, O, are sequences of video imagesand the spatio-temporal derivatives between subsequentvideo frames and the descriptor of the human’s facialexpression is D 2 fd1 . . . dNd

g. The key here is the abilityto learn both the output distribution P ðOjDÞ and therelationship between the player’s behavior, D, and thecartoon display, A, P ðDjAÞ. Thus, we train the model ona set of training data in which the cartoon display labelsare observed, then we attempt to predict the cartoonfacial expression on a set of test data in which the labelsare hidden. That is, on the test data, we compute a� ¼arg maxa P ðA ¼ ajoÞ ¼ arg maxa

Pi P ðojdiÞP ðdijA ¼ aÞ.

Three subjects performed the task 40 times each. Videoframes were recorded at 160� 120 with a Sony EVI-D30color camera (framerate 28 fps) mounted above the screen.The subjects’ faces were located in each frame using anoptical flow and exemplar-based tracker [26]. The videoswere temporally segmented using the onset times of thecartoon facial expressions and the resulting sequences wereinput to the mixture of coupled HMM clustering andtraining algorithm using four clusters (Nd ¼ 4, the numberof expressions the subjects were trying to imitate).

We did a leave-one-out cross validation experiment foreach of the subjects to verify the prediction accuracy. Therewere 40 sequences for each subject, one of which was

HOEY AND LITTLE: VALUE-DIRECTED HUMAN BEHAVIOR ANALYSIS FROM VIDEO USING PARTIALLY OBSERVABLE MARKOV DECISION... 9

Page 10: IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE ...jhoey/papers/HoeyLittlePAMI07.pdf · previous work in modeling human behaviors from both labeled and unlabeled data. Section 3

removed. The remaining 39 sequences were used to train thePOMDP. The learned model was used to infer A from theremaining sequence (in which it is hidden) and the most likelyvalue was compared to the actual display. This process wasrepeated for all 40 left-out sequences, giving the best unbiasedestimated of performance on the whole set of data. Thisprocess was repeated three times for each subject withdifferent random initializations, with average success rates of78, 74, and 62 percent. However, these results ignore themodel’s explicit representation of uncertainty, only reportingsuccess if the actual display is the peak of P ðAjOÞ. In somecases, there is a second display that is nearly as likely as thebest one. The success rates rise to 95, 93, and 84 percent if wecount those classifications as correct where the probability ofthe most likely display is less than 0.5 and the second mostlikely display is the correct one.

We also used this game to evaluate the modeling of boththe dynamics and the configuration in the observationCHMMs. We performed the same leave-one-out experimenton the first subject’s data with 10 random intitializations, andfor each seed, we trained one model with only dynamics (X)states, one with only configuration (W ) states, and one withboth. Results showed that modeling both outperformedeither separately: 76 percent for both compared to 68 percentfor dynamics alone and 68 percent for configuration alone.

We now show some details of thee learned D ¼ 1(“smiling”) model for one subject. Feature weights are shownin Fig. 5. The dynamics chain has four significant features: twoin the horizontal flow, uA1

1;v B2

2, and three in the vertical flow,vA0

0, vB11, and vA0

2. The three most significant features in theconfiguration chain are B1

1, B33, and A4

4. The output distribu-tions of the four dynamics states (X) are shown in Fig. 6a,plotted along the two most significant feature dimensions,uA1

1 and vB11. Two states (X ¼ 2; 4) correspond to no motion,

while the other two correspond to expansion upward and

outward in the bottom of the face region (X ¼ 1) andcontraction downward and inward in the bottom of the faceregion (X ¼ 3). These states correspond to the expansion andrelaxation phase of smiling. The output distributions of thefour configuration states are shown in Fig. 6b. There are twostates (W ¼ 2 and W ¼ 4) which describe the face in a fairlyrelaxed pose, while W ¼ 1 and W ¼ 3 describe “smiling”configurations, as evidenced by the darker patches near thesides at the bottom.

Fig. 7 shows the model’s explanation of a sequence inwhich D ¼ 1. We see the high level distribution over D ispeaked at D ¼ 1. Distributions over dynamics and config-uration chains show which state is most likely at eachframe. The expected pose, H, and flow field, V , are shownconditioning the image, f , and the temporal derivative, ft,respectively.

5.2 Robot Control Gestures

This “game” involves a human operator issuing navigationcommands to a robot using hand gestures. The robot hasfour possible actions, A: go left, go right, stop, and go forwardand the operator uses four hand gestures corresponding toeach command. The state, C, is the operator’s action: ABoolean indicator of whether the robot took the right actionor not. The reward function is 1 if the robot took the correctaction (C ¼ 1); otherwise, it is 0. It is important to state thatthese experiments do not demonstrate gesture recognition,but only the value-directed learning. Realistic roboticcontrol requires more complex tracking to deal withmoving platforms and the high variability in gestures.

We recorded 12 examples of each of four gesturesperformed by a single subject in front of a stationary camera.Video was grabbed from a IEEE 1394 (Firewire) camera at150� 150. The region of interest was the entire image, so notracking was required. Sequences were taken of a fixed lengthof 90 frames. From the original 12 data sequences, we selected

10 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 29, NO. 7, JULY 2007

Fig. 4. (a) Cartoon faces: neutral (n) and facial expressions A ¼ fa1 . . . a4g). (b) Graphical model of the imitation game.

Fig. 5. Feature weights �2k for model 1 dynamics and configurations chains.

Page 11: IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE ...jhoey/papers/HoeyLittlePAMI07.pdf · previous work in modeling human behaviors from both labeled and unlabeled data. Section 3

11 from each gesture and constructed a training set in whicheach A was tried for each possible gesture sequence (and theresulting C response was simulated), giving a total of44 training sequences. We used an initial Nd ¼ 6 states,which is as many as we can expect to learn given the amountof training data. The value-directed structure learningalgorithm used wp ¼ 0:9 and wd ¼ 1.

Once the model is trained on the 44 training sequences,we evaluate how well it chooses an action on the fourremaining sequences (one for each gesture). This leave-one-out cross-validation is repeated for 12 different sets of fourtest sequences and the total rewards gathered give anunbiased indication of how well the model performs onunseen data. The model chose the correct action 47 out of atotal of 12� 4 ¼ 48 times, for a total success rate of 47/48 or98 percent. The one failure was due to a misclassification ofa “left” gesture as a “right” gesture due to a large rightward

motion of the hand at the beginning of the stroke. Moreimportantly, the final POMDP models learned that therewere Na ¼ 4 states in all 12 cases.

5.3 Card Matching Game

The card matching game is a simple cooperative two-playergame in which the players must send signals to each otherthrough a video link in order to win. At the start of a round,each player is dealt three cards: a heart, a diamond, and aspade. Each player can only see his own set of cards. Theplayers must each play a single card simultaneously and, ifthe suits match, the players win a function of the amount onthe cards; otherwise, they incur a fixed penalty. Thus, thegoal of the game is to agree on which suit to play tomaximize the return. In order to facilitate this, one player(the bidder) can send a bid to their partner (the ally),indicating a card suit, and can see (but not hear) the ally

HOEY AND LITTLE: VALUE-DIRECTED HUMAN BEHAVIOR ANALYSIS FROM VIDEO USING PARTIALLY OBSERVABLE MARKOV DECISION... 11

Fig. 7. A person smiling is analyzed by the mixture of CHMMs. Probability distributions over X and W are shown for each time step. All other nodes

in the network show their expected value given all evidence.

Fig. 6. (a) Dynamics chain model 1 output states plotted along two most significant dimensions according to feature weights, uA11;v B1

1. Reconstructed

flow fields for X state means are also shown. (b) Configuration chain model 1 output distributions. The differences between the W state means and

the overall data mean are shown, as well as the overall data mean in the center.

Page 12: IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE ...jhoey/papers/HoeyLittlePAMI07.pdf · previous work in modeling human behaviors from both labeled and unlabeled data. Section 3

through a real-time video link. Thus, the expectation is thatthe ally will develop a gesturing strategy to indicateagreement with the bid. For example, a simple strategyinvolving two head gestures is for the ally to nod or shakeher head in agreement or disagreement with the bid. Amore complex strategy is to develop a set of three head orhand gestures for each card suit. This game is similar togames used in psychology to study the emergence oflanguage [25], [34]. We use it because the number and formof the gestures are not specified as rules of the game, butinstead are decided upon by the players at game time.

There are nine variables that describe the state of thegame. The suit of each of the three cards can be one of~;};|. The bidder’s actions, A, can be null (no action) orsending a confidential bid (bid~; bid}; bid|) or committinga card (cmt~; cmt}; cmt|). The ally’s action, C, can be nullor committing a card (cmt~; cmt}; cmt|). The behaviorvariable, D ¼ d1 . . . dNd

, describes the ally’s communicationthrough the video link. The other six observable variables inthe game are more functional for the POMDP, including thevalues of the cards (v1; v2 for bidder and ally, respectively)and whether a match occurred or not. The reward is afunction of fully observable variables only and is ðv1þv2Þ½1þ ðv1 > 1 ^ v2 > 1Þ� if the suits match and �10 other-wise. The number of display states, Nd, is learned using thevalue-directed structure learning technique in Fig. 2. Thenumber of states in the MDP is 20; 736Nd.

5.3.1 Simulated Results

Since our model is generative, we can use it to simulate thegame and to generate data. We first specified two priormodels for the card matching game: The first, M1, hasNd ¼ 3, while the second, M2, has Nd ¼ 5. We randomlyspecify a complete set of output distribution parameters,P ðOjDÞ, as follows: Transition matrices were drawnrandomly in ½0 1�, but with added weight of 30.0 on thediagonal (followed by a renormalization). Feature weightsfor all output distributions were drawn randomly in ð0 Rf �.The means were drawn from the model priors given these

feature weights. Covariance matrices were set to bediagonal with a variance in each dimension of ð10:0þ cÞ2,where c is normally distributed with variance 1.0. We usedNz;x ¼ Nz;w ¼ 4. The parameter Rf varies the “overlap”between output distributions. The larger Rf is, the moredistinguishable the behavior models will be. We canestimate the degree of “overlap” of the output distributionsby using them to simulate data and then measuring themaximum-likelihood error rates, r, on this data given themodels. We used three values of Rf ¼ f10; 20; 100g, whichgave error rates on 50,000 simulated sequences of 1.6, 0.02,and 0.0 percent, respectively, for model M1, and 3.6, 0.04,and 0.0 percent, respectively, for model M2.

Once the model was specified, we simulated a set ofNt training sequences from it, each of length T ¼ 100, withoutput sequences of a fixed length ofT 0 ¼ 20, using a randomselection of actions,A. We only simulate the values ofZW;ZX,not the full observation set, f;rf , since this is sufficient todemonstrate the method. We then applied the trainingprocess in Fig. 2 on these Nt training sequences, using wp ¼0:6 and wd ¼ 20:0, and starting with Nd ¼ 8. The learnedmodel and policy was tested for 20 trials of length 100 and werecord the average reward gathered over the 20 trials. Thewhole process is repeated (including random specification ofnew output distributions) 20 times and the means andstandard deviations of the averages are recorded. We alsoestimate the optimal values that can be achieved byperforming the same simulation experiment, but using theoriginal simulated model instead of a learned model.

Fig. 8 shows the results. Each plot shows the averagereward and standard deviation for each value of Nt, as wellas the optimal value (dashed line), the average number ofbehavior models (for Nt ¼ 200), Nd, and the degree ofoverlap, r. The average number of states of D learned forNt ¼ 200; Rf ¼ 100 was 4.0 and 6.1 for Nd ¼ 3 and Nd ¼ 5,respectively, showing that, although the state merging wasnot overly aggressive, it managed to reduce the model orderclose to that of the true models in both cases. The resultsshow that, if the actual output models are distinguishable

12 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 29, NO. 7, JULY 2007

Fig. 8. Average reward gathered over 20 trials in simulation, for simulated models with (top row) Nd ¼ 3 and (bottom row) Nd ¼ 5 (three and five

behaviors to be recognized, respectively), shown as a function of the number of training sequences. The three plots in each row show the results for

Rf ¼ 100, 20, and 10, from left to right. The dashed line in each plot shows the mean reward achieved when the model is given as the simulated

model and, so, is optimal for the given simulation. The values of Nd shown in the plot are the averages learned by the algorithm for Nt ¼ 200.

Page 13: IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE ...jhoey/papers/HoeyLittlePAMI07.pdf · previous work in modeling human behaviors from both labeled and unlabeled data. Section 3

(Rf ¼ 100), then the resulting learned model is nearlyoptimal for Nt

>� 50, even using the approximate solution

technique for the POMDP policy. This is what we expect tohappen if the learning technique is able to recover a set ofbehaviors that distinguish at least those behaviors that areimportant to the task, since then the state is (nearly) fullyobservable and the MDP policy is (nearly) optimal for thePOMDP. On the other hand, when the output modelsbecome less easily distinguishable (for Rf ¼ 20 and 10), thelearning is more difficult and, so, the gap between theoptimal solution and the learned one widens. Nevertheless,the performance remains close to optimal in all but the mostdifficult case (with Nd ¼ 5 and Rf ¼ 10).

5.3.2 Real Data

The card matching game was played by two users through acomputer interface in our laboratory. Each player viewedtheir partner through a link from their workstation to a SonyEVI S-video camera mounted atop their partner’s screen. Theaverage frame rate at 320� 240 resolution was over 28 fps.The rules were explained to the subjects and they played fourgames of five rounds each. The players had no chance todiscuss strategies before the game, but were given time topractice. The player’s faces were tracked using the sametracker as in Section 5.1, described in [26]. In this experiment,the partner used the obvious communication strategy of“nodding” and “shaking” their head in response to good andbad bids, respectively.

The model was trained with four display states, which isas large as we think is possible to learn reliable models giventhe training set size. The structure learning algorithm wasapplied with !p ¼ 0:9 and !d ¼ 1, which finds values of Dfor which the subpolicies match over 90 percent of the statespace. Two values of D had policies that were in p01 ¼ 0:96agreement (96 percent of the state space) and were merged.The result was, as expected, one “null” state (d1), one“nodding” state (d2), and one “shaking” state (d3). No furthermerges were found. The associated policy correctly pre-dicted 14/20 actions in the training games and 5/7 actions inthe test game.

In order to attenuate the effects of the lack of training data,we use symmetry arguments to fill in the model withouthaving to explicitly explore those situations. In particular,we may assume that players do not have any particularpreference over card suits such that the conditional prob-ability tables should be symmetric under permutation ofsuits. Therefore, we can “symmeterize” the probabilitydistributions by simply averaging over the six card suitpermutations. The policy for the symmetrized POMDPcorrectly predicts all but one (19/20) action in the traininggame, for an error rate of 5 percent. The misclassification wasdue to the subject looking to one side of the screen, yieldingsignificant horizontal head motion and a classification as d3.

The symmetrized policy correctly predicts all but one (6/7)action in the test game. The misclassified sequence is longerthan usual (over 300 frames) and includes some horizontalhead motion in the beginning that appears as shaking in themodel. This misclassification may expose a weakness of thetemporal segmentation method we use, which is basedentirely on the observable actions and game states. Althoughthis sequence is long, it is only the (fairly vigorous) head nodat the very end that is the important display. This situationcould be dealt with by incorporating an explicit model ofhuman attention, for example [41].

We performed a second experiment in which the role ofbidder was exchanged between the same two players andfound similar results. The symmetrized policy in this casepredicted all 13 actions in the training game, but only 3/5 inthe test game. The mispredictions in this case were not due tomisclassifications of the behaviors in the sense that theclassifications were consistent with others in the training set.They arose instead due to the lack of training data for thePOMDP and would be expected to disappear once more datawas incorporated.

5.4 Discussion

The experiments we have presented have demonstrated threethings. First, the imitation game showed how we can learnmodels of complex facial expressions with little prior knowl-edge and no labeled data. This was done by observing causesof the facial expressions in training data only and thenpredicting the likely causes with over 80 percent accuracy inthe test data. The second experiment then showed how, whenbehaviors are distinct and the task is simple, we can learn thenumber of behaviors that occur in the data and that are usefulto the task being modeled. In 12 experiments, the correctnumber of gestures was correctly learned in every case. Third,the card matching game demonstrated that we can apply thesame learning and solution techniques to a task of realistic sizeand learn what behaviors (and how many) are being exhibitedand what their relationship is to the task. Using synthetic datafrom simulations, we demonstrated that we can learn acomplex transition function and a complex observationfunction simultaneously and that the resulting model per-forms close to optimally in simulation and has a structure(number of displays) close to that of the correct model. Wethen demonstrated how the learning technique on real datafrom humans playing the card matching game showed howthe learned model can predict the actions of humans, and canlearn the correct number of displays being used.

In all three of the experiments we have presented, therewas never a need to specify what behaviors were to berecognized. The system learned the behaviors that were usedwithin each task and how these behaviors were related to theutility. Thus, by specifying only some of the fully observableaspects of the domain being modeled (such as the rules of thecard game), the system can learn the interactions automati-cally and a new set of behavior models does not need to bereengineered for each new task. For example, suppose therules of the card game were changed such that the players hadto communicate with their hands alone. While a traditionalcomputer vision approach would have to start afresh bybuilding a recognition system for all possible gesturesthought (by some experts) to be those that could be used bythe participants, our approach could be applied directly andwould learn what gestures were being used and why theywere being used. This is the primary benefit of this type oflearning: No prior knowledge about human behaviors needsto be included, but, rather, can be learned directly from data.

While the results we have presented show the validity ofour method from a technical standpoint, a thorough evalua-tion of the domains that are impacted by this work remains. Asignificant limitation is that only simulations were used toselect actions and gather rewards online. In theory, correctaction selection is the only accurate method for validating thatthe model was learned and solved correctly. Simply predict-ing actions taken by a human is not sufficient since the learnedmodel could implement a different, yet still optimal, policy

HOEY AND LITTLE: VALUE-DIRECTED HUMAN BEHAVIOR ANALYSIS FROM VIDEO USING PARTIALLY OBSERVABLE MARKOV DECISION... 13

Page 14: IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE ...jhoey/papers/HoeyLittlePAMI07.pdf · previous work in modeling human behaviors from both labeled and unlabeled data. Section 3

than that used by the human. A second limitation of ourcurrent experiments is the use of a fully observed state space(apart from the displays). Additional unobserved variablesincrease learning complexity and makes temporal segmenta-tion more difficult. A third limitation of our currentexperiments is that we only use a small number of displaysand only allow for merging states during learning. We wouldlike the system to learn a larger number of displays by startingfrom a small number and splitting states based on, e.g., theirpredictive power.

6 CONCLUSION

This paper has shown how partially observable Markovdecision processes, or POMDPs, can be used to allow an agentto incorporate actions and utilities into the sensing andrepresentation of visual observations. We have shown howthis model provides top-down value-based evidence for thelearned probabilistic models and allows us to learn modelsmost conducive for achieving value in a task. One of the keyfeaturesofthis techniqueisthat itdoesnotrequire labeleddatasets. That is, the model makes no prior assumptions about theform ornumber of nonverbal behaviors used in an interaction,but, rather, discovers this from the data during training.

There are three major open questions that remain to beaddressed for POMDP modeling of human interactions. First,most interaction tasks will involve state spaces that aresignificantly larger than we have experimented with in thispaper and may be partially or fully unobservable. Forexample, an assistive system for handwashing used anMDP with over 20 million states [8] and the scalability ofthe learning method remains to be validated for such largermodels. This involves ensuring that we can learn the modelparameters for a larger state space and that we can computegood approximate POMDP policies efficiently [29], [50].However, it is precisely the combination of value-directedlearning methods with POMDP solution techniques that willenable the solution of very large POMDPs [43]. The secondopen question concerns the interplay between a solution forthe POMDP and the value-directed structure learning asdescribed in Section 3.3. The strong approximation we usedwas appropriate for the examples we have examined, butshould be relaxed (using, e.g., [29]). More complex structurelearning techniques would be required for this learning. Thethird open question involves the representational power ofthe observation function. This should be sufficient todistinguish what is necessary to recognize within a particulartask. That is, the features used for modeling video sequencesof human displays must be able to distinguish what is neededfor performance in the task. It remains to be verified if ourCHMM-based observation function is sufficient for a widerange of tasks.

Another interesting avenue for future research is the useof the POMDP models in active vision systems [17]. Thetrade-off between behavior recognition and resource usecan be explicitly addressed by a POMDP model and,coupled with the behavior learning techniques we havepresented in this paper, would form a possible solution tothe active vision problem.

We are currently applying POMDP models to assistedliving tasks in which a POMDP-based system helps acognitively disabled person complete activities of dailyliving [30]. POMDP models are well suited to thisenvironment since they model the stochastic nature of user

behavior, the need to trade off various objective criteria(e.g., task completion, caregiver burden, user frustration,and independence) and the need to tailor guidance tospecific individuals and circumstances [8].

ACKNOWLEDGMENTS

The authors would like to thank Don Murray, PantelisElinas, Pascal Poupart, David Lowe, and David Poole forinvaluable help and suggestions. This work was supportedby grants from the Natural Sciences and Engineering andResearch Council of Canada and from the Institute forRobotics and Intelligent Systems, a Canadian Network ofCentres of Excellence.

REFERENCES

[1] K.J. �Astrom, “Optimal Control of Markov Decision Processes withIncomplete State Estimation,” J. Math. Analysis and Applications,vol. 10, pp. 174-205, 1965.

[2] M.S. Bartlett, G. Littlewort, B. Braathen, T.J. Sejnowski, and J.R.Movellan, “A Prototype for Automatic Recognition of Sponta-neous Facial Actions,” Advances in Neural Information ProcessingSystems, vol. 15, pp. 382-386, 2003.

[3] P.N. Belhumeur, J.P. Hespanha, and D.J. Kriegman, “Eigenfacesversus Fisherfaces: Recognition Using Class Specific LinearProjections,” IEEE Trans. Pattern Analysis and Machine Intelligence,vol. 19, no. 7, pp. 711-720, July 1997.

[4] R.E. Bellman, Dynamic Programming. Princeton Univ. Press, 1957.[5] Y. Bengio and P. Frasconi, “Input-Output HMMs for Sequence

Processing,” IEEE Trans. Neural Networks, vol. 7, no. 5, pp. 1231-1249, Sept. 1996.

[6] M. Black and Y. Yacoob, “Tracking and Recognizing Rigid andNonrigid Facial Motions Using Local Parametric Models of ImageMotions,” Int’l J. Computer Vision, vol. 25, no. 1, pp. 23-48, 1997.

[7] A.F. Bobick and J.W. Davis, “The Recognition of Human Move-ment Using Temporal Templates,” IEEE Trans. Pattern Analysisand Machine Intelligence, vol. 23, no. 3, pp. 257-267, Mar. 2001.

[8] J. Boger, J. Hoey, P. Poupart, C. Boutilier, G. Fernie, and A.Mihailidis, “A Planning System Based on Markov DecisionProcesses to Guide People with Dementia through Activities ofDaily Living,” IEEE Trans. Information Technology in Biomedicine,vol. 10, no. 2, pp. 323-333, Apr. 2006.

[9] C. Boutilier, T. Dean, and S. Hanks, “Decision Theoretic Planning:Structural Assumptions and Computational Leverage,” J. ArtificialIntelligence Research, vol. 11, pp. 1-94, 1999.

[10] M. Brand, “Learning Concise Models of Human Activity fromAmbient Video via a Structure-Inducing M-Step Estimator,”Technical Report TR-97-25, Mitsubishi Electric Research Labora-tory, Nov. 1997.

[11] M. Brand, N. Oliver, and A. Pentland, “Coupled Hidden MarkovModels for Complex Action Recognition,” Proc. Int’l Conf.Computer Vision and Pattern Recognition, pp. 994-999, 1997.

[12] C. Bregler, “Learning and Recognising Human Dynamics in VideoSequences,” Proc. IEEE Conf. Computer Vision and Pattern Recogni-tion, pp. 568-574, 1997.

[13] P. Carbonetto, N. de Freitas, P. Gustafson, and N. Thompson,“Bayesian Feature Weighting for Unsupervised Learning withApplication to Object Recognition,” Proc. Ninth Int’l WorkshopArtificial Intelligence and Statistics, pp. 122-128, Jan. 2003.

[14] Embodied Conversational Agents, J. Cassell et al., eds. MIT Press, 2000.[15] L. Chrisman, “Reinforcement Learning with Perceptual Aliasing:

The Perceptual Distinctions Approach,” Proc. 10th Nat’l Conf.Artificial Intelligence, pp. 183-188, 1992.

[16] B. Clarkson and A. Pentland, “Unsupervised Clustering ofAmbulatory Audio and Video,” Proc. Int’l Conf. Acoustics, Speech,and Signal Processing, 1999.

[17] T. Darrell and A.P. Pentland, “Active Gesture Recognition UsingPartially Observable Markov Decision Processes,” Proc. 13th IEEEInt’l Conf. Pattern Recognition, 1996.

[18] T.J. Darrell, I.A. Essa, and A.P. Pentland, “Task-Specific GestureAnalysis in Real-Time Using Interpolated Views,” IEEE Trans.Pattern Analysis and Machine Intelligence, vol. 18, no. 12, pp. 1236-1242, Dec. 1996.

14 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 29, NO. 7, JULY 2007

Page 15: IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE ...jhoey/papers/HoeyLittlePAMI07.pdf · previous work in modeling human behaviors from both labeled and unlabeled data. Section 3

[19] A.P. Dempster, N.M. Laird, and D.B. Rubin, “Maximum Like-lihood from Incomplete Data Using the EM Algorithm,” J. RoyalStatistical Soc. B, vol. 39, pp. 1-38, 1977.

[20] G. Donato, M.S. Bartlett, J.C. Hager, P. Ekman, and T.J. Sejnowski,“Classifying Facial Actions,” IEEE Trans. Pattern Analysis andMachine Intelligence, vol. 21, no. 10, pp. 974-989, Oct. 1999.

[21] I.A. Essa and A.P. Pentland, “Coding Analysis, Interpretation, andRecognition of Facial Expressions,” IEEE Trans. Pattern Analysisand Machine Intelligence, vol. 19, no. 7, pp. 757-763, July 1997.

[22] S. Fine, Y. Singer, and N. Tishby, “The Hierarchical HiddenMarkov Model: Analysis and Applications,” Machine Learning,vol. 32, no. 1, pp. 41-62, 1998.

[23] D.J. Fleet, M.J. Black, Y. Yacoob, and A.D. Jepson, “Design andUse of Linear Models for Image Motion Analysis,” Int’l J.Computer Vision, vol. 36, no. 3, pp. 171-193, 2000.

[24] A.J. Fridlund, Human Facial Expression: An Evolutionary View.Academic Press, 1994.

[25] B. Galantucci, “An Experimental Study of the Emergence ofHuman Communication Systems,” Cognitive Science, vol. 29,pp. 737-767, 2005.

[26] J. Hoey, “Decision Theoretic Learning of Human Facial Displaysand Gestures,” PhD thesis, Univ. of British Columbia, 2004.

[27] J. Hoey and J.J. Little, “Representation and Recognition ofComplex Human Motion,” Proc. IEEE Conf. Computer Vision andPattern Recognition, pp. 752-759, June 2000.

[28] J. Hoey and J.J. Little, “Bayesian Clustering of Optical FlowFields,” Proc. Int’l Conf. Computer Vision, pp. 1086-1093, Oct. 2003.

[29] J. Hoey and P. Poupart, “Solving POMDPs with Continuous orLarge Discrete Observation Spaces,” Proc. Int’l Joint Conf. ArtificialIntelligence, pp. 1332-1338, July 2005.

[30] J. Hoey, P. Poupart, C. Boutilier, and A. Mihailidis, “POMDPModels for Assistive Technology,” Proc. AAAI Fall Symp. CaringMachines: AI in Eldercare, 2005.

[31] J. Hoey, R. St-Aubin, A. Hu, and C. Boutilier, “SPUDD: StochasticPlanning Using Decision Diagrams,” Proc. Uncertainty in ArtificialIntelligence, pp. 279-288, 1999.

[32] E. Hunter, J. Schlenzig, and R. Jain, “Posture Estimation inReduced-Model Gesture Input Systems,” Proc. Int’l WorkshopAutomatic Face- and Gesture-Recognition, pp. 290-295, 1995.

[33] T. Jebara and A.P. Pentland, “Action Reaction Learning: Auto-matic Visual Analysis and Synthesis of Interactive Behaviour,”Proc. Int’l Conf. Vision Systems, pp. 273-292, 1999.

[34] R.M. Krauss and S. Glucksberg, “Social and Nonsocial Speech,”Scientific Am., vol. 236, pp. 100-105, 1977.

[35] A. Lanitis, C.J. Taylor, and T.F. Cootes, “Automatic Interpretationand Coding of Face Images Using Flexible Models,” IEEE Trans.Pattern Analysis and Machine Intelligence, vol. 19, no. 7, pp. 743-756,July 1997.

[36] C. Li and G. Biswas, “Temporal Pattern Generation Using HiddenMarkov Model Based Unsupervised Classification,” Advances inIntelligent Data Analysis, 1999.

[37] S.X. Liao and M. Pawlak, “On the Accuracy of Zernike Momentsfor Image Analysis,” IEEE Trans. Pattern Analysis and MachineIntelligence, vol. 20, no. 12, pp. 1358-1364, Dec. 1998.

[38] D. McNeill, Hand and Mind: What Gestures Reveal about Thought.Univ. of Chicago Press, 1992.

[39] K.P. Murphy, “Dynamic Bayesian Networks: Representation,Inference and Learning,” PhD thesis, Computer Science Division,Univ. of California Berkeley, July 2002.

[40] N. Oliver, A. Garg, and E. Horvitz, “Layered Representations forLearning and Inferring Office Activity from Multiple SensoryChannels,” Int’l J. Computer Vision and Image Understanding,vol. 96, pp. 163-180, 2004.

[41] T. Paek and E. Horvitz, “Conversation as Action under Uncer-tainty,” Proc. Conf. Uncertainty in Artificial Intelligence, June 2000.

[42] A. Pentland, “Looking at People: Sensing for Ubiquitous andWearable Computing,” IEEE Trans. Pattern Analysis and MachineIntelligence, vol. 22, no. 1, pp. 107-119, Jan. 2000.

[43] P. Poupart and C. Boutilier, “Value-Directed Compression ofPOMDPs,” Advances in Neural Information Processing Systems,vol. 15, pp. 1547-1554, 2003.

[44] A. Prata and W.V.T. Rusch, “Algorithm for Computation ofZernike Polynomials Expansion Coefficients,” Applied Optics,vol. 28, no. 4, pp. 749-754, Feb. 1989.

[45] L.R. Rabiner and B.H. Huang, Fundamentals of Speech Recognition.Prentice Hall, 1993.

[46] J.A. Russell and J.M. Fernandez-Dols, “What Does Facial Expres-sion Mean?” The Psychology of Facial Expression, pp. 3-30, 1997.

[47] J. Schlenzig, E. Hunter, and R. Jain, “Vision Based Hand GestureInterpretation Using Recursive Estimation,” Proc. Asilomar Conf.Signals, Systems, and Computation, pp. 394-399, Oct. 1994.

[48] E.P. Simoncelli, E.H. Adelson, and D.J. Heeger, “ProbabilityDistributions of Optical Flow,” Proc. Int’l Conf. Computer Vision andPattern Recognition, pp. 310-315, 1991.

[49] P. Smyth, “Clustering Sequences with Hidden Markov Models,”Advances in Neural Information Processing Systems, vol. 10, 1997.

[50] M.T. J. Spaan and N. Vlassis, “Perseus: Randomized Point-BasedValue Iteration for Pomdps,” J. Artificial Intelligence Research,vol. 24, pp. 195-220, 2005.

[51] T. Starner and A.P. Pentland, “Visual Recognition of AmericanSign Language Using Hidden Markov Models,” Proc. Int’l Work-shop Automatic Face and Gesture Recognition, pp. 189-194, 1995.

[52] R. Sutton and A.G. Barto, Reinforcement Learning: An Introduction.MIT Press, 1998.

[53] C.-H. Teh and R.T. Chin, “On Image Analysis by the Methods ofMoments,” IEEE Trans. Pattern Analysis and Machine Intelligence,vol. 10, no. 4, pp. 496-513, July 1988.

[54] S. Thrun, “Probabilistic Algorithms in Robotics,” AI Magazine,vol. 21, no. 4, pp. 93-109, 2000.

[55] Y. Tian, T. Kanade, and J.F. Cohn, “Recognizing Action Units forFacial Expression Analysis,” IEEE Trans. Pattern Analysis andMachine Intelligence, vol. 23, no. 2, Feb. 2001.

[56] M. Turk and A.P. Pentland, “Eigenfaces for Recognition,”J. Cognitive Neuroscience, vol. 3, no. 1, pp. 71-86, 1991.

[57] J. Williams, P. Poupart, and S. Young, “Factored PartiallyObservable Markov Decision Processes for Dialogue Manage-ment,” Proc. IJCAI Workshop Knowledge and Reasoning in PracticalDialogue Systems, pp. 76-82, Aug. 2005.

Jesse Hoey received the BSc degree in physicsfrom McGill University in Montreal, Canada, andthe MSc degree in physics and the PhD degreein computer science from the University ofBritish Columbia, Vancouver, Canada. He is alecturer in the School of Computing at theUniversity of Dundee, Scotland, and an adjunctscientist at the Toronto Rehabilitation Institute inToronto, Canada. His research focuses onplanning and acting in large-scale, real-world

uncertain domains using video observations. In particular, he works onapplying decision theoretic planning, computer vision, and machinelearning techniques to adaptive assistive technologies in health care.

James J. Little received the AB degree fromHarvard College in 1972 and the MSc and PhDdegrees in computer science from the Universityof British Columbia in 1980 and 1985. From1985 to 1988, he was a research scientist at theMIT Artificial Intelligence Laboratory. Currently,he is a professor of computer science at theUniversity of British Columbia. His researchinterests include computational vision, robotics,and spatio-temporal information systems. Parti-

cular interests are stereo, motion, tracking, mapping, and motioninterpretation. He is a member of the IEEE and the IEEE ComputerSociety.

. For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.

HOEY AND LITTLE: VALUE-DIRECTED HUMAN BEHAVIOR ANALYSIS FROM VIDEO USING PARTIALLY OBSERVABLE MARKOV DECISION... 15