ieee transactions on image processing, vol. 20, no. 9 ...qji/papers/extended_chain_graph.pdf ·...

13
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 20, NO. 9, SEPTEMBER 2011 2401 Probabilistic Image Modeling With an Extended Chain Graph for Human Activity Recognition and Image Segmentation Lei Zhang, Member, IEEE, Zhi Zeng, and Qiang Ji, Senior Member, IEEE Abstract—Chain graph (CG) is a hybrid probabilistic graph- ical model (PGM) capable of modeling heterogeneous relationships among random variables. So far, however, its application in image and video analysis is very limited due to lack of principled learning and inference methods for a CG of general topology. To overcome this limitation, we introduce methods to extend the conventional chain-like CG model to CG model with more general topology and the associated methods for learning and inference in such a gen- eral CG model. Specifically, we propose techniques to systemat- ically construct a generally structured CG, to parameterize this model, to derive its joint probability distribution, to perform joint parameter learning, and to perform probabilistic inference in this model. To demonstrate the utility of such an extended CG, we apply it to two challenging image and video analysis problems: human activity recognition and image segmentation. The experimental re- sults show improved performance of the extended CG model over the conventional directed or undirected PGMs. This study demon- strates the promise of the extended CG for effective modeling and inference of complex real-world problems. Index Terms—Activity recognition, Bayesian networks (BNs), chain graph (CG), factor graph (FG), graphical model learning and inference, image segmentation, Markov random fields (MRFs). I. INTRODUCTION P ROBABILISTIC graphical models (PGMs) have been developed as a powerful modeling tool. They provide a systematic way to capture various probabilistic relationships among random variables and to provide principled methods for learning and inference. PGMs can be divided into two classes: undirected PGMs and directed acyclic PGMs. Examples of undirected PGMs include Markov random fields (MRFs) [1], [2] and conditional random fields (CRFs) [3], and they mainly capture the mutually dependent relationships such as the spatial correlations among random variables. On the other hand, other PGMs such as Bayesian Networks (BNs) [4], [5] and hidden Markov models (HMMs) [6] are directed acyclic PGMs, and they typically model the causal relationships among random variables. Both types of PGMs have been exploited to solve Manuscript received December 29, 2009; revised October 01, 2010; accepted February 26, 2011. Date of publication March 17, 2011; current version August 19, 2011. The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Ferran Marques. L. Zhang is with the UtopiaCompression Corporation, Los Angeles, CA 90064 USA (e-mail: [email protected]). Z. Zeng and Q. Ji are with the Rensselaer Polytechnic Institute, Troy, NY 12180 USA (e-mail: [email protected], [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TIP.2011.2128332 image and video analysis problems. In fact, MRFs have become a de facto modeling framework for image segmentation, while HMMs have become standard tools for motion analysis and activity modeling. Despite their widespread use in image and video analysis, both undirected PGMs and directed PGMs have certain lim- itations regarding their modeling capability. Neither of them can effectively capture heterogenous relationships. For ex- ample, undirected PGMs usually capture mutual interactions among random variables, while directed PGMs typically model cause-and-effect relationships (i.e., causality). However, for many image modeling problems, the relationships among random variables are often heterogeneous. For example, for multiscale image segmentation, the relationships among related image entities in different layers (corresponding to different scales) may be best modeled by directed links, while the re- lationships among the entities in the same layer can be best modeled by undirected links. Considering these limitations and the fact that there are typically complex and heterogeneous relationships among many entities involved in image modeling, there is a need for a single unified framework that can simul- taneously capture all of these relationships and exploit them to solve problems in a systematic and principled manner. Chain graph (CG) [7] is a natural solution to the afore- mentioned problems. It is a hybrid PGM that consists of both directed and undirected links. CG therefore subsumes both directed and undirected PGMs. Its representation is powerful enough to capture heterogeneous relationships [8]. While a CG theoretically can assume any graphical topology, the conventional CG, however, typically assumes the chain-like structure, which tends to limit its application scope. Thus far, CG applications in real-world problems are still very limited. Among those CG models ever used, most of them have simpli- fied structures and cannot fully exploit the modeling potential of CG. Moreover, the lack of principled methods for param- eter learning and inference in a complex CG model further limits its practical utility. To overcome these limitations, this research extends the conventional chain-like CG to the CG of more general topology. In addition, we introduce principled methods for learning and inference in such an extended CG. We demonstrate the effectiveness of this extended CG model in two different image and video analysis applications. II. RELATED WORKS Despite CG’s powerful representation capability and its gen- eralization over directed and undirected PGMs, prior CG appli- cations in real-world problems are very limited. Based on their 1057-7149/$26.00 © 2011 IEEE

Upload: others

Post on 19-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 20, NO. 9 ...qji/Papers/extended_chain_graph.pdf · IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 20, NO. 9, SEPTEMBER 2011 2401 Probabilistic

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 20, NO. 9, SEPTEMBER 2011 2401

Probabilistic Image Modeling With an ExtendedChain Graph for Human Activity Recognition

and Image SegmentationLei Zhang, Member, IEEE, Zhi Zeng, and Qiang Ji, Senior Member, IEEE

Abstract—Chain graph (CG) is a hybrid probabilistic graph-ical model (PGM) capable of modeling heterogeneous relationshipsamong random variables. So far, however, its application in imageand video analysis is very limited due to lack of principled learningand inference methods for a CG of general topology. To overcomethis limitation, we introduce methods to extend the conventionalchain-like CG model to CG model with more general topology andthe associated methods for learning and inference in such a gen-eral CG model. Specifically, we propose techniques to systemat-ically construct a generally structured CG, to parameterize thismodel, to derive its joint probability distribution, to perform jointparameter learning, and to perform probabilistic inference in thismodel. To demonstrate the utility of such an extended CG, we applyit to two challenging image and video analysis problems: humanactivity recognition and image segmentation. The experimental re-sults show improved performance of the extended CG model overthe conventional directed or undirected PGMs. This study demon-strates the promise of the extended CG for effective modeling andinference of complex real-world problems.

Index Terms—Activity recognition, Bayesian networks (BNs),chain graph (CG), factor graph (FG), graphical model learningand inference, image segmentation, Markov random fields(MRFs).

I. INTRODUCTION

P ROBABILISTIC graphical models (PGMs) have beendeveloped as a powerful modeling tool. They provide a

systematic way to capture various probabilistic relationshipsamong random variables and to provide principled methods forlearning and inference. PGMs can be divided into two classes:undirected PGMs and directed acyclic PGMs. Examples ofundirected PGMs include Markov random fields (MRFs) [1],[2] and conditional random fields (CRFs) [3], and they mainlycapture the mutually dependent relationships such as the spatialcorrelations among random variables. On the other hand, otherPGMs such as Bayesian Networks (BNs) [4], [5] and hiddenMarkov models (HMMs) [6] are directed acyclic PGMs, andthey typically model the causal relationships among randomvariables. Both types of PGMs have been exploited to solve

Manuscript received December 29, 2009; revised October 01, 2010; acceptedFebruary 26, 2011. Date of publication March 17, 2011; current version August19, 2011. The associate editor coordinating the review of this manuscript andapproving it for publication was Dr. Ferran Marques.

L. Zhang is with the UtopiaCompression Corporation, Los Angeles, CA90064 USA (e-mail: [email protected]).

Z. Zeng and Q. Ji are with the Rensselaer Polytechnic Institute, Troy, NY12180 USA (e-mail: [email protected], [email protected]).

Color versions of one or more of the figures in this paper are available onlineat http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TIP.2011.2128332

image and video analysis problems. In fact, MRFs have becomea de facto modeling framework for image segmentation, whileHMMs have become standard tools for motion analysis andactivity modeling.

Despite their widespread use in image and video analysis,both undirected PGMs and directed PGMs have certain lim-itations regarding their modeling capability. Neither of themcan effectively capture heterogenous relationships. For ex-ample, undirected PGMs usually capture mutual interactionsamong random variables, while directed PGMs typically modelcause-and-effect relationships (i.e., causality). However, formany image modeling problems, the relationships amongrandom variables are often heterogeneous. For example, formultiscale image segmentation, the relationships among relatedimage entities in different layers (corresponding to differentscales) may be best modeled by directed links, while the re-lationships among the entities in the same layer can be bestmodeled by undirected links. Considering these limitations andthe fact that there are typically complex and heterogeneousrelationships among many entities involved in image modeling,there is a need for a single unified framework that can simul-taneously capture all of these relationships and exploit them tosolve problems in a systematic and principled manner.

Chain graph (CG) [7] is a natural solution to the afore-mentioned problems. It is a hybrid PGM that consists of bothdirected and undirected links. CG therefore subsumes bothdirected and undirected PGMs. Its representation is powerfulenough to capture heterogeneous relationships [8]. Whilea CG theoretically can assume any graphical topology, theconventional CG, however, typically assumes the chain-likestructure, which tends to limit its application scope. Thus far,CG applications in real-world problems are still very limited.Among those CG models ever used, most of them have simpli-fied structures and cannot fully exploit the modeling potentialof CG. Moreover, the lack of principled methods for param-eter learning and inference in a complex CG model furtherlimits its practical utility. To overcome these limitations, thisresearch extends the conventional chain-like CG to the CG ofmore general topology. In addition, we introduce principledmethods for learning and inference in such an extended CG.We demonstrate the effectiveness of this extended CG model intwo different image and video analysis applications.

II. RELATED WORKS

Despite CG’s powerful representation capability and its gen-eralization over directed and undirected PGMs, prior CG appli-cations in real-world problems are very limited. Based on their

1057-7149/$26.00 © 2011 IEEE

Page 2: IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 20, NO. 9 ...qji/Papers/extended_chain_graph.pdf · IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 20, NO. 9, SEPTEMBER 2011 2401 Probabilistic

2402 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 20, NO. 9, SEPTEMBER 2011

topology, existing hybrid graphical models can be divided intotwo main categories. The first category consists of a directedgraphical model and an undirected graphical model, typicallyconnected in series or stacked on top of each other through somecommon nodes [9], [10]. The learning and inference for suchmodels are typically done separately for the directed and undi-rected models. In [9], Liu et al. combine a BN with a MRFfor image segmentation. A naïve BN is used to transform theimage features into a probability map in the image domain. TheMRF enforces the a priori spatial relationships and the localhomogeneity among image labels. In our previous work [10],we introduced a hybrid model by stacking a CRF on top of aBN for image segmentation, where the BN is used to capturethe causal relationships and the CRF is used to capture the cor-relation among image entities. In both [9] and [10], the modelparameters are learnt separately for the BN and MRF (or CRF)parts. Moreover, the inference in [9] is performed sequentially,with inference results from the BN part fed into the the MRFpart to perform further inference. The separate and sequentialinference is not theoretically justified, and represents only anapproximation to the simultaneous inference.

The second category of existing models are chain-likemodels, typically involving a chain of undirected modelsconnected via directed links. Murino et al. [11] formulate achain of MRF model connected by a BN for image processing.The BN is used to capture the a priori constraints betweendifferent abstraction levels. A coupled MRF is used to solvethe combined restoration and segmentation problem at eachlevel. They do not address the parameter learning issue andempirically specify the potential functions. For inference, theycombine belief propagation with simulated annealing throughsampling in a sequential manner from layer to layer. Such asequential inference process represents an approximation tosimultaneous inference in the whole model. Chardin and Pérezproposed a similar model in [12] and used a directed quadtreeto connect MRF models on lattice to form a hybrid hierarchicalmodel. They developed an EM-based algorithm to learn themodel parameters, where they leverage its specific tree structureto develop the learning algorithm and use Gibbs sampling toapproximately update some parameters for modeling the spatialrelationships. For inference, they employ Gibbs sampling tomaximize the posterior marginal probability.

Hinton et al. developed a deep belief net (DBFN1) in sev-eral works (e.g., [13]). Their DBFN uses restricted Boltzmannmachine (RBM) as the building blocks for each layer. Multiplelayers are sequentially connected through directed links, wherethe lowest layer is the layer of observable variables and otherlayers are hidden layers. Typically, there are no connectionswithin a layer. However, some specific DBFN models allow thetop two layers to be connected by undirected links [13]. In [14],they further combine the idea of DBFN with MRFs to formu-late a deep network with causally linked MRFs, which allowseach layer to be a MRF. In these specific cases, the deep net-work becomes a hybrid model. In [13] and [15], they studiedthe learning issue for such deep networks.

Hinton’s model differs from the CG model introduced in thispaper in several aspects. First, the DBFN model is constructed

1The abbreviation is traditionally DBN. We use DBFN instead in order toavoid confusion with another DBN (i.e., dynamic Bayesian network).

by connecting several RBMs or MRFs at different layers usingdirected links. While in our CG model, both the directed andundirected links can be within the same layer or between dif-ferent layers, and they do not have to be linked like a hierar-chical chain. Second, the DBFN approximates the true posteriordistribution of the hidden nodes by implicitly assuming the pos-terior distribution of each node is independent of each other dueto lack of lateral links and exploits this independence to learnthe weights in the RBMs. In contrast, we do not have this as-sumption. We derive the joint probability distribution based onthe CG structure using the general global Markov property [7].Third, the DBFN is usually learned by a greedy layer-by-layerlearning [13], [15]. The learning starts from the bottom layersince the bottom layer comprises all observed variables. Thelearned hidden states of one RBM are then used as the ob-served variables for learning the next layer and this processiterates until all layers are learned. In contrast, our approachlearns all model parameters together. Fourth, a variational ap-proach is usually used to approximately perform inference inthe DBFN. In contrast, we convert our CG model into a FactorGraph (FG) [16] representation so that we can apply variousprincipled methods, either exact or approximate, to perform in-ference.

In summary, the current CG models for real applications arelimited. Their topology either consists of simply stacked di-rected and undirected models or a chain structure that connectsseveral layers of undirected graphs. More importantly, the pa-rameter learning for these models are typically performed sepa-rately for the undirected part and the directed part of the model.It often ignores the fact that the global partition function in theirformulations couples all of the model parameters. For inference,the existing methods tend to perform either separate or sequen-tial inference instead of simultaneous inference. Besides, someworks simply apply the belief propagation theories developedfor either directed or undirected PGMs to CGs without theo-retical justification. In this paper, we intend to overcome theselimitations.

III. EXTENDED CG MODEL

Here, we formally introduce the extended CG model as wellas the associated methods for learning and inference in such ahybrid graphical model.

A. Extended CG Modeling

We first illustrate the construction and parametrization of theextended CG model with a relatively simple example and thenprovide the general formulation.

1) Model Construction: There are two principal strategiesto construct a graphical model. One is to automatically learnthe model structure (i.e., structure learning) based on certaincriteria. The other is to manually construct the model structurebased on the human prior knowledge about the specific problemat hand. While automatic structure learning might find a bettermodel structure and improve the performance, it is generally avery difficult problem for even a simple type of graphical model(e.g., Bayesian Network). In this work, we will focus on manualconstruction of the CG model.

To manually construct the CG model, we choose either di-rected links or undirected links to capture the relationships be-

Page 3: IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 20, NO. 9 ...qji/Papers/extended_chain_graph.pdf · IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 20, NO. 9, SEPTEMBER 2011 2401 Probabilistic

ZHANG et al.: PROBABILISTIC IMAGE MODELING WITH AN EXTENDED CHAIN GRAPH FOR HUMAN ACTIVITY RECOGNITION AND IMAGE SEGMENTATION 2403

Fig. 1. Multiscale segmentation model consists of the intra-layer directed linksto capture the hierarchical causalities and the inter-layer directed links pointingfrom the fine layer to the next coarse layer. In addition, the intra-layer undirectedlinks capture the spatial correlations between region labels.

tween random variables based on the nature of these relation-ships and their semantic meanings. For example, if the relation-ships can be characterized as causal or one way, directed linkscan be used to capture such relationships. On the other hand, ifthe relationships are mutual or both ways, undirected links canbe chosen to capture such relationships. In cases where both di-rected and undirected links are applicable, we choose the onethat can simplify the overall model structure. Other types of re-lationships can be selectively modeled according to the com-plexity of the constructed model and the consideration of pa-rameter learning difficulty.

We use a multiscale image segmentation problem to explainthe construction of a CG model (see Fig. 1). Different imageentities are involved in image segmentation, such as regions,edges, vertices, etc., and their relationships are heterogeneous.We capture these relationships in the multiscale CG modelthrough different types of links. There are some natural causal-ities between the image entities. First, two adjacent regionsintersect to form an edge. If these regions have different labels,they form (cause) a boundary between them. Second, multipleedges intersect to form a vertex. Third, the region labels at thefine layer induce the region labels at the coarse layer. We usedirected links to capture these causalities. Besides the causal-ities, there are other useful contextual relationships betweenimage entities such as the spatial relationships, and we useundirected links to model them.

2) Model Parametrization: Model parametrization consistsof parameterizing the links in the CG model and deriving therepresented joint probability distribution (JPD). A CG modelconsists of both directed links and undirected links. We can pa-rameterize the relationships represented by these links using ei-ther potential functions or conditional probabilities. In general,if the links are undirected links, they are parameterized by localpotential functions. If the links are directed, they are parameter-ized by local conditional probabilities. However, there are morecomplex cases that need to be specifically considered duringparametrization.

We use the example in Fig. 2 for illustration. In this model,we use local conditional probabilities to parameterize the di-rected links. For example, the relationship between and itsparents is parameterized by the conditional probability

Fig. 2. Example of a CG model and their parametrization.

Fig. 3. Directed master graph for the CG model in Fig. 2.

. On the other hand, we use pairwise potentials toparameterize the undirected links. For example, the relationshipbetween and is parameterized by the pairwise potentialfunction . Other directed links and undirected linkscan be similarly parameterized. Some links such as the link be-tween and (or and ) are more complex to param-eterize because , which is a child node of , is associatedwith undirected links as well. We can group and to-gether and use a conditional probability toparameterize the links between this group and its parents. Simi-larly, we use to parameterize the linksbetween nodes and nodes .

Given the CG structure and its parametrization, we need toderive the JPD of all random variables. We use a method, analo-gous to the one used in [17], to derive the JPD. The main idea isto first create a directed master graph and then create undirectedcomponent subgraphs for some terms in the JPD of the mastergraph. The component subgraphs (denoted by ) are a coarsepartition of the variables in the CG , where the set of subgraphsinduced by the partition are maximally locally connected undi-rected subgraphs. The master graph is a directed graph whosenodes are component subgraphs (or singleton nodes) and whosedirected arcs connect from component subgraphs to if avariable in has a child in in the graph . The JPD of theCG model can be finally derived from the JPD of the mastergraph and those of the component subgraphs.

Overall, there are three steps to derive the JPD. We still usethe example in Fig. 2 to illustrate these steps.

3) Step 1: We create a directed master graph, whose nodesare resulted from maximally grouping subsets of undirectedlyconnected nodes in the original graph . We call these groupednodes as composite nodes. In Fig. 2, we groupinto one node since these nodes are connected by undirectedlinks. Similarly, we group and into one node and group

and into one node. The created master graph is illus-trated by Fig. 3.

Page 4: IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 20, NO. 9 ...qji/Papers/extended_chain_graph.pdf · IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 20, NO. 9, SEPTEMBER 2011 2401 Probabilistic

2404 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 20, NO. 9, SEPTEMBER 2011

Fig. 4. Subgraph for each term of the composite node in the JPD of the mastergraph for the example in Fig. 2.

We then derive the JPD of the master graph based on theMarkov property in a directed graphical model (e.g., a BN). Forthe master graph in Fig. 3, its JPD is factored as

(1)

We can further simplify the third, fourth, and fifth terms basedon the conditional independence in the original graph , as as-certained by the global Markov property (see [7]). The JPD ofthe master graph is further simplified as

(2)

4) Step 2: We create a subgraph for each term in the JPDof the master graph that corresponds to a composite node. Thisterm corresponds to either the joint probability or the localconditional probability of the composite node. The subgraphis an undirected graphical model. For the joint probabilityterm in (2), we construct an undirected subgraph,as shown in Fig. 4(a). For other conditional probability termsof composite nodes in (2), we construct the conditional net-works, as shown in Fig. 4(b) and (c). Here the conditionalnetwork is an undirected graph with shaded nodes representingthe conditional variables. Please note that the relationshipsbetween conditional variables and nonconditional variables(e.g., between and ) are undirected even though thelinks in the original graph may be directed, and hence theselinks are generally parameterized by potential functions. Sucha representation and parametrization also applies to the linkbetween the pair , , and .

We then derive the JPD in each subgraph. For the undirectedsubgraphs in Fig. 4, their JPDs can be derived based on the Ham-mersley–Clifford theorem [18] as a product of potential func-tions normalized by a local normalization function

(3)

where , , and are local normalization functions that canbe calculated by marginalization. For example,

(4)

where are the parameters of each pairwise potentialfunction. By marginalization, will be a function of .and will be functions of the conditional variables and therelated parameters of potential functions.

5) Step 3: Finally, we can derive the JPD of the original CGmodel by substituting the JPDs of subgraphs into the JPD of themaster graph. The JPD is factored as

(5)

where andrepresent the factors for local normalization

Apparently, the JPD of the CG model is factored as theproduct of local potential functions and local conditional prob-abilities normalized by the global partition function . Pleasenote that the global partition function here is only afunction of the parameters of some potential functions (i.e.,

). This is different from ignoring the local normalizationfunctions ( and ) and then doing the global normalizationonly through the global partition function, which is calculatedby marginalizing the product over all random variables. In thelatter case, will couple all of the parameters of potentialfunctions and is much more difficult to calculate.

6) General CG Model: Now, we can present a general for-mulation of the proposed CG model. We first introduce our no-tations. Let denote the set of random variables inthe model. As illustrated in the above example, the JPD of allrandom variables in our CG model has a factored formulation.Specifically, it is the product of conditional probabilities andpotential functions normalized by the (local and global) parti-tion functions. We assume our CG model follows the funda-mental property of a CG, i.e., it complies with the global Markovproperty (see [7]) that ascertains the conditional independenceamong random variables based on the graphical model struc-ture. Since it is a CG model, it also follows the global acyclicityproperty, i.e., there will be no directed cycles in the graph. Underthese assumptions, without loss of generality and considering upto pairwise potential functions, the JPD of our CG model can begenerally formulated as

(6)

Page 5: IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 20, NO. 9 ...qji/Papers/extended_chain_graph.pdf · IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 20, NO. 9, SEPTEMBER 2011 2401 Probabilistic

ZHANG et al.: PROBABILISTIC IMAGE MODELING WITH AN EXTENDED CHAIN GRAPH FOR HUMAN ACTIVITY RECOGNITION AND IMAGE SEGMENTATION 2405

where is the pairwise potential function thatmodels the interaction between adjacent nodes and . isthe parameter of the th pairwise potential function. The factorfunction is resulted from the local normalization in theundirected subgraphs and it is a function of a subset of variables

(i.e., the conditional variables) and a subset of potentialfunction parameters . The last term is thelocal conditional probability of the node , wheredenotes the parent nodes of . It should be mentioned that theextended CG model is not limited to pairwise potential func-tions. Actually, in the human activity recognition application(cf. Section IV), we have used triplet potential functions (i.e.,using triplet cliques) as well.

The nodes in our CG models can be grouped into three types.The first type involves nodes associated with only potentialfunctions . The second type involves nodes asso-ciated with the local normalization factor , i.e.,the nodes in the set , which correspond to the conditionalvariables. The third type involves nodes associated with thelocal conditional probability . This type of nodeis usually the child of directed links and is not connected withany undirected links. It should be noticed that some nodescould be associated with multiple terms in the above generalformulation. For a specific problem, it is also possible that notall of the aforementioned terms exist. The JPD therefore mighthave a simpler formulation in such cases. Finally, our followingdiscussion of parameter learning and inference and the currentimplementation of the general CG model in (6) assume discreterandom variables.

B. Parameter Learning

So far, we have introduced principled methods to constructthe proposed CG model and derive its JPD. The next importantissue is to learn the model parameters. Parameter learning inundirected PGMs or directed PGMs has been separately studiedin many previous works [5], [19], [20]. Specifically, parameterlearning in BNs can usually be simplified as learning in a localgraphical structure. Parameter learning in MRFs is more dif-ficult since the partition function (or its derivative) is usuallydifficult to calculate because it requires marginalization over allrandom variables. Many approaches have been proposed to ad-dress this difficulty [19]–[22]. They basically simplify the opti-mization objective function by approximation and avoid exactcalculation of the partition function or its derivative. Frey et al.[23] has given a detailed comparison of several learning and in-ference methods for PGMs.

Despite much work on parameter learning for undirectedPGMs and directed PGMs, there are very few studies addressingparameter learning in a hybrid graphical model. Lauritzen hasshown how to derive the maximum likelihood estimation(MLE) in CG [7]. Buntine discussed how to approximate theglobal partition function and showed an example in learning aCG [24]. The work in [15] introduces an approach for parameterlearning based on minimizing the “variational free-energy.”This variational approach requires a factorial approximationof the true posteriori distribution of hidden variables giventhe observed variables. Another work [25] presents a learningmethod for a FG that requires a special canonical parametriza-tion of the JPD. In the following, we will present a principled

parameter learning method for the proposed CG model. Wewill show that, for a generally structured CG model whoseJPD is factored as the product of local conditional probabilitydistributions (CPDs) and potential functions, we can learn themodel parameters by combining an analytical learning and anumerical learning based on contrastive divergence (CD) usingGibbs sampling.

In the CG model formulated by (6), the conditional prob-ability distributions and the parametersof potential functions are the model parameters that should belearned. An important property of this model is that the globalpartition function is not a function of the CPDsand it is only a function of some potential function parameters.We will leverage this property for parameter learning.

We perform parameter learning using MLE. Assume we havea set of i.i.d training data , where denote thevalues of all random variables in the th sample. The MLE aimsat maximizing the log-likelihood of parameters

(7)

where denotes all of the parameters in the model, including allCPDs and potential function parameters .

Below, we show the detailed parameter learning approach as-suming discrete random variables. The CPDs associated withdiscrete random variables become conditional probability tables(CPTs). Let denote the CPT value when the random vari-able is in its th state and its parents in their th state. Thelog-likelihood in (7) can be rewritten as

(8)

where is the number of times that is in its th state andits parents are in their th states, which can be directly countedfrom the training data. We shall note that the term canbe a function of some potential function parameters. The log-likelihood is split into two parts. The first part is a functionof the parameters of potential functions. The second part is afunction of the CPTs. When we maximize the log-likelihood

w.r.t , only the second part in (8) matters. On the otherhand, when we maximize w.r.t the parameters of potentialfunctions (e.g., ), only the first part in (8) has influence butwe need to consider both the potential functions and the logpartition function .

1) Learning Conditional Probability Distributions: Param-eter learning in the proposed CG model consists of two parts.First, the log-likelihood can be maximized with respectto the CPT parameter , i.e.,

(9)

Page 6: IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 20, NO. 9 ...qji/Papers/extended_chain_graph.pdf · IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 20, NO. 9, SEPTEMBER 2011 2401 Probabilistic

2406 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 20, NO. 9, SEPTEMBER 2011

where the constraint comes from the definition of a conditionalprobability. Maximizing (9) leads to the following analyticalsolution of the optimal parameters:

(10)

where is the total number of counts of withits parents in their th states in the training data.

2) Learning Parameters of Potential Functions: The log-likelihood should also be maximized with respect to theparameters of potential functions . There are two cases forlearning the potential function parameters.

3) Case 1: In this case, the global partition function isa function of only a few potential function parameters and canbe analytically calculated. In the previous example of (5),

is only a function of . In such specialcases, we can simply use exact MLE to learn the parametersince and its derivative can be easily calculated. The partialderivative of the log-likelihood w.r.t is calculated as(11). In this case, both and its derivative can be analyticallycalculated. Thus, we can use exact MLE to learn the parameter

(11)

4) Case 2: In this case, the global partition function istoo complex and is a function of many potential function pa-rameters, making it difficult to be analytically calculated, verymuch like the partition function difficulty in learning undirectedPGMs. We resort to an approximate learning approach, i.e., thecontrastive divergence (CD) learning [19], [26], to alleviate thisdifficulty. Instead of maximizing the exact log-likelihood, CDlearning minimizes an alternative objective function, i.e., thecontrastive divergence that is the difference between two Kull-back–Leibler (KL) divergences

(12)

where denotes the empirical distribution represented by thetraining data and denotes the distribution over the -step re-construction of the sampled data, which are generated by full-step Markov chain Monte Carlo (MCMC) sampling via Gibbssampling. is the true distribution represented by the modelin (6). Please note that (12) assumes we already know all CPTparameters during the sampling.

We can use the gradient descent approach to minimize thecontrastive divergence to solve for the optimal potential function

parameters. Ignoring the detailed derivation, we can calculatethe derivative of contrastive divergence as

(13)

where the operator means the expectation w.r.t the distribu-tion indicated by the subscript. In this equation, we only need tocalculate the expectation of the derivative w.r.t the distributionrepresented by samples (either the training data or the recon-structed samples through efficient Gibbs sampling). We calcu-late this expectation by substituting the values of random vari-ables from the samples and then normalize the derivatives w.r.tthe total number of samples. CD learning finally updates the po-tential function parameters as

(14)

where is the learning rate to update the parameters.We shall note that learning the potential function parame-

ters requires the knowledge of CPTs. Hence, the model param-eters are learned jointly. Specifically, when we perform Gibbssampling for the node , we need the conditional probability

, whose computation requires the CPTs that havebeen learned by the analytical learning. In general, we can calcu-late as (15), shown at the bottom of the page, where

is a possible state of the random variable andis the set of all possible states for . The symbol and

denote the children and parents of , respectively. Pa-rameter learning in the CG model is summarized in Algorithms1 and 2.

Algorithm 1 Parameter Learning in the Extended CG Model

Input: a set of complete training dataStep 1: learn CPDs by counting the joint configurations ofparent nodes and a child node (10).Step 2: randomly initialize to certain values.for to (the maximum iteration) do

for to doInitialize MCMC sampling from the training data .

(15)

Page 7: IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 20, NO. 9 ...qji/Papers/extended_chain_graph.pdf · IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 20, NO. 9, SEPTEMBER 2011 2401 Probabilistic

ZHANG et al.: PROBABILISTIC IMAGE MODELING WITH AN EXTENDED CHAIN GRAPH FOR HUMAN ACTIVITY RECOGNITION AND IMAGE SEGMENTATION 2407

for to doPerform one full-step Gibbs sampling based on theconditional probability (15)to update the states of random variables . This processis summarized in Algorithm 2.

end forExtract the last joint states of random variables as onesample to represent the sample distribution .

end forUse extracted samples and the training data to update

according to (14).end forOutput: all CPDs and the finally estimated parameters ofpotential functions .

Algorithm 2 One full-step Gibbs Sampling

Step 1: Initialize the joint probability distribution tocertain values of . For example, these values can come from atraining data or the previously sampled data.Step 2: Do the Gibbs sampling process.for to (the number of unobserved random variables)do

(1) Compute the conditional probability .(2) Set to the state with the probability

.end for

We can roughly estimate the computation required for pa-rameter learning. Let be the number of nodes whose CPDsshould be learned. Learning of these CPDs will take cal-culations by counting. The counting process’s time consump-tion can often be ignored. For learning parameters of poten-tial functions, assuming we actually run iterations and have

training data, we can estimate the required computation asbelow. Let (typically 1 or 2) be the number of Gibbs samplingbased on (15) that should be run for each of totally unob-served variables involved in the potential functions. We furtherassume the calculation of (15) requires maximally arithmeticcalculation. Thus, the total computation required for learningparameters of potential functions will be . It dom-inates the entire parameter learning process.

Finally, CD learning avoids running MCMC sampling intothe equilibrium, which is typically very time consuming. In ad-dition, empirical studies [27] have shown that CD learning typ-ically converges well and the estimated parameters are close toexact MLE results. CD learning therefore has been applied inmany other works [19], [26] as well due to its efficiency andgood empirical performance.

C. Probabilistic Inference

Given the CG model and its learned parameters, we performprobabilistic inference to solve the problem. The CG model con-sists of both directed links and undirected links, and to the bestof our knowledge, direct inference in such a model is very diffi-cult since existing inference methods for either directed or undi-rected PGMs may not be applicable. To solve this problem, we

Fig. 5. FG representation of the CG model in Fig. 2.

propose to convert the CG model into an FG representation [16]so that principled inference methods for FGs can be employed.

1) FG Representation of the CG Model: An FG is a bipartitegraph that expresses a global function factored as the product oflocal functions over a set of variables. The FG consists of twotypes of nodes: the variable nodes and the factor nodes. Fol-lowing the convention, each variable node will be representedas a circle and each factor node will be represented as a filledsquare. Assuming a global function defined on a set ofvariables is factored as , whereis a subset of all variables. We can construct an FG to repre-sent by adding variable nodes that correspond to the vari-ables and factor nodes that correspond to the local functions

. Undirected links are then added to connect a factor nodewith its arguments .

Our CG model represents a JPD that is factored as the productof potential functions and conditional probabilities. Given thisfactored JPD, we can easily convert the CG model into a FG rep-resentation. For example, we convert the model in Fig. 2 into anFG shown in Fig. 5 based on the derived JPD in (5). Each squarein the FG corresponds to the factor that is a local function of itsassociated variables. For the general model represented by (6),the factors include the pairwise potentials , thefactor functions and the conditional probabil-ities .

2) Probabilistic Inference in FG: Given the FG repre-sentation, we can leverage principled inference methodsdeveloped for FGs to perform inference. There are two majorapproaches to perform probabilistic inference in FGs. First,the sum-product algorithm can be used to efficiently calculatevarious marginal probabilities for either a single variable or asubset of variables [8], [16]. Let denote all of the variablesin the FG and is one of the variables. We can use thesum-product algorithm to calculate the marginal probability of

, i.e., , where the summation is overall variables excluding and is the joint probabilitydistribution. Given the marginal probability of each variable,the optimal state of this variable can be found by using theMaximum Posterior Marginal (MPM) criterion [28], i.e.,

.Second, the max-product algorithm [8] can be used to

find a setting of all variables that corresponds to the largestjoint probability. If denotes some observed variables, wecan find the optimal states of all other unobserved variables

by maximizing their joint posteriori probability,i.e., . We refer readers to [8] for

Page 8: IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 20, NO. 9 ...qji/Papers/extended_chain_graph.pdf · IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 20, NO. 9, SEPTEMBER 2011 2401 Probabilistic

2408 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 20, NO. 9, SEPTEMBER 2011

Fig. 6. CG model for multisubject activity recognition: (a) structure of the static CG model (or prior model) and (b) structure of the transition model.

more detailed discussions of the sum-product and max-productalgorithms. Besides the max-product algorithm, there are otheralgorithms [29], [30] that can also find the Most ProbableExplanation (MPE) solution given certain evidence.

In general, the computational complexity of inference de-pends on the specific method used and the type of inferenceto be performed. For example, Hutter et al. [30] has comparedseveral methods for MPE inference and showed their compu-tational difference. For the sum-product algorithm, we roughlyestimate the required computation as below. Let be the totalnumber of iterations before the message-passing process con-verges, i.e., the beliefs of all nodes are not changing any more.If there are totally links between variables and factors in theFG, we need to calculate two directional messages for each link.Let be the maximal number of factors linked to a variable,

be the maximal number of variables linked to a factor, andbe the maximal number of states for a variable, the calcu-

lation of two directional messages for one link takecalculation. The total computation after iterations

will be .

IV. APPLICATION TO HUMAN ACTIVITY RECOGNITION

To demonstrate the relevance of the proposed CG model fordifferent image and video analysis applications, we applied it totwo real-world problems: human activity recognition and 2-Dimage segmentation. The objectives of these applications aremainly to demonstrate the ability of the extended CG modelto take into account of different types of objects and their het-erogeneous relationships for solving these challenging prob-lems. These experiments, however, are not aimed at demon-strating the advantages of the proposed models over state-of-the-art methods in human activity recognition and image seg-mentation domains. Such a comparative study, though impor-tant, is however beyond the scope of this paper.

We first applied the CG model to the human activity recog-nition problem. Recognizing complex activities involving inter-actions among multiple subjects is challenging due to both thelarge variations of visual observations and the complex seman-tics of the human activity. To alleviate these difficulties, it isessential to have an activity model that can explicitly captureand model heterogeneous relationships among elements of anactivity and between different subjects at different levels of ab-straction in both space and time domain. HMMs [31], [32] anddynamic Bayesian networks (DBNs) [33], [34], although widelyused in human activity recognition, are not well suited to effec-tively capture these heterogeneous relationships. In this section,

we apply the proposed CG model and its dynamic extension,i.e., dynamic CG, for modeling and recognizing complex humanactivities.

A. Model Construction

For this study, we are interested in recognizing activities in-volving interactions between two human subjects. Specific ac-tivities include shaking hands, talking while standing, chasing,boxing, and wrestling. We first introduce a static CG for mod-eling spatial relationships among elements of such activities.The static model is subsequently extended to capture their dy-namic dependencies. Fig. 6(a) shows a static CG for activitymodeling, where the human activity is abstracted at four levels:activity, individual actions of the subjects, states of the subjectsand image observations. The node denotes the activity. and

represent the actions of two subjects. denotethe states of shape, appearance, and motion of the subjects. Theshaded nodes are, respectively, the observations of shape, ap-pearance, and motion.

Relationships among these nodes are heterogeneous. Somerelationships are asymmetrically causal, while others are mutu-ally affecting each other. The causal relationships include:

1) Complex multisubject activity induces individual basic ac-tions. Such relationships are captured by the directed linksbetween and .

2) An individual action of a subject leads to the specificshape , appearance , and motion of the subject.This type of relationship is naturally captured by the di-rected links between the action node and the states

of a subject.3) The basic states of the subjects generate their observations.

Such relationships are captured by the directed links be-tween states and their corresponding measure-ments .

In addition, relationships that usually represent the mutual in-teractions can be captured by undirected links among the statesof the subjects. They include: 1) the interactions between theactions of multiple subjects, which are captured by the linksamong the actions (e.g., and ) of different subjects and 2)the relationships among the shape, appearance, and motion ofone subject under a specific action.

Moreover, to capture the dynamic aspect of an activity, thestatic CG model is further extended to a dynamic CG (seeFig. 6), which can be represented by a two-slice CG to reflectthe dynamic evolution of an activity. The evolution is capturedby the directed temporal links between corresponding nodes attime and (e.g., ).

Page 9: IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 20, NO. 9 ...qji/Papers/extended_chain_graph.pdf · IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 20, NO. 9, SEPTEMBER 2011 2401 Probabilistic

ZHANG et al.: PROBABILISTIC IMAGE MODELING WITH AN EXTENDED CHAIN GRAPH FOR HUMAN ACTIVITY RECOGNITION AND IMAGE SEGMENTATION 2409

B. Joint Probability Distribution

To facilitate the discussion on the joint probability factor-ization, learning and inference, the dynamic CG model can bebroken down into two parts: the prior model and the transitionmodel, as shown in Fig. 6. We use the theory introduced inSection III-A2 to derive the JPDs for both the prior model andthe transition model. Ignoring the detailed derivation, the JPDof the transition model can be factored as

(16)

where and denote all random variables in the sliceand , respectively. Please note that we use triplet potentialfunctions for this activity modeling. and are the localnormalization functions that are calculated as

(17)

Similarly, the JPD of the prior model can be factored as

(18)

where denote all random variables in the first time slice andand are the local normalization functions.

We learn the CPT parameters by counting the frequency ofjoint configurations of a child node and its parent nodes in thetraining samples. We also learn the potential function parame-ters as the first case explained in Section III-B2. After parameterlearning, we use the sum-product algorithm to perform infer-ence and estimate the marginal probability distributions of theactivity node and action nodes in every time slice.We finally find the optimal activity and action states by maxi-mizing the marginal probabilities.

C. Experimental Results

We evaluate our activity model on the task of recognizing fivecomplex activities in a daily life: shaking hands (SH), talking(TK), chasing (CH), boxing (BX) and wrestling(WR). Theseactivities are conducted by two interactive subjects who performfive basic individual actions: standing, running, making a fist,clinching, and reaching out. The activity dataset2 consists of 15video sequences. The five complex activities are sequentiallyperformed in each sequence, so there are 15 samples for eachactivity. To obtain the observations for the subject states, wefirst perform motion detection to obtain their silhouettes. Theshape of the subject is then measured by the width and height ofthe bounding box, filling ratio (the area of the silhouette w.r.t.the area of the bounding box) and the moments of the motionsilhouette. To compute the shape feature, the current approachassumes the silhouette is available from motion detection, forwhich the video data we used has relatively static background.The measurement of the motion state is the the global velocityof the subject. The appearance is captured by the histogram oforiented gradient (HOG) features and histogram of optical flow(HOF) features.

Before learning the activity models, we first cluster the obser-vations to obtain the labels for all subject states. The numbersof shape, motion and appearance states are set to 3, 2, and 5,respectively, through experiments. Fig. 7 shows the influence ofthe number of states on the activity recognition performance.We can see that the performance is not sensitive to the numberof shape or motion states. But when the number of appearancestates is small , the performance will drop. When studyingthe influence of the number of shape states, we fix the number ofstates for motion and appearance as 2 and 5, respectively. Sim-ilar strategy is applied to the sensitivity analysis on the numberof states for motion and appearance.

The proposed CG activity model is compared with twoDBNs and one static CG model. The first DBN [Fig. 8(a)] isconstructed by removing all undirected links in the dynamicCG model, so the interactions between the subjects and thedependencies among the shape, motion and appearance statesare totally ignored. The second DBN [Fig. 8(b)] keeps alldirected links, but replaces the undirected links with directedlinks learned by the constrained hill-climbing algorithm (otherparts of the model are fixed in the structure learning). Thus, therelationships between two subjects or among the shape, motionand appearance states, although not causal, are represented bydirected links. In addition, to demonstrate the importance of

2[Online]. Available: http://www.ecse.rpi.edu/homepages/cvrl/data-base/database.html.

Page 10: IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 20, NO. 9 ...qji/Papers/extended_chain_graph.pdf · IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 20, NO. 9, SEPTEMBER 2011 2401 Probabilistic

2410 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 20, NO. 9, SEPTEMBER 2011

Fig. 7. Sensitivity analysis on the number of states for shape, motion, and ap-pearance of the subject.

Fig. 8. Structures of the two DBN models for comparative experiments.(a) DBN1. (b) DBN2.

modeling the temporal relationships, we also perform exper-iments using a static CG model, which has exactly the samestructure as the prior model [Fig. 6(a)].

In the experiments, we use fivefold cross validation to eval-uate the proposed CG activity model. Through online inference,we can obtain the activity label and action label of each subjectat each frame for the testing sequences. In order to evaluate therecognition accuracy for a video sequence, we vote each activityusing the activity label for each frame and assign the activityclass that receives the highest vote.

Table I compares the recognition performance of the dynamicCG model with two DBNs and the static CG model. We can findthat modeling the heterogeneous relationships in human activitywith the dynamic CG model significantly improves the recogni-tion accuracy at both activity level and action level. Comparedwith DBN1, which completely ignores the interaction betweentwo subjects and the dependencies among states, our dynamicCG model achieves 21.3% higher recognition accuracy at theactivity level and 18.7% higher accuracy at the action level.On the other hand, if we capture these dependencies with di-rected links learned from data, the recognition rate is 12% betterthan ignoring these dependencies. However, approximately rep-resenting these mutual interactions with directed links makesthe recognition rate 9.3% worse than the dynamic CG model atthe activity level and averagely 14% worse at the action level.

TABLE ICOMPARISON OF THE DYNAMIC CG MODEL WITH TWO DBNS AND A STATIC

CG MODEL FOR HUMAN ACTIVITY RECOGNITION

TABLE IICONFUSION TABLES OF FOUR ACTIVITY MODELS. (A) DYNAMIC CG MODEL.

(B) STATIC CG MODEL. (C) DBN1. (D) DBN2

This result shows the importance of modeling different relation-ships in human activity recognition with appropriate link types,as well as the capability and flexibility of our CG model for thistask. The static CG model, which ignores the temporal relation-ships in human activity, performs 5.3% worse than the dynamicCG model at the activity level and 6% worse at the action level.However, it still has higher recognition rates than both DBNs.

The detailed activity recognition results of the four modelsare summarized in Table II. We can observe that the dynamicCG model achieves almost perfect recognition result except forthree misclassifications between boxing and wrestling, whichhave quite similar attributes in some examples. In comparison,DBN1 has more misclassifications between these two activities,and besides, there are even several misclassifications betweenshaking hands and boxing, chasing and wrestling, despite theirdifferences in appearance and motion. DBN2 also has a few mis-classifications between shaking hands and boxing. These tablesdemonstrate that the CG model not only yields improved overallrecognition accuracy but also leads to improved discriminationamong individual activities.

V. APPLICATION TO IMAGE SEGMENTATION

Besides human activity recognition, we also apply the pro-posed CG model to 2-D image segmentation problems. Specifi-cally, we deal with a bi-layer segmentation problem, which seg-ments the image into the foreground and background. We usethe CG model to capture heterogeneous relationships amongmultiple image entities, including regions, edges and junctions,for effective image segmentation.

Given an image, it is first oversegmented into a set of smallerregions (i.e., superpixels). A region of pixels with the same labelin the oversegmentation form a superpixel. Multiple (more thantwo) superpixels with different labels intersect at a junction.Adjacent junctions are connected by edge segments. Our CGmodel captures the heterogeneous relationships among these su-perpixels, edges, and junctions. Let denote thelabels of superpixels and the superpixel featuresextracted from the image. denotes the labels of

Page 11: IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 20, NO. 9 ...qji/Papers/extended_chain_graph.pdf · IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 20, NO. 9, SEPTEMBER 2011 2401 Probabilistic

ZHANG et al.: PROBABILISTIC IMAGE MODELING WITH AN EXTENDED CHAIN GRAPH FOR HUMAN ACTIVITY RECOGNITION AND IMAGE SEGMENTATION 2411

Fig. 9. CG model for image segmentation. (a) Initially oversegmented image.(b) Part of the initial segmentation of image regions. (c) Part of the CG modelfor image segmentation that corresponds to (b).

edges and their measurements.denote the labels of junctions and their mea-surements.

The label variables , and are all binary random vari-ables. has values: (foreground) or (background).has values: (on the object boundary) or (not on theboundary). has values: (being a corner on the boundary)or (otherwise). We extract average color features in each su-perpixel as its region measurement and calculate the averagegradient magnitude of an edge as its measurement . We alsocalculate the measurement of a junction according to the re-sponse of Harris corner detector. The measurement is dis-cretized as binary values ( or 1) by a fixed threshold (1000)that is empirically determined.

There are various kinds of relationships among these imageentities. We use Fig. 9 to explain them. First, adjacent super-pixels with different labels can result in a boundary edge be-tween them. For example, if two superpixels and havedifferent labels, they form a boundary between them.Besides, multiple edges with different states result in a specifictype of junction. For example, if and are boundary edgeswhile and are not, then and form a corner on theobject boundary . In addition, each image entity pro-duces its own image measurements. These relationships repre-sent the causalities between different entities and can be readilymodeled by directed links. On the other hand, there are spatialcorrelations between adjacent superpixels, which can be mod-eled by undirected links. Our segmentation model captures allof these heterogeneous relationships and therefore forms a CGmodel, as illustrated by Fig. 9(c).

Given the constructed CG model, image segmentation isformulated as a problem of finding the most probable labelsof superpixels and edges. Let denote all random variables

in the model. Their JPD is factored as

(19)

where denotes the spatial neighborhood of . isthe unary potential. For simplicity, we use a three-layer percep-tron classifier to define it. The pairwise potentials are definedas , where is thecomponent-wise absolute value operator. is the weight vectorand is the global partition function.

The above image segmentation model is a special case of thegeneral CG model in (6). Given the training data, we apply theapproach described in Section III-B1 to learn the CPTs and theapproach for case 2 described in Section III-B2 to learn theweight vector associated with pairwise potentials. After pa-rameter learning, we further convert the model into a FG repre-sentation and perform probabilistic inference in the FG to findout the MPE solution, i.e.,

(20)

In the MPE solution, the superpixels with foreground labelsform the foreground segmentation. Fig. 10 shows several typ-ical image segmentation results on the Weizmann horse dataset[35]. The model successfully segmented the foreground objects(i.e., horses) in these images.

We quantitatively evaluate our segmentation results androughly compare with some results produced by other ap-proaches [36]–[38] as well as with the results produced byusing our directed or undirected models alone. These resultsare summarized in Table III(a). For the overall labeling accu-racies, our results are comparable to (or better than) the resultsproduced by other related works. In Table III(a), we also showthe performance using a CRF model alone and using a BNmodel alone. The CRF model has a structure correspondingto the undirected part in our CG model. Its unary potentialsand pairwise potentials are similarly defined as in (19). TheBN model has a structure corresponding to the directed part inour CG model. In addition, the region nodes are individuallylinked with their measurements. Comparing the performanceof different PGMs, the CG model outperforms either our CRFmodel alone or our BN model alone.

Since we discretize the vertex measurement by thresh-olding, it is also interesting to see how the threshold value willinfluence the overall performance of our model. We varied thethreshold value within a large range and redid experiments onthe Weizmann horse images. Table III(b) summarizes the quan-titative results. We found within a large range of threshold values(from 200 to 6000), the average performance only changed byabout 0.5%. These results showed the CG model was not verysensitive to this discretization process.

VI. CONCLUSION

In this paper, we propose an extended CG model that allowsvery general topology and introduce principled methods forlearning and inference in this model. We systematically study

Page 12: IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 20, NO. 9 ...qji/Papers/extended_chain_graph.pdf · IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 20, NO. 9, SEPTEMBER 2011 2401 Probabilistic

2412 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 20, NO. 9, SEPTEMBER 2011

Fig. 10. Some examples of typical segmentation results. The first row shows the original images. The second row shows the corresponding segmentation.

TABLE IIIQUANTITATIVE EXPERIMENTAL RESULTS FOR 2-D IMAGE SEGMENTATION. (A)THE QUANTITATIVE RESULTS OF OUR CG MODEL AND SEVERAL RELATED

WORKS FOR SEGMENTING THE WEIZMANN HORSE IMAGES. THE AVERAGE

PERCENTAGE OF CORRECTLY LABELED PIXELS, I.E., OVERALL LABELING

ACCURACY IS REPORTED. (B) THE OVERALL LABELING ACCURACIES FOR THE

WEIZMANN HORSE IMAGES WHILE WE VARIED THE THRESHOLD VALUE FOR

DISCRETIZING THE VERTEX MEASUREMENT�

several important issues on the proposed CG model, includingits model construction, parametrization, derivation of the rep-resented JPD, and most importantly, joint parameter learningand inference for this model. To demonstrate the capability ofthis extended CG model, we apply it to two challenging imageand video analysis tasks: human activity recognition and imagesegmentation. Extended CG models are constructed to captureuseful heterogeneous relationships among multiple entities forsolving these problems. Our experiments show that the CGmodels outperform conventional undirected PGMs or directedPGMs. It demonstrates the applicability of the proposed CGmodel to different image and video analysis problems as well asits potential benefits over standard directed or undirected PGMsin improving classification and recognition performance.

The benefits of the CGs over directed or undirected PGMs canbe studied in terms of modeling accuracy and computational ef-ficiency. For modeling accuracy, the proposed CG model is, inprinciple, superior to both directed and undirected PGMs. Bothapplications in this work demonstrate that the proposed CGmodel outperforms the models based on either pure undirectedPGMs or pure directed PGMs. We speculate the main reasonis that the proposed CG model can more correctly capture thecomplex and heterogeneous relationships in these applications.In contrast, the undirected or directed PGMs can only approx-imately model these heterogeneous relationships, resulting intheir inferior performance compared to the CG model. This per-formance inferiority is especially apparent for the activity recog-nition problem (Table I) because DBN1 ignores some relation-ships among the activity entities and DBN2 uses inappropriatelinks to represent those relationships and therefore both models

can only approximately model the relationships in human ac-tivity modeling. However, the exact benefits and the extent ofthe benefits of the CG model over a directed or undirected PGMfor a particular application is yet hard to ascertain. They usuallydepend on the specific relationship that each link captures aswell as the interactions among the links. Further research willstill be needed to systematically study the benefits of the CGmodel over directed or undirected PGMs.

In terms of computational efficiency, compared to undirectedPGMs, CG models should be computationally more efficientin both learning and inference because of the presence of di-rected parts in the general CG models. The directed links sepa-rate the entire graph into smaller subsets of undirected graph. Itfactorizes the JPD as the product of simpler components thatonly require local normalization. This factorization and localnormalization significantly simplifies the learning and inferencein CGs. Finally, through this paper, we introduce a general CGmodel to the computer vision and image processing communityand demonstrated its utility for some image and video analysistasks. It is our hope that the research community can furtherinvestigate the potential of such a framework, improve it, andapply it to more image and video analysis applications.

REFERENCES

[1] S. Geman and D. Geman, “Stochastic relaxation, Gibbs distributions,and the Bayesian restoration of images,” IEEE Trans. Pattern Anal.Mach. Intell., vol. PAMI-6, no. 6, pp. 721–741, Jun. 1984.

[2] C. Bouman and B. Liu, “Multiple resolution segmentation of texturedimages,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 13, no. 2, pp.99–113, Feb. 1991.

[3] J. Lafferty, A. McCallum, and F. Pereira, “Conditional random fields:Probabilistic models for segmenting and labeling sequence data,” inProc. Int. Conf. Mach. Learning, 2001, pp. 282–289.

[4] J. Pearl, Probabilistic Reasoning in Intelligent Systems: Networks ofPlausible Inference. San Mateo, CA: Morgan-Kaufmann, 1988.

[5] R. E. Neapolitan, Learning Bayesian Networks, 1st ed. Upper SaddleRiver, NJ: Prentice-Hall, 2003.

[6] L. R. Rabiner, “A tutorial on Hidden Markov models and selectedapplications in speech recognition,” Proc. IEEE, vol. 77, no. 2, pp.257–286, Feb. 1989.

[7] S. L. Lauritzen, Graphical Models. : Oxford University Press, 1996.[8] C. M. Bishop, Pattern Recognition and Machine Learning. Berlin,

Germany: Springer, 2006.[9] F. Liu, D. Xu, C. Yuan, and W. Kerwin, “Image segmentation based on

Bayesian network-Markov random field model and its application onin vivo plaque composition,” in Proc. IEEE Int. Symp. Biom. Imaging:Nano to Macro, 2006, pp. 141–144.

[10] L. Zhang and Q. Ji, “Image segmentation with a unified graphicalmodel,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 32, no. 8, pp.1406–1425, Aug. 2010.

[11] V. Murino, C. S. Regazzoni, and G. Vernazza, “Distributed propagationof a-priori constraints in a Bayesian network of Markov random fields,”Proc. Inst. Electr. Eng.—Commun., Speech Vis., vol. 140, no. 1, pp.46–55, 1993.

Page 13: IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 20, NO. 9 ...qji/Papers/extended_chain_graph.pdf · IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 20, NO. 9, SEPTEMBER 2011 2401 Probabilistic

ZHANG et al.: PROBABILISTIC IMAGE MODELING WITH AN EXTENDED CHAIN GRAPH FOR HUMAN ACTIVITY RECOGNITION AND IMAGE SEGMENTATION 2413

[12] A. Chardin and P. Pérez, “Unsupervised image classification with ahierarchical EM algorithm,” in Proc. Int. Conf. Comput. Vis., 1999, pp.969–974.

[13] G. E. Hinton, S. Osindero, and Y. Teh, “A fast learning algorithm fordeep belief nets,” Neural Computation, vol. 18, no. 7, pp. 1527–1554,2006.

[14] S. Osindero and G. E. Hinton, “Modeling image patches with a directedhierarchy of Markov random fields,” in Proc. Adv. Neural Inf. Process.Syst., 2008, vol. 20.

[15] G. E. Hinton, S. Osindero, and K. Bao, “Learning causally linkedMarkov random fields,” in Proc. 10th Int. Workshop Artif. Intell.Statistics, 2005, pp. 128–135.

[16] F. Kschischang, B. J. Frey, and H.-A. Loeliger, “Factor graphs and thesum-product algorithm,” IEEE Trans. Inf. Theory, vol. 47, no. 2, pp.498–519, Feb. 2001.

[17] W. L. Buntine, “Chain graphs for learning,” in Proc. Conf. UncertaintyArtif. Intell., 1995, pp. 46–54.

[18] J. Hammersley and P. Clifford, Markov Fields on Finite Graphs andLattices. Oxford, U.K.: Oxford Univ., 1971.

[19] G. E. Hinton, “Training products of experts by minimizing contrastivedivergence,” Neural Computation, vol. 14, no. 8, pp. 1771–1800, 2002.

[20] C. Sutton and A. Mccallum, “Piecewise training for undirectedmodels,” in Proc. 21st Conf. Uncertainty Artif. Intell., 2005, pp.568–575.

[21] J. Besag, “Efficiency of pseudolikelihood estimation for simpleGaussian fields,” Biometrika, vol. 64, no. 3, pp. 616–618, 1977.

[22] D. J. C. MacKay, J. S. Yedidia, W. T. Freeman, and Y. Weiss, “A con-versation about the Bethe free energy and sum-product,” CambridgeUniv., Cambridge, U.K., Tech. Rep. MERL TR-2001-18, 2001.

[23] B. J. Frey and N. Jojic, “A comparison of algorithms for inference andlearning in probabilistic graphical models,” IEEE Trans. Pattern Anal.Mach. Intell., vol. 27, no. 9, pp. 1392–1416, Sep. 2005.

[24] W. L. Buntine, “Operations for learning with graphical models,” J.Artif. Intell. Res., vol. 2, pp. 159–225, 1994.

[25] P. Abbeel, D. Koller, and A. Y. Ng, “Learning factor graphs in poly-nomial time and sample complexity,” J. Mach. Learning Res., vol. 7,pp. 1743–1788, 2006.

[26] M. Welling, “Learning in Markov random fields with contrastive freeenergies,” in Proc. 10th Int. Workshop Artif. Intell. Stat., 2005, pp.397–404.

[27] M. A. Carreira-Perpinan and G. E. Hinton, “On contrastive divergencelearning,” in Proc. 10th Int. Workshop Artif. Intell. Stat., 2005, pp.59–66.

[28] J. Marroquin, S. Mitter, and T. Poggio, “Probabilistic solution of ill-posed problems in computational vision,” J. Amer. Stat. Assoc., vol.82, no. 397, pp. 76–89, 1987.

[29] J. Park, “Using weighted MAX-SAT engines to solve MPE,” in Proc.18th Nat. Conf. Artif. Intell., 2002, pp. 682–687.

[30] F. Hutter, H. H. Hoos, and T. Stutzle, “Efficient stochastic local searchfor MPE solving,” in Proc. Int. Joint Conf. Artif. Intell., 2005, pp.169–174.

[31] J. Yamato, J. Ohaya, and K. Ishii, “Recognizing human action in time-sequential images using hidden Markov model,” in Proc. IEEE Conf.Comput. Vis. Pattern Recognit., 1992, pp. 379–385.

[32] T. Xiang and S. Gong, “Beyond tracking: Modelling activity and un-derstanding behavior,” International Journal of Computer Vision, vol.67, no. 1, pp. 21–51, 2006.

[33] B. Laxton, J. Lim, and D. Kriegman, “Leveraging temporal, contextualand ordering constraints for recognizing complex activities in video,”in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2007, pp. 1–8.

[34] J. Wu, A. Osuntogun, T. Choudhury, M. Philipose, and J. Regh, “Ascalable approach to activity recognition based on object use,” in Proc.IEEE Int. Conf. Comput. Vis., 2007, pp. 1–8.

[35] E. Borenstein, E. Sharon, and S. Ullman, “Combining top-down andbottom-up segmentation,” in Proc. CVPR Workshop Perceptual Org.Comput. Vis., 2004, pp. 46–46.

[36] T. Cour and J. Shi, “Recognizing objects by piecing together the seg-mentation puzzle,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.,2007, pp. 1–8.

[37] J. Winn and N. Jojic, “LOCUS: Learning object classes with unsuper-vised segmentation,” in Proc. IEEE Int. Conf. Comput. Vis., 2005, pp.756–763.

[38] X. Ren, C. C. Fowlkes, and J. Malik, “Cue integration in figure/groundlabeling,” in Proc. Adv. Neural Inf. Process. Syst., 2005, pp.1121–1128.

Lei Zhang (M’09) received the Ph.D. degree fromRensselaer Polytechnic Institute (RPI), Troy, NY.

He is currently with UtopiaCompression Cor-poration, Los Angeles, CA. His research areaincludes machine learning, computer vision, patternrecognition, and image processing. He has designeddifferent probabilistic graphical models for solvingvarious problems, including image segmentation,human body tracking, facial expression recogni-tion, human activity recognition, medical imageprocessing, multi-modal sensor fusion, etc. He has

authored or coauthored over 22 papers in several international journals andconferences and book chapters in different domains. He serves as a reviewerfor several computer vision and image processing journals.

Dr. Zhang is a member of Sigma Xi.

Zhi Zeng received the B.S. degree in electronic en-gineering from Fudan University, Shanghai, China,in 2003, the M.S. degree in electronic engineeringfrom Tsinghua University, Beijing, China, in 2006,and the M.Eng. degree in electrical engineeringfrom the Rensselaer Polytechnic Institute, Troy, NY,in 2009, where he is currently working toward thePh.D. degree.

His research interests include machine learning,pattern recognition, computer vision and operationsresearch.

Qiang Ji (SM’04) received the Ph.D. degree in elec-trical engineering from the University of Washington,Seattle.

He is currently a Professor with the Electrical,Computer, and Systems Engineering Department,Rensselaer Polytechnic Institute (RPI), Troy, NY.He recently served as a Program Director with theNational Science Foundation, where he managedcomputer vision and machine learning programs.He also held teaching and research positions withthe Beckman Institute at University of Illinois at

Urbana-Champaign, the Robotics Institute at Carnegie Mellon University,the Department of Computer Science at the University of Nevada, and theU.S. Air Force Research Laboratory. He currently serves as the Directorof the Intelligent Systems Laboratory (ISL), RPI. His research interests arein computer vision, probabilistic graphical models, information fusion, andtheir applications in various fields. He has authored or coauthored over 150papers in journals and conferences. His research has been supported by majorgovernmental agencies including NSF, NIH, DARPA, ONR, ARO, and AFOSRas well as by major companies including Honda and Boeing. He is an associateeditor for several related IEEE and international journals and he has served asa chair, technical area chair, and program committee in numerous internationalconferences and workshops.