conditional random field

7/28/2019 Conditional Random Field

1/8

408

Spanning Tree Approximations for Conditional Random Fields

Patrick Pletscher

Department of Computer ScienceETH Zurich, [email protected]

Cheng Soon Ong

Department of Computer ScienceETH Zurich, Switzerland

[email protected]

Joachim M. Buhmann

Department of Computer ScienceETH Zurich, [email protected]

Abstract

In this work we show that one can train Con-ditional Random Fields of intractable graphs

effectively and efficiently by considering amixture of random spanning trees of the un-derlying graphical model. Furthermore, weshow how a maximum-likelihood estimatorof such a training objective can subsequentlybe used for prediction on the full graph.We present experimental results which im-prove on the state-of-the-art. Additionally,the training objective is less sensitive to theregularization than pseudo-likelihood basedtraining approaches. We perform the experi-mental validation on two classes of data setswhere structure is important: image denois-

ing and multilabel classification.

1 Introduction

In many applications of machine learning one is inter-ested in a segmentation of the data into its subparts.Two areas where this is of immense interest includecomputer vision and natural language processing(NLP). In computer vision one is interested in afull segmentation of the images into for exampleforeground and background or a more sophisticatedlabeling including semantic labels such as car orperson. In NLP one may be interested in part-of-speech tagging. Usually in these applications anannotated and fully segmented set of training datais provided on which a classifier is trained. Whileit is possible to train a classifier for each atomicpart independently, one expects improved accuracy

Appearing in Proceedings of the 12th International Confe-rence on Artificial Intelligence and Statistics (AISTATS)2009, Clearwater Beach, Florida, USA. Volume 5 of JMLR:W&CP 5. Copyright 2009 by the authors.

by assuming an underlying structure arising fromthe application. This structure might correspond toa chain or parse tree for NLP, a grid for computervision or a complete graph for multilabel classification.

While probabilistic inference for simple structures likea chain or a tree are computationally tractable, morecomplex graphs like grids or complete graphs, whichinclude loops, are computationally intractable. A dis-criminative model that has gained large popularity re-cently and which allows for a principled integrationof the structure, the data and the labels, is the Con-ditional Random Field (CRF) [Lafferty et al., 2001].In this work we consider the training of CRFs for in-tractable graphs. We here follow the notation of [Sut-ton and McCallum, 2006] and assume a factor graphG where C = {C1, . . . , C p, . . . , C P} are the set of fac-

tors that are parametrized in the same way; sometimesreferred to as clique templates. Each of the clique tem-plates Cp consists of a subset of factors of G. Giventhe observation x, the posterior of a complete segmen-tation y has the form:

P(y|x; ) =1

Z(x; )exp

CpC

cCp

p, sc(x, yc)

.

(1) summarizes all the parameters of the model. yc isa subset of the segmentation that only involves thevariables of factor c. The individual variables yi takevalues from a finite set Y. sc(x, yc) denote the suf-

ficient statistics for factor c and Z(x, ) denotes thepartition sum, which normalizes the distribution:

Z(x; ) =y

exp

CpC

cCp

p, sc(x, yc)

.

Alternatively, the distribution can be written as anexponential family distribution.

P(y|x; ) = exp(, s(x, y) A(x; )).

Here, the individual sufficient statistics of factors be-longing to the same clique template are summed up


2/8

409


and subsequently concatenated into one big vectors(x, y) (in the corresponding order as in ). A(x; )denotes the log partition sum.

In this present work we only consider CRFs of factorsthat consist of at most two label variables, i.e., a set-ting where a factor can also be represented as an edge

in an undirected graphical model.

Training the CRF consists of finding the maximum-a-posteriori estimate for examples {(x(n), y(n))}Nn=1 anda Gaussian prior:

= arg max

Nn=1

P(y(n)|x(n); ) exp(22).

Taking the negative logarithm we obtain a convex op-timization problem, a property of the exponential fam-ily distribution. In theory we are thus guaranteedof finding a global minima by following the gradient.

However, computing the data log-likelihood itself orthe exact gradient of the objective w.r.t. the param-eter are intractable for general graphs with loops,since both computations include computing the parti-tion sum Z(x; ). This makes the learning at least ashard as probabilistic inference, which is computation-ally intractable in general.

2 Related Work

An established approach is to train the model pa-rameters for the pseudo-likelihood [Besag, 1975, Ku-

mar and Hebert, 2006]. The pseudo-likelihood is atractable approximation of the true likelihood. Itmodels the unobserved variables independently con-ditioned on the true label of the Markov blanket. Oneproblem often encountered with pseudo-likelihood isthe overestimation of the edge parameters [Kumar andHebert, 2006]. A heuristic used to relax this problemis by regularizing only the edge parameters, and notthe node parameters, this is then called the penal-ized pseudo-likelihood. An alternative approximateobjective is piecewise training [Sutton and McCal-lum, 2005]. In piecewise training the parameters aretrained over each clique individually and in a second

step these parameters are combined to give a globalmodel. Piecewise-training has been shown to outper-form pseudo-likelihood. However, we are unaware ofany comparison to penalized pseudo-likelihood, whichin our tests has performed much better than the stan-dard pseudo-likelihood.

Another approach is to use an approximate inferencealgorithm, such as Loopy Belief Propagation [Yedidiaet al., 2005] in combination with the Bethe free en-ergy, to compute an approximation of the data log-likelihood (and possibly the gradient), see e.g. [Vish-

wanathan et al., 2006]. One problem here is thatthe learning might get trapped in poor local minimacaused by the approximate inference. This behaviourhas been studied in [Kulesza and Pereira, 2008], whereit was shown that even for relatively good approxima-tion guarantees of the inference algorithm, one might

end up with poor parameters. Recently there was alsoa training proposed in which high-probability label-ings are generated by an approximate MAP labelingalgorithm. The objective of the training is then tofind parameters that separate these high-probabilitylabelings from the ground-truth labeling well [Szum-mer et al., 2008]. However, this approach optimizes amaximum margin objective rather than trying to finda maximum likelihood estimator. There also have beenresults about approximate training of structural SVMsin [Finley and Joachims, 2008], where it is shownthat one can give approximation guarantees about thelearning, given an approximation guarantee for the in-

ference. The results presented there are about themax-margin objective and do not directly apply to themaximum-likelihood setting.

In this paper, we propose a novel approach that formu-lates an alternative training objective that is tractable,similar to the pseudo-likelihood. We use spanningtrees to approximate the intractable graph. It is morepromising since the training step considers globallynormalized probability distributions on tractable sub-graphs, instead of restricting to locally (per unob-served variable) normalized distributions as in pseudo-likelihood.

The use of tractable subgraphs for probabilistic infer-ence in loopy graphs has been pioneered by Wain-wright et al. [2003]. While this work gave impor-tant insights into approximations for probabilistic in-ference, it assumed that the model is given and no pa-rameter learning is needed. Our work deals with thequestion of how to use tractable subgraphs in trainingCRFs. The rationale behind our training objective isthat while we might constrain the model through ig-noring certain dependencies in the data, at least weare not trapped in poor local minima created by ap-proximations of the data log-likelihood during learn-ing. In our approximate objective the effect of ignoringdependencies in the data should be mitigated by ran-domization and averaging over multiple trees.

Alternating iterative approaches where in one step themixture weights over the different tractable subgraphsare learned (as in Wainwrights work) and in a sec-ond step the model parameters for the given mixtureweights are learned, might be beneficial. However,as with other approximate probabilistic inference ap-proaches for parameter estimation, one would have tocome up with a strategy for identifying local minima


3/8

410

Pletscher, Ong, Buhmann

Figure 1: Examples of intractable graphs. Left: Gridgraph often encountered in computer vision. Right:Complete graph in multilabel classification. For sim-plicity we do not show a factor graph, but an undi-rected graphical model, where the edges represent fac-tors. The underlying graph is shown by dashed edges.We superimpose two examples of maximal loop-freecoverings of the factors by a spanning tree. Note thatloops going through observed nodes are allowed, asthis does not pose any computational problems.

created by the approximation. Our work is meant toexplore what can be achieved by a simple uniform mix-ture of tractable subgraphs.

3 Training a CRF by a mixture of

spanning trees

Our approximate training objective considers randomspanning trees of the intractable graph. For the span-ning tree we can compute all the required quantitiesexactly. The underlying assumption being that a ran-

dom spanning tree still captures some of the importantdependencies in the data. We thus propose the alter-native (regularized) training objective:

SP = arg max

Nn=1

tsp(G(n))

P(t, y(n)|x(n); )

exp(22). (2)

with

P(t, y(n)|x(n); ) = P(y(n)|x(n), t; )P(t).

Here sp(G) denotes the set of all maximal coveringsof the factors ofG, such that no loops exist betweenunobserved variables. For factors of at most two un-observed variables this corresponds to the set of allspanning trees (on the unobserved variables) in anundirected graphical model. P(y|x, t; ) is the samedistribution given in (1), restricted to only the factorsof spanning tree t

P(y|x, t; ) =1

Z(x, t; )exp

CpC

cCtp

p, sc(x, yc)

,

where c Ctp is a shorthand for c Cp c t. Thepartition sum Z(x, t; ) must also be adapted corre-spondingly. Again we can write this distribution in anexponential family form:

P(y|x, t; ) = exp(, s(x,y,t) A(x, t; )).

Here s(x,y,t) is the sufficient statistics that we getby only summing up factors that are included in thespanning tree t. The objective in (2) can be moti-vated as follows: We consider a generative model forthe labels in which first a random spanning tree of theundirected model is drawn. In a second step the labelsof the tree are sampled according to a Gibbs distribu-tion. In this setting the chosen tree is treated as ahidden variable. There are two key points to stress:First, P(y|x, t; ) for a given spanning tree t can betrained efficiently and exactly as probabilistic infer-

ence for a tree is tractable. Second, the estimate ofSP is not consistent for intractable graphs and willnot converge to for such cases. The two training ob-

jectives consider two different problems and the suf-ficient statistics s(x, y) and s(x,y,t) differ to a largeextent in densely connected graphs. Nevertheless, wecan construct from SP an estimate of , this is de-scribed in subsection 3.1. This then allows us to per-form prediction on the full graph with approximateprobabilistic inference, such as Loopy Belief Propaga-tion. Two examples of maximal loop-free coverings aregiven in Figure 1.

So far we have converted an intractable problem intoanother hard problem, since many interesting prob-lems have exponentially many spanning trees overwhich we would need to marginalize. Nevertheless, intheory it is possible to sum over all spanning trees inrunning time cubic in the number of variables by theMatrix-Tree Theorem [Tutte, 1984]. For many practi-cal applications a cubic running time is too slow andwe thus decided to sample a small number of spanningtrees T sp(G) uniformly at random by the algorithmdescribed in [Wilson, 1996]. In our implementation weonly sample the spanning trees once at the beginningof the training and then keep them fixed. We sample

the spanning trees for each training example indepen-dently. However, alternative approaches where e.g., aspanning tree is sampled anew at each iteration of theoptimization are also possible. Ignoring the samplingof the spanning trees (of which the expected runningtime is proportional to the mean hitting time), therunning time of one optimization step is O(|T |m|Y|2),where m denotes the number of variables.

To learn the parameters we use BFGS, a quasi-Newtonmethod. For this we also need the derivative of the log-likelihood w.r.t. the parameters. For only one sample


4/8

411


(x, y) and without regularizer it is given by

(y|x, )

=

log

t P(t) exp(, s(x,y,t))

t,y P(t) exp(, s(x, y, t))

=t

P(t|y, x; )s(x,y,t)

y,t

P(t, y|x; )s(x, y, t)

The gradient has a similar form as in hiddenCRFs [Quattoni et al., 2007] and can be reformu-lated as a combination of the features, weighted by themarginals. All of the quantities involved can be com-puted efficiently. The hidden variable in our frame-work is the spanning tree along which we are working.We are unable to prove convexity for (2), but we ob-served empirically that in the experimental results wealways converged to the same solution for different ini-

tializations.There are two advantages of the objective in (2) whencompared to pseudo-likelihood. First, it can capturelonger-range dependencies between variables than onlybetween neighboring variables. Second, it is less sen-sitive to overestimation of the edge parameters, as theobjective does indeed model a joint distribution of allthe labels, whereas pseudo-likelihood models them in-dependently conditioned on the true label of the neigh-boring variables. One disadvantage might arise fromdropping certain dependencies in training, which canbe prevented to a certain extent by dropping the de-

pendencies at random.

3.1 MAP inference

Once we have trained the parameters, we also wantto predict segmentations on unseen data. For this weusually look for the maximum-a-posterior (MAP) la-beling, which is also computationally intractable forgeneral graphs. Here we answer the question of howto use the parameters learned with the objective in (2)to obtain a MAP labeling. Computing the MAP forone spanning tree is easy, but generally we would wantto consider the whole graph and not just one of its

spanning trees. We experimented with two approachesfor inference, each with different strengths and weak-nesses.

Voting Compute MAP labelings for several randomspanning trees t T and finally for each variable takea majority vote over the different MAP solutions weget. Formally we can express the label yi for a node ias follows:

yi = arg maxyi

tT

(yMAPi (t) = yi)

with

yMAP(t) := arg maxy

P(y|x, t;SP).Rescaling Rescale the parameters SP according tothe connectivity in the spanning trees relative to the

full graph. In other words we adjust the parametersSP such that the inner product with the full suffi-cient statistics leads to similar values as observed inthe training phase. We have

y = arg maxy

P(y|x; )

with

p =

tTP(t)|C

tp|

|Cp|SPp ,

and we get by concatenation of the individual

p.Here we assume that the underlying factor graphs for

the individual samples are the same, otherwise we alsoneed to average over the set sizes of the samples. Tocompute arg maxy P(y|x;

) we can use approximateinference algorithms like Loopy Belief Propagation orgraph-cut. Instead of computing a MAP, we can alsouse maximum-marginal inference with parameters .

We also experimented with a voting strategy basedon the marginals instead of the MAP, which howeverled to similar results. We expect the voting strategyto work well in settings where max-marginal inferenceworks better than MAP, which we confirmed experi-mentally. A more sophisticated algorithm for the MAP

inference could also be developed, by considering theMAP over a subsample of spanning trees as discussedin [Meila and Jordan, 2000].

3.2 Grid Graphs and Complete Graphs

In our experiments we will consider two instances ofthe general model given in (1). The first model isoften used in computer vision and considers a gridgraph where all the vertex and edge potentials areparametrized in the same way denoted by v and e,respectively:

P(y|x; ) = 1Z(x; )

expiV

v, si(x, yi)

+

(i,j)Ei


5/8

412


follows:

P(y|x; ) =1

Z(x; )exp

iV

i, si(x, yi)

+

(i,j)Ei


6/8

413


Figure 3: Overview of the binary image denoising dataset. First row: original images, second row: imagescorrupted by bimodal noise, third row: labeling as in-ferred by our algorithm.

. Comparing the accuracy of exact training and thespanning tree approximation shows a difference of usu-

ally around 2-3%. However, for high values of noise,the spanning tree learning combined with voting infer-ence seem to perform favorably. One possible expla-nation for this is that maximum marginal predictionoften shows better results than MAP. Approximatetraining with LBP resulted in similar performance asthe spanning tree approaches, in some runs the esti-mated parameters were however substantially inferior,leading to a large variance.

4.2 Binary image denoising task

We consider the binary image denoising dataset of Ku-

mar and Hebert [2006] shown in Figure 3. It includestwo observation sets that are perturbed by unimodaland bimodal Gaussian noise, respectively. We wouldlike to point out that we use a slightly different modelthan Kumar and Hebert [2006], as they consider anobjective where the node potentials are already the logclass probabilities of a logistic regression classifier. Inour model we do not normalize the node potentials.

Since the training set was fairly small, we could notsensibly define a validation set, and hence used thetraining set to select . Our results are summarizedin Table 1. Observe that we perform slightly worse

in the unimodal dataset but significantly better in thebimodal dataset.

We investigated the effect of the number of spanningtrees sampled for each example. For training, we didnot see any improvement by using more than one span-ning tree. This can be explained by observing thatfor an image grid most of the configurations of la-bels/features are already present in a single spanningtree. However, for inference by voting, increasing thenumber of spanning trees improves the performance(Figure 4). This is as expected since the coverage of

Table 1: Pixelwise classification error (%). Compari-son to published work, where KH refers to the resultspublished in [Kumar and Hebert, 2006, Kumar et al.,2005]. There for the unimodal dataset MAP inferencewas used, for the bimodal dataset maximum-marginal(MM) inference was used. We average over 5 differentruns, where the training/test set is fixed and thespanning trees are resampled anew for each run.

unimodal bimodal

spanningvoting 2.63 0.01 5.23 0.01

rescale/MAP 2.33 0.01 5.69 0.03rescale/MM 2.53 0.01 5.26 0.01

KH 2.30 5.48

0.04

0.06

0.08

0 50 100 150 200

error

rate

(log

scale)

number of spanning trees

bimodalunimodal

Figure 4: Test error on the binary image denoisingdatasets as a function of the number of spanning treesused for the inference by voting.

the edges increases with the number of spanning treesand hence we obtain a better estimate.

Similar to the toy problem, for varying noise level, therescale and voting inference approaches show differentmerits. On the unimodal dataset the rescaling ap-proach performs better, whereas on the bimodal dataset the trend is inverted and voting performs better(Table 1). If we use the rescale approach togetherwith max-marginal inference, we get similar test erroras the voting approach.

Figure 5 shows the train and test error for different val-ues of. We observe that the spanning tree objectiveis more insensitive to the regularizer and performs wellover a wide range of regularizer parameters , which isnot observed for penalized pseudo-likelihood training,where sensitivity to is high.

4.3 Multiclass image denoising

While the experiment in the previous subsection is im-portant as a comparison to previously published re-sults, the experiment does not show the full potential


7/8

414


0.05

0.1

0.15

-6 -4 -2 0 2 4 6

error

rate

(log

scale)

log10

voting testvoting train

resc. MAP testresc. MAP train

resc. MM testresc. MM train

approximate testapproximate train

0.05

0.1

0.15

0.3

-6 -4 -2 0 2 4 6

error

rate

(log

scale)

log10

pen. PL testpen. PL train

PL testPL train

Figure 5: Influence of the hyperparamter on the bi-

modal image denoising dataset. Top: spanning treeand approximate training. Bottom: pseudo-likelihoodtraining. The approximate training shows a similarbehaviour w.r.t. the regularizer as the spanning treeobjectives, the variance and error are however larger.

of our models. We are ultimately interested in morecomplex scenarios where the number of states per vari-able is bigger than two. It is also important to studya scenario where certain labels only interact with sub-sets of other labels. We created a novel data set similarin spirit to the one in [Kumar and Hebert, 2006] with

the extensions mentioned above. We created imagesof chess boards with two different player figures, eachplayer occupying half of his fields. The positions of thefigures are sampled randomly. The RGB colors of thedifferent labels involved (two fields, two players) areselected such that they are equidistant in the RGBspace. We then add Gaussian noise to the colors forvarying noise levels.

Our spanning tree based training approach success-fully identifies the dependencies between the labelsand does not assign a players label to an opponents

0.01

0.1

1

-3 -2 -1 0 1 2 3 4 5

error

rat

e

(log

scale)

log10

span/rescale MAPspan/voting

pen. PLPL

approximate

Figure 6: Test error on the chess board data set for anoise level of 1.

field even for high levels of noise. In the experiments

we observed that penalized pseudo-likelihood performsslightly better for small noise levels, the differenceis however not statistically significant. For highernoise, the spanning tree approximation performs bet-ter (9.3%0.1 test error vs. 10.1%0.1), see Figure 6.We also included results of approximate training wherethe log-partition sum is approximated by the Bethefree energy computed by Loopy Belief Propagation,which showed inferior performance. Observe that thespanning tree training is again less sensitive to changesin hyperparameter .

4.4 Multilabel problem: Yeast dataset

Here we consider a different graph structure. Weuse the multilabel yeast data set containing 14 la-bels from Elisseeff and Weston [2001]. The underlyinggraphical model (a complete graph) as well as the pre-diction step (arg maxy, s(x, y)) of our model are thesame as in [Finley and Joachims, 2008]. The key dif-ference is that we train a CRF by maximum-likelihoodwhereas Finley and Joachims [2008] train a structuralSVM by a maximum-margin objective.

We performed 5-fold cross-validation on the trainingdata (where the model was subsequently trained onthe full training data). We obtained a test error of19.98% 0.06, which is slightly superior (but not sta-tistically significant) than the best published result of20.230.53% [Finley and Joachims, 2008] where exacttraining was used. For the penalized pseudo-likelihoodwe got an unsatisfactory test error of 21.27% 0.03.We show in Figure 7 the performance for different set-tings of . We used 3 spanning trees per example inthe training. However, we could again not find statis-tically significant evidence that training on more thanone spanning tree per example helps. For a complete


8/8

415


graph of size K we have K(K 1)/2 edges of whichK 1 are contained in one spanning tree. The proba-bility of missing an edge in an independently sampledspanning tree t is thus given by

P((i, j) t) =

1

2

K

.

The probability of missing an edge in N independentspanning trees decreases exponentially in N. For theyeast data set we have K = 14 labels and 1500 trainingexamples. If we independently sample one spanningtree for each example, we have N = 1500 and theprobability of missing an edge is less than 10100.

0.1

1

-3 -2 -1 0 1 2 3

error

rate

(lo

g

scale)

log10

voting testvoting train

resc. MAP testresc. MAP train

pen. PL testpen. PL train

Figure 7: Test error on the yeast multilabel data set

for different regularizers.

5 Conclusion

We presented an alternative approximate training ob-jective for CRFs based on mixtures of spanning trees.Similar in spirit to pseudo-likelihood we can train thisobjective exactly. The key difference to approachessuch as pseudo-likelihood and piecewise training isthat the distribution is globally normalized over allvariables, with some dependencies removed, insteadof normalizing over a small set of variables (possibly

conditioned on the Markov blanket). We have demon-strated that the spanning tree approximation can betrained efficiently and it shows state-of-the-art predic-tion performance. Furthermore, we have found it to beless sensitive to the hyperparameter, which is not thecase for pseudo-likelihood and its penalized version.

Acknowledgments

We would like to thank the anonymous reviewers fortheir valuable feedback which helped improve the pa-per. We thank Yvonne Moh for proof-reading an early

version of this work. This work was supported in partsby the Swiss National Science Foundation (SNF) un-der grant number 200021-117946.

References

J. Besag. Statistical analysis of non-lattice data. TheStatistician, 24:3:179195, 1975.

A. Elisseeff and J. Weston. A kernel method formulti-labelled classification. In NIPS, pages 681687, 2001.

T. Finley and T. Joachims. Training structural SVMswhen exact inference is intractable. In ICML, 2008.

A. Kulesza and F. Pereira. Structured learning withapproximate inference. In NIPS, 2008.

S. Kumar and M. Hebert. Discriminative RandomFields. IJCV, 68(2):179201, 2006.

S. Kumar, J. August, and M. Hebert. Exploiting infer-

ence for approximate parameter learning in discrim-inative fields: An empirical study. In EMMCVPR,2005.

J. Lafferty, A. McCallum, and F. Pereira. ConditionalRandom Fields: Probabilistic Models for Segment-ing and Labeling Sequence Data. In ICML, 2001.

M. Meila and M. I. Jordan. Learning with mixtures oftrees. JMLR, 1:148, 2000.

A. Quattoni, S. Wang, L. P. Morency, M. Collins, andT. J. Darrell. Hidden-State Conditional RandomFields. IEEE PAMI, 29(10):18481852, 2007.

C. Sutton and A. McCallum. An introduction to con-

ditional random fields for relational learning. In In-troduction to Statistical Relational Learning. MITPress, 2006.

C. Sutton and A. McCallum. Piecewise training forundirected models. In UAI, 2005.

M. Szummer, P. Kohli, and D. Hoiem. Learning CRFsusing graph cuts. In ECCV, 2008.

W. T. Tutte. Graph Theory. Addison-Wesley, 1984.

S. Vishwanathan, N. Schraudolph, M. Schmidt, andK. Murphy. Accelerated training of conditional ran-dom fields with stochastic gradient methods. InICML

, pages 969976, 2006.M. J. Wainwright, T. S. Jaakkola, and A. S. Willsky.

Tree-based reparameterization framework for anal-ysis of sum-product and related algorithms. IEEETIT, 49(5):11201146, 2003.

D. Wilson. Generating Random Spanning Trees MoreQuickly Than the Cover Time. In STOC, 1996.

J. Yedidia, W. Freeman, and Y. Weiss. Constructingfree energy approximations and generalized beliefpropagation algorithms. IEEE TIT, 51:22822312,2005.

conditional random field

Documents