discriminative methods for structured predictioneugenew/publications/dqe-syllabus.pdf · a b st ra...

Discriminative Methods for Structured

Prediction

Eugene Weinstein, PhD CandidateNew York University, Courant Institute

Department of Computer ScienceDepth Qualifying Exam

June 20th, 2007

Talk Outline• Introduction, motivation

• Multiclass SVM’s

• Max-margin Markov nets

• SVM for SP

• Large-margin HMMs

• Regression for SP

• Search-based learning for SP

• Conclusion

[Crammer, Singer ’01]

[Taskar, Guestrin, Koller ’03]

[Tsochantaridis, Hofmann, Joachims, Altun ’04]

[Sha, Saul ‘07]

[Cortes, Mohri, Weston ’06]

[Daumé, Langford, Marcu ’06]

2

Structured Prediction Intro

• Given: labeled training data

• Task: learn mapping from inputs to outputs

• Special cases

• Binary classification:

• Multiclass classification:

• Natural language parsing example:

Support Vector Machine Learning forInterdependent and Structured Output Spaces

Ioannis Tsochantaridis [email protected] Hofmann [email protected]

Department of Computer Science, Brown University, Providence, RI 02912

Thorsten Joachims [email protected]

Department of Computer Science, Cornell University, Ithaca, NY 14853

Yasemin Altun [email protected]


AbstractLearning general functional dependencies isone of the main goals in machine learning.Recent progress in kernel-based methods hasfocused on designing flexible and powerful in-put representations. This paper addressesthe complementary issue of problems involv-ing complex outputs such as multiple depen-dent output variables and structured outputspaces. We propose to generalize multiclassSupport Vector Machine learning in a formu-lation that involves features extracted jointlyfrom inputs and outputs. The resulting op-timization problem is solved e!ciently bya cutting plane algorithm that exploits thesparseness and structural decomposition ofthe problem. We demonstrate the versatilityand e"ectiveness of our method on problemsranging from supervised grammar learningand named-entity recognition, to taxonomictext classification and sequence alignment.

1. Introduction

This paper deals with the general problem of learn-ing a mapping from inputs x ! X to discrete outputsy ! Y based on a training sample of input-output pairs(x1,y1), . . . , (xn,yn) ! X " Y drawn from some fixedbut unknown probability distribution. Unlike the caseof multiclass classification where Y = {1, ..., k} withinterchangeable, arbitrarily numbered labels, we con-sider structured output spaces Y. Elements y ! Ymay be, for instance, sequences, strings, labeled trees,

Appearing in Proceedings of the 21 st International Confer-ence on Machine Learning, Ban!, Canada, 2004. Copyrightby the authors.

lattices, or graphs. Such problems arise in a variety ofapplications, ranging from multilabel classification andclassification with class taxonomies, to label sequencelearning, sequence alignment learning, and supervisedgrammar learning, to name just a few.

We approach these problems by generalizing largemargin methods, more specifically multi-class SupportVector Machines (SVMs) (Weston & Watkins, 1998;Crammer & Singer, 2001), to the broader problem oflearning structured responses. The naive approach oftreating each structure as a separate class is often in-tractable, since it leads to a multiclass problem with avery large number of classes. We overcome this prob-lem by specifying discriminant functions that exploitthe structure and dependencies within Y. In that re-spect, our approach follows the work of Collins (2002;2004) on perceptron learning with a similar class ofdiscriminant functions. However, the maximum mar-gin algorithm we propose has advantages in terms ofaccuracy and tunability to specific loss functions. Asimilar philosophy of using kernel methods for learninggeneral dependencies was pursued in Kernel Depen-dency Estimation (KDE) (Weston et al., 2003). Yet,the use of separate kernels for inputs and outputs andthe use of kernel PCA with standard regression tech-niques significantly di"ers from our formulation, whichis a more straightforward and natural generalization ofmulticlass SVMs.

2. Discriminants and Loss Functions

We are interested in the general problem of learningfunctions f : X # Y based on a training sample ofinput-output pairs. As an illustrating example, con-sider the case of natural language parsing, where thefunction f maps a given sentence x to a parse tree

Y = {!1, 1}

Figure 1. Illustration of natural language parsing model.

y. This is depicted graphically in Figure 1. The ap-proach we pursue is to learn a discriminant functionF : X ! Y " # over input/output pairs from whichwe can derive a prediction by maximizing F over theresponse variable for a specific given input x. Hence,the general form of our hypotheses f is

f(x;w) = argmaxy!Y

F (x,y;w) , (1)

where w denotes a parameter vector. It might be use-ful to think of $F as a w-parameterized family of costfunctions, which we try to design in such a way thatthe minimum of F (x, ·;w) is at the desired output yfor inputs x of interest. Throughout this paper, weassume F to be linear in some combined feature repre-sentation of inputs and outputs !(x,y),

F (x,y;w) = %w,!(x,y)& . (2)

The specific form of ! depends on the nature of theproblem and special cases will be discussed subse-quently.

Using again natural language parsing as an illustrativeexample, we can chose F such that we get a model thatis isomorphic to a Probabilistic Context Free Grammar(PCFG). Each node in a parse tree y for a sentencex corresponds to grammar rule gj , which in turn hasa score wj . All valid parse trees y (i.e. trees with adesignated start symbol S as the root and the words inthe sentence x as the leaves) for a sentence x are scoredby the sum of the wj of their nodes. This score canthus be written as F (x,y;w) = %w,!(x,y)&, where!(x,y) is a histogram vector counting how often eachgrammar rule gj occurs in the tree y. f(x;w) canbe e"ciently computed by finding the structure y ' Ythat maximizes F (x,y;w) via the CKY algorithm (seeManning and Schuetze (1999)).

Learning over structured output spaces Y inevitablyinvolves loss functions other than the standard zero-one classification loss (cf. Weston et al. (2003)). Forexample, in natural language parsing, a parse treethat di#ers from the correct parse in a few nodes only

should be treated di#erently from a parse tree thatis radically di#erent. Typically, the correctness of apredicted parse tree is measured by its F1 score (seee.g. Johnson (1999)), the harmonic mean of precisionof recall as calculated based on the overlap of nodesbetween the trees. We thus assume the availability ofa bounded loss function ( : Y!Y " # where ((y, y)quantifies the loss associated with a prediction y, if thetrue output value is y. If P (x,y) denotes the data gen-erating distribution, then the goal is to find a functionf within a given hypothesis class such that the risk

R"P (f) =

!

X#Y((y, f(x)) dP (x,y) . (3)

is minimized. We assume that P is unknown, but thata finite training set of pairs S = {(xi,yi) ' X!Y : i =1, . . . , n} generated i.i.d. according to P is given. Theperformance of a function f on the training sampleS is described by the empirical risk R"

S (f). For w-parameterized hypothesis classes, we will also writeR"

P (w) ) R"P (f(·;w)) and similarly for the empirical

risk.

3. Margins and Margin Maximization

First, we consider the separable case in which thereexists a function f parameterized by w such that theempirical risk is zero. If we assume that ((y,y$) > 0for y *= y$ and ((y,y) = 0, then the condition of zerotraining error can then be compactly written as a setof non-linear constraints

+i : maxy!Y\yi

{%w,!(xi,y)&} < %w,!(xi,yi)& . (4)

Each nonlinear inequalities in (4) can be equivalentlyreplaced by |Y| $ 1 linear inequalities, resulting in atotal of n|Y|$ n linear constraints,

+i, +y ' Y \ yi : %w, !!i(y)& > 0 , (5)

where we have defined the shorthand !!i(y) )!(xi,yi) $ !(xi,y).

If the set of inequalities in (5) is feasible, there willtypically be more than one solution w%. To specifya unique solution, we propose to select the w with,w, - 1 for which the score of the correct label yi

is uniformly most di#erent from the closest runner-up yi(w) = argmaxy &=yi

%w,!(xi,y)&. This general-izes the maximum-margin principle employed in SVMs(Vapnik, 1998) to the more general case considered inthis paper. The resulting hard-margin optimization

(x1, y1), . . . , (xm, ym) ! X " Y

x ! X y ! Y

3

Exploiting Structure

• Naive approach: treat each possible output in as discrete label, apply multiclass classification. But:

• Enumerating all members of often intractable

• Cannot model closeness of examples (changing one node of tree vs. changing the entire tree)

• Approach: try to exploit structure and dependencies within the output space

• Represent closeness of outputs using loss function

Y

Y

small loss big loss4

SVM Review

• Classifier:

• Optimization problem: minimize

• Subject to:Mehryar Mohri - Foundations of Machine Learning Courant Institute, NYUpage

• Support vectors: points along the margin and outliers.

• Soft margin:

Soft-Margin Hyperplanes

29

w · x + b = !1w · x + b = 0

w · x + b = 1

2

!w!!i

!j

! = 1/!w!.

29

Mehryar Mohri - Foundations of Machine Learning Courant Institute, NYUpage

Optimization Problem

• Constrained optimization

minimize

subject to

• Properties

• is a non-negative real-valued constant.

• Convex optimization.

• Unique solution.

30

yi[w · xi + b] ! 1 " !i # !i ! 0, i $ [1, m].

1

2!w!2

+ C

m!

i=1

!i

C

30

[Cortes, Vapnik ’95]Slide: [Mohri ’07]

5

hw(x) = sgn(w · x + b)




minimize

subject to

• Properties

• is a non-negative real-valued constant.

• Convex optimization.

• Unique solution.

30

yi[w · xi + b] ! 1 " !i # !i ! 0, i $ [1, m].

1

2!w!2

+ C

m!

i=1

!i

C

30

Solving SP Problems

• (a.k.a. how to write a structured prediction paper)

1. Pick an input/output representation

2. Decide on the form of your learner

3. Formulate an objective function and margin constraint

4. Write and solve primal/dual optimization problem

5. Do some experiments

6

Multiclass SVM Framework

• Data: , ,

• If is a parameter matrix, is the th row

• Classifier has the linear form

• Multiclass generalization of the margin constraint

Multiclass Kernel-based Vector Machines

developed by several researchers (see Joachims, 1998 for an overview). However, the rootsof this line of research go back to the seminal work of Lev Bregman (1967) which was fur-ther developed by Yair Censor and colleagues (see Censor and Zenios, 1997 for an excellentoverview). These ideas distilled in Platt’s method, called SMO, for sequential minimal op-timization. SMO works with reduced problems that are derived from a pair of exampleswhile our approach employs a single example for each reduced optimization problem. Theresult is a simple optimization problem which can be solved analytically in binary classi-fication problems (see Platt, 1998) and leads to an e!cient numerical algorithm (that isguaranteed to converge) in multiclass settings. Furthermore, although not explored in thispaper, it deems possible that the single-example reduction can be used in parallel appli-cations. Many of the technical improvements we discuss in this paper have been proposedin previous work. In particular ideas such as using a working set and caching have beendescribed by Burges (1998), Platt (1998), Joachims (1998), and others. Finally, we wouldlike to note that this work is part of a general line of research on multiclass learning wehave been involved with. Allwein et al. (2000) described and analyzed a general approachfor multiclass problems using error correcting output codes (Dietterich and Bakiri, 1995).Building on that work, we investigated the problem of designing good output codes for mul-ticlass problems (Crammer and Singer, 2000). Although the model of learning using outputcodes di"ers from the framework studied in this paper, some of the techniques presented inthis paper build upon results from an earlier paper (Crammer and Singer, 2000). Finally,some of the ideas presented in this paper can also be used to build multiclass predictorsin online settings using the mistake bound model as the means of analysis. Our currentresearch on multiclass problems concentrates on analogous online approaches (Crammerand Singer, 2001).

2. Preliminaries

Let S = {(x1, y1), . . . , (xm, ym)} be a set of m training examples. We assume that eachexample xi is drawn from a domain X ! "n and that each label yi is an integer from theset Y = {1, . . . , k}. A (multiclass) classifier is a function H : X # Y that maps an instancex to an element y of Y. In this paper we focus on a framework that uses classifiers of theform

HM(x) = arg kmaxr=1

{Mr · x} ,

where M is a matrix of size k $ n over " and Mr is the rth row of M. We interchangeablycall the value of the inner-product of the rth row of M with the instance x the confidenceand the similarity score for the r class. Therefore, according to our definition above, thepredicted label is the index of the row attaining the highest similarity score with x. Thissetting is a generalization of linear binary classifiers. Using the notation introduced above,linear binary classifiers predict that the label of an instance x is 1 if w ·x > 0 and 2 otherwise(w · x % 0). Such a classifier can be implemented using a matrix of size 2$n where M1 = wand M2 = &w. Note, however, that this representation is less e!cient as it occupies twicethe memory needed. Our model becomes parsimonious when k ' 3 in which we maintain kprototypes M1, M2, . . . , Mk and set the label of a new input instance by choosing the indexof the most similar row of M.

267

Support Vector Machine Learning forInterdependent and Structured Output Spaces

Ioannis Tsochantaridis [email protected] Hofmann [email protected]


Thorsten Joachims [email protected]

Department of Computer Science, Cornell University, Ithaca, NY 14853

Yasemin Altun [email protected]


AbstractLearning general functional dependencies isone of the main goals in machine learning.Recent progress in kernel-based methods hasfocused on designing flexible and powerful in-put representations. This paper addressesthe complementary issue of problems involv-ing complex outputs such as multiple depen-dent output variables and structured outputspaces. We propose to generalize multiclassSupport Vector Machine learning in a formu-lation that involves features extracted jointlyfrom inputs and outputs. The resulting op-timization problem is solved e!ciently bya cutting plane algorithm that exploits thesparseness and structural decomposition ofthe problem. We demonstrate the versatilityand e"ectiveness of our method on problemsranging from supervised grammar learningand named-entity recognition, to taxonomictext classification and sequence alignment.

1. Introduction

This paper deals with the general problem of learn-ing a mapping from inputs x ! X to discrete outputsy ! Y based on a training sample of input-output pairs(x1,y1), . . . , (xn,yn) ! X " Y drawn from some fixedbut unknown probability distribution. Unlike the caseof multiclass classification where Y = {1, ..., k} withinterchangeable, arbitrarily numbered labels, we con-sider structured output spaces Y. Elements y ! Ymay be, for instance, sequences, strings, labeled trees,

Appearing in Proceedings of the 21 st International Confer-ence on Machine Learning, Ban!, Canada, 2004. Copyrightby the authors.

lattices, or graphs. Such problems arise in a variety ofapplications, ranging from multilabel classification andclassification with class taxonomies, to label sequencelearning, sequence alignment learning, and supervisedgrammar learning, to name just a few.

We approach these problems by generalizing largemargin methods, more specifically multi-class SupportVector Machines (SVMs) (Weston & Watkins, 1998;Crammer & Singer, 2001), to the broader problem oflearning structured responses. The naive approach oftreating each structure as a separate class is often in-tractable, since it leads to a multiclass problem with avery large number of classes. We overcome this prob-lem by specifying discriminant functions that exploitthe structure and dependencies within Y. In that re-spect, our approach follows the work of Collins (2002;2004) on perceptron learning with a similar class ofdiscriminant functions. However, the maximum mar-gin algorithm we propose has advantages in terms ofaccuracy and tunability to specific loss functions. Asimilar philosophy of using kernel methods for learninggeneral dependencies was pursued in Kernel Depen-dency Estimation (KDE) (Weston et al., 2003). Yet,the use of separate kernels for inputs and outputs andthe use of kernel PCA with standard regression tech-niques significantly di"ers from our formulation, whichis a more straightforward and natural generalization ofmulticlass SVMs.

2. Discriminants and Loss Functions

We are interested in the general problem of learningfunctions f : X # Y based on a training sample ofinput-output pairs. As an illustrating example, con-sider the case of natural language parsing, where thefunction f maps a given sentence x to a parse tree


developed by several researchers (see Joachims, 1998 for an overview). However, the rootsof this line of research go back to the seminal work of Lev Bregman (1967) which was fur-ther developed by Yair Censor and colleagues (see Censor and Zenios, 1997 for an excellentoverview). These ideas distilled in Platt’s method, called SMO, for sequential minimal op-timization. SMO works with reduced problems that are derived from a pair of exampleswhile our approach employs a single example for each reduced optimization problem. Theresult is a simple optimization problem which can be solved analytically in binary classi-fication problems (see Platt, 1998) and leads to an e!cient numerical algorithm (that isguaranteed to converge) in multiclass settings. Furthermore, although not explored in thispaper, it deems possible that the single-example reduction can be used in parallel appli-cations. Many of the technical improvements we discuss in this paper have been proposedin previous work. In particular ideas such as using a working set and caching have beendescribed by Burges (1998), Platt (1998), Joachims (1998), and others. Finally, we wouldlike to note that this work is part of a general line of research on multiclass learning wehave been involved with. Allwein et al. (2000) described and analyzed a general approachfor multiclass problems using error correcting output codes (Dietterich and Bakiri, 1995).Building on that work, we investigated the problem of designing good output codes for mul-ticlass problems (Crammer and Singer, 2000). Although the model of learning using outputcodes di"ers from the framework studied in this paper, some of the techniques presented inthis paper build upon results from an earlier paper (Crammer and Singer, 2000). Finally,some of the ideas presented in this paper can also be used to build multiclass predictorsin online settings using the mistake bound model as the means of analysis. Our currentresearch on multiclass problems concentrates on analogous online approaches (Crammerand Singer, 2001).

2. Preliminaries

Let S = {(x1, y1), . . . , (xm, ym)} be a set of m training examples. We assume that eachexample xi is drawn from a domain X ! "n and that each label yi is an integer from theset Y = {1, . . . , k}. A (multiclass) classifier is a function H : X # Y that maps an instancex to an element y of Y. In this paper we focus on a framework that uses classifiers of theform

HM(x) = arg kmaxr=1

{Mr · x} ,

where M is a matrix of size k $ n over " and Mr is the rth row of M. We interchangeablycall the value of the inner-product of the rth row of M with the instance x the confidenceand the similarity score for the r class. Therefore, according to our definition above, thepredicted label is the index of the row attaining the highest similarity score with x. Thissetting is a generalization of linear binary classifiers. Using the notation introduced above,linear binary classifiers predict that the label of an instance x is 1 if w ·x > 0 and 2 otherwise(w · x % 0). Such a classifier can be implemented using a matrix of size 2$n where M1 = wand M2 = &w. Note, however, that this representation is less e!cient as it occupies twicethe memory needed. Our model becomes parsimonious when k ' 3 in which we maintain kprototypes M1, M2, . . . , Mk and set the label of a new input instance by choosing the indexof the most similar row of M.

267

(x1, y1), . . . , (xm, ym) ! X " Y

Crammer and Singer

Figure 1: Illustration of the margin bound employed by the optimization problem.

Given a classifier HM(x) (parametrized by a matrix M) and an example (x, y), we saythat HM(x) misclassifies an example x if HM(x) != y. Let [[!]] be 1 if the predicate ! holdsand 0 otherwise. Thus, the empirical error for a multiclass problem is given by

"S(M) =1m

m!

i=1

[[HM(xi) != yi]] . (1)

Our goal is to find a matrix M that attains a small empirical error on the sample S andalso generalizes well. Direct approaches that attempt to minimize the empirical error arecomputationally expensive (see for instance Ho!gen and Simon, 1992, Crammer and Singer,2000). Building on Vapnik’s work on support vector machines (Vapnik, 1998), we describein the next section our paradigm for finding a good matrix M by replacing the discreteempirical error minimization problem with a quadratic optimization problem. As we seelater, recasting the problem as a minimization problem also enables us to replace inner-products of the form a · b with kernel-based inner-products of the form K(a, b) = #(a) · #(b).

3. Constructing multiclass kernel-based predictors

To construct multiclass predictors we replace the misclassification error of an example,([[HM(x) != y]]), with the following piecewise linear bound,

maxr

{Mr · x + 1 " $y,r}" My · x ,

where $p,q is equal 1 if p = q and 0 otherwise. The above bound is zero if the confidencevalue for the correct label is larger by at least one than the confidences assigned to the restof the labels. Otherwise, we su!er a loss which is linearly proportional to the di!erencebetween the confidence of the correct label and the maximum among the confidences of theother labels. A graphical illustration of the above is given in Figure 1. The circles in thefigure denote di!erent labels and the correct label is plotted in dark grey while the restof the labels are plotted in light gray. The height of each label designates its confidence.Three settings are plotted in the figure. The left plot corresponds to the case when themargin is larger than one, and therefore the bound maxr{Mr · x + 1" $y,r}" My · x equalszero, and hence the example is correctly classified. The middle figure shows a case wherethe example is correctly classified but with a small margin and we su!er some loss. Theright plot depicts the loss of a misclassified example.

268

Mr · x

CorrectClass

IncorrectClass

[Crammer, Singer ’01]

7

Myy

!i, y "= yi : Myi· xi # My · xi $ 1 # !i

hM(x) = argk

maxy=1

{My · x}

X ! Rn


• Frobenius norm of :

• Separable version:

• minimize

• subject to

• Use slack variables to allow for non-separability:

• minimize

• subject to

• Next: write dual problem using SVM-like techniques

2006/09/18 14:55

1.3 Regression Problems and Algorithms 5

1.3.1 Kernel Ridge Regression with Vector Space Images

For i = 1, . . . , m, let Mxi! RN1!1 denote the column matrix representing !X(xi)

and Myi! RN2!1 the column matrix representing !Y (yi). We will denote by

"A"2F =

!pi=1

!qj=1 A2

ij the Frobenius norm of a matrix A = (Aij) ! Rp!q andby < A,B >F =

!pi=1

!qj=1 AijBij the Frobenius product of two matrices A and

B in Rp!q. The following minimization problem:

argminW"RN2!N1

F (W) =m

"

i=1

"WMxi# Myi

"2 + !"W"2F , (1.4)

where ! $ 0 is a regularization scalar coe"cient, generalizes ridge regressionto vector space images. The solution W defines the linear hypothesis g. LetMX ! RN1!m and MY ! RN2!m be the matrices defined by:

MX = [Mx1. . . Mxm

] MY = [My1. . . Mym

]. (1.5)

Then, the optimization problem 1.4 can be re-written as:

argminW"RN2!N1

F (W) = "WMX # MY "2F + !"W"2

F . (1.6)

Proposition 1 The solution of the optimization problem 1.6 is unique and is givenby either one of the following identities:

W = MY M#X(MXM#

X + !I)$1 (primal solution)

W = MY (KX + !I)$1M#X (dual solution).

(1.7)

where KX ! Rm!m is the Gram matrix associated to the kernel KX: Kij =KX(xi, xj).

Proof The function F is convex and di#erentiable, thus its solution is unique andgiven by %WF = 0. Its gradient is given by:

%WF = 2 (WMX # MY )M#X + 2!W. (1.8)

Thus,

%WF = 0 & 2(WMX # MY )M#X + 2!W = 0

& W(MXM#X + !I) = MY M#

X

& W = MY M#X(MXM#

X + !I)$1,

(1.9)

which gives the primal solution of the optimization problem. To derive the dualsolution, observe that

M#X(MXM#

X + !I)$1 = (M#XMX + !I)$1M#

X . (1.10)

2006/09/18 14:55





"A"2F =

!pi=1

!qj=1 A2


!pi=1



argminW"RN2!N1

F (W) =m

"

i=1

"WMxi# Myi

"2 + !"W"2F , (1.4)


MX = [Mx1. . . Mxm

] MY = [My1. . . Mym

]. (1.5)


argminW"RN2!N1

F (W) = "WMX # MY "2F + !"W"2

F . (1.6)


W = MY M#X(MXM#



(1.7)



%WF = 2 (WMX # MY )M#X + 2!W. (1.8)

Thus,

%WF = 0 & 2(WMX # MY )M#X + 2!W = 0

& W(MXM#X + !I) = MY M#

X

& W = MY M#X(MXM#

X + !I)$1,

(1.9)


M#X(MXM#

X + !I)$1 = (M#XMX + !I)$1M#

X . (1.10)

!i, y "= yi : Myi· xi # My · xi $ 1

!i, y "= yi : Myi· xi # My · xi $ 1 # !i

1

2!||M||2F +

m!

i=1

"i

1

2!||M||2F

Dual Formulation

• Take Lagrangian, find saddle point with KKT conditions, plug back into objective function

• maximize

• subject to:

• is a vector with 1 in th position, 0’s otherwise

• Can replace dot product with kernel in the usual fashion:

• maximize

• Classification rule:

!i !i " 1yiand !i · 1 = 0

1yi i

!

1

2

!

i,j

(xi · xj)(!i · !j) + "!

i

!i · 1yi

!

1

2

!

i,j

K(xi, xj)(!i · !j) + "!

i

!i · 1yi

9

h(x) = argk

maxy=1

!

i

!i,rK(x, xi)

Experimental Results1. Compare current algorithm with one-vs-all

2. Compare with other work [Schölkopf ’97, Dietterich ’00]

• MNIST handwritten digits, UCI ML corpus repository

Test Error Prev. Error

Prev. Technique

MNIST 1.42% 1.4% S ’97one-vs-allUSPS 4.38% 4.2%

Shuttle 0.12% 0.01% D ’00C4.5 boostLetter 1.95% 2.71%

Crammer and Singer

satimage shuttle mnist isolet letter vowel glass 0

1

2

3

4

5

6

7

Decre

ase in e

rror

rate

Figure 8: Comparison of a multiclass SVM build using the one-against-rest approach withthe multiclass support vector machines studied in this paper.

selected seven datasets, six from UCI repository 2 and the MNIST dataset. All but one of thedatasets (glass), contain a separate test set. For the remaining dataset we used five foldcross validation to compute the error rate. A description of the datasets we used is given inTable 1. Note that on MNIST, Letter and Shuttle we used only a subset of the trainingset. On each data set we ran two algorithms. The first algorithm uses the multiclass SVMof this paper in a binary mode by training one classifier for each class. Each such classifier istrained to distinguish between the class to the rest of the classes. To classify a new instancewe compute the output of each of the binary classifiers and predict the label which attainsthe highest confidence value. The second algorithm we compared is the multiclass SVMdescribed in this paper. We used our multiclass SVM as the basis for the two algorithms inorder to have a common framework for comparison. In both algorithms we used Gaussiankernels. To determine the value of ! and " we used cross validation on the training set. Weused 5-fold cross validation for the large datasets and 10-fold cross validation for the smalldatasets. In all the experiments we set the value of # to 0.001.

A summary of the results is depicted in Figure 8. Each bar in the figure is proportionalto the di!erence in the test error between the two algorithms. Positive value means that thealgorithm proposed in this paper achieved a lower error rate than the strawman algorithmbased on the ‘one-vs-rest’ approach. In general the multiclass support vector machineachieved lower error rate than the ‘one-vs-rest’ method where for the datasets with a largeexample per class ratio this improvement is significant.

2. Available at http://www.ics.uci.edu/ mlearn/MLRepository.html

286

10

Erro

r R

ate

Dec

reas

e

satimage shuttle mnist isolet letter vowel glass

Sequence Classification

• Output is sequence of labels ,

• Markov network is a graph: ,

• Model pairs of labels together (for simplicity)

• Joint feature representation:

• Linear classifier:

• Probabilistic interpretation: Markov network is a log-linear model (encodes log-likelihoods)

Yi = {1, . . . , k}

V = {1, . . . , k}(V,E)

(i, j) ! V " V

hw(x) = arg maxy

{w · f(x, y)}

11

[Taskar, Guestrin, Koller ’03]

P (y|x) !!

(i,j)!E

exp{w · f(x, yi, yj)}

f(x, y) =!

(i,j)!E

f(x, yi, yj)

Y = Y1 ! · · ·! Yl

Hamming Loss

• Multiclass margin concept: 0-1 loss (correct/incorrect)

• This paper: Hamming loss (counts # of incorrect labels), e.g.,

• If correct output is and classifier output is , suffer loss of

9 5 7 3 12 1

9 12 7 11 12 1Loss = 2

y = [y(1), . . . , y

(l)]yi = [y(1)

i, . . . , y

(l)i

]

!ti(y) =l!

k=1

I(y(k) != yi(k))

• Difference in feature space

• Loss suffered if th point is labeled with instead of :

• minimize

• subject to

Scaling the Margin

!i

1

2||w||2 + C

m!

i=1

!i

13

!ti(y) = 1

!ti(y)i y yi

!fi(y) = f(xi, yi) ! f(xi, y)

!i, y "= yi : w · !fi(y) # !ti(y) $ !i



• minimize

• subject to

Scaling the Margin

!i

1

2||w||2 + C

m!

i=1

!i

13

!ti(y) = 2

!ti(y)i y yi


!i, y "= yi : w · !fi(y) # !ti(y) $ !i



• minimize

• subject to

Scaling the Margin

!i

1

2||w||2 + C

m!

i=1

!i

13

!ti(y) = 1

2

!ti(y)i y yi


!i, y "= yi : w · !fi(y) # !ti(y) $ !i

Generalization Bound

• Dual formulation, kernel methods: upon request

• Objective: maximize # of correct labels

• Measure average per-label loss:

• Worst loss allowed by a classifier within margin:

14

L(w, x) =1

l!ti(arg maxw · fi(y))

!

If the edge features have bounded 2-norm (||f(x, yi, yj)||2 ! Redge), q is themaximum edge degree in the network, k is the number of classes in a label, andl is the number of labels, then there exists a constant K such that for any ! > 0per-label margin, with probability at least 1 " ", the per-label loss is boundedby:

ExL(w, x) ! ESL!(w, x)+

!

"

"

#

K

m

$

R2

edge||w||22q2

!2[log m + log l + log q + log k] + log

1

"

%

L!(w, x) = sup

z:|z(y)!w·fi(y)|"!!ti(y)

1

l!ti(arg max

yz(y))

Experiments

• Subset of handwritten words corpus from [Kassel ‘95]

• 6100 words, average 8 characters each

• In sequence-based methods (CRF and M3N), model pairs of adjacent labels (Markov assumption)

15

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

Log-Reg CRF mSVM M^3N

Test

err

or

(avera

ge p

er-

ch

ara

cte

r)

linear quadratic cubic

0

0.05

0.1

0.15

0.2

0.25

Cor Tex Was Wis Ave

Test

err

or

(pag

es p

er

sch

oo

l)

mSVM RMN M^3N

(a) (b) (c)Figure 1: (a) 3 example words from the OCR data set; (b) OCR: Average per-character test error forlogistic regression, CRFs, multiclass SVMs, and M3Ns, using linear, quadratic, and cubic kernels;(c) Hypertext: Test error for multiclass SVMs, RMNs and M3Ns, by school and average.

Handwriting Recognition. We selected a subset of ! 6100 handwritten words, with av-erage length of ! 8 characters, from 150 human subjects, from the data set collected byKassel [6]. Each word was divided into characters, each character was rasterized into animage of 16 by 8 binary pixels. (See Fig. 1(a).) In our framework, the image for each wordcorresponds to x, a label of an individual character to Y i, and a labeling for a completeword to Y . Each label Yi takes values from one of 26 classes {a, . . . , z}.The data set is divided into 10 folds of ! 600 training and ! 5500 testing examples.The accuracy results, summarized in Fig. 1(b), are averages over the 10 folds. We im-plemented a selection of state-of-the-art classification algorithms: independent label ap-proaches, which do not consider the correlation between neighboring characters— logisticregression, multi-class SVMs as described in (3), and one-against-all SVMs (whose perfor-mance was slightly lower than multi-class SVMs); and sequence approaches—CRFs, andour proposed M3 networks. Logistic regression and CRFs are both trained by maximiz-ing the conditional likelihood of the labels given the features, using a zero-mean diagonalGaussian prior over the parameters, with a standard deviation between 0.1 and 1. The othermethods are trained by margin maximization. Our features for each label Y i are the corre-sponding image of ith character. For the sequence approaches (CRFs and M 3), we used anindicator basis function to represent the correlation betweenY i andYi+1. For margin-basedmethods (SVMs andM3), we were able to use kernels (both quadratic and cubic were eval-uated) to increase the dimensionality of the feature space. Using these high-dimensionalfeature spaces in CRFs is not feasible because of the enormous number of parameters.

Fig. 1(b) shows two types of gains in accuracy: First, by using kernels, margin-basedmethods achieve a very significant gain over the respective likelihoodmaximizingmethods.Second, by using sequences, we obtain another significant gain in accuracy. Interestingly,the error rate of our method using linear features is 16% lower than that of CRFs, andabout the same as multi-class SVMs with cubic kernels. Once we use cubic kernels ourerror rate is 45% lower than CRFs and about 33% lower than the best previous approach.For comparison, the previously published results, although using a different setup (e.g., alarger training set), are about comparable to those of multiclass SVMs.

Hypertext. We also tested our approach on collective hypertext classification, using thedata set in [10], which contains web pages from four different Computer Science depart-ments. Each page is labeled as one of course, faculty, student, project, other. In all of ourexperiments, we learn a model from three schools, and test on the remaining school. Thetext content of the web page and anchor text of incoming links is represented using a setof binary attributes that indicate the presence of different words. The baseline model is asimple linear multi-class SVM that uses only words to predict the category of the page. Thesecond model is a relationalMarkov network (RMN) of Taskar et al. [10], which in addi-tion to word-label dependence, has an edge with a potential over the labels of two pagesthat are hyper-linked to each other. This model defines a Markov network over each website that was trained to maximize the conditional probability of the labels given the words

SVM Learning for SP

• Arbitrary input and output structure (not just sequences)

• Approach: again look for linear classifier with large margin

• Instead of scaling the margin by the loss, scale the slack

• Primal: minimize

• subject to

• Dual: maximize

• subject to16


hw(x) = arg maxy

{w · f(x, y)}

!i, y : w · !fi(y) " 1 #!i

!ti(y)

1

2||w||2 + C

m!

i=1

!i

!

i,y

!iy !

1

2

!

i,y

!

j,y

!iy!jyfi(y) · fj(y)

!i :!

y !=yi

!iy

!ti(y)" C;!i, y : !iy # 0

Solving QP

• maximize

• subject to

• Number of constraints still depends on output space size!

• Solution: propose an efficient algorithm that approximates optimal solution

• Maintain active subset of constraints

• Constraints selected by error criterion

• Remaining constraints fulfilled with precision at least

A kernel K((x,y), (x!,y!)) can be used to replace theinner products, since inner products in !! can beeasily expressed as inner products of the original !-vectors.

For soft-margin optimization with slack re-scaling andlinear penalties (SVM"s

1 ), additional box constraints

n!

y #=yi

"iy

!(yi,y)" C, #i (12)

are added to the dual. Quadratic slack penal-ties (SVM2 ) lead to the same dual as SVM0 afteraltering the inner product to $!!i(y), !!j(y)% +!ij n

C&

"(yi,y)&

"(yj ,y). !ij = 1, if i = j, else 0.

Finally, in the case of margin re-scaling, the loss func-tion a"ects the linear part of the objective functionmax!

"i,y "iy!(yi,y) ' Q(") (where the quadratic

part Q is unchanged from (11a)) and introduces stan-dard box constraints n

"y #=yi

"iy " C.

4.2. Algorithm

The algorithm we propose aims at finding a small setof active constraints that ensures a su#ciently accu-rate solution. More precisely, it creates a nested se-quence of successively tighter relaxations of the origi-nal problem using a cutting plane method. The latteris implemented as a variable selection approach in thedual formulation. We will show that this is a validstrategy, since there always exists a polynomially-sizedsubset of constraints so that the corresponding solu-tion fulfills all constraints with a precision of at least #.This means, the remaining – potentially exponentiallymany – constraints are guaranteed to be violated byno more than #, without the need for explicitly addingthem to the optimization problem.

We will base the optimization on the dual programformulation which has two important advantages overthe primal QP. First, it only depends on inner prod-ucts in the joint feature space defined by !, henceallowing the use of kernel functions. Second, the con-straint matrix of the dual program (for the L1-SVMs)supports a natural problem decomposition, since it isblock diagonal, where each block corresponds to a spe-cific training instance.

Pseudocode of the algorithm is depicted in Algo-rithm 1. The algorithm applies to all SVM formula-tions discussed above. The only di"erence is in the waythe cost function gets set up in step 5. The algorithmmaintains a working set Si for each training example(xi,yi) to keep track of the selected constraints whichdefine the current relaxation. Iterating through thetraining examples (xi,yi), the algorithm proceeds by

Algorithm 1 Algorithm for solving SVM0 and the lossre-scaling formulations SVM"s

1 and SVM"s2

1: Input: (x1,y1), . . . , (xn,yn), C, #2: Si ( ) for all i = 1, . . . , n3: repeat4: for i = 1, . . . , n do5: set up cost function

SVM"s1 : H(y) * (1 ' $!!i(y),w%)!(yi,y)

SVM"s2 : H(y) * (1'$!!i(y),w%)

#!(yi,y)

SVM"m1 : H(y) * !(yi,y) ' $!!i(y),w%

SVM"m2 : H(y) *

#!(yi,y) ' $!!i(y),w%

where w *"

j

"y!$Sj

"jy!!!j(y!).6: compute y = arg maxy$Y H(y)7: compute $i = max{0,maxy$Si H(y)}8: if H(y) > $i + # then9: Si ( Si + {y}

10: "S ( optimize dual over S, S = +iSi.11: end if12: end for13: until no Si has changed during iteration

finding the (potentially) “most violated” constraint,involving some output value y (line 6). If the (ap-propriately scaled) margin violation of this constraintexceeds the current value of $i by more than # (line 8),the dual variable corresponding to y is added to theworking set (line 9). This variable selection process inthe dual program corresponds to a successive strength-ening of the primal problem by a cutting plane thatcuts o" the current primal solution from the feasibleset. The chosen cutting plane corresponds to the con-straint that determines the lowest feasible value for $i.Once a constraint has been added, the solution is re-computed wrt. S (line 10). Alternatively, we have alsodevised a scheme where the optimization is restrictedto Si only, and where optimization over the full S isperformed much less frequently. This can be beneficialdue to the block diagonal structure of the optimizationproblems, which implies that variables "jy with j ,= i,y - Sj can simply be “frozen” at their current val-ues. Notice that all variables not included in theirrespective working set are implicitly treated as 0. Thealgorithm stops, if no constraint is violated by morethan #. The presented algorithm is implemented andavailable1 as part of SVMlight. Note that the SVMoptimization problems from iteration to iteration dif-fer only by a single constraint. We therefore restartthe SVM optimizer from the current solution, whichgreatly reduces the runtime. A convenient property ofboth algorithms is that they have a very general andwell-defined interface independent of the choice of !

1http://svmlight.joachims.org/

!

i,y

!iy !

1

2

!

i,y

!

j,y

!iy!jyfi(y) · fj(y)

!i :!

y !=yi

!iy

!ti(y)" C;!i, y : !iy # 0

17

Algorithm

• Input:

• Initialize:

• repeat

• for

•

•

• if then

• until unchanged during iteration18

(x1, y1), . . . , (xm, ym), C, !

Si ! " #i = 1, . . . ,m

i = 1, . . . ,m

H(y) > !i + " Si ! Si " {y}; !S = solve QP, S = "iSi

H(y) ! (1 " w · !fi(y))!ti(y) w =!

j

!

y!!Sj

!jy!!fj(y")

!i = max{0,maxy!Si

H(y)}y = arg maxy

H(y);

S

Algorithm

• Input:

• Initialize:

• repeat

• for

•

•

• if then


(x1, y1), . . . , (xm, ym), C, !

Si ! " #i = 1, . . . ,m

i = 1, . . . ,m


Comes from!i, y : w · !fi(y) " 1 #

!i

!ti(y)

H(y) ! (1 " w · !fi(y))!ti(y) w =!

j

!

y!!Sj

!jy!!fj(y")

!i = max{0,maxy!Si

H(y)}y = arg maxy

H(y);

S

Algorithm

• Input:

• Initialize:

• repeat

• for

•

•

• if then


(x1, y1), . . . , (xm, ym), C, !

Si ! " #i = 1, . . . ,m

i = 1, . . . ,m


Optimal solution to Lagrangian


!i

!ti(y)

H(y) ! (1 " w · !fi(y))!ti(y) w =!

j

!

y!!Sj

!jy!!fj(y")

!i = max{0,maxy!Si

H(y)}y = arg maxy

H(y);

S

Algorithm

• Input:

• Initialize:

• repeat

• for

•

•

• if then


(x1, y1), . . . , (xm, ym), C, !

Si ! " #i = 1, . . . ,m

i = 1, . . . ,m



Find most violated

constraint


!i

!ti(y)

H(y) ! (1 " w · !fi(y))!ti(y) w =!

j

!

y!!Sj

!jy!!fj(y")

!i = max{0,maxy!Si

H(y)}y = arg maxy

H(y);

S

Algorithm

• Input:

• Initialize:

• repeat

• for

•

•

• if then


(x1, y1), . . . , (xm, ym), C, !

Si ! " #i = 1, . . . ,m

i = 1, . . . ,m



Find most violated

constraint

Add constraint to active set


!i

!ti(y)

H(y) ! (1 " w · !fi(y))!ti(y) w =!

j

!

y!!Sj

!jy!!fj(y")

!i = max{0,maxy!Si

H(y)}y = arg maxy

H(y);

S

Experiments: NL Parsing

• Map sentence to parse tree

• Let be a histogram vector of rules occurring when is the parse tree corresponding to

• Find tree that maximizes using CKY algorithm

hw(x) = arg maxy

{w · f(x, y)}

x y

x

y

w · f(x, y)



f(x;w) = argmaxy!Y

F (x,y;w) , (1)


F (x,y;w) = %w,!(x,y)& . (2)





R"P (f) =

!

X#Y((y, f(x)) dP (x,y) . (3)




risk.



+i : maxy!Y\yi

{%w,!(xi,y)&} < %w,!(xi,yi)& . (4)


+i, +y ' Y \ yi : %w, !!i(y)& > 0 , (5)







f(x;w) = argmaxy!Y

F (x,y;w) , (1)


F (x,y;w) = %w,!(x,y)& . (2)





R"P (f) =

!

X#Y((y, f(x)) dP (x,y) . (3)




risk.



+i : maxy!Y\yi

{%w,!(xi,y)&} < %w,!(xi,yi)& . (4)


+i, +y ' Y \ yi : %w, !!i(y)& > 0 , (5)





f(x, y)

f(x, y) =

19

Experiments: NL Parsing

• 4098 train, 163 test sentences from Penn Treebank WSJ corpus

• Classifiers used: PCFG (generative model), with 0/1 loss, with F1-loss

• F1 is a mix of precision and recall:

Table 2. Results of various algorithms on the Named En-tity Recognition task (Altun et al., 2003).

Method HMM CRF Perceptron SVMError 9.36 5.17 5.94 5.08

Table 3. Results for various SVM formulations on theNamed Entity Recognition task (! = 0.01, C = 1).

Method Train Err Test Err Const Avg LossSVM2 0.2±0.1 5.1±0.6 2824±106 1.02±0.01SVM!s

2 0.4±0.4 5.1±0.8 2626±225 1.10±0.08SVM!m

2 0.3±0.2 5.1±0.7 2628±119 1.17±0.12

ceptron and the SVM algorithm. All discriminativelearning methods substantially outperform the stan-dard HMM. In addition, the SVM performs slightlybetter than the perceptron and CRFs, demonstratingthe benefit of a large-margin approach. Table 3 showsthat all SVM formulations perform comparably, prob-ably due to the fact the vast majority of the supportlabel sequences end up having Hamming distance 1 tothe correct label sequence (notice that for loss equalto 1 all SVM formulations are equivalent).

5.4. Sequence Alignment

Next we show how to apply the proposed algorithmto the problem of learning how to align sequencesx ! X = !!. For a given pair of sequences x andz, alignment methods like the Smith-Waterman algo-rithm select the sequence of operations (e.g. insertion,substitution) a(x, z) = argmaxa"A "w,"(x, z,a)# thattransforms x into y and that maximizes a linear ob-jective function derived from the (negative) operationcosts w. "(x, z,a) is the histogram of alignment op-erations. We use the value of "w,"(x, z, a(x, z))# as ameasure of similarity.

In order to learn the cost vector w we use training dataof the following type. For each native sequence xi thereis a most similar homologue sequence zi along withwhat is believed to be the (close to) optimal alignmentai. In addition we are given a set of decoy sequenceszt

i, t = 1, . . . , k with unknown alignments. The goal isto find a cost vector w so that homologue sequencesare close to the native sequence, and so that decoysequences are further away. With Yi = {zi, z1

i , ..., zki }

as the output space for the i-th example, we seek a w sothat "w,"(xi, zi,ai)# exceeds "w,"(xi, zt

i,a)# for all tand a. This implies a zero-one loss and hypotheses ofthe form f(xi,w) = argmaxy"Yi

maxa "w,"(x, z,a)#.We use the Smith-Waterman algorithm to implementthe maxa.

Table 4 shows the test error rates (i.e. fraction of timesthe homolog is not selected) on the synthetic dataset

Table 4. Error rates and number of constraints |S| depend-ing on the number of training examples (! = 0.1, C = 0.01).

Train Error Test Errorn GenMod SVM2 GenMod SVM2 Const1 20.0±13.3 0.0±0.0 74.3±2.7 47.0±4.6 7.8±0.32 20.0±8.2 0.0±0.0 54.5±3.3 34.3±4.3 13.9±0.84 10.0±5.5 2.0±2.0 28.0±2.3 14.4±1.4 31.9±0.9

10 2.0±1.3 0.0±0.0 10.2±0.7 7.1±1.6 58.9±1.220 2.5±0.8 1.0±0.7 3.4±0.7 5.2±0.5 95.2±2.340 2.0±1.0 1.0±0.4 2.3±0.5 3.0±0.3 157.2±2.480 2.8±0.5 2.0±0.5 1.9±0.4 2.8±0.6 252.7±2.1

described in Joachims (2003). The results are aver-aged over 10 train/test samples. The model contains400 parameters in the substitution matrix # and acost ! for “insert/delete”. We train this model usingthe SVM2 and compare against a generative sequencealignment model, where the substitution matrix iscomputed as #ij = log

!P (xi,zj)

P (xi)P (zj)

"using Laplace esti-

mates. For the generative model, we report the resultsfor ! = $0.2, which performs best on the test set. De-spite this unfair advantage, the SVM performs betterfor low training set sizes. For larger training sets, bothmethods perform similarly, with a small preference forthe generative model. However, an advantage of theSVM model is that it is straightforward to train gappenalties. As predicted by Theorem 1, the number ofconstraints |S| is low. It appears to grows sub-linearlywith the number of examples.

5.5. Natural Language Parsing

We test the feasibility of our approach for learninga weighted context-free grammar (see Figure 1) on asubset of the Penn Treebank Wall Street Journal cor-pus. We consider the 4098 sentences of length at most10 from sections F2-21 as the training set, and the 163sentences of length at most 10 from F22 as the test set.Following the setup in Johnson (1999), we start basedon the part-of-speech tags and learn a weighted gram-mar consisting of all rules that occur in the trainingdata. To solve the argmax in line 6 of the algorithm,we use a modified version of the CKY parser of MarkJohnson3 and incorporated it into SVMlight.

The results are given in Table 5. They show accu-racy and micro-averaged F1 for the training and thetest set. The first line shows the performance for gen-erative PCFG model using the maximum likelihoodestimate (MLE) as computed by Johnson’s implemen-tation. The second line show the SVM2 with zero-oneloss, while the following lines give the results for theF1-loss %(yi,y) = (1 $ F1(yi,y)) using SVM#s

2 and3At http://www.cog.brown.edu/!mj/Software.htm





2 0.4±0.4 5.1±0.8 2626±225 1.10±0.08SVM!m

2 0.3±0.2 5.1±0.7 2628±119 1.17±0.12






i , ..., zki }







10 2.0±1.3 0.0±0.0 10.2±0.7 7.1±1.6 58.9±1.220 2.5±0.8 1.0±0.7 3.4±0.7 5.2±0.5 95.2±2.340 2.0±1.0 1.0±0.4 2.3±0.5 3.0±0.3 157.2±2.480 2.8±0.5 2.0±0.5 1.9±0.4 2.8±0.6 252.7±2.1


!P (xi,zj)

P (xi)P (zj)







F1 = 2pr/(p + r)Table 5. Results for learning a weighted context-free gram-mar on the Penn Treebank. CPU time measured in hours.

Train Test Training E!ciencyMethod Acc F1 Acc F1 Const CPU(%QP)PCFG 61.4 90.4 55.2 86.0 N/A 0SVM2 66.3 92.0 58.9 86.2 7494 1.2 (81.6%)SVM!s

2 62.2 92.1 58.9 88.5 8043 3.4 (10.5%)SVM!m

2 63.5 92.3 58.3 88.4 7117 3.5 (18.0%)

SVM!m2 . All results are for C = 1 and ! = 0.01. All

values of C between 10"1 to 102 gave comparable re-sults. While the zero-one loss achieves better accuracy(i.e. predicting the complete tree correctly), the F1-score is only marginally better. Using the F1-loss givessubstantially better F1-scores, outperforming the MLEsubstantially. The di!erence is significant according toa McNemar test on the F1-scores. We conjecture thatwe can achieve further gains by incorporating morecomplex features into the grammar, which would beimpossible or at best awkward to use in a generativePCFG model. Note that our approach can handle ar-bitrary models (e.g. with kernels and overlapping fea-tures) for which the argmax in line 6 can be computed.

In terms of training time, Table 5 shows that the to-tal number of constraints added to the working set issmall. It is roughly twice the number of training ex-amples in all cases. While the training is faster for thezero-one loss, the time for solving the QPs remainsroughly comparable. The re-scaling formulations losetime mostly on the argmax in line 6. This might besped up, since we were using a rather naive algorithmin the experiments.

6. Conclusions

We formulated a Support Vector Method for super-vised learning with structured and interdependent out-puts. It is based on a joint feature map over in-put/output pairs, which covers a large class of interest-ing models including weighted context-free grammars,hidden Markov models, and sequence alignment. Fur-thermore, the approach is very flexible in its ability tohandle application specific loss functions. To solve theresulting optimization problems, we proposed a simpleand general algorithm for which we prove convergencebounds. Our empirical results verify that the algo-rithm is indeed tractable. Furthermore, we show thatthe generalization accuracy of our method is at leastcomparable or often exceeds conventional approachesfor a wide range of problems. A promising propertyof our method is that it can be used to train com-plex models, which would be di"cult to handle in agenerative setting.

Acknowledgments

The authors would like to thank Lijuan Cai for con-ducting the experiments on classification with tax-onomies. This work was supported by the Paris Kanel-lakis Dissertation Fellowship, NSF-ITR Grant IIS-0312401, and NSF CAREER Award 0237381.

References

Altun, Y., Tsochantaridis, I., & Hofmann, T. (2003).Hidden markov support vector machines. ICML.

Collins, M. (2002). Discriminative training methodsfor hidden markov models: Theory and experimentswith perceptron algorithms. EMNLP.

Collins, M. (2004). Parameter estimation for sta-tistical parsing models: Theory and practice ofdistribution-free methods.

Crammer, K., & Singer, Y. (2001). On the algorith-mic implementation of multi-class kernel-based vec-tor machines. Machine Learning Research, 2, 265–292.

Hofmann, T., Tsochantaridis, I., & Altun, Y. (2002).Learning over structured output spaces via joint ker-nel functions. Sixth Kernel Workshop.

Joachims, T. (2003). Learning to align sequences:A maximum-margin approach (Technical Report).Cornell University.

Johnson, M. (1999). PCFG models of linguistic treerepresentations. Computational Linguistics.

La!erty, J., McCallum, A., & Pereira, F. (2001). Con-ditional random fields: Probabilistic models for seg-menting and labeling sequence data. ICML.

Manning, C. D., & Schuetze, H. (1999). Foundations ofstatistical natural language processing. MIT Press.

Taskar, B., Guestrin, C., & Koller, D. (2004). Max-margin markov networks. NIPS 16.

Vapnik, V. (1998). Statistical learning theory. Wileyand Sons Inc.

Weston, J., Chapelle, O., Elissee!, A., Scholkopf, B.,& Vapnik, V. (2003). Kernel dependency estimation.NIPS 15.

Weston, J., & Watkins, C. (1998). Multi-class supportvector machines (Technical Report CSD-TR-98-04).Department of Computer Science, Royal Holloway,University of London.

!ti(y) = (1 ! F1(yi, y))

20

Large Margin GMMs

• Multiclass setup, extension to sequence classification

• Simplest GMM: single Gaussian in input space

• Parametrized by centroid and orientation matrix (inverse of covariance matrix)

• is positive semidefinite: write

• Also get a scalar threshold

• Decision rule uses Mahalanobis distance

[Sha, Saul ’07]

21

µy ! Rn

!y ! Rn!n

!y !y ! 0

!y ! 0

h(x) = arg miny

{(x ! µy)T!y(x ! µy) + !y}

Notation and Margin

• Define parameter matrix for each class (also positive semidefinite)

• Then can write decision rule as

• Define margin as distance to nearest decision boundary as in [Crammer, Singer ’01]

22

z =

!

x

1

"

expression. For each class c, the reparameterization collects the parameters {µc,!c, !c} in a singleenlarged matrix!c ! "(d+1)!(d+1):

!c =

!

"c #"c µc

#µTc "c µT

c "cµc + !c

"

. (2)

Note that !c is positive semidefinite. Furthermore, if !c is strictly positive definite, the parameters{µc,"c, !c} can be uniquely recovered from!c. With this reparameterization, the decision rule ineq. (1) simplifies to:

y = argminc

#

zT!c z

$

where z =

!

x1

"

. (3)

The argument on the right hand side of the decision rule in eq. (3) is linear in the parameters!c. Inwhat follows, we will adopt the representation in eq. (3), implicitly constructing the “augmented”vector z for each input vector x. Note that eq. (3) still yields nonlinear (piecewise quadratic) deci-sion boundaries in the vector z.

2.2 Margin maximization

Analogous to learning in SVMs, Sha and Saul [14] suggest to find the parameters {!c} that mini-mize the empirical risk on the training data—i.e., parameters that not only classify the training datacorrectly, but also place the decision boundaries as far away as possible. The margin of a labeledexample is defined as its distance to the nearest decision boundary. If possible, each labeled exampleis constrained to be at least one unit distance from the decision boundary to each competing class:

$c %= yn, zTn (!c # !yn

)zn & 1. (4)

!"#$%$&'()&*'!+,-( .+,/$'(.+,/$'(

Figure 1: Decision boundaryin a large margin GMM: la-beled examples lie at least oneunit of distance away.

Fig. 1 illustrates this idea. Note that in the “realizable” settingwhere these constraints can be simultaneously satisfied, they donot uniquely determine the parameters {!c}, which can be scaledto yield arbitrarily large margins. Therefore, as in SVMs, Sha andSaul [14] propose a convex optimization that selects the “smallest”parameters that satisfy the large margin constraints in eq. (4). Inthis case, the optimization is an instance of semidefinite program-ming [16]:

min%

c trace("c)s.t. 1 + zT

n (!yn# !c)zn ' 0, $c %= yn, n = 1, 2, . . . , N

!c ( 0, c = 1, 2, . . . , C(5)

Note that the trace of the matrix"c appears in the above objectivefunction, as opposed to the trace of the matrix !c, as defined ineq. (2); minimizing the former imposes the scale regularizationonly on the inverse covariance matrices of the GMM, while thelatter would improperly regularize the mean vectors as well. Theconstraints!c ( 0 restrict the matrices to be positive semidefinite.

The objective function must be modified for training data that lead to infeasible constraints in eq. (5).As in SVMs, Sha and Saul [14] introduce nonnegative slack variables "nc to monitor the amount bywhich the margin constraints in eq. (4) are violated. The objective function in this setting balancesthe margin violations versus the scale regularization:

min%

nc "nc + #%

c trace("c)s.t. 1 + zT

n (!yn# !c)zn ' "nc,

"nc & 0, $c %= yn, n = 1, 2, . . . , N!c ( 0, c = 1, 2, . . . , C

(6)

where the balancing hyperparameter# >0 is set by some form of cross-validation. This optimizationis also an instance of semidefinite programming.

h(x) = arg miny

{(x ! µy)T!y(x ! µy) + !y}

!y =

!

"y !"yµy

!µTy !y µT

y "yµy + !y

"

h(x) = arg miny

{zT!y z}

!y "= yi : zTi (!y # !yi

)zi $ 1


• Use one slack variable per class per data point

• minimize

• subject to

• Let mixture account for :

• Collapse constraints using Viterbi approximation/softmax

• minimize

• subject to 23

!

i,y

!iy + C!

y

trace(!y)

!i, y "= yi : 1 + zTi (!yi

# !y)zi $ !iy % !iy & 0

!y : !y " 0

!

i,y

!iy + C!

y,b

trace(!yb)

bi xi 1 + zTi (!yibi

! !y)zi " !iy

!y, b : !yb " 0

!i, y "= yi, b : 1 + zTi !yzi + log

!

b

e!zTi !ybzi # !iy $ !iy % 0

Sequence Classification

• Use HMM to model sequences

• Let be the transition probabilities from state to , and the parameters of state

• Discriminant function for (state sequence):

• Use Hamming loss:

• Scale margin as in M3N:

24

mixture EM margin

1 3.0% 1.4%2 2.6% 1.4%

4 2.1% 1.2%

8 1.8% 1.5%

Table 1: Test error rates on MNISTdigit recognition: maximum likeli-hood versus large margin GMMs.

!"

#$%&'

($%&)*

!"

#$%&'

($%&)*

Figure 2: Digit “prototypes” from maximum likelihoodGMMs (trained by EM) versus large margin GMMs.

as discriminative models. In sequential classification by CD-HMMs, the goal is to infer the correcthidden state sequence y = [y1, y2, . . . , yT ] given the observation sequenceX = [x1, x2, . . . , xT ].In the application to ASR, the hidden states correspond to phoneme labels, and the observations areacoustic feature vectors. Note that if an observation sequence has length T and each label can belongto C classes, then the number of incorrect state sequences grows as O(CT ). This combinatorialexplosion presents the main challenge for large margin methods in sequential classification: how toseparate the correct hidden state sequence from the exponentially large number of incorrect ones.

The section is organized as follows. Section 3.1 explains the way that margins are computed forsequential classification. Section 3.2 describes how to learn large margin GMMs as componentsof CD-HMMs. Details are given only for the simple case where the observations in each state aremodeled by a single ellipsoid. The extension to multiple mixture components closely follows theapproach in section 2.3. Margin-based learning of transition probabilities is similarly straightfor-ward, but likewise omitted for brevity. Both these extensions were implemented, however, for theexperiments on phonetic recognition in section 3.3.

3.1 Margin constraints for sequential classification

We start by defining a discriminant function over state (label) sequences of the CD-HMM. Leta(i, j) denote the transition probabilities of the CD-HMM, and let !s denote the ellipsoid pa-rameters of state s. The discriminant function D(X , s) computes the score of the state sequences = [s1, s2, . . . , sT ] on an observation sequenceX = [x1, x2, . . . , xT ] as:

D(X , s) =!

t

log a(st!1, st) !T

!

t=1

zTt !st

zt. (10)

This score has the same form as the log-probability log P (X , s) in a CD-HMMwith Gaussian emis-sion densities. The first term accumulates the log-transition probabilities along the state sequence,while the second term accumulates the Mahalanobis distances to each state’s centroid.

We introduce margin constraints in terms of the above discriminant function. Let h(s, y) denotethe Hamming distance (i.e., the number of mismatched labels) between an arbitrary state sequences and the target state sequence y. Earlier, in section 2 on multiway classification, we constrainedeach labeled example to lie at least one unit distance from the decision boundary to each competingclass; see eq. (4). Here, by extension, we constrain the score of each target sequence to exceed thatof each competing sequence by an amount equal to or greater than the Hamming distance:

"s #= y, D(X , y) !D(X , s) $ h(s, y) (11)

Intuitively, eq. (11) requires that the (log-likelihood) gap between the score of an incorrect sequences and the target sequence y should grow in proportion to the number of individual label errors. Theappropriateness of such proportional constraints for sequential classification was first noted by [15].

3.2 Softmax margin maximization for sequential classification

The challenge of large margin sequence classification lies in the exponentially large number ofconstraints, one for each incorrect sequence s, embodied by eq. (11). We will use the same softmaxinequality, previously introduced in section 2.3, to fold these multiple constraints into one, thus

i j

!s s

!ti(y) =l!

k=1

I(y(k) != yi(k))

D(x, y) =!

k

log a(y(k!1), y(k)) !l!

k=1

z(k)T!y(k)z(k)

y = [y(1), . . . , y

(l)]

!i, y "= yi : D(xi, yi) #D(xi, y) $ !ti(y)

Sequence Optimization

• For single Gaussian, collapsed constraints, optimization is:

• minimize

• s.t.

25

!

i,y

!iy + C!

y

trace(!y)

!y : !y " 0

!i, y "= yi : #D(xi, yi) + log!

y !=yi

eti(y)+D(xi,y) $ !i % !i & 0

Experiments

• TIMIT speech corpus: read speech, MFCC, 39 classes

• Apply Viterbi decoding, compute error rate

• Count total substitutions, insertions, and deletion

• Compare to baseline EM training

• Also compare to maximum mutual information (MMI) training - error rate reduction

26

mixture baseline margin margin(per state) (EM) (frame) (utterance)

1 40.1% 36.3% 31.2%

2 36.5% 33.5% 30.8%

4 34.7% 32.6% 29.8%8 32.7% 31.0% 28.2%

Table 2: Error rates in phonetic recognition by differentlytrained GMMs. See text for details.

mixture margin MMI

1 22% 17%2 16% 10%

4 14% 6%

8 14% n/a

Table 3: Relative reductionsin error rates from large mar-gin GMMs (section 3) versusMMI [5]. See text for details.

frame-based classification, and a proper discriminative CD-HMM whose large margin GMMs weretrained for sequential classification. The baseline CD-HMM is significantly outperformed by bothof the discriminative CD-HMMs. The best system, however, is clearly obtained from the learningalgorithm for sequential classification.

Discriminative learning of CD-HMMs is an active research area in ASR. Two types of algorithmshave been widely used: maximum mutual information (MMI)[18] and minimum classification er-ror [4]. Table 3 compares the relative reductions in error rates1 obtained by the large margin GMMsin this paper and the MMI training described in [5]. The large margin GMMs lead to larger relativereductions in error rates, owing perhaps to the absence of local minima in the objective function orthe margin-based learning based on Hamming distances, as discussed in section 3.2.

4 Discussion

Discriminative learning of sequential models is an active area of research in both ASR [10, 13, 18]and machine learning [1, 6, 15]. This paper makes contributions to lines of work in both commu-nities. Unlike previous work in ASR, we have proposed a convex, margin-based cost function thatpenalizes incorrect decodings in proportion to their Hamming distance from the desired transcrip-tion. The use of the Hamming distance in this context is a crucial insight from the work of [15] in themachine learning community, and it differs profoundly from merely penalizing the log-likelihoodgap between incorrect and correct transcriptions, as commonly done in ASR. Unlike previous workin machine learning, we have proposed a framework for sequential classification that naturally inte-grates with the infrastructure of modern speech recognizers. Using the softmax function, we havealso proposed a novel way to monitor the exponentially many margin constraints that arise in se-quential classification. For real-valued observation sequences, we have shown how to train largemargin GMMs via convex optimizations over their parameter space of positive definite matrices. Fi-nally, we have demonstrated that these learning algorithms lead to improved sequential classificationon data sets with over one million training examples (i.e., phonetically labeled frames of speech).In ongoing work, we are performing additional experiments to compare more directly with MMItraining and to understand more fully the possible benefits of margin-based discriminative training.

A Solver

The optimizations in eqs. (5), (6), (9) and (14) are convex: specifically, in terms of the matrices thatparameterize large margin GMMs, the objective functions are linear, while the constraints defineconvex sets. Despite being convex, however, these optimizations cannot be managed by off-the-shelf numerical optimization solvers or generic interior point methods for problems as large as theones in this paper. We devised our own special-purpose solver for these purposes.

For simplicity, we describe our solver for the optimization of eq. (6), noting that it is easily extendedto eqs. (9) and (14). To begin, we eliminate the slack variables and rewrite the objective function interms of the hinge loss function: hinge(z)=max(0, z). This yields the objective function:

L =!

n,c !=yn

hinge"

1 + zTn (!yn

!!c)zn

#

+ !!

c

trace("c), (15)

1Our CD-HMMs also obtained lower absolute phone error rates than those in [5]. It seems more revealing,however, to compare relative reductions in error rates, which are less affected by implementation details.

mixture baseline margin margin(per state) (EM) (frame) (utterance)

1 40.1% 36.3% 31.2%

2 36.5% 33.5% 30.8%

4 34.7% 32.6% 29.8%8 32.7% 31.0% 28.2%

Table 2: Error rates in phonetic recognition by differentlytrained GMMs. See text for details.

mixture margin MMI

1 22% 17%2 16% 10%

4 14% 6%

8 14% n/a

Table 3: Relative reductionsin error rates from large mar-gin GMMs (section 3) versusMMI [5]. See text for details.

frame-based classification, and a proper discriminative CD-HMM whose large margin GMMs weretrained for sequential classification. The baseline CD-HMM is significantly outperformed by bothof the discriminative CD-HMMs. The best system, however, is clearly obtained from the learningalgorithm for sequential classification.

Discriminative learning of CD-HMMs is an active research area in ASR. Two types of algorithmshave been widely used: maximum mutual information (MMI)[18] and minimum classification er-ror [4]. Table 3 compares the relative reductions in error rates1 obtained by the large margin GMMsin this paper and the MMI training described in [5]. The large margin GMMs lead to larger relativereductions in error rates, owing perhaps to the absence of local minima in the objective function orthe margin-based learning based on Hamming distances, as discussed in section 3.2.

4 Discussion

Discriminative learning of sequential models is an active area of research in both ASR [10, 13, 18]and machine learning [1, 6, 15]. This paper makes contributions to lines of work in both commu-nities. Unlike previous work in ASR, we have proposed a convex, margin-based cost function thatpenalizes incorrect decodings in proportion to their Hamming distance from the desired transcrip-tion. The use of the Hamming distance in this context is a crucial insight from the work of [15] in themachine learning community, and it differs profoundly from merely penalizing the log-likelihoodgap between incorrect and correct transcriptions, as commonly done in ASR. Unlike previous workin machine learning, we have proposed a framework for sequential classification that naturally inte-grates with the infrastructure of modern speech recognizers. Using the softmax function, we havealso proposed a novel way to monitor the exponentially many margin constraints that arise in se-quential classification. For real-valued observation sequences, we have shown how to train largemargin GMMs via convex optimizations over their parameter space of positive definite matrices. Fi-nally, we have demonstrated that these learning algorithms lead to improved sequential classificationon data sets with over one million training examples (i.e., phonetically labeled frames of speech).In ongoing work, we are performing additional experiments to compare more directly with MMItraining and to understand more fully the possible benefits of margin-based discriminative training.

A Solver

The optimizations in eqs. (5), (6), (9) and (14) are convex: specifically, in terms of the matrices thatparameterize large margin GMMs, the objective functions are linear, while the constraints defineconvex sets. Despite being convex, however, these optimizations cannot be managed by off-the-shelf numerical optimization solvers or generic interior point methods for problems as large as theones in this paper. We devised our own special-purpose solver for these purposes.

For simplicity, we describe our solver for the optimization of eq. (6), noting that it is easily extendedto eqs. (9) and (14). To begin, we eliminate the slack variables and rewrite the objective function interms of the hinge loss function: hinge(z)=max(0, z). This yields the objective function:

L =!

n,c !=yn

hinge"

1 + zTn (!yn

!!c)zn

#

+ !!

c

trace("c), (15)

1Our CD-HMMs also obtained lower absolute phone error rates than those in [5]. It seems more revealing,however, to compare relative reductions in error rates, which are less affected by implementation details.

Regression for Sequences

• Claim: regression incorporates structure information more naturally than classification

• 0/1 loss ignores similarities between labels, not good

• Other loss functions “hack” the objective with a closeness measure, but in regression this is free

• Sequence setup: ,

• Separate feature functions for input/output

27

[Cortes, Mohri, Weston’06]

X = X! Y = Y

!

!X : X!! FX !Y : Y

!! FY = R

N2 dim(FX) = N1

Regression: Two Problems

• Regression problem: learn hypothesis

• Gives feature vector matching input

• Corresponds to learning parameters in classification

• Preimage problem: given the , figure out the

• Formulate as

• Can be viewed as classifying the input

28

2006/09/18 14:55

4 A General Regression Framework for Learning String-to-String Mappings

X!

!FX

"Y !

!

#

FX"

$$

$$

$$

$$%

!X !Y !"1Y

f

g

Figure 1.1 Decomposition of the string-to-string mapping learning problem into aregression problem (learning g) and a pre-image problem (computing !"1

Y and using gto determine the string-to-string mapping f).

But, that extra step is not necessary and we will not require it, thereby simplifyingthat framework.

In the following two sections, we examine in more detail each of the two problemsjust mentioned (regression and pre-image problems) and present general algorithmsfor both.

1.3 Regression Problems and Algorithms

This section describes general methods for regression estimation when the dimen-sion of the image vector space is greater than one. The objective functions andalgorithms presented are not specific to the problem of learning string-to-stringmapping and can be used in other contexts, but they constitute a key step inlearning complex string-to-string mappings.

Di!erent regression methods can be used to learn g, including kernel ridgeregression (Saunders et al., 1998), Support Vector Regression (SVR) (Vapnik, 1995),or Kernel Matching Pursuit (KMP) (Vincent and Bengio, 2000). SVR and KMPo!er the advantage of sparsity and fast training. But, a crucial advantage of kernelridge regression in this context is, as we shall see, that it requires a single matrixinversion, independently of N2, the number of features predicted. Thus, in thefollowing we will consider a generalization of kernel ridge regression.

The hypothesis space we will assume is the set of all linear functions from FX toFY . Thus, g is modeled as

!x " X!, g(x) = W ("X(x)), (1.3)

where W : FX # FY is a linear function admitting an N2 $ N1 real-valued matrixrepresentation W.

We start with a regression method generalizing kernel ridge regression to the caseof vector space images. We will then further generalize this method to allow for theencoding of constraints between the input and output vectors.

2006/09/18 14:55

1.2 General Formulation 3

tially speed-up training. Section 1.6 compares our framework and algorithm withseveral other algorithms proposed for learning string-to-string mapping. Section 1.7reports the results of our experiments in several tasks.

1.2 General Formulation

This section presents a general and simple regression formulation of the problem oflearning string-to-string mappings.

Let X and Y be the alphabets of the input and output strings. Assume that atraining sample of size m drawn according to some distribution D is given:

(x1, y1), . . . , (xm, ym) ! X! " Y !. (1.1)

The learning problem that we consider consists of finding a hypothesis f : X! # Y !

out of a hypothesis space H that predicts accurately the label y ! Y ! of astring x ! X! drawn randomly according to D. In standard regression estimationproblems, labels are real-valued numbers, or more generally elements of RN withN $ 1. Our learning problem can be formulated in a similar way after theintroduction of a feature mapping !Y : Y ! # FY = RN2 . Each string y ! Y !

is thus mapped to an N2-dimensional feature vector !Y (y) ! FY .As shown by the diagram of Figure 1.1, our original learning problem is now

decomposed into the following two problems:

Regression problem: The introduction of !Y leads us to the problem of learning ahypothesis g : X! # FY predicting accurately the feature vector !Y (y) for a stringx ! X! with label y ! Y !, drawn randomly according to D.

Pre-image problem: to predict the output string f(x) ! Y ! associated to x ! X!,we must determine the pre-image of g(x) by !Y . We define f(x) by:

f(x) = argminy"Y !

%g(x) & !Y (y)%2, (1.2)

which provides an approximate pre-image when an exact pre-image does not exist(!#1

Y (g(x)) = ').

As with all regression problems, input strings in X! can also be mapped to aHilbert space FX with dim (FX) = N1, via a mapping !X : X! # FX . Bothmappings !X and !Y can be defined implicitly through the introduction of positivedefinite symmetric kernels KX and KY such that for all x, x$ ! X!, KX(x, x$) =!X(x) · !X(x$) and for all y, y$ ! Y !, KY (y, y$) = !Y (y) · !Y (y$).

This description of the problem can be viewed as a simpler formulation of theso-called Kernel Dependency Estimation (KDE) of (Weston et al., 2002). In theoriginal presentation of KDE, the first step consisted of using KY and KernelPrincipal Components Analysis to reduce the dimension of the feature space FY .

2006/09/18 14:55






(x1, y1), . . . , (xm, ym) ! X! " Y !. (1.1)







f(x) = argminy"Y !

%g(x) & !Y (y)%2, (1.2)


Y (g(x)) = ').



2006/09/18 14:55






(x1, y1), . . . , (xm, ym) ! X! " Y !. (1.1)







f(x) = argminy"Y !

%g(x) & !Y (y)%2, (1.2)


Y (g(x)) = ').



2006/09/18 14:55






(x1, y1), . . . , (xm, ym) ! X! " Y !. (1.1)







f(x) = argminy"Y !

%g(x) & !Y (y)%2, (1.2)


Y (g(x)) = ').



y ! Y !

2006/09/18 14:55






(x1, y1), . . . , (xm, ym) ! X! " Y !. (1.1)







f(x) = argminy"Y !

%g(x) & !Y (y)%2, (1.2)


Y (g(x)) = ').



w

FY

Kernel Ridge Regression

• Many regression formulations, e.g., SV Regression [Vapnik ’95], Kernel Matching Pursuit [Vincent, Bengio ’00]

• Kernel Ridge Regression [Saunders et al. ’98]: only one matrix inversion, independent of feature dimensionality

• Linear regression function:

• In matrix notation:

• Optimization:

29

g(x) = W!X(x)

MX = [!X(x1) . . .!X(xm)] ! RN1!m

MY = [!Y (y1) . . .!Y (ym)] ! RN2!m

arg minW!RN2!N1

!||W||2F + ||WMX ! MY ||2F

Solving Optimization

• Primal solution

• Dual solution

• : Gram kernel matrix

• To solve pre-image problem, no explicit computation of

30

2006/09/18 14:55





"A"2F =

!pi=1

!qj=1 A2


!pi=1



argminW"RN2!N1

F (W) =m

"

i=1

"WMxi# Myi

"2 + !"W"2F , (1.4)


MX = [Mx1. . . Mxm

] MY = [My1. . . Mym

]. (1.5)


argminW"RN2!N1

F (W) = "WMX # MY "2F + !"W"2

F . (1.6)


W = MY M#X(MXM#



(1.7)



%WF = 2 (WMX # MY )M#X + 2!W. (1.8)

Thus,

%WF = 0 & 2(WMX # MY )M#X + 2!W = 0

& W(MXM#X + !I) = MY M#

X

& W = MY M#X(MXM#

X + !I)$1,

(1.9)


M#X(MXM#

X + !I)$1 = (M#XMX + !I)$1M#

X . (1.10)

2006/09/18 14:55





"A"2F =

!pi=1

!qj=1 A2


!pi=1



argminW"RN2!N1

F (W) =m

"

i=1

"WMxi# Myi

"2 + !"W"2F , (1.4)


MX = [Mx1. . . Mxm

] MY = [My1. . . Mym

]. (1.5)


argminW"RN2!N1

F (W) = "WMX # MY "2F + !"W"2

F . (1.6)


W = MY M#X(MXM#



(1.7)



%WF = 2 (WMX # MY )M#X + 2!W. (1.8)

Thus,

%WF = 0 & 2(WMX # MY )M#X + 2!W = 0

& W(MXM#X + !I) = MY M#

X

& W = MY M#X(MXM#

X + !I)$1,

(1.9)


M#X(MXM#

X + !I)$1 = (M#XMX + !I)$1M#

X . (1.10)

2006/09/18 14:55





"A"2F =

!pi=1

!qj=1 A2


!pi=1



argminW"RN2!N1

F (W) =m

"

i=1

"WMxi# Myi

"2 + !"W"2F , (1.4)


MX = [Mx1. . . Mxm

] MY = [My1. . . Mym

]. (1.5)


argminW"RN2!N1

F (W) = "WMX # MY "2F + !"W"2

F . (1.6)


W = MY M#X(MXM#



(1.7)



%WF = 2 (WMX # MY )M#X + 2!W. (1.8)

Thus,

%WF = 0 & 2(WMX # MY )M#X + 2!W = 0

& W(MXM#X + !I) = MY M#

X

& W = MY M#X(MXM#

X + !I)$1,

(1.9)


M#X(MXM#

X + !I)$1 = (M#XMX + !I)$1M#

X . (1.10)

KXij= KX(xi, xj)

W

hW(x)

2006/09/18 14:55


This can be derived without di!culty from a series expansion of (MXM!X +!I)"1.

Since KX = M!XMX ,

W = MY (M!XMX + !I)"1 M!

X = MY (KX + !I)"1 M!X , (1.11)

which is the second identity giving W.

For both solutions, a single matrix inversion is needed. In the primal case, thecomplexity of that matrix inversion is in O(N3

1 ), or O(N2+!1 ) with " < .376, using

the best known matrix inversion algorithms. When N1, the dimension of the featurespace FX , is not large, this leads to an e!cient computation of the solution. For largeN1 and relatively small m, the dual solution is more e!cient since the complexityof the matrix inversion is then in O(m3), or O(m2+!).

Note that in the dual case, predictions can be made using kernel functions alone,as W does not have to be explicitly computed. For any x ! X#, let Mx ! RN1$1

denote the column matrix representing "X(x). Thus, g(x) = WMx. For anyy ! Y #, let My ! RN2$1 denote the column matrix representing "Y (y). Then,f(x) is determined by solving the pre-image problem:

f(x) = argminy%Y !

"WMx # My"2 (1.12)

= argminy%Y !

!

M!y My # 2M!

y WMx

"

(1.13)

= argminy%Y !

!

M!y My # 2M!

y MY (KX + !I)"1M!XMx

"

(1.14)

= argminy%Y !

!

KY (y, y) # 2(KyY )!(KX + !I)"1Kx

X

"

, (1.15)

where KyY ! Rm$1 and Kx

X ! Rm$1 are the column matrices defined by:

KyY =

#

$

$

%

KY (y, y1)

. . .

KY (y, ym)

&

'

'

(

and KxX =

#

$

$

%

KX(x, x1)

. . .

KX(x, xm)

&

'

'

(

. (1.16)

1.3.2 Generalization to Regression with Constraints

In many string-to-string mapping learning tasks such as those appearing in naturallanguage processing, there are some specific constraints relating the input andoutput sequences. For example, in part-of-speech tagging, a tag in the outputsequence must match the word in the same position in the input sequence. Moregenerally, one may wish to exploit the constraints known about the string-to-stringmapping to restrict the hypothesis space and achieve a better result.

This section shows that our regression framework can be generalized in a naturalway to impose some constraints on the regression matrix W. Remarkably, thisgeneralization also leads to a closed form solution and to an e!cient iterativealgorithm. Here again, the algorithm presented is not specific to string-to-stringlearning problems and can be used for other regression estimation problems.

2006/09/18 14:55



Since KX = M!XMX ,

W = MY (M!XMX + !I)"1 M!

X = MY (KX + !I)"1 M!X , (1.11)



1 ), or O(N2+!1 ) with " < .376, using




f(x) = argminy%Y !

"WMx # My"2 (1.12)

= argminy%Y !

!

M!y My # 2M!

y WMx

"

(1.13)

= argminy%Y !

!

M!y My # 2M!


"

(1.14)

= argminy%Y !

!

KY (y, y) # 2(KyY )!(KX + !I)"1Kx

X

"

, (1.15)



KyY =

#

$

$

%

KY (y, y1)

. . .

KY (y, ym)

&

'

'

(

and KxX =

#

$

$

%

KX(x, x1)

. . .

KX(x, xm)

&

'

'

(

. (1.16)




2006/09/18 14:55



Since KX = M!XMX ,

W = MY (M!XMX + !I)"1 M!

X = MY (KX + !I)"1 M!X , (1.11)



1 ), or O(N2+!1 ) with " < .376, using




f(x) = argminy%Y !

"WMx # My"2 (1.12)

= argminy%Y !

!

M!y My # 2M!

y WMx

"

(1.13)

= argminy%Y !

!

M!y My # 2M!


"

(1.14)

= argminy%Y !

!

KY (y, y) # 2(KyY )!(KX + !I)"1Kx

X

"

, (1.15)



KyY =

#

$

$

%

KY (y, y1)

. . .

KY (y, ym)

&

'

'

(

and KxX =

#

$

$

%

KX(x, x1)

. . .

KX(x, xm)

&

'

'

(

. (1.16)




arg minW!RN2!N1

!||W||2F + ||WMX ! MY ||2F

Pre-Image/Efficiency

• Pre-image problem discussed for -gram kernels

• Efficient solution via new Eulerian circuit algorithm

• Pre-image computation required only at test time

• Classification computes pre-image at train and test

• Usually requires simplification, e.g., Markov assumption

• Number of dual variables in regression:

• In classification, but solution more sparse

• Incremental training algorithm speeds up training31

n

m

m|Y|

Experiments

• Same dataset as M3Ns: handwritten digits

1. Hand-segmented data

2. No segmentation, -gram kernels:

• Best results with

32

n

2006/09/18 14:55


used the Viterbi algorithm to compute the pre-image solution, as in M3Ns. Theresults shows that coupling a simple predictor with a more sophisticated pre-imagealgorithm can significantly improve the performance. The high accuracy achievedin this setting can be viewed as reflecting the simplicity of this task. The datasetcontains only 55 unique words, and the same words appear in both the trainingand the test set.

We also compared all these results with the best result reported by Taskar et al.(2003) for the same problem and dataset. The experiment allowed us to comparethese results with those obtained using M3Ns. But, we are interested in morecomplex string-to-string prediction problems where restrictive prior knowledge suchas a one-to-one mapping is not available. Our second set of experiments correspondsto a more realistic and challenging setting.

1.7.3 String-to-String Prediction

Our method generalizes indeed to the much harder and more realistic problemwhere the input and output strings may be of di!erent length and where no priorsegmentation or one-to-one mapping is given. For this setting, we directly estimatethe counts of all the n-grams of the output sequence from one set of input featuresand use our pre-image algorithm to predict the output sequence.

In our experiment, we chose the following polynomial kernel KX between twoimage sequences:

KX(x1, x2) =!

x1,i,x2,j

(1 + x1,i x2,j)d , (1.72)

where the sum runs over all n-grams x1,i and x2,j of input sequences x1 and x2. Then-gram order and the degree d are both parameters of the kernel. For the kernelKY we used n-gram kernels.

As a result of the regression, the output values, that is the estimated countsof the individual n-grams in an output string are non-integers and need to bediscretized for the Euler circuit computation, see Section 1.4.8. The output wordsare in general short, so we did not anticipate counts higher than one. Thus, for eachoutput feature, we determined just one threshold above which we set the count toone. Otherwise, we set the count to zero. These thresholds were determined byexamining each feature at a time and imposing that averaged over all the stringsin the training set, the correct count of the n-gram be predicted.

Note that, as a consequence of this thresholding, the predicted strings do notalways have the same length as the target strings. Extra and missing characters arecounted as errors in our evaluation of the accuracy.

We obtained the best results using unigrams and second-degree polynomials inthe input space and bigrams in the output space. For this setting, we obtained anaccuracy of 65.3 ± 2.3.

A significant higher accuracy can be obtained by combining the predicted integercounts from several di!erent input and output kernel regressors, and computing an

2006/09/18 14:55

1.7 Experiments 23

Technique Accuracy

REG-constraints ! = 0 84.1% ±.8%

REG-constraints ! = 1 88.5% ±.9%

REG 79.5% ±.4%

REG-Viterbi (n = 2) 86.1% ±.7%

REG-Viterbi (n = 3) 98.2% ±.3%

SVMS (cubic kernel) 80.9% ±.5%

M3Ns (cubic kernel) 87.0% ±.4%

Table 1.1 Experimental results with the perfect segmentation setting. The M3N andSVM results are read from the graph in Taskar et al. (2003)

In our experiments, we used the e!cient iterative method outlined in Sec-tion 1.3.2 to compute W. However, it is not hard to see that in this case, thanksto the simplicity of the constraint matrices Ai, i = 1, . . . , C, the resulting matrix(< Ai,AjU!1 >F )ij +I can be given a simple block structure that makes it easierto invert. Indeed, it can be decomposed into l blocks in RN1"N1 that can be invertedindependently. Thus, the overall complexity of matrix inversion, which dominatesthe cost of the algorithm, is only O(lN3

1 ) here.Table 1.1 reports the results of our experiments using a polynomial kernel of

third degree for kX , and the best empirical value for the ridge regression coe!cient! which was ! = 0.01. The accuracy is measured as the percentage of the totalnumber of word characters correctly predicted. REG refers to our general regressiontechnique, REG-constraints to our general regression with constraints. The resultsobtained with the regularization parameter " set to 1 are compared to those withno constraint, i.e., " = 0. When " = 1, the constraints are active and we observe asignificant improvement of the generalization performance.

For comparison, we also trained a single predictor over all images of our trainingsample, regardless of their positions. This resulted in a regression problem with a26-dimensional output space, and m ! 5, 000 examples in each training set. Thise"ectively corresponds to a hard weight-sharing of the coe!cients corresponding todi"erent positions within matrix W, as described in Section 1.3.2. The first 26 linesof W are repeated (l " 1) times. That predictor can be applied independently toeach image segment xi of a sequence x = x1 · · ·xq . Here too, we used a polynomialkernel of third degree and ! = 0.01 for the ridge regression parameter. We alsoexperimented with the use of an n-gram statistical model based on the words ofthe training data to help discriminate between di"erent word sequence hypotheses,as mentioned in Section 1.4.8.

Table 1.1 reports the results of our experiments within this setting. REG refers tothe hard weight-sharing regression without using a statistical language model and isdirectly comparable to the results obtained using Support Vector Machines (SVMs).REG-Viterbi with n = 2 or n = 3 corresponds to the results obtained within thissetting using di"erent n-gram order statistical language models. In this case, we

n = 1, d = 2

2006/09/18 14:55

1.7 Experiments 25

100 316 1000 3162 100000.2

0.3

0.4

0.5

0.6

0.7

Te

st

Err

or

n

RANDOM

CHOLESKY

GREEDY

Figure 1.7 Comparison of random sub-sampling of n points from the OCR dataset,incomplete Cholesky decomposition after n iterations and greedy incremental learningwith n basis functions. The main bottleneck for all of these algorithms is the matrixinversion where the size of the matrix is n! n, we therefore plot test error against n. Thefurthest right point is the test error rate of training on the full training set of n = 4, 617examples.

Euler circuit using only the n-grams predicted by the majority of the regressors.Combining 5 such regressors, we obtained a test accuracy of 75.6 ± 1.5.

A performance degradation for this setting was naturally expected, but we view itas relatively minor given the increased di!culty of the task. Furthermore, our resultscan be improved by combining a statistical model with our pre-image algorithm.

1.7.4 Faster Training

As pointed out in Section 1.5, faster training is needed when the size of the trainingdata increases significantly. This section compares the greedy incremental techniquedescribed in that section with the partial Cholesky decomposition technique andthe baseline of randomly sub-sampling n points from the data, which of course alsoresults in reduced complexity, giving only a n ! n matrix to invert. The di"erenttechniques are compared in the perfect segmentation setting on the first fold of thedata. The results should be indicative of the performance gain in other folds.

In both partial Cholesky decomposition and greedy incremental learning, n

iterations are run and then an n ! n matrix is inverted, which may be viewedas the bottleneck. Thus, to determine the learning speed we plot the test errorfor the regressions problem versus n. The results are shown in Figure 1.7. Thegreedy learning technique leads to a considerable reduction in the number of kernelcomputations required and the matrix inversion size for the same error rate as

Search-based SP

• Searn: view structured prediction as search problem

• SP: distribution over inputs, output costs

• e.g.: is input, is the loss for any to the true label

• Define loss of cost-sensitive classifier as

• View outputs as vectors , but classification problems not limited to sequences

• A classifier defines a path through space of input/output pairs, and training process iteratively refines the classifier

33

[Daumé ’06] [Daumé, Langford, Marcu ’07]

D (x, c) |c| = |Y|

xi cy y yi

h : X ! Y

Search-based Structured Prediction 3

As a simple example, consider a parsing problem under F1 loss. In thiscase, D is a distribution over (x, c) where x is an input sequence and for alltrees y with |x|-many leaves, cy is the F1 loss of y on the “true” output.

The goal of structured prediction is to find a function h : X ! Y thatminimizes the loss given in Eq (1).

L(D, h) = E(x,c)!D

!

ch(x)

"

(1)

The algorithm we present is based on the view that a vector y " Y canbe produced by predicting each component (y1, . . . , yT ) in turn, allowingfor dependent predictions. This is important for coping with general lossfunctions. For a data set (x1, c1), . . . , (xN , cN ) of structured prediction ex-amples, we write Tn for the length of the longest search path on examplen, and Tmax = maxn Tn.

3 The Searn Algorithm

There are several vital ingredients in any application of Searn: a seachspace for decomposing the prediction problem; a cost sensitive learning al-gorithm; labeled structured prediction training data; a known loss functionfor the structured prediction problem; and a good initial policy. These as-pects are described in more detail below.

A search space S. The choice of search space plays a role similar to thechoice of structured decomposition in other algorithms. Final elementsof the search space can always be referenced by a sequence of choices y. Insimple applications of Searn the search space is concrete. For example,it might consist of the parts of speech of each individual word in asentence. In general, the search space can be abstract, and we show thiscan be beneficial experimentally. An abstract search space comes withan (unlearned) function f(y) which turns any sequence of predictionsin the abstract search space into an output of the correct form. (Fora concrete search space, f is just the identity function. To minimizeconfusion, we will leave o! f in future notation unless its presence isspecifically important.)

A cost sensitive learning algorithm A. The learning algorithm returns a mul-ticlass classifier h(s) given cost sensitive training data. Here s is a de-scription of the location in the search space. A reduction of cost sen-sitive classification to binary classification [4] reduces the requirementto a binary learning algorithm. Searn relies upon this learning algo-rithm to form good generalizations. Nothing else in the Searn algorithmattempts to achieve generalization or estimation. The performance ofSearn is strongly dependent upon how capable the learned classifier is.We call the learned classifier a policy because it is used multiple timeson inputs which it e!ects, just as in reinforcement learning.

y = [y(1), . . . , y

(l)]

Searn Specifics

• Searn is a meta-algorithm. Need to provide:

• Cost-sensitive learning algorithm (i.e. with loss)

• Initial classifier

• Loss function

• Initial classifier should have low training error, but need not generalize well

• Could be best path from any standard search algorithm

• Each Searn iteration finds a classifier that is not as good on the training set, but generalizes a little better

34

Searn Training

• Search state space: (input, partial output):

• Initial classifier: pick next label that minimizes cost, assuming that all future decisions are also optimal:

• Iterative step: use current classifier to construct a set of examples to train the next classifier; then interpolate

• For each state, try every possible next output

• Cost assigned to each output tried is loss difference

35

s = (x, y(1)

, . . . , y(l))

h0(s, c) = arg miny(l+1) miny(l+2),...,y(L) c[(y(1),...,y(L))]

h

lh(c, s, a) = Ey!(s,a,h)cy ! mina!

Ey!(s,a!,h)cy

Searn Training Illustration

36

yi =

1

2

3

4

i = 1 i = 2 i = 3 i = 4 i = 5

Prediction of current classifier h

Current state s

i = 6

Other path being considered (s, a, h)

Potential next state a


Ey!(s,a!,h)cy


36

yi =

1

2

3

4

i = 1 i = 2 i = 3 i = 4 i = 5


Current state s

i = 6



lh = 2


Ey!(s,a!,h)cy


36

yi =

1

2

3

4

i = 1 i = 2 i = 3 i = 4 i = 5


Current state s

i = 6



lh = 2

lh = 5


Ey!(s,a!,h)cy


36

yi =

1

2

3

4

i = 1 i = 2 i = 3 i = 4 i = 5


Current state s

i = 6



lh = 2

lh = 5

lh = 1


Ey!(s,a!,h)cy


36

yi =

1

2

3

4

i = 1 i = 2 i = 3 i = 4 i = 5


Current state s

i = 6



lh = 2

lh = 5

lh = 1

lh = 0


Ey!(s,a!,h)cy

Searn Meta-Algorithm

• Input:

• while has a significant dependence on :

• Initialize set of cost-sensitive examples:

• for

• Compute prediction:

• for

•

• for each next output after :

• Compute features and add example:

• Learn and interpolate:37

h h0

a

(x1, y1), . . . , (xm, ym), h0, A

S ! "

i ! 1, . . . ,m

(y(1), . . . , y(L)) ! h(xi)

h!! A(S);h ! !h! + (1 " !)h

State consists of input and

partial output

Use losses to build up training

examples for next iterationl ! 1, . . . , L

sl ! (xi, y(1)

, . . . , y(l))

c!sl,a! lh(c, sl, a)sl

S ! f(sl, c!)

Searn Analysis

• Let be the classifier learned at the th iteration and the loss of the new classifier on the old training data

• If the average loss over iterations is and

• then loss with and iterations is bounded:

• Proof analyses the mixture of old and new classifiers

• In practice, can be larger (more aggressive learning)

38

hi i

I lavg =1

I

I!

i=1

lhi(h!

i)

lh(h!)h!

cmax = E(x,c)!D maxy

cy

10 Hal Daume III et al.

It is important in the analysis to refer explicitly to the error of theclassifiers learned during Searn process. Let Searn(D, h) denote the dis-tribution over classification problems generated by running Searn withpolicy h on distribution D. Also let !CS

h (h!) denote the loss of classifier h!

on the distribution Searn(D, h). Let the average cost sensitive loss over Iiterations be:

!avg =1

I

I!

i=1

!cshi(h!

i) (4)

where hi is the ith policy and h!i is the classifier learned on the ith iteration.

Theorem 2 For all D with cmax = E(x,c)"D maxy cy (with (x, c) as inDef 1), for all learned cost sensitive classifiers h!, Searn with " = 1/T 3

and 2T 3 lnT iterations, outputs a learned policy with loss bounded by:

L(D, hlast) ! L(D,#) + 2T !avg lnT + (1 + lnT )cmax/T

The dependence on T in the second term is due to the cost sensitiveloss being an average over T timesteps while the total loss is a sum. ThelnT factor is not essential and can be removed using other approaches [3][30]. The advantage of the theorem here is that it applies to an algorithmthat naturally copes with variable length T and yields a smaller amount ofcomputation in practice.

The choices of " and the number of iterations are pessimistic in practice.Empirically, we use a development set to perform a line search minimizationto find per-iteration values for " and to decide when to stop iterating. Theanalytical choice of " is made to ensure that the probability that the newlycreated policy only makes one di!erent choice from the previous policy forany given example is su"ciently low. The choice of " assumes the worst:the newly learned classifier always disagrees with the previous policy. Inpractice, this rarely happens. After the first iteration, the learned policyis typically quite good and only rarely di!ers from the initial policy. Sochoosing such a small value for " is unneccesary: even with a higher value,the current classifier often agrees with the previous policy.

The proof rests on the following lemmae.

Lemma 1 (Policy Degradation) Given a policy h with loss L(D, h), ap-ply a single iteration of Searn to learn a classifier h! with cost-sensitiveloss !CS

h (h!). Create a new policy hnew by interpolation with parameter " "(0, 1/T ). Then, for all D, with cmax = E(x,c)"D maxi ci (with (x, c) as inDef 1):

L(D, hnew) ! L(D, h) + T"!CSh (h!) +

1

2"2T 2cmax (5)


It is important in the analysis to refer explicitly to the error of theclassifiers learned during Searn process. Let Searn(D, h) denote the dis-tribution over classification problems generated by running Searn withpolicy h on distribution D. Also let !CS

h (h!) denote the loss of classifier h!

on the distribution Searn(D, h). Let the average cost sensitive loss over Iiterations be:

!avg =1

I

I!

i=1

!cshi(h!

i) (4)

where hi is the ith policy and h!i is the classifier learned on the ith iteration.

Theorem 2 For all D with cmax = E(x,c)"D maxy cy (with (x, c) as inDef 1), for all learned cost sensitive classifiers h!, Searn with " = 1/T 3

and 2T 3 lnT iterations, outputs a learned policy with loss bounded by:

L(D, hlast) ! L(D,#) + 2T !avg lnT + (1 + lnT )cmax/T

The dependence on T in the second term is due to the cost sensitiveloss being an average over T timesteps while the total loss is a sum. ThelnT factor is not essential and can be removed using other approaches [3][30]. The advantage of the theorem here is that it applies to an algorithmthat naturally copes with variable length T and yields a smaller amount ofcomputation in practice.

The choices of " and the number of iterations are pessimistic in practice.Empirically, we use a development set to perform a line search minimizationto find per-iteration values for " and to decide when to stop iterating. Theanalytical choice of " is made to ensure that the probability that the newlycreated policy only makes one di!erent choice from the previous policy forany given example is su"ciently low. The choice of " assumes the worst:the newly learned classifier always disagrees with the previous policy. Inpractice, this rarely happens. After the first iteration, the learned policyis typically quite good and only rarely di!ers from the initial policy. Sochoosing such a small value for " is unneccesary: even with a higher value,the current classifier often agrees with the previous policy.

The proof rests on the following lemmae.

Lemma 1 (Policy Degradation) Given a policy h with loss L(D, h), ap-ply a single iteration of Searn to learn a classifier h! with cost-sensitiveloss !CS

h (h!). Create a new policy hnew by interpolation with parameter " "(0, 1/T ). Then, for all D, with cmax = E(x,c)"D maxi ci (with (x, c) as inDef 1):

L(D, hnew) ! L(D, h) + T"!CSh (h!) +

1

2"2T 2cmax (5)

L(D, hlast) ! L(D, h0) + 2T lavg log T + (1 + log T )cmax/T

!

Experiments

• Handwriting recognition [Kassel ’95]

• Named entity recognition

• Syntactic chunking and part-of-speech (POS) tagging

39


El presidente de la [Junta de Extremadura]ORG , [Juan Carlos Rodrıguez Ibarra]PER

, recibira en la sede de la [Presidencia del Gobierno]ORG extremeno a familiaresde varios de los condenados por el proceso “ [Lasa-Zabala]MISC ” , entre ellosa [Lourdes Dıez Urraca]PER , esposa del ex gobernador civil de [Guipuzcoa]LOC

[Julen Elgorriaga]PER ; y a [Antonio Rodrıguez Galindo]PER , hermano del general[Enrique Rodrıguez Galindo]PER .

Fig. 3 Example labeled sentence from the Spanish Named Entity Recognitiontask.

6.1.1 Handwriting Recognition The handwriting recognition task we con-sider was introduced by [25]. Later, [52] presented state-of-the-art results onthis task using max-margin Markov networks. The task is an image recogni-tion task: the input is a sequence of pre-segmented hand-drawn letters andthe output is the character sequence (“a”-“z”) in these images. The dataset we consider is identical to that considered by [52] and includes 6600sequences (words) collected from 150 subjects. The average word contains 8characters. The images are 8!16 pixels in size, and rasterized into a binaryrepresentation. Example image sequences are shown in Figure 2 (the firstcharacters are removed because they are capitalized).

For each possible output letter, there is a unique feature that countshow many times that letter appears in the output. Furthermore, for eachpair of letters, there is an “edge” feature counting how many times this pairappears adjacent in the output. These edge features are the only “structuralfeatures” used for this task (i.e., features that span multiple output labels).Finally, for every output letter and for every pixel position, there is a featurethat counts how many times that pixel position is “on” for the given outputletter.

In the experiments, we consider two variants of the data set. The first,“small,” is the problem considered by [52]. In the small problem, ten foldcross-validation is performed over the data set; in each fold, roughly 600words are used as training data and the remaining 6000 are used as test data.In addition to this setting, we also consider the “large” reverse experiment:in each fold, 6000 words are used as training data and 600 are used as testdata.

6.1.2 Spanish Named Entity Recognition The named entity recognition(NER) task is concerned with spotting names of persons, places and or-ganizations in text. Moreover, in NER we only aim to spot names andneither pronouns (“he”) nor nominal references (“the President”). We usethe CoNLL 2002 data set, which consists of 8324 training sentences and1517 test sentences; examples are shown in Figure 3. A 300-sentence sub-set of the training data set was previously used by [54] for evaluating theSVMstruct framework in the context of sequence labeling. The small train-ing set was likely used for computational considerations. The best reportedresults to date using the full data set are due to [2]. We report results onboth the “small” and “large” data sets.


GreatNNPB-NP AmericanNNP

I-NP saidVBDB-VP itPRP

B-NP increasedVBDB-VP itsPRP$

B-NP loan-lossNNI-NP

reservesNNSI-NP byIN

B-PP $$B-NP 93CD

I-NP millionCDI-NP afterIN

B-PP reviewingVBGB-VP itsPRP$

B-NP loanNNI-NP

portfolioNNI-NP ..O

Fig. 5 Example sentence for the joint POS tagging and syntactic chunking task.

[49]. An example sentence jointly labeled for these two outputs is shown inFigure 5 (under the BIO encoding).

For Searn, there is little di!erence between standard sequence labelingand joint sequence labeling. We use the same data set as for the standardsyntactic chunking task (Section 6.1.3) and essentially the same features.In order to model the fact that the two streams of labels are not indepen-dent, we decompose the problem into two parallel tagging tasks. First, thefirst POS label is determined, then the first chunk label, then the secondPOS label, then the second chunk label, etc. The only di!erence betweenthe features we use in this task and the vanilla chunking task has to dothe structural features. The structural features we use include the obviousMarkov features on the individual sequences: counts of singleton, doubletonand tripleton POS and chunk tags. We also use “crossing sequence” fea-tures. In particular, we use counts of pairs of POS and chunk tags at thesame time period as well as pairs of POS tags at time t and chunk tags att ! 1 and vice versa.

6.1.5 Search and Initial Policies The choice of “search” algorithm in Searn

essentially boils down to the choice of output vector representation, since,as defined, Searn always operates in a left-to-right manner over the outputvector. In this section, we describe vector representations for the outputspace and corresponding optimal policies for Searn.

The most natural vector encoding of the sequence labeling problem issimply as itself. In this case, the search proceeds in a greedy left-to-rightmanner with one word being labeled per step. This search order admitssome linguistic plausibility for many natural language problems. It is alsoattractive because (assuming unit-time classification) it scales as O(NL),where N is the length of the input and L is the number of labels, inde-pendent of the number of features or the loss function. However, this vectorencoding is also highly biased, in the sense that it is perhaps not optimal forsome (perhaps unnatural) problems. Other orders are possible (such as al-lowing any arbitrary position to be labeled at any time, e!ectively mimicingbelief propagation); see [12] for more experimental results under alternativeorderings.

For joint segmentation and labeling tasks, such as named entity identi-fication and syntactic chunking, there are two natural encodings: word-at-a-time and chunk-at-a-time. In word-at-a-time, one essentially follows the“BIO encoding” and tags a single word in each search step. In chunk-at-a-time, one tags single chunks in each search step, which can consist ofmultiple words (after fixing a maximum phrase length). In our experiments,

Search-based Structured Prediction 19

[Great American]NP [said]VP [it]NP [increased]VP [its loan-loss reserves]NP [by]PP [$93 million]NP [after]PP [reviewing]VP [its loan portfolio]NP , [raising]VP [its total loanand real estate reserves]NP [to]PP [$ 217 million]NP .

Fig. 4 Example labeled sentence from the syntactic chunking task.

The structural features used for this task are roughly the same as in thehandwriting recognition case. For each label, each label pair and each labeltriple, a feature counts the number of times this element is observed in theoutput. Furthermore, the standard set of input features includes the wordsand simple functions of the words (case markings, prefix and su!x up tothree characters) within a window of ±2 around the current position. Theseinput features are paired with the current label. This feature set is fairlystandard in the literature, though [2] report significantly improved resultsusing a much larger set of features. In the results shown later in this section,all comparison algorithms use identical feature sets.

6.1.3 Syntactic Chunking The final sequence labeling task we consider issyntactic chunking (for English), based on the CoNLL 2000 data set. Thisdata set includes 8936 sentences of training data and 2012 sentences of testdata. An example is shown in Figure 4. (Several authors have consideredthe noun-phrase chunking task instead of the full syntactic chunking task.It is important to notice the di"erence, though results on these two tasksare typically very similar, indicating that the majority of the di!culty iswith noun phrases.)

We use the same set of features across all models, separated into “basefeatures” and “meta features.” The base features apply to words individu-ally, while meta features apply to entire chunks. The standard base featuresused are: the chunk length, the word (original, lower cased, stemmed, andoriginal-stem), the case pattern of the word, the first and last 1, 2 and 3characters, and the part of speech and its first character. We additionallyconsider membership features for lists of names, locations, abbreviations,stop words, etc. The meta features we use are, for any base feature b, bat position i (for any sub-position of the chunk), b before/after the chunk,the entire b-sequence in the chunk, and any 2- or 3-gram tuple of bs in thechunk. We use a first order Markov assumption (chunk label only dependson the most recent previous label) and all features are placed on labels,not on transitions. In the results shown later in this section, some of thealgorithms use a slightly di"erent feature set. In particular, the CRF-basedmodel uses similar, but not identical features; see [50] for details.

6.1.4 Joint Chunking and Tagging In the preceding sections, we consideredthe single sequence labeling task: to each element in a sequence, a singlelabel is assigned. In this section, we consider the joint sequence labelingtask. In this task, each element in a sequence is labeled with multiple tags.A canonical example of this task is joint POS tagging and syntactic chunking

Experiments

40


ALGORITHM Handwriting NER Chunk C+T

Small Large Small LargeCLASSIFICATION

Perceptron 65.56 70.05 91.11 94.37 83.12 87.88Log Reg 68.65 72.10 93.62 96.09 85.40 90.39SVM-Lin 75.75 82.42 93.74 97.31 86.09 93.94SVM-Quad 82.63 82.52 85.49 85.49 ! !

STRUCTURED

Str. Perc. 69.74 74.12 93.18 95.32 92.44 93.12CRF " " 94.94 ! 94.77 96.48SVMstruct " " 94.90 ! " "M3N-Lin 81.00 ! " " " "M3N-Quad 87.00 ! " " " "

SEARN

Perceptron 70.17 76.88 95.01 97.67 94.36 96.81Log Reg 73.81 79.28 95.90 98.17 94.47 96.95SVM-Lin 82.12 90.58 95.91 98.11 94.44 96.98SVM-Quad 87.55 90.91 89.31 90.01 ! !

Table 1 Empirical comparison of performance of alternative structured predic-tion algorithms against Searn on sequence labeling tasks. (Top) Comparison forwhole-sequence 0/1 loss; (Bottom) Comparison for individual losses: Hammingfor handwriting and Chunking+Tagging and F for NER and Chunking. Searn isalways optimized for the appropriate loss.

For all Searn-based models, we use the the following settings of thetunable parameters (see [12] for a comparison of di!erent settings). We usethe optimal approximation for the computation of the per-action costs. Weuse a left-to-right search order with a beam of size 10. For the chunkingtasks, we use chunk-at-a-time search. We use weighted all pairs and costingto reduce from cost-sensitive classification to binary classification.

Note that some entries in Table 1 are missing. The vast majority of theseentries are missing because the algorithm considered could not reasonablyscale to the data set under consideration. These are indicated with a “!”symbol. Other entries are not available simply because the results we reportare copied from other publications and these publications did not report allrelevant scores. These are indicated with a “"” symbol.

We observe several patterns in the results from Table 1. The first is thatstructured techniques consistently outperform their classification counter-parts (eg., CRFs outperform logistic regression). The single exception is onthe small handwriting task: the quadratic SVM outperforms the quadraticM3N.5 For all classifiers, adding Searn consistently improves performance.

An obvious pattern worth noticing is that moving from the small dataset to the large data set results in improved performance, regardless of

5 However, it should be noted that a di!erent implementation technique wasused in this comparison. The M3N is based on an SMO algorithm, while thequadratic SVM is libsvm [6].

Comparison

41

Work Advantages LimitationsMulticlass

SVMsPrincipled view of

multiclassNo concept of loss for SP

M3Ns Max-margin SP for sequences

Need Markov assumption for tractability

SP with SVMs

SP for complex structures; analysis

Loss function “hacked” into optimization

Large-margin HMMs

Discriminative GMM training for sequences

Optimization problems may not scale

Regression for SP

Incorporates loss function naturally

Only sequence case is addressed

Searn General SP framework for any learner

Incorporating existing algorithms may be hard

Bibliography 1/2• Koby Crammer and Yoram Singer. On the algorithmic implementation of multiclass kernel-based vector machines.

Journal of Machine Learning Research, 2(5):265-292, 2001. • Leo Breiman, Jerome H. Friedman, Richard A. Olshen, and Charles J. Stone. Classification and Regression Trees.

Wadsworth & Brooks, 1984.• Corinna Cortes and Vladimir Vapnik. Support-vector networks. Machine Learning, 20(3): 273–297, September

1995.• Thomas G. Dietterich. An experimental comparison of three methods for constructing ensembles of decision

trees: Bagging, boosting, and randomization. Machine Learning, 40(2):139–158, 2000.• Thomas G. Dietterich and Ghulum Bakiri. Solving multiclass learning problems via error-correcting output

codes. Journal of Artificial Intelligence Research, 2:263–286, January 1995. • Yoav Freund and Robert E. Schapire. A decision-theoretic generalization of on-line learning and an application to

boosting. Journal of Computer and System Sciences, 55(1):119–139, August 1997. • J. Ross Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993.• B. Schölkopf. Support Vector Learning. PhD thesis, GMD First, 1997. • Vladimir N. Vapnik. Statistical Learning Theory. Wiley, 1998. • J. Weston and C. Watkins. Support vector machines for multi-class pattern recognition. In Proceedings of the

Seventh European Symposium on Artificial Neural Networks, April 1999. • Ben Taskar, Carlos Guestrin, and Daphne Koller. Max-margin Markov networks. Neural Information Processing

Systems (NIPS) 16, 2003. • A. McCallum, D. Freitag, and F. Pereira, Maximum entropy Markov models for information extraction and

segmentation. Proceedings of ICML, 2000.• R. Kassel. A Comparison of Approaches to On-line Handwritten Character Recognition. PhD thesis, MIT

Spoken Language Systems Group, 1995. • J. Lafferty, A. McCallum, and F. Pereira. Conditional random fields: Probabilistic models for segmenting and

labeling sequence data. In Proc. ICML01, 2001.

42

Bibliography 2/2• Ioannis Tsochantaridis, Thomas Hofmann, Thorsten Joachims, and Yasemin Altun. Support vector machine learning for

interdependent and structured output spaces. Proceedings ICML, 2004. • Collins, M. Discriminative training methods for hidden markov models: Theory and experiments with perceptron

algorithms. EMNLP, 2002• Collins, M. Parameter estimation for statistical parsing models: Theory and practice of distribution-free methods., 2004

• Fei Sha and Lawrence K. Saul. Large margin hidden Markov models for automatic speech recognition, Neural Information Processing Systems (NIPS) 19, 2007.

• X. Li, H. Jiang, and C. Liu. Large margin HMMs for speech recognition. In Proceedings of ICASSP 2005, pages 513–516, Philadelphia, 2005.

• B.-H. Juang and S. Katagiri. Discriminative learning for minimum error classification. IEEE Trans. Sig. Proc., 40(12):3043–3054, 1992

• P. C. Woodland and D. Povey. Large scale discriminative training of hidden Markov models for speech recognition. Computer Speech and Language, 16:25–47, 2002.

• Corinna Cortes, Mehryar Mohri, and Jason Weston. A general regression framework for learning string-to-string mappings. Gökhan H. Bakir et al., eds, Predicting Structured Data. The MIT Press, 2006.

• Craig Saunders, Alexander Gammerman, and Volodya Vovk. Ridge Regression Learning Algorithm in Dual Variables. In Proceedings ICML, 1998.

• Pascal Vincent and Yoshua Bengio. Kernel Matching Pursuit. Technical Report 1179, Departement d’Informatique et Recherche Operationnelle, Universite de Montreal, 2000.

• Jason Weston, Olivier Chapelle, Andre Elisseeff, Bernhard Schölkopf, and Vladimir Vapnik. Kernel Dependency Estimation. Neural Processing Information Systems 15, 2002.

• Harold C. Daumé III, Practical structured learning for natural language processing, Ph.D. Thesis, University of Southern California, 2006.

• William W. Cohen and Vitor Carvalho. Stacked sequential learning. In Proceedings of the International Joint Conference on Artificial Intel ligence (IJ-CAI), 2005.

• Michael Collins and Brian Roark. Incremental parsing with the perceptron algorithm. In Proceedings of the Conference of the Association for Computational Linguistics (ACL), 2004.

• Harold C. Daumé III, Search-Based Structured Prediction, Submitted to Machine Learning, 2007

43

The EndThank You!

44

“Upon Request” Slides

45

Multiclass Classification

• Many algorithms

• Decision trees [Breiman et al. ’84; Quinlan ’93]

• Output codes [Dietterich, Bakiri ’95]

• Multiclass boosting variants, e.g., [Freund, Schapire ’97]

• SVMs [Cortes and Vapnik ’95] has been extended to multiclass setting

• Many ad-hoc methods (one-vs-all, all-vs-all, etc)

• Some adapt the optimization problem to the multiclass setting, e.g. [Vapnik ’98; Weston, Watkins ’99]

46

Decomposing the QP

• QP has constraints:

• Intractable for large problems

• Iterative algorithm: decompose constraints into sets

1. Choose a constraint set (i.e., example)

2. Solve a reduced optimization for , repeat

• Convergence to global optimum guaranteed

• Rate depends on initialization

• Many implementation tweaks suggested

!i !i " 1yiand !i · 1 = 0mk

m

p

p

47


QP Decomposition DetailsMulticlass Kernel-based Vector Machines

Input {(x1, y1), . . . , (xm, ym)}.Initialize !1 = 0, . . . , !m = 0.Loop:

1. Choose an example p.2. Calculate the constants for the reduced problem:

• Ap = K(xp, xp)

• Bp =!

i!=p K(xi, xp)!i ! "1yp

3. Set !p to be the solution of the reduced problem :

min!p

Q(!p) =12Ap(!p · !p) + Bp · !p

subject to : !p " 1yp and !p · 1 = 0

Output : H(x) = arg kmaxr=1

"#

i

!i,rK (x, xi)

$.

Figure 2: Skeleton of the algorithm for learning multiclass support vector machine.

of !p in Q.

Qp(!p)def= !1

2

#

i,j

Ki,j(!i · !j) + "#

i

!i · 1yi

= !12Kp,p(!p · !p) !

#

i!=p

Ki,p(!p · !i)

!12

#

i!=p,j !=p

Ki,j(!i · !j) + "!p · 1yp + "#

i!=p

!i · 1yi

= !12Kp,p(!p · !p) ! !p ·

%

&!"1yp +#

i!=p

Ki,p!i

'

(

+

%

&!12

#

i!=p,j !=p

Ki,j(!i · !j) + "#

i!=p

!i · 1yi

'

( . (17)

Let us now define the following variables,

Ap = Kp,p > 0 (18)

Bp = !"1yp +#

i!=p

Ki,p!i (19)

Cp = !12

#

i,j !=p

Ki,j(!i · !j) + "#

i!=p

!i · 1yi .

Using the variables defined above the objective function becomes,

Qp(!p) = !12Ap(!p · !p) ! Bp · !p + Cp .

273

48


Final Algorithm Multiclass Kernel-based Vector Machines

FixedPointAlgorithm(D, !, ")Input D, !1, ".Initialize l = 0.Repeat

• l ! l + 1.• !l+1 ! 1

k

!"kr=1 max{!l,Dr}

#" 1

k .

Until$$$$!l " !l+1

!l

$$$$ # ".

Assign for r = 1, . . . , k: #r = min{!l+1,Dr}Return: $ = # " B

A .

Figure 3: The fixed-point algorithm for solving the reduced quadratic program.

Thus,

!l+1 = F (!l)

=1k

%k&

r=1

max{!l,Dr}'" 1

k

=1k

(k&

r=u+1

!l

)+

1k

(u&

r=1

Dr

)" 1

k

=*1 " u

k

+!l +

1k

(u&

r=1

Dr " 1

). (41)

Note that if !l # maxr Dr then !l+1 # maxr Dr. Similarly,

!! = F (!!) =*1 " s

k

+!! +

1k

(s&

r=1

Dr " 1

)

$ !! =1s

(s&

r=1

Dr " 1

).

(42)

We now need to consider three cases depending on the relative order of s and u. The firstcase is when u = s. In this case we get that,

|!l+1 " !!||!l " !!| =

|,1 " s

k

-!l + 1

k ("s

r=1 Dr " 1) " !!||!l " !!|

=|,1 " s

k

-!l + s

k!! " !!||!l " !!|

= 1 " s

k# 1 " 1

k.

where the second equality follows from Eq. (42). The second case is where u > s. In thiscase we get that for all r = s + 1, . . . , u :

!l # Dr # !! . (43)

279


Input {(x1, y1), . . . , (xm, ym)}.Initialize for i = 1, . . . ,m:

• !i = 0• Fi,r = !"#r,yi (r = 1 . . . k)• Ai = K(xi, xi)

Repeat:

• Calculate for i = 1 . . .m: $i = maxr

Fi,r ! minr : !i,r<"yi,r

Fi,r

• Set: p = arg max{$i}

• Set for r = 1 . . . k : Dr = Fp,r

Ap! !p,r + #r,yp and % = 1

k

!"kr=1 Dr

#! 1

k

• Call: !#p =FixedPointAlgorithm(D, %, &/2). (See Figure 3)

• Set: !!p = !#p ! !p

• Update for i = 1 . . .m and r = 1 . . . k: Fi,r " Fi,r + !!p,r K (xp, xi)

• Update: !p " !#p

Until $p < &"

Output : H(x) = arg maxr

$%

i

!i,rK (x, xi)

&.

Figure 4: Basic algorithm for learning a multiclass, kernel-based, support vector machineusing KKT conditions for example selection.

7. Implementation details

We have discussed so far the underlying principal and algorithmic issues that arise in thedesign of multiclass kernel-based vector machines. However, to make the learning algorithmpractical for large datasets we had to make several technical improvements to the baselineimplementation. While these improvements do not change the underlying design principalsthey lead to a significant improvement in running time. We therefore devote this section toa description of the implementation details. To compare the performance of the di"erentversions presented in this section we used the MNIST OCR dataset1. The MNIST datasetcontains 60, 000 training examples and 10, 000 test examples and thus can underscore signif-icant implementation improvements. Before diving into the technical details we would liketo note that many of the techniques are by no means new and have been used in prior im-plementation of two-class support vector machines (see for instance Platt, 1998, Joachims,1998, Collobert and Bengio, 2001). However, a few of our implementation improvementsbuild on the specific algorithmic design of multiclass kernel machines.

1. Available at http://www.research.att.com/ yann/exdb/mnist/index.html

281

49


SVM Dual Formulation


Dual Optimization Problem


maximize

subject to

• Solution

36

m!

i=1

!i !1

2

m!

i=1

!i!jyiyj(xi · xj)

h(x) = sgn

!

m"

i=1

!iyi(xi · x) + b

#

,

b = yi !

m!

j=1

!jyj(xj · xi) for any SV with

xi

!i " [1, m], 0 # !i # C $

m!

i=1

!iyi = 0.

!i < C.

36

50

[Cortes, Vapnik ’95]Slide: [Mohri ’07]

Dual, Kernel, etc.

• Write dual QP with one dual variable per constraint

• maximize

• subject to

• Problem: exponential number of dual variables

• Solution: interpret dual variables as probability density

• If Markov net is a forest, can decompose dual QP into equivalent QP with only (marginal) dual variables

• Dual representation in terms of dot product: kernels

51

!

i,y

!iy !

1

2

"

"

"

"

"

"

"

"

"

"

"

"

!

i,y

!iy!fi(y)

"

"

"

"

"

"

"

"

"

"

"

"

2

!i :!

y

!iy = C;!i, y : !iy " 0

m

Experiments: Sequences

• Task: given sequence of inputs , predict a sequence of outputs ,

• Label set: non-name, beginning and continuation of names of people, organizations, and miscellaneous ( )

• Joint feature map : histogram of state transitions and label emission features

• Viterbi variant used to solve

• 0/1 loss, 300 Spanish sentences

cycling through all n instances and since step 10 ispolynomial.

5. Applications and Experiments

To demonstrate the e!ectiveness and versatility of ourapproach, we report results on a number of di!erenttasks To adapt the algorithm to a new problem, it issu"cient to implement the feature mapping #(x,y),the loss function !(yi,y), as well as the maximizationin step 6.

5.1. Multiclass Classification

Our algorithm can implement the conventional winner-takes-all (WTA) multiclass classification (Crammer &Singer, 2001) as follows. Let Y = {y1, . . . ,yK} andw = (v!

1, . . . ,v!K)! is a stack of vectors, vk being a

weight vector associated with the k-th class yk. Fol-lowing Crammer and Singer (2001) one can then defineF (x,yk;w) = "vk,$(x)#, where $(x) $ %D denotesan arbitrary input representation. These discriminantfunctions can be equivalently represented in the pro-posed framework by defining a joint feature map asfollows #(x,y) & $(x) ' %c(y). Here %c refers to theorthogonal (binary) encoding of the label y and ' isthe tensor product which forms all products betweencoe"cients of the two argument vectors.

5.2. Classification with Taxonomies

The first generalization we propose is to make use ofmore interesting output features % than the orthogonalrepresentation %c. As an exemplary application of thiskind, we show how to take advantage of known classtaxonomies. Here a taxonomy is treated as a latticein which the classes y $ Y are the minimal elements.For every node z in the lattice (corresponding to asuper-class or class) we introduce a binary attribute!z(y) indicating whether or not z is a predecessor ofy. Notice that "%(y),%(y!)# will count the number ofcommon predecessors.

We have performed experiments using a documentcollection released by the World Intellectual Prop-erty Organization (WIPO), which uses the Interna-tional Patent Classification (IPC) scheme. We haverestricted ourselves to one of the 8 sections, namelysection D, consisting of 1,710 documents in the WIPO-alpha collection. For our experiments, we have indexedthe title and claim tags. We have furthermore sub-sampled the training data to investigate the e!ect ofthe training set size. Document parsing, tokenizationand term normalization have been performed with the

Table 1. Results on the WIPO-alpha corpus, section Dwith 160 groups using 3-fold and 5-fold cross validation, re-spectively. ‘flt’ is a standard (flat) SVM multiclass model,‘tax’ the hierarchical architecture. ‘0/1’ denotes trainingbased on the classification loss, ‘!’ refers to training basedon the tree loss.

flt 0/1 tax 0/1 flt ! tax !4 training instances per classacc 28.32 28.32 27.47 29.74 +5.01 %!-loss 1.36 1.32 1.30 1.21 +12.40 %2 training instances per classacc 20.20 20.46 20.20 21.73 +7.57 %!-loss 1.54 1.51 1.39 1.33 +13.67 %

MindServer retrieval engine.2 As a suitable loss func-tion !, we have used a tree loss function which definesthe loss between two classes y and y! as the height ofthe first common ancestor of y and y! in the taxon-omy. The results are summarized in Table 1 and showthat the proposed hierarchical SVM learning architec-ture improves performance over the standard multi-class SVM in terms of classification accuracy as wellas in terms of the tree loss.

5.3. Label Sequence Learning

Label sequence learning deals with the problem of pre-dicting a sequence of labels y = (y1, . . . , ym), yk $ &,from a given sequence of inputs x = (x1, . . . ,xm).It subsumes problems like segmenting or annotat-ing observation sequences and has widespread appli-cations in optical character recognition, natural lan-guage processing, information extraction, and compu-tational biology. In the following, we study our algo-rithm on a named entity recognition (NER) problem.More specifically, we consider a sub-corpus consistingof 300 sentences from the Spanish news wire articlecorpus which was provided for the special session ofCoNLL2002 devoted to NER. The label set in this cor-pus consist of non-name and the beginning and contin-uation of person names, organizations, locations andmiscellaneous names, resulting in a total of |&| = 9di!erent labels. In the setup followed in Altun et al.(2003), the joint feature map #(x,y) is the histogramof state transition plus a set of features describing theemissions. An adapted version of the Viterbi algorithmis used to solve the argmax in line 6. For both per-ceptron and SVM a second degree polynomial kernelwas used.

The results given in Table 2 for the zero-one loss,compare the generative HMM with Conditional Ran-dom Fields (CRF) (La!erty et al., 2001), Collins’ per-

2http://www.recommind.com





2 0.4±0.4 5.1±0.8 2626±225 1.10±0.08SVM!m

2 0.3±0.2 5.1±0.7 2628±119 1.17±0.12






i , ..., zki }







10 2.0±1.3 0.0±0.0 10.2±0.7 7.1±1.6 58.9±1.220 2.5±0.8 1.0±0.7 3.4±0.7 5.2±0.5 95.2±2.340 2.0±1.0 1.0±0.4 2.3±0.5 3.0±0.3 157.2±2.480 2.8±0.5 2.0±0.5 1.9±0.4 2.8±0.6 252.7±2.1


!P (xi,zj)

P (xi)P (zj)











2 0.4±0.4 5.1±0.8 2626±225 1.10±0.08SVM!m

2 0.3±0.2 5.1±0.7 2628±119 1.17±0.12






i , ..., zki }







10 2.0±1.3 0.0±0.0 10.2±0.7 7.1±1.6 58.9±1.220 2.5±0.8 1.0±0.7 3.4±0.7 5.2±0.5 95.2±2.340 2.0±1.0 1.0±0.4 2.3±0.5 3.0±0.3 157.2±2.480 2.8±0.5 2.0±0.5 1.9±0.4 2.8±0.6 252.7±2.1


!P (xi,zj)

P (xi)P (zj)







y = [y(1), . . . , y

(n)]x = [x(1)

, . . . , x(n)]

y(k)

! !

f(x, y)

y = arg maxy

H(y)

52


Adding Linear Constraints

• It’s useful to enforce constraints on the input/output map

• e.g., in POS tagging, tag position matches word position

• Formalize as linear constraints on elements of

• Matrix , regularization parameter per constraint

53

W

2006/09/18 14:55


Some natural constraints that one may wish to impose on W are linear con-straints on its coe!cients. To take these constraints into account, one can introduceadditional terms in the objective function. For example, to impose that the coe!-cients with indices in some subset I0 are null, or that two coe!cients with indicesin I1 must be equal, the following terms can be added:

!0

!

(i,j)!I0

W 2ij + !1

!

(i,j,k,l)!I1

|Wij ! Wkl|2, (1.17)

with large values assigned to the regularization factors !0 and !1. More generally,a finite set of linear constraints on the coe!cients of W can be accounted for inthe objective function by the introduction of a quadratic form defined over Wij ,(i, j) " N2 # N1.

Let N = N2N1, and denote by W the N#1 column matrix whose components arethe coe!cients of the matrix W. The quadratic form representing the constraintscan be written as < W,RW >, where R is a positive semi-definite symmetricmatrix. By Cholesky’s decomposition theorem, there exists a triangular matrix Asuch that R = A"A. Denote by Ai the transposition of the ith row of A, Ai is anN # 1 column matrix, then

< W,RW >=< W, A"AW >= ||AW||2 =N

!

i=1

< Ai,W >2 . (1.18)

The matrix Ai can be associated to an N2#N1 matrix Ai, just as W is associatedto W, and < Ai,W >2=< Ai,W >2

F . Thus, the quadratic form representing thelinear constraints can be re-written in terms of the Frobenius products of W withN matrices:

< W,RW >=N

!

i=1

< Ai,W >2F , (1.19)

with each Ai, i = 1, . . . , N , being an N2 # N1 matrix. In practice, the number ofmatrices needed to represent the constraints may be far less than N , we will denoteby C the number of constraint-matrices of the type Ai used.

Thus, the general form of the optimization problem including input-outputconstraints becomes:

argminW!RN2!N1

F (W) = $WMX ! MY $2F + "$W$2

F +C

!

i=1

#i < Ai,W >2F , (1.20)

where #i % 0, i = 1, . . . , C, are regularization parameters. Since they can be factoredin the matrices Ai by replacing Ai with

&#iAi, in what follows, we will assume

without loss of generality that #1 = . . . = #C = 1.

C

2006/09/18 14:55





"A"2F =

!pi=1

!qj=1 A2


!pi=1



argminW"RN2!N1

F (W) =m

"

i=1

"WMxi# Myi

"2 + !"W"2F , (1.4)


MX = [Mx1. . . Mxm

] MY = [My1. . . Mym

]. (1.5)


argminW"RN2!N1

F (W) = "WMX # MY "2F + !"W"2

F . (1.6)


W = MY M#X(MXM#



(1.7)



%WF = 2 (WMX # MY )M#X + 2!W. (1.8)

Thus,

%WF = 0 & 2(WMX # MY )M#X + 2!W = 0

& W(MXM#X + !I) = MY M#

X

& W = MY M#X(MXM#

X + !I)$1,

(1.9)


M#X(MXM#

X + !I)$1 = (M#XMX + !I)$1M#

X . (1.10)

Ai !i


Solving With Constraints

• Solution of optimization problem with constraints:

• Iterative solution not requiring large matrix inversion:

• ,

• , and for , converges to closed form solution. is largest eigenvalue of

54

2006/09/18 14:55


Proposition 2 The solution of the optimization problem 1.20 is unique and isgiven by the following identity:

W = (MY M!X !

C!

i=1

ai Ai)U"1, (1.21)

with U = MXM!X + !I and

"

#

#

$

a1

. . .

aC

%

&

&

'

=(

(< Ai,AjU"1 >)ij + I

)"1

"

#

#

$

< MY M!X U"1,A1 >

. . .

< MY M!X U"1,AC >

%

&

&

'

. (1.22)

Proof The new objective function F is convex and di!erentiable, thus, its solutionis unique and given by "WF = 0. Its gradient is given by:

"WF = 2 (WMX ! MY )M!X + 2!W + 2

C!

i=1

< Ai,W >F Ai. (1.23)

Thus,

"WF = 0 # 2(WMX ! MY )M!X + 2!W +

C!

i=1

< Ai,W >F Ai = 0 (1.24)

# W(MXM!X + !I) = MY M!

X !C

!

i=1

< Ai,W >F Ai (1.25)

# W = (MY M!X !

C!

i=1

< Ai,W >F Ai)(MXM!X + !I)"1. (1.26)

To determine the solution W, we need to compute the coe"cients < Ai,W >F .Let M = MY M!

X , ai =< Ai,W >F and U = MXM!X +!I, then the last equation

can be re-written as:

W = (M !C

!

i=1

ai Ai)U"1. (1.27)

Thus, for j = 1, . . . , C,

aj =< Aj ,W >F =< Aj ,MU"1 >F !C

!

i=1

ai < Aj ,AiU"1 >F , (1.28)

which defines the following system of linear equations with unknowns aj :

$j, 1 % j % C, aj +C

!

i=1

ai < Aj ,AiU"1 >F =< Aj ,MU"1 >F . (1.29)

Since u is symmetric, for all i, j, 1 % i, j % C,

< Aj ,AiU"1 >F = tr(A!

j AiU"1) = tr(U"1A!

j Ai) =< AjU"1,Ai >F . (1.30)

2006/09/18 14:55



W = (MY M!X !

C!

i=1

ai Ai)U"1, (1.21)


"

#

#

$

a1

. . .

aC

%

&

&

'

=(


)"1

"

#

#

$

< MY M!X U"1,A1 >

. . .

< MY M!X U"1,AC >

%

&

&

'

. (1.22)


"WF = 2 (WMX ! MY )M!X + 2!W + 2

C!

i=1

< Ai,W >F Ai. (1.23)

Thus,

"WF = 0 # 2(WMX ! MY )M!X + 2!W +

C!

i=1

< Ai,W >F Ai = 0 (1.24)

# W(MXM!X + !I) = MY M!

X !C

!

i=1

< Ai,W >F Ai (1.25)

# W = (MY M!X !

C!

i=1

< Ai,W >F Ai)(MXM!X + !I)"1. (1.26)




W = (M !C

!

i=1

ai Ai)U"1. (1.27)

Thus, for j = 1, . . . , C,

aj =< Aj ,W >F =< Aj ,MU"1 >F !C

!

i=1

ai < Aj ,AiU"1 >F , (1.28)


$j, 1 % j % C, aj +C

!

i=1

ai < Aj ,AiU"1 >F =< Aj ,MU"1 >F . (1.29)



j AiU"1) = tr(U"1A!

j Ai) =< AjU"1,Ai >F . (1.30)

2006/09/18 14:55



W = (MY M!X !

C!

i=1

ai Ai)U"1, (1.21)


"

#

#

$

a1

. . .

aC

%

&

&

'

=(


)"1

"

#

#

$

< MY M!X U"1,A1 >

. . .

< MY M!X U"1,AC >

%

&

&

'

. (1.22)


"WF = 2 (WMX ! MY )M!X + 2!W + 2

C!

i=1

< Ai,W >F Ai. (1.23)

Thus,

"WF = 0 # 2(WMX ! MY )M!X + 2!W +

C!

i=1

< Ai,W >F Ai = 0 (1.24)

# W(MXM!X + !I) = MY M!

X !C

!

i=1

< Ai,W >F Ai (1.25)

# W = (MY M!X !

C!

i=1

< Ai,W >F Ai)(MXM!X + !I)"1. (1.26)




W = (M !C

!

i=1

ai Ai)U"1. (1.27)

Thus, for j = 1, . . . , C,

aj =< Aj ,W >F =< Aj ,MU"1 >F !C

!

i=1

ai < Aj ,AiU"1 >F , (1.28)


$j, 1 % j % C, aj +C

!

i=1

ai < Aj ,AiU"1 >F =< Aj ,MU"1 >F . (1.29)



j AiU"1) = tr(U"1A!

j Ai) =< AjU"1,Ai >F . (1.30)

2006/09/18 14:55


Thus, the matrix (< Ai,AjU!1 >F )ij is symmetric and (< Ai,AjU!1 >F )ij + Iis invertible. The statement of the proposition follows.

Proposition 1.20 shows that, as in the constrained case, the matrix W solutionof the optimization problem is unique and admits a closed form solution. Thecomputation of the solution requires inverting matrix U, as in the unconstrainedcase, which can be done in time O(N3

1 ). But it also requires, in the general case,the inversion of matrix (< Ai,AjU!1 >F )ij + I which can be done in O(C3). Forlarge C, C close to N , the space and time complexity of this matrix inversion maybecome prohibitive.

Instead, one can use an iterative method for computing W, using Equation 1.31:

W = (MY M"X !

C!

i=1

< Ai,W >F Ai)U!1, (1.31)

and starting with the solution of the unconstrained solution:

W0 = MY M"XU!1. (1.32)

At iteration k, Wk+1 is determined by interpolating its value at the previousiteration with the one given by Equation 1.31:

Wk+1 = (1 ! !)Wk + ! (MY M"X !

C!

i=1

< Ai,Wk >F Ai)U!1, (1.33)

where 0 " ! " 1. Let P # R(N2#N1)#(N2#N1) be the matrix such that

PW =C

!

i=1

< Ai,W >F AiU!1. (1.34)

The following theorem proves the convergence of this iterative method to thecorrect result when ! is su!ciently small with respect to a quantity dependingon the largest eigenvalue of P. When the matrices Ai are sparse, as in many casesencountered in practice, the convergence of this method can be very fast.

Theorem 3 Let "max be the largest eigenvalue of P. Then, "max $ 0 and for

0 < ! < min"

2!max+1 , 1

#

, the iterative algorithm just presented converges to the

unique solution of the optimization problem 1.20.

Proof We first show that the eigenvalues of P are all non-negative. Let X be aneigenvector associated to an eigenvalue " of P. By definition,

C!

i=1

< Ai,X >F AiU!1 = "X. (1.35)

2006/09/18 14:55






W = (MY M"X !

C!

i=1

< Ai,W >F Ai)U!1, (1.31)


W0 = MY M"XU!1. (1.32)


Wk+1 = (1 ! !)Wk + ! (MY M"X !

C!

i=1

< Ai,Wk >F Ai)U!1, (1.33)


PW =C

!

i=1

< Ai,W >F AiU!1. (1.34)



0 < ! < min"

2!max+1 , 1

#




C!

i=1

< Ai,X >F AiU!1 = "X. (1.35)

2006/09/18 14:55






W = (MY M"X !

C!

i=1

< Ai,W >F Ai)U!1, (1.31)


W0 = MY M"XU!1. (1.32)


Wk+1 = (1 ! !)Wk + ! (MY M"X !

C!

i=1

< Ai,Wk >F Ai)U!1, (1.33)


PW =C

!

i=1

< Ai,W >F AiU!1. (1.34)



0 < ! < min"

2!max+1 , 1

#




C!

i=1

< Ai,X >F AiU!1 = "X. (1.35)

2006/09/18 14:55






W = (MY M"X !

C!

i=1

< Ai,W >F Ai)U!1, (1.31)


W0 = MY M"XU!1. (1.32)


Wk+1 = (1 ! !)Wk + ! (MY M"X !

C!

i=1

< Ai,Wk >F Ai)U!1, (1.33)


PW =C

!

i=1

< Ai,W >F AiU!1. (1.34)



0 < ! < min"

2!max+1 , 1

#




C!

i=1

< Ai,X >F AiU!1 = "X. (1.35)

2006/09/18 14:55






W = (MY M"X !

C!

i=1

< Ai,W >F Ai)U!1, (1.31)


W0 = MY M"XU!1. (1.32)


Wk+1 = (1 ! !)Wk + ! (MY M"X !

C!

i=1

< Ai,Wk >F Ai)U!1, (1.33)


PW =C

!

i=1

< Ai,W >F AiU!1. (1.34)



0 < ! < min"

2!max+1 , 1

#




C!

i=1

< Ai,X >F AiU!1 = "X. (1.35)2006/09/18 14:55






W = (MY M"X !

C!

i=1

< Ai,W >F Ai)U!1, (1.31)


W0 = MY M"XU!1. (1.32)


Wk+1 = (1 ! !)Wk + ! (MY M"X !

C!

i=1

< Ai,Wk >F Ai)U!1, (1.33)


PW =C

!

i=1

< Ai,W >F AiU!1. (1.34)



0 < ! < min"

2!max+1 , 1

#




C!

i=1

< Ai,X >F AiU!1 = "X. (1.35)

2006/09/18 14:55






W = (MY M"X !

C!

i=1

< Ai,W >F Ai)U!1, (1.31)


W0 = MY M"XU!1. (1.32)


Wk+1 = (1 ! !)Wk + ! (MY M"X !

C!

i=1

< Ai,Wk >F Ai)U!1, (1.33)


PW =C

!

i=1

< Ai,W >F AiU!1. (1.34)



0 < ! < min"

2!max+1 , 1

#




C!

i=1

< Ai,X >F AiU!1 = "X. (1.35)


Experiments

• New “vine-growth” model for sentence summarization

• DUC 2005 data set: 50 sets of 25 documents each

• Evaluation: Rouge ( -gram overlap) vs. human summaries

55


Fig. 6 An example of the creation of a summary under the vine-growth model.

document collection until a pre-defined word limit is reached. [53] and [33]describe representative examples. Recent work in sentence compression [26,38] and document compression [13] attempts to take small steps beyondsentence extraction. Compression models can be seen as techniques for ex-tracting sentences then dropping extraneous information. They are morepowerful than simple sentence extraction systems, while remaining train-able and tractable. Unfortunately, their training hinges on the existence of! sentence, compression " pairs, where the compression is obtained fromthe sentence by only dropping words and phrases (the work of [56] is anexception). Obtaining such data is quite challenging.

The exact model we use for the document summarization task is a novel“vine-growth” model, described in more detail in [12]. The vine-growthmethod uses syntactic parses of the sentence in the form of dependencystructures. In the vine-growth model, if a word w is to be included in thesummary, then all words closer to the tree root are included.

6.2.1 Search Space and Actions The search algorithm we employ for im-plementing the vine-growth model is based on incrementally growing sum-maries. In essence, beginning with an empty summary, the algorithm incre-mentally adds words to the summary, either by beginning a new sentenceor growing existing sentences. At any step in search, the root of a new sen-tence may be added, as may any direct child of a previously added node. Tosee more clearly how the vine-growth model functions, consider Figure 6.This figure shows a four step process for creating the summary “the manate a sandwich .” from the original document sentence “the man ate a bigsandwich with pickles .”

When there is more than one sentence in the source documents, thesearch proceeds asynchronously across all sentences. When the sentencesare laid out adjacently, the end summary is obtained by taking all the greensummary nodes once a pre-defined word limit has been reached. This finalsummary is a collection of subtrees grown o! a sequence of underlying trees:hence the name “vine-growth.”

n


ORACLE SEARN BAYESUM

Vine Extr Vine Extr D05 D03 Base Best100 w .0729 .0362 .0415 .0345 .0340 .0316 .0181 -250 w .1351 .0809 .0824 .0767 .0762 .0698 .0403 .0725

Table 2 Summarization results; values are Rouge 2 scores (higher is better).

system. (Note that it is impossible to compare against competing structuredprediction techniques. This summarization problem, even in its simplifiedform, is far too complex to be amenable to other methods.) For comparison,we present results from the BayeSum system [14,16], which achieved thehighest score according to human evaluations of responsiveness in DUC 05.This system, as submitted to DUC 05, was trained on DUC 2003 data; theresults for this configuration are shown in the “D03” column. For the sakeof fair comparison, we also present the results of this system, trained inthe same cross-validation approach as the Searn-based systems (column“D05”). Finally, we present the results for the baseline system and for thebest DUC 2005 system (according to the Rouge 2 metric).

As we can see from Table 2 at the 100 word level, sentence extractionis a nearly solved problem for this domain and this evaluation metric. Thatis, the oracle sentence extraction system yields a Rouge score of 0.0362,compared to the score achieved by the Searn system of 0.0345. This di!er-ence is on the border of statistical significance at the 95% level. The nextnoticeable item in the results is that, although the Searn-based extractionsystem comes quite close to the theoretical optimal, the oracle results for thevine-growth method are significantly higher. Not surprisingly, under Searn,the summaries produced by the vine-growth technique are uniformally bet-ter than those produced by raw extraction. The last aspect of the resultsto notice is how the Searn-based models compare to the best DUC 2005system, which achieved a Rouge score of 0.0725. The Searn-based systemsuniformly dominate this result, but this comparison is not fair due to thetraining data. We can approximate the expected improvement for havingthe new training data by comparing the BayeSum system when trained onthe DUC 2005 and DUC 2003 data: the improvement is 0.0064 absolute.When this result is added to the best DUC 2005 system, its score rises to0.0789, which is better than the Searn-based extraction system but not asgood as the vine-growth system. It should be noted that the best DUC 2005system was a purely extractive system [59].

7 Discussion and Conclusions

In this paper, we have:

– Presented an algorithm, Searn, for solving complex structured predic-tion problems with minimal assumptions on the structure of the outputand loss function.

[Daumé ’06] [Daumé, Langford, Marcu ’07]

discriminative methods for structured predictioneugenew/publications/dqe-syllabus.pdf · a b st ra...

Documents