hidden markov support vector machinessmil.csie.ntnu.edu.tw/ppt/20070212_winston_hm-svm.pdf ·...

Hidden Markov Support Vector Machines

Y. Altun, I. Tsochantaridis, and T. Hofmann, “Hidden Markov Support Vector Machines ” ICML 2003Machines, ICML, 2003.

Reference:• Y. Altun and T. Hofmann, “Large margin methods for label sequence learning,” EuroSpeech, 2003.• V. Kecman, “Learning and Soft Computing, Support Vector Machines, Neural Networks, and FuzzyV. Kecman, Learning and Soft Computing, Support Vector Machines, Neural Networks, and Fuzzy Logic Models,” The MIT Press, Cambridge, MA, USA, 2001, 608 pp.

Abtract

• This paper presents a discriminative learning technique for label sequences based on a combination of the twofor label sequences based on a combination of the two learning algorithms, Support Vector Machines and Hidden Markov Models which we call Hidden Markov Support Vector Machine (HM-SVM).

• The learning procedure is discriminative and is based on a maximum/soft margin criterion.

• It is possible to learn non-linear discriminant functions via k l f tikernel functions.

• HM-SVMs have the capability to deal with overlapping feat resfeatures.– Labels can depend directly on features of past or future

observations.

Outline

1. Introduction

2. Joint Feature Functions I/O Mappings

3 Hidden Markov Chain Discriminants3. Hidden Markov Chain Discriminants

4. Hidden Markov SVM

5. HM-SVM Optimization Algorithm

6. Soft Margin HM-SVM

7. Experiments

Introduction

• The predominant formalism for modeling and predicting label sequences has been based on Hidden Markovlabel sequences has been based on Hidden Markov Models (HMMs) and variations thereof.

• But HMMs have at least three limitations:But HMMs have at least three limitations:– They are typically trained in a non-discriminative manner.– The conditional independence assumptions are often too

restrictive.– They are based on explicit feature representations and lack the

power of kernel-based methodspower of kernel based methods.

• HM-SVMs address all of the above shortcomings, and retaining some of the key advantages of HMMs:g y g– The Markov chain dependency structure between labels.– An efficient dynamic programming formulation.

Introduction (cont.)

• Two crucial ingredients of HM-SVMs:The maximum margin principle– The maximum margin principle

– A kernel-centric approach to learning non-linear discriminantfunctions.

Joint Feature Functions I/O Mappings

• The general approach: to learn a w- parametrizeddiscriminant function F : overdiscriminant function F : over input/output pairs and to maximize this function over the response variable to make a prediction.p p

• F is linear in some combined feature representation of inputs and outputs

Joint Feature Functions I/O Mappings (cont.)

• Moreover, we would like to apply kernel functions to avoid performing an explicit mapping when this mayavoid performing an explicit mapping when this may become intractable, thus leveraging the theory of kernel-based learning.g

Hidden Markov Chain Discriminants

• The goal of learning label sequences is to learn a mapping f from observation sequencesmapping f from observation sequences

to label sequences

where each label takes values from some label set• The training set of labeled sequences:• The training set of labeled sequences:

Hidden Markov Chain Discriminants (cont.)

• We also define the output space to consist of all possible label sequencespossible label sequences.

• We denote a mapping by which maps observation vectors to some representation:vectors to some representation:

• Inspired by HMMs, we defined two types of features:p y , yp– interactions between attributes of the observation vectors.

– A specific label as well as interactions between neighboringA specific label as well as interactions between neighboring labels along the chain.

Hidden Markov Chain Discriminants (cont.)

• In terms of these features, a feature map at position t can be defined by selecting appropriateat position t can be defined by selecting appropriate subsets of the two features mentioned above.

• Actually, HMMs only use input-label features:Actually, HMMs only use input label features:and label-label features:

• In the case of HM-SVMs we maintain the latter restriction, but we also include features:

Hidden Markov SVM

• Goal: To derive a maximum margin formulation for the joint kernel learning settingjoint kernel learning setting.

• Margin:

• The maximum margin problem: Finding a weight vector w that maximizes

• Here fixing the functional margin ( ) will result in the following optimization problem with a quadratic objective function:

Hidden Markov SVM (cont.)

• Each non-linear constraint above can be replaced by an equivalent set of linear constraints:equivalent set of linear constraints:

• Let us further rewrite these constraints by introducing an additional threshold for every example:


• Proposition 1: A discriminant function F fulfills the constraints ofconstraints of

for an example if and only if there existssuch that F fulfills the constraints of


• Proof 1: We have introduced the functions to stress that we have basically obtained a binary classificationthat we have basically obtained a binary classification problem, where take the role of positive examples and for take the role ofp

negative pseudo-examples.


• The Lagrangian dual is given by

where

HM-SVM Optimization Algorithm

• Although it is one of our fundamental assumptions that a complete enumeration of the set of all label sequencescomplete enumeration of the set of all label sequences is intractable, the actual solution might be extremely sparse, since we expect that only very few negative p p y y gpseudo-examples will become support vectors.

• The main challenge in terms of computational efficiency is to design a computational scheme that exploits the anticipated sparseness of the solution.

HM-SVM Optimization Algorithm (cont.)

• Since the constraints only couple Lagrange parameters for the same training example we propose to optimizefor the same training example, we propose to optimize W iteratively, at each iteration optimizing over the subspace spanned by all for a Fixed i.p p y

• Obviously, by repeatedly cycling through the data set and optimizing over , one defines a coordinate ascent optimization procedure that converges towards the correct solution, provided the problem is feasible (i e the training data is linearlyproblem is feasible (i.e., the training data is linearly separable).


• Proposition 2: Assume a working set with is given and that a solution for the working setis given, and that a solution for the working set

has been obtained. If there exists a negative pseudo-example with such that p

th ddi t th ki t d, then adding to the working set andoptimizing over subject to foryields a strict improvement of the objective functionyields a strict improvement of the objective function.


• Algorithm: Working set optimization for HM-SVMs.

Soft Margin HM-SVM

• In the non-separable case, one may also want to introduce slack variables to allow margin violationsintroduce slack variables to allow margin violations.

• Using the more common L1 penalty, one gets the following optimization problem:following optimization problem:

Soft Margin HM-SVM (cont.)

• The Lagrangian function for this case is

• Differentiating w.r.t. gives

Soft Margin HM-SVM (cont.)

• The box constraints on the thus take the following formfollowing form

• In addition, the KKT conditions imply that whenever, which means that

Experiments

Experiments (cont.)

hidden markov support vector machinessmil.csie.ntnu.edu.tw/ppt/20070212_winston_hm-svm.pdf ·...

Documents