hidden markov support vector machinessmil.csie.ntnu.edu.tw/ppt/20070212_winston_hm-svm.pdf ·...

24
Hidden Markov Support Vector Machines Y. Altun, I. Tsochantaridis, and T. Hofmann, “Hidden Markov Support Vector Machines ICML 2003 Machines, ICML, 2003. Reference: • Y. Altun and T. Hofmann, “Large margin methods for label sequence learning,” EuroSpeech, 2003. V. Kecman, Learning and Soft Computing, Support Vector Machines, Neural Networks, and Fuzzy V. Kecman, Learning and Soft Computing, Support Vector Machines, Neural Networks, and Fuzzy Logic Models,” The MIT Press, Cambridge, MA, USA, 2001, 608 pp.

Upload: others

Post on 26-Jul-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Hidden Markov Support Vector Machinessmil.csie.ntnu.edu.tw/ppt/20070212_Winston_HM-SVM.pdf · 2/12/2007  · • V. Kecman , “Learning and Soft Computing, Support Vector Machines,

Hidden Markov Support Vector Machines

Y. Altun, I. Tsochantaridis, and T. Hofmann, “Hidden Markov Support Vector Machines ” ICML 2003Machines, ICML, 2003.

Reference:• Y. Altun and T. Hofmann, “Large margin methods for label sequence learning,” EuroSpeech, 2003.• V. Kecman, “Learning and Soft Computing, Support Vector Machines, Neural Networks, and FuzzyV. Kecman, Learning and Soft Computing, Support Vector Machines, Neural Networks, and Fuzzy Logic Models,” The MIT Press, Cambridge, MA, USA, 2001, 608 pp.

Page 2: Hidden Markov Support Vector Machinessmil.csie.ntnu.edu.tw/ppt/20070212_Winston_HM-SVM.pdf · 2/12/2007  · • V. Kecman , “Learning and Soft Computing, Support Vector Machines,

Abtract

• This paper presents a discriminative learning technique for label sequences based on a combination of the twofor label sequences based on a combination of the two learning algorithms, Support Vector Machines and Hidden Markov Models which we call Hidden Markov Support Vector Machine (HM-SVM).

• The learning procedure is discriminative and is based on a maximum/soft margin criterion.

• It is possible to learn non-linear discriminant functions via k l f tikernel functions.

• HM-SVMs have the capability to deal with overlapping feat resfeatures.– Labels can depend directly on features of past or future

observations.

Page 3: Hidden Markov Support Vector Machinessmil.csie.ntnu.edu.tw/ppt/20070212_Winston_HM-SVM.pdf · 2/12/2007  · • V. Kecman , “Learning and Soft Computing, Support Vector Machines,

Outline

1. Introduction

2. Joint Feature Functions I/O Mappings

3 Hidden Markov Chain Discriminants3. Hidden Markov Chain Discriminants

4. Hidden Markov SVM

5. HM-SVM Optimization Algorithm

6. Soft Margin HM-SVM

7. Experiments

Page 4: Hidden Markov Support Vector Machinessmil.csie.ntnu.edu.tw/ppt/20070212_Winston_HM-SVM.pdf · 2/12/2007  · • V. Kecman , “Learning and Soft Computing, Support Vector Machines,

Introduction

• The predominant formalism for modeling and predicting label sequences has been based on Hidden Markovlabel sequences has been based on Hidden Markov Models (HMMs) and variations thereof.

• But HMMs have at least three limitations:But HMMs have at least three limitations:– They are typically trained in a non-discriminative manner.– The conditional independence assumptions are often too

restrictive.– They are based on explicit feature representations and lack the

power of kernel-based methodspower of kernel based methods.

• HM-SVMs address all of the above shortcomings, and retaining some of the key advantages of HMMs:g y g– The Markov chain dependency structure between labels.– An efficient dynamic programming formulation.

Page 5: Hidden Markov Support Vector Machinessmil.csie.ntnu.edu.tw/ppt/20070212_Winston_HM-SVM.pdf · 2/12/2007  · • V. Kecman , “Learning and Soft Computing, Support Vector Machines,

Introduction (cont.)

• Two crucial ingredients of HM-SVMs:The maximum margin principle– The maximum margin principle

– A kernel-centric approach to learning non-linear discriminantfunctions.

Page 6: Hidden Markov Support Vector Machinessmil.csie.ntnu.edu.tw/ppt/20070212_Winston_HM-SVM.pdf · 2/12/2007  · • V. Kecman , “Learning and Soft Computing, Support Vector Machines,

Joint Feature Functions I/O Mappings

• The general approach: to learn a w- parametrizeddiscriminant function F : overdiscriminant function F : over input/output pairs and to maximize this function over the response variable to make a prediction.p p

• F is linear in some combined feature representation of inputs and outputs

Page 7: Hidden Markov Support Vector Machinessmil.csie.ntnu.edu.tw/ppt/20070212_Winston_HM-SVM.pdf · 2/12/2007  · • V. Kecman , “Learning and Soft Computing, Support Vector Machines,

Joint Feature Functions I/O Mappings (cont.)

• Moreover, we would like to apply kernel functions to avoid performing an explicit mapping when this mayavoid performing an explicit mapping when this may become intractable, thus leveraging the theory of kernel-based learning.g

Page 8: Hidden Markov Support Vector Machinessmil.csie.ntnu.edu.tw/ppt/20070212_Winston_HM-SVM.pdf · 2/12/2007  · • V. Kecman , “Learning and Soft Computing, Support Vector Machines,

Hidden Markov Chain Discriminants

• The goal of learning label sequences is to learn a mapping f from observation sequencesmapping f from observation sequences

to label sequences

where each label takes values from some label set• The training set of labeled sequences:• The training set of labeled sequences:

Page 9: Hidden Markov Support Vector Machinessmil.csie.ntnu.edu.tw/ppt/20070212_Winston_HM-SVM.pdf · 2/12/2007  · • V. Kecman , “Learning and Soft Computing, Support Vector Machines,

Hidden Markov Chain Discriminants (cont.)

• We also define the output space to consist of all possible label sequencespossible label sequences.

• We denote a mapping by which maps observation vectors to some representation:vectors to some representation:

• Inspired by HMMs, we defined two types of features:p y , yp– interactions between attributes of the observation vectors.

– A specific label as well as interactions between neighboringA specific label as well as interactions between neighboring labels along the chain.

Page 10: Hidden Markov Support Vector Machinessmil.csie.ntnu.edu.tw/ppt/20070212_Winston_HM-SVM.pdf · 2/12/2007  · • V. Kecman , “Learning and Soft Computing, Support Vector Machines,

Hidden Markov Chain Discriminants (cont.)

• In terms of these features, a feature map at position t can be defined by selecting appropriateat position t can be defined by selecting appropriate subsets of the two features mentioned above.

• Actually, HMMs only use input-label features:Actually, HMMs only use input label features:and label-label features:

• In the case of HM-SVMs we maintain the latter restriction, but we also include features:

Page 11: Hidden Markov Support Vector Machinessmil.csie.ntnu.edu.tw/ppt/20070212_Winston_HM-SVM.pdf · 2/12/2007  · • V. Kecman , “Learning and Soft Computing, Support Vector Machines,

Hidden Markov SVM

• Goal: To derive a maximum margin formulation for the joint kernel learning settingjoint kernel learning setting.

• Margin:

• The maximum margin problem: Finding a weight vector w that maximizes

• Here fixing the functional margin ( ) will result in the following optimization problem with a quadratic objective function:

Page 12: Hidden Markov Support Vector Machinessmil.csie.ntnu.edu.tw/ppt/20070212_Winston_HM-SVM.pdf · 2/12/2007  · • V. Kecman , “Learning and Soft Computing, Support Vector Machines,

Hidden Markov SVM (cont.)

• Each non-linear constraint above can be replaced by an equivalent set of linear constraints:equivalent set of linear constraints:

• Let us further rewrite these constraints by introducing an additional threshold for every example:

Page 13: Hidden Markov Support Vector Machinessmil.csie.ntnu.edu.tw/ppt/20070212_Winston_HM-SVM.pdf · 2/12/2007  · • V. Kecman , “Learning and Soft Computing, Support Vector Machines,

Hidden Markov SVM (cont.)

• Proposition 1: A discriminant function F fulfills the constraints ofconstraints of

for an example if and only if there existssuch that F fulfills the constraints of

Page 14: Hidden Markov Support Vector Machinessmil.csie.ntnu.edu.tw/ppt/20070212_Winston_HM-SVM.pdf · 2/12/2007  · • V. Kecman , “Learning and Soft Computing, Support Vector Machines,

Hidden Markov SVM (cont.)

• Proof 1: We have introduced the functions to stress that we have basically obtained a binary classificationthat we have basically obtained a binary classification problem, where take the role of positive examples and for take the role ofp

negative pseudo-examples.

Page 15: Hidden Markov Support Vector Machinessmil.csie.ntnu.edu.tw/ppt/20070212_Winston_HM-SVM.pdf · 2/12/2007  · • V. Kecman , “Learning and Soft Computing, Support Vector Machines,

Hidden Markov SVM (cont.)

• The Lagrangian dual is given by

where

Page 16: Hidden Markov Support Vector Machinessmil.csie.ntnu.edu.tw/ppt/20070212_Winston_HM-SVM.pdf · 2/12/2007  · • V. Kecman , “Learning and Soft Computing, Support Vector Machines,

HM-SVM Optimization Algorithm

• Although it is one of our fundamental assumptions that a complete enumeration of the set of all label sequencescomplete enumeration of the set of all label sequences is intractable, the actual solution might be extremely sparse, since we expect that only very few negative p p y y gpseudo-examples will become support vectors.

• The main challenge in terms of computational efficiency is to design a computational scheme that exploits the anticipated sparseness of the solution.

Page 17: Hidden Markov Support Vector Machinessmil.csie.ntnu.edu.tw/ppt/20070212_Winston_HM-SVM.pdf · 2/12/2007  · • V. Kecman , “Learning and Soft Computing, Support Vector Machines,

HM-SVM Optimization Algorithm (cont.)

• Since the constraints only couple Lagrange parameters for the same training example we propose to optimizefor the same training example, we propose to optimize W iteratively, at each iteration optimizing over the subspace spanned by all for a Fixed i.p p y

• Obviously, by repeatedly cycling through the data set and optimizing over , one defines a coordinate ascent optimization procedure that converges towards the correct solution, provided the problem is feasible (i e the training data is linearlyproblem is feasible (i.e., the training data is linearly separable).

Page 18: Hidden Markov Support Vector Machinessmil.csie.ntnu.edu.tw/ppt/20070212_Winston_HM-SVM.pdf · 2/12/2007  · • V. Kecman , “Learning and Soft Computing, Support Vector Machines,

HM-SVM Optimization Algorithm (cont.)

• Proposition 2: Assume a working set with is given and that a solution for the working setis given, and that a solution for the working set

has been obtained. If there exists a negative pseudo-example with such that p

th ddi t th ki t d, then adding to the working set andoptimizing over subject to foryields a strict improvement of the objective functionyields a strict improvement of the objective function.

Page 19: Hidden Markov Support Vector Machinessmil.csie.ntnu.edu.tw/ppt/20070212_Winston_HM-SVM.pdf · 2/12/2007  · • V. Kecman , “Learning and Soft Computing, Support Vector Machines,

HM-SVM Optimization Algorithm (cont.)

• Algorithm: Working set optimization for HM-SVMs.

Page 20: Hidden Markov Support Vector Machinessmil.csie.ntnu.edu.tw/ppt/20070212_Winston_HM-SVM.pdf · 2/12/2007  · • V. Kecman , “Learning and Soft Computing, Support Vector Machines,

Soft Margin HM-SVM

• In the non-separable case, one may also want to introduce slack variables to allow margin violationsintroduce slack variables to allow margin violations.

• Using the more common L1 penalty, one gets the following optimization problem:following optimization problem:

Page 21: Hidden Markov Support Vector Machinessmil.csie.ntnu.edu.tw/ppt/20070212_Winston_HM-SVM.pdf · 2/12/2007  · • V. Kecman , “Learning and Soft Computing, Support Vector Machines,

Soft Margin HM-SVM (cont.)

• The Lagrangian function for this case is

• Differentiating w.r.t. gives

Page 22: Hidden Markov Support Vector Machinessmil.csie.ntnu.edu.tw/ppt/20070212_Winston_HM-SVM.pdf · 2/12/2007  · • V. Kecman , “Learning and Soft Computing, Support Vector Machines,

Soft Margin HM-SVM (cont.)

• The box constraints on the thus take the following formfollowing form

• In addition, the KKT conditions imply that whenever, which means that

Page 23: Hidden Markov Support Vector Machinessmil.csie.ntnu.edu.tw/ppt/20070212_Winston_HM-SVM.pdf · 2/12/2007  · • V. Kecman , “Learning and Soft Computing, Support Vector Machines,

Experiments

Page 24: Hidden Markov Support Vector Machinessmil.csie.ntnu.edu.tw/ppt/20070212_Winston_HM-SVM.pdf · 2/12/2007  · • V. Kecman , “Learning and Soft Computing, Support Vector Machines,

Experiments (cont.)