© padhraic smyth, uc irvine a review of hidden markov models for context-based classification...

A Review of Hidden Markov Models for Context-Based Classification

ICML’01 Workshop onTemporal and Spatial Learning

Williams CollegeJune 28th 2001

Padhraic SmythInformation and Computer Science

University of California, Irvine

www.datalab.uci.edu

Outline

• Context in classification

• Brief review of hidden Markov models

• Hidden Markov models for classification

• Simulation results: how useful is context?• (with Dasha Chudova, UCI)

Historical Note

• “Classification in Context” was well-studied in pattern recognition in the 60’s and 70’s– e.g, recursive Markov-based algorithms were

proposed, before hidden Markov algorithms and models were fully understood

• Applications in– OCR for word-level recognition– remote-sensing pixel classification

Papers of Note

Raviv, J., “Decision-making in Markov chains applied to the problem of pattern recognition”, IEEE Info Theory, 3(4), 1967

Hanson, Riseman, and Fisher, “Context in word recognition,”Pattern Recognition, 1976

Toussaint, G., “The use of context in pattern recognition,” Pattern Recognition, 10, 1978

Mohn, Hjort, and Storvik, “A simulation study of some contextual classification methods for remotely sensed data,”IEEE Trans Geo. Rem. Sens., 25(6), 1987.

Context-Based Classification Problems

• Medical Diagnosis– classification of a patient’s state over time

• Fraud Detection– detection of stolen credit card

• Electronic Nose– detection of landmines

• Remote Sensing– classification of pixels into ground cover

Modeling Context

• Common Theme = Context– class labels (and features) are “persistent” in

time/space

Modeling Context

• Common Theme = Context– class labels (and features) are “persistent” in

time/space

- - - - - - - -

Features(observed)

Class (hidden)

Feature Windows

• Predict Ct using a window, e.g., f(Xt, Xt-1, Xt-2)

– e.g., NETtalk application

- - - - - - - -

Features(observed)

Class (hidden)

Alternative: Probabilistic Modeling

• E.g., assume p(Ct | history) = p(Ct | Ct-1)

– first order Markov assumption on the classes

- - - - - - - -

Features(observed)

Class (hidden)

Brief review of hidden Markov models (HMMs)

Graphical Models

• Basic Idea: p(U) <=> an annotated graph

– Let U be a set of random variables of interest

– 1-1 mapping from U to nodes in a graph

– graph encodes “independence structure” of model

– numerical specifications of p(U) are stored locally at the nodes

Acyclic Directed Graphical Models(aka belief/Bayesian networks)

In general,

p(X1, X2,....XN) = p(Xi | parents(Xi ) )

p(A,B,C) = p(C|A,B)p(A)p(B)

Undirected Graphical Models (UGs)

• Undirected edges reflect correlational dependencies– e.g., particles in physical systems, pixels in an image

• Also known as Markov random fields, Boltzmann machines, etc

p(X1, X2,....XN) =potential(clique i)

Examples of 3-way Graphical Models

A CB Markov chainp(A,B,C) = p(C|B) p(B|A) p(A)

Examples of 3-way Graphical Models

Markov chainp(A,B,C) = p(C|B) p(B|A) p(A)

Independent Causes:p(A,B,C) = p(C|A,B)p(A)p(B)

Hidden Markov Graphical Model• Assumption 1:

– p(Ct | history) = p(Ct | Ct-1)

– first order Markov assumption on the classes

• Assumption 2:

– p(Xt | history, Ct ) = p(Xt | Ct )

– Xt only depends on current class Ct

Hidden Markov Graphical Model

- - - - - - - -

Features(observed)

Class (hidden)

Notes: - all temporal dependence is modeled through the class variable C - this is the simplest possible model

- Avoids modeling p(X|other X’s)

Generalizations of HMMs

- - - - - - - -

SpatialRainfall(observed)

State (hidden)

Hidden state model relating atmospheric measurementsto local rainfall

“Weather state” couples multiple variables in time and space

(Hughes and Guttorp, 1996)

Graphical models = language for spatio-temporal modeling

A1 A2 A3 ATAtmospheric(observed)

Exact Probability Propagation (PP) Algorithms

• Basic PP Algorithm – Pearl, 1988; Lauritzen and Spiegelhalter, 1988– Assume the graph has no loops– Declare 1 node (any node) to be a root – Schedule two phases of message-passing

• nodes pass messages up to the root• messages are distributed back to the leaves

– (if loops, convert loopy graph to an equivalent tree)

Properties of the PP Algorithm

• Exact– p(node|all data) is recoverable at each node

• i.e., we get exact posterior from local message-passing

– modification: MPE = most likely instantiation of all nodes jointly

• Efficient– Complexity: exponential in size of largest clique– Brute force: exponential in all variables

Hidden Markov Graphical Model

- - - - - - - -

Features(observed)

Class (hidden)

PP Algorithm for a HMM

- - - - - - - -

Features(observed)

Class (hidden)

Let CT be the root

- - - - - - - -

Features(observed)

Class (hidden)

Let CT be the root

Absorb evidence from X’s (which are fixed)

- - - - - - - -

Features(observed)

Class (hidden)

Let CT be the root

Forward pass: pass evidence forward from C1

- - - - - - - -

Features(observed)

Class (hidden)

Let CT be the root

Forward pass: pass evidence forward from C1

Backward pass: pass evidence backward from CT

(This is the celebrated “forward-backward” algorithm for HMMs)

Comments on F-B Algorithm

• Complexity = O(T m2)

• Has been reinvented several times– e.g., BCJR algorithm for error-correcting codes

• Real-time recursive version– run algorithm forward to current time t– can propagate backwards to “revise” history

HMMs and Classification

Forward-Backward Algorithm

• Classification

– Algorithm produces p(Ct|all other data) at each node

– to minimize 0-1 loss• choose most likely class at each t

• Most likely class sequence?– Not the same as the sequence of most likely classes– can be found instead with Viterbi/dynamic

programming• replace sums in F-B with “max”

Supervised HMM learning

• Use your favorite classifier to learn p(C|X)– i.e., ignore temporal aspect of problem (temporarily)

• Now, estimate p(Ct | Ct-1) from labeled training

• We have a fully operational HMM– no need to use EM for learning if class labels are

provided (i.e., do “supervised HMM learning”)

Fault Diagnosis Application (Smyth, Pattern Recognition, 1994)

Features

FaultClasses

- - - - - - - -

Fault Detection in 34m Antenna Systems:

Classes: {normal, short-circuit, tacho problem, ..}

Features: AR coefficients measured every 2 seconds

Classes are persistent over time

Approach and Results

• Classifiers– Gaussian model and neural network– trained on labeled “instantaneous window” data

• Markov component– transition probabilities estimated from MTBF data

• Results– discriminative neural net much better than Gaussian– Markov component reduced the error rate (all false

alarms) of 2% to 0%.

Classification with and withoutthe Markov context

We will compare what happens when

(a) we just make decisions based on p(Ct | Xt )(“ignore context”)

(b) we use the full Markov context(i.e., use forward-backward to“integrate” temporal information)

- - - - - - - -

-5 0 5 100

Component 1 Component 2p(

-5 0 5 100

Mixture Model

0 10 20 30 40 50 60 70 80 90 100

Gaussian vs HMM Classification

0 10 20 30 40 50 60 70 80 90 100

ObservationsTrue states

0 10 20 30 40 50 60 70 80 90 100

0 10 20 30 40 50 60 70 80 90 1000

0 10 20 30 40 50 60 70 80 90 100

0 10 20 30 40 50 60 70 80 90 1000

HMMGauss

0 10 20 30 40 50 60 70 80 90 100

0 10 20 30 40 50 60 70 80 90 1000

HMMGauss

0 10 20 30 40 50 60 70 80 90 1001

0 10 20 30 40 50 60 70 80 90 100

0 10 20 30 40 50 60 70 80 90 1000

HMMGauss

0 10 20 30 40 50 60 70 80 90 1001

Simulation Experiments

Systematic Simulations

Simulation Setup1. Two Gaussian classes, at mean 0 and mean 1 => vary “separation” = sigma of the Gaussians

2. Markov dependence A = [p 1-p ; 1-p p]

Vary p (self-transition) = “strength of context”

Look at Bayes error with and without context

- - - - - - - -

-4 -3 -2 -1 0 1 2 3 4 5 6 70

Class 1 Class 2p(x)

Bayes error = 0.08

Separation = 3sigma

-4 -3 -2 -1 0 1 2 3 4 50

Class 1 Class 2p(x)

Bayes error = 0.31

Separation = 1sigma

0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 10

0.5Bayes Error vs. Markov Probability

Self-transition probability

Separation = 0.1Separation = 1Separation = 2Separation = 4

0 0.5 1 1.5 2 2.5 3 3.5 40

0.5Bayes Error vs. Gaussian Separation

Separation

teSelf-transition = 0.5Self-transition = 0.9Self-transition = 0.94Self-transition = 0.99

0 0.5 1 1.5 2 2.5 3 3.5 40

100% Reduction in Bayes Error vs. Gaussian Separation

Separation

Self-transition = 0.5Self-transition = 0.9Self-transition = 0.94Self-transition = 0.99

In summary….

• Context reduces error– greater Markov dependence => greater reduction

• Reduction is dramatic for p>0.9– e.g., even with minimal Gaussian separation, Bayes

error can be reduced to zero!!

Approximate Methods

• Forward-Only:– necessary in many applications

• “Two nearest-neighbors”– only use information from C(t-1) and C(t+1)

• How suboptimal are these methods?

0 1 2 3 4 5 6 70

Log-odds of self-transition probability

teFwBw, Separation = 1Fw, Separation = 1NN2, Separation = 1

0 1 2 3 4 5 6 70

FwBw, Separation = 0.25Fw, Separation = 0.25NN2, Separation = 0.25

0 0.5 1 1.5 2 2.5 3 3.5 40

Separation

teFwBw, self-transition = 0.99Fw, self-transition = 0.99NN2, self-transition = 0.99Bayes error

0 0.5 1 1.5 2 2.5 3 3.5 40

Separation

In summary (for approximations)….

• Forward only:– “tracks” forward-backward reductions– generally gets much more than 50% of gap between

F-B and context-free Bayes error

• 2-neighbors– typically worse than forward only – much worse for small separation– much worse for very high transition probs

• does not converge to zero Bayes error

Extensions to “Simple” HMMs

Semi Markov modelsduration in each state need not be geometric

Segmental Markov Modelsoutputs within each state have a non-constant mean, regression function

Dynamic Belief NetworksAllow arbitrary dependencies among classes and features

Stochastic Grammars, Spatial Landmark models, etc

[See Afternoon Talks at this workshop for other approaches]

Conclusions

• Context is increasingly important in many classification applications

• Graphical models – HMMs are a simple and practical approach– graphical models provide a general-purpose

language for context

• Theory/Simulation– Effect of context on error rate can be dramatic

0 0.5 1 1.5 2 2.5 3 3.5 40

0.35Absolute Reduction in Bayes Error vs. Gaussian Separation

Separation

Self-transition = 0.5Self-transition = 0.9Self-transition = 0.94Self-transition = 0.99

0 1 2 3 4 5 6 70

teFwBw, Separation = 3Fw, Separation = 3NN2, Separation = 3

0 0.5 1 1.5 2 2.5 3 3.5 40

Separation

0 0.5 1 1.5 2 2.5 3 3.5 40

0.35Absolute Reduction in Bayes Error vs. Gaussian Separation

Separation

FwBw, self-transition = 0.99Fw, self-transition = 0.99NN2, self-transition = 0.99

0 0.5 1 1.5 2 2.5 3 3.5 40

100Percent Decrease in Bayes Error vs. Gaussian Separation

Separation

FwBw, self-transition = 0.99Fw, self-transition = 0.99NN2, self-transition = 0.99

Sketch of the PP algorithm in action

© padhraic smyth, uc irvine a review of hidden markov models for context-based classification...

Documents

cs 277: data mining analysis of web user data padhraic smyth...

an introduction to data mining padhraic smyth information...

big data: hope or hype? - association for computing...

data mining - massey university clustering credits: padhraic...

lecture 16: unsupervised learning from text padhraic smyth...

bayesian networks cs 271: fall 2007 instructor: padhraic...

notes on graphical models padhraic smyth department of...

logic agents and propositional logic cs 271: fall 2007...

knowledge and uncertainty cs 271: fall 2007 instructor:...

padhraic smyth, sharad mehrotra information and computer...

segmental hidden markov models with random effects for...

padhraic smyth - ics.uci.edusmyth/padhraic_smyth_cv.pdf ·...

stochastic collapsed variational bayesian inference for...

data mining and the optiputer padhraic smyth university of...

lecture 2: problem solving using state space representations...

1 graphical models in data assimilation problems alexander...

cs 277: data mining recommender systems padhraic smyth...

cs 277: data mining regression algorithms padhraic smyth...

constraint satisfaction problems cs 271: fall 2007...

data mining lectures lecture 11: pattern discovery padhraic...