1 graphical model software for machine learning kevin murphy university of british columbia...

42
1 Graphical model software for machine learning Kevin Murphy University of British Columbia December, 2005

Post on 18-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 Graphical model software for machine learning Kevin Murphy University of British Columbia December, 2005

1

Graphical modelsoftware for machine learning

Kevin Murphy

University of British Columbia

December, 2005

Page 2: 1 Graphical model software for machine learning Kevin Murphy University of British Columbia December, 2005

2

Outline

• Discriminative models for iid data

• Beyond iid data: conditional random fields

• Beyond supervised learning: generative models

• Beyond optimization: Bayesian models

Page 3: 1 Graphical model software for machine learning Kevin Murphy University of British Columbia December, 2005

3

Supervised learning as Bayesian inference

Y1

X1

YN

XN

Y*

X*

Yn

Xn

Y*

X*N

Training

Testing

Page 4: 1 Graphical model software for machine learning Kevin Murphy University of British Columbia December, 2005

4

Supervised learning as optimization

Y1

X1

YN

XN

Y*

X*

Yn

Xn

Y*

X*N

Training

Testing

Page 5: 1 Graphical model software for machine learning Kevin Murphy University of British Columbia December, 2005

5

Example: logistic regression

• Let yn 2 {1,…,C} be given by a softmax

• Maximize conditional log likelihood

• “Max margin” solution

Page 6: 1 Graphical model software for machine learning Kevin Murphy University of British Columbia December, 2005

6

Outline

• Discriminative models for iid data

• Beyond iid data: conditional random fields

• Beyond supervised learning: generative models

• Beyond optimization: Bayesian models

Page 7: 1 Graphical model software for machine learning Kevin Murphy University of British Columbia December, 2005

7

1D chain CRFs for sequence labeling

Yn1 YnmYn2

Xn

A 1D conditional random field (CRF) is an extension of logistic regressionto the case where the output labels are sequences, yn 2 {1,…,C}m

Local evidence Edge potential

i

ij

Page 8: 1 Graphical model software for machine learning Kevin Murphy University of British Columbia December, 2005

8

2D Lattice CRFs for pixel labeling

A conditional randomfield (CRF) is a discriminative modelof P(y|x). The edge potentialsij are image dependent.

Page 9: 1 Graphical model software for machine learning Kevin Murphy University of British Columbia December, 2005

9

2D Lattice MRFs for pixel labeling

A Markov Random Field (MRF) is an undirectedgraphical model. Here we model correlation between pixel labels using ij(yi,yj). We also have a per-pixelgenerative model of observations P(xi|yi)

Local evidence Potential functionPartition function

Page 10: 1 Graphical model software for machine learning Kevin Murphy University of British Columbia December, 2005

10

Tree-structured CRFs

• Used in parts-based object detection

• Yi is location of part i in image

eyeL nose eyeR

mouth

Fischler & Elschlager, "The representation and matching of pictorial structures”, PAMI’73Felzenszwalb & Huttenlocher, "Pictorial Structures for Object Recognition," IJCV’05

Page 11: 1 Graphical model software for machine learning Kevin Murphy University of British Columbia December, 2005

11

General CRFs

• In general, the graph may have arbitrary structure

• eg for collective web page classification,nodes=urls, edges=hyperlinks

• The potentials are in general defined on cliques, not just edges

Page 12: 1 Graphical model software for machine learning Kevin Murphy University of British Columbia December, 2005

12

Factor graphsSquare nodes = factors (potentials)Round nodes = random variablesGraph structure = bipartite

Page 13: 1 Graphical model software for machine learning Kevin Murphy University of British Columbia December, 2005

13

Potential functions

• For the local evidence, we can use a discriminative classifier (trained iid)

• For the edge compatibilities, we can use a maxent/ loglinear form, using pre-defined features

Page 14: 1 Graphical model software for machine learning Kevin Murphy University of British Columbia December, 2005

14

Restricted potential functions

• For some applications (esp in vision), we often use a Potts model of the form

•We can generalize this for ordered labels (eg discretization of continuous states)

l

Page 15: 1 Graphical model software for machine learning Kevin Murphy University of British Columbia December, 2005

15

Page 16: 1 Graphical model software for machine learning Kevin Murphy University of British Columbia December, 2005

16

Learning CRFs

• If the log likelihood is

• then the gradient is cliques

Gradient = features – expected features

Tied params

Page 17: 1 Graphical model software for machine learning Kevin Murphy University of British Columbia December, 2005

17

Learning CRFs

• Given the gradient rd, one can find the global optimum using first or second order optimization methods, such as– Conjugate gradient– Limited memory BFGS– Stochastic meta descent (SMD)?

• The bottleneck is computing the expected features needed for the gradient

Page 18: 1 Graphical model software for machine learning Kevin Murphy University of British Columbia December, 2005

18

Exact inference

• For 1D chains, one can compute P(yi,i+1|x) exactly in O(N K2) time using belief propagation (BP = forwards backwards algorithm)

• For restricted potentials (eg ij=( l)), one can do this in O(NK) time using FFT-like tricks

• This can be generalized to trees.

Page 19: 1 Graphical model software for machine learning Kevin Murphy University of British Columbia December, 2005

19

Sum-product vs max-product• We use sum-product to compute marginal

probabilities needed for learning

• We use max-product to find the most probable assignment (Viterbi decoding)

• We can also compute max-marginals

Page 20: 1 Graphical model software for machine learning Kevin Murphy University of British Columbia December, 2005

20

Complexity of exact inferenceIn general, the running time is (N Kw), where w is the treewidthof the graph; this is the size of the maximal clique of the triangulatedgraph (assuming an optimal elimination ordering).For chains and trees, w = 2.For n £ n lattices, w = O(n).

Page 21: 1 Graphical model software for machine learning Kevin Murphy University of British Columbia December, 2005

21

Approximate sum-productAlgorithm Potential (pairwise) Time N=num nodes,

K = num states,I = num iterations

BP(exact iff tree)

General O(N K2 I)

BP+FFT(exact iff tree)

Restricted O(N K I)

Generalized BP General O(N K2c I)c = cluster size

Gibbs General O(N K I)

Swendsen-Wang General O(N K I)

Mean field General O(N K I)

Page 22: 1 Graphical model software for machine learning Kevin Murphy University of British Columbia December, 2005

22

Approximate max-productAlgorithm Potential (pairwise) Time N=num nodes,

K = num states,I = num iterations

BP (exact iff tree) General O(N K2 I)

BP+DT (exact iff tree) Restricted O(N K I)

Generalized BP General O(N K2c I)c = cluster size

Graph-cuts(exact iff K=2)

Restricted O(N2 K I) [?]

ICM (iterated conditional modes)

General O(N K I)

SLS (stochastic local search)

General O(N K I)

Page 23: 1 Graphical model software for machine learning Kevin Murphy University of British Columbia December, 2005

23

Learning intractable CRFs

• We can use approximate inference and hope the gradient is “good enough”.– If we use max-product, we are doing “Viterbi

training” (cf perceptron rule)

• Or we can use other techniques, such as pseudo likelihood, which does not need inference.

Page 24: 1 Graphical model software for machine learning Kevin Murphy University of British Columbia December, 2005

24

Pseudo-likelihood

Page 25: 1 Graphical model software for machine learning Kevin Murphy University of British Columbia December, 2005

25

Software for inference and learning in 1D CRFs

• Various packages– Mallet (McCallum et al) – Java– Crf.sourceforge.net (Sarawagi, Cohen) – Java– My code – matlab (just a toy, not integrated

with BNT)– Ben Taskar says he will soon release his Max

Margin Markov net code (which uses LP for inference and QP for learning).

• Nothing standard, emphasis on NLP apps

Page 26: 1 Graphical model software for machine learning Kevin Murphy University of British Columbia December, 2005

26

Software for inference in general CRFs/ MRFs

• Max-product : C++ code for GC, BP, TRP and ICM (for Lattice2) by Rick Szeliski et al

– “A comparative study of energy minimization methods for MRFs”, Rick Szeliksi, Ramin Zabih, Daniel Scharstein, Olga Veksler, Vladimir Kolmogorov, Aseem Agarwala, Marsall Tappen, Carsten Rother

• Sum-product for Gaussian MRFs: GMRFlib, C code by Havard Rue (exact inference)

• Sum-product: various other ad hoc pieces– My matlab BP code (MRF2)– Rivasseau’s C++ code for BP, Gibbs, tree-sampling

(factor graphs)– Metlzer’s C++ code for BP, GBP, Gibbs, MF (Lattice2)

Page 27: 1 Graphical model software for machine learning Kevin Murphy University of British Columbia December, 2005

27

Software for learning general MRFs/CRFs

• Hardly any!– Parise’s matlab code (approx gradient,

pseudo likelihood, CD, etc)– My matlab code (IPF, approx gradient – just a

toy – not integrated with BNT)

Page 28: 1 Graphical model software for machine learning Kevin Murphy University of British Columbia December, 2005

28

Structure of ideal toolbox

train

trainDatalearnEngine

infEngine queries model

modelinfEngine

probDist

performance

visualizesummarize

Generator/GUI/file

infer

decide

decisionEngine

utilities

decision

testData

Nbest list

Page 29: 1 Graphical model software for machine learning Kevin Murphy University of British Columbia December, 2005

29

Structure of BNT

train

trainDatalearnEngine

infEngine queries model

modelinfEngine

probDist

visualizesummarize

Generator/GUI/file

infer

testData

decide

decisionEngine

Nbest list

BPJtreeMCMC

EMStructuralEM

Graphs+CPDs

Graphs+CPDs

LeRay Shan

Cell array

NodeIdsVarElim

N=1 (MAP) Array, Gaussian, samples

LIMID

JtreeVarElim

policy

Cell array

Page 30: 1 Graphical model software for machine learning Kevin Murphy University of British Columbia December, 2005

30

Outline

• Discriminative models for iid data

• Beyond iid data: conditional random fields

• Beyond supervised learning: generative models

• Beyond optimization: Bayesian models

Page 31: 1 Graphical model software for machine learning Kevin Murphy University of British Columbia December, 2005

31

Unsupervised learning: why?

• Labeling data is time-consuming.

• Often not clear what label to use.

• Complex objects often not describable with a single discrete label.

• Humans learn without labels.

• Want to discover novel patterns/ structure.

Page 32: 1 Graphical model software for machine learning Kevin Murphy University of British Columbia December, 2005

32

Unsupervised learning: what?

• Clusters (eg GMM)

• Low dim manifolds (eg PCA)

• Graph structure (eg biology, social networks)

• “Features” (eg maxent models of language and texture)

• “Objects” (eg sprite models in vision)

Page 33: 1 Graphical model software for machine learning Kevin Murphy University of British Columbia December, 2005

33

Unsupervised learning of objects from video

Frey and Jojic; Williams and Titsias ; et al

Page 34: 1 Graphical model software for machine learning Kevin Murphy University of British Columbia December, 2005

34

Unsupervised learning: issues

• Objective function not as obvious as in supervised learning. Usually try to maximize likelihood (measure of data compression).

• Local minima (non convex objective).

• Uses inference as subroutine (can be slow – no worse than discriminative learning)

Page 35: 1 Graphical model software for machine learning Kevin Murphy University of British Columbia December, 2005

35

Unsupervised learning: how?

• Construct a generative model (eg a Bayes net).

• Perform inference.

• May have to use approximations such as maximum likelihood and BP.

• Cannot use max likelihood for model selection…

Page 36: 1 Graphical model software for machine learning Kevin Murphy University of British Columbia December, 2005

36

A comparison of BN software

www.ai.mit.edu/~murphyk/Software/Bayes/bnsoft.html

Page 37: 1 Graphical model software for machine learning Kevin Murphy University of British Columbia December, 2005

37

Popular BN software

• BNT (matlab)

• Intel’s PNL (C++)

• Hugin (commercial)

• Netica (commercial)

• GMTk (free .exe from Jeff Bilmes)

Page 38: 1 Graphical model software for machine learning Kevin Murphy University of British Columbia December, 2005

38

Outline

• Discriminative models for iid data

• Beyond iid data: conditional random fields

• Beyond supervised learning: generative models

• Beyond optimization: Bayesian models

Page 39: 1 Graphical model software for machine learning Kevin Murphy University of British Columbia December, 2005

39

Bayesian inference: why?

• It is optimal.

• It can easily incorporate prior knowledge (esp. useful for small n, large p problems).

• It properly reports confidence in output (useful for combining estimates, and for risk-averse applications).

• It separates models from algorithms.

Page 40: 1 Graphical model software for machine learning Kevin Murphy University of British Columbia December, 2005

40

Bayesian inference: how?

• Since we want to integrate, we cannot use max-product.

• Since the unknown parameters are continuous, we cannot use sum-product.

• But we can use EP (expectation propagation), which is similar to BP.

• We can also use variational inference.

• Or MCMC (eg Gibbs sampling).

Page 41: 1 Graphical model software for machine learning Kevin Murphy University of British Columbia December, 2005

41

General purposeBayesian software

• BUGS (Gibbs sampling)

• VIBES (variational message passing)

• Minka and Winn’s toolbox (infer.net)

Page 42: 1 Graphical model software for machine learning Kevin Murphy University of British Columbia December, 2005

42

Structure of ideal Bayesian toolbox

train

trainDatalearnEngine

infEngine queries model

modelinfEngine

probDist

performance

visualizesummarize

Generator/ GUI/ file

infer

decide

decisionEngine

utilities

decision

testData