learning tree conditional random fields joseph k. bradley carlos guestrin

Learning Tree Conditional Random

Fields

Joseph K. BradleyCarlos Guestrin

We want to model

conditional correlations

Reading people’s minds

X: fMRI voxels

Y: semantic features• Metal?

• Manmade?• Found in house?• ...

predict

Predict independently? Yi ~ X, for all i

Correlated!

E.g.,• Person? & Live in water?• Colorful? & Yellow?

Image fromhttp://en.wikipedia.org/wiki/File:FMRI.jpg

(Application from Palatucci et al., 2009)

Conditional Random Fields (CRFs)

(Lafferty et al., 2001)

Q(Y | X)1

Z(X) j (YCj ,XCj )

j

Pro: Avoid modeling P(X)

In fMRI, X ≈ 500 to 10,000 voxels


Q(Y | X)1

Z(X) j (YCj ,XCj )

j


),,( 2,1212,1 XYYY1

Y2

Y3Y4

4,2

2,3 encode conditional

independence structure

j


j

CjCjj xYxZ

xXYQ ),()(

1)|(

Y1

Y2

Y3Y4

4,2

2,3

),,( 2,12,1212,1 xXYY

encode conditional

independence structure

j



Normalizationdepends on X=x

Con: Compute Z(x) for each inference

Q(Y | X)1

Z(X) j (YCj ,XCj )

j

),,( 2,1212,1 XYYY1

Y2

Y3Y4

4,2

2,3



Q(Y | X)1

Z(X) j (YCj ,XCj )

j

Exact inference intractable in general.

Approximate inference expensive.Con: Compute Z(x) for each inferencePro: Avoid modeling P(X)

Use tree CRFs!

),,( 2,1212,1 XYYY1

Y2

Y3Y4

4,2

2,3


Tji

ijjiijT XYYXZ

XYQ),(

),,()(

1)|(

Con: Compute Z(x) for each inferencePro: Avoid modeling P(X)

Use tree CRFs!

Pro: Fast, exact inference

),,( 2,1212,1 XYYY1

Y2

Y3Y4

4,2

2,3

CRF Structure Learning

Tree CRFs Fast, exact inference

Avoid modeling P(X)

Featureselection

Tji

ijjiijT XYYXZ

XYQ),(

),,()(

1)|(

2,1Y1

Y2

Y3Y4

4,2

2,3

Structurelearning

CRF Structure Learning

Tree CRFs Fast, exact inference

Avoid modeling P(X)

Tji

ijjiijT XYYXZ

XYQ),(

),,()(

1)|(

Local inputs

),,( ijjiij XYY (scalable)

),,( XYY jiijinstead of

Global inputs

(not scalable)

This work

Goals:

Structured conditional models P(Y|X)Scalable methods

Tree structuresLocal inputs Xij

Max spanning treesOutlineGold standardMax spanning trees

Generalized edge weightsHeuristic weights

Experiments: synthetic & fMRI

Related workMethod Feature

selection?

Tractable models?

Torralba et al. (2004)

Boosted Random Fields

Yes No

Schmidt et al. (2008)

Block-L1 regularized pseudolikelihood

No No

Shahaf et al. (2009)

Edge weight +low-treewidth model

No YesVs. our work

Choice of edge weights Local inputs

Chow-LiuFor generative models:

Tji

jiijT YYZ

YQ),(

),(1

)(

E[logQT (Y ) logQdisc (Y )] I(Yi;Y j )(i, j )T

i

iidisc YZ

YQ )(1

)(

Where now?

Tji

jidiscT XYYIXYQXYQE),(

)|;()]|(log)|([log

Global CMI (Conditional Mutual Information):Pros: “Gold standard”Cons: I(Yi;Yj | X) intractable for big XAlgorithmic frameworkGiven: data {(y(i),x(i))}.Given: input mapping Yi Xi

Weight potential edge (Yi,Yj) with Score(i,j)

Choose max spanning tree

Local inputs!

Generalized edge scores

Key step: Weight edge (Yi,Yj) with Score(i,j).

Local Linear Entropy Scores: Score(i,j) = linear combination of

entropies over Yi,Yj,Xi,Xj

E.g., Local Conditional Mutual Information )|,()|()|()|;( ijjiijjijiijji XYYHXYHXYHXYYI

TheoremAssume true P(Y|X) is tree CRF

(w/ non-trivial parameters).

No Local Linear Entropy Score can recover all such tree CRFs

(even with exact entropies).

Generalized edge scores

Key step: Weight edge (Yi,Yj) with Score(i,j).

Local Linear Entropy Scores: Score(i,j) = linear combination of

entropies over Yi,Yj,Xi,Xj

OutlineGold standardMax spanning trees

Generalized edge weightsHeuristic weights

Experiments: synthetic & fMRI

Heuristics

Piecewise likelihood Local CMI DCI

Piecewise likelihood (PWL)

logQT (Y | X) logP(Yij | X ij )(i, j )T

Sutton and McCallum (2005,2007): PWL for parameter learning

Main idea: Bound Z(X)

For tree CRFs, optimal parameters give:

Score(i, j)E[logP(Yij | X ij )]

Edge score w/ local inputs Xij

Bounds log likelihoodFails on simple counterexampleDoes badly in practiceHelps explain other edge scores

Local Conditional Mutual Info

Score(i, j)I(Yi;Y j | X ij )

Y1 Y2 Y3 Yn

X1 X2 X3 Xn

.

.

.

True P(Y,X)

I(Y2;Y j | X2 j ) 0,jStrong potential

Y3

Y2

Y1

),,( 3,1313,1 XYY

Experiments

Given: Data {(yi,xi)}; input mapping Yi Xi

Compute edge scores:

DCI(i, j) H(Yij | X ij )H(Yi | X i)H(Y j | X j )

Regress P(Yij|Xij) (10-fold CV to choose regularization)Choose max spanning tree

Parameter learning:Conjugate gradient on L2-regularized log likelihood10-fold CV to choose regularization

Algorithmic details

Synthetic experiments

Experiments:Binary Y,X; tabular edge factorsUse natural input mapping: Yi Xi

P(Y|X) P(X)Y1 Y2 Y3 Yn

X1 X2 X3 Xn

.

.

.

X1 X2 X3 Xn...

intractable P(Y,X)


P(Y|X), P(X): chains & trees

Y1

Y3

Y4Y2

Y5

P(Y|X) P(X)

P(Y,X): tractable & intractable

X1

X3

X4X2

X5

Φ(Yij,Xij):

tractable P(Y,X) X1

X3

X4

X2

X5


P(Y|X): chains & trees

P(Y|X)

P(Y,X): tractable & intractable

With & without cross-factors

Φ(Yij,Xij):

Y1 Y2 Y3 Yn

X1 X2 X3 Xn

.

.

.

crossfactors

Associative (all positive & alternating +/-) & random factors

Synthetic: vary # train exs.


TreeIntractable P(Y,X)Associative Φ (alternating +/-)|Y|=401000 test examples

Synthetic: vary model size

Fixed 50 train exs., 1000 test exs.

fMRI experiments

X(500 fMRI voxels)

Y(218 semantic features)• Metal?

• Manmade?• Found in house?• ...

predict

Data, setup from Palatucci et al. (2009)

Decode(hand-built map)

Object (60 total)• Bear

• Screwdriver• ...

Zero-shot learning: Can predict objects not in training data (given decoding).

Image fromhttp://en.wikipedia.org/wiki/File:FMRI.jpg

fMRI experimentsX

(500 fMRI voxels)

Y(218 semantic features)

predict

Y,X real-valued Gaussian factors:

(y,x)exp 1

2Ay (Cx b) 2

Input mapping: Regressed Yi ~ Y-i,X Chose top K inputs

Added fixed

Regularized A & C,b separately CV for parameter learning very expensive Do CV on subject 0 only2 methods: CRF1: K=10 & CRF2: K=20 &

(Yij ,X ij )

(Yij )

(Yi,X)P(Yi | X)

Accuracy: (for zero-shot learning)

Hold out objects i,j.

Predict Y(i)’, Y(j)’

If ||Y(i) - Y(i)’||2 < ||Y(j) - Y(i)’||2 then we got i right.

fMRI experiments

fMRI experiments

Accuracy: CRFs a bit worse

fMRI experiments

better


Log likelihood: CRFs better

fMRI experiments

better


Log likelihood: CRFs better

Squared error: CRFs better

ConclusionScalable learning of CRF structureAnalyzed edge scores for spanning tree methods

Local Linear Entropy Scores imperfectHeuristics

Pleasing theoretical propertiesEmpirical success—we recommend DCI

Future work

Templated CRFs

Learning edge score

Assumptions on model/factors which give learnability

Thank you!

Thank you!References

M. Craven, D. DiPasquo, D. Freitag, A. McCallum, T. Mitchell, K. Nigam, S. Slattery. Learning to Extract Symbolic Knowledge from the World Wide Web. AAAI 1998.Lafferty, J.D., McCallum, A., Pereira, F.C.N. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. ICML 2001.M. Palatucci, D. Pomerleau, G. Hinton, T. Mitchell. Zero-Shot Learning with Semantic Output Codes. NIPS 2009.M. Schmidt, K. Murphy, G. Fung, R. Rosales. Structure learning in random fields for heart motion abnormality detection. CVPR 2008.D. Shahaf, A. Chechetka, C. Guestrin. Learning Thin Junction Trees via Graph Cuts. AI-STATS 2009.C. Sutton, A. McCallum. Piecewise training of undirected models. UAI 2005.C. Sutton, A. McCallum. Piecewise pseudolikelihood for efficient training of conditional random fields. ICML, 2007.A. Torralba, K. Murphy, W. Freeman. Contextual models for object detection using boosted random fields. NIPS 2004.

(extra slides)

B: Score Decay Assumption

B: Example complexity

Future work: Templated CRFs

Learn template, e.g.Score(i,j) = DCI(i,j)Parametrization

(Yij ,X ij )P(Yij | X ij )WebKB (Craven et al., 1998)Given webpages {(Yi=page type, Xi=content)}

Use template to: Choose tree over pages

Instantiate parameters

P(Y|X=x) = P(pages’ types | pages’ content)

Requires local inputsPotentially very fast

Future work: Learn score

Given training queries:DataGround-truth model (E.g., from expensive structure learning method)

Learn function Score(Yi,Yj) for MST algorithm.

learning tree conditional random fields joseph k. bradley carlos guestrin

Documents

x j slide

big x slide

y j x intractable

modeling px slide

weight edge y i

scalable slide

weight potential edge

inference pro