learning tree conditional random fields joseph k. bradley carlos guestrin
Post on 21-Dec-2015
219 views
TRANSCRIPT
We want to model
conditional correlations
Reading people’s minds
X: fMRI voxels
Y: semantic features• Metal?
• Manmade?• Found in house?• ...
predict
Predict independently? Yi ~ X, for all i
Correlated!
E.g.,• Person? & Live in water?• Colorful? & Yellow?
Image fromhttp://en.wikipedia.org/wiki/File:FMRI.jpg
(Application from Palatucci et al., 2009)
Conditional Random Fields (CRFs)
(Lafferty et al., 2001)
Q(Y | X)1
Z(X) j (YCj ,XCj )
j
Pro: Avoid modeling P(X)
In fMRI, X ≈ 500 to 10,000 voxels
Conditional Random Fields (CRFs)
Q(Y | X)1
Z(X) j (YCj ,XCj )
j
Pro: Avoid modeling P(X)
),,( 2,1212,1 XYYY1
Y2
Y3Y4
4,2
2,3 encode conditional
independence structure
j
Conditional Random Fields (CRFs)
j
CjCjj xYxZ
xXYQ ),()(
1)|(
Y1
Y2
Y3Y4
4,2
2,3
),,( 2,12,1212,1 xXYY
encode conditional
independence structure
j
Pro: Avoid modeling P(X)
Conditional Random Fields (CRFs)
Normalizationdepends on X=x
Con: Compute Z(x) for each inference
Q(Y | X)1
Z(X) j (YCj ,XCj )
j
),,( 2,1212,1 XYYY1
Y2
Y3Y4
4,2
2,3
Pro: Avoid modeling P(X)
Conditional Random Fields (CRFs)
Q(Y | X)1
Z(X) j (YCj ,XCj )
j
Exact inference intractable in general.
Approximate inference expensive.Con: Compute Z(x) for each inferencePro: Avoid modeling P(X)
Use tree CRFs!
),,( 2,1212,1 XYYY1
Y2
Y3Y4
4,2
2,3
Conditional Random Fields (CRFs)
Tji
ijjiijT XYYXZ
XYQ),(
),,()(
1)|(
Con: Compute Z(x) for each inferencePro: Avoid modeling P(X)
Use tree CRFs!
Pro: Fast, exact inference
),,( 2,1212,1 XYYY1
Y2
Y3Y4
4,2
2,3
CRF Structure Learning
Tree CRFs Fast, exact inference
Avoid modeling P(X)
Featureselection
Tji
ijjiijT XYYXZ
XYQ),(
),,()(
1)|(
2,1Y1
Y2
Y3Y4
4,2
2,3
Structurelearning
CRF Structure Learning
Tree CRFs Fast, exact inference
Avoid modeling P(X)
Tji
ijjiijT XYYXZ
XYQ),(
),,()(
1)|(
Local inputs
),,( ijjiij XYY (scalable)
),,( XYY jiijinstead of
Global inputs
(not scalable)
This work
Goals:
Structured conditional models P(Y|X)Scalable methods
Tree structuresLocal inputs Xij
Max spanning treesOutlineGold standardMax spanning trees
Generalized edge weightsHeuristic weights
Experiments: synthetic & fMRI
Related workMethod Feature
selection?
Tractable models?
Torralba et al. (2004)
Boosted Random Fields
Yes No
Schmidt et al. (2008)
Block-L1 regularized pseudolikelihood
No No
Shahaf et al. (2009)
Edge weight +low-treewidth model
No YesVs. our work
Choice of edge weights Local inputs
Chow-LiuFor generative models:
Tji
jiijT YYZ
YQ),(
),(1
)(
E[logQT (Y ) logQdisc (Y )] I(Yi;Y j )(i, j )T
i
iidisc YZ
YQ )(1
)(
Chow-Liu for CRFs?For CRFs with global inputs:
Tji
jiijT XYYXZ
XYQ),(
),,()(
1)|(
Tji
jidiscT XYYIXYQXYQE),(
)|;()]|(log)|([log
i
iidisc XYXZ
XYQ ),()(
1)|(
Global CMI (Conditional Mutual Information):Pro: “Gold standard”Con: I(Yi;Yj | X) intractable for big X
Where now?
Tji
jidiscT XYYIXYQXYQE),(
)|;()]|(log)|([log
Global CMI (Conditional Mutual Information):Pros: “Gold standard”Cons: I(Yi;Yj | X) intractable for big XAlgorithmic frameworkGiven: data {(y(i),x(i))}.Given: input mapping Yi Xi
Weight potential edge (Yi,Yj) with Score(i,j)
Choose max spanning tree
Local inputs!
Generalized edge scores
Key step: Weight edge (Yi,Yj) with Score(i,j).
Local Linear Entropy Scores: Score(i,j) = linear combination of
entropies over Yi,Yj,Xi,Xj
E.g., Local Conditional Mutual Information )|,()|()|()|;( ijjiijjijiijji XYYHXYHXYHXYYI
TheoremAssume true P(Y|X) is tree CRF
(w/ non-trivial parameters).
No Local Linear Entropy Score can recover all such tree CRFs
(even with exact entropies).
Generalized edge scores
Key step: Weight edge (Yi,Yj) with Score(i,j).
Local Linear Entropy Scores: Score(i,j) = linear combination of
entropies over Yi,Yj,Xi,Xj
OutlineGold standardMax spanning trees
Generalized edge weightsHeuristic weights
Experiments: synthetic & fMRI
Heuristics
Piecewise likelihood Local CMI DCI
Piecewise likelihood (PWL)
logQT (Y | X) logP(Yij | X ij )(i, j )T
Sutton and McCallum (2005,2007): PWL for parameter learning
Main idea: Bound Z(X)
For tree CRFs, optimal parameters give:
Score(i, j)E[logP(Yij | X ij )]
Edge score w/ local inputs Xij
Bounds log likelihoodFails on simple counterexampleDoes badly in practiceHelps explain other edge scores
Piecewise likelihood (PWL)
Score(i, j)E[logP(Yij | X ij )] H(Yij | X ij )Y1 Y2 Y3 Yn
X1 X2 X3 Xn
.
.
.
True P(Y,X)
H(Y2 j | X2 j ) H(Y j | X2 j ),jStrong potential
H(Y jk | X jk )
Choose (2,j)
Over (j,k)
Y1 Y2
Y3
Yn
Score(i, j) H(Yij | X ij )H(Yi | X ij )H(Y j | X ij )
Local Conditional Mutual Info
Score(i, j) H(Yij | X ij )
I(Yi;Y j | X ij )Decomposable score w/ local inputs Xij
Does pretty well in practiceCan fail with strong potentials
Theorem: Local CMI bounds log likelihood gain
E[logQT (Y | X) logQdisc(Y | X)]
I(Yi;Y j | X ij )(i, j )T
Local Conditional Mutual Info
Score(i, j)I(Yi;Y j | X ij )
Y1 Y2 Y3 Yn
X1 X2 X3 Xn
.
.
.
True P(Y,X)
I(Y2;Y j | X2 j ) 0,jStrong potential
Y3
Y2
Y1
),,( 3,1313,1 XYY
Score(i, j) H(Yij | X ij )H(Yi | X i)H(Y j | X j )
Score(i, j) H(Yij | X ij )H(Yi | X ij )H(Y j | X ij )
Decomposable Conditional Influence (DCI)
Exact measure of gain for some edgesEdge score w/ local inputs Xij
Succeeds on counterexampleDoes best in practice
PWLFrom )|( XYQdisc
Y2Y1Y3
Experiments
Given: Data {(yi,xi)}; input mapping Yi Xi
Compute edge scores:
DCI(i, j) H(Yij | X ij )H(Yi | X i)H(Y j | X j )
Regress P(Yij|Xij) (10-fold CV to choose regularization)Choose max spanning tree
Parameter learning:Conjugate gradient on L2-regularized log likelihood10-fold CV to choose regularization
Algorithmic details
Synthetic experiments
Experiments:Binary Y,X; tabular edge factorsUse natural input mapping: Yi Xi
P(Y|X) P(X)Y1 Y2 Y3 Yn
X1 X2 X3 Xn
.
.
.
X1 X2 X3 Xn...
intractable P(Y,X)
Synthetic experiments
P(Y|X), P(X): chains & trees
Y1
Y3
Y4Y2
Y5
P(Y|X) P(X)
P(Y,X): tractable & intractable
X1
X3
X4X2
X5
Φ(Yij,Xij):
tractable P(Y,X) X1
X3
X4
X2
X5
Synthetic experiments
P(Y|X): chains & trees
P(Y|X)
P(Y,X): tractable & intractable
With & without cross-factors
Φ(Yij,Xij):
Y1 Y2 Y3 Yn
X1 X2 X3 Xn
.
.
.
crossfactors
Associative (all positive & alternating +/-) & random factors
Synthetic: vary # train exs.
TreeIntractable P(Y,X)Associative Φ (alternating +/-)|Y|=401000 test examples
Synthetic: vary # train exs.
TreeIntractable P(Y,X)Associative Φ (alternating +/-)|Y|=401000 test examples
Synthetic: vary # train exs.
TreeIntractable P(Y,X)Associative Φ (alternating +/-)|Y|=401000 test examples
Synthetic: vary # train exs.
TreeIntractable P(Y,X)Associative Φ (alternating +/-)|Y|=401000 test examples
Synthetic: vary # train exs.
TreeIntractable P(Y,X)Associative Φ (alternating +/-)|Y|=401000 test examples
fMRI experiments
X(500 fMRI voxels)
Y(218 semantic features)• Metal?
• Manmade?• Found in house?• ...
predict
Data, setup from Palatucci et al. (2009)
Decode(hand-built map)
Object (60 total)• Bear
• Screwdriver• ...
Zero-shot learning: Can predict objects not in training data (given decoding).
Image fromhttp://en.wikipedia.org/wiki/File:FMRI.jpg
fMRI experimentsX
(500 fMRI voxels)
Y(218 semantic features)
predict
Y,X real-valued Gaussian factors:
(y,x)exp 1
2Ay (Cx b) 2
Input mapping: Regressed Yi ~ Y-i,X Chose top K inputs
Added fixed
Regularized A & C,b separately CV for parameter learning very expensive Do CV on subject 0 only2 methods: CRF1: K=10 & CRF2: K=20 &
(Yij ,X ij )
(Yij )
(Yi,X)P(Yi | X)
Accuracy: (for zero-shot learning)
Hold out objects i,j.
Predict Y(i)’, Y(j)’
If ||Y(i) - Y(i)’||2 < ||Y(j) - Y(i)’||2 then we got i right.
fMRI experiments
fMRI experiments
better
Accuracy: CRFs a bit worse
Log likelihood: CRFs better
Squared error: CRFs better
fMRI experiments
better
Accuracy: CRFs a bit worse
Log likelihood: CRFs better
Squared error: CRFs better
ConclusionScalable learning of CRF structureAnalyzed edge scores for spanning tree methods
Local Linear Entropy Scores imperfectHeuristics
Pleasing theoretical propertiesEmpirical success—we recommend DCI
Future work
Templated CRFs
Learning edge score
Assumptions on model/factors which give learnability
Thank you!
Thank you!References
M. Craven, D. DiPasquo, D. Freitag, A. McCallum, T. Mitchell, K. Nigam, S. Slattery. Learning to Extract Symbolic Knowledge from the World Wide Web. AAAI 1998.Lafferty, J.D., McCallum, A., Pereira, F.C.N. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. ICML 2001.M. Palatucci, D. Pomerleau, G. Hinton, T. Mitchell. Zero-Shot Learning with Semantic Output Codes. NIPS 2009.M. Schmidt, K. Murphy, G. Fung, R. Rosales. Structure learning in random fields for heart motion abnormality detection. CVPR 2008.D. Shahaf, A. Chechetka, C. Guestrin. Learning Thin Junction Trees via Graph Cuts. AI-STATS 2009.C. Sutton, A. McCallum. Piecewise training of undirected models. UAI 2005.C. Sutton, A. McCallum. Piecewise pseudolikelihood for efficient training of conditional random fields. ICML, 2007.A. Torralba, K. Murphy, W. Freeman. Contextual models for object detection using boosted random fields. NIPS 2004.
Future work: Templated CRFs
Learn template, e.g.Score(i,j) = DCI(i,j)Parametrization
(Yij ,X ij )P(Yij | X ij )WebKB (Craven et al., 1998)Given webpages {(Yi=page type, Xi=content)}
Use template to: Choose tree over pages
Instantiate parameters
P(Y|X=x) = P(pages’ types | pages’ content)
Requires local inputsPotentially very fast