thesis defense
DESCRIPTION
Learning Large-Scale Conditional Random Fields. Thesis Defense. Joseph K. Bradley. Committee Carlos Guestrin (U. of Washington, Chair) Tom Mitchell John Lafferty (U. of Chicago) Andrew McCallum (U. of Massachusetts at Amherst). 1 / 18 / 2013. Modeling Distributions. - PowerPoint PPT PresentationTRANSCRIPT
Carnegie Mellon
Thesis Defense
Joseph K. Bradley
Learning Large-Scale Conditional Random
Fields
CommitteeCarlos Guestrin (U. of Washington, Chair)Tom MitchellJohn Lafferty (U. of Chicago)Andrew McCallum (U. of Massachusetts at Amherst)
1 / 18 / 2013
Modeling Distributions
2
Goal: Model distribution P(X) over random variables XE.g.: Model life of a grad student.
X2: deadline?
X1: losing sleep?
X3: sick?
X4: losing hair?
X5: overeating?
X6: loud roommate?
X7: taking classes?
X8: cold weather?X9: exercising?
X11: single?X10: gaining weight?
Modeling Distributions
3
X2: deadline?
X1: losing sleep?
X5: overeating?
X7: taking classes?
= P( losing sleep, overeating | deadline, taking classes )
Goal: Model distribution P(X) over random variables XE.g.: Model life of a grad student.
Markov Random Fields (MRFs)
4
X2: deadline?
X1: losing sleep?
X3: sick?
X4: losing hair?
X5: overeating?
X6: loud roommate?
X7: taking classes?
X8: cold weather?X9: exercising?
X10: single?X10: gaining weight?
Goal: Model distribution P(X) over random variables XE.g.: Model life of a grad student.
Markov Random Fields (MRFs)
5
X2
X1
X3
X4
X5
X6
X7
X8X9
X10X10
graphical
structure
factor (parameters)
Goal: Model distribution P(X) over random variables X
Conditional Random Fields (CRFs)
6
X2
Y1
Y3
Y4
Y5
X1
X3
X4X5
X6Y2
MRFs: P(X) CRFs: P(Y|X) (Lafferty et al., 2001)
Do not model P(X)Simpler structure (over Y only)
MRFs & CRFs
7
Benefits•Principled statistical and computational framework•Large body of literatureApplications•Natural language processing (e.g., Lafferty et al., 2001)•Vision (e.g., Tappen et al., 2007)•Activity recognition (e.g., Vail et al., 2007)•Medical applications (e.g., Schmidt et al., 2008)•...
Challenges
8
Goal: Given data, learn CRF structure and parameters.
X2
Y1
Y3
Y4
Y5
X1
X5
X6Y2
Many learning methods require inference, i.e., answering queries P(A|B)
NP hard in general(Srebro, 2003)
Big structured optimization
problem
NP hard to approximate(Roth, 1996)
Approximations often lack strong guarantees.
Thesis Statement
CRFs offer statistical and computational advantages, but traditional learning methods are often impractical for large problems.
We can scale learning by using decompositions of learning problems which trade off sample complexity, computation, and parallelization.
9
Outline
Parameter Learning Learning without
intractable inferenceSc
alin
g co
re
met
hods
10
Structure Learning Learning
tractable structures
Parallel Regression Multicore sparse
regressionPara
llel
scal
ing
solve via
Outline
Parameter Learning Learning without
intractable inferenceSc
alin
g co
re
met
hods
11
Log-linear MRFs
12
X2
X1
X3
X4
X5
X6
X7
X8X9
X10X10
Goal: Model distribution P(X) over random variables X
Parameters FeaturesAll results generalize to CRFs.
Parameter Learning: MLE
13
Traditional method: max-likelihood estimation (MLE)Minimize objective:
Loss
Gold Standard: MLE is (optimally) statistically efficient.
Parameter LearningGiven structure Φ and samples from Pθ*(X),Learn parameters θ.
Parameter Learning: MLE
14
Parameter Learning: MLE
15
MLE requires inference.Provably hard for general MRFs. (Roth, 1996)
Inference makeslearning hard.
Can we learn withoutintractable inference?
Parameter Learning: MLE
16
Inference makeslearning hard.
Can we learn withoutintractable inference?
Approximate inference & objectives
• Many works: Hinton (2002), Sutton & McCallum (2005), Wainwright (2006), ...
• Many lack strong theory.• Almost no guarantees for general
MRFs or CRFs.
Our Solution
17
Max Likelihood Estimation (MLE)
Optimal High Difficult
Max Pseudolikelihood Estimation (MPLE)
High Low Easy
Sample complexit
y
Computational complexity
Parallel optimizati
on
PAC learnabilityfor many MRFs!
Bradley, Guestrin (2012)
Our Solution
18
Max Likelihood Estimation (MLE)
Optimal High Difficult
Sample complexit
y
Computational complexity
Parallel optimizati
on
PAC learnabilityfor many MRFs!
Max Pseudolikelihood Estimation (MPLE)
High Low Easy
Bradley, Guestrin (2012)
Our Solution
19
Max Likelihood Estimation (MLE)
Optimal High
Max Pseudolikelihood Estimation (MPLE)
Difficult
High Low Easy
Max Composite Likelihood Estimation (MCLE)
Low Low Easy
Sample complexit
y
Computational complexity
Parallel optimizati
on
Choose MCLE structure to optimize trade-offs
Bradley, Guestrin (2012)
Deriving Pseudolikelihood (MPLE)
20
X2
X1
X3
X4
X5
MLE:
Hard to compute.So replace it!
Deriving Pseudolikelihood (MPLE)
21
X1
MLE:
Estimate via regression:
MPLE:
(Besag, 1975)
Tractable inference!
Pseudolikelihood (MPLE)
22
Pros•No intractable inference!•Consistent estimator
Cons•Less statistically efficient than MLE (Liang & Jordan, 2008)•No PAC bounds
PAC = Probably Approximately Correct(Valiant, 1984)
MPLE:
(Besag, 1975)
Sample Complexity: MLE
23
# parameters (length of θ)
Λmin: min eigenvalue of Hessian of loss at θ*
probability of failure
Our Theorem: Bound on n (# training examples needed)
Recall: Requires intractable inference.
parameter error (L1)
Sample Complexity: MPLE
24
# parameters (length of θ)
Λmin: mini [ min eigenvalue of Hessian of component i at θ* ]
probability of failureparamete
r error (L1)
Our Theorem: Bound on n (# training examples needed)
Recall: Tractable inference.
PAC learnabilityfor many MRFs!
Sample Complexity: MPLE
25
Our Theorem: Bound on n (# training examples needed)
PAC learnabilityfor many MRFs!
Related WorkRavikumar et al. (2010)• Regression Yi~X with Ising models• Basis of our theoryLiang & Jordan (2008)• Asymptotic analysis of MLE, MPLE• Our bounds match theirsAbbeel et al. (2006)• Only previous method with PAC bounds for high-treewidth
MRFs• We extend their work:
• Extension to CRFs, algorithmic improvements, analysis• Their method is very similar to MPLE.
Trade-offs: MLE & MPLE
26
Our Theorem: Bound on n (# training examples needed)
Sample — computational complexitytrade-off
MLELarger Λmin => Lower sample complexity
Higher computational complexity
MPLESmaller Λmin => Higher sample complexity
Lower computational complexity
Trade-offs: MPLE
27
X1
Joint optimization for MPLE:
X2
Disjoint optimization for MPLE:
2 estimates of Average estimates
Lower sample complexity
Data-parallel
Sample complexity — parallelismtrade-off
Synthetic CRFs
28
RandomAssociative
Chains Stars Grids
Factor strength = strength of variable interactions
Predictive Power of Bounds
29
Errors should be ordered: MLE < MPLE < MPLE-disjoint
L1 p
aram
erro
r ε
# training examples
MLEMPLE
MPLE-disjoint
Factors: random, fixed strengthLength-4 chains
bette
r
Predictive Power of Bounds
30
MLE & MPLE Sample Complexity:
Factors: randomLength-6 chains
10,000 train exs
MLE
Actu
al ε
bette
r
harder
Failure Modes of MPLE
31
How do Λmin(MLE) and Λmin(MPLE) vary for different models?
Sample complexity:
Model diamet
er
Factor strengt
h
Node degree
Λmin: Model Diameter
32
Λmin ratio: MLE/MPLE(Higher = MLE better)
Model diameterΛ m
in ra
tio
Relative MPLE performance is independent of diameter in chains.(Same for random factors)
Factors: associative, fixed strengthChains
Λmin: Factor Strength
33
Λmin ratio: MLE/MPLE(Higher = MLE better)
Factor strengthΛ m
in ra
tio
Factors: associativeLength-8 Chains
MPLE performs poorly with strong factors.(Same for random factors, and star & grid models)
Λmin: Node Degree
34
Λmin ratio: MLE/MPLE(Higher = MLE better)
Node degree
Λ min ra
tio
Factors: associative, fixed strength
Stars
MPLE performs poorly with high-degree nodes.(Same for random factors)
Failure Modes of MPLE
35
How do Λmin(MLE) and Λmin(MPLE) vary for different models?
Sample complexity:
Model diamet
er
Factor strengt
h
Node degree
We can often fix this!
Composite Likelihood (MCLE)
36
MLE: Estimate P(Y) all at once
Composite Likelihood (MCLE)
37
MLE: Estimate P(Y) all at once
MPLE: Estimate P(Yi|Y\i) separately
Yi
Composite Likelihood (MCLE)
38
MLE: Estimate P(Y) all at once
MPLE: Estimate P(Yi|Y\i) separately
YAi
Something in between?
Composite Likelihood (MCLE):
Estimate P(YAi|Y\Ai) separately.(Lindsay, 1988)
Generalizes MLE, MPLE; analogous:ObjectiveSample complexityJoint & disjoint optimization
Composite Likelihood (MCLE)
39
MCLE Class:Node-disjoint subgraphs which cover graph.
Composite Likelihood (MCLE)
40
MCLE Class:Node-disjoint subgraphs which cover graph.
• Trees (tractable inference)
• Follow structure of P(X)• Cover star
structures• Cover strong factors
• Choose large components
Combs
Generalizes MLE, MPLE; analogous:ObjectiveSample complexityJoint & disjoint optimization
Structured MCLE on a Grid
41
Grid size |X|Log
loss
ratio
(oth
er/M
LE)
MCLE (combs)
MPLE
Grid size |X|
Trai
ning
tim
e (s
ec)
MCLE (combs)
MPLE
MLE
Grid. Associative factors.10,000 train exs. Gibbs sampling.
bette
r
MCLE (combs) lowers sample complexity...without increasing computation!
MCLE tailoredto model structure.
Also in thesis: tailoring to
correlations in data.
Summary: Parameter Learning
42
Likelihood (MLE) Optimal High
Pseudolikelihood (MPLE)
Difficult
High Low Easy
Composite Likelihood (MCLE) Low Low Easy
Sample complexit
y
Computational complexity
Parallel optimizati
on
• Finite sample complexity bounds for general MRFs, CRFs• PAC learnability for certain classes
• Empirical analysis• Guidelines for choosing MCLE structures: tailor to model, data
OutlineSc
alin
g co
re
met
hods
43
Structure Learning Learning
tractable structures
CRF Structure Learning
44
X3: deadline?
Y1: losing sleep?
Y3: sick? Y2: losing hair?
X1: loud roommate?
X2: taking classes?
Structure learning: Choose YC
I.e., learn conditional independence
Evidence selection: Choose XD
I.e., select X relevant to each YC
Related WorkPrevious Work
Method Structure learning?
Tractable inference?
Evidence selection?
Torralba et al. (2004)
Boosted Random Fields
Yes No Yes
Schmidt et al. (2008)
Block-L1 regularized pseudolikelihood
Yes No No
Shahaf et al. (2009)
Edge weights +low-treewidth model
Yes Yes No
Most similar to our work: They focus on selecting treewidth-k structures. We focus on the choice of edge weight.
45
Tree CRFs with Local Evidence
GoalGiven:
DataLocal evidence
Learn tree CRF structureVia a scalable method
Bradley, Guestrin (2010)
46
Xi relevant to each Yi
Fast inference at test-time
Chow-Liu for MRFs
47
Chow & Liu (1968)
Y1Y2
Y3
AlgorithmWeight edges with mutual information:
Chow-Liu for MRFs
48
Chow & Liu (1968)Algorithm
Weight edges with mutual information:Choose max-weight spanning tree.
Y1Y2
Y3
Chow-Liu finds amax-likelihood structure.
Chow-Liu for CRFs?What edge weight? must be efficient to compute
Global Conditional Mutual Information
(CMI)
Pro: Finds max-likelihood structure (with enough data)
Con: Intractable for large |X|
49
AlgorithmWeight each possible edge:
Choose max-weight spanning tree.
Generalized Edge WeightsGlobal
CMI
50
Local Linear Entropy Scores (LLES): w(i,j) = linear combination of entropies over
Yi,Yj,Xi,Xj
TheoremNo LLES can recover all tree CRFs (even with non-trivial parameters and exact entropies).
Heuristic Edge Weights
Decomposable
Conditional Influence
(DCI)
Local CMI
Method Guarantees Compute w(i,j) tractably
Comments
Global CMI Recovers true tree
No Shahaf et al. (2009)Local CMI Lower-bounds likelihood gain
Yes Fails with strong Yi—Xi potentials
DCI Exact likelihood gain for some edges
Yes Best empirically
Global CMI
51
0
0.2
0.4
0.6
0.8
1
0 100 200 300 400 500
Synthetic TestsTrees w/ associative factors. |Y|=40.1000 test samples. Error bars: 2 std. errors.
# training examples
Frac
tion
edge
s rec
over
ed
DCIGlobal CMI
Local CMI
Schmidt et al.
True CRF
52
Synthetic TestsTrees w/ associative factors. |Y|=40.1000 test samples. Error bars: 2 std. errors.
0
5000
10000
15000
20000
0 100 200 300 400 500
Seco
nds
# training examples
Global CMI
DCILocal CMI
Schmidt et al.
53
bette
r
fMRI Tests
X: fMRI voxels (500)
Y: semantic features (218)
predict(Application & data from Palatucci et al., 2009)
Image fromhttp://en.wikipedia.org/wiki/File:FMRI.jpg
bette
r
)]|([log XYPE Disconnected(Palatucci et al., 2009)
DCI 1
DCI 2
54
Summary: Structure Learning
55
• Analyzed generalizing Chow-Liu to CRFs• Proposed class of edge weights: Local Linear Entropy Scores
• Negative result: insufficient for recovering trees• Discovered useful heuristic edge weights: Local CMI, DCI
• Promising empirical results on synthetic & fMRI data
Generalized Chow-LiuCompute edge
weightsMax-weight spanning tree
w12
w25 w45
w24
w23
Outline
56
Parallel Regression Multicore sparse
regressionPara
llel
scal
ing
Parameter LearningPseudolikelihoodCanonical parameterizationSc
alin
g co
re
met
hods Structure Learning
Generalized Chow-Liu
solve via Compute edge weights via P(Yi,Yj |
Xij )
Regress each variable on its
neighbors:P( Xi | X\i )
Sparse (L1) Regression(Bradley, Kyrola, Bickson, Guestrin, 2011)
Bias towards sparse solutions
Lasso (Tibshirani, 1996)
Objective:Goal: Predict from , given samples
Useful in high-dimensional setting (# features >> # examples) Lasso and sparse logistic regression
57
Parallelizing LASSOMany LASSO optimization algorithms
Gradient descent, interior point, stochastic gradient, shrinkage, hard/soft thresholdingCoordinate descent (a.k.a. Shooting (Fu, 1998))
One of the fastest algorithms (Yuan et al., 2010)
Parallel optimizationMatrix-vector ops (e.g., interior point)Stochastic gradient (e.g., Zinkevich et al., 2010)Shooting
Not great empirically Best for many samples, not large d
Inherently sequential
Shotgun: Parallel coordinate descent for L1 regression simple algorithm, elegant analysis
58
Shooting: Sequential SCDwhere
Stochastic Coordinate Descent (SCD)While not converged,
Choose random coordinate j,Update wj (closed-form minimization)
59
Shotgun: Parallel SCDwhere
Shotgun Algorithm (Parallel SCD)While not converged,
On each of P processors,Choose random coordinate j,Update wj (same as for Shooting)
Nice case:Uncorrelatedfeatures
Bad case:Correlatedfeatures
Is SCD inherently sequential?
60
Shotgun: Theory
Convergence Theorem
Final objective
Assume # parallel updates
iterations
where
= spectral radius of XTX
Optimal objective
Generalizes bounds for Shooting (Shalev-Shwartz & Tewari, 2009)
61
Shotgun: Theory
Convergence Theorem
final - opt objective
Assume
iterations
# parallel updates
where = spectral radius of X ’X.Nice case:Uncorrelatedfeatures
Bad case:Correlatedfeatures
(at worst)
where
62
Shotgun: TheoryConvergence Theorem
Assume
... linear speedups predicted.
Up to a threshold...
Experiments matchour theory!
63
Pmax=79Mug32_singlepixcam
T (it
erat
ions
)
P (parallel updates)
SparcoProblem7Pmax=284
T (it
erat
ions
)
P (parallel updates)
Lasso ExperimentsCompared many algorithms
Interior point (L1_LS)Shrinkage (FPC_AS, SpaRSA)Projected gradient (GPSR_BB)Iterative hard thresholding (Hard_IO)Also ran: GLMNET, LARS, SMIDAS
35 datasetsλ=.5, 10ShootingShotgun P = 8 (multicore)
Single-Pixel Camera
Sparco (van den Berg et al., 2009) Sparse Compressed Imaging Large, Sparse Datasets
64
Shotgun provesmost scalable & robust
Shotgun: SpeedupAggregated results from all tests
Spee
dup
# cores
Optimal
Lasso Iteration Speedup
Lasso Time Speedup
Logistic Reg. Time Speedup
Not so great
But we are doingfewer iterations!
Explanation:Memory wall (Wulf & McKee, 1995)The memory bus gets flooded.
Logistic regression uses more FLOPS/datum.Extra computation hides memory latency.Better speedups on average!
65
Summary: Parallel Regression
66
• Shotgun: parallel coordinate descent on multicore• Analysis: near-linear speedups, up to problem-dependent limit• Extensive experiments (37 datasets, 7 other methods)
• Our theory predicts empirical behavior well.• Shotgun is one of the most scalable methods.
ShotgunDecompose
computation by coordinate updates
Trade a little extra computation for a lot of
parallelism
Recall: Thesis StatementWe can scale learning by using decompositions of learning problems which trade off sample complexity, computation, and parallelization.
67
Parameter LearningStructured composite likelihood
Structure LearningGeneralized Chow-Liu
Parallel RegressionShotgun: parallel coordinate descent
MLE MCLE MPLEw12
w25 w45
w24
w23
Decompositions use model structure & locality. Trade-offs use model- and data-specific methods.
Future Work: Unified System
68
Parameter Learning Structure Learning
Parallel Regression
Structured MCLEAutomatically:• choose MCLE structure
& parallelization strategy
• to optimize trade-offs,• tailored to model &
data.
Shotgun (multicore)Distributed
Limited communication in distributed setting.
Handle complex objectives (e.g., MCLE).
L1 Structure Learning
Learning TreesUse structured
MCLE?
Learn trees for parameter estimators?
Summary
Parameter learning Structured composite likelihoodFinite sample complexity boundsEmpirical analysisGuidelines for choosing MCLE
structures: tailor to model, dataAnalyzed canonical parameterization
of Abbeel et al. (2006)
69
We can scale learning by using decompositions of learning problems which trade off sample complexity, computation, and parallelization.
Structure learning Generalizing Chow-Liu to CRFsProposed class of edge weights:
Local Linear Entropy ScoresInsufficient for recovering trees
Discovered useful heuristic edge weights: Local CMI, DCIPromising empirical results on
synthetic & fMRI dataParallel regression Shotgun: parallel coordinate descent on multicoreAnalysis: near-linear speedups, up to problem-dependent limitExtensive experiments (37 datasets, 7 other methods)
Our theory predicts empirical behavior well.Shotgun is one of the most scalable methods.
Thank
you!