thesis defense

69
Carnegie Mellon Thesis Defense Joseph K. Bradley Learning Large-Scale Conditional Random Fields Committee Carlos Guestrin (U. of Washington, Chair) Tom Mitchell John Lafferty (U. of Chicago) Andrew McCallum (U. of Massachusetts at Amherst) 1 / 18 / 2013

Upload: junius

Post on 16-Feb-2016

47 views

Category:

Documents


0 download

DESCRIPTION

Learning Large-Scale Conditional Random Fields. Thesis Defense. Joseph K. Bradley. Committee Carlos Guestrin (U. of Washington, Chair) Tom Mitchell John Lafferty (U. of Chicago) Andrew McCallum (U. of Massachusetts at Amherst). 1 / 18 / 2013. Modeling Distributions. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Thesis Defense

Carnegie Mellon

Thesis Defense

Joseph K. Bradley

Learning Large-Scale Conditional Random

Fields

CommitteeCarlos Guestrin (U. of Washington, Chair)Tom MitchellJohn Lafferty (U. of Chicago)Andrew McCallum (U. of Massachusetts at Amherst)

1 / 18 / 2013

Page 2: Thesis Defense

Modeling Distributions

2

Goal: Model distribution P(X) over random variables XE.g.: Model life of a grad student.

X2: deadline?

X1: losing sleep?

X3: sick?

X4: losing hair?

X5: overeating?

X6: loud roommate?

X7: taking classes?

X8: cold weather?X9: exercising?

X11: single?X10: gaining weight?

Page 3: Thesis Defense

Modeling Distributions

3

X2: deadline?

X1: losing sleep?

X5: overeating?

X7: taking classes?

= P( losing sleep, overeating | deadline, taking classes )

Goal: Model distribution P(X) over random variables XE.g.: Model life of a grad student.

Page 4: Thesis Defense

Markov Random Fields (MRFs)

4

X2: deadline?

X1: losing sleep?

X3: sick?

X4: losing hair?

X5: overeating?

X6: loud roommate?

X7: taking classes?

X8: cold weather?X9: exercising?

X10: single?X10: gaining weight?

Goal: Model distribution P(X) over random variables XE.g.: Model life of a grad student.

Page 5: Thesis Defense

Markov Random Fields (MRFs)

5

X2

X1

X3

X4

X5

X6

X7

X8X9

X10X10

graphical

structure

factor (parameters)

Goal: Model distribution P(X) over random variables X

Page 6: Thesis Defense

Conditional Random Fields (CRFs)

6

X2

Y1

Y3

Y4

Y5

X1

X3

X4X5

X6Y2

MRFs: P(X) CRFs: P(Y|X) (Lafferty et al., 2001)

Do not model P(X)Simpler structure (over Y only)

Page 7: Thesis Defense

MRFs & CRFs

7

Benefits•Principled statistical and computational framework•Large body of literatureApplications•Natural language processing (e.g., Lafferty et al., 2001)•Vision (e.g., Tappen et al., 2007)•Activity recognition (e.g., Vail et al., 2007)•Medical applications (e.g., Schmidt et al., 2008)•...

Page 8: Thesis Defense

Challenges

8

Goal: Given data, learn CRF structure and parameters.

X2

Y1

Y3

Y4

Y5

X1

X5

X6Y2

Many learning methods require inference, i.e., answering queries P(A|B)

NP hard in general(Srebro, 2003)

Big structured optimization

problem

NP hard to approximate(Roth, 1996)

Approximations often lack strong guarantees.

Page 9: Thesis Defense

Thesis Statement

CRFs offer statistical and computational advantages, but traditional learning methods are often impractical for large problems.

We can scale learning by using decompositions of learning problems which trade off sample complexity, computation, and parallelization.

9

Page 10: Thesis Defense

Outline

Parameter Learning Learning without

intractable inferenceSc

alin

g co

re

met

hods

10

Structure Learning Learning

tractable structures

Parallel Regression Multicore sparse

regressionPara

llel

scal

ing

solve via

Page 11: Thesis Defense

Outline

Parameter Learning Learning without

intractable inferenceSc

alin

g co

re

met

hods

11

Page 12: Thesis Defense

Log-linear MRFs

12

X2

X1

X3

X4

X5

X6

X7

X8X9

X10X10

Goal: Model distribution P(X) over random variables X

Parameters FeaturesAll results generalize to CRFs.

Page 13: Thesis Defense

Parameter Learning: MLE

13

Traditional method: max-likelihood estimation (MLE)Minimize objective:

Loss

Gold Standard: MLE is (optimally) statistically efficient.

Parameter LearningGiven structure Φ and samples from Pθ*(X),Learn parameters θ.

Page 14: Thesis Defense

Parameter Learning: MLE

14

Page 15: Thesis Defense

Parameter Learning: MLE

15

MLE requires inference.Provably hard for general MRFs. (Roth, 1996)

Inference makeslearning hard.

Can we learn withoutintractable inference?

Page 16: Thesis Defense

Parameter Learning: MLE

16

Inference makeslearning hard.

Can we learn withoutintractable inference?

Approximate inference & objectives

• Many works: Hinton (2002), Sutton & McCallum (2005), Wainwright (2006), ...

• Many lack strong theory.• Almost no guarantees for general

MRFs or CRFs.

Page 17: Thesis Defense

Our Solution

17

Max Likelihood Estimation (MLE)

Optimal High Difficult

Max Pseudolikelihood Estimation (MPLE)

High Low Easy

Sample complexit

y

Computational complexity

Parallel optimizati

on

PAC learnabilityfor many MRFs!

Bradley, Guestrin (2012)

Page 18: Thesis Defense

Our Solution

18

Max Likelihood Estimation (MLE)

Optimal High Difficult

Sample complexit

y

Computational complexity

Parallel optimizati

on

PAC learnabilityfor many MRFs!

Max Pseudolikelihood Estimation (MPLE)

High Low Easy

Bradley, Guestrin (2012)

Page 19: Thesis Defense

Our Solution

19

Max Likelihood Estimation (MLE)

Optimal High

Max Pseudolikelihood Estimation (MPLE)

Difficult

High Low Easy

Max Composite Likelihood Estimation (MCLE)

Low Low Easy

Sample complexit

y

Computational complexity

Parallel optimizati

on

Choose MCLE structure to optimize trade-offs

Bradley, Guestrin (2012)

Page 20: Thesis Defense

Deriving Pseudolikelihood (MPLE)

20

X2

X1

X3

X4

X5

MLE:

Hard to compute.So replace it!

Page 21: Thesis Defense

Deriving Pseudolikelihood (MPLE)

21

X1

MLE:

Estimate via regression:

MPLE:

(Besag, 1975)

Tractable inference!

Page 22: Thesis Defense

Pseudolikelihood (MPLE)

22

Pros•No intractable inference!•Consistent estimator

Cons•Less statistically efficient than MLE (Liang & Jordan, 2008)•No PAC bounds

PAC = Probably Approximately Correct(Valiant, 1984)

MPLE:

(Besag, 1975)

Page 23: Thesis Defense

Sample Complexity: MLE

23

# parameters (length of θ)

Λmin: min eigenvalue of Hessian of loss at θ*

probability of failure

Our Theorem: Bound on n (# training examples needed)

Recall: Requires intractable inference.

parameter error (L1)

Page 24: Thesis Defense

Sample Complexity: MPLE

24

# parameters (length of θ)

Λmin: mini [ min eigenvalue of Hessian of component i at θ* ]

probability of failureparamete

r error (L1)

Our Theorem: Bound on n (# training examples needed)

Recall: Tractable inference.

PAC learnabilityfor many MRFs!

Page 25: Thesis Defense

Sample Complexity: MPLE

25

Our Theorem: Bound on n (# training examples needed)

PAC learnabilityfor many MRFs!

Related WorkRavikumar et al. (2010)• Regression Yi~X with Ising models• Basis of our theoryLiang & Jordan (2008)• Asymptotic analysis of MLE, MPLE• Our bounds match theirsAbbeel et al. (2006)• Only previous method with PAC bounds for high-treewidth

MRFs• We extend their work:

• Extension to CRFs, algorithmic improvements, analysis• Their method is very similar to MPLE.

Page 26: Thesis Defense

Trade-offs: MLE & MPLE

26

Our Theorem: Bound on n (# training examples needed)

Sample — computational complexitytrade-off

MLELarger Λmin => Lower sample complexity

Higher computational complexity

MPLESmaller Λmin => Higher sample complexity

Lower computational complexity

Page 27: Thesis Defense

Trade-offs: MPLE

27

X1

Joint optimization for MPLE:

X2

Disjoint optimization for MPLE:

2 estimates of Average estimates

Lower sample complexity

Data-parallel

Sample complexity — parallelismtrade-off

Page 28: Thesis Defense

Synthetic CRFs

28

RandomAssociative

Chains Stars Grids

Factor strength = strength of variable interactions

Page 29: Thesis Defense

Predictive Power of Bounds

29

Errors should be ordered: MLE < MPLE < MPLE-disjoint

L1 p

aram

erro

r ε

# training examples

MLEMPLE

MPLE-disjoint

Factors: random, fixed strengthLength-4 chains

bette

r

Page 30: Thesis Defense

Predictive Power of Bounds

30

MLE & MPLE Sample Complexity:

Factors: randomLength-6 chains

10,000 train exs

MLE

Actu

al ε

bette

r

harder

Page 31: Thesis Defense

Failure Modes of MPLE

31

How do Λmin(MLE) and Λmin(MPLE) vary for different models?

Sample complexity:

Model diamet

er

Factor strengt

h

Node degree

Page 32: Thesis Defense

Λmin: Model Diameter

32

Λmin ratio: MLE/MPLE(Higher = MLE better)

Model diameterΛ m

in ra

tio

Relative MPLE performance is independent of diameter in chains.(Same for random factors)

Factors: associative, fixed strengthChains

Page 33: Thesis Defense

Λmin: Factor Strength

33

Λmin ratio: MLE/MPLE(Higher = MLE better)

Factor strengthΛ m

in ra

tio

Factors: associativeLength-8 Chains

MPLE performs poorly with strong factors.(Same for random factors, and star & grid models)

Page 34: Thesis Defense

Λmin: Node Degree

34

Λmin ratio: MLE/MPLE(Higher = MLE better)

Node degree

Λ min ra

tio

Factors: associative, fixed strength

Stars

MPLE performs poorly with high-degree nodes.(Same for random factors)

Page 35: Thesis Defense

Failure Modes of MPLE

35

How do Λmin(MLE) and Λmin(MPLE) vary for different models?

Sample complexity:

Model diamet

er

Factor strengt

h

Node degree

We can often fix this!

Page 36: Thesis Defense

Composite Likelihood (MCLE)

36

MLE: Estimate P(Y) all at once

Page 37: Thesis Defense

Composite Likelihood (MCLE)

37

MLE: Estimate P(Y) all at once

MPLE: Estimate P(Yi|Y\i) separately

Yi

Page 38: Thesis Defense

Composite Likelihood (MCLE)

38

MLE: Estimate P(Y) all at once

MPLE: Estimate P(Yi|Y\i) separately

YAi

Something in between?

Composite Likelihood (MCLE):

Estimate P(YAi|Y\Ai) separately.(Lindsay, 1988)

Page 39: Thesis Defense

Generalizes MLE, MPLE; analogous:ObjectiveSample complexityJoint & disjoint optimization

Composite Likelihood (MCLE)

39

MCLE Class:Node-disjoint subgraphs which cover graph.

Page 40: Thesis Defense

Composite Likelihood (MCLE)

40

MCLE Class:Node-disjoint subgraphs which cover graph.

• Trees (tractable inference)

• Follow structure of P(X)• Cover star

structures• Cover strong factors

• Choose large components

Combs

Generalizes MLE, MPLE; analogous:ObjectiveSample complexityJoint & disjoint optimization

Page 41: Thesis Defense

Structured MCLE on a Grid

41

Grid size |X|Log

loss

ratio

(oth

er/M

LE)

MCLE (combs)

MPLE

Grid size |X|

Trai

ning

tim

e (s

ec)

MCLE (combs)

MPLE

MLE

Grid. Associative factors.10,000 train exs. Gibbs sampling.

bette

r

MCLE (combs) lowers sample complexity...without increasing computation!

MCLE tailoredto model structure.

Also in thesis: tailoring to

correlations in data.

Page 42: Thesis Defense

Summary: Parameter Learning

42

Likelihood (MLE) Optimal High

Pseudolikelihood (MPLE)

Difficult

High Low Easy

Composite Likelihood (MCLE) Low Low Easy

Sample complexit

y

Computational complexity

Parallel optimizati

on

• Finite sample complexity bounds for general MRFs, CRFs• PAC learnability for certain classes

• Empirical analysis• Guidelines for choosing MCLE structures: tailor to model, data

Page 43: Thesis Defense

OutlineSc

alin

g co

re

met

hods

43

Structure Learning Learning

tractable structures

Page 44: Thesis Defense

CRF Structure Learning

44

X3: deadline?

Y1: losing sleep?

Y3: sick? Y2: losing hair?

X1: loud roommate?

X2: taking classes?

Structure learning: Choose YC

I.e., learn conditional independence

Evidence selection: Choose XD

I.e., select X relevant to each YC

Page 45: Thesis Defense

Related WorkPrevious Work

Method Structure learning?

Tractable inference?

Evidence selection?

Torralba et al. (2004)

Boosted Random Fields

Yes No Yes

Schmidt et al. (2008)

Block-L1 regularized pseudolikelihood

Yes No No

Shahaf et al. (2009)

Edge weights +low-treewidth model

Yes Yes No

Most similar to our work: They focus on selecting treewidth-k structures. We focus on the choice of edge weight.

45

Page 46: Thesis Defense

Tree CRFs with Local Evidence

GoalGiven:

DataLocal evidence

Learn tree CRF structureVia a scalable method

Bradley, Guestrin (2010)

46

Xi relevant to each Yi

Fast inference at test-time

Page 47: Thesis Defense

Chow-Liu for MRFs

47

Chow & Liu (1968)

Y1Y2

Y3

AlgorithmWeight edges with mutual information:

Page 48: Thesis Defense

Chow-Liu for MRFs

48

Chow & Liu (1968)Algorithm

Weight edges with mutual information:Choose max-weight spanning tree.

Y1Y2

Y3

Chow-Liu finds amax-likelihood structure.

Page 49: Thesis Defense

Chow-Liu for CRFs?What edge weight? must be efficient to compute

Global Conditional Mutual Information

(CMI)

Pro: Finds max-likelihood structure (with enough data)

Con: Intractable for large |X|

49

AlgorithmWeight each possible edge:

Choose max-weight spanning tree.

Page 50: Thesis Defense

Generalized Edge WeightsGlobal

CMI

50

Local Linear Entropy Scores (LLES): w(i,j) = linear combination of entropies over

Yi,Yj,Xi,Xj

TheoremNo LLES can recover all tree CRFs (even with non-trivial parameters and exact entropies).

Page 51: Thesis Defense

Heuristic Edge Weights

Decomposable

Conditional Influence

(DCI)

Local CMI

Method Guarantees Compute w(i,j) tractably

Comments

Global CMI Recovers true tree

No Shahaf et al. (2009)Local CMI Lower-bounds likelihood gain

Yes Fails with strong Yi—Xi potentials

DCI Exact likelihood gain for some edges

Yes Best empirically

Global CMI

51

Page 52: Thesis Defense

0

0.2

0.4

0.6

0.8

1

0 100 200 300 400 500

Synthetic TestsTrees w/ associative factors. |Y|=40.1000 test samples. Error bars: 2 std. errors.

# training examples

Frac

tion

edge

s rec

over

ed

DCIGlobal CMI

Local CMI

Schmidt et al.

True CRF

52

Page 53: Thesis Defense

Synthetic TestsTrees w/ associative factors. |Y|=40.1000 test samples. Error bars: 2 std. errors.

0

5000

10000

15000

20000

0 100 200 300 400 500

Seco

nds

# training examples

Global CMI

DCILocal CMI

Schmidt et al.

53

bette

r

Page 54: Thesis Defense

fMRI Tests

X: fMRI voxels (500)

Y: semantic features (218)

predict(Application & data from Palatucci et al., 2009)

Image fromhttp://en.wikipedia.org/wiki/File:FMRI.jpg

bette

r

)]|([log XYPE Disconnected(Palatucci et al., 2009)

DCI 1

DCI 2

54

Page 55: Thesis Defense

Summary: Structure Learning

55

• Analyzed generalizing Chow-Liu to CRFs• Proposed class of edge weights: Local Linear Entropy Scores

• Negative result: insufficient for recovering trees• Discovered useful heuristic edge weights: Local CMI, DCI

• Promising empirical results on synthetic & fMRI data

Generalized Chow-LiuCompute edge

weightsMax-weight spanning tree

w12

w25 w45

w24

w23

Page 56: Thesis Defense

Outline

56

Parallel Regression Multicore sparse

regressionPara

llel

scal

ing

Parameter LearningPseudolikelihoodCanonical parameterizationSc

alin

g co

re

met

hods Structure Learning

Generalized Chow-Liu

solve via Compute edge weights via P(Yi,Yj |

Xij )

Regress each variable on its

neighbors:P( Xi | X\i )

Page 57: Thesis Defense

Sparse (L1) Regression(Bradley, Kyrola, Bickson, Guestrin, 2011)

Bias towards sparse solutions

Lasso (Tibshirani, 1996)

Objective:Goal: Predict from , given samples

Useful in high-dimensional setting (# features >> # examples) Lasso and sparse logistic regression

57

Page 58: Thesis Defense

Parallelizing LASSOMany LASSO optimization algorithms

Gradient descent, interior point, stochastic gradient, shrinkage, hard/soft thresholdingCoordinate descent (a.k.a. Shooting (Fu, 1998))

One of the fastest algorithms (Yuan et al., 2010)

Parallel optimizationMatrix-vector ops (e.g., interior point)Stochastic gradient (e.g., Zinkevich et al., 2010)Shooting

Not great empirically Best for many samples, not large d

Inherently sequential

Shotgun: Parallel coordinate descent for L1 regression simple algorithm, elegant analysis

58

Page 59: Thesis Defense

Shooting: Sequential SCDwhere

Stochastic Coordinate Descent (SCD)While not converged,

Choose random coordinate j,Update wj (closed-form minimization)

59

Page 60: Thesis Defense

Shotgun: Parallel SCDwhere

Shotgun Algorithm (Parallel SCD)While not converged,

On each of P processors,Choose random coordinate j,Update wj (same as for Shooting)

Nice case:Uncorrelatedfeatures

Bad case:Correlatedfeatures

Is SCD inherently sequential?

60

Page 61: Thesis Defense

Shotgun: Theory

Convergence Theorem

Final objective

Assume # parallel updates

iterations

where

= spectral radius of XTX

Optimal objective

Generalizes bounds for Shooting (Shalev-Shwartz & Tewari, 2009)

61

Page 62: Thesis Defense

Shotgun: Theory

Convergence Theorem

final - opt objective

Assume

iterations

# parallel updates

where = spectral radius of X ’X.Nice case:Uncorrelatedfeatures

Bad case:Correlatedfeatures

(at worst)

where

62

Page 63: Thesis Defense

Shotgun: TheoryConvergence Theorem

Assume

... linear speedups predicted.

Up to a threshold...

Experiments matchour theory!

63

Pmax=79Mug32_singlepixcam

T (it

erat

ions

)

P (parallel updates)

SparcoProblem7Pmax=284

T (it

erat

ions

)

P (parallel updates)

Page 64: Thesis Defense

Lasso ExperimentsCompared many algorithms

Interior point (L1_LS)Shrinkage (FPC_AS, SpaRSA)Projected gradient (GPSR_BB)Iterative hard thresholding (Hard_IO)Also ran: GLMNET, LARS, SMIDAS

35 datasetsλ=.5, 10ShootingShotgun P = 8 (multicore)

Single-Pixel Camera

Sparco (van den Berg et al., 2009) Sparse Compressed Imaging Large, Sparse Datasets

64

Shotgun provesmost scalable & robust

Page 65: Thesis Defense

Shotgun: SpeedupAggregated results from all tests

Spee

dup

# cores

Optimal

Lasso Iteration Speedup

Lasso Time Speedup

Logistic Reg. Time Speedup

Not so great

But we are doingfewer iterations!

Explanation:Memory wall (Wulf & McKee, 1995)The memory bus gets flooded.

Logistic regression uses more FLOPS/datum.Extra computation hides memory latency.Better speedups on average!

65

Page 66: Thesis Defense

Summary: Parallel Regression

66

• Shotgun: parallel coordinate descent on multicore• Analysis: near-linear speedups, up to problem-dependent limit• Extensive experiments (37 datasets, 7 other methods)

• Our theory predicts empirical behavior well.• Shotgun is one of the most scalable methods.

ShotgunDecompose

computation by coordinate updates

Trade a little extra computation for a lot of

parallelism

Page 67: Thesis Defense

Recall: Thesis StatementWe can scale learning by using decompositions of learning problems which trade off sample complexity, computation, and parallelization.

67

Parameter LearningStructured composite likelihood

Structure LearningGeneralized Chow-Liu

Parallel RegressionShotgun: parallel coordinate descent

MLE MCLE MPLEw12

w25 w45

w24

w23

Decompositions use model structure & locality. Trade-offs use model- and data-specific methods.

Page 68: Thesis Defense

Future Work: Unified System

68

Parameter Learning Structure Learning

Parallel Regression

Structured MCLEAutomatically:• choose MCLE structure

& parallelization strategy

• to optimize trade-offs,• tailored to model &

data.

Shotgun (multicore)Distributed

Limited communication in distributed setting.

Handle complex objectives (e.g., MCLE).

L1 Structure Learning

Learning TreesUse structured

MCLE?

Learn trees for parameter estimators?

Page 69: Thesis Defense

Summary

Parameter learning Structured composite likelihoodFinite sample complexity boundsEmpirical analysisGuidelines for choosing MCLE

structures: tailor to model, dataAnalyzed canonical parameterization

of Abbeel et al. (2006)

69

We can scale learning by using decompositions of learning problems which trade off sample complexity, computation, and parallelization.

Structure learning Generalizing Chow-Liu to CRFsProposed class of edge weights:

Local Linear Entropy ScoresInsufficient for recovering trees

Discovered useful heuristic edge weights: Local CMI, DCIPromising empirical results on

synthetic & fMRI dataParallel regression Shotgun: parallel coordinate descent on multicoreAnalysis: near-linear speedups, up to problem-dependent limitExtensive experiments (37 datasets, 7 other methods)

Our theory predicts empirical behavior well.Shotgun is one of the most scalable methods.

Thank

you!