submodularity in machine learning andreas krause stefanie jegelka

Submodularity in Machine Learning

Andreas KrauseStefanie Jegelka

2

Network Inference

How learn who influences whom?

3

Summarizing Documents

How select representative sentences?

4

MAP inference

How find the MAP labeling in discrete graphical models efficiently?

sky

treehouse

grass

5

What’s common?

Formalization:

Optimize a set function F(S) under constraints

generally very hard

but: structure helps! … if F is submodular, we can …

solve optimization problems with strong guaranteessolve some learning problems

6

Outline

What is submodularity?

OptimizationMinimization

Maximization

LearningLearning for Optimization: new settings

Part I

Part II

Break

many newresults!

7

Outline


OptimizationMinimization: new algorithms, constraints

Maximization: new algorithms (unconstrained)


Part I

Part II

… and many new applications!

many newresults!

8

submodularity.orgslides, links, references, workshops, …

9

Example: placing sensors

Place sensors to monitor temperature

10

Set functions

finite ground setset function

will assume (w.l.o.g.)

assume black box that can evaluatefor any

11

Utility of having sensors at subset of all locations

X1

X2

X3

A={1,2,3}: Very informativeHigh value F(A)

X4X5

X1

A={1,4,5}: Redundant infoLow value F(A)

Example: placing sensors

Marginal gain

Given set function

Marginal gain:

12

X1

X2

Xs

new sensor s

13

B

Decreasing gains: submodularity

X1X2

X3

X4X5

placement B = {1,…,5}

X1

X2

placement A = {1,2}

Adding s helps a lot! Adding s doesn’t help muchXs

new sensor sA + s+ s

Big gain small gain

14

Equivalent characterizations

Diminishing gains: for all

Union-Intersection: for all

A B+ s + s

Questions

15

How do I prove my problem is submodular?

Why is submodularity useful?

Example: Set cover

16

Node predictsvalues of positionswith some radius

SERVER

LAB

KITCHEN

COPYELEC

PHONEQUIET

STORAGE

CONFERENCE

OFFICEOFFICE

goal: cover floorplan with discsplace sensorsin building Possible

locations

: “area covered by sensors placed at A”

Formally: Finite set , collection of n subsetsFor define

Set cover is submodular

17

SERVER

LAB

KITCHEN

COPYELEC

PHONEQUIET

STORAGE

CONFERENCE

OFFICEOFFICE

SERVER

LAB

KITCHEN

COPYELEC

PHONEQUIET

STORAGE

CONFERENCE

OFFICEOFFICE

S1 S2

S1 S2

S3

S4 S’

S’

A={s1,s2}

B = {s1,s2,s3,s4}

F(A U {s’}) – F(A)

F(B U {s’}) – F(B)

≥

18

More complex model for sensing

Joint probability distribution P(X1,…,Xn,Y1,…,Yn) = P(Y1,…,Yn) P(X1,…,Xn | Y1,…,Yn)

Ys: temperatureat location s

Xs: sensor valueat location s

Xs = Ys + noise

Prior Likelihood

Y1 Y2 Y3

Y6

Y5Y4

X1

X4

X3

X6X5

X2

Example: Sensor placement

Utility of having sensors at subset A of all locations

19

X1

X2

X3

A={1,2,3}: High value F(A)

X4X5

X1

A={1,4,5}: Low value F(A)

Uncertaintyabout temperature Ybefore sensing

Uncertaintyabout temperature Yafter sensing

Submodularity of Information Gain

Y1,…,Ym, X1, …, Xn discrete RVs

F(A) = I(Y; XA) = H(Y)-H(Y | XA)

F(A) is NOT always submodular

If Xi are all conditionally independent given Y,then F(A) is submodular! [Krause & Guestrin `05]

20

Y1

X1

Y2

X2

Y3

X4X3

Proof:“information never hurts”

Example: costs

21

breakfast??

cost:time to reach shop+ price of items

t1t2

t3

each item1 $

Market 1 Market 2

Market 3

ground set

Example: costs

22

breakfast??

cost:time to shop+ price of items

F( ) = cost( ) + cost( , )

submodular?

= t1 + 1 + t2 + 2

= #shops + #itemsMarket 1 Market 2

Market 3

Shared fixed costs

23

A

B

marginal cost: #new shops + #new items

• shops: shared fixed cost• economies of scale

decreasing cost is submodular!

24

Another example: Cut functions

a c

db

e g

hf

V={a,b,c,d,e,f,g,h}2

2

2

22 2

1

1

3

3

3

33 3

Cut function is submodular!

Why are cut functions submodular?

25

a b

S Fab(S){} 0{a} w{b} w{a,b} 0

Submodular if

w

a c

db

e g

hf

2

2

2

22 2

1

1

3

3

3

33 3

Cut function in subgraph {i,j} Submodular!

w ≥ 0!

Closedness propertiesF1,…,Fm submodular functions on V and 1,…,m > 0

Then: F(A) = i i Fi(A) is submodular

Submodularity closed under nonnegative linear combinations!

Extremely useful fact:F(A) submodular P() F(A) submodular!

Multicriterion optimizationA basic proof technique!

26

Other closedness propertiesRestriction: F(S) submodular on V, W subset of V

Then is submodular

27

SWV


Then is submodular

Conditioning: F(S) submodular on V, W subset of VThen is submodular

28

SWV


Then is submodular

Conditioning: F(S) submodular on V, W subset of VThen is submodular

Reflection: F(S) submodular on VThen is submodular

29

SV

30

Submodularity …

discrete convexity ….

… or concavity?

31

Convex aspects

convex extensiondualityefficient minimization

But this is only half of the story…

32

Concave aspects

submodularity:

concavity:A + s B + s

|A|

F(A) “intuitively”

33

Submodularity and concavity

suppose and

submodular if and only if … is concave

34

Maximum of submodular functions

submodular. What about

?

|A|

F2(A)

F1(A)

F(A) = max(F1(A),F2(A))

max(F1,F2) not submodular in general!

Minimum of submodular functions

Well, maybe F(A) = min(F1(A),F2(A)) instead?

35

F1(A) F2(A) F(A)

{} 0 0 0{a} 1 0 0{b} 0 1 0{a,b} 1 1 1

F({b}) – F({})=0

F({a,b}) – F({a})=1

<

min(F1,F2) not submodular in general!

Two faces of submodular functions

36

Convex aspectsminimization!

Concave aspectsmaximization!

What to do with submodular functions

37

Optimization

Minimization

Maximization

Learning

Online/adaptiveoptim.

What to do with submodular functions

38

Optimization

Minimization

Maximization

Learning


Minimization and maximization not the same??

39

Submodular minimization

structured sparsityregularization

clustering

MAP inference minimum cut

ts

40


submodularity and convexity

Set functions and energy functionsany set function

with .… is a function on binary vectors!

a

b

d

c

A

1

1

0

0

a

b

c

d

41

pseudo-boolean function

42

Submodularity and convexity

minimum of f is a minimum of Fsubmodular minimization as convex minimization:polynomial time! Grötschel, Lovász, Schrijver 1981

extension

convex

Lovász extension

Lovász, 1982

43

Submodularity and convexity

minimum of f is a minimum of Fsubmodular minimization as convex minimization:polynomial time!

extension

convex

Lovász extension

Lovász, 1982

44

The submodular polyhedron PF

Example: V = {a,b}

x({a}) ≤ F({a})

x({b}) ≤ F({b})

x({a,b}) ≤ F({a,b})PF

-1 x{a}

x{b}

0 1

1

2

-2

A F(A){} 0{a} -1{b} 2{a,b} 0

Evaluating the Lovász extension

45

-1x{a}

x{b}

0 1

12

-2

Linear maximization over PF

Exponentially many constraints!!! Computable in O(n log n) time [Edmonds ‘70]

y*

• Subgradient• Separation oracle

x

greedy algorithm:• sort x• order defines sets•

Lovász extension: example

46

A F(A){} 0{a} 1{b} .8{a,b} .2

F(a)F(b)

F(a,b)

F({})


combinatorial algorithms

Fulkerson prizeIwata, Fujishige, Fleischer ‘01 & Schrijver ’00

state of the art:O(n4T + n5logM) [Iwata ’03]

O(n6 + n5T) [Orlin ’09]

minimize convex extension

ellipsoid algorithm [Grötschel et al.

`81]

subgradient method,smoothing [Stobbe & Krause `10]

duality: minimum norm point algorithm

[Fujishige & Isotani ’11] T = time for evaluating F

47

48

-1x{a}

x{b}

0 1

1

2

-2

regularized problemLovász extension

minimizes F:

Fujishige ‘91, Fujishige & Isotani ‘11

[-1,1]

u({a,b})=F({a,b})

u*

Base polytope BF

Example: V = {a,b}A F(A){} 0{a} -1{b} 2{a,b} 0

The minimum-norm-point algorithmdual: minimum norm problem

49

1. find

2.

can we solve this??yes! recall: can solve linear optimization over PF

similar: optimization over BF

can find (Frank-Wolfe algorithm)

The minimum-norm-point algorithm

-1x{a}

x{b}

0 1

1

2

-2

[-1,1]

u({a,b})=F({a,b})

u*

Fujishige ‘91, Fujishige & Isotani ‘11

50

Empirical comparison

Minimum norm point algorithm: usually orders of magnitude faster

Cut functions from DIMACS Challenge

Runn

ing

time

(sec

onds

)

Low

er is

bett

er (l

og-s

cale

!)

Problem size (log-scale!)512 102425612864

Minimum norm point algorithm

[Fujishige & Isotani ’11]

combinatorialalgorithms

51

Applications?

52

Example I: Sparsity

pixels largewaveletcoefficients

widebandsignalsamples

largeGabor (TF)coefficients

time

freq

uenc

y

Many natural signals sparse in suitable basis.Can exploit for learning/regularization/compressive sensing...

Sparse reconstruction

53

• explain y with few columns of M: few xi

discrete regularization on support S of x

relax to convex envelope

in nature: sparsity pattern often not random…

subsetselection:S = {1,3,4,7}

S

Structured sparsity

54

x

m3

m2

m4

m5

m7m6

m1m1

m2

m3 m4 m6 m7

Incorporate tree preference in regularizer?

Set function:

if T is a tree and S not|S| = |T|

S

Structured sparsity

55

x

m3

m2

m4

m5

m7m6

m1m1

m2

m3


Set function:

If T is a tree and S not,|S| = |T|

x

m3

m2

m4

m5

m7m6

m1

m7m6m4

56

Set function:

If T is a tree and S not,|S| = |T|

Structured sparsity


S

S

Function F is … submodular!

Sparsity

57

• Optimization: submodular minimization

[Bach`10]

• explain y with few columns of M: few xi

• prior knowledge: patterns of nonzeros

• submodular function

Lovász extension

discrete regularization on support S of x

relax to convex envelope

Further connections: Dictionary Selection

58

Where does the dictionary M come from?

Want to learn it from data:

Selecting a dictionary with near-max. variance reduction Maximization of approximately submodular function

[Krause & Cevher ‘10; Das & Kempe ’11]

59

Example: MAP inference

labels pixel values

label

pixel

60

Example: MAP inference

Recall: equivalence

ab

dc

A1100

abcd

function on binary vectors set function

if is submodular (attractive potentials), thenMAP inference = submodular minimization!polynomial-time

1

1 1

1 0 0

00

0000

Special casesMinimizing general submodular functions:

poly-time, but not very scalable

Special structure faster algorithms

Symmetric functionsGraph cutsConcave functionsSums of functions with bounded support...

61

62

0

1 1

1 1

0 0

00

000

MAP inference

if each is submodular:

MAP inference = Minimum cut: fast

then is a graph cut function.

a b a b

Pairwise is not enough…

63

color + pairwise

[Kohli et al.`09]

Pixels in one tile should have the same label

color + pairwise +

Pixels in a superpixel should have the same label

Enforcing label consistency

64

E(x)

concave function of cardinality submodular

> 2 arguments: Graph cut ??

works well for some particular [Billionet & Minoux `85, Freedman & Drineas `05, Živný & Jeavons `10,...]

necessary conditions complex and not all submodular functions equal such graph cuts [Živný et al.‘09]

Higher-order functions as graph cuts?

65

General strategy:reduce to pairwise case by adding auxiliary variables

66

Not all submodular functions can be optimized as graph cutsEven if they can: possibly many extra nodes in the graph

Other options?minimum norm algorithmother special cases:e.g. parametric maxflow

[Fujishige & Iwata`99]

Approximate! Every submodular functioncan be approximated bya series of graph cut functions [Jegelka, Lin & Bilmes `11]

Fast approximate minimization

speech corpus selection [Lin&Bilmes `11]

67

Not all submodular functions can be optimized as graph cutsEven if they can: possibly many extra nodes in the graph

Approximate!

Fast approximate minimization

speech corpus selection [Lin&Bilmes `11]

decompose:• represent as much as

possible exactly by a graph• rest: approximate iteratively

by changing edge weights

solve a series of cut problems

Symmetric:Queyranne‘s algorithm: O(n3) [Queyranne, 1998]

Concave of modular:

[Stobbe & Krause `10, Kohli et al, `09]

Sum of submodular functions, each bounded support [Kolmogorov `12]

Other special cases

68


69

Optimization

Maximization

Learning


unconstrained

constrained

70


unconstrained:nontrivial algorithms, polynomial time

constraints: e.g.limited cases doable:odd/even cardinality, inclusion/exclusion of a set

General case: NP hard• hard to approximate within polynomial factors!• But: special cases often still work well

[Lower bounds: Goel et al.`09, Iwata & Nagano `09, Jegelka & Bilmes `11]

special case:balancedcut

Constraints

71

cut matching path spanning tree

ground set: edges in a graph

minimum…

72

Recall: MAP and cuts

binary labeling:

pairwise random field:

What’s the problem?

minimum cut: prefershort cut = short object boundary

aimreality

73

MAP and cutsMinimum cut

minimizesum of edge weights

implicit criterion:short cut = short boundary

minimize submodular function of edges

new criterion:boundary may be long if the boundary is homogeneous

Minimum cooperative cut

not a sum of edge weights!

Reward co-occurrence of edges

74

submodular cost function:use few groups Si of edges

sum of weights:use few edges

7 edges, 4 types25 edges, 1 type

75

ResultsGraph cut Cooperative cut

76

Optimization?

not a standard graph cutMAP viewpoint:global, non-submodular energy function

Constrained optimization

77

cut matching path spanning tree

convex relaxation minimize surrogate function

[Goel et al.`09, Iwata & Nagano `09, Goemans et al. `09, Jegelka & Bilmes `11, Iyer et al. ICML `13, Kohli et al `13...]

approximate optimization

approximation bounds dependent on F: polynomial – constant – FPTAS

Efficient constrained optimization

78

cut

matchingpath

spanning tree

[Jegelka & Bilmes `11, Iyer et al. ICML `13]

2. Solve easy sum-of-weights problem:

and repeat.

minimize a series of surrogate functions

1. compute linear upper bound

• efficient• only need to solve sum-of-weights problems• unifying viewpoint of submodular min and max

see Wed best student paper talk

79

Submodular min in practice

Does a special algorithm apply?symmetric function? graph cut? …. approximately?

Continuous methods: convexityminimum norm point algorithm

Other techniques [not addressed here]

LP, column generation, …

Combinatorial algorithms: relatively high complexity

Constraints: hardmajorize-minimize or relaxation

80

Outline


OptimizationMinimize costs

Maximize utility


Part I

Part II

Break!

see you in half an hour

submodularity in machine learning andreas krause stefanie jegelka

Documents

location s x s

y n y s

sensing slide

fb slide

s doesnt

temperature y

y n px

s big gain small gain