topic 4 robust multi-label regularizationyboykov/courses/cs898/lectures/lec4_multi... · lp...

CS898

Topic 4

Robust multi-label regularization

Topic overview

Multi-label problems:

• Stereo, restoration, texture synthesis, multi-object segmentation

Types of pair-wise pixel interactions

• Convex interactions

• Discontinuity preserving interactions

Energy minimization algorithms:

• Ishikawa (convex)

• a-expansions (robust metric interactions)

• ICM, simulated annealing, message passing, etc. (general)

Extra Reading: Szeliski Ch 3.7

Example of a binary labeling problem

Object/background segmentation (topic 2) is an example of binary labeling problem

}1,0{pSfeasible labels

at any pixel p

S

S

SS

}1|{ pSpS

}0|{ pSpS

Example of a binary labeling problem

Object/background segmentation (topic 2) is an example of binary labeling problem

}1,0{pLfeasible labels

at any pixel p

),...,( ||1 LLL

labeling ofimage pixels p

}|{ pLpL

or, equivalently,

S

S

SS

}1|{ pLpS

}0|{ pLpS

For conveniencethis topic usesLp (label at p)

instead of Sp (segment at p)

depth mapL

Stereo (topic 3) is an example of image labeling problem

with non-binary labels

},...,3,2,1,0{ nLp

Example of a multi-label problem

feasible disparities

at any pixel p

}|{ pLpL

or, equivalently,

),...,( ||1 LLL

labeling ofimage pixels p

In topic 3 we usedequivalent notation Ld

}|{ pd pd

Remember:

stereo with s-t graph cuts [Roy&Cox’98]

x

y

Remember:

stereo with s-t graph cuts [Roy&Cox’98]

s

t cut

L(p)

p

“cut”

x

y

labels

x

yDis

parity

labels

Multi-label energy minimization

with s-t graph cuts [Ishikawa 1998, 2003]

Exact optimization for convex pair-wise potentials

V(dL)

dL=Lp-Lq

V(dL)

dL=Lp-Lq

graph construction for linear interactions extends to “convex” interactions

Npq

qp

p

pp LLVLDLE ),()()( 1RLp works only for 1D labels

Q: Are convex regularization models good enough for general labeling problems in vision?

A: No

(see the following discussion)

observed noisy image I

I

palong one scan line in the image

Reconstruction in Vision: (a basic example)

image labeling L(restored intensities)

L = {L1, L2 , ... , Ln}I = {I1, I2 , ... , In}

How to compute L from I ?

L

Energy minimization(discrete approach)

Markov Random Fields (MRF) framework

• weak membrane model (Geman&Geman’84, Blake&Zisserman’83,87)

pL qL

ZZLp 2:

Nqp

qp

p

pp LLVILE),(

2 ),()()(L

T Tdiscontinuity preserving potentials

Blake&Zisserman’83,87

spatial regularizationdata fidelity

piece-wise smooth labeling

Basic pairwise potentials V(α,β):

Convex regularization

• gradient descent works

• exact polynomial algorithms (Ishikawa)

TV regularization (extreme case of convex)

• a bit harder (non-differentiable)

• global minima algorithms (Ishikawa, Hochbaum, Nikolova et al.)

Robust regularization (“discontinuity-preserving”)

• bounded potentials (e.g. truncated convex)

• NP-hard, many local minima

• good approximations (message passing, a-expansion)

Nqp

qp

p

pp LLVILE),(

2 ),()()(L

total variation

Robust pairwise regularization


• bounded potentials (e.g. truncated convex)


• good approximations (message passing, a-expansion)

Nqp

qp

p

pp LLVILE),(

2 ),()()(L

pL qL


Robust pairwise regularization


• bounded potentials (e.g. Ising or Potts model)


• provably good approximations (a-expansion) maxflow/mincutalgorithms

Nqp

qp

p

pp LLVILE),(

2 ),()()(L

pL qL


}2L:p{ p2

“perceptual grouping”

}1L:p{ p1

}0L:p{ p0

pL qL

piece-wise constant labeling

weakmembrane

Potts model(piece-wise constant labeling)





Nqp

qp

p

pp LLVILE),(

2 ),()()(L

Left eye imageRight eye image


depth layers

Nqp

qp

p

pp LLVLDE),(

),()()(L






0

C

1

Nqp

qp

p

pp LLVLDE),(

),()()(L





Pairwise interactions V:

“convex” vs. “discontinuity-preserving”

V(dL)

dL=Lp-Lq

Potts model

robust“discontinuity preserving”

interactions V

V(dL)

dL=Lp-Lq

“convex”

interactions V

V(dL)

dL=Lp-Lq

V(dL)

dL=Lp-Lq

“linear” model(TV)

“quadratic” model

boundedmodels

(truncated convex)

piecewise constantlabeling

piecewise smoothlabeling

smoothlabeling

smooth labeling (with some discontinuity

robustness)

see comparison in the next slides



NOTE: optimization of restoration energywith quadratic regularization term

relates to noise-reduction via mean-filtering

N

L),(

22 )()()(qp

qp

p

pp LLILE

Indeed, optimum labeling (L) satisfies:

Can solve this linear system, of course. One approach - fixed point iterations:

I L

quadratic

=>

That is, start at L0=I and iteratively update each pixel’s label to weighted average of observed intensity Ip and mean current label in neighborhood Np



quadratic

I L

NOTE: optimization of restoration energywith quadratic regularization term

relates to noise-reduction via mean-filtering

N

L),(

22 )()()(qp

qp

p

pp LLILE

Indeed, optimum labeling (L) satisfies:

Can solve this linear system, of course. One approach - fixed point iterations:

=>

That is, start at L0=I and iteratively update each pixel’s label to weighted average of observed intensity Ip and mean current label in neighborhood Np

I

pblurred edge along one scan line in the image



quadratic (convex)may over-smooth

(similarly to mean filtering)

quadratic

I L

NOTE: minimizing the sum

of quadratic differences

prefers to split one large jump

into many small ones

N),(

2)(qp

qp LL

a l

arg

e ju

mp

I




histogram

equalization

linear (TV)

I L



linear (TV)


of absolute differences

does not care how a large jump

is split (the sum does not change)

=> no over-smoothing

N),(

||qp

qp LL

may create “stair-case”

better robustness (similar to median vs. mean)

a l

arg

e ju

mp



Potts model may create “false banding” (next slide)

histogram

equalization

Potts

I L

I




linear (TV)may create “stair-case”

bounded (e.g. Potts)


of bounded differences

prefers one large jump

to splitting into smaller ones.

=> restores sharp boundaries.

N),(

][qp

qp LL

a l

arg

e ju

mp

convex models discontinuity-preserving models

linear

(TV)

quadrat

ic

truncated

linear

truncat

ed

quadrat

ic

Potts

restoration

with:

Pairwise Regularization Models(comparison)

smooth

piecewis

e smooth

piecewise

constant

noisy

images

stair-casingover-smoothing

stair-casingstair-casing

stair-casingstair-casing

over-smoothing banding

banding

common artifacts: over-smoothing, stair-casing, banding

lots of code

Optimization for

“discontinuity preserving” models

NP-hard problem (3 or more labels)

• two labels can be solved via s-t cuts

a-expansion approximation algorithm [BVZ 1998]

• guaranteed approximation quality [Veksler, 2001]

– within a factor of 2 from the global minima (Potts model)

Many other (small or large) move making algorithms

- a/b swap, jump moves, range moves, fusion moves, etc.

LP relaxations, mean-field approximation, message passing

- e.g. LBP [Weiss&Freeman], TRWS [Kolmogorov&Weinright]

Other MRF techniques (simulated annealing, ICM)

Variational formulations (continuous)

- e.g. convex approaches [Chambolle,Pock,Cremers,Darbon]

other labelsa

a-expansion move

Basic idea is motivated by methods for multi-way cut problem(similar to Potts model)

Break computation into a sequence of binary s-t cuts

),( qppq SSE)( pp SE

a-expansion (binary move)

optimizes submodular set function

any “expansion” of label correspond to some subset

(shaded area)

S

Npq

qppq

p

pp LLELESLESE)(

),()()()(ˆ

L current labeling

}{ pLp|

ppppp SLSSL )(

pS1




L current labeling

}{ pLp|

pS 0 1

)(pE)( pp LE

Npq

qppq

p

pp LLELESLESE)(

),()()()(ˆ




L current labeling

}{ pLp|

pS 0 1

)( ,pqE

),( qppq LLE

qS

0

1

),( qpq LE

),( ppq LE

Npq

qppq

p

pp LLELESLESE)(

),()()()(ˆ



L current labeling

}{ pLp|

pS 0 1

)( ,pqE

),( qppq LLE

qS

0

1

),( qpq LE

),( ppq LE

)(SE

(1,0)(0,1)(0,0)(1,1) pqpqpqpq EEEE ˆˆˆˆ

Set function is submodular if



L current labeling

}{ pLp|

pS 0 1

)( ,pqE

),( qppq LLE

qS

0

1

),( qpq LE

),( ppq LE

)(SE

),(),(),(),( qpqppqqppqpq LELELLEE

Set function is submodular if

=

0 triangular inequality for ||a-b||=E(a,b)



L current labeling

}{ pLp|

pS 0 1

)( ,pqE

),( qppq LLE

qS

0

1

),( qpq LE

),( ppq LE

),( baEpq

a-expansion moves are submodular ifis a metric on the space of labels

[BVZ, PAMI 2001]

examples of metric

pairwise interactions

FACT (easy to prove): any truncated metric is also a metric

But, this is only a special case of L2 metric for 1D labels.In general, for this metric is defined as

We called this linear or TV potential

Just check that

Truncated L2 is also a metric

Potts is another important example of a metric

Quadratic (squared L2) and truncated quadratic potentials are not metrics.

Other very good approximation algorithms are available (e.g. TRWS, Kolmogorov&Weinright 2006)

Note: unlike Ishikawa, a-expansion and other methods (LBP, TRWS, etc.) apply to labels in

a-expansion algorithm

1. Start with any initial solution2. For each label “a” in any (e.g. random) order

1. Compute optimal a-expansion move (s-t graph cuts)2. Decline the move if there is no energy decrease

3. Stop when no expansion move would decrease energy

a-expansion moves

initial solution

-expansion

-expansion

-expansion

-expansion

-expansion

-expansion

-expansion

In each a-expansion a given label “a” grabs space from other labels

For each move we choose expansion that gives the largest decrease in the energy: binary optimization problem

Multi-way graph cuts

stereo vision

original pair of “stereo” images

depth map

ground truthBVZ 1998KZ 2002right imageleft image

a-expansions vs. ICM

basic general alternative

Iterated Conditional Modes (ICM)

Example: consider pair-wise energy

N

Lpq

qppq

p

pp LLVLDE ),()()(

Unlike a-expansion, optimizes only over a single pixel at a time

=> local (pixel-wise) optimization

- Consider any fixed current labeling Lt and any given pixel - Treat label Lp as the only optimization variable x keeping other labels for fixed.

- This reduces our energy to

- Select optimal by enumeration and set new label .- Iterate over all pixels (in any fixed or random order) until no pixel operation reduces the energy.

},....,2,1,0{ nx

ppqLq

xLp

ICM algorithm

[Besag, JRSS’86]

a-expansions vs. simulated annealing

- Consider any fixed current labeling Lt and any given pixel - Treat label Lp as the only optimization variable x keeping other labels for fixed.

- This reduces our energy to

- Select optimal by enumeration and set new label .- Iterate over all pixels (in any fixed or random order) until no pixel operation reduces the energy.

},....,2,1,0{ nx

pqLq

xLp

ICM algorithm

- incorporates the following randomization strategy into ICM:at each pixel p label is updated randomly according to probabilities

x 0 1 2 … n

Pr(x) ~ ...

xLp

T

Ep )0(exp

T

Ep )1(exp

T

Ep )2(exp

T

nEp )(exp

NOTE 1 - lower energy Ep(x) gives x more chances to be selected

randomization of ICM

simulated annealing (SA)

NOTE 2 - higher temperature parameter T means more randomness

- lower temperature parameter T reduces to ICM (optimal x is always selected)

Unlike a-expansion, optimizes only over a single pixel at a time

=> local (pixel-wise) optimization

Typical SAstarts with high T and

graduallyreduces T

to zero.

Szeliski: appendix B.5.1

[Geman&Geman 1982]

related to “soft-max”

p


initial

labeling

one pixel

local move

(ICM or SA)

large move

(a-b swap)

large move

(a-expansion)

a-expansion and ICM/SA are greedy iterative methods

converging to different kinds of “local” minima

- ICM/SA solution can not be improved by changing a label of any one pixel to any given label a

- a-expansion solution can not be improved by changing any subset of pixels to any given label a

small vs. large moves

normalized correlation,

start for annealing, 24.7% erra-expansions (BVZ 89,01)

90 seconds, 5.8% err


simulated annealing,

19 hours, 20.3% erra-expansions (BVZ 89,01)

90 seconds, 5.8% err


NOTE 1: ICM and SA are general methods applicable to arbitrary non-metric and high-order energiesNOTE 2: now-days there are other general methods based on graph cuts, message passing, relaxations, etc.

0

20000

40000

60000

80000

100000

1 10 100 1000 10000 100000

Time in seconds

Smoo

thne

ss E

nerg

y

Annealing Our method

[BVZ,2001]

a-expansion

Other applications

Graph-cut textures (Kwatra, Schodl, Essa, Bobick 2003)

similar to “image-quilting” (Efros & Freeman, 2001)

B

A B

G

DC

F

H I J

E

Other applications

Graph-cut textures (Kwatra, Schodl, Essa, Bobick 2003)

Other applications

Multi-object Extraction

Obvious generalization of binary object extraction technique[BJ’01]

Some computational photography applications

Image compositing(Agarwala et al. 2004, see Szeliski Sec 9.3.2)

Block-coordinate descent alternating a-expansion

(for segmentation L) and fitting colors Ii

Color model fitting(multi-label version of Chan-Vese)

...)()(,...),,(1:

21

0:

2010

pp Lp

p

Lp

p IIIIIILE

Npq

qppq LLw}{

][ Potts model


(for segmentation L) and fitting affine transforms Ti

Stereo via piece-wise constant

plane fitting [Birchfield &Tomasi 1999]

Models T = parameters of affine transformations T(p)=a p + b

...)()(,...),,(1:

2

)(

0:

2

)(

10

10

pp Lp

ppT

Lp

ppT IIIITTLE

2x2 2x1

Npq

qppq LLw}{

][ Potts model


(for segmentation L) and fitting planes Ti

Piece-wise smooth local plane fitting

[Olsson et al. 2013]

...)()(,...),,(1:

2

)(

0:

2

)(

10

10

pp Lp

ppT

Lp

ppT IIIITTLE

Npq

qp LLw}{

truncated angle-differences

non-metric interactionsneed other optimization


(for segmentation L) and fitting color space planes Ci

Signboard segmentation

[Milevsky 2013]

Labels = planes in RGBXY space C(p) = a x + b

...))(())((,...),,(1:

2

1

0:

2

0

10

pp Lp

p

Lp

p IpCIpCCCLE

Npq

qppq LLw}{

][ Potts model

3x2 3x1

Signboard segmentation

[Milevsky 2013]

3x2 3x1Goal: detection of characters, then text line fitting and translation

[Marin et al. ICCV 2015]

3D reconstructionof heart vessels center lines

Vessel extraction

p

pLpLE 2)()( Npq

qp LL}{

,truncated

angle-differences

Learning pair-wise potentials

Structure detection(Kumar and Hebert 2006, see Szeliski Sec 3.7)

Standard (hand-tuned) pair-wise potentials learned pair-wise potentials

MRF/CRF models in CNN segmentation

- Postprocessing of the results - e.g. improves edge alignment due to loss of resolution in CNNs

- Integrated as trainable or pass-through layers - RNNs, GraphNNs- typically mimicking convolution-like local optimization operations

(e.g. message passing)- currently limited to relatively weak regularization models/optimization

(e.g. dense-CRF due to its better amenability to simpler optimizers)

- Weakly supervised, unsupervised training- proposal generation- loss functions- stronger algorithms may be used in loss optimization

Next

Dense CRF, mean-field approximation?

Relaxations?

Geometric model fitting? Multi-part object fitting?

Single-view reconstruction?

Intro to learning, detection, CNN segmentation?

Student presentations

Multi-label Problems:

Linear/Unary, Quadratic, and other

Approximations and Relaxations

Approximations and Relaxations:

unary/linear, quadratic, etc…

Indeed…

where

yields linear relaxation over simplex:

“probabilities” of labels at point p

indicator if p is

assigned label k

Observation: as discussed earlier in Topic 2, arity of an energy potential corresponds to

the order of the polynomial expressing such potential via indicator variables.

For example, unary potentials can be interpreted as linear functions.

However, objective/energy/loss may have equivalent representations or relaxations of different order

Approximations and Relaxations:

unary/linear, quadratic, etc…

global “linearization”

local “linearization”

- gradient descent- parallel ICM

- Schlesinger LP- QPBO - TRWS

probabilistic “linearization”

- unary mean field

approximation- deterministic

annealing

higher-orderapproaches

- quadratic IP- quadratic

relaxations

Due to simplicity, linear approximations are particularly common

local linearization:

first-order approximation (gradient descent)

Example: pairwise (or second-order) energy

Assume real valued labels and some continuous functions

“messages” from neighboring nodes


first-order approximation (gradient descent)


Assume real valued labels and some continuous functions

Questions: step size Δt, labels may not be real-valued, relaxations of functions D and V may not be obvious

steepest descent


Parallel ICM


Consider decomposition

same as in ICM (see earlier)

Note: for simplicity, here we assume sum over directed pairs

unary approximation at any given current labeling

Question: Minimizing unary approximation is easy, but in what sense this approximation is good?

In fact, in this case the original energy E may go upafter parallel ICM (minimizing the right-hand-side). Why?

“probabilistic linearization”

Mean-Field Approximation


GOAL: find simpler (e.g. unary) energy “approximating”

?One general approximation technique:

mean-field approximation [see more in Bishop, Chapter 10.1, on variational inference]

and

Find minimizing KL-divergence between Gibbs distributions

corresponding to energies E and U :

- KL-divergence is a distance measure between two distributions (already used in Topic 2).

In this case it also implicitly defines a quality of approximation of energy E by U.

Comments: - Gibbs distributions G(L) defines probability of state L with given energy E(L)

- Z and Zu are the corresponding normalization constants



Mean field approximation:

NOTE: Unary energy U corresponds to “factorized” Gibbs distribution Gu (independent variables Lp)

“soft-max” for potentials Up

(probabilities / beliefs)

The optimal distributions bp that give minimum KL divergence between Gu and G

are also called pseudo-marginals for joint distribution G. Also often called simply

“marginals” or “marginal distributions”, even though they are not.




Common approximate algorithm optimizing factorized distributions b for any given energy E corresponds to greedy coordinate descent

1.

2.

3. …

NOTE: greedy coordinate descent is used in ICM w.r.t. labels Lp rather than distributions bp




Consider one iteration optimizing over bp at any given point p :

(negative) entropy of Gumean energy E w.r.t. distribution Gu

so called Mean Field free energy







Entropy of the joint distribution for independent variables equals

the sum of entropies(easy to prove standard fact)







Entropy of the joint distribution for independent variables equals


const





for probability distribution



remember from probabilityEntropy of the joint distribution for independent variables equals






for probability distribution

The smallest value of KL divergence (zero) is achieved when

or equivalently [Bishop, eq.(10.9)]




general coordinate descent formulas for

estimating mean-field approximation[Bishop]

(a.k.a. variational inference)

update belief bp at p for each label Lp

according to mean energy for Lp

w.r.t. current beliefs at other nodes

or equivalently [Bishop, eq.(10.9)]useful if directly interested in approximating (factorizing) complex distributions










or equivalently [Bishop, eq.(10.9)]

Remember our original goal:

useful if directly interested in approximating (factorizing) complex distributions







in terms of unary energy U




Remember our original goal:




mean-field approximation

updates for pairwise energies

sequential updates for beliefs b according to messages from neighboring nodes








updates for pairwise energies

NOTE: converges to a “fixed point”, but… may not converge if updates are made in parallel

Question: once (approximately) optimal factorized distributions {bp} are known,

how can one estimate labels {Lp} ?

sequential updates for beliefs b according to messages from neighboring nodes

Messages for

beliefs and labels



locally averages energy for each Lp

ICM

locally optimizes labels Lp

current Lq

updates current beliefs bpupdates current labels Lptable with element

for each Lp

locally optimizes labels Lp

Example: squared deviation potentials (e.g. in restoration)


locally averages energy for each Lp

ICM

updates current beliefs bptable with element

for each Lp

e.g. locally averages labels Lp

current Lq

Messages for

beliefs and labels

updates current labels Lp

Dense CRF

Densely connected pairwise Potts model

Due to graph density, graph cut is not practical

Uses mean-field approximationand bilateral filtering for significant speed-up [Koltun et al, NIPS11]

NOTE: dense CRF is a weaker regularization model, but it is an easier objectiveamenable to greedy approximate optimization methods such as mean field technique.

(see Topic 2)

Deterministic Annealing


Introduce additional temperature parameter T

Observation: finding optimal beliefs is easier for larger T and harder for smaller T. Why?

(negative) entropy of Guaverage energy E w.r.t. distribution Gu

convex (easy)

non-convex(hard)

Deterministic Annealing


Introduce additional temperature parameter T

(large)

(small)

General Idea for Deterministic Annealing:

- start from large T and uniform beliefs b- gradually decrease temperature while updating beliefs (e.g. using mean-field)

- eventually Gibbs distribution GT should be concentrated around globally optimal labeling Land (approximate) beliefs bT “may” reflect that

(no guarantees)

Loopy belief propagation (loopy BP)

One class of iterative message-passing inference methods (BP) uses messages derived from dynamic programming on non-loopy graphs (chains, trees).

- No guarantees on loopy graphs (oscillates), but works well in some applications.

- exact optimization on non-loopy graph

- sum-product, max-product

Technical operations inside many optimization algorithms can be represented via “local messages”, e.g. in gradient descent, ICM, graph cuts, ……

including dynamic programming (e.g. Viterbi) on chains/trees

global linearization:

Linear Programming (LP) Relaxations


Introduce binary indicators:

indicates if p is assigned label k (as in Topic 1)

indicates if pair p,q is assigned labels k,m

and:

First, formulate an equivalent Integer Linear Program (ILP) as follows…




Relaxing integrality constraints gives Schlesinger LP relaxation (1972)

indicates if p is assigned label k (as in Topic 1)

indicates if pair p,q is assigned labels k,m



• binary labeling case - Quadratic Pseudo-Boolean Optimization (QPBO)

[Hammer et al. 1984, Boros 2002]- relaxation has half-integral solutions, such that

- submodular problems yield fully integral solution (global optimum)

• arbitrary number of labels- typical algorithms focus on a dual LP- block-coordinate ascent, message passing- tree-structures block-coordinate ascent

TRWS [Kolmogorov&Weinright 2006]- guaranteed global optima for a class of “multi-label submodular” problems

Global vs Local Linearization

Global Linearization

Local Linearization

- parallel ICM- gradient descent

- Schlesinger LP- QPBO - TRWS

Quadratic Relaxations and Approximations

Example: squared deviations image restoration energy

Can easily work with real valued labels, if desired


Example: pairwise Potts energy

indicates if p is assigned label k (as before)

“saddle” functions f(x,y) = -xy

equals for

Lots of other options for relaxing Potts model


- Natural for labeling problems with labels in Rn

(e.g. restoration, stereo, optical flows)

- Quadratic relaxations for Potts model (segmentation)

- Submodular approximations

topic 4 robust multi-label regularizationyboykov/courses/cs898/lectures/lec4_multi... · lp...

Documents