topic 4 robust multi-label regularizationyboykov/courses/cs898/lectures/lec4_multi... · lp...
TRANSCRIPT
CS898
Topic 4
Robust multi-label regularization
Topic overview
Multi-label problems:
• Stereo, restoration, texture synthesis, multi-object segmentation
Types of pair-wise pixel interactions
• Convex interactions
• Discontinuity preserving interactions
Energy minimization algorithms:
• Ishikawa (convex)
• a-expansions (robust metric interactions)
• ICM, simulated annealing, message passing, etc. (general)
Extra Reading: Szeliski Ch 3.7
Example of a binary labeling problem
Object/background segmentation (topic 2) is an example of binary labeling problem
}1,0{pSfeasible labels
at any pixel p
S
S
SS
}1|{ pSpS
}0|{ pSpS
Example of a binary labeling problem
Object/background segmentation (topic 2) is an example of binary labeling problem
}1,0{pLfeasible labels
at any pixel p
),...,( ||1 LLL
labeling ofimage pixels p
}|{ pLpL
or, equivalently,
S
S
SS
}1|{ pLpS
}0|{ pLpS
For conveniencethis topic usesLp (label at p)
instead of Sp (segment at p)
depth mapL
Stereo (topic 3) is an example of image labeling problem
with non-binary labels
},...,3,2,1,0{ nLp
Example of a multi-label problem
feasible disparities
at any pixel p
}|{ pLpL
or, equivalently,
),...,( ||1 LLL
labeling ofimage pixels p
In topic 3 we usedequivalent notation Ld
}|{ pd pd
Remember:
stereo with s-t graph cuts [Roy&Cox’98]
x
y
Remember:
stereo with s-t graph cuts [Roy&Cox’98]
s
t cut
L(p)
p
“cut”
x
y
labels
x
yDis
parity
labels
Multi-label energy minimization
with s-t graph cuts [Ishikawa 1998, 2003]
Exact optimization for convex pair-wise potentials
V(dL)
dL=Lp-Lq
V(dL)
dL=Lp-Lq
graph construction for linear interactions extends to “convex” interactions
Npq
qp
p
pp LLVLDLE ),()()( 1RLp works only for 1D labels
Q: Are convex regularization models good enough for general labeling problems in vision?
A: No
(see the following discussion)
observed noisy image I
I
palong one scan line in the image
Reconstruction in Vision: (a basic example)
image labeling L(restored intensities)
L = {L1, L2 , ... , Ln}I = {I1, I2 , ... , In}
How to compute L from I ?
L
observed noisy image I
I
palong one scan line in the image
Reconstruction in Vision: (a basic example)
image labeling L(restored intensities)
L = {L1, L2 , ... , Ln}I = {I1, I2 , ... , In}
How to compute L from I ?
L
Energy minimization(discrete approach)
Markov Random Fields (MRF) framework
• weak membrane model (Geman&Geman’84, Blake&Zisserman’83,87)
pL qL
ZZLp 2:
Nqp
qp
p
pp LLVILE),(
2 ),()()(L
T Tdiscontinuity preserving potentials
Blake&Zisserman’83,87
spatial regularizationdata fidelity
piece-wise smooth labeling
Basic pairwise potentials V(α,β):
Convex regularization
• gradient descent works
• exact polynomial algorithms (Ishikawa)
TV regularization (extreme case of convex)
• a bit harder (non-differentiable)
• global minima algorithms (Ishikawa, Hochbaum, Nikolova et al.)
Robust regularization (“discontinuity-preserving”)
• bounded potentials (e.g. truncated convex)
• NP-hard, many local minima
• good approximations (message passing, a-expansion)
Nqp
qp
p
pp LLVILE),(
2 ),()()(L
total variation
Robust pairwise regularization
Robust regularization (“discontinuity-preserving”)
• bounded potentials (e.g. truncated convex)
• NP-hard, many local minima
• good approximations (message passing, a-expansion)
Nqp
qp
p
pp LLVILE),(
2 ),()()(L
pL qL
piece-wise smooth labeling
Robust pairwise regularization
Robust regularization (“discontinuity-preserving”)
• bounded potentials (e.g. Ising or Potts model)
• NP-hard, many local minima
• provably good approximations (a-expansion) maxflow/mincutalgorithms
Nqp
qp
p
pp LLVILE),(
2 ),()()(L
pL qL
piece-wise smooth labeling
}2L:p{ p2
“perceptual grouping”
}1L:p{ p1
}0L:p{ p0
pL qL
piece-wise constant labeling
weakmembrane
Potts model(piece-wise constant labeling)
Robust regularization (“discontinuity-preserving”)
• bounded potentials (e.g. Ising or Potts model)
• NP-hard, many local minima
• provably good approximations (a-expansion) maxflow/mincutalgorithms
Nqp
qp
p
pp LLVILE),(
2 ),()()(L
Left eye imageRight eye image
Potts model(piece-wise constant labeling)
depth layers
Nqp
qp
p
pp LLVLDE),(
),()()(L
Robust regularization (“discontinuity-preserving”)
• bounded potentials (e.g. Ising or Potts model)
• NP-hard, many local minima
• provably good approximations (a-expansion) maxflow/mincutalgorithms
Potts model(piece-wise constant labeling)
0
C
1
Nqp
qp
p
pp LLVLDE),(
),()()(L
Robust regularization (“discontinuity-preserving”)
• bounded potentials (e.g. Ising or Potts model)
• NP-hard, many local minima
• provably good approximations (a-expansion) maxflow/mincutalgorithms
Pairwise interactions V:
“convex” vs. “discontinuity-preserving”
V(dL)
dL=Lp-Lq
Potts model
robust“discontinuity preserving”
interactions V
V(dL)
dL=Lp-Lq
“convex”
interactions V
V(dL)
dL=Lp-Lq
V(dL)
dL=Lp-Lq
“linear” model(TV)
“quadratic” model
boundedmodels
(truncated convex)
piecewise constantlabeling
piecewise smoothlabeling
smoothlabeling
smooth labeling (with some discontinuity
robustness)
see comparison in the next slides
Pairwise interactions V:
“convex” vs. “discontinuity-preserving”
NOTE: optimization of restoration energywith quadratic regularization term
relates to noise-reduction via mean-filtering
N
L),(
22 )()()(qp
qp
p
pp LLILE
Indeed, optimum labeling (L) satisfies:
Can solve this linear system, of course. One approach - fixed point iterations:
I L
quadratic
=>
That is, start at L0=I and iteratively update each pixel’s label to weighted average of observed intensity Ip and mean current label in neighborhood Np
Pairwise interactions V:
“convex” vs. “discontinuity-preserving”
quadratic
I L
NOTE: optimization of restoration energywith quadratic regularization term
relates to noise-reduction via mean-filtering
N
L),(
22 )()()(qp
qp
p
pp LLILE
Indeed, optimum labeling (L) satisfies:
Can solve this linear system, of course. One approach - fixed point iterations:
=>
That is, start at L0=I and iteratively update each pixel’s label to weighted average of observed intensity Ip and mean current label in neighborhood Np
I
pblurred edge along one scan line in the image
Pairwise interactions V:
“convex” vs. “discontinuity-preserving”
quadratic (convex)may over-smooth
(similarly to mean filtering)
quadratic
I L
NOTE: minimizing the sum
of quadratic differences
prefers to split one large jump
into many small ones
N),(
2)(qp
qp LL
a l
arg
e ju
mp
I
pblurred edge along one scan line in the image
Pairwise interactions V:
“convex” vs. “discontinuity-preserving”
histogram
equalization
linear (TV)
I L
quadratic (convex)may over-smooth
(similarly to mean filtering)
linear (TV)
NOTE: minimizing the sum
of absolute differences
does not care how a large jump
is split (the sum does not change)
=> no over-smoothing
N),(
||qp
qp LL
may create “stair-case”
better robustness (similar to median vs. mean)
a l
arg
e ju
mp
Pairwise interactions V:
“convex” vs. “discontinuity-preserving”
Potts model may create “false banding” (next slide)
histogram
equalization
Potts
I L
I
pblurred edge along one scan line in the image
quadratic (convex)may over-smooth
(similarly to mean filtering)
linear (TV)may create “stair-case”
bounded (e.g. Potts)
NOTE: minimizing the sum
of bounded differences
prefers one large jump
to splitting into smaller ones.
=> restores sharp boundaries.
N),(
][qp
qp LL
a l
arg
e ju
mp
convex models discontinuity-preserving models
linear
(TV)
quadrat
ic
truncated
linear
truncat
ed
quadrat
ic
Potts
restoration
with:
Pairwise Regularization Models(comparison)
smooth
piecewis
e smooth
piecewise
constant
noisy
images
stair-casingover-smoothing
stair-casingstair-casing
stair-casingstair-casing
over-smoothing banding
banding
common artifacts: over-smoothing, stair-casing, banding
lots of code
Optimization for
“discontinuity preserving” models
NP-hard problem (3 or more labels)
• two labels can be solved via s-t cuts
a-expansion approximation algorithm [BVZ 1998]
• guaranteed approximation quality [Veksler, 2001]
– within a factor of 2 from the global minima (Potts model)
Many other (small or large) move making algorithms
- a/b swap, jump moves, range moves, fusion moves, etc.
LP relaxations, mean-field approximation, message passing
- e.g. LBP [Weiss&Freeman], TRWS [Kolmogorov&Weinright]
Other MRF techniques (simulated annealing, ICM)
Variational formulations (continuous)
- e.g. convex approaches [Chambolle,Pock,Cremers,Darbon]
other labelsa
a-expansion move
Basic idea is motivated by methods for multi-way cut problem(similar to Potts model)
Break computation into a sequence of binary s-t cuts
),( qppq SSE)( pp SE
a-expansion (binary move)
optimizes submodular set function
any “expansion” of label correspond to some subset
(shaded area)
S
Npq
qppq
p
pp LLELESLESE)(
),()()()(ˆ
L current labeling
}{ pLp|
ppppp SLSSL )(
pS1
),( qppq SSE)( pp SE
a-expansion (binary move)
optimizes submodular set function
L current labeling
}{ pLp|
pS 0 1
)(pE)( pp LE
Npq
qppq
p
pp LLELESLESE)(
),()()()(ˆ
),( qppq SSE)( pp SE
a-expansion (binary move)
optimizes submodular set function
L current labeling
}{ pLp|
pS 0 1
)( ,pqE
),( qppq LLE
qS
0
1
),( qpq LE
),( ppq LE
Npq
qppq
p
pp LLELESLESE)(
),()()()(ˆ
a-expansion (binary move)
optimizes submodular set function
L current labeling
}{ pLp|
pS 0 1
)( ,pqE
),( qppq LLE
qS
0
1
),( qpq LE
),( ppq LE
)(SE
(1,0)(0,1)(0,0)(1,1) pqpqpqpq EEEE ˆˆˆˆ
Set function is submodular if
a-expansion (binary move)
optimizes submodular set function
L current labeling
}{ pLp|
pS 0 1
)( ,pqE
),( qppq LLE
qS
0
1
),( qpq LE
),( ppq LE
)(SE
),(),(),(),( qpqppqqppqpq LELELLEE
Set function is submodular if
=
0 triangular inequality for ||a-b||=E(a,b)
a-expansion (binary move)
optimizes submodular set function
L current labeling
}{ pLp|
pS 0 1
)( ,pqE
),( qppq LLE
qS
0
1
),( qpq LE
),( ppq LE
),( baEpq
a-expansion moves are submodular ifis a metric on the space of labels
[BVZ, PAMI 2001]
examples of metric
pairwise interactions
FACT (easy to prove): any truncated metric is also a metric
But, this is only a special case of L2 metric for 1D labels.In general, for this metric is defined as
We called this linear or TV potential
Just check that
Truncated L2 is also a metric
Potts is another important example of a metric
Quadratic (squared L2) and truncated quadratic potentials are not metrics.
Other very good approximation algorithms are available (e.g. TRWS, Kolmogorov&Weinright 2006)
Note: unlike Ishikawa, a-expansion and other methods (LBP, TRWS, etc.) apply to labels in
a-expansion algorithm
1. Start with any initial solution2. For each label “a” in any (e.g. random) order
1. Compute optimal a-expansion move (s-t graph cuts)2. Decline the move if there is no energy decrease
3. Stop when no expansion move would decrease energy
a-expansion moves
initial solution
-expansion
-expansion
-expansion
-expansion
-expansion
-expansion
-expansion
In each a-expansion a given label “a” grabs space from other labels
For each move we choose expansion that gives the largest decrease in the energy: binary optimization problem
Multi-way graph cuts
stereo vision
original pair of “stereo” images
depth map
ground truthBVZ 1998KZ 2002right imageleft image
a-expansions vs. ICM
basic general alternative
Iterated Conditional Modes (ICM)
Example: consider pair-wise energy
N
Lpq
qppq
p
pp LLVLDE ),()()(
Unlike a-expansion, optimizes only over a single pixel at a time
=> local (pixel-wise) optimization
- Consider any fixed current labeling Lt and any given pixel - Treat label Lp as the only optimization variable x keeping other labels for fixed.
- This reduces our energy to
- Select optimal by enumeration and set new label .- Iterate over all pixels (in any fixed or random order) until no pixel operation reduces the energy.
},....,2,1,0{ nx
ppqLq
xLp
ICM algorithm
[Besag, JRSS’86]
a-expansions vs. simulated annealing
- Consider any fixed current labeling Lt and any given pixel - Treat label Lp as the only optimization variable x keeping other labels for fixed.
- This reduces our energy to
- Select optimal by enumeration and set new label .- Iterate over all pixels (in any fixed or random order) until no pixel operation reduces the energy.
},....,2,1,0{ nx
pqLq
xLp
ICM algorithm
- incorporates the following randomization strategy into ICM:at each pixel p label is updated randomly according to probabilities
x 0 1 2 … n
Pr(x) ~ ...
xLp
T
Ep )0(exp
T
Ep )1(exp
T
Ep )2(exp
T
nEp )(exp
NOTE 1 - lower energy Ep(x) gives x more chances to be selected
randomization of ICM
simulated annealing (SA)
NOTE 2 - higher temperature parameter T means more randomness
- lower temperature parameter T reduces to ICM (optimal x is always selected)
Unlike a-expansion, optimizes only over a single pixel at a time
=> local (pixel-wise) optimization
Typical SAstarts with high T and
graduallyreduces T
to zero.
Szeliski: appendix B.5.1
[Geman&Geman 1982]
related to “soft-max”
p
a-expansions vs. simulated annealing
initial
labeling
one pixel
local move
(ICM or SA)
large move
(a-b swap)
large move
(a-expansion)
a-expansion and ICM/SA are greedy iterative methods
converging to different kinds of “local” minima
- ICM/SA solution can not be improved by changing a label of any one pixel to any given label a
- a-expansion solution can not be improved by changing any subset of pixels to any given label a
small vs. large moves
normalized correlation,
start for annealing, 24.7% erra-expansions (BVZ 89,01)
90 seconds, 5.8% err
a-expansions vs. simulated annealing
simulated annealing,
19 hours, 20.3% erra-expansions (BVZ 89,01)
90 seconds, 5.8% err
a-expansions vs. simulated annealing
NOTE 1: ICM and SA are general methods applicable to arbitrary non-metric and high-order energiesNOTE 2: now-days there are other general methods based on graph cuts, message passing, relaxations, etc.
0
20000
40000
60000
80000
100000
1 10 100 1000 10000 100000
Time in seconds
Smoo
thne
ss E
nerg
y
Annealing Our method
[BVZ,2001]
a-expansion
Other applications
Graph-cut textures (Kwatra, Schodl, Essa, Bobick 2003)
similar to “image-quilting” (Efros & Freeman, 2001)
B
A B
G
DC
F
H I J
E
Other applications
Graph-cut textures (Kwatra, Schodl, Essa, Bobick 2003)
Other applications
Multi-object Extraction
Obvious generalization of binary object extraction technique[BJ’01]
Some computational photography applications
Image compositing(Agarwala et al. 2004, see Szeliski Sec 9.3.2)
Block-coordinate descent alternating a-expansion
(for segmentation L) and fitting colors Ii
Color model fitting(multi-label version of Chan-Vese)
...)()(,...),,(1:
21
0:
2010
pp Lp
p
Lp
p IIIIIILE
Npq
qppq LLw}{
][ Potts model
Block-coordinate descent alternating a-expansion
(for segmentation L) and fitting affine transforms Ti
Stereo via piece-wise constant
plane fitting [Birchfield &Tomasi 1999]
Models T = parameters of affine transformations T(p)=a p + b
...)()(,...),,(1:
2
)(
0:
2
)(
10
10
pp Lp
ppT
Lp
ppT IIIITTLE
2x2 2x1
Npq
qppq LLw}{
][ Potts model
Block-coordinate descent alternating a-expansion
(for segmentation L) and fitting planes Ti
Piece-wise smooth local plane fitting
[Olsson et al. 2013]
...)()(,...),,(1:
2
)(
0:
2
)(
10
10
pp Lp
ppT
Lp
ppT IIIITTLE
Npq
qp LLw}{
truncated angle-differences
non-metric interactionsneed other optimization
Block-coordinate descent alternating a-expansion
(for segmentation L) and fitting color space planes Ci
Signboard segmentation
[Milevsky 2013]
Labels = planes in RGBXY space C(p) = a x + b
...))(())((,...),,(1:
2
1
0:
2
0
10
pp Lp
p
Lp
p IpCIpCCCLE
Npq
qppq LLw}{
][ Potts model
3x2 3x1
Signboard segmentation
[Milevsky 2013]
3x2 3x1Goal: detection of characters, then text line fitting and translation
[Marin et al. ICCV 2015]
3D reconstructionof heart vessels center lines
Vessel extraction
p
pLpLE 2)()( Npq
qp LL}{
,truncated
angle-differences
Learning pair-wise potentials
Structure detection(Kumar and Hebert 2006, see Szeliski Sec 3.7)
Standard (hand-tuned) pair-wise potentials learned pair-wise potentials
MRF/CRF models in CNN segmentation
- Postprocessing of the results - e.g. improves edge alignment due to loss of resolution in CNNs
- Integrated as trainable or pass-through layers - RNNs, GraphNNs- typically mimicking convolution-like local optimization operations
(e.g. message passing)- currently limited to relatively weak regularization models/optimization
(e.g. dense-CRF due to its better amenability to simpler optimizers)
- Weakly supervised, unsupervised training- proposal generation- loss functions- stronger algorithms may be used in loss optimization
Next
Dense CRF, mean-field approximation?
Relaxations?
Geometric model fitting? Multi-part object fitting?
Single-view reconstruction?
Intro to learning, detection, CNN segmentation?
Student presentations
Multi-label Problems:
Linear/Unary, Quadratic, and other
Approximations and Relaxations
Approximations and Relaxations:
unary/linear, quadratic, etc…
Indeed…
where
yields linear relaxation over simplex:
“probabilities” of labels at point p
indicator if p is
assigned label k
Observation: as discussed earlier in Topic 2, arity of an energy potential corresponds to
the order of the polynomial expressing such potential via indicator variables.
For example, unary potentials can be interpreted as linear functions.
However, objective/energy/loss may have equivalent representations or relaxations of different order
Approximations and Relaxations:
unary/linear, quadratic, etc…
global “linearization”
local “linearization”
- gradient descent- parallel ICM
- Schlesinger LP- QPBO - TRWS
probabilistic “linearization”
- unary mean field
approximation- deterministic
annealing
higher-orderapproaches
- quadratic IP- quadratic
relaxations
Due to simplicity, linear approximations are particularly common
local linearization:
first-order approximation (gradient descent)
Example: pairwise (or second-order) energy
Assume real valued labels and some continuous functions
“messages” from neighboring nodes
local linearization:
first-order approximation (gradient descent)
Example: pairwise (or second-order) energy
Assume real valued labels and some continuous functions
Questions: step size Δt, labels may not be real-valued, relaxations of functions D and V may not be obvious
steepest descent
local linearization:
Parallel ICM
Example: pairwise (or second-order) energy
Consider decomposition
same as in ICM (see earlier)
Note: for simplicity, here we assume sum over directed pairs
unary approximation at any given current labeling
Question: Minimizing unary approximation is easy, but in what sense this approximation is good?
In fact, in this case the original energy E may go upafter parallel ICM (minimizing the right-hand-side). Why?
“probabilistic linearization”
Mean-Field Approximation
Example: pairwise (or second-order) energy
GOAL: find simpler (e.g. unary) energy “approximating”
?One general approximation technique:
mean-field approximation [see more in Bishop, Chapter 10.1, on variational inference]
and
Find minimizing KL-divergence between Gibbs distributions
corresponding to energies E and U :
- KL-divergence is a distance measure between two distributions (already used in Topic 2).
In this case it also implicitly defines a quality of approximation of energy E by U.
Comments: - Gibbs distributions G(L) defines probability of state L with given energy E(L)
- Z and Zu are the corresponding normalization constants
“probabilistic linearization”
Mean-Field Approximation
Mean field approximation:
NOTE: Unary energy U corresponds to “factorized” Gibbs distribution Gu (independent variables Lp)
“soft-max” for potentials Up
(probabilities / beliefs)
The optimal distributions bp that give minimum KL divergence between Gu and G
are also called pseudo-marginals for joint distribution G. Also often called simply
“marginals” or “marginal distributions”, even though they are not.
“probabilistic linearization”
Mean-Field Approximation
Mean field approximation:
Common approximate algorithm optimizing factorized distributions b for any given energy E corresponds to greedy coordinate descent
1.
2.
3. …
NOTE: greedy coordinate descent is used in ICM w.r.t. labels Lp rather than distributions bp
“probabilistic linearization”
Mean-Field Approximation
Mean field approximation:
Consider one iteration optimizing over bp at any given point p :
(negative) entropy of Gumean energy E w.r.t. distribution Gu
so called Mean Field free energy
“probabilistic linearization”
Mean-Field Approximation
Mean field approximation:
Consider one iteration optimizing over bp at any given point p :
(negative) entropy of Gumean energy E w.r.t. distribution Gu
so called Mean Field free energy
“probabilistic linearization”
Mean-Field Approximation
Mean field approximation:
Consider one iteration optimizing over bp at any given point p :
(negative) entropy of Gumean energy E w.r.t. distribution Gu
so called Mean Field free energy
Entropy of the joint distribution for independent variables equals
the sum of entropies(easy to prove standard fact)
“probabilistic linearization”
Mean-Field Approximation
Mean field approximation:
Consider one iteration optimizing over bp at any given point p :
(negative) entropy of Gumean energy E w.r.t. distribution Gu
so called Mean Field free energy
Entropy of the joint distribution for independent variables equals
the sum of entropies(easy to prove standard fact)
const
“probabilistic linearization”
Mean-Field Approximation
Mean field approximation:
Consider one iteration optimizing over bp at any given point p :
for probability distribution
(negative) entropy of Gumean energy E w.r.t. distribution Gu
so called Mean Field free energy
remember from probabilityEntropy of the joint distribution for independent variables equals
the sum of entropies(easy to prove standard fact)
“probabilistic linearization”
Mean-Field Approximation
Mean field approximation:
Consider one iteration optimizing over bp at any given point p :
for probability distribution
The smallest value of KL divergence (zero) is achieved when
or equivalently [Bishop, eq.(10.9)]
“probabilistic linearization”
Mean-Field Approximation
Mean field approximation:
general coordinate descent formulas for
estimating mean-field approximation[Bishop]
(a.k.a. variational inference)
update belief bp at p for each label Lp
according to mean energy for Lp
w.r.t. current beliefs at other nodes
or equivalently [Bishop, eq.(10.9)]useful if directly interested in approximating (factorizing) complex distributions
“probabilistic linearization”
Mean-Field Approximation
Mean field approximation:
general coordinate descent formulas for
estimating mean-field approximation[Bishop]
(a.k.a. variational inference)
update belief bp at p for each label Lp
according to mean energy for Lp
w.r.t. current beliefs at other nodes
or equivalently [Bishop, eq.(10.9)]
Remember our original goal:
useful if directly interested in approximating (factorizing) complex distributions
“probabilistic linearization”
Mean-Field Approximation
Mean field approximation:
update belief bp at p for each label Lp
according to mean energy for Lp
w.r.t. current beliefs at other nodes
in terms of unary energy U
general coordinate descent formulas for
estimating mean-field approximation[Bishop]
(a.k.a. variational inference)
Remember our original goal:
“probabilistic linearization”
Mean-Field Approximation
Example: pairwise (or second-order) energy
mean-field approximation
updates for pairwise energies
sequential updates for beliefs b according to messages from neighboring nodes
update belief bp at p for each label Lp
according to mean energy for Lp
w.r.t. current beliefs at other nodes
“probabilistic linearization”
Mean-Field Approximation
Example: pairwise (or second-order) energy
mean-field approximation
updates for pairwise energies
NOTE: converges to a “fixed point”, but… may not converge if updates are made in parallel
Question: once (approximately) optimal factorized distributions {bp} are known,
how can one estimate labels {Lp} ?
sequential updates for beliefs b according to messages from neighboring nodes
Messages for
beliefs and labels
Example: pairwise (or second-order) energy
mean-field approximation
locally averages energy for each Lp
ICM
locally optimizes labels Lp
current Lq
updates current beliefs bpupdates current labels Lptable with element
for each Lp
locally optimizes labels Lp
Example: squared deviation potentials (e.g. in restoration)
mean-field approximation
locally averages energy for each Lp
ICM
updates current beliefs bptable with element
for each Lp
e.g. locally averages labels Lp
current Lq
Messages for
beliefs and labels
updates current labels Lp
Dense CRF
Densely connected pairwise Potts model
Due to graph density, graph cut is not practical
Uses mean-field approximationand bilateral filtering for significant speed-up [Koltun et al, NIPS11]
NOTE: dense CRF is a weaker regularization model, but it is an easier objectiveamenable to greedy approximate optimization methods such as mean field technique.
(see Topic 2)
Deterministic Annealing
Mean field approximation:
Introduce additional temperature parameter T
Observation: finding optimal beliefs is easier for larger T and harder for smaller T. Why?
(negative) entropy of Guaverage energy E w.r.t. distribution Gu
convex (easy)
non-convex(hard)
Deterministic Annealing
Mean field approximation:
Introduce additional temperature parameter T
(large)
(small)
General Idea for Deterministic Annealing:
- start from large T and uniform beliefs b- gradually decrease temperature while updating beliefs (e.g. using mean-field)
- eventually Gibbs distribution GT should be concentrated around globally optimal labeling Land (approximate) beliefs bT “may” reflect that
(no guarantees)
Loopy belief propagation (loopy BP)
One class of iterative message-passing inference methods (BP) uses messages derived from dynamic programming on non-loopy graphs (chains, trees).
- No guarantees on loopy graphs (oscillates), but works well in some applications.
- exact optimization on non-loopy graph
- sum-product, max-product
Technical operations inside many optimization algorithms can be represented via “local messages”, e.g. in gradient descent, ICM, graph cuts, ……
including dynamic programming (e.g. Viterbi) on chains/trees
global linearization:
Linear Programming (LP) Relaxations
Example: pairwise (or second-order) energy
Introduce binary indicators:
indicates if p is assigned label k (as in Topic 1)
indicates if pair p,q is assigned labels k,m
and:
First, formulate an equivalent Integer Linear Program (ILP) as follows…
global linearization:
Linear Programming (LP) Relaxations
Example: pairwise (or second-order) energy
Relaxing integrality constraints gives Schlesinger LP relaxation (1972)
indicates if p is assigned label k (as in Topic 1)
indicates if pair p,q is assigned labels k,m
global linearization:
Linear Programming (LP) Relaxations
• binary labeling case - Quadratic Pseudo-Boolean Optimization (QPBO)
[Hammer et al. 1984, Boros 2002]- relaxation has half-integral solutions, such that
- submodular problems yield fully integral solution (global optimum)
• arbitrary number of labels- typical algorithms focus on a dual LP- block-coordinate ascent, message passing- tree-structures block-coordinate ascent
TRWS [Kolmogorov&Weinright 2006]- guaranteed global optima for a class of “multi-label submodular” problems
Global vs Local Linearization
Global Linearization
Local Linearization
- parallel ICM- gradient descent
- Schlesinger LP- QPBO - TRWS
Quadratic Relaxations and Approximations
Example: squared deviations image restoration energy
Can easily work with real valued labels, if desired
Quadratic Relaxations and Approximations
Example: pairwise Potts energy
indicates if p is assigned label k (as before)
“saddle” functions f(x,y) = -xy
equals for
Lots of other options for relaxing Potts model
Quadratic Relaxations and Approximations
- Natural for labeling problems with labels in Rn
(e.g. restoration, stereo, optical flows)
- Quadratic relaxations for Potts model (segmentation)
- Submodular approximations