bayesian video matting using learnt image priorsnema/publications/apostoloff04_poster.pdfin put (c )...
TRANSCRIPT
-
• Learn priors on the spatiotemporal relationship between edges in the alpha matte and the composite image
• Fit these priors to natural image statistics models to avoid over-fitting• These priors propagate alpha values along, but not across, edge surfaces in
both space and time
•
• We calculate the maximum a posteriori (MAP) estimate of the foreground image F and the alpha-matte α given C and B:
•
• The MAP estimation is converted into an energy minimization by taking the negative log-likelihood of the posterior, and noting that p(C) and p(B) do not depend on the unknowns F and α.
• This is calculated over 100,000’s of variables using the well-honed Matlab optimizer fmincon. Derivatives of the energy function are computed using Maple, and automatically written to C source code. An analytical sparse Hessian is assembled from these derivatives allowing the use of efficient sparse solvers.
Nicholas Apostoloff and Andrew W. FitzgibbonUniversity of Oxford{nema,awf}@robots.ox.ac.uk
Bayesian video matting using learnt image priors
Spatiotemporal consistency Experimental resultsTo regularize the inverse problem of video matting using learnt priors on the distribution of alpha values and the spatiotemporal consistency of image sequences.
Objective
Video matting
The relationship between the spatiotemporal consistency of the alpha mattes α and the composite images C is learnt using models from natural image statistics
The general problem of video matting is under-constrained as is evidenced with background subtraction, where no threshold yields a correct solution
Time
Composite image volume
Background subtraction
solution
Ambiguous regions
The composite volume exhibits both a spatial and temporal consistency that can be used to help resolve the ambiguous regions
The strategy: MAP estimation
dα
dC
dα
dC
p(dα,dC)spacial p(dα|dC)temporal
Training data
Learnt priors
High energyLow energy
C B
Background subtraction
Spatiotemporal consistency
solution
Photoshop
Ruzon
Photoshop
Photoshop
Ruzon
Ruzon
No
tem
pora
l co
nsis
tenc
yW
ith t
empo
ral
cons
iste
ncy
= x + x
C α F (1-α) B
Given GivenExtract Extract
Video matting is a classic inverse problem in computer vision that involves the extraction of foreground objects, and the alpha mattes that describe their opacity, from a set of images.
The compositing equation defines how the composite image C is formed as a linear combination of the foreground image F and the background image B using the alpha matte α.
What’s new?
Video matting is an under-constrained problem - there are more unknowns than equations. Therefore more information is necessary to regularize the solution.
α
Background subtraction
FC’
C B
Background subtraction
αFC’
C B
Background subtraction
αFC’
C B
Why is it difficult?
The energy equationReconstruction error
Foreground energy
Alpha distribution Learnt spatiotemporal consistency energy
+ +
+ ++Gaussian mixture model
=
dCdα
log
p(dα
|dC)
0 0.2 0.4 0.6 0.8 1!9
!8
!7
!6
!5
!4
!3
!2
!1
!
log
p(!
)
Ground truthBeta distribution
log
p(α
)
α
||C−!F− (1−!)B||2
ObjectiveTo regularize the inverse problem of Video Matting in a Bayesianframework using learnt priors on the distribution of alpha values andthe spatiotemporal consistency of image sequences.
Video MattingVideo matting is a classic inverse problem in computer vision that involves theextraction of foreground objects, and the alpha mattes that describe their opacity,from a set of images. It is most prevalent in the film industry for special effectsshots that require the superposition of an actor onto a new background.The compositing equation defines how the composite image C is formed as a linearcombination of the foreground image F and the background image B using thealpha matte α
= × + ×C α F (1− α) B
However, video matting is the inverse of the compositing process:
Given a sequence of images C, solve for α, F and B.
It is apparent from this statement that the problem is hugely under-constrained.Previous video matting techniques have relied on user interaction and ac-curate modelling of the foreground and background colour distributions to assistthe process, while neglecting the inherent information that lies in the statistics ofnatural images and alpha mattes.The principal contribution of this work is the investigation of the statisticalrelationship between C and α and the modelling of a prior that imposes both spatialand temporal consistency in α. In contrast to previous efforts, the prior probabilitydistributions are learnt from examples and fit to models of natural image statistics,producing stronger, more appropriate regularization.
The Strategy: MAP EstimationThe Bayesian formulation of the video matting problem is one of finding the maxi-mum a posteriori (MAP) estimate of the foreground image F and the alpha-matteα given C and B:
{F, α} = argmaxF,α
p(F, α|C, B).Using Bayes rule, the posterior can be expressed as a combination of priors on F ,α, C and B, and a conditional probability in C and B:
{F, α} = argmaxF,α
p(C, B|F, α)p(F )p(α)p(C)p(B)
.
The MAP estimation is converted into an energy minimization by taking the neg-ative log-likelihood of the posterior, and noting that p(C) and p(B) do not dependon the unknowns F and α:
{F, α} = argminF,α
{L(C, B|F, α) + L(F ) + L(α)}
where L(C, B|F, α) is the reconstruction error, L(F ) is the foreground energyand L(α) is the negative log alpha prior. For this work, we used matlab’sconstrained nonlinear optimizer fmincon. Like graduated non-convexity (Blakeand Zisserman, 1987), the minimization is applied three times using smoothedversions of the spatiotemporal prior, with the result of one level of smoothingseeding the next level.
The ProblemVideo matting is inherently under-constrained – 7 unknowns in 3 equations percolour pixel. Background extraction removes Br, Bg and Bb to reduce the problemto one of 4 unknowns (red) in 3 equations.
Cr(x) = αxFr(x) + (1− αx)Br(x)Cg(x) = αxFg(x) + (1− αx)Bg(x)Cb(x) = αxFb(x) + (1− αx)Bb(x)
Input (C)
Input (B)
Backgroundsubtraction
Desired output(α)
Desired output(F )
In this work we exploit the joint redundancy of natural images and alpha mattesby learning a prior on the spatiotemporal consistency of α and F and regularizingthe problem in a Bayesian framework.
The Strategy: The Energy EquationReconstruction ErrorAssuming Gaussian, white, additive imagenoise, the conditional density in C and B forpixel x is:
p(Cx, Bx|Fx, αx) =
η exp
(−||Cx − αxFx − (1− αx) Bx||
2
2σc
)where the constant η is dropped and the frac-tion, 1/2σc, is subsumed by the weights λi inEx.
Foreground energyWe follow previous authors (Chuang et al., 2001;Ruzon and Tomasi, 2000) and use a Gaussian mix-ture model (GMM) in the RGB colour space tomodel the distribution of foreground pixels colours.The distribution at each pixel is generated from allforeground pixels in a square neighborhood in theautomatically generated trimap as in (Ruzon andTomasi, 2000).
Ex = ||Cx − αx Fx − (1− αx ) Bx||2
− λ1 log( Nk∑
k
1√(2π)3|Σk|
exp(−(Fx − µk)#Σ−1k (Fx − µk)))
− λ2 ((η − 1) log αx + (τ − 1) log (1− αx ))− λ3
∑p∈N
log
(γ1x(1 + γ2x(∇pαx )2)−γ3x + γ4xe−γ5x(∇pαx )
2
)
Alpha Distribution PriorA Beta function is used to model the distri-bution of alpha values as primarily 0’s or 1’sand has the density and energy functions:
p(α(x)) =α(x)η−1(1− α(x))τ−1
β(η, τ )
where β(η, τ ) is constant wrt α and F and isignored in the energy function.
Spatiotemporal ConsistencyPriorThe spatio-temporal consistency prior smooths al-pha along, but not across, edges in space and time.Each conditional is modelled as a mixture of a t-distribution and a Gaussian which is inspired bythe statistics of derivatives of natural images:
p(dα|dC) = γ1(1 + γ2dα2)−γ3 + γ4 exp(−γ5dα2)where the coefficients γi are learnt from trainingsets.
Learnt Spatiotemporal PriorWexler et. al
spatialconsistency
energy
Learnt marginalprior
log p(dα, dC)
Modelledconditional priorlog p(dα|dC)
X axes are dC and Y axes are dα. Blue is low log-probability and red is high.
−0.5 0 0.5−10
−5
0
5
dα
log
p(dα
|dC
)
DataModel
−0.5 0 0.5−5
0
5
dα
log
p(dα
|dC
)
DataModel
−0.5 0 0.5−4
−2
0
2
4
6
dα
log
p(dα
|dC
)
DataModel
−0.5 0 0.5−3
−2
−1
0
1
2
3
4
5
dα
log
p(dα
|dC
)
DataModel
Results
C′ αF α
C B |C − B|
αrt
αps
C′ αF α
C B αbs
αrt
αps
C′ αF α
C B αbs
αrt
αps
No temporalconsistency
With temporalconsistency
ReferencesBlake, A. and Zisserman, A. (1987). Visual Reconstruction. MIT Press, Cambridge, USA.
Chuang, Y.-Y., Curless, B., Salesin, D., and Szeliski, R. (2001). A Bayesian approach to digital matting. In Proc.CVPR, volume 2, pages 264–271.
Ruzon, M. and Tomasi, C. (2000). Alpha estimation in natural images. In Proc. CVPR, volume 1, pages 18–25.
1
2
3