bayesian video matting using learnt image priorsnema/publications/apostoloff04_poster.pdfin put (c )...

1
Learn priors on the spatiotemporal relationship between edges in the alpha matte and the composite image Fit these priors to natural image statistics models to avoid over-fitting These priors propagate alpha values along, but not across, edge surfaces in both space and time We calculate the maximum a posteriori (MAP) estimate of the foreground image F and the alpha-matte α given C and B: The MAP estimation is converted into an energy minimization by taking the negative log-likelihood of the posterior, and noting that p(C) and p(B) do not depend on the unknowns F and α. This is calculated over 100,000’s of variables using the well-honed Matlab optimizer fmincon. Derivatives of the energy function are computed using Maple, and automatically written to C source code. An analytical sparse Hessian is assembled from these derivatives allowing the use of efficient sparse solvers. Nicholas Apostoloff and Andrew W. Fitzgibbon University of Oxford {nema,awf}@robots.ox.ac.uk Bayesian video matting using learnt image priors Spatiotemporal consistency Experimental results To regularize the inverse problem of video matting using learnt priors on the distribution of alpha values and the spatiotemporal consistency of image sequences. Objective Video matting The relationship between the spatiotemporal consistency of the alpha mattes α and the composite images C is learnt using models from natural image statistics The general problem of video matting is under- constrained as is evidenced with background subtraction, where no threshold yields a correct solution Time Composite image volume Background subtraction solution Ambiguous regions The composite volume exhibits both a spatial and temporal consistency that can be used to help resolve the ambiguous regions The strategy: MAP estimation dα dC dα dC p(dα,dC) spacial p(dα|dC) temporal Training data Learnt priors High energy Low energy C B Background subtraction Spatiotemporal consistency solution Photoshop Ruzon Photoshop Photoshop Ruzon Ruzon No temporal consistency With temporal consistency = x + x C α F (1-α) B Given Given Extract Extract Video matting is a classic inverse problem in computer vision that involves the extraction of foreground objects, and the alpha mattes that describe their opacity, from a set of images. The compositing equation defines how the composite image C is formed as a linear combination of the foreground image F and the background image B using the alpha matte α. What’s new? Video matting is an under-constrained problem - there are more unknowns than equations. Therefore more information is necessary to regularize the solution. α Background subtraction F C’ C B Background subtraction α F C’ C B Background subtraction α F C’ C B Why is it difficult? The energy equation Reconstruction error Foreground energy Alpha distribution Learnt spatiotemporal consistency energy + + + + + Gaussian mixture model = dC dα log p(dα|dC) 0 0.2 0.4 0.6 0.8 1 !9 !8 !7 !6 !5 !4 !3 !2 !1 Ground truth Beta distribution log p(α) α || C - !F - (1 - !)B|| 2 {F, α} = argmax F,α p(C, B |F, α)p(F )p(α) p(C )p(B ) . 1 2 3

Upload: others

Post on 28-Jan-2021

5 views

Category:

Documents


0 download

TRANSCRIPT

  • • Learn priors on the spatiotemporal relationship between edges in the alpha matte and the composite image

    • Fit these priors to natural image statistics models to avoid over-fitting• These priors propagate alpha values along, but not across, edge surfaces in

    both space and time

    • We calculate the maximum a posteriori (MAP) estimate of the foreground image F and the alpha-matte α given C and B:

    • The MAP estimation is converted into an energy minimization by taking the negative log-likelihood of the posterior, and noting that p(C) and p(B) do not depend on the unknowns F and α.

    • This is calculated over 100,000’s of variables using the well-honed Matlab optimizer fmincon. Derivatives of the energy function are computed using Maple, and automatically written to C source code. An analytical sparse Hessian is assembled from these derivatives allowing the use of efficient sparse solvers.

    Nicholas Apostoloff and Andrew W. FitzgibbonUniversity of Oxford{nema,awf}@robots.ox.ac.uk

    Bayesian video matting using learnt image priors

    Spatiotemporal consistency Experimental resultsTo regularize the inverse problem of video matting using learnt priors on the distribution of alpha values and the spatiotemporal consistency of image sequences.

    Objective

    Video matting

    The relationship between the spatiotemporal consistency of the alpha mattes α and the composite images C is learnt using models from natural image statistics

    The general problem of video matting is under-constrained as is evidenced with background subtraction, where no threshold yields a correct solution

    Time

    Composite image volume

    Background subtraction

    solution

    Ambiguous regions

    The composite volume exhibits both a spatial and temporal consistency that can be used to help resolve the ambiguous regions

    The strategy: MAP estimation

    dC

    dC

    p(dα,dC)spacial p(dα|dC)temporal

    Training data

    Learnt priors

    High energyLow energy

    C B

    Background subtraction

    Spatiotemporal consistency

    solution

    Photoshop

    Ruzon

    Photoshop

    Photoshop

    Ruzon

    Ruzon

    No

    tem

    pora

    l co

    nsis

    tenc

    yW

    ith t

    empo

    ral

    cons

    iste

    ncy

    = x + x

    C α F (1-α) B

    Given GivenExtract Extract

    Video matting is a classic inverse problem in computer vision that involves the extraction of foreground objects, and the alpha mattes that describe their opacity, from a set of images.

    The compositing equation defines how the composite image C is formed as a linear combination of the foreground image F and the background image B using the alpha matte α.

    What’s new?

    Video matting is an under-constrained problem - there are more unknowns than equations. Therefore more information is necessary to regularize the solution.

    α

    Background subtraction

    FC’

    C B

    Background subtraction

    αFC’

    C B

    Background subtraction

    αFC’

    C B

    Why is it difficult?

    The energy equationReconstruction error

    Foreground energy

    Alpha distribution Learnt spatiotemporal consistency energy

    + +

    + ++Gaussian mixture model

    =

    dCdα

    log

    p(dα

    |dC)

    0 0.2 0.4 0.6 0.8 1!9

    !8

    !7

    !6

    !5

    !4

    !3

    !2

    !1

    !

    log

    p(!

    )

    Ground truthBeta distribution

    log

    p(α

    )

    α

    ||C−!F− (1−!)B||2

    ObjectiveTo regularize the inverse problem of Video Matting in a Bayesianframework using learnt priors on the distribution of alpha values andthe spatiotemporal consistency of image sequences.

    Video MattingVideo matting is a classic inverse problem in computer vision that involves theextraction of foreground objects, and the alpha mattes that describe their opacity,from a set of images. It is most prevalent in the film industry for special effectsshots that require the superposition of an actor onto a new background.The compositing equation defines how the composite image C is formed as a linearcombination of the foreground image F and the background image B using thealpha matte α

    = × + ×C α F (1− α) B

    However, video matting is the inverse of the compositing process:

    Given a sequence of images C, solve for α, F and B.

    It is apparent from this statement that the problem is hugely under-constrained.Previous video matting techniques have relied on user interaction and ac-curate modelling of the foreground and background colour distributions to assistthe process, while neglecting the inherent information that lies in the statistics ofnatural images and alpha mattes.The principal contribution of this work is the investigation of the statisticalrelationship between C and α and the modelling of a prior that imposes both spatialand temporal consistency in α. In contrast to previous efforts, the prior probabilitydistributions are learnt from examples and fit to models of natural image statistics,producing stronger, more appropriate regularization.

    The Strategy: MAP EstimationThe Bayesian formulation of the video matting problem is one of finding the maxi-mum a posteriori (MAP) estimate of the foreground image F and the alpha-matteα given C and B:

    {F, α} = argmaxF,α

    p(F, α|C, B).Using Bayes rule, the posterior can be expressed as a combination of priors on F ,α, C and B, and a conditional probability in C and B:

    {F, α} = argmaxF,α

    p(C, B|F, α)p(F )p(α)p(C)p(B)

    .

    The MAP estimation is converted into an energy minimization by taking the neg-ative log-likelihood of the posterior, and noting that p(C) and p(B) do not dependon the unknowns F and α:

    {F, α} = argminF,α

    {L(C, B|F, α) + L(F ) + L(α)}

    where L(C, B|F, α) is the reconstruction error, L(F ) is the foreground energyand L(α) is the negative log alpha prior. For this work, we used matlab’sconstrained nonlinear optimizer fmincon. Like graduated non-convexity (Blakeand Zisserman, 1987), the minimization is applied three times using smoothedversions of the spatiotemporal prior, with the result of one level of smoothingseeding the next level.

    The ProblemVideo matting is inherently under-constrained – 7 unknowns in 3 equations percolour pixel. Background extraction removes Br, Bg and Bb to reduce the problemto one of 4 unknowns (red) in 3 equations.

    Cr(x) = αxFr(x) + (1− αx)Br(x)Cg(x) = αxFg(x) + (1− αx)Bg(x)Cb(x) = αxFb(x) + (1− αx)Bb(x)

    Input (C)

    Input (B)

    Backgroundsubtraction

    Desired output(α)

    Desired output(F )

    In this work we exploit the joint redundancy of natural images and alpha mattesby learning a prior on the spatiotemporal consistency of α and F and regularizingthe problem in a Bayesian framework.

    The Strategy: The Energy EquationReconstruction ErrorAssuming Gaussian, white, additive imagenoise, the conditional density in C and B forpixel x is:

    p(Cx, Bx|Fx, αx) =

    η exp

    (−||Cx − αxFx − (1− αx) Bx||

    2

    2σc

    )where the constant η is dropped and the frac-tion, 1/2σc, is subsumed by the weights λi inEx.

    Foreground energyWe follow previous authors (Chuang et al., 2001;Ruzon and Tomasi, 2000) and use a Gaussian mix-ture model (GMM) in the RGB colour space tomodel the distribution of foreground pixels colours.The distribution at each pixel is generated from allforeground pixels in a square neighborhood in theautomatically generated trimap as in (Ruzon andTomasi, 2000).

    Ex = ||Cx − αx Fx − (1− αx ) Bx||2

    − λ1 log( Nk∑

    k

    1√(2π)3|Σk|

    exp(−(Fx − µk)#Σ−1k (Fx − µk)))

    − λ2 ((η − 1) log αx + (τ − 1) log (1− αx ))− λ3

    ∑p∈N

    log

    (γ1x(1 + γ2x(∇pαx )2)−γ3x + γ4xe−γ5x(∇pαx )

    2

    )

    Alpha Distribution PriorA Beta function is used to model the distri-bution of alpha values as primarily 0’s or 1’sand has the density and energy functions:

    p(α(x)) =α(x)η−1(1− α(x))τ−1

    β(η, τ )

    where β(η, τ ) is constant wrt α and F and isignored in the energy function.

    Spatiotemporal ConsistencyPriorThe spatio-temporal consistency prior smooths al-pha along, but not across, edges in space and time.Each conditional is modelled as a mixture of a t-distribution and a Gaussian which is inspired bythe statistics of derivatives of natural images:

    p(dα|dC) = γ1(1 + γ2dα2)−γ3 + γ4 exp(−γ5dα2)where the coefficients γi are learnt from trainingsets.

    Learnt Spatiotemporal PriorWexler et. al

    spatialconsistency

    energy

    Learnt marginalprior

    log p(dα, dC)

    Modelledconditional priorlog p(dα|dC)

    X axes are dC and Y axes are dα. Blue is low log-probability and red is high.

    −0.5 0 0.5−10

    −5

    0

    5

    log

    p(dα

    |dC

    )

    DataModel

    −0.5 0 0.5−5

    0

    5

    log

    p(dα

    |dC

    )

    DataModel

    −0.5 0 0.5−4

    −2

    0

    2

    4

    6

    log

    p(dα

    |dC

    )

    DataModel

    −0.5 0 0.5−3

    −2

    −1

    0

    1

    2

    3

    4

    5

    log

    p(dα

    |dC

    )

    DataModel

    Results

    C′ αF α

    C B |C − B|

    αrt

    αps

    C′ αF α

    C B αbs

    αrt

    αps

    C′ αF α

    C B αbs

    αrt

    αps

    No temporalconsistency

    With temporalconsistency

    ReferencesBlake, A. and Zisserman, A. (1987). Visual Reconstruction. MIT Press, Cambridge, USA.

    Chuang, Y.-Y., Curless, B., Salesin, D., and Szeliski, R. (2001). A Bayesian approach to digital matting. In Proc.CVPR, volume 2, pages 264–271.

    Ruzon, M. and Tomasi, C. (2000). Alpha estimation in natural images. In Proc. CVPR, volume 1, pages 18–25.

    1

    2

    3