an introduction to sparse coding, sparse sensing, and...

40
An Introduction to Sparse Coding and Dictionary Learning Kai Cao January 14, 2014 1

Upload: others

Post on 30-Jan-2021

17 views

Category:

Documents


0 download

TRANSCRIPT

  • An Introduction to Sparse Coding and Dictionary Learning

    Kai Cao January 14, 2014

    1

  • Outline

    • Introduction • Mathematical foundation • Sparse coding • Dictionary learning • Summary

    2

  • Introduction

    3

  • What is sparsity?

    • Sparsity implies many zeros in a vector or a matrix

    4

    FFT transform

    Sparse representation

    Fingerprint patch FFT response

    Reconstructed patch

    IFFT transform

    Usage: Compression Analysis Denoising ……

  • Sparse Representation

    Dictionary Learning Problem

    Sparse Coding Problem

    x α≈ D

  • Application---Denoising

    [M. Elad, Springer 2010]

    Dictionary

    Source

    Result 30.829dB

    Noisy image PSNR 22.1dB=

  • Application---Compression

    [O. Bryta, M. Elad, 2008]

    550 bytes per image

    9.44

    15.81

    14.67

    15.30

    13.89

    12.41

    12.57

    10.66

    10.27

    6.60

    5.49

    6.36

    Original JPEG JPEG 2000 PCA

    Dictionary based

    Bottom: RMSE values

  • Mathematical foundation

    8

  • Derivatives of vectors

    • First order

    • Second order

    • Exercise

    9

    T Ta x x a ax x

    ∂ ∂= =

    ( )T

    Tx Bx B B xx

    ∂= +

    2 22 2

    1min || || || || , ,2m

    n n mx D x R D Rα

    α λ α ×∈

    − + ∈ ∈

    1( )T TD D I D xα λ −= +

  • Trace of a Matrix

    • Definition

    • Properties

    10

    1( ) ,n iiiTr A a==∑ ( )

    n nijA a R

    ×= ∈

    2 2

    1 1|| || ( ),

    n nT

    F iji j

    A a Tr A A= =

    = =∑∑

    ( ) ( ) ( ),Tr ABC Tr BCA Tr CAB= =

    ( ) ( ),TTr A Tr A=( ) ( ),Tr A B Tr A B+ = +

    ( ) ( ),Tr aA aTr A=

    ( ) ( ),Tr AB Tr BA=

    n nB R ×∈

    a R∈n nB R ×∈

    , n nB C R ×∈

  • Derivatives of traces

    • First order

    • Derivatives of traces

    • Exercise

    11

    ( ) TTr XA AX∂

    =∂

    ( )T TTr X XA XA XAX∂

    = +∂

    ( )TTr X A AX∂

    =∂

    ( )T TTr X BX B X BXX∂

    = +∂

    2 2min || || || || , ,k m

    n m n kF FA R

    X DA A X R D Rλ×

    × ×

    ∈− + ∈ ∈

    1( )T TA D D I D Xλ −= +

  • Sparse coding

    12

  • Sparse linear model

    • Let x ϵ Rn be a signal

    • Let D =[d1, d2, …, dm] ϵ Rn×m be a set of normalized “basis vectors” (dictionary)

    • Sparse representation is to find a sparse vector α ϵ

    Rm such that x ≈ Dα, where α is regarded as sparse code

    13

    D

    x

    ( 1)Ti id d =

  • The sparse coding model

    • Objective function

    • The regularization term can be – the l2 norm. – the l0 norm. – the l1 norm. – …

    14

    22

    1min || || ( )2m

    x Dα

    α λϕ α∈

    − +

    Data fitting term Regularization term

    ϕ2 22 1

    || || m iiα α=∑0|| || #{ | 0}ii aα ≠

    1 1|| || | |m iiα α=∑

    Sparsity inducing

  • Matching pursuit

    1. Initialization: α = 0, residual r = x 2. while ||α||0

  • Patch from latent Dictionary elements

    …… d1 d2 d3 d4 d5

    An example for matching pursuit

    x

    c1=-0.039 c2= 0.577 c3=0.054 c4=-0.031 c5 =-0.437

    ……

    = - ×0.577 Residual r

    Correlation ci= diT x

    …… Correlation ci= diT r d1 d2 d3 d4 d5

    c1=-0.035 c2= 0 c3=0.037 c4=-0.046 c5 =-0.289

    = - - × (-0.289) Residual r × 0.577

    Coefficient does not update !

    Reconstructed patch = + × 0.577 × (-0.289) x̂2ˆ|| || 0.763x x− =

  • Orthogonal matching pursuit

    1. Initialization: α = 0, residual r = x, active set Ω = Æ 2. while ||α||0

  • An example for orthogonal matching pursuit

    2ˆ|| || 0.759x x− =× (-0.309)

    = - - ×(-0.309)

    Reconstructed patch =

    Residual r × 0.499

    + × 0.499 x̂

    Patch from latent Dictionary elements

    …… d1 d2 d3 d4 d5 x

    c1=-0.039 c2= 0.577 c3=0.054 c4=-0.031 c5 =-0.437

    ……

    = - ×0.577 Residual r

    Correlation ci= diT x

    …… Correlation ci= diT r d1 d2 d3 d4 d5

    c1=-0.035 c2= 0 c3=0.037 c4=-0.046 c5 =-0.289

  • Why does l1-norm induce sparsity?

    • Analysis in 1D (comparison with l2)

    19

    21min ( ) | |2

    α λ α∈

    − +

    2 21min ( )2

    α λα∈

    − +

    if x ³ λ , α = x – λ if x £- λ , α = x+ λ else, α =0

    Þ α = x/(1+2λ)

    x λ

    α

    -λ x

    α slope = /(1+2λ)

    Þ

  • Why does l1-norm induce sparsity?

    • Analysis in 2D (comparison with l2)

    20

    22 1

    1min || || || ||2

    α λ α∈

    − +

    2 22 2

    1min || || || ||2

    α λ α∈

    − +

    22

    1min || ||2

    α∈

    1|| ||α µ≤s.t. ⇔22

    1min || ||2

    α∈

    2|| ||α µ≤s.t. ⇔

    μ -μ

    α2

    α1 α1

    α2

    μ -μ

  • Optimality condition for l1-norm regularization

    • Directional derivative in the direction u at α

    • g is subgradient of J at α if and only if

    • Proposition 1: • Proposition 2: if J is differentiable at α, • Proposition 3: α is optimal if and only if for all u,

    21

    22 1

    1min ( ) || || || ||2m

    J x Dα

    α α λ α∈

    = − +

    0

    ( ) ( )( , ) limt

    J tu JJ ut

    α αα+→

    + −∇ =

    , ( ) ( ) ( )m Tt R J t J g tα α∀ ∈ ≥ + −

    ( , ) 0J uα∇ ≥( , ) ( )TJ u J uα α∇ = ∇

    , ( , )m Tu R g u J uα⇔∀ ∈ ≤∇g is a subgradient

  • • Example: f(x) = |x|

    22

    Subgradient for l1-norm regularization

    x

    f(x) = |x|

    x

    subgradient

    1

    -1

    | | 0( , )

    ( ) 0u x

    f x usign x u x

    =∇ = ≠

  • Subgradient for l1-norm regularization

    • g is a subgradient at α if and only if for all i

    23

    22 1

    1min ( ) || || || ||2m

    J x Dα

    α α λ α∈

    = − +

    , 0 , 0( , ) ( ) sign( ) | |

    i i

    T Ti i i

    i a i aJ u u D x D a u uα α λ λ

    ≠ =

    ∇ = − − + +∑ ∑

    | ( ) |Ti ig d x Dα λ− − ≤ if 0ia =

    0ia ≠( ) sign( )Ti i ig d x D aα λ= − + if

  • First order method for convex optimization

    • Differentiable objective – Gradient descent: – With line search for a decent ht – Diminishing step size: e.g., ht=(t+t0)-1

    • Non differentiable objective – Subgradient decent: , gt is a subgradient – With line search – Diminishing step size

    24

    1 ( )t t t tJα α η α+ = − ∇

    1t t t tgα α η+ = −

  • Reformulation as quadratic program

    25

    22 1

    1min || || || ||2m

    x Dα

    α λ α∈

    − +

    22

    ,

    1min || || (1 1 )2m

    T Tx D Dα α

    α α λ α α+ − +

    + − + −∈

    − + + +

  • Dictionary Learning

    26

  • Dictionary selection

    • Which D to use? • A fixed set of basis:

    – Steerable wavelet – Contourlet – DCT Basis – ……

    • Data adaptive dictionary – learn from data – K-SVD (l0-norm) – On-line dictionary learning (l1-norm)

    27

  • The examples are linear combinations

    of atoms from D

    The objective function for K-SVD

    D ≈ X A

    2

    ,min || ||FD A X DA− 0

    || ||j Lα ≤"j, s.t.

    Each example has a sparse representation with

    no more than L atoms

    www.cs.technion.ac.il/~ronrubin/Talks/K-SVD.ppt 28

  • K–SVD – An Overview

    D Initialize D Sparse Coding

    Use MP or OMP

    Dictionary Update

    Column-by-Column by SVD computation

    X T

    www.cs.technion.ac.il/~ronrubin/Talks/K-SVD.ppt 29

  • K–SVD: Sparse Coding Stage

    D

    X

    For the jth example we solve

    Ordinary Sparse Coding !

    T 2

    02min . .jx s t Lα α α− ≤D

    www.cs.technion.ac.il/~ronrubin/Talks/K-SVD.ppt

    2min || ||FA X DA− 0|| ||j Lα ≤"j, s.t.

    30

  • K–SVD: Dictionary Update Stage

    D

    X T

    For the kth atom

    we solve

    (the residual)

    2min || ||FD X DA− 0|| ||j Lα ≤"j, s.t.

    2min || ||k

    kk T k Fd

    d Eα −

    ik i T

    i kE d Xα

    = −∑

    Solve with SVD 31

    TkE U V= Λ 1kd u=

    www.cs.technion.ac.il/~ronrubin/Talks/K-SVD.ppt

  • K–SVD Dictionary Update Stage

    Only some of the examples use column dk!

    When updating ak, only recompute the coefficients corresponding to those examples

    dk ak T

    Solve with SVD!

    We want to solve:

    Ek

    www.cs.technion.ac.il/~ronrubin/Talks/K-SVD.ppt 32

  • Compare K-SVD with K-means

    33

    Initialize Dictionary

    Sparse Coding Use MP or OMP

    Dictionary Update

    Column-by-Column by SVD computation

    Initialize Cluster Centers

    Assignment for each vector

    Cluster centers update

    Cluster-by-cluster

    K-SVD K-means

  • dictionary learning with l1-norm regularization

    • Objective function for l1-norm regularization

    • Advantages of online learning: – Handle large and dynamic datasets, – Could be much faster than batch algorithms.

    34

    22 1

    1

    1 1min || || || ||2

    t

    i i iD ix D

    tα λ α

    =

    − +∑

    22 1

    1arg min || || || ||2mi iR

    x Dα

    α α λ α∈

    − +where

  • dictionary learning with l1-norm regularization

    35

    22 1

    1

    11

    1 1( ) || || || ||2

    1 1( ( ) ( )) || ||2

    t

    t i i ii

    tT T

    t t ii

    F D x Dt

    Tr D DA Tr D Bt

    α λ α

    λ α

    =

    =

    = − +

    = − +

    ∑where

    1,

    tT

    t i ii

    A α α=

    =∑1

    tT

    t i ii

    B xα=

    =∑

    ( ) 1( )t t tF D DA B

    D t∂

    = −∂

    1 1 1,T

    t t t tA A α α+ + += +For a new xt+1, 1 1 1T

    t t t tB B x α+ + += +

  • On-line dictionary learning 1) Initialization: D0 ϵ Rn×m; A0=0; B0=0; 2) For t=1,…,T 3) Draw xt from the training data set 4) Get sparse code

    5) Aggregate sufficient statistics

    6) Dictionary update

    7) End for

    36

    21 2 1

    1arg min || || || ||2mt t tR

    x Dα

    α α λ α−∈

    = − +

    1 ,T

    t t t tA A α α−= + 1 ,T

    t t t tB B xα−= +

    1( )t

    t tF DD D

    Dρ−∂

    = −∂

  • Toolbox - SPAMS • SPArse Modeling Software:

    – Sparse coding • l0-norm regularization • l1-norm regularization • ……

    – Dictionary learning • K-SVD • Online dictionary learning • ……

    • C++ implemented with Matlab interface • http://spams-devel.gforge.inria.fr/

    37

  • Summary

    • Sparsity and sparse representation • Sparse coding with l0- and l1-norm regularization

    – Orthogonal matching pursuit/matching pursuit – Subgradient and optimal condition

    • Dictionary learning with l0- and l1-norm regularization – K-SVD – Online dictionary learning

    • Try to use it !!

    38

  • References • T. T. Cai, Lie Wang ,Orthogonal Matching Pursuit for Sparse Signal Recovery

    With Noise, IEEE Transactions on Information Theory, 57(7): 4680-4688,2011 • Efron, T. Hastie, I. Johnstone, and R. Tibshirani. Least angle regression. Annals

    of statistics, 32(2):407–499, 2004. • M. Aharon, M. Elad, and A. M. Bruckstein. The K-SVD: An algorithm for

    designing of overcomplete dictionaries for sparse representations. IEEE Transactions on Signal Processing, 54(11):4311-4322, November 2006.

    • J. Mairal, F. Bach, J. Ponce, and G. Sapiro. Online dictionary learning for sparse coding. In Proceedings of the International Conference on Machine Learning (ICML), 2009a.

    39

  • Thank you for listening

    40

    An Introduction to Sparse Coding and Dictionary LearningOutlineIntroduction What is sparsity? Sparse RepresentationApplication---DenoisingApplication---CompressionMathematical foundationDerivatives of vectorsTrace of a MatrixDerivatives of tracesSparse codingSparse linear modelThe sparse coding modelMatching pursuitAn example for matching pursuitOrthogonal matching pursuitAn example for orthogonal matching pursuitWhy does l1-norm induce sparsity?Why does l1-norm induce sparsity?Optimality condition for l1-norm regularizationSubgradient for l1-norm regularization Subgradient for l1-norm regularization First order method for �convex optimizationReformulation as quadratic programDictionary LearningDictionary selectionSlide Number 28Slide Number 29Slide Number 30Slide Number 31Slide Number 32Compare K-SVD with K-meansdictionary learning with �l1-norm regularizationdictionary learning with �l1-norm regularizationOn-line dictionary learningToolbox - SPAMSSummaryReferencesThank you for listening