some topics on sparsitykowon.dongseo.ac.kr/~lbg/cagd/kmmcs/201008/namsangnam.pdfi morphological...

Some topics on Sparsity

Sangnam Nam

August 12, 2010

Why Sparsity?: randomly generated image

Why Sparsity?: DCT of randomly generated image

Compression: Image

Compression: Audio

Denoising

Denoising: Simple example

signal x dct transform of x

signal + noise dct transform of (signal + noise)

Denoising with hard thresholding at 0.07

Denoising with hard thresholding at 0.18

With augmented dictionary:

Transform coefficients Denoised Result

Inpainting: Original Image

Inpainting: letter

Inpainting: lines

Inpainting: random

Compressed Sensing

Separation

Super-resolution

Notation/Terminology

I y ∈ Rd : signal

I φ ∈ Φ: a column of Φ ∈ Rd×K

I φ : φ ∈ Φ spans Rd ⇒ Φ : a dictionary for Rd

I φ ∈ Φ: an atom

I y = Φx ⇒ x : a coefficient vector / representation

Sparse Model

`0-minimization

To find the sparsest representation of y :we solve

minx‖x‖0 subject to y = Mx

How do we know we have the solution?

`0-minimization

To find the sparsest representation of y :we solve

minx‖x‖0 subject to y = Mx

How do we know we have the solution?

Uniqueness of Sparse Solutions

I Let ‖x‖0 := #i : xi 6= 0.

I Let Spark of Φ := minx∈ker(Φ),x 6=0 ‖x‖0.

I Let y = Mx , ‖x‖0 ≤ (Spark of Φ)/2

⇒ x is unique solution of minz ‖z‖0 subject to y = Mz .

I Let ‖x‖0 := #i : xi 6= 0.I Let Spark of Φ := minx∈ker(Φ),x 6=0 ‖x‖0.

`0-minimization

Good, we (kind of) know when we have the solution.

Unfortunately, solving `0-minimization is NP-hard.What to do?

`0-minimization

Good, we (kind of) know when we have the solution.Unfortunately, solving `0-minimization is NP-hard.

What to do?

`0-minimization

Good, we (kind of) know when we have the solution.Unfortunately, solving `0-minimization is NP-hard.What to do?

Alternatives

I Greedy Algorithm/Matching Pursuit

I Convex Optimization/Basis Pursuit

Matching Pursuit

Matching Pursuit [Mallat & Zhang]

1. Given signal y , dictionary Φ.

2. Set k = 0, y0 = 0, r0 = y .

3. Find φ := arg minφ∈Φ〈φ, rk〉.4. Update yk+1 := yk + 〈rk , φ〉φ, rk+1 := rk − 〈rk , φ〉φ.

5. Increment k .

6. If yk uses a specified number of atoms or rk is smaller than aspecified error, stop. Otherwise, go to step 3.

Energy Preservation

Energy Preservation (Pythagoras theorem)

‖rk‖22 = ‖rk+1‖2

2 + |〈rk , φ〉|2

Drawback of Matching Pursuit?

MP can select an atom more than once.

Orthogonal Matching Pursuit

Orthogonal Matching Pursuit (OMP)

1. Given signal y , dictionary Φ.

2. Set k = 0, Λ = , y0 = 0, r0 = y .

3. Find ik := arg mini 〈φi , rk〉 and set Λ := Λ ∪ ik.4. Compute ∆y := (best `2-approximation from span of ΦΛ to

rk), update yk+1 := yk + ∆y , and rk+1 := rk −∆y .

5. Increment k .

6. If yk uses a specified number of atoms or rk is smaller than aspecified error, stop. Otherwise, go to step 3.

Variety of Matching Pursuits

I Morphological Component Analysis [MCA, Bobin et al]

I Stagewise OMP [Donoho et al]

I CoSAMP [Needell & Tropp]

I ROMP [Needell & Vershynin]

I Iterative Hard Thresholding [Blumensath & Davies]

Sampling Issue

To find good approximation of f : [0, 5]→ R, one may sample thefunction values

1. Uniformly at N locations for some large N, or

2. Adaptively sample–more locations near 5 in the figure– atsome large number N locations,

then, use linear interpolation, cubic spline approximation, etc.

What if it is a polynomial of degree 6?

What we know it is a polynomial of degree 6? One needs only 7different samples! Approximation is perfect!

What if it is a sum of 6 monomials?

What if it is a sum of 6 monomials of degree less than 200? Howmany samples do we need?

Think about Economy!

Digital Camera

1. High resolution photo (e.g. size 1024x1024) requires largenumber of samples (e.g. 1024x1024 samples).

2. Image is (immediately) compressed to smaller and lossy JPEGfile.

3. Not a big deal for digital cameras?

Shannon Sampling / Nyquist Rate

Nyquist rate ⇒ 44.1 KHz sampling rate

⇒ large audio file⇒ Compress to lossy MP3 file.

Nyquist rate ⇒ 44.1 KHz sampling rate⇒ large audio file

⇒ Compress to lossy MP3 file.

Nyquist rate ⇒ 44.1 KHz sampling rate⇒ large audio file⇒ Compress to lossy MP3 file.

Fewer Samples

I Can we sample less to begin with?

I How do we recover/approximate the original data?

I Key idea to exploit: Sparsity

Fewer Samples

`1-minimization

Relaxminx‖x‖0 subject to y = Φx

Solveminx‖x‖1 subject to y = Φx

Convex problem. Can be recast into Linear Programming.

`1-minimization

Relaxminx‖x‖0 subject to y = Φx

Solveminx‖x‖1 subject to y = Φx

Convex problem. Can be recast into Linear Programming.

Why does `1-minimization work?

Feasible solutions

`2-balls `1-balls `p-balls, 0 < p < 1

`2-minimizer `1-minimizer `p-minimizer, 0 < p < 1

When can we guarantee successful recovery?

Null Space Property

Notation:N(Φ) := (Kernel / Null Space of Φ)Λ := (Support of x∗)Λ := Λc

Null Space Property

minx‖x‖1 s.t. Φx = Φx∗ (A)

`1-minimization (A) recovers x∗ if and only if

|〈z , sign(x∗)〉| < ‖zΛ‖1

for all z ∈ N(Φ).

Null Space Property

|〈z , sign(x∗)〉| < ‖zΛ‖1

`1-minimization (A) recovers every x∗ with support Λ if

‖zΛ‖1 < ‖zΛ‖1

for all z ∈ N(Φ).

Null Space Property

|〈z , sign(x∗)〉| < ‖zΛ‖1

`1-minimization (A) recovers every x∗ with support Λ if

‖zΛ‖1 < ‖zΛ‖1

`1-minimization (A) recovers every k-sparse x∗ if

‖zΛ‖1 < ‖zΛ‖1

for all z ∈ N(Φ) and Λ with #Λ ≤ k .

Uniqueness for `1 using NSP

If x∗ is k-sparse, then

‖x∗ + z‖1 ≥ ‖x∗‖1 − ‖zΛ‖1 + ‖zΛ‖1

≥ ‖x∗‖1 .

I.e., x∗ is the unique `1-minimizer.

Coherence

Coherence M(Φ) := maxi 6=j |〈φi , φj〉|.

[Gribonval,Nielsen]

TheoremIf ‖x∗‖0 ≤

(1 + 1

)is a solution of y = Mx, then x∗ is the

unique minimizer of the `1-problem. x∗ is also the uniqueminimizer of the `0-problem!

Coherence

Coherence M(Φ) := maxi 6=j |〈φi , φj〉|.

[Gribonval,Nielsen]

TheoremIf ‖x∗‖0 ≤

(1 + 1

)is a solution of y = Mx, then x∗ is the

unique minimizer of the `1-problem. x∗ is also the uniqueminimizer of the `0-problem!

Derivation of `1-guaranteeing sparsity level

For z ∈ N(Φ),

φjzj = −∑i 6=j

φizi .

|zj | ≤ M(Φ)∑i 6=j

(1 + M(Φ))|zj | ≤ M(Φ) ‖z‖1

(1 + M(Φ)) ‖zΛ‖1 ≤ (#Λ)M(Φ) ‖z‖1

⇒ ‖zΛ‖1 ≤‖z‖1

#Λ ≤ 1

M(Φ)).

For z ∈ N(Φ),

φjzj = −∑i 6=j

φizi .

|zj | ≤ M(Φ)∑i 6=j

(1 + M(Φ))|zj | ≤ M(Φ) ‖z‖1

(1 + M(Φ)) ‖zΛ‖1 ≤ (#Λ)M(Φ) ‖z‖1

⇒ ‖zΛ‖1 ≤‖z‖1

#Λ ≤ 1

M(Φ)).

For z ∈ N(Φ),

φjzj = −∑i 6=j

φizi .

|zj | ≤ M(Φ)∑i 6=j

(1 + M(Φ))|zj | ≤ M(Φ) ‖z‖1

(1 + M(Φ)) ‖zΛ‖1 ≤ (#Λ)M(Φ) ‖z‖1

⇒ ‖zΛ‖1 ≤‖z‖1

#Λ ≤ 1

M(Φ)).

For z ∈ N(Φ),

φjzj = −∑i 6=j

φizi .

|zj | ≤ M(Φ)∑i 6=j

(1 + M(Φ))|zj | ≤ M(Φ) ‖z‖1

(1 + M(Φ)) ‖zΛ‖1 ≤ (#Λ)M(Φ) ‖z‖1

⇒ ‖zΛ‖1 ≤‖z‖1

#Λ ≤ 1

M(Φ)).

For z ∈ N(Φ),

φjzj = −∑i 6=j

φizi .

|zj | ≤ M(Φ)∑i 6=j

(1 + M(Φ))|zj | ≤ M(Φ) ‖z‖1

(1 + M(Φ)) ‖zΛ‖1 ≤ (#Λ)M(Φ) ‖z‖1

⇒ ‖zΛ‖1 ≤‖z‖1

#Λ ≤ 1

M(Φ)).

Coherence gives suboptimal estimates

For Φ ∈ Rd×K , we may suppose M(Φ) = O(

We can recover x∗ if x∗ is 12

√O(d)

√d) sparse.

To put it differently, in order to recover k-sparse x∗, d = O(k2)measurements are desired.

√O(d)

√d) sparse.

√O(d)

√d) sparse.

Recovery of Logan-Shepp phantom using FourierTransform Samples

An Image to recover:

Available samples in Fourier Domain (2.7% (4.3%)):

Reconstructed image—perfect!:

How many samples do we need?

To recover

we (obviously) need k ≤ m ≤ N samples.Need to identify k nonzero positions out of N locations ⇒log2

)≈ k log N

k measurements

To recover

we (obviously) need k ≤ m ≤ N samples.

Need to identify k nonzero positions out of N locations ⇒log2

)≈ k log N

k measurements

To recover

we (obviously) need k ≤ m ≤ N samples.Need to identify k nonzero positions out of N locations ⇒log2

)≈ k log N

k measurements

Random Sampling of Fourier Transform

[Candes, Romberg, Tao]

TheoremLet N = n2, f be an n × n real-valued image supported on T . LetΩ be a subset of 1, . . . , n2 chosen unformly at random. If

#Ω ≥ C (#T ) log N,

then with high probability (1− O(N−M)), the minimizer to

ming‖g‖1 s.t. g |Ω = f |Ω

is unique and is equal to f .

Restricted Isometry Property

For s ∈ N, A matrix Φ satisfies the Restricted Isometry Propertywith the isometry constant δs if

(1− δs) ‖x‖22 ≤ ‖Φx‖2

2 ≤ (1 + δs) ‖x‖22

for all s-sparse signal x .

Perfect Recovery condition in terms of RIP constant

[Candes]

TheoremAssume that δ2s <

√2− 1. Then, the solution x∗ to

minx‖x‖1 s.t. Mx = Mx

satisfies‖x∗ − x‖2 ≤ C0s−1/2 ‖x − xs‖1

where C0 is a constant.

Remark: The sensing matrix M is non-adaptive, but the`1-minimization with M recovers every s-sparse signal.

RIP and NSP

[Candes]

TheoremIf h is in the nullspace of Φ, then

‖hT‖1 ≤ ρ ‖hT c‖1 , ρ :=√

2δ2s(1− δ2s)−1

for every T with #T = s.

Examples of Measurement Matrices with good RIP

I Random matrices with i.i.d. entries. If entries of Φ ∈ Rm×d

are drawn from i.i.d. Gaussian with mean 0 and variance 1/m,then with overwhelming probability the RIC δs is less than√

2− 1 when s ≤ Cm/ log(d/m).

I Fourier ensemble. Φ ∈ Rm×d is constructed by samplingrandom m rows of the discrete Fourier transform.

I General orthogonal measurement ensembles.

Questions

1. ‘Random dictionaries’ are good for compressed sensing. Arethere any deterministic dictionaries that are as good asrandom ones?

2. Any alternative to RIP?

3. What happens when the model is not exactly sparse?

4. What happens when there is noise?

Stable and Robust Recovery

We observey = Φx + n.

We reconstruct x as the solution to

minx‖x‖1 s.t. ‖y − Φx‖2 ≤ ε.

Stable and Robust Recovery

[Candes]

TheoremAssume that δ2s <

√2− 1 and ‖n‖2 ≤ ε. Then, the solution x∗

satisfies‖x∗ − x‖2 ≤ C0s−1/2 ‖x − xs‖1 + C1ε

where C0 and C1 are some constants.

Global Optimization Approach for Noisy Problem

We solve

2‖Mx − y‖2

2 + λ ‖x‖1

for some λ > 0.

λ→ 0?

λ→∞?

We solve

2‖Mx − y‖2

2 + λ ‖x‖1

for some λ > 0.

λ→ 0?

λ→∞?

We solve

2‖Mx − y‖2

2 + λ ‖x‖1

for some λ > 0.

λ→ 0?

λ→∞?

Compressed Sensing

Paper: Compressed Sensing by David Donoho

‖θ − θN‖2 ≤ C (N + 1)1/2−1/p, N = 0, 1, . . . .

We Need Good Dictionaries

Techniques/Algorithms of sparse modeling begin with somedictionary Φ that provides sparse representations of the class ofsignals of interest. What are those Φ’s?

Good Dictionaries

I Smooth functions ⇒ Fourier transform

I Smooth functions with point singularities ⇒ Wavelets

I Singularities along smooth curves ⇒ Curvelets, Shearlets, etc.

I . . .

PCA / SVD?

We want to design/learn dictionaries from data.

Principal Component Analysis ⇒ Not suitable for non-Gaussianmixtures

[Olshausen & Field]Used Sparsity Promoting regularization to learn Dictionary⇒ Localized, oriented, bandpass receptive fields emerge

PCA / SVD?

Dictionary Learning via `1-criterion

If Φ were a good dictionary for a signal x , then we can recover xby solving

minx‖x‖1 s.t. y = Φx .

⇒ Given a data set Y := [y1, . . . , yN ], we may find a gooddictionary Φ by solving

minΦ,X‖X‖1 s.t. Y = ΦX .

Dictionary Learning via `1-criterion

If Φ were a good dictionary for a signal x , then we can recover xby solving

minx‖x‖1 s.t. y = Φx .

⇒ Given a data set Y := [y1, . . . , yN ], we may find a gooddictionary Φ by solving

minΦ,X‖X‖1 s.t. Y = ΦX .

Some issues

I We can decrease ‖X‖1 as small as we wish by scaling because

αΦα−1X = ΦX

for any α ∈ R.

I The problemminΦ,X‖X‖1 s.t. Y = ΦX

is not convex.

The first issue can be dealt with by requiring that each column ofΦ to be of unit length. We will write Φ ∈ U(d ,K ) to mean thatΦ ∈ Rd×K is such a dictionary.

Some issues

αΦα−1X = ΦX

for any α ∈ R.

is not convex.

Some issues

αΦα−1X = ΦX

for any α ∈ R.

is not convex.

Dictionary Identification

General Question: Given a data set Y , we learned a dictionary Φvia some learning method. Is Φ the ‘right’ one?

Dictionary Identification

Suppose that the training data Y is generated by

Y = Φ0X0.

Under what condition on X0 (and Φ0), can we be sure that theminimization problem

minΦ,X‖X‖1 s.t. Y = ΦX

will recover the pair Φ0 and X0?

Necessary and Sufficient Condition

[Gribonval & Schnass]

TheoremSuppose that K ≥ d. (Φ0,X0) is a strict local minimum of theproblem

minΦ,X‖X‖1 s.t. ΦX = Φ0X0,Φ ∈ U(d ,K )

if and only if

|〈CX0 + V , sign(X0)〉| < ‖(CX0 + V )Λ‖1

for every CX0 + V 6= 0 with diag(Φ∗0Φ0C ) = 0 and V ∈ N(Φ0).

Sufficient Condition for Basis

Notation:

Sufficient Condition for Basis

Theorem(Φ0,X0) is a strict local minimum of the problem

minΦ,X‖X‖1 s.t. ΦX = Φ0X0,Φ ∈ U(d , d)

if for every k = 1, . . . , d, there exists dk ∈ Rd with ‖dk‖∞ < 1such that

Xkdk = Xksk − diag((∥∥x i∥∥)i)

where mk is the k-th column of Φ∗0Φ0.

Geometric Interpretation

Local minimum Not local minimum

uk := Xksk − diag((∥∥x i∥∥)i)

mk , Qk := [−1, 1]rk .

Bernoulli-Gaussian Model

X0 follows the Bernoulli-Gaussian Model with parameter 0 < p < 1if each entry of X0 is the product of independent Bernoulli(p)random variable ξij and Normal random variable gij , i.e.,

xij = ξijgij .

Concentration of Measure Phenomena

If X0 follows the Bernoulli-Gaussian distribution with parameter p,then with high probability XkQk contains the `2-ball of radius

α ≈ Np(1− p)

Xksk is contained in the `2-ball of radius

β ≈√

and diag((∥∥x i∥∥)i)

mk is contained in the `2-ball of radius

γ ≈ Np

α ≈ Np(1− p)

β ≈√

γ ≈ Np

α ≈ Np(1− p)

β ≈√

γ ≈ Np

Sufficient Condition under Bernoulli-Gaussian Model

If the coherence M(Φ0) of Φ0 is less than 1− p then for largeenough N, Φ0 is locally identifiable.

Surprise: One needs only

N ≥ CK log K .

Sufficient Condition under Bernoulli-Gaussian Model

If the coherence M(Φ0) of Φ0 is less than 1− p then for largeenough N, Φ0 is locally identifiable.

Surprise: One needs only

N ≥ CK log K .

Some dictionary learning methods

I K-SVD [Elad et al.]

I ISI [Gowreesunker]

I . . .

some topics on sparsitykowon.dongseo.ac.kr/~lbg/cagd/kmmcs/201008/namsangnam.pdfi morphological...

Documents

least angle regression, forward stagewise and the...

least angle regression, forward stagewise and the...

least angle regression, forward stagewise and the...

barbara needell, msw, phd university of california at...

hybrid dense/sparse matrices in compressed sensing...

acceleration of randomized kaczmarz...

application of degree reduction of polynomial b´ezier...

the stagewise model - stanford universityextreme beach...

augmented lagrangian method, dual...

linking records to advance child protection: a california...

forward stagewise naive bayes - upm

kmmcs, jan. 2006, spline methods in cagd, lbg@dongseo.ac.kr...

boosting and additive models - university of...

the california child welfare system: data snapshot barbara...

joe magruder, msw emily putnam-hornstein, msw barbara...

stagewise processes - umstagewise processes, laboratory...

a population-based analysis of race and poverty as risk...

stagewise intro lecture 1

logistic model trees - department of computer science ·...

home...