the power of penalties

7/29/2019 The Power of Penalties

1/71

The Power of Penalties

in Signal Processing

Paul Eilers and Johan de Rooi

Erasmus Medical Center, Rotterdam, The Netherlands

LASIR, April 2012


2/71

Signals in real life

You cant always get what you want (Mick Jagger)

We can measure a lot

But there always are problems, small or large

Noise and artifacts

Drifting baselines

Convoluted signals

Usually a combination

Ill show some examples, and previews of solutions

LASIR, April 2012 1


3/71

A noisy signal

130 130.5 131 131.5 132 132.5 133 133.5 1340

50

100

150

200

250

300

XRD peak (linear), = 10000

Angle

Counts,

linear

130 130.5 131 131.5 132 132.5 133 133.5 1340

0.5

1

1.5

2

2.5

XRD peak (logscale), = 10000

log

10(counts

Angle

LASIR, April 2012 2


4/71

A drifting baseline

0 500 1000 1500 2000 2500 3000 3500 4000

26

27

28

29

30

31

32

Data and fitted baseline

0 500 1000 1500 2000 2500 3000 3500 40001

0

1

2

3

4

5

Fitted baseline subtracted

LASIR, April 2012 3


5/71

A baseline in time-resolved spectroscopy

Data

100 200 300

5

10

15

20

25

30

35

4040

20

0

20

40

60

80

100Estimated baseline

100 200 300

5

10

15

20

25

30

35

4040

20

0

20

40

60

80

100

Artefact

100 200 300

5

10

15

20

25

30

35

4040

30

20

10

0

"Background" weights

100 200 300

5

10

15

20

25

30

35

400

0.2

0.4

0.6

0.8

1

LASIR, April 2012 4


6/71

Segmented smoothing

0 10 20 30 40

7

8

9

10

11

Position on chromosome (Mbase)

log2(CNV

signal)

Array GBM 139.CEL

0 10 20 30 40

7

8

9

10

11


log2(CNV

signal)

Array GBM 2032.CEL

LASIR, April 2012 5


7/71

Deconvolution of spikes

0 50 100 150 200 250 300 350 400 450 500

0

0.5

1

Data and fit; = 0.02; = 0.0001

0 50 100 150 200 250 300 350 400 450 500

0

0.5

1

Estimated pulse coeffcients

20 15 10 5 0 5 10 15 20

0

0.5

1

Pulse shapes; initial and final estimate

LASIR, April 2012 6


8/71

Smoothing

LASIR, April 2012 7


9/71

Signals, raw and smooth

We have a signal, y, and compute another signal, z

We want z to be close to y

We have ways to measure that

Sums of squares: i(y

i z

i)2 = ||y z||2

Sums of absolute values i |yi zi| = |y z|

Or more complicated objective functions to minimize

Like in regression or convolution: ||y Cz||2

Or using (adaptive) weights i wi(yi zi)2 = (y z)W(y z)

LASIR, April 2012 8


10/71

Desired properties and penalties

We want z to have desirable properties

Smooth everywhere

Or piece-wise constant

Or consisting of a few spikes

Invent a penalty, another objective function, working on z

It has to be small ifz has desired property

Combine the objective functions for fit and penalty

Minimize that combination

LASIR, April 2012 9


11/71

A simple smoother: Whittaker

Whittaker (1923) proposed graduation: minimize

S2 = i

(yi zi)2 +

i

(dzi)2

Given a noisy data series y, it finds a smoother series z

Operator d forms differences of order d: zi = zi zi1

Today we call this penalized least squares

Explicit solution, with matrix D, such that dz = Dz:

(I+ DD) z = y

LASIR, April 2012 10


12/71

The Whittaker smoother (simulated data)

0.0 0.2 0.4 0.6 0.8 1.0

1.5

1.0

0.5

0.

0

0.5

1.

0

1

.5

x

y

Whittaker smoother with lambda = 10, 1000, 1e4



13/71

Sparseness

Many equations (one per observation), but a banded system

Computation time is linear in data length

R package spam is great (sparse matrices, Matlab-style)

System with 4000 equations solved in 20 milliseconds

# Whittaker smoother

m = length(y)

E = diag.spam(m)

D = diff(E, diff = 2)P = lambda * t(D) %*% D

z = solve(E + P, y)



14/71

Whittaker for Poisson counts

130 130.5 131 131.5 132 132.5 133 133.5 1340

50

100

150

200

250

300

XRD peak (linear), = 10000

Angle

Counts,

linear

130 130.5 131 131.5 132 132.5 133 133.5 1340

0.5

1

1.5

2

2.5

XRD peak (logscale), = 10000

log

10(counts

Angle



15/71

Segmented smoothing



16/71

Copy number variations in DNA

Normal DNA comes in pairs of chromosomes

But not so in tumors

Some parts may be missing, others doubled, tripled, or more

Different changes in each chromosome

These are called copy number variations (CNV)

We can use SNP microarrays to measure them in very manyplaces

But we get a noisy signal

We expect constant segments



17/71

The Whittaker smoother for segments?

0 10 20 30 40

7

8

9

10

11


log2(CNV

signal)

Array GBM 139.CEL

0 10 20 30 40

7

8

9

10

11


log

2(CNV

signal)

Array GBM 2032.CEL



18/71


19/71

The L1 penalty in action

0 10 20 30 40

7

8

9

10

11


log2(CNV

signal)

Array GBM 139.CEL

0 10 20 30 40

7

8

9

10

11


log

2(CNV

signal)

Array GBM 2032.CEL



20/71

Computation for the L1 penalty

We could try quadratic programming techniques

But there is an easier solution

For any x and approximation x we have |x| = x2/|x| x2/|x|

Use weighted L2 penalty, with vi = 1/| zi|:

S1 = i

(yi zi)2 + i

vi(zi)2

Iteratively update v and z

Solve (I+ DVD)z = y repeatedly, with V = diag(v)

Some smoothing near 0: use vi = 1/

( zi)2 + 2, with small



21/71


22/71

The L0 penalty in action

0 10 20 30 40

7

8

9

10

11


log2(CNV

signal)

Array GBM 139.CEL

0 10 20 30 40

7

8

9

10

11


log

2(CNV

signal)

Array GBM 2032.CEL



23/71

Baseline estimation



24/71

A drifting chromatograph signal

0 500 1000 1500 2000 2500 3000 350026

27

28

29

30

31

32

Time [s]

Drifting chromatogram



25/71

A picture of rice grains

Rice grains on background

50 100 150 200 250

50

100

150

200

250

0 100 200 300

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8Crosssection (red)

0 100 200 3000.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8Crosssection (blue)



26/71


27/71

Modelling strategy

Smooth background curve (surface)

B-splines with penalty to tune smoothness (P-splines)

Special mixture of distributions

Normal distribution for noise

Unknown one-sided distribution for signal



28/71

Simulated data in 1D

0 50 100 150 200 250 300 350 400 450 5000.2

0

0.2

0.4

0.6

0.8

1

1.2Simulated constant background

0 50 100 150 200 250 300 350 400 450 5000.2

0

0.2

0.4

0.6

0.8

1

1.2Simulated variable background



29/71

A statistical model for constant background

Observed: y

Mixture model for distribution ofy:

f(y) = pg(y|, ) + (1 p)h(y )

g normal, with (background level) and unknown

h unspecified, supported only on positive axis

Mixing ratio p unknown



30/71

Illustrating the mixture idea

g (v |,)

Two component mixture for baseline and peaks

qqqqqqqqqq

qqqqqqq

qqqqqqq

q

qq

qqqqqqqq

qqqqqqqq

qqqqqqqqq

qqqqqqqqqq

qqqqqq

qq

q

q

qqqqq

q

q

q

q

qqqqqqqqqq

qqqqqqqqqqqqqqqqqqqqq

qqq

qq

q

q

qqqq

qq

qqqqqqqqqq

q

q

q

qqq

q

q

q

qq

qq

qqqqqqqqq

qqqq

qq

q

q

q

q

qq

q

q

q

q

qq

qqqqqqqqq

qqqqqqq

qqqqqqqqq

h(.)



31/71

EM estimation

Suppose we knew the distribution parameters approximately

Take one yi, compute (Bayes)

wi1 =pg(yi)

pg(yi) + (1 p)h(yi)

Then wi1 is probability ofyi coming from g (background)

And similarly wi2 = 1 wi1 for y coming from h (signal)

Use y with weights w1 to improve and

Use w2 to improve nonparametric estimate ofh



32/71


33/71

Showing the weights

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.5

1

1.5Simulated data with constant background and estimate

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.5

1Estimated weights

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

10

20

30Background and signal distributions (green: unsmoothed)



34/71

Non-parametric density estimation

Variation on Whittaker smoother

Construct histogram (100 bins) ofy

Sum w2 in bins to get pseudo counts t

Smooth t, with E(tj) = ezj and Poisson-type likelihood

Difference penalty on z

Deals well with left discontinuity



35/71

Background with trend

Model: y(xi) = v(xi) + ui

Smooth trend v

Mixture model for distribution ofu:

f(u) = pg(u|0, ) + (1 p)h(u)

g normal, with unknown

h unspecified, supported only on positive axis

Mixing ratio p unknown



36/71

Estimating a varying background

Model trend with B-splines:

vi = v(xi) = Bj(xi)j

Add difference penalty on

This is the P-spline approach

Use EM procedure as before (split and fit)

Weights follow from residuals

Model is re-estimated with these weights



37/71

P-splines illustrated

Light penalty

Heavier penalty



38/71

Fitting a background trend

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.2

0

0.2

0.4

0.6

0.8

1

1.2Simulated data with varying background and Psplines estimate

0.2 0 0.2 0.4 0.6 0.8 10

5

10

15

20Background and signal distributions (green: unsmoothed)



39/71


40/71

Chromatogram background

0 500 1000 1500 2000 2500 3000 3500 400026

27

28

29

30

31

32


0 500 1000 1500 2000 2500 3000 3500 40001

0

1

2

3

4

5



h b k d (d l)


41/71

Chromatogram background (detail)

300 400 500 600 700 800 900 100027

28

29

30

31

32


300 400 500 600 700 800 900 10001

0

1

2

3

4

5



di i l hi i h li


42/71

Two-dimensional smoothing with P-splines

Tensor products of B-splines

Bjk(x,y) = Bj(x)Bk(y)

Equally spaced knots on 2D grid

Matrix of coefficients A = [jk]:

zi = j

k

Bj(xi)Bk(yi)jk

Penalties on rows and columns of A


T d b i f 2 D b li


43/71

Tensor product basis for 2-D baseline

850

900950

1000

Wavelen

gth(nm)

30

40

50

60

70

TemperatureC

0

0.

1

0.

2

0.

3

0.

4

0.

5



44/71

P k i f t d t


45/71

Peaks as a nuisance: femtosecond spectroscopy

Data

100 200 300

5

1015

20

25

30

35

4040

20

0

20

40

60

80

100Estimated baseline

100 200 300

5

1015

20

25

30

35

4040

20

0

20

40

60

80

100

Artefact

100 200 300

5

10

15

20

25

30

35

4040

30

20

10

0

"Background" weights

100 200 300

5

10

15

20

25

30

35

400

0.2

0.4

0.6

0.8

1



46/71


47/71

Spike deconvolution


P l lik i l


48/71

Pulse-like signals

Some signals are series of pulses: spike trains

We encounter them in many places

In chemical instruments

DNA sequencers

chromatographs

In nature

pulsatory hormone release

neuron signalling

In technical systems like radar or ultrasound


DNA i f t


49/71

DNA sequencing, four traces

0

500

1000

1500

ABI trace 1

0

500

1000

1500

ABI trace 2

0

500

1000

1500

ABI trace 3

0 100 200 300 400 500 600 700 800 900 10000

1000

2000

ABI trace 4


Th f l d l


50/71

The sum of pulses model

The model:

y(t) = j ajs(t j) + ei

Assumptions:

each pulse has identical shape s(.)

location j unknown

height aj unknown

linear superposition holds (sum of pulses)


The convolution model


51/71

The convolution model

Observations always are in discrete time

Assume discrete pulse shape sk

Discrete input series x

Non-zero elements ofx give pulse heights and positions

yi = j

sijxj + ei

Or y = Cx + e, ifcij = sij

Columns ofC identical, but shifted

This is called convolution


Convolution matrix


52/71

Convolution matrix

05

1015

2025

30

0

10

20

300

0.5

1

Convolution matrix

Convolution matrix

5 10 15 20 25 30

5

10

15

20

25

30


Deconvolution of pulse trains


53/71

Deconvolution of pulse trains

Output y given, estimate input x

Convolution matrix (pulse shape) assumed to be known

Model: y = Cx + e

This looks like regression problem, and it is

Least squares solution: x = (CC)1Cy

Results are disastrous; the problem is ill-conditioned


Results of (penalized) linear regression


54/71

Results of (penalized) linear regression

0 10 20 30 40 50 60 70 80 90 100

0

0.5

1Data, components, and fit

0 10 20 30 40 50 60 70 80 90 1001

0.5

0

0.5

1x 10

6 Deconvolution without penalty

0 10 20 30 40 50 60 70 80 90 100

0

0.5

1

Deconvolution with L2 penalty; = 0.01


Penalties come to the rescue


55/71

Penalties come to the rescue

Least squares goal is S = ||y Cx||2

Extend it with a ridge (L2) penalty:

S = ||y Cx||2 + ||x||2

Or with a LASSO (L1) penalty

S = ||y Cx||2 + j

|xj|

Ridge penalty not useful: no sign of impulses

LASSO is not too bad

L0 penalty works best


LASSO (L1) and L0 results


56/71

LASSO (L1) and L0 results

0 10 20 30 40 50 60 70 80 90 100

0

0.5

1Data, components, and fit

0 10 20 30 40 50 60 70 80 90 100

0

0.5

1


0 10 20 30 40 50 60 70 80 90 100

0

0.5

1



Implementation of LASSO and L0 penalty


57/71

Implementation of LASSO and L0 penalty

Write it as a weighted square: |xj| = x2j /|xj|

Avoid division by near zero: |xj| x2j /

x2j +

Small number

Do this iteratively, for LASSO: |xj| x2j /

x2j +

Or, for L0 penalty: |xj| x2j /(x

2j + )

Approximation xj from previous iteration


Interpretation of the L0 penalty


58/71

Interpretation of the L0 penalty

Consider the penalty after convergence

j

x2j /( + x2j )

Where is a small number

When xj = 0, no contribution to penalty

When xj = 0, contribution to penalty

Hence we penalize the number of non-zero elements


Deconvolution of hormone concentrations


59/71

Deconvolution of hormone concentrations

0 20 40 60 80 100 120 1402

4

6

8

10

Data and fit; = 1.2; = 0.001

0 20 40 60 80 100 120 1405

0

5

10Estimated pulse coeffcients

0 20 40 60 80 100 120 1400

2

4

6Individual spikes


Blind deconvolution


60/71

Blind deconvolution

If we know the input, we can estimate the pulse shape

This suggests an iterative procedure

Make good guess at pulse shape

Do the penalized deconvolution

Estimate pulse shape (regression with good condition)

Repeat last two steps

This works, with some care


Blind deconvolution of DNA data


61/71

Blind deconvolution of DNA data

0 50 100 150 200 250 300 350 400 450 500

0

0.5

1

Data and fit; = 0.02; = 0.0001

0 50 100 150 200 250 300 350 400 450 500

0

0.5

1

Estimated pulse coeffcients

20 15 10 5 0 5 10 15 20

0

0.5

1

Pulse shapes; initial and final estimate



62/71

More deconvolution


Illustrating convolution with step input


63/71

Illustrating convolution with step input

0 50 100 150 200 250 300 350 4004

2

0

2

4

6Input

0 50 100 150 200 250 300 350 4004

2

0

2

4

6Output


Deconvolution with step input


64/71

Deconvolution with step input

0 50 100 150 200 250 300 350 4004

2

0

2

4

6Input and output

0 50 100 150 200 250 300 350 4004

2

0

2

4

6Output and estimated input


Deconvolution of spikes in two dimensions


65/71

Deconvolution of spikes in two dimensions

The same principles: spikes are smeared out

Image in matrix Y, spike (input) matrix X

But now in two directions

Convolution kernel assumed to be known (Gaussian)

Model: y = Cx + e

With y = vec(Y) and x = vec(X)

Matrix C computed in a special way

L0 penalty on elements ofx


2-D spike deconvolution (simulated data)


66/71

2 D spike deconvolution (simulated data)

50 100 150 200 250

50

100

150

200

250

10 20 30 40

5

10

15

20

25

30

35

40

10 20 30 40

5

10

15

20

25

30

35

40

10 20 30 40

5

10

15

20

25

30

35

400

0.02

0.04

0.06

0.08


Convergence history


67/71

Convergence history

20 40

10

20

30

40

20 40

10

20

30

40

20 40

10

20

30

40

20 40

10

20

30

40

20 40

10

20

30

40

20 40

10

20

30

40

20 40

10

20

30

40

20 40

10

20

30

40

20 40

10

20

30

40

20 40

10

20

30

40

20 40

10

20

30

40

20 40

10

20

30

40


Computational aspects


68/71

Computational aspects

The system is too large for comfort

We now use 40 by 40 sub-pictures (1600 unknowns)

There are ways to improve things

We know that most elements ofXare zero

We are working on an adaptive strategy


Super-resolution


69/71

Super resolution

We can use a finer grid for X

Say 2 by 2 sub-pixels for each Ypixel

This works in principle

But computational aspects are harder

At the moment only an illustration available

Working with a coarsened Y



70/71

10 20 30 40

5

10

15

20

25

30

35

40

5 10 15 20

5

10

15

20

10 20 30 40

5

10

15

20

25

30

35

40

5 10 15 20

5

10

15

200

0.02

0.04

0.06

0.08


Summary


71/71

y

Penalties are very useful

For smoothness (reduce noise, estimate baselines)

For sparseness (spike deconvolution, 1-D and 2-D)

There are more applications

Shape constraints, like monotone or unimodal

Fit can be likelihood-based (counts, binary data)

Penalties are connected to prior opinions (Bayes)

the power of penalties

Documents