implementation of block algebraic iterative reconstruction ...pcha/hdtomo/sc/blockair.pdf8 p. c....

Implementation of Block Algebraic Iterative Reconstruction Methods

Per Christian Hansen joint work with

Hans Henrik B. Sørensen

May 2014 2 P. C. Hansen – Implementation of Block AIR Methods

About Me …

• Interests: inverse problems, tomography, regularization algorithms, matrix compu-tations, image deblurring, signal processing, Matlab software, …

• Head of the project High-Definition Tomography, funded by an ERC Advanced Research Grant.

• Author of several Matlab software packages. • Author of four books.

Forward problem


Outline of Talk

We consider reconstruction problems in computed tomography: reconstruct a 2D or 3D object from its projections, i.e., (noisy) measurements of the damping of rays that go through the domain.

We obtain a very large system of equations A x = b with a very sparse matrix A, which must be solved by an iterative method.

1. Classical iterative reconstruction techniques

2. Performance considerations

3. An overview of block methods

4. How to compare the block methods

5. Numerical results


Analogy: the “Sudoku” Problem – 数独

3

7

4 6

This matrix in rank deficient and there are infinitely many solutions.

0

BB@

1 0 1 00 1 0 11 1 0 00 0 1 1

1

CCA

0

BB@

x1

x2

x3

x4

1

CCA =

0

BB@

3746

1

CCA

3

7

4 6

0

BBBB@

1 0 1 00 1 0 11 1 0 00 0 1 11 0 0 1

1

CCCCA

0

BB@

x1

x2

x3

x4

1

CCA =

0

BBBB@

37465

1

CCCCA

5

Unique solution!


3D Tomography Test Problem

• Parallel X-rays are sent through the object.

• The object is discretized in an array of N×N×N voxels.

• Projections are recorded on detectors with p×p pixels.

• The directions of the rays are ”evenly” distributed over the half-sphere using Lebedev quadrature points.


Setting Up the Algebraic Model

Damping of the i-th X-ray through domain (Beer’s law):

bi =Rrayi

Â(s) d`; Â(s) = attenuation coef.

Discretization leads to a large, sparse, ill-conditioned system:

A x = b

Geometry

Image

Projections Noise

¹b = A ¹x

b = ¹b + e

2D example:


A Note About the 3D Algebraic Model

Each ray corresponds to a particular row of the matrix A.

Each ray intersects only a very small number of voxels.

Hence, many rows of A are structurally orthogonal.


Noise Sensitivity

Assume that A has full rank, and consider the two problems:

A ¹x = ¹b (no noise) A x ¼ b = ¹b + e

The last term dominates because A is very ill conditioned! We must use regularization to compute an approximate solution that is less senstive to the noise.

xnaive = A¡1b = xexact + A¡1e; kA¡1ek À kxexactkLet us define the ”naive” solution:


Some Large-Scale Reconstruction Algorithms

Bayesian Methods My knowledge here is very limited …

Transform-Based Methods The forward problem is formulated as a certain transform

→ find a stable way to compute the inverse transform. Examples: the inverse Radon transform for tomography

→ filtered back-projection, FDK.

Algebraic Iterative Methods The forward problem is formulated as a discretized problem

→ solve A x = b using an iterative method. Examples: Cimmino, Kaczmarz, CGLS.


ART (Algebraic Reconstruction Technique)

Relaxation parameter

Parallelism at the level of an inner product

Algorithm: ART (Classical Kaczmarz)

Let x0 = 0Repeat the above for k = 1; 2; 3; : : :

Algorithm: xk Ã ART-sweep (¸; A; b; xk¡1)

xk;0 = xk¡1

xk;i = P

µxk;i¡1 + ¸

bi ¡ aTi xk;i¡1

kaik22ai

¶; i = 1; : : : ; m

xk = xk;m.


SIRT (Simultaneous Iter. Reconstr. Tech.)

matrix that defines the specific method

Algorithm: SIRT

Let x0 = 0For k = 1; 2; 3; : : :

xk =¡

xk¡1 + ¸ ATM (b¡A xk¡1)¢

Relaxation parameter

Parallelism at the level of a matrix-vector product

No evidence that x0 6= 0 gives better solutions or smaller computing time.

Cimmino:M = 1

mdiag(1=kaik22).


Performance

ART and SIRT (Cimmino) for very small λ = 0.01 and for ”optimal” λ.

Slow convergence.

ART can converge a lot faster than SIRT.

kxk¡

¹xk 2

=k¹xk 2


Performance

Iterations k

Rela

tive

erro

r

Test Problem: • Parallel-beam tomography. • 13 projections. • 3D Shepp-Logan phantom, Schabel (2006).

kxk ¡ ¹xk2=k¹xk2

ART


Performance 1 core

Intel Xeon E5620 2.40 GHz (1 core)

Same number of flops! The difference is due to the cache: ART uses row ai twice once it is loaded.

ART SIRT


Intel Xeon E5620 2.40 GHz (4 cores)

Performance 4 cores

ART SIRT

Four cores are better suited for block matrix-vector operations.


Our Dilemma

ART has faster convergence than SIRT – i.e., more reduction of the error per iteration.

SIRT can better take advantage of multi-core architecture than ART.

How to achieve the ”best of both worlds?” → Block methods!


Block Methods

In each iteration we can: • Treat the blocks sequentially or simultaneously (i.e., in parallel). • Treat each block by an iterative or by a direct computation.

We obtain several methods: • Sequential processing + ART on each block → classical ART • Sequential processing + SIRT on each block • Sequential processing + pseudoinverse of Aℓ • Parallel processing + ART on each block • Parallel processing + SIRT on each block → classical SIRT • Parallel processing + pseudoinverse of Aℓ


The convergence depends on the number of blocks p: If p = 1, we recover SIRT If p = m, we recover ART

Block-Sequential Methods

Eggermont, Herman, Lent (1981) Elfving (1980)

Parallelism given by the tradeoff:

Algorithm: Block-Sequential

Initialization: choose an arbitrary x0 2 Rn

Iteration: for k = 0; 1; 2; : : :

xk;0 = xk¡1

xk;` = P¡

xk;`¡1 + ¸ AT` M` (b` ¡A` xk;`¡1)

¢; ` = 1; 2; : : : ; p

xk = xk¡1;p

M` = (A`AT` )y ) AT

` M` = Ay`Variant by Elfving (1980):


The convergence depends on p: If p = 1, we recover ART

If p = m, we recover SIRT

Block-Parallel Methods

Algorithm: Block-Parallel

Initialization: choose an arbitrary x0 2 Rn

Iteration: for k = 0; 1; 2; : : :

for ` = 1; : : : ; p execute in parallel

xk;` = ART-sweep(¸; A`; b`; xk¡1)

xk = 1=pPp

`=1 xk;`.

Variants: Elfving (1980) – inner step:

CARP algorithm, Gordon & Gordon (2005): xk;` = P

¡xk¡1;` + ¸ Ay

`(b` ¡ A` xk¡1;`)¢

xk =Pp

`=1 D` xk;`; D` depends on sparsity structure

Censor, Elfving, Herman (2001)

Parallelism is given by:


Block Sequential

4 blocks

The ”building blocks” are SIRT iterations, suited for multicore. The blocks are treated sequentailly! Hence the error reduc-tion per iteration is close to that of ART.

ART SIRT Block-Seq.

Intel Xeon E5620

2.40 GHz (4 cores)


Block Parallel

ART SIRT Block Seq.

Block Par.

Intel Xeon E5620

2.40 GHz (4 cores)


Fair Comparison of the Methods …

It is quite easy to make an unfair comparison between the methods: choose a bad λ for the method you don’t like.

To make a fair comparison between the methods, we choose the value of λ that is (near) optimal for each method!

What do we mean by ”(near) optimal”? – Choose a test problem with a known solution. – Find the parameter λ that gives fastest semi-convergence.

The relaxation parameter λ makes comparisons difficult …

xnaive = A¡1b = xexact + A¡1e; kA¡1ek À kxexactk

Recall that we do not want the ”naive” solution:


Illustration of Semi-Convergence

A¡1b


Semi-convergence and relaxation parameter λ

Optimal λ reaches min. error in fewest iterations

Training for Optimal λ

Optimal λ

Iteration k


Convergence Results I

Only convergence is considered here, the number of cores is irrelevant.


Convergence Results II

Only convergence is considered here, the number of cores is irrelevant.


Blocks of Structurally Orthogonal Rows

When a block has structurally orthogonal rows then ART, SIRT and ”pinv” are equivalent. It is worthwhile to utilize this!

PART algorithm, Gordon (2006)

In 3D tomography, it is easy to find sets of rows that are orthogonal due to the structure of zeros/nonzeros.

Thus, a re-ordering of the rows can produce blocks with mutually orthogonal rows (= the traces of rays are non-overlapping).


Single-Core Results

Intel Core i7-3820 3.60 GHz (1 core)

Block-Seq: block-sequential-SIRT Block-Par: block-parallel-ART (Censor, Elfving, Herman) CARP: block-parallel-ART (Gordon, Gordon) PART – utilizes struct. orthog. ART (1 thread)


Multi-Core Performance – 4 cores

Block-seq-SIRT Block-par-ART (Censor, Elfving, Herman) Block-par-ART (Gordon, Gordon) PART – utilizes struct. orthog. ART (1 thread)


Multi-core Results – 4 Cores

Intel Core i7-3820 3.60 GHz (4 cores)

The advantage of PART over standard ART is due to the improved use of multicore architecture.

Block-Seq: block-sequential-SIRT Block-Par: block-parallel-ART (Censor, Elfving, Herman) CARP: block-parallel-ART (Gordon, Gordon) PART – utilizes struct. orthog. ART (1 thread)


Multi-core Results – 32 Cores

4 socket AMD Opteron 6282 SE 2.60 GHz (32 cores)

With many cores, PART is a clear winner. Block-Seq: block-sequential-SIRT

Block-Par: block-parallel-ART (Censor, Elfving, Herman) CARP: block-parallel-ART (Gordon, Gordon) PART – utilizes struct. orthog. ART (1 thread)


Conclusions

Block algebraic iterative reconstruction techniques are able to achieve initial convergence rate similar to that of ART,

and with the smaller computing time of SIRT, because we can utilize the multicore architecture.

With a suitable row ordering and choice of blocks, we can produce blocks of structurally orthogonal rows.

PART has identical convergence to ART and very good scaling properties in practice.

Next step: target GPUs (up to 2688 cores).

implementation of block algebraic iterative reconstruction ...pcha/hdtomo/sc/blockair.pdf8 p. c....

Documents