imaging science and technologycong/expedition/references/mutual inform… · 108 journal of imaging...

The Journal of

Imaging Science and Technology

Reprinted from Vol. 49 No. 2March/April 2005

IS&TThe Society for Imaging Science and Technology

http://www.imaging.org

©2005 Society for Imaging Science and Technology (IS&T)All rights reserved. This paper, or parts thereof, may not be reproduced in any form

without the written permission of IS&T, the sole copyright owner of The Journal of Imaging Science and Technology.

For information on reprints or reproduction contactDonna Smith

Managing EditorThe Journal of Imaging Science and Technology

Society for Imaging Science and Technology7003 Kilworth Lane

Springfield, Virginia 22151 USA 703/642-9090 extension 17

703/642-9094 (fax)[email protected]

www.imaging.org

105

JOURNAL OF IMAGING SCIENCE AND TECHNOLOGY® • Volume 49, Number 2, March/April 2005

Original manuscript received July 16, 2004

† Corresponding Author: Raj Shekhar, [email protected]

©2005, IS&T—The Society for Imaging Science and Technology

the anatomy. Both linear and elastic registration are usefultools in many medical applications.2−7

While a linear transformation can be simply definedas a superposition of rotations, translations, scaling andshear, and is always applied globally, there is noagreement in the image processing community about thebest way to define and estimate an elastic trans-formation. Many algorithms for elastic image registrationthat have been reported in the literature are based on asubdivision of the image using a mesh of control pointswhich is defined at progressively smaller subvolumesizes.3,8−11 Such algorithms optimize the position of controlpoints to find the best local transformation. The imagesare compared using an image similarity measure. Formultimodality image registration the measure of choiceis typically mutual information.12,13

The use of mutual information for image registrationwas first introduced by Wells et al.14 Pluim et al.presented a comprehensive survey of the literature onmutual information-based registration.13 The goal ofmutual information-based registration is to find thetransformation that best aligns two images bymaximizing the mutual information between them.Figure 1 describes the mutual information-basedregistration process. The optimization algorithm is usedto find the set of transformation parameters thatmaximizes the mutual information between the referenceimage and the transformed floating image. Dependingon the particular application, the transformation can belinear (rigid or affine) or elastic. For linear trans-formations the parameters being optimized are commonlythe rotation, translation and/or scaling coefficients, whilefor elastic transformations defined by a mesh of controlpoints the parameters being optimized are the controlpoint locations in the floating image space.

IntroductionImage registration is the process of aligning two imagesthat represent the same anatomy from different points ofview, at different times or using different imagingmodalities. Image registration is an important tool inmedical imaging, where it is used to merge or compareimages obtained from a variety of modalities, such asmagnetic resonance imaging (MRI), computed tomo-graphy(CT), positron emission tomography (PET), single photonemission computed tomography (SPECT) and ultrasound.In medicine, real-time image registration can be an enablingtechnology for the effective and efficient use of manydiagnostic and image-guided treatment procedures relyingon either multimodality image fusion or serial imagecomparison.1 Image registration algorithms can be classifiedinto linear and elastic registration algorithms. Linearregistration algorithms use global transformations with acombination of rotation, translation, scaling and shearcomponents. Elastic image registration refers to the casein which one of the images is transformed using anonlinear, continuous transformation. The process ofimage registration yields the transformation field that bestlocates one of the images (the floating image) into the other(the reference image), and can be used to track changes in

Hardware Acceleration of Mutual Information-Based 3D Image Registration

Carlos R. Castro-Pareja and Raj Shekhar†

Department of Diagnostic Radiology, University of Maryland School of Medicine, Baltimore, Maryland, USA

Real-time image registration is potentially an enabling technology for the effective and efficient use of many image-guideddiagnostic and treatment procedures relying on multimodality image fusion or serial image comparison. Mutual information iscurrently the best-known image similarity measure for intensity-based multimodality image registration. The calculation ofmutual information is memory intensive and does not benefit from cache-based memory architectures in standard softwareimplementations, i.e., the calculation incurs a large number of cache misses. Previous attempts to perform image registration inreal time focused on parallel supercomputer implementations, which achieved significant speedups using large, expensivesupercomputers that are impractical for clinical deployment. We present a hardware architecture that can be used to acceleratea number of linear and elastic image registration algorithms that use mutual information as an image similarity measure. Aproof-of-concept implementation of the architecture achieved speedups of 30 for linear registration and 100 for elastic registrationagainst a 3.2 GHz Pentium III Xeon workstation. Further speedup can be achieved by using several modules in parallel.

Journal of Imaging Science and Technology 49: 105−113 (2005)

106 Journal of Imaging Science and Technology® Pareja and Shekhar

Previous attempts to perform image registration in realtime focused on parallel supercomputer implementations,which achieved real-time performance using large,expensive supercomputers that are too impractical to bedeployed in a hospital.15−18 Our Fast Automatic ImageRegistration (FAIR) architecture accelerated mutualinformation-based linear registration.19 In this article wepresent a second generation architecture, called FAIR-II, that has been designed to accelerate both linear andelastic registration and achieves even higher speeds thanthe original FAIR architecture. FAIR-II architectureachieves registration speeds 30 to 100 times faster thana 3.2 GHz Pentium III-based PC workstation, dependingon the underlying transformation. The single-modulespeedup results from using parallel memory access andparallel calculation pipelines. Using several modules inparallel can increase the overall speedup. An imple-mentation of the architecture can be customized to meetthe particular requirements of most control point mesh-based elastic registration algorithms.8,10,11,17,20

Mutual Information CalculationMutual information is defined as:

MI RI FI H RI H FI H RI FI, ,( ) = ( ) + ( ) − ( ) , (1)

where RI is the reference image, FI is the floating image(the one being transformed), H(RI) and H(FI) are theindividual image entropies, and H(RI,FI) is their jointentropy. The entropies are computed as follows:

H RI p a p aRI RI

a( ) = − ( ) ( )∑ ln (2)

H FI p b p bFI FI

b( ) = − ( ) ( )∑ ln (3)

H RI FI p a b p a bRI FI RI FI

a b, ( , ) ln ( , ), ,

,( ) = −∑ (4)

The joint voxel intensity probability pRI,FI (a,b), i.e., theprobability of a voxel in the reference image having anintensity a given that the corresponding voxel in thefloating image has an intensity b, can be obtained fromthe joint histogram of the two images. The joint histogramrepresents the joint intensity probability distribution. Inthe process of mutual information-based registration, thedispersion of values within the joint histogram isminimized, which in turn minimizes the joint entropyand maximizes the mutual information. Calculation ofmutual information involves first obtaining the individualand joint histograms of the two images and thencalculating their individual and joint entropies using thehistograms’ values. Joint histogram calculation is amemory intensive task that does not benefit from cache-based memory architecture, i.e., it incurs a large numberof cache misses, in standard software implementations.19

ArchitectureAs mentioned before, mutual information calculationcan be divided into two steps. In the first step the mutualand individual histograms are formed. In the secondstep, the entropies are calculated and added accordingto Eqs. (1) through (4). The individual and joint voxelintensity probabilities are calculated from the joint andindividual histograms. The time required to perform thefirst step is proportional to the image size, while thetime required to perform the second step is proportionalto the joint histogram size. Since typical 3D image sizesare between 643 and 5123 voxels, and typical jointhistogram sizes range between 322 and 2562 bins, jointhistogram calculation speed will normally be the mostimportant factor in mutual information calculation time.We showed earlier that, for large images, joint histogramcalculation can take 99.9% of the total mutualinformation calculation time in software.19 On the otherhand, off-chip entropy calculation requires theadditional overhead of transmitting the joint histogramback to the host computer, which can take a significantamount of time. Because of this overhead we decided toimplement entropy calculation on-chip. A block diagramof the FAIR-II architecture is shown in Fig. 2. Thefollowing subsections describe the operation of itsdifferent elements.

Joint Histogram ComputationThe first step in mutual information calculationinvolves applying the current estimate of the inversetransformation to each voxel coordinate of the referenceimage to find the corresponding voxel coordinates inthe floating image, and updating the joint histogramat the location dictated by the intensities of the voxelpair. The corresponding coordinates do not normallycoincide with a grid point in the floating image, thusrequiring interpolation. Partial volume (PV) inter-polation, as introduced by Maes et al.21 was chosenbecause it provides smooth changes in the histogramvalues as a result of small changes in the trans-formation.13 The algorithm for joint histogramcomputation is shown in Fig. 3. Step (a) of the algorithmis performed by the voxel coordinate transformationunit. Steps (b) and (c) are performed in parallel. Thereference image coordinate values are passed to thereference image controller, the integer components ofthe floating image coordinates are passed to the floatingimage controller, and its fractional components to thePV interpolator. Finally the interpolation weightscalculated by the PV interpolator are accumulated intothe joint histogram.

Figure 1. Mutual information-based registration flowchart.

Hardware Acceleration of Mutual Information-Based 3D Image Registration Vol. 49, No. 2, March/April 2005 107

Voxel Coordinate TransformationThe voxel coordinate transformation unit performs thetransformation of reference image voxel coordinates intolocations in the floating image space. An elastictransformation is separated into two components: aglobal linear transformation and a local, control pointmesh-based transformation, each calculated by anindependent submodule. Floating image voxelcoordinates are calculated as shown in Eq. (5):

v T v d vf global r local r= ⋅ + ( ) , (5)

where vr is a vector containing the coordinates of a voxelin the reference image, vf is a vector containing thecorresponding floating image coordinates, Tglobal is theglobal linear transformation (defined by a 4 × 4 matrix)

Figure 2. Block diagram of the FAIR-II architecture.

Figure 3. Joint histogram computation algorithm.

For each voxel in the reference image,a. Calculate corresponding floating image

coordinates (apply coordinate transformation)b. Load corresponding floating image 2 × 2 × 2 voxel

neighborhood (8 intensity values are loaded inparallel)

c. Calculate the 8 partial volume interpolationweights (similar to trilinear interpolation, butweights are not added at the end)

d. Accumulate interpolation weights into corre-sponding joint histogram bins.


and dlocal is the local deformation function. The localdeformation function is defined by a uniform 3D meshof control points, which is stored in the Grid Memorymodule. Each control point is mapped to a fixed locationin the reference image space and is associated to thecorresponding value of dlocal . Local deformation valuesfor reference image coordinates located between meshcontrol points are calculated using interpolation.

The effects of both transformations are superimposedby addition. The main reason to decompose thetransformation into its linear and local components isto reduce the hardware requirements for the GridMemory and the control point coordinate interpolator.This way the dynamic range of mesh control pointcoordinates is restricted to the range of allowableinternal deformations, which is normally much smallerthan the range of the global linear transformation, whichincludes the whole floating image space. It also allowsperforming linear transformations, by setting the controlpoint values to zero, i.e., d vlocal r( ) = 0.

The particular implementation of the voxel coordinatetransformation unit depends on the algorithm beingaccelerated. Some of the most commonly usedtransformation algorithms include local l ineartransformations,11 and control-point-mesh-basedtransformations using trilinear interpolation8,22 or cubic(spline) interpolation.9,20,23 Since all the aforementionedtransformation algorithms can be implemented using ahardware pipeline, the choice of a particulartransformation algorithm does not have any impact inthe final mutual information calculation speed.However, since more complex transformations requirea larger number of arithmetic units and/or a deeperpipeline, the hardware requirements of a particularimplementation will depend on the transformationalgorithm being implemented.

Interpolation Weight CalculationThe reference image coordinates coming from the voxelcoordinate transformation unit ( vr in Eq. (5)) are alwaysinteger and correspond to exact reference image voxellocations. However, their corresponding floating imagecoordinates ( vf ) have fractional components and can landoutside of the actual image. As in the original FAIRarchitecture,19 the integer components of the floatingimage coordinates are used to locate the corresponding 2× 2 × 2 floating image voxel neighborhood and thus passedto the Floating Image Controller, while the fractionalcomponents are used for interpolation. As mentionedbefore, Partial Volume interpolation is used.21 As a resultof PV interpolation, eight interpolation weights w0:7 aregenerated. Each of these interpolation weights isassociated to a voxel belonging to the 2 × 2 × 2 floatingimage neighborhood that contains vf . These weights arepassed along with the corresponding reference imageintensity value and eight floating image intensity valuesto the Scheduler module.

Image Memory AccessEach image memory controller has an independentmemory bus, therefore allowing parallel access of voxelintensity pairs. The reference image controller loads thereference image voxel intensity values. At the same time,the floating image controller accesses the corresponding2 × 2 × 2 voxel neighborhoods in parallel through amemory bus implemented using cubic addressing.24 Thereference image memory is accessed sequentially, usingburst accesses. While the image memories are accessed,the PV interpolator unit calculates the interpolation

weights. Since floating image accesses are notsequential, it is recommended to use SRAMs to storethe floating image, thus allowing one 2 × 2 × 2neighborhood access per external clock cycle.

Joint Histogram AccumulationFor a given reference image voxel, there are eight voxelintensity pairs, each with its own interpolation weight. Sincethe joint histogram memory needs to be read and writtenonce for each intensity pair, 16 accesses to the jointhistogram memory are required per reference image voxel.To provide 16 accesses in the time of a single image memoryaccess, three different techniques are used:1. The joint histogram is implemented using internal

on-chip memory, running at a higher frequency thanthe external image memory. This reduces the numberof necessary accesses per clock cycle by a factor equalto the ratio between the internal and external clockrates.

2. Dual-ported memories are used to allow simultaneousread and write accesses. This reduces in half thenumber of necessary accesses per memory port.

3. The joint histogram memory is partitioned to allow ahigher number of accesses to occur at the same time.

The joint histogram memory is partitioned into 2nblocks. Each partition has its own independentaccumulator pipeline. The number of joint histogrampartitions depends on the internal clock speed. Thefollowing inequality must hold to ensure that the dataflow into the accumulator FIFOs is equal or larger thantheir output flow:

JH Partitions

ff

RI

acc_ ≥

⋅8, (6)

where fRI is the reference image clock frequency and facc

is the accumulator clock frequency. The joint histogrammemory is arranged such that the address of a jointhistogram bin is obtained by concatenating thecorresponding reference image and floating image voxelintensity values. Joint histogram partitioning is basedon the recognition that the neighboring voxels even inrelatively uniform regions in clinical images varyslightly, in terms of a few LSBs. Partitioning is,therefore, performed according to the n LSBs of thememory addresses (n = log2(JH_Partitions)). The leastsignificant bits (LSBs) of the joint histogram memoryaddresses correspond to the LSBs of the floating imagevoxel intensity values. This partitioning schememaximizes the likelihood that different floating imagevoxel intensity values in the same neighborhood will beassigned to different joint histogram memory blocks,thus allowing a higher degree of parallelism in jointhistogram RAM accesses. An example with a jointhistogram of size 8 × 8 and four partitions (n = 2) isshown in Fig. 4. In this example the intensity valuesare three bits wide. Therefore, a voxel intensity pairwill have six bits. The three most significant bits, labeledvr2:0, contain a reference image voxel intensity value,while the three least significant bits, labeled vf2:0,contain the corresponding floating image voxel intensityvalue. To map the voxel intensity pair to a given jointhistogram partition, its least two significant bits, labeledJHP1 and JHP0, are used. These bits correspond tovf1:0. The remaining four upper bits of the voxelintensity pair, labeled JH3:0, are used to address alocation within the joint histogram partition. For


example, if a voxel intensity pair consists of thereference image value vr2:0 = 0102 (the subscript denotesthe base), and the floating image value vf2:0 = 1102, thecorresponding voxel intensity pair will be 0101102, whichwill be mapped to Partition 2 (since JHP1:0 = 102 = 2),at the address 01012 (since JH3:0 = 01012).

A limitation of this partitioning scheme is that, in thepresence of large image areas of identical intensity, itwould not provide the desired parallelism. To preventthis problem, the scheduler groups equivalent voxelintensity pairs and adds their correspondinginterpolation weights together before sending them tothe accumulators. This operation also prevents read-after-write hazards in the accumulation pipeline due toconsecutive accesses to the same voxel intensity pair.FIFOs are used as buffers to handle traffic spikes toindividual joint histogram blocks.

While the joint histogram is computed, the row andcolumn accumulator unit computes the individualhistograms. The architecture of the row and columnaccumulator is similar to the joint histogramaccumulation pipeline. The global sum module keepstrack of the number of voxels accumulated into thehistograms and calculates its reciprocal value, i.e., itapplies the 1/n function to it. The resulting value is usedduring entropy accumulation to calculate the voxelintensity probabilities from the individual and jointhistogram values.

The size of the internal joint histogram memorymust be equal to the largest joint histogram size beingused in the algorithm of choice. However, variable jointhistogram sizes can be implemented in hardware byperforming a bit AND operation against the contents ofa bit mask register. Having a one-bit mask register forthe floating image and a different mask register for thereference image allows independent control over thelateral joint histogram dimensions. This is useful whendealing with multimodality images and also whenperforming nonrigid registration, since performing localrefinements to the transformation at small scales affectssmaller portions of the floating image, thus requiringthe use of smaller joint histogram sizes to reduce thedispersion of values in the joint histogram.

Entropy AccumulationIn the second step of mutual information calculation,the accumulator modules send the partial jointhistogram values to the entropy accumulator. In theentropy accumulator the individual and joint voxelintensity probabilities are calculated from the jointhistogram values, and the p ⋅ ln(p) function is appliedto them. The results are then accumulated, thusobtaining the mutual information value.

To calculate mutual information, it is necessary toevaluate the function f (p) = p ⋅ ln(p) for all the individualand joint intensity probabilities. Since the probabilitiesrange from 0 to 1, the ln(p) function will have a range of[−∞, 0], with the problem that it is undefined for p = 0.Fortunately, the p ⋅ ln(p) function has a much morelimited range of [−e−1, 0] and is defined in the full rangeof 0 ≤ p ≤ 1. Hardware-based logarithm calculation isusually performed using series approximations or acombination of series approximations and look-up tables(LUTs).25−29 Using series approximation requiresimplementing a series of arithmetic units that consumesignificant hardware resources, and take several clockcycles to finish calculation. Existing LUT-basedapproaches for logarithm calculation have a limitedaccuracy (up to 24 fixed-point bits) that makes themunsuitable for the current application.28,29 In thissubsection we present an algorithm for the calculationof p ⋅ ln(p) that allows simple, LUT-based implementationin FPGAs and has a small enough error range acceptablefor its application in mutual information calculation.

LUT-Based Entropy CalculationThe function f(p) is approximated in the range [0, 1] byusing the piecewise-polynomial function

ˆ,f pN m ( ) with

m segments, defined in Eq. (7) below.

ˆ

mod

mod

mod

,f p

P p p pP p p p p p

P p p i p p i p

P p p m p p

N mi

m

( ) =

( ) ≤ <( ) ≤ <

( ) ⋅ ≤ < +( ) ⋅

( ) −( ) ⋅ ≤ <

⎧

⎨

⎪⎪⎪⎪

⎩

⎪⎪⎪⎪ −

0

1

1

02

1

1 1

for for

for

for

∆∆ ∆ ∆

∆ ∆ ∆

∆ ∆

� �

� �, (7)

where ∆p = 1/m is the segment size of ˆ

,f pN m ( ) and Pi isa polynomial of order N − 1. To minimize the maximumapproximation error, each Pi is obtained by calculatingthe Chebyshev approximation for f(p), defined in Eq. (8)for i ⋅ ∆p ≤ p < (i + 1) ⋅ ∆p.30 The Chebyshev approximationis simple to calculate for continuous functions and hasthe advantage that it is very close to the minimax

Figure 4. Joint histogram partitioning example


approximation, the most accurate polynomialapproximation.31 Equation (9) is used to calculate thecoefficients.

P p p c T

p pp

ci i k kk

N

imodmod

, ,∆∆

∆( ) ≈

( )−

⎛

⎝⎜⎞

⎠⎟⎡

⎣⎢⎢

⎤

⎦⎥⎥

−=

−∑

21

120

1

0 (8)

cN

fp l

Np i

k lN

i k

l

N

,

cos cos

=

−( )⎛

⎝⎜⎞

⎠⎟+ ⋅ +( )

⎡

⎣⎢⎢

⎤

⎦⎥⎥

−( )⎛

⎝⎜⎞

⎠⎟=∑

2

21 2

1 21 2

1

∆ ∆π π (9)

The Chebyshev polynomials are defined by Tn(x) = cos(n⋅ arccos(x)), for −1 ≤ x ≤ 1. The Chebyshev polynomials oforder zero to three are shown in Eq. (10). Since eachpolynomial is used to approximate f(p) in a specific [i ⋅∆p, (i + 1) ⋅ ∆p] range, whereas the Chebyshev polynomialsare defined in [−1, 1], the variable conversion shown inEq. (11) was applied to the equations.

T0(x) = 1, T1(x) = x, T2(x) = 2x2 – 1, T3(x) = 4x3 – 3x (10)

x p p p p= ( ) −( )2 mod ∆ ∆ ∆ (11)

To keep our arithmetic pipeline simple, we consideredonly first, second and third-order approximations in ourarchitecture. Equation (12) defines the ith polynomialcomponent of

ˆ ( )f pN :

P p k p k p k p ki d i d i d i d i( ) = ⋅ + ⋅ + ⋅ +, , , ,33

22

1 0 , (12)

where pd = pmod∆p. The polynomial coefficients ki,j arestored in the ith entry of the LUT. They are calculatedfrom the Chebyshev coefficients as shown in Eqs. (13)through (16), which are derived from Eqs. (8), (10) and(11):

k c c c ci i i i i, , , , ,.0 0 1 2 30 5= ⋅ − + − (13)

k c c c pi i i i, , , ,1 1 2 34 9 2= − ⋅ + ⋅( ) ⋅ ( )∆ (14)

k c c c p k

c c p

i i i i i

i i

, , , , ,

, ,

1 1 2 3 2

2 32

4 9 2

2 12 2

= − ⋅ + ⋅( ) ⋅ ( ) =

⋅ − ⋅( ) ⋅ ( )

∆

∆ (15)

k c pi i, ,3 334 2= ⋅ ⋅ ( )∆ (16)

Error CharacteristicsThe approximation error in the ith segment of

ˆ,f pN m ( )

is given by:

ε i N m ip f p f p P p p f p( ) = ( ) − ( ) = ( ) − ( )ˆ mod, ∆ (17)

The position pmax_ei of the maximum absolute error ei

in this segment is obtained by numerically evaluatingp when the derivative of the error is equal to zero:

d dp P p p f pi i ei eiε = ( ) − ( ) =˙ mod ˙

max_ max_∆ 0 (18)

⇒ ⋅ ( ) + ⋅ ( )+ = + ( )

3 2

1

32

2

1

k p p k p p

k p

i ei i ei

i ei

, max_ , max_

, max_

mod mod

ln

∆ ∆ (19)

The maximum absolute error is then found by insertingpmax_ei into the error equation:

e p pi

i p p i pi i ei= ( ) = ( )

⋅ ≤ < +( )⋅max max_

∆ ∆1ε ε (20)

Since using an Nth-order piecewise-polynomialapproximation assumes that the Nth derivative of f(p)is constant in each segment of

ˆ ( ),f pN m , the segment size∆p must be determined by the rate of change of the Nthderivative and the desired maximum error. The first andsecond derivatives of f(p) are f (p) = 1 + ln(p) and

˙f (p)= 1/p. The rate of change of all derivatives approachesinfinity as p nears zero and decreases steadily as pincreases. Therefore, the approximation error is higherfor smaller values of p, which further implies that thiserror is largest in the first segment of

ˆ,fN m (p), i.e., emax

= e0 = ε0(pmax0)).For a given ∆p, using a higher-order polynomial

approximation yields a lower maximum error. Therefore,for a given number of LUT entries, emax will decrease asthe polynomial order increases. The plot on Fig. 5illustrates the dependence between ∆p, emax and thepolynomial order. To be able to store the LUT usinginternal memory, a reasonable maximum number ofentries is 16K (214 bins). For a LUT with 16K entries,the maximum error is in the order of 10−8 to 10−10 forfirst- to third-order approximations.

Multi-LUT ApproachIt was shown in the previous section that the maximumerror will occur in the first segment of

ˆ ( ),f pN m . Figure6(a) shows the error magnitude plot for an LUT with 214

bins, using third-order polynomials and double precisionfloating point numbers. As predicted before, the errordecreases as p increases.

To further reduce LUT size and improve accuracy, weused a multi-LUT approach, where each LUT had adifferent ∆p value and was based on a different

Figure 5. Plots of maximum approximation error emax versus∆p for different approximation polynomial orders.


approximation function ˆ ( ),f pN ml

. The general n-LUTform of this approximation is shown in Eq. (21), where∆p1 < ∆p2 < … < ∆pn. Having larger values of ∆p for highervalues of p is equivalent to approximating the functionusing variable segment sizes, which allows us to reducethe LUT size while keeping the approximation errorbelow the allowable maximum.

ˆ

ˆ

ˆ

ˆ

,

, max

,

F p

f p p p

f p p p p

f p p p

N

N m

N m

N mn n

( ) =

( ) ≤ <

( ) ≤ <

( ) ≤ <

⎧

⎨

⎪⎪

⎩

⎪⎪

−

1 1

2 1 2

1

0

1

for

for

for

max

max

max

� � (21)

Let p pmin maxi i

=−1

and pmax0

0= . Given a maximumallowable error and polynomial order, we can vary ∆pand use Eqs. (18) and (20) to determine the range [pmin,

1] over which the error criterion is satisfied. Asexplained above, the smaller the ∆p, the wider is the[pmin, 1] interval, and hence smaller is pmin. In ouranalysis, we considered only those ∆p values that werepowers of two to make the selection of the LUT and LUT-segment simple. For a given probability range, a higher-order approximation leads to a larger ∆p value, thusresulting in smaller LUT sizes. The following algorithmwas used to determine the number and the probabilityrange of the LUTs, given a target size (number ofsamples) for each LUT:1. Given the target LUT size m1, obtain the desired Dp1

for the segment starting at pmin = 0 using Eqs. (18)and (20). The probability range of the first LUT thusequals [0, pmax1

], where p p mmax1 1 1= ⋅∆ .2. Use pmax1

from the first LUT to obtain ∆p for thesecond LUT. Determine the range of the second LUT.

3. Iterate step 2 to obtain the ∆p values and LUT rangesfor the subsequent LUTs, until the entire [0, 1] rangehas been covered.

As stated before, the range of f(p) for p in [0, 1] is[−e−1, 0] ⊂ [−0.5, 0]. Since the desired accuracy is 10−9

(~2−30), a 30-bit fixed-point representation will havesufficient dynamic range for the f(p) calculation. Usinga higher number of bits will reduce the rounding error.

Using this approach we designed the first-orderpolynomial, 4-LUT configuration shown in Table I. Themaximum error was in the order of 10−8 for a total of 8Kentries. Figure 6(b) shows a plot of the error magnitudeversus p for this configuration. Using lower-orderpolynomials has the advantage of reducing the LUT datawidth and the number of arithmetic components in thepipeline.

Implementation and ResultsThe system was implemented as a proof of concept usingan Altera Stratix EP1S40 FPGA in a PCI prototypingboard manufactured by SBS, Inc.32 The FPGA ran at amaximum internal frequency of 200 MHz, with a 50 MHzreference image RAM bus and a 100 MHz floating imageRAM bus. Joint histogram RAM was partitioned intoeight blocks, and implemented using internal 4 Kbmemory blocks. Reference and floating image memorieswere implemented using standard PC100 SDRAMS andhigh-speed SRAMs, respectively. Entropy calculationwas implemented using the 4-LUT, first-orderpolynomial configuration shown in Table I. As shownbefore, for a given LUT, the error magnitude decreasesas p increases. All LUTs were implemented inside oneinternal 512 Kbit memory block. Mutual informationwas calculated using 32-bit fixed point numbers.

As implemented, the system was able to process 50million reference image voxels per second. Timingcharacteristics of the system were compared againstoptimized software implementations and the hardwareimplementation of the previous FAIR architecture.19 TableII shows the characteristics of the different systems used

Figure 6. (a) Error magnitude plot for a single LUT with 214

entries; and (b) Error magnitude plot for 4 LUTs with 211 entrieseach.

TABLE I. Characteristics of a 4-LUT Implementation of theEntropy Calculation Pipeline

LUT Range LUT Number Number of Entries (m) pmin pmax ∆p

1 2048 0 2−13 2−24

2 2048 2−13 2−6 2−17

3 2048 2−6 2−2 2−13

4 2048 2−2 2−0 2−11


for benchmarking, and Table III shows the timing results.The number of subvolumes refers to the number of 2 × 2× 2 control point neighborhoods in the control point meshused to define the local deformations, and is an indicatorof the resolution at which local deformations are defined.When local deformations are used the softwareimplementations become roughly three times slower dueto the extra computational load from accessing the gridmemory and performing the extra interpolations. Thesoftware execution times prove the fact that mutualinformation calculation time is more dependent on theexternal memory speed than the processor speed,especially for large image sizes. FAIR-II delivered aspeedup of 30 for linear registration and 100 for elasticregistration against the corresponding softwareimplementations on a 3.2-GHz PIII Xeon workstationwith 1 GB of 266 MHz DDRAM.

ConclusionsElastic image registration is currently an active field ofresearch in the medical imaging community, far frombeing considered a solved problem. Most currentalgorithms for elastic image registration are tailored tospecific applications, although many of them sharesignificant characteristics. The FAIR-II architecturepresented in this paper does not aim to accelerate aspecific algorithm, but rather any algorithm based on auniform control point mesh-based transformation thatuses mutual information as an image similaritymeasure. Mutual information was chosen because it iswidely accepted in the imaging community as the bestimage similarity measure currently available for single-or multimodality image registration. FAIR-II achievesspeedup rates an order of magnitude higher than thoseachieved by the first-generation FAIR architecture forlinear registration and also supports elastic registration.

Future work on the acceleration of mutual informationcalculation can be performed on the estimation of thejoint voxel intensity probability distribution, which iscalculated from the joint histogram. The currentarchitecture uses Partial Volume interpolation to updatethe joint histogram. When performing elasticregistration, a more accepted way to calculate the jointhistogram of small image segments is using Parzenwindowing. The use of Parzen windowing for calculatingthe joint intensity probability distribution has been well

TABLE III. Comparison of Mutual Information Calculation Times for Different Implementations

Mutual information calculation time (milliseconds)Image Size (voxels) 218 221 224

Number of subvolumes 1* 8 64 1* 8 64 1* 8 64SW1 562 1682 1685 4712 13434 13530 38500 106269 105972SW2 120 499 485 1025 3547 3945 10200 32351 33983FAIR19 62 ** ** 430 ** ** 3370 ** **FAIR-II 5 5 5 42 42 42 335 335 335

*In the 1-subvolume case an linear transformation was used in the software version**FAIR supports only linear transformations

TABLE II. Comparison System Characteristics

System Name Processor Type Processor Speed Memory Type Memory Speed

SW1 Intel PIII Xeon 500 MHz SDRAM 100 MHzSW2 Intel PIII Xeon 3.2 GHz DDRAM 266 MHzFAIR19 Altera Acex EP1K100 FPGA 80 MHz SDRAM 80 MHzFAIR-II Altera Stratix EP1S40 FPGA 200 MHz SRAM 50 MHz

documented in the literature.14,33−36 It is based onaccumulating the joint histogram using a Gaussiankernel centered at the location corresponding to thecurrent voxel intensity pair. Using Parzen windowingimplies accessing the joint histogram more often thanwhen using Partial Volume interpolation. Since the jointhistogram is accessed at different locations by definition,implementing Parzen windowing will require the use ofdeeper partitioning than presented, but with simpleraccumulation pipelines.

Acceleration of elastic image registration algorithmsis widely recognized as one of the current requirementsto bring these algorithms into use in clinical procedures,especially in the operating room.13,37 The FAIR-IIarchitecture is an important step in this direction sinceit allows the implementation of image registrationsystems that are economical and compact and thussuitable for clinical deployment.

Acknowledgment. This work was supported by theNational Medical Technology Testbed grant DAMD17-97-2-7016.

References 1. R. Shekhar, V. Zagrodsky, C. R. Castro-Pareja, V. Walimbe, and J. M.

Jagadeesh, High-speed registration of three- and four-dimensionalmedical images by using voxel similarity, Radiographics 23, 1673(2003).

2. J. V. Hajnal, D. J. Hawkes and D. L. G. Hill, Medical Image Registra-tion, CRC Press, Boca Raton, FL, 2001.

3. D. Rueckert, Nonrigid registration: concepts, algorithms, and applica-tions, Medical Image Registration, J. V. Hajnal, D. L. G. Hill, and D. J.Hawkes, Eds., CRC Press, Boca Raton, FL, 2001.

4. R. Myers, The application of PET-MR image registration in the brain,Brit. J. Radiol. 75, S31 (2002).

5. G. Xiao, J. M. Brady, J. A. Noble, M. Burcher, and R. English, Nonrigidregistration of 3D free-hand ultrasound images of the breast, IEEETrans. Med. Imaging 21, 405 (2002).

6. M. A. Wirth, J. Narhan and D. Gray, Nonrigid mammogram registra-tion using mutual information, Proc. SPIE 4684, 562 (2002).

7. Z. Lee, K. K. Nagano, J. L. Duerk, D. B. Sodee, and D. L. Wilson, Auto-matic registration of MR and SPECT images for treatment planningin prostate cancer, Academic Radiology 10, 673 (2003).

8. U. Kjems, S. C. Strother, J. Anderson, I. Law, and L. K. Hansen, En-hancing the multivariate signal of [15O] water PET studies with a newnonlinear neuroanatomical registration algorithm, IEEE Trans. Med.Imaging 18, 306 (1999).

9. T. Rohlfing and J. Maurer, Intensity-based non-rigid registration us-ing adaptive multilevel free-form deformation with an incompress-ibility constraint, Lecture Notes in Computer Science 2208, 111(2001).


10. J. A. Schnabel, D. Rueckert, M. Quist, J. M. Blackall, A. D. Castellano-Smith, T. Hartkens, G. P. Penney, W. A. Hall, H. Liu, C. L. Truwit, D. L.G. Gerritsen, D. L. G. Hill, and D. J. Hawkes, A generic framework fornonrigid registration based on nonuniform multi-level free-form de-formations, Lecture Notes in Computer Science 2208, 573 (2001).

11. V. Walimbe, V. Zagrodsky, S. Raja, B. Bybel, M. Kanvinde, and R.Shekhar, Elastic registration of three-dimensional whole body CT andPET images by quaternion-based interpolation of multiple piecewiselinear rigid-body registrations, Proc. SPIE 5370, 119 (2004).

12. F. Maes, D. Vandermeulen and P. Suetens, Medical image registra-tion using mutual information, Proc. IEEE 19, 1699 (2003).

13. J. P. W. Pluim, J. B. A. Maintz and M. A. Viergever, Mutual information-based registration of medical images: A survey, IEEE Trans. Med.Imaging 22, 986 (2003).

14. W. M. Wells, P. Viola, H. Atsumi, S. Nakajima, and R. Kikinis, Multi-modal volume registration by maximization of mutual information,Medical Image Analysis 1, 35 (1996).

15. G. E. Christensen, M. I. Miller, M. W. Vannier, and U. Grenander, Indi-vidualizing neuroanatomical atlases using a massively parallel com-puter, IEEE Computer 29. 32 (1996).

16. S. K. Warfield, F. A. Jolesz, and R. Kikinis, A high performance ap-proach to the registration of medical imaging data, Parallel Comput-ing 24, 1345 (1998).

17. T. Rohlfing and C. R. Maurer, Non-rigid image registration in shared-memory multiprocessor environments with application to brains,breasts, and bees, IEEE Trans. Information Technol. Biomedicine 7,16 (2003).

18. F. Ino, K. Ooyama, A. Takeuchi, and K. Hagihara, Design and imple-mentation of parallel nonrigid image registration using off-the-shelfsupercomputers, Lecture Notes in Computer Science 2878, 327(2003).

19. C. R. Castro-Pareja, J. M. Jagadeesh and R. Shekhar, FAIR: A Hard-ware Architecture for Real-Time 3D Image Registration, IEEE Trans.Information Technol. Biomed. 7, 426 (2003).

20. D. Rueckert, L. I. Sonoda, C. Hayes, D. L. G. Hill, M. O. Leach, and D.J. Hawkes, Nonrigid registration using free-form deformations: Appli-cation to breast MR images, IEEE Trans. Med. Imaging 18, 712 (1999).

21. F. Maes, A. Collignon, D. Vandermeulen, G. Marchal, and P. Suetens,Multimodality Image Registration by Maximization of Mutual Informa-tion, IEEE Trans. Med. Imaging 16, 187 (1997).

22. M. Otte, Elastic registration of fMRI data using Bezier-spline transfor-mations, IEEE Trans. Med. Imaging 20, 193 (2001).

23. C. Studholme, R. T. Constable and J. S. Duncan, Accurate alignmentof functional EPI data to anatomical MRI using a physics-based dis-tortion model, IEEE Trans. Med. Imaging 19, 1115 (2000).

24. M. Doggett and M. Meissner, A memory addressing and access de-sign for real time volume rendering, Proc. IEEE International Sympo-sium on Circuits and Systems, ISCAS ‘99, IEEE Press, Los Alamitos,CA, 1999, p. 344.

25. D. K. Kostopoulos, An algorithm for the computation of binary loga-rithms, IEEE Trans. Computers 40, 1267 (1991).

26. D. M. Mandelbaum and S. G. Mandelbaum, A fast, efficient parallel-acting method of generating functions defined by power series, in-cluding logarithm, exponential, and sine, cosine, IEEE Trans. Paralleland Distributed Systems 7, 33 (1996).

27. J. Hormigo, J. Villalba and M. J. Schulte, A hardware algorithm forvariable-precision logarithm, Proc. of IEEE International Conferenceon Application-Specific Systems, Architectures, and Processors, IEEEPress, Los Alamitos, CA, 2000, p. 215.

28. H. Hassler and N. Takagi, Function evaluation by table look-up andaddition, Proc. of 12th Symposium on Computer Arithmetic, 1995, p.10.

29. S. L. SanGregory, C. Brothers, D. Gallagher, and R Siferd, A Fast,Low-Power Logarithm Approximation with CMOS VLSI Implementa-tion, Proc. 42nd Midwest Symposium on Circuits and Systems, IEEEPress, Piscataway, NJ, 388 (1999).

30. W. Press, B. Flannery, S. Teukolsky, and W. Vetterling, Numerical Reci-pes in C: The Art of Scientific Computing, 2nd ed. Cambridge Univer-sity Press, Cambridge, England, 1992.

31. M. J. D. Powell, Approximation Theory and Methods, Cambridge Uni-versity Press, Cambridge, 1981.

32. Tsunami PCI FPGA Procesor Technical Data Sheet, Release version1.2 ed. SBS Technologies, Waterloo, Ontario, Canada, 2003.

33. P. Thevenaz and M. Unser, Optimization of mutual information formultiresolution image registration, IEEE Trans. Image Processing 9,2083 (2000).

34. G. Hermosillo and O. Faugeras, Dense image matching with globaland local statistical criteria: A variational approach, Computer Visionand Pattern Recognition 1, 73 (2001).

35. J.-F. Mangin, C. Poupon, C. Clark, D. Le Bihan, and I. Bloch, Eddy-current distortion correction and robust tensor estimation for MR dif-fusion imaging, Lecture Notes in Computer Science 2208, 186 (2001).

36. D. Sarrut and S. Clippe, Geometrical transformation approximationfor 2D/3D intensity-based registration of portal images and CT scan,Lecture Notes in Computer Science 2208, 532 (2001).

37. K. Cleary, H. Y. Chung, and S. K. Mun, OR 2020 workshop over-view: operating room of the future, Proc. 18th International Con-gress on Computer Assisted Radiology and Surgery, vol. 1268, H.U. Lemke, M. W. Vannier, K. Inamura, Eds., Elsevier, Amsterdam,2004, p. 847.

imaging science and technologycong/expedition/references/mutual inform… · 108 journal of imaging...

Documents