an fpga-based computing platform for real-time...
TRANSCRIPT
An FPGA-based Computing Platform for Real-time 3D Medical Imaging and its Application to Cone-beam CT Reconstruction
Jianchun Li1, Christos Papachristou1 and Raj Shekhar2
1Department of Electrical Engineering and Computing Science, Case Western Reserve University, Cleveland, OH 44106, USA
2Department of Diagnostic Radiology, University of Maryland, Baltimore, MD 21201
Corresponding Author: Raj Shekhar, Ph.D.
Department of Diagnostic Radiology University of Maryland 22 S. Greene Street Baltimore, MD 21201 USA
Phone: 410-706-8714 Fax: 410-328-0641 Email: [email protected]
2
ABSTRACT
Real-time three-dimensional (3D) medical imaging requires both high memory and high
computational bandwidth due to the massive amount of data to be processed. Application-specific
systems have been designed to speed up a few selected 3D imaging algorithms. When applied to
other algorithms, all such systems require complicated redesign procedures or get lower
performance than expected. In this article, we present an FPGA-based computing platform that can
be used to accelerate a broad range of 3D medical imaging algorithms dominated by local
operations. Its generality and reconfigurability make it easy to be customized to differing algorithm
requirements. The architecture in the platform exploits intrinsic parallelisms in 3D imaging
algorithms in order to achieve high computational bandwidth. Along with the architecture, a new
caching scheme called brick caching was designed to dynamically partition the input data and
buffer them in distributed internal caches to provide multiple memory accesses. The caching
scheme exploits locality of reference in three dimensions, thus greatly reducing the total data-flow
from external system memory to internal caches. It is also a deterministic caching method, which
allows the input data to be pre-fetched into the processor before they are processed. We also present
application of the platform in FDK (Feldkamp-Davis-Kress) cone-beam computed tomography
(CT) image reconstruction so as to demonstrate its potential in real-time 3D medical imaging.
Keywords: 3D imaging, computing platform, brick caching scheme, cone-beam reconstruction
3
1. INTRODUCTION
Real-time three-dimensional (3D) imaging represents a developing trend in medical imaging.
Currently, 3D medical images are acquired from a variety of imaging modalities like computed
tomography (CT), magnetic resonance imaging (MRI), ultrasound, positron emission tomography
(PET), and single photon emission computed tomography (SPECT). Compared to two-dimensional
(2D) images, 3D images represent an anatomic structure in a more realistic manner, are intuitive to
work with and can provide more accurate information for image-based diagnoses and treatments.
However, most 3D medical imaging algorithms are computationally intensive, which makes real-
time 3D imaging expensive and impractical for most clinical applications.1,2,3
The primary reason of these algorithms being time-consuming is the massive amount of data to
be processed. For example, a 512×512×512 8-bit image translates to 128 Mbytes of data, and such
images often need to be processed repeatedly because many algorithms are iterative in nature.
Another reason is the complicated data addressing and accessing patterns in these algorithms. 3D
image data are usually stored in the system memory sequentially; however, many algorithms
require nonsequential or even random access to the data. Therefore, traditional word-line and
interleave caching methods cannot efficiently fetch and buffer image data. Random access to the
system memory (usually DRAM), on the other hand, slows down the entire computation.
These two factors make 3D medical imaging algorithms time-consuming and expensive. Two
alternative solutions could be used to address this problem. The first is to design an application
specific integrated circuit (ASIC) for accelerating a particular algorithm. This is the most efficient
way to accelerate an algorithm; however, ASICs are not flexible to adapt to the changes in the
algorithm and are difficult to reuse in other applications. Moreover, designing and developing an
4
ASIC incurs high development or non-recurring engineering (NRE) costs. The second solution is to
use a homogeneous multiple-processor system,4,5 such as a general supercomputer or a workstation
cluster. This is the most flexible solution. However, the traditional von Neumann architecture used
in general-purpose architecture and the communication overhead between the processors limit the
potential speedup per processor and make it hard to achieve the desired performance without large-
scale multi-processor systems. Presently, this makes the solution unacceptably expensive for
clinical deployment.
Compared to the above two solutions, a field programmable gate array (FPGA)-based
computing system offers a good trade-off between performance and flexibility, and is more
attractive and promising for real-time 3D medical imaging. Modern high-end FPGAs integrate
hard-wired processor cores, SRAM blocks, multipliers, digital clock managers and massive
reconfigurable logic resources in a single chip.6 By combining these resources and customizing the
computing architecture inside the FPGA, dedicated systems can be designed to accelerate different
algorithms. The system architecture inside the FPGA can also be reconfigured to adapt to the
changing demands of different algorithms, thus avoiding the need to design a totally new hardware.
Several FPGA-based systems have been designed to accelerate specific 3D medical imaging
algorithms.7-10 However, a disadvantage of FPGA-based systems is that a large amount of time and
effort is required to redesign the computing architecture inside the FPGA in order to maintain high
performance for different algorithms. Also, it is difficult to manage the data flow in algorithms in
which random data access occurs.
To make the high-performance FPGA-based solution easy to be used in different 3D medical
imaging applications, we have designed a computing platform with a generalized architecture and a
new caching scheme called brick caching. These two concepts are aimed at accelerating a broad
5
range of 3D medical imaging algorithms dominated by local operations. A local operation can be
defined as an operation for which only a small subset of input data are required to compute a single
output value. Trilinear interpolation is an example of a local operation, which needs eight input
samples to calculate one output value. This property provides the intrinsic parallelism for
partitioning the overall computation and executing the partitions concurrently. It is also the basis for
our brick caching scheme described in Section 2.
The architecture is targeted to Xilinx Virtex II Pro FPGAs, which contain up to four PowerPC
embedded processors, a large number of SRAM blocks (several hundreds), multipliers and other
logic resources.6 Generality of this architecture allows it be easily modified to map different local
operations-based imaging algorithms. To achieve high computational performance, intrinsic
algorithm parallelisms are exploited in the architecture such as multiple data brick (block) caching,
parallel data-stream processing and deep pipeline cascades.
The brick caching scheme we present takes full advantage of locality of reference in three
dimensions to reduce the total data-flow from the main memory to the caches. It is a deterministic
caching method, meaning that there are no cache misses in computation and data are always pre-
fetched into the caches before they are needed. High-speed multiple data accesses are supported by
configuring the SRAM blocks in the FPGA into multiple independent caches and copying the same
input data block into each of them.
In section 2 of this paper, we introduce our brick caching scheme. In section 3, we present the
architecture of the platform. The application of the platform to FDK (Feldkamp-Davis-Kress) cone-
beam reconstruction1 is demonstrated in section 4. We discuss our simulation results and future
objectives in section 5.
6
2. BRICK CACHING SCHEME
3D Medical imaging involves processing massive amounts of image data; hence, DRAM is
typically used as the main system memory for data storage. Moreover, the hardware structure of
DRAM is such that peak performance is achieved only when accessing sequential data, thus causing
a dramatic drop in performance when accessing data randomly. Unfortunately, frequently occurring
operations like 3D interpolation and 3D finite impulse response (FIR) filtering in most 3D imaging
algorithms require multiple random data accesses. For example, trilinear interpolation requires us to
access eight neighbouring input data samples that do not appear sequentially in main memory, thus
requiring random memory accesses.
To address the memory access problem, several strategies have been suggested. For example, de
Boer et al.11 proposed a scheme for eight simultaneous memory accesses by implementing eight
independent external memory banks, which is especially useful for trilinear interpolation. Doggett
and Meibner12 presented a cubic addressing scheme for real-time volume rendering, in which the
input image is partitioned into fixed-size cubes that are buffered inside the processor one at a time.
Whenever a traced ray passes through the boundary of the cube, a new one is fetched into the cache.
Unlike these caching methods, our proposed caching scheme, called brick caching, partitions
the image data in the output image space, rather than the input image space. It explores the locality
of reference in three dimensions to significantly reduce the total data-flow from the system memory
to the internal caches. Also, our scheme is a deterministic caching method since the input data are
pre-fetched into the FPGA before they are actually needed, thus eliminating cache misses. The
FPGA SRAM blocks are configured into multiple independent caches for multiple simultaneous
data accesses. Our brick caching scheme is described next.
7
2.1. Brick partitioning
The first step in brick partitioning involves partitioning the output data space (i.e., output of the
algorithm) into small m × n × k cubes, called output bricks (Figure 1c). The coordinates of the eight
vertices of an output brick are denoted as ),,( iii ZOYOXO , i = 1…8.
Subsequently, the output brick is mapped back to locate the corresponding data subset in the
input space according to the data addressing procedure of the algorithm (Figure 1b). The located
subset in the input space contains all the data required to compute the output results in the output
brick. The vertices of the data subset are denoted as ),,( iii ZIYIXI , i = 1…8. The dataset in the
input space usually has an irregular direction and shape (Figure 1b).
The third step in brick portioning is to find the boundary of the input brick marking another
dataset in the input space with regular direction and shape, which completely encloses the required
input dataset. Suppose the vertex A (Figure 1a) of the input brick is (XBmin, YBmin, ZBmin) and the
vertex B of the brick is (XBmax, YBmax, ZBmax), then:
⎪⎩
⎪⎨
⎧
⎦⎣=⎦⎣=⎦⎣=
)min( )min( )min(
min
min
min
i
i
i
ZIZBYIYBXIXB
(1)
⎪⎩
⎪⎨
⎧
+⎦⎣=+⎦⎣=+⎦⎣=
1)max( 1)max( 1)max(
max
max
max
i
i
i
ZIZBYIYBXIXB
(2)
where ⎣x⎦ is the floor function whose return value is the maximal integer less than x and min(xi)
returns the smallest value in the data set xi, while max(xi) returns the largest value. The sizes of
the input brick along the three dimensions X, Y and Z are:
8
⎪⎩
⎪⎨
⎧
+−=+−=+−=
1 1 1
minmaxz
minmaxy
minmaxx
ZBZBSYBYBSXBXBS
(3)
The data in the input brick will be fetched from the system memory to the caches sequentially.
If the offset from the cache beginning is Coff, then the position, Cp, of the data with coordinates (X,
Y, Z) in the cache can be calculated as:
Cp=Coff +(Z–ZBmin)*Sx*Sy+(Y-YBmin)*Sx+(X-XBmin) (4)
The size of the output bricks (m × n × k) will determine the size of the input bricks. Suppose the
volume of input brick i is Vi and the total data-flow volume from the system memory to the caches
is Vtotal, then both Vi and Vtotal are a function of m, n and k. That is:
( , , )total iV V f m n k= =∑ (5)
Assuming the bandwidth from system memory to the cache to be W and the total algorithm
processing time to be T, then:
totalVTW
≥ (6)
When totalVTW
= , the system memory bandwidth is at least one of the bottlenecks limiting the
processing speed. Thus, Vtotal should be minimized by optimizing the partition of the output space.
This involves selecting the optimized sizes of the output brick, m, n and k. One important constraint
in this optimization is that Vi, the volume of any input brick, should not exceed the available cache
size. Numerical methods of simulating the algorithm computing procedure can be used for this
optimization.
9
2.2. Multiple memory access
To support multiple parallel memory accesses, the SRAM blocks in FPGA are configured as
multiple independent caches. These caches have two data ports, one port connecting to the
broadcast data bus from the system memory and the other connecting to the data processing
pipeline for feeding data. A mask register controls whether and when a certain cache will accept
data from the main memory (Figure 2). More details about cache configuration are discussed in
section 3.
For trilinear interpolation, de Boer et al.11 proposed to index the eight neighboring voxels
(numbered 0 to 7) and to store them in eight independent memory banks so that they can be
accessed simultaneously in one clock cycle. A similar approach can be used in the proposed brick
caching scheme. However, the disadvantage of this approach is that the addressing is very
complicated. Not only the data position in the memory bank, but also the bank index should be
calculated for each access. To address this problem, we use data duplication, which involves simply
copying the same input data brick into multiple cache banks (Figure 2) so that the neighboring data
samples can be accessed simultaneously. Although there is a memory utilization overhead
associated with data duplication, it is acceptable here because there are large amounts of
independent SRAM blocks available in modern FPGA chips. For example, Virtex Pro XC2VP125
from Xilinx has 556 18Kb dual-port SRAM blocks inside the chip.6 Data duplication simplifies the
cache structure design and saves computing resources.
2.3. Brick pre-fetching
Brick caching is a deterministic caching method that the data block can be pre-fetched into the
caches before the data are actually required by the processing units. After finishing the processing
10
of one input brick, the system is able to process the next one immediately and there is no pipeline
stall while activating data fetching. The whole system can work smoothly at its full speed. On the
other hand, if the next brick is not in the cache, the system pipeline will stall, waiting for the next
brick to be fetched. Brick caching scheme prevents pipeline stalls by pre-fetching data bricks.
Specifically, every bank of the cache system is partitioned into two sections, as shown in Figure
3. Each of them can hold one input brick. The caches act as a FIFO (first-in-first-out) except that
the moving unit is a single data brick. One data port of the dual-port cache banks is configured as
the input port connected to system memory, and the other is the output port connected to the
processing pipeline. Data fetching and data feeding are executed simultaneously.
3. ARCHITECTURE
The architecture is designed to be a high-level abstract platform so that a broad range of local
operations-based 3D imaging algorithms can be mapped onto it. To accelerate the execution of the
algorithms, the architecture takes advantage of the brick caching scheme and exploits the intrinsic
parallelisms present in the algorithms. The major parallelism in local operations-based 3D imaging
algorithms is the independence of the calculation of each output data, i.e., the result of one output is
not determined or affected by other outputs. Therefore, a multiple deep pipeline execution unit in
the architecture is designed to perform the massive computational tasks. The architecture is
designed to implement in a single FPGA chip except for the system memories.
The system (or architecture) (Figure 4) is composed of five sub-systems: (a) PowerPC-based
Controller (PPC); (b) Input Brick Fetching unit (IBF); (c) Multiple Pipeline Execution unit (MPE);
11
(d) Output Brick Storing unit (OBS) and (e) Central Pipeline Controller (CPC). Details about the
function, configuration and execution of the subsystems are described below.
3.1. The subsystems
The PPC (Figure 5A) is used to calculate the parameters of the data bricks, such as the origin
and size of the input and output data bricks. It also controls the execution of the whole system,
which involves performing various operations like initiating or stopping computing. These
parameters are sent to a register file (REG File) through device control register (DCR) bus, which
can be accessed by the CPC.
The PPC is composed of a PowerPC processor, on-chip instruction memory and data memory, a
register file and a memory controller connecting to an external memory, MEM-1. The PowerPC
processor is one of the hardwired processor cores embedded in Xilinx Virtex-II Pro FPGA chips.
The on-chip instruction memory (I-MEM) and data memory (D-MEM) are configured with SRAM
blocks inside the FPGA. Their data bus is 32-bit wide and their capacity is 32 KByte. Critical parts
of the program running on the PowerPC processor are compiled and loaded into the on-chip
instruction memory at the time of FPGA configuration. The on-chip data memory (D-MEM) holds
the computational results from the processor. The data memory is configured as a dual-port memory
– one port is connected to the processor and the other is used by other subsystems to access the data
in the memory. MEM-1 is an external DRAM that provides extra data and instruction space for the
processor; it is controlled by the MEM-1 controller.
The Input Brick Fetching unit (IBF) in figure 5B is the architectural implementation of the brick
caching scheme. It contains multiple dual-port input caches, a mask register and a memory
controller, and the MEM-2 controller. The dual-port caches have one port connected to the data
12
broadcast bus from the MEM-2 controller and the other port connecting to the pipelines in the MPE
unit. The MEM-2 controller accepts parameters of input bricks from the CPC unit, then fetches
input bricks and broadcasts them into the input brick caches. The mask register, set by the CPC,
controls which caches will accept the incoming data. Each of the input brick caches is partitioned
into two sections: S1 and S2. While one section (say, S1) is accepting data from the MEM-2
controller, the other section (S2) can provide data to the MPE unit at the same time, and vice versa.
The Multiple Pipeline Execution unit (MPE) in figure 5C is the subsystem performing the major
computational tasks. Multiple independent deep pipelines, which are identical in function and
architecture, are configured in the unit. These pipelines accept multiple data streams from the input
brick caches. The outputs of the pipelines are combined together and sent to the output brick
caches. All the pipelines are data-driven, i.e., each function unit (FU) only operates when their input
data are ready. The configuration of the MPE is algorithm-dependent because different algorithms
need different functional architecture.
The Output Brick Storing unit (OBS, Figure 5D) is used to save the final results from the MPE
and send them out to the external memory MEM-3. It contains two output brick caches and a
memory controller, the MEM-3 controller. The output brick caches are dual-port caches with one
port connecting to the MPE unit and the other connecting to the MEM-3 controller. When one
cache is accepting results from the pipelines, the other passes computed data to the MEM-3
controller for data dumping. The directions of the data flows in the OBS are controlled with a
demultiplexer and a multiplexer.
The Central Pipeline Controller (CPC, Figure 5E) coordinates the operation of the whole
system. Initially, it communicates with the PPC, requesting execution parameters such as the origin
and size of the input and output bricks. After receiving these parameters, the CPC sends the
13
information about the input brick to the IBF, after which the IBF begins to fetch the input brick.
When the IBF finishes the fetching, the CPC initiates the execution of the MPE. After the MPE
completes computation of one output brick, the CPC directs the OBS to save results to the external
memory MEM-3. The communications between the CPC and the other four subsystems occurs
simultaneously and the executions of the four subsystems are overlapped.
3.2. Parallelism
To achieve high computational bandwidth, the system is designed to operate at three parallel
levels. The first level is what we call the brick operation level. At this level, the algorithm is divided
into four stages (Figure 6): parameter generation, input brick fetching, multi-pipeline processing
and output brick storing. The parameter generation, performed by PPC, calculates the parameters
required for the execution of the other three subsystems. The IBF performs input brick fetching, the
MPE performs the multi-pipeline processing and the OBS takes care of the output brick storing. All
four stages run simultaneously as a system-level pipeline, thus forming the first level of parallelism.
The second level of parallelism is multiple data-stream processing. In the MPE unit, multiple
independent data-stream processing pipelines run simultaneously. The number of pipelines depends
on the algorithm and the available FPGA resources. Suppose the number of the pipelines is M, then
for each clock cycle, there are M results coming out of the MPE unit.
The third level of parallelism is the deep pipeline architecture in the MPE unit. The computation
is partitioned into large numbers of small stages and implemented on the deep pipeline architecture
with high running frequency.
14
The computational bandwidth is set by the slowest stage at the brick operation level. Thus, it is
important to balance the workload of the four stages in the first parallelism level to ensure the best
possible performance.
3.3. Generality to different algorithms
The whole computing platform is designed to be a high-level platform for easy adaptation to
different algorithms. The architectures of four subsystems, PPC, CPC, IBF and OBS, are designed
to be generic, so that implementation of different algorithms requires only the change of the
configuration parameters held in the configuration registers. The users are able to use these units as
primitives in their own system design. Only the architecture of MPE is algorithm-specific and
should be customized according to the data-flow of the algorithms. Many strategies have been
proposed to automatically map the data-flow of an algorithm into pipeline architectures.13,14,15 Our
future goal is to develop a high-level user interface to automatically configure the architecture of
MPE according to user input to algorithm data-flow. Since the user can focus on the architecture
design of the MPE unit, instead of the whole system, this would greatly reduce the redesigning time
and cost.
4. APPLICATION TO CONE-BEAM RECONSTRUCTION
Cone-beam reconstruction algorithms reconstruct the 3D CT image from a set of 2D
projections. These algorithms are categorized into two classes: analytic and algebraic. Algebraic
algorithms demand more computing resources than analytic algorithms. In practice, the FDK
15
algorithm, a type of analytic algorithm,1 is being widely implemented because of its computational
simplicity and compatibility with 2D fan-beam reconstruction algorithms.
The FDK algorithm requires a long execution time on modern computers. One reason is the
large size of the datasets that need to be processed, including both the reconstructed image and the
input projection data. Another reason is the large number of backprojection iterations involved. In
this section, we provide a brief description of the FDK algorithm, analyze its computational
complexity and describe how to speed the time-consuming backprojection procedure with our
computing platform.
4.1. The algorithm
Unlike the 2D fan-beam tomography approach, which reconstructs one image slice at a time, the
cone-beam method uses 2D array of detectors to collect projection data and reconstructs the entire
volume at a time. The geometry of the system configuration for cone-beam CT imaging is shown in
Figure 7.
In the system, an X-ray cone beam is emitted by the X-ray source. After its transmission through
the imaging specimen, the attenuated intensities of the X-rays are detected by the 2D detector array.
There are two coordinate systems, (x, y, z) and (u, v), shown as Figure 7. The origin of (u, v)
coordinates is at the center of the detector array and origin O of the (x, y, z) coordinate system, is at
the center of the objective. The (x, y, z) system is stationary to the object, whereas the (u, v) system
moves with the detector. The X-ray source rotates along the z axis for scanning at equal angular
intervals. The detector rotates opposite the source at the same plane. The X-ray source and the two
origins are always along a straight line. The acquired projection data is a 3D data. We denote the
16
attenuated X-ray intensity at the position (u, v) when the source is at the projection angle β using
the notation I(u, v, β).
The FDK algorithm has two major computational steps. The first step is called weighted
filtering, in which the projection data I(u, v, β) are weighted with a factor of )/( 222 vuRR ++ ,
where R is the distance from the source to the center of the object. Next, the projection data are
filtered along the u direction with a 1D FIR filter. Many choices exist for this filter,16,17,18 we used
the Ram-Lak filter, which is frequently used in filtered backprojection algorithm.17 The Ram-Lak
filter is given by:
))2/(sin21)((sin*)8/()( 222 scscsV Ω−ΩΩ=Ω π (7)
where Ω is the bandwidth of the image and s is the variable. This filtering is a one-dimensional
(1D) convolution and can be calculated with Fast Fourier Transform (FFT). Its complexity for the
entire volume is ))log(( 3 NNO , where N is the size of the reconstructed image along one
dimension.
The second step in the algorithm is backprojection. For each individual image voxel with
coordinates (x, y, z), the intensity value M(x, y, z) is calculated by the accumulation of all the
filtered projections I(u, v, β) over each projection angle β times a weighting factor W(x, y, β). The
formula is:
∑=
βββ ),,(*),,(),,( yxWvuIzyxM (8)
For a given projection angle β, the corresponding u, v and W(x, y, β) for M(x, y, z) are calculated as:
))sin()cos((*))sin()cos(/( ββββ xyyxRRu −−−= (9)
17
zyxRRv *))sin()cos(/( ββ −−= (10)
22 ))sin()cos(/(),,( βββ yxRRyxW −−= (11)
Since the calculated u and v are usually not integer, bilinear interpolation is needed to compute
the desired I(u, v, β) with its four neighbors. The complexity of the backprojection step for the
entire volume is )( 4NO , where N is the size of the reconstructed image.
Comparing the complexities of these two steps tells us that the backprojection takes
approximately ))log(/( NNO times longer than that of the weighted filtering step. In practice, the
weighted filtering step can be finished on the order of 10 seconds on a single modern computer. In
contrast, the execution time for the backprojection step is on the order of half an hour. Therefore,
our objective was to significantly reduce the backprojection time with the computing platform
described above.
4.2. Data Partitioning
To use the brick caching scheme, we partition the reconstructed 3D image (the output space) into
small cubes (the output bricks). According to our numerical simulation the optimal brick size of
16×16×16 voxels within the limits of the available FPGA resources. Next, the output bricks are
mapped back to the input space of the filtered projection data, to find the input bricks. The filtered
projection data involves a set of 2D images. Thus, when mapped back to the input space, an output
brick is projected into a series of hexagons (Figure 7) on different projection planes with different
projection angle β. The input bricks are determined as bounding rectangles for the hexagons. In the
cone-beam reconstruction algorithm, therefore, the input brick are rectangles, not cuboids.
18
Assuming the vertices of an output brick as (xi, yi, zi), i=1…8 and the projection angle β , the
corresponding projected coordinates (ui, vi) are:
( *sin( ) *cos( )) *
*cos( ) *sin( )i i
ii i
x y RuR x y
β ββ β
⎡ ⎤− += ⎢ ⎥− −⎣ ⎦
(12)
*
*cos( ) *sin( )i
ii i
z RvR x yβ β
⎡ ⎤= ⎢ ⎥− −⎣ ⎦
(13)
where R is the distance from the source to the center of the object. The upper-left corner (umin, vmin)
and the bottom-right corner (umax, vmax) of the resulting 2D input brick are:
⎩⎨⎧
⎦⎣=⎦⎣=
)min( )min(
min
min
i
i
vvuu
(14)
⎩⎨⎧
+⎦⎣=+⎦⎣=
1)max( 1)max(
max
max
i
i
vvuu
(15)
The computation for mapping the output brick to the input brick is executed by the PPC, and the
calculated parameters are sent to the IBF for input brick fetching.
4.3. Mapping Algorithm to Architecture
Mapping the algorithm to the architecture is straightforward. Specifically, the PPC executes the
calculation described in section 4.2 for locating the input brick, the IBF fetches the input bricks into
the caches, the MPE performs the backprojection computation, and the OBS saves the reconstructed
image data into MEM-3. The configurations of the subsystems are as follows.
The PPC (Figure 5A) runs at 300 MHz and its program is stored in the I-MEM. The program
running on the PPC is optimized and only has about 600 instructions. A look-up table is stored in
the D-MEM for providing the values of cos(β) and sin(β) in equations (12) and (13). The processed
19
results are passed to the register file, which is configured to have ten 32-bit registers. These results
provide information about the position and size of the input and output bricks as well as the
projection angle β and some control signals to the CPC.
The memory controller, the MEM-2 controller, in the IBF (Figure 5B) is a double data rate
(DDR) SDRAM controller running at 133 MHz and has a data bus of 64-bit width connecting to the
external memory MEM-2. It also has another 64-bit data bus for broadcasting the input data into the
caches. There are 32 small dual-port caches in the IBF. Each cache is configured with four 2-KB
dual-port SRAM blocks and partitioned into two sections with the capability of 4 KB/per section.
Every four caches form a group to provide 4 input data to one pipeline simultaneously for bilinear
interpolation.
The MPE (Figure 5C) is the algorithm-dependent unit designed according to the backprojection
data-flow of FDK cone-beam reconstruction algorithm. For parallel computation, it has eight
independent pipelines running at 150 MHz. All the pipelines have the same architecture and contain
64 pipeline stages. Every stage is registered to achieve the highest running frequency. The data path
in the pipelines is 16-bit.
The OBS (Figure 5D) has two output brick caches, each of which is configured as a dual-port 8
KB cache for holding the reconstructed image data. One port of the caches connects to a 128-bit
data bus in order to accept the combined eight 16-bit results from the MPE and the other port
connects to the MEM-3 controller via 128-bit data bus. The MEM-3 controller is a DDR SDRAM
controller connected to the external SDRAM MEM-3 through the 64-bit 133 MHz data bus.
The CPC controls the execution of the whole system. It communicates with the PPC through the
register file and it controls the operations of the other subsystems with FIFOs and mailbox registers.
The CPC runs at 150 MHz frequency.
20
5. RESULTS AND DISCUSSION
5.1. Performance
The performance of our computing platform in accelerating the backprojection procedure of
FDK algorithm was simulated with our constructed SystemC platform model. The input data was
the filtered projection data of the Shepp-Logan phantom. The projections were generated from 300
projection angles and the size of the detector array was assumed 512µ512. The projection data were
represented using 16-bit fixed-point format and had a total size of 150 MB. With the same input
data, we tested the time required for constructing three 3D images with the size of 2563, 5123 and
10243 voxels, respectively. The tested results are shown in Table 1. According to the simulation
results, the backprojection time is linearly proportional to the image size, implying that the system
is always running at its highest speed. The speed is limited by one of the four stages in the brick
operation cycle described in section 3.2.
Figure 8 shows the time required by the four stages to finish one brick operation. It is clear that
the multi-pipeline processing stage takes the longest time. Therefore, the MPE unit is the
computational bottleneck under the current configuration. To improve the system’s performance,
the number of the pipelines in MPE can be increased, which is limited only by the available FPGA
resources. When there are not sufficient FPGA resources in one chip, multiple FPGA chips can be
used to further parallelize the computation.
5.2. Accuracy
In the software implementation of the imaging algorithms, the data and the computation are
usually in the floating-point format, while in our proposed computing platform, the data and the
21
computation are all in 16-bit fixed-point format except for the accumulation operations and the
accumulated data, which use 32-bit fixed-point format. Fixed-point format is helpful to reduce the
memory storage requirement and the execution time, but it has limited data precision. We compared
the image reconstructed with floating-point software implementation (Figure 9B) and our fixed-
point hardware implementation (Figure 9C). The results indicate that the difference between the
two images is small and acceptable in the region of interest (ROI). When compared to the original
phantom (Figure 9A), the floating-point image has a noise level of 3.4% and the contrast of the
three small ellipsoids have a loss of 13.6%, while the fixed-point image has a noise level of 5.6%
and a contrast loss of 18.7% in the three ellipsoids. The noise level is measured by comparing the
voxel intensities of the reconstructed images with that of the phantom. If higher precision is
required, the bit width of the fixed-point data can be extended, though this will demand more
computing resources.
5.3. Discussion
The goal of our work was to design a general computing platform to accelerate a broad range of
local operations-based 3D medical imaging algorithms. Our strategy was to design a new caching
scheme as a solution to the memory access bottleneck in 3D imaging algorithms and to exploit the
intrinsic parallelisms in the algorithm for the hardware architecture design. Our simulation of the
implementation of the FDK cone-beam reconstruction algorithm and other 3D imaging algorithms
(in a preliminary work we have accelerated mutual information-based 3D image registration)
showed that our platform can achieve high computing performance in its application to 3D imaging
algorithms.
22
Our FPGA-based computing platform can be reconfigured to adapt to different algorithms;
however, our effort to limit the reconfigurable part of the system to only one execution unit
alleviates the burden of the design procedure. Instead of redesigning the whole system, users can
focus on the MPE pipeline architecture design according to the dataflow of the algorithms. Our
objective is to make the design procedure more automatic and further extend the application range
of our computing platform in the real-time 3D imaging field.
ACKNOLEDGEMENTS
We would like to thank Carlos R. Castro-Pareja (University of Maryland) and Vivek Walimbe
(The Ohio State University) for their valuable comments and proofreading.
23
REFERENCES
1. L.A. Feldkamp, L.C. Davis and J.W. Kress, Practical Cone-beam Algorithm, Journal of Optical Society of America, A6: 612 (1984). 2. W. M. Wells, P. Viola, H. Atsumi, S. Nakajima and R. Kikinis, Multi-modal volume registration by maximization of mutual information, Medical Image Analysis, 1: 35 (1996). 3. X. Xu and J. L. Prince, Generalized gradient vector flow external forces for active contours, Signal Processing, 71: 131 (1998). 4. T. Rohlfing and C. R. Maurer, Non-rigid image registration in shared-memory multiprocessor environments with application to brains, breasts, and bees, IEEE Transactions on Information Technology in Biomedicine, 7: 16 (2003). 5. D.A. Reimann, V. Chaudhary, M.J. Flynn and I.K. Sethi, Cone beam tomography using MPI on heterogeneous workstation clusters, Proceedings of MPI Developer's Conference, 2: 142 (1996). 6. Virtex-II Pro complete data sheet, Xilinx Inc. 2002. 7. I. Goddard and M. Trepanier, High-speed cone-beam reconstruction: an embedded systems approach, Proceedings of SPIE, 4681: 483 (2002). 8. G. Hampson and A. Paplinski, Hardware implementation of an ultrasonic beamformer, Proceedings of IEEE, 1: 227 (1997). 9. G. Knittel, A PCI-compatible FPGA-coprocessor for 2D/3D image processing, IEEE Symposium on FPGAs for Custom Computing Machines, 136 (1996). 10. C. R. Castro-Pareja, J.M. Jagadeesh, and R. Shekhar, FAIR: A Hardware Architecture for Real-Time 3-D Image Registration, IEEE Transactions on Information Technology in Biomedicine, 7: 426 (2003). 11. M de Boer, A. Gropl, J. Hesser, and R. Manner, Latency and Hazard-free Volume Memory Architecture for Direct Volume Rendering, Eurographic Workshop on Graphics Hardware, 109 (1996). 12. M. Doggett and M. Meiβner, A Memory Addressing and Access Design for Real Time Volume Rendering, IEEE International Symposium on Circuits and Systems, 344 (1999).
24
13. D. Cronquist, P. Franklin, C. Fisher, M. Figueroa, and C. Ebeling. Architecture Design of Reconfigurable Pipelined Datapaths. Twentieth Anniversary Conference on Advanced Research in VLSI, (1999). 14. B. Draper, W. Bohm, J. Hammes, W. Najjar, R. Beveridge, C. Ross, M. Chawathe, M. Desai and J. Bins. Compiling SA-C Programs to FPGAs: Performance Results, International Conference on Vision Systems, 220 (2001). 15. J. Frigo, M. Gokhale, D. Lavenier, Evaluation of the streams-C C-to-FPGA compiler: an application sperspective, Proceedings of the 2001 ACM/SIGDA ninth international symposium on Field programmable gate arrays, 134 (2001). 16. L.T. Chang and G.T. Herman, A scientific study of filter selection for a fan-beam convolution algorithm, SIAM J. Appl. Math., 39: 83 (1980). 17. G.N. Ramachandran and A.V. Lakshminarayanan, Three-dimensional reconstruction from radiographs and electron micrographs: Application of convolutions instead of Fourier transforms, Proc. Nat. Acad. Sci. USA, 68: 2236 (1971). 18. K.T. Smith and F. Keiner, Mathematical foundations of computed tomography, Appl. Optics, 24: 3950 (1985).
25
FIGURE AND TABLE CAPTIONS:
Figure 1: Mapping of an output brick to an input brick.
Figure 2: Data duplication diagram.
Figure 3: Partition of one cache bank into two sections Figure 4. Block diagram of the system architecture: (a) PPC: PowerPC based Controller; (b)IBF: Input Brick Fetching unit; (c) MPE: Multiple Pipeline Execution unit; (d) OBS: Output Brick Storing unit; (e) CPC: Central Pipeline Controller Figure 5: Detailed system architecture: (A) PowerPC based Controller (PPC); (B) Input Crick Fetching unit (IBF); (C) Multiple Pipeline Execution unit (MPE); (D) Output Brick Storing unit (OBS); (E) Central Pipeline Controller (CPC). Figure 6. The four operation stages in the brick operation level Figure 7: Geometry of Cone-beam CT system and schematic of brick caching scheme for cone-beam reconstruction Figure 8: Time needed for the four stages at the first parallelism level to finish one brick operation Figure 9: Comparison of the Shepp-Logan phantom with the images reconstructed from the floating-point implementation and the fixed-point implementation. The projection data are from 300 projections with a detector resolution of 256×256 and the reconstruction grid is 2563. All images are windowed to 0.95--1.05. (A) The original phantom and its intensity profile across the three small ellipsoids; (B) the image reconstructed from floating-point implementation; (C) the image reconstructed from 16-bit fixed-point implementation.
Table 1: Backprojection times for different image sizes
26
Image size
(voxels) 2563 5123 10243
Backprojection time (seconds) 4.2 33.5 267.7
Table 1: Backprojection times for different image sizes
27
(b) Input Space
Output Brick
m n
k
Figure 1: Mapping of an output brick to an input brick.
Data subset
Input Brick
Sx Sy
Sz
Data subset
(c) Output Space (a) Input Space
A
B
28
Figure 2: Data duplication diagram
System Memory
cache 1 cache 2 cache n …..
Mask Register
29
Input Brick n
Input Brick n+1
Input Port
Output Port
S1
S2
Figure 3: Partition of one cache bank into two sections
30
PPC
IBF
MPE
OBS
MEM-2
MEM-3
MEM-1
FPGA
DataControl
CPC
Figure 4. Block diagram of the system architecture: (a) PPC: PowerPC based Controller; (b)IBF:
Input Brick Fetching unit; (c) MPE: Multiple Pipeline Execution unit; (d) OBS: Output Brick
Storing unit; (e) CPC: Central Pipeline Controller
31
MEM-1 Controller
DSOCM
ISOCM
PLB
DCR
I-MEM
D-MEM
MEM-1
PowerPC
MEM-2 Controller
…
CPC
…
MEM-3 Controller
Output Cache I Output Cache II
DMUX
MEM-2
MEM-3
FU1 FU1
FU2 FU2
FUn FUn
Input Caches
REG File
PLB BUS
(A) (B)
(C)
(D) FPGA
Data Path
Control Signal
S1S2
Pipeline-1 pipeline-m
MUX
Mask Register
(E)
Figure 5: Detailed system architecture: (A) PowerPC based Controller (PPC); (B) Input Crick
Fetching unit (IBF); (C) Multiple Pipeline Execution unit (MPE); (D) Output Brick Storing unit
(OBS); (E) Central Pipeline Controller (CPC).
32
Figure 6. The four operation stages in the brick operation level
Output brick storing
Multi-pipeline processing
Input Brick Fetching
Parameter Generation
33
Figure 7: Geometry of Cone-beam CT system and schematic of brick caching scheme for cone-
beam reconstruction.
β
uv
xy
z
Input Brick
Projection plane at β
Output Brick
Output image
X-ray source R
34
0 1000 2000 3000 4000
output brickstoring
multi-pipelineprocessing
input brickfetching
parametergeneration
(ns)
Figure 8: Time needed for the four stages at the first parallelism level to finish one brick operation
35
60 80 100 120 140 160 180 200
0.98
1
1.02
1.04
1.06
1.08
60 80 100 120 140 160 180 200
0.98
1
1.02
1.04
1.06
1.08
60 80 100 120 140 160 180 200
0.98
1
1.02
1.04
1.06
1.08
Figure 9: Comparison of the Shepp-Logan phantom with the images reconstructed from the
floating-point implementation and the fixed-point implementation. The projection data are from 300
projections with a detector resolution of 256×256 and the reconstruction grid is 2563. All images are
windowed to 0.95--1.05. (A) The original phantom and its intensity profile across the three small
ellipsoids; (B) the image reconstructed from floating-point implementation; (C) the image
reconstructed from 16-bit fixed-point implementation.
(A) Original Phantom
(B) Floating-point Implementation
(C) Fixed-point Implementation
Intensity Profiles