contents · 1.3 chapter synopsis in this section the structure and overview of the following...

63
Contents List of Figures iii List of Tables iv Abbreviations v 1 Introduction 1 1.1 Background ....................................... 1 1.2 Objective ........................................ 1 1.3 Chapter synopsis .................................... 2 2 FAIR 3 2.1 Introduction ....................................... 3 2.2 Image registration: a general software framework .................. 3 2.3 FAIR Theory ...................................... 4 2.3.1 Images and Transformations .......................... 4 2.3.2 Distances and Regularization ......................... 5 2.4 FAIR Numerics ..................................... 5 2.4.1 Discretize then Optimize ............................ 5 2.4.2 A Family of Nested Approximations ..................... 6 2.4.3 Numerical Optimization ............................ 6 2.5 FAIR MATLAB .................................... 6 2.5.1 Notation and Conventions ........................... 6 2.5.2 Coordinate System ............................... 6 2.5.3 FAIR Administration ............................. 7 2.5.4 Memory versus Clarity ............................. 7 2.5.5 Cell,Grids and Numbering ........................... 8 2.6 Fixed level PIR ..................................... 8 3 MATLAB on CUDA 11 3.1 Introduction ....................................... 11 3.2 MATLAB on CUDA .................................. 11 3.2.1 MATLAB MEX environment ......................... 11 3.2.2 The MEX file .................................. 12 3.2.3 The MATLAB Array .............................. 13 3.2.4 Customised build for CUDA MEX-Files ................... 14 3.2.5 MEX APIs ................................... 15 3.3 CUDA MEX Memory Management .......................... 15 3.4 CUDA MEX Retention of variables on the GPU .................. 18 3.5 CUDA MEX programming tools ........................... 19 3.5.1 CUDA MEX Debugging ............................ 19 3.5.2 CUDA MEX Profiling ............................. 20 i

Upload: others

Post on 02-Feb-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Contents · 1.3 Chapter synopsis In this section the structure and overview of the following chapters are brie y outlined Chapter 2 - FAIR This chapter introduces a generic software

Contents

List of Figures iii

List of Tables iv

Abbreviations v

1 Introduction 11.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.3 Chapter synopsis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 FAIR 32.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.2 Image registration: a general software framework . . . . . . . . . . . . . . . . . . 32.3 FAIR Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.3.1 Images and Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . 42.3.2 Distances and Regularization . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.4 FAIR Numerics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.4.1 Discretize then Optimize . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.4.2 A Family of Nested Approximations . . . . . . . . . . . . . . . . . . . . . 62.4.3 Numerical Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.5 FAIR MATLAB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.5.1 Notation and Conventions . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.5.2 Coordinate System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.5.3 FAIR Administration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.5.4 Memory versus Clarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.5.5 Cell,Grids and Numbering . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.6 Fixed level PIR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3 MATLAB on CUDA 113.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.2 MATLAB on CUDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3.2.1 MATLAB MEX environment . . . . . . . . . . . . . . . . . . . . . . . . . 113.2.2 The MEX file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123.2.3 The MATLAB Array . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133.2.4 Customised build for CUDA MEX-Files . . . . . . . . . . . . . . . . . . . 143.2.5 MEX APIs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.3 CUDA MEX Memory Management . . . . . . . . . . . . . . . . . . . . . . . . . . 153.4 CUDA MEX Retention of variables on the GPU . . . . . . . . . . . . . . . . . . 183.5 CUDA MEX programming tools . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.5.1 CUDA MEX Debugging . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193.5.2 CUDA MEX Profiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

i

Page 2: Contents · 1.3 Chapter synopsis In this section the structure and overview of the following chapters are brie y outlined Chapter 2 - FAIR This chapter introduces a generic software

Contents ii

4 CUDA enabled FAIR 234.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234.2 Image interpolation in CUDA enabled FAIR . . . . . . . . . . . . . . . . . . . . 24

4.2.1 Next Neighbor Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . 244.2.2 Linear Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244.2.3 Spline Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254.2.4 Derivatives of Interpolation Schemes . . . . . . . . . . . . . . . . . . . . . 284.2.5 The Interpolation Toolbox . . . . . . . . . . . . . . . . . . . . . . . . . . . 294.2.6 CUDA MEX Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . 304.2.7 CUDA MEX interpolation results . . . . . . . . . . . . . . . . . . . . . . . 33

4.3 Parameterized transformation in CUDA enabled FAIR . . . . . . . . . . . . . . 384.3.1 Affine Linear Transformations . . . . . . . . . . . . . . . . . . . . . . . . . 384.3.2 Derivatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 394.3.3 Summarizing the Parameterized Transformations . . . . . . . . . . . . . . 404.3.4 CUDA MEX parameterized transformation . . . . . . . . . . . . . . . . . 40

4.4 Similarity Measure in CUDA enabled FAIR . . . . . . . . . . . . . . . . . . . . . 424.5 Parametric Image Registration in CUDA enabled FAIR . . . . . . . . . . . . . . 434.6 Experiment and results:Fixed Level Parametric Image Registration on CUDA

enabled FAIR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

5 Recommendations and Conclusion 495.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 495.2 Goals achieved . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 495.3 Scope for further work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

5.3.1 FAIR improvements for GPU . . . . . . . . . . . . . . . . . . . . . . . . . 505.3.2 Usage of CUDA driver API . . . . . . . . . . . . . . . . . . . . . . . . . . 50

A FAIR/CUDA files 53

Bibliography 58

Page 3: Contents · 1.3 Chapter synopsis In this section the structure and overview of the following chapters are brie y outlined Chapter 2 - FAIR This chapter introduces a generic software

List of Figures

2.1 General software framework of image registration . . . . . . . . . . . . . . . . . . 42.2 Discretization of a 1D domain Ω = (ω1, ω2) ⊂ R. . . . . . . . . . . . . . . . . . . 82.3 Parametric Image registration on HNSP data. . . . . . . . . . . . . . . . . . . . . 92.4 Parametric image registration example in FAIR . . . . . . . . . . . . . . . . . . . 9

3.1 C MEX cycle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133.2 Example of nvopts file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143.3 Mex persistent memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163.4 Mex hybrid array . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163.5 Example pseudocode for memory leak while calling MEX routines . . . . . . . . 173.6 CUDA MEX persistent memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193.7 CUDA Visual Profiler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4.1 FAIR CUDA Registration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234.2 2D linear interpolation in FAIR. Images generated from FAIR . . . . . . . . . . . 254.3 “Mother” spline b = b0. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264.4 2D splines interpolation in FAIR. Images generated from FAIR . . . . . . . . . . 284.5 linear and spline interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304.6 linearInter2D(top) and splineInter2D(bottom) Matlab kernels, Images generated

from FAIR software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314.7 mother spline and derivative, Image courtesy GPU GEMS2 . . . . . . . . . . . . 334.8 m file for run time testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334.9 Runtime test for CUDA MEX interpolation . . . . . . . . . . . . . . . . . . . . . 344.10 Bandwidth test for CUDA MEX interpolation . . . . . . . . . . . . . . . . . . . . 354.11 Test for accuracy of CUDA MEX interpolation . . . . . . . . . . . . . . . . . . . 374.12 Translation, rigid, and affine linear transformations. . . . . . . . . . . . . . . . . 394.13 Usage of persistent memory in rigid2D . . . . . . . . . . . . . . . . . . . . . . . 414.14 PIR for SSD and rigid transformations,m=[128,64]. . . . . . . . . . . . . . . . . . 464.15 Iteration history for both CUDA MEX and MATLAB at course level . . . . . . . 464.16 The stopping criterion for coarse level in both methods . . . . . . . . . . . . . . . 464.17 PIR for SSD and rigid transformations, m=[512,256]. . . . . . . . . . . . . . . . . 474.18 Iteration history for both CUDA MEX and MATLAB at fine level . . . . . . . . 474.19 The stopping criterion for fine level in both methods . . . . . . . . . . . . . . . . 47

5.1 Flow chart explaining use of driver API for CUDA MEX . . . . . . . . . . . . . . 51

A.1 Makefile . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53A.2 nvopts.sh . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54A.3 Makefile.dbg . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55A.4 mexoptsdbg.sh . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56A.5 nvoptsdbg.sh . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

iii

Page 4: Contents · 1.3 Chapter synopsis In this section the structure and overview of the following chapters are brie y outlined Chapter 2 - FAIR This chapter introduces a generic software

List of Tables

2.1 Profiler results for Parametric Image Registration in FAIR . . . . . . . . . . . . . 9

4.1 CUDA MEX interpolation kernel runtime . . . . . . . . . . . . . . . . . . . . . . 344.2 CUDA MEX interpolation kernel bandwidth . . . . . . . . . . . . . . . . . . . . 344.3 The transformation toolbox . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 404.4 Averaged runtimes of rigid2D on GPU . . . . . . . . . . . . . . . . . . . . . . . . 424.5 The FAIR distance toolbox . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424.6 PIR objective function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 444.7 Averaged runtimes of Parametric image registration in FAIR on MATLAB and

CUDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

5.1 CUDA Driver API objects, NVIDIA programming guide . . . . . . . . . . . . . . 51

iv

Page 5: Contents · 1.3 Chapter synopsis In this section the structure and overview of the following chapters are brie y outlined Chapter 2 - FAIR This chapter introduces a generic software

Abbreviations

2D Two Dimensions

API Application Program Interface

CUDA Compute Unified Device Architecture

CUDA MEX MEX Cuda enabled MEX file

FAIR Flexible Algorithms for Image Registration

GPU Graphics Processing Unit

HNSP Human Neuro Scanning Project

OPTN Persistent Option

PIR Parametric Image Registration

PDE Partial Differential Equations

SSD Sum of Square Differences

v

Page 6: Contents · 1.3 Chapter synopsis In this section the structure and overview of the following chapters are brie y outlined Chapter 2 - FAIR This chapter introduces a generic software

Chapter 1

Introduction

1.1 Background

Image registration is one of the challenging problems in image processing. Given two images takenfor example at different times, from different devices or perspectives, the goal is to determinea reasonable transformation, such that a transformed version of one of the images is similarto the second one. There is a large number of applications demanding for registration. Areasrange from art, astronomy, astro-physics, biology, chemistry, criminology, genetics, physics, orbasically any area involving imaging techniques. Each of these application areas have developedtheir own specialized methods and implementations for registration. Hence, it would be usefulin practice to have access to a suite of different registration methods to allow a user to choosethe right tool for a particular problem or to compare and contrast between techniques [1] [2].

1.2 Objective

The MATLAB based toolbox Flexible Algorithms for Image Registration (FAIR) is one suchunified approach that collects state-of-the-art implementations of different building blocks whichcan then be combined in order to fit specific demands of particular applications. One typicalapplication field of interest is the real time registration of images during a surgery. As the namesuggests this application places a stringent requirement for the computation time within anyalgorithm to be designed for this purpose.

One immediate and interesting solution for the implementation for such methods is the use ofGraphics Processing Unit (GPU) computing. The combination of high arithmetic and memorybandwidth with the programmability provided by Compute Unified Device Architecture (CUDA),make it very suitable for general purpose super computing. It is also intuitive to use the GPUfor medical imaging computations on account of similar data level parallelism in both the appli-cations.

A possible combination of FAIR as an research oriented tool and the computing horse power ofthe GPU could lead to an attractive method for fast prototyping of registration methods for realtime medical imaging applications. Moreover, such a development cycle could drastically reducethe time involved in porting research methods to industry applications.

Hence, the objective of the thesis work is,

”To implement this integration of a GPU accelerated image registration cycle onCUDA platform into FAIR toolbox”.

1

Page 7: Contents · 1.3 Chapter synopsis In this section the structure and overview of the following chapters are brie y outlined Chapter 2 - FAIR This chapter introduces a generic software

Chapter 1. Introduction 2

The framework should not violate the FAIR registration paradigm, so that the user could useexisting FAIR scripts for the CUDA enabled FAIR as well. Thereby combining flexibility of FAIRand the computing power of the GPU. The FAIR on CUDA implementation should provide aninterface similar to mex function provided by MATLAB.

1.3 Chapter synopsis

In this section the structure and overview of the following chapters are briefly outlined

Chapter 2 - FAIRThis chapter introduces a generic software cycle for image registration which is then definedwithin the FAIR toolbox. Conventions and numeric specific to FAIR are presented. Anexample of FAIR registration is profiled at the end of this chapter to draw insights into thecomputational aspects of a typical registration cycle [2].

Chapter 3 - MATLAB and CUDAThis chapter discusses the CUDA MEX environment, which enables one to call CUDAsubroutines directly from MATLAB. It also provides detailed information on CUDA MEXprogramming in general [3].

Chapter 4 - CUDA enabled FAIRThis chapter discusses the actual CUDA MEX implementation of the various functionalmodules of FAIR’s registration cycle. The problem setting is defined using mathematicaldefinitions from FAIR. A detailed performance analysis of the different implementationsfollowed by the runtime analysis of the GPU accelerated registration cycle within FAIR ismade [4].

Chapter 5 - Recommendations and ConclusionIn conclusion, the various milestones met during the due course of this work are listed alongwith mentioning recommended developments to be taken up in the future [5].

Page 8: Contents · 1.3 Chapter synopsis In this section the structure and overview of the following chapters are brie y outlined Chapter 2 - FAIR This chapter introduces a generic software

Chapter 2

FAIR

2.1 Introduction

Image registration is introduced from a software framework point of view in this chapter. Thisview is formalized with sound mathematical concepts from FAIR theory.Thereby, formulatingit as an optimization process. Particularly, the chapter introduces the paradigm followed bythe FAIR toolbox to approach the image registration problem. Various numerical aspects ofFAIR also discussed. The chapter ends with the analysis of a fixed level example for parametricregistration.

2.2 Image registration: a general software framework

A simple definition of registration could be stated as the process of finding the spatial transformthat maps points from one image to the corresponding points in another image [2]. There arevarious types of image registration algorithms based on the modality of the source images, thechoice of features and the corresponding mapping, etc. Each of these types are customized forthe specific medical purposes. Given the complexity of the high medical and surgical demandsmost research work on image registration emphasize on the difference between these individualmethods.

Nevertheless, the different registration methods indeed share a lot of common functionalitiesbetween them. As any good software framework, for image registration FAIR identifies thesecommon functionalities and thereafter manages them as separate functional modules. Each ofthese modules have many variants. A combination of judiciously selected options for each ofthese modules results in an effective and meaningful image registration technique. This conceptof a general software framework for image registration [2], is depicted in the following Figure 2.1.The image registration technique can be segregated into:

Metric or Similarity Measure The metric component measures how well a transformed tem-plate image matches the reference image.

Optimizer This component searches for the optimum transform parameters that minimizes thesimilarity metric.

Transform This component is a spatial transformation used to map points from the space ofone image to the space of the second image.

Interpolator The interpolator obtains the image intensity at a transformed position that is notnecessarily on a grid point.

3

Page 9: Contents · 1.3 Chapter synopsis In this section the structure and overview of the following chapters are brie y outlined Chapter 2 - FAIR This chapter introduces a generic software

Chapter 2. FAIR 4

Figure 2.1: General software framework of image registration

2.3 FAIR Theory

FAIR formally defines image registration as the process to,

Find a reasonable transformation, such that a transformed version of a template image is similarto a reference image.

This formulation suggests an optimization framework for the software cycle mentioned above.

J [y] = D[T [y],R] + αS[y − yref ]y−→ min (2.1)

where,T and R denote template and reference images,T [y] is the transformed template,D measures image similarity andS measures reasonability of the transform.

FAIR considers all variables as functions. The reasons for this continuous model are :

• It is appropriate to model the object itself rather than the discrete image that measures it.

• When an image is transformed it is unlikely that it is always aligned with the original pixelgrid. In such a case an interpolant has to be used to bring to the continuous setting. Henceit is prudent to have a continuous setting throughout rather than having a mixed discretecontinuous setting.

• The numerical optimization schemes to be discussed exploit sequences on nested descriti-zation of the very same functional.

2.3.1 Images and Transformations

Images are considered as continuous mappings from a domain into the real numbers. The domainis denoted by Ω ⊂ Rd, where d denotes the spatial dimensionality of the given data,

T : Ω→ R, Ω ⊂ Rd d = spatial dimension

Since images are modeled continuously, image transformation can easily be phrased. The trans-formed image is denoted by T [y], where y = Rd → Rd denotes the transformation and

T [y](x) = T (y(x))

Page 10: Contents · 1.3 Chapter synopsis In this section the structure and overview of the following chapters are brie y outlined Chapter 2 - FAIR This chapter introduces a generic software

Chapter 2. FAIR 5

This Eulerian approach transforms the domain. The corresponding transformation of the imagesis counter intuitive. when the grid is rotated counter clockwise, the transformed image lookslike a clockwise rotated copy of the template. The so called Lagrangian-framework which is tofollow the points is not used in FAIR.

Though all techniques in FAIR also apply to non parametric registration, in this thesis only theparametric setting was used. Parameterized transformations can be formulated as

y = Qw

where, Q is a collection of basis functions and w is a collection of parameter or coefficients.Important classes like rigid and affine linear transformations are covered here.

2.3.2 Distances and Regularization

Having discussed the forward problem i.e., given a transformation, how to compute the trans-formed image.The next step is to provide a quantitative measure for the quality of transformation.This measure has two ingredients

• The first ingredient is related to image similarity and,

• the second ingredient measures plausibility or regularity of the transformation.

Here, the sum of squared differences (measuring the energy of the difference image) is used asan intuitive example, where

D[T [y],R] = 0.5∫

Ω

(T [y](x)−R(x))2 dx

Since registration is an ill-posed problem, regularization becomes inevitable. where,

S[y] = Elastic Potential[y − yref ] where, yref (x) = x (2.2)

2.4 FAIR Numerics

FAIR is based on a discretize then optimize approach, where a nested sequence of discretizationsis solved using plain Newton-type techniques.

2.4.1 Discretize then Optimize

Most of the registration problems do not allow for an analytic solution and thus numericalsolutions are to be provided. A feasible approach is to solve the discretized optimality conditionfor the continuous problem. However FAIR focusses on numerical optimization and thereforeon optimality conditions for the discretized problem. To be more precise, a sequence J h ofdiscretizations running from coarse to fine of the continuous functional J is considered. Theidea is to capture the important features on a coarse presentation and to solve this problemwith relatively low computational costs. For finer representations, only corrections based onthe added information are required. A key point is that all the discrete problems are linkedby the underlying continuous model and thus the solutions yh approximates the solution of thecontinuous problem. Another important point is that on each descritization level an optimizationproblem is to be solved. Thus, consistent line search techniques such as the Armijo line searchand automatic stopping criteria are used.

Page 11: Contents · 1.3 Chapter synopsis In this section the structure and overview of the following chapters are brie y outlined Chapter 2 - FAIR This chapter introduces a generic software

Chapter 2. FAIR 6

2.4.2 A Family of Nested Approximations

Being able to use consistent approximations of the registration problem is a major feature ofFAIR. Multi-scale and multilevel strategies for a particular problem are used for many goodreasons. Using a coarse representation yields a smoother objective function which prevents fromrunning into local minima, enables fast optimization techniques, and results in fewer unknownswhich is good for memory, computing time, and results.

Discretization is also used for approximation of integrals and derivatives. Using structured gridsof cells with centers xi and volume h, roughly speaking∫

f(x) dx = h∑

f(xi) +O(h2) and

∂f(xi+0.5) = (f(xi+1 − f(xi))/h+O(h2)

where, the Landau-symbol O indicates errors of order h2.

A visualization of f can be achieved by assigning the value f(xi) to a cell of volume h anddisplaying the piecewise constant.

2.4.3 Numerical Optimization

An additional feature of FAIR is the focus on differentiable modules. Derivatives are the mostimportant ingredient to most of the efficient numerical optimization techniques. Emphasis hasalso been given to proper line search and stopping criteria used by optimization schemes. Thedefault scheme is a Gauss Newton type scheme with an Armijo line search.

2.5 FAIR MATLAB

FAIR programs are developed using MATLAB. This is certainly a limitation in terms of speedand memory, but MATLAB provides a fast, easy and intuitive access to numerical computingand in particular to sparse matrices. It has been chosen because it is easy to use for educationand research.

2.5.1 Notation and Conventions

Similar quantities are collected in arrays and linear algebra conform data structures are used asoften as possible. For example, the size of the data is denoted by m = [m1, ...,md]. Vectorizeddata structures are used to enable a direct and simple access to linear algebra and numericaloptimization.

2.5.2 Coordinate System

Throughout FAIR a geometric right handed (x1, x2, x3) coordinated system is used. All coordi-nates are absolute and physical and thus changing the descritization does not effect positioningof data or scaling of the registration problem.

Unfortunately, MATLAB stores 2D arrays in an (i, j) coordinate system and therefore imagingfunctions like image should not be used directly within FAIR.

Page 12: Contents · 1.3 Chapter synopsis In this section the structure and overview of the following chapters are brie y outlined Chapter 2 - FAIR This chapter introduces a generic software

Chapter 2. FAIR 7

2.5.3 FAIR Administration

In FAIR, the main building blocks are administered by specific functions with standardized inputand rules. An administrative function caller for a generic task is parameterized on the basis ofa list of persistent parameters OPTN. Each administrative function handles a certain task and isconfigured by a list of options.The commonly used options are listed below,

• caller(’reset’,’caller’,method,name1,value1,...): clears all options and sets themethod to be used to method and adds variables with name1 and value value1 etc. to thepersistent list of parameters

• caller(’set’,’name1’,value1,’name2’,value2,...):adds (or overwrite) ’name1’,value1,’name2’,value2,... to the persistent parameters

• caller(’clear’): clears persistent parameters

• caller(’disp’): displays persistent parameters

• [method,optn]=caller: returns the specific methods and the persistent options

• value=caller(’get’,’name’): returns the value of name or [ ] if not defined

• [y1,y2,...]=caller(x1,x2,...): executes the function based on input variables and thepersistent options

2.5.4 Memory versus Clarity

FAIR chooses clarity over memory while working on higher dimensional formulations. A higherdimensional formula can often be derived using an appropriate combination of one-dimensionalformulae. An important but often not very efficient tool is the so-called Kronecker-product.

Given two matrices A ∈ Rp1,p2and B ∈ Rq1,q2

, the Kronecker-product is defined by

A⊗B =

a1,1B · · · a1,p2B...

. . ....

ap1,1B · · · ap1,q2B

∈ Rp1q1,p2q2

Often, these Kronecker-products involve an identity matrix In of appropriate size where,

In = speye(n, n) =

1 0. . .

0 1

∈ Rn,n (2.3)

for example a 2D affine linear transformation y = [y1, y2] can be phrased as

yi(x) = [1, x1, x2] · [wi+1, wi+2, wi+3, wi+4]>

where w ∈ R12 denotes the parameters. A compact formulation y = Q(x)w is obtained bysetting

Q(x) = I2 ⊗ [1, x1, x2].

Though there are more memory efficient ways of computing the transformed grid, FAIR storesa discretized version Q of Q and computes Qw using a simple matrix-vector multiplication.

Page 13: Contents · 1.3 Chapter synopsis In this section the structure and overview of the following chapters are brie y outlined Chapter 2 - FAIR This chapter introduces a generic software

Chapter 2. FAIR 8

2.5.5 Cell,Grids and Numbering

Let d denote the spatial dimension of the given data. It is assumed that given data withdataT (j) ∈ Rd is related to points xj = [x1

j , ....., xdj ] ∈ Rd, j = 1, ....., n and that these points are

the cell centered grid of a regular grid of a d dimensional interval Ω = (ω1ω2)×· · ·×(ω2d−1, ω2d) ⊂Rd. Superscript indices are used for components of vectors and subscript indices are used fornumbering.

A grid is a partitioning of the interval into a number of congruent cells or boxes. Thus, the ithcomponent of the difference between two grid points is a multiple of a constant grid width hi.For example, let the data size m = [m1, .....,md] be given and,

hi = (ω2i − ω2i−1)/mi, h = [h1, ....., hd] (2.4)

ξij = ω2i−1 + (j − 0.5)hi, ξi = [ξi

1, ....., ξimi ] ∈ Rmi

(2.5)

Index vectors j = [j1, ....., jd] are used for accessing elements of higher dimensional arrays. Alsothe collection of points xj = [ξ1

j1 , ....., ξdjd ], ji = 1, .....,mi, i = 1, ....., d is called a cell centered

grid and the d dimensional intervals cellj = x ∈ Rd | −hi/2 < xi − ξij < hi/2 are called cells ,

where the cell centersare xj

h• • • •

ω1 ω2cellj

xj

Figure 2.2: Discretization of a 1D domain Ω = (ω1, ω2) ⊂ R.

The Figure 2.2 displays a one-dimensional example. The interval Ω = (ω1, ω2) is dividedinto m = 4 cells of length h = (ω2 − ω1)/m,with center xj = ξ1

j .

The corresponding MATLAB statement reads,

h = (omega(2:2:end)-omega(1:2:end-1))./m,

x = omega(1)+h/2:h:omega(2)-h/2.

2.6 Fixed level PIR

In this section we have a quick look at a FAIR image registration implementation that wouldserve as a reference example throughout this thesis. Using this example it is shown in chapter4 how particular algorithms in the constituent functional modules in FAIR can be ported tothe GPU. The detailed explanation of the FAIR functional modules is provided in subsequentsections in chapter 4 . For the current discussion we refer to the general software frameworkmentioned in section 2.2

The example shown in Figure 2.3 is a Parametric Image Registration(PIR) process applied ondata from a histological serial section, generated in the Human NeuroScanning Project (HNSP).Though a prototype for a relatively easy registration, PIR serves a very cpmpetent example toexplain all features of FAIR, both general and CUDA enabled.

Page 14: Contents · 1.3 Chapter synopsis In this section the structure and overview of the following chapters are brie y outlined Chapter 2 - FAIR This chapter introduces a generic software

Chapter 2. FAIR 9

HN

SP(a) T (xc) (b) R(xc) (c) |T (xc)− R(xc)|

rigi

d/fin

e

(d) T (xc) and grid yc (e) T (yc) (f) |T (yc)− R(xc)|

Figure 2.3: Parametric Image registration on HNSP data,Image generated from FAIR

As a test case it also helps enlist the computational aspects of all the functional modules in-volved. This information is easily obtained from profiling the execution of the FAIR exampleE6_HNSP_PIR_SSD_rigid2D_level7. The configuration of this PIR example is shown in Figure2.4. These specific functionalities are discussed in detail in chapter 4.

Figure 2.4: Parametric image registration example in FAIR

Function Name Calls Total Time(s) PercentageE6_HNSP_PIR_SSD_rigid2D_level7 1 38.040 s 100inter=splineInter2D 180 22.559 s 59.3opt=Armijo 85 5.328 s 14distance=SSD 175 0.984 s 2.6trafo=rigid2D 179 0.600 s 1.5FAIRplotsand others 89 12.817 s 22.4

Table 2.1: Profiler results for Parametric Image Registration in FAIR

As evident from Table 2.1, amongst all the functional modules, the interpolation module interis the most called module and also accounts for more than half the time involved in the imageregistration. This percentage of time within the interpolation module, which in this example isthe spline interpolation technique, is even higher when the total time is adjusted after deductingtime taken by the visualization methods that account for approximately one fifth of the totaltime for the example discussed. Therefore a fast implementation of the B spline interpolationscheme should contribute significantly to the overall speed of image registration in general. Thisobservation is not surprising as the interpolation choice decides the smoothness of optimizationsearch space therefore also resulting in fewer iterations of the optimizer. Moreover, due to thehigh probability of a transformed point not aligning with the original pixel grid, the interpolationsare evaluated at a substantially large number of times in every optimization cycle.

Page 15: Contents · 1.3 Chapter synopsis In this section the structure and overview of the following chapters are brie y outlined Chapter 2 - FAIR This chapter introduces a generic software

Chapter 2. FAIR 10

In order to have fast implementations of the spline interpolation and other related functionalmethods, the strategy used here is to apply these algorithms on GPUs for acceleration usingCUDA. As discussed in this chapter, FAIR toolbox is implemented on MATLAB. Therefore thepreliminary task of integrating these two software environments is necessary, which is the topicof discussion for the next chapter.

Page 16: Contents · 1.3 Chapter synopsis In this section the structure and overview of the following chapters are brie y outlined Chapter 2 - FAIR This chapter introduces a generic software

Chapter 3

MATLAB on CUDA

3.1 Introduction

As FAIR is a software predominantly implemented using optimized MATLAB libraries, to makeit faster, a solution outside the scope of MATLAB has to be approached. One such possibilityis to pursue hardware acceleration by using the latest generation of GPUs, as a co-processor.Through their high number of streaming processors and fast on chip communications they providesubstantial computation power. Considering MATLAB, and therefore FAIR, relies extensivelyon matrix centric computations, the porting of FAIR applications seem well suited for GPUcomputation. This has been made feasible in the recent years by the introduction of the CUDA”a general purpose parallel computing architecture - with a new parallel programming modeland instruction set architecture - that leverages the parallel compute engine in NVIDIA GPUsto solve many complex computational problems in a more efficient way than on a CPU”. [3].

This chapter which is broadly based on the online Matlab Application Program Interface (API),[3], [4] and [5], dicussess the integration of the CUDA system into MATLAB. The scope of thischapter is purposefully limited to specifics pertaining to this thesis. Since it was performedin Linux, the explanations have also been made keeping in line with the experiences gainedwithin this environment. It is likely that many of the aspects could also be applicable to otherenvironments. In the following sections we elaborate on the various aspects of CUDA integrationwithin MATLAB. An introduction to CUDA is omitted here. For detailed documentation onCUDA refer [3].

3.2 MATLAB on CUDA

One promising application of this GPU computing capability is through MATLAB and dynam-ically linked subroutines called MATLAB MEX functions that the MATLAB interpreter canautomatically load and execute. With a properly developed CUDA enabled MEX function, theuser-friendly MATLAB interface can be used to perform behind-the-scenes parallel computationson the GPU. The issue of getting CUDA MEX functions to work encompasses the preliminarystage of getting generic MEX functions to work. Hence, this section partly addresses genericMEX file properties and issues as well [5].

3.2.1 MATLAB MEX environment

Although MATLAB is a complete, self-contained environment for programming and manipu-lating data, it is often useful to interact with data and programs external to the MATLAB

11

Page 17: Contents · 1.3 Chapter synopsis In this section the structure and overview of the following chapters are brie y outlined Chapter 2 - FAIR This chapter introduces a generic software

Chapter 3. MATLAB on CUDA 12

environment. MATLAB provides an API to support these external interfaces using user definedC or Fortran subroutines from MATLAB as if they were built-in functions. These MATLABcallable C and Fortran programs are referred to as MEX-files. MEX-files have several applica-tions : Large pre-existing C and Fortran programs can be called from MATLAB without havingto be rewritten as M-files. Bottleneck computations (usually for-loops) that do not run fastenough in MATLAB can be recoded in C or Fortran for efficiency. MATLAB provides a script,called mex to compile a MEX file with additional calls to API routines, to a shared object ordynamic linked library that can be loaded and executed inside a MATLAB session. [4]

MEX files could be used wuthin CUDA by the usig tools provided by NVIDIA. Hence to accessand compute on the GPU using CUDA from within MATLAB one must familiarise with theprovided MATLAB API.

3.2.2 The MEX file

This section describes the MEX-file, how these C language files interact with MATLAB andhow to pass and manipulate arguments of different data types. The source code for a MEX-fileconsists of two distinct parts: The computational routine and the gateway routine

The computational routine Contains the code for performing the computations that is re-quired to be implemented in the MEX-file. In the CUDA MEX scenario this would mostlikely be the the computation kernel that executes on the device.

The gateway routine Interfaces the computational routine with MATLAB by the entry pointmexFunction and its parameters prhs, nrhs, plhs, nlhsprhs is an array of right-hand input arguments,nrhs is the number of right-hand input arguments,plhs is an array of left-hand output arguments,andnlhs is the number of left-hand output arguments.The gateway calls the computational routine as a subroutine.

The two components of the MEX-file may be separate or combined. In either case, the filesmust contain the #include "mex.h" header so that the entry point and interface routines aredeclared properly. The name of the gateway routine must always be mexFunction. In thegateway routine, the data access is through the mxArray structure (discussed in 3.2.3) and thenthis data is modiofied in the C computational subroutine.

For MATLAB to recognise output from the MEX-file, a pointer of type mxArray is to be set tothe data returned by the computational routine or computational kernel, in the case of CUDA.Unlike C, where the function argument checking is done at compile time, any number or type ofarguments can be passed to the M-function, which is responsible for argument checking. This isalso true for MEX-files. Hence the user MEX program must safely handle any number of inputor output arguments of any supported type. [4]

The following Figure 3.1 shows the C MEX Cycle. It is shown

1. How inputs enter a MEX-file,

2. What functions the gateway routine performs and,

3. How outputs return to MATLAB .

The MATLAB interpreter looks through the list of directories on MATLAB’s search path andscans each directory looking for the first occurrence of a file with the corresponding filename

Page 18: Contents · 1.3 Chapter synopsis In this section the structure and overview of the following chapters are brie y outlined Chapter 2 - FAIR This chapter introduces a generic software

Chapter 3. MATLAB on CUDA 13

Figure 3.1: C MEX cycle

extension from the table or “.m“ . When it finds one, it loads the file and executes it. MEX-files take precedence over M-files when like-named files exist in the same directory. This is animportant feature within the context of porting FAIR based .m implementations onto CUDAenabled MEX implementations as this facilitates the use of existing FAIR scripts without anymodification.

3.2.3 The MATLAB Array

The MATLAB language works with only a single object type: the MATLAB array. All MATLABvariables, including scalars, vectors, matrices, strings, cell arrays, structures, and objects arestored as MATLAB arrays.

In C, the MATLAB array is declared to be of type mxArray. The mxArray structure contains,among other things:

1. Its type.

2. Its dimensions.

3. The data associated with this array.

4. If numeric, whether the variable is real or complex.

5. If sparse, its indices and nonzero maximum elements.

6. If a structure or object, the number of fields and field names.

In this regard it is important to note that unlike in M-files , MEX-file functions do not have theirown variable workspace. MEX-file functions operate in the callers workspace. mexEvalString

Page 19: Contents · 1.3 Chapter synopsis In this section the structure and overview of the following chapters are brie y outlined Chapter 2 - FAIR This chapter introduces a generic software

Chapter 3. MATLAB on CUDA 14

evaluates the string in the callers workspace. In addition, one could use the mexGetArray andmexPutArray routines to get and put variables into the callers workspace.

Given the context of the discussion in this section it might also be noteworthy to mention aboutthe Array ordering convention in MATLAB. All MATLAB data is stored columnwise. MATLABuses this convention because it was originally written in Fortran. This is the same conventionused by MEX functions i.e. in the column major format. This might cause some ambiguity asCUDA uses both conventions for ordering of arrays. The C row major convention is used bymost of the CUDA routines e.g. CUDAMemcpy2D which ”copies a matrix (height rows of widthbytes each) from the memory area pointed to by src to the memory area pointed to by dst”.However, what CUDAMemcpy2D means by rows is understood to be columns in MATLAB.

3.2.4 Customised build for CUDA MEX-Files

As mentioned earlier the MEX files could be compiled and built using the mex script. But quiteoften it might be of interest to have a customised build to choose from a multitude of compilers,debug options and compiler options. This is be done using the -f option to specify an optionsfile at the MATLAB prompt as below:

MEX filename -f <optionsfile>

In order to compile and build CUDA enabled MEX files the mex script and the default mexopts.shoptionsfile have to modfied. The gcc compiler has to be replaced nvcc compiler with a few ofthe options changed in the standard mexopts.sh file that is found in the MATLAB installationas shown in 3.2. The MEX script is modified to recognize *. cu extensions.

1 CC=’nvcc’CFLAGS=’-O3 -Xcompiler "-fPIC -D_GNU_SOURCE -pthread -fexceptions -m64 -march=native"’CLIBS="$RPATH $MLIBS -lm -lstdc++"COPTIMFLAGS=’-Xcompiler "-O3 -DNDEBUG -march=native"’

.6 .

.

.LD="gcc"

Figure 3.2: Example of nvopts file

These changed files (namely , nvmex and nvopts.sh ) in the MATLAB package are available fordownload from the NVIDIA CUDA site 1. The package contains the following files.

1. Makefile

2. nvopts.sh optionsfile.

3. MEX source file with the *.cu extension.

4. The nvMEX script.

Customised Makefiles and optionfiles for FAIR could found in Appendix A.1 http://developer.download.nvidia.com/compute/CUDA/1_1/MATLAB_CUDA_1.1.tgz

Page 20: Contents · 1.3 Chapter synopsis In this section the structure and overview of the following chapters are brie y outlined Chapter 2 - FAIR This chapter introduces a generic software

Chapter 3. MATLAB on CUDA 15

3.2.5 MEX APIs

The MATLAB API provides a full set of routines that handle the various data types supportedby MATLAB. For each data type there is a specific set of functions that could be used for datamanipulation. There are basically three different categories of routines in the API based on thescope of their operation [4]

mx Routines Routines prefixed with mx allow you to create, access, manipulate, and destroymxArrays. The array access and creation library provides a set of array access and creationroutines for manipulating MATLAB arrays. These subroutines, which are fully documentedin the online API reference pages, always start with the prefix mx. For example, mxGetPiretrieves the pointer to the imaginary data inside the array. Although most of the routinesin the array access and creation library let you manipulate the MATLAB array, thereare two exceptions the IEEE routines and memory management routines. For example,mxGetNaN returns a double, not an mxArray.

MEX Routines Routines that begin with the mex prefix perform operations back in the MAT-LAB environment. For example, the mexCallMATLAB routine calls any other MATLAB .mor mexfunction.

Engine Routines Routines that allow calling MATLAB from programs employing it as a com-putation engine. These routines are actually C or Fortran programs that communicatewith a separate MATLAB process via pipes (in UNIX). A library of functions providedwith MATLAB that allow starting and ending the MATLAB process, send data to andfrom MATLAB, and send commands to be processed in MATLAB.

Further detailed information regarding APIs can obtained from the MATLAB documentationwebsite 2.

3.3 CUDA MEX Memory Management

Memory management within MEX-files is like in regular C applications. However, there arespecial considerations in the case of CUDA MEX-files as they must exist within the context ofa larger application, i.e., MATLAB itself. Few of these considerations relevant to CUDA MEXare described below [4].

Automatic Cleanup of Temporary Arrays When a MEX-file returns to MATLAB, it givesto MATLAB the results of its computations in the form of the left-hand side arguments -the mxArrays contained within the plhs[] list. Any mxArrays created by the MEX-filethat are not in this list are automatically destroyed. In addition, any memory allocatedwith mxCalloc, mxMalloc, or mxRealloc during the MEX-files execution is automati-cally freed. Its recommended that MEX-files destroy their own temporary arrays and freetheir own dynamically allocated memory. It is more efficient for the MEX-file to performthis cleanup than to rely on the automatic mechanism.

Persistent Arrays An array, or a piece of memory, can be exempted from MATLABs auto-matic cleanup by calling mexMakeArrayPersistent or mexMakeMemoryPersistent. How-ever, if a MEX-file creates such persistent objects, there is a danger that a memory leakcould occur if the MEX-file is cleared before the persistent object is properly destroyed.In order to prevent this from happening, a MEX-file that creates persistent objects shouldregister a function, using mexAtExit, which will dispose of the objects. This is shown inthe example in Figure 3.3

2http://www.mathworks.com/support/tech-notes/1600/1605.html

Page 21: Contents · 1.3 Chapter synopsis In this section the structure and overview of the following chapters are brie y outlined Chapter 2 - FAIR This chapter introduces a generic software

Chapter 3. MATLAB on CUDA 16

1 #include "mex. h"

static int initialized = 0;static mxArray *persistent_array_ptr = NULL;

6 void cleanup(void) mexPrintf("MEX -file is terminating , destroying array\n");mxDestroyArray(persistent_array_ptr );

11

void mexFunction(int nlhs , mxArray *plhs[],int nrhs , const mxArray *prhs [])

if (! initialized) 16 mexPrintf("MEX -file initializing , creating array\n");

/* Create persistent array and register its cleanup. */persistent_array_ptr = mxCreateDoubleMatrix (1, 1, mxREAL );mexMakeArrayPersistent(persistent_array_ptr );mexAtExit(cleanup );

21 initialized = 1;/* Set the data of the array to some interesting value. */*mxGetPr(persistent_array_ptr) = 1. 0;

else

26 mexPrintf("MEX -file executing; value of first array element is %g\n",*mxGetPr(persistent_array_ptr ));

Figure 3.3: Mex persistent memory

Hybrid Arrays Functions such as mxSetPr, mxSetData, and mxSetCell allow the di-rect placement of memory pieces into an mxArray. mxDestroyArray will destroy thesepieces along with the entire array. Because of this, it is possible to create an array thatcannot be destroyed, i.e., an array on which it is not safe to call mxDestroyArray. Suchan array is called a hybrid array, because it contains both destroyable and nondestroyablecomponents. For example, it is not allowed to call mxFree (or the ANSI free() function, forthat matter) on automatic variables. Therefore, in the code fragment in figure 3.4, pArrayis a hybrid array. Because hybrid arrays cannot be destroyed, they cannot be cleaned upby the automatic mechanism outlined in Automatic Cleanup of Temporary Arrays. Theautomatic cleanup mechanism is the only way to destroy temporary arrays in case of a userinterrupt. Therefore, temporary hybrid arrays are not valid and may cause your MEX-fileto crash. Although persistent hybrid arrays are viable, its recommended to avoid their usewherever possible.

.

.mxArray *pArray = mxCreateDoubleMatrix (0, 0, mxREAL );double data [10];

5 mxSetPr(pArray , data);mxSetM(pArray , 1);mxSetN(pArray , 10);

.

.

Figure 3.4: Mex hybrid array

MEX routine calls When calling a MATLAB function from a MEX function using mexrou-tines such as MexCallMATLAB , (used to execute optimised MATLAB routines from withinthe MEX function) 3, MATLAB does not overwrite the lhs pointer with each call, but justallocates more space. Under circumstances where such a call is repeatedly made eitherwithin the MEX function or the calling m-file itself, considerable host memory could be

3The combination of CUDA routines to do the bulk of the calculations with MATLAB routines to easilycompute specialized elements is one of the features of CUDA enabled FAIR, combining flexibility of the frameworkand computation intensity of the GPU

Page 22: Contents · 1.3 Chapter synopsis In this section the structure and overview of the following chapters are brie y outlined Chapter 2 - FAIR This chapter introduces a generic software

Chapter 3. MATLAB on CUDA 17

consumed. Therefore, after making such a call from a CUDA MEX function the arrayshould be destroyed using mxDestroyArray(lhs[0]); once the use of the data is met, of-tenly once transferred to the GPU. This destruction only clears the allocated host memory,while still maintaining the lhs array, thereby resolving a potential memory leak. Examplepseudo-code is shown in Figure 3.5

1 for (i = 0; i < LargeN; i++)

.mexCallMATLAB (1, &lhs[0], 2, rhs , "mrdivide");

6 .mxDestroyArray(lhs [0]);.

11

Figure 3.5: Example pseudocode for memory leak while calling MEX routines

Device memory allocation When allocating space on the GPU from a MEX function,CUDAMalloc ((void **)&A,N * sizeof(A[0])); that allocated space does not clear byitself when the MEX call terminates. So, if the MEX function is repeatedly called from aloop in MATLAB, the GPU memory will fill up. Therefore the memory should be clearedwith CUDAFree(A); at the end of the MEX function. If a call to a MEX function is maderepeatedly from a loop in MATLAB without clearing the CUDA allocations, erroneousresults may occur if the GPU runs out of memory. Clearing such allocations prevents anydevice memory leak. In all cases, memory is freed when either MATLAB is exited or theMEX function is cleared as a variable in MATLAB [5].

Host memory allocation Copying data between host and device memory is one of the mainbottlenecks in accelerating frameworks such as FAIR. One possibility to reduce this tospeed up the host-device transfer time is to use the routine CUDAMallocHost in a MEXfile. This enables the usage of pinned memory that makes the host memory directly visibleto CUDA instead of accessing through the paged memory of the host kernel. But a fewmemory issues are witnessed when this functionality is used alongside mexroutines. Thiscould be possibe on account of the CUDA host memory allocation conflicting with that ofthe running MATLAB.

Page 23: Contents · 1.3 Chapter synopsis In this section the structure and overview of the following chapters are brie y outlined Chapter 2 - FAIR This chapter introduces a generic software

Chapter 3. MATLAB on CUDA 18

3.4 CUDA MEX Retention of variables on the GPU

A typical computation on a GPU consists of a three step process:

1. Transfer of data to the device memory.

2. Kernel execution.

3. Transfer of calculated results back to host memory.

Steps (1) and (3), contibute heavily to the overall execution time of the end to end applicationprocess. Moreover as the GPU shows considerable gain in performance only once the processeddata is significantly high enough to spawn sufficient CUDA thereads, these effects become moreprofound. This way communication between host and device are frequently the bottleneck in acalculation. Hence optimising this is a high priority. The following three strategies are used totackle this problem:

• Firstly, the transfers themselves could be made faster by the usage of pinned host memoryallocation using CudaMallocHost(). But as mentioned in the previous section on memorymanagement this could be performed only in restricted situations.

• Secondly, the concept of persistent memory as mentioned in the Figure 3.3 could be ex-tended further to CUDA MEX files. By doing so a repeated call to the MEX file (thatregisters the memory freeing routine by MexAtExit ) would retrieve a previously calculatedresult or stored value from the device memory pointed by the static pointer variable . In thecase of device persistent memory the mexMakeArrayPersistent() routine is not requiredas the Cuda variables are not automatically cleared by the MEX engine. Therefore thedevice memory allocated is retained till the registered cleanup function is called, ie eitheron the clearance of the MEX file or exit of matlab session. The device memory pointed bypersistent_array_in in Figure 3.6 is an example of this.

• A third possibility would be to use hybrid persistent memory as discussed in the previoussection by setting the memory pointer of a mxArray to the CUDA device memory allo-cated using cudamalloc(). This device memory must be registered to be freed by thecleanup routine as in the case of the persistent memory. The device memory pointed bypersistent_array_out in Figure 3.6 is an example of this which can be seen as follows.

Page 24: Contents · 1.3 Chapter synopsis In this section the structure and overview of the following chapters are brie y outlined Chapter 2 - FAIR This chapter introduces a generic software

Chapter 3. MATLAB on CUDA 19

#include "mex. h"#include "cuda. h"

static int initialized = 0;5 static float *parray_in = NULL;

static float *parray_out = NULL;

void cleanup(void)

10 mexPrintf("MEX -file is terminating , freeing cuda device memory\n");cudaFree(parray_in );cudaFree(parray_out );

15

void mexFunction(int nlhs , mxArray *plhs[],int nrhs , const mxArray *prhs [])

20 if (! initialized) mexPrintf("MEX -file initializing , creating array\n");/* Create persistent array and register its cleanup. */float *data;int dim [2]. ;

25 dim [0]= GetM(prhs [0]); dim [1]= GetN(prhs [0]);data = mxmalloc(dim[0] * dim[1] * sizeof(float ));data = (float*) mxGetData(prhs [0]);

.

.30 .

parray_in = cudaMalloc ((void **)& parray_in ,dim [0]* dim [1]* sizeof(parray_in [0]));parray_out = cudaMalloc ((void **)& parray_out ,dim [0]* dim [1]* sizeof(parray_out [0]));

.

.35 mexAtExit(cleanup );

initialized = 1;.

/* Set the data of the array to some interesting value. */cudaMemcpy(parray_in ,data ,N*sizeof(float),cudaMemcpyHostToDevice );

40

cudaMemcpy(parray_out ,parray_in ,N*sizeof(float),cudaMemcpyDeviceToDevice );

mxSetPr(pArray ,parray_out );mxSetM(pArray ,parray_out );

45 mxSetN(pArray ,parray_out );..

50 else

mexPrintf("MEX -file executing; value of first array element is %g\n",cudaMemcpy(parray_out ,parray_in ,N*sizeof(float),cudaMemcpyDeviceToDevice );

55

Figure 3.6: CUDA MEX persistent memory

3.5 CUDA MEX programming tools

Irrespective of the application domain, efficient programming on any architecture requires suit-able tools for enhanced program design. Given the amount of work done on standard PCs in lastfew decades there is a plethora of applications for this purpose. In this section two such toolsare discussed that make it feasible to analyse the CUDA enabled FAIR toolbox..

3.5.1 CUDA MEX Debugging

Due to the working with many-core architecture, massive threading model, and large amountsof data, finding errors in these large volumes of data on the GPUs is tedious. Therefore thefollowing methods are suggested to be used to debug and find errors in programs written in theCUDA MEX scenario.

Page 25: Contents · 1.3 Chapter synopsis In this section the structure and overview of the following chapters are brie y outlined Chapter 2 - FAIR This chapter introduces a generic software

Chapter 3. MATLAB on CUDA 20

Host MEX mode This is the normal mex debugging mode as discussed in the MATLAB onlinereference on mex debugging 4.The steps for this are briefly mentioned below.

1. Firstly, the MEX file should be compiled using the -g option for generating thedebug variables.

2. Then MATLAB has to start in the ”debug mode“ by specifying the debugger afterthe -D option.In case of Linux debugger gdb, matlab -Dgdb must be typed at thecommand prompt.5

3. By issuing run MATLAB begins in the debug mode.4. Within MATLAB mex debugging has to be switched by executing dbmex on at the

MATLAB prompt.5. The MEX function can be called in the usual way both directly as well as through

a MATLAB script.Once the mexfunction is loaded the control is passed on to thespecified debugger, here gdb.Thereafter all the general debugger commands are usedas for normal C files.Details on using gdb could be obtained using the help and ’manpages.

6. With dbmex stop control could be passed to debugger whenever required.

CUDA MEX emulation mode The easiest method of debugging a CUDA program is to addprint statements to the CUDA MEX source code and compile to run in the emulator(initiated by passing the -deviceemu flag to the nvcc compiler). Since the emulator runson the host processor (and not on the GPU), the print statements can be compiled andlinked so the programmer can examine whatever program values might be important. Theemulator does not precisely reproduce what happens on the GPU, which means that bugsand behavior that occur on the GPU (including race conditions) may not happen in theemulated environment 6. The corresponding modified makefile and nvopts.sh for FAIRemulation mode is available in Appendix A.

CUDA MEX emulation/debug mode The above two modes could be combined in such away that the host gdb debugger could be used with the emulation compiled CUDA MEXfile.This could be achieved by compiling the CUDA MEX source with both the -g and-deviceemu flags. Since in emulation mode the CUDA MEX is for all practical purposesan ordinary host MEX file, the procedure mentioned in the first point above is valid forthis mode as well.

CUDA MEX device debug mode Since the CUDA 2.2 SDK release, most Unix-based op-erating system releases of the CUDA Toolkit now include CUDA-GDB. The usage withregard to MATLAB is similar as in the case of GDB as CUDA-GDB is a port of the for-mer.Nevertheless there are a few considerations to be noted that make the usage of CUDAGDB slightly different as compared to GNU GDB.Firstly, X11 should not be running onthe GPU on which FAIR is running.Secondly, MATLAB under debug mode must be runusing -nojvn and -nodesktop mode. A detailed description of the functioning of theCUDA GDB debugger could be found in NVIDIA’s CUDAGDB user manual 7. The corre-sponding makefile and nvopts.sh files required to run FAIR for both ordinary and CUDAenabled MEX files are provided in Appendix A.

3.5.2 CUDA MEX Profiling

Profiling is an efficient way to measure where a program spends time.Its helps discover per-formance problems caused by unnecessary computation and also helps identify the most time

4http://www.mathworks.com/access/helpdesk/help/techdoc/matlab_external/f32489.html5On Linux for multicore CPU architectures it might be preferable to restrict the active number of CPUs to 1

by setting the processor affinity to the MATLAB process ID (PID).This could be done using the taskset commandfrom the scheduler tools

6 http://www.drdobbs.com/architect/2206011247 http://developer.download.nvidia.com/compute/cuda/2_3/toolkit/docs/CUDA_GDB_User_Manual_2.

3beta.pdf

Page 26: Contents · 1.3 Chapter synopsis In this section the structure and overview of the following chapters are brie y outlined Chapter 2 - FAIR This chapter introduces a generic software

Chapter 3. MATLAB on CUDA 21

consuming parts of the algorithm by providing accurate timing data. MATLAB software pro-vides a graphical user interface, called the Profiler, that assists in the profiling of ordinary .mand mex files. In this section we mention only the CUDA MEX specific profiling considerationsusing the Cuda Visual profiler provided by NVIDIA. A detailed documentation on the usage ofthe MATLAB profiler could be in the online MATLAB technical reference website 8. On Linuxfor Intel multi-core chips, it is recommended to restrict the active number of CPUs to 1 for themost accurate and efficient profiling. This though not mentioned in the MATLAB documenta-tion can be performed by obtaining the process ID (PID) of MATLAB and thereafter assigninga single processor core’s affinity to it by using the taskset command from the systems schedulerutilities.

The Cuda visual profiler allows one to run a MATLAB script while timing the various calls toCUDA routines. This gives one a good look at where the program is spending time, hence cangive some important clues for identifying how to speed up calculations.

Figure 3.7: CUDA Visual Profiler

The figure above shows how to start a MATLAB script from the Profiler. There are a fewspecifics to be considered while profiling CUDA MEX programs as compared to normal CUDAprograms. These are listed as below:

1. The call to MATLAB requires the full path name within quotes.

2. MATLAB has to run with the -nodesktop and +-nojvm+ flags to switch Java off duringprofiling.

3. The script or CUDA MEX file to be profiled is called using the -r option and without thefile extension.

4. The field ”Max. Execution Time” has to be set to a value considerably larger the expectedexecution of the the CUDA MEX files on account of the fact that the profiler call MATLABas the main application with the provided script as an argument.Hence time for MATLABto load has to be accommodated.

8 http://www.mathworks.com/access/helpdesk/help/techdoc/matlab_env/f9-17018.html

Page 27: Contents · 1.3 Chapter synopsis In this section the structure and overview of the following chapters are brie y outlined Chapter 2 - FAIR This chapter introduces a generic software

Chapter 3. MATLAB on CUDA 22

5. The script calling the CUDA MEX file has to explicitly call ”quit” as compared to a normalC program there is no provision to terminate the application, here MATLAB.

Once the above settings are taken care of the profiler could be started and at the end of thescript execution, data corresponding to the profiled CUDA code is obtained. Details on usingthe Cuda profiler is available in CUDA toolkit documentation [6].

In this chapter the various aspects of CUDA and MATLAB integration were presented within thecontext of using in FAIR. Various tools, procedures and programming consideration have beendiscussed to efficiently integrate the CUDA framework within FAIR. In the following chapteractual FAIR implementations that use the details discussed here are mentioned.

Page 28: Contents · 1.3 Chapter synopsis In this section the structure and overview of the following chapters are brie y outlined Chapter 2 - FAIR This chapter introduces a generic software

Chapter 4

CUDA enabled FAIR

4.1 Introduction

Now that both the FAIR toolbox and the CUDA MEX environment have been introduced andthoroughly discussed, this chapter would henceforth focus on the actual CUDA MEX implemen-tations of the functional modules in FAIR. In this context the software framework mentioned insection 2.2 is extended to incorporate the FAIR CUDA MEX kernels as shown in Figure 4.1.

Figure 4.1: FAIR CUDA Registration

It was seen in 2.6 that the interpolation module was the most time consuming amongst all thefunctional modules. Therefore, in the following section firstly, fast GPU implementations of thegeneral interpolation schemes within FAIR are presented. Other functional modules such astransformation and similarity measure are also ported to the GPU in order to have an end toend FAIR image registration cycle ported to the GPU.

Before stating the GPU implementations of each FAIR module, it is necessary to get familiarizedwith the corresponding mathematical foundation on which FAIR is designed. Hence in eachsection, the functionality of the module is mathematically explained using excerpts from theFAIR documentation [1].

23

Page 29: Contents · 1.3 Chapter synopsis In this section the structure and overview of the following chapters are brie y outlined Chapter 2 - FAIR This chapter introduces a generic software

Chapter 4. CUDA enabled FAIR 24

4.2 Image interpolation in CUDA enabled FAIR

The objective of image interpolation is to find a function T interpolating data dataT ∈ Rn givenon a grid. More precisely for given points xj ∈ Rn the function T is required to satisfy theinterpolation condition

T (xj) = dataT (j) hspace5mmj = 1, ..., n. (4.1)T (x) = 0 for x /∈ Ω (4.2)

The emphasis in FAIR and therefore in this thesis is on linear and spline interpolations onaccount of the requirement for derivatives in the optimization process. The piecewise linearinterpolants in either case are differentiable almost everywhere.The necessary insight for allinterpolation schemes are given with one dimensional examples. These examples are easilyextended to higher dimensions using Kronecker products.

4.2.1 Next Neighbor Interpolation

Next neighbor interpolation is not used in FAIR as the interpolant continuous throughout andtherefore not having valid derivatives. Nevertheless, we discuss the next neighbor interpolationas we use it in context of the discussion on texture memory for the CUDA implementation. Thelack of derivatives makes it not conducive for the optimization scheme in the registration cycle.

Next Neighbor Interpolation Can be mathematically defined as follows:

T nn(x) = 0 for x /∈ Ω (4.3)T nn(x) := dataT (j) (4.4)

where j is such that x ∈ cellj .

4.2.2 Linear Interpolation

In linear interpolation the value of the function at a certain position is obtained as a weightedsum of the function values of the neighboring points. By using the simple linear map

x 7→ x′ = (x− ω1)/h+ 0.5, (4.5)

a domain Ω = (ω1, ω2) is mapped onto Ω′ = (0.5,m+0.5) and in particular, xj = ω1 +(j−0.5)his mapped onto j. Thus, the neighbors and the weights for an arbitrarily chosen point x can beeasily obtained by splitting x′ into an integer part p and a remainder ξ, where

p = bx′c := maxsj ∈ Z | j ≤ x′ and ξ = x′ − p, 0 ≤ ξ < 1. (4.6)

The formula for linear interpolation can be given as

T linear(x) := dataT (p) · (1− ξ) + dataT (p+ 1) · ξ, (4.7)

Equation 4.7 is only valid if the point x lies completely within the domain. For points that lieoutside the domain this would result in spurious values that are not necessarily zero as requiredby the definition in equation 4.1. Therefore FAIR works around this problem by padding artificial

Page 30: Contents · 1.3 Chapter synopsis In this section the structure and overview of the following chapters are brie y outlined Chapter 2 - FAIR This chapter introduces a generic software

Chapter 4. CUDA enabled FAIR 25

data points on the domain ends. This way for all x ∈ Ω and thus x′ ∈ [0.5,m+ 0.5] it then holds0 ≤ p ≤ m and 0 ≤ ξ < 1. This is also important as the assumption that the data is compactlysupported in not always true in many real world applications.

The one dimensional example mentioned above could be extended to higher dimensions usingthe Kronecker-product approach.Here, the above concepts are applied to any coordinate andT linear is computed as a weighted sum of the data from neighboring cells,

T linear(x) =∑

k∈0,1d

dataT (p+ k)∏

i=1,...,d

(ξi)ki

(1− ξi)(1−ki). (4.8)

For d = 2, 4.8 results in p = (p1, p2), ξ = (ξ1, ξ2), and

T linear(x) = dataT (p1, p2)(1− ξ1)(1− ξ2) + dataT (p1 + 1, p2)ξ1(1− ξ2)

+ dataT (p1, p2 + 1)(1− ξ1)ξ2 + dataT (p1 + 1, p2 + 1)ξ1ξ2.

Two dimensional linear interpolation in FAIR is implemented in the linearInter2D. An exampleusage and generated output of linearInter2D is shown in the Figure 4.2 below

dataT = flipud ([1,2,3,4;1,2,3,4;4,4,4,4]) ;3 m = size(dataT);

omega = [0,m(1),0,m(2)];M= m,10*m;% two resolutions , coarse and finexc = reshape(getCenteredGrid(omega ,M1) ,[] ,2);

8 % coarse resolutionxf = reshape(getCenteredGrid(omega ,M2) ,[M2 ,2]); % fine resolutionTc = linearInter2D(dataT ,omega ,xc (:));Tf = linearInter2D(dataT ,omega ,xf (:));clf; ph = plot3(xc(:,1),xc(:,2),Tc(:), r o ); hold on;

13 qh = surf(xf(:,:,1),xf(:,:,2), reshape(Tf ,M2));

(a) data (b) fine grid (c) 3D view

Figure 4.2: 2D linear interpolation in FAIR. Images generated from FAIR

Linear interpolation has many desirable features such as low computational cost and no highfrequency perturbations on account of data fitting. But it is visible from Figure 4.2 that theinterpolant is not differentiable at the grid points. Hence its derivatives which are obtained fromtaking the finite differences, are not good enough to be used for efficient optimization schemes.For this purpose FAIR looks to spline interpolation as a trade off between differentiability andcompute efficiency.

4.2.3 Spline Interpolation

The objective is to find a function T spline interpolating the data and minimizing its bendingenergy. Again, the one-dimensional situation provides a perfect starting point for higher dimen-sions, where schemes are derived from a Kronecker-product approach.

Page 31: Contents · 1.3 Chapter synopsis In this section the structure and overview of the following chapters are brie y outlined Chapter 2 - FAIR This chapter introduces a generic software

Chapter 4. CUDA enabled FAIR 26

Approximating the bending energy by an integral over the square of the second derivative,

S[T ] =∫

Ω(T ′′(x))2 dx, (4.9)

the solution of the interpolation problem

S[T ] != min subject to T (xj) = dataT (j), j = 1, . . . ,m, (4.10)

is a cubic spline that can be expanded in terms of some coefficients cj and basis functions bj .One of the many outstanding properties of a spline space is that it allows for an expansion interms of a simple basis, where each basis function bj is a translated version of a so-called ’mother’spline b.

Figure 4.3: “Mother” spline b = b0 (solid) and basis functions b2 and b7.

In order to achieve a convenient access to the indexing of the basis function, the map introducedin equation 4.5 is used. The mapped cell centered grid points are xj = j. Figure 4.3 shows thebasis function b = b0 and two arbitrarily chosen translates b2 and b7, where bj(x) = b(x− j) and

b(x) =

(x+ 2)3, −2 ≤ x < −1,−x3 − 2(x+ 1)3 + 6(x+ 1), −1 ≤ x < 0,x3 + 2(x− 1)3 − 6(x− 1), 0 ≤ x < 1,

(2− x)3, 1 ≤ x < 2,0, else.

(4.11)

The goal is to expand the interpolant by

T (x) = T spline(x) =m∑

j=1

cjbj(x) (4.12)

and to derive fast ways for evaluating equation 4.12 and for computing the coefficients c =[c1; . . . ; cm]. Expanding 4.12 at the cell centers xj = j gives the interpolation condition

dataT (j) = T (xj) =m∑

k=1

ckbk(j) = [b1(j), . . . , bm(j)] c, j = 1, . . . ,m. (4.13)

Gathering all function values in T (xc) = [T (x1); . . . ;T (xm)] yields the equivalent formula

Page 32: Contents · 1.3 Chapter synopsis In this section the structure and overview of the following chapters are brie y outlined Chapter 2 - FAIR This chapter introduces a generic software

Chapter 4. CUDA enabled FAIR 27

dataT = T (xc) = [b1(xc), . . . , bm(xc)]c = Bmc, with (4.14)

Bm = [bk(xj)] =

4 1 0

1. . . . . .. . . . . . 1

0 1 4

∈ Rm,m (4.15)

and presents a convenient formula for computing the coefficients.

Since b0(x) = 0 for x 6∈ (−2, 2), for any point x = p+ ξ with integerpart p and remainder ξ (4.6)at most four basis functions are nonzero and thus

T spline(x) = cp−1b(ξ + 1) + cpb(ξ) + cp+1b(ξ − 1) + cp+2b(ξ − 2), (4.16)

which provides an efficient way of evaluating the spline.

A Kronecker-product approach is used for higher dimensions. Here, 4.12 is replaced by

T (x) = T spline(x) =md∑

jd=1

· · ·m1∑

j1=1

cj1,...,jdbj1(x1) · · · bj

d

(xd). (4.17)

For d = 2, the interpolation condition reads

dataT (j1, j2) = T (xcj1,j2) =

∑m2

k2=1

∑m1

k1=1 ck1,k2bk1(ξ1

j1)bk2(ξ2

j2),

since xcj = [ξ1

j1 , ξ2j2 ]. With the matrices Bmi = [bk(ξ1

j )]mi

k,j=1 as introduces in 4.14, this can berewritten as,

dataT (j1, j2) =∑m2

k2=1

∑m1

k1=1 ck1,k2Bm1(k1, j1)Bm2(k2, j2).

Using the lexicographical ordering j = j1 + (j2 − 1)m1, k = k1 + (k2 − 1)m1, j, k = 1, . . . , n =m1m2, and the Kronecker-product Bm = Bm2 ⊗Bm1 ∈ Rn,n, i.e.Bm(j, k) = Bm1(j1, k1)Bm2(j2, k2), the above interpolation condition reads

dataT (j) =∑n

k=1Bm(j, k)ck, j = 1, . . . , n, or simply dataT = Bm c

This provides a way for computing the spline coefficients c = B−1m1 · dataT ·B−1

m2 ;

Two dimensional spline interpolation in FAIR is implemented in the splineInter2D. An exampleof spline interpolation is shown in the Figure 4.4 as follows

The general case is a straightforward extension of Figure 4.4 using a lexicographical orderingand the matrix Bm = Bmd ⊗ · · · ⊗Bm1 , the interpolation condition yields T (xc) = Bm(xc)c.

Page 33: Contents · 1.3 Chapter synopsis In this section the structure and overview of the following chapters are brie y outlined Chapter 2 - FAIR This chapter introduces a generic software

Chapter 4. CUDA enabled FAIR 28

2 %$ examples for spline interpolation in 2D%$ (c) Jan Modersitzki 2009/03/24 , see FAIR .2 and FAIRcopyright .m.

dataT = flipud ([1,2,3,4;1,2,3,4;4,4,4,4])’;m = size(dataT);

7 omega = [0,m(1),0,m(2)];M = m,10*m; % two resolutionsxc = reshape(getCenteredGrid(omega ,M1) ,[] ,2);xf = reshape(getCenteredGrid(omega ,M2),[M2 ,2]);

12 B = @(i) spdiags(ones(m(i),1)*[1,4,1],[-1:1],m(i),m(i));T = B(1)\ dataT/B(2);Tc = book_splineInter2D(T,omega ,xc (:));Tf = book_splineInter2D(T,omega ,xf (:));

17 clf;ph = plot3(xc(:,1),xc(:,2),Tc(:),’ro’); hold on;qh = surf(xf(:,:,1),xf(:,:,2), reshape(Tf ,M2));

(a) data (b) fine grid (c) 3D view

Figure 4.4: 2D splines interpolation in FAIR. Images generated from FAIR

4.2.4 Derivatives of Interpolation Schemes

The objective function to be minimised in the optimizer relies heavily on the transformed image.This makes it imperative for the calculations of the interpolants. In this section the formulationsfor the spline interpolator derivatives is given.

Based on interpolation FAIR defines an image as follows,

T (x) = \hspace2mm inter(T,omega,xc),

where, T denotes the coefficients, omega the domain and, xc is the discretized domain.

All the interpolants used in FAIR are Kronecker-products of 1D basis functions

T (x) =md∑

jd=1

· · ·m1∑

j1=1

cj bj1

(x1) · · · bjq

(xq) · · · bjd

(xd). (4.18)

Therefore,

∂qT (x) =md∑

jd=1

· · ·m1∑

j1=1

cj bj1

(x1) · · · (bjq

)′(xq) · · · bjd

(xd), (4.19)

where (bq)′ can be computed from 4.11. The spline interpolation function splineInter2D alsoprovides this analytically calculated derivatives.

Instead of performing a pointwise interpolation for multivariate interpolants, the FAIR interpo-lation implementation collects the individual coordinates of each point i.e. xj = [x1

j , . . . , xdj ], j =

1, . . . , n as ,

Page 34: Contents · 1.3 Chapter synopsis In this section the structure and overview of the following chapters are brie y outlined Chapter 2 - FAIR This chapter introduces a generic software

Chapter 4. CUDA enabled FAIR 29

xc = [x11; . . . ; x1

n; . . . ;xd1; . . . ;xd

n],

Therefore, the interpolation and corresponding derivative Tj(xc) = T (xj) = T (x1j , . . . , x

dj ) at

the jth component are given as

dT (xc) =[

∂Tj(xc)∂xc(k)

]j=1,...,n, k=1,...,nd

The derivative results in a n-by-nd matrix, also known as the Jacobian of T . Since the jthcomponent of T only depends on xj = [x1

j , . . . , xdj ], the Jacobian is a block matrix with diagonal

blocks. This format is explicitly given in the equation below for the 2D case.

dT (xc) =

∂1T (x11, x

21) ∂2T (x1

1, x21)

. . . . . .∂1T (x1

n, x2n) ∂2T (x1

n, x2n)

4.2.5 The Interpolation Toolbox

Having discussed both linear and spline interpolation, FAIR interpolation can be summarized asfollows,

Given the data dataT on a cell centered grid xc on a domain Ω, the FAIR interpolation functioncomputes the value of the interpolant for any wanted point y ∈ Rd. For a collection of n pointsyc = [y1

1 ; . . . ; y1n; . . . ; yd

1 ; . . . ; ydn] ∈ Rnd, the result is a collection of corresponding function

values

Tc = T (yc) = [T (y1j , . . . , y

dj )]nj=1.

Whose Jacobian is given by,

dT = dT (yc) =[

∂Ti(yc)∂yc(j)

]i=1,...,n, j=1,...,nd

∈ Rn,nd.

FAIR provides an administrative function inter to facilitate all the interpolation schemes in asingle uniform framework. The definition and usage of the inter function is mentioned below.

[Tc,dT]=inter(T,omega,yc)T ∈ Rm1,...,md the coefficients for a representation of Tomega specifies the domain Ω = (ω1, ω2)× · · · × (ω2d−1, ω2d)yc ∈ Rnd interpolation pointsTc ∈ Rn value of the interpolant at certain locations yc ∈ Rnd

dT ∈ Rn,nd derivative of the interpolant

The options used in this thesis from this toolbox are:

• Tc=interT,omega,yc: returns the function values

• [Tc,dT]=interT,omega,yc: returns the function values and the derivative

• [T,R]=inter’coefficients’,dataT,dataR,omega: computes the coefficients for theinterpolation scheme to be used; for linear interpolation T=data and R=dataR.

Page 35: Contents · 1.3 Chapter synopsis In this section the structure and overview of the following chapters are brie y outlined Chapter 2 - FAIR This chapter introduces a generic software

Chapter 4. CUDA enabled FAIR 30

• [Tc,dT]=interT,omega,yc,’m’,m:returns the function values and the derivative;additionallysend descritization information for the CUDA MEX implementation of the interpolationfunctions.

The example in Figure 4.5 illustrates the usage of interT,omega,xc. In the first loop a linearscheme is used. The second loop uses a regularized spline interpolation. The regularization (W =moment matrix and θ = 100) is obtained by changing the coefficients appropriately.

1

%$ example for interpolation in 2D%$ (c) Jan Modersitzki 2009/03/24 , see FAIR .2 and FAIRcopyright .m.

setupUSData; close all; T = dataT; xc = @(m) getCenteredGrid(omega ,m);6 inter(’set’,’inter’,’linearInter2D ’);

for p=5:7,m = 2^p*[1 ,1];Tc = inter(T,omega ,xc(m));

11 figure(p-4); viewImage2D(Tc,omega ,m); colormap(gray (256));end;

inter(’set’,’inter’,’splineInter2D ’);T = getSplineCoefficients(dataT ,’regularizer ’,’moments ’,’theta’ ,100);

16 for p=5:7,m = 2^p*[1 ,1];Tc = inter(T,omega ,xc(m));figure(p-1); viewImage2D(Tc,omega ,m); colormap(gray (256));

end;

(a) linear, m = (32, 32) (b) linear, m = (64, 64) (c) linear, m = (128, 128)

(d) spline, m = (32, 32) (e) spline, m = (64, 64) (f) spline, m = (128, 128)

Figure 4.5: linear and spline interpolation on a cell centered grid of dimension m,Imagesgenerated from FAIR

4.2.6 CUDA MEX Interpolation

Before discussing the CUDA MEX implementation of the interpolation module it is worthwhileto verify where the bottlenecks lie in the current FAIR implementation with respect to the speedof execution. By performing this identification, relevant strategies are applied on the GPUimplementation for acceleration.

In order to do so both the functions linearInter2D and splineInter2D were profiled within theregistration example in Figure 2.3. The corresponding code snippets of both the functions areshown in the following Figure 4.6. The compute intensive parts ( hereafter refered to as matlab

Page 36: Contents · 1.3 Chapter synopsis In this section the structure and overview of the following chapters are brie y outlined Chapter 2 - FAIR This chapter introduces a generic software

Chapter 4. CUDA enabled FAIR 31

interpolation kernels) identified from profiling have been highlighted. The darker highlightedcode indicates higher number of calls.

Figure 4.6: linearInter2D(top) and splineInter2D(bottom) Matlab kernels, Images generatedfrom FAIR software

The observations from Figure 4.6 indicate that most time is spent in the code segments fetchingdata. This is not surprising considering the amount of interpolation coefficients that have tobe fetched, especially while calculating the derivatives for the linear interpolation case. Thederivative calculation is not so intensive in the case of spline interpolation on account of theanalytic formulation for derivatives in the spline interpolation. However, there are high number ofcalls to the motherspline routine that performs the evaluation of the spline for the correspondingweights. Moreover, FAIR also does not benefit from any apparent cache reuse because of nooptimized data structure. In the rest of this section we discuss on efficiently solving these issueson the GPU and also highlight the suitability of the hardware for these methods under discussion.

GPUs are highly parallel, multithreaded and manycore processors with tremendous computa-tional horsepower and very high memory bandwidth [3]. The GPU is extremely efficient when thealgorithms can be expressed as data-parallel computations, where the same program is executedon many data elements in parallel. Both linear and spline interpolation could be categorized assuch programs. But it is not only this feature that enhances the computation of interpolants onthe GPU.

The GPU provides a texturing hardware that is especially suited for interpolation. This hardwareperforms low-precision interpolation effortlessly between neighboring texture memory pixels orsimply, texels. Prior to this operation the coefficient data has to be bound to the special memorycalled textures that enable the use of this hardware. This operation is much faster as compared toloading coefficients and performing the weighted sum in software. Other texture related featuressuitable for the interpolation implementation are listed below [3] :

• The texture memory space is cached. Therefore a slower read from the device memorytakes place only in the situation when a cache miss occurs.

Page 37: Contents · 1.3 Chapter synopsis In this section the structure and overview of the following chapters are brie y outlined Chapter 2 - FAIR This chapter introduces a generic software

Chapter 4. CUDA enabled FAIR 32

• Threads of the same warp accessing nearby texture addresses gain high performance be-cause of the cache being optimized for 2D spacial locality.

• The global device memory places constraints on the read access patterns from a threadblock. As the nature of transformations applied on the coordinates is not known thesepatterns, called coalesced accesses, cannot be guranteed for interpolation methods. Insuch situations textures provide good alternative as long as there is locality in the fetches.

• Memory layouts called CUDA arrays that are optimized for texture fetching are availableto the programmer to offload the burden of ordering data suitably.

Using the CUDA runtime texture APIs the readily available bilinear interpolation filter modewas used to implement the CUDA MEX linearInter2D implementation and was integrated intoFAIR. This section focusses on the GPU spline interpolation implementation of splineInter2D.Details regarding the texture bilinear interpolation can be obtained from the CUDA SDK andNVIDIA documentation [3].

The CUDA MEX B spline interpolation implementation based on [7], uses a modified formulationof equation 4.12 to express the spline interpolation as a weighted sum of 2d fast texture linearinterpolation fetches as compared to 4d nearest neighbor interpolations.

This insight can be first drawn from the 1D linear interpolation method by considering thereformulation of equation 4.7

T linear(x) := dataT (p) · (1− ξ) + dataT (p+ 1) · ξ,

This can be rewritten as ,

(a+ b) · T linear(x) := dataT (p) · a+ dataT (p+ 1) · b, if 0 ≤ b/(a+ b) ≤ 1 (4.20)

From equation 4.16 and 4.12,

T (x) = T spline(x) =∑m

j=1 cjbj(x)

T spline(x) = cp−1b(ξ + 1) + cpb(ξ) + cp+1b(ξ − 1) + cp+2b(ξ − 2)

where, c represents the interpolation coefficients to be loaded from memory, ξ the fractional partand, b the bspline weight calculated from the mother spline.

Using equation 4.20 the above equation can then be written as,

T spline(x) = g0(ξ) · clinearp+h0 + g1(ξ) · clinear

p+h1 (4.21)

where,

g0(ξ) = b(ξ + 1) + b(ξ) (4.22)g1(ξ) = b(ξ − 1) + b(ξ − 2) (4.23)

h0 = (b(ξ)g0(ξ)

)− 1 (4.24)

h1 = (b(ξ − 2)g1(ξ)

) + 1 (4.25)

Though this reformulation does not sound significant in the one dimensional situation, just onedimension higher this translates to avoiding 16 explicit near neighbor fetches as compared to just

Page 38: Contents · 1.3 Chapter synopsis In this section the structure and overview of the following chapters are brie y outlined Chapter 2 - FAIR This chapter introduces a generic software

Chapter 4. CUDA enabled FAIR 33

Figure 4.7: mother spline and derivative, Image courtesy GPU GEMS2

4 hard-wired linear interpolations. For calculating the derivatives a straightforward replacementof the motherspline filter kernel with its derivative as shown in Figure 4.7 above. The rest ofthe formulation remains the same except for the weighted sum of the final interpolant is replacedby the difference of the returned texture values. This is because the sum of all the weights b

′(ξ)

of the derivative sum up to 0 and not 1.

In the next section a detailed performance analysis of all the interpolation compute kernels fromboth CUDA enabled as well as pure matlab FAIR methods are done.

4.2.7 CUDA MEX interpolation results

Using the tic/toc timing routine in MATLAB and the gettimeofday() function, the followingruntime results for the various kernels were measured. The example used was a simple rotationtransform on a discretized grid about the center of the domain. Test data used was the HNSPdata. The test m-file is given below in Figure 4.8

setupHNSPData;for level = 5:8

4 omega = MLdatalevel.omega; m = MLdatalevel .m;xc = getCenteredGrid(omega ,m);alpha = pi/6; R = [cos(alpha),-sin(alpha );sin(alpha),cos(alpha )];center = (omega (2:2: end)-omega (1:2: end )) ’/2;wc = [alpha;(eye(2)-R)* center ];

9 yc = rigid2D(wc,xc);T = getSplineCoefficients(dataT ,’dim’,2,’regularizer ’,’gradient ’,’theta’ ,50);[Tc ,dT]= splineInter2D(T,omega ,yc,’m’,m);figure (1); viewImage2D(Tc ,omega ,m,’colormap ’,’gray (256)’);

end

Figure 4.8: m file for run time testing

Page 39: Contents · 1.3 Chapter synopsis In this section the structure and overview of the following chapters are brie y outlined Chapter 2 - FAIR This chapter introduces a generic software

Chapter 4. CUDA enabled FAIR 34

Grid Size linearInter2D splineInter2D splineInter2D splineInter2D(FAIR)(msecs) (FAIR)(msecs) (NN texture)(msecs) (bilinear texture)(msecs)

64X32 23.717 28.856 0.065 0.048128X64 67.898 78.599 0.088 0.049256X128 216.525 229.961 0.134 0.067512X256 556.287 575.266 0.298 0.088

Table 4.1: CUDA MEX interpolation kernel runtime

Figure 4.9: Runtime test for CUDA MEX interpolation

Grid splineInter2D splineInter2DSize (NN texture) (bilinear texture)

Measured Theoretical Theoretical Measured Theoretical Theoreticaleffective worst best effective worst best

bandwidth Case Case bandwidth Case Case64X32 1.44 2.39 0.5 1.44 3.24 0.68128X64 2.45 7.07 1.49 4.15 12.71 2.67256X128 4 18.58 3.91 10.66 37.17 7.83512X256 9.14 33.43 7.04 26.76 113.2 23.83

Table 4.2: CUDA MEX interpolation kernel bandwidth

From the Table 4.1 and Figure 4.9 it is evident that the GPU spline kernels are considerably fasterthan the MATLAB version. But a formidable performance assessment of the implementationcannot be made on based on comparison of existing implementation on another architecture. Forthis purpose a more quantitative approach was to utilize the profiler tool provided by NVIDIAto obtain statistics such as the overall global throughput, GPU occupancy etc. These globalmemory throughput measurements at full GPU occupancy have been populated in Table 4.2

A good implementation on textures would be one where the texture cache has been used to themaximum. An indication of this would be very low number of global memory accesses, whichshould happen only in case of a cache miss. Since the profiler does not account for the numberof texture cache misses, theoretical estimates for the bounds on the global memory access haveto me made. In this context the best and worst case scenarios for the bspline method can be

Page 40: Contents · 1.3 Chapter synopsis In this section the structure and overview of the following chapters are brie y outlined Chapter 2 - FAIR This chapter introduces a generic software

Chapter 4. CUDA enabled FAIR 35

calculated as follows: The splineInter2D implementation using the near neighbor texturemode, performs 16 float texture fetches per point to perform the weighted sum in equation 4.12.

Apart from that, there are two explicit reads from global memory to access the input (thetransformed coordinate values from FAIR’s trafo module) and one write to global memory forstoring the interpolation result. Of these 19 float variables in the best case all 16 neighboringtexels could be already present in the cache. But allowing for one initial implicit load into texturememory from the global memory per thread, the memory bandwidth is calculated for 4 globalmemory accesses per interpolant calculation. The worst case situation is calculated taking intoaccount all 19 possible global memory accesses. Though the splineInter2D implementationusing the texture linear filtering mode, performs 4 explicit linear texture fetches, the bandwidthcalculations remain the same as in the near neighbor case as each linear texture fetch implicitlycalls the 4 near texels for performing the bilinear interpolation.

Figure 4.10: Bandwidth test for CUDA MEX interpolation

It is seen that both the texture based implementations make best use of the texture cache.Especially the bspline implementation using linear filtering is approximately 2-3 times fasterthan the near neighbor implementation. This is on account of the the hard wired bi-linearimplementation in the graphics hardware. In addition, the near neighbor implementation hasa few divergent branches in the routine evaluating the weights based on the mothers spline.Divergent branches within a single warp of a thread block penalizes the execution time withinthe block. The rate at which the worst case bandwidth estimation approaches the maximumavailable bandwidth on the GPUs 1 is an indication to the suitability of the the textures forapplications like the interpolation where specific access patterns cannot be guranteed.

As mentioned earlier, this fast implementation relies heavily on the graphics hardware thatperforms the bilinear filtering on textures. This hardware is able to do so by allowing the filteringin low precision. The lower precision is also on account of the fixed point storage of the fractionξ in equation 4.5. Since, image registration is primarily a constrained optimization problem,the accuracy of the derivatives play a huge role while finding the global minima of the objectivefunction. Therefore, it is necessary to investigate the effect this low precision calculation mighthave on the FAIR interpolants and corresponding derivatives. This is performed in two steps :

1. By simply checking the difference between interpolants and derivatives calculated purelywithin MATLAB and the corresponding values from the CUDA MEX implementation.Thereafter measuring the rms values of this result.

2. If f is a multivariate function f : Rn → R and v ∈ Rn be an arbitrary vector in theTaylor expansion f(x+ h · v) = f(x) + h · df(x) · v+O(h2). .An matrix A is the derivativeof f , if and only if the difference

||f(x+ hv)− f(x)− hAv|| (4.26)

1The maximum possible memory bandwidth is approximately 110 GB/sec(ideal) and 93 GB/sec (based onbenchmarks performed on device to device memory copies for GTX 295)

Page 41: Contents · 1.3 Chapter synopsis In this section the structure and overview of the following chapters are brie y outlined Chapter 2 - FAIR This chapter introduces a generic software

Chapter 4. CUDA enabled FAIR 36

is essentially quadratic in h. The MATLAB code in Figure 4.11 performs this computationand also visualises the results as incorresponding graphs.

Page 42: Contents · 1.3 Chapter synopsis In this section the structure and overview of the following chapters are brie y outlined Chapter 2 - FAIR This chapter introduces a generic software

Chapter 4. CUDA enabled FAIR 37

2 setupMRIData;m = 128*[3 ,2];

4 xc = getCenteredGrid(omega ,m);T = getSplineCoefficients(dataT ,’dim’,2,’regularizer ’,’gradient ’,’theta’ ,50);

6 [Tc ,dT]= splineInter2D(T,omega ,xc,’m’,m);figure (1); viewImage2D(Tc ,omega ,m,’colormap ’,’gray (256)’);

8

fctn = @(x) splineInter2D(single(T),omega ,x,’m’,m);10 [fig ,ph ,th] = checkDerivative(fctn ,single(xc));

end

h splineInter2D (FAIR) splineInter2D (CUDA MEX)T0 = |f0− ft| T1 = |f0 + h ∗ f0′ − ft| T0 = |f0− ft| T1 = |f0 + h ∗ f0′ − ft|

1.0000e-01 2.8280e+02 1.5068e+01 2.8590e+02 1.5162e+011.0000e-02 2.8312e+01 1.5115e-01 2.8621e+01 1.5175e-011.0000e-03 2.8315e+00 5.4799e-03 2.8623e+00 5.4119e-031.0000e-04 2.8328e-01 5.2735e-03 2.8621e-01 5.1618e-031.0000e-05 2.8751e-02 5.2539e-03 2.9050e-02 5.1319e-031.0000e-06 1.9615e-03 2.7349e-03 1.9775e-03 2.7321e-031.0000e-07 1.9082e-09 2.8314e-04 1.9465e-08 2.8623e-041.0000e-08 8.5628e-18 2.8314e-05 7.5989e-18 2.8623e-051.0000e-09 0.0000e+00 2.8314e-06 0.0000e+00 2.8623e-061.0000e-10 0.0000e+00 2.8314e-07 0.0000e+00 2.8623e-07

Figure 4.11: Test for accuracy of CUDA MEX interpolation

Since the texture filtering is currently supported only for floats, the corresponding splineInter2DFAIR implementation is made to perform the computation in single format as compared tothe native double computation in MATLAB. This is shown in line 10 of the MATLAB code inFigure 4.11. The derivatives result is cast back to double as FAIR handles the derivatives bystoring it in a sparse MATLAB array, which currently does not support operations on sparsesingle arrays. From the results in Figure 4.11 it can seen that the derivative computation inthe CUDA MEX splineInter2D is accurate enough as it satisfies the condition in equation4.26. Since the float precision for values in the range [-1,1] is approximately around 10−7, wecan safely assume this to translate to between 10−7 and 10−6 as the range of the derivatives isat least one-two orders larger. This justifies the round off error witnessed in both the graphsfor the single based MATLAB computation (left) and the the CUDA MEX splineInter2Dimplementation at around h = 10−6 on the logarithmic x axis.With this test its safe to assumethat these derivatives could be used not only in parametric registration but also in non rigidregistration where the requirement for accurate derivatives is more stringent . With the mosttime consuming part of the registration cycle implemented and discussed in this section, the fol-lowing sections briefly discuss about the other functional modules of the FAIR software within

Page 43: Contents · 1.3 Chapter synopsis In this section the structure and overview of the following chapters are brie y outlined Chapter 2 - FAIR This chapter introduces a generic software

Chapter 4. CUDA enabled FAIR 38

the registration cycle.

4.3 Parameterized transformation in CUDA enabled FAIR

A parametric transformation is a function y : Rd → Rd, where the components are linearcombinations of certain basis functions q` and the coefficients are basically the parameters w`.For example, the linear function y : R → R with y = w1x+w2 is parameterized by the parametersw = [w1;w2] and the basis functions q1(x) = x and q2(x) = 1. Setting Q(x) = [q1(x), q2(x)]yields the compact description y = Q(x)w. Choosing a collection xc of points to be mapped, thetransformed points are thus yc = Q(xc)w or using FAIR-notation: yc=Q*wc, where Q = Q(xc)and wc= w.

4.3.1 Affine Linear Transformations

A simple example of the parametrized transformation mentioned above is the affine linear trans-formation. An affine linear transformation allows for translation, rotation, shearing, and indi-vidual scaling. The components of an affine linear transformation are

y1 = w1x1 + w2x

2 + w3,

y2 = w4x1 + w5x

2 + w6,

where, w = [w1; . . . ;w6] ∈ R6 parametrizes the transformation. With

Q(x) =[x1 x2 1 0 0 00 0 0 x1 x2 1

]y = Q(x)w. (4.27)

A particular affine transformation is the so-called rigid transformation, which only allows fortranslations and rotations. The components of the rigid transformation are given by,

y1 = cos(w1)x1 − sin(w1)x2 + w2,

y2 = sin(w1)x1 + cos(w1)x2 + w3,

where, w = [w1;w2;w3] ∈ R3 parametrizes the transformation. Although this function is non-linear in w, it still allows an expansion y(x) = Q(x)f(w), with Q from equation 4.27 andf(w) = [cosw1;− sinw1;w2; sinw1; cosw1;w3].

As an example for a transformation with only one parameter w ∈ R, a rotation about the centerof the domain c = (ω2−ω1, ω4−ω3)/2 is considered. A simple way to perform this transforma-tion is to shift c to the origin, rotate about the origin, and to shift back. With,

R =[cosw −sinwsinw −cosw

]

it holds (y − c) = R(x− c), which results in y = Rx+ (I −R)c in the original domain.

These transformations discussed above are shown in Figure 4.12.

Page 44: Contents · 1.3 Chapter synopsis In this section the structure and overview of the following chapters are brie y outlined Chapter 2 - FAIR This chapter introduces a generic software

Chapter 4. CUDA enabled FAIR 39

tran

slat

ion

(a) T (xc) with grid yc =y(xc)

(b) T (yc) with grid xc

rigi

d

(c) T (xc) with grid yc =y(xc)

(d) T (yc) with grid xc

affine

(e) T (xc) with grid yc =y(xc)

(f) T (yc) with grid xc

Figure 4.12: Translation of an ultrasound image rigid, and affine linear transformations.

4.3.2 Derivatives

For the optimization schemes to be used later, derivatives of the parametric transformationsare required. Note that y = y(w) is considered as a function in the parameters w. For caseswhere y = Q(x)w, this derivative is simply Q(x). However, in an efficient code, the matrix Qshould not be assembled every time the transformation is called. This is resolved by using theMATLAB persistent variable discussed in the previous chapter. In FAIR persistent variablesare initialized by calling the function without an output request.

Rigid transformations depend non linearly on w and the derivative is thus slightly more complex.Recall that a 2D rigid transformation is given by

y(w, x) = Q(x)f(w),

with Q from equation 4.27 and f(w) = [cosw1;− sinw1;w2; sinw1; cosw1;w3].

Therefore dwy = Q(x)df , with

df =

−sinw1 0 0−cosw1 0 0

0 1 0−cosw1 0 0−sinw1 0 0

0 0 1

Page 45: Contents · 1.3 Chapter synopsis In this section the structure and overview of the following chapters are brie y outlined Chapter 2 - FAIR This chapter introduces a generic software

Chapter 4. CUDA enabled FAIR 40

[y, dy] = trafo(wc, xc)wc ∈ Rp parameter of the transformationxc ∈ Rnd grid pointsyc ∈ Rp transformed grid pointsdy ∈ Rnd,p derivative of transformation with respect to wc

Table 4.3: The transformation toolbox

4.3.3 Summarizing the Parameterized Transformations

This chapter introduces a unified way of coding parameterized transformations yc=trafowc,xcand computing their derivatives. Particular transformations discussed in this section are transla-tions,rigid and affine linear. Using the interpolation schemes introduced in section 4.2, the trans-formed image can be computed conveniently. These techniques also enable for a solution of theso-called forward problem i.e. given a parameter vector w compute the transformation y = Qwand the transformed image T [y].

Given points xc in a domain Ω and parameters wc , the trafo function computes the location ofthe transformed points yc i.e. yc=Q*f(wc), where Q and f depend on the specific transformation.

4.3.4 CUDA MEX parameterized transformation

As mentioned earlier in 4.3.2 the assembly of the large matrix Q in equation 4.27 the transfor-mation equation is very inefficient. To resolve this problem FAIR uses the persistent variablesin MATLAB. Persistent variables in MATLAB are visible within the function where it is createdand the values stored in the persistent variable are retained for successive calls of the function.This helps avoid the necessity to construct the large matrix Q every time the rigid2D is called.

One of the achievements of this work was to be able to do the same as above within the CUDAenabled FAIR toolbox i.e. to assemble results within the CUDA MEX function and retain thememory on the device. There is a huge advantage in doing this as host ↔ device transfersare usually the biggest hindrance in applications running on the GPU. This is explained usingrelevant sections of the rigid2D CUDA MEX implementation in the Figure 4.13 in the nextpage. The core idea of this method is to prevent MATLAB from clearing the CUDA MEXmemory as it would for ordinary automatic mex variables on exiting the function.

Therefore using the mexAtExit function the MATLAB environment is informed that the memorymanagement for the CUDA MEX device memory variables xf_gpu and yf_gpu would be clearedat the exit of the CUDA MEX function by the user defined subroutine cleanup. So, on the firstexecution of the CUDA MEX function rigid2D the input grids are stored in the arrays xf_gpuand yf_gpu and the flag initialised_rigid is set.

Thereafter in every subsequent call the stored device arrays can be accessed without having tocopy it from the Host memory. The effect of storing the Q matrix on the device as compared toa host device transfer every time is shown in table 4.4 in the next pages. From the table it isseen that with increasing grid size the percentage of time saved also increases significantly, asthe computation time gets closer to that required transferring data. Using this method it hasbeen possible to store only standard data type arrays such as float etc. CUDA variables suchas cudarrays and texture resulted in kernel launch failures. This is most likely on account ofusing the runtime API for this thesis. More on this and a possible solution are discussed in thenext chapter.

Page 46: Contents · 1.3 Chapter synopsis In this section the structure and overview of the following chapters are brie y outlined Chapter 2 - FAIR This chapter introduces a generic software

Chapter 4. CUDA enabled FAIR 41

#include "cuda.h"#include "mex.h"

4 ........

*///Static variable to retain device memory locationsstatic float *xf_gpu , *yf_gpu;

9 static float *yc_gpu;static int initialised_rigid =0;

*///routine to clear CUDA MEX persistent variable14 __host__ void cleanup(void)

mexPrintf("MEX -file rigid2D is terminating ,destroying array \n");cudaFree(xf_gpu ); cudaFree(yf_gpu ); cudaFree(yc_gpu );

19 // ////////////////////////////////////////////////////////////////////////////////! Kernel to transform an image//! @param y_xf_gpu ,y_yf_gpu output data in global memory//! @param xf_gpu ,yf_gpu input data (Q) from global memory// //////////////////////////////////////////////////////////////////////////////

24 __global__ void rigid2DKernel( float* y_xf_gpu ,float* y_yf_gpu ,float* xf_gpu ,float* yf_gpu ,rigid_data dd)

// /////////////////////////////////////////////////////////////////////////////////* Gateway function */

29 ///* function [yc ,dy] = rigid2D(w,x,varargin );// //////////////////////////////////////////////////////////////////////////////

void mexFunction(int nlhs , mxArray *plhs[],int nrhs , const mxArray *prhs [])

34 rigid_data rig_data;mxClassID category;

///* Find the dimensions of the data */xm = mxGetM(prhs [1]); xn = mxGetN(prhs [1]);

39 category = mxGetClassID(prhs [1]);

///* Allocate memory for output ....cudaMalloc( (void **) & yc_gpu ,sizeof(float)*xn*xm/2)

44

///* Allocate memory for Qfloat *y_xf_gpu , *y_yf_gpu;cudaMalloc( (void **) & y_xf_gpu ,sizeof(float)*xm/2*xn);cudaMalloc( (void **) & y_yf_gpu ,sizeof(float)*xm/2*xn);

49

xwidth =(int)(xm/2); xheight =(int)(xn);///* 2D Compute execution configuration using 128 threads per block */dim3 dimBlock (128 ,1);dim3 dimGrid(xwidth/dimBlock.x,xheight/dimBlock.y);

54

if(! initialised_rigid )

x = mxGetPr(prhs [1]);

59 //*Allocate memory for QcudaMalloc( (void **) & xf_gpu ,sizeof(float)*xn*xm/2);cudaMalloc( (void **) & yf_gpu ,sizeof(float)*xn*xm/2);

....

....64 ///* Construct Q using input data

cudaMemcpy( xf_gpu , x, sizeof(float)*xn*xm/2, cudaMemcpyHostToDevice );cudaMemcpy( yf_gpu , (float *)(x)+(int)(xn*xm/2), sizeof(float )*xn*xm/2, cudaMemcpyHostToDevice );

....///* register function and set flag to handle cuda memory cleanup

69 mexAtExit(cleanup );initialised_rigid = 1;

..../**Call function to perform rigid2D on GPU */rigid2DKernel <<<dimGrid ,dimBlock >>>(y_xf_gpu ,y_yf_gpu ,xf_gpu ,yf_gpu ,rig_data ,xwidth ,xheight );

74 cutilSafeCall(cudaThreadSynchronize ());....

else

79 /**Call function on GPU */rigid2DKernel <<<dimGrid ,dimBlock >>>(y_xf_gpu ,y_yf_gpu ,x_gpu ,y_gpu ,rig_data ,xwidth ,xheight );cutilSafeCall(cudaThreadSynchronize ());

....

84 ///* Set result to device pointer */

mxArray *pArray = mxCreateDoubleMatrix (0, 0, mxREAL );double data [10];mxSetPr(pArray ,yc_gpu );

89 mxSetM(pArray , xm);mxSetN(pArray , xn);

....///* Clean -up non persistent memory on device and host */cudaThreadExit ();

94

Figure 4.13: Usage of persistent memory in rigid2D

Page 47: Contents · 1.3 Chapter synopsis In this section the structure and overview of the following chapters are brie y outlined Chapter 2 - FAIR This chapter introduces a generic software

Chapter 4. CUDA enabled FAIR 42

Grid Size Grid Size rigid2D (CUDA MEX) rigid2D (CUDA MEX) Percentage of time savedX Y (no persistent) (persistent) using persistent memory64 32 0.2181 0.2139 2128 64 0.2369 0.2243 5256 128 0.2289 0.2233 2512 256 0.2247 0.2142 5512 512 0.2320 0.2200 51024 512 0.2427 0.2135 121024 1024 0.2683 0.2329 132048 1024 0.2874 0.2379 17

Table 4.4: Averaged runtimes of rigid2D on GPU

4.4 Similarity Measure in CUDA enabled FAIR

In this section, the L2-norm of the difference image or sum of squared differences (SSD) isintroduced as a prototype of a distance measure. The distance to be discussed basically measuresthe energy contained in the difference image T [y] − R. For this to be meaningful, it has to beassumed that the intensities of the two images are comparable i.e. the gray value of a particle ismore or less the same in the reference image. The measure is defined as follows.

Given T and R, the measured SSD is,

DSSD[T,R] = 0.5∫

Ω(T (x)−R(x))2dx.

Though a continuous setting is used, the integral can not be computed analytically. Therefore,numerical integration or quadrature is required. A discrete analogue of the SSD is given bya numerical integration of the function ψ(x) = 0.5(T (x) − R(x))2, where T and R are theinterpolants of the template and reference, respectively. For a particular h, let xc denote thecorresponding cellcentered grid of width h, Th = T (xc) and Rh = R(xc), respectively. Thediscretized SSD is based on a midpoint quadrature rule with an apriori chosen cell centered gridof width h and reads

[Dc, rc, dD, dr, d2psi] = distances(Tc,Rc, omega,m)Tc ∈ Rn sampled transformed template Tc = T (yc)Rc ∈ Rn sampled reference Rc = R(xc)omega,m specify domain and descritizationDc ∈ R distance of Tc and Rcrc ∈ Rp residual r, e.g. r = Tc−Rc for SSD

or r = ρT,R = joint density estimator for MIdD ∈ Rnd derivative of D wrt. Tcdr ∈ Rp,nd derivative of rc wrt. Tcd2psi ∈ Rq,q second derivative of the outerfunction ψ wrt. r

Table 4.5: The FAIR distance toolbox

DSSD,h(Th,Rh) = 0.5 · hd · ‖Th − Rh‖2,wherehd = h1 · · ·hd.

FAIR enables a convenient use of different measures including their analytical derivatives. Likefor the interpolation or transformation modules, a generic distance measure module distanceis provided. On the basis of a persistent parameter OPTN, this function allows a convenientintegration of the various measures.

Page 48: Contents · 1.3 Chapter synopsis In this section the structure and overview of the following chapters are brie y outlined Chapter 2 - FAIR This chapter introduces a generic software

Chapter 4. CUDA enabled FAIR 43

This function computes an integral based distance measure by approximating the integral usinga midpoint quadrature rule based on a cell centered grid xc of m points for the domain Ω. Thedistance measure D(Tc) = ψ(r(Tc)) is coded as a composition of an outer function ψ and theresidual r to enable a Gauss Newton type optimization scheme. The input and output of thefunction is summarized in the following Table 4.5. The function returns its derivative and anapproximation to the Hessian given by ∇2D ≈ H = drT d2ψdr.

4.5 Parametric Image Registration in CUDA enabled FAIR

Using the functional modules discussed above a joint objective function J is created from:

• A parametric transformation y(w) = y(w, x), here the CUDA enabled rigid transformation.

• The transformed template T y or T (yc) obtained from the CUDA spline interpolationfunction.

• The distance measure D[Ty,R] which in this case, is the CUDA SSD implementation.

This is summarized mathematically as J [w] = D[T (y(w, x),R)] + S(w), where, S is a regular-izer on the optimization parameter for adding a bias to a particular solution or neglecting anunwanted solution. In the current example this extra regularizer has not been used. In order tocollect all the concepts under a single formulation and operate at a higher level of abstraction,FAIR provides an objective function J . Based on a certain descritization xc and the currentparameters wc, this function computes,

1. The transformation yc = y(wc, xc).

2. The transformed image T (yc).

3. The distance D(T (yc),R).

The following MATLAB implementation outlines the objective function.

function [Jc,dJ,H] = PIRobjFctn(T,Rc,omega,m,xc,beta,wc)[yc,dy] = trafo(wc,xc); % compute transformation[Tc,dT] = inter(T,omega,yc); % compute transformed image[Jc,rc,dD,dr,d2psi] = distance(Tc,Rc,omega,m); % compute distancedJ = dD*dT*dy; dr = dr*dT*dy; % multiply outer and inner derivativesH = dr’*d2psi*dr + beta*speye(length(wc)); % compute approximation to Hessian

The discretized objective function for parametric image registration is, Jh(wc) = Dh(T (yc),R(xc))+S(wc),

where,

yc = y(wc, xc), Dh is a distance measure and, S is a regularizer of the coefficients.

The specific transformation,interpolation, and distance measure are supplied by trafo, inter,and distance. Note that H could be a matrix or a function handle, if a matrix free code is used.

For the optimization process an Gauss Newton scheme is provided by FAIR. The central ideaof this scheme is to iteratively solve a Quasi Newton system Hdw = −dJ , thereby obtaininga better update dw of the initial guess wc in every iteration. Though the function solveGNprovides the flexibility to choose between various iterative solvers, for both the current FAIRand CUDA MEX examples MATLAB’s backslash operator is used to solve the Quasi Newtonsystem in every iteration. The iteration is stopped when the following stopping criteria are met:

Page 49: Contents · 1.3 Chapter synopsis In this section the structure and overview of the following chapters are brie y outlined Chapter 2 - FAIR This chapter introduces a generic software

Chapter 4. CUDA enabled FAIR 44

[Jc, para, dJ,H] = PIRobjFctn(T,Rc, omega,m, beta,M,wRef, xc, wc)T,Rc template and sampled reference Rc = R(xc)omega, m specifying descritization of Ωbeta≥ 0, parameter for regularizing HessianM, wRef optional regularization (default: M=[]; wRef=[])xc ∈ Rdn underlying grid pointswc ∈ Rp current parametersJc∈ R current objective function value based on wc,para (structure) collects intermediates for visualization

para=Tc,Rc,omega,m,yc,Jc, whereyc = y(wc, xc) and Tc=T(yc)

dJ ∈ R1,p derivative of J w.r.t. wc,H ∈ Rp,p approximation to Hessian, H ≈ d2D + d2S + βI.

Table 4.6: PIR objective function

STOP(1) = abs(Jold-Jc) <= tolJ*(1+abs(JRef)); %relative variation in theobjective functionSTOP(2) = norm(yc-yold) <= tolW*(1+norm(yc)); %relative variation in theparametersSTOP(3) = hd*norm(dJ) <= tolG*(1+abs(JRef)); %the norm of the gradientSTOP(4) = norm(dJ) <= eps; % comparision withmachine precisionSTOP(5) = (iter > maxIter); % comparision with predefinedmax iterationsSTOP = all(STOP(1:3)) | any(STOP(4:5));

Given below is the typical call of the Gauss newton scheme and the FAIR implementation of thescheme itself.

fctn = @(wc) PIRobjFctn(T,Rc,omega,m,beta,M,wRef,xc,wc); % handle to objective functionw0 = trafo(’w0’); % initial guess[wc,His] = GaussNewtonArmijo(fctn,w0); % call the optimizer

function [wc,His] = PIRGaussNewtonArmijo(T,R,omega,m,yc)% -- start initial phase -------------------------------------------------[Jc,dJ,H] = fctn(yc); % compute current values% -- start iteration phase -----------------------------------------------while 1,iter = iter + 1; % update iteration countcheckStoppingRules; % check the stopping rulesdy = -H\dJ; % solve Quasi-Newton’s system[t,yt,LSiter] = Armijo(fctn,yc,dy,Jc,dJ); % perform Armijo line-searchif LSiter<0, break; end; % break if line-search failsyc = yt; [Jc,dJ,H] =fctn(yc); % update current values

end; %while

Page 50: Contents · 1.3 Chapter synopsis In this section the structure and overview of the following chapters are brie y outlined Chapter 2 - FAIR This chapter introduces a generic software

Chapter 4. CUDA enabled FAIR 45

4.6 Experiment and results:Fixed Level Parametric ImageRegistration on CUDA enabled FAIR

The data used in this section is provided by setupHNSPdata and the multilevel representation isgenerated using getMultilevel. Here, Ω = (0, 2)× (0, 1), a spline interpolation based interpo-lation approach, the SSD distance measure, and no additional regularization of the parameterare used. The reduction of the distance is measured by

reduction = J(wc)/J(w0), (4.28)

where,

w0 is the starting guess and, wc is the numerical optimizer.

The experiments are performed for rigid transformations and are either based on coarse level orfine level representation of the data:

l = level ml

coarse 6 [128,64]fine 8 [512,256]

The results obtained for the coarse and fine level are very close, although the optimization onthe fine level is much more expensive i.e. more iterations are needed and the computationconsumes much more time. Results for the coarse level and the fine level are shown in 4.14and 4.17, respectively. The convergence history is presented in 4.15 and 4.18. On the coarselevel, the numerical minimizer is obtained after 48 iterations whereas for the fine level, thenumerical minimizer is obtained after 115 iterations. Both these results conform to that of thepure MATLAB version, which can be seen for the stopping criterion and iteration history asshown in Figures 4.16, 4.19, 4.14 and 4.17. This is in line with the previous discussion on theaccuracy of results, especially the derivatives of the GPU implementation.

The main goal of the thesis was to investigate the ability to accelerate the image registrationpipeline using the GPU. Therefore the most interesting result would be to measure and comparethe speed of executing a complete FAIR image registration cycle in both MATLAB and in CUDA.These timing measurements were made using the MATLAB timing functions tic toc. Thereforefor the CUDA MEX implementations the timing includes the memory transfers between the hostand device, the time involved in the kernel launches and also the execution time on the GPUitself. The test was performed on three grids of varying descritization within the PIR_SSD_RIGIDexample discussed so far in this section. The host system on which the results were generated isMATLAB running on a intel quad core CPU and the CUDA device testing on was a NVIDIAGTX 295 graphics card using only one of the available two CUDA processors on device. Table4.7 summarizes these results.

From the results it can be observed that for small grids the runtime for both the MATLAB andCUDA MEX versions are comparable, and that there is no significant gain by using the GPU.As the descritization gets finer the gain in performance get more significant. This behavior isexpected on account of two reasons.

Firstly, the GPUs consist of many stream processors, each with the capability of launchinghundreds of threads. Unless all the multiprocessors are sufficiently utilized no gain in performance

Page 51: Contents · 1.3 Chapter synopsis In this section the structure and overview of the following chapters are brie y outlined Chapter 2 - FAIR This chapter introduces a generic software

Chapter 4. CUDA enabled FAIR 46

HN

SP(a) T (xc) (b) R(xc) (c) |T (xc)− R(xc)|

rigi

d/co

arse

(d) T (xc) and grid yc (e) T (yc) (f) |T (yc)− R(xc)|

Figure 4.14: PIR results for distance=SSD, trafo=rigid2D, and m=[128,64]

(a) iteration history for m= [128,64] in MATLAB (b) iteration history for m= [128,64] in CUDA MEX

Figure 4.15: Iteration history for both CUDA MEX and MATLAB at course level

1 MATLAB1[ (Jold -Jc) = 8.53210293e-04 <= tolJ *(1+| Jstop|) = 4.77114650e+00]1[ |yc-yOld| = 1.72291201e-05 <= tolY *(1+ norm(yc)) = 1.47847101e-02]1[ |dJ| = 2.13845429e+01 <= tolG *(1+ abs(Jstop) = 4.77114650e+01]0[ norm(dJ) = 2.13845429e+01 <= eps = 2.22044605e-13]

6 0[ iter = 48 >= maxIter = 500 ]% ---------- [ GaussNewtonArmijo : done ! ] ----------------------------------

CUDA MEX

11 1[ (Jold -Jc) = 9.01546190e-04 <= tolJ *(1+| Jstop|) = 4.77114845e+00]1[ |yc-yOld| = 1.78247084e-05 <= tolY *(1+ norm(yc)) = 1.47846932e-02]1[ |dJ| = 2.18113840e+01 <= tolG *(1+ abs(Jstop) = 4.77114845e+01]0[ norm(dJ) = 2.18113840e+01 <= eps = 2.22044605e-13]0[ iter = 48 >= maxIter = 500 ]

16 % ---------- [ GaussNewtonArmijo : done ! ] ----------------------------------

Figure 4.16: The stopping criterion for coarse level in both methods

can be witnessed. Small problem sizes are not able to generate so many threads. Hence for smallproblem sizes multi core CPUS already have comparable efficiency.

Secondly, whenever a CUDA MEX function is called within FAIR, the input data is moved fromthe host to device and back after the kernel execution with the output data. This host-devicetransfer is a significant problem as the bandwidth provided by the PCI interface between thehost and device is far less compared to that on the GPU device.

Page 52: Contents · 1.3 Chapter synopsis In this section the structure and overview of the following chapters are brie y outlined Chapter 2 - FAIR This chapter introduces a generic software

Chapter 4. CUDA enabled FAIR 47

HN

SP(a) T (xc) (b) R(xc) (c) |T (xc)− R(xc)|

rigi

d/fin

e

(d) T (xc) and grid yc (e) T (yc) (f) |T (yc)− R(xc)|

Figure 4.17: PIR results for distance=SSD, trafo=rigid2D, and m=[512,256]

(a) iteration history for [512,256] in MATLAB (b) iteration history for [512,256] in CUDA MEX

Figure 4.18: Iteration history for both CUDA MEX and MATLAB at fine level

CUDA MEX LEVEL 8----------------

1[ (Jold -Jc) = 4.10610489e-05 <= tolJ *(1+| Jstop|) = 5.20809124e+00]4 1[ |yc-yOld| = 1.96948288e-06 <= tolY *(1+ norm(yc)) = 1.47908270e-02]

1[ |dJ| = 2.85537218e+01 <= tolG *(1+ abs(Jstop) = 5.20809124e+01]0[ norm(dJ) = 2.85537218e+01 <= eps = 2.22044605e-13]0[ iter = 115 >= maxIter = 500 ]% ---------- [ GaussNewtonArmijo : done ! ] ----------------------------------

9

|MATLAB LEVEL 8|---------------

1[ (Jold -Jc) = 7.20116946e-05 <= tolJ *(1+| Jstop|) = 5.20809084e+00]1[ |yc-yOld| = 1.96082782e-06 <= tolY *(1+ norm(yc)) = 1.47908288e-02]

14 1[ |dJ| = 2.84202453e+01 <= tolG *(1+ abs(Jstop) = 5.20809084e+01]0[ norm(dJ) = 2.84202453e+01 <= eps = 2.22044605e-13]0[ iter = 115 >= maxIter = 500 ]% ---------- [ GaussNewtonArmijo : done ! ] ----------------------------------

Figure 4.19: The stopping criterion for fine level in both methods

Grid Size Grid Size PIR_SSD_RIGID PIR_SSD_RIGIDX Y (FAIR with MATLAB) (FAIR WITH CUDA MEX

128 64 14.96 secs 14.13 secs256 128 45 secs 33 secs512 256 201.85 secs 92 secs

Table 4.7: Averaged runtimes of Parametric image registration in FAIR on MATLAB andCUDA

Page 53: Contents · 1.3 Chapter synopsis In this section the structure and overview of the following chapters are brie y outlined Chapter 2 - FAIR This chapter introduces a generic software

Chapter 4. CUDA enabled FAIR 48

Hence, it is important to minimize these transfers as far as possible. Apart from the SSD andrigid2D persistent memory i.e. the capability to retain calculated results or input arrays, was notimplemented on the most computationally involved functional module, the spline interpolation.The reason for this was the inability to retain the specialized CUDA attribute cudaarray oversuccessive CUDA MEX calls using the method discussed in the program 4.13. This is likely downto the fact that when using the cuda runtime API, as in the current implementation, calling acudaThreadExit(), the CUDA context 2 that has the cudarray device pointer as well as thetexture reference in a distinct 32 bit address space, gets cleared. Hence during the next MEXcall the information is lost with no means to retrieve the previous context.

A possible solution to this problem is suggested in the next chapter.

2Analogous to a CPU process

Page 54: Contents · 1.3 Chapter synopsis In this section the structure and overview of the following chapters are brie y outlined Chapter 2 - FAIR This chapter introduces a generic software

Chapter 5

Recommendations andConclusion

5.1 Introduction

In the scope of this thesis an end to end image registration cycle was implemented on a CUDAenabled graphics card within the MATLAB based FAIR toolbox. During this work a softwaremodel for image registration was adapted to the FAIR toolbox. Detailed run time analysisand profiling of a prototype parametric image registration was performed and computationallyexpensive functional modules were identified. Thereafter, a detailed study into the efficientintegration of CUDA and MATLAB using mex APIs was taken up and documented. Using thesetools and methods learnt from the study, as proofs of the CUDA and FAIR integration concept,the functional modules of the parametric image registration cycle were implemented on the GPU.The various performance aspects were tested and inferences were drawn on individual functionalmodules. Based on the timing results performed on input grids of varying descritizations forparametric image registration on CUDA, inference was drawn that using CUDA indeed thereis gain in performance. The significant growth with considerable increase in grid size was alsodiscussed. In this chapter the achieved goals are listed and the topics and ideas left for futureresearch are suggested.

5.2 Goals achieved

As part of the thesis work many milestones were set and achieved. The main achievements ofthis thesis work have been listed below

1. Successful integration of MATLAB and CUDA.

2. Porting of the FAIR toolbox onto the GPU.

3. Fast implementation of spline interpolation within the CUDA MEX framework.

4. Analysis of accuracy results for texture usage for interpolant derivatives.

5. GPU acceleration of fixed level image registration scheme for large descritizations.

6. Implementation of persistent memory on GPUs.

49

Page 55: Contents · 1.3 Chapter synopsis In this section the structure and overview of the following chapters are brie y outlined Chapter 2 - FAIR This chapter introduces a generic software

Chapter 5. Recommendations and Conclusion 50

5.3 Scope for further work

Given the optimized implementations of the functional modules of FAIR on the GPU, the speedup achieved in the overall registration cycle could be futher enhanced. This can be done byreducing the numerous transfers of data between the host and device every time a CUDA MEXroutine is called, especially the spline interpolation. This effect is not so prominent for the CUDAMEX implementations of other functional modules as persistent memory was implemented. Theprevious chapter briefly described the reason for this problem in section 4.6. In general for afast CUDA MEX implementation these transfers have to be minimized. Therefore suggestionsfor future work discussed here mainly focus on this theme.

5.3.1 FAIR improvements for GPU

In order to achieve the best performance using the GPU some changes in the FAIR design arerequired. Few of these suggestion are listed below:

• Though use of Kronecker products provides for clarity in understanding underlying con-cepts as mentioned in section 2.5.4, their use in FAIR is a major source for creating largematrices that could generally be avoided by using other suitable methods.

• The explicit storage of the large coordinate grids could be avoided, by sharing the do-main and descritization information to the functional modules such as rigid2D that couldgenerate the grid locally without having to explicitly store them.

• It might be desirable to combine functionalities of two modules that almost always occurin succession. For example, it is likely that a rigid or any other transformation is followedby an interpolation scheme. By merging these two modules, the necessity to transfer theoutput i.e the transformed grid y= trafo(x) from device to host and immediately back asinput to the interpolation module can be avoided. This in combination with the previoussuggestion, an extremely low cost interpolation/transformation can be created.

• The stringent requirement for the lexico-graphical ordering in FAIR places a restriction onthe use of more sophisticated data structures instead of the standard long one dimensionalvectors used in FAIR. By using special data arrangement techniques better spatial localityand better cache reuse can be achieved not only in the GPU but also on the host device.

5.3.2 Usage of CUDA driver API

The problem pertaining the implementation of persistent CUDA arrays and texture referencewith respect to context management, the usage of the CUDA run time Application ProgramInterface (API) was found to be prohibhitive as mentioned in 4.6. In order to keep the CUDAprogram code concise, GPU context and module management are implicitly handled in theruntime API. In a general C application environment this would cause no problem as the contextis retained throughout the lifespan of the application. In the CUDA MEX programming model,the CUDA MEX file with the kernel implementation, is just a dynamically linked subroutineexecuted by the application engine MATLAB. Therefore the lifespan of the MEX file and withit the CUDA context ends with the termination of the CUDA MEX routine.

A solution for this problem could be sought in form of the CUDA driver API, which is a low levelC API that provides functions to load kernels as modules of CUDA binary or assembly code.As opposed to the run time API, the driver API provides direct access to the CUDA contextwhich encapsulates all the actions and resources performed by the driver API.These actions alsoinclude that of the run time API as it is built on top of the lower level driver API.

Page 56: Contents · 1.3 Chapter synopsis In this section the structure and overview of the following chapters are brie y outlined Chapter 2 - FAIR This chapter introduces a generic software

Chapter 5. Recommendations and Conclusion 51

Table 5.1: CUDA Driver API objects, NVIDIA programming guide

The various objects available directly from the driver API summarised in Table 5.1, along withthe device pointers stored in a distinct 32-bit address space get cleared by the system when thecontext gets destroyed at end of its execution. This is most likely what happens when attemptingto store the CUDA array as a persistent memory. Hence a method to retrieve the context hasto be used.

Figure 5.1: Flow chart explaining use of driver API for CUDA MEX

Though each host thread can have only one active context at a time, it has a stack of currentcontexts. Therefore a CUDA MEX host function can create and initialize a context with thetexture reference code and then load this context onto the stack. This MEX file can be locked

Page 57: Contents · 1.3 Chapter synopsis In this section the structure and overview of the following chapters are brie y outlined Chapter 2 - FAIR This chapter introduces a generic software

Chapter 5. Recommendations and Conclusion 52

from being cleared by MATLAB by using the function mexlock() in the first run of the CUDAMEX file. This way the original context with the CUDAarray and textureference used for thebspline interpolation can be prevented from getting destroyed when the CUDA MEX subroutinefinishes execution. Subsequent calls to the same CUDA MEX file can retrieve from the stack,the handle to the context initialised in the first run and thereafter use the context and get backthe texture reference for interpolation.

At the end of the registration cycle this locked CUDA MEX file can be cleared from MATLABusing appropriate MEX APIs. The complete process discussed above is summarised in Figure5.1. This method could be further extended to implementing a CUDA MEX file that sets upall neccesary variables for the registration cycle on the GPU and pass the device pointers asarguments to the CUDA MEX routines.

To sum up from the experiences gained , the prospects of integrating the performance of CUDAenabled GPUs with the flexibility of MATLAB in general and the FAIR framework in particular,are very good.

Page 58: Contents · 1.3 Chapter synopsis In this section the structure and overview of the following chapters are brie y outlined Chapter 2 - FAIR This chapter introduces a generic software

Appendix A

FAIR/CUDA files

# Define installation location for CUDA and compilation flags compatible3 # with the CUDA include files.

CUDAHOME = /usr/local/cudaINCLUDEDIR = -I$(CUDAHOME )/ includeINCLUDELIB = -L$(CUDAHOME )/lib -lcufft -lcudart -lcublas -Wl,-rpath ,$(CUDAHOME )/libCFLAGS = -fPIC -D_GNU_SOURCE -pthread -fexceptions

8 COPTIMFLAGS = -O3 -funroll -loops

# Define installation location for MATLAB.export MATLAB = /usr/local/matlabMEX= $(MATLAB )/bin/mex

13 MEXEXT = .$(shell $(MATLAB )/bin/mexext)

# nvmex is a modified mex script that knows how to handle CUDA .cu files.NVMEX = ./nvmex

18 # List the mex files to be built. The .mex extension will be replaced with the# appropriate extension for this installation of MATLAB , e.g. .mexglx or# .mexa64.MEXFILES = interpolation/splineInter2D.mexall: $(MEXFILES :.mex=$(MEXEXT ))

23 clean: rm -f $(MEXFILES :.mex=$(MEXEXT )).SUFFIXES: .cu .cu_o .mexglx .mexa64 .mexmaci .c.mexglx:$(MEX) CFLAGS=’$(CFLAGS)’ COPTIMFLAGS=’$(COPTIMFLAGS)’ $< \$(INCLUDEDIR) $(INCLUDELIB)

28 .cu.mexglx:$(NVMEX) -f nvopts.sh $< $(INCLUDEDIR) $(INCLUDELIB)

.c.mexa64:$(MEX) CFLAGS=’$(CFLAGS)’ COPTIMFLAGS=’$(COPTIMFLAGS)’ $< \

33 $(INCLUDEDIR) $(INCLUDELIB)

.cu.mexa64:$(NVMEX) -f nvopts.sh $< $(INCLUDEDIR) $(INCLUDELIB)

38 .c.mexmaci:$(MEX) CFLAGS=’$(CFLAGS)’ COPTIMFLAGS=’$(COPTIMFLAGS)’ $< \$(INCLUDEDIR) $(INCLUDELIB)

.cu.mexmaci:43 $(NVMEX) -f nvopts.sh $< $(INCLUDEDIR) $(INCLUDELIB)

Figure A.1: Makefile

53

Page 59: Contents · 1.3 Chapter synopsis In this section the structure and overview of the following chapters are brie y outlined Chapter 2 - FAIR This chapter introduces a generic software

Appendix A. FAIR/CUDA files 54

1 ## nvopts.sh Shell script for configuring MEX -file creation script ,# mex. These options were tested with gcc 3.2.3.## usage: Do not call this file directly; it is sourced by the

6 # mex shell script. Modify only if you don ’t like the# defaults after running mex. No spaces are allowed# around the ’=’ in the variable assignment .##

11 # Copyright 1984 -2004 The MathWorks , Inc.# $Revision: 1.43.4.7 $ $Date: 2006/03/10 00:42:26 $# ----------------------------------------------------------------------------#

TMW_ROOT="$MATLAB"16 MFLAGS=’’

if [ "$ENTRYPOINT" = "mexLibrary" ]; thenMLIBS="-L$TMW_ROOT/bin/$Arch -lmx -lmex -lmat -lmwservices -lut -lm"

elseMLIBS="-L$TMW_ROOT/bin/$Arch -lmx -lmex -lmat -lm"

21 ficase "$Arch" in

Undetermined)# ----------------------------------------------------------------------------# Change this line if you need to specify the location of the MATLAB

26 # root directory. The script needs to know where to find utility# routines so that it can determine the architecture ; therefore , this# assignment needs to be done while the architecture is still# undetermined .# ----------------------------------------------------------------------------

31 MATLAB="$MATLAB"## Determine the location of the GCC libraries#

GCC_LIBDIR=‘gcc -4.3 -v 2>&1 | awk ’/.* Reading specs .*/ print substr($4 ,0,length($4)-6)’‘36 # GCC_LIBDIR =‘/usr/lib64 ‘

;;glnxa64)

# ----------------------------------------------------------------------------RPATH="-Wl ,-rpath -link ,$TMW_ROOT/bin/$Arch"

41 CC=’nvcc -ccbin /usr/bin/gcc -4.3 ’CFLAGS=’-O3 -Xcompiler "-fPIC -D_GNU_SOURCE -pthread -fexceptions -m64"’CLIBS="$RPATH $MLIBS -lm -lstdc++"COPTIMFLAGS=’-Xcompiler "-O3 -funroll -loops -msse2 -DNDEBUG"’CDEBUGFLAGS=’-g’

46 #CXX=’g++-4.3’CXXFLAGS=’-fPIC -fno -omit -frame -pointer -ansi -D_GNU_SOURCE -pthread ’CXXLIBS="$RPATH $MLIBS -lm"CXXOPTIMFLAGS=’-O -DNDEBUG ’

51 CXXDEBUGFLAGS=’-g’## NOTE: g77 is not thread safe

FC=’g77 ’FFLAGS=’-fPIC -fno -omit -frame -pointer -fexceptions ’

56 FLIBS="$RPATH $MLIBS -lm -lstdc++"FOPTIMFLAGS=’-O’FDEBUGFLAGS=’-g’

#LD="gcc -4.3"

61 LDEXTENSION =’.mexa64 ’LDFLAGS="-pthread -shared -Wl,--version -script ,$TMW_ROOT/extern/lib/$Arch/$MAPFILE"LDOPTIMFLAGS=’-O’LDDEBUGFLAGS=’-g’

#66 POSTLINK_CMDS=’:’

# ----------------------------------------------------------------------------;;

esac71 # ############################################################################

## Architecture independent lines:## Set and uncomment any lines which will apply to all architectures .

76 ## ----------------------------------------------------------------------------# CC=" $CC"# CFLAGS =" $CFLAGS"# COPTIMFLAGS =" $COPTIMFLAGS "

81 # CDEBUGFLAGS =" $CDEBUGFLAGS "# CLIBS =" $CLIBS"## LD=" $LD"# LDFLAGS =" $LDFLAGS"

86 # LDOPTIMFLAGS =" $LDOPTIMFLAGS "# LDDEBUGFLAGS =" $LDDEBUGFLAGS "# ----------------------------------------------------------------------------# ############################################################################

Figure A.2: nvopts.sh

Page 60: Contents · 1.3 Chapter synopsis In this section the structure and overview of the following chapters are brie y outlined Chapter 2 - FAIR This chapter introduces a generic software

Appendix A. FAIR/CUDA files 55

1 # Define installation location for CUDA and compilation flags compatible# with the CUDA include files.CUDAHOME = /usr/local/cudaINCLUDEDIR = -I$(CUDAHOME )/ include -I$(CUDAHOME )/C/common/inc/INCLUDELIB = -L$(CUDAHOME )/lib64 -Wl ,-rpath ,$(CUDAHOME )/lib -lcudart

6 CFLAGS = -fPIC -D_GNU_SOURCE -pthread -fexceptionsCOPTIMFLAGS = -O3 -funroll -loops -msse2

# Define installation location for MATLAB.export MATLAB =/opt/matlab

11 MEX = $(MATLAB )/bin/mexMEXEXT = .$(shell $(MATLAB )/bin/mexext)

# nvmex is a modified mex script that knows how to handle CUDA .cu files.NVMEX = ./nvmex

16

# List the mex files to be built. The .mex extension will be replaced with the# appropriate extension for this installation of MATLAB , e.g. .mexglx or# .mexa64.MEXFILES = interpolation/linearInter2D.mex

21

all: $(MEXFILES :.mex=$(MEXEXT ))

clean:rm -f $(MEXFILES :.mex=$(MEXEXT ))

26

.SUFFIXES: .cu .cu_o .mexglx .mexa64 .mexmaci

.c.mexglx:$(MEX) -f mexopts_dbg.sh CFLAGS=’$(CFLAGS)’ COPTIMFLAGS=’$(COPTIMFLAGS)’ $< \

31 $(INCLUDEDIR) $(INCLUDELIB)

.cu.mexglx:$(NVMEX) -f nvopts_dbg.sh $< $(INCLUDEDIR) $(INCLUDELIB)

36 .c.mexa64:$(MEX) -f mexopts_dbg.sh CFLAGS=’$(CFLAGS)’ COPTIMFLAGS=’$(COPTIMFLAGS)’ $< \CDEBUGFLAGS=’$(CDEBUGFLAGS)’ $< \$(INCLUDEDIR) $(INCLUDELIB)

41 .cu.mexa64:$(NVMEX) -f nvopts_dbg.sh $< $(INCLUDEDIR) $(INCLUDELIB)

.c.mexmaci:$(MEX) -f mexopts_dbg.sh CFLAGS=’$(CFLAGS)’ COPTIMFLAGS=’$(COPTIMFLAGS)’ $< \

46 $(INCLUDEDIR) $(INCLUDELIB)

.cu.mexmaci:$(NVMEX) -f nvopts_dbg.sh $< $(INCLUDEDIR) $(INCLUDELIB

51

Figure A.3: Makefile.dbg

Page 61: Contents · 1.3 Chapter synopsis In this section the structure and overview of the following chapters are brie y outlined Chapter 2 - FAIR This chapter introduces a generic software

Appendix A. FAIR/CUDA files 56

# mexopts_dbg .sh Shell script for configuring MEX -file creation script ,# mex. These options were tested with the specified compiler.## usage: Do not call this file directly; it is sourced by the

5 # mex shell script. Modify only if you don ’t like the# defaults after running mex. No spaces are allowed# around the ’=’ in the variable assignment .## Copyright 1984 -2008 The MathWorks , Inc.

10 # $Revision: 1.78.4.16 $ $Date: 2008/11/04 19:40:11 $# ----------------------------------------------------------------------------#

TMW_ROOT="$MATLAB"MFLAGS=’’

15 if [ "$ENTRYPOINT" = "mexLibrary" ]; thenMLIBS="-L$TMW_ROOT/bin/$Arch -lmx -lmex -lmat -lmwservices -lut"

elseMLIBS="-L$TMW_ROOT/bin/$Arch -lmx -lmex -lmat"

fi20 case "$Arch" in

Undetermined)# ----------------------------------------------------------------------------# Change this line if you need to specify the location of the MATLAB# root directory. The script needs to know where to find utility

25 # routines so that it can determine the architecture ; therefore , this# assignment needs to be done while the architecture is still# undetermined .# ----------------------------------------------------------------------------

MATLAB="$MATLAB"30 # ----------------------------------------------------------------------------

;;glnxa64)

# ----------------------------------------------------------------------------RPATH="-Wl ,-rpath -link ,$TMW_ROOT/bin/$Arch"

35 # StorageVersion : 1.0# CkeyName: GNU C# CkeyManufacturer : GNU# CkeyLanguage : C# CkeyVersion :

40 CC=’gcc -4.3’CFLAGS=’-ansi -D_GNU_SOURCE ’CFLAGS="$CFLAGS -fexceptions"CFLAGS="$CFLAGS -fPIC -fno -omit -frame -pointer -pthread"CLIBS="$RPATH $MLIBS -lm"

45 COPTIMFLAGS=’-O -DNDEBUG ’CDEBUGFLAGS=’-g’CLIBS="$CLIBS -lstdc++"

## C++ keyName: GNU C++

50 # C++ keyManufacturer : GNU# C++ keyLanguage : C++# C++ keyVersion :CXX=’g++-4.3’CXXFLAGS=’-ansi -D_GNU_SOURCE ’

55 CXXFLAGS="$CXXFLAGS -fPIC -fno -omit -frame -pointer -pthread"CXXLIBS="$RPATH $MLIBS -lm"CXXOPTIMFLAGS=’-O -DNDEBUG ’CXXDEBUGFLAGS=’-g’

#60 # FortrankeyName : g95

# FortrankeyManufacturer : GNU# FortrankeyLanguage : Fortran# FortrankeyVersion :

#65 FC=’g95 ’

FFLAGS=’-fexceptions ’FFLAGS="$FFLAGS -fPIC -fno -omit -frame -pointer"FLIBS="$RPATH $MLIBS -lm"FOPTIMFLAGS=’-O’

70 FDEBUGFLAGS=’-g’#

LD="$COMPILER"LDEXTENSION =’.mexa64 ’LDFLAGS="-pthread -shared -Wl,--version -script ,$TMW_ROOT/extern/lib/$Arch/$MAPFILE -Wl,--no-undefined"

75 LDOPTIMFLAGS=’-O’LDDEBUGFLAGS=’-g’

#POSTLINK_CMDS=’:’

# ----------------------------------------------------------------------------80

# ----------------------------------------------------------------------------;;

esac# ############################################################################

85 ## Architecture independent lines:## Set and uncomment any lines which will apply to all architectures .#

90 # ----------------------------------------------------------------------------# CC=" $CC"

# ----------------------------------------------------------------------------# ############################################################################

Figure A.4: mexoptsdbg.sh

Page 62: Contents · 1.3 Chapter synopsis In this section the structure and overview of the following chapters are brie y outlined Chapter 2 - FAIR This chapter introduces a generic software

Appendix A. FAIR/CUDA files 57

# gccopts.sh Shell script for configuring MEX -file creation script ,# mex. These options were tested with gcc 3.2.3.## usage: Do not call this file directly; it is sourced by the

5 # mex shell script. Modify only if you don ’t like the# defaults after running mex. No spaces are allowed# around the ’=’ in the variable assignment .## Copyright 1984 -2004 The MathWorks , Inc.

10 # $Revision: 1.43.4.7 $ $Date: 2006/03/10 00:42:26 $# ----------------------------------------------------------------------------#

TMW_ROOT="$MATLAB"MFLAGS=’’

15 if [ "$ENTRYPOINT" = "mexLibrary" ]; thenMLIBS="-L$TMW_ROOT/bin/$Arch -lmx -lmex -lmat -lmwservices -lut -lm"

elseMLIBS="-L$TMW_ROOT/bin/$Arch -lmx -lmex -lmat -lm"

fi20 case "$Arch" in

Undetermined)# ----------------------------------------------------------------------------# Change this line if you need to specify the location of the MATLAB# root directory. The script needs to know where to find utility

25 # routines so that it can determine the architecture ; therefore , this# assignment needs to be done while the architecture is still# undetermined .# ----------------------------------------------------------------------------

MATLAB="$MATLAB"30 #

# Determine the location of the GCC libraries#

GCC_LIBDIR=‘gcc -4.3 -v 2>&1 | awk ’/.* Reading specs .*/ print substr($4 ,0,length($4)-6)’‘# GCC_LIBDIR =‘/usr/lib64 ‘

35

# ----------------------------------------------------------------------------;;

glnxa64)# ----------------------------------------------------------------------------

40 RPATH="-Wl ,-rpath -link ,$TMW_ROOT/bin/$Arch"CC=’nvcc -ccbin /usr/bin/gcc -4.3 -G -g -O0’CFLAGS=’ -Xcompiler "-fPIC -D_GNU_SOURCE -pthread -fexceptions -m64"’CLIBS="$RPATH $MLIBS -lm -lstdc++"COPTIMFLAGS=’-Xcompiler "-O3 -funroll -loops -msse2 -DNDEBUG"’

45 CDEBUGFLAGS=’-g’#

CXX=’g++-4.3’CXXFLAGS=’-fPIC -fno -omit -frame -pointer -ansi -D_GNU_SOURCE -pthread ’CXXLIBS="$RPATH $MLIBS -lm"

50 CXXOPTIMFLAGS=’-O -DNDEBUG ’CXXDEBUGFLAGS=’-g’

## NOTE: g77 is not thread safe

FC=’g77 ’55 FFLAGS=’-fPIC -fno -omit -frame -pointer -fexceptions ’

FLIBS="$RPATH $MLIBS -lm -lstdc++"FOPTIMFLAGS=’-O’FDEBUGFLAGS=’-g’

#60 LD="gcc -4.3"

LDEXTENSION =’.mexa64 ’LDFLAGS="-pthread -shared -Wl,--version -script ,$TMW_ROOT/extern/lib/$Arch/$MAPFILE"LDOPTIMFLAGS=’-O’LDDEBUGFLAGS=’-g’

65 #POSTLINK_CMDS=’:’

# ----------------------------------------------------------------------------;;

70 esac# ############################################################################# CC=" $CC"# CFLAGS =" $CFLAGS"# COPTIMFLAGS =" $COPTIMFLAGS "

75 # CDEBUGFLAGS =" $CDEBUGFLAGS "# CLIBS =" $CLIBS"## LD=" $LD"# LDFLAGS =" $LDFLAGS"

80 # LDOPTIMFLAGS =" $LDOPTIMFLAGS "# LDDEBUGFLAGS =" $LDDEBUGFLAGS "# ----------------------------------------------------------------------------The -G options specifies generate debugging information for the CUDA kernelsand Forces -O0 (mostly unoptimized) compilation Spills all variables to local

85 memory ( will probably slow program execution) The -g option tells nvcc togenerate debugging information for the host code and include symbolicdebugging information in the executable. Finally , the -o option tells thecompiler to write the executable to AssignScaleVectorWithError.# ############################################################################

Figure A.5: nvoptsdbg.sh

Page 63: Contents · 1.3 Chapter synopsis In this section the structure and overview of the following chapters are brie y outlined Chapter 2 - FAIR This chapter introduces a generic software

Bibliography

[1] J. Modersitzki. FAIR: Flexible Algorithms for Image Registration. SIAM, Philadelphia, 2009.

[2] T.S. Yoo. Insight into images: principles and practice for segmentation, registration, andimage analysis. AK Peters, Ltd., 2004.

[3] C. Nvidia. programming Guide, version 2.3. NVIDIA Corporation, .

[4] M.W. Inc. Matlab: The language of technical computing. The Math Works, 1996.

[5] Brian Dushaw. Matlab and CUDA. URL http://staff.wahington.edu/dushaw/epubs/MAtlab_CUDA_Tutorial_8_08.pdf.

[6] C. Nvidia. Cuda Profiler, version 2.3. NVIDIA Corporation, .

[7] C. Sigg and M. Hadwiger. Fast third-order texture filtering. GPU Gems, 2:313–329, 2005.URL http://developer.nvidia.com/GPUGems2/gpugems2_chapter20.html.

58