cuda l-bfgs and map superresolution - fau ~z. dot ~z ~z a i ~y i. axpy end for ~z h0 ~z. scale for i...

CUDA L-BFGS andMAP SuperresolutionMulti-core architectures and programming

Oliver Taubmann & Jens WetzlAugust 7, 2012

Outline

Two sub-projects:

� Library for unconstrained nonlinear optimization2 uses the L-BFGS method2 works with any differentiable cost function

� Super-resolution of a low quality image series2 employs nonlinear optimizer2 maximum-a-posteriori approach

Both were implemented on the GPU using CUDA.

August 7, 2012 | Oliver Taubmann & Jens Wetzl | CUDA L-BFGS and MAP Superresolution 2 / 54

Outline

CUDA L-BFGSAlgorithm & ImplementationOptimizationFramework

MAP SuperresolutionAlgorithm & ImplementationOptimizationEvaluation


Outline

CUDA L-BFGSAlgorithm & Implementation

OutlineInterface & CPU ImplementationAnalysis

OptimizationFramework



Descent Methods

� Find local minima of a differentiable function� Iterative scheme:

Input: Function f , start point~x

while ‖∇f (~x)‖22 ≥ ε · ‖~x‖2

2 do . Convergence criterion

~z← FINDSEARCHDIRECTION(f ,~x) . Method-specifict← argmin

t≥0f (~x + t ·~z) . Line search

~x←~x + t ·~z . Update current solution

end while

Output: Final solution~x


Gradient Descent


Quasi-Newton Methods

� Newton: Local, quadratic approximation

function FINDSEARCHDIRECTION(f ,~x)return −H−1

f (~x) · ∇f (~x)end function

Where H−1f is the inverse Hessian of f .

� Quasi-Newton:

2 Don’t compute H−1f directly

2 Estimate it using successive gradient vectors


Newton’s Method

Newton’s Method

The idea:• Select a point.• Compute the minimum of the second order Taylor approximation.

x

5

10

15

20

25

30

0

f(x)

-15 -10 -5 15

0.001x4

5 100

Lecture Pattern Recognition | „ 2005-2012 Hornegger, Hahn, Steidl 11-29Image source: Lecture slides Pattern Recognition by Stefan Steidl


Newton’s Method


L-BFGS

� “Limited-memory Broyden-Fletcher-Goldfarb-Shanno”� Keeps a small history, typically 3 ≤ m ≤ 8, of the latest updates:

~sk = ~xk+1−~xk~yk = ∇f (xk+1)−∇f (xk)

� Plus some derived scalar values:

ρk =1

~yTk ·~sk

αk (see next slide)

� Needs an inital (diagonal or scalar) estimate of H−1f

2 With no prior knowledge, a scalar H0 is sufficient2 Initialized to H0 = 1, updated each iteration to H0 =

~yTk ·~sk‖~yk‖2

2


L-BFGS Pseudocode

function FINDSEARCHDIRECTION(f ,~x)~z← ∇f (~x)for i← k− 1, k− 2, . . . , k−m do

αi ← ρi ·~sTi ·~z . Store αi

~z←~z− αi ·~yiend for~z← H0 ·~zfor i← k−m, k−m + 1, . . . , k− 1 do

βi ← ρi ·~yTi ·~z

~z←~z +~si · (αi− βi)end forreturn −~z

end function


Interface

Minimizer:

class LBFGS_API lbfgs{public:lbfgs(cost_function& cf);~lbfgs();

status minimize(float *d_x);status minimize_with_host_x(float *h_x);

static std::string statusToString(status stat);

void setMaxIterations (size_t maxIter );void setMaxEvaluations (size_t maxEvals );void setGradientEpsilon(float gradientEps);

[...]

private:[...]

};


Interface (cont.)

Cost function:class LBFGS_API cost_function {public:cost_function(size_t numDimensions);virtual ~cost_function();

virtual void f_gradf(const float *d_x, float *d_f, float *d_gradf) = 0;size_t getNumberOfUnknowns() const;[...]

};

CPU cost function:class LBFGS_API cpu_cost_function : public cost_function {public:cpu_cost_function(size_t numDimensions);virtual ~cpu_cost_function();

virtual void cpu_f_gradf(const real *h_x, real *h_f, real *h_gradf) = 0;void f_gradf(const float *d_x, float *d_f, float *d_gradf);[...]

};


CPU Implementation

� First: Straight-forward implementation of pseudocode (using Eigen)� Repeatedly refined to mimick reference Fortran code in netlib

2 One big “work-array” for most of the data2 GOTOs everywhere for reverse communication2 → Lots and lots of time spent on comparing and debugging

DO 170 I=1,N170 W(I)=G(I)172 CONTINUE

CALL MCSRCH(N,X,F,G,W(ISPT+POINT*N+1),STP,FTOL,

* XTOL,MAXFEV,INFO,NFEV,DIAG)IF (INFO .EQ. -1) THENIFLAG=1RETURN

ENDIFIF (INFO .NE. 1) GO TO 190

� Finally ended up with an algorithmically “streamlined” implementation2 Store only as much as needed2 Minimize redundant computations


http://eigen.tuxfamily.org/index.php?title=Main_Page

Analysis

function FINDSEARCHDIRECTION(f ,~x)~z← ∇f (~x)for i← k− 1, k− 2, . . . , k−m do

αi ← ρi ·~sTi ·~z . dot

~z←~z− αi ·~yi . axpyend for~z← H0 ·~z . scalefor i← k−m, k−m + 1, . . . , k− 1 do

βi ← ρi ·~yTi ·~z . dot

~z←~z +~si · (αi− βi) . axpyend forreturn −~z

end function

� 3 vector operation types that can be parallelized� Otherwise highly sequential (note read/write pattern of~z)� Dot products require global communication!


Analysis

What’s left?

� Line search: For each step length tried, we have2 1 axpy for backtracking2 1 function/gradient evaluation for checking conditions2 Arbitrarily complex scalar computations

� History updates:2 Solution difference: 1 scale (~xk−~xk−1 = t ·~z)2 Gradient difference: 1 axpy (∇f (~xk)−∇f (~xk−1))2 2 more dots for ρk =

1~yT

k ·~skand

H0 =~yT

k ·~sk

‖~yk‖22=

ρ−1k

~yTk ·~yk


Outline

CUDA L-BFGSAlgorithm & ImplementationOptimization

Naïve ApproachUsing cuBLASScalar Operations on the GPUAlgorithmic Optimization

Framework



Naïve Approach

The obvious things to do:

� . . . after replacing all convenient abstractions provided by Eigen. . .� Keep all vectors in device memory – no copying of large chunks� Implement simple kernels for dot, scale and axpy

2 One element per thread for everything2 Two stage atomic additions for dot (first shared, then global)


Using cuBLAS

� These functions are also offered by cuBLAS, so why not use them?� cuBLAS is a CUDA port of BLAS (Basic Linear Algebra Subprograms)� BLAS is highly optimized and widely used in HPC

� Example code for dispatching according to chosen implementation:

void lbfgs::dispatch_dot([...]) const{#if defined(LBFGS_IMPLEMENTATION_NAIVE)

[...]gpu_lbfgs::dot<<<gridDim, blockDim>>>(d_x, d_y, n, d_res);[...]

#elif defined(LBFGS_IMPLEMENTATION_CUBLAS)const cublasPointerMode_t mode = dstDevicePointer

? CUBLAS_POINTER_MODE_DEVICE: CUBLAS_POINTER_MODE_HOST;

CublasSafeCall(cublasSetPointerMode(m_cublasHandle, mode));CublasSafeCall(cublasSdot(m_cublasHandle, n, d_x, 1, d_y, 1, dst));

#endif}


Using cuBLAS (cont.)

101 102 103 104 105 106 107

102

103

104

Problem size

Opt

imiz

atio

ntim

e[m

s]cuBLASNaïve

� n-dim. Rosenbrock function

� Times don’t include function evals

� GPU: GeForce GTX 580

� As expected, cuBLAS is much faster for large vectors. Nice!


Using cuBLAS (cont.)

101 102 103 104 105 106 107

102

103

104

Problem size

Opt

imiz

atio

ntim

e[m

s]cuBLASNaïve

� As expected, cuBLAS is much faster for large vectors. Nice!


Scalar Operations on the GPU

� Up to now, all scalar calculations are still done on the CPU�→ scalar values (e.g. dot results) must still be copied frequently� Is it faster to avoid copying and let a single GPU thread do the work?

� Trying to combine as many steps as possible,we still ended up with several “sequential” kernels:

__global__ void update1 ([...]); // first update loop__global__ void update2 ([...]); // second update loop__global__ void update3 ([...]); // after line search__global__ void initStep ([...]);__global__ void lineSearch([...]);


Scalar Operations on the GPU (cont.)

100 200 300 400 500 600 700

updates

line search

544ms

120ms

547ms

120ms

[ms]

Vectors on deviceAll on device

� It’s a tiny bit faster, but not worth the effort.Oh well, you don’t know if you don’t try.

� Host/device copies ↓, kernel launch overhead ↑


Line Search

� Code optimization is all good and well, but will only get us so far:2 Most of the time is usually spent in function/gradient evaluations2 L-BFGS is robust even when step lengths are chosen naïvely, but:2 A clever line search is crucial to minimize the number of evaluations

t0

f (~x + t ·~s)


Line Search


t0 t1

f (~x + t ·~s)


Line Search


t0 t1 t2

f (~x + t ·~s)


Line Search


t0 t1 t2t3

f (~x + t ·~s)


Going up against Netlib

25 50 75 100 125

Backtracking

Strong Wolfe (midpoint)

Strong Wolfe (quadratic)

Netlib reference

133

100

84

73

# of objective function evaluations

Line Search Comparison

� Measurements: Rosenbrock, avg. of integral starting points in [−4, 4]2

� Netlib uses lengthy (read: ugly) quadratic/cubic interpolation



101 102 103 104 105 106 107

102

103

104

Problem size

Opt

imiz

atio

ntim

e[m

s]

GPUNetlib

� n-dim. Rosenbrock function

� Times don’t include function evals

� GPU: GeForce GTX 580

� CPU: Xeon @2.67GHz

� Keep in mind: For large problems sizes,2 copying between host/device for each evaluation becomes expensive, so2 GPU applications and CPU optimizers usually don’t mix well, and vice versa



101 102 103 104 105 106 107

102

103

104

Problem size

Opt

imiz

atio

ntim

e[m

s]

GPUNetlib

� Keep in mind: For large problems sizes,2 copying between host/device for each evaluation becomes expensive, so2 GPU applications and CPU optimizers usually don’t mix well, and vice versa


Outline




Testing

Functions used for testing:

� Quadratic functions: f (~x) =~xTA~x +~bT~x + c, ~x ∈ Rn

2 Convex if A is positive semidefinite⇒ unique, easy-to-find solution2 We randomly generate parameters up to matrix size 500×500

• Compute QR decomposition of a random matrix• A = QΛQT with Λ = diag(λi), λi ≥ 0 randomly chosen


Testing (cont.)

Functions used for testing:

� Rosenbrock function: f (x, y) = (1− x)2 + 100 · (y− x2)2

Source: en.wikipedia.org/wiki/Rosenbrock_function

2 Lowest point in “valley”, f (1, 1) = 0, is difficult to find2 → popular as a challenging test for numerical solvers2 Has multidimensional generalisations, one of which we used as well


http://en.wikipedia.org/wiki/Rosenbrock_function

CMake Build System

� Builds (and installs) library, optional sample projects and test cases� Allows to comfortably. . .

2 switch on error checking, timing, verbose output, CPU double precision2 choose between all different implementations


Outline




Outline


MAP SuperresolutionAlgorithm & Implementation

Problem ModelImplementation

OptimizationEvaluation


Problem

� Given:2 A series of low quality images, e.g. from a bad camera2 Motion (and camera) parameters, e.g. obtained from registration

� Wanted: A single high quality image2 reconstructed from the original images2 reasonably “smooth” at the same time

Effectively higher resolutions can be obtained due to sub-pixel motion


Model

~x

~y(0)

~y(K)

Warp withparameters θ

Blur bypoint-spread

function

Decimate byzoom factor

Corrupt withadditive

Gaussian noise

Image adapted from Lindsay C. Pickup’s dissertation“Machine Learning in Multi-frame Image Super-resolution”

~x : high resolution image

~y(k) : k-th low resolution image

~y(k) = W(k)~x

~y(0)...~y(K)

︸︷︷︸

~y

=

W(0)

...W(K)

︸︷︷︸

W

~x


Application: ToF-Endoscopy

Low-res. (70× 70) Bicubic (140× 140) Super-resolved (140× 140)

Whole input sequence:


Algorithm

� Compose system matrix W from motion parameters andwidth of the point spread function (PSF)

� Compute average image as initial guess for optimization:

~x(0) = W̃T~y where W̃ is W with normalized columns

� Minimize the objective function

~x∗ = argmin~x

‖W ~x−~y‖22 + λ · ‖cHuber (LoG(~x)) ‖1

2 W ~x−~y is the residual2 LoG(~x) is the Laplacian-of-Gaussian-filtered image2 cHuber(·) is the pseudo-Huber loss function (applied element-wise)2 λ controls the strength of the prior, i.e. smoothness


Algorithm in Codevoid MAPSuperResolution::superresolve(const LRImageStack &lrImages,const std::vector<MotionParams> &motionParams, SRImage &srImage, [...])

{// Compute system matrixSRSystemMatrix systemMatrix(motionParams, m_psfWidth, [...]);

// Initialize SR Imageswitch (init) {

case SR_INITIALIZATION_AVERAGE:srImage.initToAverageImage([...]); break;

case SR_INITIALIZATION_BLACK:srImage.setZero(); break;

}

// OptimizeSRCostFunction cf(srImage.getNumPixels(), systemMatrix, lrImages,

m_gpuHandles, m_prior);

lbfgs minimizer(cf);minimizer.setGradientEpsilon(m_gradientEps);

lbfgs::status stat = minimizer.minimize(srImage.getPixels());}


Implementation

First steps:

� Some framework functionality2 Reading motion parameters from file2 Reading and saving image files to/from device memory (using FreeImage)

� Parallelization of all steps mentioned above with CUDA2 System matrix and all images stay on the device throughout2 Uses our optimizer (combined evaluation of f (~x) and ∇f (~x) pays off here)2 Tested correctness against given Matlab implementation

� But: Matrix is still dense⇒ prohibitive for realistic problem sizes2 It’s

(#LR pixels · #LR images

)× #HR pixels elements

#LR images #LR pixels #HR pixels Required memory8 64× 64 128× 128 2 GB8 128× 128 256× 256 32 GB16 256× 256 1024× 1024 4 TB


http://freeimage.sourceforge.net/

Outline


MAP SuperresolutionAlgorithm & ImplementationOptimization

Sparse System MatrixComputational BottlenecksPrior

Evaluation


Building the Sparse Matrix

� 1 LR pixel is affected by only a small region in the HR image� PSF width determines the size of that region�→ Max. number of elements per row can be precomputed

� Using the CRS (compressed row storage) format:2 Positions of non-zero entries are not restricted2 We can build all matrix rows in parallel2 Efficient matrix-vector-multiplication is possible

� If a row has less than max. number of elements,we still need to generate valid CRS structure


Using the Sparse Matrix

� For computing the residual,~r = W ~x−~y, we need multiplication� Straight-forward to implement, but there’s cuSPARSE, too:

2 Similar to cuBLAS, but for sparse matrices2 Offers CRS-matrix-vector-multiplication, which we use

� Problem: We also need transposed multiplication2 For the gradient: −2WT~r2 For the average image: W̃

T~y

� CRS it not well suited for that, but at least:� cuSPARSE offers a better-than-naïve implementation


Enter CCS (compressed column storage)

� CCS is perfect for transposed multiplication�→ Added option to store both CRS and CCS representation

2 cuSPARSE even offers the conversion:

#ifdef SUPERRES_STORE_TRANSPOSEcusparseScsr2csc(

m_gpuHandles.cusparseHandle, m_height, m_width,m_d_values, m_d_rowPointers, m_d_colIndices, // CRSm_d_values_ccs, m_d_rowIndices_ccs, m_d_colPointers_ccs, // CCS1, CUSPARSE_INDEX_BASE_ZERO);

#endif

� Trade-off:(-) Needs twice the memory for the matrix(0) Precomputation time: Building ↑, average image ↓(+) Evaluations are much faster


Computational Bottlenecks

100 200 300 400 500 600

average image

system matrix

function eval

solver + rest

total

9%

45%

35%

11%

100%

16%

14%

62%

7%

100%

[ms]

CRS onlyCRS+CCS

#LR images 16

#LR pixels128× 128

#HR pixels512× 512

GPU: GeForce GTX580


Computational Bottlenecks

100 200 300 400 500 600

average image

system matrix

function eval

solver + rest

total

9%

45%

35%

11%

100%

16%

14%

62%

7%

100%

[ms]

CRS onlyCRS+CCS


Computational Bottlenecks (cont.)

50 100 150 200 250 300 350 400

function

gradient

prior

total

17%

17%

66%

100%

7%

68%

25%

100%

[ms]

Function evaluation bottlenecks

CRS onlyCRS+CCS


Prior (value and gradient)

� As just seen, the prior now dominates function evaluation time

� It requires two convolutions with a LoG-function, . . .2 Use a discrete 3× 3 kernel with precomputed weights2 Hardcode these weights instead of reading them from memory

� . . . some computations per element and a global sum: ‖cHuber(·)‖1

−4 −2 0 2 4−1

0

1

GaussianLoG

discretization−→ 1

4

0 1 01 −4 10 1 0

∗ =


Prior Implementations Compared

� First, we used one thread per row (CPU habits. . . )� Bad idea! Columnwise is much faster – cue coalesced read� Also: Compute just a few elements per thread (avoid unused cores)

2 For the sum, this didn’t work that well – why?2 More global atomic adds⇒ aggregate block-locally first

1 2 3 4 5 6 7 8 9 10 11 12 13 14

1 thread per row 12.7ms

[ms]

Prior evaluation (without filtering)





1 2 3 4 5 6 7 8 9 10 11 12 13 14

1 thread per col

1 thread per row

5ms

12.7ms

[ms]






1 2 3 4 5 6 7 8 9 10 11 12 13 14

8 elem. per thread

1 thread per col

1 thread per row

4.7ms

5ms

12.7ms

[ms]






1 2 3 4 5 6 7 8 9 10 11 12 13 14shared atomicAdd

8 elem. per thread

1 thread per col

1 thread per row

1.3ms

4.7ms

5ms

12.7ms

[ms]



Optimization Overview

50 100 150 200 250 300 350 400

function

gradient

prior

total

17%

17%

66%

100%

7%

68%

25%

100%

48%

48%

4%

100%

[ms]

Function evaluation bottlenecks

CRS+CCS+Optim.CRS onlyCRS+CCS


Outline




Simulated Image Sequences

“Phantom” (16 images of size 64× 128)

Low-res. (64× 128) Bicubic (256× 512)

Super-resolved (256× 512) Ground truth (256× 512)

Input sequence:

· · ·


Simulated Image Sequences (cont.)

“Aerial” (16 images of size 128× 128)


Simulated Image Sequences (cont.)

Low-res. (128× 128) Bicubic (256× 256)

Super-resolved (256× 256) Ground truth (256× 256)


Timing vs. CPU Implementations

8 16 32 64 128 256

101

102

103

104

105

LR image side length

[ms]

GPUITK

MATLAB

#LR images: 8

Zoom factor: 2

GPU: GeForce GTX580

CPU: i5-2400 @3.10GHz


Timing vs. CPU Implementations

8 16 32 64 128 256

101

102

103

104

105


[ms]

GPUITK

MATLAB


Timing vs. ITK version

8 16 32 64 128 256

0

500

1000

1500

2000

2500

3000

3500

4000

4500

8×

11×


[ms]

GPUITK


Questions?

Thanks for listening!

cuda l-bfgs and map superresolution - fau ~z. dot ~z ~z a i ~y i. axpy end for ~z h0 ~z. scale for i...

Documents