cuda lecture 6 embarrassingly parallel computations
DESCRIPTION
CUDA Lecture 6 Embarrassingly Parallel Computations. Prepared 7/28/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron. Embarrassingly Parallel Computations. - PowerPoint PPT PresentationTRANSCRIPT
Prepared 7/28/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.
CUDA Lecture 6Embarrassingly Parallel
Computations
A computation that can obviously be divided into a number of completely independent parts, each of which can be executed by a separate process(or).
No communication or very little communication between processes.
Each process can do its tasks without any interaction with other processes.
Embarrassingly Parallel Computations – Slide 2
Embarrassingly Parallel Computations
Low level image processingMandelbrot setMonte Carlo calculations
Embarrassingly Parallel Computations – Slide 3
Embarrassingly Parallel Computation Examples
Many low level image processing operations only involve local data with very limited if any communication between areas of interest.
Embarrassingly Parallel Computations – Slide 4
Topic 1: Low level image processing
Square region for each process Can also use strips
Embarrassingly Parallel Computations – Slide 5
Partitioning into regions for individual processes
ShiftingObject shifted by x in the x-dimension and y
in the y-dimension:x = x + x y = y + y
where x and y are the original and x and y are the new coordinates.
ScalingObject scaled by a factor of Sx in the x-direction
and Sy in the y-direction:x = xSx y = ySy
Embarrassingly Parallel Computations – Slide 6
Some geometrical operations
RotationObject rotated through an angle about the
origin of the coordinate system: x = x cos + y sin y = -x sin + y cos
ClippingApplies defined rectangular boundaries to a
figure and delete points outside the defined area.Display the point (x,y) only if
xlow x xhigh ylow y yhigh
otherwise discard.
Embarrassingly Parallel Computations – Slide 7
Some geometrical operations (cont.)
GPUs sent starting row numbersSince these are related to block and thread
i.d.s each GPU should be able to figure it out for themselves
Return results are single valuesNot good programming but done to simplify
first exposure to CUDA programsNo terminating code is shown
Embarrassingly Parallel Computations – Slide 8
Notes
Set of points in a complex plane that are quasi-stable (will increase and decrease, but not exceed some limit) when computed by iterating the function
zk+1 = zk2 + c
where zk+1 is the (k+1)th iteration of the complex number z = a + bi and c is a complex number giving position of point in the complex plane.
The initial value for z is zero.Embarrassingly Parallel Computations – Slide 9
Topic 2: Mandelbrot Set
Iterations continued until magnitude of z is greater than 2 or number of iterations reaches arbitrary limit.
Magnitude of z is the length of the vector given by
Embarrassingly Parallel Computations – Slide 10
Mandelbrot Set (cont.)
22 bazlength
structure complex{ float real; float imag;}
int cal_pixel(complex c){ int count=0, max=256; complex z; float temp, lengthsq;
z.real = 0; z.imag = 0; do { temp = z.real * z.real – z.imag * z.imag + c.real; z.imag = 2 * z.real * z.imag + c.imag; z.real = temp; lengthsq = z.real * z.real + z.imag * z.imag; count++; } while ((lengthsq < 4.0) && (count < max)); return count;}
Embarrassingly Parallel Computations – Slide 11
Sequential routine computing value of one point returning number of iterations
Square of length compared to 4 (rather than comparing length to 2) to avoid a square root computation
Given the terminating conditions, all of the Mandelbrot points must be within a circle centered at the origin having radius 2
Resolution is expanded at will to obtain interesting images
Embarrassingly Parallel Computations – Slide 12
Notes
Embarrassingly Parallel Computations – Slide 13
Mandelbrot set
Dynamic Task AssignmentHave processor request regions after
computing previous regions.Doesn’t lend itself to a CUDA solution.
Static Task AssignmentSimply divide the region into a fixed number of
parts, each computed by a separate processor.Not very successful because different regions
require different numbers of iterations and times.
…but what we’ll do in CUDA.
Embarrassingly Parallel Computations – Slide 14
Parallelizing Mandelbrot Set Computation
Embarrassingly Parallel Computations – Slide 15
Dynamic Task Assignment: Work Pool/Processor Farms
#include <stdio.h>#include <conio.h>#include <math.h>#include “../common/cpu_bitmap.h”#define DIM 1000
struct cuComplex { float r, i; cuComplex(float a, float b) : r(a), i(b) {} __device__ float magnitude2 (void) { return r*r + i*i; } __device__ cuComplex operator*(const cuComplex& a) { return cuComplex(r*a.r – i*a.i, i*a.r + r*a.i); } __device__ cuComplex operator+(const cuComplex& a) { return cuComplex(r + a.r, i + a.i); }};
Embarrassingly Parallel Computations – Slide 16
Simplified CUDA Solution
__device__ int mandelbrot(int x, int y) { float jx = 2.0 * (x – DIM/2) / (DIM/2); float jy = 2.0 * (y - DIM/2) / (DIM/2); cuComplex c(jx,jy); cuComplex z(0.0,0.0); int i = 0; do { z = z * z + c; i++; } while ((z.magnitude2() < 4.0) && (i < 256)); return i % 8;}
Embarrassingly Parallel Computations – Slide 17
Simplified CUDA Solution (cont.)
__global__ void kernel (unsigned char *ptr) { // map from blockIdx to pixel position int x = blockIdx.x; int y = blockIdx.y; int offset = x + y * gridDim.x; // now calculate the value at that position int mValue = mandelbrot(x,y); ptr[offset*4 + 0] = 0; ptr[offset*4 + 1] = 0; ptr[offset*4 + 2] = 0; ptr[offset*4 + 3] = 255; switch (mValue) { case 0: break; // black case 1: ptr[offset*4 + 0] = 255; break; // red case 2: ptr[offset*4 + 1] = 255; break; // green case 3: ptr[offset*4 + 2] = 255; break; // blue case 4: ptr[offset*4 + 0] = 255; // yellow ptr[offset*4 + 1] = 255; break; case 5: ptr[offset*4 + 1] = 255; // cyan ptr[offset*4 + 2] = 255; break; case 6: ptr[offset*4 + 0] = 255; // magenta ptr[offset*4 + 2] = 255; break; default: ptr[offset*4 + 0] = 255; // white ptr[offset*4 + 1] = 255; ptr[offset*4 + 2] = 255; break; }}
Embarrassingly Parallel Computations – Slide 18
Simplified CUDA Solution (cont.)
int main(void) { CPUBitmap bitmap(DIM, DIM); unsigned char *dev_bitmap; cudaMalloc((void**) &dev_bitmap, bitmap.image_size()); dim3 grid(DIM, DIM); kernel<<<grid,1>>>(dev_bitmap); cudaMemcpy(bitmap.get_ptr(), dev_bitmap, bitmap.image_size(), cudaMemcpyDeviceToHost)); bitmap.display_and_exit(); cudaFree(dev_bitmap);}
Embarrassingly Parallel Computations – Slide 19
Simplified CUDA Solution (cont.)
An embarrassingly parallel computation named for Monaco’s gambling resort city, method’s first important use in development of atomic bomb during World War II.
Monte Carlo methods use random selections.Given a very large set and a probability
distribution over itDraw a set of samples identically and
independently distributedCan then approximate the distribution using
these samples
Embarrassingly Parallel Computations – Slide 20
Topic 3: Monte Carlo methods
Evaluating integrals of arbitrary functions of 6+ dimensions
Predicting future values of stocksSolving partial differential equationsSharpening satellite imagesModeling cell populationsFinding approximate solutions to NP-hard
problems
Embarrassingly Parallel Computations – Slide 21
Applications of Monte Carlo Method
Form a circle within a square, with unit radius so that the square has sides 22. Ratio of the area of the circle to the square given by
Randomly choose points within square.Keep score of how many points happen to lie within
circle.Fraction of points within the circle will be /4 given
a sufficient number of randomly selected samples.
Embarrassingly Parallel Computations – Slide 22
Monte Carlo Example: Calculating
4221
square of Areacircle of Area 2
Embarrassingly Parallel Computations – Slide 23
Calculating (cont.)
Embarrassingly Parallel Computations – Slide 24
Calculating (cont.)
x x x x x x x x x x x x
x x xx x x x x
23420
16 .
Relative error is a way to measure the quality of an estimate
The smaller the error, the better the estimatea: actual valuee: estimated valueRelative error = |e-a|/a
Embarrassingly Parallel Computations – Slide 25
Relative Error
Embarrassingly Parallel Computations – Slide 26
Increasing Sample Size Reduces Error
n Estimate Abs. Error 1/(2n)10 2.40000 0.23606 0.15811
100 3.36000 0.06952 0.05000
1,000 3.14400 0.00077 0.01581
10,000 3.13920 0.00076 0.00500
100,000 3.14132 0.00009 0.00158
1,000,000 3.14006 0.00049 0.00050
10,000,000 3.14136 0.00007 0.00016
100,000,000 3.14154 0.00002 0.00005
1,000,000,000 3.14155 0.00001 0.00002
One quadrant of the construction can be described by the integral
Random numbers (xr,yr)generated, each between 0and 1.
Counted as in circle ifxr
2+yr2 1.
Embarrassingly Parallel Computations – Slide 27
Next Example: Computing an Integral
41
1
0
2 dxx
Computing the integral
Sequential codeRoutine randv(x1,x2)
returns a pseudorandomnumber between x1 and x2.
Monte Carlo method very useful if the function cannot be integrated numerically (maybe having a large number of variables).
Embarrassingly Parallel Computations – Slide 28
Example
sum=0;for (i=0; i<N; i++) { xr = rand_v(x1,x2); sum += xr * xr - 3*xr;}area = (sum / N) * (x2 - x1);
2
1
)3( 2x
x
dxxxI
Error in Monte Carlo estimate decreases by the factor 1/n
Rate of convergence independent of integrand’s dimensionDeterministic numerical integration methods
do not share this propertyHence Monte Carlo superior when integrand
has 6 or more dimensionsFurthermore Monte Carlo methods often
amenable to parallelismCan find an estimate about p times faster or
reduce error of estimate by pEmbarrassingly Parallel Computations – Slide 29
Why Monte Carlo Methods Interesting
Embarrassingly Parallel Computation ExamplesLow level image processingMandelbrot setMonte Carlo calculations
Applications: numerical integrationRelated topics: Metropolis algorithm, simulated
annealingFor parallelizing such applications, need best way
to generate random numbers in parallel.
Embarrassingly Parallel Computations – Slide 30
Summary
Based on original material fromThe University of Akron
Tim O’Neil, Saranya VinjarapuThe University of North Carolina at Charlotte
Barry Wilkinson, Michael AllenOregon State University: Michael Quinn
Revision history: last updated 7/28/2011.
Embarrassingly Parallel Computations – Slide 31
End Credits