department of computer science & engineering university of colorado denver by ali alkhathlan,...
Post on 17-Jan-2016
219 Views
Preview:
TRANSCRIPT
On OpenCL and Chaotic Phenomena
Department of Computer Science & EngineeringUniversity of Colorado Denver
ByAli Alkhathlan, Ali Alsaadi, and Mohamed Khalifa
It aims to utilize GPGPU through the use of OpenCL, an open standard for computing on heterogeneous platforms, including CPUs, GPUs.
◦for the computation involved in Chaotic Phenomena: Mandelbrot Set and bifurcation diagram of the Logistic
Map.
The performance provided by the GPU through OpenCL will be compared to CPU performance through OpenCL,
and plain C++
Goal of projects:
Introduction. Background. Implementation. Methodology, Results, and Analysis. Conclusion.
Outline:
Introduction.
Introduction of the project
Problem.Objectives.Approach.
The investigation of chaotic phenomena requires heavy computation.◦ However, much chaotic phenomena exhibit large amounts of
data parallelism.◦ So, in that the same computation is performed over and over
on differing inputs.
The problem to be tackled by this project is o the usage of GPGPU, through OpenCL, to perform the
computations required to investigate chaotic phenomena. o In particular, the Mandelbrot Set and the bifurcation
diagram of the logistic map.
Problem:
Objectives: Successful implementation of algorithms for the
computation of the Mandelbrot Set and the bifurcation diagram of the logistic map, both in OpenCL and in plain C++.
Comparison of performance between OpenCL running on a GPU, OpenCL running on a CPU, and as control, a plain serial C++ implementation of the above mentioned chaotic phenomena.
Analysis of the benefits and complications derived from the usage of OpenCL for the computation of chaotic phenomena.
The investigation of the effectivity of OpenCL for the computation of chaotic phenomena:
◦ It will be performed by comparing implementations of the algorithms for the computation of the Mandelbrot Set as well as the bifurcation diagram of the logistic map.
first in C++. Then in OpenCL on the GPU and on the CPU.
Approach:
Background
It is the usage of graphics processing units for non-graphics-related computation.
Graphics processing units allow for massive parallelism due to their architecture.
GPGPU:
Key Concepts:
GPU Architecture:- The architecture of graphics processing units are oriented towards
performing tasks that involve data parallelism.
http://pds.ucdenver.edu/index.php?p=video&c=tech&a=t&i=2
GPU has up to hundreds of cores (as compared to CPUs which have 8-16 at the most.
Each of those cores is capable of executing dozens of instruction streams as the same time.
The general flow of computation on a GPU starts with the loading of data onto the GPU. ◦This is often one of the most expensive operations, since
the data has to travel through the bus. This step has the transfer of graphics primitives, such as
an image to be rendered. • In GPGPU, this is the transfer of the data to be operated
on. • In GPGPU, a shader is called a kernel instead, to reflect
the more general scope.• Afterwards, data is then transferred back to main
memory, where the CPU can operate on it once again.
Important Information:
Mandelbrot Set. It is often visualized by coloring the complex plane according to the
number of iterations it takes for a point to escape the circle of radius 2. When |z| >= 2, it is guaranteed to tend to infinity; i.e., when z exits a circle
of radius 2 centered around the origin zn+1 = zn2 + cz0 = c
where c = original point
Bifurcation diagram of the logistic map. It is a plot of the long term iterates of the logistic map, with r varying.
Defined by the mapping: x n+1 = rxn(1 – xn) where r > 0 and 0 <= x <= 1.
They are an ideal candidate for GPGPU, as their computation involves the repeated application.
Chaotic Phenomena:
http://www.rationalsys.com/robertpirsig.html
Are the computation of the Mandelbrot Set and logistic Map highly amenable to parallelization?
Mandelbrot SetBifurcation diagram of the logistic map
OpenCL:
Open Computing Language is a language and a framework: - Developing and executing programs over heterogeneous devices and platforms.
- For example: CPUs, and GPUs. It includes: - A language(based on C99) for writing kernels. - APIs to define and control the platforms.
All major hardware manufacturers support OpenCL, including Nvidia , Intel [6], and AMD/ATI [2].
What is it?
OpenCL:
Choosing Devices
shows a simplified block diagram of a generalized GPU compute device.
Hardware Overview:
This Figure illustrates the relationship of the ATI Stream Computing components.
The ATI Stream Computing Implementation of OpenCL:
◦ GPUs compute devices can execute non-graphics functions by using kernels.
◦ Each instance of a kernel running on a compute unit is called a work-item.
◦ All the work-items are scheduled onto a group of stream cores.
◦ OpenCL maps the total number of work-items to be launched onto an n-dimensional grid.
◦ The developer can specify how to divide these items into work-groups.
◦ There are an integer number of wavefronts in each work-group.
The ATI Stream Computing Implementation of OpenCL:
Figure Work-Item Grouping Into Work-Groups and Wavefronts
The ATI Stream Computing Implementation of OpenCL:
Global and local dimensions
Synchronization within work-items
OpenCL- Memory Model
◦ All stream cores within a compute unit execute the same instruction for each cycle.
◦ A work item can issue one VLIW instruction per clock cycle.
◦ To hide latencies due to memory accesses and processing element operations, up to four work-items from the same wavefront are pipelined on the same stream core.
◦ Compute units operate independently of each other, so it is possible for each array to execute different instructions.
Work-Item Processing:
Foe example: branching, is done by combining all necessary paths as a wavefront. If work-items within a wavefront diverge, all paths are executed serially.
Masking of wavefronts is effected by constructs such as:if(x)
{. //items within these braces = A
..}
else{
. //items within these braces = B..}
- The wavefront mask is set true for lanes (elements/items) in which x is true, then execute A. - The mask then is inverted, and B is executed.
Flow Control:
A kernel is a small, user-developed program that is run repeatedly on a stream of data.
There are Multiple kernel types vertex, pixel, geometry, domain, hull, and now compute.
Compute kernel: is a specific type of kernel that is not part of the traditional graphics pipeline.
Compute Kernel:
Before the Development of compute kernels, pixel shaders were responsible for non-graphic computing.
However, new hardware support compute Kernels which are a better suited for non-graphic computations (Applications).
The compute kernel type can be used for graphics.
Compute Kernel:
Two concepts relating to compute kernels that provide data-parallel
A single instruction is executed over all work-items in a wavefront in parallel. It is the lowest level that flow control can affect.
This means that if two work-items inside of a wavefront go divergent paths of flow control, all work-items in the wavefront go to both paths of flow control.
Work-groups are composed of wavefronts. Best performance is attained when the group size is an integer multiple of the wavefront size.
Wavefronts and Workgroups
OpenCL has four memory domains: private local global Constant
The AMD Accelerated Parallel Processing system also recognize host (CPU) and PCI Express (PCIe) memory
Memory Architecture and Access:
private memory- specific to a work-item; it is not visible to other work-items.
local memory - specific to a work-group; accessible only by work-items belonging to that work-group.
global memory- accessible to all work-items executing in
a context, as well as to the host (read, write, and map commands). constant memory
- read only region for host-allocated and -initialized objects that are not changed during kernel execution.
Memory Architecture and Access:
host (CPU) memory - host-accessible region for an application’s data
structures and program data. PCIe memory
- part of host (CPU) memory accessible from, and modifiable by, the host program and the GPU compute device.
◦ Modifying this memory requires synchronization between the GPU compute device and the CPU.
Memory Architecture and Access:
Interrelationship of the memory domains:
Copy process occur among host to PCIe and PCIe to GPU compute device. • Memory Access
• Local Memory is faster than Memory Access because of Global Memory and Memory Access is faster than PCIe
• Global Buffer• It permits applications to read from and write to arbitrary locations in memory
• Image Read/Write• Image reads are cached through the texture system• It can be done by addressing the desired location in input memory using fetch unit
• Memory Load/Store• Only constants (read only buffers) are cached • Each work item can write to an arbitrary location within global buffer
• Communication between Host and GPU• PCI Express Bus • Command Processor or Processor API calls• DMA transfer
illustrates the interrelationship of the memories Cont.
illustrates the standard dataflow between host (CPU) and GPU.
How to copy data? Two ways to copy data from the host to the GPU compute device memory:• Implicitly: by using clEnqueueMapBuffer and clEnqueueUnMapMemObject.• Explicitly through: clEnqueueReadBuffer and clEnqueueWriteBuffer (clEnqueueReadImage, clEnqueueWriteImage.).
block diagram of the GPU memory system. Up arrows read paths Down arrows write paths. WC write cache.
Global Memory Optimization
GPU Memory Diagram consists of Multiple Compute Units and contains:◦ 32 kb local memory◦ L1 Cache◦ Registers◦ 16 processing elements with five way VLIW processor
• L1 Cache 8 kb per compute unit i.e. 160 kb for 20 compute units for ATI Redon One terabyte Bandwidth on ATI Redon• Multiple compute units share L2 cache with size of 512kb on ATI
Redon• Bandwidth of L1 Cache and Shared L2 Cache is 435 GB/s
•ATI Radeon HD 5870• ATI Radeon™ HD 5870 GPU has eight memory controllers connected
to multiple banks of GDDR5 memory• Memory clock speed is 1200 MHz with data rate of 4800 Mb/pin• Peak Bandwidth = (8 memory controllers) * (4800 Mb/pin) * (32 bits) *
(1 B/8b) = 154 GB/s
Global Memory Optimization cont.
Comparing Local, Global and Single Cache Miss Rate
The Miss Rate decreases as the cache size increases Till L1 cache level, it decrease up to 10% At L2, the Global Miss Rate Decreases more 10% It is similar to the single cache miss rate at Level 2 cache L2 is not tied to CPU clock cycle, it affects the miss penalty
that is tied to the miss rate of 1st level cache For L2, Global Miss Rate should be considered Local cache rate is not good measure of the 2nd level cache. Local Miss Rate is the function of 1st level cache Local Miss Rate can be varied if 1st level cache varies
Global Miss Rate and Local Miss Rate
It shows kind of same variation at level 1 and level 2 caches
For Single Miss Rate, the level variation remains same in both L1 and L2
As the cache size increase, the Single Miss Rate decreases
Single Miss Rate
◦ GPU compute devices are very efficient at parallelizing large numbers of work-items in a manner transparent to the application.
◦ Each GPU compute device uses the large number of wavefronts to hide memory access latencies by having the resource scheduler switch the active wavefront in a given compute unit whenever the current wavefront is waiting for a memory access to complete.
◦ Hiding memory access latencies requires that each work-item contain a large number of ALU operations per memory load/store.
GPU Compute Device Scheduling:
Simplified Execution Of Work-Items On A Single Stream Core
GPU Compute Device Scheduling:
Implementation.
Implementation
Comparison will be done pairwise between:
Single-threaded C++ implementation
OpenCL backed by CPU driver Intel OpenCL driver
OpenCL backed by GPU driver AMD/ATI OpenCL driver (APP)
Mandelbrot Set
Implementation – Mandelbrot Set The 2-dimensional region of the complex plane will be
divided into a 1024x1024 grid.
Each cell of the grid corresponds to a pixel in the visualization.
The Mandelbrot map will be performed up to 1024 times, or until the pixel escapes.
Implementation – Mandelbrot Set
The iterations will be implemented as:◦A simple for-loop for the C++ implementation◦An OpenCL kernel for the OpenCL implementation
Each pixels will be iterated for up to 1024 iterations Or Until it escapes the circle of radius 2.
Afterwards, the pixels will be colored according to the number of iterations.
Implementation – Mandelbrot,c++ :int main() { // left end of the x-axis const float xL[] = {-2, 0.3f, -0.333939f, -0.44545f, -0.4222f}; // right end of the x-axis const float xR[] = {1, 0.4f, -0.22282f, -0.11212f, -0.31111f}; // left (top) end of the y-axis const float yL[] = {-1, 0.3f, -0.67946f, -0.81202f, -0.7076431f}; // right (bottom) end of the y-axis const float yR[] = {1, 0.4f, -0.54478f, -0.40979f, -0.5729629f}; // number of sets const int nSets = 5; // maximum number of iterations const int maxIter = 1024; // PI const float PI = 2*acos(0.0f); // matrix containing number of iterations int *mat = NULL; // number of elements of the matrix const int nEl = 1024; // size of matrix size_t datasize = sizeof(int)*(nEl*nEl);
Implementation – Mandelbrot,c++ : // timings file ofstream timef("timings.txt", fstream::app); for (int setN = 0; setN < nSets; ++setN) { timef << "Set " << setN << ": x = " << xL[setN] << ":" << xR[setN] << " ; y = " << yL[setN] << ":" << yR[setN] << endl; const int N_TIMES = 10; clock_t start = clock(); for (int nTimes = 0; nTimes < N_TIMES; ++nTimes) { // perform calculation for (int i = 0; i < nEl; ++i) { for (int j = 0; j < nEl; ++j) { int idx = nEl*i + j; float x0 = xL[setN] + (xR[setN] - xL[setN])*j/nEl; float y0 = yL[setN] + (yR[setN] - yL[setN])*i/nEl; float x = x0; float y = y0; int nIter = 0; for (; nIter < maxIter && (x*x + y*y) < 4; ++nIter) { float x_ = x*x - y*y + x0; y = 2*x*y + y0; x = x_; } mat[idx] = nIter; } } }
Implementation – Mandelbrot.cl:// calculates the mandelbrot set__kernel void mandelbrot(__global int* mat, float xL, float xR, float yL, float yR, int maxIter) { int idx = get_global_id(0); int i = idx / N_EL; int j = idx % N_EL; // initial x and y float x0 = xL + (xR - xL)*j/N_EL; float y0 = yL + (yR - yL)*i/N_EL; float x = x0; float y = y0; int nIter = 0; // iterate until escape or maximum iterations for (; nIter < maxIter && (x*x + y*y) < 4; ++nIter) { float x_ = x*x - y*y + x0; y = 2*x*y + y0; x = x_; } mat[idx] = nIter;}
Implementation – Mandelbrot,OpenCL:
// Get devices cl_context_properties cprops[3] = {CL_CONTEXT_PLATFORM, (cl_context_properties)(plat)(), 0}; cl::Context ctx(CL_DEVICE_TYPE_ALL, cprops, NULL, NULL, &status); cl::Buffer buff(ctx, CL_MEM_WRITE_ONLY, datasize, NULL, &status); checkErr(status, "Buffer()"); vector<cl::Device> devices; devices = ctx.getInfo<CL_CONTEXT_DEVICES>(); timef << "# of devices: " << devices.size() << endl;
Implementation – Mandelbrot,OpenCL:
// select device cl::Device device = devices[0]; string devName; device.getInfo(CL_DEVICE_NAME, &devName); timef << "Device Name: " << devName << endl; // load program ifstream f("mandelbrot.cl"); std::string progStr(istreambuf_iterator<char>(f), (istreambuf_iterator<char>())); cl::Program::Sources source(1, std::make_pair(progStr.c_str(), progStr.length()+1)); cl::Program program(ctx, source); status = program.build(devices, ""); checkErr(status, "Program::build()");
We invoked this code to performs the computation for the OpenCL
implementations.
Implementation – Mandelbrot,OpenCL: // get kernel
cl::Kernel kernel(program, "mandelbrot", &status); checkErr(status, "Kernel"); status = kernel.setArg(0, buff); checkErr(status, "Kernel::setArg(0)"); status = kernel.setArg(5, maxIter); checkErr(status, "Kernel::setArg(5)"); // calculate over sets for (int setN = 0; setN < nSets; ++setN) { status = kernel.setArg(1, xL[setN]); checkErr(status, "Kernel::setArg(1)"); status = kernel.setArg(2, xR[setN]); checkErr(status, "Kernel::setArg(2)"); status = kernel.setArg(3, yL[setN]); checkErr(status, "Kernel::setArg(3)"); status = kernel.setArg(4, yR[setN]); checkErr(status, "Kernel::setArg(4)"); timef << "Set " << setN << ": x = " << xL[setN] << ":" << xR[setN] << " ; y = " << yL[setN] << ":" << yR[setN] << endl; cl::CommandQueue queue(ctx, device, 0, &status); checkErr(status, "CommandQueue()"); const int N_TIMES = 10; clock_t start = clock();
Implementation – Mandelbrot,OpenCL:
for (int nTimes = 0; nTimes < N_TIMES; ++nTimes) { // enqueue kernel cl::Event event; status = queue.enqueueNDRangeKernel(kernel, cl::NullRange, cl::NDRange(nEl*nEl), cl::NullRange, NULL, &event); checkErr(status, "enqueue()"); // wait for kernel to finish event.wait(); // read to matrix (blocking) status = queue.enqueueReadBuffer(buff, CL_TRUE, 0, datasize, mat, NULL, NULL); checkErr(status, "Read()"); }
Event: A token sent through a pipeline that can be used to enforce synchronization, flush caches, and report status back to the host application.
Implementation – Mandelbrot,OpenCL:
clock_t end = clock(); double sec = (end - start)/ (double) CLOCKS_PER_SEC; timef << "Time: " << sec << endl; timef << "Per iter: " << sec / (double) N_TIMES << endl; // create the image CImage im; im.Create(nEl, nEl, 24); // paint the image for (int i = 0; i < nEl; ++i) { for (int j = 0; j < nEl; ++j) { float u = mat[nEl*i + j]; if (u == maxIter) { // if part of set, pixel is black im.SetPixelRGB(j, i, 0, 0, 0); } else { // otherwise, color it based on number of iterations float x = xL[setN] + (xR[setN] - xL[setN])*j/nEl; float y = yL[setN] + (yR[setN] - yL[setN])*i/nEl; float v = u; float c = v * 2.0f * PI / 256.0f; im.SetPixelRGB(j, i, ((1.0f + cos(c))*0.5f)*255, ((1.0f + cos(2.0f*c + 2.0f*PI/3.0f))*0.5f)*255, ((1.0f + cos(c - 2.0f*PI/3.0f))*0.5f)*255); } } }
Implementation – Mandelbrot,OpenCL:
// save the image ostringstream strm; strm << "mandelbrot" << setN << ".bmp"; im.Save(strm.str().c_str()); } timef.close();}
Logistic Map
Implementation – Logistic MapThe 1-dimensional interval of the real axis (the r-values) will be divided into 1024.
Each division corresponds to a column in the diagram
220 iterations will be performed to warmup, starting with x=0.4
Implementation – Logistic Map
Following, 210 = 1024 iterations will be recorded
Afterwards, these will be plotted along the column
Again, the iterations will be implemented as:◦A simple for-loop for the C++ implementation◦An OpenCL kernel for the OpenCL
implementation
Implementation – Logistic Map, C++ // timings file ofstream timef("timings.txt", fstream::app); // number of times to perform map (for benchmarking) const int N_TIMES = 10; for (int setN = 0; setN < nSets; ++setN) { timef << "Set " << setN << ": r = " << rL[setN] << ":" << rR[setN] << endl; clock_t start = clock(); for (int nTimes = 0; nTimes < N_TIMES; ++nTimes) { // Iterate the map for (int idx = 0; idx < nEl; ++idx) { float r = rL[setN] + (rR[setN] - rL[setN])*idx/nEl; float x = 0.4f; for (int i = 0; i < warmup; ++i) { x = r*x*(1-x); } for (int i = 0; i < maxIter; ++i) { mat[maxIter*idx + i] = x = r*x*(1-x); } } }
Implementation – Logistic.cl// calcualtes the logistic function__kernel void logistic(__global float* mat, float rL, float rR, int warmup, int maxIter) { int idx = get_global_id(0); // r of the map to iterate on float r = rL + (rR - rL)*idx/N_EL; float x = 0.4f; // warmup for (int i = 0; i < warmup; ++i) { x = r*x*(1-x); } // plotted iterates for (int i = 0; i < maxIter; ++i) { mat[maxIter*idx + i] = x = r*x*(1-x); }}
Implementation – Logistic Map, OpenCL
// get platforms cl_uint nPlatforms = 0; cl_platform_id *platforms = NULL; vector<cl::Platform> platformList; cl::Platform::get(&platformList); ofstream timef("timings.txt", fstream::app); string vendor; platformList[0].getInfo((cl_platform_info) CL_PLATFORM_VENDOR, &vendor); timef << "Platform by: " << vendor << endl;
Implementation – Logistic Map, OpenCL // get context cl_context_properties cprops[3] = {CL_CONTEXT_PLATFORM, (cl_context_properties)(platformList[1])(), 0}; cl::Context ctx(CL_DEVICE_TYPE_ALL, cprops, NULL, NULL, &status); cl::Buffer buff(ctx, CL_MEM_WRITE_ONLY, datasize, NULL, &status); checkErr(status, "Buffer()"); // get devices vector<cl::Device> devices; devices = ctx.getInfo<CL_CONTEXT_DEVICES>(); timef << "# of devices: " << devices.size() << endl; cl::Device device = devices[0]; string devName; device.getInfo(CL_DEVICE_NAME, &devName); timef << "Device Name: " << devName << endl;
Implementation – Logistic Map, OpenCL
// get program ifstream f("logistic.cl"); std::string progStr(istreambuf_iterator<char>(f), (istreambuf_iterator<char>())); cl::Program::Sources source(1, std::make_pair(progStr.c_str(), progStr.length()+1)); cl::Program program(ctx, source); status = program.build(devices, ""); checkErr(status, "Program::build()"); // get kernel cl::Kernel kernel(program, "logistic", &status); checkErr(status, "Kernel");
We invoked this function to performs the computation for the OpenCL
implementations.
Implementation – Logistic Map, OpenCL // load arguments status = kernel.setArg(0, buff); checkErr(status, "Kernel::setArg(0)"); status = kernel.setArg(3, warmup); checkErr(status, "Kernel::setArg(3)"); status = kernel.setArg(4, maxIter); checkErr(status, "Kernel::setArg(4)"); // calculate over sets for (int setN = 0; setN < nSets; ++setN) { status = kernel.setArg(1, rL[setN]); checkErr(status, "Kernel::setArg(1)"); status = kernel.setArg(2, rR[setN]); checkErr(status, "Kernel::setArg(2)"); timef << "Set " << setN << ": r = " << rL[setN] << ":" << rR[setN] << endl; cl::CommandQueue queue(ctx, device, 0, &status); checkErr(status, "CommandQueue()"); const int N_TIMES = 10; clock_t start = clock();
Implementation – Logistic Map, OpenCL for (int nTimes = 0; nTimes < N_TIMES; ++nTimes) { // enqueue kernel cl::Event event; status = queue.enqueueNDRangeKernel(kernel, cl::NullRange, cl::NDRange(nEl), cl::NullRange, NULL, &event); checkErr(status, "enqueue()"); // wait for kernel to finish event.wait(); // read buffer to memory (blocking) status = queue.enqueueReadBuffer(buff, CL_TRUE, 0, datasize, mat, NULL, NULL); checkErr(status, "Read()"); } Event: A token sent through a pipeline that can be used to enforce synchronization,
flush caches, and report status back to the host application.
Implementation – Logistic Map, OpenCL // Output timings clock_t end = clock(); double sec = (end - start)/ (double) CLOCKS_PER_SEC; timef << "Time: " << sec << endl; timef << "Per iter: " << sec / (double) N_TIMES << endl; // Create image CImage im; im.Create(nEl, -height, 24); // White out the image for (int i = 0; i < nEl; ++i) { for (int j = 0; j < height; ++j) { im.SetPixelRGB(i,j,255,255,255); } }
Implementation – Logistic Map, OpenCL // Plot the iterates for (int i = 0; i < nEl; ++i) { for (int j = 0; j < maxIter; ++j) { float u = mat[nEl*i + j]; if (xL <= u && u < xR) { // only plot u if in range u -= xL; im.SetPixelRGB(i, height - 1 - (u*height/xR),0,0,0); } } } // Save the image ostringstream strm; strm << "logistic" << setN << ".bmp"; im.Save(strm.str().c_str()); } timef.close();}
Methodology, Results, Analysis
Methodology, Results, Analysis
MethodologyResultsAnalysis
Methodology:
The performance of each implementation for each chaotic phenomenon will be measured by timing them◦They will be timed for 10 runs for each set
◦Each set is a 2-D region (Mandelbrot Set) 1-D interval (Logistic map)
Methodology:
The means will be compared and graphedAfterwards, accuracy will be determined
◦Generated graphs will be compared Visually Numerically
Methodology:
Hardware used:◦Intel Core i5 Dual-Core M460 @ 2.53GHz◦AMD Radeon Mobility HD 5145 (ATI
RV710)
Results:
Mandelbrot Set.Logistic Map.
Results – Mandelbrot Set:
Set 1 Set 2 Set 3 Set 4 Set 50
1
2
3
4
5
6
7
8
Plain C++
OpenCL, GPU
OpenCL, CPU
Result – Mandelbrot Set:
Performance◦OpenCL implementation is faster than
Plain C++ by roughly 10 times.◦OpenCL running on CPU and GPU both
take less than a second◦Order of magnitude difference in runtime
Results – Mandelbrot Set:
Performance◦OpenCL running on GPU runs in ¾ the time of OpenCL on CPU
◦Less difference than expected (more on this later).
Results – Mandelbrot Set:
Accuracy◦To be determined by both visual comparison and numerical comparison of generated visualizations
◦Visualizations follow on the next slides
Mandelbrot Set – Set 1 - C++
Mandelbrot Set – Set 1 – OpenCL, GPU
Mandelbrot Set – Set 1 – OpenCL, CPU
Mandelbrot Set – Set 2 - C++
Mandelbrot Set – Set 2 – OpenCL, GPU
Mandelbrot Set – Set 2 – OpenCL, CPU
Mandelbrot Set – Set 3 – C++
Mandelbrot Set – Set 3 – OpenCL, GPU
Mandelbrot Set – Set 3 – OpenCL , CPU
Mandelbrot Set – Set 4 – C++
Mandelbrot Set – Set 4 – OpenCL, GPU
Mandelbrot Set – Set 4 – OpenCL, CPU
Mandelbrot Set – Set 5 – C++
Mandelbrot Set – Set 5 – OpenCL, GPU
Mandelbrot Set – Set 5 – OpenCL, CPU
Mandelbrot Set – Results:
Accuracy◦Visual comparison gives no apparent difference
◦Numerical comparison confirms this: no difference in number of iterations
◦Perfect accuracy
Logistic Map – Results:
Set 1 Set 2 Set 3 Set 40
1
2
3
4
5
6
7
8
Plain C++
OpenCL, GPU
OpenCL, CPU
Logistic Map – Results:
Performance◦Similar results to Mandelbrot Set◦Plain C++ takes roughly 7 seconds◦OpenCL running on CPU and GPU both
take less than a second◦Order of magnitude difference in runtime
Logistic Map – Results:
Performance◦OpenCL running on GPU runs in ½ the time of OpenCL on CPU
◦Greater difference, but still less difference than expected (more on this later)
Logistic Map – Results:
Accuracy◦Also to be determined by both visual comparison and numerical comparison of generated visualizations
◦Visualizations follow on the next slides
Logistic Map – Set 1 – C++
Logistic Map – Set 1 – OpenCL, GPU
Logistic Map – Set 1 – OpenCL, CPU
Logistic Map – Set 2 – C++
Logistic Map – Set 2 – OpenCL, GPU
Logistic Map – Set 2 – OpenCL, CPU
Logistic Map – Set 3 – C++
Logistic Map – Set 3 – OpenCL, GPU
Logistic Map – Set 3 – OpenCL, CPU
Logistic Map – Set 4 – C++
Logistic Map – Set 4 – OpenCL, GPU
Logistic Map – Set 4 – OpenCL, CPU
Logistic Map - Results
Accuracy◦Once again, no noticeable difference can be
observed visually◦Numerical comparison also confirms this
Analysis:
For both chaotic phenomena investigated, an order of magnitude difference in speed was observed between the OpenCL and plain C++ implementations
Also, no visible difference in accuracy was found Thus, OpenCL can be considered an excellent
way to boost performance.
Analysis:
For both chaotic phenomena, GPU was faster than CPU.
However, this difference is smaller than expected, considering the parallelism of the problem and of GPUs.
Analysis:
Possible explanation:◦Bus transfer from CPU to GPU takes too
much time. Possible solution:
◦Increase workload of the kernel so as to minimize required transfer
Analysis: CPU performance in and of itself is remarkable Improvement over plain C++ is an order of
magnitude, but only dual-core CPU was used Expected improvement: factor of 2 Actual: factor of 10 Possible explanation:
◦Excellent optimization by Intel OpenCL driver
Conclusions:
SummaryContributionsFuture Work
Summary:
GPGPU provides access to massive parallelism◦But only data parallelism
This is due to GPU architecture being specialized for massive data parallelism
OpenCL gives us easy access to GPGPU◦Along with parallelization for CPUs and embedded
devices
Summary
Chaotic phenomena require large amounts of computation
However, this is usually very data-parallelPrime examples:
◦Mandelbrot Set◦Bifurcation diagram of the logistic map
Summary:
OpenCL was used to investigate how useful GPGPU can be for investigation of chaotic phenomena
Results are spectacular:◦10x improvement over plain C++ Even for CPU-driven OpenCL
◦GPU-driven OpenCL still faster than CPU-driven OpenCL
Summary
Analysis of benefits and complications from using OpenCL◦Speed◦Accuracy◦Code complexity
Contributions:
This project shows that OpenCL can be used to greatly speed up computations for investigation of chaotic phenomena
And in general, computation of highly data-parallel work
Contributions:
OpenCL can be used regardless of whether a GPU is available◦OpenCL can be used to parallelize serial
implementations for CPU◦Still have massive improvements
Future Work:
Increase work load of the kernel, thus reducing data transfer required and latency incurred.◦Data transfer to GPU is very slow◦May have to contend with other bus-users◦Even without data, latency is high (off-chip)
Future Work:
Investigate using highly-optimized code for both OpenCL and C++◦More realistic comparison between OpenCL and
C++◦However, may accidentally lead to optimizing C++ more than OpenCL, or vice versa
Future Work:
Investigation of other chaotic phenomena◦Lorenz strange attractor◦Burning ship fractal◦Mandelbar fractal
In general, highly data-parallel work.
References: These slides contain material developed and copyright by: Gita Alaghband (UC Denver). http://developer.amd.com/sdks/AMDAPPSDK/documentation/Pages/default.aspx. [1] Alligood, K. T., Sauer, T., and Yorke, J.A. Chaos: an introduction to dynamical systems. New York
City, NY: Springer-Verlag, 1997. Print. [2] “AMD Accelerated Parallel Processing SDK”. AMD Developer Central. AMD, n.d. Web. 6 Mar
2012. [3] Devaney, Robert L. An Introduction to Chaotic Dynamical Systems, 2nd ed,. Boulder, CO:
Westview Press, 2003. Print. [4] Garcia, V., E. Debreuve, and M. Barlaud. Fast k nearest neighbor search using GPU. In Proceedings
of the CVPR Workshop on Computer Vision on GPU, 2008. Print. [5] Harrison, Owen, and John Waldron. “AES on SM3.0 compliant GPUs.” In Proceedings of CHES
2007. Print. [6] "Intel® OpenCL SDK." Intel Visual Computing Source. Intel Corporation, n.d.. Web. 6 Mar 2012. [7] Mancheril, Naju. “GPU-based Sorting in PostgreSQL.” Thesis, School of Computer Science -
Carnegie Mellon University. Print. [8] Milnor, John W. Dynamics in One Complex Variable. 3rd ed. In Annals of Mathematics Studies
160. Princeton, NJ: Princeton University Press, 2006. [9] “OpenCL.” Nvidia Developer Zone. Nvidia, n.d. Web. 6 Mar 2012. [10] "OpenCL.” OpenCL. Khronos Group, n.d. Web. 6 Mar 2012.
[11] Scarpino, Matthew. OpenCL in Action. Greenwich, CT: Manning Publications, 2011. Print.
[12] Strogatz, Steven (2000). Nonlinear Dynamics and Chaos. Perseus Publishing.¬ [13] Vasiliadis, Giorgos, et al. “GrAVity: A Massively Parallel Antivirus Engine.” In
proceedings of RAID 2010. Print. [14] Vasiliadis, Giorgos, et al. “Regular Expression Matching on Graphics Hardware for
Intrusion Detection.” In proceedings of RAID 2009. Print. [16] “CUDA Zone.” Nvidia Developer Zone. Nvidia. n.d. Web. 27 Mar 2012. [17] “Next Generation CUDA Architecture, Code Named Fermi.” Nvidia. n.d. Web. 27 Mar
2012. [18] Friedrichs, M.S. et al. "Accelerating Molecular Dynamic Simulation on Graphics
Processing Units". Journal of Computational Chemistry 30 (6): 864–72, 2009. Web. 27 Mar 2012.
[19] Pande, Vijay and Stanford University. “Folding@home.” Stanford, California: Stanford University, 2012. Web. 27 Mar 2012.
[20] Pande, Vijay and Stanford University. “Folding@home team stats pages.” Stanford, California: Stanford University, 2012. Web. 27 Mar 2012.
[21] Fung, et al. "Mediated Reality Using Computer Graphics Hardware for Computer Vision". In Proceedings of the International Symposium on Wearable Computing 2002 (ISWC2002), Seattle, WA, 7-10 2002, 83--89. Web. 27 Mar 2012.
References:
[22] Harris, Mark. “Mapping computational concepts to GPUs.” In ACM SIGGRAPH 2005 Courses (Los Angeles, California, 31 July – 4 August 2005). J. Fujii, Ed. SIGGRAPH '05. ACM Press, New York, NY, 50. Web. 27 Mar 2012.
[23] “About the Khronos Group.” Khronos Group, n.d. Web. 27 Mar 2012. [24] Fang, Jianbin, et al. “A Comprehensive Performance Comparison of CUDA and OpenCL.“ In Parallel
Processing(ICPP), 2011 International Conference on 13-16 Sept. 2011. Web. 27 Mar 2012. [25] Jaaskelainen, P.O. “OpenCL-based design methodology for application-specific processors.” In
Embedded Computer Systems (SAMOS), 2010 International Conference on, pp. 223- 230. Web. 27 Mar 2012.
[26] Li, T.Y.; Yorke, J.A. (1975). "Period Three Implies Chaos" (PDF). American Mathematical Monthly 82 (10): 985–92. Web. 27 Mar 2012.
[27] Farber, Rob. “Cuda, Supercomputing for the Masses: Part 17.” In Dr. Dobb’s, 14 Apr 2010. Web. 27 Mar 2012.
[28] Weisstein, Eric W. "Logistic Map." From MathWorld--A Wolfram Web Resource. http://mathworld.wolfram.com/LogisticMap.html
[29] “Mandel zoom 01 head and shoulder.jpg.’ Wikimedia Commons. Web. 27 Mar 2012. [30] “Mandel zoom 00 mandelbrot set.jpg.” Wikimedia Commons. Web. 27 Mar 2012. [31] “TwoLorenzOrbits.jpg.” Wikimedia Commons. Web. 27 Mar 2012. [32] “Logistic Bifurcation map High Resolution.png.” Wikimedia Commons. Web. 27 Mar 2012. http://www.rationalsys.com/robertpirsig.html
References:
Question ?
top related