lecture 12: attributes, performance and scalability many-core vs. cuda synchronization
TRANSCRIPT
Lecture 12:
Attributes, Performance and Scalability Many-Core vs. CUDA
Synchronization
Ways to Manage Parallelism
Theoretical Models
Flynn's Taxonomy
PRAM (Parallel Random Access Model)
SingleInstruction
MultiInstruction
Single Data Multiple Data
SISD SIMD
MISD MIMD
ExclusiveRead
ConcurrentRead
Exclusive Write Concurrent Write
EREW ERCW
CREW CRCW
Parallelizable Problems
Some are easy
Some are not so easy
CGI Animation - Model Rendering (e.g. "Toy Story")
String Processing - Which Shakespeare works contain the word "foresooth".
CGI Animation - Interactive Physics Simulation (e.g. colliding galaxies)
String Processing - Generate an alphabetical list all words in the works of Shakespeare
Historical Performance
MapReduce - Intel ManyCore
Data Parallel Computing - NVidia CUDA (Compute Unified Device Architecture)
Real Parallel Implementations
MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key.
The MapReduce protocol was developed at Google. (Hadoop is an open-source version)
MapReduce is loosely based on functional programming
Before MapReduce every programmer writing parallel/distributed application had to worry about details of concurrency and robustness (fault-tolerance)
MapReduce abstracts the hardware and concurrency control issues away from the applications programmer
MapReduce is more accurately Map-Sort-Group-Accumulate (not a marketable name:-)
To ensure fault-tolerance data is duplicated and distributed across multiple processors/computers so the same (overlapping) parts of one virtual file can live on many computers.
Parallelism in the form of MapReduce
Overview of MapReduce
f2
key
key
key
S O
R T
keykeykeykey
keykeykeykey
keykeykeykey
f1
key
key
key
key-valuepairs
key-resultpairs
key-value listpairs
Mapper Reducer
Berkeley CS 61A - Brian Harvey - http://www.youtube.com/watch?v=mVXpvsdeuKU
Example Problem:
Input: A list of every grade for every student in every class taken,currently keyed by course-studentID
Compute the current GPA for each student.
csc101-M001, 78csc101-M033, 95csc101-M002, 83csc101-M101,100csc101-M001, 85csc101-M033, 99 : :csc145-M001, 56csc145-M099, 89csc145-M045, 94 : :csc540-M001,100csc540-M033, 45 : :
M001-csc101, 78
M001-csc101, 85
M033-csc101, 95
M002-csc101, 83
M101-csc101,100
:
M001-csc540,100
M033-csc540, 45
M001-csc101,(78,85,...)
M002-csc101,(83,99,...)
M033-csc101,(95,56,...)
M001-csc540,(100,80,..)
M033-csc540,(45,20,...)
M001,3.45
M002,3.10
M033,1.99
: :
M101,2.85
Input Map Group Reduce
MapReduce
http://hadoop.apache.org/mapreduce/
Parallel Computing on a GPU
• NVIDIA GPU Computing Architecture– Via a separate HW interface
– In laptops, desktops, workstations, servers
• 8-series GPUs deliver 50 to 200 GFLOPSon compiled parallel C applications
• GPU parallelism is doubling every year• Programming model scales transparently
• Programmable in C with CUDA tools• Multithreaded SIMD model uses application
data parallelism and thread parallelism
GeForce 8800
Tesla S870
Tesla D870
CUDA Stack
The CUDA software stack is composed of several layers (1) a hardware driver, (2) an application programming interface (API) and its runtime, and (3) two higher-level mathematical libraries of common usage, CUFFT and CUBLAS.
http://developer.download.nvidia.com/compute/cuda/1_0/NVIDIA_CUDA_Programming_Guide_1.0.pdf
CUDA Thread Batching
The host issues a succession of kernel invocations to the device. Each kernel is executed as a batch of threads organized as a grid of thread blocks.
This method works best on those problems for which parallelization is "easy".
To implement "hard" parallel solutions on such a system, requires active participation by the applications programmer.
The pay-off is a significant speed-up compared to approaches in which the details of parallelization have been hidden from the applications programmer.
Clock Synchronization
Time Sources
De Facto Primary Standard – International Atomic Time (TAI) Keeping of TAI started in 1955 1 atomic second = 9,192,631,770 orbital transitions of Cs133(Caesium) 86400 atomic seconds = 1 solar day –3 ms
Coordinated Universal Time (UTC) – International StandardKeeping of UTC started 1961 Derived from TAI by adding leap seconds to keep it close to solar time UTC source signals are synchronized UTC time is re-transmitted by GPS satellites
Local clocks are based on oscillators
Clock synchronization is a problem from computer science and engineering which deals with the idea that internal clocks of several computers may differ. Even when initially set accurately, real clocks will differ after some amount of time due to clock drift, caused by clocks counting time at slightly different rates. There are several problems that occur as a repercussion of rate differences and several solutions, some being more appropriate than others in certain contexts.
http://en.wikipedia.org/wiki/Clock_synchronization
Computer Science 425 Distributed Systems - Klara Nahrstedt
Skew:s(t) = Ci(t) –Cj(t)
Maximum Drift Rate (MDR) ρ
|t –Ci(t) | ≤ ρt(1-ρ) ≤ dCi(t)/dt ≤ (1+ρ)
Synchronization interval R and synchronization bound D
| Ci(t) –Cj(t) | ≤ 2ρt| Ci(R) –Cj(R)| ≤ 2ρR ≤ D R ≤ D/2ρ
This calculation ignored propagation delays
Clock Skew & Drift Rates
Computer Science 425 Distributed Systems - Klara Nahrstedt
Besides the incorrectness of the time itself, there are problems associated with clock skew that take on more complexity in a distributed system in which several computers will need to realize the same global time.
For instance, in Unix systems the make command is used to compile new or modified code without the need to recompile unchanged code. The make command uses the clock of the machine it runs on to determine which source files need to be recompiled. If the sources reside on a separate file server and the two machines have unsynchronized clocks, the make program might not produce the correct results.
Time Synchronization Problems in Distributed Systems
http://en.wikipedia.org/wiki/Clock_synchronization
In a centralized system the solution is trivial; the centralized server will dictate the system time. Cristian's algorithm and the Berkeley Algorithm are some solutions to the clock synchronization problem in a centralized server environment. In a distributed system the problem takes on more complexity because a global time is not easily known. The most used clock synchronization solution on the Internet is the Network Time Protocol (NTP) which is a layered client-server architecture based on UDP message passing. Lamport timestamps and Vector clocks are concepts of the logical clocks in distributed systems.
Solutions to Synchronization Problems
http://en.wikipedia.org/wiki/Clock_synchronization
Cristian's algorithm
Cristian's algorithm relies on the existence of a time server. The time server maintains its clock by using a radio clock or other accurate time source, then all other computers in the system stay synchronized with it. A time client will maintain its clock by making a procedure call to the time server. Variations of this algorithm make more precise time calculations by factoring in network propagation time.
http://en.wikipedia.org/wiki/Clock_synchronization
mr - message where client piasks time server S for time
mt - message where time server responds with its current time T
Client pi uses T in mt to set its clock
Computer Science 425 Distributed Systems - Klara Nahrstedt
Berkeley algorithm
This algorithm is more suitable for systems where a radio clock is not present, this system has no way of making sure of the actual time other than by maintaining a global average time as the global time. A time server will periodically fetch the time from all the time clients, average the results, and then report back to the clients the adjustment that needs be made to their local clocks to achieve the average. This algorithm highlights the fact that internal clocks may vary not only in the time they contain but also in the clock rate. Often, any client whose clock differs by a value outside of a given tolerance is disregarded when averaging the results. This prevents the overall system time from being drastically skewed due to one erroneous clock.
http://en.wikipedia.org/wiki/Clock_synchronization
Use elected leader process to ensure maximum skew is ρamong clients.
Elected leader broadcasts to all machines for their time, adjusts times received for transmission delay & latency,
averages times after removing outliers tells each machine how to adjust.
In some systems multiple time leaders are used.
Averaging client’s clocks may cause the entire system to drift away from UTC over time
Failure of the leader requires some time for re-election, so accuracy cannot be guaranteed
Computer Science 425 Distributed Systems - Klara Nahrstedt
Network Time Protocol
This algorithm is a class of mutual network synchronization algorithms in which no master or reference clocks are needed. All clocks equally participate in the synchronization of the network by exchanging their timestamps using regular beacon packets. CS-MNS is suitable for distributed and mobile applications. It has been shown to be scalable, accurate in the order of few microseconds, and compatible to IEEE 802.11 and similar standards.
http://en.wikipedia.org/wiki/Clock_synchronization
• Provides UTC synchronization service across Internet• Uses time servers to sync. networked processes. • Time servers are connected by sync. subnet tree. • The root is adjusted directly. • Each node synchronizes its children nodes.
Computer Science 425 Distributed Systems - Klara Nahrstedt
Reference Broadcast Synchronization (RBS)This algorithm is often used in wireless networks and sensor networks. In this scheme, an initiator broadcasts a reference message to urge the receivers to adjust their clocks.
http://en.wikipedia.org/wiki/Clock_synchronization
A critical path analysis for traditional time synchronization protocols (top) and RBS (bottom).
For traditional protocols working on a LAN, the largest contributions to nondeterministic latency are the Send Time (from the sender’s clock read to delivery of the packet to its NIC, including protocol processing) and Access Time (the delay in the NIC until the channel becomes free).
The Receive Time tends to be much smaller than the Send Time because the clock can be read at interrupt time, before protocol processing.
In RBS, the critical path length is shortened to include only the time from the injection of the packet into the channel to the last clock read.
Fine-Grained Network Time Synchronization using Reference Broadcasts - Elson
Happens-Before Relation on Events
http://en.wikipedia.org/wiki/Happened-before
OpenCL (Open Computing Language) is a framework for writing programs that execute across heterogeneous platforms consisting of CPUs, GPUs, and other processors.
OpenCL includes a language for writing kernels and APIs that are used to control the platforms.
OpenCL provides parallel computing using task-based and data-based parallelism.
AMD/ATI - Stream SDK
Nvidia - offers OpenCL as an alternative to its CUDA
OpenCL gives any application access to the Graphics Processing Unit for non-graphical computing. Thus, OpenCL extends the power of the Graphics Processing Unit beyond graphics (General-purpose computing on graphics processing units).
OpenCL is analogous to the open industry standards OpenGL and OpenAL, for 3D graphics and computer audio, respectively. OpenCL is managed by the non-profit technology consortium Khronos Group.
openCL
http://en.wikipedia.org/wiki/OpenCL
http://www.khronos.org/opencl/
Some Basic Concepts
OpenCL programs are divided in two parts:
(1) one that executes on the device (in our case, on the GPU)
(2) the other that executes on the host (in our case, the CPU).
In order to execute code on the device, programmers can write special functions (called kernels), which are coded with the OpenCL Programming Language - a sort of C with some restrictions and special keywords and data types.
On the other hand, the host program offers an API so that you can manage your device execution. The host can be programmed in C or C++ and it controls the OpenCL environment.
The kernel is a function that is executed on the device. There are other functions that can be executed on the device but the kernels are entry points to the device program. In other words, kernels are the only functions that can be called from the host.
How do I program a Kernel?
How do I express the parallelism with kernels?
What is the execution model?
Definition of the Kernel
http://opencl.codeplex.com/
SIMT: stands for Single Instruction Multiple Thread which is how instructions are executed in the host. That is, the same code is executed in parallel by a different thread, and each thread executes the code with different data.
Work-item: the work-items are equivalent to CUDA threads, and are the smallest unit of execution. Each time a Kernel is launched, an appropriate number of work-items are launched, each one executing the same code. Each work-item has an ID, which is accessible from the kernel, and which is used to distinguish the data to be processed by each work-item.
Work-group: exist to enable interoperability between work-items. They reflect how work-items are organized (as an N-dimensional grid of work-groups, with N = 1, 2 or 3). Work-groups correspond to CUDA thread blocks. Like work-items, work-groups have an unique ID that can be accessed from the kernel.
Units of Execution
http://opencl.codeplex.com/
ND-Range: the next organization level, specifying how work-groups are organized (as an N-dimensional grid of work-groups, N = 1, 2 or 3)
ND-Range
http://opencl.codeplex.com/
Sample Kernel that adds two vectors. This kernel takes four parameters:
src_a = one of the vectors to be added src_b = the other vectors to be added res = vector to hold the result num = the number of elements in each vector
Example - Sum of Two Vectors
CPU version looks like this
Example - Sum of Two Vectors
Rather than one thread iterating through all the vector elements, we allocate a separate thread for each pair of elements. The index of each thread corresponds to the index of the vector elements.
The reserved word kernel, specifies that the function is a kernel.
Kernel functions always return void.
The reserved word global specifies the location in memory of the arguments.
Creating the basic OpenCL run-time environment
Platform: "The host plus a collection of devices managed by the OpenCL framework that allow an application to share resources and execute kernels on devices in the platform."
Platforms are represented by a cl_platform object, which can be initialized using the following function:
Device: represented by cl_device objects, initialized with the following function.
Context: defines the OpenCL environment, including OpenCL kernels, devices, memory management, and command-queues. Contexts in OpenCL are referenced by cl_context object, which are initialized using the following function:
Command-Queue: an object in which OpenCL commands are enqueued to be executed by the device. "The command-queue is created on a specific device in a context [...] Having multiple command-queues allows applications to queue multiple independent commands without requiring synchronization."
Allocating Memory
To execute the vector sum kernel we need to allocate 3 vectors and initialize 2 of them.
To perform memory allocation on the CPU, we would do something like this:
To represent memory allocated on the device we use the cl_mem type. To allocate memory we use:
Allocating Memory
CL_MEM_READ_WRITE
CL_MEM_WRITE_ONLY
CL_MEM_READ_ONLY
CL_MEM_USE_HOST_PTR
CL_MEM_ALLOC_HOST_PTR
CL_MEM_COPY_HOST_PTR
valid values for flags are
this function is used to allocate memory for src_a_d, src_b_d, and res_d:
OpenCL Programs vs OpenCL Kernels
Kernel: kernels are essentially functions that we can call from the host and that will run on the device. Note: All code that runs on the device, including both kernels and non-kernel functions that are called by kernels, are compiled at run-time.
Program: an OpenCL Program is formed by a set of kernels, functions and declarations, and is represented by an cl_program object. When creating a program, you must specify which files compose the program before we compile it.
Create a Program
Compile a Program
View Compile Log
Extract Entry Point
Sample Program
Launching the Kernel
Once our kernel object exists, we can now launch it.
First of all, we must set up the arguments. This is done using:
This function should be called for each argument.
After all arguments are enqueued, we can call the kernel using:
Reading back
Read back is easy. Analogously to what we've done to write to memory, now we must enqueue a read buffer operation, which is done using:
All object and memory allocated with clCreate is destroyed using clRelease
Cleaning up