lecture 12: attributes, performance and scalability many-core vs. cuda synchronization

39
Lecture 12: Attributes, Performance and Scalability Many-Core vs. CUDA Synchronization

Upload: laurence-allen

Post on 12-Jan-2016

222 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Lecture 12: Attributes, Performance and Scalability Many-Core vs. CUDA Synchronization

Lecture 12:

    Attributes, Performance and Scalability    Many-Core vs. CUDA

    Synchronization

Page 2: Lecture 12: Attributes, Performance and Scalability Many-Core vs. CUDA Synchronization

Ways to Manage Parallelism

Theoretical Models

Flynn's Taxonomy

PRAM (Parallel Random Access Model)

SingleInstruction

MultiInstruction

Single Data Multiple Data

SISD SIMD

MISD MIMD

ExclusiveRead

ConcurrentRead

Exclusive Write Concurrent Write

EREW ERCW

CREW CRCW

Page 3: Lecture 12: Attributes, Performance and Scalability Many-Core vs. CUDA Synchronization

Parallelizable Problems

Some are easy

Some are not so easy

CGI Animation - Model Rendering (e.g. "Toy Story")

String Processing - Which Shakespeare works contain the word "foresooth".

CGI Animation - Interactive Physics Simulation (e.g. colliding galaxies)

String Processing - Generate an alphabetical list all words in the works of Shakespeare

Page 4: Lecture 12: Attributes, Performance and Scalability Many-Core vs. CUDA Synchronization

Historical Performance

Page 5: Lecture 12: Attributes, Performance and Scalability Many-Core vs. CUDA Synchronization

MapReduce - Intel ManyCore

Data Parallel Computing - NVidia CUDA (Compute Unified Device Architecture)

Real Parallel Implementations

Page 6: Lecture 12: Attributes, Performance and Scalability Many-Core vs. CUDA Synchronization

MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key.

The MapReduce protocol was developed at Google. (Hadoop is an open-source version)

MapReduce is loosely based on functional programming

Before MapReduce every programmer writing parallel/distributed application had to worry about details of concurrency and robustness (fault-tolerance)

MapReduce abstracts the hardware and concurrency control issues away from the applications programmer

MapReduce is more accurately Map-Sort-Group-Accumulate (not a marketable name:-)

To ensure fault-tolerance data is duplicated and distributed across multiple processors/computers so the same (overlapping) parts of one virtual file can live on many computers.

Parallelism in the form of MapReduce

Page 7: Lecture 12: Attributes, Performance and Scalability Many-Core vs. CUDA Synchronization

Overview of MapReduce

f2

key

key

key

S O

R T

keykeykeykey

keykeykeykey

keykeykeykey

f1

key

key

key

key-valuepairs

key-resultpairs

key-value listpairs

Mapper Reducer

Berkeley CS 61A - Brian Harvey - http://www.youtube.com/watch?v=mVXpvsdeuKU

Page 8: Lecture 12: Attributes, Performance and Scalability Many-Core vs. CUDA Synchronization

Example Problem:

Input: A list of every grade for every student in every class taken,currently keyed by course-studentID

Compute the current GPA for each student.

csc101-M001, 78csc101-M033, 95csc101-M002, 83csc101-M101,100csc101-M001, 85csc101-M033, 99 : :csc145-M001, 56csc145-M099, 89csc145-M045, 94 : :csc540-M001,100csc540-M033, 45 : :

M001-csc101, 78

M001-csc101, 85

M033-csc101, 95

M002-csc101, 83

M101-csc101,100

:

M001-csc540,100

M033-csc540, 45

M001-csc101,(78,85,...)

M002-csc101,(83,99,...)

M033-csc101,(95,56,...)

M001-csc540,(100,80,..)

M033-csc540,(45,20,...)

M001,3.45

M002,3.10

M033,1.99

: :

M101,2.85

Input Map Group Reduce

Page 9: Lecture 12: Attributes, Performance and Scalability Many-Core vs. CUDA Synchronization

MapReduce

http://hadoop.apache.org/mapreduce/

Page 10: Lecture 12: Attributes, Performance and Scalability Many-Core vs. CUDA Synchronization

Parallel Computing on a GPU

• NVIDIA GPU Computing Architecture– Via a separate HW interface

– In laptops, desktops, workstations, servers

• 8-series GPUs deliver 50 to 200 GFLOPSon compiled parallel C applications

• GPU parallelism is doubling every year• Programming model scales transparently

• Programmable in C with CUDA tools• Multithreaded SIMD model uses application

data parallelism and thread parallelism

GeForce 8800

Tesla S870

Tesla D870

Page 11: Lecture 12: Attributes, Performance and Scalability Many-Core vs. CUDA Synchronization

CUDA Stack

The CUDA software stack is composed of several layers (1) a hardware driver, (2) an application programming interface (API) and its runtime, and (3) two higher-level mathematical libraries of common usage, CUFFT and CUBLAS.

http://developer.download.nvidia.com/compute/cuda/1_0/NVIDIA_CUDA_Programming_Guide_1.0.pdf

Page 12: Lecture 12: Attributes, Performance and Scalability Many-Core vs. CUDA Synchronization

CUDA Thread Batching

The host issues a succession of kernel invocations to the device. Each kernel is executed as a batch of threads organized as a grid of thread blocks.

This method works best on those problems for which parallelization is "easy".

To implement "hard" parallel solutions on such a system, requires active participation by the applications programmer.

The pay-off is a significant speed-up compared to approaches in which the details of parallelization have been hidden from the applications programmer.

Page 13: Lecture 12: Attributes, Performance and Scalability Many-Core vs. CUDA Synchronization

Clock Synchronization

Time Sources

De Facto Primary Standard – International Atomic Time (TAI) Keeping of TAI started in 1955 1 atomic second = 9,192,631,770 orbital transitions of Cs133(Caesium) 86400 atomic seconds = 1 solar day –3 ms

Coordinated Universal Time (UTC) – International StandardKeeping of UTC started 1961 Derived from TAI by adding leap seconds to keep it close to solar time UTC source signals are synchronized UTC time is re-transmitted by GPS satellites

Local clocks are based on oscillators

Clock synchronization is a problem from computer science and engineering which deals with the idea that internal clocks of several computers may differ. Even when initially set accurately, real clocks will differ after some amount of time due to clock drift, caused by clocks counting time at slightly different rates. There are several problems that occur as a repercussion of rate differences and several solutions, some being more appropriate than others in certain contexts.

http://en.wikipedia.org/wiki/Clock_synchronization

Computer Science 425 Distributed Systems - Klara Nahrstedt

Page 14: Lecture 12: Attributes, Performance and Scalability Many-Core vs. CUDA Synchronization

Skew:s(t) = Ci(t) –Cj(t)

Maximum Drift Rate (MDR) ρ

|t –Ci(t) | ≤ ρt(1-ρ) ≤ dCi(t)/dt ≤ (1+ρ)

Synchronization interval R and synchronization bound D

| Ci(t) –Cj(t) | ≤ 2ρt| Ci(R) –Cj(R)| ≤ 2ρR ≤ D R ≤ D/2ρ

This calculation ignored propagation delays

Clock Skew & Drift Rates

Computer Science 425 Distributed Systems - Klara Nahrstedt

Page 15: Lecture 12: Attributes, Performance and Scalability Many-Core vs. CUDA Synchronization

Besides the incorrectness of the time itself, there are problems associated with clock skew that take on more complexity in a distributed system in which several computers will need to realize the same global time.

For instance, in Unix systems the make command is used to compile new or modified code without the need to recompile unchanged code. The make command uses the clock of the machine it runs on to determine which source files need to be recompiled. If the sources reside on a separate file server and the two machines have unsynchronized clocks, the make program might not produce the correct results.

Time Synchronization Problems in Distributed Systems

http://en.wikipedia.org/wiki/Clock_synchronization

Page 16: Lecture 12: Attributes, Performance and Scalability Many-Core vs. CUDA Synchronization

In a centralized system the solution is trivial; the centralized server will dictate the system time. Cristian's algorithm and the Berkeley Algorithm are some solutions to the clock synchronization problem in a centralized server environment. In a distributed system the problem takes on more complexity because a global time is not easily known. The most used clock synchronization solution on the Internet is the Network Time Protocol (NTP) which is a layered client-server architecture based on UDP message passing. Lamport timestamps and Vector clocks are concepts of the logical clocks in distributed systems.

Solutions to Synchronization Problems

http://en.wikipedia.org/wiki/Clock_synchronization

Page 17: Lecture 12: Attributes, Performance and Scalability Many-Core vs. CUDA Synchronization

Cristian's algorithm

Cristian's algorithm relies on the existence of a time server. The time server maintains its clock by using a radio clock or other accurate time source, then all other computers in the system stay synchronized with it. A time client will maintain its clock by making a procedure call to the time server. Variations of this algorithm make more precise time calculations by factoring in network propagation time.

http://en.wikipedia.org/wiki/Clock_synchronization

mr - message where client piasks time server S for time

mt - message where time server responds with its current time T

Client pi uses T in mt to set its clock

Computer Science 425 Distributed Systems - Klara Nahrstedt

Page 18: Lecture 12: Attributes, Performance and Scalability Many-Core vs. CUDA Synchronization

Berkeley algorithm

This algorithm is more suitable for systems where a radio clock is not present, this system has no way of making sure of the actual time other than by maintaining a global average time as the global time. A time server will periodically fetch the time from all the time clients, average the results, and then report back to the clients the adjustment that needs be made to their local clocks to achieve the average. This algorithm highlights the fact that internal clocks may vary not only in the time they contain but also in the clock rate. Often, any client whose clock differs by a value outside of a given tolerance is disregarded when averaging the results. This prevents the overall system time from being drastically skewed due to one erroneous clock.

http://en.wikipedia.org/wiki/Clock_synchronization

Use elected leader process to ensure maximum skew is ρamong clients.

Elected leader broadcasts to all machines for their time, adjusts times received for transmission delay & latency,

averages times after removing outliers tells each machine how to adjust.

In some systems multiple time leaders are used.

Averaging client’s clocks may cause the entire system to drift away from UTC over time

Failure of the leader requires some time for re-election, so accuracy cannot be guaranteed

Computer Science 425 Distributed Systems - Klara Nahrstedt

Page 19: Lecture 12: Attributes, Performance and Scalability Many-Core vs. CUDA Synchronization

Network Time Protocol

This algorithm is a class of mutual network synchronization algorithms in which no master or reference clocks are needed. All clocks equally participate in the synchronization of the network by exchanging their timestamps using regular beacon packets. CS-MNS is suitable for distributed and mobile applications. It has been shown to be scalable, accurate in the order of few microseconds, and compatible to IEEE 802.11 and similar standards.

http://en.wikipedia.org/wiki/Clock_synchronization

• Provides UTC synchronization service across Internet• Uses time servers to sync. networked processes. • Time servers are connected by sync. subnet tree. • The root is adjusted directly. • Each node synchronizes its children nodes.

Computer Science 425 Distributed Systems - Klara Nahrstedt

Page 20: Lecture 12: Attributes, Performance and Scalability Many-Core vs. CUDA Synchronization

Reference Broadcast Synchronization (RBS)This algorithm is often used in wireless networks and sensor networks. In this scheme, an initiator broadcasts a reference message to urge the receivers to adjust their clocks.

http://en.wikipedia.org/wiki/Clock_synchronization

A critical path analysis for traditional time synchronization protocols (top) and RBS (bottom).

For traditional protocols working on a LAN, the largest contributions to nondeterministic latency are the Send Time (from the sender’s clock read to delivery of the packet to its NIC, including protocol processing) and Access Time (the delay in the NIC until the channel becomes free).

The Receive Time tends to be much smaller than the Send Time because the clock can be read at interrupt time, before protocol processing.

In RBS, the critical path length is shortened to include only the time from the injection of the packet into the channel to the last clock read.

Fine-Grained Network Time Synchronization using Reference Broadcasts - Elson

Page 21: Lecture 12: Attributes, Performance and Scalability Many-Core vs. CUDA Synchronization

Happens-Before Relation on Events

http://en.wikipedia.org/wiki/Happened-before

Page 22: Lecture 12: Attributes, Performance and Scalability Many-Core vs. CUDA Synchronization

OpenCL (Open Computing Language) is a framework for writing programs that execute across heterogeneous platforms consisting of CPUs, GPUs, and other processors.

OpenCL includes a language for writing kernels and APIs that are used to control the platforms.

OpenCL provides parallel computing using task-based and data-based parallelism.

AMD/ATI - Stream SDK

Nvidia - offers OpenCL as an alternative to its CUDA

OpenCL gives any application access to the Graphics Processing Unit for non-graphical computing. Thus, OpenCL extends the power of the Graphics Processing Unit beyond graphics (General-purpose computing on graphics processing units).

OpenCL is analogous to the open industry standards OpenGL and OpenAL, for 3D graphics and computer audio, respectively. OpenCL is managed by the non-profit technology consortium Khronos Group.

openCL

http://en.wikipedia.org/wiki/OpenCL

Page 23: Lecture 12: Attributes, Performance and Scalability Many-Core vs. CUDA Synchronization

http://www.khronos.org/opencl/

Some Basic Concepts

OpenCL programs are divided in two parts:

(1) one that executes on the device (in our case, on the GPU)

(2) the other that executes on the host (in our case, the CPU).

In order to execute code on the device, programmers can write special functions (called kernels), which are coded with the OpenCL Programming Language - a sort of C with some restrictions and special keywords and data types.

On the other hand, the host program offers an API so that you can manage your device execution. The host can be programmed in C or C++ and it controls the OpenCL environment.

Page 24: Lecture 12: Attributes, Performance and Scalability Many-Core vs. CUDA Synchronization

The kernel is a function that is executed on the device. There are other functions that can be executed on the device but the kernels are entry points to the device program. In other words, kernels are the only functions that can be called from the host.

How do I program a Kernel?

How do I express the parallelism with kernels?

What is the execution model?

Definition of the Kernel

http://opencl.codeplex.com/

Page 25: Lecture 12: Attributes, Performance and Scalability Many-Core vs. CUDA Synchronization

SIMT: stands for Single Instruction Multiple Thread which is how instructions are executed in the host. That is, the same code is executed in parallel by a different thread, and each thread executes the code with different data.

Work-item: the work-items are equivalent to CUDA threads, and are the smallest unit of execution. Each time a Kernel is launched, an appropriate number of work-items are launched, each one executing the same code. Each work-item has an ID, which is accessible from the kernel, and which is used to distinguish the data to be processed by each work-item.

Work-group: exist to enable interoperability between work-items. They reflect how work-items are organized (as an N-dimensional grid of work-groups, with N = 1, 2 or 3). Work-groups correspond to CUDA thread blocks. Like work-items, work-groups have an unique ID that can be accessed from the kernel.

Units of Execution

http://opencl.codeplex.com/

Page 26: Lecture 12: Attributes, Performance and Scalability Many-Core vs. CUDA Synchronization

ND-Range: the next organization level, specifying how work-groups are organized (as an N-dimensional grid of work-groups, N = 1, 2 or 3)

ND-Range

http://opencl.codeplex.com/

Page 27: Lecture 12: Attributes, Performance and Scalability Many-Core vs. CUDA Synchronization

Sample Kernel that adds two vectors. This kernel takes four parameters:

src_a = one of the vectors to be added src_b = the other vectors to be added res = vector to hold the result num = the number of elements in each vector

Example - Sum of Two Vectors

CPU version looks like this

Page 28: Lecture 12: Attributes, Performance and Scalability Many-Core vs. CUDA Synchronization

Example - Sum of Two Vectors

Rather than one thread iterating through all the vector elements, we allocate a separate thread for each pair of elements. The index of each thread corresponds to the index of the vector elements.

The reserved word kernel, specifies that the function is a kernel.

Kernel functions always return void.

The reserved word global specifies the location in memory of the arguments.

Page 29: Lecture 12: Attributes, Performance and Scalability Many-Core vs. CUDA Synchronization

Creating the basic OpenCL run-time environment

Platform: "The host plus a collection of devices managed by the OpenCL framework that allow an application to share resources and execute kernels on devices in the platform."

Platforms are represented by a cl_platform object, which can be initialized using the following function:

Device: represented by cl_device objects, initialized with the following function.

Page 30: Lecture 12: Attributes, Performance and Scalability Many-Core vs. CUDA Synchronization

Context: defines the OpenCL environment, including OpenCL kernels, devices, memory management, and command-queues. Contexts in OpenCL are referenced by cl_context object, which are initialized using the following function:

Command-Queue: an object in which OpenCL commands are enqueued to be executed by the device. "The command-queue is created on a specific device in a context [...] Having multiple command-queues allows applications to queue multiple independent commands without requiring synchronization."

Page 31: Lecture 12: Attributes, Performance and Scalability Many-Core vs. CUDA Synchronization

Allocating Memory

To execute the vector sum kernel we need to allocate 3 vectors and initialize 2 of them.

To perform memory allocation on the CPU, we would do something like this:

Page 32: Lecture 12: Attributes, Performance and Scalability Many-Core vs. CUDA Synchronization

To represent memory allocated on the device we use the cl_mem type. To allocate memory we use:

Allocating Memory

CL_MEM_READ_WRITE

CL_MEM_WRITE_ONLY

CL_MEM_READ_ONLY

CL_MEM_USE_HOST_PTR

CL_MEM_ALLOC_HOST_PTR

CL_MEM_COPY_HOST_PTR

valid values for flags are

this function is used to allocate memory for src_a_d, src_b_d, and res_d:

Page 33: Lecture 12: Attributes, Performance and Scalability Many-Core vs. CUDA Synchronization

OpenCL Programs vs OpenCL Kernels

Kernel: kernels are essentially functions that we can call from the host and that will run on the device. Note: All code that runs on the device, including both kernels and non-kernel functions that are called by kernels, are compiled at run-time.

Program: an OpenCL Program is formed by a set of kernels, functions and declarations, and is represented by an cl_program object. When creating a program, you must specify which files compose the program before we compile it.

Page 34: Lecture 12: Attributes, Performance and Scalability Many-Core vs. CUDA Synchronization

Create a Program

Compile a Program

View Compile Log

Extract Entry Point

Page 35: Lecture 12: Attributes, Performance and Scalability Many-Core vs. CUDA Synchronization

Sample Program

Page 36: Lecture 12: Attributes, Performance and Scalability Many-Core vs. CUDA Synchronization

Launching the Kernel

Once our kernel object exists, we can now launch it.

First of all, we must set up the arguments. This is done using:

This function should be called for each argument.

After all arguments are enqueued, we can call the kernel using:

Page 37: Lecture 12: Attributes, Performance and Scalability Many-Core vs. CUDA Synchronization

Reading back

Read back is easy. Analogously to what we've done to write to memory, now we must enqueue a read buffer operation, which is done using:

Page 38: Lecture 12: Attributes, Performance and Scalability Many-Core vs. CUDA Synchronization

All object and memory allocated with clCreate is destroyed using clRelease

Cleaning up

Page 39: Lecture 12: Attributes, Performance and Scalability Many-Core vs. CUDA Synchronization