linear array and ring - dsrajnor.files.wordpress.com  · web viewa process sends the . same....

27
BE (Computer Engineering) High Performance Computing (OCT/NOV 2019) (2015 Pattern)(Semester-I)(410241) [Max Marks-:70 Question Paper Solution Q1) a) Explain term of all-to-all broadcast on linear array, mesh & Hypercube topologies. [8] All-to-all broadcast is a generalization of one-to- all broadcast in which all p nodes simultaneously initiate a broadcast. A process sends the same m-word message to every other process, but different processes may broadcast different messages. All-to-all broadcast is used in matrix operations, including matrix multiplication and matrix-vector multiplication. The dual of all-to-all broadcast is all-to-all reduction, in which every node is the destination of an all-to-one reduction Figure illustrates all-to-all broadcast and all-to-all reduction. One way to perform an all-to-all broadcast is to perform p one- to-all broadcasts, one starting at each node. If performed naively, on some architectures this approach may take up to p times as long as a one-to-all broadcast. It is possible to use the communication links in the interconnection network more efficiently by performing all p one-to-all broadcasts simultaneously so that all messages traversing the same path at

Upload: others

Post on 21-Sep-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Linear Array and Ring - dsrajnor.files.wordpress.com  · Web viewA process sends the . same. m-word message to every other process, but different processes may broadcast different

BE (Computer Engineering)

High Performance Computing (OCT/NOV 2019)(2015 Pattern)(Semester-I)(410241) [Max Marks-:70

Question Paper SolutionQ1) a) Explain term of all-to-all broadcast on linear array, mesh & Hypercube

topologies. [8] All-to-all broadcast is a generalization of one-to-all broadcast in which all p nodes simultaneously

initiate a broadcast. A process sends the same m-word message to every other process, but different processes may broadcast different messages. All-to-all broadcast is used in matrix operations, including matrix multiplication and matrix-vector multiplication. The dual of all-to-all broadcast is all-to-all reduction, in which every node is the destination of an all-to-one reduction Figure illustrates all-to-all broadcast and all-to-all reduction.

One way to perform an all-to-all broadcast is to perform p one-to-all broadcasts, one starting at each node. If performed naively, on some architectures this approach may take up to p times as long as a one-to-all broadcast. It is possible to use the communication links in the interconnection network more efficiently by performing all p one-to-all broadcasts simultaneously so that all messages traversing the same path at the same time are concatenated into a single message whose size is the sum of the sizes of individual messages.

The following sections describe all-to-all broadcast on linear array, mesh, and hypercube topologies.

Linear Array and Ring

While performing all-to-all broadcast on a linear array or a ring, all communication links can be kept busy simultaneously until the operation is complete because each node always has some information that it can pass along to its neighbor. Each node first sends to one of its neighbors the data it needs to broadcast. In subsequent steps, it forwards the data received from one of its neighbors to its other neighbor.

Page 2: Linear Array and Ring - dsrajnor.files.wordpress.com  · Web viewA process sends the . same. m-word message to every other process, but different processes may broadcast different

Figure illustrates all-to-all broadcast for an eight-node ring. The same procedure would also work on a linear array with bidirectional links. As with the previous figures, the integer label of an arrow indicates the time step during which the message is sent. In all-to-all broadcast, p different messages circulate in the p-node ensemble. In Figure , each message is identified by its initial source, whose label appears in parentheses along with the time step. For instance, the arc labeled 2 (7) between nodes 0 and 1 represents the data communicated in time step 2 that node 0 received from node 7 in the preceding step. As Figure shows, if communication is performed circularly in a single direction, then each node receives all (p - 1) pieces of information from all other nodes in (p - 1) steps.

b) Explain mapping techniques for local balancing. [6]

Page 3: Linear Array and Ring - dsrajnor.files.wordpress.com  · Web viewA process sends the . same. m-word message to every other process, but different processes may broadcast different

Static Mapping: Static mapping techniques distribute the tasks among processes prior to the execution of the algorithm. For statically generated tasks, either static or dynamic mapping can be used. The choice of a good mapping in this case depends on several factors, including the knowledge of task sizes, the size of data associated with tasks, the characteristics of inter-task interactions, and even the parallel programming paradigm. Even when task sizes are known, in general, the problem of obtaining an optimal mapping is an NP-complete problem for non-uniform tasks. However, for many practical cases, relatively inexpensive heuristics provide fairly acceptable approximate solutions to the optimal static mapping problem.

Algorithms that make use of static mapping are in general easier to design and program.

Dynamic Mapping: Dynamic mapping techniques distribute the work among processes during the execution of the algorithm. If tasks are generated dynamically, then they must be mapped dynamically too. If task sizes are unknown, then a static mapping can potentially lead to serious load-imbalances and dynamic mappings are usually more effective. If the amount of data associated with tasks is large relative to the computation, then a dynamic mapping may entail moving this data among processes. The cost of this data movement may outweigh some other advantages of dynamic mapping and may render a static mapping more suitable. However, in a shared-address-space paradigm, dynamic mapping may work well even with large data associated with tasks if the interaction is read-only. The reader should be aware that the shared-address-space programming paradigm does not automatically provide immunity against data-movement costs. If the underlying hardware is NUMA (then the data may physically move from a distant memory. Even in a cc-UMA architecture, the data may have to move from one cache to another.

Algorithms that require dynamic mapping are usually more complicated, particularly in the message-passing programming paradigm.

c) Explain N-wide superscalar architecture [6]

Page 4: Linear Array and Ring - dsrajnor.files.wordpress.com  · Web viewA process sends the . same. m-word message to every other process, but different processes may broadcast different

• Ans: In the simpler model, instructions can be issued only in the order in which they are encountered. That is, if the second instruction cannot be issued because it has a data dependency with the first, only one instruction is issued in the cycle. This is called in-order issue.

• In a more aggressive model, instructions can be issued out of order. In this case, if the second instruction has data dependencies with the first, but the third instruction does not, the first and third instructions can be co-scheduled. This is also called dynamic issue. Performance of in-order issue is generally limited

OR

Q2) a) Explain the methods for containing Interaction overheads. [8]

Maximizing Data Locality

Minimize Volume of Data-Exchange A fundamental technique for reducing the interaction overhead is to minimize the overall volume of shared data that needs to be accessed by concurrent processes. This is akin to maximizing the temporal data locality, i.e., making as many of the consecutive references to the same data as possible. Clearly, performing as much of the

Page 5: Linear Array and Ring - dsrajnor.files.wordpress.com  · Web viewA process sends the . same. m-word message to every other process, but different processes may broadcast different

computation as possible using locally available data obviates the need for bringing in more data into local memory or cache for a process to perform its tasks. As discussed previously, one way of achieving this is by using appropriate decomposition and mapping schemes. For example, in the case of matrix multiplication, we saw that by using a two-dimensional mapping of the computations to the processes we were able to reduce the amount of 

Minimize Frequency of Interactions

 Minimizing interaction frequency is important in reducing the interaction overheads in parallel programs because there is a relatively high startup cost associated with each interaction on much architecture. Interaction frequency can be reduced by restructuring the algorithm such that shared data are accessed and used in large pieces. Thus, by amortizing the startup cost over large accesses, we can reduce the overall interaction overhead, even if such restructuring does not necessarily reduce the overall volume of shared data that need to be accessed. This is akin to increasing the spatial locality of data access, i.e., ensuring the proximity of consecutively accessed data locations. On a shared-address-space architecture, each time a word is accessed, an entire cache line containing many words is fetched. If the program is structured to have spatial locality, then fewer cache lines are accessed. On a message-passing system, spatial locality leads to fewer message-transfers over the network because each message can transfer larger amounts of useful data. The number of messages can sometimes be reduced further on a message-passing system by combining messages between the same source-destination pair into larger messages if the interaction pattern permits and if the data for multiple messages are available at the same time, albeit in separate data structures.

Minimizing Contention and Hot Spots

Replicating Data or Computations

Using Optimized Collective Interaction Operations

Overlapping Interactions with Other Interactions

b) Write short note on circular shift on a mesh. [6]

Circular shift is a member of a broader class of global communication operations known as permutation. A permutation is a simultaneous, one-to-one data redistribution operation in which each node sends a packet of m words to a unique node. We define a circular q-shift as the operation in which node i sends a data packet to node (i + q) mod p in a p-node ensemble (0 < q < p). The shift operation finds application in some matrix computations and in string and image pattern matching.

The implementation of a circular q-shift is fairly intuitive on a ring or a bidirectional linear array. It can be performed by min{q , p - q} neighbor-to-neighbor communications in one direction. Mesh algorithms for circular shift can be derived by using the ring algorithm.

Page 6: Linear Array and Ring - dsrajnor.files.wordpress.com  · Web viewA process sends the . same. m-word message to every other process, but different processes may broadcast different

c) List application of parallel programming [6]

Applications of Parallel Computing:

Data bases and Data mining. Real time simulation of systems. Science and Engineering. Advanced graphics, augmented reality and virtual reality.

Q3) a) Explain sources of overhead in parallel program. [8]

Using twice as many hardware resources, one can reasonably expect a program to run twice as

Page 7: Linear Array and Ring - dsrajnor.files.wordpress.com  · Web viewA process sends the . same. m-word message to every other process, but different processes may broadcast different

fast. However, in typical parallel programs, this is rarely the case, due to a variety of overheads associated with parallelism. An accurate quantification of these overheads is critical to the understanding of parallel program performance.

Interprocess Interaction Any nontrivial parallel system requires its processing elements to interact and communicate data (e.g., intermediate results). The time spent communicating data between processing elements is usually the most significant source of parallel processing overhead.

Idling Processing elements in a parallel system may become idle due to many reasons such as load imbalance, synchronization, and presence of serial components in a program. In many parallel applications (for example, when task generation is dynamic), it is impossible (or at least difficult) to predict the size of the subtasks assigned to various processing elements. Hence, the problem cannot be subdivided statically among the processing elements while maintaining uniform workload. If different processing elements have different workloads, some processing elements may be idle during part of the time that others are working on the problem. In some parallel programs, processing elements must synchronize at certain points during parallel program execution. If all processing elements are not ready for synchronization at the same time, then the ones that are ready sooner will be idle until all the rest are ready. Parts of an algorithm may be unparallelizable, allowing only a single processing element to work on it. While one processing element works on the serial part, all the other processing elements must wait.

Excess Computation The fastest known sequential algorithm for a problem may be difficult or impossible to parallelize, forcing us to use a parallel algorithm based on a poorer but easily parallelizable (that is, one with a higher degree of concurrency) sequential algorithm. The difference in computation performed by the parallel program and the best serial program is the excess computation overhead incurred by the parallel program.

b) Explain the performance Metrics for parallel system.

Ans: It is important to study the performance of parallel programs with a view to determining the best algorithm, evaluating hardware platforms, and examining the benefits from parallelism. A number of metrics have been used based on the desired outcome of performance analysis.

Execution Time

The serial runtime of a program is the time elapsed between the beginning and the end of its execution on a sequential computer. The parallel runtime is the time that elapses from the moment a parallel computation starts to the moment the last processing element finishes execution. We denote the serial runtime by TS and the parallel runtime by TP.

Total Parallel Overhead

Page 8: Linear Array and Ring - dsrajnor.files.wordpress.com  · Web viewA process sends the . same. m-word message to every other process, but different processes may broadcast different

The overheads incurred by a parallel program are encapsulated into a single expression referred to as the overhead function. We define overhead function or total overhead of a parallel system as the total time collectively spent by all the processing elements over and above that required by the fastest known sequential algorithm for solving the same problem on a single processing element. We denote the overhead function of a parallel system by the symbol To.

The total time spent in solving a problem summed over all processing elements is pTP . TS units of this time are spent performing useful work, and the remainder is overhead. Therefore, the overhead function (To) is given by

Speedup

When evaluating a parallel system, we are often interested in knowing how much performance gain is achieved by parallelizing a given application over a sequential implementation. Speedup is a measure that captures the relative benefit of solving a problem in parallel. It is defined as the ratio of the time taken to solve a problem on a single processing element to the time required to solve the same problem on a parallel computer with p identical processing elements. We denote speedup by the symbol S.

Example Adding n numbers using n processing elements

Consider the problem of adding n numbers by using n processing elements. Initially, each processing element is assigned one of the numbers to be added and, at the end of the computation, one of the processing elements stores the sum of all the numbers. Assuming that n is a power of two, we can perform this operation in log n steps by propagating partial sums up a logical binary tree of processing elements.  Figure illustrates the procedure for n = 16. The processing elements are labeled from 0 to 15. Similarly, the 16 numbers to be added are labeled from 0 to 15. The sum of the numbers with consecutive labels from i to j is denoted by

Each step shown in Figure  consists of one addition and the communication of a single word. The addition can be performed in some constant time, say tc, and the communication of a single word can be performed in time ts + tw. Therefore, the addition and communication operations take a constant amount of time. Thus,

Q4) a) Write a note on minimum & cost optimal execution time. [8]

We are often interested in knowing how fast a problem can be solved, or what the minimum possible execution time of a parallel algorithm is, provided that the number of processing elements is not a constraint. As we increase the number of processing elements for a given

problem size, either the parallel runtime continues to decrease and asymptotically approaches a minimum value, or it starts rising after attaining a minimum value . We can determine the

minimum

Page 9: Linear Array and Ring - dsrajnor.files.wordpress.com  · Web viewA process sends the . same. m-word message to every other process, but different processes may broadcast different

[5670]-698

Under the assumptions of Example , the parallel run time for the problem of adding n numbers on p processing elements can be approximated by

b) Explain parallel Matrix-vector multiplication algorithm with example. [8]

Ans: A matrix is a set of numerical and non-numerical data arranged in a fixed number of rows and column. Matrix multiplication is an important multiplication design in parallel computation. Here, we will discuss the implementation of matrix multiplication on various communication networks like mesh and hypercube. Mesh and hypercube have higher network connectivity, so they allow faster algorithm than other networks like ring network.

We have considered a 2D mesh network SIMD model having wraparound connections. We will design an algorithm to multiply two n × n arrays using n2processors in a particular amount of time.

Matrices A and B have elements aij and bij respectively. Processing element PEij represents aij and bij. Arrange the matrices A and B in such a way that every processor has a pair of elements to multiply. The elements of matrix A will move in left direction and the elements of matrix B will move in upward direction. These changes in the position of the elements in matrix A and B present each processing element, PE, a new pair of values to multiply.

Steps in Algorithm

Stagger two matrices. Calculate all products, aik × bkj

Calculate sums when step 2 is complete.

Procedure MatrixMulti Begin for k = 1 to n-1 for all Pij; where i and j ranges from 1 to n ifi is greater than k then rotate a in left direction end if if j is greater than k then rotate b in the upward direction end if

Page 10: Linear Array and Ring - dsrajnor.files.wordpress.com  · Web viewA process sends the . same. m-word message to every other process, but different processes may broadcast different

for all Pij ; where i and j lies between 1 and n compute the product of a and b and store it in c for k= 1 to n-1 step 1 for all Pi;j where i and j ranges from 1 to n rotate a in left direction rotate b in the upward direction c=c+aXb End

Q5) a) What are the issues in sorting on parallel computers with example? [8]

Where the Input and Output Sequences are Stored

In sequential sorting algorithms, the input and the sorted sequences are stored in the process's memory. However, in parallel sorting there are two places where these sequences can reside. They may be stored on only one of the processes, or they may be distributed among the processes. The latter approach is particularly useful if sorting is an intermediate step in another algorithm. In this chapter, we assume that the input and sorted sequences are distributed among the processes.

Consider the precise distribution of the sorted output sequence among the processes. A general method of distribution is to enumerate the processes and use this enumeration to specify a global ordering for the sorted sequence. In other words, the sequence will be sorted with respect to this process enumeration. For instance, if Pi comes before Pj in the enumeration, all the elements stored in Pi will be smaller than those stored in Pj . We can enumerate the processes in many ways. For certain parallel algorithms and interconnection networks, some enumerations lead to more efficient parallel formulations than others.

How Comparisons are Performed

A sequential sorting algorithm can easily perform a compare-exchange on two elements because they are stored locally in the process's memory. In parallel sorting algorithms, this step is not so easy. If the elements reside on the same process, the comparison can be done easily. But if the elements reside on different processes, the situation becomes more complicated.

One Element Per Process

Consider the case in which each process holds only one element of the sequence to be sorted. At some point in the execution of the algorithm, a pair of processes (Pi, Pj) may need to compare their elements, ai and aj. After the comparison, Pi will hold the smaller and Pj the larger of {ai, aj}. We can perform comparison by having both processes send their elements to each other. Each process compares the received element with its own and retains the appropriate element. In our example, Pi will keep the smaller and Pj will keep the larger of {ai, aj}. As in the sequential case,

Page 11: Linear Array and Ring - dsrajnor.files.wordpress.com  · Web viewA process sends the . same. m-word message to every other process, but different processes may broadcast different

we refer to this operation as compare-exchange. As each compare-exchange operation requires one comparison step and one communication step.

b) Modity DFS for parallel execution & analyze its complexing. [8]

The main strategy of depth-first search is to explore deeper into the graph whenever possible. Depth-first search explores edges that come out of the most recently discovered vertex, . Only edges going to unexplored vertices are explored. When all of ’s edges have been explored, the search backtracks until it reaches an unexplored neighbor. This process continues until all of the vertices that are reachable from the original source vertex are discovered. If there are any unvisited vertices, depth-first search selects one of them as a new source and repeats the search from that vertex. The algorithm repeats this entire process until it has discovered every vertex. This algorithm is careful not to repeat vertices, so each vertex is explored once. DFS uses a stack data structure to keep track of vertices.

Here are the basic steps for performing a depth-first search:

Visit a vertex . Mark  as visited. Recursively visit each unvisited vertex attached to .

This animation illustrates the depth-first search algorithm:

a depth-first search starting at A, assuming that the left edges in the shown graph are chosen before right edges, and assuming the search remembers previously visited nodes and will not repeat them (since this is a small graph), will visit the nodes in the following order: A, B, D, F, E, C, G. The edges traversed in this search form a Trémaux tree, a structure with important applications in graph theory. Performing the same search without remembering previously

Page 12: Linear Array and Ring - dsrajnor.files.wordpress.com  · Web viewA process sends the . same. m-word message to every other process, but different processes may broadcast different

visited nodes results in visiting nodes in the order A, B, D, F, E, A, B, D, F, E, etc. forever, caught in the A, B, D, F, E cycle and never reaching C or G.

Iterative deepening is one technique to avoid this infinite loop and would reach all nodes.

Depth-first search visits every vertex once and checks every edge in the graph once. Therefore, DFS complexity is . This assumes that the graph is represented as an adjacency list.

Q6) a) Discuss the issues in sorting for parallel computer.

Ans: Parallelizing a sequential sorting algorithm involves distributing the elements to be sorted onto the available processes. This process raises a number of issues that we must address in order to make the presentation of parallel sorting algorithms clearer.

Q6) a) Explain dijkastra algorithm in parallel formulations [8]

Ans: Given a graph and a source vertex in the graph, find shortest paths from source to all vertices in the given graph.

Dijkstra’s algorithm is very similar to Prim’s algorithm for minimum spanning tree. Like Prim’s MST, we generate a SPT (shortest path tree) with given source as root. We maintain two sets, one set contains vertices included in shortest path tree, other set includes vertices not yet included in shortest path tree. At every step of the algorithm, we find a vertex which is in the other set (set of not yet included) and has a minimum distance from the source.Below are the detailed steps used in Dijkstra’s algorithm to find the shortest path from a single source vertex to all other vertices in the given graph.Algorithm1) Create a set sptSet (shortest path tree set) that keeps track of vertices included in shortest path tree, i.e., whose minimum distance from source is calculated and finalized. Initially, this set is empty.2) Assign a distance value to all vertices in the input graph. Initialize all distance values as INFINITE. Assign distance value as 0 for the source vertex so that it is picked first.3) While sptSet doesn’t include all verticesa) Pick a vertex u which is not there in sptSet and has minimum distance value.b) Includeupto sptSet.c) Update distance value of all adjacent vertices of u. To update the distance values, iterate through all adjacent vertices. For every adjacent vertex v, if sum of distance value of u (from source) and weight of edge u-v, is less than the distance value of v, then update the distance value of v.Let us understand with the following example:

Page 13: Linear Array and Ring - dsrajnor.files.wordpress.com  · Web viewA process sends the . same. m-word message to every other process, but different processes may broadcast different

We repeat the above steps until sptSet doesn’t include all vertices of given graph. Finally, we get the aboveShortest Path Tree (SPT).

b) Explain communicaion strategies for parallel BFS. [8]

Ans:  an important component of best-first search (BFS) algorithms is the open list. It maintains the unexpanded nodes in the search graph, ordered according to their l-value. In the sequential algorithm, the most promising node from the open list is removed and expanded, and newly generated nodes are added to the open list.

In most parallel formulations of BFS, different processors concurrently expand different nodes from the open list. These formulations differ according to the data structures they use to implement the open list. Given p processors, the simplest strategy assigns each processor to work on one of the current best nodes on the open list. This is called the centralized strategy because each processor gets work from a single global open list. Since this formulation of parallel BFS expands more than one node at a time, it may expand nodes that would not be expanded by a sequential algorithm. Consider the case in which the first node on the open list is a solution. The parallel formulation still expands the first p nodes on the open list. However, since it always picks the best p nodes, the amount of extra work is limited. Figure 11.14illustrates this strategy. There are two problems with this approach:

1. The termination criterion of sequential BFS fails for parallel BFS. Since at any moment, p nodes from the open list are being expanded, it is possible that one of the nodes may be a solution that does not correspond to the best goal node (or the path found is not the shortest path). This is because the remaining p - 1 nodes may lead to search spaces containing better goal nodes. Therefore, if the cost of a solution found by a processor is c, then this solution is not guaranteed to correspond to the best goal node until the cost of nodes being searched at other processors is known to be at least c. The termination criterion must be modified to ensure that termination occurs only after the best solution has been found.

Page 14: Linear Array and Ring - dsrajnor.files.wordpress.com  · Web viewA process sends the . same. m-word message to every other process, but different processes may broadcast different

2. Since the open list is accessed for each node expansion, it must be easily accessible to all processors, which can severely limit performance. Even on shared-address-space architectures, contention for the open list limits speedup. Let texp be the average time to expand a single node, and taccess be the average time to access the open list for a single-node expansion. If there are n nodes to be expanded by both the sequential and parallel formulations (assuming that they do an equal amount of work), then the sequential run time is given by n(taccess + texp). Assume that it is impossible to parallelize the expansion of individual nodes. Then the parallel run time will be at least ntaccess, because the open list must be accessed at least once for each node expanded. Hence, an upper bound on the speedup is (taccess + texp)/taccess.

Q7) a) Draw & explain CUDA architecture in detail [8]

Ans: CPUs are designed to process as many sequential instructions as quickly as possible. While most CPUs support threading, creating a thread is usually an expensive operation and high-end CPUs can usually make efficient use of no more than about 12 concurrent threads.

GPUs on the other hand are designed to process a small number of parallel instructions on large sets of data as quickly as possible. For instance, calculating 1 million polygons and determining which to draw on the screen and where. To do this they rely on many slower processors and inexpensive threads

Physical Architecture

CUDA-capable GPU cards are composed of one or more Streaming Multiprocessors (SMs), which are an abstraction of the underlying hardware. Each SM has a set of Streaming Processors (SPs), also called CUDA cores, which share a cache of shared memory that is faster than the GPU’s global memory but that can only be accessed by the threads running on the SPs the that SM. These streaming processors are the “cores” that execute instructions.

The numbers of SPs/cores in an SM and the number of SMs depend on your device: see the Finding your Device Specifications section below for details. It is important to realize, however, that regardless of GPU model, there are many more CUDA cores in a GPU than in a typical multicore CPU: hundreds or thousands more. For example, the Kepler Streaming Multiprocessor design, dubbed SMX, contains 192 single-precision CUDA cores, 64 double-precision units, 32 special function units, and 32 load/store units. (See the Kepler Architecture Whitepaper for a description and diagram.)

CUDA cores are grouped together to perform instructions in a what nVIDIA has termed a warp of threads. Warp simply means a group of threads that are scheduled together to execute the same instructions in lockstep. All CUDA cards to date use a warp size of 32. Each SM has at least one warp scheduler, which is responsible for executing 32 threads. Depending on the model of GPU, the cores may be double or quadruple pumped so that they execute one instruction on two or four threads in as many clock cycles. For instance, Tesla devices use a group of 8

Page 15: Linear Array and Ring - dsrajnor.files.wordpress.com  · Web viewA process sends the . same. m-word message to every other process, but different processes may broadcast different

quadpumped cores to execute a single warp. If there are less than 32 threads scheduled in the warp, it will still take as long to execute the instructions.

The CUDA programmer is responsible for ensuring that the threads are being assigned efficiently for code that is designed to run on the GPU. The assignment of threads is done virtually in the code using what is sometimes referred to as a ‘tiling’ scheme of blocks of threads that form a grid. Programmers define a kernel function that will be executed on the CUDA card using a particular tiling scheme.

b) List APIs for dealing with CUDA device memory. [5]

2. API synchronization behavior

The API provides memcpy/memset functions in both synchronous and asynchronous forms, the latter having an "Async" suffix. This is a misnomer as each function may exhibit synchronous or asynchronous behavior depending on the arguments passed to the function.

Page 16: Linear Array and Ring - dsrajnor.files.wordpress.com  · Web viewA process sends the . same. m-word message to every other process, but different processes may broadcast different

Memcpy

In the reference documentation, each memcpy function is categorized as synchronous or asynchronous, corresponding to the definitions below.

Synchronous

1. All transfers involving Unified Memory regions are fully synchronous with respect to the host.

2. For transfers from pageable host memory to device memory, a stream sync is performed before the copy is initiated. The function will return once the pageable buffer has been copied to the staging memory for DMA transfer to device memory, but the DMA to final destination may not have completed.

3. For transfers from pinned host memory to device memory, the function is synchronous with respect to the host.

4. For transfers from device to either pageable or pinned host memory, the function returns only once the copy has completed.

5. For transfers from device memory to device memory, no host-side synchronization is performed.

6. For transfers from any host memory to any host memory, the function is fully synchronous with respect to the host.

Asynchronous

1. For transfers from device memory to pageable host memory, the function will return only once the copy has completed.

2. For transfers from any host memory to any host memory, the function is fully synchronous with respect to the host.

3. For all other transfers, the function is fully asynchronous. If pageable memory must first be staged to pinned memory, this will be handled asynchronously with a worker thread.

c) Explain different kinds of CUDA memory. [5]

CUDA Memory Types

Every CUDA enabled GPU provides several different types of memory. These different types of memory each have different properties such as access latency, address space, scope, and lifetime.

The different types of memory are register, shared, local, global, and constant memory.

On devices with compute capability 1.x, there are 2 locations where memory can possibly reside; cache memory and device memory.

Page 17: Linear Array and Ring - dsrajnor.files.wordpress.com  · Web viewA process sends the . same. m-word message to every other process, but different processes may broadcast different

The cache memory is considered “on-chip” and accesses to the cache is very fast. Shared memory and cached constant memory are stored in cache memory with devices that support compute capability 1.x.

The device memory is considered “off-chip” and accesses to device memory is about ~100x slower than accessing cached memory. Global memory, local memory and (uncached) constant memory is stored in device memory.

On devices that support compute capability 2.x, there is an additional memory bank that is stored with each streaming multiprocessor. This is considered L1-cache and although the address space is relatively small, it’s access latency is very low.

OR

Q8) a) Explain how the CUDA-C program executes at kernel level with example.[8]

Ans: Sample code in adding 2 numbers with a GPU

This sample code adds 2 numbers together with a GPU:

Define a kernel (a function to run on a GPU). Allocate & initialize the host data. Allocate & initialize the device data. Invoke a kernel in the GPU. Copy kernel output to the host. Cleanup.

Define a kernel

Use the keyword __global__ to define a kernel. A kernel is a function to be run on a GPU instead of a CPU. This kernel adds 2 numbers aa & bb and store the result in cc.

// Kernel definition // Run on GPU// Adding 2 numbers and store the result in c__global__ void add(int *a, int *b, int *c) { *c = *a + *b;}

Page 18: Linear Array and Ring - dsrajnor.files.wordpress.com  · Web viewA process sends the . same. m-word message to every other process, but different processes may broadcast different

Allocate & initialize host data

In the host, allocate the input and output parameters for the kernel call, and initiate all input parameters.

int main(void) { // Allocate & initialize host data - run on the host int a, b, c; // host copies of a, b, c a = 2; b = 7; ...}

c) Give five application of CUDA. [5]

Ans: Fast Video Transcoding

Transcoding is a very common, and highly complex procedure which easily involves trillions of parallel computations, many of which are floating point operations

Video Enhancement

Complicated video enhancement techniques often require an enormous amount of computations. For example, there are algorithms that can upscale a movie by using information from frames surrounding the current frame.

Oil and Natural Resource Exploration

The first two topics I talked about had to do with video, which is naturally suited for the video card. Now it’s time to talk about more serious technologies involving oil, gas, and other natural resource exploration.

Medical Imaging

CUDA is a significant advancement for the field of medical imaging. Using CUDA, MRI machines can now compute images faster than ever possible before, and for a lower price. Computational Sciences

In the raw field of computational sciences, CUDA is very advantageous. For example, it is now possible to use CUDA with MATLAB, which can increase computations by a great amount.

Neural Networks

I personally worked on a program which required the training of several thousand neural networks to a large set of training data.

Page 19: Linear Array and Ring - dsrajnor.files.wordpress.com  · Web viewA process sends the . same. m-word message to every other process, but different processes may broadcast different

Gate-level VLSI Simulation

In college, my friend and I were able to create a simple gate-level VLSI simulation tool which used CUDA. Speedups were anywhere from 4x to 70x, depending on the circuit and stimulus to the circuit.

Fluid Dynamics

Fluid dynamics simulations have also been created. These simulations require a huge number of calculations, and are useful for wing design, and other engineering tasks.