parallel processing

PARALLEL PROCESSING : FUNDAMENTALS Khushdeep Singh Department of Computer Science and Engineering IIT Kanpur TUTOR : Prof. Dr. U. Rude, Florian Schornbaum

1

OUTLINE

Para

llel P

roce

ssin

g : F

un

dam

en

tals

Overview

1. What is Parallel Processing ?

2. Why use Parallel Processing ?

Flynns Classical Taxonomy

Parallel Computer Memory Architectures

1. Shared Memory

2. Distributed Memory

3. Hybrid Distributed-Shared Memory

Parallel Programming Models

Designing Parallel Programs

Amdahls Law

Embarrassingly parallel

Summary

2

What is Parallel Processing?

Simultaneous use of multiple resources to solve a computational problem :

The problem is broken into discrete parts that can be solved concurrently

Instructions from each part execute simultaneously on different CPUs Pa

ralle

l Pro

cess

ing

: Fu

nd

ame

nta

ls

3

Why use Parallel Processing?

Save time

Solve larger problems:

Many problems are so large and/or complex that it is impractical or impossible to solve them on a single computer

Use of non-local resources:

Using compute resources on a wide area network, or even the Internet when local compute resources are scarce

E.g. : SETI@home : over 1.3 million users, 3.2 million computers in nearly every country in the world.

Para

llel P

roce

ssin

g : F

un

dam

en

tals

4


Limits to serial computing:

Transmission speeds : limits on how fast data can move through hardware

Limits to miniaturization

Heating issues : Power Consumption proportional to frequency

Economic limitations : it is increasingly expensive to make a single processor faster

Current computer architectures are increasingly relying upon hardware level parallelism to improve performance:

Multiple execution units

Pipelined instructions

Multi-core

Para

llel P

roce

ssin

g : F

un

dam

en

tals

5


Parallelism and Moore's law:

Moore's law : performance of chips effectively doubles every 2 years due to the addition of more transistors to a circuit board

Parallel computation necessary to take full advantage of the gains allowed by Moore's law

Para

llel P

roce

ssin

g : F

un

dam

en

tals

6


Classification of Parallel Computers : Flynn's Classical Taxonomy Single Instruction, Single Data (SISD):

A serial (non-parallel) computer Single Instruction: Only one instruction stream is being acted on by

the CPU during any one clock cycle Single Data: Only one data stream is being used as input during any

one clock cycle

Single Instruction, Multiple Data (SIMD): Single Instruction: All processing units execute the same instruction

at any given clock cycle Multiple Data: Each processing unit can operate on a different data

element Best suited for problems characterized by a high degree of regularity,

such as image processing. E.g. : GPU

Para

llel P

roce

ssin

g : F

un

dam

en

tals

7


Multiple Instruction, Single Data (MISD): Multiple Instruction: Each processing unit operates on the data

independently via separate instruction streams.

Single Data: A single data stream is fed into multiple processing units.

Few actual examples.

Multiple Instruction, Multiple Data (MIMD): Multiple Instruction: Every processor may be executing a different

instruction stream

Multiple Data: Every processor may be working with a different data stream

E.g. : networked parallel computer clusters and "grids", multi-processor SMP computers, multi-core PCs.

Para

llel P

roce

ssin

g : F

un

dam

en

tals

8

Parallel Architectures

Shared Memory :

Ability for all processors to access all memory as global address space

Changes in a memory location effected by one processor are visible to all other processors

Shared memory machines can be divided into two main classes based upon memory access times:

Uniform Memory Access (UMA)

Non-Uniform Memory Access (NUMA)

Para

llel P

roce

ssin

g : F

un

dam

en

tals

9


Uniform Memory Access (UMA) :

Commonly represented by Symmetric Multiprocessor (SMP) machines

Identical processors

Equal access times to memory

Para

llel P

roce

ssin

g : F

un

dam

en

tals

10


Non-Uniform Memory Access (NUMA)

Made by physically linking two or more SMPs

One SMP can directly access memory of another

Not all processors have equal access time to all memories

Memory access across link is slower

Para

llel P

roce

ssin

g : F

un

dam

en

tals

11

Parallel Architectures Distributed Memory :

Processors have their own local memory

Change in a processors local memory have no effect on the memory of other processors

Needs message passing

Explicit programming required

Para

llel P

roce

ssin

g : F

un

dam

en

tals

12

Parallel Architectures Shared vs Distributed memory :

Para

llel P

roce

ssin

g : F

un

dam

en

tals

13

Shared Memory Distributed Memory

Advantages Disadvantages Advantages Disadvantages

Data sharing between tasks is fast

Expense with increase in no. of processors

Memory is scalable with no. of processors

Explicit programming required

User-friendly programming perspective to memory

Programmer responsible for synchronization

No overhead in cache coherency

Message passing involves overhead

Lack of scalability Cost effectiveness due to Networking


Hybrid Distributed-Shared Memory

Shared memory component : a cache coherent SMP machine

Distributed memory component : networking of multiple SMP machines

Para

llel P

roce

ssin

g : F

un

dam

en

tals

14


An abstraction above hardware and memory architectures

Models NOT specific to a particular type of memory architecture

Shared Memory Model:

Tasks share a common address space

Mechanisms such as locks / semaphores used for synchronization

Advantage : programming development simplified

Threads can be used :

Each thread has local data, but also, shares the entire resources of main program

Threads communicate with each other through global memory

Para

llel P

roce

ssin

g : F

un

dam

en

tals

15

Parallel Programming Models Implementation of shared memory model :

OpenMP :

Directive based

Master thread forks a specified number of slave threads and task is divided among them

After execution of parallel task, threads join back

Para

llel P

roce

ssin

g : F

un

dam

en

tals

16

Parallel Programming Models OpenMP : Core elements

Para

llel P

roce

ssin

g : F

un

dam

en

tals

17


OpenMP : Example Program int main (int argc, char *argv[]) { int th_id, nthreads; #pragma omp parallel private(th_id) { th_id = omp_get_thread_num(); printf("Hello World from thread %d\n", th_id); #pragma omp barrier if ( th_id == 0 ) { nthreads = omp_get_num_threads(); printf("There are %d threads\n",nthreads); } } return EXIT_SUCCESS; }

Para

llel P

roce

ssin

g : F

un

dam

en

tals

18


Message Passing Model :

Tasks use their own local memory

Tasks exchange data by sending and receiving messages

User explicitly distributes data

Para

llel P

roce

ssin

g : F

un

dam

en

tals

19


Implementation of message passing model :

Message Passing Interface (MPI) :

PORTABILITY : Architecture and hardware independent code

Provides well-defined and safe data transfer

Support heterogeneous environment (e.g. clusters)

Most MPI implementations consist of a specific set of routines (i.e., an API) directly callable from C, C++, Fortran

Para

llel P

roce

ssin

g : F

un

dam

en

tals

20


Message Passing Interface (MPI) : Concepts

Communicator and Rank : connect groups of processes in the MPI session

Point-to-point basics : communication between two specific processes. E.g. MPI_send, MPI_recieve calls

Collective basics : communication among all processes in a process group E.g. MPI_Bcast, MPI_Reduce calls

Derived data types :

specify the type of data which is sent between processors

predefined MPI data types such as MPI_INT, MPI_CHAR, MPI_DOUBLE

Para

llel P

roce

ssin

g : F

un

dam

en

tals

21


Message Passing Interface (MPI) : Example Program #define BUFSIZE 128 #define TAG 0 int main (int argc, char *argv[]) { char idstr[32]; char buff[BUFSIZE]; int numprocs; int myid; int i; MPI_Status stat; MPI_Init (&argc, &argv); MPI_Comm_size (MPI_COMM_WORLD,&numprocs); MPI_Comm_rank (MPI_COMM_WORLD,&myid); If(myid == 0) { for(i=1 ; i


Message Passing Interface (MPI) : Example Program for(i=1 ; i


Automatic and Manual Parallelization :

Manual Parallelization : time consuming, complex and error-prone

Automatic Parallelization : done by a parallelizing compiler or pre-processor. Two different ways:

Fully Automatic :

compiler analyzes the source code and identifies opportunities for parallelism

Programmer Directed :

using "compiler directives" or flags, the programmer explicitly tells the compiler how to parallelize the code

E.g. : OpenMP

Para

llel P

roce

ssin

g : F

un

dam

en

tals

24


Partitioning :

Breaking the problem into discrete "chunks" of work that can be distributed to multiple tasks

Two basic ways to partition :

Domain decomposition : the data associated with a problem is decomposed

Para

llel P

roce

ssin

g : F

un

dam

en

tals

25


Partitioning :

Two basic ways to partition :

Functional decomposition : the focus is on the computation that is to be performed rather than on the data manipulated by the computation

Para

llel P

roce

ssin

g : F

un

dam

en

tals

26


Load Balancing :

Practice of distributing work among tasks so that all tasks are kept busy all of the time

Two types :

Static load balancing : assigning a fixed amount of work to each processing site a priori

Dynamic Load Balancing : Two types :

Task-oriented : when one processing site finishes its task, it is assigned another task

Data-oriented : when a processing site finishes its task before other sites, the site with the most work gives the idle site some of its data to process

Para

llel P

roce

ssin

g : F

un

dam

en

tals

27


Granularity :

Qualitative measure of the ratio of computation to communication

Fine-grain Parallelism : relatively small amounts of computation between communication events Facilitates load balancing

High communication overhead

Coarse-grain Parallelism : significant work done between communications

Most efficient granularity depends on the algorithm and the hardware environment used

Para

llel P

roce

ssin

g : F

un

dam

en

tals

28

Amdahls Law

Expected speedup of parallelized implementations of an algorithm relative to the serial algorithm.

Eq. :

Speedup = 1

1 +/ ,

P : Portion that can be made parallel

N : No. of processors

Para

llel P

roce

ssin

g : F

un

dam

en

tals

29

Embarrassingly parallel

Embarrassingly parallel problem : little or no effort is required to separate the problem into a number of parallel tasks

No dependency (or communication) between the parallel tasks

Examples :

Distributed relational database queries using distributed set processing

Rendering of computer graphics

Event simulation and reconstruction in particle physics

Brute-force searches in cryptography

Ensemble calculations of numerical weather prediction

Tree growth step of the random forest machine learning technique

Para

llel P

roce

ssin

g : F

un

dam

en

tals

30

Applications of parallel processing

Para

llel P

roce

ssin

g : F

un

dam

en

tals

31

Summary

Parallel Processing : Simultaneous use of multiple resources to solve a computational problem

Need for parallel processing : Limits to serial computing and Moores Law

Flynns Classical Taxonomy : SISD, SIMD, MIMD, MISD

Parallel architectures : Shared memory, distributed memory and hybrid

Parallel programing models : OpenMP, MPI

Designing parallel programs : Automatic parallelization, partitioning, load balancing and granularity

Embarrassingly parallel problems : very easy to solve by parallel processing

Para

llel P

roce

ssin

g : F

un

dam

en

tals

32

References Introduction to Parallel Computing :

https://computing.llnl.gov/tutorials/parallel_comp/#Hybrid

http://en.wikipedia.org

Introduction to Scientific High Performance Computing : Reinhold Bader (LRZ), Georg Hager (RRZE), Heinz Bast (Intel)

Elementary Parallel Programming With Examples : Reinhold Bader (LRZ), Georg Hager (RRZE)

Programming Shared Memory Systems with OpenMP : Reinhold Bader (LRZ) , Georg Hager (RRZE)

THANK YOU !

Para

llel P

roce

ssin

g : F

un

dam

en

tals

33

parallel processing

Documents

outline parallel processing

gpu parallel processing

moores law parallel

multiple processing

parallel summary

image processing

single data stream

single data sisd