parallel processing

33
PARALLEL PROCESSING : FUNDAMENTALS Khushdeep Singh Department of Computer Science and Engineering IIT Kanpur TUTOR : Prof. Dr. U. Rude, Florian Schornbaum 1

Upload: nsatiqahsazhar

Post on 02-Oct-2015

222 views

Category:

Documents


2 download

DESCRIPTION

coa

TRANSCRIPT

  • PARALLEL PROCESSING : FUNDAMENTALS Khushdeep Singh Department of Computer Science and Engineering IIT Kanpur TUTOR : Prof. Dr. U. Rude, Florian Schornbaum

    1

  • OUTLINE

    Para

    llel P

    roce

    ssin

    g : F

    un

    dam

    en

    tals

    Overview

    1. What is Parallel Processing ?

    2. Why use Parallel Processing ?

    Flynns Classical Taxonomy

    Parallel Computer Memory Architectures

    1. Shared Memory

    2. Distributed Memory

    3. Hybrid Distributed-Shared Memory

    Parallel Programming Models

    Designing Parallel Programs

    Amdahls Law

    Embarrassingly parallel

    Summary

    2

  • What is Parallel Processing?

    Simultaneous use of multiple resources to solve a computational problem :

    The problem is broken into discrete parts that can be solved concurrently

    Instructions from each part execute simultaneously on different CPUs Pa

    ralle

    l Pro

    cess

    ing

    : Fu

    nd

    ame

    nta

    ls

    3

  • Why use Parallel Processing?

    Save time

    Solve larger problems:

    Many problems are so large and/or complex that it is impractical or impossible to solve them on a single computer

    Use of non-local resources:

    Using compute resources on a wide area network, or even the Internet when local compute resources are scarce

    E.g. : SETI@home : over 1.3 million users, 3.2 million computers in nearly every country in the world.

    Para

    llel P

    roce

    ssin

    g : F

    un

    dam

    en

    tals

    4

  • Why use Parallel Processing?

    Limits to serial computing:

    Transmission speeds : limits on how fast data can move through hardware

    Limits to miniaturization

    Heating issues : Power Consumption proportional to frequency

    Economic limitations : it is increasingly expensive to make a single processor faster

    Current computer architectures are increasingly relying upon hardware level parallelism to improve performance:

    Multiple execution units

    Pipelined instructions

    Multi-core

    Para

    llel P

    roce

    ssin

    g : F

    un

    dam

    en

    tals

    5

  • Why use Parallel Processing?

    Parallelism and Moore's law:

    Moore's law : performance of chips effectively doubles every 2 years due to the addition of more transistors to a circuit board

    Parallel computation necessary to take full advantage of the gains allowed by Moore's law

    Para

    llel P

    roce

    ssin

    g : F

    un

    dam

    en

    tals

    6

  • Flynns Classical Taxonomy

    Classification of Parallel Computers : Flynn's Classical Taxonomy Single Instruction, Single Data (SISD):

    A serial (non-parallel) computer Single Instruction: Only one instruction stream is being acted on by

    the CPU during any one clock cycle Single Data: Only one data stream is being used as input during any

    one clock cycle

    Single Instruction, Multiple Data (SIMD): Single Instruction: All processing units execute the same instruction

    at any given clock cycle Multiple Data: Each processing unit can operate on a different data

    element Best suited for problems characterized by a high degree of regularity,

    such as image processing. E.g. : GPU

    Para

    llel P

    roce

    ssin

    g : F

    un

    dam

    en

    tals

    7

  • Flynns Classical Taxonomy

    Multiple Instruction, Single Data (MISD): Multiple Instruction: Each processing unit operates on the data

    independently via separate instruction streams.

    Single Data: A single data stream is fed into multiple processing units.

    Few actual examples.

    Multiple Instruction, Multiple Data (MIMD): Multiple Instruction: Every processor may be executing a different

    instruction stream

    Multiple Data: Every processor may be working with a different data stream

    E.g. : networked parallel computer clusters and "grids", multi-processor SMP computers, multi-core PCs.

    Para

    llel P

    roce

    ssin

    g : F

    un

    dam

    en

    tals

    8

  • Parallel Architectures

    Shared Memory :

    Ability for all processors to access all memory as global address space

    Changes in a memory location effected by one processor are visible to all other processors

    Shared memory machines can be divided into two main classes based upon memory access times:

    Uniform Memory Access (UMA)

    Non-Uniform Memory Access (NUMA)

    Para

    llel P

    roce

    ssin

    g : F

    un

    dam

    en

    tals

    9

  • Parallel Architectures

    Uniform Memory Access (UMA) :

    Commonly represented by Symmetric Multiprocessor (SMP) machines

    Identical processors

    Equal access times to memory

    Para

    llel P

    roce

    ssin

    g : F

    un

    dam

    en

    tals

    10

  • Parallel Architectures

    Non-Uniform Memory Access (NUMA)

    Made by physically linking two or more SMPs

    One SMP can directly access memory of another

    Not all processors have equal access time to all memories

    Memory access across link is slower

    Para

    llel P

    roce

    ssin

    g : F

    un

    dam

    en

    tals

    11

  • Parallel Architectures Distributed Memory :

    Processors have their own local memory

    Change in a processors local memory have no effect on the memory of other processors

    Needs message passing

    Explicit programming required

    Para

    llel P

    roce

    ssin

    g : F

    un

    dam

    en

    tals

    12

  • Parallel Architectures Shared vs Distributed memory :

    Para

    llel P

    roce

    ssin

    g : F

    un

    dam

    en

    tals

    13

    Shared Memory Distributed Memory

    Advantages Disadvantages Advantages Disadvantages

    Data sharing between tasks is fast

    Expense with increase in no. of processors

    Memory is scalable with no. of processors

    Explicit programming required

    User-friendly programming perspective to memory

    Programmer responsible for synchronization

    No overhead in cache coherency

    Message passing involves overhead

    Lack of scalability Cost effectiveness due to Networking

  • Parallel Architectures

    Hybrid Distributed-Shared Memory

    Shared memory component : a cache coherent SMP machine

    Distributed memory component : networking of multiple SMP machines

    Para

    llel P

    roce

    ssin

    g : F

    un

    dam

    en

    tals

    14

  • Parallel Programming Models

    An abstraction above hardware and memory architectures

    Models NOT specific to a particular type of memory architecture

    Shared Memory Model:

    Tasks share a common address space

    Mechanisms such as locks / semaphores used for synchronization

    Advantage : programming development simplified

    Threads can be used :

    Each thread has local data, but also, shares the entire resources of main program

    Threads communicate with each other through global memory

    Para

    llel P

    roce

    ssin

    g : F

    un

    dam

    en

    tals

    15

  • Parallel Programming Models Implementation of shared memory model :

    OpenMP :

    Directive based

    Master thread forks a specified number of slave threads and task is divided among them

    After execution of parallel task, threads join back

    Para

    llel P

    roce

    ssin

    g : F

    un

    dam

    en

    tals

    16

  • Parallel Programming Models OpenMP : Core elements

    Para

    llel P

    roce

    ssin

    g : F

    un

    dam

    en

    tals

    17

  • Parallel Programming Models

    OpenMP : Example Program int main (int argc, char *argv[]) { int th_id, nthreads; #pragma omp parallel private(th_id) { th_id = omp_get_thread_num(); printf("Hello World from thread %d\n", th_id); #pragma omp barrier if ( th_id == 0 ) { nthreads = omp_get_num_threads(); printf("There are %d threads\n",nthreads); } } return EXIT_SUCCESS; }

    Para

    llel P

    roce

    ssin

    g : F

    un

    dam

    en

    tals

    18

  • Parallel Programming Models

    Message Passing Model :

    Tasks use their own local memory

    Tasks exchange data by sending and receiving messages

    User explicitly distributes data

    Para

    llel P

    roce

    ssin

    g : F

    un

    dam

    en

    tals

    19

  • Parallel Programming Models

    Implementation of message passing model :

    Message Passing Interface (MPI) :

    PORTABILITY : Architecture and hardware independent code

    Provides well-defined and safe data transfer

    Support heterogeneous environment (e.g. clusters)

    Most MPI implementations consist of a specific set of routines (i.e., an API) directly callable from C, C++, Fortran

    Para

    llel P

    roce

    ssin

    g : F

    un

    dam

    en

    tals

    20

  • Parallel Programming Models

    Message Passing Interface (MPI) : Concepts

    Communicator and Rank : connect groups of processes in the MPI session

    Point-to-point basics : communication between two specific processes. E.g. MPI_send, MPI_recieve calls

    Collective basics : communication among all processes in a process group E.g. MPI_Bcast, MPI_Reduce calls

    Derived data types :

    specify the type of data which is sent between processors

    predefined MPI data types such as MPI_INT, MPI_CHAR, MPI_DOUBLE

    Para

    llel P

    roce

    ssin

    g : F

    un

    dam

    en

    tals

    21

  • Parallel Programming Models

    Message Passing Interface (MPI) : Example Program #define BUFSIZE 128 #define TAG 0 int main (int argc, char *argv[]) { char idstr[32]; char buff[BUFSIZE]; int numprocs; int myid; int i; MPI_Status stat; MPI_Init (&argc, &argv); MPI_Comm_size (MPI_COMM_WORLD,&numprocs); MPI_Comm_rank (MPI_COMM_WORLD,&myid); If(myid == 0) { for(i=1 ; i

  • Parallel Programming Models

    Message Passing Interface (MPI) : Example Program for(i=1 ; i

  • Designing Parallel Programs

    Automatic and Manual Parallelization :

    Manual Parallelization : time consuming, complex and error-prone

    Automatic Parallelization : done by a parallelizing compiler or pre-processor. Two different ways:

    Fully Automatic :

    compiler analyzes the source code and identifies opportunities for parallelism

    Programmer Directed :

    using "compiler directives" or flags, the programmer explicitly tells the compiler how to parallelize the code

    E.g. : OpenMP

    Para

    llel P

    roce

    ssin

    g : F

    un

    dam

    en

    tals

    24

  • Designing Parallel Programs

    Partitioning :

    Breaking the problem into discrete "chunks" of work that can be distributed to multiple tasks

    Two basic ways to partition :

    Domain decomposition : the data associated with a problem is decomposed

    Para

    llel P

    roce

    ssin

    g : F

    un

    dam

    en

    tals

    25

  • Designing Parallel Programs

    Partitioning :

    Two basic ways to partition :

    Functional decomposition : the focus is on the computation that is to be performed rather than on the data manipulated by the computation

    Para

    llel P

    roce

    ssin

    g : F

    un

    dam

    en

    tals

    26

  • Designing Parallel Programs

    Load Balancing :

    Practice of distributing work among tasks so that all tasks are kept busy all of the time

    Two types :

    Static load balancing : assigning a fixed amount of work to each processing site a priori

    Dynamic Load Balancing : Two types :

    Task-oriented : when one processing site finishes its task, it is assigned another task

    Data-oriented : when a processing site finishes its task before other sites, the site with the most work gives the idle site some of its data to process

    Para

    llel P

    roce

    ssin

    g : F

    un

    dam

    en

    tals

    27

  • Designing Parallel Programs

    Granularity :

    Qualitative measure of the ratio of computation to communication

    Fine-grain Parallelism : relatively small amounts of computation between communication events Facilitates load balancing

    High communication overhead

    Coarse-grain Parallelism : significant work done between communications

    Most efficient granularity depends on the algorithm and the hardware environment used

    Para

    llel P

    roce

    ssin

    g : F

    un

    dam

    en

    tals

    28

  • Amdahls Law

    Expected speedup of parallelized implementations of an algorithm relative to the serial algorithm.

    Eq. :

    Speedup = 1

    1 +/ ,

    P : Portion that can be made parallel

    N : No. of processors

    Para

    llel P

    roce

    ssin

    g : F

    un

    dam

    en

    tals

    29

  • Embarrassingly parallel

    Embarrassingly parallel problem : little or no effort is required to separate the problem into a number of parallel tasks

    No dependency (or communication) between the parallel tasks

    Examples :

    Distributed relational database queries using distributed set processing

    Rendering of computer graphics

    Event simulation and reconstruction in particle physics

    Brute-force searches in cryptography

    Ensemble calculations of numerical weather prediction

    Tree growth step of the random forest machine learning technique

    Para

    llel P

    roce

    ssin

    g : F

    un

    dam

    en

    tals

    30

  • Applications of parallel processing

    Para

    llel P

    roce

    ssin

    g : F

    un

    dam

    en

    tals

    31

  • Summary

    Parallel Processing : Simultaneous use of multiple resources to solve a computational problem

    Need for parallel processing : Limits to serial computing and Moores Law

    Flynns Classical Taxonomy : SISD, SIMD, MIMD, MISD

    Parallel architectures : Shared memory, distributed memory and hybrid

    Parallel programing models : OpenMP, MPI

    Designing parallel programs : Automatic parallelization, partitioning, load balancing and granularity

    Embarrassingly parallel problems : very easy to solve by parallel processing

    Para

    llel P

    roce

    ssin

    g : F

    un

    dam

    en

    tals

    32

  • References Introduction to Parallel Computing :

    https://computing.llnl.gov/tutorials/parallel_comp/#Hybrid

    http://en.wikipedia.org

    Introduction to Scientific High Performance Computing : Reinhold Bader (LRZ), Georg Hager (RRZE), Heinz Bast (Intel)

    Elementary Parallel Programming With Examples : Reinhold Bader (LRZ), Georg Hager (RRZE)

    Programming Shared Memory Systems with OpenMP : Reinhold Bader (LRZ) , Georg Hager (RRZE)

    THANK YOU !

    Para

    llel P

    roce

    ssin

    g : F

    un

    dam

    en

    tals

    33