intro parallel computation

Upload: cleopatra2121

Post on 04-Apr-2018

220 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/29/2019 Intro Parallel Computation

    1/35

    An Overview of Parallel computing

    Jayanti Prasad

    Inter-University Centre for Astronomy & AstrophysicsPune, India (411007)

    May 20, 2011

  • 7/29/2019 Intro Parallel Computation

    2/35

    Plan of the Talk

    Introduction

    Parallel platforms

    Parallel Problems & Paradigms

    Parallel Programming

    Summary and conclusions

  • 7/29/2019 Intro Parallel Computation

    3/35

    High-level view of a computer system

  • 7/29/2019 Intro Parallel Computation

    4/35

    Why parallel computation ?

    The speed with which a single processor can process data

    (number of instructions per second, FLOPS) cannot beincreased indefinitely.

    It is difficult to cool faster CPUs.Faster CPUs demand smaller chip size which again createsmore heat.

    Using a fast/parallel computer :Can solve existing problems/more problems in less time.Can solve completely new problems leading to new findings.

    In Astronomy High Performance Computing is used for twomain purposes

    Processing large volume of dataDynamical simulations

    References :Ananth Grama et. al. 2003, Introduction to Parallel Computing, Addison Wesley

    Kevin Dowd & Charles R. Severance, High Performance Computing, OReily

  • 7/29/2019 Intro Parallel Computation

    5/35

    Performance

    Computation time for different problems may depend on thefollowings:

    Input-Output:- problems which require a lot of diskread/write.

    Communication:- Problems, like dynamical simulations needa lot of data to be communicated among processors. Fast

    inter-connects (Gig-bit Ethernet, Infiniband, Myrinet etc.) arehighly recommended.

    Memory:- Problems which need a large amount of data tobe available in the main memory (DRAM), demand more

    memory. Latency and bandwidth are very important.Processor:- Problems in which a large number or

    computations have to be done.

    Cache :- Better memory hierarchy (RAM,L1,L2,L3..) andefficient use of that benefits all problems.

  • 7/29/2019 Intro Parallel Computation

    6/35

    Performance measurement and benchmarking

    FLOPS:- Computational power is measured in units ofFloating Point Operations Per Second or FLOPS (Mega, Giga,Tera, Peta,.. FLOPS).

    Speed UP:-Speed Up =

    T1

    TN(1)

    where T1 is the time needed to solve a problem on oneprocessor and Tn is the time needed to solve that on N

    processors.

  • 7/29/2019 Intro Parallel Computation

    7/35

    Pi li

  • 7/29/2019 Intro Parallel Computation

    8/35

    Pipeline

    If we want to add two numbers we need five computational steps:

    If we have one functional unit for every step then the addition stillrequires five clock cycles, however, all the units being busy at thesame time, one result is produced every cycle. Reference : Tobias Wittwer 2005,An Introduction to Parallel Programming

    V t P

  • 7/29/2019 Intro Parallel Computation

    9/35

    Vector Processors

    A processor that performs one instruction on several data setsis called a vector processor.

    The most common form of parallel computation is in the form

    of Single Instruction Multiple Data (SIMD) i.e., samecomputational steps are applied on different data sets.

    Problems which can be broken into small problems forparallelization are called embarrassingly parallel problems e.g.,

    SIMD.

    M lti P

  • 7/29/2019 Intro Parallel Computation

    10/35

    Multi-core Processors

    That part of a processor which execute instructions in anapplication is called a core.

    Note that in a multiprocessor system we can have eachprocessor having one core or more than one cores.

    In general the number of hardware threads are equal to thenumber of cores.

    In some processors one core can support more than onesoftware threads.

    A multi-core processor presents multiple virtual CPUs to the

    user and operating system which are not physically countablebut OS can give you clue (check /proc).

    Reference : Darryl Gove, 2011, Multi-core Application Programming, Addison-Wesley

  • 7/29/2019 Intro Parallel Computation

    11/35

    Shared Memory System

  • 7/29/2019 Intro Parallel Computation

    12/35

    Shared Memory System

    Distributed Memory system

  • 7/29/2019 Intro Parallel Computation

    13/35

    Distributed Memory system

    Topology : Star, N-Cube, Torus ...

    Parallel Problems

  • 7/29/2019 Intro Parallel Computation

    14/35

    Parallel Problems

    Scalar Product

    S =

    N

    i=1 A

    iBi (2)

    Linear-Algebra: Matrix multiplication

    Cij =

    M

    k=1 A

    ikBkj (3)

    Integration

    y = 4

    10

    dx

    1 + x2(4)

    Dynamical simulations

    fi =N

    j=1

    mj(xj xi)

    (|xj xi|)3/2(5)

    Models of parallel programming: Fork Join

  • 7/29/2019 Intro Parallel Computation

    15/35

    Models of parallel programming: Fork-Join

    Models of parallel programming: Master-Slave

  • 7/29/2019 Intro Parallel Computation

    16/35

    Models of parallel programming: Master-Slave

    Synchronization

  • 7/29/2019 Intro Parallel Computation

    17/35

    Synchronization

    More than one processors working on a single problem mustcoordinate (if the problem is not embarrassingly parallelproblem).

    The OpenMp CRITICAL directive specifies a region of code

    that must be executed by only one thread at a time.The BARRIER (in OpenMp, MPI etc.,) directivesynchronizes all threads in the team.

    In pthreads Mutexes are used to avoid data inconsistencies

    due to simultaneous operations by multiple threads upon thesame memory area at the same time.

    art t on ng or ecompos ng a pro em: v e an

  • 7/29/2019 Intro Parallel Computation

    18/35

    art t on ng or ecompos ng a pro em: v e anconquer

    Partitioning or decomposing a problem is related to the

    opportunities for parallel execution.On the basis of whether we are diving the executions or datawe can have two type of decompositions:

    Functional decompositionDomain decomposition

    Load balance

    Domain decomposition

  • 7/29/2019 Intro Parallel Computation

    19/35

    Domain decomposition

    Figure: The left figure shows the use of orthogonal recursive bisection

    and the right one shows Peano-Hilbert space filling curve for domaindecomposition. Note that parts which have Peano-Hilbert keys close toeach other are physically also close to each other.

    Examples: N-body, antenna-baselines, sky maps (l,m)

    Mapping and indexing

  • 7/29/2019 Intro Parallel Computation

    20/35

    Mapping and indexing

    Indexing in MPI

    MPI Comm rank(MPI COMM WORLD, &id);

    Indexing in OpenMP

    id = omp get thread num();

    Indexing in CUDA

    1 t id = t h r e a d Id x . x + b l o ck D i m . x * ( t h re a d I dx . y + b l o c k D im . y * t h r ea d I d x . z );2 b id = b l o ck I dx . x + g r i d Di m . x * b l o ck I dx . y ;3 n t hr e ad s = b l oc k Di m . x * b l oc k Di m . y * b l oc k Di m . z ;4 n bl oc ks = g ri dD im . x * g ri dD im . y ;

    5 id = t id + n th re ads * b id ;

    1 d i m3 d i m G ri d ( g r sz x , g r sz y ) , d i mB l o ck ( b l sz , b ls z , b l s z ) ; \ \2 V e c Ad d $ < < < d i mG r id , d i m B l oc k > > > $ ( ) ;

    Mapping

  • 7/29/2019 Intro Parallel Computation

    21/35

    Mapping

    A problem of size N can be divided among NP processors in

    the following ways:A processor with identification number id get the databetween the data index istart and iend where

    istart = idN

    Np

    (6)

    iend = (id+ 1) N

    Np(7)

    Every processor can pick data skipping NP elements

    for(i = 0; i< N; i+ = Np) (8)

    Note that N/Np may not always an integer so keep loadbalance in mind.

    Shared Memory programming : Threading Building Blocks

  • 7/29/2019 Intro Parallel Computation

    22/35

    S y p g g g g

    Intel Threading building blocks (ITBB) is provided in the form

    of C++ runtime library which and can run on any platform.

    The main advantage of TBB is that it works at a higher levelthan raw threads, yet does not require exotic languages orcompilers.

    Most threading packages require you to create, join, andmanage threads. Programming directly in terms of threadscan be tedious and can lead to inefficient programs becausethreads are low-level, heavy constructs that are close to thehardware.

    TBB runtime library automatically schedules tasks ontothreads in a way that makes efficient use of processorresources. The runtime is very effective in load-balancing also.

    Shared Memory programming : Pthreads

  • 7/29/2019 Intro Parallel Computation

    23/35

    y p g g

    1 #include< p t h r e a d . h >2 void * p r i n t _ h e l l o _ w o r l d ( void * para m ) {3 long t id = (long) p a r a m ;4 p r i n t f ( " He ll o w or ld f ro m % ld ! \ n" , t i d ) ;5 }6

    7 int m a i n (int a r g c , char * a r g v [ ] ) {8 p t h r ea d _ t m y t hr e a d [ N T H RE A D S ] ;9 long i ;

    10 for( i =0 ; i < N TH RE AD S ; i ++ )11 p t h r e a d _ c r e a t e ( & m y t h r e a d [ i ] , N U L L , & p r i n t _ h e l l o _ wo r l d ,( void * ) i ) ;12 p t h r e a d _ e x i t ( N U L L ) ;13 return(0);14 }

    Reference : David R. Butenhof 1997, Programming with Posix threads, Addison-Wesley

    Shared Memory programming : OpenMp

  • 7/29/2019 Intro Parallel Computation

    24/35

    y p g g p p

    No need to make any change in the structure of the program.

    Need only three things to be done :

    #include< omp.h >

    #pragma omp parallel for shared () private ()-fopenmp when compiling

    In general available on all GNU/Linux system by defaultReferences :Rohit Chandra et. al. 2001, Parallel Programming in OpenMP, Morgan KaufmannBarbara Chapman et. al. 2008, Using OpenMP, MIT Press

    http://www.iucaa.ernet.in/jayanti/openmp.html

    OpenMP: Example

  • 7/29/2019 Intro Parallel Computation

    25/35

    p p

    1 #include< s t d i o . h >2 #include< o m p . h >3 int m ai n (int a r g c , char* a r g v [ ] ) {4 int n t hr e ad s , t id , n u m th r d ;5 / / s et t he n um be r of t hr ea ds6 o m p _ s e t _ n u m _ t h r e a d s ( a t o i ( a r g v [ 1 ] ) ) ;7

    8 #pragma o m p p a r al l e l p r i va t e ( t i d )910 {11 t id = o m p_ g et _ th r ea d _n u m ( ); / / o b ta in t he t h re ad i d1213 n t hr e ad s = o m p_ g et _ nu m _t h re a ds ( ) ; / / f in d n u mb er o f t h re a ds1415 p r i n t f ( " \ tH el lo W or ld f ro m t h re ad % d o f % d \n ", t id , n t h re a d s ) ;16

    17 }18 return(0);19 }

    Distributed Memory Programming : MPI

  • 7/29/2019 Intro Parallel Computation

    26/35

    y g g

    MPI is a solution for very large problemsCommunication

    Point to PointCollective

    Communication overhead can dominate computation and itmay be hard to get linear scaling.

    Is distributed in the form of libraries e.g., libmpi,libmpich or inthe form of compilers e.g., mpicc,mpif90.

    Programs have to be completely restructured.References :Gropp, William et. al. 1999, Using MPI 2, MIT Press

    http://www.iucaa.ernet.in/jayanti/mpi.html

    MPI : Example

  • 7/29/2019 Intro Parallel Computation

    27/35

    1 #include< s t d i o . h >2 #include< m p i . h >3 int m a i n (int a r g c , char * a r g v [ ] ) {4 int r an k , s iz e , l en ;5 char n a m e [ M P I _ M A X _ P R O C E S S O R _ N A M E ] ;

    6 M P I _I n i t ( & a rg c , & a r g v ) ;7 M P I _ C o m m _ r a n k ( M P I _ C O M M _ W O R L D , & r a n k ) ;8 M P I _ C o m m _ s i z e ( M P I _ C O M M _ W O R L D , & s i z e ) ;9 M P I _ G et _ p r o ce s s o r _n a m e ( n a me , & l e n ) ;

    10 p r in tf ( " He ll o w or ld ! I m % d of % d on % s\ n" , r ank , s iz e , n am e );11 M P I _ F i n a l i z e ( ) ;12 return(0);13 }

    GPU Programming : CUDA

  • 7/29/2019 Intro Parallel Computation

    28/35

    Programming language similar to C with few extra constructs.

    Parallel section is written in the form of kernel which isexecuted on GPU.

    Two different memory spaces : one for CPU and another of

    GPU. Data has to be explicitly copied back and forth.A very large number of threads can be used. Note that GPUcannot match the complexity of CPU. It is mostly used forSIMD programming.

    References :

    David B. Kirk and Wen-mei W. Hwu 2010, Programming Massively Parallel Processors, Morgan KaufmannJason Sanders & Edward Kandrot 2011, Cuda By examples, Addison-Wesley

    http://www.iucaa.ernet.in/jayanti/cuda.html

    CUDA : Example

  • 7/29/2019 Intro Parallel Computation

    29/35

    1 _ _ g l o b a l _ _ void f o r c e _ p p (float * p o s _ d , float * a c c _ d , int n ){2

    3 int t id x = t h re a dI d x . x;4 int t id y = t h re a dI d x . y;5 int t id z = t h re a dI d x . z;67 int m yi d = ( b lo ck Di m . z * ( ti dy + b lo ck Di m . y * ti dx ) ) + t id z ;8 int n t hr e ad s = b l oc k Di m . z * b l oc k Di m . y * b l oc k Di m . x ;9

    10 for(int i = m yid ; i < n ; i + = n th re ad s ){11 for(int l =0 ; l < n di m; l ++)

    12 a c c_ d [ l + n d im * i ] = . . .;1314 } //15 _ _ s y n c t h r e a d s ( ) ;16 }17 / / t hi s w as t he d ev ic e p ar t

    1 d i m3 d i m Gr i d ( 1 ) ;2 d i m 3 d i m B l o c k ( B L O C K _ S I Z E , B L O C K _S I Z E , B L O C K _ S I Z E ) ;3 c u d aM e m c py ( p o s _d , p os , n p ar t * n d i m *sizeof(float) , c u d a M e m c p y H o s t T o D e v i c e ) ;4 f o r ce _ p p < < < d im G ri d , d i m B l oc k > > > ( p o s_ d , a c c _d , n p a rt ) ;5 c u d a M e m c p y ( a c c , a c c _ d , n p a r t * n d i m * sizeof(float) , c u d a M e m cp y D e v ic e T o H o st ) ;

    Nvidia Quadro FX 3700

  • 7/29/2019 Intro Parallel Computation

    30/35

    General Information for device 0 Name: Quadro FX 3700Compute capability: 1.1Clock rate: 1242000Device copy overlap: EnabledKernel execition timeout : Disabled

    Memory Information for device 0 Total global mem: 536150016Total constant Mem: 65536Max mem pitch: 262144Texture Alignment: 256 MP Information for device 0 Multiprocessor count: 14Shared mem per mp: 16384Registers per mp: 8192

    Threads in warp: 32Max threads per block: 512Max thread dimensions: (512, 512, 64)Max grid dimensions: (65535, 65535, 1)

    Dynamic & static libraries

  • 7/29/2019 Intro Parallel Computation

    31/35

    Most of the open source software are distributed in the form of

    source codes and from which libraries are created for the use.In general libraries are not portable.

    One of the most common problems which a user face is dueto not linking libraries.

    The most common way to create libraries is: source code object code library.

    gcc -c first.cgcc -c second.car rc libtest.a first.o second.ogcc -shared -Wl,-soname, libtest.so.0 -o libtest.so.0 first.o second.o -lc

    The above library can be used in the following waygcc program.c -L/LIBPATH -ltest

    Dynamic library gets preference over static one.

    Hyper threading

  • 7/29/2019 Intro Parallel Computation

    32/35

    Hyper-Threading Technology used in Intel R XeonTM

    and

    Intel R PentiumTM

    4 processors, makes a single physicalprocessor appear as two logical processors to the operatingsystem.

    Hyper-Threading duplicates the architectural state on eachprocessor, while sharing one set of execution resources.

    sharing system resources, such as cache or memory bus, maydegrade system performance and Hyper-Threading can

    improve the performance of some applications, but not all.

    Summary

  • 7/29/2019 Intro Parallel Computation

    33/35

    It is not easy to make a super-fast single processor somulti-processor computing is the only way to get more

    computing power.When more than one processors (cores) share the samememory shared memory programming is uded e.g.,pthread,OpenMp, itbb etc.

    Shared memory programming is fast and it easy to get linear

    scaling since communication is not an issue.When processors having their own memory are used forparallel computation, distributed memory programming isused e.g., MPI, PVM.

    Distributed memory programming is the main way to solvelarge problems (when thousands of processors are needed).

    General Purpose Graphical Processing Units (GPGPU) canprovide very high performance at very low cost, however,programming is somewhat complicated and parallelism is

    limited to only SIMD.

    Top 10

  • 7/29/2019 Intro Parallel Computation

    34/35

  • 7/29/2019 Intro Parallel Computation

    35/35

    Thank You !