2 basic concepts on

Upload: atmabuddhimanas

Post on 07-Apr-2018

217 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/6/2019 2 Basic Concepts on

    1/58

    Basic Concepts in Parallelization

    1

    RvdP/V1 Tutorial IWOMP 2010 CCS Un. of Tsukuba, June 14 , 2010

    IWOMP 2010CCS, University of Tsukuba

    Tsukuba, JapanJune 14-16, 2010

    Ruud van der Pas

    Senior Staff EngineerOracle Solaris StudioOracle

    Menlo Park, CA, USA

    Basic Concepts in Parallelization

  • 8/6/2019 2 Basic Concepts on

    2/58

    Basic Concepts in Parallelization

    2

    RvdP/V1 Tutorial IWOMP 2010 CCS Un. of Tsukuba, June 14 , 2010

    Outline

    Introduction

    Parallel Architectures

    Parallel Programming Models

    Data Races

    Summary

  • 8/6/2019 2 Basic Concepts on

    3/58

    Basic Concepts in Parallelization

    3

    RvdP/V1 Tutorial IWOMP 2010 CCS Un. of Tsukuba, June 14 , 2010

    Introduction

  • 8/6/2019 2 Basic Concepts on

    4/58Basic Concepts in Parallelization

    4

    RvdP/V1 Tutorial IWOMP 2010 CCS Un. of Tsukuba, June 14 , 2010

    Why Parallelization ?

    Parallelization is another optimization techniqueThe goal is to reduce the execution time

    To this end, multiple processors, or cores, are used

    Time

    1 core 4 coresparallelization

    Using 4 cores, the execution

    time is 1/4 of the single coretime

  • 8/6/2019 2 Basic Concepts on

    5/58Basic Concepts in Parallelization

    5

    RvdP/V1 Tutorial IWOMP 2010 CCS Un. of Tsukuba, June 14 , 2010

    What Is Parallelization ?

    "Something" is parallel if there is a certain levelof independence in the order of operations

    A sequence of machine instructions

    A collection of program statements

    An algorithmThe problem you're trying to solve

    gr

    an

    ul

    ari

    ty

    In other words, it doesn't matter in what orderthose operations are performed

  • 8/6/2019 2 Basic Concepts on

    6/58Basic Concepts in Parallelization

    6

    RvdP/V1 Tutorial IWOMP 2010 CCS Un. of Tsukuba, June 14 , 2010

    What is a Thread ?

    Loosely said, a thread consists of a series of instructions

    with it's own program counter (PC) and state A parallel program executes threads in parallel

    These threads are then scheduled onto processors

    .....

    .....

    .....

    .....

    .....

    .....

    .....

    .....

    Thread 0

    .....

    .....

    .....

    .....

    .....

    .....

    .....

    .....

    Thread 1

    .....

    .....

    .....

    .....

    .....

    .....

    .....

    .....

    Thread 2

    .....

    .....

    .....

    .....

    .....

    .....

    .....

    .....

    Thread 3

    P P

    PCPC

    PC

    PC

  • 8/6/2019 2 Basic Concepts on

    7/58Basic Concepts in Parallelization

    7

    RvdP/V1 Tutorial IWOMP 2010 CCS Un. of Tsukuba, June 14 , 2010

    Parallel overhead

    The total CPU time often exceeds the serial CPU time:

    The newly introduced parallel portions in yourprogram need to be executed

    Threads need time sending data to each other andsynchronizing (communication)

    Often the key contributor, spoiling all the fun Typically, things also get worse when increasing the

    number of threads

    Effi cient parallelization is about minimizing thecommunication overhead

  • 8/6/2019 2 Basic Concepts on

    8/58Basic Concepts in Parallelization

    8

    RvdP/V1 Tutorial IWOMP 2010 CCS Un. of Tsukuba, June 14 , 2010

    Communication

    Serial

    Execution

    Wallclocktime

    Parallel - With

    communication

    Additional communication

    Less than 4x faster

    Consumes additional resources

    Wallclock time is more than of serial wallclock time

    Total CPU time increases

    Parallel - Without

    communication

    Embarrassinglyparallel: 4x faster

    Wallclock time is ofserial wallclock time

    Communication

  • 8/6/2019 2 Basic Concepts on

    9/58Basic Concepts in Parallelization

    9

    RvdP/V1 Tutorial IWOMP 2010 CCS Un. of Tsukuba, June 14 , 2010

    Load balancing

    Time

    Perfect Load Balancing Load Imbalance

    All threads fi nish in thesame amount of time

    No threads is idle

    Thread is idle

    Different threads need a differentamount of time to fi nish theirtask

    Total wall clock time increases

    Program does not scale well

    1 idle

    2 idle3 idle

  • 8/6/2019 2 Basic Concepts on

    10/58Basic Concepts in Parallelization

    10

    RvdP/V1 Tutorial IWOMP 2010 CCS Un. of Tsukuba, June 14 , 2010

    About scalability Defi ne the speed-up S(P) as

    S(P) := T(1)/T(P)

    The effi ciency E(P) is defi ned asE(P) := S(P)/P

    In the ideal case, S(P)=PandE(P)=P/P=1=100%

    Unless the application is

    embarrassingly parallel, S(P)eventually starts to deviate fromthe ideal curve

    Past this point Popt

    , the application

    sees less and less benefi t fromadding processors

    Note that both metrics give noinformation on the actual run-time

    As such, they can be dangerous touse

    Ideal

    Speed-up

    S(P)

    PPopt

    In some cases, S(P) exceeds P

    This is called "superlinear" behaviour

    Don't count on this to happen though

  • 8/6/2019 2 Basic Concepts on

    11/58

    Basic Concepts in Parallelization

    11

    RvdP/V1 Tutorial IWOMP 2010 CCS Un. of Tsukuba, June 14 , 2010

    Amdahl's Law

    This "law' describes the effect the non-parallelizable part of a

    program has on scalability Note that the additional overhead caused by parallelization and

    speed-up because of cache effects are not taken into account

    Assume our program has a parallel fractionf

    This implies the execution timeT(1) := f*T(1) + (1-f)*T(1)

    On P processors:T(P) = (f/P)*T(1) + (1-f)*T(1)

    Amdahl's law:

    S(P) = T(1)/T(P) = 1 / (f/P + 1-f)

    Comments:

  • 8/6/2019 2 Basic Concepts on

    12/58

    Basic Concepts in Parallelization

    12

    RvdP/V1 Tutorial IWOMP 2010 CCS Un. of Tsukuba, June 14 , 2010

    Amdahl's Law

    It is easy to scale on asmall number ofprocessors

    Scalable performancehowever requires ahigh degree ofparallelization i.e. f isvery close to 1

    This implies that youneed to parallelize thatpart of the code wherethe majority of the timeis spent

    1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 160

    2

    4

    6

    8

    10

    12

    14

    16

    1.000.99

    0.75

    0.50

    0.25

    0.00

    Value for f:

    Processors

    Sp

    eed-up

  • 8/6/2019 2 Basic Concepts on

    13/58

    Basic Concepts in Parallelization

    13

    RvdP/V1 Tutorial IWOMP 2010 CCS Un. of Tsukuba, June 14 , 2010

    Amdahl's Law in practice

    f = (1 - T(P)/T(1))/(1 - (1/P))

    We can estimate the parallel fractionf

    Recall:T(P) = (f/P)*T(1) + (1-f)*T(1)

    It is trivial to solve this equation for f:

    Example:

    T(1) = 100 and T(4) = 37 => S(4) = T(1)/T(4) = 2.70f = (1-37/100)/(1-(1/4)) = 0.63/0.75 = 0.84 = 84%

    Estimated performance on 8 processors is then:

    T(8) = (0.84/8)*100 + (1-0.84)*100 = 26.5S(8) = T/T(8) = 3.78

  • 8/6/2019 2 Basic Concepts on

    14/58

    Basic Concepts in Parallelization

    14

    RvdP/V1 Tutorial IWOMP 2010 CCS Un. of Tsukuba, June 14 , 2010

    Threads Are Getting Cheap .....

    1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 160

    10

    20

    30

    40

    50

    60

    70

    80

    90

    100

    Elapsed time and speed up for f = 84%

    Number of cores used

    Using 8 cores:

    100/26.5 = 3.8x faster

    = Elapsed time

    = Speed up

  • 8/6/2019 2 Basic Concepts on

    15/58

    Basic Concepts in Parallelization

    15

    RvdP/V1 Tutorial IWOMP 2010 CCS Un. of Tsukuba, June 14 , 2010

    Numerical results

    Consider:A = B + C + D + E

    Serial Processing Parallel Processing

    Thread 0 Thread 1

    The roundoff behaviour is different and so the numericalresults may be different too

    This is natural for parallel programs, but it may be hard todifferentiate it from an ordinary bug ....

    A = B + C

    A = A + D

    A = A + E

    T1 = B + C

    T1 = T1 + T2

    T2 = D + E

  • 8/6/2019 2 Basic Concepts on

    16/58

    Basic Concepts in Parallelization

    16

    RvdP/V1 Tutorial IWOMP 2010 CCS Un. of Tsukuba, June 14 , 2010

    Cache Coherence

  • 8/6/2019 2 Basic Concepts on

    17/58

    Basic Concepts in Parallelization

    17

    RvdP/V1 Tutorial IWOMP 2010 CCS Un. of Tsukuba, June 14 , 2010

    Typical cache based system

    L2cacheCPU Memory

    L1cache?

  • 8/6/2019 2 Basic Concepts on

    18/58

    Basic Concepts in Parallelization

    18

    RvdP/V1 Tutorial IWOMP 2010 CCS Un. of Tsukuba, June 14 , 2010

    Cache line modifi cations

    Main Memory

    caches

    page cache line

    cache lineinconsistency !

  • 8/6/2019 2 Basic Concepts on

    19/58

    Basic Concepts in Parallelization

    19

    RvdP/V1 Tutorial IWOMP 2010 CCS Un. of Tsukuba, June 14 , 2010

    Caches in a parallel computer

    A cache line starts in

    memoryOver time multiple copies

    of this line may exist

    Cache Coherence ("cc"):

    Tracks changes in copies

    Makes sure correct cacheline is used

    Different implementationspossible

    Need hardware support tomake it effi cient

    Processors Caches Memory

    CPU 0

    CPU 1

    CPU 2

    CPU 3

    invalidated

    invalidated

    invalidated

  • 8/6/2019 2 Basic Concepts on

    20/58

    Basic Concepts in Parallelization

    20

    RvdP/V1 Tutorial IWOMP 2010 CCS Un. of Tsukuba, June 14 , 2010

    Parallel Architectures

  • 8/6/2019 2 Basic Concepts on

    21/58

    Basic Concepts in Parallelization

    21

    RvdP/V1 Tutorial IWOMP 2010 CCS Un. of Tsukuba, June 14 , 2010

    Uniform Memory Access (UMA)

    Also called "SMP"

    (Symmetric MultiProcessor)

    Memory Access time isUniform for all CPUs

    CPU can be multicore

    Interconnect is "cc":

    Bus

    Crossbar

    No fragmentation -Memory and I/O areshared resources

    Memory I/O

    cache

    CPU

    Easy to use and to administer

    Effi cient use of resources

    Said to be expensive

    Said to be non-scalable

    Pro

    Con

    cache

    CPU

    cache

    CPU

    Interconnect I/O

    I/O

  • 8/6/2019 2 Basic Concepts on

    22/58

    Basic Concepts in Parallelization

    22

    RvdP/V1 Tutorial IWOMP 2010 CCS Un. of Tsukuba, June 14 , 2010

    NUMA

    Also called "Distributed

    Memory" or NORMA (NoRemote Memory Access)

    Memory Access time isNon-Uniform

    Hence the name "NUMA"

    Interconnect is not "cc":

    Ethernet, Infi niband,etc, ......

    Runs 'N' copies of the OS

    Memory and I/O aredistributed resources

    Said to be cheap

    Said to be scalable

    Diffi cult to use and administer

    In-effi cient use of resources

    Pro

    Con

    I/O

    cache

    CPU

    M

    I/O

    cache

    CPU

    M

    I/O

    cache

    CPU

    M

    Interconnect

  • 8/6/2019 2 Basic Concepts on

    23/58

    Basic Concepts in Parallelization

    23

    RvdP/V1 Tutorial IWOMP 2010 CCS Un. of Tsukuba, June 14 , 2010

    The Hybrid Architecture

    Second-level Interconnect

    Second-level interconnect is not cache coherent

    Ethernet, Infi niband, etc, ....

    Hybrid Architecture with all Pros and Cons:

    UMA within one SMP/Multicore node

    NUMA across nodes

    SMP/MC node SMP/MC node

  • 8/6/2019 2 Basic Concepts on

    24/58

    Basic Concepts in Parallelization

    24

    RvdP/V1 Tutorial IWOMP 2010 CCS Un. of Tsukuba, June 14 , 2010

    cc-NUMA

    Two-level interconnect:

    UMA/SMP within one system

    NUMA between the systems

    Both interconnects support cache coherence i.e. thesystem is fully cache coherent

    Has all the advantages ('look and feel') of an SMP

    Downside is the Non-Uniform Memory Access time

    Interconnect

    2-nd level

  • 8/6/2019 2 Basic Concepts on

    25/58

    Basic Concepts in Parallelization

    25

    RvdP/V1 Tutorial IWOMP 2010 CCS Un. of Tsukuba, June 14 , 2010

    Parallel Programming Models

  • 8/6/2019 2 Basic Concepts on

    26/58

    Basic Concepts in Parallelization

    26

    RvdP/V1 Tutorial IWOMP 2010 CCS Un. of Tsukuba, June 14 , 2010

    How To Program A Parallel Computer?

    There are numerous parallel programming models

    The ones most well-known are: Distributed Memory

    Sockets (standardized, low level)

    PVM - Parallel Virtual Machine (obsolete)

    MPI - Message Passing Interface (de-facto std)

    Shared Memory

    Posix Threads (standardized, low level)

    OpenMP (de-facto standard)

    Automatic Parallelization (compiler does it for you)

    Any

    Cluster

    and/or

    singleSMP

    SMP

    on

    ly

  • 8/6/2019 2 Basic Concepts on

    27/58

    Basic Concepts in Parallelization

    27

    RvdP/V1 Tutorial IWOMP 2010 CCS Un. of Tsukuba, June 14 , 2010

    Parallel Programming ModelsDistributed Memory - MPI

  • 8/6/2019 2 Basic Concepts on

    28/58

    Basic Concepts in Parallelization

    28

    RvdP/V1 Tutorial IWOMP 2010 CCS Un. of Tsukuba, June 14 , 2010

    What is MPI?

    MPI stands for the Message Passing Interface

    MPI is a very extensive de-facto parallel programmingAPI for distributed memory systems (i.e. a cluster)

    An MPI program can however also be executed on ashared memory system

    First specifi cation: 1994

    The current MPI-2 specifi cation was released in 1997

    Major enhancements

    Remote memory management

    Parallel I/ODynamic process management

  • 8/6/2019 2 Basic Concepts on

    29/58

    Basic Concepts in Parallelization

    29

    RvdP/V1 Tutorial IWOMP 2010 CCS Un. of Tsukuba, June 14 , 2010

    More about MPI

    MPI has its own data types (e.g. MPI_INT)

    User defi ned data types are supported as wellMPI supports C, C++ and Fortran

    Include fi lein C/C++ andmpif.hin Fortran

    An MPI environment typically consists of at least:

    A library implementing the API

    A compiler and linker that support the library

    A run time environment to launch an MPI program

    Various implementations available

    HPC Clustertools, MPICH, MVAPICH, LAM, VoltaireMPI, Scali MPI, HP-MPI, .....

  • 8/6/2019 2 Basic Concepts on

    30/58

    Th MPI E ti M d l

  • 8/6/2019 2 Basic Concepts on

    31/58

    Basic Concepts in Parallelization

    31

    RvdP/V1 Tutorial IWOMP 2010 CCS Un. of Tsukuba, June 14 , 2010

    The MPI Execution Model

    All processesstart

    MPI_Finalize

    All processesfi nish

    = communication

    MPI_Init

    Th MPI M M d l

  • 8/6/2019 2 Basic Concepts on

    32/58

    Basic Concepts in Parallelization

    32

    RvdP/V1 Tutorial IWOMP 2010 CCS Un. of Tsukuba, June 14 , 2010

    The MPI Memory Model

    Pprivate

    Pprivate

    P

    private

    P

    private

    P

    private

    TransferMechanism

    All threads/processes haveaccess to their own, private,memory only

    Data transfer and mostsynchronization has to beprogrammed explicitly

    All data is private Data is shared explicitly by

    exchanging buffers

    E l S d N I t

  • 8/6/2019 2 Basic Concepts on

    33/58

    Basic Concepts in Parallelization

    33

    RvdP/V1 Tutorial IWOMP 2010 CCS Un. of Tsukuba, June 14 , 2010

    Example - Send N Integers

    #include

    you = 0; him = 1;

    MPI_Init(&argc, &argv);

    MPI_Comm_rank(MPI_COMM_WORLD, &me);

    if ( me == 0 ) {

    error_code =MPI_Send(&data_buffer, N,MPI_INT,

    him, 1957,MPI_COMM_WORLD);

    } else if ( me == 1 ) {

    error_code =MPI_Recv(&data_buffer, N,MPI_INT,

    you, 1957,MPI_COMM_WORLD,

    MPI_STATUS_IGNORE);

    }

    MPI_Finalize();

    include fi le

    initialize MPI environment

    get process ID

    process 0 sends

    process 1 receives

    leave the MPI environment

    R ti B h i

  • 8/6/2019 2 Basic Concepts on

    34/58

    Basic Concepts in Parallelization

    34

    RvdP/V1 Tutorial IWOMP 2010 CCS Un. of Tsukuba, June 14 , 2010

    you = 1him = 0

    Process 0

    you = 1him = 0

    Process 1

    Run time Behavior

    N integersdestination = you = 1

    label = 1957

    MPI_Send

    me = 0 me = 1

    Yes ! Connection established

    MPI_RecvN integers

    sender = him =0label = 1957

  • 8/6/2019 2 Basic Concepts on

    35/58

  • 8/6/2019 2 Basic Concepts on

    36/58

    Basic Concepts in Parallelization

    36

    RvdP/V1 Tutorial IWOMP 2010 CCS Un. of Tsukuba, June 14 , 2010

    Parallel Programming ModelsShared Memory - Automatic

    Parallelization

  • 8/6/2019 2 Basic Concepts on

    37/58

  • 8/6/2019 2 Basic Concepts on

    38/58

    Basic Concepts in Parallelization

    38

    RvdP/V1 Tutorial IWOMP 2010 CCS Un. of Tsukuba, June 14 , 2010

    Parallel Programming ModelsShared Memory - OpenMP

  • 8/6/2019 2 Basic Concepts on

    39/58

    Basic Concepts in Parallelization

    39

    RvdP/V1 Tutorial IWOMP 2010 CCS Un. of Tsukuba, June 14 , 2010

    http://www.openmp.org

    0

    M

    1

    M

    P

    M

    Shared Memory

    Shared Memory Model

  • 8/6/2019 2 Basic Concepts on

    40/58

    Basic Concepts in Parallelization

    40

    RvdP/V1 Tutorial IWOMP 2010 CCS Un. of Tsukuba, June 14 , 2010

    Shared Memory Model

    Tprivate

    Tprivate

    T

    private

    T

    private

    T

    private

    Programming Model

    Shared

    Memory

    All threads have access to thesame, globally shared, memory

    Data can be shared or private

    Shared data is accessible by all

    threads

    Private data can only beaccessed by the thread thatowns it

    Data transfer is transparent tothe programmer

    Synchronization takes place,but it is mostly implicit

    About data

  • 8/6/2019 2 Basic Concepts on

    41/58

    Basic Concepts in Parallelization

    41

    RvdP/V1 Tutorial IWOMP 2010 CCS Un. of Tsukuba, June 14 , 2010

    About data

    In a shared memory parallel program variables have a"label" attached to them:

    Labelled "Private" Visible to one thread onlyChange made in local data, is not seen by others

    Example -Local variables in a function that isexecuted in parallel

    Labelled "Shared" Visible to all threadsChange made in global data, is seen by all others

    Example -Global data

    Example - Matrix times vector

  • 8/6/2019 2 Basic Concepts on

    42/58

    Basic Concepts in Parallelization

    42

    RvdP/V1 Tutorial IWOMP 2010 CCS Un. of Tsukuba, June 14 , 2010

    TID = 0

    for (i=0,1,2,3,4)

    TID = 1

    for (i=5,6,7,8,9)

    Example - Matrix times vector

    i = 0 i = 5

    a[0] = sum a[5] = sum

    sum = b[i=0][j]*c[j] sum = b[i=5][j]*c[j]i = 1 i = 6

    a[1] = sum a[6] = sum

    sum = b[i=1][j]*c[j] sum = b[i=6][j]*c[j]... etc ...

    for (i=0; i

  • 8/6/2019 2 Basic Concepts on

    43/58

    Basic Concepts in Parallelization

    43

    RvdP/V1 Tutorial IWOMP 2010 CCS Un. of Tsukuba, June 14 , 2010

    A Black and White comparison

    MPI OpenMP

    De-facto standard De-facto standard

    Endorsed by all key players Endorsed by all key players

    Runs on any number of (cheap) systems Limited to one (SMP) system

    Grid Ready Not (yet?) Grid Ready

    High and steep learning curve Easier to get started (but, ...)

    You're on your own Assistance from compiler

    All or nothing model Mix and match model

    No data scoping (shared, private, ..) Requires data scoping

    More widely used (but ....) Increasingly popular (CMT !)

    Sequential version is not preserved Preserves sequential codeRequires a library only Need a compiler

    Requires a run-time environment No special environment

    Easier to understand performance Performance issues implicit

  • 8/6/2019 2 Basic Concepts on

    44/58

    Basic Concepts in Parallelization

    44

    RvdP/V1 Tutorial IWOMP 2010 CCS Un. of Tsukuba, June 14 , 2010

    The Hybrid ParallelProgramming Model

    The Hybrid Programming Model

  • 8/6/2019 2 Basic Concepts on

    45/58

    Basic Concepts in Parallelization

    45

    RvdP/V1 Tutorial IWOMP 2010 CCS Un. of Tsukuba, June 14 , 2010

    The Hybrid Programming Model

    0

    M

    1

    M

    P

    M

    Shared Memory

    Distributed Memory

    0

    M

    1

    M

    P

    M

    Shared Memory

  • 8/6/2019 2 Basic Concepts on

    46/58

    Basic Concepts in Parallelization

    46

    RvdP/V1 Tutorial IWOMP 2010 CCS Un. of Tsukuba, June 14 , 2010

    Data Races

    About Parallelism

  • 8/6/2019 2 Basic Concepts on

    47/58

    Basic Concepts in Parallelization

    47

    RvdP/V1 Tutorial IWOMP 2010 CCS Un. of Tsukuba, June 14 , 2010

    About Parallelism

    Parallelism

    Independence

    No Fixed Ordering

    "Something" that does notobey this rule, is not

    parallel (at that level ...)

    Shared Memory Programming

  • 8/6/2019 2 Basic Concepts on

    48/58

    Basic Concepts in Parallelization

    48

    RvdP/V1 Tutorial IWOMP 2010 CCS Un. of Tsukuba, June 14 , 2010

    Shared Memory Programming

    Threads communicate via shared memory

    Tprivate

    Tprivate

    T

    private

    T

    private T

    private

    Shared

    Memory

    XX

    X

    Y

    49What is a Data Race?

  • 8/6/2019 2 Basic Concepts on

    49/58

    Basic Concepts in Parallelization

    49

    RvdP/V1 Tutorial IWOMP 2010 CCS Un. of Tsukuba, June 14 , 2010

    Two different threads in a multi-threaded sharedmemory program

    Access the same (=shared) memory location

    Asynchronously and

    Without holding any common exclusive locks and

    At least one of the accesses is a write/store

    50Example of a data race

  • 8/6/2019 2 Basic Concepts on

    50/58

    Basic Concepts in Parallelization

    50

    RvdP/V1 Tutorial IWOMP 2010 CCS Un. of Tsukuba, June 14 , 2010

    p

    T

    private

    T

    private

    SharedMemory

    n

    {n = omp_get_thread_num();}

    #pragma omp parallel shared(n)

    W

    W

    51Another example

  • 8/6/2019 2 Basic Concepts on

    51/58

    Basic Concepts in Parallelization

    51

    RvdP/V1 Tutorial IWOMP 2010 CCS Un. of Tsukuba, June 14 , 2010

    p

    {x = x + 1;}

    #pragma omp parallel shared(x)

    T

    private

    T

    private

    SharedMemory

    x

    R/W

    R/W

    52About Data Races

  • 8/6/2019 2 Basic Concepts on

    52/58

    Basic Concepts in Parallelization

    52

    RvdP/V1 Tutorial IWOMP 2010 CCS Un. of Tsukuba, June 14 , 2010

    Loosely described, a data race means that the updateof a shared variable is not well protected

    A data race tends to show up in a nasty way:

    Numerical results are (somewhat) different from runto run

    Especially with Floating-Point data diffi cult to

    distinguish from a numerical side-effect Changing the number of threads can cause the

    problem to seemingly (dis)appear

    May also depend on the load on the system

    May only show up using many threads

  • 8/6/2019 2 Basic Concepts on

    53/58

    54Not a parallel loop

  • 8/6/2019 2 Basic Concepts on

    54/58

    Basic Concepts in Parallelization

    54

    RvdP/V1 Tutorial IWOMP 2010 CCS Un. of Tsukuba, June 14 , 2010

    for (i=0; i

  • 8/6/2019 2 Basic Concepts on

    55/58

    Basic Concepts in Parallelization

    55

    RvdP/V1 Tutorial IWOMP 2010 CCS Un. of Tsukuba, June 14 , 2010

    We manually parallelized the previous loop

    The compiler detects the data dependence and doesnot parallelize the loop

    Vectorsaandbare of type integer

    We use the checksum ofaas a measure forcorrectness:

    checksum += a[i] for i = 0, 1, 2, ...., n-2

    The correct, sequential, checksum result is computedas a reference

    We ran the program using 1, 2, 4, 32 and 48 threads

    Each of these experiments was repeated 4 times

    56Numerical results

  • 8/6/2019 2 Basic Concepts on

    56/58

    Basic Concepts in Parallelization

    56

    RvdP/V1 Tutorial IWOMP 2010 CCS Un. of Tsukuba, June 14 , 2010

    threads: 1 checksum 1953 correct value 1953threads: 1 checksum 1953 correct value 1953threads: 1 checksum 1953 correct value 1953threads: 1 checksum 1953 correct value 1953

    threads: 2 checksum 1953 correct value 1953threads: 2 checksum 1953 correct value 1953threads: 2 checksum 1953 correct value 1953threads: 2 checksum 1953 correct value 1953

    threads: 4 checksum 1905 correct value 1953

    threads: 4 checksum 1905 correct value 1953threads: 4 checksum 1953 correct value 1953threads: 4 checksum 1937 correct value 1953

    threads: 32 checksum 1525 correct value 1953threads: 32 checksum 1473 correct value 1953threads: 32 checksum 1489 correct value 1953threads: 32 checksum 1513 correct value 1953

    threads: 48 checksum 936 correct value 1953threads: 48 checksum 1007 correct value 1953threads: 48 checksum 887 correct value 1953threads: 48 checksum 822 correct value 1953

    Data Race

    In Action !

    57

  • 8/6/2019 2 Basic Concepts on

    57/58

    Basic Concepts in ParallelizationRvdP/V1 Tutorial IWOMP 2010 CCS Un. of Tsukuba, June 14 , 2010

    Summary

    58Parallelism Is Everywhere

  • 8/6/2019 2 Basic Concepts on

    58/58

    Basic Concepts in ParallelizationRvdP/V1 Tutorial IWOMP 2010 CCS Un. of Tsukuba, June 14 , 2010

    Multiple levels of parallelism:

    Granularity Technology Programming Model

    Instruction Level Superscalar Compiler

    Chip Level Multicore Compiler, OpenMP, MPI

    System Level SMP/cc-NUMA Compiler, OpenMP, MPIGrid Level Cluster MPI

    Threads Are Getting Cheap