computational methods in physics phys 3437

Computational Computational Methods in PhysicsMethods in Physics

PHYS 3437 PHYS 3437Dr Rob ThackerDr Rob Thacker

Dept of Astronomy & Physics Dept of Astronomy & Physics (MM-301C)(MM-301C)

[email protected]@ap.smu.ca

Today’s LectureToday’s Lecture

Introduction to parallel Introduction to parallel programmingprogramming Concepts – what are parallel computers, Concepts – what are parallel computers,

what is parallel programming?what is parallel programming? Why do you need to use parallel Why do you need to use parallel

programming?programming? When parallelism will be beneficialWhen parallelism will be beneficial

Amdahl’s Law Amdahl’s Law Very brief introduction to OpenMPVery brief introduction to OpenMP

Why bother to teach this in Why bother to teach this in an undergrad an undergrad PhysicsPhysics

course?course? Because parallel computing is now Because parallel computing is now ubiquitousubiquitous

Most Most laptopslaptops are parallel computers, for example are parallel computers, for example Dual/Quad core chips are already standard, in the future Dual/Quad core chips are already standard, in the future

we can look forward to 8/16/32/64 cores per chip!we can look forward to 8/16/32/64 cores per chip! Actually Sun Microsystems already sells a chip with 8 coresActually Sun Microsystems already sells a chip with 8 cores I predict that by 2012 you will be buying chips with 16 coresI predict that by 2012 you will be buying chips with 16 cores

If we want to use all this capacity, then we will need to run If we want to use all this capacity, then we will need to run codes that can use more than one CPU core at a timecodes that can use more than one CPU core at a time Such codes are said to be parallelSuch codes are said to be parallel

Exposure to these concepts will help you significantly if Exposure to these concepts will help you significantly if you want to go to grad school in an area that uses you want to go to grad school in an area that uses computational methods extensivelycomputational methods extensively Because not many people have these skills!Because not many people have these skills!

If you are interested, an excellent essay on how computing If you are interested, an excellent essay on how computing is changing can be found here:is changing can be found here: http://view.eecs.berkeley.edu/wiki/Main_Pagehttp://view.eecs.berkeley.edu/wiki/Main_Page

Some caveatsSome caveats In two lectures we cannot cover very much In two lectures we cannot cover very much

on parallel computingon parallel computing We will concentrate on the simplest kind of We will concentrate on the simplest kind of

parallel programming parallel programming Exposes some of the inherent problemsExposes some of the inherent problems Still gives you useful increased performanceStill gives you useful increased performance

Remember, making a code run 10 times Remember, making a code run 10 times faster turns a week into a day!faster turns a week into a day! The type of programming we’ll be looking at is The type of programming we’ll be looking at is

often limited in terms of the maximum speed-up often limited in terms of the maximum speed-up possible, but factors of 10 are pretty common possible, but factors of 10 are pretty common

Why can’t the compiler just Why can’t the compiler just make my code parallel for make my code parallel for

me?me? In some situations it can, but most of the time it can’tIn some situations it can, but most of the time it can’t

You really are smarter than a compiler is!You really are smarter than a compiler is! There are many situations where a compiler will not be able There are many situations where a compiler will not be able

to make something parallel but you canto make something parallel but you can Compilers that can attempt to parallelize code are called Compilers that can attempt to parallelize code are called

“auto-parallelizing”“auto-parallelizing” Some people have suggested writing parallel Some people have suggested writing parallel

languages that only allow the types of code that can languages that only allow the types of code that can be easily parallelizedbe easily parallelized These have proven to not be very popular and are too These have proven to not be very popular and are too

restrictiverestrictive At present, the most popular way of parallel At present, the most popular way of parallel

programming is to add additional commands to your programming is to add additional commands to your original codeoriginal code These commands are sometimes called pragmas or directivesThese commands are sometimes called pragmas or directives

Recap: von Neumann Recap: von Neumann architecturearchitecture

First practical First practical stored-program stored-program architecturearchitecture

Still in use today Still in use today Speed is limited by Speed is limited by

the bandwidth of the bandwidth of data between data between memory and memory and processing unitprocessing unit ““von Neumann” von Neumann”

bottleneckbottleneck

MEMORY

CONTROLUNIT

PROCESSUNIT

INPUTOUTPUT

DATAMEMORY

PROGRAMMEMORY

CPU

Developed while working on the EDVAC design

Machine instructions areencoded in binary & stored – key insight!

Shared memory computersShared memory computers

CPU

MEMORY

CPU CPU CPU

Traditional shared memory design – all processors share a memory bus

All of the processors see the share the same memory locations. Thismeans that programming these computers is reasonably straightforward. Sometimes called “SMP”s for symmetric multi-processor.

Program these computersusing “OpenMP” extensions to C,FORTRAN

Distributed memory Distributed memory computerscomputers

CPU

Really a collection of computers linked together via a network.

Each processor has its own memory and must communicate withother processors over the network to get information from othermemory locations. This is really quite difficult at times. This is the architecture of “computer clusters” (you could actually have each “CPU” here be a shared memory computer).

MEMORY

CPU

MEMORY

CPU

MEMORY

CPU

MEMORY

NETWORKProgram these computers usingMPI or PVM extensions to C, FORTRAN

Parallel executionParallel execution What do we mean by What do we mean by

being able to do things in being able to do things in parallel?parallel?

Suppose the input data of Suppose the input data of an operation is divided an operation is divided into series of independent into series of independent partsparts

Processing of the parts is Processing of the parts is carried out independentlycarried out independently

A simple example is A simple example is operations on operations on vectors/arrays where we vectors/arrays where we loop over array indicesloop over array indices

ARRAY A(i)

do i=1,10a(i)=a(i)*2.end do

Task 1


Task 2


Task 3

Some subtletiesSome subtleties However, you can’t always do thisHowever, you can’t always do this Consider Consider

do i=2,ndo i=2,n a(i)=a(i-1)a(i)=a(i-1)

end doend do This kind of loop has what we call a This kind of loop has what we call a dependencedependence If you update a value of a(i) If you update a value of a(i) beforebefore a(i-1) has a(i-1) has

been updated then you will get the wrong answer been updated then you will get the wrong answer compared to running on a single processorcompared to running on a single processor

We’ll talk a little more about this later, but it We’ll talk a little more about this later, but it does mean that not every loop can be does mean that not every loop can be “parallelized”“parallelized”

Issues to be aware ofIssues to be aware of Parallel computing is not about being Parallel computing is not about being

“cool” and doing lots and lots of “flops”“cool” and doing lots and lots of “flops” Flops = floating point operations per secondFlops = floating point operations per second

We want solutions to problems in a We want solutions to problems in a reasonable amount of timereasonable amount of time Sometimes that means doing a lot of Sometimes that means doing a lot of

calculations – e.g. consider what we found calculations – e.g. consider what we found about the number of collisions for molecules in about the number of collisions for molecules in airair

Gains from algorithmic improvements will Gains from algorithmic improvements will often swamp hardware improvementsoften swamp hardware improvements

Don’t be brain-limited, if there is a better Don’t be brain-limited, if there is a better algorithmalgorithm use it use it

Algorithmic Improvements Algorithmic Improvements in n-body simulationsin n-body simulations

Improvements in the speed of algorithms are proportionally better than the speedincrease of computers over the same time interval.

Identifying Performance Identifying Performance DesiresDesiresPositive Precondition

Negative Precondition

Frequency of UseCode Evolution timescale

Daily

Yearly

Monthly

Hundreds of executions between changes

Changes each run

Performance Performance CharacteristicsCharacteristicsPositive Precondition


Execution Time

Days

Minutes

Hours

Frequent (many persecond)

None

Level of Synchronization

Infrequent (every minute)

Data and Algorithm Data and Algorithm Positive Precondition


Simple

Complex Irregular, dynamic

Regular, static

Data structuresAlgorithmic complexity*

*approximately thenumber of stages

RequirementsRequirementsPositive Precondition


Current resolution meets needs

Need a factor of 2 increase

Must significantly increaseresolution/length of integration

How much speed-up can we How much speed-up can we achieve?achieve?

Some parts of a code cannot be run in parallelSome parts of a code cannot be run in parallel For example the loop over a(i)=a(i-1) from earlierFor example the loop over a(i)=a(i-1) from earlier Any code that cannot be executed in parallel is Any code that cannot be executed in parallel is

said to be said to be serialserial or sequential or sequential Lets suppose in terms of the total execution time Lets suppose in terms of the total execution time

of a program a fraction fof a program a fraction fss has to be run in serial, has to be run in serial, while fwhile fpp can be run in parallel on n cpus can be run in parallel on n cpus

Equivalently the time spent in each fraction will Equivalently the time spent in each fraction will be tbe tss and t and tp p so the total time on 1 cpu is tso the total time on 1 cpu is t1cpu1cpu=t=tss+t+tpp

If we can run the parallel fraction on n cpus then If we can run the parallel fraction on n cpus then it will take a time tit will take a time tpp/n/n

The total time will then be tThe total time will then be tncpuncpu=t=tss+t+tpp/n/n

Amdahl’s LawAmdahl’s Law

How much speed-up (SHow much speed-up (Snn=t=t1cpu1cpu/t/tncpuncpu) is feasible?) is feasible? Amdahl’s Law is the most significant limit. Amdahl’s Law is the most significant limit.

Given our previous results and n processors, Given our previous results and n processors, the maximum speed-up is given by:the maximum speed-up is given by:

Only if the serial fraction fOnly if the serial fraction fss(=t(=tss/(t/(tss+t+tpp)) is zero )) is zero is perfect speed-up possible (at least in theory)is perfect speed-up possible (at least in theory)

1)1(/1

sps

ps

ncpu

cpun fn

n

ntt

tt

t

tS

Amdahl’s Law Amdahl’s Law

0

20

40

60

80

100

120

1 20 40 60 80 100 120

fs=0.1fs=0.01fs=0.001

Ncpu

Sp

eed

-up

Scaling similar for differentfs here

Have to achieveexcellent parallelismhere!

What is OpenMP?What is OpenMP? OpenMP is a “pragma” based “application OpenMP is a “pragma” based “application

programmer interface” (API) that provides a programmer interface” (API) that provides a simple extension to C/C++ and FORTRANsimple extension to C/C++ and FORTRAN Pragma is just a fancy word for “instructions”Pragma is just a fancy word for “instructions”

It is exclusively designed for shared memory It is exclusively designed for shared memory programmingprogramming

Ultimately, OpenMP is a very simple Ultimately, OpenMP is a very simple interface to something called interface to something called threadsthreads based based programmingprogramming

What actually happens when you break up a What actually happens when you break up a loop into pieces is that a number of loop into pieces is that a number of threads threads of executionof execution are created that can run the are created that can run the loop pieces in parallelloop pieces in parallel

Threads based executionThreads based execution

Serial execution, interspersed with Serial execution, interspersed with parallelparallel

Parallel Section Parallel Section

Serial Section Serial SectionSerial Section

MasterThread

In practice many compilers block execution of the extra threads during serial sections, this saves the overhead of the `fork-join’ operation

Some background to Some background to threads programmingthreads programming

There is actually an entire set of commands There is actually an entire set of commands in C to allow you to create threadsin C to allow you to create threads

You could, if you wanted, program with You could, if you wanted, program with these commandsthese commands The most common thread standard is called The most common thread standard is called

POSIXPOSIX However, OpenMP provides a simple However, OpenMP provides a simple

interface to a lot of the functionality interface to a lot of the functionality provided by threadsprovided by threads If it is simple, and does what you need why If it is simple, and does what you need why

bother going to the effort of using threads bother going to the effort of using threads programming?programming?

Components of OpenMPComponents of OpenMP

Directives(Pragmas inyour code)

Runtime Library

Routines(Compiler)

EnvironmentVariables

(set at Unixprompt)

OpenMP: Where did it come OpenMP: Where did it come from?from?

Prior to 1997, vendors all had their own proprietary Prior to 1997, vendors all had their own proprietary shared memory programming commandsshared memory programming commands

Programs were not portable from one SMP to anotherPrograms were not portable from one SMP to another Researchers were calling for some kind of portabilityResearchers were calling for some kind of portability ANSI X3H5 (1994) proposal tried to formalize a ANSI X3H5 (1994) proposal tried to formalize a

shared memory standard – but ultimately failedshared memory standard – but ultimately failed OpenMP (1997) worked because the vendors got OpenMP (1997) worked because the vendors got

behind it and there was new growth in the shared behind it and there was new growth in the shared memory market placememory market place

Very hard for researchers to get new languages Very hard for researchers to get new languages supported now, must have backing from computer supported now, must have backing from computer vendors!vendors!

BottomlineBottomline For OpenMP & shared memory programming in For OpenMP & shared memory programming in

general, one only has to worry about parallelism of general, one only has to worry about parallelism of workwork

This is because all the processors in a shared-This is because all the processors in a shared-memory computer can see all the same memory memory computer can see all the same memory locationslocations

On distributed-memory computers one has to On distributed-memory computers one has to worry both about parallelism of the work and also worry both about parallelism of the work and also the placement of datathe placement of data Is the value I need in the memory of another processor?Is the value I need in the memory of another processor?

Data movement is what makes distributed-memory Data movement is what makes distributed-memory codes (usually written in something called MPI) so codes (usually written in something called MPI) so much longer – it can be highly non-trivialmuch longer – it can be highly non-trivial Although it can be easy – it depends on the algorithmAlthough it can be easy – it depends on the algorithm

First StepsFirst Steps

Loop level parallelismLoop level parallelism is the simplest is the simplest and easiest way to use OpenMPand easiest way to use OpenMP Take each do loop and make it parallel Take each do loop and make it parallel

(if possible)(if possible) It allows you to slowly build up It allows you to slowly build up

parallelism within your applicationparallelism within your application However, not all loops are However, not all loops are

immediately parallelizeable due to immediately parallelizeable due to dependenciesdependencies

Loop Level ParallelismLoop Level Parallelism

Consider the single precision vector Consider the single precision vector add-multiply operation add-multiply operation YY=a=aXX++Y Y (“SAXPY”)(“SAXPY”)

do i=1,n Y(i)=a*X(i)+Y(i)end do

FORTRAN

C$OMP PARALLEL DOC$OMP& DEFAULT(NONE)C$OMP& PRIVATE(i),SHARED(X,Y,n,a) do i=1,n Y(i)=a*X(i)+Y(i) end do

for (i=1;i<=n;++i) { Y[i]+=a*X[i];}

C/C++

#pragma omp parallel for \ private(i) shared(X,Y,n,a)for (i=1;i<=n;++i) { Y[i]+=a*X[i];}

In more detailIn more detail

C$OMP PARALLEL DOC$OMP& DEFAULT(NONE)C$OMP& PRIVATE(i),SHARED(X,Y,n,a) do i=1,n Y(i)=a*X(i)+Y(i) end do

Denotes this is a region of code for parallel execution

Good programming practice, mustdeclare nature of all variables

Thread PRIVATE variables: each threadmust have their own copy of this variable (in this case i is the only private variable)

Thread SHARED variables: all threads can

access these variables, but must not updateindividual memory locations simultaneously

Comment pragmas for FORTRAN - ampersand

necessary for continuation

A quick noteA quick note

To be fully lexically correct you may want to To be fully lexically correct you may want to include an include an C$OMP END PARALLEL DOC$OMP END PARALLEL DO

In f90 programs use In f90 programs use !$OMP!$OMP as a sentinel as a sentinel

Notice that the sentinels mean that the Notice that the sentinels mean that the OpenMP commands look like OpenMP commands look like commentscomments A compiler that has OpenMP compatibility A compiler that has OpenMP compatibility

turned on will see the comments after the turned on will see the comments after the sentinelsentinel

This means you can still compile the code on This means you can still compile the code on computers that don’t have OpenMPcomputers that don’t have OpenMP

How the compiler handles How the compiler handles OpenMPOpenMP

When you compile an OpenMP code you When you compile an OpenMP code you need to add “flags” to the compile line, e.g.need to add “flags” to the compile line, e.g. f77 –openmp –o myprogram myprogram.ff77 –openmp –o myprogram myprogram.f Unfortunately different compilers have different Unfortunately different compilers have different

commands for turning on OpenMP support, the commands for turning on OpenMP support, the above will work on Sun machinesabove will work on Sun machines

When the compiler flag is turned on, you When the compiler flag is turned on, you now force the compiler to link in all of the now force the compiler to link in all of the additional libraries (and so on) necessary to additional libraries (and so on) necessary to run the threadsrun the threads This is all transparent to you thoughThis is all transparent to you though

Requirements for Requirements for parallel loopsparallel loops

To divide up the work the compiler needs to know To divide up the work the compiler needs to know the number of iterations to be executed – the the number of iterations to be executed – the trip trip countcount must be computable must be computable

They must also not exhibit any of the dependencies They must also not exhibit any of the dependencies we mentioned we mentioned We’ll review this more in the next lectureWe’ll review this more in the next lecture Actually a good test for dependencies is running the loop Actually a good test for dependencies is running the loop

from n to 1, rather than 1 to n. If you get a different from n to 1, rather than 1 to n. If you get a different answer that suggests there are dependenciesanswer that suggests there are dependencies

DO WHILEDO WHILE is not parallelizable using these is not parallelizable using these directivesdirectives There is actually a way of parallelizing There is actually a way of parallelizing DO WHILEDO WHILE using a using a

different set of OpenMP commands, but we don’t have different set of OpenMP commands, but we don’t have time to cover thattime to cover that

The loop can only have one exit point – therefore The loop can only have one exit point – therefore BREAKBREAK or or GOTOGOTOs are not alloweds are not allowed

Performance limitationsPerformance limitations Each time you start and end a parallel loop there Each time you start and end a parallel loop there

is an overhead associated with the threadsis an overhead associated with the threads These overheads must always be added to the These overheads must always be added to the

time taken to calculate the loop itselftime taken to calculate the loop itself Therefore there is a limit on the smallest loop Therefore there is a limit on the smallest loop

size that will achieve speed upsize that will achieve speed up In practice, we need roughly 5000 floating point In practice, we need roughly 5000 floating point

operations in a loop for it to be worth operations in a loop for it to be worth parallelizingparallelizing

A good rule of thumb is that any thread should A good rule of thumb is that any thread should have at least 1000 floating point operationshave at least 1000 floating point operations

Thus small loops are simply not worth the Thus small loops are simply not worth the bother!bother!

SummarySummary Shared memory parallel computers can be Shared memory parallel computers can be

programmed using the OpenMP extensions to programmed using the OpenMP extensions to C,FORTRANC,FORTRAN Distributed memory computers require a different Distributed memory computers require a different

parallel languageparallel language The easiest way to use OpenMP is to make loops The easiest way to use OpenMP is to make loops

parallel by dividing work up among threadsparallel by dividing work up among threads Compiler handles most of the difficult parts of codingCompiler handles most of the difficult parts of coding However, not all loops are immediately parallelizableHowever, not all loops are immediately parallelizable Dependencies may prevent parallelizationDependencies may prevent parallelization

Loops are made to run in parallel by adding Loops are made to run in parallel by adding directives (“pragmas”) to your codedirectives (“pragmas”) to your code These directives appear to be comments to ordinary These directives appear to be comments to ordinary

compilerscompilers

Next LectureNext Lecture

More details on dependencies and More details on dependencies and how we can deal with themhow we can deal with them

computational methods in physics phys 3437

Documents

parallel computers

code parallel

parallel programmingconcepts

parallel languages

parallel computingwe

memory busall

memory locations

type of programming