computational methods in physics phys 3437
DESCRIPTION
Computational Methods in Physics PHYS 3437. Dr Rob Thacker Dept of Astronomy & Physics (MM-301C) [email protected]. Today’s Lecture. Introduction to parallel programming Concepts – what are parallel computers, what is parallel programming? Why do you need to use parallel programming? - PowerPoint PPT PresentationTRANSCRIPT
Computational Computational Methods in PhysicsMethods in Physics
PHYS 3437 PHYS 3437Dr Rob ThackerDr Rob Thacker
Dept of Astronomy & Physics Dept of Astronomy & Physics (MM-301C)(MM-301C)
[email protected]@ap.smu.ca
Today’s LectureToday’s Lecture
Introduction to parallel Introduction to parallel programmingprogramming Concepts – what are parallel computers, Concepts – what are parallel computers,
what is parallel programming?what is parallel programming? Why do you need to use parallel Why do you need to use parallel
programming?programming? When parallelism will be beneficialWhen parallelism will be beneficial
Amdahl’s Law Amdahl’s Law Very brief introduction to OpenMPVery brief introduction to OpenMP
Why bother to teach this in Why bother to teach this in an undergrad an undergrad PhysicsPhysics
course?course? Because parallel computing is now Because parallel computing is now ubiquitousubiquitous
Most Most laptopslaptops are parallel computers, for example are parallel computers, for example Dual/Quad core chips are already standard, in the future Dual/Quad core chips are already standard, in the future
we can look forward to 8/16/32/64 cores per chip!we can look forward to 8/16/32/64 cores per chip! Actually Sun Microsystems already sells a chip with 8 coresActually Sun Microsystems already sells a chip with 8 cores I predict that by 2012 you will be buying chips with 16 coresI predict that by 2012 you will be buying chips with 16 cores
If we want to use all this capacity, then we will need to run If we want to use all this capacity, then we will need to run codes that can use more than one CPU core at a timecodes that can use more than one CPU core at a time Such codes are said to be parallelSuch codes are said to be parallel
Exposure to these concepts will help you significantly if Exposure to these concepts will help you significantly if you want to go to grad school in an area that uses you want to go to grad school in an area that uses computational methods extensivelycomputational methods extensively Because not many people have these skills!Because not many people have these skills!
If you are interested, an excellent essay on how computing If you are interested, an excellent essay on how computing is changing can be found here:is changing can be found here: http://view.eecs.berkeley.edu/wiki/Main_Pagehttp://view.eecs.berkeley.edu/wiki/Main_Page
Some caveatsSome caveats In two lectures we cannot cover very much In two lectures we cannot cover very much
on parallel computingon parallel computing We will concentrate on the simplest kind of We will concentrate on the simplest kind of
parallel programming parallel programming Exposes some of the inherent problemsExposes some of the inherent problems Still gives you useful increased performanceStill gives you useful increased performance
Remember, making a code run 10 times Remember, making a code run 10 times faster turns a week into a day!faster turns a week into a day! The type of programming we’ll be looking at is The type of programming we’ll be looking at is
often limited in terms of the maximum speed-up often limited in terms of the maximum speed-up possible, but factors of 10 are pretty common possible, but factors of 10 are pretty common
Why can’t the compiler just Why can’t the compiler just make my code parallel for make my code parallel for
me?me? In some situations it can, but most of the time it can’tIn some situations it can, but most of the time it can’t
You really are smarter than a compiler is!You really are smarter than a compiler is! There are many situations where a compiler will not be able There are many situations where a compiler will not be able
to make something parallel but you canto make something parallel but you can Compilers that can attempt to parallelize code are called Compilers that can attempt to parallelize code are called
“auto-parallelizing”“auto-parallelizing” Some people have suggested writing parallel Some people have suggested writing parallel
languages that only allow the types of code that can languages that only allow the types of code that can be easily parallelizedbe easily parallelized These have proven to not be very popular and are too These have proven to not be very popular and are too
restrictiverestrictive At present, the most popular way of parallel At present, the most popular way of parallel
programming is to add additional commands to your programming is to add additional commands to your original codeoriginal code These commands are sometimes called pragmas or directivesThese commands are sometimes called pragmas or directives
Recap: von Neumann Recap: von Neumann architecturearchitecture
First practical First practical stored-program stored-program architecturearchitecture
Still in use today Still in use today Speed is limited by Speed is limited by
the bandwidth of the bandwidth of data between data between memory and memory and processing unitprocessing unit ““von Neumann” von Neumann”
bottleneckbottleneck
MEMORY
CONTROLUNIT
PROCESSUNIT
INPUTOUTPUT
DATAMEMORY
PROGRAMMEMORY
CPU
Developed while working on the EDVAC design
Machine instructions areencoded in binary & stored – key insight!
Shared memory computersShared memory computers
CPU
MEMORY
CPU CPU CPU
Traditional shared memory design – all processors share a memory bus
All of the processors see the share the same memory locations. Thismeans that programming these computers is reasonably straightforward. Sometimes called “SMP”s for symmetric multi-processor.
Program these computersusing “OpenMP” extensions to C,FORTRAN
Distributed memory Distributed memory computerscomputers
CPU
Really a collection of computers linked together via a network.
Each processor has its own memory and must communicate withother processors over the network to get information from othermemory locations. This is really quite difficult at times. This is the architecture of “computer clusters” (you could actually have each “CPU” here be a shared memory computer).
MEMORY
CPU
MEMORY
CPU
MEMORY
CPU
MEMORY
NETWORKProgram these computers usingMPI or PVM extensions to C, FORTRAN
Parallel executionParallel execution What do we mean by What do we mean by
being able to do things in being able to do things in parallel?parallel?
Suppose the input data of Suppose the input data of an operation is divided an operation is divided into series of independent into series of independent partsparts
Processing of the parts is Processing of the parts is carried out independentlycarried out independently
A simple example is A simple example is operations on operations on vectors/arrays where we vectors/arrays where we loop over array indicesloop over array indices
ARRAY A(i)
do i=1,10a(i)=a(i)*2.end do
Task 1
do i=11,20a(i)=a(i)*2.end do
Task 2
do i=21,30a(i)=a(i)*2.end do
Task 3
Some subtletiesSome subtleties However, you can’t always do thisHowever, you can’t always do this Consider Consider
do i=2,ndo i=2,n a(i)=a(i-1)a(i)=a(i-1)
end doend do This kind of loop has what we call a This kind of loop has what we call a dependencedependence If you update a value of a(i) If you update a value of a(i) beforebefore a(i-1) has a(i-1) has
been updated then you will get the wrong answer been updated then you will get the wrong answer compared to running on a single processorcompared to running on a single processor
We’ll talk a little more about this later, but it We’ll talk a little more about this later, but it does mean that not every loop can be does mean that not every loop can be “parallelized”“parallelized”
Issues to be aware ofIssues to be aware of Parallel computing is not about being Parallel computing is not about being
“cool” and doing lots and lots of “flops”“cool” and doing lots and lots of “flops” Flops = floating point operations per secondFlops = floating point operations per second
We want solutions to problems in a We want solutions to problems in a reasonable amount of timereasonable amount of time Sometimes that means doing a lot of Sometimes that means doing a lot of
calculations – e.g. consider what we found calculations – e.g. consider what we found about the number of collisions for molecules in about the number of collisions for molecules in airair
Gains from algorithmic improvements will Gains from algorithmic improvements will often swamp hardware improvementsoften swamp hardware improvements
Don’t be brain-limited, if there is a better Don’t be brain-limited, if there is a better algorithmalgorithm use it use it
Algorithmic Improvements Algorithmic Improvements in n-body simulationsin n-body simulations
Improvements in the speed of algorithms are proportionally better than the speedincrease of computers over the same time interval.
Identifying Performance Identifying Performance DesiresDesiresPositive Precondition
Negative Precondition
Frequency of UseCode Evolution timescale
Daily
Yearly
Monthly
Hundreds of executions between changes
Changes each run
Performance Performance CharacteristicsCharacteristicsPositive Precondition
Negative Precondition
Execution Time
Days
Minutes
Hours
Frequent (many persecond)
None
Level of Synchronization
Infrequent (every minute)
Data and Algorithm Data and Algorithm Positive Precondition
Negative Precondition
Simple
Complex Irregular, dynamic
Regular, static
Data structuresAlgorithmic complexity*
*approximately thenumber of stages
RequirementsRequirementsPositive Precondition
Negative Precondition
Current resolution meets needs
Need a factor of 2 increase
Must significantly increaseresolution/length of integration
How much speed-up can we How much speed-up can we achieve?achieve?
Some parts of a code cannot be run in parallelSome parts of a code cannot be run in parallel For example the loop over a(i)=a(i-1) from earlierFor example the loop over a(i)=a(i-1) from earlier Any code that cannot be executed in parallel is Any code that cannot be executed in parallel is
said to be said to be serialserial or sequential or sequential Lets suppose in terms of the total execution time Lets suppose in terms of the total execution time
of a program a fraction fof a program a fraction fss has to be run in serial, has to be run in serial, while fwhile fpp can be run in parallel on n cpus can be run in parallel on n cpus
Equivalently the time spent in each fraction will Equivalently the time spent in each fraction will be tbe tss and t and tp p so the total time on 1 cpu is tso the total time on 1 cpu is t1cpu1cpu=t=tss+t+tpp
If we can run the parallel fraction on n cpus then If we can run the parallel fraction on n cpus then it will take a time tit will take a time tpp/n/n
The total time will then be tThe total time will then be tncpuncpu=t=tss+t+tpp/n/n
Amdahl’s LawAmdahl’s Law
How much speed-up (SHow much speed-up (Snn=t=t1cpu1cpu/t/tncpuncpu) is feasible?) is feasible? Amdahl’s Law is the most significant limit. Amdahl’s Law is the most significant limit.
Given our previous results and n processors, Given our previous results and n processors, the maximum speed-up is given by:the maximum speed-up is given by:
Only if the serial fraction fOnly if the serial fraction fss(=t(=tss/(t/(tss+t+tpp)) is zero )) is zero is perfect speed-up possible (at least in theory)is perfect speed-up possible (at least in theory)
1)1(/1
sps
ps
ncpu
cpun fn
n
ntt
tt
t
tS
Amdahl’s Law Amdahl’s Law
0
20
40
60
80
100
120
1 20 40 60 80 100 120
fs=0.1fs=0.01fs=0.001
Ncpu
Sp
eed
-up
Scaling similar for differentfs here
Have to achieveexcellent parallelismhere!
What is OpenMP?What is OpenMP? OpenMP is a “pragma” based “application OpenMP is a “pragma” based “application
programmer interface” (API) that provides a programmer interface” (API) that provides a simple extension to C/C++ and FORTRANsimple extension to C/C++ and FORTRAN Pragma is just a fancy word for “instructions”Pragma is just a fancy word for “instructions”
It is exclusively designed for shared memory It is exclusively designed for shared memory programmingprogramming
Ultimately, OpenMP is a very simple Ultimately, OpenMP is a very simple interface to something called interface to something called threadsthreads based based programmingprogramming
What actually happens when you break up a What actually happens when you break up a loop into pieces is that a number of loop into pieces is that a number of threads threads of executionof execution are created that can run the are created that can run the loop pieces in parallelloop pieces in parallel
Threads based executionThreads based execution
Serial execution, interspersed with Serial execution, interspersed with parallelparallel
Parallel Section Parallel Section
Serial Section Serial SectionSerial Section
MasterThread
In practice many compilers block execution of the extra threads during serial sections, this saves the overhead of the `fork-join’ operation
Some background to Some background to threads programmingthreads programming
There is actually an entire set of commands There is actually an entire set of commands in C to allow you to create threadsin C to allow you to create threads
You could, if you wanted, program with You could, if you wanted, program with these commandsthese commands The most common thread standard is called The most common thread standard is called
POSIXPOSIX However, OpenMP provides a simple However, OpenMP provides a simple
interface to a lot of the functionality interface to a lot of the functionality provided by threadsprovided by threads If it is simple, and does what you need why If it is simple, and does what you need why
bother going to the effort of using threads bother going to the effort of using threads programming?programming?
Components of OpenMPComponents of OpenMP
Directives(Pragmas inyour code)
Runtime Library
Routines(Compiler)
EnvironmentVariables
(set at Unixprompt)
OpenMP: Where did it come OpenMP: Where did it come from?from?
Prior to 1997, vendors all had their own proprietary Prior to 1997, vendors all had their own proprietary shared memory programming commandsshared memory programming commands
Programs were not portable from one SMP to anotherPrograms were not portable from one SMP to another Researchers were calling for some kind of portabilityResearchers were calling for some kind of portability ANSI X3H5 (1994) proposal tried to formalize a ANSI X3H5 (1994) proposal tried to formalize a
shared memory standard – but ultimately failedshared memory standard – but ultimately failed OpenMP (1997) worked because the vendors got OpenMP (1997) worked because the vendors got
behind it and there was new growth in the shared behind it and there was new growth in the shared memory market placememory market place
Very hard for researchers to get new languages Very hard for researchers to get new languages supported now, must have backing from computer supported now, must have backing from computer vendors!vendors!
BottomlineBottomline For OpenMP & shared memory programming in For OpenMP & shared memory programming in
general, one only has to worry about parallelism of general, one only has to worry about parallelism of workwork
This is because all the processors in a shared-This is because all the processors in a shared-memory computer can see all the same memory memory computer can see all the same memory locationslocations
On distributed-memory computers one has to On distributed-memory computers one has to worry both about parallelism of the work and also worry both about parallelism of the work and also the placement of datathe placement of data Is the value I need in the memory of another processor?Is the value I need in the memory of another processor?
Data movement is what makes distributed-memory Data movement is what makes distributed-memory codes (usually written in something called MPI) so codes (usually written in something called MPI) so much longer – it can be highly non-trivialmuch longer – it can be highly non-trivial Although it can be easy – it depends on the algorithmAlthough it can be easy – it depends on the algorithm
First StepsFirst Steps
Loop level parallelismLoop level parallelism is the simplest is the simplest and easiest way to use OpenMPand easiest way to use OpenMP Take each do loop and make it parallel Take each do loop and make it parallel
(if possible)(if possible) It allows you to slowly build up It allows you to slowly build up
parallelism within your applicationparallelism within your application However, not all loops are However, not all loops are
immediately parallelizeable due to immediately parallelizeable due to dependenciesdependencies
Loop Level ParallelismLoop Level Parallelism
Consider the single precision vector Consider the single precision vector add-multiply operation add-multiply operation YY=a=aXX++Y Y (“SAXPY”)(“SAXPY”)
do i=1,n Y(i)=a*X(i)+Y(i)end do
FORTRAN
C$OMP PARALLEL DOC$OMP& DEFAULT(NONE)C$OMP& PRIVATE(i),SHARED(X,Y,n,a) do i=1,n Y(i)=a*X(i)+Y(i) end do
for (i=1;i<=n;++i) { Y[i]+=a*X[i];}
C/C++
#pragma omp parallel for \ private(i) shared(X,Y,n,a)for (i=1;i<=n;++i) { Y[i]+=a*X[i];}
In more detailIn more detail
C$OMP PARALLEL DOC$OMP& DEFAULT(NONE)C$OMP& PRIVATE(i),SHARED(X,Y,n,a) do i=1,n Y(i)=a*X(i)+Y(i) end do
Denotes this is a region of code for parallel execution
Good programming practice, mustdeclare nature of all variables
Thread PRIVATE variables: each threadmust have their own copy of this variable (in this case i is the only private variable)
Thread SHARED variables: all threads can
access these variables, but must not updateindividual memory locations simultaneously
Comment pragmas for FORTRAN - ampersand
necessary for continuation
A quick noteA quick note
To be fully lexically correct you may want to To be fully lexically correct you may want to include an include an C$OMP END PARALLEL DOC$OMP END PARALLEL DO
In f90 programs use In f90 programs use !$OMP!$OMP as a sentinel as a sentinel
Notice that the sentinels mean that the Notice that the sentinels mean that the OpenMP commands look like OpenMP commands look like commentscomments A compiler that has OpenMP compatibility A compiler that has OpenMP compatibility
turned on will see the comments after the turned on will see the comments after the sentinelsentinel
This means you can still compile the code on This means you can still compile the code on computers that don’t have OpenMPcomputers that don’t have OpenMP
How the compiler handles How the compiler handles OpenMPOpenMP
When you compile an OpenMP code you When you compile an OpenMP code you need to add “flags” to the compile line, e.g.need to add “flags” to the compile line, e.g. f77 –openmp –o myprogram myprogram.ff77 –openmp –o myprogram myprogram.f Unfortunately different compilers have different Unfortunately different compilers have different
commands for turning on OpenMP support, the commands for turning on OpenMP support, the above will work on Sun machinesabove will work on Sun machines
When the compiler flag is turned on, you When the compiler flag is turned on, you now force the compiler to link in all of the now force the compiler to link in all of the additional libraries (and so on) necessary to additional libraries (and so on) necessary to run the threadsrun the threads This is all transparent to you thoughThis is all transparent to you though
Requirements for Requirements for parallel loopsparallel loops
To divide up the work the compiler needs to know To divide up the work the compiler needs to know the number of iterations to be executed – the the number of iterations to be executed – the trip trip countcount must be computable must be computable
They must also not exhibit any of the dependencies They must also not exhibit any of the dependencies we mentioned we mentioned We’ll review this more in the next lectureWe’ll review this more in the next lecture Actually a good test for dependencies is running the loop Actually a good test for dependencies is running the loop
from n to 1, rather than 1 to n. If you get a different from n to 1, rather than 1 to n. If you get a different answer that suggests there are dependenciesanswer that suggests there are dependencies
DO WHILEDO WHILE is not parallelizable using these is not parallelizable using these directivesdirectives There is actually a way of parallelizing There is actually a way of parallelizing DO WHILEDO WHILE using a using a
different set of OpenMP commands, but we don’t have different set of OpenMP commands, but we don’t have time to cover thattime to cover that
The loop can only have one exit point – therefore The loop can only have one exit point – therefore BREAKBREAK or or GOTOGOTOs are not alloweds are not allowed
Performance limitationsPerformance limitations Each time you start and end a parallel loop there Each time you start and end a parallel loop there
is an overhead associated with the threadsis an overhead associated with the threads These overheads must always be added to the These overheads must always be added to the
time taken to calculate the loop itselftime taken to calculate the loop itself Therefore there is a limit on the smallest loop Therefore there is a limit on the smallest loop
size that will achieve speed upsize that will achieve speed up In practice, we need roughly 5000 floating point In practice, we need roughly 5000 floating point
operations in a loop for it to be worth operations in a loop for it to be worth parallelizingparallelizing
A good rule of thumb is that any thread should A good rule of thumb is that any thread should have at least 1000 floating point operationshave at least 1000 floating point operations
Thus small loops are simply not worth the Thus small loops are simply not worth the bother!bother!
SummarySummary Shared memory parallel computers can be Shared memory parallel computers can be
programmed using the OpenMP extensions to programmed using the OpenMP extensions to C,FORTRANC,FORTRAN Distributed memory computers require a different Distributed memory computers require a different
parallel languageparallel language The easiest way to use OpenMP is to make loops The easiest way to use OpenMP is to make loops
parallel by dividing work up among threadsparallel by dividing work up among threads Compiler handles most of the difficult parts of codingCompiler handles most of the difficult parts of coding However, not all loops are immediately parallelizableHowever, not all loops are immediately parallelizable Dependencies may prevent parallelizationDependencies may prevent parallelization
Loops are made to run in parallel by adding Loops are made to run in parallel by adding directives (“pragmas”) to your codedirectives (“pragmas”) to your code These directives appear to be comments to ordinary These directives appear to be comments to ordinary
compilerscompilers
Next LectureNext Lecture
More details on dependencies and More details on dependencies and how we can deal with themhow we can deal with them