Download - More on Parallel Computing
Computational Science jsumethod05 [email protected] 130 January 2005
More on Parallel Computing
Spring Semester 2005Geoffrey FoxCommunity
Grids Laboratory Indiana University
505 N MortonSuite 224
Bloomington [email protected]
jsumethod05 [email protected] 230 January 2005
What is Parallel Architecture?• A parallel computer is any old collection of processing elements
that cooperate to solve large problems fast– from a pile of PC’s to a shared memory multiprocessor
• Some broad issues:– Resource Allocation:
• how large a collection? • how powerful are the elements?• how much memory?
– Data access, Communication and Synchronization• how do the elements cooperate and communicate?• how are data transmitted between processors?• what are the abstractions and primitives for cooperation?
– Performance and Scalability• how does it all translate into performance?• how does it scale?
jsumethod05 [email protected] 330 January 2005
Parallel Computers -- Classic Overview• Parallel computers allow several CPUs to contribute to a
computation simultaneously.• For our purposes, a parallel computer has three types of
parts:– Processors
– Memory modules
– Communication / synchronization network
• Key points:– All processors must be busy for peak speed.
– Local memory is directly connected to each processor.
– Accessing local memory is much faster than other memory.
– Synchronization is expensive, but necessary for correctness.
Colors Used in Following pictures
jsumethod05 [email protected] 430 January 2005
Distributed Memory Machines• Every processor has a memory
others can’t access.• Advantages:
– Relatively easy to design and build
– Predictable behavior– Can be scalable– Can hide latency of
communication
• Disadvantages:– Hard to program– Program and O/S (and sometimes
data) must be replicated
jsumethod05 [email protected] 530 January 2005
Communication on Distributed Memory Architecture
• On distributed memory machines, each chunk of decomposed data resides on separate memory space -- a processor is typically responsible for storing and processing data (owner-computes rule)
• Information needed on edges for update must be communicated via explicitly generated messages
Messages
jsumethod05 [email protected] 630 January 2005
Distributed Memory Machines -- Notes• Conceptually, the nCUBE CM-5 Paragon SP-2 Beowulf PC cluster
BlueGene are quite similar.
• Bandwidth and latency of interconnects different
• The network topology is a two-dimensional torus for Paragon, three-dimensional torus for BlueGene, fat tree for CM-5, hypercube for nCUBE and Switch for SP-2
• To program these machines:
• Divide the problem to minimize number of messages while retaining parallelism
• Convert all references to global structures into references to local pieces (explicit messages convert distant to local variables)
• Optimization: Pack messages together to reduce fixed overhead (almost always needed)
• Optimization: Carefully schedule messages (usually done by library)
30 January 200530 January 2005 jsumethod05 [email protected] [email protected] 77
BlueGene/L has Classic Architecture
32768 node BlueGene/L takes #1 TOP500
Position 29 Sept 200470.7 Teraflops
30 January 200530 January 2005 jsumethod05 [email protected] [email protected] 88
BlueGene/L Fundamentals Low Complexity nodes gives more flops per transistor and per
watt 3D Interconnect supports many scientific simulations as nature
as we see it is 3D
30 January 200530 January 2005 jsumethod05 [email protected] [email protected] 99
1024 Nodes full systemwith hypercube Interconnect
1987 MPP
jsumethod05 [email protected] 1030 January 2005
Shared-Memory Machines
• All processors access the same memory.
• Advantages:– Retain sequential programming
languages such as Java or Fortran
– Easy to program (correctly)
– Can share code and data among processors
• Disadvantages:– Hard to program (optimally)
– Not scalable due to bandwidth limitations in bus
jsumethod05 [email protected] 1130 January 2005
Communication on SharedMemory Architecture
• On a shared Memory Machine a CPU is responsible for processing a decomposed chunk of data but not for storing it
• Nature of parallelism is identical to that for distributed memory machines but communication implicit as “just” access memory
jsumethod05 [email protected] 1230 January 2005
Shared-Memory Machines -- Notes• Interconnection network varies from machine to
machine• These machines share data by direct access.
– Potentially conflicting accesses must be protected by synchronization.
– Simultaneous access to the same memory bank will cause contention, degrading performance.
– Some access patterns will collide in the network (or bus), causing contention.
– Many machines have caches at the processors.– All these features make it profitable to have each
processor concentrate on one area of memory that others access infrequently.
jsumethod05 [email protected] 1330 January 2005
Distributed Shared Memory Machines
• Combining the (dis)advantages of shared and distributed memory
• Lots of hierarchical designs.– Typically, “shared memory
nodes” with 4 to 32 processors
– Each processor has a local cache
– Processors within a node access shared memory
– Nodes can get data from or put data to other nodes’ memories
jsumethod05 [email protected] 1430 January 2005
Summary on Communication etc.• Distributed Shared Memory machines have
communication features of both distributed (messages) and shared (memory access) architectures
• Note for distributed memory, programming model must express data location (HPF Distribute command) and invocation of messages (MPI syntax)
• For shared memory, need to express control (openMP) or processing parallelism and synchronization -- need to make certain that when variable updated, “correct” version is used by other processors accessing this variable and that values living in caches are updated
jsumethod05 [email protected] 1530 January 2005
Seismic Simulation of Los Angeles Basin• This is (sophisticated) wave equation similar to Laplace example
and you divide Los Angeles geometrically and assign roughly equal number of grid points to each processor
Computer with4 Processors
Problem represented byGrid Points and divided
Into 4 Domains
jsumethod05 [email protected] 1630 January 2005
Communication Must be Reduced• 4 by 4 regions in each
processor– 16 Green (Compute) and 16
Red (Communicate) Points
• 8 by 8 regions in each processor– 64 Green and “just” 32 Red
Points
• Communication is an edge effect
• Give each processor plenty of memory and increase region in each machine
• Large Problems Parallelize Best
jsumethod05 [email protected] 1730 January 2005
Irregular 2D Simulation -- Flow over an Airfoil• The Laplace grid
points become finite element mesh nodal points arranged as triangles filling space
• All the action (triangles) is near near wing boundary
• Use domain decomposition but no longer equal area as equal triangle count
jsumethod05 [email protected] 1830 January 2005
• Simulation of cosmological cluster (say 10 million stars )
• Lots of work per star as very close together( may need smaller time step)
• Little work per star as force changes slowly and can be well approximated by low order multipole expansion
Heterogeneous Problems
jsumethod05 [email protected] 1930 January 2005
Load Balancing Particle Dynamics• Particle dynamics of this type (irregular with sophisticated force
calculations) always need complicated decompositions
• Equal area decompositions as shown here to load imbalance
Equal Volume DecompositionUniverse Simulation
Galaxy or Star or ...16 Processors
If use simpler algorithms (full O(N2) forces) or FFT, then equal area best
jsumethod05 [email protected] 2030 January 2005
Reduce Communication• Consider a geometric problem with 4
processors• In top decomposition, we divide
domain into 4 blocks with all points in a given block contiguous
• In bottom decomposition we give each processor the same amount of work but divided into 4 separate domains
• edge/area(bottom) = 2* edge/area(top)• So minimizing communication implies
we keep points in a given processor together
Block Decomposition
Cyclic Decomposition
jsumethod05 [email protected] 2130 January 2005
Minimize Load Imbalance• But this has a flip side. Suppose we are
decomposing Seismic wave problem and all the action is near a particular earthquake fault denoted by .
• In Top decomposition only the white processor does any work while the other 3 sit idle.– Ffficiency 25% due to Load Imbalance
• In Bottom decomposition all the processors do roughly the same work and so we get good load balance …...
Block Decomposition
Cyclic Decomposition
jsumethod05 [email protected] 2230 January 2005
Parallel Irregular Finite Elements
• Here is a cracked plate and calculating stresses with an equal area decomposition leads to terrible results– All the work is near crack
Processor
jsumethod05 [email protected] 2330 January 2005
Irregular Decomposition for Crack
• Concentrating processors near crack leads to good workload balance
• equal nodal point -- not equal area -- but to minimize communication nodal points assigned to a particular processor are contiguous
• This is NP complete (exponenially hard) optimization problem but in practice many ways of getting good but not exact good decompositions Processor
Region assigned to 1 processor
WorkLoad
Not Perfect !
jsumethod05 [email protected] 2430 January 2005
Further Decomposition Strategies
• Not all decompositions are quite the same• In defending against missile attacks, you track each missile on a separate node --
geometric again• In playing chess, you decompose chess tree -- an abstract not geometric space
Computer Chess TreeCurrent Position(node in Tree)
First Set Moves
Opponents Counter Moves
California gets its independence
jsumethod05 [email protected] 2530 January 2005
Summary of Parallel Algorithms• A parallel algorithm is a collection of tasks and a partial
ordering between them.• Design goals:
– Match tasks to the available processors (exploit parallelism).
– Minimize ordering (avoid unnecessary synchronization points).
– Recognize ways parallelism can be helped by changing ordering
• Sources of parallelism:– Data parallelism: updating array elements simultaneously.
– Functional parallelism: conceptually different tasks which combine to solve the problem. This happens at fine and coarse grain size
• fine is “internal” such as I/O and computation; coarse is “external” such as separate modules linked together
jsumethod05 [email protected] 2630 January 2005
Data Parallelism in Algorithms• Data-parallel algorithms exploit the parallelism inherent in many
large data structures.– A problem is an (identical) algorithm applied to multiple points in data “array”
– Usually iterate over such “updates”
• Features of Data Parallelism– Scalable parallelism -- can often get million or more way parallelism
– Hard to express when “geometry” irregular or dynamic
• Note data-parallel algorithms can be expressed by ALL programming models (Message Passing, HPF like, openMP like)
jsumethod05 [email protected] 2730 January 2005
Functional Parallelism in Algorithms• Functional parallelism exploits the parallelism between the parts
of many systems.– Many pieces to work on many independent operations– Example: Coarse grain Aeroelasticity (aircraft design)
• CFD(fluids) and CSM(structures) and others (acoustics, electromagnetics etc.) can be evaluated in parallel
• Analysis:– Parallelism limited in size -- tens not millions– Synchronization probably good as parallelism natural from problem and
usual way of writing software– Web exploits functional parallelism NOT data parallelism
jsumethod05 [email protected] 2830 January 2005
Pleasingly Parallel Algorithms• Many applications are what is called (essentially) embarrassingly
or more kindly pleasingly parallel
• These are made up of independent concurrent components– Each client independently accesses a Web Server
– Each roll of a Monte Carlo dice (random number) is an independent sample
– Each stock can be priced separately in a financial portfolio
– Each transaction in a database is almost independent (a given account is locked but usually different accounts are accessed at same time)
– Different parts of Seismic data can be processed independently
• In contrast points in a finite difference grid (from a differential equation) canNOT be updated independently
• Such problems are often formally data-parallel but can be handled much more easily -- like functional parallelism
jsumethod05 [email protected] 2930 January 2005
Parallel Languages• A parallel language provides an executable notation for
implementing a parallel algorithm.
• Design criteria:– How are parallel operations defined?
• static tasks vs. dynamic tasks vs. implicit operations
– How is data shared between tasks?• explicit communication/synchronization vs. shared memory
– How is the language implemented?• low-overhead runtime systems vs. optimizing compilers
• Usually a language reflects a particular style of expressing parallelism.
• Data parallel expresses concept of identical algorithm on different parts of array
• Message parallel expresses fact that at low level parallelism implies information is passed between different concurrently executing program parts
jsumethod05 [email protected] 3030 January 2005
Data-Parallel Languages• Data-parallel languages provide an abstract, machine-independent
model of parallelism.– Fine-grain parallel operations, such as element-wise operations on arrays– Shared data in large, global arrays with mapping “hints”– Implicit synchronization between operations– Partially explicit communication from operation definitions
• Advantages:– Global operations conceptually simple– Easy to program (particularly for certain scientific applications)
• Disadvantages:– Unproven compilers– As express “problem” can be inflexible if new algorithm which language didn’t
express well
• Examples: HPF• Originated on SIMD machines where parallel operations are in lock-
step but generalized (not so successfully as compilers too hard) to MIMD
jsumethod05 [email protected] 3130 January 2005
Approaches to Parallel Programming • Data Parallel typified by CMFortran and its generalization - High
Performance Fortran which in previous years we discussed in detail but this year we will not discuss; See Source Book for more on HPF
• Typical Data Parallel Fortran Statements are full array statements
– B=A1 + A2
– B=EOSHIFT(A,-1)
– Function operations on arrays representing full data domain
• Message Passing typified by later discussion of Laplace Example, which specifies specific machine actions i.e. send a message between nodes whereas data parallel model is at higher level as it (tries) to specify a problem feature
• Note: We are always using "data parallelism" at problem level whether software is "message passing" or "data parallel"
• Data parallel software is translated by a compiler into "machine language" which is typically message passing on a distributed memory machine and threads on a shared memory
jsumethod05 [email protected] 3230 January 2005
Shared Memory Programming Model• Experts in Java are familiar with this as it is built in Java
Language through thread primitives
• We take “ordinary” languages such as Fortran, C++, Java and add constructs to help compilers divide processing (automatically) into separate threads– indicate which DO/for loop instances can be executed in parallel
and where there are critical sections with global variables etc.
• openMP is a recent set of compiler directives supporting this model
• This model tends to be inefficient on distributed memory machines as optimizations (data layout, communication blocking etc.) not natural
jsumethod05 [email protected] 3330 January 2005
Structure(Architecture) of Applications - I• Applications are metaproblems with a mix of module (aka coarse grain
functional) and data parallelism• Modules are decomposed into parts (data parallelism) and composed
hierarchically into full applications.They can be the – “10,000” separate programs (e.g. structures,CFD ..) used in design of
aircraft– the various filters used in Adobe Photoshop or Matlab image processing
system – the ocean-atmosphere components in integrated climate simulation– The data-base or file system access of a data-intensive application– the objects in a distributed Forces Modeling Event Driven Simulation
jsumethod05 [email protected] 3430 January 2005
Structure(Architecture) of Applications - II• Modules are “natural” message-parallel components of problem
and tend to have less stringent latency and bandwidth requirements than those needed to link data-parallel components– modules are what HPF needs task parallelism for
– Often modules are naturally distributed whereas parts of data parallel decomposition may need to be kept on tightly coupled MPP
• Assume that primary goal of metacomputing system is to add to existing parallel computing environments, a higher level supporting module parallelism– Now if one takes a large CFD problem and divides into a few
components, those “coarse grain data-parallel components” will be supported by computational grid technology
• Use Java/Distributed Object Technology for modules -- note Java to growing extent used to write servers for CORBA and COM object systems
jsumethod05 [email protected] 3530 January 2005
Multi Server Model for metaproblems• We have multiple supercomputers in the backend -- one doing
CFD simulation of airflow; another structural analysis while in more detail you have linear algebra servers (Netsolve); Optimization servers (NEOS); image processing filters(Khoros);databases (NCSA Biology workbench); visualization systems(AVS, CAVEs)– One runs 10,000 separate programs to design a modern aircraft which
must be scheduled and linked …..
• All linked to collaborative information systems in a sea of middle tier servers(as on previous page) to support design, crisis management, multi-disciplinary research
jsumethod05 [email protected] 3630 January 2005
Database
Matrix Solver
OptimizationService
MPP
MPP
Parallel DBProxy
NEOS ControlOptimization
Origin 2000Proxy
NetSolveLinear Alg.
Server
Multi-Server Scenario
IBM SP2Proxy
Gatew
ay C
on
trol Ag
ent-b
asedC
ho
ice o
fC
om
pu
te E
ng
ine
Mu
ltidiscip
linary
Co
ntro
l (W
ebF
low
)Data Analysis
Server