why parallel computing? annual performance improvements drops from 50% per year to 20% per year...

Why Parallel Computing?• Annual performance improvements drops

from 50% per year to 20% per year

• Manufacturers focusing on multi-core systems, not single-core

• Why?– Smaller transistors = faster processors.– Faster processors = increased power– Increased power = increased heat.– Increased heat = unreliable processors

Parallel Programming Examples• Embarrassingly Parallel Applications

– Google searches employ > 1,000,000 processors

• Applications with unacceptable sequential run times

• Grand Challenge Problems – 1991 High Performance Computing Act (Law 102-94)

“Fundamental Grand Challenge science and engineering problems with broad economic and/or scientific impact and whose solution can be advanced by applying high performance computing techniques and resources.”

• Promotes terascale level computation over high bandwidth wide area computational grids

Grand Challenge Problems

• Require more processing power than available on single processing systems

• Solutions would significantly benefit society• There are not complete solutions for these,

which are commercially available• There is a significant potential for progress with

today’s technology• Examples: Climate modeling, Gene discovery,

energy research, semantic web-based applications

Grand ChallengeProblem

Examples

• Associations

– Computer Research Association (CRA)

– National Science Foundation (NSF)

Partial Grand Challenge Problem List

1. Predict contaminant seepage2. Predict airborne contaminant affects3. Gene sequence discovery4. Short term weather forecasts5. Predict long term global warming6. Predict earthquakes and volcanoes, 7. Predict hurricanes and tornados8. Automate natural language understanding9. Computer vision10.Nanotechnology11.Computerized reasoning12.Protein mechanisms13.Predict asteroid collisions14.Sub-atomic particle interactions15.Model biomechanical processes16.Manufacture new materials17.Fundamental nature of matter18.Transportation patterns19.Computational fluid dynamics

1.5

Global Weather Forecasting Example

• Suppose whole global atmosphere divided into cells of size 1 mile 1 mile 1 mile to a height of 10 miles (10 cells high) - about 5 108 cells.

• Suppose each calculation requires 200 floating point operations. In one time step, 1011 floating point operations necessary.

• To forecast the weather over 7 days using 1-minute intervals, a computer operating at 1Gflops (109 floating point operations/s) takes 106 seconds or over 10 days.

• To perform calculation in 5 minutes requires computer operating at 3.4 Tflops (3.4 1012 floating point operations/sec).

Modeling Motion of Astronomical Bodies

• Astronomical bodies attracted to each other by gravity; the force on each determines movement

• Required calculations: O(n2) or at best, O(n lg n)

• At each time step, calculate new position

• A galaxy might have1011 stars

• If each calculation requires 1 ms, one iteration requires 109 years using the N2 algorithm and almost a year using an efficient N lg N algorithm

Types of Parallelism• Fine grain

– Vector Processors• Matrix operations in single instructions

– High performance optimizing compilers• Reorder instructions

• Loop partitioning

• Low level synchronizations

• Coarse grain– Threads, critical sections, mutual exclusion– Message passing

Von Neumann Bottleneck

Modifications (Transparent to software)• Device Controllers• Cache: Fast memory to hold blocks of recently

used consecutive memory locations• Pipelines to breakup an instruction into pieces

and execute the pieces in parallel• Multiple issue replicates functional units to

enable executing instructions in parallel• Fine grained multithreading after each

instruction

CPU speed exceeds than memory access time

Parallel Systems• Shared Memory Systems

– All cores see the same memory– Coordination: Critical sections and mutual exclusion

• Distributed Memory Systems– Beowulf Clusters; Clusters of Workstations– Each core has access to local memory– Coordination: message passing

• Hybrid (Distributed memory presented as shared)– Uniform Memory Access (UMA)– Cache Only Memory Access (COMA)– Non-uniform Memory Access (NUMA)

• Computational Grids– Heterogeneous and Geographically Separated

Hardware Configurations• Flynn Categories

– SISD (Single Core)– MIMD (*our focus*)– SIMD (Vector Processors)– MISD

• Within MIMD– SPMD (*our focus*)– MPMD

• Multiprocessor Systems– Threads, Critical Sections

• Multi-computer Systems– Message Passing– Sockets, RMI

P0 P1 P2 P3

Memory

P0 P1 P2 P3

M0 M1 M2 M3

Distributed Memory Multi-computer

Shared Memory Multiprocessor

Hardware Configurations• Shared Memory Systems

– Do not scale to high numbers of processors– Considerations

• Enforcing critical sections through mutual exclusion• Forks and joins of threads

• Distributed Memory Systems– Topology: The graph that defines the network

• More connections means higher cost• Latency: time to establish connections• Bandwidth: width of the pipe

– Considerations• Appropriate message passing framework• Redesign algorithms to those that are less natural

Shared Memory Problems• Memory contention

– Single bus inexpensive, sequential access– Crossbar switches, parallel access but

expensive

• Cache Coherence– Cache doesn’t match memory

• Write through writes all changes immediately• Write back writes dirty line when expelled

– Processor cache requires broadcast of changes and complex coherence algorithms

Distributed Memory• Possible to use commodity systems

• Relatively inexpensive interconnects

• Requires message passing, which programmers tend to find difficult

• Must deal with network, topology, and security issues

• Hybrid systems are distributed, but present a system to programmers as shared, but with performance loss

Network Terminology• Latency – Time to send “null”, zero length message

• Bandwidth – Maximum transmission rate (bits/sec)

• Total edges – Total number of network connections

• Degree – Maximum connections per network node

• Connectivity – Minimum connections to disconnect

• Bisection width – Number of connections to cut the network into equal two parts

• Diameter – Maximum hops connecting two nodes

• Dilation – Number of extra hops needed to map one topology to another

Web-based Networks

• Generally uses TCP/IP protocol• The number of hops between nodes is not constant• Communication incurs high latencies• Nodes scattered over large geographical distances• High Bandwidths possible after connection established• The slowest link along the path limits speed• Resources are highly heterogeneous• Security becomes a major concern; proxies often used• Encryption algorithms can require significant overhead.• Subject to local policies at each node• Example: www.globus.org

http://www.globus.org/

Routing Techniques

• Packet Switching– Message packets routed separately; assembled at the sink

• Deadlock free– Guarantees sufficient resources to complete transmission

• Store and Forward– Messages stored at node before transmission continues

• Cut Through– Entire path of transmission established before transmission

• Wormhole Routing– Flits (a couple of bits) held at each node; the “worm” of

flits move when the next node becomes available

Myranet• Proprietary technology (http://www.myri.com) • Point-to-point, full-duplex switch based technology• Custom chip settings for parallel topologies• Lightweight transparent cut-through routing protocol• Thousands of processors without TCP/IP limitations• Can embed TCP/IP messages to maximize flexibility• Gateway for wide area heterogeneous networks

Type Bandwidth Latency

Ethernet 10 MB/sec 10-15 ms

GB Ethernet 10 GB/sec 150 us

Myranet 10 GB/sec 2us

Rectangles: Processors, Circles: Switches

http://www.myri.com/

Parallel Techniques

• Peer-to-peer– Independent systems coordinate to run a single

application– Mechanisms: Threads, Message Passing

• Client-server– Server responds to many clients running many

applications– Mechanisms: Remote Method Invocation, Sockets

The focus is this class is peer-to-peer applications

Popular Network Topologies

• Fully Connected and Star• Line and Ring• Tree and Fat Tree• Mesh and Torus• Hypercube• Hybrids: Pyramid• Multi-stage: Myrinet

Fully Connected and Star

Degree? Connectivity? Total edges? Bisection width? Diameter?

Line and Ring


Tree and Fat Tree

Degree? Connectivity? Total edges?

Edges Connecting Node at level k to k-1 are twice thenumber of edges connecting a Node from level k-2 to k-1

Bisection width? Diameter?

Mesh and Torus


Hypercube

Degree? Connectivity? Total edges?

•A Hypercube of degree zero is a single node

•A Hypercube of degree d is two hypercubes of degree d-1With edges connecting the corresponding nodes

Bisection width? Diameter?

Pyramid


A Hybrid Network Combining a mesh and a Tree

1.26

Multistage Interconnection NetworkExample: Omega network

000

001

010

011

100

101

110

111

000

001

010

011

100

101

110

111

Inputs Outputs

switch elements withstraight-through or

crossover connections

1.27

Distributed Shared Memory Making main memory of group of interconnected computers look as though a single memory with single address space. Then can use shared memory programming techniques.

Processor

Interconnectionnetwork

Shared

Computers

Messages

memory

Parallel Performance Metrics

• Complexity (Big Oh Notation)– f(n) = O(g(n)) if for constants z, c>0 f(n)≤c g(n) when n>z

• Speed up: s(p) = t1/tp

• Cost: C(p) = tp*p

• Efficiency: E(p) = t1/C(p)

• Scalability– Imprecise term to measure impact of adding processors– We might say an application scales to 256 processors– Can refer to hardware, software, or both

Parallel Run Time

• Sequential execution time: t1

– t1 = Computation time of best sequential algorithm

• Communication overhead: Tcomm = m(tstartup + ntdata)– tstartup = latency (time to send a message with no data)– tdata = time to send one data element– n = number of data elements– m = number of messages

• Computation overhead: tcomp=f (n, p))

• Parallel execution time: tp = tcomp + tcomm– Tp = reflects the worst case execution time over all processors

Estimating Scalability

Parallel Visualization Tools

Process 1

Process 2

Process 3

TimeComputingWaitingMessage-passing system routine

Message

Observe using a space-time diagram (or process-time diagram)

Superlinear speed-up (s(p)>p)

1. Non-optimal sequential algorithma. Solution: Compare to an optimal sequential algorithm

b. Parallel versions are often different from sequential versions

2. Specialized hardware on certain processorsa. Processor has fast graphics but computes slow

b. NASA superlinear distributed grid application

3. Average case doesn’t match single runa. Consider a search application

b. What are the speed-up possibilities?

Reasons for:

Speed-up Potential• Amdahl’s “pessimistic” law

– Fraction of Sequential processing (f) is fixed– S(p) = t1/ (f * t1 + (1-f)t1/p) → 1/f as p →∞

• Gustafson’s “optimistic” law– Greater data implies parallel portion grows– Assumes more capability leads to more data– S(p) = f + (1-f)*p

• For each law– What is the best speed-up if f=.25?– What is the speed-up for 16 processors?

• Which assumption if valid?

Challenges

• Running multiple instances of a sequential program won’t make effective use of parallel resources

• Without programmer optimizations, additional processors will not improve overall system performance

• Connect networks of systems together in a peer-to-peer manner

• Rewrite and parallel existing programs– Algorithms for parallel systems are drastically

different than those that execute sequentially

• Translation programs that automatically parallelize serial programs.– This is very difficult to do.– Success has been limited.

• Operating system thread/process allocation– Some benefit, but not a general solution

Solutions

Compute and merge results• Serial Algorithm:

result = 0;for (int i=0; i<N; i++) { sum += merge(compute(i)); }

• Parallel Algorithm (first try) :– Each processor, P, performs N/P computations– IF P>0 THEN Send partial results to master (P=0)– ELSE receive and merge partial results

Is this the best we can do?How many compute calls? How many merge calls?How is work distributed among the processors?

Multiple cores forming a global sum

Copyright © 2010, Elsevier Inc. All rights Reserved

•How many merges must the master do?•Suppose 1024 processors. Then how many merges would the master do?•Note the difference from the first approach

Parallel Algorithms

• Problem: Three helpers must mow, weed eat, and pull weeds on a large field

• Task Level– Each helper perform one of the tasks over the

entire large field

• Data Level– Each helper do all three tasks over one third of the

field

Case Study• Millions of doubles

• Thousands of bins

• We want to create a histogram of the number of values present in each bin

• How would we program this sequentially?

• What parallel algorithm would we use?– Using a shared memory system– Using a distributed memory system

In This Class We• Investigate converting sequential programs

to make use of parallel facilities

• Devise algorithms that are parallel in nature

• Use C with Industry Standard Extensions– Message-Passing Interface (MPI) via mpich– Posix Threads (Pthreads)– OpenMP

Parallel Program Development• Cautions

– Parallel program programming is harder than sequential programming– Some algorithms don’t lend themselves to running in parallel

• Advised Steps of Development– Step 1: Program and test as much as possible sequentially– Step 2: Code the Parallel version– Step 3: Run in parallel; one processor with few threads– Step 4: Add more threads as confidence grows– Step 5: Run in parallel with a small number of processors– Step 6: Add more processes as confidence grows

• Tools– There are parallel debuggers that can help– Insert assertion error checks within the code– Instrument the code (add print statements)– Timing: time(), gettimeofday(), clock(), MPI_Wtime()

why parallel computing? annual performance improvements drops from 50% per year to 20% per year...

Documents

processors applications

faster processors

unreliable processors

increased heat

parallel computing

floating point operationssec

processing power

computer vision