why parallel/distributed computing

Why Parallel/Distributed Computing

Sushil K. PrasadSushil K. [email protected]@gsu.edu

.

What is Parallel and Distributed computing?

Solving a single problem faster using multiple Solving a single problem faster using multiple CPUsCPUs

E.g. Matrix Multiplication C = A X BE.g. Matrix Multiplication C = A X B

Parallel = Shared Memory among all CPUs Parallel = Shared Memory among all CPUs Distributed = Local Memory/CPUDistributed = Local Memory/CPU Common Issues: Partition, Synchronization, Common Issues: Partition, Synchronization,

Dependencies, load balancingDependencies, load balancing

.

Eniac (350 op/s) 1946 - (U.S. Army photo)

.

ASCI White (10 teraops/sec 2006)

Mega flops = 10^6 flops = 2^20Giga = 10^9 = billion = 2^30Tera = 10^12 = trillion = 2^40Peta = 10^15 = quadrillion = 2^50Exa = 10^18 = quintillion = 2^60

.

65 Years of Speed Increases

ENIAC

350 flops

1946

Today - 2011

8 Peta flops = 10^15 flops

K computer

.

Why Parallel and Distributed Computing? Grand Challenge ProblemsGrand Challenge Problems

Weather Forecasting; Global WarmingWeather Forecasting; Global Warming Materials Design – Superconducting Materials Design – Superconducting

material at room temperature; nano-material at room temperature; nano-devices; spaceships.devices; spaceships.

Organ Modeling; Drug DiscoveryOrgan Modeling; Drug Discovery

.

Why Parallel and Distributed Computing? Physical Limitations of Circuits Physical Limitations of Circuits

Heat and light effectHeat and light effect Superconducting material to counter heat effectSuperconducting material to counter heat effect Speed of light effect – no solution!Speed of light effect – no solution!

.

Microprocessor RevolutionMicros

Minis

Mainframes

Speed (log scale)

Time

Supercomputers

.

VLSI – Effect of Integration VLSI – Effect of Integration 1 M transistor enough for full 1 M transistor enough for full

functionality - Dec’s Alpha (90’s)functionality - Dec’s Alpha (90’s) Rest must go into multiple CPUs/chipRest must go into multiple CPUs/chip

Cost – Multitudes of average CPUs give Cost – Multitudes of average CPUs give better FLPOS/$ compared to traditional better FLPOS/$ compared to traditional supercomputerssupercomputers

Why Parallel and Distributed Computing?

.

Modern Parallel Computers Caltech’s Cosmic Cube (Seitz and Fox)Caltech’s Cosmic Cube (Seitz and Fox) Commercial copy-catsCommercial copy-cats

nCUBE Corporation (512 CPUs)nCUBE Corporation (512 CPUs) Intel’s Supercomputer SystemsIntel’s Supercomputer Systems

iPSC1, iPSC2, Intel Paragon (512 CPUs)iPSC1, iPSC2, Intel Paragon (512 CPUs) Thinking Machines CorporationThinking Machines Corporation

CM2 (65K 4-bit CPUs) – 12-dimensional hypercube - SIMDCM2 (65K 4-bit CPUs) – 12-dimensional hypercube - SIMD CM5 – fat-tree interconnect - MIMD CM5 – fat-tree interconnect - MIMD

Tiahe-1a 4.7 petaflops, 14K Tiahe-1a 4.7 petaflops, 14K Xeon X5670 and 7,168 X5670 and 7,168 Nvidia Tesla M2050 M2050 K-computer 8 petaflops (10^15 FLOPS), 2011, 68 K 2.0GHz 8-core CPUs 68 K 2.0GHz 8-core CPUs

548,352 cores; 548,352 cores;

http://en.wikipedia.org/wiki/Xeon

http://en.wikipedia.org/wiki/Nvidia_Tesla

.

Everyday ReasonsEveryday Reasons Available local networked workstations and Grid resources should be Available local networked workstations and Grid resources should be

utilizedutilized Solve compute-intensive problems fasterSolve compute-intensive problems faster

Make infeasible problems feasibleMake infeasible problems feasibleReduce design timeReduce design timeLeverage of large combined memory Leverage of large combined memory

Solve larger problems in same amount of timeSolve larger problems in same amount of timeImprove answer’s precisionImprove answer’s precisionReduce design timeReduce design time

Gain competitive advantage Gain competitive advantage Exploit commodity multi-core and GPU chipsExploit commodity multi-core and GPU chips Find Jobs!Find Jobs!

Why Parallel and Distributed Computing?

.

Why Shared Memory programming? Easier conceptual environmentEasier conceptual environment Programmers typically familiar with concurrent Programmers typically familiar with concurrent threadsthreads and and

processesprocesses sharing address space sharing address space CPUs within multi-core chips share memoryCPUs within multi-core chips share memory OpenMP an application programming interface (API) for OpenMP an application programming interface (API) for

shared-memory systemsshared-memory systems Supports higher performance parallel programming of Supports higher performance parallel programming of

symmetrical multiprocessorssymmetrical multiprocessors Java threadsJava threads MPI for Distributed Memory ProgrammingMPI for Distributed Memory Programming

.

Seeking Concurrency

Data dependence graphsData dependence graphs Data parallelismData parallelism Functional parallelismFunctional parallelism PipeliningPipelining

.

Data Dependence Graph

Directed graphDirected graph Vertices = tasksVertices = tasks Edges = dependenciesEdges = dependencies

.

Data Parallelism

Independent tasks apply same operation to Independent tasks apply same operation to different elements of a data setdifferent elements of a data set

Okay to perform operations concurrentlyOkay to perform operations concurrently Speedup: potentially p-fold, p #processorsSpeedup: potentially p-fold, p #processors

for i 0 to 99 do a[i] b[i] + c[i]endfor

.

Functional Parallelism Independent tasks apply different operations to Independent tasks apply different operations to

different data elementsdifferent data elements

First and second statementsFirst and second statements Third and fourth statementsThird and fourth statements Speedup: Limited by amount of concurrent sub-tasksSpeedup: Limited by amount of concurrent sub-tasks

a 2b 3m (a + b) / 2s (a2 + b2) / 2v s - m2

.

Pipelining Divide a process into stagesDivide a process into stages Produce several items simultaneouslyProduce several items simultaneously Speedup: Limited by amount of concurrent sub-Speedup: Limited by amount of concurrent sub-

tasks = #of stages in the pipelinetasks = #of stages in the pipeline

why parallel/distributed computing

Documents