multiprocessors - parallel computing

59
1 COMP381 by M. Hamdi Multiprocessor Multiprocessor s - s - Parallel Parallel Computing Computing

Upload: mac

Post on 05-Jan-2016

68 views

Category:

Documents


3 download

DESCRIPTION

Multiprocessors - Parallel Computing. Processor Performance. We have looked at various ways of increasing a single processor performance (Excluding VLSI techniques): Pipelining ILP Super-scalars Out-of-order execution (Scoreboarding) VLIW Cache (L1, L2, L3) Interleaved memories - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Multiprocessors -  Parallel Computing

1COMP381 by M. Hamdi

Multiprocessors - Multiprocessors -

Parallel Parallel ComputingComputing

Page 2: Multiprocessors -  Parallel Computing

2COMP381 by M. Hamdi

• We have looked at various ways of increasing a single processor performance (Excluding VLSI techniques):

Pipelining

ILP

Super-scalars

Out-of-order execution (Scoreboarding)

VLIW

Cache (L1, L2, L3)

Interleaved memories

Compilers (Loop unrolling, branch prediction, etc.)

RAID

Etc …

• However, quite often even the best microprocessors are not fast enough for certain applications !!!

Processor Performance

Page 3: Multiprocessors -  Parallel Computing

3COMP381 by M. Hamdi

0 1 2 3 4 5 6+0

5

10

15

20

25

30

0 5 10 150

0.5

1

1.5

2

2.5

3

Fra

ctio

n o

f to

tal c

ycle

s (%

)

Number of instructions issued

Sp

ee

du

p

Instructions issued per cycle

Example: How far will ILP go?

• Infinite resources and fetch bandwidth, perfect branch prediction and renaming

Page 4: Multiprocessors -  Parallel Computing

4COMP381 by M. Hamdi

When Do We Need High Performance Computing?

• Case1

–To do a time-consuming operation in less timeless time

• I am an aircraft engineer

• I need to run a simulation to test the stability of the wings at high speed

• I’d rather have the result in 5 minutes than in 5 days so that I can complete the aircraft final design sooner.

Page 5: Multiprocessors -  Parallel Computing

5COMP381 by M. Hamdi

When Do We Need High Performance Computing?

• Case 2

–To do an operation before a tighter deadlinetighter deadline

• I am a weather prediction agency

• I am getting input from weather stations/sensors

• I’d like to make the forecast for tomorrow before tomorrow

Page 6: Multiprocessors -  Parallel Computing

6COMP381 by M. Hamdi

When Do We Need High Performance Computing ?

• Case 3

– To do a high number of operationshigh number of operations per seconds

• I am an engineer of Amazon.com

• My Web server gets 10,000 hits per seconds

• I’d like my Web server and my databases to handle 10,000 transactions per seconds so that customers do not experience bad delays

–Amazon does “process” several GBytes of data per seconds

Page 7: Multiprocessors -  Parallel Computing

7COMP381 by M. Hamdi

The need for High-Performance ComputersJust some examples

• Automotive design:– Major automotive companies use large systems (500+ CPUs) for:

• CAD-CAM, crash testing, structural integrity and aerodynamics.

– Savings: approx. $1 billion per company per year.

• Semiconductor industry:– Semiconductor firms use large systems (500+ CPUs) for

• device electronics simulation and logic validation

– Savings: approx. $1 billion per company per year.

• Airlines:– System-wide logistics optimization systems on parallel systems.

– Savings: approx. $100 million per airline per year.

Page 8: Multiprocessors -  Parallel Computing

8COMP381 by M. Hamdi

Grand ChallengesGrand Challenges

10 MB

100 MB

1 GB

10 GB

100 GB

1 TB

100 MFLOPS 1 GFLOPS 10 GFLOPS 100 GFLOPS 1 TFLOPS

2D airfoil

48-hourweather

oil reservoirmodelling

3D plasma modelling

72-hourweather

vehicle dynamics

chemical dynamics

pharmaceutical designstructural biology

Computational Performance Requirements

Sto

rag

e R

equir

em

ents

Page 9: Multiprocessors -  Parallel Computing

9COMP381 by M. Hamdi

Weather ForecastingWeather Forecasting• Suppose the whole global atmosphere divided

into cells of size 1 km 1 km 1 km to a height of 10 km (10 cells high) - about 5 108 cells.

• Suppose each cell calculation requires 200 floating point operations. In one time step, 1011 floating point operations are necessary.

• To forecast the weather over 7 days using 1-minute intervals, a computer operating at 1Gflops (109 floating point operations/s) – similar to the Pentium 4similar to the Pentium 4 - takes 106 seconds or over 10 days.

• To perform calculation in 5 minutes requires a computer operating at 3.4 Tflops (3.4 1012

floating point operations/sec).

Page 10: Multiprocessors -  Parallel Computing

10COMP381 by M. Hamdi

Google

1. The user enters a query on a web form sent to the Google web server.

2. The web server sends the query to the Index Server cluster, which matches the query to documents.

3. The match is sent to the Doc Server cluster, which retrieves the documents to generate abstracts and cached copies.

4. The list, with abstracts, is displayed by the web server to the user, sorted(using a secret formula involving PageRank).

Page 11: Multiprocessors -  Parallel Computing

11COMP381 by M. Hamdi

Google Requirements

• Google: search engine that scales at Internet growth rates

• Search engines: 24x7 availability

• Google : 600M queries/day, or AVERAGE of 7500 queries/s all day

• Response time goal: < 0.5 s per search

• Google crawls WWW and puts up new index every 2 weeks

• Storage: 5.3 billion web pages, 950 million newsgroup messages, and 925 million images indexed, Millions of videos

Page 12: Multiprocessors -  Parallel Computing

12COMP381 by M. Hamdi

Google• require high amounts of computation per request

• A single query on Google (on average)– reads hundreds of megabytes of data– consumes tens of billions of CPU cycles

• A peak request stream on Google– requires an infrastructure comparable in size to largest supercomputer

installations

• Typical google Data center: 15000 PCs (linux), 30000 disks: almost 3 petabyte!

• Google application affords easy parallelization– Different queries can run on different processors– A single query can use multiple processors

• because the overall index is partitioned

Page 13: Multiprocessors -  Parallel Computing

13COMP381 by M. Hamdi

Multiprocessing

• Multiprocessing (Parallel Processing): Concurrent execution of tasks (programs) using multiple computing, memory and interconnection resources.

Use multiple resources to solve problems faster.• Provides alternative to faster clock for performance

– Assuming a doubling of effective processor performance every 2 years, 1024-Processor system can get you the performance that it would take 20 years for a single-processor system to deliver

• Using multiple processors to solve a single problem

– Divide problem into many small pieces

– Distribute these small problems to be solved by multiple processors simultaneously

Page 14: Multiprocessors -  Parallel Computing

14COMP381 by M. Hamdi

Multiprocessing• For the last 30+ years multiprocessing has been seen as the best

way to produce orders of magnitude performance gains.– Double the number of processors, get double performance (less than 2

times the cost).• It turns out that the ability to develop and deliver software for

multiprocessing systems has been the impediment to wide adoption.

Page 15: Multiprocessors -  Parallel Computing

15COMP381 by M. Hamdi

Amdahl’s Law

• A parallel program has a sequential part (e.g., I/O) and a parallel part

– T1 = T1 + (1-)T1

– Tp = T1 + (1-)T1 / p

• Therefore:

Speedup(p) = 1 / ( + (1-)/p)

= p / ( p + 1 - )

1 /

• Example: if a code is 10% sequential (i.e., = .10), the speedup will always be lower than 1 + 90/10 = 10, no matter how many processors are used!

Page 16: Multiprocessors -  Parallel Computing

16COMP381 by M. Hamdi

Page 17: Multiprocessors -  Parallel Computing

17COMP381 by M. Hamdi

• Amdahl's Law is pessimistic (in this case)– Let s be the serial part

– Let p be the part that can be parallelized n ways – Serial: SSPPPPPP – 6 processors: SSP– P– P– P– P– P

– Speedup = 8/3 = 2.67

– T(n) =

– As n , T(n)

• Pessimistic

Performance Potential Using Multiple Processors

1s+p/n

1s

Page 18: Multiprocessors -  Parallel Computing

18COMP381 by M. Hamdi

10% 20% 30% 40% 50% 60% 70% 80% 90% 99%

0

5

10

15

20

25Speedup

% Serial

1000 CPUs16 CPUs4 CPUs

Amdahl’s Law

Page 19: Multiprocessors -  Parallel Computing

19COMP381 by M. Hamdi

Example

Page 20: Multiprocessors -  Parallel Computing

20COMP381 by M. Hamdi

Performance Potential: Another view

• Gustafson view (more widely adopted for multiprocessors) – Parallel portion increases as the problem size increases

• Serial time fixed (at s) • Parallel time proportional to problem size (true most of the time)

• Old Serial: SSPPPPPP • 6 processors: SSPPPPPP • PPPPPP • PPPPPP • PPPPPP • PPPPPP • PPPPPP • Hypothetical Serial:• SSPPPPPP PPPPPP PPPPPP PPPPPP PPPPPP PPPPPP

– Speedup = (8+5*6)/8 = 4.75 – T'(n) = s + n*p; T'() !!!!

Page 21: Multiprocessors -  Parallel Computing

21COMP381 by M. Hamdi

0

20

40

60

80

100

10% 20% 30% 40% 50% 60% 70% 80% 90% 99%

% Serial

Speedup

Gustafson-Barsis

Amdhal

Amdahl vs. Gustafson-Barsis

Page 22: Multiprocessors -  Parallel Computing

22COMP381 by M. Hamdi

TOP 5 Most Powerful computers in the world – must be multiprocessors

http://www.top500.org/http://www.top500.org/

Rank Site/Country/Year Computer/Processors Manufacturer Rmax (GFlops)

1 DOE/NNSA/LLNLUnited States

212992 (PowerPC) IBM 478200

2 Forschungszentrum Juelich (FZJ)Germany

65536 (PowerPC) IBM 167300

3 SGI/New Mexico Computing Applications Center (NMCAC)United States

14336 (Intel EM64T Xeon ) SGI126900

4 Computational Research Laboratories, TATA SONSIndia

14240 (Intel EM64T Xeon ) HP 117900

5 Government AgencySweden

13728 (AMD x86_64 Opteron Dual Core )

Cray 42900

Page 23: Multiprocessors -  Parallel Computing

23COMP381 by M. Hamdi

Supercomputer Style Migration (Top500)

• In the last 8 years uniprocessor and SIMDs disappeared while Clusters and Constellations grew from 3% to 80%

Cluster – whole computers interconnected using their I/O bus

Constellation – a cluster that uses an SMP multiprocessor as the building block

Page 24: Multiprocessors -  Parallel Computing

24COMP381 by M. Hamdi

• Multiprocessor systems are being used for a wide variety of uses.

• Redundant processing (safeguard) – fault tolerance.

• Multiprocessor systems – increase throughput

Many tasks (no communication between them)

Multi-user departmental, enterprise and web servers.

• Parallel computing systems – decrease execution time.

Execute large-scale applications in parallel.

Multiprocessing (usage)

Page 25: Multiprocessors -  Parallel Computing

25COMP381 by M. Hamdi

Multiprocessing• Multiple resources

– Computers (e.g., clusters of PCs)

– CPU (e.g., shared memory computers)

– ALU (e.g., multiprocessors within a single chips)

– Memory

– Interconnect

• Tasks– Programs

– Procedures

– InstructionsCoarse-grain

Fine-grain

Different combinationsresult in differentsystems.

Page 26: Multiprocessors -  Parallel Computing

26COMP381 by M. Hamdi

1. The ability to develop and deliver software for multiprocessing systems has been the impediment to wide adoption – the goal was to make programming transparent to the user (e.g., pipelining) which never happened. However, there have a lot of advances here.

2. The tremendous advances of microprocessors (doubling in performance every 2 years) was able to satisfy the need of 99% of the applications

3. It did not make a business case: vendors were only able to sell few parallel computers (< 200). As a result, they were not able to invest in designing cheap and powerful multiprocessors

4. Most parallel computer vendors went bankrupt by the mid-90s – there was no business.

Why did the popularity of Multiprocessors slowed down compared to the 90s

Page 27: Multiprocessors -  Parallel Computing

27COMP381 by M. Hamdi

• SISD (Single Instruction, Single Data): – Typical uniprocessor systems that we’ve studied throughout

this course.– Uniprocessor systems can time share and still be SISD.

• SIMD (Single Instruction, Multiple Data): – Multiple processors simultaneously executing the same

instruction on different data.– Specialized applications (e.g., image processing).

• MIMD (Multiple Instruction, Multiple Data): – Multiple processors autonomously executing different

instructions on different data.– Keep in mind that the processors are working together to solve

a single problem.

Flynn’s Taxonomy of Computing

Page 28: Multiprocessors -  Parallel Computing

28COMP381 by M. Hamdi

SIMD Systems

Processor

Memory

PM

PM

PM

PM

PM

PM

PM

PM

PM

PM

PM

PM

PM

PM

PM

PM

von Neumann Computer Some Interconnection Network

One control unit

Lockstep

All Ps do the same or nothing

Page 29: Multiprocessors -  Parallel Computing

29COMP381 by M. Hamdi

MIMD Shared Memory Systems

Interconnection Networks

M M M M

P P P P P P C

P C

P C

P C

M M M M

Global Memory

P

C

P

C

P

C

One global memory

Cache Coherence

All Ps have equal access to memory

Page 30: Multiprocessors -  Parallel Computing

30COMP381 by M. Hamdi

Cache Coherent NUMA

Interconnection Network

M

C

P

M

C

P

M

C

P

M

C

P

Each P has part of the shared memory

Non uniform memory access

Page 31: Multiprocessors -  Parallel Computing

31COMP381 by M. Hamdi

MIMD Distributed Memory Systems

Interconnection Networks

M M M M

P P P P

1110 1111

1010 1011

0110 0111

0010 0011

1101

1010

1000 1001

0100 0101

0010

0000 0001

S

LAN/WANNo shared memory

Message Passing

Topology

Page 32: Multiprocessors -  Parallel Computing

32COMP381 by M. Hamdi

Cluster Architecture

M

C

P

I/O

OS

M

C

P

I/O

OS

M

C

P

I/O

OS

Middleware

Programming Environment

Interconnection Network Home cluster

Page 33: Multiprocessors -  Parallel Computing

33COMP381 by M. Hamdi

InternetInternet

Grids

Dependable, consistent, pervasive, and inexpensive access to high end computing.

Geographically distributed platforms.

Page 34: Multiprocessors -  Parallel Computing

34COMP381 by M. Hamdi

10

100

1

2003 2005 2007 2009 2011 2013

Increasing HW

Threads HT

Multi-core Era

Scalar and Parallel

Applications

Many-core Era

Massively Parallel

Applications

Multiprocessing within a chip: Many-Core

Intel predicts Intel predicts 100’s of cores 100’s of cores on a chip in on a chip in 20152015

Page 35: Multiprocessors -  Parallel Computing

35COMP381 by M. Hamdi

P

P

P

I

N

M

M

MC ontro ller

Add r1, b

Add r1, b

Add r1, b

D atastream

InstructionS tream

SIM D execution

SIMD Parallel ComputingSIMD Parallel Computing

It can be a stand-It can be a stand-alone multiprocessoralone multiprocessor

Or Or

Embedded in a single Embedded in a single processor for specific processor for specific applications (MMX)applications (MMX)

Page 36: Multiprocessors -  Parallel Computing

36COMP381 by M. Hamdi

SIMD Applications

• Applications:

• Database, image processing, and signal processing.

• Image processing maps very naturally onto SIMD systems.

» Each processor (Execution unit) performs operations on a single pixel or neighborhood of pixels.

» The operations performed are fairly straightforward and simple.

» Data could be streamed into the system and operated on in real-time or close to real-time.

Page 37: Multiprocessors -  Parallel Computing

37COMP381 by M. Hamdi

SIMD Operations

• Image processing on SIMD systems.– Sequential pixel operations take a very long time to

perform.• A 512x512 image would require 262,144 iterations through a

sequential loop with each loop executing 10 instructions. That translates to 2,621,440 clock cycles (if each instruction is a single cycle) plus loop overhead.

512x512 image

Each pixel is operated onsequentially one afteranother.

Page 38: Multiprocessors -  Parallel Computing

38COMP381 by M. Hamdi

SIMD Operations

• Image processing on SIMD systems.– On a SIMD system with 64x64 processors (e.g., very

simple ALUs) the same operations would take 640 cycles, where each processor operates on an 8x8 set of pixels plus loop overhead.

512x512 image

Each processor operates onan 8x8 set of pixels in parallel.

Speedup due to parallelism:2,621,440/640 = 4096 =64x64 (number of proc.) loop overhead ignored.

Page 39: Multiprocessors -  Parallel Computing

39COMP381 by M. Hamdi

SIMD Operations

• Image processing on SIMD systems.– On a SIMD system with 512x512 processors (which is

not unreasonable on SIMD machines) the same operation would take 10 cycles.

512x512 image

Each processor operates ona single pixel in parallel.

Speedup due to parallelism:2,621,440/10 = 262,144 =512x512 (number of proc.)!

Notice no loop overhead!

Page 40: Multiprocessors -  Parallel Computing

40COMP381 by M. Hamdi

Pentium MMX MultiMedia eXtentions

• 57 new instructions

• Eight 64-bit wide MMX registers

• First available in 1997

• Supported on:– Intel Pentium-MMX, Pentium II,

Pentium III, Pentium IV

– AMD K6, K6-2, K6-3, K7 (and later)

– Cyrix M2, MMX-enhanced MediaGX, Jalapeno (and later)

• Gives a large speedup in many multimedia applications

Page 41: Multiprocessors -  Parallel Computing

41COMP381 by M. Hamdi

MMX SIMD Operations• Example: consider an image pixel

data represented as bytes.

– with MMX, eight of these pixels can be packed together in a 64-bit quantity and moved into an MMX register

– MMX instruction performs the arithmetic or logical operation on all eight elements in parallel

• PADD(B/W/D): Addition

PADDB MM1, MM2 adds 64-bit contents of MM2 to MM1,

byte-by-byte any carries generated

are dropped, e.g., byte A0h + 70h = 10h

• PSUB(B/W/D): Subtraction

Page 42: Multiprocessors -  Parallel Computing

42COMP381 by M. Hamdi

MMX: Image Dissolve Using Alpha Blending

• Example: MMX instructions speed up image composition

• A flower will dissolve a swan

• Alpha (a standard scheme) determines the intensity of the flower

• The full intensity, the flower’s 8-bit alpha value is FFh, or 255

• The equation below calculates each pixel:Result_pixel =Flower_pixel *(alpha/255) + Swan_pixel * [1-(alpha/255)]

For alpha 230, the resulting pixel is 90% flower and 10% swan

Page 43: Multiprocessors -  Parallel Computing

43COMP381 by M. Hamdi

• It is easy to write applications for SIMD processors

• The applications are limited (image processing, computer vision, etc.)

• It is frequently used to speed specific applications (e.g., graphics co-processor in SGI computers)

• In the late 80s and early 90s, many SIMD machines were commercially available (e.g., Connection machine has 64K ALUs, and MasPar has 16K ALUs)

SIMD Multiprocessing

Page 44: Multiprocessors -  Parallel Computing

44COMP381 by M. Hamdi

• MIMD (Multiple Instruction, Multiple Data): – Multiple processors autonomously executing different

instructions on different data.

– Keep in mind that the processors are working together to solve a single problem.

• This is a more general form of multiprocessing, and can be used in numerous applications

Flynn’s Taxonomy of Computing

Page 45: Multiprocessors -  Parallel Computing

45COMP381 by M. Hamdi

Unlike SIMD, MIMD computer works asynchronously.

• Shared memory (tightly coupled) MIMD

• Distributed memory (loosely coupled) MIMD

MIMD Architecture

Processor

A

Processor

B

Processor

C

Data Inputstream A

Data Inputstream B

Data Inputstream C

Data Outputstream A

Data Outputstream B

Data Outputstream C

InstructionStream A

InstructionStream B

InstructionStream C

Page 46: Multiprocessors -  Parallel Computing

46COMP381 by M. Hamdi

Memory

Disk & other IO

Shared Memory Multiprocessor

Registers

Caches

Processor

Registers

Caches

Processor

Registers

Caches

Processor

Registers

Caches

Processor

Chipset •Memory: centralized with Uniform Memory Access time (“uma”) and bus interconnect, I/O•Examples: Sun Enterprise 6000, SGI Challenge, Intel SystemPro

Page 47: Multiprocessors -  Parallel Computing

47COMP381 by M. Hamdi

Shared Memory Programming Model

ProcessProcess ProcessProcess

SystemSystem

X

load(X) store(X)

Processor Memory

Shared variable

Page 48: Multiprocessors -  Parallel Computing

48COMP381 by M. Hamdi

Shared Memory Model

Virtual address spaces for a collection of processes communicating via shared addresses

Machine physical address space

Shared portion of address space

Private portion of address space

Pn private

Common physical addresses

P2 private

P1 private

P0 private

Store

Load

Page 49: Multiprocessors -  Parallel Computing

49COMP381 by M. Hamdi

Cache Coherence Problem

• Processor 3 does not see the value written by processor 0

X:42

R: X

$

MEM

P

$

P

$

P

$

P

X:42

X:17

W: X = 17 R: X

X:42

Page 50: Multiprocessors -  Parallel Computing

50COMP381 by M. Hamdi

Write Through does not help

• Processor 3 sees 42 in cache (does not get the correct value (17) from memory.

X:42

R: X

$

MEM

P

$

P

$

P

$

P

X:42

X:17

W: X = 17 R: X

X:42

X:17

R: X

Page 51: Multiprocessors -  Parallel Computing

51COMP381 by M. Hamdi

One Solution: Shared Cache

Advantages• Cache placement identical to single cache

–only one copy of any cached block

Disadvantages• Bandwidth limitation

P1

Switch

Main memory

Pn

(Interleaved)

(Interleaved)

First-level $

Shared Cache

Page 52: Multiprocessors -  Parallel Computing

52COMP381 by M. Hamdi

Limits of Shared Cache Approach

I/O MEM MEM° ° °

PROC

cache

PROC

cache

° ° °

Assume:

1 GHz processor w/o cache

=> 4 GB/s inst BW per processor (32-bit)

=> 1.2 GB/s data BW at 30% load-store

Need 5.2 GB/s of bus bandwidth per processor!

• Typical bus bandwidth can hardly support one processor

5.2 GB/s

140 MB/s

Page 53: Multiprocessors -  Parallel Computing

53COMP381 by M. Hamdi

Distributed Cache: Snoopy Cache-Coherence Protocols

• Bus is a broadcast medium & caches know what they have

–bus protocol: arbitration, command/addr, data

=> Every device observes every transaction

StateAddressData

I/O devicesMem

P1

$

Bus snoop

$

Pn

Cache-memorytransaction

Page 54: Multiprocessors -  Parallel Computing

54COMP381 by M. Hamdi

Snooping Cache Coherency

• Cache Controller “snoops” all transactions on the shared bus• A transaction is a relevant transaction if it involves a cache block currently contained in this cache

• take action to ensure coherence (invalidate, update, or supply value)

Page 55: Multiprocessors -  Parallel Computing

COMP381 by M. Hamdi

Hardware Cache Coherence

• write-invalidate

• write-update (also called distributed write)

X -> X’

invalidate -->

X -> Inv X -> Inv

. . . . .

memory ICN

X -> X’

update -->

X -> X’ X -> X’

. . . . .

memory ICN

Page 56: Multiprocessors -  Parallel Computing

56COMP381 by M. Hamdi

Limits of Bus-Based Shared Memory

I/O MEM MEM° ° °

PROC

cache

PROC

cache

° ° °

Assume:

1 GHz processor w/o cache

=> 4 GB/s inst BW per processor (32-bit)

=> 1.2 GB/s data BW at 30% load-store

Suppose 98% inst hit rate and 95% data hit rate

=> 80 MB/s inst BW per processor

=> 60 MB/s data BW per processor140 MB/s combined BW

Assuming 1 GB/s bus bandwidth

8 processors will saturate the memory bus

5.2 GB/s

140 MB/s

Page 57: Multiprocessors -  Parallel Computing

57COMP381 by M. Hamdi

Scalable Shared Memory Architectures Crossbar Switch

Mem

Mem

Mem

Mem

Cache

P

I/OCache

P

I/O

Used in SUN Used in SUN entreprise 10000entreprise 10000

Page 58: Multiprocessors -  Parallel Computing

58COMP381 by M. Hamdi

Scalable Shared Memory Architectures

• Used in IBM SP Multiprocessor

0

1

1

P0

P1

P2

P3

P4

P5

P6

P7

M0

M1

M2

M3

M4

M5

M6

M7

000

001

010

011

100

101

110

111

Page 59: Multiprocessors -  Parallel Computing

59COMP381 by M. Hamdi

Approaches to Building Parallel Machines

P1

Switch

Main memory

Pn

(Interleaved)

(Interleaved)

First-level $

P1

$

Interconnection network

$

Pn

Mem Mem

P1

$

Interconnection network

$

Pn

Mem Mem

Shared Cache

Distributed Memory

Scale