chapter 7 multicores, multiprocessors, and...

51
Chapter 7 Multicores, Multiprocessors, and Clusters

Upload: vuongdien

Post on 06-May-2018

238 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: Chapter 7 Multicores, Multiprocessors, and Clusterstwins.ee.nctu.edu.tw/courses/co_13/lecture/Chapter 7...Chapter 7 — Multicores, Multiprocessors, and Clusters — 11 Example: Sum

Chapter 7Multicores, Multiprocessors, and Clusters

Page 2: Chapter 7 Multicores, Multiprocessors, and Clusterstwins.ee.nctu.edu.tw/courses/co_13/lecture/Chapter 7...Chapter 7 — Multicores, Multiprocessors, and Clusters — 11 Example: Sum

Chapter 7 — Multicores, Multiprocessors, and Clusters — 2

IntroductionGoal: connecting multiple computersto get higher performance

Multiprocessors

Scalability, availability, power efficiency

Job-level (process-level) parallelismHigh throughput for independent jobs

Parallel processing programSingle program run on multiple processors

Multicore microprocessorsChips with multiple processors (cores)

§9.1 Introduction

Page 3: Chapter 7 Multicores, Multiprocessors, and Clusterstwins.ee.nctu.edu.tw/courses/co_13/lecture/Chapter 7...Chapter 7 — Multicores, Multiprocessors, and Clusters — 11 Example: Sum

Chapter 7 — Multicores, Multiprocessors, and Clusters — 3

Hardware and SoftwareHardware

Serial: e.g., Pentium 4

Parallel: e.g., quad-core Xeon e5345

SoftwareSequential: e.g., matrix multiplication

Concurrent: e.g., operating system

Sequential/concurrent software can run on serial/parallel hardware

Challenge: making effective use of parallel hardware

Page 4: Chapter 7 Multicores, Multiprocessors, and Clusterstwins.ee.nctu.edu.tw/courses/co_13/lecture/Chapter 7...Chapter 7 — Multicores, Multiprocessors, and Clusters — 11 Example: Sum

Chapter 7 — Multicores, Multiprocessors, and Clusters — 4

What We’ve Already Covered§2.11: Parallelism and Instructions

Synchronization

§3.6: Parallelism and Computer ArithmeticAssociativity

§4.10: Parallelism and Advanced Instruction-Level Parallelism

§5.8: Parallelism and Memory HierarchiesCache Coherence

§6.9: Parallelism and I/O:Redundant Arrays of Inexpensive Disks

Page 5: Chapter 7 Multicores, Multiprocessors, and Clusterstwins.ee.nctu.edu.tw/courses/co_13/lecture/Chapter 7...Chapter 7 — Multicores, Multiprocessors, and Clusters — 11 Example: Sum

Chapter 7 — Multicores, Multiprocessors, and Clusters — 5

Parallel ProgrammingParallel software is the problem

Need to get significant performance improvement

Otherwise, just use a faster uniprocessor, since it’s easier!

DifficultiesPartitioning

Coordination

Communications overhead

§7.2 The Difficulty of C

reating Parallel P

rocessing Program

s

Page 6: Chapter 7 Multicores, Multiprocessors, and Clusterstwins.ee.nctu.edu.tw/courses/co_13/lecture/Chapter 7...Chapter 7 — Multicores, Multiprocessors, and Clusters — 11 Example: Sum

Chapter 7 — Multicores, Multiprocessors, and Clusters — 6

Amdahl’s LawSequential part can limit speedupExample: 100 processors, 90× speedup?

Tnew = Tparallelizable/100 + Tsequential

Solving: Fparallelizable = 0.999Need sequential part to be 0.1% of original time

90/100F)F(1

1Speedupableparallelizableparalleliz

=+−

=

Page 7: Chapter 7 Multicores, Multiprocessors, and Clusterstwins.ee.nctu.edu.tw/courses/co_13/lecture/Chapter 7...Chapter 7 — Multicores, Multiprocessors, and Clusters — 11 Example: Sum

Chapter 7 — Multicores, Multiprocessors, and Clusters — 7

Scaling ExampleWorkload: sum of 10 scalars, and 10 × 10 matrix sum

Speed up from 10 to 100 processors

Single processor: Time = (10 + 100) × tadd

10 processorsTime = 10 × tadd + 100/10 × tadd = 20 × tadd

Speedup = 110/20 = 5.5 (55% of potential)

100 processorsTime = 10 × tadd + 100/100 × tadd = 11 × tadd

Speedup = 110/11 = 10 (10% of potential)

Assumes load can be balanced across processors

Page 8: Chapter 7 Multicores, Multiprocessors, and Clusterstwins.ee.nctu.edu.tw/courses/co_13/lecture/Chapter 7...Chapter 7 — Multicores, Multiprocessors, and Clusters — 11 Example: Sum

Chapter 7 — Multicores, Multiprocessors, and Clusters — 8

Scaling Example (cont)What if matrix size is 100 × 100?

Single processor: Time = (10 + 10000) × tadd

10 processorsTime = 10 × tadd + 10000/10 × tadd = 1010 × tadd

Speedup = 10010/1010 = 9.9 (99% of potential)

100 processorsTime = 10 × tadd + 10000/100 × tadd = 110 × tadd

Speedup = 10010/110 = 91 (91% of potential)

Assuming load balanced

Page 9: Chapter 7 Multicores, Multiprocessors, and Clusterstwins.ee.nctu.edu.tw/courses/co_13/lecture/Chapter 7...Chapter 7 — Multicores, Multiprocessors, and Clusters — 11 Example: Sum

Chapter 7 — Multicores, Multiprocessors, and Clusters — 9

Strong vs Weak ScalingStrong scaling: problem size fixed

As in example

Weak scaling: problem size proportional to number of processors

10 processors, 10 × 10 matrixTime = 20 × tadd

100 processors, 32 × 32 matrixTime = 10 × tadd + 1000/100 × tadd = 20 × tadd

Constant performance in this example

Page 10: Chapter 7 Multicores, Multiprocessors, and Clusterstwins.ee.nctu.edu.tw/courses/co_13/lecture/Chapter 7...Chapter 7 — Multicores, Multiprocessors, and Clusters — 11 Example: Sum

Chapter 7 — Multicores, Multiprocessors, and Clusters — 10

Shared MemorySMP: shared memory multiprocessor

Hardware provides single physicaladdress space for all processorsSynchronize shared variables using locksMemory access time

UMA (uniform) vs. NUMA (nonuniform)

§7.3 Shared M

emory M

ultiprocessors

Page 11: Chapter 7 Multicores, Multiprocessors, and Clusterstwins.ee.nctu.edu.tw/courses/co_13/lecture/Chapter 7...Chapter 7 — Multicores, Multiprocessors, and Clusters — 11 Example: Sum

Chapter 7 — Multicores, Multiprocessors, and Clusters — 11

Example: Sum ReductionSum 100,000 numbers on 100 processor UMA

Each processor has ID: 0 ≤ Pn ≤ 99

Partition 1000 numbers per processor

Initial summation on each processorsum[Pn] = 0;

for (i = 1000*Pn;

i < 1000*(Pn+1); i = i + 1)

sum[Pn] = sum[Pn] + A[i];

Now need to add these partial sumsReduction: divide and conquer

Half the processors add pairs, then quarter, …

Need to synchronize between reduction steps

Page 12: Chapter 7 Multicores, Multiprocessors, and Clusterstwins.ee.nctu.edu.tw/courses/co_13/lecture/Chapter 7...Chapter 7 — Multicores, Multiprocessors, and Clusters — 11 Example: Sum

Chapter 7 — Multicores, Multiprocessors, and Clusters — 12

Example: Sum Reduction

half = 100;

repeat

synch();

if (half%2 != 0 && Pn == 0)

sum[0] = sum[0] + sum[half-1];

/* Conditional sum needed when half is odd;

Processor0 gets missing element */

half = half/2; /* dividing line on who sums */

if (Pn < half) sum[Pn] = sum[Pn] + sum[Pn+half];

until (half == 1);

Page 13: Chapter 7 Multicores, Multiprocessors, and Clusterstwins.ee.nctu.edu.tw/courses/co_13/lecture/Chapter 7...Chapter 7 — Multicores, Multiprocessors, and Clusters — 11 Example: Sum

Chapter 7 — Multicores, Multiprocessors, and Clusters — 13

Message PassingEach processor has private physical address spaceHardware sends/receives messages between processors

§7.4 Clusters and O

ther Message-P

assing Multiprocessors

Page 14: Chapter 7 Multicores, Multiprocessors, and Clusterstwins.ee.nctu.edu.tw/courses/co_13/lecture/Chapter 7...Chapter 7 — Multicores, Multiprocessors, and Clusters — 11 Example: Sum

Chapter 7 — Multicores, Multiprocessors, and Clusters — 14

Loosely Coupled ClustersNetwork of independent computers

Each has private memory and OS

Connected using I/O systemE.g., Ethernet/switch, Internet

Suitable for applications with independent tasksWeb servers, databases, simulations, …

High availability, scalable, affordable

ProblemsAdministration cost (prefer virtual machines)

Low interconnect bandwidthc.f. processor/memory bandwidth on an SMP

Page 15: Chapter 7 Multicores, Multiprocessors, and Clusterstwins.ee.nctu.edu.tw/courses/co_13/lecture/Chapter 7...Chapter 7 — Multicores, Multiprocessors, and Clusters — 11 Example: Sum

Chapter 7 — Multicores, Multiprocessors, and Clusters — 15

Sum Reduction (Again)Sum 100,000 on 100 processors

First distribute 100 numbers to eachThe do partial sumssum = 0;for (i = 0; i<1000; i = i + 1)

sum = sum + AN[i];

ReductionHalf the processors send, other half receive and add

The quarter send, quarter receive and add, …

Page 16: Chapter 7 Multicores, Multiprocessors, and Clusterstwins.ee.nctu.edu.tw/courses/co_13/lecture/Chapter 7...Chapter 7 — Multicores, Multiprocessors, and Clusters — 11 Example: Sum

Chapter 7 — Multicores, Multiprocessors, and Clusters — 16

Sum Reduction (Again)Given send() and receive() operations

limit = 100; half = 100;/* 100 processors */repeathalf = (half+1)/2; /* send vs. receive

dividing line */if (Pn >= half && Pn < limit)send(Pn - half, sum);

if (Pn < (limit/2))sum = sum + receive();

limit = half; /* upper limit of senders */until (half == 1); /* exit with final sum */

Send/receive also provide synchronizationAssumes send/receive take similar time to addition

Page 17: Chapter 7 Multicores, Multiprocessors, and Clusterstwins.ee.nctu.edu.tw/courses/co_13/lecture/Chapter 7...Chapter 7 — Multicores, Multiprocessors, and Clusters — 11 Example: Sum

Chapter 7 — Multicores, Multiprocessors, and Clusters — 17

Grid ComputingSeparate computers interconnected by long-haul networks

E.g., Internet connections

Work units farmed out, results sent back

Can make use of idle time on PCsE.g., SETI@home, World Community Grid

Page 18: Chapter 7 Multicores, Multiprocessors, and Clusterstwins.ee.nctu.edu.tw/courses/co_13/lecture/Chapter 7...Chapter 7 — Multicores, Multiprocessors, and Clusters — 11 Example: Sum

Chapter 7 — Multicores, Multiprocessors, and Clusters — 18

MultithreadingPerforming multiple threads of execution in parallel

Replicate registers, PC, etc.

Fast switching between threads

Fine-grain multithreadingSwitch threads after each cycle

Interleave instruction execution

If one thread stalls, others are executed

Coarse-grain multithreadingOnly switch on long stall (e.g., L2-cache miss)

Simplifies hardware, but doesn’t hide short stalls (eg, data hazards)

§7.5 Hardw

are Multithreading

Page 19: Chapter 7 Multicores, Multiprocessors, and Clusterstwins.ee.nctu.edu.tw/courses/co_13/lecture/Chapter 7...Chapter 7 — Multicores, Multiprocessors, and Clusters — 11 Example: Sum

Chapter 7 — Multicores, Multiprocessors, and Clusters — 19

Simultaneous Multithreading (SMT)

In multiple-issue dynamically scheduled processor

Schedule instructions from multiple threads

Instructions from independent threads execute when function units are available

Within threads, dependencies handled by scheduling and register renaming

Example: Intel Pentium-4 HTTwo threads: duplicated registers, shared function units and caches

Page 20: Chapter 7 Multicores, Multiprocessors, and Clusterstwins.ee.nctu.edu.tw/courses/co_13/lecture/Chapter 7...Chapter 7 — Multicores, Multiprocessors, and Clusters — 11 Example: Sum

Chapter 7 — Multicores, Multiprocessors, and Clusters — 20

Multithreading Example

Page 21: Chapter 7 Multicores, Multiprocessors, and Clusterstwins.ee.nctu.edu.tw/courses/co_13/lecture/Chapter 7...Chapter 7 — Multicores, Multiprocessors, and Clusters — 11 Example: Sum

Chapter 7 — Multicores, Multiprocessors, and Clusters — 21

Future of MultithreadingWill it survive? In what form?

Power considerations ⇒ simplified microarchitectures

Simpler forms of multithreading

Tolerating cache-miss latencyThread switch may be most effective

Multiple simple cores might share resources more effectively

Page 22: Chapter 7 Multicores, Multiprocessors, and Clusterstwins.ee.nctu.edu.tw/courses/co_13/lecture/Chapter 7...Chapter 7 — Multicores, Multiprocessors, and Clusters — 11 Example: Sum

Chapter 7 — Multicores, Multiprocessors, and Clusters — 22

Instruction and Data StreamsAn alternate classification

§7.6 SIS

D, M

IMD

, SIM

D, S

PM

D, and V

ector

MIMD:Intel Xeon e5345

MISD:No examples today

Multiple

SIMD: SSE instructions of x86

SISD:Intel Pentium 4

SingleInstruction Streams

MultipleSingleData Streams

SPMD: Single Program Multiple DataA parallel program on a MIMD computer

Conditional code for different processors

Page 23: Chapter 7 Multicores, Multiprocessors, and Clusterstwins.ee.nctu.edu.tw/courses/co_13/lecture/Chapter 7...Chapter 7 — Multicores, Multiprocessors, and Clusters — 11 Example: Sum

Chapter 7 — Multicores, Multiprocessors, and Clusters — 23

SIMDOperate elementwise on vectors of data

E.g., MMX and SSE instructions in x86Multiple data elements in 128-bit wide registers

All processors execute the same instruction at the same time

Each with different data address, etc.

Simplifies synchronization

Reduced instruction control hardware

Works best for highly data-parallel applications

Page 24: Chapter 7 Multicores, Multiprocessors, and Clusterstwins.ee.nctu.edu.tw/courses/co_13/lecture/Chapter 7...Chapter 7 — Multicores, Multiprocessors, and Clusters — 11 Example: Sum

Chapter 7 — Multicores, Multiprocessors, and Clusters — 24

Vector ProcessorsHighly pipelined function units

Stream data from/to vector registers to unitsData collected from memory into registers

Results stored from registers to memory

Example: Vector extension to MIPS32 × 64-element registers (64-bit elements)

Vector instructionslv, sv: load/store vector

addv.d: add vectors of double

addvs.d: add scalar to each element of vector of double

Significantly reduces instruction-fetch bandwidth

Page 25: Chapter 7 Multicores, Multiprocessors, and Clusterstwins.ee.nctu.edu.tw/courses/co_13/lecture/Chapter 7...Chapter 7 — Multicores, Multiprocessors, and Clusters — 11 Example: Sum

Chapter 7 — Multicores, Multiprocessors, and Clusters — 25

Example: DAXPY (Y = a × X + Y)Conventional MIPS code

l.d $f0,a($sp) ;load scalar aaddiu r4,$s0,#512 ;upper bound of what to load

loop: l.d $f2,0($s0) ;load x(i)mul.d $f2,$f2,$f0 ;a × x(i)l.d $f4,0($s1) ;load y(i)add.d $f4,$f4,$f2 ;a × x(i) + y(i)s.d $f4,0($s1) ;store into y(i)addiu $s0,$s0,#8 ;increment index to xaddiu $s1,$s1,#8 ;increment index to ysubu $t0,r4,$s0 ;compute boundbne $t0,$zero,loop ;check if done

Vector MIPS codel.d $f0,a($sp) ;load scalar alv $v1,0($s0) ;load vector xmulvs.d $v2,$v1,$f0 ;vector-scalar multiplylv $v3,0($s1) ;load vector yaddv.d $v4,$v2,$v3 ;add y to productsv $v4,0($s1) ;store the result

Page 26: Chapter 7 Multicores, Multiprocessors, and Clusterstwins.ee.nctu.edu.tw/courses/co_13/lecture/Chapter 7...Chapter 7 — Multicores, Multiprocessors, and Clusters — 11 Example: Sum

Chapter 7 — Multicores, Multiprocessors, and Clusters — 26

Vector vs. ScalarVector architectures and compilers

Simplify data-parallel programming

Explicit statement of absence of loop-carried dependencesReduced checking in hardware

Regular access patterns benefit from interleaved and burst memory

Avoid control hazards by avoiding loops

More general than ad-hoc media extensions (such as MMX, SSE)

Better match with compiler technology

Page 27: Chapter 7 Multicores, Multiprocessors, and Clusterstwins.ee.nctu.edu.tw/courses/co_13/lecture/Chapter 7...Chapter 7 — Multicores, Multiprocessors, and Clusters — 11 Example: Sum

Chapter 7 — Multicores, Multiprocessors, and Clusters — 27

History of GPUsEarly video cards

Frame buffer memory with address generation for video output

3D graphics processingOriginally high-end computers (e.g., SGI)

Moore’s Law ⇒ lower cost, higher density3D graphics cards for PCs and game consoles

Graphics Processing UnitsProcessors oriented to 3D graphics tasks

Vertex/pixel processing, shading, texture mapping,rasterization

§7.7 Introduction to Graphics P

rocessing Units

Page 28: Chapter 7 Multicores, Multiprocessors, and Clusterstwins.ee.nctu.edu.tw/courses/co_13/lecture/Chapter 7...Chapter 7 — Multicores, Multiprocessors, and Clusters — 11 Example: Sum

Chapter 7 — Multicores, Multiprocessors, and Clusters — 28

Graphics in the System

Page 29: Chapter 7 Multicores, Multiprocessors, and Clusterstwins.ee.nctu.edu.tw/courses/co_13/lecture/Chapter 7...Chapter 7 — Multicores, Multiprocessors, and Clusters — 11 Example: Sum

Chapter 7 — Multicores, Multiprocessors, and Clusters — 29

GPU ArchitecturesProcessing is highly data-parallel

GPUs are highly multithreaded

Use thread switching to hide memory latencyLess reliance on multi-level caches

Graphics memory is wide and high-bandwidth

Trend toward general purpose GPUsHeterogeneous CPU/GPU systems

CPU for sequential code, GPU for parallel code

Programming languages/APIsDirectX, OpenGL

C for Graphics (Cg), High Level Shader Language (HLSL)

Compute Unified Device Architecture (CUDA)

Page 30: Chapter 7 Multicores, Multiprocessors, and Clusterstwins.ee.nctu.edu.tw/courses/co_13/lecture/Chapter 7...Chapter 7 — Multicores, Multiprocessors, and Clusters — 11 Example: Sum

Chapter 7 — Multicores, Multiprocessors, and Clusters — 30

Example: NVIDIA TeslaStreaming

multiprocessor

8 × Streamingprocessors

Page 31: Chapter 7 Multicores, Multiprocessors, and Clusterstwins.ee.nctu.edu.tw/courses/co_13/lecture/Chapter 7...Chapter 7 — Multicores, Multiprocessors, and Clusters — 11 Example: Sum

Chapter 7 — Multicores, Multiprocessors, and Clusters — 31

Example: NVIDIA TeslaStreaming Processors

Single-precision FP and integer units

Each SP is fine-grained multithreaded

Warp: group of 32 threadsExecuted in parallel,SIMD style

8 SPs × 4 clock cycles

Hardware contextsfor 24 warps

Registers, PCs, …

Page 32: Chapter 7 Multicores, Multiprocessors, and Clusterstwins.ee.nctu.edu.tw/courses/co_13/lecture/Chapter 7...Chapter 7 — Multicores, Multiprocessors, and Clusters — 11 Example: Sum

Chapter 7 — Multicores, Multiprocessors, and Clusters — 32

Classifying GPUsDon’t fit nicely into SIMD/MIMD model

Conditional execution in a thread allows an illusion of MIMDBut with performance degredation

Need to write general purpose code with care

Tesla MultiprocessorSIMD or VectorData-Level Parallelism

SuperscalarVLIWInstruction-Level Parallelism

Dynamic: Discovered at Runtime

Static: Discoveredat Compile Time

Page 33: Chapter 7 Multicores, Multiprocessors, and Clusterstwins.ee.nctu.edu.tw/courses/co_13/lecture/Chapter 7...Chapter 7 — Multicores, Multiprocessors, and Clusters — 11 Example: Sum

Chapter 7 — Multicores, Multiprocessors, and Clusters — 33

Interconnection NetworksNetwork topologies

Arrangements of processors, switches, and links

§7.8 Introduction to Multiprocessor N

etwork Topologies

Bus Ring

2D MeshN-cube (N = 3)

Fully connected

Page 34: Chapter 7 Multicores, Multiprocessors, and Clusterstwins.ee.nctu.edu.tw/courses/co_13/lecture/Chapter 7...Chapter 7 — Multicores, Multiprocessors, and Clusters — 11 Example: Sum

Chapter 7 — Multicores, Multiprocessors, and Clusters — 34

Multistage Networks

Page 35: Chapter 7 Multicores, Multiprocessors, and Clusterstwins.ee.nctu.edu.tw/courses/co_13/lecture/Chapter 7...Chapter 7 — Multicores, Multiprocessors, and Clusters — 11 Example: Sum

Chapter 7 — Multicores, Multiprocessors, and Clusters — 35

Network CharacteristicsPerformance

Latency per message (unloaded network)

ThroughputLink bandwidth

Total network bandwidth

Bisection bandwidth

Congestion delays (depending on traffic)

Cost

Power

Routability in silicon

Page 36: Chapter 7 Multicores, Multiprocessors, and Clusterstwins.ee.nctu.edu.tw/courses/co_13/lecture/Chapter 7...Chapter 7 — Multicores, Multiprocessors, and Clusters — 11 Example: Sum

Chapter 7 — Multicores, Multiprocessors, and Clusters — 36

Parallel BenchmarksLinpack: matrix linear algebra

SPECrate: parallel run of SPEC CPU programsJob-level parallelism

SPLASH: Stanford Parallel Applications for Shared MemoryMix of kernels and applications, strong scaling

NAS (NASA Advanced Supercomputing) suitecomputational fluid dynamics kernels

PARSEC (Princeton Application Repository for Shared Memory Computers) suite

Multithreaded applications using Pthreads and OpenMP

§7.9 Multiprocessor B

enchmarks

Page 37: Chapter 7 Multicores, Multiprocessors, and Clusterstwins.ee.nctu.edu.tw/courses/co_13/lecture/Chapter 7...Chapter 7 — Multicores, Multiprocessors, and Clusters — 11 Example: Sum

Chapter 7 — Multicores, Multiprocessors, and Clusters — 37

Code or Applications?Traditional benchmarks

Fixed code and data sets

Parallel programming is evolvingShould algorithms, programming languages, and tools be part of the system?

Compare systems, provided they implement a given application

E.g., Linpack, Berkeley Design Patterns

Would foster innovation in approaches to parallelism

Page 38: Chapter 7 Multicores, Multiprocessors, and Clusterstwins.ee.nctu.edu.tw/courses/co_13/lecture/Chapter 7...Chapter 7 — Multicores, Multiprocessors, and Clusters — 11 Example: Sum

Chapter 7 — Multicores, Multiprocessors, and Clusters — 38

Modeling PerformanceAssume performance metric of interest is achievable GFLOPs/sec

Measured using computational kernels from Berkeley Design Patterns

Arithmetic intensity of a kernelFLOPs per byte of memory accessed

For a given computer, determinePeak GFLOPS (from data sheet)

Peak memory bytes/sec (using Stream benchmark)

§7.10 Roofline: A

Sim

ple Perform

ance Model

Page 39: Chapter 7 Multicores, Multiprocessors, and Clusterstwins.ee.nctu.edu.tw/courses/co_13/lecture/Chapter 7...Chapter 7 — Multicores, Multiprocessors, and Clusters — 11 Example: Sum

Chapter 7 — Multicores, Multiprocessors, and Clusters — 39

Roofline Diagram

Attainable GPLOPs/sec= Max ( Peak Memory BW × Arithmetic Intensity, Peak FP Performance )

Page 40: Chapter 7 Multicores, Multiprocessors, and Clusterstwins.ee.nctu.edu.tw/courses/co_13/lecture/Chapter 7...Chapter 7 — Multicores, Multiprocessors, and Clusters — 11 Example: Sum

Chapter 7 — Multicores, Multiprocessors, and Clusters — 40

Comparing SystemsExample: Opteron X2 vs. Opteron X4

2-core vs. 4-core, 2× FP performance/core, 2.2GHz vs. 2.3GHzSame memory system

To get higher performance on X4 than X2

Need high arithmetic intensityOr working set must fit in X4’s 2MB L-3 cache

Page 41: Chapter 7 Multicores, Multiprocessors, and Clusterstwins.ee.nctu.edu.tw/courses/co_13/lecture/Chapter 7...Chapter 7 — Multicores, Multiprocessors, and Clusters — 11 Example: Sum

Chapter 7 — Multicores, Multiprocessors, and Clusters — 41

Optimizing PerformanceOptimize FP performance

Balance adds & multiplies

Improve superscalar ILP and use of SIMD instructions

Optimize memory usageSoftware prefetch

Avoid load stalls

Memory affinityAvoid non-local data accesses

Page 42: Chapter 7 Multicores, Multiprocessors, and Clusterstwins.ee.nctu.edu.tw/courses/co_13/lecture/Chapter 7...Chapter 7 — Multicores, Multiprocessors, and Clusters — 11 Example: Sum

Chapter 7 — Multicores, Multiprocessors, and Clusters — 42

Optimizing PerformanceChoice of optimization depends on arithmetic intensity of code

Arithmetic intensity is not always fixed

May scale with problem sizeCaching reduces memory accesses

Increases arithmetic intensity

Page 43: Chapter 7 Multicores, Multiprocessors, and Clusterstwins.ee.nctu.edu.tw/courses/co_13/lecture/Chapter 7...Chapter 7 — Multicores, Multiprocessors, and Clusters — 11 Example: Sum

Chapter 7 — Multicores, Multiprocessors, and Clusters — 43

Four Example Systems

§7.11 Real S

tuff: Benchm

arking Four Multicores

2 × quad-coreIntel Xeon e5345(Clovertown)

2 × quad-coreAMD Opteron X4 2356(Barcelona)

Page 44: Chapter 7 Multicores, Multiprocessors, and Clusterstwins.ee.nctu.edu.tw/courses/co_13/lecture/Chapter 7...Chapter 7 — Multicores, Multiprocessors, and Clusters — 11 Example: Sum

Chapter 7 — Multicores, Multiprocessors, and Clusters — 44

Four Example Systems

2 × oct-coreIBM Cell QS20

2 × oct-coreSun UltraSPARCT2 5140 (Niagara 2)

Page 45: Chapter 7 Multicores, Multiprocessors, and Clusterstwins.ee.nctu.edu.tw/courses/co_13/lecture/Chapter 7...Chapter 7 — Multicores, Multiprocessors, and Clusters — 11 Example: Sum

Chapter 7 — Multicores, Multiprocessors, and Clusters — 45

And Their RooflinesKernels

SpMV (left)LBHMD (right)

Some optimizations change arithmetic intensityx86 systems have higher peak GFLOPs

But harder to achieve, given memory bandwidth

Page 46: Chapter 7 Multicores, Multiprocessors, and Clusterstwins.ee.nctu.edu.tw/courses/co_13/lecture/Chapter 7...Chapter 7 — Multicores, Multiprocessors, and Clusters — 11 Example: Sum

Chapter 7 — Multicores, Multiprocessors, and Clusters — 46

Performance on SpMVSparse matrix/vector multiply

Irregular memory accesses, memory boundArithmetic intensity

0.166 before memory optimization, 0.25 after

Xeon vs. OpteronSimilar peak FLOPSXeon limited by shared FSBsand chipset

UltraSPARC/Cell vs. x8620 – 30 vs. 75 peak GFLOPsMore cores and memory bandwidth

Page 47: Chapter 7 Multicores, Multiprocessors, and Clusterstwins.ee.nctu.edu.tw/courses/co_13/lecture/Chapter 7...Chapter 7 — Multicores, Multiprocessors, and Clusters — 11 Example: Sum

Chapter 7 — Multicores, Multiprocessors, and Clusters — 47

Performance on LBMHDFluid dynamics: structured grid over time steps

Each point: 75 FP read/write, 1300 FP opsArithmetic intensity

0.70 before optimization, 1.07 after

Opteron vs. UltraSPARCMore powerful cores, not limited by memory bandwidth

Xeon vs. othersStill suffers from memory bottlenecks

Page 48: Chapter 7 Multicores, Multiprocessors, and Clusterstwins.ee.nctu.edu.tw/courses/co_13/lecture/Chapter 7...Chapter 7 — Multicores, Multiprocessors, and Clusters — 11 Example: Sum

Chapter 7 — Multicores, Multiprocessors, and Clusters — 48

Achieving PerformanceCompare naïve vs. optimized code

If naïve code performs well, it’s easier to write high performance code for the system

0%0%

6.416.7

Naïve code not feasible

SpMVLBMHD

IBM Cell QS20

86%93%

4.110.5

3.59.7

SpMVLBMHD

Sun UltraSPARC T2

38%50%

3.614.1

1.47.1

SpMVLBMHD

AMDOpteron X4

64%82%

1.55.6

1.04.6

SpMVLBMHD

Intel Xeon

Naïve as % of optimized

Optimized GFLOPs/sec

Naïve GFLOPs/sec

KernelSystem

Page 49: Chapter 7 Multicores, Multiprocessors, and Clusterstwins.ee.nctu.edu.tw/courses/co_13/lecture/Chapter 7...Chapter 7 — Multicores, Multiprocessors, and Clusters — 11 Example: Sum

Chapter 7 — Multicores, Multiprocessors, and Clusters — 49

FallaciesAmdahl’s Law doesn’t apply to parallel computers

Since we can achieve linear speedup

But only on applications with weak scaling

Peak performance tracks observed performanceMarketers like this approach!

But compare Xeon with others in example

Need to be aware of bottlenecks

§7.12 Fallacies and Pitfalls

Page 50: Chapter 7 Multicores, Multiprocessors, and Clusterstwins.ee.nctu.edu.tw/courses/co_13/lecture/Chapter 7...Chapter 7 — Multicores, Multiprocessors, and Clusters — 11 Example: Sum

Chapter 7 — Multicores, Multiprocessors, and Clusters — 50

PitfallsNot developing the software to take account of a multiprocessor architecture

Example: using a single lock for a shared composite resource

Serializes accesses, even if they could be done in parallel

Use finer-granularity locking

Page 51: Chapter 7 Multicores, Multiprocessors, and Clusterstwins.ee.nctu.edu.tw/courses/co_13/lecture/Chapter 7...Chapter 7 — Multicores, Multiprocessors, and Clusters — 11 Example: Sum

Chapter 7 — Multicores, Multiprocessors, and Clusters — 51

Concluding RemarksGoal: higher performance by using multiple processors

DifficultiesDeveloping parallel software

Devising appropriate architectures

Many reasons for optimismChanging software and application environment

Chip-level multiprocessors with lower latency, higher bandwidth interconnect

An ongoing challenge for computer architects!

§7.13 Concluding R

emarks