multicore meets petascale: catalyst for a programming ... · – 389 mm2 – 120 w @ 1900 mhz •...

56
Multicore Meets Petascale: Catalyst for a Programming Model Revolution Kathy Yelick U.C. Berkeley and Lawrence Berkeley National Laboratory

Upload: others

Post on 20-Aug-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Multicore Meets Petascale: Catalyst for a Programming ... · – 389 mm2 – 120 W @ 1900 MHz • Intel Core2 sc (Laptop) – 130 mm2 – 15 W @ 1000 MHz • PowerPC450 (BlueGene/P)

Multicore Meets Petascale:Catalyst for a Programming

Model Revolution

Kathy Yelick

U.C. Berkeley and

Lawrence Berkeley National Laboratory

Page 2: Multicore Meets Petascale: Catalyst for a Programming ... · – 389 mm2 – 120 W @ 1900 MHz • Intel Core2 sc (Laptop) – 130 mm2 – 15 W @ 1000 MHz • PowerPC450 (BlueGene/P)

Moore’s Law is Alive and Well

2X transistors/Chip Every 1.5 years

Called “Moore’s Law”

Moore’s Law

Microprocessors havebecome smaller, denser,and more powerful.

Gordon Moore (co-founder ofIntel) predicted in 1965 that thetransistor density ofsemiconductor chips woulddouble roughly every 18months.

Slide source: Jack Dongarra

Page 3: Multicore Meets Petascale: Catalyst for a Programming ... · – 389 mm2 – 120 W @ 1900 MHz • Intel Core2 sc (Laptop) – 130 mm2 – 15 W @ 1000 MHz • PowerPC450 (BlueGene/P)

Clock Scaling Hits Power Density Wall

4004

8008

8080

8085

8086

286386

486Pentium®

P6

1

10

100

1000

10000

1970 1980 1990 2000 2010

Year

Po

wer

Den

sit

y (

W/c

m2)

Hot Plate

Nuclear

Reactor

Rocket

Nozzle

Sun’sSurface

Source: Patrick Gelsinger, Intel

Scaling clock speed (business as usual) will not work

Page 4: Multicore Meets Petascale: Catalyst for a Programming ... · – 389 mm2 – 120 W @ 1900 MHz • Intel Core2 sc (Laptop) – 130 mm2 – 15 W @ 1000 MHz • PowerPC450 (BlueGene/P)

Multicore Revolution

• Chip density iscontinuingincrease ~2x every2 years

– Clock speed isnot

– Number ofprocessor coresmay doubleinstead

• There is little or nohidden parallelism(ILP) to be found

• Parallelism mustbe exposed to andmanaged bysoftware

Source: Intel, Microsoft (Sutter) and

Stanford (Olukotun, Hammond)

Page 5: Multicore Meets Petascale: Catalyst for a Programming ... · – 389 mm2 – 120 W @ 1900 MHz • Intel Core2 sc (Laptop) – 130 mm2 – 15 W @ 1000 MHz • PowerPC450 (BlueGene/P)

Petaflop with ~1M Cores By 2008

100 Pflop/s

10 Pflop/s

1 Pflop/s

100 Tflop/s

10 Tflops/s

1 Tflop/s

100 Gflop/s

10 Gflop/s

1 Gflop/s

10 MFlop/s

1 PFlop system in 2008?

Data from top500.org

6-8 years

Commonby 2015?

Page 6: Multicore Meets Petascale: Catalyst for a Programming ... · – 389 mm2 – 120 W @ 1900 MHz • Intel Core2 sc (Laptop) – 130 mm2 – 15 W @ 1000 MHz • PowerPC450 (BlueGene/P)

Petaflop with ~1M Cores By 2008

100 Pflop/s

10 Pflop/s

1 Pflop/s

100 Tflop/s

10 Tflops/s

1 Tflop/s

100 Gflop/s

10 Gflop/s

1 Gflop/s

10 MFlop/s

1 PFlop system in 2009

6-8 years

On your deskin 2025?

8-10 years

Page 7: Multicore Meets Petascale: Catalyst for a Programming ... · – 389 mm2 – 120 W @ 1900 MHz • Intel Core2 sc (Laptop) – 130 mm2 – 15 W @ 1000 MHz • PowerPC450 (BlueGene/P)

Need a Fundamentally New Approach

• Rethink hardware

– What limits performance

– How to build efficient hardware

• Rethink software

– Massive parallelism

– Eliminate scaling bottlenecks replication,synchronization

• Rethink algorithms

– Massive parallelism and locality

– Counting Flops is the wrong measure

Page 8: Multicore Meets Petascale: Catalyst for a Programming ... · – 389 mm2 – 120 W @ 1900 MHz • Intel Core2 sc (Laptop) – 130 mm2 – 15 W @ 1000 MHz • PowerPC450 (BlueGene/P)

Rethink Hardware

(Ways to Waste $50M)

Page 9: Multicore Meets Petascale: Catalyst for a Programming ... · – 389 mm2 – 120 W @ 1900 MHz • Intel Core2 sc (Laptop) – 130 mm2 – 15 W @ 1000 MHz • PowerPC450 (BlueGene/P)

Waste #1: Ignore Power Budget

Power is top concern in hardware design

• Power density within a chip

– Led to multicore revolution

• Energy consumption

– Always important in handheld devices

– Increasingly so in desktops

– Soon to be significant fraction of budget inlarge systems

• One knob: increase concurrency

Page 10: Multicore Meets Petascale: Catalyst for a Programming ... · – 389 mm2 – 120 W @ 1900 MHz • Intel Core2 sc (Laptop) – 130 mm2 – 15 W @ 1000 MHz • PowerPC450 (BlueGene/P)

Optimizing for Serial PerformanceConsumes Power

• Power5 (Server)– 389 mm2

– 120 W @ 1900 MHz

• Intel Core2 sc (Laptop)– 130 mm2

– 15 W @ 1000 MHz

• PowerPC450 (BlueGene/P)

– 8 mm2

– 3 W @ 850 MHz

• Tensilica DP (cell phones)– 0.8 mm2

– 0.09 W @ 650 MHz

Intel Core2

Power 5

Each core operates at 1/3 to 1/10th efficiency of largest chip, but you can pack 100x more cores onto a chip and consume 1/20 the power!

PPC450

TensilicaDP

Page 11: Multicore Meets Petascale: Catalyst for a Programming ... · – 389 mm2 – 120 W @ 1900 MHz • Intel Core2 sc (Laptop) – 130 mm2 – 15 W @ 1000 MHz • PowerPC450 (BlueGene/P)

Power Demands Threaten to Limit the FutureGrowth of Computational Science

Looking forward to Exascale (1000x Petascale)• DOE E3 Report

– Extrapolation of existing design trends

– Estimate: 130 MW

• DARPA Exascale Study

– More detailed assessment of component technologies

• Power-constrained design for 2014 technology

• 3 TF/chip, new memory technology, optical interconnect

– Estimate:• 40 MW plausible target, not business as usual

• Billion-way concurrency with cores and latency-hiding

• NRC Study– Power and multicore challenges are not just an HPC problem

Page 12: Multicore Meets Petascale: Catalyst for a Programming ... · – 389 mm2 – 120 W @ 1900 MHz • Intel Core2 sc (Laptop) – 130 mm2 – 15 W @ 1000 MHz • PowerPC450 (BlueGene/P)

Waste #2: Buy More Cores thanMemory can Feed

• Required bandwidth depends on the algorithm

• Need hardware designed to algorithmic needs

1.E+02

1.E+03

1.E+04

1.E+05

1.E+06

1.E+07

1.E+08

1.E+09

1.E+10

2003

2005

2007

2009

2011

2013

2015

2017

2019

Time to fill 1/2 mem (usec)Time to matmul on 1/2 mem (usec)Time to FFT on 1/2 mem (usec)Time to Stencil (usec)Time for N 2̂ particle method

Page 13: Multicore Meets Petascale: Catalyst for a Programming ... · – 389 mm2 – 120 W @ 1900 MHz • Intel Core2 sc (Laptop) – 130 mm2 – 15 W @ 1000 MHz • PowerPC450 (BlueGene/P)

Waste #3: Ignore Little’s Law--Latency also Matters

3.4 GF 1.9 GF 1.5 GFAutotune SpMV

--3%1%Efficiency %

-- 0.6 GF 1.0 GFNaïve SpMV (median of

many matrices)

25.6 GB/s21.321.3 GB/sPeak MemBW

3.2X 1.5XAuto Speedup

14.6 (DP Fl. Pt.)17.6 GF74.6 GFPeak GFLOPS

3.2 GHz2.2 GHz2.3 GHzClock Rate

2-VLIW, SIMD, localRAM, DMA

4-/3-issue, 2-/1-SSE3, OOO,caches, prefetch

Architecture

1*8 = 8 2*2 = 42*4 = 8Chips*Cores

CellOpteronClovertownName

Sparse Matrix-Vector Multiply (2flops / 8 bytes) should be BW limited

Page 14: Multicore Meets Petascale: Catalyst for a Programming ... · – 389 mm2 – 120 W @ 1900 MHz • Intel Core2 sc (Laptop) – 130 mm2 – 15 W @ 1000 MHz • PowerPC450 (BlueGene/P)

Why is the STI Cell So Efficient?(Latency Hiding with Software Controlled Memory)

• Performance of Standard Cache Hierarchy– Cache hierarchies are insufficient to tolerate latencies

– Hardware prefetch prefers long unit-stride access patterns (optimized for STREAM)

• Cell “explicit DMA”– Cell software controlled DMA engines can provide nearly flat bandwidth

– Performance model is simple and deterministic (much simpler than modeling acomplex cache hierarchy),

min{time_for_memory_ops, time_for_core_exec}

Cell STRIAD (64KB concurrency)

0.000

5.000

10.000

15.000

20.000

25.000

30.000

16 32 64 128 256 512 1024 2048

stanza size

GB

/s

1 SPE 2 SPEs 3 SPEs 4 SPEs

5 SPEs 6 SPEs 7 SPEs 8 SPEs

Page 15: Multicore Meets Petascale: Catalyst for a Programming ... · – 389 mm2 – 120 W @ 1900 MHz • Intel Core2 sc (Laptop) – 130 mm2 – 15 W @ 1000 MHz • PowerPC450 (BlueGene/P)

Waste #4: Unnecessarily SynchronizeCommunication

8-byte Roundtrip Latency

14.6

6.6

22.1

9.6

6.6

4.5

9.5

18.5

24.2

13.5

17.8

8.3

0

5

10

15

20

25

Elan3/Alpha Elan4/IA64 Myrinet/x86 IB/G5 IB/Opteron SP/Fed

Ro

un

dtrip

Laten

cy (

usec)

MPI ping-pong

GASNet put+sync

• MPI Message Passing is two-sided: each transfer is tied toa synchronization event (message received)

• One-sided communication avoids this

Flood Bandwidth for 4KB messages

547

420

190

702

152

252

750

714231

763

223

679

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Elan3/Alpha Elan4/IA64 Myrinet/x86 IB/G5 IB/Opteron SP/Fed

Pe

rc

en

t H

W p

ea

k

MPI

GASNet

Page 16: Multicore Meets Petascale: Catalyst for a Programming ... · – 389 mm2 – 120 W @ 1900 MHz • Intel Core2 sc (Laptop) – 130 mm2 – 15 W @ 1000 MHz • PowerPC450 (BlueGene/P)

One-Sided vs Two-Sided Communication

• A one-sided put/get message can be handled directly by anetwork interface with RDMA support– Avoid interrupting the CPU or storing data from CPU

(preposts)

• A two-sided messages needs to be matched with a receiveto identify memory address to put data– Offloaded to Network Interface in networks like Quadrics– Need to download match tables to interface (from host)

address

message id

data payload

data payload

one-sided put message

two-sided message

network

interface

memory

host

CPU

Joint work with Dan Bonachea

Page 17: Multicore Meets Petascale: Catalyst for a Programming ... · – 389 mm2 – 120 W @ 1900 MHz • Intel Core2 sc (Laptop) – 130 mm2 – 15 W @ 1000 MHz • PowerPC450 (BlueGene/P)

Rethinking Programming Models

forMassive concurrency

Latency hidingLocality control

Page 18: Multicore Meets Petascale: Catalyst for a Programming ... · – 389 mm2 – 120 W @ 1900 MHz • Intel Core2 sc (Laptop) – 130 mm2 – 15 W @ 1000 MHz • PowerPC450 (BlueGene/P)

Parallel Programming Models

• Most parallel programs are written using either:–Message passing with a SPMD model

• Scientific applications; is portable and scalable

• Success driven by cluster computing

–Shared memory with threads in OpenMP or Threads

• IT applications; easier to program

• Used on shared memory architectures

• Future Petascale machines–Massively distributed, O(100K) nodes/sockets

–Massively parallelism within a chip (2x every year,starting at 4-100 now)

• Main question: 1 programming model or 2?–MPI + OpenMP or something else

Page 19: Multicore Meets Petascale: Catalyst for a Programming ... · – 389 mm2 – 120 W @ 1900 MHz • Intel Core2 sc (Laptop) – 130 mm2 – 15 W @ 1000 MHz • PowerPC450 (BlueGene/P)

Partitioned Global AddressSpace

• Global address space: any thread/process maydirectly read/write data allocated by another

• Partitioned: data is designated as local or global

Glo

bal ad

dre

ss s

pace

x: 1

y:

l: l: l:

g: g: g:

x: 5

y:

x: 7

y: 0

p0 p1 pn

By default:

• Objectheaps areshared

• Programstacks areprivate

• 3 Current languages: UPC, CAF, and Titanium– All three use an SPMD execution model

– Emphasis in this talk on UPC and Titanium (based onJava)

• 3 Emerging languages: X10, Fortress, and Chapel

Page 20: Multicore Meets Petascale: Catalyst for a Programming ... · – 389 mm2 – 120 W @ 1900 MHz • Intel Core2 sc (Laptop) – 130 mm2 – 15 W @ 1000 MHz • PowerPC450 (BlueGene/P)

PGAS Language Overview

• Many common concepts, although specifics differ– Consistent with base language, e.g., Titanium is strongly

typed

• Both private and shared data– int x[10]; and shared int y[10];

• Support for distributed data structures– Distributed arrays; local and global pointers/references

• One-sided shared-memory communication– Simple assignment statements: x[i] = y[i]; or t = *p;

– Bulk operations: memcpy in UPC, array ops in Titanium andCAF

• Synchronization– Global barriers, locks, memory fences

• Collective Communication, IO libraries, etc.

Page 21: Multicore Meets Petascale: Catalyst for a Programming ... · – 389 mm2 – 120 W @ 1900 MHz • Intel Core2 sc (Laptop) – 130 mm2 – 15 W @ 1000 MHz • PowerPC450 (BlueGene/P)

PGAS Languages for DistributedMemory

• PGAS languages are a good fit to distributedmemory machines and clusters of multicore

– Global address space uses fast one-sidedcommunication

– UPC and Titanium use GASNet communication

• Alternative on distributed memory is MPI

– PGAS partitioned model scaled to 100s of nodes

– Shared data structure are only in the programmer’smind in MPI; cannot be programmed as such

Page 22: Multicore Meets Petascale: Catalyst for a Programming ... · – 389 mm2 – 120 W @ 1900 MHz • Intel Core2 sc (Laptop) – 130 mm2 – 15 W @ 1000 MHz • PowerPC450 (BlueGene/P)

One-Sided vs. Two-Sided: Practice

0

100

200

300

400

500

600

700

800

900

10 100 1,000 10,000 100,000 1,000,000

Size (bytes)

Ban

dw

idth

(M

B/s

)

GASNet put (nonblock)"

MPI Flood

Relative BW (GASNet/MPI)

1.0

1.2

1.4

1.6

1.8

2.0

2.2

2.4

10 1000 100000 10000000

Size (bytes)

• InfiniBand: GASNet vapi-conduit and OSU MVAPICH 0.9.5

• Half power point (N ) differs by one order of magnitude

• This is not a criticism of the implementation!

Joint work with Paul Hargrove and Dan Bonachea

(up

is

go

od

) NERSC Jacquardmachine withOpteronprocessors

Page 23: Multicore Meets Petascale: Catalyst for a Programming ... · – 389 mm2 – 120 W @ 1900 MHz • Intel Core2 sc (Laptop) – 130 mm2 – 15 W @ 1000 MHz • PowerPC450 (BlueGene/P)

Communication Strategies for 3D FFT

Joint work with Chris Bell, Rajesh Nishtala, Dan Bonachea

chunk = all rows with same destination

pencil = 1 row

• Three approaches:

–Chunk:

• Wait for 2nd dim FFTs to finish

• Minimize # messages

–Slab:

• Wait for chunk of rows destinedfor 1 proc to finish

• Overlap with computation

–Pencil:

• Send each row as it completes

• Maximize overlap and

• Match natural layoutslab = all rows in a single plane withsame destination

Page 24: Multicore Meets Petascale: Catalyst for a Programming ... · – 389 mm2 – 120 W @ 1900 MHz • Intel Core2 sc (Laptop) – 130 mm2 – 15 W @ 1000 MHz • PowerPC450 (BlueGene/P)

NAS FT Variants Performance Summary

• Slab is always best for MPI; small message costtoo high

• Pencil is always best for UPC; more overlap

0

200

400

600

800

1000

Myrinet 64

InfiniBand 256Elan3 256

Elan3 512Elan4 256

Elan4 512

MFlops per Thread Best MFlop rates for all NAS FT Benchmark versions

Best NAS Fortran/MPI

Best MPI

Best UPC

0

100

200

300

400

500

600

700

800

900

1000

1100

Myrinet 64InfiniBand 256

Elan3 256Elan3 512

Elan4 256Elan4 512

MFl

ops

per T

hrea

d

Best NAS Fortran/MPIBest MPI (always Slabs)Best UPC (always Pencils)

.5 Tflops

Myrinet Infiniband Elan3 Elan3 Elan4 Elan4

#procs 64 256 256 512 256 512

MF

lop

s p

er

Th

rea

d

Chunk (NAS FT with FFTW)Best MPI (always slabs)Best UPC (always pencils)

Page 25: Multicore Meets Petascale: Catalyst for a Programming ... · – 389 mm2 – 120 W @ 1900 MHz • Intel Core2 sc (Laptop) – 130 mm2 – 15 W @ 1000 MHz • PowerPC450 (BlueGene/P)

PGAS Languages and Productivity

Page 26: Multicore Meets Petascale: Catalyst for a Programming ... · – 389 mm2 – 120 W @ 1900 MHz • Intel Core2 sc (Laptop) – 130 mm2 – 15 W @ 1000 MHz • PowerPC450 (BlueGene/P)

Arrays in a Global Address Space

• Key features of Titanium arrays

– Generality: indices may start/end and any point

– Domain calculus allow for slicing, subarray, transpose andother operations without data copies

• Use domain calculus to identify ghosts and iterate: foreach (p in gridA.shrink(1).domain()) ...

• Array copies automatically work on intersection gridB.copy(gridA.shrink(1));

gridA gridB

“restricted” (non-ghost)cells

ghost cells

intersection (copied area)

Joint work with Titanium group

Useful in gridcomputationsincluding AMR

Page 27: Multicore Meets Petascale: Catalyst for a Programming ... · – 389 mm2 – 120 W @ 1900 MHz • Intel Core2 sc (Laptop) – 130 mm2 – 15 W @ 1000 MHz • PowerPC450 (BlueGene/P)

Languages Support HelpsProductivity

C++/Fortran/MPI AMR• Chombo package from LBNL

• Bulk-synchronous comm:– Pack boundary data between procs

– All optimizations done by programmer

Titanium AMR• Entirely in Titanium• Finer-grained communication

– No explicit pack/unpack code– Automated in runtime system

• General approach– Language allow programmer optimizations– Compiler/runtime does some automatically

Work by Tong Wen and Philip Colella; Communication optimizations joint with Jimmy Su

0

5000

10000

15000

20000

25000

30000

Titanium C++/F/MPI

(Chombo)

Lin

es

of

Co

de

AMRElliptic

AMRTools

Util

Grid

AMR

Array

Speedup

0

10

20

30

40

50

60

70

80

16 28 36 56 112

#procs

spee

du

p

Ti Chombo

Page 28: Multicore Meets Petascale: Catalyst for a Programming ... · – 389 mm2 – 120 W @ 1900 MHz • Intel Core2 sc (Laptop) – 130 mm2 – 15 W @ 1000 MHz • PowerPC450 (BlueGene/P)

Particle/Mesh Method: Heart Simulation

• Elastic structures in an incompressible fluid.– Blood flow, clotting, inner ear, embryo growth,

• Complicated parallelization– Particle/Mesh method, but “Particles”

connected into materials (1D or 2D structures)

– Communication patterns irregular betweenparticles (structures) and mesh (fluid)

Joint work with Ed Givelberg, Armando Solar-Lezama, Charlie Peskin, Dave McQueen

2D Dirac Delta Function

Code Size in Lines

8000

Fortran

4000

Titanium

Note: Fortran code is not parallel

Page 29: Multicore Meets Petascale: Catalyst for a Programming ... · – 389 mm2 – 120 W @ 1900 MHz • Intel Core2 sc (Laptop) – 130 mm2 – 15 W @ 1000 MHz • PowerPC450 (BlueGene/P)
Page 30: Multicore Meets Petascale: Catalyst for a Programming ... · – 389 mm2 – 120 W @ 1900 MHz • Intel Core2 sc (Laptop) – 130 mm2 – 15 W @ 1000 MHz • PowerPC450 (BlueGene/P)

Dense and Sparse Matrix Factorization

Blocks 2Dblock-cyclicdistributed

Panel factorizationsinvolve communicationfor pivoting Matrix-

matrixmultiplicationused here.Can be coalesced

Completed part of U

Co

mp

lete

d p

art o

f L

A(i,j) A(i,k)

A(j,i) A(j,k)

Trailing matrix

to be updated

Panel being factored

Joint work with Parry Husbands and Esmond Ng

Page 31: Multicore Meets Petascale: Catalyst for a Programming ... · – 389 mm2 – 120 W @ 1900 MHz • Intel Core2 sc (Laptop) – 130 mm2 – 15 W @ 1000 MHz • PowerPC450 (BlueGene/P)

Matrix Factorization in UPC

• UPC factorization uses a highly multithreaded style

– Used to mask latency and to mask dependence delays

– Three levels of threads:• UPC threads (data layout, each runs an event scheduling loop)

• Multithreaded BLAS (boost efficiency)

• User level (non-preemptive) threads with explicit yield

– No dynamic load balancing, but lots of remote invocation

– Layout is fixed (blocked/cyclic) and tuned for block size

• Same framework being used for sparse Cholesky

• Hard problems

– Block size tuning (tedious) for both locality and granularity

– Task prioritization (ensure critical path performance)

– Resource management can deadlock memory allocator if notcareful

Joint work with Parry Husbands

Page 32: Multicore Meets Petascale: Catalyst for a Programming ... · – 389 mm2 – 120 W @ 1900 MHz • Intel Core2 sc (Laptop) – 130 mm2 – 15 W @ 1000 MHz • PowerPC450 (BlueGene/P)

UPC HP Linpack Performance

X1 UPC vs. MPI/HPL

0

200

400

600

800

1000

1200

1400

60 X1/64 X1/128

GF

lop

/s

MPI/HPL

UPC

Opteron

cluster

UPC vs.

MPI/HPL

0

50

100

150

200

Opt/64

GF

lop

/s

MPI/HPL

UPC

Altix UPC.

Vs.

MPI/HPL

0

20

40

60

80

100

120

140

160

Alt/32

GF

lop

/s

MPI/HPL

UPC

•Comparable to MPI HPL (numbers from HPCC database)

•Faster than ScaLAPACK due to less synchronization

•Large scaling of UPC code on Itanium/Quadrics (Thunder)

• 2.2 TFlops on 512p and 4.4 TFlops on 1024p

Joint work with Parry Husbands

UPC vs.

ScaLAPACK

0

20

40

60

80

2x4 proc grid 4x4 proc grid

GFlo

ps

ScaLAPACK

UPC

Page 33: Multicore Meets Petascale: Catalyst for a Programming ... · – 389 mm2 – 120 W @ 1900 MHz • Intel Core2 sc (Laptop) – 130 mm2 – 15 W @ 1000 MHz • PowerPC450 (BlueGene/P)

PGAS Languages + Autotuning forMulticore

• PGAS languages are a good fit to shared memorymachines, including multicore– Global address space implemented as reads/writes

– Current UPC and Titanium implementation uses threads

• Alternative on shared memory is OpenMP orthreads– PGAS has locality information that is important on multi-

socket SMPs and may be important as #cores grows

– Also may be exploited for processor with explicit localstore rather than cache, e.g., Cell processor

• Open question in architecture– Cache-coherence shared memory

– Software-controlled local memory (or hybrid)

• How do we get performance, not just parallelismon multicore?

DMA

Page 34: Multicore Meets Petascale: Catalyst for a Programming ... · – 389 mm2 – 120 W @ 1900 MHz • Intel Core2 sc (Laptop) – 130 mm2 – 15 W @ 1000 MHz • PowerPC450 (BlueGene/P)

Tools for Efficiency: Autotuning

• Automatic performance tuning– Use machine time in place of human time for tuning

– Search over possible implementations

– Use performance models to restrict search space

– Autotuned libraries for dwarfs (up to 10x speedup)

Block size (n0 xm0) for densematrix-matrixmultiply

• Spectral (FFTW, Spiral)

• Dense (PHiPAC, Atlas)

• Sparse (Sparsity, OSKI)

• Stencils/structured grids

– Are these compilers?

• Don’t transform source

• There are compilers thatuse this kind of search

• But not for the sparsecase (transform matrix)

Optimization:1.5x more entries (zeros)

1.5x speedup

Compilers won’t do this!

Page 35: Multicore Meets Petascale: Catalyst for a Programming ... · – 389 mm2 – 120 W @ 1900 MHz • Intel Core2 sc (Laptop) – 130 mm2 – 15 W @ 1000 MHz • PowerPC450 (BlueGene/P)

Sparse Matrices on Multicore

AMD X2

0.0

0.5

1.0

1.5

2.0

2.5

3.0

Dense

Pro

tein

FEM

-Sphr

FEM

-Cant

Tunnel

FEM

-Har

QCD

FEM

-Ship

Eco

nom

Epid

em

FEM

-Accel

Cir

cuit

Webbase LP

Media

n

GFlo

p/

s

1Core Naïve 1Core PF

1Core PF ,RB 1Core PF ,RB ,CB

2Core 2Socket x 2Core

2Core Naïve MPI + OSKI

Clovertown

0.0

0.5

1.0

1.5

2.0

2.5D

ense

Pro

tein

FEM

-Sphr

FEM

-Cant

Tunnel

FEM

-Har

QCD

FEM

-Ship

Eco

nom

Epid

em

FEM

-Acc

el

Cir

cuit

Webbase LP

Media

n

GFlo

p/

s1Core Naïve 1Core PF

1Core PF ,RB 1Core PF ,RB ,CB

2Core 4Core

2Socket x 4Core 2Core Naïve

MPI+OSKI

Autotuning is more important than parallelism!And don’t think running MPI process per core is good enough

for performance.

Page 36: Multicore Meets Petascale: Catalyst for a Programming ... · – 389 mm2 – 120 W @ 1900 MHz • Intel Core2 sc (Laptop) – 130 mm2 – 15 W @ 1000 MHz • PowerPC450 (BlueGene/P)

Rethink Compilers and Tools

Page 37: Multicore Meets Petascale: Catalyst for a Programming ... · – 389 mm2 – 120 W @ 1900 MHz • Intel Core2 sc (Laptop) – 130 mm2 – 15 W @ 1000 MHz • PowerPC450 (BlueGene/P)

Compilers

• Challenges of writing optimized andportable code suggest compilers should:– Analyze and optimize parallel code

• Eliminate races

• Raise level of programming to data parallelism andother higher level abstractions

• Enforce easy-to-learn memory models

– Write code generators rather than programs• Compilers are more dynamic than traditional

• Use partial evaluation and dynamic optimizations(used in both examples)

Page 38: Multicore Meets Petascale: Catalyst for a Programming ... · – 389 mm2 – 120 W @ 1900 MHz • Intel Core2 sc (Laptop) – 130 mm2 – 15 W @ 1000 MHz • PowerPC450 (BlueGene/P)

Parallel Program Analysis

• To perform optimizations, new analyses areneeded for parallel languages

• In a data parallel or serial (auto-parallelized)language, the semantics are serial– Analysis is “easier” but more critical to performance

• Parallel semantics requires– Concurrency analysis: which code sequences may run

concurrently

– Parallel alias analysis: which accesses could conflictbetween threads

• Analysis is used to detect races, identifylocalizable pointers, and ensure memoryconsistency semantics (if desired)

Page 39: Multicore Meets Petascale: Catalyst for a Programming ... · – 389 mm2 – 120 W @ 1900 MHz • Intel Core2 sc (Laptop) – 130 mm2 – 15 W @ 1000 MHz • PowerPC450 (BlueGene/P)

Concurrency Analysis• Graph generated from program as follows:

– Node for each code segment between barriers and singleconditionals

– Edges added to represent control flow between segments– Barrier edges removed

• Two accesses can run concurrently if:– They are in the same node, or– One access’s node is reachable from the other access’s node

// segment 1

if ([single])

// segment 2

else

// segment 3

// segment 4

Ti.barrier()

// segment 5

1

2 3

4

5

barrier

Joint work with Amir Kamil and Jimmy Su

Page 40: Multicore Meets Petascale: Catalyst for a Programming ... · – 389 mm2 – 120 W @ 1900 MHz • Intel Core2 sc (Laptop) – 130 mm2 – 15 W @ 1000 MHz • PowerPC450 (BlueGene/P)

Alias Analysis

• Allocation sites correspond to abstractlocations (a-locs)– Abstract locations (a-locs) are typed

• All explicit and implicit program variableshave points-to sets– Each field of an object has a separate set

– Arrays have a single points-to set for all elements

• Thread aware: Two kinds of abstractlocations: local and remote– Local locations reside in local thread’s memory

– Remote locations reside on another thread

– Generalizes to multiple levels (thread, node,cluster)

Joint work with Amir Kamil

Page 41: Multicore Meets Petascale: Catalyst for a Programming ... · – 389 mm2 – 120 W @ 1900 MHz • Intel Core2 sc (Laptop) – 130 mm2 – 15 W @ 1000 MHz • PowerPC450 (BlueGene/P)

% of Dynamic Fences Removed

0%

20%

40%

60%

80%

100%

sharing concur concur+

pointer

concur+

multi-level

pointer

fft

amr-gas

gsrb

lu-fact

pi

pps

sort

demv

spmv

amr-poisson

Implementing Sequential Consistency withFences and Analysis

Joint work with Amir Kamil

Page 42: Multicore Meets Petascale: Catalyst for a Programming ... · – 389 mm2 – 120 W @ 1900 MHz • Intel Core2 sc (Laptop) – 130 mm2 – 15 W @ 1000 MHz • PowerPC450 (BlueGene/P)

Hierarchical Machines

• Parallel machines often havehierarchical structure

B

level 1(thread local)

level 2(node local)

level 3(cluster local)

level 4(grid world)

CD

A1

2

3 4

Page 43: Multicore Meets Petascale: Catalyst for a Programming ... · – 389 mm2 – 120 W @ 1900 MHz • Intel Core2 sc (Laptop) – 130 mm2 – 15 W @ 1000 MHz • PowerPC450 (BlueGene/P)

Race Detection ResultsG

ood

Page 44: Multicore Meets Petascale: Catalyst for a Programming ... · – 389 mm2 – 120 W @ 1900 MHz • Intel Core2 sc (Laptop) – 130 mm2 – 15 W @ 1000 MHz • PowerPC450 (BlueGene/P)

Rethink Algorithms

Page 45: Multicore Meets Petascale: Catalyst for a Programming ... · – 389 mm2 – 120 W @ 1900 MHz • Intel Core2 sc (Laptop) – 130 mm2 – 15 W @ 1000 MHz • PowerPC450 (BlueGene/P)

Latency and Bandwidth-Avoiding

• Many iterative algorithms are limited by

– Communication latency (frequent messages)

– Memory bandwidth

• New optimal ways to implement Krylov subspace methods onparallel and sequential computers

– Replace x Ax by x [Ax,A2x,…Akx]

– Change GMRES, CG, Lanczos, … accordingly

• Theory– Minimizes network latency costs on parallel machine

– Minimizes memory bandwidth and latency costs on sequentialmachine

• Performance models for 2D problem– Up to 7x (overlap) or 15x (no overlap) speedups on BG/P

• Measure speedup: 3.2x for out-of-core

Page 46: Multicore Meets Petascale: Catalyst for a Programming ... · – 389 mm2 – 120 W @ 1900 MHz • Intel Core2 sc (Laptop) – 130 mm2 – 15 W @ 1000 MHz • PowerPC450 (BlueGene/P)

5 10 15 20 25 30

0

1

2

3

4

5

6

7

8

Local Dependencies for k=8

x

Ax

A2x

A3x

A4x

A5x

A6x

A7x

A8x

Locally Dependent Entries for [x,Ax,…,A8x], A tridiagonal

Can be computed without communication

k=8 fold reuse of A

Page 47: Multicore Meets Petascale: Catalyst for a Programming ... · – 389 mm2 – 120 W @ 1900 MHz • Intel Core2 sc (Laptop) – 130 mm2 – 15 W @ 1000 MHz • PowerPC450 (BlueGene/P)

5 10 15 20 25 30

0

1

2

3

4

5

6

7

8

Type (1) Remote Dependencies for k=8

Remotely Dependent Entries for [x,Ax,…,A8x], A tridiagonal

x

Ax

A2x

A3x

A4x

A5x

A6x

A7x

A8x

One message to get data needed to compute remotely dependent entries, not k=8

Price: redundant work

Page 48: Multicore Meets Petascale: Catalyst for a Programming ... · – 389 mm2 – 120 W @ 1900 MHz • Intel Core2 sc (Laptop) – 130 mm2 – 15 W @ 1000 MHz • PowerPC450 (BlueGene/P)

5 10 15 20 25 30

0

1

2

3

4

5

6

7

8

Type (2) Remote Dependencies for k=8

Fewer Remotely Dependent Entries for [x,Ax,…,A8x], A tridiagonal

x

Ax

A2x

A3x

A4x

A5x

A6x

A7x

A8x

Reduce redundant work by half

Page 49: Multicore Meets Petascale: Catalyst for a Programming ... · – 389 mm2 – 120 W @ 1900 MHz • Intel Core2 sc (Laptop) – 130 mm2 – 15 W @ 1000 MHz • PowerPC450 (BlueGene/P)

Latency Avoiding Parallel Kernel for[x, Ax, A2x, … , Akx]

• Compute locally dependent entriesneeded by neighbors

• Send data to neighbors, receive fromneighbors

• Compute remaining locallydependent entries

• Wait for receive

• Compute remotely dependent entries

Page 50: Multicore Meets Petascale: Catalyst for a Programming ... · – 389 mm2 – 120 W @ 1900 MHz • Intel Core2 sc (Laptop) – 130 mm2 – 15 W @ 1000 MHz • PowerPC450 (BlueGene/P)

Work by Demmel and Hoemmen

Can use Matrix Power Kernel, but change Algorithms

Page 51: Multicore Meets Petascale: Catalyst for a Programming ... · – 389 mm2 – 120 W @ 1900 MHz • Intel Core2 sc (Laptop) – 130 mm2 – 15 W @ 1000 MHz • PowerPC450 (BlueGene/P)

Predictions and Conclusions

• Parallelism will explode

– Number of cores will double every ~2 years

– Petaflop (million processor) machines will be common in HPCby 2015 (all top 500 machines will have this)

• Performance will become a software problem

– Parallelism and locality are fundamental; can save power bypushing these to software

• Locality will continue to be important

– On-chip to off-chip as well as node to node

– Need to design algorithms for what counts (communication notcomputation)

• Massive parallelism required (including pipelining andoverlap)

Page 52: Multicore Meets Petascale: Catalyst for a Programming ... · – 389 mm2 – 120 W @ 1900 MHz • Intel Core2 sc (Laptop) – 130 mm2 – 15 W @ 1000 MHz • PowerPC450 (BlueGene/P)

Conclusions

• Parallel computing is the future

• Re-think Hardware

– Hardware to make programming easier

– Hardware to support good performance tuning

• Re-think Software

– Software to make the most of hardware

– Software to ease programming

• Berkeley UPC compiler: http://upc.lbl.gov

• Titanium compiler: http://titanium.cs.berkeley.edu

• Re-think Algorithms

– Design for bottlenecks: latency and bandwidth

Page 53: Multicore Meets Petascale: Catalyst for a Programming ... · – 389 mm2 – 120 W @ 1900 MHz • Intel Core2 sc (Laptop) – 130 mm2 – 15 W @ 1000 MHz • PowerPC450 (BlueGene/P)

Concurrency for Low Power

• Highly concurrent systems are more power efficient– Dynamic power is proportional to V2fC

– Increasing frequency (f) also increases supply voltage (V):more than linear effect

– Increasing cores increases capacitance (C) but has only alinear effect

• Hidden concurrency burns power– Speculation, dynamic dependence checking, etc.

• Push parallelism discovery to software to save power– Compilers, library writers and applications programmers

• Challenge: Can you double the concurrency in youralgorithms every 2 years?

Page 54: Multicore Meets Petascale: Catalyst for a Programming ... · – 389 mm2 – 120 W @ 1900 MHz • Intel Core2 sc (Laptop) – 130 mm2 – 15 W @ 1000 MHz • PowerPC450 (BlueGene/P)

LBMHD: Structure Grid Application

• Plasma turbulence simulation

• Two distributions:

– momentum distribution (27 components)

– magnetic distribution (15 vector components)

• Three macroscopic quantities:

– Density

– Momentum (vector)

– Magnetic Field (vector)

• Must read 73 doubles, and update(write) 79 doubles perpoint in space

• Requires about 1300 floating point operations per point inspace

• Just over 1.0 flops/byte (ideal)

• No temporal locality between points in space within onetime step

Page 55: Multicore Meets Petascale: Catalyst for a Programming ... · – 389 mm2 – 120 W @ 1900 MHz • Intel Core2 sc (Laptop) – 130 mm2 – 15 W @ 1000 MHz • PowerPC450 (BlueGene/P)

Autotuned Performance(Cell/SPE version)

• First attempt at cellimplementation.

• VL, unrolling, reorderingfixed

• Exploits DMA anddouble buffering to loadvectors

• Straight to SIMDintrinsics.

• Despite the relativeperformance, Cell’s DPimplementation severelyimpairs performance

Intel Clovertown AMD Opteron

Sun Niagara2 (Huron) IBM Cell Blade*

*collision() only

+SIMDization

+SW Prefetching

+Unrolling

+Vectorization

+Padding

Naïve+NUMA

Page 56: Multicore Meets Petascale: Catalyst for a Programming ... · – 389 mm2 – 120 W @ 1900 MHz • Intel Core2 sc (Laptop) – 130 mm2 – 15 W @ 1000 MHz • PowerPC450 (BlueGene/P)

Autotuned Performance(Cell/SPE version)

Intel Clovertown AMD Opteron

Sun Niagara2 (Huron) IBM Cell Blade*

*collision() only

+SW Prefetching

+Unrolling

+Vectorization

+Padding

Naïve+NUMA

7.5% of peak flops

17% of bandwidth

42% of peak flops

35% of bandwidth

59% of peak flops

15% of bandwidth

57% of peak flops

33% of bandwidth