multicore meets petascale: catalyst for a programming ... · – 389 mm2 – 120 w @ 1900 mhz •...
TRANSCRIPT
Multicore Meets Petascale:Catalyst for a Programming
Model Revolution
Kathy Yelick
U.C. Berkeley and
Lawrence Berkeley National Laboratory
Moore’s Law is Alive and Well
2X transistors/Chip Every 1.5 years
Called “Moore’s Law”
Moore’s Law
Microprocessors havebecome smaller, denser,and more powerful.
Gordon Moore (co-founder ofIntel) predicted in 1965 that thetransistor density ofsemiconductor chips woulddouble roughly every 18months.
Slide source: Jack Dongarra
Clock Scaling Hits Power Density Wall
4004
8008
8080
8085
8086
286386
486Pentium®
P6
1
10
100
1000
10000
1970 1980 1990 2000 2010
Year
Po
wer
Den
sit
y (
W/c
m2)
Hot Plate
Nuclear
Reactor
Rocket
Nozzle
Sun’sSurface
Source: Patrick Gelsinger, Intel
Scaling clock speed (business as usual) will not work
Multicore Revolution
• Chip density iscontinuingincrease ~2x every2 years
– Clock speed isnot
– Number ofprocessor coresmay doubleinstead
• There is little or nohidden parallelism(ILP) to be found
• Parallelism mustbe exposed to andmanaged bysoftware
Source: Intel, Microsoft (Sutter) and
Stanford (Olukotun, Hammond)
Petaflop with ~1M Cores By 2008
100 Pflop/s
10 Pflop/s
1 Pflop/s
100 Tflop/s
10 Tflops/s
1 Tflop/s
100 Gflop/s
10 Gflop/s
1 Gflop/s
10 MFlop/s
1 PFlop system in 2008?
Data from top500.org
6-8 years
Commonby 2015?
Petaflop with ~1M Cores By 2008
100 Pflop/s
10 Pflop/s
1 Pflop/s
100 Tflop/s
10 Tflops/s
1 Tflop/s
100 Gflop/s
10 Gflop/s
1 Gflop/s
10 MFlop/s
1 PFlop system in 2009
6-8 years
On your deskin 2025?
8-10 years
Need a Fundamentally New Approach
• Rethink hardware
– What limits performance
– How to build efficient hardware
• Rethink software
– Massive parallelism
– Eliminate scaling bottlenecks replication,synchronization
• Rethink algorithms
– Massive parallelism and locality
– Counting Flops is the wrong measure
Rethink Hardware
(Ways to Waste $50M)
Waste #1: Ignore Power Budget
Power is top concern in hardware design
• Power density within a chip
– Led to multicore revolution
• Energy consumption
– Always important in handheld devices
– Increasingly so in desktops
– Soon to be significant fraction of budget inlarge systems
• One knob: increase concurrency
Optimizing for Serial PerformanceConsumes Power
• Power5 (Server)– 389 mm2
– 120 W @ 1900 MHz
• Intel Core2 sc (Laptop)– 130 mm2
– 15 W @ 1000 MHz
• PowerPC450 (BlueGene/P)
– 8 mm2
– 3 W @ 850 MHz
• Tensilica DP (cell phones)– 0.8 mm2
– 0.09 W @ 650 MHz
Intel Core2
Power 5
Each core operates at 1/3 to 1/10th efficiency of largest chip, but you can pack 100x more cores onto a chip and consume 1/20 the power!
PPC450
TensilicaDP
Power Demands Threaten to Limit the FutureGrowth of Computational Science
Looking forward to Exascale (1000x Petascale)• DOE E3 Report
– Extrapolation of existing design trends
– Estimate: 130 MW
• DARPA Exascale Study
– More detailed assessment of component technologies
• Power-constrained design for 2014 technology
• 3 TF/chip, new memory technology, optical interconnect
– Estimate:• 40 MW plausible target, not business as usual
• Billion-way concurrency with cores and latency-hiding
• NRC Study– Power and multicore challenges are not just an HPC problem
Waste #2: Buy More Cores thanMemory can Feed
• Required bandwidth depends on the algorithm
• Need hardware designed to algorithmic needs
1.E+02
1.E+03
1.E+04
1.E+05
1.E+06
1.E+07
1.E+08
1.E+09
1.E+10
2003
2005
2007
2009
2011
2013
2015
2017
2019
Time to fill 1/2 mem (usec)Time to matmul on 1/2 mem (usec)Time to FFT on 1/2 mem (usec)Time to Stencil (usec)Time for N 2̂ particle method
Waste #3: Ignore Little’s Law--Latency also Matters
3.4 GF 1.9 GF 1.5 GFAutotune SpMV
--3%1%Efficiency %
-- 0.6 GF 1.0 GFNaïve SpMV (median of
many matrices)
25.6 GB/s21.321.3 GB/sPeak MemBW
3.2X 1.5XAuto Speedup
14.6 (DP Fl. Pt.)17.6 GF74.6 GFPeak GFLOPS
3.2 GHz2.2 GHz2.3 GHzClock Rate
2-VLIW, SIMD, localRAM, DMA
4-/3-issue, 2-/1-SSE3, OOO,caches, prefetch
Architecture
1*8 = 8 2*2 = 42*4 = 8Chips*Cores
CellOpteronClovertownName
Sparse Matrix-Vector Multiply (2flops / 8 bytes) should be BW limited
Why is the STI Cell So Efficient?(Latency Hiding with Software Controlled Memory)
• Performance of Standard Cache Hierarchy– Cache hierarchies are insufficient to tolerate latencies
– Hardware prefetch prefers long unit-stride access patterns (optimized for STREAM)
• Cell “explicit DMA”– Cell software controlled DMA engines can provide nearly flat bandwidth
– Performance model is simple and deterministic (much simpler than modeling acomplex cache hierarchy),
min{time_for_memory_ops, time_for_core_exec}
Cell STRIAD (64KB concurrency)
0.000
5.000
10.000
15.000
20.000
25.000
30.000
16 32 64 128 256 512 1024 2048
stanza size
GB
/s
1 SPE 2 SPEs 3 SPEs 4 SPEs
5 SPEs 6 SPEs 7 SPEs 8 SPEs
Waste #4: Unnecessarily SynchronizeCommunication
8-byte Roundtrip Latency
14.6
6.6
22.1
9.6
6.6
4.5
9.5
18.5
24.2
13.5
17.8
8.3
0
5
10
15
20
25
Elan3/Alpha Elan4/IA64 Myrinet/x86 IB/G5 IB/Opteron SP/Fed
Ro
un
dtrip
Laten
cy (
usec)
MPI ping-pong
GASNet put+sync
• MPI Message Passing is two-sided: each transfer is tied toa synchronization event (message received)
• One-sided communication avoids this
Flood Bandwidth for 4KB messages
547
420
190
702
152
252
750
714231
763
223
679
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Elan3/Alpha Elan4/IA64 Myrinet/x86 IB/G5 IB/Opteron SP/Fed
Pe
rc
en
t H
W p
ea
k
MPI
GASNet
One-Sided vs Two-Sided Communication
• A one-sided put/get message can be handled directly by anetwork interface with RDMA support– Avoid interrupting the CPU or storing data from CPU
(preposts)
• A two-sided messages needs to be matched with a receiveto identify memory address to put data– Offloaded to Network Interface in networks like Quadrics– Need to download match tables to interface (from host)
address
message id
data payload
data payload
one-sided put message
two-sided message
network
interface
memory
host
CPU
Joint work with Dan Bonachea
Rethinking Programming Models
forMassive concurrency
Latency hidingLocality control
Parallel Programming Models
• Most parallel programs are written using either:–Message passing with a SPMD model
• Scientific applications; is portable and scalable
• Success driven by cluster computing
–Shared memory with threads in OpenMP or Threads
• IT applications; easier to program
• Used on shared memory architectures
• Future Petascale machines–Massively distributed, O(100K) nodes/sockets
–Massively parallelism within a chip (2x every year,starting at 4-100 now)
• Main question: 1 programming model or 2?–MPI + OpenMP or something else
Partitioned Global AddressSpace
• Global address space: any thread/process maydirectly read/write data allocated by another
• Partitioned: data is designated as local or global
Glo
bal ad
dre
ss s
pace
x: 1
y:
l: l: l:
g: g: g:
x: 5
y:
x: 7
y: 0
p0 p1 pn
By default:
• Objectheaps areshared
• Programstacks areprivate
• 3 Current languages: UPC, CAF, and Titanium– All three use an SPMD execution model
– Emphasis in this talk on UPC and Titanium (based onJava)
• 3 Emerging languages: X10, Fortress, and Chapel
PGAS Language Overview
• Many common concepts, although specifics differ– Consistent with base language, e.g., Titanium is strongly
typed
• Both private and shared data– int x[10]; and shared int y[10];
• Support for distributed data structures– Distributed arrays; local and global pointers/references
• One-sided shared-memory communication– Simple assignment statements: x[i] = y[i]; or t = *p;
– Bulk operations: memcpy in UPC, array ops in Titanium andCAF
• Synchronization– Global barriers, locks, memory fences
• Collective Communication, IO libraries, etc.
PGAS Languages for DistributedMemory
• PGAS languages are a good fit to distributedmemory machines and clusters of multicore
– Global address space uses fast one-sidedcommunication
– UPC and Titanium use GASNet communication
• Alternative on distributed memory is MPI
– PGAS partitioned model scaled to 100s of nodes
– Shared data structure are only in the programmer’smind in MPI; cannot be programmed as such
One-Sided vs. Two-Sided: Practice
0
100
200
300
400
500
600
700
800
900
10 100 1,000 10,000 100,000 1,000,000
Size (bytes)
Ban
dw
idth
(M
B/s
)
GASNet put (nonblock)"
MPI Flood
Relative BW (GASNet/MPI)
1.0
1.2
1.4
1.6
1.8
2.0
2.2
2.4
10 1000 100000 10000000
Size (bytes)
• InfiniBand: GASNet vapi-conduit and OSU MVAPICH 0.9.5
• Half power point (N ) differs by one order of magnitude
• This is not a criticism of the implementation!
Joint work with Paul Hargrove and Dan Bonachea
(up
is
go
od
) NERSC Jacquardmachine withOpteronprocessors
Communication Strategies for 3D FFT
Joint work with Chris Bell, Rajesh Nishtala, Dan Bonachea
chunk = all rows with same destination
pencil = 1 row
• Three approaches:
–Chunk:
• Wait for 2nd dim FFTs to finish
• Minimize # messages
–Slab:
• Wait for chunk of rows destinedfor 1 proc to finish
• Overlap with computation
–Pencil:
• Send each row as it completes
• Maximize overlap and
• Match natural layoutslab = all rows in a single plane withsame destination
NAS FT Variants Performance Summary
• Slab is always best for MPI; small message costtoo high
• Pencil is always best for UPC; more overlap
0
200
400
600
800
1000
Myrinet 64
InfiniBand 256Elan3 256
Elan3 512Elan4 256
Elan4 512
MFlops per Thread Best MFlop rates for all NAS FT Benchmark versions
Best NAS Fortran/MPI
Best MPI
Best UPC
0
100
200
300
400
500
600
700
800
900
1000
1100
Myrinet 64InfiniBand 256
Elan3 256Elan3 512
Elan4 256Elan4 512
MFl
ops
per T
hrea
d
Best NAS Fortran/MPIBest MPI (always Slabs)Best UPC (always Pencils)
.5 Tflops
Myrinet Infiniband Elan3 Elan3 Elan4 Elan4
#procs 64 256 256 512 256 512
MF
lop
s p
er
Th
rea
d
Chunk (NAS FT with FFTW)Best MPI (always slabs)Best UPC (always pencils)
PGAS Languages and Productivity
Arrays in a Global Address Space
• Key features of Titanium arrays
– Generality: indices may start/end and any point
– Domain calculus allow for slicing, subarray, transpose andother operations without data copies
• Use domain calculus to identify ghosts and iterate: foreach (p in gridA.shrink(1).domain()) ...
• Array copies automatically work on intersection gridB.copy(gridA.shrink(1));
gridA gridB
“restricted” (non-ghost)cells
ghost cells
intersection (copied area)
Joint work with Titanium group
Useful in gridcomputationsincluding AMR
Languages Support HelpsProductivity
C++/Fortran/MPI AMR• Chombo package from LBNL
• Bulk-synchronous comm:– Pack boundary data between procs
– All optimizations done by programmer
Titanium AMR• Entirely in Titanium• Finer-grained communication
– No explicit pack/unpack code– Automated in runtime system
• General approach– Language allow programmer optimizations– Compiler/runtime does some automatically
Work by Tong Wen and Philip Colella; Communication optimizations joint with Jimmy Su
0
5000
10000
15000
20000
25000
30000
Titanium C++/F/MPI
(Chombo)
Lin
es
of
Co
de
AMRElliptic
AMRTools
Util
Grid
AMR
Array
Speedup
0
10
20
30
40
50
60
70
80
16 28 36 56 112
#procs
spee
du
p
Ti Chombo
Particle/Mesh Method: Heart Simulation
• Elastic structures in an incompressible fluid.– Blood flow, clotting, inner ear, embryo growth,
…
• Complicated parallelization– Particle/Mesh method, but “Particles”
connected into materials (1D or 2D structures)
– Communication patterns irregular betweenparticles (structures) and mesh (fluid)
Joint work with Ed Givelberg, Armando Solar-Lezama, Charlie Peskin, Dave McQueen
2D Dirac Delta Function
Code Size in Lines
8000
Fortran
4000
Titanium
Note: Fortran code is not parallel
Dense and Sparse Matrix Factorization
Blocks 2Dblock-cyclicdistributed
Panel factorizationsinvolve communicationfor pivoting Matrix-
matrixmultiplicationused here.Can be coalesced
Completed part of U
Co
mp
lete
d p
art o
f L
A(i,j) A(i,k)
A(j,i) A(j,k)
Trailing matrix
to be updated
Panel being factored
Joint work with Parry Husbands and Esmond Ng
Matrix Factorization in UPC
• UPC factorization uses a highly multithreaded style
– Used to mask latency and to mask dependence delays
– Three levels of threads:• UPC threads (data layout, each runs an event scheduling loop)
• Multithreaded BLAS (boost efficiency)
• User level (non-preemptive) threads with explicit yield
– No dynamic load balancing, but lots of remote invocation
– Layout is fixed (blocked/cyclic) and tuned for block size
• Same framework being used for sparse Cholesky
• Hard problems
– Block size tuning (tedious) for both locality and granularity
– Task prioritization (ensure critical path performance)
– Resource management can deadlock memory allocator if notcareful
Joint work with Parry Husbands
UPC HP Linpack Performance
X1 UPC vs. MPI/HPL
0
200
400
600
800
1000
1200
1400
60 X1/64 X1/128
GF
lop
/s
MPI/HPL
UPC
Opteron
cluster
UPC vs.
MPI/HPL
0
50
100
150
200
Opt/64
GF
lop
/s
MPI/HPL
UPC
Altix UPC.
Vs.
MPI/HPL
0
20
40
60
80
100
120
140
160
Alt/32
GF
lop
/s
MPI/HPL
UPC
•Comparable to MPI HPL (numbers from HPCC database)
•Faster than ScaLAPACK due to less synchronization
•Large scaling of UPC code on Itanium/Quadrics (Thunder)
• 2.2 TFlops on 512p and 4.4 TFlops on 1024p
Joint work with Parry Husbands
UPC vs.
ScaLAPACK
0
20
40
60
80
2x4 proc grid 4x4 proc grid
GFlo
ps
ScaLAPACK
UPC
PGAS Languages + Autotuning forMulticore
• PGAS languages are a good fit to shared memorymachines, including multicore– Global address space implemented as reads/writes
– Current UPC and Titanium implementation uses threads
• Alternative on shared memory is OpenMP orthreads– PGAS has locality information that is important on multi-
socket SMPs and may be important as #cores grows
– Also may be exploited for processor with explicit localstore rather than cache, e.g., Cell processor
• Open question in architecture– Cache-coherence shared memory
– Software-controlled local memory (or hybrid)
• How do we get performance, not just parallelismon multicore?
DMA
Tools for Efficiency: Autotuning
• Automatic performance tuning– Use machine time in place of human time for tuning
– Search over possible implementations
– Use performance models to restrict search space
– Autotuned libraries for dwarfs (up to 10x speedup)
Block size (n0 xm0) for densematrix-matrixmultiply
• Spectral (FFTW, Spiral)
• Dense (PHiPAC, Atlas)
• Sparse (Sparsity, OSKI)
• Stencils/structured grids
– Are these compilers?
• Don’t transform source
• There are compilers thatuse this kind of search
• But not for the sparsecase (transform matrix)
Optimization:1.5x more entries (zeros)
1.5x speedup
Compilers won’t do this!
Sparse Matrices on Multicore
AMD X2
0.0
0.5
1.0
1.5
2.0
2.5
3.0
Dense
Pro
tein
FEM
-Sphr
FEM
-Cant
Tunnel
FEM
-Har
QCD
FEM
-Ship
Eco
nom
Epid
em
FEM
-Accel
Cir
cuit
Webbase LP
Media
n
GFlo
p/
s
1Core Naïve 1Core PF
1Core PF ,RB 1Core PF ,RB ,CB
2Core 2Socket x 2Core
2Core Naïve MPI + OSKI
Clovertown
0.0
0.5
1.0
1.5
2.0
2.5D
ense
Pro
tein
FEM
-Sphr
FEM
-Cant
Tunnel
FEM
-Har
QCD
FEM
-Ship
Eco
nom
Epid
em
FEM
-Acc
el
Cir
cuit
Webbase LP
Media
n
GFlo
p/
s1Core Naïve 1Core PF
1Core PF ,RB 1Core PF ,RB ,CB
2Core 4Core
2Socket x 4Core 2Core Naïve
MPI+OSKI
Autotuning is more important than parallelism!And don’t think running MPI process per core is good enough
for performance.
Rethink Compilers and Tools
Compilers
• Challenges of writing optimized andportable code suggest compilers should:– Analyze and optimize parallel code
• Eliminate races
• Raise level of programming to data parallelism andother higher level abstractions
• Enforce easy-to-learn memory models
– Write code generators rather than programs• Compilers are more dynamic than traditional
• Use partial evaluation and dynamic optimizations(used in both examples)
Parallel Program Analysis
• To perform optimizations, new analyses areneeded for parallel languages
• In a data parallel or serial (auto-parallelized)language, the semantics are serial– Analysis is “easier” but more critical to performance
• Parallel semantics requires– Concurrency analysis: which code sequences may run
concurrently
– Parallel alias analysis: which accesses could conflictbetween threads
• Analysis is used to detect races, identifylocalizable pointers, and ensure memoryconsistency semantics (if desired)
Concurrency Analysis• Graph generated from program as follows:
– Node for each code segment between barriers and singleconditionals
– Edges added to represent control flow between segments– Barrier edges removed
• Two accesses can run concurrently if:– They are in the same node, or– One access’s node is reachable from the other access’s node
// segment 1
if ([single])
// segment 2
else
// segment 3
// segment 4
Ti.barrier()
// segment 5
1
2 3
4
5
barrier
Joint work with Amir Kamil and Jimmy Su
Alias Analysis
• Allocation sites correspond to abstractlocations (a-locs)– Abstract locations (a-locs) are typed
• All explicit and implicit program variableshave points-to sets– Each field of an object has a separate set
– Arrays have a single points-to set for all elements
• Thread aware: Two kinds of abstractlocations: local and remote– Local locations reside in local thread’s memory
– Remote locations reside on another thread
– Generalizes to multiple levels (thread, node,cluster)
Joint work with Amir Kamil
% of Dynamic Fences Removed
0%
20%
40%
60%
80%
100%
sharing concur concur+
pointer
concur+
multi-level
pointer
fft
amr-gas
gsrb
lu-fact
pi
pps
sort
demv
spmv
amr-poisson
Implementing Sequential Consistency withFences and Analysis
Joint work with Amir Kamil
Hierarchical Machines
• Parallel machines often havehierarchical structure
B
level 1(thread local)
level 2(node local)
level 3(cluster local)
level 4(grid world)
CD
A1
2
3 4
Race Detection ResultsG
ood
Rethink Algorithms
Latency and Bandwidth-Avoiding
• Many iterative algorithms are limited by
– Communication latency (frequent messages)
– Memory bandwidth
• New optimal ways to implement Krylov subspace methods onparallel and sequential computers
– Replace x Ax by x [Ax,A2x,…Akx]
– Change GMRES, CG, Lanczos, … accordingly
• Theory– Minimizes network latency costs on parallel machine
– Minimizes memory bandwidth and latency costs on sequentialmachine
• Performance models for 2D problem– Up to 7x (overlap) or 15x (no overlap) speedups on BG/P
• Measure speedup: 3.2x for out-of-core
5 10 15 20 25 30
0
1
2
3
4
5
6
7
8
Local Dependencies for k=8
x
Ax
A2x
A3x
A4x
A5x
A6x
A7x
A8x
Locally Dependent Entries for [x,Ax,…,A8x], A tridiagonal
Can be computed without communication
k=8 fold reuse of A
5 10 15 20 25 30
0
1
2
3
4
5
6
7
8
Type (1) Remote Dependencies for k=8
Remotely Dependent Entries for [x,Ax,…,A8x], A tridiagonal
x
Ax
A2x
A3x
A4x
A5x
A6x
A7x
A8x
One message to get data needed to compute remotely dependent entries, not k=8
Price: redundant work
5 10 15 20 25 30
0
1
2
3
4
5
6
7
8
Type (2) Remote Dependencies for k=8
Fewer Remotely Dependent Entries for [x,Ax,…,A8x], A tridiagonal
x
Ax
A2x
A3x
A4x
A5x
A6x
A7x
A8x
Reduce redundant work by half
Latency Avoiding Parallel Kernel for[x, Ax, A2x, … , Akx]
• Compute locally dependent entriesneeded by neighbors
• Send data to neighbors, receive fromneighbors
• Compute remaining locallydependent entries
• Wait for receive
• Compute remotely dependent entries
Work by Demmel and Hoemmen
Can use Matrix Power Kernel, but change Algorithms
Predictions and Conclusions
• Parallelism will explode
– Number of cores will double every ~2 years
– Petaflop (million processor) machines will be common in HPCby 2015 (all top 500 machines will have this)
• Performance will become a software problem
– Parallelism and locality are fundamental; can save power bypushing these to software
• Locality will continue to be important
– On-chip to off-chip as well as node to node
– Need to design algorithms for what counts (communication notcomputation)
• Massive parallelism required (including pipelining andoverlap)
Conclusions
• Parallel computing is the future
• Re-think Hardware
– Hardware to make programming easier
– Hardware to support good performance tuning
• Re-think Software
– Software to make the most of hardware
– Software to ease programming
• Berkeley UPC compiler: http://upc.lbl.gov
• Titanium compiler: http://titanium.cs.berkeley.edu
• Re-think Algorithms
– Design for bottlenecks: latency and bandwidth
Concurrency for Low Power
• Highly concurrent systems are more power efficient– Dynamic power is proportional to V2fC
– Increasing frequency (f) also increases supply voltage (V):more than linear effect
– Increasing cores increases capacitance (C) but has only alinear effect
• Hidden concurrency burns power– Speculation, dynamic dependence checking, etc.
• Push parallelism discovery to software to save power– Compilers, library writers and applications programmers
• Challenge: Can you double the concurrency in youralgorithms every 2 years?
LBMHD: Structure Grid Application
• Plasma turbulence simulation
• Two distributions:
– momentum distribution (27 components)
– magnetic distribution (15 vector components)
• Three macroscopic quantities:
– Density
– Momentum (vector)
– Magnetic Field (vector)
• Must read 73 doubles, and update(write) 79 doubles perpoint in space
• Requires about 1300 floating point operations per point inspace
• Just over 1.0 flops/byte (ideal)
• No temporal locality between points in space within onetime step
Autotuned Performance(Cell/SPE version)
• First attempt at cellimplementation.
• VL, unrolling, reorderingfixed
• Exploits DMA anddouble buffering to loadvectors
• Straight to SIMDintrinsics.
• Despite the relativeperformance, Cell’s DPimplementation severelyimpairs performance
Intel Clovertown AMD Opteron
Sun Niagara2 (Huron) IBM Cell Blade*
*collision() only
+SIMDization
+SW Prefetching
+Unrolling
+Vectorization
+Padding
Naïve+NUMA
Autotuned Performance(Cell/SPE version)
Intel Clovertown AMD Opteron
Sun Niagara2 (Huron) IBM Cell Blade*
*collision() only
+SW Prefetching
+Unrolling
+Vectorization
+Padding
Naïve+NUMA
7.5% of peak flops
17% of bandwidth
42% of peak flops
35% of bandwidth
59% of peak flops
15% of bandwidth
57% of peak flops
33% of bandwidth