bringing users along the road to billion way concurrency · 3/14/2011 2 nersc is the primary...
Post on 13-Jun-2020
0 Views
Preview:
TRANSCRIPT
3/14/2011
1
Bringing Users Along the Road to Billion Way Concurrency
Kathy YelickNERSC Di t L B k lNERSC Director, Lawrence Berkeley
National Laboratory
EECS Department, UC Berkeley
Cover Stories from NERSC Research
T Head-Gordon2010
Geddes 2009
V.Daggett2010
E. Bylaska2010
Dorland 2010
Sugiyama 2010
A. Aspden2009
NERSC is enabling new high quality science across disciplines, with over 1,600 refereed publications last year
2
Balbuena 2009
Wang 2009
Bonoli 2009
2009
Xantheas2009
Mavrikakis 2009
3/14/2011
2
NERSC is the Primary Computing Center for DOE Office of Science
•NERSC serves a large populationOver 3000 users, 400 projects, 500 codes
•NERSC Serves DOE SC Mission–Allocated by DOE program managers
–Not limited to largest scale jobs
–Not open to non-DOE applications
•Strategy: Science First–Requirements workshops by officey–Procurements based on science codes–Partnerships with vendors to meet
science requirements
3
Physics Math + CS AstrophysicsChemistry Climate CombustionFusion Lattice Gauge Life SciencesMaterials Other
NERSC Systems for Science
Large-Scale Computing Systems
Franklin (NERSC-5): Cray XT4• 38,128 cores (quad core), ~25 Tflop/s on applications; 356 Tflop/s peak
Hopper (NERSC-6): Cray XE6 • Phase 1: Cray XT5, 668 nodes, 5344 cores• Phase 2: > 1 Pflop/s peak (late 2010), 24-core nodes
NERSC Global Filesystem (NGF)
• Uses IBM’s GPFS• 1 5 PB; 5 5 GB/s
Clusters105 Tflops combined
Carver
Analytics
4
HPSS Archival Storage• 40 PB capacity• 4 Tape libraries• 150 TB disk cache
1.5 PB; 5.5 GB/sCarver• IBM iDataplex cluster
PDSF (HEP/NP)• Linux cluster (~1K cores)
Magellan Cloud testbed
• IBM iDataplex cluster
• Euclid (512 GB shared memory)
• Dirac GPU testbed (48 nodes. Fermi)
3/14/2011
3
NERSC Roadmap
107
106 NERSC-8
NERSC-91 EF Peak
105
104
103
102 Franklin (N5)Franklin (N5) +QC36 TF Sustained
Hopper (N6)>1 PF Peak
NERSC-710 PF Peak
100 PF Peak
Pea
k Te
raflo
p/s
10
2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020
( )19 TF Sustained101 TF Peak
36 TF Sustained352 TF Peak
5
Users expect 10x improvement in capability every 3-4 years
Challenges to Exascale
• System power is the primary constraint for exascale (MW ~= $M)
• Concurrency (1000x today) driven by system power and density
• Memory bandwidth and capacity are under pressure
• Processor architecture is an open question but heterogeneity is likely• Processor architecture is an open question, but heterogeneity is likely
• Algorithms need to be designed to minimize data movement, not flops
• Programming model memory & concurrency new chip-level model
• Reliability and resiliency will be critical at this scale
• I/O bandwidth unlikely to keep pace with machine speed
Why are these NERSC problems?Why are these NERSC problems?• All are challenges for 100 “capacity” 100PF machines, except:
– System wide outages and bisection bandwidth• NERSC needs to guide the transition of its user community• NERSC represents broad HPC market better than any other DOE center• Hardware revolutions affect procurement strategies (Giga->Tera)
3/14/2011
4
NERSC Response to Exascale
• Vendor partnerships
• NERSC workload and procurements• NERSC workload and procurements
• Testbeds & training
• Anticipating and changing the future– Programming models
– AutotuningAutotuning
– Algorithms
– Co-Design
7
Energy Efficiency Partnerships with Synapsense and IBM
• Monitoring for energy efficiency (and reliability!)
600 Sensors for temperature, etc. Rear door heat exchangers
• Liquid cooling on IBM system uses return water from another system, with modified CDU design– Reduces cooling costs to as much as ½
– Reduces floor space requirements by 30%
Air is colder coming out than going in!8
3/14/2011
5
Center of Excellence with Cray
What should we tell NERSC users to do ?
Multicore Era: Massive on-chip concurrency necessary for reasonable power use
NERSC/Cray “Programming Models Center of Excellence” combines:
• Berkeley Lab strength in advanced programming models, multicore tuning, and application benchmarking
• Cray strength in advanced programming models, optimizing compilers, and benchmarking
reasonable power use
Immediate question: What is the best way to use Hopper node?• Flat MPI - Today’s preferred mode of operation• MPI + OpenMP, MPI + pthreads, MPI + PGAS, MPI+OpenCL, PGAS…
Paratec MPI+OpenMP Performance
18002000
Wallclock FFT "DGEMM"
200400600800
10001200140016001800
Tim
e / s G
OOD
0200
1 2 3 6 12
768 384 256 128 64OpenMP threads / MPI tasks
3/14/2011
6
fvCAM MPI+OpenMP Performance
600700800
Wallclock Dynamics Physics
0100200300400500600
Tim
e / s
GOOD
1 2 3 6 12
240 120 80 40 20
OpenMP threads / MPI tasks
GTC MPI+OpenMP Performance
30003500
pusher shift charge poisson total
50010001500200025003000
Tim
e / S
ecs
GOOD
0
1 2 4 6 12
1536 768 384 256 128OpenMP threads / MPI tasks
3/14/2011
7
12
14
B)
Less Memory Usage with OpenMPCompared to Flat MPI
12
14
GTC f CAM
4
6
8
10
Mem
ory
per
No
de
(GB
4
6
8
10
Mem
ory
per
no
de
/ GB
GTC fvCAM
0
2
1 2 3 6 12
240 120 80 40 20
M
OpenMP threads / MPI tasks
0
2
1 2 3 4 6 12
96 48 32 24 16 8
OPENMP threads / MPI tasks
Current Procurement Strategy
Use Science Applications to Measure System Performance!
NERSC-6 “Sustained System Performance (SSP)” BenchmarksPerformance (SSP) Benchmarks
CAM Climate
GAMESSQuantum Chemistry
GTCFusion
IMPACT-TAccelerator
Physics
MAESTROAstro-
physics
MILCNuclearPhysics
PARATECMaterialScience
Peak FLOPs offer little insight into delivered application performance
14
Peak FLOPs offer little insight into delivered application performanceSSP applications model delivered performance of REAL workload
“Benchmarks are only useful insofar as they model the intended workload”
Ingrid Bucher (LANL 1978)
3/14/2011
8
Algorithm Diversity
Science areasDense linear
algebra
Sparse linear
algebra
Spectral Methods (FFT)s
Particle Methods
Structured Grids
Unstructured or AMR Grids
AcceleratorScience XX XX XX XX XXScience
Astrophysics XX XX XX XX XX XX
Chemistry XX XX XX XX
Climate XX XX XX
Combustion XX XX
F i XX XX XX XX XXFusion XX XX XX XX XX
Lattice Gauge XX XX XX XX
Material Science XX XX XX XX
NERSC users require a system which performs adequately in all areas
Numerical Methods at NERSC(Caveat: survey data from ERCAP requests)
35%
Methods at NERSCPercentage of 400 Total Projects
5%
10%
15%
20%
25%
30%
16
0%
5%
3/14/2011
9
Revolutions Require a New Strategy
• Preserve Application Performance as goal– But, allow significant optimizations
– Produced optimized versions of some codes– Produced optimized versions of some codes
– Understand performance early
• E.g., for Fermi-based GPU system– Fermi is nearly as expensive as host node
– 48 nodes w/Fermi or 2x more nodes without
– Minimum: Fermi must be 2x faster than hostMinimum: Fermi must be 2x faster than host
• Estimating value of acceleration for SSP– Estimate “best possible performance”
– Successive refinement of estimates
– Stop if estimate < min threshold (2x host)17
Dirac Testbed at NERSC
• Dirac is a 48 nodes GPU cluster
• 44 Fermi nodes– 3GB memory per card
– @~144GB/s bandwidth + ECC
– ~1TF peak single precision and 500 GF/s DP
– 24 GB of memory per node
4 T l d
Paul Dirac, Nobel prize-winning Theoretical Physicist
• 4 Tesla nodes
• QDR Infiniband network
3/14/2011
10
Ri-MP2: Molecular Equilibrium Structure
• Goal: Optimize Q-Chem RI-MP2 Routine– Start with “step 4” kernel, which
dominates the running timeObservations– Observations– Impressive single node speedups– Performance results highly dependent
on molecular structureFermi GPU Racks - NERSC
x 18.8x 7.4
x 1.7
x 12.3
Jihan Kim (SciDAC-E NERSC Postdoc)
GTC (Fusion) Charge Deposition Performance
FermiTesla
Optimizing GTC (Fusion code) Charge Deposition for Multicore and GPUs
K. Ibrahim, K. Madduri, L. Oliker, S. Williams, E.-J. Im, E. Strohmaier, S. Ethier, J. Shalf
A2 A5 A10 A20 B2 B5configuration
3/14/2011
11
Fermi Memory Bandwidth
Anticipating and Influencing the Future
22
3/14/2011
12
Autotuning: Write Code Generators for Nodes
+Zy-1
z+1
1
Nearest-neighbor 7point stencil on a 3D array
Use Autotuning!
3D Grid
+Y
+X
7-point nearest neightbors
y+1
y
x-1
z-1
x+1x,y,z
Write code generators and let computers do
tuning
23
Example pattern-specific compiler: Structured grid in Ruby
• Ruby class encapsulates SG pattern
•class LaplacianKernel < Kernel• def kernel(in_grid, out_grid)• in_grid.each_interior do |point|• in_grid.neighbors(point,1).each
do |x|p– body of anonymous
lambda specifies filter function
• Code generator produces OpenMP for m lticore 86
do |x|• out_grid[point] += 0.2*x.val• end• end•end
•VALUE kern_par(int argc, VALUE* argv, VALUE self) {•unpack_arrays into in_grid and out_grid;
•#pragma omp parallel for default(shared) multicore x86– ~1000-2000x faster than
Ruby
– Minimal per-call runtime overhead
#pragma omp parallel for default(shared) private (t_6,t_7,t_8)•for (t_8=1; t_8<256-1; t_8++) {• for (t_7=1; t_7<256-1; t_7++) {• for (t_6=1; t_6<256-1; t_6++) {• int center = INDEX(t_6,t_7,t_8);• out_grid[center] = (out_grid[center]• +(0.2*in_grid[INDEX(t_6-1,t_7,t_8)]));• ...• out_grid[center] = (out_grid[center]•+(0 2*in grid[INDEX(t 6 t 7 t 8+1)]));
3/14/2011
13
PGAS Languages: Why use 2 Languages (MPI+X) when 1 will do?
• Global address space: thread may directly read/write remote data • Partitioned: data is designated as local or global
Glo
bal
ad
dre
ss s
pac
e
x: 1y:
l: l: l:
g: g: g:
x: 5y:
x: 7y: 0
p0 p1 pnp p p
• Remote put and get: never have to say “receive” • No less scalable than MPI! • Permits sharing, whereas MPI rules it out!• Gives affinity control, useful on shared and distributed memory
Hybrid Partitioned Global Address Space
•Shared Segment on Host
•Shared Segment on GPU M
•Shared Segment on Host M
•Shared Segment on GPU M
•Shared Segment on Host M
•Shared Segment on GPU M
•Shared Segment on Host M
•Shared Segment on GPU M
•Local Segment on Host •Memory
•Processor 1
Memory
•Local Segment on GPU •Memory
•Local Segment on Host •Memory
•Processor 2
•Local Segment on GPU •Memory
•Local Segment on Host •Memory
•Processor 3
•Local Segment on GPU •Memory
•Local Segment on Host •Memory
•Processor 4
•Local Segment on GPU •Memory
Each thread has only two shared segments
Memory Memory Memory Memory Memory Memory Memory
Each thread has only two shared segments Decouple the memory model from execution models; one
thread per CPU, vs. one thread for all CPU and GPU “cores” Caveat: type system and therefore interfaces blow up with
different parts of address space
3/14/2011
14
GASNet GPU Extension Performance
Latency Bandwidth
•Good
•Good
•Goo
•Goo
dd dd
Communication-Avoiding Algorithms
• Consider Sparse Iterative Methods• Nearest neighbor communication on a mesh• Dominated by time to read matrix (edges) from DRAM• Dominated by time to read matrix (edges) from DRAM• And (small) communication and global
synchronization events at each step
Can we lower data movement costs?• Take k steps “at once” with one matrix read
from DRAM and one communication phaseParallel implementation– Parallel implementation
O(log p) messages vs. O(k log p)
– Serial implementationO(1) moves of data moves vs. O(k)
Joint work with Jim Demmel, Mark Hoemman, Marghoob Mohiyuddin
3/14/2011
15
“Monomial” basis [Ax,…,Akx]
fails to converge
• A different polynomial basis does
Know your mathematics!
A different polynomial basis does converge
Communication-Avoiding GMRES on 8-core Clovertown
3/14/2011
16
Co-Design Before its Time
• Demonstrated during SC ‘09• CSU atmospheric model ported to
low-power core designD l C T ili i– Dual Core Tensilica processors running atmospheric model at 25MHz
– MPI Routines ported to custom Tensilica Interconnect
• Memory and processor Stats available for performance analysis
• Emulation performance advantage250x Speedup over merely function
Icosahedral mesh for algorithm scaling
– 250x Speedup over merely function software simulator
• Actual code running - not representative benchmark
General Lessons
• Early intervention with hardware designs
• Optimize for what is important: Opt e o at s po ta t
energy data movement
• Anticipating and changing the future– Programming models
– Autotuning
– AlgorithmsAlgorithms
– Co-Design
32
3/14/2011
17
NERSC Aggressive Roadmap
107
106
•Exascale + ???
•NERSC-8
•NERSC-9•1 EF Peak
105
104
103
102 •COTS/MPP + MPI (+ OpenMP)
•GPU CUDA/OpenCL•Or Manycore BG/Q, R
•Franklin (N5)•Franklin (N5) +QC•36 TF Sustained
•Hopper (N6)•>1 PF Peak
•NERSC-7•10 PF Peak
•100 PF Peak
•Pea
k Te
raflo
p/s
10
2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020
•COTS/MPP + MPI
( )•19 TF Sustained•101 TF Peak
•36 TF Sustained•352 TF Peak
33
•Users expect 10x improvement in capability every 3-4 years
•Three continents, three institutions
•UC Berkeley/LBNL/NERSC
•University of Heidelberg and
•National Astronomical Observatories (CAS)
•Horst Simon•Hemant Shukla
•John Shalf•ICCS Projects John Shalf•Rainer Spurzem
•ICCS Activities•Summer School Aug 2-6, 2010
•Proven Algorithmic Techniques for Many-core Processors
•Wen-Mei Hwu (UIUC) and David Kirk (NVIDIA)
•ISAAC is a three-year (2010-2013) NSF funded project to focus on research and development of infrastructure for accelerating physics and astronomy applications using and multicore architectures.
•Goal is to successfully harness the power of the parallel architectures for compute-intensive scientific problems and open
•ISAAC•Infrastructure for Astrophysics Applications Computing
j
•GRACE II
•SILK ROAD
•Workshop Nov 30 – Dec 2, 2009
•Many-core and Accelerator-based Computing for Physics and Astronomy Applications
•
p p pdoors for new discovery and revolutionize the growth of science via, Simulations, Instrumentations and Data processing /analysis
•Visit us – http://iccs.lbl.gov
3/14/2011
18
Performance Has Also Slowed, Along with Power
1.E+06
1.E+07
Transistors (in Thousands)
Moore’s Law Continues with core doubling
1.E+01
1.E+02
1.E+03
1.E+04
1.E+05Frequency (MHz)
Power (W)
Perf
Cores
35
1.E-01
1.E+00
1970 1975 1980 1985 1990 1995 2000 2005 2010
•Data from Kunle Olukotun, Lance Hammond, Herb Sutter, Burton Smith, Chris Batten, and Krste Asanoviç
Memory is Not Keeping Pace
•Technology trends against a constant or increasing memory per core• Memory density is doubling every three years; processor logic is every two
• Storage costs (dollars/Mbyte) are dropping gradually compared to logic costs
•Source: David Turek, IBM
•Cost of Computation vs. Memory
•Source: IBM
36
• Question: Can you double concurrency without doubling memory?
3/14/2011
19
How to make use of 100,000 (or more!) cores?
37
7 Point Stencil Revisited
38
• Cell and GTX280 are notable for both performance and energy efficiency
•Joint work with Kaushik Datta, Jonathan Carter, Shoaib Kamil, Lenny Oliker, John Shalf, and Sam Williams
3/14/2011
20
#2: Understand your machine limits
The “roofline” model
S. Williams, D. Patterson, L. Oliker, J. Shalf, K. Yelick
39
The Roofline Performance Model
• The top of the roof is determined by peak computation rate
peak DP64.0
128.0
256.0 Generic Machine
computation rate (Double Precision floating point, DP for these algorithms)
• The instruction mix, lack of SIMD operations, ILP or failure to use other
mul / add imbalance
w/out SIMD
w/out ILP
atta
inab
le G
flop
/s
4.0
8.0
16.0
32.0
64.0
features of peak will lower attainable
0.5
1.0
1/8
actual flop:byte ratio
2.0
1/41/2 1 2 4 8 16
3/14/2011
21
The Roofline Performance Model
peak DP64.0
128.0
256.0 Generic Machine The sloped part of the
roof is determined by peak DRAM bandwidth
mul / add imbalance
w/out SIMD
w/out ILP
atta
inab
le G
flop
/s
4.0
8.0
16.0
32.0
64.0 p(STREAM)
Lack of proper prefetch, ignoring NUMA, or other things will reduce attainable bandwidth
0.5
1.0
1/8
actual flop:byte ratio
2.0
1/41/2 1 2 4 8 16
The Roofline Performance Model
peak DP64.0
128.0
256.0 Generic Machine Locations of posts in the
building are determined by algorithmic intensity
mul / add imbalance
w/out SIMD
w/out ILP
atta
inab
le G
flop
/s
4.0
8.0
16.0
32.0
64.0 y g y Will vary across
algorithms and with bandwidth-reducing optimizations, such as better cache re-use (tiling), compression techniques
0.5
1.0
1/8
actual flop:byte ratio
2.0
1/41/2 1 2 4 8 16
3/14/2011
22
Roofline model for Stencil(out-of-the-box code)
Large datasets 2 unit stride streams No NUMA Littl ILPbl
e G
flop
/s
16
32
64
128
ble
Gfl
op/s
16
32
64
128peak DP
mul/add imbalance
peak DP
w/out SIMD
mul/add imbalance
Opteron 2356(Barcelona)
Intel Xeon E5345(Clovertown)
w/out SIMD
Little ILP No DLP Far more adds than
multiplies (imbalance) Ideal flop:byte ratio 1/3
High locality requirements
Capacity and conflict
1
2
1/16
flop:DRAM byte ratio
atta
inab
4
8
1/81/4
1/2 1 2 4 81
2
1/16
flop:DRAM byte ratioat
tain
ab
4
8
1/81/4
1/2 1 2 4 8
64
128
64
128
w/out ILP
Sun T2+ T5140(Victoria Falls)
w/out ILP
IBM QS20Cell Blade
Capacity and conflict misses will severely impair flop:byte ratio
1
2
1/16
flop:DRAM byte ratio
atta
inab
le G
flop
/s
4
8
16
32
1/81/4
1/2 1 2 4 81
2
1/16
flop:DRAM byte ratio
atta
inab
le G
flop
/s
4
8
16
32
1/81/4
1/2 1 2 4 8
25% FP
peak DP
12% FP
w/out FMA
peak DP
w/out ILP
w/out SIMD
No naïve SPEimplementation
Roofline model for Stencil(out-of-the-box code)
Large datasets 2 unit stride streams No NUMA Littl ILPbl
e G
flop
/s
16
32
64
128
ble
Gfl
op/s
16
32
64
128peak DP
mul/add imbalance
peak DP
w/out SIMD
mul/add imbalance
Opteron 2356(Barcelona)
Intel Xeon E5345(Clovertown)
w/out SIMD
Little ILP No DLP Far more adds than
multiplies (imbalance) Ideal flop:byte ratio 1/3
High locality requirements
Capacity and conflict
1
2
1/16
flop:DRAM byte ratio
atta
inab
4
8
1/81/4
1/2 1 2 4 81
2
1/16
flop:DRAM byte ratio
atta
inab
4
8
1/81/4
1/2 1 2 4 8
64
128
64
128
w/out ILP
Sun T2+ T5140(Victoria Falls)
w/out ILP
IBM QS20Cell Blade
Capacity and conflict misses will severely impair flop:byte ratio
1
2
1/16
flop:DRAM byte ratio
atta
inab
le G
flop
/s
4
8
16
32
1/81/4
1/2 1 2 4 81
2
1/16
flop:DRAM byte ratio
atta
inab
le G
flop
/s
4
8
16
32
1/81/4
1/2 1 2 4 8
25% FP
peak DP
12% FP
w/out FMA
peak DP
w/out ILP
w/out SIMD
No naïve SPEimplementation
3/14/2011
23
Roofline model for Stencil(NUMA, cache blocking, unrolling, prefetch, …)
Cache blocking helps ensure flop:byte ratio is as close as possible to 1/3
Clovertown has huge ble
Gfl
op/s
16
32
64
128
ble
Gfl
op/s
16
32
64
128peak DP
mul/add imbalance
peak DP
w/out SIMD
mul/add imbalance
Opteron 2356(Barcelona)
Intel Xeon E5345(Clovertown)
w/out SIMD
C ove tow as ugecaches but is pinned to lower BW ceiling
Cache management is essential when capacity/thread is low
1
2
1/16
flop:DRAM byte ratio
atta
inab
4
8
1/81/4
1/2 1 2 4 81
2
1/16
flop:DRAM byte ratioat
tain
ab
4
8
1/81/4
1/2 1 2 4 8
64
128
64
128
w/out ILP
Sun T2+ T5140(Victoria Falls)
w/out ILP
IBM QS20Cell Blade
1
2
1/16
flop:DRAM byte ratio
atta
inab
le G
flop
/s
4
8
16
32
1/81/4
1/2 1 2 4 81
2
1/16
flop:DRAM byte ratio
atta
inab
le G
flop
/s
4
8
16
32
1/81/4
1/2 1 2 4 8
25% FP
peak DP
12% FP
w/out FMA
peak DP
w/out ILP
w/out SIMD
No naïve SPEimplementation
Roofline model for Stencil(SIMDization + cache bypass)
Make SIMDization explicit
Use cache bypass instruction: movntpdbl
e G
flop
/s
16
32
64
128
ble
Gfl
op/s
16
32
64
128peak DP
mul/add imbalance
peak DP
w/out SIMD
mul/add imbalance
Opteron 2356(Barcelona)
Intel Xeon E5345(Clovertown)
w/out SIMD
st uct o : ov tpd Increases flop:byte
ratio to ~0.5 on x86/Cell
1
2
1/16
flop:DRAM byte ratio
atta
inab
4
8
1/81/4
1/2 1 2 4 81
2
1/16
flop:DRAM byte ratio
atta
inab
4
8
1/81/4
1/2 1 2 4 8
64
128
64
128
w/out ILP
Sun T2+ T5140(Victoria Falls)
w/out ILP
IBM QS20Cell Blade
1
2
1/16
flop:DRAM byte ratio
atta
inab
le G
flop
/s
4
8
16
32
1/81/4
1/2 1 2 4 81
2
1/16
flop:DRAM byte ratio
atta
inab
le G
flop
/s
4
8
16
32
1/81/4
1/2 1 2 4 8
25% FP
peak DP
12% FP
w/out FMA
peak DP
w/out ILP
w/out SIMD
3/14/2011
24
#3) Write Code Generators Rather Than Code (LBMHD)
Intel Clovertown AMD Opteron LBMHD is not always bandwidth limited: used SIMD, etc.
Sun Niagara2 (Huron) IBM Cell Blade* +SIMDization
+SW Prefetching
+Unrolling
+Vectorization
+Padding
Naïve+NUMA
•Joint work with Sam Williams, Lenny Oliker, John Shalf, and Jonathan Carter
#4) Used Optimized Librarites: ( Sparse Matrix Vector Multiplication)
• Sparse Matrix– Most entries are 0.0– Performance advantage in only
i / i hstoring/operating on the nonzeros– Requires significant meta data
• Evaluate y=Ax– A is a sparse matrix– x & y are dense vectors
• Challenges– Difficult to exploit ILP(bad for superscalar)
A x y
•Protein•FEM /
•Spheres•FEM /
•Cantilever
48
Difficult to exploit ILP(bad for superscalar), – Difficult to exploit DLP(bad for SIMD)– Irregular memory access to source vector– Difficult to load balance– Very low computational intensity (often >6 bytes/flop)
= likely memory bound
•FEM /•Accelerator
•Circuit •webbase
3/14/2011
25
Extra Work Can Improve Efficiency!
• Example: 3x3 blockingp g– Logical grid of 3x3 cells
– Fill-in explicit zeros
– Use only 1 index per block, rather than per nonzero
– Unroll 3x3 block multiplies
– “Fill ratio” = 1.5
• On Pentium III: 1.5x speedup!– (Actual mflop rate = 2.25
higher)
Naïve Parallel Implementation
• SPMD style
• Partition by rows
• Load balance by nonzeros
AMD OpteronIntel Clovertown
• N2 ~ 2.5x x86 machine
IBM Cell Blade (PPEs)Sun Niagara2 (Huron)
50
Naïve Pthreads
Naïve
3/14/2011
26
• SPMD style
• Partition by rows
• Load balance by nonzeros
Naïve Parallel Implementation
AMD OpteronIntel Clovertown
8x cores = 1.9x performance8x cores = 1.9x performance• N2 ~ 2.5x x86 machine
IBM Cell Blade (PPEs)Sun Niagara2 (Huron)
pp
4x cores = 1.5x performance4x cores = 1.5x performance
64x threads = 41x performance64x threads = 41x performance
51
Naïve Pthreads
Naïve
pp
4x threads = 3.4x performance4x threads = 3.4x performance
• SPMD style
• Partition by rows
• Load balance by nonzeros
Naïve Parallel Implementation
AMD OpteronIntel Clovertown
1.4% of peak flops
29% of bandwidth
1.4% of peak flops
29% of bandwidth 4% of peak flops
20% f b d idth
4% of peak flops
20% f b d idth • N2 ~ 2.5x x86 machine
IBM Cell Blade (PPEs)Sun Niagara2 (Huron)
20% of bandwidth20% of bandwidth
25% of peak flops
39% f b d idth
25% of peak flops
39% f b d idth
52
Naïve Pthreads
Naïve
39% of bandwidth39% of bandwidth
2.7% of peak flops
4% of bandwidth
2.7% of peak flops
4% of bandwidth
3/14/2011
27
Autotuned Performance(+DIMMs, Firmware, Padding)
• Clovertown was already fully populated with DIMMs
• Gave Opteron as many DIMMs as Clovertown
• Firmware update for
AMD OpteronIntel Clovertown
Firmware update for Niagara2
• Array padding to avoid inter-thread conflict misses
• PPE’s use ~1/3 of Cell chip area
IBM Cell Blade (PPEs)Sun Niagara2 (Huron)+More DIMMs(opteron), +FW fix, array padding(N2), etc…
53
+Cache/TLB Blocking
+Compression
+SW Prefetching
+NUMA/Affinity
Naïve Pthreads
Naïve
Autotuned Performance(+Cell/SPE version)
• Wrote a double precision Cell/SPE version
• DMA, local store blocked, NUMA aware, etc…
•AMD Opteron•Intel Clovertown
etc…• Only 2x1 and larger
BCOO• Only the SpMV-proper
routine changed
• About 12x faster (median) than using the PPEs alone. •IBM Cell Blade (SPEs)•Sun Niagara2 (Huron)
•+More DIMMs(opteron), •+FW fix, array padding(N2), etc…
54
•+Cache/TLB Blocking
•+Compression
•+SW Prefetching
•+NUMA/Affinity
•Naïve Pthreads
•Naïve
3/14/2011
28
• Wrote a double precision Cell/SPE version
• DMA, local store blocked, NUMA aware, etc…
Autotuned Performance(+Cell/SPE version)
AMD OpteronIntel Clovertown
4% of peak flops
52% of bandwidth
4% of peak flops
52% of bandwidth20% of peak flops
65% of bandwidth
20% of peak flops
65% of bandwidth etc…• Only 2x1 and larger
BCOO• Only the SpMV-proper
routine changed
• About 12x faster than using the PPEs alone.
IBM Cell Blade (SPEs)Sun Niagara2 (Huron)+More DIMMs(opteron), +FW fix, array padding(N2), etc…
54% of peak flops
57% f b d idth
54% of peak flops
57% f b d idth 40% of peak flops40% of peak flops
55
+Cache/TLB Blocking
+Compression
+SW Prefetching
+NUMA/Affinity
Naïve Pthreads
Naïve
57% of bandwidth57% of bandwidth 40% of peak flops
92% of bandwidth
40% of peak flops
92% of bandwidth
MPI vs. Threads
• On x86 machines, autotuned(OSKI) shared memory MPICH implementation rarely scales beyond 2 threads
•AMD Opteron•Intel Clovertown
• Still debugging MPI issues on Niagara2, but so far, it rarely scales beyond 8 threads.
•Sun Niagara2 (Huron)
56
•Autotuned pthreads
•Autotuned MPI
•Naïve Serial
3/14/2011
29
Optimized Sparse Kernel Interface - OSKI
• Provides sparse kernels automatically tuned for user’s matrix & machine– BLAS-style functionality: SpMV, Ax & ATy, TrSV
– Hides complexity of run-time tuning
– Includes new, faster locality-aware kernels: ATAx, Akx
• Faster than standard implementations– Up to 4x faster matvec, 1.8x trisolve, 4x ATA*x
• For “advanced” users & solver library writersy– Available as stand-alone library (OSKI 1.0.1h, 6/07)
– Available as PETSc extension (OSKI-PETSc .1d, 3/06)
– Bebop.cs.berkeley.edu/oski
Programming Model for Multicore
• These autotuned implementations, use:– Fixed set of threads (pthreads)
• “Parallel all the time”
– Shared memory• Avoid unnecessarily replication
– Logically partitioned memory with affinity• Avoid unnecessary cache coherence traffic
• What programming model offers these?
58
3/14/2011
30
PGAS Languages: Why use 2 Languages (MPI+X) when 1 will do?
• Global address space: thread may directly read/write remote data • Partitioned: data is designated as local or global
Glo
bal
ad
dre
ss s
pac
e
x: 1y:
l: l: l:
g: g: g:
x: 5y:
x: 7y: 0
p0 p1 pnp p p
• Remote put and get: never have to say “receive” • Remote function invocation? See HPCS languages
• No less scalable than MPI! • Permits sharing, whereas MPI rules it out!• One model rather than two, but if you insist on two:
• Can call UPC from MPI and vice verse (tested and used)
#5) Used Lightweight Communication
8-byte Roundtrip Latency
Use a programming model in which you can utilize bandwidth and “low” latency
Flood Bandwidth for 4KB messages
14.6
22.1
9.6 9.5
18.5
24.2
13.5
17.8
8 310
15
20
25
trip
La
ten
cy
(us
ec
)
MPI ping-pong
GASNet put+sync
Flood Bandwidth for 4KB messages
547
420
190
702
152
750
714231
763223
679
40%
50%
60%
70%
80%
90%
100%
rce
nt
HW
pe
ak
MPIGASNet
6.66.6
4.5
8.3
0
5
Elan3/Alpha Elan4/IA64 Myrinet/x86 IB/G5 IB/Opteron SP/Fed
Ro
un
dt
252
0%
10%
20%
30%
Elan3/Alpha Elan4/IA64 Myrinet/x86 IB/G5 IB/Opteron SP/Fed
Pe
r
Joint work with Berkeley UPC Group
3/14/2011
31
Two-sided vs One-sided Communication
•message id •data payload
one sided put message
•two-sided message
•network
•host•CPU
• Two-sided message passing (e.g., MPI) requires matching a send with a receive to identify memory address to put data
•address •data payload
•one-sided put message• interface
•memory
p– Wildly popular in HPC, but cumbersome in some applications– Couples data transfer with synchronization
• Using global address space decouples synchronization– Pay for what you need! – Note: Global Addressing ≠ Cache Coherent Shared memory
Joint work with Dan Bonachea, Paul Hargrove, Rajesh Nishtala and rest of UPC group
Case Study Update: NAS FT
• Perform a large 3D FFT
– Represents bisection-bandwidth limited algorithms
• Builds on our previous work, but with a 2D partition
– Requires two rounds of communication rather than oneq
– Each processor communicates with O(√T) threads
• Leverage nonblocking communication– Packed: minimize messages, no overlap
– Slab: no packing, one message per plane per “other” thread
62
3/14/2011
32
FFT Performance on BlueGene/PHPC Challenge Peak as of July 09 is ~4.5 Tflops on 128k Cores
PGAS implementations consistently outperform MPI
Leveraging communication/computation 3500communication/computation overlap yields best performance More collectives in flight
and more communication leads to better performance
At 32k cores, overlap algorithms yield 17% improvement in overall
1500
2000
2500
3000
GF
lop
s
SlabsSlabs (Collective)Packed Slabs (Collective)MPI Packed Slabs
improvement in overall application time
Numbers are getting close to HPC record Future work to try to beat
the record0
500
1000
256 512 1024 2048 4096 8192 16384 32768
Num. of Cores
63
GOOD
GOOD
GOOD
FFT Performance on Cray XT4• 1024 Cores of the Cray XT4
– Uses FFTW for local FFTs
– Larger the problem size the more effective the overlap
GOOD
GOOD
GOOD
64
3/14/2011
33
#6) Avoid Synchronization
Computations as DAGsView parallel executions as the directed acyclic graph of the computation
65
CholeskyCholesky
4 x 44 x 4
QRQR
4 x 44 x 4
Slide source: Jack Dongarra
Parallel LU Factorization
Blocks 2Dblock-cyclicdistributedCompleted part of U
Panel factorizationsinvolve communicationfor pivoting Matrix-
matrixmultiplicationused here.Can be coalesced
Com
pleted part o
A(i,j) A(i,k)
A(j,i) A(j,k)
Trailing matrixof L
Trailing matrixto be updated
Panel being factored
3/14/2011
34
Event Driven Execution of Dense LU
• Ordering needs to be imposed on the schedule• Critical operation: Panel Factorization
– need to satisfy its dependencies firsty p– perform trailing matrix updates with low block numbers first– “memory constrained” lookahead
• General issue: dynamic scheduling in partitioned memory– Can deadlock memory allocator!
some edges omitted
DAG Scheduling Outperforms Bulk-Synchronous Style
UPC vs.
PLASMA on shared memory UPC on partitioned memory
ScaLAPACK
0
20
40
60
80
2x4 pr oc gr i d 4x4 pr oc gr i d
GFlo
ps
ScaLAPACK
UPC
68
• UPC LU factorization code adds cooperative (non-preemptive) threads for latency hiding– New problem in partitioned memory: allocator deadlock– Can run on of memory locally due tounlucky execution order
PLASMA by Dongarra et al; UPC LU joint with Parray Husbands
3/14/2011
35
#7) Use Scalable Algorithms
• Algorithmic gains in last decade have far outstripped Moore’s Law
–Adaptive meshesrather than uniform
–Sparse matrices rather than dense
–Reformulation of problem back to basics
• Algorithmic gains have outstripped Moore’s Lawoutst pped oo e s a
• Example of canonical “Poisson” problem on n points:–Dense LU: most general, but O(n3) flops on O(n2) data–Multigrid: fastest/smallest, O(n) flops on O(n) data
Performance results: John Bell et al
Conclusions
• Use your communication systems effectivelyy– On-chip communication between cores
– Communication between processor and memory
– Communication between sockets
Communication between nodes– Communication between nodes
• Work with experts on hardware, software, algorithms, applications
70
3/14/2011
36
More Info
• The Berkeley View/Parlab– http://view.eecs.berkeley.edu
http://parlab eecs berkeley edu/– http://parlab.eecs.berkeley.edu/
• Berkeley Autotuning and PGAS projects– http://bebop.cs.berkeley.edu– http://upc.lbl.gov– http://titanium.cs.berkeley.edu
• NERSC System Architecture Groupy p– http://www.nersc.gov/projects/SDSA
• LBNL Future Technologies Grouphttp://crd.lbl.gov/ftg
top related