c o m p u t a t i o n a l r e s e a r c h d i v i s i o n bips the potential of the cell processor...
Post on 03-Jan-2016
214 Views
Preview:
TRANSCRIPT
C O M P U T A T I O N A L R E S E A R C H D I V I S I O N
BIPSBIPS
The Potential of the Cell Processor
for Scientific Computing
Leonid Oliker
Samuel Williams, John ShalfShoaib Kamil, Parry Husbands, Katherine Yelick
Computational Research Division
Lawrence Berkeley National Laboratory
Motivation
Stagnating application performance is well-know problem in scientific computing By end of decade numerous mission critical applications expected to have 100X computational
demands of current levels Many HEC platforms are poorly balanced for demands of leading applications
Memory-CPU gap, deep memory hierarchies, poor network-processor integration, low-degree network topology
Traditional superscalar trends slowing down Mined most benefits of ILP and pipelining, clock frequency limited by power wall
Specialized HPC market cannot support huge tech investments HPC community looking for alternative high-performance architectures with healthy market outside
scientific computing (approx 0$B market) Sophistication of gaming technology is demanding more floating-point intense computation
However, ultimately limited by level human recognition not scientific fidelity
Recently-released Cell processor has tremendous computational capability This work examines four key scientific algorithms on Cell
Cell will be used in the PS3: compelling because it will be produced in high volume Radical departure from conventional designs including the XBOX 360’s Xenon
Cell/PS3 XBOX 360
Heterogeneous Homogeneous
PPC + 8 SIMD cores 3 x PPC
Software-controlled Conventional Cache-based
memory architecture memory hierarchy (1MB)
221mm2 (+30%) 168mm2
Key limiting factors current platforms: off-chip memory bandwidth and power usage Memory hurdles: latency and bandwidth utilization
Homogenous cache-based CMP do not address these deficiencies Software controlled memory improves memory bandwidth and power usage:
Allows finely tuned deep prefetching Data double buffering: hides memory latency, potential fully utilizing bandwidth
More efficient cache utilization policies (smaller “caches” required) Memory fetching takes advantage of application-level information More predictable performance for performance modeling, real time systems, etc Less architectural complexity vs. automatic caching memory = less power
However, software controlled memory increases programming complexity
Introduction to Cell
Cell Processor Architecture All units connected via EIB
4 x 128b rings @ 1.6GHz PPC core @ 3.2GHz 8 x SPE’s (128b SIMD core)
Off-load engine
between a P and a coP
• 128 x 128b single cycle register file 256KB Local store
• 16K x 128b 6-cycle local store
• Private address space
• access to global store via DMA
• No unaligned access (only via permutes) Dual SIMD issue (private PC)
• one arithmetic/float/etc…
• one load/store/permute/branch/channel Statically scheduled Execution: in-order 7 cycle pipelines 3W @ 3.2 GHz
Memory controller (25.6GB/s dual XDR) ~40W total if PPC is idle
Cell architecture benefits include: Reduced pipeline length
Lower branch mis-predict penalty Lower memory latency
Numerous in-flight DMAs Hide memory latency Utilize available mem BW Overlap comp with comm
Impressive performance/power ratio
Microarchitectural Issues
PPE (PowerPC Processing Element)
512KB cache: coherent with DMAs, not Local Store (LS)
Dual thread, dual issue, in order
VMX unit + scalar DP FMA
SPE (Synergistic Processing Element) 7 cycle in order dual SIMD pipelines
Single Precision
• 4 FMA datapaths, 6 cycle latency
• 25.6 Gflop/s per SPE, 204.8 Gflop/s overall peak
Double Precision
• 1 FMA datapath, 13 cycle latency
• 13 cycle pipeline doesn’t fit in a 7 cycle forwarding network, so 6 cycle stall after issuing for correctness = 1.83 GFLOP/s, 14.6 Gflop/s overall peak
One DP issue every 7 cycles (1/14th peak of SP)
• Prohibit dual issuing DP instructions = 1.6 GFLOP/s (12.8 Gflop/s overall)
For streaming apps (loads, permutes, etc) one issue every 8 cycles
• DP is obviously not a first-class citizen in Cell microarchitecture
Software managed branch hints
Programming Cell
Local store appears to be the SPU’s entire memory space. However, with DMAs, it can be programmed as a software (user) controlled memory. A series of DMAs can be issued to transfer data from global store (DRAM) to local store (SRAM)
remote get Allows expressing a list of addresses & sizes for amenable algorithms
Analogous to vector loads from DRAM to a large SRAM (vector register file) Local store has constant 6 cycle latency (no cache misses)
Greatly simplifies performance estimation vs. out-of-order, cache-based Double buffering allows explicit overlapping of computation and communication Our work does not benchmark the compiler’s ability to SIMDize
For all critical sections, wrote code with SIMD/quadword intrinsics Ideal performance, somewhat more work than C, far less work than assembly Programming overhead:
SpMV: 1 month, 600 lines Learning programming model, architecture, compiler, tools, algorithm choice
Stencil versions: about 1 weeks, 450 lines (2 versions) - original 15 lines Required significant loop unrolling and intrinsics use
• SP time-skewed Stencil version: one day, total rewrite 450 lines attained 65 Gflop/s!
Parallel Programming Model
Possible programming paradigms include Task parallelism, with independent tasks on each SPE Pipelined parallelism, where large data blocks are passed from one
SPE to next Similar to Streaming model
Data parallelism: identical operations on distinct data (SPMD) We examine as hierarchical SPMD
Simplest and most direct way to decompose the problem Data parallel paradigm good match for many scientific algorithms Similar to OpenMP or Multistreaming Cray X1(E) parallelism PPE is used to partition and load balance PPE is not used for any computation
• Allows us to treat system as homogenous parallel machine
Estimation, Simulation and Exploration
Performance Modeling (PM): Double buffered + long DMAs + in order machine Use MAX( static timing analysis, memory traffic modeling)
Latency of operation, issue width limits, operand alignment of SIMD/quadword DMA load/store overheads (including constraints such as single DRAM controller)
For regular data structures, “spreadsheet” modeling works Some kernels (SPMV, FFT) requires more advanced modeling to capture input data pattern
Iteration performance varies based on matrix non-zero format
Full System Simulator (FSS): Based on IBM’s mambo, cycle accurate, includes static timing analyzer, compilers, etc…
Cell+ DP pipeline is not very important for video games
Redesigned pipeline helps HPC - but increases design complexity and power consumption Propose modest modification: alternate design forwarding network
How severely does DP throughput of 1 SIMD instruction every 7/8 cycles impair execution? Cell+ model fully utilizes the DP datapath: 1 SIMD instruction every 2 cycles Allows dual issuing of DP instructions with loads/stores/permutes/branch Same SP throughput, frequency, bandwidth and power as Cell
Comparing Processors
Cray X1E world’s most powerful vector processor Cell performance does not include the Power core Cell+ 51.2 Gflop/s peak (DP) - 3.5x Cell performance Impressive potential in performance and power efficiency
Dense Matrix-Matrix Multiplication
GEMM characterized by high comp intensity and regular data access Expect to reach close to peak on most platforms
Explored two blocking formats: Column major and Block data layout Column major: implicit blocking via gather stanzas
Issues: tile size within SPE Local store, multiple short DMAs Maximizes FLOP/byte, reduce TLB misses based on size of blocks
Block data layout (BDL): explicit blocking Two stage addressing scheme, requires single long DMA
Choose a block size large enough so that kernel is computationally bound 642 in single precision Much easier in DP (14x computational time, 2x transfer time)
Future work - cannon’s algorithm Reduce DRAM BW by using EIB Could significantly increase number of pages touched
GEMM - Results
CellPM represents results from our analytical performance model IBM’s published hardware numbers come very close to these (within 3%)
Cell results compared with highly optimized vendor GEMM libraries Impressive performance results and power efficiency (>200x Power vs IA64!)
SP: 69x, 26x, 7x faster X1E, IA64, AMD64 DP: 0.9x, 3,7x, 2.7x faster X1E, IA64, AMD64
Cell+ approach improves DP performance 3.5x w/ modest architectural mods 50Gflops!
Gflop/s Cell+PM CellPM X1E AMD64 IA64
DP 51.1 14.6 16.9 4.0 5.4
SP — 204.7 29.5 7.8 3.0
Mflop/W Cell+PM CellPM X1E AMD64 IA64
DP 1277 365 141 45 42
SP — 5117 245 88 23
Sparse Matrix-Vector Multiplication
SPMV most expense step in iteratively solving PDE of sparse linear/eigen systems Poses performance challenge for cache based system due to:
Low computational intensity and irregular (indexed) data accesses Potentially challenge on Cell: no caches or word-granularity gather/scatter support Potential advantages:
Low functional unit and local store latency Task parallelism of 8 SPEs, 8 independent load/store units Ability to stream nonzeros via DMA Local store is not write-back cache: overwrite temps without DRAM BW usage
Cell implementation work examines CSR or Block CSR SIMDization
Requires all row lengths to be a multiple of 4 (simplifies quadword alignment) Explicitly cache block columns
Exploit spatial locality within the local store Implicitly cache block the rows Cache block parallelization strategies:
Partition by rows: potential load imbalance (depends on matrix structure) Partition by nonzeros: each SPE contains copy of source + reduction across SPEs
Double buffer nonzeros Overlaps computation and communication Requires restarting in the middle of a row
Partially double buffer row pointers to find structure
Completely eliminate
empty blocks Prune empty rows
SpMV - example figure Explicitly choose column blocking via cost function
Cache block perimeter
is fixed (Local Store) What is optimal r x c?
SPU7
SPU6
SPU5
SPU4
SPU3SPU2SPU1SPU0
Parallelize across SPUs Cost function of
execution time Rows + NZ
SpMV - FSS Implementation
Use performance model estimates to guide actual implementation In DP row lengths must be even (QW aligned)
No BCSR software implementation yet Parallelization
Dynamically analyze costFunction(rows,NZs) ~ execution time Runtime blocking
Cost function based LS=256KB=32K doubles, max column block = 32K
Only need a 15b relative index to store absolute column index (not 32b) Runtime search for structure
Empty cache blocks, �search for first non empty row Itanium/AMD version: highly optimized OSKI used auto tuner Cray version best know to date: optimized CSRP and Jagged Diagonal Examined suite (un)symmetric matrices from real numerical calculations
SpMV - Results
Gflop/s CellFSS Cell+PM CellPM X1E AMD64 IA64
unsymmetricDP 3.04 2.46 2.34 1.14 0.36 0.36
SP - - 4.08 - 0.53 0.41
symmetricDP 3.38* 4.35 4.00 2.64 0.60 0.67
SP - - 7.68 - 0.80 0.83
Mflop/W CellFSS Cell+PM CellPM X1E AMD64 IA64
unsymmetricDP 76.0 61.5 58.5 9.50 4.04 2.77
SP - - 102 - 5.96 3.15
symmetricDP 84.5* 109 100 22.0 6.74 5.15
SP - - 192 - 8.99 6.38
Cell DP achieves impressive 6-8x speedup vs AMD/Itanium, 20x power efficiency• Even though mem BW 4x, Cell achieves higher performance via double buffering• Multicore systems will not see a performance increase w/o improved mem bandwidth
Cell outperforms X1E by around 2x, and is 5x more power efficient• X1 performance much more sensitive to #NNZ (affects vector length)
PM very close to FSS performance using static implementation - confirming PM accuracy• However FSS is 30% faster due to dynamic partitioning
*Unsymmetric kernel used on symmetric matrix, FSS = Full system simulator, PM = Performance Model
SpMV - Future Optimizations
Auto-tuning Other parallelization strategies BCSR (better for SIMD, worse for memory traffic) Other storage formats (DIA/JAG/etc…)
Symmetry (currently only present in the performance model) Easier to exploit in single precision & w/BCSR Cache blocking limits benefit (~50%)
Segmented Scan Reduces loop overhead at the expense of nonzero processing time Good if NZ/Row (within a cache block) is small Single segment (Vector Length=1) would be beneficial Make runtime decision for a given cache block Complicated by presence of empty rows within a cache block
Stencils on Structured Grids Stencil computations codes represent wide array of scientific applications
Each point in multidimensional grid is updated from subset of neighbors Finite difference operations used to solver complex numerical systems We examine simple heat equation and 3D hyperbolic PDE Relatively low computational intensity results in low % of peak on superscalars
Memory bandwidth bound Algorithm requires keeping 4 planes in local store to optimize performance
(Z-1,t), (Z,t), (Z+1,t) -> (Z,t+1) Cell approach utilizing double buffering: previous output with next input
(Z-1,t+1) & (Z+2,t) Cell algorithm virtually identical to traditional architectures
ISA forces explicit memory loads/stores rather than cache misses and evictions Parallelization - process one plane at a time
Break middle loop up among SPEs (divide each plane 8 ways) Maintains long DMAs (unit-stride direction) and double buffering in Z direction Computational intensity drops to decrease complexity
SIMDization: permutes required to pack left & right neighbors into a SIMD register Neighbor communication poor fit for aligned quadword loads requirements Potential Cell bottleneck: Unaligned loads emulated w/ permute instruction
Problem can be partially avoided with data padding• For SP permute is second bottleneck after BW, not the FPU
Stencils - Time Skewing
Low computational intensity limits performance - memory bandwidth Time Skewing: multiple steps combined to increase performance
Increased computational intensity with almost no additional memory traffic Note some numerical methods are not amenable to merging multiple timesteps
Stencils - ResultsGflop/s CellFSS Cell+
PM CellPM X1E AMD64 IA64
DP 7.25 21.12step 8.2 3.91 0.57 1.19
SP 65.84step - 21.2 3.26 1.07 1.97
Mflop/W CellFSS Cell+PM CellPM X1E AMD64 IA64
DP 181 5282step 205 32.6 6.4 9.15
SP 16454step - 530 27.2 12 15.2
In DP Cell is computationally bound in single time step In SP 2 timesteps are required before computationally bound Permute unit quickly becomes utilized (Quadword alignment)
In SP Cell achieves 6.5x, 11x, 20x compared w/ X1E, Itanium2, Opteron Using time skewing achieves 66Gflops! 60x faster and 130x power efficient vs Opteron
Note stencil code highly optimized on cache-based platforms In DP Cell achieves 2x, 7x, 14x compared w/X1E, Itanium2, Opteron
Even though DP peak throughput is only 1/14th (!) compared with SP Note Cell prohibits double issue of DP with loads or permutes
For codes with streaming behavior one DP SIMD instruction each 8 cycles Unlike scalar systems - time skewing can at least double performance on Cell Cell performance: software controlled memory for codes w/ predictable memory accesses
1D Fast Fourier Transforms
Fast Fourier transform (FFT) - is of great importance to a wide variety of applications One of the main techniques for solving PDEs Relatively low computational intensity with non-trivial volume data movement
1D FFT: Naïve Algorithm - cooperatively executed across the SPEs Load roots of unity, load data (cyclic) 3 stages: local work, on-chip transpose, local work No double buffering (ie no overlap of communication or computation)
2D FFT: 1D FFTs are each run on single SPE Each SPE performs 2 * (N/8) FFTs Double buffer (2 incoming and 2 outgoing) Straightforward algorithm (N2 2D FFT): N simultaneous FFTs, transpose, Transposes represent about 50% of SP execution time, but only 20% of DP
Cell performance compared with highly optimized FFTW and vendor libraries
FFT - Resultsaveraged Gflop/s Cell+
PM CellPM X1E AMD64 IA64
1DDP 13.4 5.85 4.53 1.61 2.70
SP - 33.7 5.30 3.24 1.72
2DDP 16.2 6.65 7.05 0.69 0.31
SP - 38.2 7.93 1.32 0.42
averaged Mflop/W Cell+PM CellPM X1E AMD64 IA64
1DDP 335 146 37.8 18.1 20.8
SP - 843 44.2 36.4 13.2
2DDP 405 166 58.8 7.75 2.38
SP - 955 66.1 14.8 3.23 FSS implementation in progress In SP Cell is unparalleled - 91x faster and 300x more power efficient vs Itanium2! Cell+ offers significant performance advantage (2.5X versus Cell)
Cell+ 2D FFT: 30x faster than Itanium2, 23x AMD2, >2x X1E Cell DP performance approx equal to X1E (simple Cell implementation)
Cell performance underscores advantage of software controlled memory Does not suffer from associatively issues of cache architectures (powers of 2)
Effectively fully-associative cache Opportunity to explicitly overlap communication with computation
DP Performance Advantage
0
5
10
15
20
25
DGEMM SpMV(u) SpMV(s) Stencil 1D FFT 2D FFT0
5
10
15
20
25
Cell+ vs X1E Cell+ vs AMD64 Cell+ vs IA64Cell vs X1E Cell vs AMD64 Cell vs IA64
52.3x
DP Power Advantage
0
10
20
30
40
50
DGEMM SpMV(u) SpMV(s) Stencil 1D FFT 2D FFT0
10
20
30
40
50
Cell+ vs X1E Cell+ vs AMD64 Cell+ vs IA64Cell vs X1E Cell vs AMD64 Cell vs IA64
52x, 70x, 170x
Summary I
Our work presents broadest quantitative study of scientific kernels on Cell to date Developed analytic framework to predict performance and validated accuracy
Kernel times predicable, as load time from LS is constant Results show tremendous potential of Cell for scientific computing Proposed Cell+: modest microarch variant designed to improve DP performance
Fully utilizable DP pipeline greatly improves power and efficiency Cell’s heterogenous multicore seems more promising than emerging CMP designs
CMP w/ 2 cores < 2x performance, compare with 10-20x on Cell Would need serious increase in power and memory bandwidth to scale up
Of course ultimately need to evaluate full application performance
SP Performance Advantage
0
25
50
75
100
SGEMM SpMV(u) SpMV(s) Stencil 1D FFT 2D FFT
X1E AMD64 IA64
SP Power Advantage
0
50
100
150
200
250
300
SGEMM SpMV(u) SpMV(s) Stencil 1D FFT 2D FFT
X1E AMD64 IA64
Summary II
Cell’s 3 level Software controlled memory decouples L/S from computation: Extremely predictable kernel performance Long DMA transfers achieve high % of mem BW (like fully engaged prefetch) Ability to decoupled gather (ex. stream nonzeros) - large # concurrent mem fetches For predictable memory access can overlap comp with memory (double buffer) Future designs benefit from larger local store and lower DMA startup
Disadvantages: Increased programming complexity to specify local memory movement Lack unaligned load support: additions instructions necessary to permute Permute pipeline can become bottleneck (Stencil SP example)
Future work: Real Cell hardware, more algorithms, comparisons with modern CMPs
top related