programming the cell processor: achieving high performance and efficiency
Post on 01-Jan-2016
37 Views
Preview:
DESCRIPTION
TRANSCRIPT
Presented by
Jeremy S. MeredithSadaf R. Alam
Jeffrey S. VetterFuture Technologies Group
Computer Science and Mathematics Division
Research supported by the Department of Energy’s Office of ScienceOffice of Advanced Scientific Computing Research
Programming the Cell Processor:Achieving High Performance and Efficiency
2 Meredith_Cell_SC07
One POWER architecture processing element (PPE)
Eight synergistic processing elements (SPEs)
All connected through a high-bandwidth element interconnect bus (EIB)
Over 200 gigaflops (single precision) on one chip
Cell broadband engine processor:An overview
3 Meredith_Cell_SC07
Cell broadband engine processor:Details
EightSPEs
Dual-threaded
Vector instructions
Dual-issue pipeline
Simple instruction set heavily focused on single instruction multiple data (SIMD)
Capability for double precision, but optimization for single
Uniform 128-bit 128-register file
256-K fixed-latency local store
Memory flow controller with direct memory access (DMA) engine to access main memory or other SPE local stores
One64-bit PPE
4 Meredith_Cell_SC07
Genetic algorithm, traveling salesman:Single- vs. double-precision performance The cell processor has a much higher latency for
double-precision results.
The “ if ” test in the sorting predicate is highly penalized for double precision.
0.0
0.5
1.0
1.5
2.0
2.5
Pentium 42.8 GHz
PPE only 1 SPE 1 SPE,avoiding “ if ”
test
Ru
nti
me
(sec
)
Single precision
Double precision
Replacing this testwith extra arithmetic
results in a large speedup
Replacing this testwith extra arithmetic
results in a large speedup
5 Meredith_Cell_SC07
Genetic algorithm, Ackley’s function:Using SPE-optimized math libraries
Ackley’s function involvescos, exp, sqrt
0.047
0.681
0.248
0.064
0.01
0.1
1
Original Fastcosine
Fastexp/sqrt
SIMD
Ru
nti
me
(se
c)
[lo
g s
ca
le]
Switching to a math library optimized for
SPEs results in a more than 10x
improvement for single precision
6 Meredith_Cell_SC07
The cell processor can overlap communication with computation.
Covariance matrix creation has a low ratio of communication to computation.
However, even with an SIMD-optimized implementation, the high bandwidth of the cell’s EIB makes this overhead negligible.
0
5
10
15
20
25
30
35
Scalar SIMD
Optimizations
Ru
nti
me
usi
ng
1 S
PE
(s
ec)
Synchronous DMASynchronous DMA
Overlapped DMAOverlapped DMA
6 Meredith_Cell_CS07
Covariance matrix creation:DMA communication overhead
7 Meredith_Cell_SC07
0.0
1.0
2.0
3.0
4.0
5.0
Original-O3 Simplify arrayindexing (1)
Simplify arrayindexing (2)
Loop unrolling Instructionreordering
Runtime (sec)
Stochastic Boolean SAT solver:Hiding latency in logic-intensive apps
PPE: attempting to hide latency manually works against the compilerPPE: attempting to hide latency manually works against the compiler
SPE: manual loop unrolling and instruction reordering achieved speedups although no computational SIMD optimizations were possible
SPE: manual loop unrolling and instruction reordering achieved speedups although no computational SIMD optimizations were possible
8 Meredith_Cell_SC07
Support vector machine:Parallelism and concurrent bandwidth As more simultaneous
SPE threads are added, the total runtime decreases.
Total DMA time also decreases, showing that the concurrent bandwidth to all SPEs is higher than to any one SPE.
Thread launch time increases, but the latest Linux kernel for the cell system does reduce this overhead.
0
1
2
3
4
5
6
7
1 2 4 8 8, newkernel
Number of SPE threads
Runtime (sec)
Full execution
Launch + DMA
Thread launch
8 Meredith_Cell_SC07
9 Meredith_Cell_SC07
Threads launched only on first call
Molecular dynamics:SIMD intrinsics and PPE-to-SPE signaling Using SIMD intrinsics easily achieved 2x speedups in total runtime.
Using PPE-SPE mailboxes, SPE threads can be reused across iterations.
Thus, thread launch overheads will be completely amortized on longer runs.
0.0
0.5
1.0
1.5
2.0
2.5
1 SPE,original
1 SPE,SIMDized
1 SPE 2 SPEs 4 SPEs 8 SPEs
Ru
nti
me
(sec
) Full runtimeFull runtime
Thread launch overheadThread launch overhead
10 Meredith_Cell_SC0710 Meredith_Cell_SC07
Conclusion
Be aware of arithmetic costs. Use the optimized math libraries from the SDK if it helps. Double precision requires different kinds of optimizations.
The cell has a very high bandwidth to the SPEs. Use asynchronous DMA to overlap communication and
computation for applications that are still bandwidth bound.
Amortize expensive SPE thread launch overheads. Launch once, and signal SPEs to start the next iteration.
Use of SIMD intrinsics can result in large speedups. Manual loop unrolling and instruction reordering can help
even if no other SIMDization is possible.
top related