programming the cell processor: achieving high performance and efficiency

Presented by

Jeremy S. MeredithSadaf R. Alam

Jeffrey S. VetterFuture Technologies Group

Computer Science and Mathematics Division

Research supported by the Department of Energy’s Office of ScienceOffice of Advanced Scientific Computing Research

Programming the Cell Processor:Achieving High Performance and Efficiency

2 Meredith_Cell_SC07

One POWER architecture processing element (PPE)

Eight synergistic processing elements (SPEs)

All connected through a high-bandwidth element interconnect bus (EIB)

Over 200 gigaflops (single precision) on one chip

Cell broadband engine processor:An overview

Cell broadband engine processor:Details

EightSPEs

Dual-threaded

Vector instructions

Dual-issue pipeline

Simple instruction set heavily focused on single instruction multiple data (SIMD)

Capability for double precision, but optimization for single

Uniform 128-bit 128-register file

256-K fixed-latency local store

Memory flow controller with direct memory access (DMA) engine to access main memory or other SPE local stores

One64-bit PPE

Genetic algorithm, traveling salesman:Single- vs. double-precision performance The cell processor has a much higher latency for

double-precision results.

The “ if ” test in the sorting predicate is highly penalized for double precision.

Pentium 42.8 GHz

PPE only 1 SPE 1 SPE,avoiding “ if ”

Single precision

Double precision

Replacing this testwith extra arithmetic

results in a large speedup

Replacing this testwith extra arithmetic

results in a large speedup

Genetic algorithm, Ackley’s function:Using SPE-optimized math libraries

Ackley’s function involvescos, exp, sqrt

Original Fastcosine

Fastexp/sqrt

Switching to a math library optimized for

SPEs results in a more than 10x

improvement for single precision

The cell processor can overlap communication with computation.

Covariance matrix creation has a low ratio of communication to computation.

However, even with an SIMD-optimized implementation, the high bandwidth of the cell’s EIB makes this overhead negligible.

Scalar SIMD

Optimizations

Synchronous DMASynchronous DMA

Overlapped DMAOverlapped DMA

6 Meredith_Cell_CS07

Covariance matrix creation:DMA communication overhead

Original-O3 Simplify arrayindexing (1)

Simplify arrayindexing (2)

Loop unrolling Instructionreordering

Runtime (sec)

Stochastic Boolean SAT solver:Hiding latency in logic-intensive apps

PPE: attempting to hide latency manually works against the compilerPPE: attempting to hide latency manually works against the compiler

SPE: manual loop unrolling and instruction reordering achieved speedups although no computational SIMD optimizations were possible

Support vector machine:Parallelism and concurrent bandwidth As more simultaneous

SPE threads are added, the total runtime decreases.

Total DMA time also decreases, showing that the concurrent bandwidth to all SPEs is higher than to any one SPE.

Thread launch time increases, but the latest Linux kernel for the cell system does reduce this overhead.

1 2 4 8 8, newkernel

Number of SPE threads

Runtime (sec)

Full execution

Launch + DMA

Thread launch

Threads launched only on first call

Molecular dynamics:SIMD intrinsics and PPE-to-SPE signaling Using SIMD intrinsics easily achieved 2x speedups in total runtime.

Using PPE-SPE mailboxes, SPE threads can be reused across iterations.

Thus, thread launch overheads will be completely amortized on longer runs.

1 SPE,original

1 SPE,SIMDized

1 SPE 2 SPEs 4 SPEs 8 SPEs

) Full runtimeFull runtime

Thread launch overheadThread launch overhead

10 Meredith_Cell_SC0710 Meredith_Cell_SC07

Conclusion

Be aware of arithmetic costs. Use the optimized math libraries from the SDK if it helps. Double precision requires different kinds of optimizations.

The cell has a very high bandwidth to the SPEs. Use asynchronous DMA to overlap communication and

computation for applications that are still bandwidth bound.

Amortize expensive SPE thread launch overheads. Launch once, and signal SPEs to start the next iteration.

Use of SIMD intrinsics can result in large speedups. Manual loop unrolling and instruction reordering can help

even if no other SIMDization is possible.

Contact

Jeremy MeredithFuture Technologies Group Computer Science and Mathematics Division (865) 241-5842jsmeredith@ornl.gov

programming the cell processor: achieving high performance and efficiency

double precision

original1 spe

expensive spe thread

cell system

doubleprecision results

spe local storesone

ppespe mailboxes

simultaneous spe threads

Documents

improving efficiency, achieving sustainability

beyond ehr - achieving operational efficiency

achieving business efficiency through people

achieving cost efficiency and high performance processing

kurzke achieving maximum thermal efficiency en

achieving administrative efficiency

achieving scale and efficiency in microinsurance through ......

achieving operational efficiency by leveraging pi system

fluids gement achieving operational efficiency gains using

achieving cost efficiency and high performance processing -...

achieving energy efficiency through behaviour change: what

achieving administrative efficiency & technology...

managing performance and efficiency of a processor

achieving the efficiency for information systems

achieving energy efficiency through behaviour change: what...

achieving cost efficiency and high performance computing

achieving cpu (& mlc) savings through optimizing processor

achieving superior energy efficiency · 2019-03-04 ·...

imagine achieving maximum efficiency · imagine achieving...

programming the cell processor: achieving high performance...