samuel williams, john shalf, leonid oliker, shoaib kamil, parry husbands, katherine yelick lawrence...

Samuel Williams, John Shalf, Leonid Oliker, Shoaib Kamil, Parry Husbands, Katherine

YelickLawrence Berkeley National Laboratory

ACM International Conference on Computing Frontiers

May 2-6, 2006, Italy

Presentation by Aarul Jain

Introduce a performance model of Cell. Implement key scientific computing kernels:

dense matrix multiply, sparse matrix vector multiply, stencil computations, and 1D/2D FFTs.

Verify results from performance models against published results and implementations of Cell full system simulator.

Compare cell performance to benchmarks run on leading superscalar (AMD Opteron), VLIW (Intel Itanium2) and Vector (Cray X1E) architectures.

Propose micro-architectural modifications that could significantly improve the efficiency of double-precision calculations.

Details and results from the paper.◦ Programming Model used.◦ Performance Model used for simulation.◦ “Cell+” architecture for DP performance

improvement.◦ Dense Matrix-Matrix multiply.◦ Sparse Matrix Vector multiply.◦ Stencil Computations.◦ Fast Fourier Transforms.

Comments/Critiques Project Q/A

Three programming models◦ Task parallelism.◦ Pipelined parallelism.◦ Data parallelism.

Data-parallel programming model used. Rely heavily on SIMD intrinsic -> NO C. Double buffering used to overlap data

movement with computation on SPEs. One month to implement first kernel, 600

lines of code.

Deterministic behavior of software controlled memory.

In-order execution and fixed load-store memory latency of SPEs.

Step1: Segmented code snippets that operate on data present in local store of SPE and did static timing analysis on its assembly.

Step2: a model that tabulates the time required for DMA loads and stores of the operands required by code snippets.

Compute total time by adding all the outer loops where each loop is computed by taking maximum of the snippet and DMA transfer times.

Double precision operations are implemented using 9-cycle pipelined FMA data path with 4 cycles of overhead for data movement.

6 cycles stall after issuing a DP instruction. Much detail about Cell+ architecture not

discussed in the paper. (Proprietary?) Propose a design with a longer forwarding

network to eliminate all but one stalls. More details on pipeline of SPE may be found at:

◦ B. Flachs et. al., A Streaming processor unit for a cell processor. ISSCC Dig. Tech. Papers, Feb. 2005

General Matrix Multiply and Add(GEMM)◦ Column major◦ Block data layout

Each matrix is broken into 8n x n element tiles designed to fit into the memory available on Cell chip.

Further they are divided into n x n element tiles that can fit into 8 SPE local stores.

Storage formats◦ Compressed Sparse Row (CSR)◦ Blocked Compressed Sparse Row (BCSR)

Two types of kernels used derived from Chombo and Cactus toolkits.

Both solve 7 point stencils in 3D for each point.

Compute intensity less than matrix multiplications.

Both 1D and 2D versions analyzed. Look-up tables used. No double buffering.

Broadest quantitative study of Cell’s performance. Cell’s three level software-controlled memory architecture

provides several advantages over mainstream cache-based architectures.

Disadvantage: unaligned load support. Propose Cell+ architecture for improving DP performance.

Cell is unique in its architecture -> future architectures based on Cell??

Authors have done considerable work in analyzing Cell performance.

Critique1 http://arstechnica.com/news.ars/post/20060615-7071.html

Critique2 http://www.hpcwire.com/hpc/679134.html

Title: FAST FOURIER TRANSFORM IMPLEMENTATION ON CELL BROADBAND ENGINE ARCHITECTURE

Main Objectives: ◦ Explore Cell Architecture and find out

limitations/advantages of Cell Architecture.◦ Get familiar with Cell programming environment.

http://www-128.ibm.com/developerworks/power/library/pa-cellperf

http://en.wikipedia.org/wiki/Single_precision http://arstechnica.com/news.ars/post/20060615-7071.ht

ml http://www.hpcwire.com/hpc/671376.html http://www.hpcwire.com/hpc/679134.html http://arstechnica.com/cpu/2q99/benchmarking-1.html http://crd.lbl.gov/html/news/CRDreport0506.pdf http://arstechnica.com/news.ars/post/20060225-

6265.html http://www.research.ibm.com/journal/sj/451/

eichenberger.html http://ieeexplore.ieee.org/iel5/71/27301/01214317.pdf?

tp=&isnumber=&arnumber=1214317

samuel williams, john shalf, leonid oliker, shoaib kamil, parry husbands, katherine yelick lawrence...

Documents

cell performance

performance model of

cell architecture

dense matrixmatrix

cell processor

sparse matrix vector

data parallelism

implementations of cell