samuel williams, john shalf, leonid oliker, shoaib kamil, parry husbands, katherine yelick lawrence...

17
Samuel Williams, John Shalf, Leonid Oliker, Shoaib Kamil, Parry Husbands, Katherine Yelick Lawrence Berkeley National Laboratory ACM International Conference on Computing Frontiers May 2-6, 2006, Italy Presentation by Aarul Jain

Upload: shannon-kennedy

Post on 27-Dec-2015

217 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Samuel Williams, John Shalf, Leonid Oliker, Shoaib Kamil, Parry Husbands, Katherine Yelick Lawrence Berkeley National Laboratory ACM International Conference

Samuel Williams, John Shalf, Leonid Oliker, Shoaib Kamil, Parry Husbands, Katherine

YelickLawrence Berkeley National Laboratory

ACM International Conference on Computing Frontiers

May 2-6, 2006, Italy

Presentation by Aarul Jain

Page 2: Samuel Williams, John Shalf, Leonid Oliker, Shoaib Kamil, Parry Husbands, Katherine Yelick Lawrence Berkeley National Laboratory ACM International Conference

Introduce a performance model of Cell. Implement key scientific computing kernels:

dense matrix multiply, sparse matrix vector multiply, stencil computations, and 1D/2D FFTs.

Verify results from performance models against published results and implementations of Cell full system simulator.

Compare cell performance to benchmarks run on leading superscalar (AMD Opteron), VLIW (Intel Itanium2) and Vector (Cray X1E) architectures.

Propose micro-architectural modifications that could significantly improve the efficiency of double-precision calculations.

Page 3: Samuel Williams, John Shalf, Leonid Oliker, Shoaib Kamil, Parry Husbands, Katherine Yelick Lawrence Berkeley National Laboratory ACM International Conference

Details and results from the paper.◦ Programming Model used.◦ Performance Model used for simulation.◦ “Cell+” architecture for DP performance

improvement.◦ Dense Matrix-Matrix multiply.◦ Sparse Matrix Vector multiply.◦ Stencil Computations.◦ Fast Fourier Transforms.

Comments/Critiques Project Q/A

Page 4: Samuel Williams, John Shalf, Leonid Oliker, Shoaib Kamil, Parry Husbands, Katherine Yelick Lawrence Berkeley National Laboratory ACM International Conference

Three programming models◦ Task parallelism.◦ Pipelined parallelism.◦ Data parallelism.

Data-parallel programming model used. Rely heavily on SIMD intrinsic -> NO C. Double buffering used to overlap data

movement with computation on SPEs. One month to implement first kernel, 600

lines of code.

Page 5: Samuel Williams, John Shalf, Leonid Oliker, Shoaib Kamil, Parry Husbands, Katherine Yelick Lawrence Berkeley National Laboratory ACM International Conference

Deterministic behavior of software controlled memory.

In-order execution and fixed load-store memory latency of SPEs.

Step1: Segmented code snippets that operate on data present in local store of SPE and did static timing analysis on its assembly.

Step2: a model that tabulates the time required for DMA loads and stores of the operands required by code snippets.

Compute total time by adding all the outer loops where each loop is computed by taking maximum of the snippet and DMA transfer times.

Page 6: Samuel Williams, John Shalf, Leonid Oliker, Shoaib Kamil, Parry Husbands, Katherine Yelick Lawrence Berkeley National Laboratory ACM International Conference

Double precision operations are implemented using 9-cycle pipelined FMA data path with 4 cycles of overhead for data movement.

6 cycles stall after issuing a DP instruction. Much detail about Cell+ architecture not

discussed in the paper. (Proprietary?) Propose a design with a longer forwarding

network to eliminate all but one stalls. More details on pipeline of SPE may be found at:

◦ B. Flachs et. al., A Streaming processor unit for a cell processor. ISSCC Dig. Tech. Papers, Feb. 2005

Page 7: Samuel Williams, John Shalf, Leonid Oliker, Shoaib Kamil, Parry Husbands, Katherine Yelick Lawrence Berkeley National Laboratory ACM International Conference
Page 8: Samuel Williams, John Shalf, Leonid Oliker, Shoaib Kamil, Parry Husbands, Katherine Yelick Lawrence Berkeley National Laboratory ACM International Conference

General Matrix Multiply and Add(GEMM)◦ Column major◦ Block data layout

Each matrix is broken into 8n x n element tiles designed to fit into the memory available on Cell chip.

Further they are divided into n x n element tiles that can fit into 8 SPE local stores.

Page 9: Samuel Williams, John Shalf, Leonid Oliker, Shoaib Kamil, Parry Husbands, Katherine Yelick Lawrence Berkeley National Laboratory ACM International Conference

Storage formats◦ Compressed Sparse Row (CSR)◦ Blocked Compressed Sparse Row (BCSR)

Page 10: Samuel Williams, John Shalf, Leonid Oliker, Shoaib Kamil, Parry Husbands, Katherine Yelick Lawrence Berkeley National Laboratory ACM International Conference

Two types of kernels used derived from Chombo and Cactus toolkits.

Both solve 7 point stencils in 3D for each point.

Page 11: Samuel Williams, John Shalf, Leonid Oliker, Shoaib Kamil, Parry Husbands, Katherine Yelick Lawrence Berkeley National Laboratory ACM International Conference

Compute intensity less than matrix multiplications.

Both 1D and 2D versions analyzed. Look-up tables used. No double buffering.

Page 12: Samuel Williams, John Shalf, Leonid Oliker, Shoaib Kamil, Parry Husbands, Katherine Yelick Lawrence Berkeley National Laboratory ACM International Conference
Page 13: Samuel Williams, John Shalf, Leonid Oliker, Shoaib Kamil, Parry Husbands, Katherine Yelick Lawrence Berkeley National Laboratory ACM International Conference

Broadest quantitative study of Cell’s performance. Cell’s three level software-controlled memory architecture

provides several advantages over mainstream cache-based architectures.

Disadvantage: unaligned load support. Propose Cell+ architecture for improving DP performance.

Page 14: Samuel Williams, John Shalf, Leonid Oliker, Shoaib Kamil, Parry Husbands, Katherine Yelick Lawrence Berkeley National Laboratory ACM International Conference

Cell is unique in its architecture -> future architectures based on Cell??

Authors have done considerable work in analyzing Cell performance.

Critique1 http://arstechnica.com/news.ars/post/20060615-7071.html

Critique2 http://www.hpcwire.com/hpc/679134.html

Page 15: Samuel Williams, John Shalf, Leonid Oliker, Shoaib Kamil, Parry Husbands, Katherine Yelick Lawrence Berkeley National Laboratory ACM International Conference

Title: FAST FOURIER TRANSFORM IMPLEMENTATION ON CELL BROADBAND ENGINE ARCHITECTURE

Main Objectives: ◦ Explore Cell Architecture and find out

limitations/advantages of Cell Architecture.◦ Get familiar with Cell programming environment.

Page 16: Samuel Williams, John Shalf, Leonid Oliker, Shoaib Kamil, Parry Husbands, Katherine Yelick Lawrence Berkeley National Laboratory ACM International Conference

http://www-128.ibm.com/developerworks/power/library/pa-cellperf

http://en.wikipedia.org/wiki/Single_precision http://arstechnica.com/news.ars/post/20060615-7071.ht

ml http://www.hpcwire.com/hpc/671376.html http://www.hpcwire.com/hpc/679134.html http://arstechnica.com/cpu/2q99/benchmarking-1.html http://crd.lbl.gov/html/news/CRDreport0506.pdf http://arstechnica.com/news.ars/post/20060225-

6265.html http://www.research.ibm.com/journal/sj/451/

eichenberger.html http://ieeexplore.ieee.org/iel5/71/27301/01214317.pdf?

tp=&isnumber=&arnumber=1214317

Page 17: Samuel Williams, John Shalf, Leonid Oliker, Shoaib Kamil, Parry Husbands, Katherine Yelick Lawrence Berkeley National Laboratory ACM International Conference