ieee journal of solid-state circuits, vol. 41, no. 1, january 2006

76
OVERVIEW OF THE ARCHITECTURE, CIRCUIT DESIGN, AND PHYSICAL IMPLEMENTATION OF A FIRST-GENERATION CELL PROCESSOR IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 41, NO. 1, JANUARY 2006

Upload: callum-pace

Post on 30-Dec-2015

38 views

Category:

Documents


4 download

DESCRIPTION

Overview of the Architecture, Circuit Design, and Physical Implementation of a First-Generation Cell Processor. IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 41, NO. 1, JANUARY 2006. First Consumer Product. Play Station 3!. Introduction. Developed through partnership of - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 41, NO. 1, JANUARY 2006

OVERVIEW OF THE ARCHITECTURE, CIRCUIT DESIGN, ANDPHYSICAL IMPLEMENTATION OF A FIRST-GENERATION CELL PROCESSOR

IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 41, NO. 1, JANUARY 2006

Page 2: IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 41, NO. 1, JANUARY 2006

First Consumer Product

Play Station 3!

Page 3: IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 41, NO. 1, JANUARY 2006

Introduction

Developed through partnership of SONY Computer Entertainment. Toshiba. IBM.

Aim Highly tuned for media processing. Expected demands for complex and larger

data handling.

Page 4: IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 41, NO. 1, JANUARY 2006

What is Cell?

Cell is an architecture for high performance distributed computing.  

It is comprised of hardware and software cells.

Implementation of a wide range of single

or multiple processor and memory configurations.

Page 5: IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 41, NO. 1, JANUARY 2006

“Supercomputer” in daily life Parallelism with high frequency. Real time response. Supports Multiple operating system. 10 simultaneous threads. 128 memory requests. Optimally address many different system

and application requirements.

Page 6: IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 41, NO. 1, JANUARY 2006

Architecture Overview

8 SPE’s with Local Storage (LS). PPE with its L2 cache. Internal element interconnect bus (EIB). Memory Interface Controller (MIC). Bus Interface Controller (BIC). Power Management Unit (PMU). Thermal Management Unit (TMU). Pervasive Unit.

Page 7: IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 41, NO. 1, JANUARY 2006

High Level Diagram

Page 8: IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 41, NO. 1, JANUARY 2006

Die Photograph

Page 9: IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 41, NO. 1, JANUARY 2006

Synergistic Processing Elements (SPE) (1/2)

Share system memory with PPE through DMA.

Data and instructions in a private real address space supported by a 256 K LS.

According to IBM a single SPE can perform as well as a top end (single core) desktop CPU given the right task.

Page 10: IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 41, NO. 1, JANUARY 2006

Synergistic Processing Elements (SPE) (2/2)

Access main storage by issuing DMA commands to the associated MFC block (asynchronous transfer).

Fully pipelined 128 bit wide dual issue SIMD.

SPE’s in a Cell can be chained together to act as a stream processor.  

Page 11: IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 41, NO. 1, JANUARY 2006

Power Processor Element (PPE) (1/2)

32-kB instruction and data cache. 64 bit “Power Architecture” with 512kB L2

cache.

Page 12: IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 41, NO. 1, JANUARY 2006

Power Processor Element (PPE) (2/2)

Through MMIO control registers can intiate DMA for SPE.

Hyepervisor extension. Moderate length of pipeline.

Page 13: IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 41, NO. 1, JANUARY 2006

Element Interconnect Bus(EIB) Can transfer upto 96bytes per cycle.

4 16byte wide rings Two rings going clockwise. Two rings going counterclockwise.

Separate address and command network.

12on/off ramps.

Page 14: IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 41, NO. 1, JANUARY 2006
Page 15: IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 41, NO. 1, JANUARY 2006

Memory Interface Controller (MIC) Two 36 bit wide XDR memory banks.

Can also support just a single bank.

Speed matching SRAM and two clocks.

Page 16: IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 41, NO. 1, JANUARY 2006

Power Reduction

Power Management Unit.

PMU allows software controls to reduce chip power.

Can cause OS to throttle, pause or stop for single or multiple units.

Page 17: IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 41, NO. 1, JANUARY 2006

Thermal Monitoring

Thermal Sensors and Thermal Monitoring Unit.

One sensor located at relatively constant temp. location, for external cooling.

10 DTS at various critical locations.

Page 18: IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 41, NO. 1, JANUARY 2006

Optimum Point (1/3)

Triple constraint : Power, Performance, Area.

Gate Oxide thickness Thinner oxide

Higher performance. Higher gate tunneling too. Reliabilty concerns.

Page 19: IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 41, NO. 1, JANUARY 2006

Optimum Point (2/3)

Channel Length Short channel length

Improved performance. Increased leakage current too.

Supply Voltage Higher voltage

Improved performance. Higher AC/DC power.

Page 20: IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 41, NO. 1, JANUARY 2006

Optimum Point (3/3)

Wire Levels Few levels

Increased chip area. Many levels

More cost.

Page 21: IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 41, NO. 1, JANUARY 2006

Final Technology Parameters

Page 22: IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 41, NO. 1, JANUARY 2006

Chip Integration

241M transistors.

8912 discrete flour planned blocks.

Custom tailored nets.

20 separate power domains.

Page 23: IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 41, NO. 1, JANUARY 2006

POWER-CONSCIOUS DESIGN OFTHE CELL PROCESSOR’SSPE Osamu Takahashi

IBM Systems and

Technology Group

Scott Cottier

Sang H. Dhong

Brian Flachs

Joel Silberman

IBM T.J. Watson

Research Center

Page 24: IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 41, NO. 1, JANUARY 2006

The CELL Processor - Properties Mostly CMOS static gates. Dynamic gates used for time critical

paths. Tight coupling of

ISA uArchitecture Physical implementation achieves Compact and Power efficient

design.

Page 25: IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 41, NO. 1, JANUARY 2006

APPLICATIONS

To name a few (list goes endless) Image processing for high definition TV Image processing for medical usages High performance computing Gaming

Flexible enough to be a GP uP that supports HLL programming.

Page 26: IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 41, NO. 1, JANUARY 2006

Cell processor - Architecture

64-bit power core

Eight Synergistic Processor Elements(SPEs)

L2 Cache

Interconnection bus

I/O Controller

Rambus Flex I/O

Page 27: IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 41, NO. 1, JANUARY 2006

Architecture contd.

SPE has two clock domains: one with an 11FO4 cycle time. other with a 22FO4 cycle time. Implementation using custom design - high-

frequency domain. The SPE contains 256 Kbytes of dedicated local store memory. The 128-bit, 128-entry general-purpose

register file with six read ports and two write ports.

Page 28: IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 41, NO. 1, JANUARY 2006

SPE

The SMF operates at half the SPE’s frequency.

The SPE operates at operations of up to 5.6 GHz at a 1.4 V supply and 56° C.

The SPE’s measured power consumption is in the range of 1 W to 11 W, depending on Operating clock frequency. Temperature. Workload.

Page 29: IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 41, NO. 1, JANUARY 2006

Triple design constraints

Cell contains eight copies of the SPE. Optimization of the SPE’s power and area is critical to the overall

chip design. Conscious effort to reduce SPE area and power while meeting the

11 FO4 cycle time performance objectives. Optimized design to balance three constraints of

Power. Area. Performance.

Tradeoffs to achieve the overall best results Some techniques used

latch selection. fine-grained clock-gating scheme. multiclock-domain design. use of dual-threshold voltage. Selective use of dynamic circuits.

Page 30: IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 41, NO. 1, JANUARY 2006

Latch selection

Logic has 8-9FO4 time.

Rest of the time used by latches.

Several Latches with various insertion delays used.

Page 31: IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 41, NO. 1, JANUARY 2006

Transmission Gate Latch

SPE’s main workhorse latch.

Come in two varieties Scannable. Non scannable.

Each has several power levels.

Used almost throughout the SPE.

Page 32: IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 41, NO. 1, JANUARY 2006

Pulsed Clock Latch

Non scannable. Small insertion delay. Small Area. Relatively low power consumption. Used in

Most timing. Power critical areas.

Page 33: IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 41, NO. 1, JANUARY 2006

Dynamic multiplexer latch

Scannable. Multiplexing widths

from 4-10. Small insertion delay. Used in

Time critical. Multiplexing requiring

areas. Typical use in

dataflow operand latches.

Page 34: IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 41, NO. 1, JANUARY 2006

Dynamic PLA Latch

Scannable latch. Used to generate control signals (clock

gating signals). The last two latches use slightly higher

power. Complete complex task in critical time. Example of a tradeoff among triple

constraints.

Page 35: IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 41, NO. 1, JANUARY 2006

Fine-grained clock gating

Effective method of reducing power -used extensively in the CELL.

Use of local clock buffer (LCB) Supplies clock to bank of

latches. If enable signal fired LCB

buffers the global clock and sends to the bank of latches.

SPE activates only necessary pipeline stages.

Registers are turned off normally.

Functional blocks were simulated and verified.

50% active power reduction using this design process.

Page 36: IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 41, NO. 1, JANUARY 2006

Multiple clock frequency domains

High frequency increases performance.

Has some penalties Higher clock power. Higher percentage of clock

insertion delays. Shorter distance that a

signal can travel. SPE has some units

whose performance does not solely depend on frequency.

SMF operates at half the frequency.

Page 37: IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 41, NO. 1, JANUARY 2006

Multiple clock frequency domains

11 FO4 blocks Register file. Fixed point unit. Floating point unit. Data forwarding. Load/Store.

22 FO4 blocks Direct memory access unit. Bus control.

Distribution of one clock to both domains.

SMF activated every second clock cycle.

Page 38: IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 41, NO. 1, JANUARY 2006

Multiple clock frequency domains Avoids physical implementation difficulties. Helps escape

Latch insertion delay. Travel distance penalties.

Advantages Large percentage of clock dedicated to logic. Most of SMF paths become non-critical. Smaller transistors can be used.

SMF optimized for both area and power without sacrificing performance.

Page 39: IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 41, NO. 1, JANUARY 2006

Dual-threshold-voltage devices Leakage – significant portion of power

consumption for deep micron technology. Cannot be solved by clock gating or two

clock domains. Use high-threshold-voltage transistors. Penalty – slower switching time. Used in paths with enough timing slack. Non critical paths from SMF because of two

clock domains were replaced with these.

Page 40: IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 41, NO. 1, JANUARY 2006

Selective use of dynamic circuits Advantages of static circuits over dynamic Design ease. Low switching factor. Tool compatibility. Technology independence. Advantages of dynamic circuits over static

counterparts Faster speed due to low cap at dynamic nodes. Larger gains because of invertors after logic. Micro architecture efficiency – fewer stages. Smaller area.

Page 41: IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 41, NO. 1, JANUARY 2006

Selective use of dynamic circuits

Dynamic logic requires a clock – higher power consumption.

Requires both true and complementary signals.

Static implementation tends to hit speed wall earlier.

Approach for design Implement logic circuits in

static CMOS as much as possible.

Alternatives when static did not meet the speed requirements.

Page 42: IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 41, NO. 1, JANUARY 2006

Selective use of dynamic circuits

Dynamic logic requires a clock – higher power consumption.

Requires both true and complementary signals.

Static implementation tends to hit speed wall earlier.

Approach for design Implement logic circuits in static CMOS as

much as possible. Alternatives when static did not meet the

speed requirements.

Page 43: IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 41, NO. 1, JANUARY 2006

Selective use of dynamic circuits Dynamic circuits have static interfaces. 19 percent of the non-SRAM area. Include the following macros

Dataflow forwarding. Multiport register file. Floating point unit. Dynamic PLL. Multiplexer latch. Instruction line buffer.

Page 44: IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 41, NO. 1, JANUARY 2006

SPE hardware measurements Tested for complicated 3D picture rendering. The fastest operation ran at 5.6 GHz with a 1.4 V supply

at 56° C. The global clock mesh’s measured power is 1.3 W per

SPE at a 1.2V supply and 2.0-GHz clock frequency. The Cell architecture is compatible with the 64b Power

architecture so that applications can be built on the Power investments.

It can be considered as a non-homogenous coherent chip multiprocessor.

High design frequency has been achieved through highly optimized implementation.

Its streaming DMA architecture helps to enhance memory effectiveness of a processor.

Refer to shmoo plot for power analysis

Page 45: IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 41, NO. 1, JANUARY 2006

SPE shmoo plot

Page 46: IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 41, NO. 1, JANUARY 2006

Applications of the CELL ProcessorAnd Its Potential For Scientific Computing

Page 47: IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 41, NO. 1, JANUARY 2006

r

Page 48: IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 41, NO. 1, JANUARY 2006

THE POWER!

FOLDING@HOME Broke the Guinness world record for the “worlds most powerful distributed network” with computing power of > 1 PF(thousand trillion floating point operations per second).Blue Gene is 500 TF

Page 49: IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 41, NO. 1, JANUARY 2006

WHY THE POWER?

Cell combines the considerable floating point resources Cell combines the considerable floating point resources required for demanding numerical algorithms with a required for demanding numerical algorithms with a power efficient software-controlled memory hierarchy.power efficient software-controlled memory hierarchy.

Contains a powerful 64-bit Dual-threaded IBM PowerPC Contains a powerful 64-bit Dual-threaded IBM PowerPC core and eight proprietary 'Synergistic Processing core and eight proprietary 'Synergistic Processing Elements' (SPEs), - eight more highly specialized mini-Elements' (SPEs), - eight more highly specialized mini-computers on the same die.computers on the same die.

Cell’s peak double precision performance is very Cell’s peak double precision performance is very impressive relative to its commodity peers impressive relative to its commodity peers (14.6Gflop/[email protected]),(14.6Gflop/[email protected]),

Page 50: IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 41, NO. 1, JANUARY 2006

OVERVIEW

Quantitative Performance comparison of the cell to AMD Opteron(superscalar), Intel Itanium 2(VLIW) and Cray X1E(vector)Minor Architectural Changes (CELL +) to improve DP performance.Complexity of mapping scientific algorithms onto the CELL.A few interesting Applications

Page 51: IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 41, NO. 1, JANUARY 2006

ARCHITECTUREEach SPE contains 4 SP 6 cycle pipelined FMA(fused multiply–add) datapaths, 1 DP 9 cycle pipelined FMA datapath + 4 cycles for data movement. 7 Cycle in-order ex. Pipeline and forwarding network.Inserts a 6 cycle stall after a DP instr1 DP instruction issued every 7 CyclesDP Performance is 1/14 peak SP performance

Page 52: IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 41, NO. 1, JANUARY 2006

Programming

Modified SPMD(Single Program Multiple Data)

Dual Program Multiple DataEach SPE has its own local memory to fetch code and read/write data.All loads and stores are local.Explicit DMA operations to move data from main memory to local memory.Software controlled Memory

Page 53: IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 41, NO. 1, JANUARY 2006

Programming ModelsVery challenging to program.Explicit parallelism between SPE and PPCQuad word ISAUnlike MPI communication intrinsics are low level, hence faster Three Basic Models

Task Parallel – Separate Tasks assigned each SPEPipeline Parallel – Large Blocks of data transferred between SPEsData Parallel – Same code, Distinct Data (paper uses this)

Page 54: IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 41, NO. 1, JANUARY 2006

Benchmark Kernels

Stencil Computations on Structured Grids Sparse Matrix-Vector MultiplicationMatrix-Matrix Multiplication1D FFTs2D FFTs

Page 55: IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 41, NO. 1, JANUARY 2006

CELL +

The authors of this paper proposed minor architectural changes to the CELL ProcessorDP wasn’t a major focus for the Gaming worldRedesign would increase complexity and power consuptionDP instructions fetched every 2 cycles keeping everything else the same

Page 56: IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 41, NO. 1, JANUARY 2006

The Processors Used

Page 57: IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 41, NO. 1, JANUARY 2006

Benchmark 1 –GEMM

Dense Matrix-Matrix Multiplication – High Computational Intensity and regular memory accessExpect to reach close to peak on most platformsExplored two blocking formats: Column major and Block data layout

Page 58: IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 41, NO. 1, JANUARY 2006

Benchmark 1 –GEMM

3.07.829.5204.7—SP

5.44.016.914.651.1DP

IA64AMD64X1ECellPMCell+PMGflop/s CellPMCellPM

23882455117—SP

42451413651277DP

IA64AMD64X1ECellPMCell+PMMflop/W

Page 59: IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 41, NO. 1, JANUARY 2006

BENCHMARK 2 – Sparse Matrix Vector Multiply

Seems like a poor choice at first glance due to low computational Seems like a poor choice at first glance due to low computational intensity and irregular data accesses.intensity and irregular data accesses.But less local store latency, task parallelism, 8 SPE load store units But less local store latency, task parallelism, 8 SPE load store units and DMA prove otherwise.and DMA prove otherwise.Most of the matrix entries are zero, thus the nonzeros are sparsely Most of the matrix entries are zero, thus the nonzeros are sparsely distributed and can be streamed via DMAdistributed and can be streamed via DMALike DGEMM, can exploit a FMA wellLike DGEMM, can exploit a FMA wellVery low computational intensity (1 FMA for every 12+ bytes)Very low computational intensity (1 FMA for every 12+ bytes)Non FP instructions can dominateNon FP instructions can dominateRow lengths can be unique and in multiples of 4Row lengths can be unique and in multiples of 4

Page 60: IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 41, NO. 1, JANUARY 2006

symmetric

unsymmetric

0.830.80-7.68--SP

0.670.602.644.004.353.38*DP

-

3.04

CellFSS

0.410.53-4.08-SP

0.360.361.142.342.46DP

IA64AMD64X1ECellPMCell+PMGflop/s

symmetric

unsymmetric

6.388.99-192--SP

5.156.7422.010010984.5*DP

-

76.0

CellFSS

3.155.96-102-SP

2.774.049.5058.561.5DP

IA64AMD64X1ECellPMCell+PMMflop/W

SpMV - Results

Page 61: IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 41, NO. 1, JANUARY 2006

Stencil Based ComputationsStencil computations codes represent wide array of scientific applications

Each point in multidimensional grid is updated from subset of neighboursFinite difference operations used to solver complex numerical systemsHere simple heat equations and 3D hyperbolic PDE are examinedRelatively low computational intensity results in low % of peak on superscalarsMemory bandwidth bound – Low computational intensity

Page 62: IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 41, NO. 1, JANUARY 2006

Stencils - Results

Gflop/s

1.971.073.2621.2-65.8SP

7.25

CellFSS

1.190.573.918.221.1DP

IA64AMD64X1ECellPMCell+PM

Mflop/W

15.21227.2530-1645SP

181

CellFSS

9.156.432.6205528DP

IA64AMD64X1ECellPMCell+PM

Page 63: IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 41, NO. 1, JANUARY 2006

1D Fast Fourier Transforms

Fast Fourier transform (FFT) - is of great importance to a wide variety of applications

One of the main techniques for solving PDEs Relatively low computational intensity with non-trivial volume data

movement1D FFT: Naïve Algorithm - cooperatively executed across the SPEs

Load roots of unity, load data (cyclic) 3 stages: local work, on-chip transpose, local work No double buffering (ie no overlap of communication or computation)

2D FFT: 1D FFTs are each run on single SPE Each SPE performs 2 * (N/8) FFTs Double buffer (2 incoming and 2 outgoing) Straightforward algorithm (N2 2D FFT): N simultaneous FFTs, transpose, Transposes represent about 50% of SP execution time, but only 20% of DP

Cell performance compared with highly optimized FFTW and vendor libraries

Page 64: IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 41, NO. 1, JANUARY 2006

1D FFT Results

2D

1D

0.421.327.9338.2-SP

0.310.697.056.6516.2DP

1.723.245.3033.7-SP

2.701.614.535.8513.4DP

IA64AMD64X1ECellPMCell+PMaveraged Gflop/s

2D

1D

3.2314.866.1955-SP

2.387.7558.8166405DP

13.236.444.2843-SP

20.818.137.8146335DP

IA64AMD64X1ECellPMCell+PMaveraged Mflop/W

Page 65: IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 41, NO. 1, JANUARY 2006

A Few ConclusionsFar more predictable than conventional machinesEven in double precision, it obtains much better performance on a surprising variety of codes.Cell can eliminate unneeded memory traffic, hide memory latency, and thus achieves a much higher percentage of memory bandwidth.Instruction set can be very inefficient for poorly SIMD or misaligned codes.Loop overheads can heavily dominate performance. Programming Model is clunky

Page 66: IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 41, NO. 1, JANUARY 2006

Real World Applications

Page 67: IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 41, NO. 1, JANUARY 2006

FOLDING@HOMEFolding@home tm is a Distributed Computing Project at Stanford UniversityConnects > 1 Million CPUsMainly to study protein folding and misfoldingPS3 Cell Broadband Engine increased the total computation power exponentially upto 1 PT1 work unit takes 8 hours. Run PS3 overnight. Then sends results back.250 K CPUs active in 2008

Page 68: IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 41, NO. 1, JANUARY 2006
Page 69: IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 41, NO. 1, JANUARY 2006

Other Real life Scientific Appln

Ray Tracing

Modeling of the human brain

Solve complex equations to predict gravity waves that are generated by the super-sized black

To assist an autonomous vehicle.

Page 70: IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 41, NO. 1, JANUARY 2006

Axion Racing Entry Into Darpa Urban Challenge

Series of events designed to test autonomous vehicles for developing technology that keep people off the battlefield. Axion Racing, used PS3 running Yellow Dog Linux as part of its on-board image recognition system. ‘Spirit’ the name of Axion Racing’s vehicle, was the first of its kind to drive itself to the 14,110 foot summit of Colorado’s Pikes Peak. uses stereo vision (2 cameras) concept to determine object distance andthen running them through the software produces something called a disparity map. The further away the object is the smaller the disparity map, likewise the opposite for near objects.Spirit uses cell to park and reverse.

Page 71: IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 41, NO. 1, JANUARY 2006

SPIRIT

Along with the stereo cameras, Spirit uses a laser range finder,infrared camera and two NAVCOM Starfire GPS units And an inertial navigation system (to correct for GPS errors and signal losses)

Page 72: IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 41, NO. 1, JANUARY 2006

Ray Tracing

Very computationally intense algorithm to model the path taken by light as they interact with optical surfacesAlso used in modeling radio waves, radiation effects and in other engineering areas.Algorithm needs to be heavily modified to run on the cell.

Page 73: IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 41, NO. 1, JANUARY 2006

Ray TracingThis video shows a progression of ray-traced shaders executing on a cluster of IBM QS20 Cell blades.over 300,000 triangles, render at over 60 frames per second (depending on the shader) at 1080p resolution using 14 Cell processors. Because of the scalable nature of the ray-tracer it can also render interactive frames on a single Linux Playstation3 using only 6 SPEs.

Page 74: IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 41, NO. 1, JANUARY 2006

$

Page 75: IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 41, NO. 1, JANUARY 2006

Conclusion

Overall, a single PS3 performs better than the highest-end desktops available and compares to as many as 25 nodes of an IBM Blue Gene supercomputer. And there is still tremendous scope left for extracting more performance through further optimization.

Its a commodity processor, hence cheap and can be used in large quantities.

The most Difficult process is writing and compiling code!

Page 76: IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 41, NO. 1, JANUARY 2006

QUESTIONS????