comp 635: seminar on heterogeneous processors lecture 2 ...cavazos/cisc879-spring2012/papers... ·...

25
Vivek Sarkar Department of Computer Science Rice University [email protected] September 10, 2007 COMP 635: Seminar on Heterogeneous Processors Lecture 2: Introduction to the Cell Processor www.cs.rice.edu/~vsarkar/comp635 2 COMP 635, Fall 2007 (V.Sarkar) Announcements Class dates (REMINDER) 8/27, 9/10, 9/20 (Thurs), 9/24, 10/1, 10/8, 10/22, 10/29, 11/5, 11/19, 11/26, 12/3 No classes on 9/3 (Labor Day), 10/15 (Midterm Recess), 11/12 (Supercomputing 2007) No class on 9/17 (Mon); we will meet on 9/20 (Thurs) instead that week Time & Place Default: Mondays, 3:30pm - 4:30pm, DH 1075 Exception: 9/20 (Thurs) lecture, 3:30pm - 4:30pm, DH 3076 30 minutes reserved after each lecture for discussion (optional) Office Hours (DH 3131) 11am - 12noon, Fridays from 8/31/07 to 12/7/07 Volunteers needed to lead discussion of papers in next lecture (9/20) 1. “Using advanced compiler technology to exploit the performance of the Cell Broadband Engine architecture”, A. Eichenberger et al, IBM Systems Journal, Vol 45, No 1, 2006 2. “Dynamic Multigrain Parallelization on the Cell Broadband Engine”, F. Blagojevic et al, PPoPP 2007 Best Paper, March 2007. CELL HACK-A-THON II, Austin, September 22 - 25 See http://www.hpc-consortium.net/events/200709/ for details Contact me if you’re interesting in attending so as to work on a class project

Upload: others

Post on 28-May-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: COMP 635: Seminar on Heterogeneous Processors Lecture 2 ...cavazos/cisc879-spring2012/papers... · COMP635,F al207(V.S rk) 5 Cell History •IBM, SCEI/Sony, Toshiba Alliance formed

Vivek Sarkar

Department of Computer ScienceRice University

[email protected] 10, 2007

COMP 635: Seminar on Heterogeneous Processors

Lecture 2: Introduction to the Cell Processor

www.cs.rice.edu/~vsarkar/comp635

2COMP 635, Fall 2007 (V.Sarkar)

Announcements• Class dates (REMINDER)

— 8/27, 9/10, 9/20 (Thurs), 9/24, 10/1, 10/8, 10/22, 10/29, 11/5, 11/19, 11/26, 12/3— No classes on 9/3 (Labor Day), 10/15 (Midterm Recess), 11/12 (Supercomputing

2007)— No class on 9/17 (Mon); we will meet on 9/20 (Thurs) instead that week— Time & Place

– Default: Mondays, 3:30pm - 4:30pm, DH 1075– Exception: 9/20 (Thurs) lecture, 3:30pm - 4:30pm, DH 3076– 30 minutes reserved after each lecture for discussion (optional)

— Office Hours (DH 3131)– 11am - 12noon, Fridays from 8/31/07 to 12/7/07

• Volunteers needed to lead discussion of papers in next lecture (9/20)1. “Using advanced compiler technology to exploit the performance of the Cell

Broadband Engine architecture”, A. Eichenberger et al, IBM Systems Journal, Vol 45,No 1, 2006

2. “Dynamic Multigrain Parallelization on the Cell Broadband Engine”, F. Blagojevic et al,PPoPP 2007 Best Paper, March 2007.

• CELL HACK-A-THON II, Austin, September 22 - 25— See http://www.hpc-consortium.net/events/200709/ for details— Contact me if you’re interesting in attending so as to work on a class project

Page 2: COMP 635: Seminar on Heterogeneous Processors Lecture 2 ...cavazos/cisc879-spring2012/papers... · COMP635,F al207(V.S rk) 5 Cell History •IBM, SCEI/Sony, Toshiba Alliance formed

3COMP 635, Fall 2007 (V.Sarkar)

Acknowledgments

• MIT 6.189 IAP 2007, Jan 2007, Lecture 2, “Introduction to the Cell Processor”,Michael Perrone, http://cag.csail.mit.edu/ps3/lectures/6.189-lecture2-cell.pdf

• Georgia Tech, Sony/Toshiba/IBM Workshop on Software and Applications for theCell/B.E. processor, June 18-19, 2007, http://sti.cc.gatech.edu/program.html—Code and Data Partitioning for the Local Stores on the Cell/B.E. processor,

Kevin O'Brien, Kathryn O'Brien, Zehra Sura, Tao Zhang and Tong Chen,http://sti.cc.gatech.edu/Slides/OBrien-070618.pdf

• U. Penn. Systems Seminar on “Cell Processor,” Diana Palsetia, 11/21/2006,www.cis.upenn.edu/~palsetia/cellproc.ppt

4COMP 635, Fall 2007 (V.Sarkar)

Outline

• Cell Processor and Software Environment—http://cag.csail.mit.edu/ps3/lectures/6.189-lecture2-cell.pdf

• Code and Data Partitioning for the Local Stores on the Cell/BE processor—http://sti.cc.gatech.edu/Slides/OBrien-070618.pdf

• Yuan Zhao -- Experiences with compiling for Cell

Page 3: COMP 635: Seminar on Heterogeneous Processors Lecture 2 ...cavazos/cisc879-spring2012/papers... · COMP635,F al207(V.S rk) 5 Cell History •IBM, SCEI/Sony, Toshiba Alliance formed

5COMP 635, Fall 2007 (V.Sarkar)

Cell History• IBM, SCEI/Sony, Toshiba Alliance formed in 2000• Design Center opened in March 2001

— Based in Austin, Texas• Single Cell BE operational Spring 2004• 2-way SMP operational Summer 2004• February 7, 2005: First technical disclosures• October 6, 2005: Mercury Announces Cell Blade• November 9, 2005: Open Source SDK & Simulator Published• November 14, 2005: Mercury Announces Turismo Cell Offering• February 8, 2006 IBM Announced Cell Blade

Systems and Technology Group

6COMP 635, Fall 2007 (V.Sarkar)

Cell Chip

Page 4: COMP 635: Seminar on Heterogeneous Processors Lecture 2 ...cavazos/cisc879-spring2012/papers... · COMP635,F al207(V.S rk) 5 Cell History •IBM, SCEI/Sony, Toshiba Alliance formed

7COMP 635, Fall 2007 (V.Sarkar)

Cell Features

• Heterogeneousmulticore systemarchitecture— Power Processor

Element for controltasks

— Synergistic ProcessorElements for data-intensive processing

• SynergisticProcessor Element(SPE) consists of— Synergistic Processor

Unit (SPU)— Synergistic Memory

Flow Control (MFC)– Data movement and

synchronization– Interface to high-

performance ElementInterconnect Bus

16B/cycle(2x)

16B/cycle

BIC

FlexIOTM

MIC

DualXDRTM

16B/cycle

EIB (up to 96B/cycle)

16B/cycle

64-bit Power Architecture with VMX

PPE

SPE

LS

SXUSPU

MFC

PXUL1

PPU

16B/cycleL2

32B/cycle

LS

SXUSPU

MFC

LS

SXUSPU

MFC

LS

SXUSPU

MFC

LS

SXUSPU

MFC

LS

SXUSPU

MFC

LS

SXUSPU

MFC

LS

SXUSPU

MFC

8COMP 635, Fall 2007 (V.Sarkar)

Permute UnitLoad-Store Unit

Floating-Point UnitFixed-Point Unit

Branch UnitChannel Unit

Result Forwarding and StagingRegister File

Local Store(256kB)

Single Port SRAM

128B Read 128B Write

DMA Unit

Instruction Issue Unit / Instruction Line Buffer

8 Byte/Cycle 16 Byte/Cycle 128 Byte/Cycle64 Byte/Cycle

On-Chip Coherent Bus

SPE Block Diagram

Page 5: COMP 635: Seminar on Heterogeneous Processors Lecture 2 ...cavazos/cisc879-spring2012/papers... · COMP635,F al207(V.S rk) 5 Cell History •IBM, SCEI/Sony, Toshiba Alliance formed

9COMP 635, Fall 2007 (V.Sarkar)

SPU Details

• Synergistic Processor Element (SPE)— ISA influenced by VMX and PS2’s

Emotion Engine• User-mode architecture

— No translation/protection within SPE— DMA is full PowerPC protect/xlate

• Direct programmer control— DMA/DMA-list— Branch hint— No dynamic prediction— In-order execution

• VMX-like SIMD dataflow— Graphics SP-Float— No saturate arith, some byte— IEEE DP-Float (BlueGene-like)

• Unified register file— 128 entry x 128 bit

• 256KB Local Store— Combined I & D— 16B/cycle L/S bandwidth— 128B/cycle DMA bandwidth

• DMA unit w/ Memory Flow Control (MFC)commands

— MFC’s MMU allows consistent interfaceto system storage map for allprocessors despite heterogeneousstructure

• SPU Units (pipelined)— Simple (FXU even)

– Add/Compare– Rotate– Logical, Count Leading

Zero— Permute (FXU odd)

– Permute– Table-lookup

— FPU (Single / DoublePrecision)

— Control (SCN)– Dual Issue, Load/Store,

ECC Handling— Channel (SSC) – Interface to

MFC— Register File (GPR/FWD)

• SPU Latencies— Simple fixed point - 2 cycles*— Complex fixed point - 4 cycles*— Load - 6 cycles*— Single-precision (ER) float - 6 cycles*— Integer multiply - 7 cycles*— Branch miss (no penalty for correct hint) - 20 cycles— DP (IEEE) float (partially pipelined) - 13 cycles*— Enqueue DMA Command - 20 cycles*

10COMP 635, Fall 2007 (V.Sarkar)

LSA - Local Store Address (32 bit) EA - Effective Address (32 or 64 bit) TS - Transfer Size (16 bytes to 16K bytes) LS - DMA List Size (8 bytes to 16 K bytes) TG - Tag Group(5 bit) CL - Cache Management / Bandwidth Class

DMA Commands

Command Parameters

Put - Transfer from Local Store to EA spacePuts - Transfer and Start SPU executionPutr - Put Result - (Arch. Scarf into L2)Putl - Put using DMA List in Local StorePutrl - Put Result using DMA List in LS (Arch)Get - Transfer from EA Space to Local StoreGets - Transfer and Start SPU executionGetl - Get using DMA List in Local StoreSndsig - Send Signal to SPU Command Modifiers: <f,b>f: Embedded Tag Specific Fence

Command will not start until all previous commandsin same tag group have completed

b: Embedded Tag Specific BarrierCommand and all subsiquent commands in sametag group will not start until previous commands in sametag group have completed

SL1 Cache Management Commandssdcrt - Data cache region touch (DMA Get hint) sdcrtst - Data cache region touch for store (DMA Put hint)sdcrz - Data cache region zerosdcrs - Data cache region storesdcrf - Data cache region flush

Synchronization CommandsLockline (Atomic Update) Commands:

getllar - DMA 128 bytes from EA to LS and set Reservationputllc - Conditionally DMA 128 bytes from LS to EAputlluc - Unconditionally DMA 128 bytes from LS to EA

barrier - all previous commands complete before subsiquentcommands are started

mfcsync - Results of all previous commands in Tag groupare remotely visible

mfceieio - Results of all preceding Puts commands in samegroup visible with respect to succeeding Get commands

Memory Flow Controller Commands

Page 6: COMP 635: Seminar on Heterogeneous Processors Lecture 2 ...cavazos/cisc879-spring2012/papers... · COMP635,F al207(V.S rk) 5 Cell History •IBM, SCEI/Sony, Toshiba Alliance formed

11COMP 635, Fall 2007 (V.Sarkar)

PPE StructurePower Processor Element (PPE):

— General purpose, 64-bitRISC processor(Power/PowePC binarycompatible)

— In-order, dual issue, dualthreaded

— L1 : 32KB I ; 32KB D— L2 : 512KB— Coherent load / store— VMX-32— Realtime Controls

– Locking L2 Cache &TLB

– Software / hardwaremanaged TLB

– Bandwidth /ResourceReservation

– Mediated Interrupts

12COMP 635, Fall 2007 (V.Sarkar)

Element Interconnect Bus• EIB data ring for internal communication

—Four unidirectional 16 byte data rings, supporting multiple transfers– 2 clockwise, 2 anti-clockwise; worst-case latency is half ring length

—96B/cycle peak bandwidth—Over 100 outstanding requests

Page 7: COMP 635: Seminar on Heterogeneous Processors Lecture 2 ...cavazos/cisc879-spring2012/papers... · COMP635,F al207(V.S rk) 5 Cell History •IBM, SCEI/Sony, Toshiba Alliance formed

13COMP 635, Fall 2007 (V.Sarkar)

Example of Eight Concurrent Transactions

MIC SPE0 SPE2 SPE4 SPE6 BIF /IOIF1

Ramp

7

Controller

Ramp

8

Controller

Ramp

9

Controller

Ramp

10

Controller

Ramp

11

Controller

Controller

Ramp

0

Controller

Ramp

1

Controller

Ramp

2

Controller

Ramp

3

Controller

Ramp

4

Controller

Ramp

5

Controller

Ramp

6

Controller

Ramp

7

Controller

Ramp

8

Controller

Ramp

9

Controller

Ramp

10

Controller

Ramp

11

Data

Arbiter

Ramp

7

Controller

Ramp

8

Controller

Ramp

9

Controller

Ramp

10

Controller

Ramp

11

ControllerController

Ramp

5

Controller

Ramp

4

Controller

Ramp

3

Controller

Ramp

2

Controller

Ramp

1

Controller

Ramp

0

PPE SPE1 SPE3 SPE5 SPE7 IOIF1PPE SPE1 SPE3 SPE5 SPE7 IOIF1

PPE SPE1 SPE3 SPE5 SPE7 IOIF1MIC SPE0 SPE2 SPE4 SPE6 BIF /IOIF0

Ring1Ring3

Ring0Ring2

controls

14COMP 635, Fall 2007 (V.Sarkar)

FreescaleMPC8641D

1.5 GHz

Theoretical Peak Operations

AMDAthlon™ 64 X2

2.4 GHz

PowerPC®

970MP2.5 GHz

Cell BroadbandEngineTM

3.2 GHz

IntelPentium D®

3.2 GHz

Page 8: COMP 635: Seminar on Heterogeneous Processors Lecture 2 ...cavazos/cisc879-spring2012/papers... · COMP635,F al207(V.S rk) 5 Cell History •IBM, SCEI/Sony, Toshiba Alliance formed

15COMP 635, Fall 2007 (V.Sarkar)

Cell BE Performance

• BE can outperform a P4/SSE2 at same clock rate by 3 to 18x (assuminglinear scaling) in various types of application workloads

12x290 fps (per SPE)200 fps (IA32)mpeg2 decoder (sdtv)video processing

12x770 Telemark (perSPE)

501 Telemark (1.4GHzmpc7447)EEMBCcommunication

18x1.98 Gbps (per SPE)0.85 Gbps (IA32)SHA-1

6x2.3 Gbps (per SPE)2.68 Gbps (IA32)MD-5

10x0.16 Gbps (per SPE)0.12 Gbps (IA32)TDES

14x2Gbps (per SPE)1.1 Gbps (IA32)AESsecurity

15x24 fps (BE)1.6 fps (G5/VMX)TRE

12x240 MVPS (per SPE)160 MVPS (G5/VMX)transform-lightgraphics

6x420 Mcups (per SPE)570 Mcups (IA32)smith-watermanbioinformatic

2x12 GFLops (BE)6 GFlops (IA32)Linpack (D.P.)

8x150 GFlops (BE)18 GFlops (IA32)Linpack (S.P.)

8x190 GFlops (8SPEs)25 GflopsMatrix Multiplication (S.P.)HPC

BE PerfAdvantage3 GHz BE3 GHz GPPAlgorithmType

16COMP 635, Fall 2007 (V.Sarkar)

CELL Software Design Considerations• Four Levels of Parallelism

—Blade Level: Two Cell processors per blade—Chip Level: 9 cores run independent tasks—Instruction level: Dual issue pipelines on each SPE—Register level: Native SIMD on SPE and PPE VMX

• 256KB local store per SPE: data + code + stack• Communication

—DMA and Bus bandwidth– DMA granularity – 128 bytes– DMA bandwidth among LS and System memory

—Traffic control– Exploit computational complexity and data locality to lower data traffic

requirement—Shared memory / Message passing abstraction overhead—Synchronization—DMA latency handling

Page 9: COMP 635: Seminar on Heterogeneous Processors Lecture 2 ...cavazos/cisc879-spring2012/papers... · COMP635,F al207(V.S rk) 5 Cell History •IBM, SCEI/Sony, Toshiba Alliance formed

17COMP 635, Fall 2007 (V.Sarkar)

Typical CELL Software Development Flow

• Algorithm complexity study

• Data layout/locality and Data flow analysis

• Experimental partitioning and mapping of the algorithm andprogram structure to the architecture

• Develop PPE Control, PPE Scalar code

• Develop PPE Control, partitioned SPE scalar code—Communication, synchronization, latency handling

• Transform SPE scalar code to SPE SIMD code

• Re-balance the computation / data movement

• Other optimization considerations—PPE SIMD, system bottleneck, load balance

18COMP 635, Fall 2007 (V.Sarkar)

Programming the cell is challenging

Issues• Dividing program among different cores

• Creating instructions in a different language for the 8 SPEsthan for the PowerPC core.

• Need to think in terms of SIMD nature of dataflow to getmaximum performance from SPUs

• SPU local store needs to perform coherent DMA access foraccessing system memory

Page 10: COMP 635: Seminar on Heterogeneous Processors Lecture 2 ...cavazos/cisc879-spring2012/papers... · COMP635,F al207(V.S rk) 5 Cell History •IBM, SCEI/Sony, Toshiba Alliance formed

19COMP 635, Fall 2007 (V.Sarkar)

ProgrammerExperience

DevelopmentTools Stack

Hardware orSystem Level Simulator

Linux PPC64 with Cell Extensions

SPE Management LibApplication Libs

SamplesWorkloads

Demos

Code Dev Tools

Debug Tools

Performance Tools

Standards: Language extensionsABI

Verification Hypervisor

DevelopmentEnvironment

End-UserExperience

ExecutionEnvironment

Miscellaneous Tools

Cell Software Environment

20COMP 635, Fall 2007 (V.Sarkar)

Manually compiling and binding a Cell BE program

Copyright: IBM

Page 11: COMP 635: Seminar on Heterogeneous Processors Lecture 2 ...cavazos/cisc879-spring2012/papers... · COMP635,F al207(V.S rk) 5 Cell History •IBM, SCEI/Sony, Toshiba Alliance formed

21COMP 635, Fall 2007 (V.Sarkar)

Outline

• Cell Processor and Software Environment—http://cag.csail.mit.edu/ps3/lectures/6.189-lecture2-cell.pdf

• Code and Data Partitioning for the Local Stores on the Cell/BE processor—http://sti.cc.gatech.edu/Slides/OBrien-070618.pdf

• Yuan Zhao -- Experiences with compiling for Cell

22COMP 635, Fall 2007 (V.Sarkar)

Shared Memory Processor

• CBE can be explicitly programmed as a shared-memory multiprocessorusing two different instruction sets

• The SPEs and the PPE can be programmed to fully inter-operate in a cache-coherent Shared-Memory Multiprocessor Model— Cache-coherent DMA operations for SPEs— DMA operations use effective address common to all PPE and SPEs— SPE shared-memory store instructions are replaced

– A store from the register file to the LS– DMA operation from LS to shared memory

— SPE shared-memory load instructions are replaced– DMA operation from shared memory to LS– A load from LS to register file

• Of course … a compiler could provide much of this functionality.

Page 12: COMP 635: Seminar on Heterogeneous Processors Lecture 2 ...cavazos/cisc879-spring2012/papers... · COMP635,F al207(V.S rk) 5 Cell History •IBM, SCEI/Sony, Toshiba Alliance formed

23COMP 635, Fall 2007 (V.Sarkar)

foo1 ();

#pragma omp parallel forfor (i=0; i < N; i++) A[i] = x * B[i];

foo2 ();

Single sourcefor (i=LB; i < UB; i++) A[i] = x * B[i];

foo3(LB,UB)

outline

foo3_SPU (LB,UB)

clone

for (i=LB; i < UB; i++) A[i] = x * B[i];Runtime barrier

foo1 ();Runtime distribution of work: invoke foo3, for i=[0,N)Runtime barrierfoo2 ();

Runtime barrier

In SPE code:A, B, and x are shared

Compiling a single source file for the Cell (w/o buffers)

24COMP 635, Fall 2007 (V.Sarkar)

foo1 ();

#pragma omp parallel forfor (i=0; i < N; i++) A[i] = x * B[i];

foo2 ();

Single source

foo1 ();Runtime distribution of work: invoke foo3 and foo3_SPU, for i=[0,N)Runtime barrierfoo2 ();

for (i=LB; i < UB; i++) A[i] = x * B[i];Runtime barrier

foo3(LB,UB)

outline

foo3_SPU (LB,UB)

/** buffers A´[M], B´[M] **/

for ( k=LB; k < UB; k+=M) { DMA M elements of B into B´ for (j=0; j<M; j++) { A´[j] = cache_lookup(x) * B´[j]; } DMA M elements of A out of A´}

Runtime barrier

clone

Compiling a single source file for the Cell (w/ buffers)

Page 13: COMP 635: Seminar on Heterogeneous Processors Lecture 2 ...cavazos/cisc879-spring2012/papers... · COMP635,F al207(V.S rk) 5 Cell History •IBM, SCEI/Sony, Toshiba Alliance formed

25COMP 635, Fall 2007 (V.Sarkar)

Data Partitioning

• Single Source assumption: all data lives in System Memory

• Naïve implementation, every load and store requires a dmaoperation

—Too costly (~700 cycles per load or store)

—MP will require locking on every reference

• What can be done to make this acceptable?

26COMP 635, Fall 2007 (V.Sarkar)

Prefetching

• Example:

for(i=0;i<100000;i++)

a[i]=b[i]+c[i];

for(i=0;i<100000;i+=100) {

dma_get(b’,b[i],400);

dma_get(c’,c[i],400);

for(ii=0;ii<100;ii++)

a’[ii]=b’[ii]+c’[ii];

dma_put(a[i],a’,400);

}

Original Code

Blocked, with prefetch

dma_get(b’,b[0],400);

dma_get(c’,c[0],400);

for(i=0;i<99900;i+=100) {

dma_get(b”,b[i+100],400);

dma_get(c”,c[i+100],400);

for(ii=0;ii<100;ii++)

a’[ii]=b’[ii]+c’[ii];

dma_put(a[i],a’,400);

swap(a’,a”);

swap(b’,b”);

swap(c’,c”);

}

for(ii=0;ii<100;ii++)

a”[ii]=b”[ii]+c”[ii];

dma_put(a[i],a”,400);

Software Pipelined Prefetch

Page 14: COMP 635: Seminar on Heterogeneous Processors Lecture 2 ...cavazos/cisc879-spring2012/papers... · COMP635,F al207(V.S rk) 5 Cell History •IBM, SCEI/Sony, Toshiba Alliance formed

27COMP 635, Fall 2007 (V.Sarkar)

Irregular Accesses

• b and c can be prefetched, but dhas an irregular access pattern,thus we cannot predict whatelements of d are required

• we seem to be thrown back onthe naïve implementation, d[f(i)]must be fetched on eachiteration with a consequentlarge slowdown of the loop

• observation: it’s as if everyaccess to d incurred a cachemiss

What do we do about this?

for(i=0;i<100000;i++)

a[i]=b[i]+c[i]*d[f(i)];

28COMP 635, Fall 2007 (V.Sarkar)

Software Caching

for(i=0;i<100000;i++)

= … d[f(i)];

for(i=0;i<100000;i++)

t=cache_lookup(d[f(i)];

= … t;

Original CodeCode with explicit Cache Lookup

inline vector cache_lookup(addr) {

if (cache_directory[addr&key_mask] != (addr&tag_mask))

miss_handler(addr);

return cache_data[addr&key_mask][addr&offset_mask];

}

the miss handler will dma the required data, and some suitable quantity of surrounding data

higher degrees of associativity can be supported, possibly for little extra cost on a SIMD processor

Page 15: COMP 635: Seminar on Heterogeneous Processors Lecture 2 ...cavazos/cisc879-spring2012/papers... · COMP635,F al207(V.S rk) 5 Cell History •IBM, SCEI/Sony, Toshiba Alliance formed

29COMP 635, Fall 2007 (V.Sarkar)

Combining Prefetch with Software Cache

• We may already have b[f(i)] in local store as a result of prefetching

• in this example, the only effect is to cause unneccesary misses

• but if we substitute a[f(i)] for the last term …

for(i=0;i<100000;i++)

a[i]=b[i]+c[i]*b[f(i)];

for(i=0;i<100000;i+=100) {

dma_get(b’,b[i],400);

dma_get(c’,c[i],400);

for(ii=0;ii<100;ii++) {

t=cache_look_up(b[f(i)]);

a’[ii]=b’[ii]+c’[ii]*t;

}

dma_put(a[i],a’,400);

}

Original Code

Prefetching and Caching

Prefetching must also update the cache directory, and

Miss handling must not evict prefetched data

30COMP 635, Fall 2007 (V.Sarkar)

Coherence Problem

• SPE accesses data in global memory through two mechanisms:— Software controlled cache— Static buffers

• Incorrect value may be used or generated if coherence is not maintained.Examples:

— Two copies of data in software controlled cache and static buffer. One changesthe value and the other one may read a stale value

— Multiple copies of data in different static buffers

• Approaches:— Compiler: no runtime overhead,— Runtime: more powerful but complicated

Page 16: COMP 635: Seminar on Heterogeneous Processors Lecture 2 ...cavazos/cisc879-spring2012/papers... · COMP635,F al207(V.S rk) 5 Cell History •IBM, SCEI/Sony, Toshiba Alliance formed

31COMP 635, Fall 2007 (V.Sarkar)

Solution Overview

• Combine two approaches for optimal solution—Try to apply compiler solution as much as possible—Resort to runtime solution if necessary

• Components—Local coherence simplification—Global coherence avoidance analysis—Dynamic coherence maintenance

32COMP 635, Fall 2007 (V.Sarkar)

Local Coherence Simplification

• There is no coherence problem for this static buffer in the loop• Runtime coherence maintenance is needed only

— At the entry of loop: DMA read and check whether the software controlled cachehas updated data

— At the exit of loop:– Write-through: update the hit cache line and DMA write– Write-back: put the static buffer content into cache

• Pros/Cons— Requires local data dependence info, which may be more likely to be available— The structure of software controlled cache remains unchanged

• References are put into static buffer in a loop only when there is no data dependencebetween the reference and any other reference accessed by software controlledcache or another static buffer in the loop.

— The coherence maintenance can be overlapped with DMA operations— Candidates for static buffer may be lost if the data dependence information is too

conservative

Page 17: COMP 635: Seminar on Heterogeneous Processors Lecture 2 ...cavazos/cisc879-spring2012/papers... · COMP635,F al207(V.S rk) 5 Cell History •IBM, SCEI/Sony, Toshiba Alliance formed

33COMP 635, Fall 2007 (V.Sarkar)

Global Coherence Avoidance Analysis

• Runtime coherence maintenance can be avoided by compileranalysis—At entry: if there is no updated cache line for this static buffer—At exit: if there is no cache line for this static buffer already in

cache that will be referenced later• How the compiler predicts cache contents

—No lines in cache after flush— If data is carefully aligned or padded, compiler can assume

different variables will never be in the same cache line—Can not predict the replacement. A line will be assumed to stay

in cache until flush

34COMP 635, Fall 2007 (V.Sarkar)

Optimization with Flushes

• When runtime coherence maintenance is needed by theprevious analysis, it may be profitable to insert extra cacheflushes to avoid the coherence maintenance

• Flush can be a flush for one variable or combine them as flushall

• The previous analysis can provide information about thepossible insertion points for flush—Move in the control flow graph to reduce the overhead—Similar to the algorithm of partial redundant elimination.—Branch profiling may help

Page 18: COMP 635: Seminar on Heterogeneous Processors Lecture 2 ...cavazos/cisc879-spring2012/papers... · COMP635,F al207(V.S rk) 5 Cell History •IBM, SCEI/Sony, Toshiba Alliance formed

35COMP 635, Fall 2007 (V.Sarkar)

SPU Code Partition Manager Overview

ActivePartition m

PartitionManager

(long resident)

SPU Processor

(partition m)

…….

call to Partition n

…….

Main Memory

Partition 1

Partition 2

…...

Partition n

…...

36COMP 635, Fall 2007 (V.Sarkar)

SPU Code Partition Manager Overview

PartitionManager

(long resident)

SPU Processor

(partition m)

…….

call to Partition n

…….

Main Memory

Partition 1

Partition 2

…...

Partition n

…...

Page 19: COMP 635: Seminar on Heterogeneous Processors Lecture 2 ...cavazos/cisc879-spring2012/papers... · COMP635,F al207(V.S rk) 5 Cell History •IBM, SCEI/Sony, Toshiba Alliance formed

37COMP 635, Fall 2007 (V.Sarkar)

OVERLAY command effect: Binary View

Header

……

Partition 1

Partition 2

Program Binary Image

…...

Partition n

…...

offset: 0x1000

offset: 0x2000

offset: 0x3000

38COMP 635, Fall 2007 (V.Sarkar)

OVERLAY command effect: Execution View

Program MemoryImage

Header

……

Partition 1 Partition 2 …... Partition n

virtual address:0x1000

Page 20: COMP 635: Seminar on Heterogeneous Processors Lecture 2 ...cavazos/cisc879-spring2012/papers... · COMP635,F al207(V.S rk) 5 Cell History •IBM, SCEI/Sony, Toshiba Alliance formed

39COMP 635, Fall 2007 (V.Sarkar)

Call Graph Partitioning Algorithm

• Build an affinity graph based on the global call graph.—Each global call graph node becomes a node in the affinity graph

and costs some memory—Each call graph edge becomes an edge in the affinity graph

• Each call graph edge is weighted.—Estimated through static program analysis—Profiling

• Apply maximum spanning tree algorithm to the affinity graph.—Process edges by the order of the weight—If merging the two nodes of the edge does not exceed the memory

limitation, then merge, and so on.

• Each (merged) node left is a program partition.

40COMP 635, Fall 2007 (V.Sarkar)

An Example of Call Graph Partitioning

Assume Memory Limitation is 1000

300

300 300

300300 300

100

1000200

10

5050

Page 21: COMP 635: Seminar on Heterogeneous Processors Lecture 2 ...cavazos/cisc879-spring2012/papers... · COMP635,F al207(V.S rk) 5 Cell History •IBM, SCEI/Sony, Toshiba Alliance formed

41COMP 635, Fall 2007 (V.Sarkar)

An Example of Call Graph Partitioning

300

600 300

300 300

100

200

10

5050

Assume Memory Limitation is 1000

42COMP 635, Fall 2007 (V.Sarkar)

An Example of Call Graph Partitioning

300

900 300

300

100 10

50

50

Assume Memory Limitation is 1000

Page 22: COMP 635: Seminar on Heterogeneous Processors Lecture 2 ...cavazos/cisc879-spring2012/papers... · COMP635,F al207(V.S rk) 5 Cell History •IBM, SCEI/Sony, Toshiba Alliance formed

43COMP 635, Fall 2007 (V.Sarkar)

An Example of Call Graph Partitioning

600

300

900

100 10

50

Assume Memory Limitation is 1000

44COMP 635, Fall 2007 (V.Sarkar)

An Example of Call Graph Partitioning

900900 150

Assume Memory Limitation is 1000

Page 23: COMP 635: Seminar on Heterogeneous Processors Lecture 2 ...cavazos/cisc879-spring2012/papers... · COMP635,F al207(V.S rk) 5 Cell History •IBM, SCEI/Sony, Toshiba Alliance formed

45COMP 635, Fall 2007 (V.Sarkar)

Optimizations

• Profiling to get accurate call edge frequencies—Especially with the presence of a lot of indirect calls

through function pointers

• Get the accurate function code size—Currently estimated—Conservative, very rough

• Leaf function duplication—Some leaf functions are referenced by two non-

coalescable partitions—May be beneficial to duplicate the function

• Double Buffering—Rely on good prefetching to be beneficial—Prefetching is a difficult problem

46COMP 635, Fall 2007 (V.Sarkar)

• NAS and SPEC OMP benchmarks, speedups against 1 SPE

• Scalability generally very good

—IS and equake not good due to non-parallelized loops

Performance Normalized to one SPU

1

2

3

4

5

6

7

8

1SPU 2SPU 4SPU 8SPU

Sp

ee

du

p

CG

EP

FT

IS

MG

equake

swim

Page 24: COMP 635: Seminar on Heterogeneous Processors Lecture 2 ...cavazos/cisc879-spring2012/papers... · COMP635,F al207(V.S rk) 5 Cell History •IBM, SCEI/Sony, Toshiba Alliance formed

47COMP 635, Fall 2007 (V.Sarkar)

Outline

• Cell Processor and Software Environment—http://cag.csail.mit.edu/ps3/lectures/6.189-lecture2-cell.pdf

• Code and Data Partitioning for the Local Stores on the Cell/BE processor—http://sti.cc.gatech.edu/Slides/OBrien-070618.pdf

• Yuan Zhao -- Experiences with compiling for Cell

48COMP 635, Fall 2007 (V.Sarkar)

© Copyright International Business Machines Corporation 2006All Rights Reserved

This document was developed for IBM offerings in the United States as of the date of publication. IBM may not make these offerings available inother countries, and the information is subject to change without notice. Consult your local IBM business contact for information on the IBMofferings available in your area. In no event will IBM be liable for damages arising directly or indirectly from any use of the information contained inthis document.Information in this document concerning non-IBM products was obtained from the suppliers of these products or other public sources. Questions onthe capabilities of non-IBM products should be addressed to the suppliers of those products.IBM may have patents or pending patent applications covering subject matter in this document. The furnishing of this document does not give youany license to these patents. Send license inquires, in writing, to IBM Director of Licensing, IBM Corporation, New Castle Drive, Armonk, NY10504-1785 USA.All statements regarding IBM future direction and intent are subject to change or withdrawal without notice, and represent goals and objectives only.The information contained in this document has not been submitted to any formal IBM test and is provided "AS IS" with no warranties or guaranteeseither expressed or implied.All examples cited or described in this document are presented as illustrations of the manner in which some IBM products can be used and theresults that may be achieved. Actual environmental costs and performance characteristics will vary depending on individual client configurationsand conditions.IBM Global Financing offerings are provided through IBM Credit Corporation in the United States and other IBM subsidiaries and divisionsworldwide to qualified commercial and government clients. Rates are based on a client's credit rating, financing terms, offering type, equipmenttype and options, and may vary by country. Other restrictions may apply. Rates and offerings are subject to change, extension or withdrawalwithout notice.IBM is not responsible for printing errors in this document that result in pricing or information inaccuracies.All prices shown are IBM's United States suggested list prices and are subject to change without notice; reseller prices may vary.IBM hardware products are manufactured from new parts, or new and serviceable used parts. Regardless, our warranty terms apply.Many of the features described in this document are operating system dependent and may not be available on Linux. For more information, pleasecheck: http://www.ibm.com/systems/p/software/whitepapers/linux_overview.htmlAny performance data contained in this document was determined in a controlled environment. Actual results may vary significantly and aredependent on many factors including system hardware configuration and software design and configuration. Some measurements quoted in thisdocument may have been made on development-level systems. There is no guarantee these measurements will be the same on generally-available systems. Some measurements quoted in this document may have been estimated through extrapolation. Users of this document shouldverify the applicable data for their specific environment.

Special Notices

Page 25: COMP 635: Seminar on Heterogeneous Processors Lecture 2 ...cavazos/cisc879-spring2012/papers... · COMP635,F al207(V.S rk) 5 Cell History •IBM, SCEI/Sony, Toshiba Alliance formed

49COMP 635, Fall 2007 (V.Sarkar)

The following terms are trademarks of International Business Machines Corporation in the United States and/or other countries: alphaWorks,BladeCenter, Blue Gene, ClusterProven, developerWorks, e business(logo), e(logo)business, e(logo)server, IBM, IBM(logo), ibm.com, IBMBusiness Partner (logo), IntelliStation, MediaStreamer, Micro Channel, NUMA-Q, PartnerWorld, PowerPC, PowerPC(logo), pSeries, TotalStorage,xSeries; Advanced Micro-Partitioning, eServer, Micro-Partitioning, NUMACenter, On Demand Business logo, OpenPower, POWER, PowerArchitecture, Power Everywhere, Power Family, Power PC, PowerPC Architecture, POWER5, POWER5+, POWER6, POWER6+, Redbooks,System p, System p5, System Storage, VideoCharger, Virtualization Engine.

A full list of U.S. trademarks owned by IBM may be found at: http://www.ibm.com/legal/copytrade.shtml.

Cell Broadband Engine and Cell Broadband Engine Architecture are trademarks of Sony Computer Entertainment, Inc. in the United States, othercountries, or both.Rambus is a registered trademark of Rambus, Inc.XDR and FlexIO are trademarks of Rambus, Inc.UNIX is a registered trademark in the United States, other countries or both.Linux is a trademark of Linus Torvalds in the United States, other countries or both.Fedora is a trademark of Redhat, Inc.Microsoft, Windows, Windows NT and the Windows logo are trademarks of Microsoft Corporation in the United States, other countries or both.Intel, Intel Xeon, Itanium and Pentium are trademarks or registered trademarks of Intel Corporation in the United States and/or other countries.AMD Opteron is a trademark of Advanced Micro Devices, Inc.Java and all Java-based trademarks and logos are trademarks of Sun Microsystems, Inc. in the United States and/or other countries.TPC-C and TPC-H are trademarks of the Transaction Performance Processing Council (TPPC).SPECint, SPECfp, SPECjbb, SPECweb, SPECjAppServer, SPEC OMP, SPECviewperf, SPECapc, SPEChpc, SPECjvm, SPECmail, SPECimapand SPECsfs are trademarks of the Standard Performance Evaluation Corp (SPEC).AltiVec is a trademark of Freescale Semiconductor, Inc.PCI-X and PCI Express are registered trademarks of PCI SIG.InfiniBand™ is a trademark the InfiniBand® Trade AssociationOther company, product and service names may be trademarks or service marks of others.

Revised July 23, 2006

Special Notices (Cont.) -- Trademarks

50COMP 635, Fall 2007 (V.Sarkar)

(c) Copyright International Business Machines Corporation 2005.All Rights Reserved. Printed in the United Sates April 2005.

The following are trademarks of International Business Machines Corporation in the United States, or other countries, or both. IBM IBM Logo Power Architecture

Other company, product and service names may be trademarks or service marks of others.

All information contained in this document is subject to change without notice. The products described in this document areNOT intended for use in applications such as implantation, life support, or other hazardous uses where malfunction could resultin death, bodily injury, or catastrophic property damage. The information contained in this document does not affect or changeIBM product specifications or warranties. Nothing in this document shall operate as an express or implied license or indemnityunder the intellectual property rights of IBM or third parties. All information contained in this document was obtained in specificenvironments, and is presented as an illustration. The results obtained in other operating environments may vary.

While the information contained herein is believed to be accurate, such information is preliminary, and should not be reliedupon for accuracy or completeness, and no representations or warranties of accuracy or completeness are made.

THE INFORMATION CONTAINED IN THIS DOCUMENT IS PROVIDED ON AN "AS IS" BASIS. In no event will IBM be liablefor damages arising directly or indirectly from any use of the information contained in this document.

IBM Microelectronics Division The IBM home page is http://www.ibm.com1580 Route 52, Bldg. 504 The IBM Microelectronics Division home page isHopewell Junction, NY 12533-6351 http://www.chips.ibm.com