pod: a parallel-on-die architecture dong hyuk woogeorgia tech joshua b. frymanintel corp. allan d....

73
POD: A Parallel-On-Die POD: A Parallel-On-Die Architecture Architecture Dong Hyuk Woo Dong Hyuk Woo Georgia Tech Georgia Tech Joshua B. Fryman Joshua B. Fryman Intel Corp. Intel Corp. Allan D. Knies Allan D. Knies Intel Intel Berkeley Research Berkeley Research Marsha Eng Marsha Eng Intel Corp. Intel Corp. Hsien-Hsin S. Lee Hsien-Hsin S. Lee Georgia Tech Georgia Tech

Upload: britton-horton

Post on 16-Jan-2016

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: POD: A Parallel-On-Die Architecture Dong Hyuk WooGeorgia Tech Joshua B. FrymanIntel Corp. Allan D. KniesIntel Berkeley Research Marsha EngIntel Corp. Hsien-Hsin

POD: A Parallel-On-Die ArchitecturePOD: A Parallel-On-Die Architecture

Dong Hyuk WooDong Hyuk Woo Georgia TechGeorgia Tech

Joshua B. FrymanJoshua B. Fryman Intel Corp.Intel Corp.

Allan D. KniesAllan D. Knies Intel Berkeley ResearchIntel Berkeley Research

Marsha EngMarsha Eng Intel Corp.Intel Corp.

Hsien-Hsin S. LeeHsien-Hsin S. Lee Georgia TechGeorgia Tech

Page 2: POD: A Parallel-On-Die Architecture Dong Hyuk WooGeorgia Tech Joshua B. FrymanIntel Corp. Allan D. KniesIntel Berkeley Research Marsha EngIntel Corp. Hsien-Hsin

2Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture

What So New with ‘Many-Core’?

• On Chip!– New definition of scalability!

Performance Scalability

- Minimize communication

- Hide communication latency

Area ScalabilityArea Scalability

- How many cores on a die?How many cores on a die?

Power ScalabilityPower Scalability

- How much power per core?How much power per core?

Performance Scalability

- Minimize communication

- Hide communication latency

Page 3: POD: A Parallel-On-Die Architecture Dong Hyuk WooGeorgia Tech Joshua B. FrymanIntel Corp. Allan D. KniesIntel Berkeley Research Marsha EngIntel Corp. Hsien-Hsin

3Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture

Technology (nm) 65 45 32 22 16

# of cores 4 8 16 32 64

Frequency (GHz) 3.0 3.0 3.0 3.0 3.0

Vdd (V) 1.3 1.1 0.9 0.8 0.7

Power / core (W) 37.5 26.8 18.0 14.2 10.9

Total power (W) 150 215 288 454 696

Scalable Power Consumption?

* The above pictures do not represent any current or future Intel products.

Page 4: POD: A Parallel-On-Die Architecture Dong Hyuk WooGeorgia Tech Joshua B. FrymanIntel Corp. Allan D. KniesIntel Berkeley Research Marsha EngIntel Corp. Hsien-Hsin

4Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture

Thousand Cores, Power Consumption?

??

??

?

Page 5: POD: A Parallel-On-Die Architecture Dong Hyuk WooGeorgia Tech Joshua B. FrymanIntel Corp. Allan D. KniesIntel Berkeley Research Marsha EngIntel Corp. Hsien-Hsin

5Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture

Some Known Facts on Interconnection

• One main energy hog!– Alpha 21364:

• 20% (25W / 125W)

– MIT RAW: • 40% of individual tile power

– UT TRIPS: • 25%

• Mesh-based MIMD many-core?– Unsustainable energy in its router logic– Unpredictable communication pattern

• Unfriendly to low-power techniques

Interconnection Network related

Page 6: POD: A Parallel-On-Die Architecture Dong Hyuk WooGeorgia Tech Joshua B. FrymanIntel Corp. Allan D. KniesIntel Berkeley Research Marsha EngIntel Corp. Hsien-Hsin

POD: A Parallel-On-Die ArchitecturePOD: A Parallel-On-Die Architecture

Page 7: POD: A Parallel-On-Die Architecture Dong Hyuk WooGeorgia Tech Joshua B. FrymanIntel Corp. Allan D. KniesIntel Berkeley Research Marsha EngIntel Corp. Hsien-Hsin

7Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture

Design Goals

• Broad-purpose acceleration– Scalable vector performance

• Energy and area efficiency

• Low latency inter-PE communication

• Best-in-class scalar performance

• Backward compatibility

Page 8: POD: A Parallel-On-Die Architecture Dong Hyuk WooGeorgia Tech Joshua B. FrymanIntel Corp. Allan D. KniesIntel Berkeley Research Marsha EngIntel Corp. Hsien-Hsin

8Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture

Host Processor

Host Processor

Page 9: POD: A Parallel-On-Die Architecture Dong Hyuk WooGeorgia Tech Joshua B. FrymanIntel Corp. Allan D. KniesIntel Berkeley Research Marsha EngIntel Corp. Hsien-Hsin

9Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture

Processing Elements (PEs)

Host Processor

Page 10: POD: A Parallel-On-Die Architecture Dong Hyuk WooGeorgia Tech Joshua B. FrymanIntel Corp. Allan D. KniesIntel Berkeley Research Marsha EngIntel Corp. Hsien-Hsin

10Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture

Instruction Bus (IBus)

Host Processor

Page 11: POD: A Parallel-On-Die Architecture Dong Hyuk WooGeorgia Tech Joshua B. FrymanIntel Corp. Allan D. KniesIntel Berkeley Research Marsha EngIntel Corp. Hsien-Hsin

11Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture

ORTree: Status Monitoring

Host ProcessorFlag status

Page 12: POD: A Parallel-On-Die Architecture Dong Hyuk WooGeorgia Tech Joshua B. FrymanIntel Corp. Allan D. KniesIntel Berkeley Research Marsha EngIntel Corp. Hsien-Hsin

12Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture

Data Return Buffer (DRB)

Host Processor

DRB

DRB

DRB

DRB

DRB

DRB

DRB

DRB

Page 13: POD: A Parallel-On-Die Architecture Dong Hyuk WooGeorgia Tech Joshua B. FrymanIntel Corp. Allan D. KniesIntel Berkeley Research Marsha EngIntel Corp. Hsien-Hsin

13Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture

2D Torus P2P Links

0 7 1 6 2 5 3 4

Host Processor

DRB

DRB

DRB

DRB

DRB

DRB

DRB

DRB

Page 14: POD: A Parallel-On-Die Architecture Dong Hyuk WooGeorgia Tech Joshua B. FrymanIntel Corp. Allan D. KniesIntel Berkeley Research Marsha EngIntel Corp. Hsien-Hsin

14Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture

2D Torus P2P Links

Host Processor

DRB

DRB

DRB

DRB

DRB

DRB

DRB

DRB

Page 15: POD: A Parallel-On-Die Architecture Dong Hyuk WooGeorgia Tech Joshua B. FrymanIntel Corp. Allan D. KniesIntel Berkeley Research Marsha EngIntel Corp. Hsien-Hsin

15Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture

RRQ & MBus: DMA

RRQ

RRQ

RRQ

RRQ

RRQ

RRQ

RRQ

RRQ

Host Processor

DRB

DRB

DRB

DRB

DRB

DRB

DRB

DRB

RowResponse

Queue

MemoryBus

Page 16: POD: A Parallel-On-Die Architecture Dong Hyuk WooGeorgia Tech Joshua B. FrymanIntel Corp. Allan D. KniesIntel Berkeley Research Marsha EngIntel Corp. Hsien-Hsin

16Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture

On-chip MC / Ring

RRQ

RRQ

RRQ

RRQ

RRQ

RRQ

RRQ

RRQ

Host Processor

DRB

DRB

DRB

DRB

DRB

DRB

DRB

DRB

Sys

tem

mem

ory

xTLB

LLC

ARB

MC

MC

Arbitrator

Memory Controller

Page 17: POD: A Parallel-On-Die Architecture Dong Hyuk WooGeorgia Tech Joshua B. FrymanIntel Corp. Allan D. KniesIntel Berkeley Research Marsha EngIntel Corp. Hsien-Hsin

17Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture

Wire Delay Aware Design

RRQ

RRQ

RRQ

RRQ

RRQ

RRQ

RRQ

RRQ

D

D

D

D

D

D

D

D

D

D

D

D

D

D

D

D

D

D

D

D

D

EFLAGS

! MemFence

IBUS

Page 18: POD: A Parallel-On-Die Architecture Dong Hyuk WooGeorgia Tech Joshua B. FrymanIntel Corp. Allan D. KniesIntel Berkeley Research Marsha EngIntel Corp. Hsien-Hsin

18Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture

Simplified PE Pipelining

Local

SRAM

IBUS

MBUS

DEC / RF

80

78

NSEW

G

X

M

NSEW

96-b

it V

LIW

inst

ruct

ion

Page 19: POD: A Parallel-On-Die Architecture Dong Hyuk WooGeorgia Tech Joshua B. FrymanIntel Corp. Allan D. KniesIntel Berkeley Research Marsha EngIntel Corp. Hsien-Hsin

19Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture

Physical Design Evaluation

iBTB BTB

Lhist

G

Fetch BT

LBT

ags IL1

IFQ

UC

odeR

OM

BAC

uopQDecode

Decode

Decode

Decode(Conplex)

Rename

Commit

ARF

ROBAlloc

RSRAT

FP

SIMD

INT

AGU

LQ SQ

Bypass

s/f

DL1

tagsDTLB

prefetchs/f

PF

Intel Conroe Core 2 Duo processor die photo

Page 20: POD: A Parallel-On-Die Architecture Dong Hyuk WooGeorgia Tech Joshua B. FrymanIntel Corp. Allan D. KniesIntel Berkeley Research Marsha EngIntel Corp. Hsien-Hsin

20Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture

Physical Design Evaluation

Intel Conroe Core 2 Duo processor die photo

PODPE

Page 21: POD: A Parallel-On-Die Architecture Dong Hyuk WooGeorgia Tech Joshua B. FrymanIntel Corp. Allan D. KniesIntel Berkeley Research Marsha EngIntel Corp. Hsien-Hsin

21Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture

Physical Design Evaluation @ 45nm

• PE– 128KB dual-ported SRAM, Basic GP ALU, simplified SSE ALU– ~3.2 mm2

• RRQ– ~3.2 mm2 (conservative estimation)

• The host processor– Core-based microarchitecture– ~25.9 mm2

• 3MB LLC– ~20 mm2

• Total: ~276.5 mm2

RRQ

RRQ

RRQ

RRQ

RRQ

RRQ

RRQ

RRQ

Host Processor

DRB

DRB

DRB

DRB

DRB

DRB

DRB

DRB

xTLB

LLC

ARB

MC

MC

Page 22: POD: A Parallel-On-Die Architecture Dong Hyuk WooGeorgia Tech Joshua B. FrymanIntel Corp. Allan D. KniesIntel Berkeley Research Marsha EngIntel Corp. Hsien-Hsin

22Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture

Interconnection Fabric Among PEs

• Different links for different communication– IBus / Mbus / 2D torus P2P links / ORTree– No contention among different communication

• Fully synchronized and orchestrated by the host processor (and RRQ for MBus)– No crossbar– No contention / arbitration– No buffer space for the router– Nothing unpredictable!

Page 23: POD: A Parallel-On-Die Architecture Dong Hyuk WooGeorgia Tech Joshua B. FrymanIntel Corp. Allan D. KniesIntel Berkeley Research Marsha EngIntel Corp. Hsien-Hsin

23Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture

Design Space Exploration

• Baseline Design

Page 24: POD: A Parallel-On-Die Architecture Dong Hyuk WooGeorgia Tech Joshua B. FrymanIntel Corp. Allan D. KniesIntel Berkeley Research Marsha EngIntel Corp. Hsien-Hsin

24Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture

Design Space Exploration

• 3D POD with many-coreDie 0

Die 1

You live and work here

You get your beer here

Page 25: POD: A Parallel-On-Die Architecture Dong Hyuk WooGeorgia Tech Joshua B. FrymanIntel Corp. Allan D. KniesIntel Berkeley Research Marsha EngIntel Corp. Hsien-Hsin

25Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture

Design Space Exploration

• 3D many-POD processorDie 0

Die 1

Page 26: POD: A Parallel-On-Die Architecture Dong Hyuk WooGeorgia Tech Joshua B. FrymanIntel Corp. Allan D. KniesIntel Berkeley Research Marsha EngIntel Corp. Hsien-Hsin

Programming ModelProgramming Model

Page 27: POD: A Parallel-On-Die Architecture Dong Hyuk WooGeorgia Tech Joshua B. FrymanIntel Corp. Allan D. KniesIntel Berkeley Research Marsha EngIntel Corp. Hsien-Hsin

27Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture

Programming Model

• Example code: Reduction

1

0

N

iiaS

char* base_addr = (char*) malloc(npe);for ( i=0; I < npe; i++ ) {

*(base_addr+i) = i+1;}$ASM mov r15 = MY_PE_ID_POINTER$ASM ld r15 = local [ r15 + 0 ]POD_movl( r14, base_addr );

$ASM add r15 = r15, r14

$ASM ld r1 = sys [ r15 + 0 ]POD_mfence();

$ASM add r2 = r2, r1for ( i=0; i< (npeX-1); i++ ) {

$ASM xfer.east r1 = r1$ASM add r2 = r2, r1

}$ASM add r1 = r2, 0for ( i=0; i< (npeY-1); i++ ) {

$ASM xfer.north r1 = r1$ASM add r2 = r2, r1

}

Page 28: POD: A Parallel-On-Die Architecture Dong Hyuk WooGeorgia Tech Joshua B. FrymanIntel Corp. Allan D. KniesIntel Berkeley Research Marsha EngIntel Corp. Hsien-Hsin

Simulation ResultSimulation Result

Page 29: POD: A Parallel-On-Die Architecture Dong Hyuk WooGeorgia Tech Joshua B. FrymanIntel Corp. Allan D. KniesIntel Berkeley Research Marsha EngIntel Corp. Hsien-Hsin

29Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture

Performance Result

4

16

64869.3GFLOPS

113.6GFLOPS

134.8GFLOPS

860.3GFLOPS

91.4GFLOPS

504.6GFLOPS

DenseMMM(SP)

FFT(SP)

IDCT(DP)

OptionPricing

(SP)

DownSampling

(SP)

K-means(SP)

Sp

ee

du

p

1x1 POD 2x2 POD 4x4 POD 8x8 POD

Page 30: POD: A Parallel-On-Die Architecture Dong Hyuk WooGeorgia Tech Joshua B. FrymanIntel Corp. Allan D. KniesIntel Berkeley Research Marsha EngIntel Corp. Hsien-Hsin

30Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture

Performance per mm2

Homo-geneous

many-core (ideal)

1 c

ore

4 c

ore

s1

6 c

ore

s6

4 c

ore

s

0

2

4

6

8

10

12

14

DenseMMM FFT IDCT OptionPricing

DownSampling

K-means

Pe

rfo

rma

nc

e p

er

mm

^2

1x1 POD 2x2 POD 4x4 POD 8x8 POD

Page 31: POD: A Parallel-On-Die Architecture Dong Hyuk WooGeorgia Tech Joshua B. FrymanIntel Corp. Allan D. KniesIntel Berkeley Research Marsha EngIntel Corp. Hsien-Hsin

31Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture

Performance per Watt

DenseMMM FFT IDCT OptionPricing

DownSampling

K-means Homo-geneous

many-core(ideal)

1 c

ore

4 c

ore

s1

6 c

ore

s6

4 c

ore

s

0

2

4

6

8

10

p=0.1

25

p=0.2

50

p=0.5

00

p=0.1

25

p=0.2

50

p=0.5

00

p=0.1

25

p=0.2

50

p=0.5

00

p=0.1

25

p=0.2

50

p=0.5

00

p=0.1

25

p=0.2

50

p=0.5

00

p=0.1

25

p=0.2

50

p=0.5

00

Pe

rfo

rma

nc

e p

er

Wa

tt

1x1 POD 2x2 POD 4x4 POD 8x8 POD

Page 32: POD: A Parallel-On-Die Architecture Dong Hyuk WooGeorgia Tech Joshua B. FrymanIntel Corp. Allan D. KniesIntel Berkeley Research Marsha EngIntel Corp. Hsien-Hsin

32Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture

Performance per Joule

DenseMMM FFT IDCT OptionPricing

DownSampling

K-means Homo-geneous

many-core(ideal)

1 c

ore

4 c

ore

s1

6 c

ore

s6

4 c

ore

s

0

100

200

300

400

500

600

p=0.1

25

p=0.2

50

p=0.5

00

p=0.1

25

p=0.2

50

p=0.5

00

p=0.1

25

p=0.2

50

p=0.5

00

p=0.1

25

p=0.2

50

p=0.5

00

p=0.1

25

p=0.2

50

p=0.5

00

p=0.1

25

p=0.2

50

p=0.5

00

Pe

rfo

rma

nc

e p

er

Jo

ule

1x1 POD 2x2 POD 4x4 POD 8x8 POD

Page 33: POD: A Parallel-On-Die Architecture Dong Hyuk WooGeorgia Tech Joshua B. FrymanIntel Corp. Allan D. KniesIntel Berkeley Research Marsha EngIntel Corp. Hsien-Hsin

33Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture

Interconnection Power

Input buffer X-bar arbiter Link

RAW 31% 30% ~0% 39%

TRIPS 35% 33% 1% 31%

Wang et al., “Power-Driven Design of Router Microarchitectures in On-Chip Networks”, MICRO 2003

Page 34: POD: A Parallel-On-Die Architecture Dong Hyuk WooGeorgia Tech Joshua B. FrymanIntel Corp. Allan D. KniesIntel Berkeley Research Marsha EngIntel Corp. Hsien-Hsin

34Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture

2D Torus P2P Link Active Time

DenseMMM FFT IDCT OptionPricing DownSampling K-means

0%

1%

2%

3%

4%

5%

6%

N E W S N E W S N E W S N E W S N E W S N E W S

Act

ive

Tim

e

1x1 POD 2x2 POD 4x4 POD 8x8 POD

Page 35: POD: A Parallel-On-Die Architecture Dong Hyuk WooGeorgia Tech Joshua B. FrymanIntel Corp. Allan D. KniesIntel Berkeley Research Marsha EngIntel Corp. Hsien-Hsin

35Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture

Conclusion

• We propose POD – Parallel-On-Die many-core architecture – Broad-purpose acceleration – High energy and area efficiency– Backward compatibility

• A single-chip POD can achieve up to 1.5 TFLOPS of IEEE SP FP operations.

• Interconnection architecture of POD is extremely power- and area-efficient.

RRQ

RRQ

RRQ

RRQ

RRQ

RRQ

RRQ

RRQ

Host Processor

DRB

DRB

DRB

DRB

DRB

DRB

DRB

DRB

xTLB

LLC

ARB

MC

MC

Page 36: POD: A Parallel-On-Die Architecture Dong Hyuk WooGeorgia Tech Joshua B. FrymanIntel Corp. Allan D. KniesIntel Berkeley Research Marsha EngIntel Corp. Hsien-Hsin

Let’s Use Power to Let’s Use Power to ComputeCompute, , Not Commute!Not Commute!

Georgia TechGeorgia Tech

ECE MARS LabsECE MARS Labs

http://arch.ece.gatech.eduhttp://arch.ece.gatech.edu

Page 37: POD: A Parallel-On-Die Architecture Dong Hyuk WooGeorgia Tech Joshua B. FrymanIntel Corp. Allan D. KniesIntel Berkeley Research Marsha EngIntel Corp. Hsien-Hsin

Back-up SlidesBack-up Slides

Page 38: POD: A Parallel-On-Die Architecture Dong Hyuk WooGeorgia Tech Joshua B. FrymanIntel Corp. Allan D. KniesIntel Berkeley Research Marsha EngIntel Corp. Hsien-Hsin

38Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture

PODSIM

Native

x86

machinePODSIM

PE pipelining

Memory subsystem

Accurate modeling of on-chip, off-chip bandwidth

Limitation:

Host processor overhead not modeled (I$ miss, branch misprediction)

LLC takeover of the MC ring by the host

Page 39: POD: A Parallel-On-Die Architecture Dong Hyuk WooGeorgia Tech Joshua B. FrymanIntel Corp. Allan D. KniesIntel Berkeley Research Marsha EngIntel Corp. Hsien-Hsin

39Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture

Why multi-processing again?

• No more faster processor– Power wall– Increasing wire delay– Diminishing return of deeply pipelined architecture

• No more illusion of a sequential processor– Design complexity– Diminishing return of wide-issue superscalar

processors

Page 40: POD: A Parallel-On-Die Architecture Dong Hyuk WooGeorgia Tech Joshua B. FrymanIntel Corp. Allan D. KniesIntel Berkeley Research Marsha EngIntel Corp. Hsien-Hsin

40Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture

Out-of-order Host Issues

• Speculative execution– No recovery mechanism in each PE– Broadcast POD instructions in a non-speculative

manner• As long as host-side code does not depend on the results

from POD, this approach does not affect the performance.

• Instruction Reordering– POD instructions are strongly ordered.

• IBits queue – Similar to the store queue

Page 41: POD: A Parallel-On-Die Architecture Dong Hyuk WooGeorgia Tech Joshua B. FrymanIntel Corp. Allan D. KniesIntel Berkeley Research Marsha EngIntel Corp. Hsien-Hsin

41Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture

x86 Modification

• 5 new instructions:– sendibits, getort, drainort, movl, getdrb,

• 3 modified instructions:– lfence, sfence, mfence

• 4 possible new co-processor registers or else mem-mapped addrs:– ibits register, ort status, drb value, xTLB support

• IBits queue

• All other modifications external to the x86 core– Arbiter for LLC priority can be external to LLC

• Deferred Design Issue: LLC-POD Coherence– v1 Study: uncachable– v2 Prototype: LLC Snoops on A/D rings

Page 42: POD: A Parallel-On-Die Architecture Dong Hyuk WooGeorgia Tech Joshua B. FrymanIntel Corp. Allan D. KniesIntel Berkeley Research Marsha EngIntel Corp. Hsien-Hsin

42Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture

POD ISA

• “Instructions” are 3-wide VLIW bundles, – One of each G,X,M (32 bits each)

• Integer Pipe (G):– and,or,not,xor,shl,shr,add,sub,mul,cmp,cmov,xfer,xferdrb : !

div

• SSE Pipe (X):– pfp{fma,add,sub,mul,div,max,min},pi{add,sub,mul,and,or,not,c

mp},shuffle,cvt,xfer : !idiv

• Memory Pipe (M):– ld,st,blkrd/wr,mask{set,pop,push},mov,xfer

• Hybrid (G+X+M):– movl

Page 43: POD: A Parallel-On-Die Architecture Dong Hyuk WooGeorgia Tech Joshua B. FrymanIntel Corp. Allan D. KniesIntel Berkeley Research Marsha EngIntel Corp. Hsien-Hsin

43Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture

Power Scalability!

0

2

4

6

8

10

12

14

0 10 20 30 40 50 60

Relative Chip Power Budget

Rel

ativ

e P

erfo

rman

ce P

er J

ou

le

Page 44: POD: A Parallel-On-Die Architecture Dong Hyuk WooGeorgia Tech Joshua B. FrymanIntel Corp. Allan D. KniesIntel Berkeley Research Marsha EngIntel Corp. Hsien-Hsin

44Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture

Some Known Facts on Interconnection

• MIT RAW– ~ 40% of die area

Taylor et al., “The Raw Microprocessor: A Computational Fabric for Software Circuits and General Purpose Programs”, IEEE MICRO Mar/Apr 2002

Page 45: POD: A Parallel-On-Die Architecture Dong Hyuk WooGeorgia Tech Joshua B. FrymanIntel Corp. Allan D. KniesIntel Berkeley Research Marsha EngIntel Corp. Hsien-Hsin

45Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture

Some Known Facts on Interconnection

• One main energy hog!– Alpha 21364:

• 20% (25W / 125W)

– MIT RAW: • 40% of individual tile power

– UT TRIPS: • 25%

Taylor et al., “The Raw Microprocessor: A Computational Fabric for Software Circuits and General Purpose Programs”, IEEE MICRO Mar/Apr 2002Wang et al., “Orion: A Power-Performance Simulator for Interconnection Networks”, MICRO 2002Wang et al., “Power-Driven Design of Router Microarchitectures in On-Chip Networks”, MICRO 2003Peh, “Chip-Scale Power Scaling”, http://www.princeton.edu/~peh/talks/stanford_nws.pdf

Page 46: POD: A Parallel-On-Die Architecture Dong Hyuk WooGeorgia Tech Joshua B. FrymanIntel Corp. Allan D. KniesIntel Berkeley Research Marsha EngIntel Corp. Hsien-Hsin

46Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture

Some Known Facts on Interconnection

• Mesh-based MIMD many-core?

– Unsustainable energy in its router logic

– Unpredictable communication pattern • Difficult to apply common low-power techniques

Bokar, “Networks for Multi-core Chip—A Controversial View”, 2006 Workshop on On- and Off-Chip Interconnection Networks for Multicore Systems

Page 47: POD: A Parallel-On-Die Architecture Dong Hyuk WooGeorgia Tech Joshua B. FrymanIntel Corp. Allan D. KniesIntel Berkeley Research Marsha EngIntel Corp. Hsien-Hsin

Programming ModelProgramming Model

Page 48: POD: A Parallel-On-Die Architecture Dong Hyuk WooGeorgia Tech Joshua B. FrymanIntel Corp. Allan D. KniesIntel Berkeley Research Marsha EngIntel Corp. Hsien-Hsin

48Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture

Programming Model

• Example code: Reduction

1

0

N

iiaS

char* base_addr = (char*) malloc(npe);for ( i=0; I < npe; i++ ) {

*(base_addr+i) = i+1;}$ASM mov r15 = MY_PE_ID_POINTER$ASM ld r15 = local [ r15 + 0 ]POD_movl( r14, base_addr );

$ASM add r15 = r15, r14

$ASM ld r1 = sys [ r15 + 0 ]POD_mfence();

$ASM add r2 = r2, r1for ( i=0; i< (npeX-1); i++ ) {

$ASM xfer.east r1 = r1$ASM add r2 = r2, r1

}$ASM add r1 = r2, 0for ( i=0; i< (npeY-1); i++ ) {

$ASM xfer.north r1 = r1$ASM add r2 = r2, r1

}

Page 49: POD: A Parallel-On-Die Architecture Dong Hyuk WooGeorgia Tech Joshua B. FrymanIntel Corp. Allan D. KniesIntel Berkeley Research Marsha EngIntel Corp. Hsien-Hsin

49Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture

Programming Model (2x2 POD)

• Malloc memory at the host memory spacechar* base_addr = (char*) malloc(npe);for ( i=0; I < npe; i++ ) {

*(base_addr+i) = i+1;}$ASM mov r15 = MY_PE_ID_POINTER$ASM ld r15 = local [ r15 + 0 ]POD_movl( r14, base_addr );

$ASM add r15 = r15, r14

$ASM ld r1 = sys [ r15 + 0 ]POD_mfence();

$ASM add r2 = r2, r1for ( i=0; i< (npeX-1); i++ ) {

$ASM xfer.east r1 = r1$ASM add r2 = r2, r1

}$ASM add r1 = r2, 0for ( i=0; i< (npeY-1); i++ ) {

$ASM xfer.north r1 = r1$ASM add r2 = r2, r1

}

Page 50: POD: A Parallel-On-Die Architecture Dong Hyuk WooGeorgia Tech Joshua B. FrymanIntel Corp. Allan D. KniesIntel Berkeley Research Marsha EngIntel Corp. Hsien-Hsin

50Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture

Programming Model (2x2 POD)

• Initialize ai

base_addr: 16

16: 017: 018: 019: 0

char* base_addr = (char*) malloc(npe);for ( i=0; I < npe; i++ ) {

*(base_addr+i) = i+1;}$ASM mov r15 = MY_PE_ID_POINTER$ASM ld r15 = local [ r15 + 0 ]POD_movl( r14, base_addr );

$ASM add r15 = r15, r14

$ASM ld r1 = sys [ r15 + 0 ]POD_mfence();

$ASM add r2 = r2, r1for ( i=0; i< (npeX-1); i++ ) {

$ASM xfer.east r1 = r1$ASM add r2 = r2, r1

}$ASM add r1 = r2, 0for ( i=0; i< (npeY-1); i++ ) {

$ASM xfer.north r1 = r1$ASM add r2 = r2, r1

}

Page 51: POD: A Parallel-On-Die Architecture Dong Hyuk WooGeorgia Tech Joshua B. FrymanIntel Corp. Allan D. KniesIntel Berkeley Research Marsha EngIntel Corp. Hsien-Hsin

51Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture

Programming Model (2x2 POD)

• Load the address of my PE ID

base_addr: 16

16: 117: 218: 319: 4

char* base_addr = (char*) malloc(npe);for ( i=0; I < npe; i++ ) {

*(base_addr+i) = i+1;}$ASM mov r15 = MY_PE_ID_POINTER$ASM ld r15 = local [ r15 + 0 ]POD_movl( r14, base_addr );

$ASM add r15 = r15, r14

$ASM ld r1 = sys [ r15 + 0 ]POD_mfence();

$ASM add r2 = r2, r1for ( i=0; i< (npeX-1); i++ ) {

$ASM xfer.east r1 = r1$ASM add r2 = r2, r1

}$ASM add r1 = r2, 0for ( i=0; i< (npeY-1); i++ ) {

$ASM xfer.north r1 = r1$ASM add r2 = r2, r1

}

Page 52: POD: A Parallel-On-Die Architecture Dong Hyuk WooGeorgia Tech Joshua B. FrymanIntel Corp. Allan D. KniesIntel Berkeley Research Marsha EngIntel Corp. Hsien-Hsin

52Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture

Programming Model (2x2 POD)

• Load my PE ID

r15 2 r15 2

r15 2 r15 2

base_addr: 16

16: 117: 218: 319: 4

char* base_addr = (char*) malloc(npe);for ( i=0; I < npe; i++ ) {

*(base_addr+i) = i+1;}$ASM mov r15 = MY_PE_ID_POINTER$ASM ld r15 = local [ r15 + 0 ]POD_movl( r14, base_addr );

$ASM add r15 = r15, r14

$ASM ld r1 = sys [ r15 + 0 ]POD_mfence();

$ASM add r2 = r2, r1for ( i=0; i< (npeX-1); i++ ) {

$ASM xfer.east r1 = r1$ASM add r2 = r2, r1

}$ASM add r1 = r2, 0for ( i=0; i< (npeY-1); i++ ) {

$ASM xfer.north r1 = r1$ASM add r2 = r2, r1

}

Page 53: POD: A Parallel-On-Die Architecture Dong Hyuk WooGeorgia Tech Joshua B. FrymanIntel Corp. Allan D. KniesIntel Berkeley Research Marsha EngIntel Corp. Hsien-Hsin

53Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture

Programming Model (2x2 POD)

• Broadcast base_addr

r15 2 r15 3

r15 0 r15 1

base_addr: 16

16: 117: 218: 319: 4

char* base_addr = (char*) malloc(npe);for ( i=0; I < npe; i++ ) {

*(base_addr+i) = i+1;}$ASM mov r15 = MY_PE_ID_POINTER$ASM ld r15 = local [ r15 + 0 ]POD_movl( r14, base_addr );

$ASM add r15 = r15, r14

$ASM ld r1 = sys [ r15 + 0 ]POD_mfence();

$ASM add r2 = r2, r1for ( i=0; i< (npeX-1); i++ ) {

$ASM xfer.east r1 = r1$ASM add r2 = r2, r1

}$ASM add r1 = r2, 0for ( i=0; i< (npeY-1); i++ ) {

$ASM xfer.north r1 = r1$ASM add r2 = r2, r1

}

Page 54: POD: A Parallel-On-Die Architecture Dong Hyuk WooGeorgia Tech Joshua B. FrymanIntel Corp. Allan D. KniesIntel Berkeley Research Marsha EngIntel Corp. Hsien-Hsin

54Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture

Programming Model (2x2 POD)

• Calculate the pointer to my ai

r14r15

162

r14r15

163

r14r15

160

r14r15

161

base_addr: 16

16: 117: 218: 319: 4

char* base_addr = (char*) malloc(npe);for ( i=0; I < npe; i++ ) {

*(base_addr+i) = i+1;}$ASM mov r15 = MY_PE_ID_POINTER$ASM ld r15 = local [ r15 + 0 ]POD_movl( r14, base_addr );

$ASM add r15 = r15, r14

$ASM ld r1 = sys [ r15 + 0 ]POD_mfence();

$ASM add r2 = r2, r1for ( i=0; i< (npeX-1); i++ ) {

$ASM xfer.east r1 = r1$ASM add r2 = r2, r1

}$ASM add r1 = r2, 0for ( i=0; i< (npeY-1); i++ ) {

$ASM xfer.north r1 = r1$ASM add r2 = r2, r1

}

Page 55: POD: A Parallel-On-Die Architecture Dong Hyuk WooGeorgia Tech Joshua B. FrymanIntel Corp. Allan D. KniesIntel Berkeley Research Marsha EngIntel Corp. Hsien-Hsin

55Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture

Programming Model (2x2 POD)

• Load my ai from the host memory space

r14r15

1618

r14r15

1619

r14r15

1616

r14r15

1617

base_addr: 16

16: 117: 218: 319: 4

char* base_addr = (char*) malloc(npe);for ( i=0; I < npe; i++ ) {

*(base_addr+i) = i+1;}$ASM mov r15 = MY_PE_ID_POINTER$ASM ld r15 = local [ r15 + 0 ]POD_movl( r14, base_addr );

$ASM add r15 = r15, r14

$ASM ld r1 = sys [ r15 + 0 ]POD_mfence();

$ASM add r2 = r2, r1for ( i=0; i< (npeX-1); i++ ) {

$ASM xfer.east r1 = r1$ASM add r2 = r2, r1

}$ASM add r1 = r2, 0for ( i=0; i< (npeY-1); i++ ) {

$ASM xfer.north r1 = r1$ASM add r2 = r2, r1

}

Page 56: POD: A Parallel-On-Die Architecture Dong Hyuk WooGeorgia Tech Joshua B. FrymanIntel Corp. Allan D. KniesIntel Berkeley Research Marsha EngIntel Corp. Hsien-Hsin

56Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture

Programming Model (2x2 POD)

• Wait until it is loaded

r14r15

1618

r14r15

1619

r14r15

1616

r14r15

1617

base_addr: 16

16: 117: 218: 319: 4

char* base_addr = (char*) malloc(npe);for ( i=0; I < npe; i++ ) {

*(base_addr+i) = i+1;}$ASM mov r15 = MY_PE_ID_POINTER$ASM ld r15 = local [ r15 + 0 ]POD_movl( r14, base_addr );

$ASM add r15 = r15, r14

$ASM ld r1 = sys [ r15 + 0 ]POD_mfence();

$ASM add r2 = r2, r1for ( i=0; i< (npeX-1); i++ ) {

$ASM xfer.east r1 = r1$ASM add r2 = r2, r1

}$ASM add r1 = r2, 0for ( i=0; i< (npeY-1); i++ ) {

$ASM xfer.north r1 = r1$ASM add r2 = r2, r1

}

Page 57: POD: A Parallel-On-Die Architecture Dong Hyuk WooGeorgia Tech Joshua B. FrymanIntel Corp. Allan D. KniesIntel Berkeley Research Marsha EngIntel Corp. Hsien-Hsin

57Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture

Programming Model (2x2 POD)

• Update S

r1

r14r15

3

1618

r1

r14r15

4

1619

r1

r14r15

1

1616

r1

r14r15

2

1617

base_addr: 16

16: 117: 218: 319: 4

char* base_addr = (char*) malloc(npe);for ( i=0; I < npe; i++ ) {

*(base_addr+i) = i+1;}$ASM mov r15 = MY_PE_ID_POINTER$ASM ld r15 = local [ r15 + 0 ]POD_movl( r14, base_addr );

$ASM add r15 = r15, r14

$ASM ld r1 = sys [ r15 + 0 ]POD_mfence();

$ASM add r2 = r2, r1for ( i=0; i< (npeX-1); i++ ) {

$ASM xfer.east r1 = r1$ASM add r2 = r2, r1

}$ASM add r1 = r2, 0for ( i=0; i< (npeY-1); i++ ) {

$ASM xfer.north r1 = r1$ASM add r2 = r2, r1

}

Page 58: POD: A Parallel-On-Die Architecture Dong Hyuk WooGeorgia Tech Joshua B. FrymanIntel Corp. Allan D. KniesIntel Berkeley Research Marsha EngIntel Corp. Hsien-Hsin

58Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture

Programming Model (2x2 POD)

• 1st iteration

r1r2r14r15

331618

r1r2r14r15

441619

r1r2r14r15

111616

r1r2r14r15

221617

base_addr: 16

16: 117: 218: 319: 4

char* base_addr = (char*) malloc(npe);for ( i=0; I < npe; i++ ) {

*(base_addr+i) = i+1;}$ASM mov r15 = MY_PE_ID_POINTER$ASM ld r15 = local [ r15 + 0 ]POD_movl( r14, base_addr );

$ASM add r15 = r15, r14

$ASM ld r1 = sys [ r15 + 0 ]POD_mfence();

$ASM add r2 = r2, r1for ( i=0; i< (npeX-1); i++ ) {

$ASM xfer.east r1 = r1$ASM add r2 = r2, r1

}$ASM add r1 = r2, 0for ( i=0; i< (npeY-1); i++ ) {

$ASM xfer.north r1 = r1$ASM add r2 = r2, r1

}

Page 59: POD: A Parallel-On-Die Architecture Dong Hyuk WooGeorgia Tech Joshua B. FrymanIntel Corp. Allan D. KniesIntel Berkeley Research Marsha EngIntel Corp. Hsien-Hsin

59Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture

Programming Model (2x2 POD)

• Transfer my ai to my east-side neighbor

r1r2r14r15

331618

r1r2r14r15

441619

r1r2r14r15

111616

r1r2r14r15

221617

base_addr: 16

16: 117: 218: 319: 4

char* base_addr = (char*) malloc(npe);for ( i=0; I < npe; i++ ) {

*(base_addr+i) = i+1;}$ASM mov r15 = MY_PE_ID_POINTER$ASM ld r15 = local [ r15 + 0 ]POD_movl( r14, base_addr );

$ASM add r15 = r15, r14

$ASM ld r1 = sys [ r15 + 0 ]POD_mfence();

$ASM add r2 = r2, r1for ( i=0; i< (npeX-1); i++ ) {

$ASM xfer.east r1 = r1$ASM add r2 = r2, r1

}$ASM add r1 = r2, 0for ( i=0; i< (npeY-1); i++ ) {

$ASM xfer.north r1 = r1$ASM add r2 = r2, r1

}

Page 60: POD: A Parallel-On-Die Architecture Dong Hyuk WooGeorgia Tech Joshua B. FrymanIntel Corp. Allan D. KniesIntel Berkeley Research Marsha EngIntel Corp. Hsien-Hsin

60Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture

r1r2r14r15

31618

r1r2r14r15

41619

r1r2r14r15

11616

r1r2r14r15

21617

Programming Model (2x2 POD)

• Update S

3

1

4

2

base_addr: 16

16: 117: 218: 319: 4

char* base_addr = (char*) malloc(npe);for ( i=0; I < npe; i++ ) {

*(base_addr+i) = i+1;}$ASM mov r15 = MY_PE_ID_POINTER$ASM ld r15 = local [ r15 + 0 ]POD_movl( r14, base_addr );

$ASM add r15 = r15, r14

$ASM ld r1 = sys [ r15 + 0 ]POD_mfence();

$ASM add r2 = r2, r1for ( i=0; i< (npeX-1); i++ ) {

$ASM xfer.east r1 = r1$ASM add r2 = r2, r1

}$ASM add r1 = r2, 0for ( i=0; i< (npeY-1); i++ ) {

$ASM xfer.north r1 = r1$ASM add r2 = r2, r1

}

Page 61: POD: A Parallel-On-Die Architecture Dong Hyuk WooGeorgia Tech Joshua B. FrymanIntel Corp. Allan D. KniesIntel Berkeley Research Marsha EngIntel Corp. Hsien-Hsin

61Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture

r1r2r14r15

71618

r1r2r14r15

71619

r1r2r14r15

31616

r1r2r14r15

31617

Programming Model (2x2 POD)

• Copy rowsum to r1

4

2

3

1

base_addr: 16

16: 117: 218: 319: 4

char* base_addr = (char*) malloc(npe);for ( i=0; I < npe; i++ ) {

*(base_addr+i) = i+1;}$ASM mov r15 = MY_PE_ID_POINTER$ASM ld r15 = local [ r15 + 0 ]POD_movl( r14, base_addr );

$ASM add r15 = r15, r14

$ASM ld r1 = sys [ r15 + 0 ]POD_mfence();

$ASM add r2 = r2, r1for ( i=0; i< (npeX-1); i++ ) {

$ASM xfer.east r1 = r1$ASM add r2 = r2, r1

}$ASM add r1 = r2, 0for ( i=0; i< (npeY-1); i++ ) {

$ASM xfer.north r1 = r1$ASM add r2 = r2, r1

}

Page 62: POD: A Parallel-On-Die Architecture Dong Hyuk WooGeorgia Tech Joshua B. FrymanIntel Corp. Allan D. KniesIntel Berkeley Research Marsha EngIntel Corp. Hsien-Hsin

62Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture

r1r2r14r15

71618

r1r2r14r15

71619

r1r2r14r15

31616

r1r2r14r15

31617

Programming Model (2x2 POD)

• 1st iteration

7

3

7

3

base_addr: 16

16: 117: 218: 319: 4

char* base_addr = (char*) malloc(npe);for ( i=0; I < npe; i++ ) {

*(base_addr+i) = i+1;}$ASM mov r15 = MY_PE_ID_POINTER$ASM ld r15 = local [ r15 + 0 ]POD_movl( r14, base_addr );

$ASM add r15 = r15, r14

$ASM ld r1 = sys [ r15 + 0 ]POD_mfence();

$ASM add r2 = r2, r1for ( i=0; i< (npeX-1); i++ ) {

$ASM xfer.east r1 = r1$ASM add r2 = r2, r1

}$ASM add r1 = r2, 0for ( i=0; i< (npeY-1); i++ ) {

$ASM xfer.north r1 = r1$ASM add r2 = r2, r1

}

Page 63: POD: A Parallel-On-Die Architecture Dong Hyuk WooGeorgia Tech Joshua B. FrymanIntel Corp. Allan D. KniesIntel Berkeley Research Marsha EngIntel Corp. Hsien-Hsin

63Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture

r1r2r14r15

71618

r1r2r14r15

71619

r1r2r14r15

31616

r1r2r14r15

31617

Programming Model (2x2 POD)

• Transfer my rowsum to my north-side neighbor

7

3

7

3

base_addr: 16

16: 117: 218: 319: 4

char* base_addr = (char*) malloc(npe);for ( i=0; I < npe; i++ ) {

*(base_addr+i) = i+1;}$ASM mov r15 = MY_PE_ID_POINTER$ASM ld r15 = local [ r15 + 0 ]POD_movl( r14, base_addr );

$ASM add r15 = r15, r14

$ASM ld r1 = sys [ r15 + 0 ]POD_mfence();

$ASM add r2 = r2, r1for ( i=0; i< (npeX-1); i++ ) {

$ASM xfer.east r1 = r1$ASM add r2 = r2, r1

}$ASM add r1 = r2, 0for ( i=0; i< (npeY-1); i++ ) {

$ASM xfer.north r1 = r1$ASM add r2 = r2, r1

}

Page 64: POD: A Parallel-On-Die Architecture Dong Hyuk WooGeorgia Tech Joshua B. FrymanIntel Corp. Allan D. KniesIntel Berkeley Research Marsha EngIntel Corp. Hsien-Hsin

64Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture

r1r2r14r15

71618

r1r2r14r15

71619

r1r2r14r15

31616

r1r2r14r15

31617

Programming Model (2x2 POD)

• Update S

7 7

3 3

base_addr: 16

16: 117: 218: 319: 4

char* base_addr = (char*) malloc(npe);for ( i=0; I < npe; i++ ) {

*(base_addr+i) = i+1;}$ASM mov r15 = MY_PE_ID_POINTER$ASM ld r15 = local [ r15 + 0 ]POD_movl( r14, base_addr );

$ASM add r15 = r15, r14

$ASM ld r1 = sys [ r15 + 0 ]POD_mfence();

$ASM add r2 = r2, r1for ( i=0; i< (npeX-1); i++ ) {

$ASM xfer.east r1 = r1$ASM add r2 = r2, r1

}$ASM add r1 = r2, 0for ( i=0; i< (npeY-1); i++ ) {

$ASM xfer.north r1 = r1$ASM add r2 = r2, r1

}

Page 65: POD: A Parallel-On-Die Architecture Dong Hyuk WooGeorgia Tech Joshua B. FrymanIntel Corp. Allan D. KniesIntel Berkeley Research Marsha EngIntel Corp. Hsien-Hsin

65Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture

r1r2r14r15

101618

r1r2r14r15

101619

r1r2r14r15

101616

r1r2r14r15

101617

Programming Model (2x2 POD)

• Done!

3 3

7 7

base_addr: 16

16: 117: 218: 319: 4

char* base_addr = (char*) malloc(npe);for ( i=0; I < npe; i++ ) {

*(base_addr+i) = i+1;}$ASM mov r15 = MY_PE_ID_POINTER$ASM ld r15 = local [ r15 + 0 ]POD_movl( r14, base_addr );

$ASM add r15 = r15, r14

$ASM ld r1 = sys [ r15 + 0 ]POD_mfence();

$ASM add r2 = r2, r1for ( i=0; i< (npeX-1); i++ ) {

$ASM xfer.east r1 = r1$ASM add r2 = r2, r1

}$ASM add r1 = r2, 0for ( i=0; i< (npeY-1); i++ ) {

$ASM xfer.north r1 = r1$ASM add r2 = r2, r1

}

Page 66: POD: A Parallel-On-Die Architecture Dong Hyuk WooGeorgia Tech Joshua B. FrymanIntel Corp. Allan D. KniesIntel Berkeley Research Marsha EngIntel Corp. Hsien-Hsin

66Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture

r1r2r14r15

101618

r1r2r14r15

101619

r1r2r14r15

101616

r1r2r14r15

101617

Programming Model (2x2 POD)

• Broadcast the pointer to ‘result’

3 3

7 7

base_addr: 16

16: 117: 218: 319: 4

int result;

POD_movl( r14, &result );

$ASM mov r15 = MY_PE_ID_POINTER$ASM ld r15 = local [ r15 + 0 ]

$ASM sub r15 = r15, 0$ASM pushmask.e

$ASM st sys[ r14 + 0 ] = r15$ASM popmask

POD_mfence();

128: <- addr of ‘result’

Page 67: POD: A Parallel-On-Die Architecture Dong Hyuk WooGeorgia Tech Joshua B. FrymanIntel Corp. Allan D. KniesIntel Berkeley Research Marsha EngIntel Corp. Hsien-Hsin

67Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture

r1r2r14r15

1012818

r1r2r14r15

1012819

r1r2r14r15

1012816

r1r2r14r15

1012817

Programming Model (2x2 POD)

• Only PE0 will write-back!

3 3

7 7

base_addr: 16

16: 117: 218: 319: 4

int result;

POD_movl( r14, &result );

$ASM mov r15 = MY_PE_ID_POINTER$ASM ld r15 = local [ r15 + 0 ]

$ASM sub r15 = r15, 0$ASM pushmask.e

$ASM st sys[ r14 + 0 ] = r15$ASM popmask

POD_mfence();

128: <- addr of ‘result’

Page 68: POD: A Parallel-On-Die Architecture Dong Hyuk WooGeorgia Tech Joshua B. FrymanIntel Corp. Allan D. KniesIntel Berkeley Research Marsha EngIntel Corp. Hsien-Hsin

68Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture

r1r2r14r15

101282

r1r2r14r15

101282

r1r2r14r15

101282

r1r2r14r15

101282

Programming Model (2x2 POD)

• Load PE_ID

3 3

7 7

base_addr: 16

16: 117: 218: 319: 4

int result;

POD_movl( r14, &result );

$ASM mov r15 = MY_PE_ID_POINTER$ASM ld r15 = local [ r15 + 0 ]

$ASM sub r15 = r15, 0$ASM pushmask.e

$ASM st sys[ r14 + 0 ] = r15$ASM popmask

POD_mfence();

128: <- addr of ‘result’

Page 69: POD: A Parallel-On-Die Architecture Dong Hyuk WooGeorgia Tech Joshua B. FrymanIntel Corp. Allan D. KniesIntel Berkeley Research Marsha EngIntel Corp. Hsien-Hsin

69Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture

r1r2r14r15

101282

r1r2r14r15

101283

r1r2r14r15

101280

r1r2r14r15

101281

Programming Model (2x2 POD)

• Compare PE_ID

3 3

7 7

base_addr: 16

16: 117: 218: 319: 4

int result;

POD_movl( r14, &result );

$ASM mov r15 = MY_PE_ID_POINTER$ASM ld r15 = local [ r15 + 0 ]

$ASM sub r15 = r15, 0$ASM pushmask.e

$ASM st sys[ r14 + 0 ] = r15$ASM popmask

POD_mfence();

128: <- addr of ‘result’

Page 70: POD: A Parallel-On-Die Architecture Dong Hyuk WooGeorgia Tech Joshua B. FrymanIntel Corp. Allan D. KniesIntel Berkeley Research Marsha EngIntel Corp. Hsien-Hsin

70Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture

r1r2r14r15

101282

r1r2r14r15

101283

r1r2r14r15

101280

r1r2r14r15

101281

Programming Model (2x2 POD)

• Enable PE only when equal to zero

3 3

7 7

base_addr: 16

16: 117: 218: 319: 4

int result;

POD_movl( r14, &result );

$ASM mov r15 = MY_PE_ID_POINTER$ASM ld r15 = local [ r15 + 0 ]

$ASM sub r15 = r15, 0$ASM pushmask.e

$ASM st sys[ r14 + 0 ] = r15$ASM popmask

POD_mfence();

128: <- addr of ‘result’

Page 71: POD: A Parallel-On-Die Architecture Dong Hyuk WooGeorgia Tech Joshua B. FrymanIntel Corp. Allan D. KniesIntel Berkeley Research Marsha EngIntel Corp. Hsien-Hsin

71Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture

r1r2r14r15

101282

r1r2r14r15

101283

r1r2r14r15

101280

r1r2r14r15

101281

Programming Model (2x2 POD)

• Write-back

3 3

7 7

base_addr: 16

16: 117: 218: 319: 4

int result;

POD_movl( r14, &result );

$ASM mov r15 = MY_PE_ID_POINTER$ASM ld r15 = local [ r15 + 0 ]

$ASM sub r15 = r15, 0$ASM pushmask.e

$ASM st sys[ r14 + 0 ] = r15$ASM popmask

POD_mfence();

128: <- addr of ‘result’

Page 72: POD: A Parallel-On-Die Architecture Dong Hyuk WooGeorgia Tech Joshua B. FrymanIntel Corp. Allan D. KniesIntel Berkeley Research Marsha EngIntel Corp. Hsien-Hsin

72Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture

r1r2r14r15

101282

r1r2r14r15

101283

r1r2r14r15

101280

r1r2r14r15

101281

Programming Model (2x2 POD)

• Enable ALL PEs

3 3

7 7

base_addr: 16

16: 117: 218: 319: 4

int result;

POD_movl( r14, &result );

$ASM mov r15 = MY_PE_ID_POINTER$ASM ld r15 = local [ r15 + 0 ]

$ASM sub r15 = r15, 0$ASM pushmask.e

$ASM st sys[ r14 + 0 ] = r15$ASM popmask

POD_mfence();

128: <- addr of ‘result’

Page 73: POD: A Parallel-On-Die Architecture Dong Hyuk WooGeorgia Tech Joshua B. FrymanIntel Corp. Allan D. KniesIntel Berkeley Research Marsha EngIntel Corp. Hsien-Hsin

73Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture

r1r2r14r15

101282

r1r2r14r15

101283

r1r2r14r15

101280

r1r2r14r15

101281

Programming Model (2x2 POD)

• Wait until the result is written-back

3 3

7 7

base_addr: 16

16: 117: 218: 319: 4

int result;

POD_movl( r14, &result );

$ASM mov r15 = MY_PE_ID_POINTER$ASM ld r15 = local [ r15 + 0 ]

$ASM sub r15 = r15, 0$ASM pushmask.e

$ASM st sys[ r14 + 0 ] = r15$ASM popmask

POD_mfence();

128: <- addr of ‘result’