pod: a parallel-on-die architecture dong hyuk woogeorgia tech joshua b. frymanintel corp. allan d....
TRANSCRIPT
POD: A Parallel-On-Die ArchitecturePOD: A Parallel-On-Die Architecture
Dong Hyuk WooDong Hyuk Woo Georgia TechGeorgia Tech
Joshua B. FrymanJoshua B. Fryman Intel Corp.Intel Corp.
Allan D. KniesAllan D. Knies Intel Berkeley ResearchIntel Berkeley Research
Marsha EngMarsha Eng Intel Corp.Intel Corp.
Hsien-Hsin S. LeeHsien-Hsin S. Lee Georgia TechGeorgia Tech
2Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture
What So New with ‘Many-Core’?
• On Chip!– New definition of scalability!
Performance Scalability
- Minimize communication
- Hide communication latency
Area ScalabilityArea Scalability
- How many cores on a die?How many cores on a die?
Power ScalabilityPower Scalability
- How much power per core?How much power per core?
Performance Scalability
- Minimize communication
- Hide communication latency
3Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture
Technology (nm) 65 45 32 22 16
# of cores 4 8 16 32 64
Frequency (GHz) 3.0 3.0 3.0 3.0 3.0
Vdd (V) 1.3 1.1 0.9 0.8 0.7
Power / core (W) 37.5 26.8 18.0 14.2 10.9
Total power (W) 150 215 288 454 696
Scalable Power Consumption?
* The above pictures do not represent any current or future Intel products.
4Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture
Thousand Cores, Power Consumption?
??
??
?
5Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture
Some Known Facts on Interconnection
• One main energy hog!– Alpha 21364:
• 20% (25W / 125W)
– MIT RAW: • 40% of individual tile power
– UT TRIPS: • 25%
• Mesh-based MIMD many-core?– Unsustainable energy in its router logic– Unpredictable communication pattern
• Unfriendly to low-power techniques
Interconnection Network related
POD: A Parallel-On-Die ArchitecturePOD: A Parallel-On-Die Architecture
7Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture
Design Goals
• Broad-purpose acceleration– Scalable vector performance
• Energy and area efficiency
• Low latency inter-PE communication
• Best-in-class scalar performance
• Backward compatibility
8Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture
Host Processor
Host Processor
9Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture
Processing Elements (PEs)
Host Processor
10Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture
Instruction Bus (IBus)
Host Processor
11Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture
ORTree: Status Monitoring
Host ProcessorFlag status
12Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture
Data Return Buffer (DRB)
Host Processor
DRB
DRB
DRB
DRB
DRB
DRB
DRB
DRB
13Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture
2D Torus P2P Links
0 7 1 6 2 5 3 4
Host Processor
DRB
DRB
DRB
DRB
DRB
DRB
DRB
DRB
14Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture
2D Torus P2P Links
Host Processor
DRB
DRB
DRB
DRB
DRB
DRB
DRB
DRB
15Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture
RRQ & MBus: DMA
RRQ
RRQ
RRQ
RRQ
RRQ
RRQ
RRQ
RRQ
Host Processor
DRB
DRB
DRB
DRB
DRB
DRB
DRB
DRB
RowResponse
Queue
MemoryBus
16Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture
On-chip MC / Ring
RRQ
RRQ
RRQ
RRQ
RRQ
RRQ
RRQ
RRQ
Host Processor
DRB
DRB
DRB
DRB
DRB
DRB
DRB
DRB
Sys
tem
mem
ory
xTLB
LLC
ARB
MC
MC
Arbitrator
Memory Controller
17Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture
Wire Delay Aware Design
RRQ
RRQ
RRQ
RRQ
RRQ
RRQ
RRQ
RRQ
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
EFLAGS
! MemFence
IBUS
18Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture
Simplified PE Pipelining
Local
SRAM
IBUS
MBUS
DEC / RF
80
78
NSEW
G
X
M
NSEW
96-b
it V
LIW
inst
ruct
ion
19Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture
Physical Design Evaluation
iBTB BTB
Lhist
G
Fetch BT
LBT
ags IL1
IFQ
UC
odeR
OM
BAC
uopQDecode
Decode
Decode
Decode(Conplex)
Rename
Commit
ARF
ROBAlloc
RSRAT
FP
SIMD
INT
AGU
LQ SQ
Bypass
s/f
DL1
tagsDTLB
prefetchs/f
PF
Intel Conroe Core 2 Duo processor die photo
20Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture
Physical Design Evaluation
Intel Conroe Core 2 Duo processor die photo
PODPE
21Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture
Physical Design Evaluation @ 45nm
• PE– 128KB dual-ported SRAM, Basic GP ALU, simplified SSE ALU– ~3.2 mm2
• RRQ– ~3.2 mm2 (conservative estimation)
• The host processor– Core-based microarchitecture– ~25.9 mm2
• 3MB LLC– ~20 mm2
• Total: ~276.5 mm2
RRQ
RRQ
RRQ
RRQ
RRQ
RRQ
RRQ
RRQ
Host Processor
DRB
DRB
DRB
DRB
DRB
DRB
DRB
DRB
xTLB
LLC
ARB
MC
MC
22Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture
Interconnection Fabric Among PEs
• Different links for different communication– IBus / Mbus / 2D torus P2P links / ORTree– No contention among different communication
• Fully synchronized and orchestrated by the host processor (and RRQ for MBus)– No crossbar– No contention / arbitration– No buffer space for the router– Nothing unpredictable!
23Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture
Design Space Exploration
• Baseline Design
24Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture
Design Space Exploration
• 3D POD with many-coreDie 0
Die 1
You live and work here
You get your beer here
25Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture
Design Space Exploration
• 3D many-POD processorDie 0
Die 1
Programming ModelProgramming Model
27Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture
Programming Model
• Example code: Reduction
1
0
N
iiaS
char* base_addr = (char*) malloc(npe);for ( i=0; I < npe; i++ ) {
*(base_addr+i) = i+1;}$ASM mov r15 = MY_PE_ID_POINTER$ASM ld r15 = local [ r15 + 0 ]POD_movl( r14, base_addr );
$ASM add r15 = r15, r14
$ASM ld r1 = sys [ r15 + 0 ]POD_mfence();
$ASM add r2 = r2, r1for ( i=0; i< (npeX-1); i++ ) {
$ASM xfer.east r1 = r1$ASM add r2 = r2, r1
}$ASM add r1 = r2, 0for ( i=0; i< (npeY-1); i++ ) {
$ASM xfer.north r1 = r1$ASM add r2 = r2, r1
}
Simulation ResultSimulation Result
29Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture
Performance Result
4
16
64869.3GFLOPS
113.6GFLOPS
134.8GFLOPS
860.3GFLOPS
91.4GFLOPS
504.6GFLOPS
DenseMMM(SP)
FFT(SP)
IDCT(DP)
OptionPricing
(SP)
DownSampling
(SP)
K-means(SP)
Sp
ee
du
p
1x1 POD 2x2 POD 4x4 POD 8x8 POD
30Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture
Performance per mm2
Homo-geneous
many-core (ideal)
1 c
ore
4 c
ore
s1
6 c
ore
s6
4 c
ore
s
0
2
4
6
8
10
12
14
DenseMMM FFT IDCT OptionPricing
DownSampling
K-means
Pe
rfo
rma
nc
e p
er
mm
^2
1x1 POD 2x2 POD 4x4 POD 8x8 POD
31Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture
Performance per Watt
DenseMMM FFT IDCT OptionPricing
DownSampling
K-means Homo-geneous
many-core(ideal)
1 c
ore
4 c
ore
s1
6 c
ore
s6
4 c
ore
s
0
2
4
6
8
10
p=0.1
25
p=0.2
50
p=0.5
00
p=0.1
25
p=0.2
50
p=0.5
00
p=0.1
25
p=0.2
50
p=0.5
00
p=0.1
25
p=0.2
50
p=0.5
00
p=0.1
25
p=0.2
50
p=0.5
00
p=0.1
25
p=0.2
50
p=0.5
00
Pe
rfo
rma
nc
e p
er
Wa
tt
1x1 POD 2x2 POD 4x4 POD 8x8 POD
32Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture
Performance per Joule
DenseMMM FFT IDCT OptionPricing
DownSampling
K-means Homo-geneous
many-core(ideal)
1 c
ore
4 c
ore
s1
6 c
ore
s6
4 c
ore
s
0
100
200
300
400
500
600
p=0.1
25
p=0.2
50
p=0.5
00
p=0.1
25
p=0.2
50
p=0.5
00
p=0.1
25
p=0.2
50
p=0.5
00
p=0.1
25
p=0.2
50
p=0.5
00
p=0.1
25
p=0.2
50
p=0.5
00
p=0.1
25
p=0.2
50
p=0.5
00
Pe
rfo
rma
nc
e p
er
Jo
ule
1x1 POD 2x2 POD 4x4 POD 8x8 POD
33Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture
Interconnection Power
Input buffer X-bar arbiter Link
RAW 31% 30% ~0% 39%
TRIPS 35% 33% 1% 31%
Wang et al., “Power-Driven Design of Router Microarchitectures in On-Chip Networks”, MICRO 2003
34Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture
2D Torus P2P Link Active Time
DenseMMM FFT IDCT OptionPricing DownSampling K-means
0%
1%
2%
3%
4%
5%
6%
N E W S N E W S N E W S N E W S N E W S N E W S
Act
ive
Tim
e
1x1 POD 2x2 POD 4x4 POD 8x8 POD
35Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture
Conclusion
• We propose POD – Parallel-On-Die many-core architecture – Broad-purpose acceleration – High energy and area efficiency– Backward compatibility
• A single-chip POD can achieve up to 1.5 TFLOPS of IEEE SP FP operations.
• Interconnection architecture of POD is extremely power- and area-efficient.
RRQ
RRQ
RRQ
RRQ
RRQ
RRQ
RRQ
RRQ
Host Processor
DRB
DRB
DRB
DRB
DRB
DRB
DRB
DRB
xTLB
LLC
ARB
MC
MC
Let’s Use Power to Let’s Use Power to ComputeCompute, , Not Commute!Not Commute!
Georgia TechGeorgia Tech
ECE MARS LabsECE MARS Labs
http://arch.ece.gatech.eduhttp://arch.ece.gatech.edu
Back-up SlidesBack-up Slides
38Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture
PODSIM
Native
x86
machinePODSIM
PE pipelining
Memory subsystem
Accurate modeling of on-chip, off-chip bandwidth
Limitation:
Host processor overhead not modeled (I$ miss, branch misprediction)
LLC takeover of the MC ring by the host
39Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture
Why multi-processing again?
• No more faster processor– Power wall– Increasing wire delay– Diminishing return of deeply pipelined architecture
• No more illusion of a sequential processor– Design complexity– Diminishing return of wide-issue superscalar
processors
40Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture
Out-of-order Host Issues
• Speculative execution– No recovery mechanism in each PE– Broadcast POD instructions in a non-speculative
manner• As long as host-side code does not depend on the results
from POD, this approach does not affect the performance.
• Instruction Reordering– POD instructions are strongly ordered.
• IBits queue – Similar to the store queue
41Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture
x86 Modification
• 5 new instructions:– sendibits, getort, drainort, movl, getdrb,
• 3 modified instructions:– lfence, sfence, mfence
• 4 possible new co-processor registers or else mem-mapped addrs:– ibits register, ort status, drb value, xTLB support
• IBits queue
• All other modifications external to the x86 core– Arbiter for LLC priority can be external to LLC
• Deferred Design Issue: LLC-POD Coherence– v1 Study: uncachable– v2 Prototype: LLC Snoops on A/D rings
42Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture
POD ISA
• “Instructions” are 3-wide VLIW bundles, – One of each G,X,M (32 bits each)
• Integer Pipe (G):– and,or,not,xor,shl,shr,add,sub,mul,cmp,cmov,xfer,xferdrb : !
div
• SSE Pipe (X):– pfp{fma,add,sub,mul,div,max,min},pi{add,sub,mul,and,or,not,c
mp},shuffle,cvt,xfer : !idiv
• Memory Pipe (M):– ld,st,blkrd/wr,mask{set,pop,push},mov,xfer
• Hybrid (G+X+M):– movl
43Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture
Power Scalability!
0
2
4
6
8
10
12
14
0 10 20 30 40 50 60
Relative Chip Power Budget
Rel
ativ
e P
erfo
rman
ce P
er J
ou
le
44Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture
Some Known Facts on Interconnection
• MIT RAW– ~ 40% of die area
Taylor et al., “The Raw Microprocessor: A Computational Fabric for Software Circuits and General Purpose Programs”, IEEE MICRO Mar/Apr 2002
45Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture
Some Known Facts on Interconnection
• One main energy hog!– Alpha 21364:
• 20% (25W / 125W)
– MIT RAW: • 40% of individual tile power
– UT TRIPS: • 25%
Taylor et al., “The Raw Microprocessor: A Computational Fabric for Software Circuits and General Purpose Programs”, IEEE MICRO Mar/Apr 2002Wang et al., “Orion: A Power-Performance Simulator for Interconnection Networks”, MICRO 2002Wang et al., “Power-Driven Design of Router Microarchitectures in On-Chip Networks”, MICRO 2003Peh, “Chip-Scale Power Scaling”, http://www.princeton.edu/~peh/talks/stanford_nws.pdf
46Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture
Some Known Facts on Interconnection
• Mesh-based MIMD many-core?
– Unsustainable energy in its router logic
– Unpredictable communication pattern • Difficult to apply common low-power techniques
Bokar, “Networks for Multi-core Chip—A Controversial View”, 2006 Workshop on On- and Off-Chip Interconnection Networks for Multicore Systems
Programming ModelProgramming Model
48Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture
Programming Model
• Example code: Reduction
1
0
N
iiaS
char* base_addr = (char*) malloc(npe);for ( i=0; I < npe; i++ ) {
*(base_addr+i) = i+1;}$ASM mov r15 = MY_PE_ID_POINTER$ASM ld r15 = local [ r15 + 0 ]POD_movl( r14, base_addr );
$ASM add r15 = r15, r14
$ASM ld r1 = sys [ r15 + 0 ]POD_mfence();
$ASM add r2 = r2, r1for ( i=0; i< (npeX-1); i++ ) {
$ASM xfer.east r1 = r1$ASM add r2 = r2, r1
}$ASM add r1 = r2, 0for ( i=0; i< (npeY-1); i++ ) {
$ASM xfer.north r1 = r1$ASM add r2 = r2, r1
}
49Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture
Programming Model (2x2 POD)
• Malloc memory at the host memory spacechar* base_addr = (char*) malloc(npe);for ( i=0; I < npe; i++ ) {
*(base_addr+i) = i+1;}$ASM mov r15 = MY_PE_ID_POINTER$ASM ld r15 = local [ r15 + 0 ]POD_movl( r14, base_addr );
$ASM add r15 = r15, r14
$ASM ld r1 = sys [ r15 + 0 ]POD_mfence();
$ASM add r2 = r2, r1for ( i=0; i< (npeX-1); i++ ) {
$ASM xfer.east r1 = r1$ASM add r2 = r2, r1
}$ASM add r1 = r2, 0for ( i=0; i< (npeY-1); i++ ) {
$ASM xfer.north r1 = r1$ASM add r2 = r2, r1
}
50Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture
Programming Model (2x2 POD)
• Initialize ai
base_addr: 16
16: 017: 018: 019: 0
char* base_addr = (char*) malloc(npe);for ( i=0; I < npe; i++ ) {
*(base_addr+i) = i+1;}$ASM mov r15 = MY_PE_ID_POINTER$ASM ld r15 = local [ r15 + 0 ]POD_movl( r14, base_addr );
$ASM add r15 = r15, r14
$ASM ld r1 = sys [ r15 + 0 ]POD_mfence();
$ASM add r2 = r2, r1for ( i=0; i< (npeX-1); i++ ) {
$ASM xfer.east r1 = r1$ASM add r2 = r2, r1
}$ASM add r1 = r2, 0for ( i=0; i< (npeY-1); i++ ) {
$ASM xfer.north r1 = r1$ASM add r2 = r2, r1
}
51Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture
Programming Model (2x2 POD)
• Load the address of my PE ID
base_addr: 16
16: 117: 218: 319: 4
char* base_addr = (char*) malloc(npe);for ( i=0; I < npe; i++ ) {
*(base_addr+i) = i+1;}$ASM mov r15 = MY_PE_ID_POINTER$ASM ld r15 = local [ r15 + 0 ]POD_movl( r14, base_addr );
$ASM add r15 = r15, r14
$ASM ld r1 = sys [ r15 + 0 ]POD_mfence();
$ASM add r2 = r2, r1for ( i=0; i< (npeX-1); i++ ) {
$ASM xfer.east r1 = r1$ASM add r2 = r2, r1
}$ASM add r1 = r2, 0for ( i=0; i< (npeY-1); i++ ) {
$ASM xfer.north r1 = r1$ASM add r2 = r2, r1
}
52Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture
Programming Model (2x2 POD)
• Load my PE ID
r15 2 r15 2
r15 2 r15 2
base_addr: 16
16: 117: 218: 319: 4
char* base_addr = (char*) malloc(npe);for ( i=0; I < npe; i++ ) {
*(base_addr+i) = i+1;}$ASM mov r15 = MY_PE_ID_POINTER$ASM ld r15 = local [ r15 + 0 ]POD_movl( r14, base_addr );
$ASM add r15 = r15, r14
$ASM ld r1 = sys [ r15 + 0 ]POD_mfence();
$ASM add r2 = r2, r1for ( i=0; i< (npeX-1); i++ ) {
$ASM xfer.east r1 = r1$ASM add r2 = r2, r1
}$ASM add r1 = r2, 0for ( i=0; i< (npeY-1); i++ ) {
$ASM xfer.north r1 = r1$ASM add r2 = r2, r1
}
53Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture
Programming Model (2x2 POD)
• Broadcast base_addr
r15 2 r15 3
r15 0 r15 1
base_addr: 16
16: 117: 218: 319: 4
char* base_addr = (char*) malloc(npe);for ( i=0; I < npe; i++ ) {
*(base_addr+i) = i+1;}$ASM mov r15 = MY_PE_ID_POINTER$ASM ld r15 = local [ r15 + 0 ]POD_movl( r14, base_addr );
$ASM add r15 = r15, r14
$ASM ld r1 = sys [ r15 + 0 ]POD_mfence();
$ASM add r2 = r2, r1for ( i=0; i< (npeX-1); i++ ) {
$ASM xfer.east r1 = r1$ASM add r2 = r2, r1
}$ASM add r1 = r2, 0for ( i=0; i< (npeY-1); i++ ) {
$ASM xfer.north r1 = r1$ASM add r2 = r2, r1
}
54Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture
Programming Model (2x2 POD)
• Calculate the pointer to my ai
r14r15
162
r14r15
163
r14r15
160
r14r15
161
base_addr: 16
16: 117: 218: 319: 4
char* base_addr = (char*) malloc(npe);for ( i=0; I < npe; i++ ) {
*(base_addr+i) = i+1;}$ASM mov r15 = MY_PE_ID_POINTER$ASM ld r15 = local [ r15 + 0 ]POD_movl( r14, base_addr );
$ASM add r15 = r15, r14
$ASM ld r1 = sys [ r15 + 0 ]POD_mfence();
$ASM add r2 = r2, r1for ( i=0; i< (npeX-1); i++ ) {
$ASM xfer.east r1 = r1$ASM add r2 = r2, r1
}$ASM add r1 = r2, 0for ( i=0; i< (npeY-1); i++ ) {
$ASM xfer.north r1 = r1$ASM add r2 = r2, r1
}
55Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture
Programming Model (2x2 POD)
• Load my ai from the host memory space
r14r15
1618
r14r15
1619
r14r15
1616
r14r15
1617
base_addr: 16
16: 117: 218: 319: 4
char* base_addr = (char*) malloc(npe);for ( i=0; I < npe; i++ ) {
*(base_addr+i) = i+1;}$ASM mov r15 = MY_PE_ID_POINTER$ASM ld r15 = local [ r15 + 0 ]POD_movl( r14, base_addr );
$ASM add r15 = r15, r14
$ASM ld r1 = sys [ r15 + 0 ]POD_mfence();
$ASM add r2 = r2, r1for ( i=0; i< (npeX-1); i++ ) {
$ASM xfer.east r1 = r1$ASM add r2 = r2, r1
}$ASM add r1 = r2, 0for ( i=0; i< (npeY-1); i++ ) {
$ASM xfer.north r1 = r1$ASM add r2 = r2, r1
}
56Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture
Programming Model (2x2 POD)
• Wait until it is loaded
r14r15
1618
r14r15
1619
r14r15
1616
r14r15
1617
base_addr: 16
16: 117: 218: 319: 4
char* base_addr = (char*) malloc(npe);for ( i=0; I < npe; i++ ) {
*(base_addr+i) = i+1;}$ASM mov r15 = MY_PE_ID_POINTER$ASM ld r15 = local [ r15 + 0 ]POD_movl( r14, base_addr );
$ASM add r15 = r15, r14
$ASM ld r1 = sys [ r15 + 0 ]POD_mfence();
$ASM add r2 = r2, r1for ( i=0; i< (npeX-1); i++ ) {
$ASM xfer.east r1 = r1$ASM add r2 = r2, r1
}$ASM add r1 = r2, 0for ( i=0; i< (npeY-1); i++ ) {
$ASM xfer.north r1 = r1$ASM add r2 = r2, r1
}
57Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture
Programming Model (2x2 POD)
• Update S
r1
r14r15
3
1618
r1
r14r15
4
1619
r1
r14r15
1
1616
r1
r14r15
2
1617
base_addr: 16
16: 117: 218: 319: 4
char* base_addr = (char*) malloc(npe);for ( i=0; I < npe; i++ ) {
*(base_addr+i) = i+1;}$ASM mov r15 = MY_PE_ID_POINTER$ASM ld r15 = local [ r15 + 0 ]POD_movl( r14, base_addr );
$ASM add r15 = r15, r14
$ASM ld r1 = sys [ r15 + 0 ]POD_mfence();
$ASM add r2 = r2, r1for ( i=0; i< (npeX-1); i++ ) {
$ASM xfer.east r1 = r1$ASM add r2 = r2, r1
}$ASM add r1 = r2, 0for ( i=0; i< (npeY-1); i++ ) {
$ASM xfer.north r1 = r1$ASM add r2 = r2, r1
}
58Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture
Programming Model (2x2 POD)
• 1st iteration
r1r2r14r15
331618
r1r2r14r15
441619
r1r2r14r15
111616
r1r2r14r15
221617
base_addr: 16
16: 117: 218: 319: 4
char* base_addr = (char*) malloc(npe);for ( i=0; I < npe; i++ ) {
*(base_addr+i) = i+1;}$ASM mov r15 = MY_PE_ID_POINTER$ASM ld r15 = local [ r15 + 0 ]POD_movl( r14, base_addr );
$ASM add r15 = r15, r14
$ASM ld r1 = sys [ r15 + 0 ]POD_mfence();
$ASM add r2 = r2, r1for ( i=0; i< (npeX-1); i++ ) {
$ASM xfer.east r1 = r1$ASM add r2 = r2, r1
}$ASM add r1 = r2, 0for ( i=0; i< (npeY-1); i++ ) {
$ASM xfer.north r1 = r1$ASM add r2 = r2, r1
}
59Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture
Programming Model (2x2 POD)
• Transfer my ai to my east-side neighbor
r1r2r14r15
331618
r1r2r14r15
441619
r1r2r14r15
111616
r1r2r14r15
221617
base_addr: 16
16: 117: 218: 319: 4
char* base_addr = (char*) malloc(npe);for ( i=0; I < npe; i++ ) {
*(base_addr+i) = i+1;}$ASM mov r15 = MY_PE_ID_POINTER$ASM ld r15 = local [ r15 + 0 ]POD_movl( r14, base_addr );
$ASM add r15 = r15, r14
$ASM ld r1 = sys [ r15 + 0 ]POD_mfence();
$ASM add r2 = r2, r1for ( i=0; i< (npeX-1); i++ ) {
$ASM xfer.east r1 = r1$ASM add r2 = r2, r1
}$ASM add r1 = r2, 0for ( i=0; i< (npeY-1); i++ ) {
$ASM xfer.north r1 = r1$ASM add r2 = r2, r1
}
60Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture
r1r2r14r15
31618
r1r2r14r15
41619
r1r2r14r15
11616
r1r2r14r15
21617
Programming Model (2x2 POD)
• Update S
3
1
4
2
base_addr: 16
16: 117: 218: 319: 4
char* base_addr = (char*) malloc(npe);for ( i=0; I < npe; i++ ) {
*(base_addr+i) = i+1;}$ASM mov r15 = MY_PE_ID_POINTER$ASM ld r15 = local [ r15 + 0 ]POD_movl( r14, base_addr );
$ASM add r15 = r15, r14
$ASM ld r1 = sys [ r15 + 0 ]POD_mfence();
$ASM add r2 = r2, r1for ( i=0; i< (npeX-1); i++ ) {
$ASM xfer.east r1 = r1$ASM add r2 = r2, r1
}$ASM add r1 = r2, 0for ( i=0; i< (npeY-1); i++ ) {
$ASM xfer.north r1 = r1$ASM add r2 = r2, r1
}
61Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture
r1r2r14r15
71618
r1r2r14r15
71619
r1r2r14r15
31616
r1r2r14r15
31617
Programming Model (2x2 POD)
• Copy rowsum to r1
4
2
3
1
base_addr: 16
16: 117: 218: 319: 4
char* base_addr = (char*) malloc(npe);for ( i=0; I < npe; i++ ) {
*(base_addr+i) = i+1;}$ASM mov r15 = MY_PE_ID_POINTER$ASM ld r15 = local [ r15 + 0 ]POD_movl( r14, base_addr );
$ASM add r15 = r15, r14
$ASM ld r1 = sys [ r15 + 0 ]POD_mfence();
$ASM add r2 = r2, r1for ( i=0; i< (npeX-1); i++ ) {
$ASM xfer.east r1 = r1$ASM add r2 = r2, r1
}$ASM add r1 = r2, 0for ( i=0; i< (npeY-1); i++ ) {
$ASM xfer.north r1 = r1$ASM add r2 = r2, r1
}
62Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture
r1r2r14r15
71618
r1r2r14r15
71619
r1r2r14r15
31616
r1r2r14r15
31617
Programming Model (2x2 POD)
• 1st iteration
7
3
7
3
base_addr: 16
16: 117: 218: 319: 4
char* base_addr = (char*) malloc(npe);for ( i=0; I < npe; i++ ) {
*(base_addr+i) = i+1;}$ASM mov r15 = MY_PE_ID_POINTER$ASM ld r15 = local [ r15 + 0 ]POD_movl( r14, base_addr );
$ASM add r15 = r15, r14
$ASM ld r1 = sys [ r15 + 0 ]POD_mfence();
$ASM add r2 = r2, r1for ( i=0; i< (npeX-1); i++ ) {
$ASM xfer.east r1 = r1$ASM add r2 = r2, r1
}$ASM add r1 = r2, 0for ( i=0; i< (npeY-1); i++ ) {
$ASM xfer.north r1 = r1$ASM add r2 = r2, r1
}
63Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture
r1r2r14r15
71618
r1r2r14r15
71619
r1r2r14r15
31616
r1r2r14r15
31617
Programming Model (2x2 POD)
• Transfer my rowsum to my north-side neighbor
7
3
7
3
base_addr: 16
16: 117: 218: 319: 4
char* base_addr = (char*) malloc(npe);for ( i=0; I < npe; i++ ) {
*(base_addr+i) = i+1;}$ASM mov r15 = MY_PE_ID_POINTER$ASM ld r15 = local [ r15 + 0 ]POD_movl( r14, base_addr );
$ASM add r15 = r15, r14
$ASM ld r1 = sys [ r15 + 0 ]POD_mfence();
$ASM add r2 = r2, r1for ( i=0; i< (npeX-1); i++ ) {
$ASM xfer.east r1 = r1$ASM add r2 = r2, r1
}$ASM add r1 = r2, 0for ( i=0; i< (npeY-1); i++ ) {
$ASM xfer.north r1 = r1$ASM add r2 = r2, r1
}
64Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture
r1r2r14r15
71618
r1r2r14r15
71619
r1r2r14r15
31616
r1r2r14r15
31617
Programming Model (2x2 POD)
• Update S
7 7
3 3
base_addr: 16
16: 117: 218: 319: 4
char* base_addr = (char*) malloc(npe);for ( i=0; I < npe; i++ ) {
*(base_addr+i) = i+1;}$ASM mov r15 = MY_PE_ID_POINTER$ASM ld r15 = local [ r15 + 0 ]POD_movl( r14, base_addr );
$ASM add r15 = r15, r14
$ASM ld r1 = sys [ r15 + 0 ]POD_mfence();
$ASM add r2 = r2, r1for ( i=0; i< (npeX-1); i++ ) {
$ASM xfer.east r1 = r1$ASM add r2 = r2, r1
}$ASM add r1 = r2, 0for ( i=0; i< (npeY-1); i++ ) {
$ASM xfer.north r1 = r1$ASM add r2 = r2, r1
}
65Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture
r1r2r14r15
101618
r1r2r14r15
101619
r1r2r14r15
101616
r1r2r14r15
101617
Programming Model (2x2 POD)
• Done!
3 3
7 7
base_addr: 16
16: 117: 218: 319: 4
char* base_addr = (char*) malloc(npe);for ( i=0; I < npe; i++ ) {
*(base_addr+i) = i+1;}$ASM mov r15 = MY_PE_ID_POINTER$ASM ld r15 = local [ r15 + 0 ]POD_movl( r14, base_addr );
$ASM add r15 = r15, r14
$ASM ld r1 = sys [ r15 + 0 ]POD_mfence();
$ASM add r2 = r2, r1for ( i=0; i< (npeX-1); i++ ) {
$ASM xfer.east r1 = r1$ASM add r2 = r2, r1
}$ASM add r1 = r2, 0for ( i=0; i< (npeY-1); i++ ) {
$ASM xfer.north r1 = r1$ASM add r2 = r2, r1
}
66Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture
r1r2r14r15
101618
r1r2r14r15
101619
r1r2r14r15
101616
r1r2r14r15
101617
Programming Model (2x2 POD)
• Broadcast the pointer to ‘result’
3 3
7 7
base_addr: 16
16: 117: 218: 319: 4
int result;
POD_movl( r14, &result );
$ASM mov r15 = MY_PE_ID_POINTER$ASM ld r15 = local [ r15 + 0 ]
$ASM sub r15 = r15, 0$ASM pushmask.e
$ASM st sys[ r14 + 0 ] = r15$ASM popmask
POD_mfence();
128: <- addr of ‘result’
67Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture
r1r2r14r15
1012818
r1r2r14r15
1012819
r1r2r14r15
1012816
r1r2r14r15
1012817
Programming Model (2x2 POD)
• Only PE0 will write-back!
3 3
7 7
base_addr: 16
16: 117: 218: 319: 4
int result;
POD_movl( r14, &result );
$ASM mov r15 = MY_PE_ID_POINTER$ASM ld r15 = local [ r15 + 0 ]
$ASM sub r15 = r15, 0$ASM pushmask.e
$ASM st sys[ r14 + 0 ] = r15$ASM popmask
POD_mfence();
128: <- addr of ‘result’
68Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture
r1r2r14r15
101282
r1r2r14r15
101282
r1r2r14r15
101282
r1r2r14r15
101282
Programming Model (2x2 POD)
• Load PE_ID
3 3
7 7
base_addr: 16
16: 117: 218: 319: 4
int result;
POD_movl( r14, &result );
$ASM mov r15 = MY_PE_ID_POINTER$ASM ld r15 = local [ r15 + 0 ]
$ASM sub r15 = r15, 0$ASM pushmask.e
$ASM st sys[ r14 + 0 ] = r15$ASM popmask
POD_mfence();
128: <- addr of ‘result’
69Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture
r1r2r14r15
101282
r1r2r14r15
101283
r1r2r14r15
101280
r1r2r14r15
101281
Programming Model (2x2 POD)
• Compare PE_ID
3 3
7 7
base_addr: 16
16: 117: 218: 319: 4
int result;
POD_movl( r14, &result );
$ASM mov r15 = MY_PE_ID_POINTER$ASM ld r15 = local [ r15 + 0 ]
$ASM sub r15 = r15, 0$ASM pushmask.e
$ASM st sys[ r14 + 0 ] = r15$ASM popmask
POD_mfence();
128: <- addr of ‘result’
70Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture
r1r2r14r15
101282
r1r2r14r15
101283
r1r2r14r15
101280
r1r2r14r15
101281
Programming Model (2x2 POD)
• Enable PE only when equal to zero
3 3
7 7
base_addr: 16
16: 117: 218: 319: 4
int result;
POD_movl( r14, &result );
$ASM mov r15 = MY_PE_ID_POINTER$ASM ld r15 = local [ r15 + 0 ]
$ASM sub r15 = r15, 0$ASM pushmask.e
$ASM st sys[ r14 + 0 ] = r15$ASM popmask
POD_mfence();
128: <- addr of ‘result’
71Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture
r1r2r14r15
101282
r1r2r14r15
101283
r1r2r14r15
101280
r1r2r14r15
101281
Programming Model (2x2 POD)
• Write-back
3 3
7 7
base_addr: 16
16: 117: 218: 319: 4
int result;
POD_movl( r14, &result );
$ASM mov r15 = MY_PE_ID_POINTER$ASM ld r15 = local [ r15 + 0 ]
$ASM sub r15 = r15, 0$ASM pushmask.e
$ASM st sys[ r14 + 0 ] = r15$ASM popmask
POD_mfence();
128: <- addr of ‘result’
72Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture
r1r2r14r15
101282
r1r2r14r15
101283
r1r2r14r15
101280
r1r2r14r15
101281
Programming Model (2x2 POD)
• Enable ALL PEs
3 3
7 7
base_addr: 16
16: 117: 218: 319: 4
int result;
POD_movl( r14, &result );
$ASM mov r15 = MY_PE_ID_POINTER$ASM ld r15 = local [ r15 + 0 ]
$ASM sub r15 = r15, 0$ASM pushmask.e
$ASM st sys[ r14 + 0 ] = r15$ASM popmask
POD_mfence();
128: <- addr of ‘result’
73Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture
r1r2r14r15
101282
r1r2r14r15
101283
r1r2r14r15
101280
r1r2r14r15
101281
Programming Model (2x2 POD)
• Wait until the result is written-back
3 3
7 7
base_addr: 16
16: 117: 218: 319: 4
int result;
POD_movl( r14, &result );
$ASM mov r15 = MY_PE_ID_POINTER$ASM ld r15 = local [ r15 + 0 ]
$ASM sub r15 = r15, 0$ASM pushmask.e
$ASM st sys[ r14 + 0 ] = r15$ASM popmask
POD_mfence();
128: <- addr of ‘result’