christopher batten - csl.cornell.educbatten/pdfs/batten-xloops-pymtl-ibm2015.pdf · christopher...

72
Architectural Specialization for Inter-Iteration Loop Dependence Patterns Christopher Batten Computer Systems Laboratory School of Electrical and Computer Engineering Cornell University Spring 2015

Upload: dodien

Post on 30-Jul-2018

225 views

Category:

Documents


0 download

TRANSCRIPT

Architectural Specialization forInter-Iteration Loop Dependence Patterns

Christopher Batten

Computer Systems LaboratorySchool of Electrical and Computer Engineering

Cornell University

Spring 2015

• Research Overview • XLOOPS: Explicit Loop Specialization PyMTL: Productive Hardware Modeling

Motivating Trends in Computer Architecture

Transistors(Thousands)

Frequency(MHz)

TypicalPower (W)

MIPSR2K

IntelP4

DECAlpha 21264

Data collected by M. Horowitz, F. Labonte, O. Shacham, K. Olukotun, L. Hammond, C. Batten

1975 1980 1985 1990 1995 2000 2005 2010 2015

100

101

102

103

104

105

106

SPECintPerformance

107

Numberof Cores

Intel 48-CorePrototype

AMD 4-CoreOpteron

• Data-Parallelism via

GPGPUs and Vector

• Fine-Grain Task-Level

Parallelism

• Instruction Set

Specialization

• Subgraph

Specialization

• Application-Specific

Accelerators

• Domain-Specific

Accelerators

• Coarse-Grain

Reconfig Arrays

• Field-Programmable

Gate Arrays

Specialization

Cornell University Christopher Batten 2 / 66

• Research Overview • XLOOPS: Explicit Loop Specialization PyMTL: Productive Hardware Modeling

Performance (Tasks per Second)

En

erg

y E

ffic

ien

cy (

Ta

sks p

er

Joule

)

SimpleProcessor

Design PowerConstraint

High-PerformanceArchitectures

EmbeddedArchitectures

DesignPerformanceConstraint

Flexibility vs

. Spe

cializat

ion

CustomASIC

Less FlexibleAccelerator

More FlexibleAccelerator

Cornell University Christopher Batten 3 / 66

• Research Overview • XLOOPS: Explicit Loop Specialization PyMTL: Productive Hardware Modeling

Vertically Integrated Research Methodology

Our research involves reconsidering all aspects of the computing stackincluding applications, programming frameworks, compiler optimizations,runtime systems, instruction set design, microarchitecture design, VLSI

implementation, and hardware design methodologies

CrossCompiler

FunctionalSimulator

Binary

Applications

Functional-LevelModel

Cycle-LevelSimulator

Cycle-LevelModel

Layout

Register-Transfer-Level Model

RTLSimulator

Gate-Level Model

Gate-LevelSimulator

Switching Activity

PowerAnalysis

SynthesisPlace&Route

Key Metrics: Cycle Count, Cycle Time, Area, Energy

Experimenting with full-chiplayout, FPGA prototypes, andtest chips is a key part of our

research methodology

Cornell University Christopher Batten 4 / 66

• Research Overview • XLOOPS: Explicit Loop Specialization PyMTL: Productive Hardware Modeling

Projects Within the Batten Research Group

Apps

Algos

PL

ISA

uArch

RTL

VLSI

Circuits

Tech

Compiler

GPGPU

Architecture[ISCA'13]

[MICRO'14a]

Integrated

Voltage

Regulation[MICRO'14b]

XLOOPS

Explicit Loop

Specialization[MICRO'14c]

Polymorphic

Hardware

Specialization

Accelerating

Dynamic

Prog Langs

PyMTL/Pydgin

Frameworks[MICRO'14d]

[ISPASS'15]

PyMTL/Pydgin

Frameworks

XLOOPS

Explicit Loop

Specialization

PyMTL/Pydgin

Frameworks[MICRO'14d]

[ISPASS'15][MICRO'14c]

Cornell University Christopher Batten 5 / 66

Research Overview • XLOOPS: Explicit Loop Specialization • PyMTL: Productive Hardware Modeling

XLOOPS: Architectural Specialization forInter-Iteration Loop Dependence Patterns

Shreesha Srinath, Berkin Ilbeyi, Mingxing Tan, Gai Liu,Zhiru Zhang, and Christopher Batten

47th ACM/IEEE Int’l Symp. on Microarchitecture (MICRO)Cambridge, UK, Dec. 2014

Cornell University Christopher Batten 6 / 66

Research Overview • XLOOPS: Explicit Loop Specialization • PyMTL: Productive Hardware Modeling

Loop Dependence Pattern Specialization

Iteration0

inst0inst1inst2inst3...branch

Iteration1

inst0inst1inst2inst3...branch

inst0inst1inst2inst3...branch

Iteration2

inst0inst1inst2inst3...branch

Iteration3

inst0inst1inst2inst3...branch

Iterationn-1

Intra-IterationMicro-op Fusion,

ASIPs, CCA, BERET

Inter-IterationVector, GPU,HELIX-RC

BothDySER, C-Cores,

Qs-Cores

Key Challenge: Creating HW/SW abstractions that are flexibleand enable performance-portable execution

Cornell University Christopher Batten 7 / 66

Research Overview • XLOOPS: Explicit Loop Specialization • PyMTL: Productive Hardware Modeling

Explicit Loop Specialization (XLOOPS)

Key Idea 1: Expose fine-grained parallelism by elegantly encodinginter-iteration loop dependence patterns in the ISA

Key Idea 2: Single-ISA hetereogenous architecture with a new executionparadigm supporting traditional, specialized, and adaptive execution

GPP

L1 Data Cache

Lanes

Lane Manager

Mem XBar

I Traditional Execution

I Specialized Execution

I Adaptive Execution

Cornell University Christopher Batten 8 / 66

Research Overview • XLOOPS: Explicit Loop Specialization • PyMTL: Productive Hardware Modeling

3. XLOOPS Microarchitecture 4. Evaluation

1. XLOOPS Instruction Setloop:

lw r2, 0(rA)

lw r3, 0(rB)

...

addiu.xi rA, 4

addiu.xi rB, 4

addiu r1, r1, 1

xloop.uc r1, rN, loop

2. XLOOPS Compiler

#pragma xloops ordered

for(i = 0; i < N i++)

A[i] = A[i] * A[i-K];

#pragma xloops atomic

for(i = 0; i < N; i++)

B[ A[i] ]++;

D[ C[i] ]++;

OoO GPP

L1 Data Cache

Lanes

Lane Manager

Mem XBar

Cornell University Christopher Batten 9 / 66

Research Overview XLOOPS : { • Instruction Set • Compiler Microarchitecture Evaluation } PyMTL

3. XLOOPS Microarchitecture 4. Evaluation

1. XLOOPS Instruction Setloop:

lw r2, 0(rA)

lw r3, 0(rB)

...

addiu.xi rA, 4

addiu.xi rB, 4

addiu r1, r1, 1

xloop.uc r1, rN, loop

2. XLOOPS Compiler

#pragma xloops ordered

for(i = 0; i < N i++)

A[i] = A[i] * A[i-K];

#pragma xloops atomic

for(i = 0; i < N; i++)

B[ A[i] ]++;

D[ C[i] ]++;

OoO GPP

L1 Data Cache

Lanes

Lane Manager

Mem XBar

Cornell University Christopher Batten 10 / 66

Research Overview XLOOPS : { • Instruction Set • Compiler Microarchitecture Evaluation } PyMTL

XLOOPS Instruction Set Extensions

xloop.{d}.{c} rI, rN, L

Data

Dependence

Control

Dependence

Induction

Variable

Loop

Bound

Loop

Label

XLOOP Instruction

Unordered Concurrent Fixed Bound

xloop.uc.fb r2, r3, 0x8000

Cross-Iteration Instructions

addiu.xi rX, imm

addu.xi rX, rT

Variables that can be computed as linear functions of the induction variable

Cornell University Christopher Batten 11 / 66

Research Overview XLOOPS : { • Instruction Set • Compiler Microarchitecture Evaluation } PyMTL

XLOOPS Instruction Set: Unordered Concurrent

Iteration 0

inst0inst1inst2inst3...xloop.uc

Iteration 1

inst0inst1inst2inst3...xloop.uc

Iteration 2

inst0inst1inst2inst3...xloop.uc

Iteration 3

inst0inst1inst2inst3...xloop.uc

loop:

lw r2, 0(rA)

lw r3, 0(rB)

mul r4, r2, r3

sw r4, 0(rC)

addiu.xi rA, 4

addiu.xi rB, 4

addiu.xi rC, 4

addiu r1, r1, 1

xloop.uc r1, rN, loop

for ( i=0; i<N; i++ )

C[i] = A[i] * B[i]

Element-wise Vector

Multiplication

Instructions in loop cannot

write live-in registers

Live-out values must be stored

to memory

Data-races are possible

Cornell University Christopher Batten 12 / 66

Research Overview XLOOPS : { • Instruction Set • Compiler Microarchitecture Evaluation } PyMTL

XLOOPS Instruction Set: Unordered Atomic

loop:

lw r6, 0(rA)

lw r7, 0(r6)

addiu r7, r7, 1

sw r7, 0(r6)

addiu.xi rA, 4

...

addiu r1, r1, 1

xloop.ua r1, rN, loop

for ( i=0; i<N; i++ )

B[A[i]]++; D[C[i]]++;

Histogram

Updates

Iterations execute atomically

No race conditions

Iteration 0

inst0inst1inst2inst3...xloop.ua

Iteration 1

inst0inst1inst2inst3...xloop.ua

Iteration 2

inst0inst1inst2inst3...xloop.ua

Iteration 3

inst0inst1inst2inst3...xloop.ua

Results can be non-deterministic

Inspired by Transactional Memory

Cornell University Christopher Batten 13 / 66

Research Overview XLOOPS : { • Instruction Set • Compiler Microarchitecture Evaluation } PyMTL

XLOOPS Instruction Set: Ordered-Through-Registers

loop:

lw r2, 0(rA)

addu rX, r2, rX

sw rX, 0(rB)

addiu.xi rA, 4

addiu.xi rB, 4

addiu r1, r1, 1

xloop.or r1, rN, loop

for ( i=0; i<N; i++ )

X += A[i]; B[i] = X

Parallel-Prefix

Summation

rX - Cross Iteration Register

CIRs are guranteed to have

the same value as a serial

execution

Inspired by Multiscalar

Iteration 0

inst0inst1inst2inst3...xloop.or

Iteration 1

inst0inst1inst2inst3...xloop.or

Iteration 2

inst0inst1inst2inst3...xloop.or

Iteration 3

inst0inst1inst2inst3...xloop.or

Cornell University Christopher Batten 14 / 66

Research Overview XLOOPS : { • Instruction Set • Compiler Microarchitecture Evaluation } PyMTL

XLOOPS Instruction Set: Ordered-Through-Memory

# r1 = rK

# r3 = rA + 4*rK

loop:

lw r4, 0(r3)

lw r5, 0(rA)

mul r6, r4, r5

sw r6, 0(r3)

addiu.xi r3, 4

addiu.xi rA, 4

addiu r1, r1, 1

xloop.om r1, rN, loop

for ( i=0; i<N; i++ )

A[i] = A[i] * A[i-k];

Updates to memory defined by

serial iteration order

No race conditions

Iteration 0

inst0inst1inst2inst3...xloop.om

Iteration 1

inst0inst1inst2inst3...xloop.om

Iteration 2

inst0inst1inst2inst3...xloop.om

Iteration 3

inst0inst1inst2inst3...xloop.om

Inspired by Multiscalar, TLS

Cornell University Christopher Batten 15 / 66

Research Overview XLOOPS : { • Instruction Set • Compiler Microarchitecture Evaluation } PyMTL

XLOOPS Instruction Set: Dynamic Bound

Iteration 0

inst0inst1inst2inst3...xloop.uc.db

Iteration 6 Iteration 7

Iteration 1

inst0inst1inst2inst3...xloop.uc.db

Iteration 2

inst0inst1inst2inst3...xloop.uc.dbIteration 3

inst0inst1inst2inst3...xloop.uc.db

Iteration 4

inst0inst1inst2inst3...xloop.uc.db

Iteration 5inst0inst1inst2inst3...xloop.uc.db

Parallelize using xloop.uc.db 0

1 2

3 4 5

6 7

for ( i=0; i<N; i++ )

...

if ( cond ) N++;

Cornell University Christopher Batten 16 / 66

Research Overview XLOOPS : { Instruction Set • Compiler • Microarchitecture Evaluation } PyMTL

3. XLOOPS Microarchitecture 4. Evaluation

1. XLOOPS Instruction Setloop:

lw r2, 0(rA)

lw r3, 0(rB)

...

addiu.xi rA, 4

addiu.xi rB, 4

addiu r1, r1, 1

xloop.uc r1, rN, loop

2. XLOOPS Compiler

#pragma xloops ordered

for(i = 0; i < N i++)

A[i] = A[i] * A[i-K];

#pragma xloops atomic

for(i = 0; i < N; i++)

B[ A[i] ]++;

D[ C[i] ]++;

OoO GPP

L1 Data Cache

Lanes

Lane Manager

Mem XBar

Cornell University Christopher Batten 17 / 66

Research Overview XLOOPS : { Instruction Set • Compiler • Microarchitecture Evaluation } PyMTL

XLOOPS Compiler

Kernel implementing Floyd-Warshall shortest path algorithm

for ( int k = 0; k < n; k++ )

#pragma xloops ordered

for ( int i = 0; i < n; i++ )

#pragma xloops unordered

for ( int j = 0; j < n; j++ )

path[i][j] = min( path[i][j], path[i][k] + path[k][j] );

Cornell University Christopher Batten 18 / 66

Research Overview XLOOPS : { Instruction Set • Compiler • Microarchitecture Evaluation } PyMTL

Mid-Level Optimization Passes

XLOOPSData-Dependence

Analysis Pass

XLOOPSControl-Dependence

Analysis Pass

Code Generation

C++ Appw/ Pragmas

ModifiedLSR Pass

XLOOPSBinary

I Programmer annotations. unordered: no data-dependences. ordered: preserve data-dependences. atomic: atomic memory updates

I Loop strength reduction pass encodesMIVs as xi instructions

I XLOOPS data-dependence analysis pass. Register-dependence: analysing use-definition

chains through PHI nodes. Memory-dependence: well known

dependence analysis techniques

I Detect updates to the loop bound to encodedynamic-bound control-dependence pattern

Cornell University Christopher Batten 19 / 66

Research Overview XLOOPS : { Instruction Set Compiler • Microarchitecture • Evaluation } PyMTL

3. XLOOPS Microarchitecture 4. Evaluation

1. XLOOPS Instruction Setloop:

lw r2, 0(rA)

lw r3, 0(rB)

...

addiu.xi rA, 4

addiu.xi rB, 4

addiu r1, r1, 1

xloop.uc r1, rN, loop

2. XLOOPS Compiler

#pragma xloops ordered

for(i = 0; i < N i++)

A[i] = A[i] * A[i-K];

#pragma xloops atomic

for(i = 0; i < N; i++)

B[ A[i] ]++;

D[ C[i] ]++;

OoO GPP

L1 Data Cache

Lanes

Lane Manager

Mem XBar

Cornell University Christopher Batten 20 / 66

Research Overview XLOOPS : { Instruction Set Compiler • Microarchitecture • Evaluation } PyMTL

Traditional Execution

GPR RF32 × 32b

2r2w

GPP

LLFU

D$ Request/Response Crossbar

L1 I$ 16 KB

L2 Request and Response Crossbars

L1 D$ 16 KB

SLFU

Minimal changes to ageneral-purpose processor (GPP)

I xloop → bne

I addiu.xi→ addiu

I addu.xi → addu

Efficient traditional execution

I Enables gradual adoptionI Enables adaptive execution to

migrate an xloop instruction

Cornell University Christopher Batten 21 / 66

Research Overview XLOOPS : { Instruction Set Compiler • Microarchitecture • Evaluation } PyMTL

Specialized Execution – xloop.uc

GPR RF32 × 32b

2r2w

GPP

LLFU

D$ Request/Response Crossbar

L1 I$ 16 KB

L2 Request and Response Crossbars

L1 D$ 16 KB

SLFU

Lane3

Lane1

Lane RF24 × 32b

2r2w

Inst Buf128×

Lane RF24 × 32b

2r2w

Inst Buf128×

Lane RF24 × 32b

2r2w

Inst Buf128×

Lane0

SLFU SLFU SLFU

IDQ

Lane Management Unit

IDQ IDQ

Loop Pattern Specialization Unit

I Lane Management Unit (LMU)I Four decoupled in-order lanesI Lanes contain instruction buffers

and index queuesI Lanes and the GPP arbitrate for

data-memory port andlong-latency functional unit

Specialized execution

I Scan phaseI Specialized execution phase

Cornell University Christopher Batten 22 / 66

Research Overview XLOOPS : { Instruction Set Compiler • Microarchitecture • Evaluation } PyMTL

lwIteration 2

Iteration 3lw

dispatchdispatch

swaddiu.xiaddiu.xi

opaddiu.xiaddiuxloop

swaddiu.xiaddiu.xi

opaddiu.xiaddiuxloop

GPP LMU Lane0 Lane1 LLFUloop: lw r2, 0(rA) lw r3, 0(rB) mul r4, r2, r3 sw r4, 0(rC) addiu.xi rA, 4 addiu.xi rB, 4 addiu.xi rC, 4 addiu r1, r1, 1 xloop.uc r1, rN, loop

opop

Tim

e

xloop

Sc

an

Ph

as

e

rename

op

lwlw

mulsw

addiu.xiaddiu.xi

opaddiu.xiaddiuxloop

op

renamerenamerenamerenamerenamerenamerenamerename

writewritewritewritewritewritewritewritewrite

op

writewritewritewritewritewritewritewritewrite

Sp

ec

ialize

d E

xe

cu

tio

n P

ha

se

lwIteration 0

dispatch

lw lwIteration 1

dispatch

mul

X

lwmul

X

Cornell University Christopher Batten 23 / 66

Research Overview XLOOPS : { Instruction Set Compiler • Microarchitecture • Evaluation } PyMTL

Specialized Execution – xloop.or

GPR RF32 × 32b

2r2w

GPP

LLFU

D$ Request/Response Crossbar

L1 I$ 16 KB

L2 Request and Response Crossbars

L1 D$ 16 KB

SLFU

Lane3

Lane1

Lane RF24 × 32b

2r2w

Inst Buf128×

Lane RF24 × 32b

2r2w

Inst Buf128×

Lane RF24 × 32b

2r2w

Inst Buf128×

Lane0

SLFU SLFU SLFU

IDQ

Lane Management Unit

IDQ IDQ

CIB 8×CIB 8×CIB 8×

I Cross-iteration buffers (CIBs)forward register-dependences

I LMU control logic. Cross-iteration registers (CIRs). Last update to a CIR

I Lane control logic. Stall if CIR is not available. If last update to CIR then write to

the next CIB

Cornell University Christopher Batten 24 / 66

Research Overview XLOOPS : { Instruction Set Compiler • Microarchitecture • Evaluation } PyMTL

addiu.xiopaddiu.xi

addiuxloop

GPP LMU Lane0 Lane1 LLFUloop: lw r2, 0(rA) addu rX, r2, rX sw rX, 0(rB) addiu.xi rA, 4 addiu.xi rB, 4 addiu r1, r1, 1 xloop.or r1, rN, loop

opop

Tim

e

xloop

Sc

an

Ph

as

e rename

op

lwaddusw

addiu.xiopaddiu.xi

addiuxloop

op

renamerenamerenamerenamerenamerename

writewritewritewritewritewritewrite

op

writewritewritewritewritewritewrite

Sp

ec

ialize

d E

xe

cu

tio

n P

ha

se

lwIteration 0

dispatch

addusw

addiu.xiopaddiu.xi

addiuxloop

lwIteration 1

dispatch

addusw

addiu.xiopaddiu.xi

addiuxloop

addiu.xiopaddiu.xi

addiuxloop

lwIteration 2

dispatch

addu lwIteration 3

dispatch

sw addusw

Iteration 4

dispatchdispatch

Cornell University Christopher Batten 25 / 66

Research Overview XLOOPS : { Instruction Set Compiler • Microarchitecture • Evaluation } PyMTL

Specialized Execution – xloop.om

GPR RF32 × 32b

2r2w

GPP

LLFU

D$ Request/Response Crossbar

L1 I$ 16 KB

L2 Request and Response Crossbars

L1 D$ 16 KB

SLFU

Lane3

Lane1

Lane RF24 × 32b

2r2w

Inst Buf128×

Lane RF24 × 32b

2r2w

Inst Buf128×

Lane RF24 × 32b

2r2w

Inst Buf128×

Lane0

SLFU SLFU SLFU

IDQ

Lane Management Unit

IDQ IDQ

CIB 8×CIB 8×CIB 8×

LSQ16×

LSQ16×

LSQ16×

I LSQ to support hardwarememory disambiguation

I LMU control logic. Track non-speculative vs.

speculative lanes. Promote lanes to be

non-speculative

I Lane control logic. Handle structural hazards. Handle dependence violations

Cornell University Christopher Batten 26 / 66

Research Overview XLOOPS : { Instruction Set Compiler • Microarchitecture • Evaluation } PyMTL

GPP LMU Lane0 Lane1 LLFU

opop

Tim

e

xloop

lwlw

xloopsw

. . .

rename

. . .

write

. . .

renamerename

writerename

writewritewrite

write

. . .

write

writewritewrite

loop: lw r4, 0(r3) lw r5, 0(rA) ... ... sw r6, 0(r7) addiu r1, r1, 1 xloop.om r1, rN, loop

Sc

an

Ph

as

e

Iteration 0

dispatch

Sp

ec

ialize

d E

xe

cu

tio

n P

ha

se

dispatch

Iteration 1lwlw

xloopsw

. . .

lwlw

check

Iteration 1

lwlw

xloopsw

. . .

Iteration 2

lwlw

sw

. . .

xloopdispatch

Iteration 3

X

check

dispatch

Cornell University Christopher Batten 27 / 66

Research Overview XLOOPS : { Instruction Set Compiler • Microarchitecture • Evaluation } PyMTL

Supporting other patterns

GPR RF32 × 32b

2r2w

GPP

LLFU

D$ Request/Response Crossbar

L1 I$ 16 KB

L2 Request and Response Crossbars

L1 D$ 16 KB

SLFU

Lane3

Lane1

Lane RF24 × 32b

2r2w

Inst Buf128×

Lane RF24 × 32b

2r2w

Inst Buf128×

Lane RF24 × 32b

2r2w

Inst Buf128×

Lane0

SLFU SLFU SLFU

IDQ

Lane Management Unit

IDQ IDQ

CIB 8×CIB 8×CIB 8×

LSQ16×

LSQ16×

LSQ16×

DBN

Lane Management Unit I xloop.ua – Using xloop.om

mechanisms

I xloop.orm – Combine xloop.or

and xloop.om mechanisms

I xloop.*.db

. Lanes communicate updates toloop bound

. LMU tracks maximum bound andgenerates additional work

Cornell University Christopher Batten 28 / 66

Research Overview XLOOPS : { Instruction Set Compiler • Microarchitecture • Evaluation } PyMTL

Adaptive Execution

GPP

L1 Data Cache

Lanes

Lane Manager

Mem XBar

I Some kernels have higherperformance on LPSU (e.g.,significant inter-iteration parallelism)

I Some kernels have higherperformance on GPP (e.g., limitedinter-iteration parallelism, significantintra-iteration parallelism)

I Approach #1: Move to more complicated superscalar or out-of-orderlanes to better exploit both inter- and intra-iteration parallelism

I Approach #2: Adaptively migrate between traditional and specializedexecution to achieve best performance

Cornell University Christopher Batten 29 / 66

Research Overview XLOOPS : { Instruction Set Compiler • Microarchitecture • Evaluation } PyMTL

GPP LMU Lane0 Lane1 LLFU

Tim

e

OoO GPP

L1 Data Cache

Lanes

Lane Manager

Mem XBar

GP

P P

rofi

lin

gT

rad

itio

na

l E

xe

cu

tio

nL

PS

U P

rofi

lin

gI Migrating loop oniteration boundariesis very cheap andusually only requiressending the nextiteration index

I An adaptive profilingtable in GPP recordsprofiling progress forsmall number ofrecently seen xloop

instructions

Cornell University Christopher Batten 30 / 66

Research Overview XLOOPS : { Instruction Set Compiler Microarchitecture • Evaluation • } PyMTL

3. XLOOPS Microarchitecture 4. Evaluation

1. XLOOPS Instruction Setloop:

lw r2, 0(rA)

lw r3, 0(rB)

...

addiu.xi rA, 4

addiu.xi rB, 4

addiu r1, r1, 1

xloop.uc r1, rN, loop

2. XLOOPS Compiler

#pragma xloops ordered

for(i = 0; i < N i++)

A[i] = A[i] * A[i-K];

#pragma xloops atomic

for(i = 0; i < N; i++)

B[ A[i] ]++;

D[ C[i] ]++;

OoO GPP

L1 Data Cache

Lanes

Lane Manager

Mem XBar

Cornell University Christopher Batten 31 / 66

Research Overview XLOOPS : { Instruction Set Compiler Microarchitecture • Evaluation • } PyMTL

Application Kernelsxloop.uc

Color space conversionDense matrix-multiply

String search algorithmSymmetric matrix-multiplyViterbi decoding algorithm

Floyd-Warshall shortest path

xloop.or

ADPCM decoderCovriance computation

Floyd-Steinberg ditheringK-Means clustering

SHA-1 encryption kernelSymmetric matrix-multiply

xloop.om

Dynamic-programmingK-Nearest neighbors

Knapsack kernelFloyd-Warshall shortest path

xloop.orm, xloop.ua

Greedy maximal-matching2D Stencil computationBinary tree constructionHeap-sort computationHuffman entropy coding

Radix sort algorithmxloop.uc.db

Breadth-first searchQuick-sort algorithm

25 Kernels: MiBench,PolyBench, PBBS, custom

Cornell University Christopher Batten 32 / 66

Research Overview XLOOPS : { Instruction Set Compiler Microarchitecture • Evaluation • } PyMTL

Cycle-Level Evaluation Methodology

PyMTL

I LLVM-3.1 based compiler framework

I gem5 – in-order and out-of-order processors

I PyMTL – LPSU models

I McPAT-1.0 – 45nm energy models

Cornell University Christopher Batten 33 / 66

Research Overview XLOOPS : { Instruction Set Compiler Microarchitecture • Evaluation • } PyMTL

XLOOPS Cycle-Level Performance Results

rgb2

cmyk

-uc

sgem

m-u

c

ssea

rch-

uc

sym

m-u

c

vite

rbi-u

c

war

-uc

adpc

m-o

r

cova

r-or

dith

er-o

r

kmea

ns-o

r

sha-

or

sym

m-o

r

dynp

rog-

om

knn-

om

ksac

k-sm

-om

ksac

k-lg-o

m

war

-om

mm

-orm

sten

cil-o

rm

btre

e-ua

hsor

t-ua

huffm

an-u

a

rsor

t-ua

bfs-

uc-d

b

qsor

t-uc-

db

1.0

1.5

2.0

2.5

3.0

3.5

4.0

Sp

ee

du

p N

orm

aliz

ed

to

S

imp

le R

ISC

Co

re

OOO 2-Way OOO 4-Way OOO 2-Way with LPSU

xloop.uc xloop.or xloop.{om,orm} xloop.ua xloop.*.db

I XLOOPS vs. Simple Core : Always higher performanceI XLOOPS vs. OOO 2-way : Generally competitive or higher performanceI XLOOPS vs. OOO 4-way : Mixed results

Cornell University Christopher Batten 34 / 66

Research Overview XLOOPS : { Instruction Set Compiler Microarchitecture • Evaluation • } PyMTL

Energy-Efficiency vs. Performance ResultsIn-order+LPSU

vs. In-order CoreOOO 2-way+LPSU

vs. OOO 2-WayOOO 4-way+LPSU

vs. OOO 4-Way

0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0

Normalized Performance

0.5

1.0

1.5

2.0

2.5

3.0

3.5

Norm

aliz

ed E

nerg

y E

ffic

iency

0.5 1.0 1.5 2.0 2.5 3.0

Normalized Performance0.5 1.0 1.5 2.0 2.5

Normalized Performance

I XLOOPS vs. Simple Core : Similar energy efficiency, higher powerI XLOOPS vs. OOO 2-way : Higher energy efficiency, mixed powerI XLOOPS vs. OOO 4-way : Higher energy efficiency, lower powerI Adaptive execution trades energy efficiency for performanceI Profiling and migration cause minimal performance degredation

Cornell University Christopher Batten 35 / 66

Research Overview XLOOPS : { Instruction Set Compiler Microarchitecture • Evaluation • } PyMTL

DCache16KB SRAM for Cache Lines

DCacheTags

ICacheTags

ICache16KB SRAM for Cache Lines

L0Instr

Buffer

L0Instr

Buffer

L0Instr

Buffer

L0Instr

Buffer

Loop PatternSpecialization Unit

ScalarProcessor

32b IEEEFloating Point Unit

32b IntegerMul/Div Unit

VLSIImplementation

I TSMC 40 nmstandard-cell-basedimplementation

I RISC scalarprocessor with4-lane LPSU

I Supports xloop.uc

I ≈40% extra areacompared to simpleRISC processor

Cornell University Christopher Batten 36 / 66

Research Overview XLOOPS : { Instruction Set Compiler Microarchitecture • Evaluation • } PyMTL

loop:

lw r2, 0(rA)

lw r3, 0(rB)

...

addiu.xi rA, 4

addiu.xi rB, 4

addiu r1, r1, 1

xloop.uc r1, rN, loop

OoO GPP

L1 Data Cache

Lanes

Lane Manager

Mem XBar

#pragma xloops ordered

for(i = 0; i < N i++)

A[i] = A[i] * A[i-K];

#pragma xloops atomic

for(i = 0; i < N; i++)

B[ A[i] ]++;

D[ C[i] ]++;

XLOOPS Take-Away Points

I XLOOPS is an elegant new abstraction thatenables performance-portable execution of loops

I XLOOPS enables a single-ISA heterogeneousarchitecture with a new execution paradigm. Traditional Execution. Specialized Execution. Adaptive Execution

This work was supported in part by the National ScienceFoundation (NSF), the Defense Advanced Research ProjectsAgency (DARPA), and donations from Intel Corporation,Synopsys, Inc., and Xilinx, Inc.

Cornell University Christopher Batten 37 / 66

Research Overview XLOOPS: Explicit Loop Specialization • PyMTL: Productive Hardware Modeling •

PyMTL: A Unified Framework for VerticallyIntegrated Computer Architecture Research

Derek Lockhart, Gary Zibrat, and Christopher Batten

47th ACM/IEEE Int’l Symp. on Microarchitecture (MICRO)Cambridge, UK, Dec. 2014

Cornell University Christopher Batten 38 / 66

Research Overview XLOOPS: Explicit Loop Specialization • PyMTL: Productive Hardware Modeling •

PyMTL OutlineOutline(

1'/'39'PyMTL:'A'Unified'Framework'for'Ver8cally'Integrated'Computer'Architecture'Research'

The'Computer'Architecture'Research'Methodology'Gap'

The'PerformanceFProducGvity'Gap'

PyMTL'

SimJIT'

Cornell University Christopher Batten 39 / 66

Research Overview XLOOPS PyMTL : { • Methodology Gap • Framework Perf/Productivity Gap SimJIT }

Managing Increasing Design ComplexityManaging(Increasing(Design(Complexity(

3'/'39'

• AbstracGons'

''' • Methodologies'

''' • Pa>erns,'Languages,'Tools'

PyMTL:'A'Unified'Framework'for'Ver8cally'Integrated'Computer'Architecture'Research'

Cornell University Christopher Batten 40 / 66

Research Overview XLOOPS PyMTL : { • Methodology Gap • Framework Perf/Productivity Gap SimJIT }

Computer Architecture Research AbstractionsComputer(Architecture(Research(Abstractions(

4'/'39''

Algorithms'

InstrucGon'Set'Architecture'

Microarchitecture'

VLSI'

Compilers'Academic'Research'

'A'Few''

Researchers'

Industry'Development'

'Hundreds'of'Engineers'

Sea'of'Transistors'

ApplicaGons'

PyMTL:'A'Unified'Framework'for'Ver8cally'Integrated'Computer'Architecture'Research'

Cornell University Christopher Batten 41 / 66

Research Overview XLOOPS PyMTL : { • Methodology Gap • Framework Perf/Productivity Gap SimJIT }

Computer Architecture Research Methodologies

InstrucGon'Set'Architecture'

Algorithms'

Compilers'

Computer(Architecture(Research(Methodologies(

6'/'39'

Cycle'Level'

•  Behavior'•  Timing'

Microarchitecture'

FuncGonal'Level'

•  Behavior'

VLSI'

Register'Transfer'Level'

•  Behavior'•  Timing'•  Physical'Resources'Sea'of'Transistors'

ApplicaGons'

PyMTL:'A'Unified'Framework'for'Ver8cally'Integrated'Computer'Architecture'Research'

Cornell University Christopher Batten 42 / 66

Research Overview XLOOPS PyMTL : { • Methodology Gap • Framework Perf/Productivity Gap SimJIT }

Computer Architecture Research MethodologiesComputer(Architecture(Research(Methodologies(

7'/'39'PyMTL:'A'Unified'Framework'for'Ver8cally'Integrated'Computer'Architecture'Research'

FuncGonal'Level'

Cycle'Level'

Register'Transfer'Level'

Algorithm'and'ISA'Development'

Design'Space'Explora8on'

Area/Energy/Timing'Valida8on'and'

Prototype'Development'

Modeling'Towards'Layout

Greater''Simula8on'Speed'

Greater''Model'Detail'

Cornell University Christopher Batten 42 / 66

Research Overview XLOOPS PyMTL : { • Methodology Gap • Framework Perf/Productivity Gap SimJIT }

Computer Architecture Research FrameworksComputer(Architecture(Research(Frameworks(

8'/'100''

• AbstracGons'

''' • Methodologies'

''' • Pa>erns,'Languages,'Tools'

PyMTL:'A'Unified'Framework'for'Ver8cally'Integrated'Computer'Architecture'Research'

Cornell University Christopher Batten 43 / 66

Research Overview XLOOPS PyMTL : { • Methodology Gap • Framework Perf/Productivity Gap SimJIT }

Computer Architecture Research FrameworksComputer(Architecture(Research(Frameworks(

9'/'100''

FuncGonal'Level'

Cycle'Level'

Register'Transfer'Level'

Algorithm'and'ISA'Development'

Design'Space'Explora8on'

Area/Energy/Timing'Valida8on'and'

Prototype'Development'

MATLAB/Python'Algorithm'or'C++'InstrucGon'Set'Simulator'

C++'Computer'Architecture''SimulaGon'Framework'(ObjectFOriented)'

Verilog'or'VHDL'Design'with'EDA'Toolflow'(ConcurrentFStructural)'

PyMTL:'A'Unified'Framework'for'Ver8cally'Integrated'Computer'Architecture'Research'

Cornell University Christopher Batten 43 / 66

Research Overview XLOOPS PyMTL : { • Methodology Gap • Framework Perf/Productivity Gap SimJIT }

Computer Architecture Research FrameworksComputer(Architecture(Research(Frameworks(

9'/'100''

FuncGonal'Level'

Cycle'Level'

Register'Transfer'Level'

Algorithm'and'ISA'Development'

Design'Space'Explora8on'

Area/Energy/Timing'Valida8on'and'

Prototype'Development'

Different'languages,''pa>erns,'and'tools!

The'Computer'Architecture'Research'Methodology'Gap

PyMTL:'A'Unified'Framework'for'Ver8cally'Integrated'Computer'Architecture'Research'

Cornell University Christopher Batten 43 / 66

Research Overview XLOOPS PyMTL : { • Methodology Gap • Framework Perf/Productivity Gap SimJIT }

Great Ideas From Prior WorkGreat(Ideas(From(Prior(Work(

10'/'100''PyMTL:'A'Unified'Framework'for'Ver8cally'Integrated'Computer'Architecture'Research'

•  ConcurrentVStructural'Modeling'(Liberty,'Cascade,'SystemC)!!

•  Unified'Modeling'Languages'(SystemC)''

•  Hardware'Genera8on'Languages'(Chisel,'Genesis2,'BlueSpec,'MyHDL)''

•  HDLVIntegrated'Simula8on'Frameworks'(Cascade)!!

•  LatencyVInsensi8ve'Interfaces'(Liberty,'BlueSpec)'

Consistent'interfaces'across'abstracGons'''Unified'design'environment'for'FL,'CL,'RTL'''ProducGve'RTL'design'space'exploraGon'''ProducGve'RTL'validaGon'and'cosimulaGon'''Component'and'test'bench'reuse'

Cornell University Christopher Batten 44 / 66

Research Overview XLOOPS PyMTL : { Methodology Gap • Framework • Perf/Productivity Gap SimJIT }

PyMTL OutlineOutline(

11'/'39'PyMTL:'A'Unified'Framework'for'Ver8cally'Integrated'Computer'Architecture'Research'

The'Computer'Architecture'Research'Methodology'Gap'

The'PerformanceFProducGvity'Gap'

PyMTL'

SimJIT'

Cornell University Christopher Batten 45 / 66

Research Overview XLOOPS PyMTL : { Methodology Gap • Framework • Perf/Productivity Gap SimJIT }

What is PyMTL?What(is(PyMTL?(

12'/'39'

'• A'Python'DSEL'for'concurrentFstructural'hardware'modeling'• A'Python'API'for'analyzing'models'described'in'the'PyMTL'DSEL'• A'Python'tool'for'simulaGng'PyMTL'FL,'CL,'and'RTL'models'• A'Python'tool'for'translaGng'PyMTL'RTL'models'into'Verilog'• A'Python'tesGng'framework'for'model'validaGon'

PyMTL:'A'Unified'Framework'for'Ver8cally'Integrated'Computer'Architecture'Research'

API'

SimulaGon'Tool'

TranslaGon'Tool'

Model'DSEL'

TesGng'Framework'

Cornell University Christopher Batten 46 / 66

Research Overview XLOOPS PyMTL : { Methodology Gap • Framework • Perf/Productivity Gap SimJIT }

What Does PyMTL Enable?What(Does(PyMTL(Enable?(

14'/'39'

•  Incremental'refinement'from'algorithm'to'accelerator'implementaGon'•  Automated'tesGng'and'integraGon'of'PyMTLFgenerated'Verilog'

FL'Model'

Test'Harness'

CL'Model'

Test'Harness'

RTL'Model'

Test'Harness'

Verilog'RTL'

Model'

Verilog'RTL'

Model'

Test'Harness'

PyMTL:'A'Unified'Framework'for'Ver8cally'Integrated'Computer'Architecture'Research'

Cornell University Christopher Batten 47 / 66

Research Overview XLOOPS PyMTL : { Methodology Gap • Framework • Perf/Productivity Gap SimJIT }

What Does PyMTL Enable?What(Does(PyMTL(Enable?(

15'/'39'

'•  Incremental'refinement'from'algorithm'to'accelerator'implementaGon'•  Automated'tesGng'and'integraGon'of'PyMTLFgenerated'Verilog'• MulGFlevel'coFsimulaGon'of'FL,'CL,'and'RTL'models'

FL'Model'

CL'Model'

RTL'Model'

PyMTL:'A'Unified'Framework'for'Ver8cally'Integrated'Computer'Architecture'Research'

Cornell University Christopher Batten 47 / 66

Research Overview XLOOPS PyMTL : { Methodology Gap • Framework • Perf/Productivity Gap SimJIT }

What Does PyMTL Enable?What(Does(PyMTL(Enable?(

16'/'39'

'•  Incremental'refinement'from'algorithm'to'accelerator'implementaGon'•  Automated'tesGng'and'integraGon'of'PyMTLFgenerated'Verilog'• MulGFlevel'coFsimulaGon'of'FL,'CL,'and'RTL'models'•  ConstrucGon'of'highlyFparameterized'RTL'chip'generators'

Verilog'RTL'

Model'

PyMTL:'A'Unified'Framework'for'Ver8cally'Integrated'Computer'Architecture'Research'

Cornell University Christopher Batten 47 / 66

Research Overview XLOOPS PyMTL : { Methodology Gap • Framework • Perf/Productivity Gap SimJIT }

What Does PyMTL Enable?What(Does(PyMTL(Enable?(

18'/'39'

'•  Incremental'refinement'from'algorithm'to'accelerator'implementaGon'•  Automated'tesGng'and'integraGon'of'PyMTLFgenerated'Verilog'• MulGFlevel'coFsimulaGon'of'FL,'CL,'and'RTL'models'•  ConstrucGon'of'highlyFparameterized'RTL'chip'generators'•  Embedding'within'C++'frameworks'&'integraGon'of'C++/Verilog'models'(see!Srinath!et.!al.!in!MICRO247,!Session!6B!)!

PyMTL:'A'Unified'Framework'for'Ver8cally'Integrated'Computer'Architecture'Research'

'gem5'

PyMTL'

C++'Model' PyMTL' Verilog'

Model'

(Used to implement CL model for XLOOPS LPSU)

Cornell University Christopher Batten 48 / 66

Research Overview XLOOPS PyMTL : { Methodology Gap • Framework • Perf/Productivity Gap SimJIT }

The PyMTL FrameworkThe(PyMTL(Framework(

19'/'39'PyMTL:'A'Unified'Framework'for'Ver8cally'Integrated'Computer'Architecture'Research'

Model'

Config'

Test'&'Sim'Harness'

Verilog'

Traces'&'VCD'

User'Tool'Output'

Elaborator'

SimulaGon'Tool'

TranslaGon'Tool'

User'Tool'

Model'Instance'

EDA'Toolflow'

Specifica8on' Tools' Output'

VisualizaGon' StaGc'Analysis'

Dynamic'Checking'

FPGA'SimulaGon'

High'Level'Synthesis'

Cornell University Christopher Batten 49 / 66

Research Overview XLOOPS PyMTL : { Methodology Gap • Framework • Perf/Productivity Gap SimJIT }

The PyMTL DSEL: FL ModelsThe(PyMTL(DSEL(

21'/'39'

def sorter_network( input ): ! return sorted( input ) !!class SorterNetworkFL( Model ) ! def __init__( s, nbits, nports ): !! s.in_ = InPort [nports](nbits) ! s.out = OutPort[nports](nbits) !!! @s.tick_fl! def logic(): ! for i, v in enumerate( sorted( s.in_ ) ): ! s.out[i].next = v'

PyMTL:'A'Unified'Framework'for'Ver8cally'Integrated'Computer'Architecture'Research'

f(x)'

['3,'1,'2,'0']' ['0,'1,'2,'3']'f(x)'

Cornell University Christopher Batten 50 / 66

Research Overview XLOOPS PyMTL : { Methodology Gap • Framework • Perf/Productivity Gap SimJIT }

The PyMTL DSEL: CL ModelsThe(PyMTL(DSEL(

22'/'39'

def sorter_network( input ): ! return sorted( input ) !!class SorterNetworkCL( Model ) ! def __init__( s, nbits, nports, delay=3 ): !! s.in_ = InPort [nports](nbits) ! s.out = OutPort[nports](nbits) ! s.pipe = Pipeline( delay ) !! @s.tick_cl! def logic(): ! s.pipe.xtick() ! s.pipe.push( sorted( s.in_ ) ) !! if s.pipe.ready(): ! for i, v in enumerate( s.pipe.pop() ): ! s.out[i].next = v !! '

PyMTL:'A'Unified'Framework'for'Ver8cally'Integrated'Computer'Architecture'Research'

f(x)'

['3,'1,'2,'0']' ['0,'1,'2,'3']'f(x)'

Cornell University Christopher Batten 51 / 66

Research Overview XLOOPS PyMTL : { Methodology Gap • Framework • Perf/Productivity Gap SimJIT }

The PyMTL DSEL: RTL ModelsThe(PyMTL(DSEL(

23'/'39'

def sorter_network( input ): ! return sorted( input ) !!class SorterNetworkRTL( Model ) ! def __init__( s, nbits ): !! s.in_ = InPort [4](nbits) ! s.out = OutPort[4](nbits) !! s.m = m = MinMaxRTL[5](nbits) !! s.connect( s.in_[0], m[0].in_[0] ) ! s.connect( s.in_[1], m[0].in_[1] ) ! s.connect( s.in_[2], m[1].in_[0] ) ! s.connect( s.in_[3], m[2].in_[1] ) !! . . . ! '

PyMTL:'A'Unified'Framework'for'Ver8cally'Integrated'Computer'Architecture'Research'

['3,'1,'2,'0']' ['0,'1,'2,'3']'f(x)'

Cornell University Christopher Batten 52 / 66

Research Overview XLOOPS PyMTL : { Methodology Gap • Framework • Perf/Productivity Gap SimJIT }

The PyMTL DSEL: RTL ModelsThe(PyMTL(DSEL(

24'/'39'

class MinMaxRTL( Model ) ! def __init__( s, nbits ): ! s.in_ = InPort [2](nbits) ! s.out = OutPort[2](nbits) ! @s.combinational! def logic(): ! swap = s.in_[0] > s.in_[1] ! s.out[0].value = s.in[1] if swap else s.in[0] ! s.out[1].value = s.in[0] if swap else s.in[1] !!class RegRTL( Model ) ! def __init__( s, nbits ): ! s.in_ = InPort (nbits) ! s.out = OutPort(nbits) ! @s.tick_rtl! def logic(): ! s.out.next = s.in_ ! '

PyMTL:'A'Unified'Framework'for'Ver8cally'Integrated'Computer'Architecture'Research'

Cornell University Christopher Batten 52 / 66

Research Overview XLOOPS PyMTL : { Methodology Gap • Framework • Perf/Productivity Gap SimJIT }

PyMTL Accelerator Case Study

L1 DCache

L1 ICache

Arbitration

DotProduct

AcceleratorProcessor

Experimented with FL, CL, and RTLmodels of a pipelined processor, blockingcache, and dot-product accelerator

Rela

tive S

imula

tor

Perf

orm

an

ce

Level of Detail (FL = 1, CL = 2, RTL = 3)

1 2 3 4 5 6 7 8 9

Pure FL Model SimJIT+PyPyCPython

27 different compositions that trade-offsimulator performance vs. accuracy

Cornell University Christopher Batten 53 / 66

Research Overview XLOOPS PyMTL : { Methodology Gap • Framework • Perf/Productivity Gap SimJIT }

Why Python?Why(Python?(

26'/'39'

Benefits:''• Modern'language'features'enable'rapid'prototyping'(dynamicFtyping,'reflecGon,'metaprogramming)'•  Lightweight,'pseudocodeFlike'syntax'• BuiltFin'support'for'integraGng'C/C++'code'•  Large,'acGve'developer'and'support'community''Drawbacks:''•  Performance'

PyMTL:'A'Unified'Framework'for'Ver8cally'Integrated'Computer'Architecture'Research'

Cornell University Christopher Batten 54 / 66

Research Overview XLOOPS PyMTL : { Methodology Gap Framework • Perf/Productivity Gap • SimJIT }

PyMTL OutlineOutline(

27'/'39'PyMTL:'A'Unified'Framework'for'Ver8cally'Integrated'Computer'Architecture'Research'

The'Computer'Architecture'Research'Methodology'Gap'

The'PerformanceFProducGvity'Gap'

PyMTL'

SimJIT'

Cornell University Christopher Batten 55 / 66

Research Overview XLOOPS PyMTL : { Methodology Gap Framework • Perf/Productivity Gap • SimJIT }

Performance-Productivity GapPerformanceHProductivity(Gap(

28'/'39'

Experiment:''•  Simple'8x8'Mesh'Network'Model''• CycleFPrecise'CL'Model:'•  PyMTL'Model'Simulated'with'the'CPython'Interpreter'•  HandFWri>en'C++'Model'and'Simulator'

• BitFAccurate'RTL'Model:'•  PyMTL'Model'Simulated'with'CPython'Interpreter'•  HandFWri>en'Verilog'RTL'Simulated'with'Verilator'

PyMTL:'A'Unified'Framework'for'Ver8cally'Integrated'Computer'Architecture'Research'

Cornell University Christopher Batten 56 / 66

Research Overview XLOOPS PyMTL : { Methodology Gap Framework • Perf/Productivity Gap • SimJIT }

Performance-Productivity Gap

RTL�NetworkCL�Network

Simulated�CyclesSimulated�Cycles1K 10K 100K 1M10M

1x

5x10x

30x

75x150x300x

60x

200x

1000x

CPython VerilatorC++

1K10K 100K 1M

10x5x

1x

CPython

PerformanceHProductivity(Gap(

29'/'39'PyMTL:'A'Unified'Framework'for'Ver8cally'Integrated'Computer'Architecture'Research'

300x' 1200x'

Cornell University Christopher Batten 57 / 66

Research Overview XLOOPS PyMTL : { Methodology Gap Framework • Perf/Productivity Gap • SimJIT }

Performance-Productivity GapPerformanceHProductivity(Gap(

30'/'39'

Python'is'growing'in'popularity'in'many'domains'of'scienGfic'and'highFperformance'compuGng.''How'do'they'close'this'gap?'

PyMTL:'A'Unified'Framework'for'Ver8cally'Integrated'Computer'Architecture'Research'

•  PythonVWrapped'C/C++'Libraries'(NumPy,!CVXOPT,!NLPy,!pythonOCC,!GEM5)''

•  Numerical'JustVInVTime'Compilers'(Numba,!Parakeet)!!

•  JustVInVTime'Compiled'Interpreters'(PyPy,!Pyston)!

•  Selec8ve'Embedded'JustVInVTime'Specializa8on'(SEJITS)!

Cornell University Christopher Batten 58 / 66

Research Overview XLOOPS PyMTL : { Methodology Gap Framework • Perf/Productivity Gap • SimJIT }

Performance-Productivity Gap

RTL�NetworkCL�Network

Simulated�CyclesSimulated�Cycles1K 10K 100K 1M10M

1x

5x10x

30x

75x150x300x

60x

200x

1000x

CPython PyPy VerilatorC++

1K10K 100K 1M

10x5x

1x

CPython PyPy

PerformanceHProductivity(Gap(

31'/'39'PyMTL:'A'Unified'Framework'for'Ver8cally'Integrated'Computer'Architecture'Research'

30x'240x'

Cornell University Christopher Batten 59 / 66

Research Overview XLOOPS PyMTL : { Methodology Gap Framework Perf/Productivity Gap • SimJIT • }

PyMTL OutlineOutline(

32'/'39'PyMTL:'A'Unified'Framework'for'Ver8cally'Integrated'Computer'Architecture'Research'

The'Computer'Architecture'Research'Methodology'Gap'

The'PerformanceFProducGvity'Gap'

PyMTL'

SimJIT'

Cornell University Christopher Batten 60 / 66

Research Overview XLOOPS PyMTL : { Methodology Gap Framework Perf/Productivity Gap • SimJIT • }

PyMTL SimJIT-RTL ArchitecturePyMTL(SimJIT(Architecture(

PyMTL:'A'Unified'Framework'for'Ver8cally'Integrated'Computer'Architecture'Research'

PyMTL'RTL'Model'Instance'

TranslaGon'

Verilator'

LLVM/GCC' Wrapper'Gen'

Verilog'Source'

PyMTL'CFFI'Model'Instance'

RTL'C++'Source'

C'Interface'Source'

C'Shared'Library'

TranslaGon'Cache'

SimJITFRTL'Tool'

33'/'39'

Fairly'robust,'ready'for'use'in'research!'

Cornell University Christopher Batten 61 / 66

Research Overview XLOOPS PyMTL : { Methodology Gap Framework Perf/Productivity Gap • SimJIT • }

PyMTL SimJIT-CL ArchitecturePyMTL(SimJIT(Architecture(

PyMTL:'A'Unified'Framework'for'Ver8cally'Integrated'Computer'Architecture'Research' 34'/'39'

PyMTL'CL'Model'Instance'

TranslaGon' LLVM/GCC' Wrapper'Gen'

PyMTL'CFFI'Model'Instance'

CL'C++'Source'

C'Interface'Source'

C'Shared'Library'

SimJITFCL'Tool'

Just'a'prototype!'

Cornell University Christopher Batten 62 / 66

Research Overview XLOOPS PyMTL : { Methodology Gap Framework Perf/Productivity Gap • SimJIT • }

Performance-Productivity Gap

RTL�NetworkCL�Network

Simulated�CyclesSimulated�Cycles1K 10K 100K 1M10M

1x

5x10x

30x

75x150x300x

60x

200x

1000x

CPython PyPy VerilatorC++

1K10K 100K 1M

10x5x

1x

CPython PyPy

PerformanceHProductivity(Gap(

PyMTL:'A'Unified'Framework'for'Ver8cally'Integrated'Computer'Architecture'Research'

30x'240x'

35'/'39'

Cornell University Christopher Batten 63 / 66

Research Overview XLOOPS PyMTL : { Methodology Gap Framework Perf/Productivity Gap • SimJIT • }

PyMTL SimJIT Performance

RTL�NetworkCL�Network

Simulated�CyclesSimulated�Cycles1K 10K 100K 1M10M

1x

5x10x

30x

75x150x300x

60x

200x

1000x

CPython PyPySimJIT-CL SimJIT-RTLSimJIT-CL+PyPy SimJIT-RTL+PyPy

VerilatorC++

1K10K 100K 1M

10x5x

1x

CPython PyPy

PyMTL(SimJIT(Performance(

36'/'39'PyMTL:'A'Unified'Framework'for'Ver8cally'Integrated'Computer'Architecture'Research'

4.5x' 6x'

Cornell University Christopher Batten 64 / 66

Research Overview XLOOPS PyMTL : { Methodology Gap Framework Perf/Productivity Gap • SimJIT • }

f(x)

f(x)

PyMTL Take-Away Points

I PyMTL is a productive Python framework for FL,CL, and RTL modeling and hardware design

I SimJIT is a strong first step towards closing theperformance-productivity gap between Pythonand C++ simulation

I An alpha version of PyMTL is open source andavailable for researchers to experiment with

https://github.com/cornell-brg/pymtl

This work was supported in part by the National ScienceFoundation (NSF), the Defense Advanced Research ProjectsAgency (DARPA), and donations from Intel Corporation andSynopsys, Inc.

Cornell University Christopher Batten 65 / 66

Research Overview XLOOPS: Explicit Loop Specialization PyMTL: Productive Hardware Modeling

f(x)

f(x)

Batten Research Group

Exploring cross-layer hardwarespecialization using a vertically

integrated researchmethodology

Performance (Tasks per Second)

En

erg

y E

ffic

ien

cy (

Ta

sks p

er

Joule

)

SimpleProcessor

Design PowerConstraint

High-PerformanceArchitectures

EmbeddedArchitectures

DesignPerformanceConstraint

Flexibility vs

. Spe

cializat

ion

CustomASIC

Less FlexibleAccelerator

More FlexibleAccelerator

loop:

lw r2, 0(rA)

lw r3, 0(rB)

...

addiu.xi rA, 4

addiu.xi rB, 4

addiu r1, r1, 1

xloop.uc r1, rN, loop

OoO GPP

L1 Data Cache

Lanes

Lane Manager

Mem XBar

#pragma xloops ordered

for(i = 0; i < N i++)

A[i] = A[i] * A[i-K];

#pragma xloops atomic

for(i = 0; i < N; i++)

B[ A[i] ]++;

D[ C[i] ]++;

Cornell University Christopher Batten 66 / 66