mihai budiu microsoft research – silicon valley joint work with girish venkataramani, tiberiu...

56
Mihai Budiu Microsoft Research – Silicon Valley joint work with Girish Venkataramani, Tiberiu Chelcea, Seth Copen Goldstein Carnegie Mellon University Spatial Computation Computing without General-Purpose Processors May 10, 2005

Upload: abigail-doherty

Post on 27-Mar-2015

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Mihai Budiu Microsoft Research – Silicon Valley joint work with Girish Venkataramani, Tiberiu Chelcea, Seth Copen Goldstein Carnegie Mellon University

Mihai BudiuMicrosoft Research – Silicon Valley

joint work with

Girish Venkataramani, Tiberiu Chelcea, Seth Copen Goldstein

Carnegie Mellon University

Spatial ComputationComputing without General-Purpose Processors

May 10, 2005

Page 2: Mihai Budiu Microsoft Research – Silicon Valley joint work with Girish Venkataramani, Tiberiu Chelcea, Seth Copen Goldstein Carnegie Mellon University

2

Outline• Intro: Problems of current architectures

• Compiling Application-Specific Hardware

• ASH Evaluation

• Conclusions

1000

Per

form

ance

1

10

100

19

80

19

84

19

86

19

88

19

90

19

92

19

94

19

96

19

98

20

00

19

82

Page 3: Mihai Budiu Microsoft Research – Silicon Valley joint work with Girish Venkataramani, Tiberiu Chelcea, Seth Copen Goldstein Carnegie Mellon University

3

Resources

• We do not worry about not having hardware resources• We worry about being able to use hardware resources

[Intel]

Page 4: Mihai Budiu Microsoft Research – Silicon Valley joint work with Girish Venkataramani, Tiberiu Chelcea, Seth Copen Goldstein Carnegie Mellon University

4

Complexity

1981

1983

1985

1987

1989

1991

1993

1995

1997

1999

2003

2001

2005

2007

2009

Designer productivity

104

Chip size

105

106

107

108

109

1010

ALUs

Cannot rely on global signals(clock is a global signal)

5ps 20ps

gatewire

Page 5: Mihai Budiu Microsoft Research – Silicon Valley joint work with Girish Venkataramani, Tiberiu Chelcea, Seth Copen Goldstein Carnegie Mellon University

5

Complexity

1981

1983

1985

1987

1989

1991

1993

1995

1997

1999

2003

2001

2005

2007

2009

Designer productivity

104

Chip size

105

106

107

108

109

1010

ALUs

Cannot rely on global signals(clock is a global signal)

5ps 20ps

gatewire

Automatictranslation

C ! HW

Simple, short,unidirectionalinterconnect

No interpretationDistributed

control,Asynchronous

Simple hw,mostly idle

Page 6: Mihai Budiu Microsoft Research – Silicon Valley joint work with Girish Venkataramani, Tiberiu Chelcea, Seth Copen Goldstein Carnegie Mellon University

6

Our Proposal:Application-Specific Hardware

• ASH addresses these problems• ASH is not a panacea• ASH “complementary” to CPU

High-ILPcomputation

Low ILP computation+ OS + VM CPU ASH

Memory

$

Page 7: Mihai Budiu Microsoft Research – Silicon Valley joint work with Girish Venkataramani, Tiberiu Chelcea, Seth Copen Goldstein Carnegie Mellon University

7

Outline• Problems of current architectures

• CASH: Compiling Application-Specific Hardware

• ASH Evaluation

• Conclusions

Page 8: Mihai Budiu Microsoft Research – Silicon Valley joint work with Girish Venkataramani, Tiberiu Chelcea, Seth Copen Goldstein Carnegie Mellon University

8

Application-Specific HardwareC program

Compiler

Dataflow IR

Reconfigurable/custom hw

HW backend

Page 9: Mihai Budiu Microsoft Research – Silicon Valley joint work with Girish Venkataramani, Tiberiu Chelcea, Seth Copen Goldstein Carnegie Mellon University

9

Computation Dataflow

x = a & 7;...

y = x >> 2;

Program

&

a 7

>>

2

x

IR

a

Circuits

&7

>>2

No interpretation

Operations Nodes Pipeline stages

Variables Def-use edges Channels (wires)

Page 10: Mihai Budiu Microsoft Research – Silicon Valley joint work with Girish Venkataramani, Tiberiu Chelcea, Seth Copen Goldstein Carnegie Mellon University

10

Basic Computation=Pipeline Stage

data

valid

ack

latch+

Page 11: Mihai Budiu Microsoft Research – Silicon Valley joint work with Girish Venkataramani, Tiberiu Chelcea, Seth Copen Goldstein Carnegie Mellon University

11

+

Asynchronous Computation

data

valid

ack

1

+

2

+

3

+

4

+

8

+

7

+

6

+

5

latch

Page 12: Mihai Budiu Microsoft Research – Silicon Valley joint work with Girish Venkataramani, Tiberiu Chelcea, Seth Copen Goldstein Carnegie Mellon University

12

Distributed Control Logic

+ -

ackrdy

global

FSM

short, local wires

Page 13: Mihai Budiu Microsoft Research – Silicon Valley joint work with Girish Venkataramani, Tiberiu Chelcea, Seth Copen Goldstein Carnegie Mellon University

13

MUX: Forward Branches

if (x > 0) y = -x;

elsey = b*x;

*

x

b 0

y

!

- >

Conditionals ) Speculation

SSA= no arbitration

Critical path

Page 14: Mihai Budiu Microsoft Research – Silicon Valley joint work with Girish Venkataramani, Tiberiu Chelcea, Seth Copen Goldstein Carnegie Mellon University

14

Control Flow ) Data Flow

datapredicate

Merge (label)

Gateway

data

data

Split (branch)p

!

Page 15: Mihai Budiu Microsoft Research – Silicon Valley joint work with Girish Venkataramani, Tiberiu Chelcea, Seth Copen Goldstein Carnegie Mellon University

15

i

+1< 100

0

*

+

sum

0

Loops

int sum=0, i;

for (i=0; i < 100; i++)

sum += i*i;

return sum;return sum; !

retback

Page 16: Mihai Budiu Microsoft Research – Silicon Valley joint work with Girish Venkataramani, Tiberiu Chelcea, Seth Copen Goldstein Carnegie Mellon University

16

Pipeliningi

+

<=

100

1

*

+

sum

pipelinedmultiplier(8 stages)

int sum=0, i;

for (i=0; i < 100; i++)

sum += i*i;

return sum;

step 1

Page 17: Mihai Budiu Microsoft Research – Silicon Valley joint work with Girish Venkataramani, Tiberiu Chelcea, Seth Copen Goldstein Carnegie Mellon University

17

Pipeliningi

+

<=

100

1

*

+

sum

step 2

Page 18: Mihai Budiu Microsoft Research – Silicon Valley joint work with Girish Venkataramani, Tiberiu Chelcea, Seth Copen Goldstein Carnegie Mellon University

18

Pipeliningi

+

<=

100

1

*

+

sum

step 3

Page 19: Mihai Budiu Microsoft Research – Silicon Valley joint work with Girish Venkataramani, Tiberiu Chelcea, Seth Copen Goldstein Carnegie Mellon University

19

Pipeliningi

+

<=

100

1

*

+

sum

step 4

Page 20: Mihai Budiu Microsoft Research – Silicon Valley joint work with Girish Venkataramani, Tiberiu Chelcea, Seth Copen Goldstein Carnegie Mellon University

20

Pipeliningi

+

<=

100

1

i=1

i=0

+

sum

step 5

Page 21: Mihai Budiu Microsoft Research – Silicon Valley joint work with Girish Venkataramani, Tiberiu Chelcea, Seth Copen Goldstein Carnegie Mellon University

21

Pipeliningi

+

<=

100

1

*i=1

i=0

+

sum

step 6

back

Page 22: Mihai Budiu Microsoft Research – Silicon Valley joint work with Girish Venkataramani, Tiberiu Chelcea, Seth Copen Goldstein Carnegie Mellon University

22

Pipeliningi

+

<=

100

1

*

+

sum

i’s loop

sum’s loop

Longlatency pipe

predicate

step 7

Page 23: Mihai Budiu Microsoft Research – Silicon Valley joint work with Girish Venkataramani, Tiberiu Chelcea, Seth Copen Goldstein Carnegie Mellon University

23

Predicate ackedge is on thecritical path.

Pipeliningi

+

<=

100

1

*

+

sum

critical pathi’s loop

sum’s loop

Page 24: Mihai Budiu Microsoft Research – Silicon Valley joint work with Girish Venkataramani, Tiberiu Chelcea, Seth Copen Goldstein Carnegie Mellon University

24

Pipeline balancing i

+

<=

100

1

*

+

sum

i’s loop

sum’s loop

decouplingFIFO

step 7

Page 25: Mihai Budiu Microsoft Research – Silicon Valley joint work with Girish Venkataramani, Tiberiu Chelcea, Seth Copen Goldstein Carnegie Mellon University

25

Pipeline balancing i

+

<=

100

1

*

+

sum

i’s loop

sum’s loop

critical path

decouplingFIFO

back back to talk

Page 26: Mihai Budiu Microsoft Research – Silicon Valley joint work with Girish Venkataramani, Tiberiu Chelcea, Seth Copen Goldstein Carnegie Mellon University

26

ProceduresCaller

CalleeCall

Argument

Return

Continuation

Page 27: Mihai Budiu Microsoft Research – Silicon Valley joint work with Girish Venkataramani, Tiberiu Chelcea, Seth Copen Goldstein Carnegie Mellon University

27

Memory Access

LD

ST

LD

MonolithicMemory

local communication global structures

Future work: fragment this!

pipelinedarbitratednetwork

Page 28: Mihai Budiu Microsoft Research – Silicon Valley joint work with Girish Venkataramani, Tiberiu Chelcea, Seth Copen Goldstein Carnegie Mellon University

28

Outline• Problems of current architectures

• Compiling ASH

• ASH Evaluation

• Conclusions

Page 29: Mihai Budiu Microsoft Research – Silicon Valley joint work with Girish Venkataramani, Tiberiu Chelcea, Seth Copen Goldstein Carnegie Mellon University

29

Evaluating ASHC

CASHcore

Verilog back-end

Synopsys,Cadence P/R

ASIC

180nm std. cell library, 2V

~1999technology

Mediabench kernels(1 hot function/benchmark)

ModelSim(Verilog simulation)

performancenumbers

Mem

commercial tools

Page 30: Mihai Budiu Microsoft Research – Silicon Valley joint work with Girish Venkataramani, Tiberiu Chelcea, Seth Copen Goldstein Carnegie Mellon University

30

Compile TimeC

CASHcore

Verilog back-end

Synopsys,Cadence P/R

ASIC

20 seconds

10 seconds

20 minutes1 hour

200 lines

Mem

Page 31: Mihai Budiu Microsoft Research – Silicon Valley joint work with Girish Venkataramani, Tiberiu Chelcea, Seth Copen Goldstein Carnegie Mellon University

31

ASH Area (mm2)P4: 217

minimal RISC core

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

adpc

m_d

adpc

m_e

g721

_d

g721

_e

gsm

_d

gsm

_e

jpeg

_d

jpeg

_e

mpe

g2_d

mpe

g2_e

pegw

it_d

pegw

it_e

Are

a [

sq

mm

]

Memory access

Circuit

Page 32: Mihai Budiu Microsoft Research – Silicon Valley joint work with Girish Venkataramani, Tiberiu Chelcea, Seth Copen Goldstein Carnegie Mellon University

32

ASH vs 600MHz CPU [4-wide OOO, .18 m]

2.40

1.37

1.79

1.98

0.74

1.65

0.56

1.34

0.80

1.06

0.43 0.44

1.05

0.00

0.50

1.00

1.50

2.00

2.50

adpc

m_d

adpc

m_e

g721

_d

g721

_e

gsm

_d

gsm

_e

jpeg

_d

jpeg

_e

mpe

g2_d

mpe

g2_e

pegw

it_d

pegw

it_e

aver

age

Tim

es f

aste

r

Page 33: Mihai Budiu Microsoft Research – Silicon Valley joint work with Girish Venkataramani, Tiberiu Chelcea, Seth Copen Goldstein Carnegie Mellon University

33

Bottleneck: Memory Protocol

LD

ST Memory

• Enabling dependent operations requires round-trip to memory.LS

Q

• Exploring novel memory access protocols.

Page 34: Mihai Budiu Microsoft Research – Silicon Valley joint work with Girish Venkataramani, Tiberiu Chelcea, Seth Copen Goldstein Carnegie Mellon University

34

Power (mW)DSP110

mP4000

Xeon [+cache]67000

70

29

10 10

19

38

46

2623

30

22 2225

0

10

20

30

40

50

60

70

adpc

m_d

adpc

m_e

g721

_d

g721

_e

gsm

_d

gsm

_e

jpeg

_d

jpeg

_e

mpe

g2_d

mpe

g2_e

pegw

it_d

pegw

it_e

aver

age

Po

we

r [m

W]

Page 35: Mihai Budiu Microsoft Research – Silicon Valley joint work with Girish Venkataramani, Tiberiu Chelcea, Seth Copen Goldstein Carnegie Mellon University

35

Energy-delay

363 285

1524 1788

147

389

36

437

171 174

48 50

227

1

10

100

1000

10000

adpc

m_d

adpc

m_e

g721

_d

g721

_e

gsm

_d

gsm

_e

jpeg_

d

jpeg_

e

mpe

g2_d

mpe

g2_e

pegw

it_d

pegw

it_e

aver

age

Tim

es

be

tte

r th

an

su

pe

rsc

ala

r

Page 36: Mihai Budiu Microsoft Research – Silicon Valley joint work with Girish Venkataramani, Tiberiu Chelcea, Seth Copen Goldstein Carnegie Mellon University

36

Energy Efficiency (op/nJ)

5766

143 143

52 51

39

6255

40

28 28

0

20

40

60

80

100

120

140

160

adpc

m_d

adpc

m_e

g721

_d

g721

_e

gsm

_d

gsm

_e

jpeg

_d

jpeg

_e

mpe

g2_d

mpe

g2_e

pegw

it_d

pegw

it_e

[Op

erat

ion

s/n

J](n

on-

spe

cula

tive

ari

thm

etic

)

Page 37: Mihai Budiu Microsoft Research – Silicon Valley joint work with Girish Venkataramani, Tiberiu Chelcea, Seth Copen Goldstein Carnegie Mellon University

37

Energy Efficiency

0.01 0.1 1 10 100 1000

Energy Efficiency [Operations/nJ]

General-purpose DSP

Dedicated hardware

ASH media kernels

FPGA

Microprocessors

1000x

Asynchronous P

Page 38: Mihai Budiu Microsoft Research – Silicon Valley joint work with Girish Venkataramani, Tiberiu Chelcea, Seth Copen Goldstein Carnegie Mellon University

38

Outline

Problems of current architectures

+ Compiling ASH

+ Evaluation

= Related work, Conclusions

Page 39: Mihai Budiu Microsoft Research – Silicon Valley joint work with Girish Venkataramani, Tiberiu Chelcea, Seth Copen Goldstein Carnegie Mellon University

39

Bilbliography• Dataflow: A Complement to Superscalar

Mihai Budiu, Pedro Artigas, and Seth Copen GoldsteinISPASS 2005

• Spatial ComputationMihai Budiu, Girish Venkataramani, Tiberiu Chelcea, and Seth Copen GoldsteinASPLOS 2004

• C to Asynchronous Dataflow Circuits: An End-to-End ToolflowGirish Venkataramani, Mihai Budiu, Tiberiu Chelcea, and Seth Copen Goldstein IWLS 2004

• Optimizing Memory Accesses For Spatial ComputationMihai Budiu and Seth Copen GoldsteinCGO 2003

• Compiling Application-Specific HardwareMihai Budiu and Seth Copen GoldsteinFPL 2002

Page 40: Mihai Budiu Microsoft Research – Silicon Valley joint work with Girish Venkataramani, Tiberiu Chelcea, Seth Copen Goldstein Carnegie Mellon University

40

Related Work• Optimizing compilers

• High-level synthesis

• Reconfigurable computing

• Dataflow machines

• Asynchronous circuits

• Spatial computation

We target an extreme point in the design space:no interpretation,

fully distributed computation and control

Page 41: Mihai Budiu Microsoft Research – Silicon Valley joint work with Girish Venkataramani, Tiberiu Chelcea, Seth Copen Goldstein Carnegie Mellon University

41

ASH Design Point

• Design an ASIC in a day

• Fully automatic synthesis to layout

• Fully distributed control and computation

(spatial computation)– Replicate computation to simplify wires

• Energy/op rivals custom ASIC

• Performance rivals superscalar

• E£t 100 times better than any processor

Page 42: Mihai Budiu Microsoft Research – Silicon Valley joint work with Girish Venkataramani, Tiberiu Chelcea, Seth Copen Goldstein Carnegie Mellon University

42

Conclusions

Feature Advantages

No interpretation Energy efficiency, speed

Spatial layout Short wires, no contention

Asynchronous Low power, scalable

Distributed No global signals

Automatic compilation Designer productivity

Spatial computation strengths

Page 43: Mihai Budiu Microsoft Research – Silicon Valley joint work with Girish Venkataramani, Tiberiu Chelcea, Seth Copen Goldstein Carnegie Mellon University

43

Backup Slides• Absolute performance • Control logic• Exceptions• Leniency• Normalized area• ASH weaknesses• Splitting memory• Recursive calls• Leakage• Why not compare to…• Targeting FPGAs

Page 44: Mihai Budiu Microsoft Research – Silicon Valley joint work with Girish Venkataramani, Tiberiu Chelcea, Seth Copen Goldstein Carnegie Mellon University

44

Absolute Performance

0

1000

2000

3000

4000

5000

6000

adpc

m_d

adpc

m_e

g721

_d

g721

_e

gsm

_d

gsm

_e

jpeg

_d

jpeg

_e

mpe

g2_d

mpe

g2_e

pegw

it_d

pegw

it_e

Mill

ion

s o

f O

pe

rati

on

s p

er

Se

co

nd MOPSall

MOPSspecMOPS

12300

CPU range

back

Page 45: Mihai Budiu Microsoft Research – Silicon Valley joint work with Girish Venkataramani, Tiberiu Chelcea, Seth Copen Goldstein Carnegie Mellon University

=

rdyin

ackout

rdyoutackin

datain dataout

Re

g

back

Pipeline Stage

C

Page 46: Mihai Budiu Microsoft Research – Silicon Valley joint work with Girish Venkataramani, Tiberiu Chelcea, Seth Copen Goldstein Carnegie Mellon University

46

Exceptions• Strictly speaking, C has no exceptions

• In practice hard to accommodate exceptions in hardware implementations

• An advantage of software flexibility: PC is single point of execution control

High-ILPcomputation

Low ILP computation+ OS + VM + exceptions CPU ASH

Memory

back

$$$

Page 47: Mihai Budiu Microsoft Research – Silicon Valley joint work with Girish Venkataramani, Tiberiu Chelcea, Seth Copen Goldstein Carnegie Mellon University

47

Critical Paths

if (x > 0) y = -x;

elsey = b*x;

*

xb 0

y

!

- >

Page 48: Mihai Budiu Microsoft Research – Silicon Valley joint work with Girish Venkataramani, Tiberiu Chelcea, Seth Copen Goldstein Carnegie Mellon University

48

Lenient Operations

if (x > 0) y = -x;

elsey = b*x;

*

xb 0

y

!

- >

Solves the problem of unbalanced paths

back back to talk

Page 49: Mihai Budiu Microsoft Research – Silicon Valley joint work with Girish Venkataramani, Tiberiu Chelcea, Seth Copen Goldstein Carnegie Mellon University

49

Normalized Area

back

0

50

100

150

200

250

adpc

m_d

adpc

m_e

g721

_d

g721

_e

gsm

_d

gsm

_e

jpeg_

d

jpeg_

e

mpe

g2_d

mpe

g2_e

pegw

it_d

pegw

it_e

aver

age

So

urc

e L

ines

/sq

mm

0

1

2

3

4

5

6

Ob

ject

co

de

Kb

/sq

mm

Lines/sq mm

KBytes/sq mm

Page 50: Mihai Budiu Microsoft Research – Silicon Valley joint work with Girish Venkataramani, Tiberiu Chelcea, Seth Copen Goldstein Carnegie Mellon University

50

ASH Weaknesses

• Both branch and join not free• Static dataflow (no re-issue of same instr)• Memory is “far”• Fully static

– No branch prediction– No dynamic unrolling– No register renaming

• Calls/returns not lenient

back

Page 51: Mihai Budiu Microsoft Research – Silicon Valley joint work with Girish Venkataramani, Tiberiu Chelcea, Seth Copen Goldstein Carnegie Mellon University

51

Predicted not takenEffectively a noop for CPU!

Predicted taken.

Branch Prediction

for (i=0; i < N; i++) {

...

if (exception) break;

}

i

+

<

1

&

!

exception

result available before inputs

ASH crit path

CPU crit path

back

Page 52: Mihai Budiu Microsoft Research – Silicon Valley joint work with Girish Venkataramani, Tiberiu Chelcea, Seth Copen Goldstein Carnegie Mellon University

52

Memory Partitioning• MIT RAW project: Babb FCCM ‘99,

Barua HiPC ‘00,Lee ASPLOS ‘00

• Stanford SpC: Semeria DAC ‘01, TVLSI ‘02

• Illinois FlexRAM: Fraguella PPoPP ‘03

• Hand-annotations #pragma

back

Page 53: Mihai Budiu Microsoft Research – Silicon Valley joint work with Girish Venkataramani, Tiberiu Chelcea, Seth Copen Goldstein Carnegie Mellon University

53

Recursion

recursive call

save live values

restore live valuesstack

back

Page 54: Mihai Budiu Microsoft Research – Silicon Valley joint work with Girish Venkataramani, Tiberiu Chelcea, Seth Copen Goldstein Carnegie Mellon University

54

Leakage Power

Ps = k Area e-VT

• Employ circuit-level techniques

• Cut power supply of idle circuit portions– most of the circuit is idle most of the time– strong locality of activity

back

Page 55: Mihai Budiu Microsoft Research – Silicon Valley joint work with Girish Venkataramani, Tiberiu Chelcea, Seth Copen Goldstein Carnegie Mellon University

55

Why Not Compare To…• In-order processor

– Worse in all metrics than superscalar, except power– We beat it in all metrics, including performance

• DSP– We expect roughly the same results as for superscalar

(Wattch maintains high IPC for these kernels)

• ASIC– No available tool-flow supports C to the same degree

• Asynchronous ASIC– We compared with a Balsa synthesis system– We are 15 times better in Et compared to resulting ASIC

• Async processor– We are 350 times better in Et than Amulet (scaled to .18)

back

Page 56: Mihai Budiu Microsoft Research – Silicon Valley joint work with Girish Venkataramani, Tiberiu Chelcea, Seth Copen Goldstein Carnegie Mellon University

56

Why not target FPGA

• Do not support asynchronous circuits

• Very inefficient in area, power, delay

• Too fine-grained for datapath circuits

• We are designing an async FPGA

back