computing without processors thesis proposal mihai budiu july 30, 2001 this presentation uses...

60
Computing Without Processors Thesis Proposal Mihai Budiu July 30, 2001 This presentation uses TeXPoint by George Nec Thesis Committee: Seth Goldstein, chair Todd Mowry Peter Lee Babak Falsafi, ECE Nevin Heintze, Agere Systems

Post on 22-Dec-2015

217 views

Category:

Documents


1 download

TRANSCRIPT

Computing Without ProcessorsThesis Proposal

Mihai Budiu July 30, 2001

This presentation uses TeXPoint by George Necula

Thesis Committee:Seth Goldstein, chair

Todd Mowry Peter Lee

Babak Falsafi, ECENevin Heintze, Agere Systems

2

Four Types of Research

• Solve nonexistent problems

• Solve past problems

• Solve current problems

• Solve future problems

3

The Law

(source: Intel)

4

The Crossover Phenomenon

time

technology

5

Example Crossover

time

DRAM

CPU

1980

caches

access speed (ns)

no caches

200

Trouble Aheadfor

Microarchitecture

7

Signal Propagation

time

now

mmdie size

distancein 1 clock

20

8

Reliability & Yield

time

defects/chip

tolerable

new process

occurring

now

9

Energy

timenow

100W

CPU consumption

thermal dissipation

power

10

Instruction-Level Parallelism (ILP)

time

fetch

commit

instructions

now

11

Premises of this Research

• We will have lots of gates– Moore’s law continues– Nanotechnology

• Contemporary architectures do not scale

12

Outline

• Motivation

• ASH: Application-Specific Hardware

• The spatial model of computation

• CASH: Compiling for ASH

• Evolutionary path

• Conclusions

• Future work

13

ASH Application-Specific Hardware

Reconfigurablehardware

HLL program

Compiler

Circuit

14

ASH: A Scalable Architecture-- Thesis Statement --

Application-specific hardware on a reconfigurable-hardware substrate is a solution for the smooth evolution of computer architecture.

We can provide scalable compilers for translating high-level languages into hardware.

15

Exampleint f(void){ int i=0, j = 0;

for (; i < 10; i++) j += i;

return j;}

16

Outline

• Motivation

• ASH: Application-Specific Hardware

• The spatial model of computation

• CASH: Compiling for ASH

• Evolutionary path

• Conclusions

• Future work

17

• Build reconfigurable hardware using nanotechnology

Huge structures

ASH and Nanotechnology

• Low Power: 1010 gates use less than 2 W• Low cost: nanocents/gate• High density: 105x over CMOS

Nano-RAM cell

In yellow: a CMOS RAM cell.

18

A graph of the whole program execution:

A Limit Study of Performance

Memory word

Basic block

Memory write

Memory read

Control-flow transfer

19

Typical Program Graph (g721_e)

Control flow transfer

100% memory cluster

Memory reads

100% code cluster

memcpy

20

Program Graph After Inlining memcpy

memcpy

21

Application Slowdown

-1

0

1

2

3

4

5

6

7

8

9

10

11

tim

es s

low

er t

han

nat

ive

1 clock/square 5 clocks/square

22

How Time Is Spent

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

099.g

o

129.c

ompr

ess

130.l

i

132.i

jpeg

adpc

m_d

adpc

m_e

epic_

e

g721

_Q_d

g721

_Q_e

gsm

_d

gsm

_e

jpeg_d

jpeg_e

mpe

g2_d

per

cen

t

idle

executioncontrol flow

register traffic

No caches: reads expensive

No speculation

23

Lesson

The spatial model of computation has different properties.

24

Outline

• Motivation

• ASH: Application-Specific Hardware

• The spatial model of computation

• CASH: Compiling for ASH

• Evolutionary path

• Future work

25

CASH: Compiling for ASH

Memory partitioning

Interconnection net

Program to circuits

26

Compilation

1. Program

int reverse(int x){ int k,r=0; for (k=0; k<32; k++) r |= x&1; x = x >> 1; r = r << 1; }}

Unknown latency ops.

Computations& local storage2. Split-phase Abstract

Machines

3. Configurations placed

independently4. Placement on chip

Reliability

27

Split-phase Abstract Machines

SAM 1

SAM 2SAM 3

CFG

Power

28

Hyperblock => SAM

• Single-entry, multiple exit

• May contain loops

29

SAM => FSM

Start Loop

Exit

Exit

RemoteMemory

Localmemory

30

Implementing SAMs- interesting details -

31

The SAM FSM

Computation

Predicates (control)

Combinational logic

start exit

Reg

iste

r

args results

32

Computation = Dataflow

• Variables => wires + tokens• No token store; no token matching • Local communication only

Signals

x = a & 7;...

y = x >> 2;

Programs

&

a 7

>>

2

x

Circuits

33

Tokens & Synchronization

• Tokens signal operation completion• Possible implementations:

data

validack

Local

data

valid

reset

Global

data valid

Static

34

Speculation

if (x > 0) y = -x;

elsey = b*x;

*

x

b 0

y

!

slow

Computation Predicates

- >- >

and Eager Muxes

Static-Single Assignment implemented in hardware

ILP

35

Predicates

*q = 2;

• Guard side-effects– Memory access– Procedure calls

• Control looping

• Decide exit branch

• Select variable definition x=... x=...

...=x

36

Computing Predicates

• Correct for irreducible graphs• Correct even when speculatively computed • Can be eagerly computed

s t

b

37

Loops + Dataflow

for (i=0; i < 10; i++)a[i] += i;

+

load

+

store

&a[0]

+

1i

a[0]

0

a[1]

a[2]

a[3]

= Pipelining

38

Outline

• Motivation

• ASH: Application-Specific Hardware

• The spatial model of computation

• CASH: Compiling for ASH

• Evolutionary path

• Conclusions

• Future work

39

Evolutionary Path

Microprocessors ASH

The problem with ASH: Resources

40

Virtualization

41

CPU+ASH

core computation

support computation+ OS+ VM

CPU ASH

Memory

42

Outline

• Motivation

• ASH: Application-Specific Hardware

• The spatial model of computation

• CASH: Compiling for ASH

• Evolutionary path

• Conclusions

• Future work

43

ASH Benefits

Problem Solution

Reliability Configuration around defects

Power Only “useful” gates switching

Signals Localized computation

ILP Statically extracted

44

Scalable Performance

performance

CPU

ASH

time

now

45

Summary

• Contemporary CPU architecture faces lots of problems

• Application-Specific Hardware (ASH) provides a scalable technology

• Compiling HLL into hardware dataflow machines is an effective solution

46

Timeline

12/0206/01

CASH core

09/01 12/01 04/02 06/02 09/02

Writethesis

Hw/sw partitioning(ASH + CPU)

Costmodels

ASH Simulation

Loop parallelization

Explore architectural/compiler trade-offs

now

Memory partitioning

47

Extras

• Related work

• Reconfigurable hardware

• Other cross-over phenomena

• A CPU + ASH study

• More about predicates

48

Related Work

• Hardware synthesis from HLL

• Reconfigurable hardware

• Predicated execution

• Dataflow machines

• Speculative execution

• Predicated SSA

back

49

Reconfigurable Hardware

Universal gates

and/or

storage elements

Interconnectionnetwork

Programmable Switches

back to presentation

50

Switch controlled by a 1-bit RAM cell

0001

Universal gate = RAM

a0a1a0

a1

dataa1 & a2

0data in

control

Main RH Ingredient: RAM Cell

back

51

Reconfigurable Computing

• Back to ENIAC-style computation

• Synthesize one machine to solve one problem

back back to “extras”

52

Efficiency

time

idle

used

hardware resources

now

53

Manufacturing Cost

time

3x109$

now

cost

affordable

cost

54

Complexity

time

transistors

manageable

available

109

108

1010

now

55

CAD Tools

time

manual interventions

now

feasible

necessary

back

56

ASH BenefitsProblem Solution

Reliability Configuration around defects

Power Only “useful” gates switching

Signals Localized computation

ILP Statically extracted

Complexity Hierarchy of abstractions

CAD Compiler + local place & route

Efficiency Circuit customized to application

Cost No masks, no physics, same substrate

Performance Scalableback

57

CPU+ASH Study

• Reconfigurable functional unit on processor pipeline

• Adapted SimpleScalar 3.0• ASH & CPU use the same memory

hierarchy (incl. L1)• ASH can access CPU registers• CPU pipeline interlocked with ASH• Results pending

back

58

Simplifying Predicates

• Shared implementations

• Control equivalence

a

b

c

59

Deep Speculation

if (p) if (q) x = a; else x = b;else x = c;

x

a b c

!pp&!qp&q

60

Predicates & Tokens

*q = 2 readysafe

q

~x

ready

safe

x

*q = 2

1

ready & safe

q

Predicated tokens Eliminate speculation

~x

safe & readyx

back

ready

Eliminate wires

P P_ready

P & P_ready