wavescalar and the wavecache

21
Spring 2003 CSE P548 1 WaveScalar and the WaveCache Steven Swanson Ken Michelson Mark Oskin Tom Anderson Susan Eggers University of Washington

Upload: thom

Post on 07-Jan-2016

22 views

Category:

Documents


0 download

DESCRIPTION

WaveScalar and the WaveCache. Steven Swanson Ken Michelson Mark Oskin Tom Anderson Susan Eggers University of Washington. Worries to Keep You up at Night. In 2016 200,000 RISC-1 processors will fit on a die. It will take 36 cycles to cross the die. Still a lack of ILP. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: WaveScalar and the WaveCache

Spring 2003 CSE P548 1

WaveScalar and the WaveCache

Steven SwansonKen Michelson

Mark OskinTom AndersonSusan Eggers

University of Washington

Page 2: WaveScalar and the WaveCache

Spring 2003 CSE P548 2

Worries to Keep You up at Night

In 2016 200,000 RISC-1 processors will fit on a die. It will take 36 cycles to cross the die. Still a lack of ILP. Memory latency is still a problem. For reasonable yields, only 1 transistor in 24 billion

may be broken (if one flaw breaks a chip).

Page 3: WaveScalar and the WaveCache

Spring 2003 CSE P548 3

WaveScalar’s Solution: Utilize Die Capability

A sea of simple, RISClike processors in-order, single-issue takes advantage of billions of transistors without

exacerbating the other problems short design & implementation time operates at a short cycle not need lots of ILP fewer defects

Page 4: WaveScalar and the WaveCache

Spring 2003 CSE P548 4

L2 C

ache

WaveScalar Processing Element

FLOW CONTROL

FU

FLOW CONTROL

DECODE

CONFIG.LOGIC

INPUTS

OUTPUTS

Page 5: WaveScalar and the WaveCache

Spring 2003 CSE P548 5

WaveScalar’s Solution: Short Wires

Dataflow execution model each processor executes when it’s operands have

arrived same principle as out-of-order execution but applies to

the processor & includes fetching no single program counter

short wires: no long control lines no centralized hardware data structures no need for sequential & individual instruction fetches

Page 6: WaveScalar and the WaveCache

Spring 2003 CSE P548 6

WaveScalar’s Solution: Short Wires

Dataflow execution model, cont’d. differs from original dataflow computers

distributed tag management (matching between renamed producer-consumer registers)

special WaveScalar instructions assign a number to all operands in a wave (think iteration or trace) & coordinate wave execution

all instructions in a “wave” execute on data with the same wave number

Page 7: WaveScalar and the WaveCache

Spring 2003 CSE P548 7

WaveScalar’s Solution: Short Wires

Dataflow execution model differs from original dataflow computers

explicit wave-ordered memory compiler assigns sequence number to each memory

operation in a bread-first manner sequence number for an operation, its predecessor &

successor all sent with produced data wave & sequence numbers provide a total order on

memory operations through any traversal of a wave+ normal memory semantics+ no need for special dataflow languages; C & C++ programs

execute just fine

Page 8: WaveScalar and the WaveCache

Spring 2003 CSE P548 8

WaveScalar’s Solution: Short Wires

Nearest-neighbor communication code placement to locate consumers near their

producers short, fast node-to-node links rather than slow

broadcast networks exploits dataflow locality: probability of producing a value

for a particular consumer instruction & therefore register (register renaming can destroy this)

instructions can dynamically migrate toward their neighbors during execution

Page 9: WaveScalar and the WaveCache

Spring 2003 CSE P548 9

Dynamic Optimization

The common case has higher costs, and the

branch can detect this…

Common Case

Rare Case

Branch

Join

Page 10: WaveScalar and the WaveCache

Spring 2003 CSE P548 10

Dynamic Optimization

…and fix it, by moving. The join can do the same.

Common Case

Rare Case

Branch

Join

Page 11: WaveScalar and the WaveCache

Spring 2003 CSE P548 11

L2 C

ache

WaveScalar’s Solution: Short Wires

PE Domain

Page 12: WaveScalar and the WaveCache

Spring 2003 CSE P548 12

L2 C

ache

WaveScalar’s Solution: Short Wires

D$ + Store Buffer

Cluster

Page 13: WaveScalar and the WaveCache

Spring 2003 CSE P548 13

WaveScalar’s Solution: Creative Use of Untapped Parallelism

Expand the window for exploiting ILP no in-order fetch using only one PC (sucking though

a straw) place instructions with the processing elements out-of-order execution on a grand scale

Allow multiple threads to execute concurrently OS & applications multiple applications, parallel threads

Page 14: WaveScalar and the WaveCache

Spring 2003 CSE P548 14

WaveScalar’s Solution: The I-Cache is the Processor

Model is processor-in-memory (PIM) processing element associated with each instruction

WaveScalar version processing elements placed in the I-cache to reduce

latency

Page 15: WaveScalar and the WaveCache

Spring 2003 CSE P548 15

L2 C

ache

WaveScalar’s Solution: Design to Compensate for Circuit Unreliablity

Fewer design & implementation errors from the grid of simple, uniform design

Route around processors with flaws

decentralized control

dynamic instruction migration

Page 16: WaveScalar and the WaveCache

Spring 2003 CSE P548 16

Research Agenda: Architecture

WaveScalar ISA Microarchitecture design

node design domain size cache-coherence across clusters cluster arrangement

Control & memory speculation WaveScalar instruction management

hardware for instruction placement & replacement hardware for dynamic, self-optimizing placement

Page 17: WaveScalar and the WaveCache

Spring 2003 CSE P548 17

Research Agenda: Architecture

Multithreaded WaveScalar Design of the network & routing issues Power management Static & dynamic fault detection & recovery (rerouting

instructions) System-level design Application to non-silicon designs

Page 18: WaveScalar and the WaveCache

Spring 2003 CSE P548 18

Research Agenda: Compilers

Instruction placement Revisit classic optimizations

code savings vs. communication costs cache pollution vs. loop parallelism

New opportunities for optimization a match between compiler & execute models WaveScalar-specific instructions

Page 19: WaveScalar and the WaveCache

Spring 2003 CSE P548 19

Research Agenda: OS & Networking

Tension between facilitating short routines & poor instruction locality

The software side of thread management A bunch of stuff I don’t know about

optimizing the OS interface new thread protection policies memory management issues security lazy context switching utilizing virtual machines

Page 20: WaveScalar and the WaveCache

Spring 2003 CSE P548 20

Putting It All Together

Grid of hundreds (maybe thousands) of simple, data-flow processing nodes

no centralized control; scalable few design errors; increase in yield

Processing nodes embedded in the I-cache Instructions execute in place Send results directly to the consumers

short, point-to-point links Instructions can dynamically migrate

reduce latency to hot consumers map around defects

3X performance without any prediction mechanisms more with them

Page 21: WaveScalar and the WaveCache

Spring 2003 CSE P548 21