wavescalar and the wavecache

Spring 2003 CSE P548 1

WaveScalar and the WaveCache

Steven SwansonKen Michelson

Mark OskinTom AndersonSusan Eggers

University of Washington


Worries to Keep You up at Night

In 2016 200,000 RISC-1 processors will fit on a die. It will take 36 cycles to cross the die. Still a lack of ILP. Memory latency is still a problem. For reasonable yields, only 1 transistor in 24 billion

may be broken (if one flaw breaks a chip).


WaveScalar’s Solution: Utilize Die Capability

A sea of simple, RISClike processors in-order, single-issue takes advantage of billions of transistors without

exacerbating the other problems short design & implementation time operates at a short cycle not need lots of ILP fewer defects


L2 C

ache

WaveScalar Processing Element

FLOW CONTROL

FU

FLOW CONTROL

DECODE

CONFIG.LOGIC

INPUTS

OUTPUTS


WaveScalar’s Solution: Short Wires

Dataflow execution model each processor executes when it’s operands have

arrived same principle as out-of-order execution but applies to

the processor & includes fetching no single program counter

short wires: no long control lines no centralized hardware data structures no need for sequential & individual instruction fetches



Dataflow execution model, cont’d. differs from original dataflow computers

distributed tag management (matching between renamed producer-consumer registers)

special WaveScalar instructions assign a number to all operands in a wave (think iteration or trace) & coordinate wave execution

all instructions in a “wave” execute on data with the same wave number



Dataflow execution model differs from original dataflow computers

explicit wave-ordered memory compiler assigns sequence number to each memory

operation in a bread-first manner sequence number for an operation, its predecessor &

successor all sent with produced data wave & sequence numbers provide a total order on

memory operations through any traversal of a wave+ normal memory semantics+ no need for special dataflow languages; C & C++ programs

execute just fine



Nearest-neighbor communication code placement to locate consumers near their

producers short, fast node-to-node links rather than slow

broadcast networks exploits dataflow locality: probability of producing a value

for a particular consumer instruction & therefore register (register renaming can destroy this)

instructions can dynamically migrate toward their neighbors during execution


Dynamic Optimization

The common case has higher costs, and the

branch can detect this…

Common Case

Rare Case

Branch

Join


Dynamic Optimization

…and fix it, by moving. The join can do the same.

Common Case

Rare Case

Branch

Join


L2 C

ache


PE Domain


L2 C

ache


D$ + Store Buffer

Cluster


WaveScalar’s Solution: Creative Use of Untapped Parallelism

Expand the window for exploiting ILP no in-order fetch using only one PC (sucking though

a straw) place instructions with the processing elements out-of-order execution on a grand scale

Allow multiple threads to execute concurrently OS & applications multiple applications, parallel threads


WaveScalar’s Solution: The I-Cache is the Processor

Model is processor-in-memory (PIM) processing element associated with each instruction

WaveScalar version processing elements placed in the I-cache to reduce

latency


L2 C

ache

WaveScalar’s Solution: Design to Compensate for Circuit Unreliablity

Fewer design & implementation errors from the grid of simple, uniform design

Route around processors with flaws

decentralized control

dynamic instruction migration


Research Agenda: Architecture

WaveScalar ISA Microarchitecture design

node design domain size cache-coherence across clusters cluster arrangement

Control & memory speculation WaveScalar instruction management

hardware for instruction placement & replacement hardware for dynamic, self-optimizing placement


Research Agenda: Architecture

Multithreaded WaveScalar Design of the network & routing issues Power management Static & dynamic fault detection & recovery (rerouting

instructions) System-level design Application to non-silicon designs


Research Agenda: Compilers

Instruction placement Revisit classic optimizations

code savings vs. communication costs cache pollution vs. loop parallelism

New opportunities for optimization a match between compiler & execute models WaveScalar-specific instructions


Research Agenda: OS & Networking

Tension between facilitating short routines & poor instruction locality

The software side of thread management A bunch of stuff I don’t know about

optimizing the OS interface new thread protection policies memory management issues security lazy context switching utilizing virtual machines


Putting It All Together

Grid of hundreds (maybe thousands) of simple, data-flow processing nodes

no centralized control; scalable few design errors; increase in yield

Processing nodes embedded in the I-cache Instructions execute in place Send results directly to the consumers

short, point-to-point links Instructions can dynamically migrate

reduce latency to hot consumers map around defects

3X performance without any prediction mechanisms more with them

wavescalar and the wavecache

Documents

order execution

short cyclenot

wave normal memory semantics

processor model

total order

memory operations

memory latency

order fetch