wavescalar: the executive summary

31
Autumn 2006 CSE P548 1 WaveScalar: the Executive Summary Dataflow machine good at exploiting ILP dataflow parallelism + traditional coarser-grain parallelism cheap thread management low operand latency because of a hierarchical organization memory ordering enforced through wave-ordered memory no special languages

Upload: cybill

Post on 20-Jan-2016

37 views

Category:

Documents


0 download

DESCRIPTION

WaveScalar: the Executive Summary. Dataflow machine good at exploiting ILP dataflow parallelism + traditional coarser-grain parallelism cheap thread management low operand latency because of a hierarchical organization memory ordering enforced through wave-ordered memory - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: WaveScalar: the Executive Summary

Autumn 2006 CSE P548 1

WaveScalar: the Executive Summary

Dataflow machine• good at exploiting ILP • dataflow parallelism + traditional coarser-grain parallelism

• cheap thread management• low operand latency because of a hierarchical organization• memory ordering enforced through wave-ordered memory

• no special languages

Page 2: WaveScalar: the Executive Summary

Autumn 2006 CSE P548 2

WaveScalar

Additional motivation:

• increasing disparity between computation (fast transistors) & communication (long wires)

• increasing circuit complexity

• decreasing fabrication reliability

Page 3: WaveScalar: the Executive Summary

Autumn 2006 CSE P548 3

Monolithic von Neumann Processors

A phenomenal success today.

But in 2016?

PerformanceCentralized processing &

control Long wirese.g., operand broadcast

networks

Complexity40-75% of “design” time is

design verification

Defect tolerance1 flaw -> paperweight

Page 4: WaveScalar: the Executive Summary

Autumn 2006 CSE P548 4

WaveScalar’s Microarchitecture

Good performance via distributed microarchitecture • hundreds of PEs• dataflow execution – no centralized control • short point-to-point (producer to consumer) operand communication• organized hierarchically for fast communication between

neighboring PEs• scalable

Low design complexity through simple, identical PEs • design one & stamp out thousands

Defect tolerance • route around a bad PE

Page 5: WaveScalar: the Executive Summary

Autumn 2006 CSE P548 5

Processing Element

• Simple, small (.5M transistors)• 5-stage pipeline (receive input

operands, match tags, instruction schedule, execute, send output)

• Holds 64 (decoded) instructions

• 128-entry token store

• 4-entry output buffer

Page 6: WaveScalar: the Executive Summary

Autumn 2006 CSE P548 6

PEs in a Pod

• Share operand bypass network

• Back-to-back producer-consumer execution across PEs

• Relieve congestion on intra-domain bus

Page 7: WaveScalar: the Executive Summary

Autumn 2006 CSE P548 7

Domain

Page 8: WaveScalar: the Executive Summary

Autumn 2006 CSE P548 8

Cluster

Page 9: WaveScalar: the Executive Summary

Autumn 2006 CSE P548 9

WaveScalar Processor

Long distancecommunication• dynamic routing• grid-based network• 2-cycle hop/cluster

Page 10: WaveScalar: the Executive Summary

Autumn 2006 CSE P548 10

Whole Chip

• Can hold 32K instructions• Normal memory hierarchy• Traditional directory-based

cache coherence• ~400 mm2 in 90 nm

technology• 1GHz.• ~85 watts

Page 11: WaveScalar: the Executive Summary

Autumn 2006 CSE P548 11

Page 12: WaveScalar: the Executive Summary

Autumn 2006 CSE P548 12

WaveScalar Instruction Placement

Place instructions in PEs to maximize data locality & instruction-level parallelism.• Instruction placement algorithm based on a performance model

that captures the important performance factors• Carve the dataflow graph into segments

• a particular depth to make chains of dependent instructions that will be placed in the same pod

• a particular width to make multiple independent chains that will be placed in different, but near-by pods

• Snakes segments across PES in the chip on demand• K-loop bounding to prevent instruction “explosion”

Page 13: WaveScalar: the Executive Summary

Autumn 2006 CSE P548 13

Example to Illustrate the Memory Ordering Problem

A[j + i*i] = i;

b = A[i*j];

*

Load

Store

+

j i

*

b

A

+

+

Page 14: WaveScalar: the Executive Summary

Autumn 2006 CSE P548 14

Example to Illustrate the Memory Ordering Problem

A[j + i*i] = i;

b = A[i*j];

*

Load

Store

+

j i

*

b

A

+

+

Page 15: WaveScalar: the Executive Summary

Autumn 2006 CSE P548 15

Example to Illustrate the Memory Ordering Problem

A[j + i*i] = i;

b = A[i*j];

*

Load

Store

+

j i

*

b

A

+

+

Page 16: WaveScalar: the Executive Summary

Autumn 2006 CSE P548 16

Wave-ordered Memory

• Compiler annotates memory operations

• Send memory requests in any order• Hardware reconstructs the

correct order

Load

Store

LoadStore

Load

Store

3

4

8

5

67

Sequence #

4

?

9

6

88

Successor

2

3

?

4

54 Predecessor

Page 17: WaveScalar: the Executive Summary

Autumn 2006 CSE P548 17

Store bufferWave-ordering Example

Load

Store

LoadStore

Load

Store

5

6

6

8

3 42

8 9?

4

57 84

4 ?3

3 42

Page 18: WaveScalar: the Executive Summary

Autumn 2006 CSE P548 18

Store bufferWave-ordering Example

4 ?3

Load

Store

LoadStore

Load

Store

5

6

6

8

3 42

8 9?

4

57 84

4 ?3

3 42

Page 19: WaveScalar: the Executive Summary

Autumn 2006 CSE P548 19

Store bufferWave-ordering Example

4 ?3

8 9?

Load

Store

LoadStore

Load

Store

5

6

6

8

3 42

8 9?

4

57 84

4 ?3

3 42

Page 20: WaveScalar: the Executive Summary

Autumn 2006 CSE P548 20

Store bufferWave-ordering Example

4 ?3

7 84

8 9?

Load

Store

LoadStore

Load

Store

5

6

6

8

3 42

8 9?

4

57 84

4 ?3

3 42

Page 21: WaveScalar: the Executive Summary

Autumn 2006 CSE P548 21

Wave-ordered Memory

Waves are loop-free sections of the dataflow graph

Each dynamic wave has a wave numberWave number is incremented between

waves

Ordering memory:• wave-numbers• sequence number within a wave

Page 22: WaveScalar: the Executive Summary

Autumn 2006 CSE P548 22

WaveScalar Tag-matching

WaveScalar tag• thread identifier• wave number

Token: tag & value

<ThreadID:Wave#:InstructionID>.value

+

<2:5>.3 <2:5>.6

<2:5>.9

Page 23: WaveScalar: the Executive Summary

Autumn 2006 CSE P548 23

Single-thread Performance

Performance

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

amm

p

art

equak

e

gzi

p

mcf

twolf

djp

eg

mpeg

2en

code

raw

dau

dio

aver

age

AIP

C WSOOO

Page 24: WaveScalar: the Executive Summary

Autumn 2006 CSE P548 24

Single-thread Performance per Area

Performance per area

0

0.01

0.02

0.03

0.04

0.05am

mp

art

eq

uake

gzip

mcf

twolf

djp

eg

mp

eg

2encod

e

raw

daud

io

avera

ge

AIP

C/m

m2

WSOOO

Page 25: WaveScalar: the Executive Summary

Autumn 2006 CSE P548 25

Multithreading the WaveCache

Architectural-support for WaveScalar threads

• instructions to start & stop memory orderings, i.e., threads

• memory-free synchronization to allow exclusive access to data (thread communicate instruction)

• fence instruction to force all previous memory operations to fully execute (to allow other threads to see the results of this one’s memory ops)

Combine to build threads with multiple granularities

• coarse-grain threads: 25-168X over a single thread; 2-16X over CMP, 5-11X over SMT

• fine-grain, dataflow-style threads: 18-242X over single thread

• combine the two in the same application: 1.6X or 7.9X -> 9X

Page 26: WaveScalar: the Executive Summary

Autumn 2006 CSE P548 26

Creating & Terminating a Thread

Page 27: WaveScalar: the Executive Summary

Autumn 2006 CSE P548 27

Thread Creation Overhead

Page 28: WaveScalar: the Executive Summary

Autumn 2006 CSE P548 28

Performance of Coarse-grain Parallelism

Page 29: WaveScalar: the Executive Summary

Autumn 2006 CSE P548 29

CMP Comparison

0

2

4

6

8

10

12

14

16

18

20

ws

scm

psm

t8cm

p4

cmp2 ws

scm

psm

t8cm

p4

cmp2

eckm

an ws

scm

p

eckm

an

Sp

eed

up

vs 1

-th

read

scala

r C

MP

32 threads

16 threads

8 threads

4 threads

2 threads

1 thread

lu fft radix

Page 30: WaveScalar: the Executive Summary

Autumn 2006 CSE P548 30

Performance of Fine-grain Parallelism

Relies on:Cheap synchronizationLoad once, pass data (not load/compute/store)

Page 31: WaveScalar: the Executive Summary

Autumn 2006 CSE P548 31

Building the WaveCache

RTL-level implementation• some didn’t believe it could be built in a normal-sized chip• some didn’t believe it could achieve a decent cycle time and load-

use latencies• Verilog & Synopsis CAD tools

Different WaveCache’s for different applications• 1 cluster: low-cost, low power, single-thread or embedded

• 42 mm2 in 90 nm process technology, 2.2 AIPC on Splash2• 16 clusters: multiple threads, higher performance: 378 mm2 , 15.8

AIPC

Board-level FPGA implementation• OS & real application simulations