asci winterschool on embedded systems march 2004 renesse processor components the cornerstones of...

82
ASCI Winterschool on Embedded Systems March 2004 Renesse Processor Components the cornerstones of future platforms with emphasis on ILP exploitation Henk Corporaal Peter Knijnenburg

Upload: antony-bruno-mckenzie

Post on 12-Jan-2016

231 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: ASCI Winterschool on Embedded Systems March 2004 Renesse Processor Components the cornerstones of future platforms with emphasis on ILP exploitation Henk

ASCI Winterschool on Embedded Systems

March 2004Renesse

Processor Componentsthe cornerstones of future platforms

with emphasis on ILP exploitation

Henk CorporaalPeter Knijnenburg

Page 2: ASCI Winterschool on Embedded Systems March 2004 Renesse Processor Components the cornerstones of future platforms with emphasis on ILP exploitation Henk

ASCI winterschool H.C.-P.K. 2

Future

We foresee that

many characteristics of current high performance architectures will find their way into the embedded domain.

Page 3: ASCI Winterschool on Embedded Systems March 2004 Renesse Processor Components the cornerstones of future platforms with emphasis on ILP exploitation Henk

ASCI winterschool H.C.-P.K. 3

What are we talking about?

ILP = Instruction Level Parallelism =

ability to perform multiple operations (or instructions),from a single instruction stream,

in parallel

Page 4: ASCI Winterschool on Embedded Systems March 2004 Renesse Processor Components the cornerstones of future platforms with emphasis on ILP exploitation Henk

ASCI winterschool H.C.-P.K. 4

Processor ComponentsOverview

• Motivation and Goals• Trends in Computer Architecture• RISC processors • ILP Processors• Transport Triggered Architectures• Configurable components• Summary and Conclusions

Page 5: ASCI Winterschool on Embedded Systems March 2004 Renesse Processor Components the cornerstones of future platforms with emphasis on ILP exploitation Henk

ASCI winterschool H.C.-P.K. 5

Motivation for ILP (and other types of parallelism)• Increasing VLSI densities; decreasing feature size

• Increasing performance requirements

• New application areas, like– Multi-media (image, audio, video, 3-D)– intelligent search and filtering engines– neural, fuzzy, genetic computing

• More functionality

• Use of existing Code (Compatibility)

• Low Power: P = fCV2

Page 6: ASCI Winterschool on Embedded Systems March 2004 Renesse Processor Components the cornerstones of future platforms with emphasis on ILP exploitation Henk

ASCI winterschool H.C.-P.K. 6

Low power through parallelism

• Sequential Processor– Switching capacitance C– Frequency f– Voltage V– P = fCV2

• Parallel Processor (two times the number of units)– Switching capacitance 2C– Frequency f/2– Voltage V’ < V– P = f/2 2C V’2 = fCV’2

Page 7: ASCI Winterschool on Embedded Systems March 2004 Renesse Processor Components the cornerstones of future platforms with emphasis on ILP exploitation Henk

ASCI winterschool H.C.-P.K. 7

ILP Goals

• Making the most powerful single chip processor

• Exploiting parallelism between independent instructions (or operations) in programs

• Exploit hardware concurrency– multiple FUs, buses, reg files, bypass paths, etc.

• Code compatibility– binary: superscalar and super-pipelined– HLL: VLIW

• Incorporate enhanced functionality (ASIP)

Page 8: ASCI Winterschool on Embedded Systems March 2004 Renesse Processor Components the cornerstones of future platforms with emphasis on ILP exploitation Henk

ASCI winterschool H.C.-P.K. 8

Overview

• Motivation and Goals

• Trends in Computer Architecture

• RISC processors

• ILP Processors

• Transport Triggered Architectures

• Configurable components

• Summary and Conclusions

Page 9: ASCI Winterschool on Embedded Systems March 2004 Renesse Processor Components the cornerstones of future platforms with emphasis on ILP exploitation Henk

ASCI winterschool H.C.-P.K. 9

Trends in Computer Architecture

• Bridging the semantic gap• Performance increase• VLSI developments• Architecture developments: design space• The role of compiler• Right match

Page 10: ASCI Winterschool on Embedded Systems March 2004 Renesse Processor Components the cornerstones of future platforms with emphasis on ILP exploitation Henk

ASCI winterschool H.C.-P.K. 10

FunctionUnit(s)

Very simple processor

DataMemory

MDR

MAR

r0r1r2

FunctionUnit(s)

Registerfile

Instruction register

Decode logic

Processor datapath

Page 11: ASCI Winterschool on Embedded Systems March 2004 Renesse Processor Components the cornerstones of future platforms with emphasis on ILP exploitation Henk

ASCI winterschool H.C.-P.K. 11

Bridging the Semantic Gap

Programming domains• Application domain• Architecture domain• Data path domain

Larchiteccture

LD r1, M(&B)

LD r2,M(&C)

ADD r1,r1,r2

ST r1, M(&A)SW compilation orinterpretation

Ldatapath

&B MARMDR r1&C MARMDR r2

r1 ALUinput-1

r2 ALUinput-2

ALUoutput := ALUinput-1

ALUoutput r1r1 MDR&A MARHW interpretation

Lapplication

A := B + C

Example:

Page 12: ASCI Winterschool on Embedded Systems March 2004 Renesse Processor Components the cornerstones of future platforms with emphasis on ILP exploitation Henk

ASCI winterschool H.C.-P.K. 12

Bridging the Semantic Gap: Different Methods

Direct Hardwareinterpretation

Direct Execution Architectures

Application

Architecture

Operations &Data Transports

Application

Compilationand/or softwareinterpretation

Architecture

Micro-Codeinterpretation

Operations &Data Transports

CISC Architectures

Application

Compilationand/or softwareinterpretation

Architecture

Micro-Codeinterpretation

Operations &Data Transports

RISC Architectures

Application

Direct Compilation

and/or softwareinterpretation

Architecture

Operations &Data Transports

Microcoded Architectures

Page 13: ASCI Winterschool on Embedded Systems March 2004 Renesse Processor Components the cornerstones of future platforms with emphasis on ILP exploitation Henk

ASCI winterschool H.C.-P.K. 13

Bridging the Semantic Gap: What happens to the semantic level ?

Interpretation

Compiler and/or

interpretation

?

Year

Sem

anti

c L

evel

1950 1960 1970 1980 1990 2000

Application Domain

Architecture Domain

Datapath Domain

CISCRISC

2010

Page 14: ASCI Winterschool on Embedded Systems March 2004 Renesse Processor Components the cornerstones of future platforms with emphasis on ILP exploitation Henk

ASCI winterschool H.C.-P.K. 14

Performance Increase

Year

78 80 82 84 86 88 90 92 94 96 98

0.1

1.0

10

100

1000

SPECfp92 dataSPECint92 data

SPECfp92 growthSPECint92 growth

SP

EC

int

and

SP

EC

fp r

atin

gs

Microprocessor SPEC Ratings

• 50% SPECint improvement / year

• 60% SPECfp improvement / year

00 02

Page 15: ASCI Winterschool on Embedded Systems March 2004 Renesse Processor Components the cornerstones of future platforms with emphasis on ILP exploitation Henk

ASCI winterschool H.C.-P.K. 15

VLSI Developments

Year

70 80 90 00

10e3

10e5

10e7

Den

sity

in t

ran

sist

ors

/ch

ip

0.1

1.0

10M

inim

um

feature size in

(um

)

DensityFeature Size

# Transistors (DRAM)

~ 2(year-1956) * 2/3

Cycle time:

tcycle ~ tgate * #gate_levels + wiring_delay + pad_delay

What happens to these contributions ?

Page 16: ASCI Winterschool on Embedded Systems March 2004 Renesse Processor Components the cornerstones of future platforms with emphasis on ILP exploitation Henk

ASCI winterschool H.C.-P.K. 16

Architecture Developments

How to improve performance?

• (Super)-pipelining• Powerful instructions

– MD-technique

• multiple data operands per operation– MO-technique

• multiple operations per instruction• Multiple instruction issue

Page 17: ASCI Winterschool on Embedded Systems March 2004 Renesse Processor Components the cornerstones of future platforms with emphasis on ILP exploitation Henk

ASCI winterschool H.C.-P.K. 17

Architecture Developments

Pipelined Execution of InstructionsIF: Instruction Fetch

DC: Instruction Decode

RF: Register Fetch

EX: Execute instruction

WB: Write Result Register

IF DC RF EX WBIF DC RF EX WB

IF DC RF EX WBIF DC RF EX WB

INS

TR

UC

TIO

N

CYCLE

1 2 43 5 6 7 8

12

3

4

Purpose:• Reduce #gate_levels in critical path• Reduce CPI close to one• More efficient Hardware

Problems• Hazards: pipeline stalls

• Structural hazards: add more hardware• Control hazards, branch penalties: use branch prediction• Data hazards: by passing required

Superpipelining: Split one or more of the critical pipeline stages

Simple 5-stage pipeline

Page 18: ASCI Winterschool on Embedded Systems March 2004 Renesse Processor Components the cornerstones of future platforms with emphasis on ILP exploitation Henk

ASCI winterschool H.C.-P.K. 18

Architecture Developments

Powerful Instructions (1)MD-technique• Multiple data operands per operation

Two Styles• Vector• SIMD

SIMD Execution Method

time

Instruction 1

Instruction 2

Instruction 3

Instruction n

node1 node2 node-K

Inst

r K

+1

Inst

ruct

ion

1

Inst

ruct

ion

2

Inst

ruct

ion

3

Inst

ruct

ion

K

time

FU1FU2 FU3 FU-K

Vector Execution Method

a = B * c + d

Page 19: ASCI Winterschool on Embedded Systems March 2004 Renesse Processor Components the cornerstones of future platforms with emphasis on ILP exploitation Henk

ASCI winterschool H.C.-P.K. 19

Architecture Developments

Powerful Instructions (1)

Vector Computing• FU mix may match the application domain• Use of interleaved memory• FUs need to be tightly connected

SIMD computing• Nodes used for independent operations• Mesh or hypercube connectivity• Exploit data locality of e.g. image processing applications• SIMD on restricted scale: Multi-media instructions

– MMX, SUN-VIS, HP MAX-2, AMD-K7/Athlon 3Dnow, Trimedia, ......

– Example: i=1..4|ai-bi|

Page 20: ASCI Winterschool on Embedded Systems March 2004 Renesse Processor Components the cornerstones of future platforms with emphasis on ILP exploitation Henk

ASCI winterschool H.C.-P.K. 20

Architecture Developments

Powerful Instructions (2)

MO-technique: multiple operations per instruction

• CISC (Complex Instruction Set Computer)• VLIW (Very Long Instruction Word)

sub r8, r5, 3 and r1, r5, 12 mul r6, r5, r2 ld r3, 0(r5)

FU 1 FU 2 FU 3 FU 4field

instruction bnez r5, 13

FU 5

VLIW instruction example

Page 21: ASCI Winterschool on Embedded Systems March 2004 Renesse Processor Components the cornerstones of future platforms with emphasis on ILP exploitation Henk

ASCI winterschool H.C.-P.K. 21

Architecture Developments: Powerful Instructions (2)

VLIW Characteristics

• Only RISC like operation support Short cycle times

• Flexible: Can implement any FU mixture• Extensible• Tight inter FU connectivity required• Large instructions• Not binary compatible

Page 22: ASCI Winterschool on Embedded Systems March 2004 Renesse Processor Components the cornerstones of future platforms with emphasis on ILP exploitation Henk

ASCI winterschool H.C.-P.K. 22

Architecture Developments

Multiple instruction issue (per cycle)

Who guarantees semantic correctness?

• User specifies multiple instruction streams– MIMD (Multiple Instruction Multiple Data)

• Run-time detection of ready instructions– Superscalar

• Compile into dataflow representation– Dataflow processors

Page 23: ASCI Winterschool on Embedded Systems March 2004 Renesse Processor Components the cornerstones of future platforms with emphasis on ILP exploitation Henk

ASCI winterschool H.C.-P.K. 23

Multiple instruction issue

Three Approaches

a := b + 15;

c := 3.14 * d;

e := c / f;

Translation to DDG (Data Dependence Graph)

ld

+

st

&b

15

&a

ld *

/ st

ld

st

&f 3.14

&e

&d

&c

Example code

Page 24: ASCI Winterschool on Embedded Systems March 2004 Renesse Processor Components the cornerstones of future platforms with emphasis on ILP exploitation Henk

ASCI winterschool H.C.-P.K. 24

Generated Code

Instr. Sequential Code Dataflow Code I1 ld r1,M(&b) ld(M(&b) -> I2I2 addi r1,r1,15 addi 15 -> I3I3 st r1,M(&a) st M(&a)I4 ld r1,M(&d) ld M(&d) -> I5I5 muli r1,r1,3.14 muli 3.14 -> I6, I8I6 st r1,M(&c) st M(&c)I7 ld r2,M(&f) ld M(&f) -> I8I8 div r1,r1,r2 div -> I9I9 st r1,M(&e) st M(&e) Notes:

• An MIMD may execute two streams: (1) I1-I3 (2) I4-I9– No dependencies between streams; in practice communication and

synchronization required between streams

• A superscalar issues multiple instructions from sequential stream– Obey dependencies (True and name dependencies)

– Reverse engineering of DDG needed at run-time

• Dataflow code is direct representation of DDG

Page 25: ASCI Winterschool on Embedded Systems March 2004 Renesse Processor Components the cornerstones of future platforms with emphasis on ILP exploitation Henk

ASCI winterschool H.C.-P.K. 25

Instruction Pipeline Overview

IF DC RF EX WB

IF DC/RF EX WB

CISC

RISC

IF1 DC1 RF1 EX1 ROBISSUE WB1

IF2 DC2 RF2 EX2 ROBISSUE WB2

IF3 DC3 RF3 EX3 ROBISSUE WB3

IFk DCk RFk EXk ROBISSUE WBk

Superscalar

IF1 IF2 IFs DC RF--- EX1 EX2 --- EX5 WBSuperpipelined

IF DC

RF1 EX1 WB1

RF2 EX2 WB2

RFk EXk WBk

VLIW

RF1 EX1 WB1

RF2 EX2 WB2

RFk EXk WBkD

AT

AF

LOW

Page 26: ASCI Winterschool on Embedded Systems March 2004 Renesse Processor Components the cornerstones of future platforms with emphasis on ILP exploitation Henk

ASCI winterschool H.C.-P.K. 26

Four dimensional representation of the architecture design space <I, O, D, S>

Instructions/cycle ‘I’

Superpipelining Degree ‘S’

Operations/instruction ‘O’

Data/operation ‘D’

Superscalar MIMD Dataflow

Superpipelined

RISC

VLIW

10 100

1010

0.1

Vector

10

SIMD100

CISC

Page 27: ASCI Winterschool on Embedded Systems March 2004 Renesse Processor Components the cornerstones of future platforms with emphasis on ILP exploitation Henk

ASCI winterschool H.C.-P.K. 27

Architecture design space

Architecture K I O D S MparCISC 1 0.2 1.2 1.1 1 0.26RISC 1 1 1 1 1.2 1.2VLIW 10 1 10 1 1.2 12Superscalar 3 3 1 1 1.2 3.6Superpipelined 1 1 1 1 3 3Vector 7 0.1 1 64 5 32SIMD 128 1 1 128 1.2 154MIMD 32 32 1 1 1.2 38Dataflow 10 10 1 1 1.2 12

Typical values of K (# of functional units or processor nodes), and

<I, O, D, S> for different architectures

Mpar = I*O*D*S

Op I_set

S(architecture) = f(Op) * lt (Op)

Page 28: ASCI Winterschool on Embedded Systems March 2004 Renesse Processor Components the cornerstones of future platforms with emphasis on ILP exploitation Henk

ASCI winterschool H.C.-P.K. 28

The Role of the Compiler

9 steps required to translate an HLL program

• Front-end compilation• Determine dependencies• Graph partitioning: make multiple threads (or tasks)• Bind partitions to compute nodes• Bind operands to locations• Bind operations to time slots: Scheduling• Bind operations to functional units• Bind transports to buses• Execute operations and perform transports

Page 29: ASCI Winterschool on Embedded Systems March 2004 Renesse Processor Components the cornerstones of future platforms with emphasis on ILP exploitation Henk

ASCI winterschool H.C.-P.K. 29

Division of responsibilities between hardware and compiler

Frontend

Binding of Operands

Determine Dependencies

Scheduling

Binding of Transports

Binding of Operations

Execute

Binding of Operands

Determine Dependencies

Scheduling

Binding of Transports

Binding of Operations

Responsibility of compiler Responsibility of Hardware

Application

Superscalar

Dataflow

Multi-threaded

Indep. Arch

VLIW

TTA

Page 30: ASCI Winterschool on Embedded Systems March 2004 Renesse Processor Components the cornerstones of future platforms with emphasis on ILP exploitation Henk

ASCI winterschool H.C.-P.K. 30

The Right Match

Year

72 80 90

103

104

105

106

107

Tra

nsi

sto

r p

er C

PU

Ch

ip

00

108

8-bit Microprocessor

RISC 32-bit core

CISC 32-bit core

RISC+ MMU + FP 64-bit

VLIW Superscalar Dataflow

MIMD

INTRODUCTIO

N

DISADVANTAGE

Page 31: ASCI Winterschool on Embedded Systems March 2004 Renesse Processor Components the cornerstones of future platforms with emphasis on ILP exploitation Henk

ASCI winterschool H.C.-P.K. 31

Overview

• Motivation and Goals

• Trends in Computer Architecture

• RISC processors

• ILP Processors

• Transport Triggered Architectures

• Configurable components

• Summary and Conclusions

Page 32: ASCI Winterschool on Embedded Systems March 2004 Renesse Processor Components the cornerstones of future platforms with emphasis on ILP exploitation Henk

ASCI winterschool H.C.-P.K. 32

RISC basicsIF DC EX WB

INS

TR

UC

TIO

N

CYCLE

1 2 43 5 6 7 8

12

3

4

IF DC EX WB

IF DC EX WB

IF DC EX WB

Forwarding

Register File

Op-1 Op-1

Memory Unit ALU

BP-1

Immediate

mux

mux

Bypassbuses

Functionunit

operand regs.RISC datapathNote: Ifetch path not shown

Page 33: ASCI Winterschool on Embedded Systems March 2004 Renesse Processor Components the cornerstones of future platforms with emphasis on ILP exploitation Henk

ASCI winterschool H.C.-P.K. 33

Why RISC? Make the common case fast• Reduced number of instructions• Limited addressing modes

– load-store architecture

• Large uniform register set• Limited number of instruction sizes

(preferably one)– know directly where the following instruction

starts

• Limited number of instruction formats

Enables pipelining

Page 34: ASCI Winterschool on Embedded Systems March 2004 Renesse Processor Components the cornerstones of future platforms with emphasis on ILP exploitation Henk

ASCI winterschool H.C.-P.K. 34

Overview

• Motivation and Goals

• Trends in Computer Architecture

• RISC processors

• ILP Processors

• Transport Triggered Architectures

• Configurable components

• Summary and Conclusions

Page 35: ASCI Winterschool on Embedded Systems March 2004 Renesse Processor Components the cornerstones of future platforms with emphasis on ILP exploitation Henk

ASCI winterschool H.C.-P.K. 35

ILP Processors• Overview• General ILP organization• VLIW concept

– examples like: TriMedia, Mpact, TMS320C6x, IA-64

• Superscalar concept– examples like: HP-PA8000, Alpha 21264, MIPS

R10k/R12k, Pentium I-IV, AMD5-7, UltraSparc – (Ref: IEEE Micro April 1996 (HotChips issue)

• Comparing Superscalar and VLIW

Page 36: ASCI Winterschool on Embedded Systems March 2004 Renesse Processor Components the cornerstones of future platforms with emphasis on ILP exploitation Henk

ASCI winterschool H.C.-P.K. 36

General ILP processor organization

Instruction FetchUnit

Inst

ruct

ion

M

em

ory Instruction

DecodeUnit

FU-1

FU-2

FU-K

Re

gis

ter

File

Da

taM

em

ory

Central Processing Unit

Page 37: ASCI Winterschool on Embedded Systems March 2004 Renesse Processor Components the cornerstones of future platforms with emphasis on ILP exploitation Henk

ASCI winterschool H.C.-P.K. 37

ILP processor characteristics

• Issue multiple operations/instructions per cycle

• Multiple concurrent Function Units

• Pipelined execution

• Shared register file

• Four Superscalar variants– In-order/Out-of-order execution– In-order/Out-of-order completion

Page 38: ASCI Winterschool on Embedded Systems March 2004 Renesse Processor Components the cornerstones of future platforms with emphasis on ILP exploitation Henk

ASCI winterschool H.C.-P.K. 38

VLIW concept

Int Register File

Instruction Memory

Int FU

A VLIW architecture with 7 FUs

Data Memory

Int FU Int FU LD/ST LD/ST FP FU

Floating PointRegister File

FP FU

Page 39: ASCI Winterschool on Embedded Systems March 2004 Renesse Processor Components the cornerstones of future platforms with emphasis on ILP exploitation Henk

ASCI winterschool H.C.-P.K. 39

VLIW example: Trimedia

SDRAM

Timers

MemoryInterface

PCI interface

Video In Video Out

Audio In Audio Out

I2C Interface Serial Interface

VLIWProcessor

32K I$

16K D$VLD

coprocessor

32 bit, 33 MHZ

40 Mpix/s

208 chanel digital audio

Huffman decoder MPEG1,2

19 Mpix/s

Stereo digital audio

* 5-issue* 128 registers* 27 Fus* 32-bit

* 8-Way set associative caches

* dual ported data cache

* gaurded operations

Trimedia Overview

Page 40: ASCI Winterschool on Embedded Systems March 2004 Renesse Processor Components the cornerstones of future platforms with emphasis on ILP exploitation Henk

ASCI winterschool H.C.-P.K. 40

VLIW example: TMS320C62

TMS320C62 VelociTI Processor

• 8 operations (of 32-bit) per instruction (256 bit)• Two clusters

– 8 Fus: 4 Fus / cluster : (2 Multipliers, 6 ALUs)– 2 x 16 registers– One port available to read from register file of other cluster

• Flexible addressing modes (like circular addressing)• Flexible instruction packing• All operations conditional• 5 ns, 200 MHz, 0.25 um, 5-layer CMOS• 128 KB on-chip RAM

Page 41: ASCI Winterschool on Embedded Systems March 2004 Renesse Processor Components the cornerstones of future platforms with emphasis on ILP exploitation Henk

ASCI winterschool H.C.-P.K. 41

VelociTIC64 datapath

Cluster

Page 42: ASCI Winterschool on Embedded Systems March 2004 Renesse Processor Components the cornerstones of future platforms with emphasis on ILP exploitation Henk

ASCI winterschool H.C.-P.K. 42

VLIW example: IA-64

Intel HP 64 bit VLIW like architecture• 128 bit instruction bundle containing 3 instructions• 128 Integer + 128 Floating Point registers : 7-bit reg id.

• Guarded instructions– 64 entry boolean register file heavily rely on if-conversion to

remove branches

• Specify instruction independence– some extra bits per bundle

• Fully interlocked– i.e. no delay slots: operations are latency compatible within family

of architectures

• Split loads– non trapping load + exception check

Page 43: ASCI Winterschool on Embedded Systems March 2004 Renesse Processor Components the cornerstones of future platforms with emphasis on ILP exploitation Henk

ASCI winterschool H.C.-P.K. 43

Intel Itanium 2• EPIC• 0.18um 6ML• 8 issue slots• 1 GHz

(8000 MIPS)• 130 W (max)• 61 MOPS/W• 128b bundle

(3x41b + 5b)

Page 44: ASCI Winterschool on Embedded Systems March 2004 Renesse Processor Components the cornerstones of future platforms with emphasis on ILP exploitation Henk

ASCI winterschool H.C.-P.K. 44

Superscalar: Concept

InstructionMemory

InstructionCache

Decoder

BranchUnit

ALU-1 ALU-2Logic &

ShiftLoadUnit

StoreUnit

ReorderBuffer

RegisterFile

DataCache

DataMemory

Reservation Stations

Address

DataData

Instruction

Page 45: ASCI Winterschool on Embedded Systems March 2004 Renesse Processor Components the cornerstones of future platforms with emphasis on ILP exploitation Henk

ASCI winterschool H.C.-P.K. 45

Intel Pentium 4• Superscalar• 0.12um 6ML• 1.0 V• 3 issue• >3 GHz• 58 W• 20 stage pipeline• ALUs clocked at

2X• Trace cache

Page 46: ASCI Winterschool on Embedded Systems March 2004 Renesse Processor Components the cornerstones of future platforms with emphasis on ILP exploitation Henk

ASCI winterschool H.C.-P.K. 46

Pentium 4

• Trace cache

• Hyper threading

• Add with ½ cycle throughput (1 ½ cycle latency)

cycle cycle cycle

add least signif. 16 bits

add most signif. 16 bits

calculate flags

forwarding carry

Page 47: ASCI Winterschool on Embedded Systems March 2004 Renesse Processor Components the cornerstones of future platforms with emphasis on ILP exploitation Henk

ASCI winterschool H.C.-P.K. 47

P4 vs P II, PIII pipeline

11 22 33 44 55 66 77 88 99 1010

FetchFetch FetchFetch DecodeDecode DecodeDecode DecodeDecode RenameRename ROB RdROB Rd Rdy/SchRdy/Sch DispatchDispatch ExecExec

Basic P6 PipelineBasic P6 Pipeline

Basic PentiumBasic Pentium®® 4 Processor Pipeline 4 Processor Pipeline

11 22 33 44 55 66 77 88 99 1010 1111 1212

TC Nxt IPTC Nxt IP TC FetchTC Fetch DriveDrive AllocAlloc RenameRename QueQue SchSch SchSch SchSch

1313 1414

DispDisp DispDisp

1515 1616 1717 1818 1919 2020

RFRF ExEx FlgsFlgs Br CkBr Ck DriveDriveRF RF

Intro at Intro at 1.4GHz1.4GHz

.18µ.18µ

Intro at Intro at 733MHz733MHz

.18µ.18µ

Page 48: ASCI Winterschool on Embedded Systems March 2004 Renesse Processor Components the cornerstones of future platforms with emphasis on ILP exploitation Henk

ASCI winterschool H.C.-P.K. 48

Example with Higher IPC and Faster Clock!

CodeSequence:

Ld Add Add Ld Add Add

10 clocks10 clocks10ns10nsIPC = 0.6IPC = 0.6

6 clocks6 clocks4.3ns4.3nsIPC = 1.0IPC = 1.0

P6P6@1GHz@1GHz

Pentium® 4 Pentium® 4 [email protected]@1.4GHz

Page 49: ASCI Winterschool on Embedded Systems March 2004 Renesse Processor Components the cornerstones of future platforms with emphasis on ILP exploitation Henk

ASCI winterschool H.C.-P.K. 49

Superscalar Issues

• How to fetch multiple instructions in time (across basic block boundaries) ? Trace Cache

• Handling control hazards: Branch prediction• Non-blocking memory system: Hit over miss• Handling dependencies: Renaming• How to support precise interrupts?: ROB• How to recover from mis-predicted branch path?

ROB

Page 50: ASCI Winterschool on Embedded Systems March 2004 Renesse Processor Components the cornerstones of future platforms with emphasis on ILP exploitation Henk

ASCI winterschool H.C.-P.K. 50

Renaming

All four instructions may issue simultaneously– (If resources are available)

Renaming is implemented using– Reorder buffer: Pentium II/III, HP PA-8000, PowerPC 604,

SPARC64– Direct register remapping: MIPS 10k/12k, DEC 21264

Example:# Original Code dependence latency renamed version 1 mul r1,r2,r3 4 mul p1,p2,p32 st r1,3(r2) RaW 1 st p1,3(p2)3 add r1,r5,#4 WaW, WaR 1 add p4,p5,#44 shl r2,r1,r3 RaW, WaR 1 shl p6,p4,#3

Page 51: ASCI Winterschool on Embedded Systems March 2004 Renesse Processor Components the cornerstones of future platforms with emphasis on ILP exploitation Henk

ASCI winterschool H.C.-P.K. 51

Renaming

Note: Old mapping r1-p1not needed anymore; however p1 still active

When may we reuse physical register p1?– Old mapping has changed (r1-p4)– p1 has been committed

Logic register Physical register r1 p4 r2 p6 r3 p3 r4 - r5 p5

Mapping (after I4)

Page 52: ASCI Winterschool on Embedded Systems March 2004 Renesse Processor Components the cornerstones of future platforms with emphasis on ILP exploitation Henk

ASCI winterschool H.C.-P.K. 52

Branch Prediction

• Branch Prediction techniques, why?– Speculatively execute beyond branches– Reduce branch penalties

• Classification– Static techniques; prediction based on:

• Profiling information• Static analysis of code: use of heuristics

– Dynamic techniques• 1-level: Branch prediction buffer with n-bit prediction counters

• 2-level: Branch correlation using branch history

• Hybrid methods (e.g. Alpha 21264)

– Combinations of static and dynamic

Page 53: ASCI Winterschool on Embedded Systems March 2004 Renesse Processor Components the cornerstones of future platforms with emphasis on ILP exploitation Henk

ASCI winterschool H.C.-P.K. 53

Static Techniques: Heuristic Based (Ball and Larus’93)• Loop Branch Heuristic

– Back-edge will be taken 88% of the time

• Pointer Heuristic– A comparison of two pointers will fail 60% of the time

• Call Heuristic– A successor block containing a call and which does not post-

dominate the block containing the branchwill not be taken 78% of the time

• Opcode Heuristic– A test of an integer for ‘ 0’, or ’ 0’ or ‘ some constant’ will fail

outcome 84% of the time

• Loop Exit Heuristic– A branch in a loop in which no successor block is a loop head will not

exit the loop 80% of the time

Page 54: ASCI Winterschool on Embedded Systems March 2004 Renesse Processor Components the cornerstones of future platforms with emphasis on ILP exploitation Henk

ASCI winterschool H.C.-P.K. 54

Static Heuristic Based (Ball and Larus’93)• Return Heuristic

– A successor block containing a return will not be taken 72% of the time

• Store Heuristic– A successor block containing a store instruction and which does not

post-dominate will not be taken 55% of the time

• Loop Header Heuristic– A successor block which is a loop header or a loop pre-header (I.e.

passes control unconditionally to a loop head which it dominates) and which does not post-dominate will be taken 75% of the time

• Guard Heuristic– A successor block in which a register is used before being defined

and which does not post-dominate will be taken 62% of the time if that register is an operand of the branch

Page 55: ASCI Winterschool on Embedded Systems March 2004 Renesse Processor Components the cornerstones of future platforms with emphasis on ILP exploitation Henk

ASCI winterschool H.C.-P.K. 55

Static Heuristic Based Prediction

When multiple predictors apply we use

‘Dempster-Shafer’

evidence combination

Pold * Pheuristic

Pold * Pheuristic + (1- Pold)*(1-Pheuristic)Pnew =

For example if both ‘Loop Exit’ and ‘Store’ heuristic are applied

0.8*0.45

0.8*0.45 + (1 - 0.8)*(1 - 0.45)Pnew = = 0.766

Page 56: ASCI Winterschool on Embedded Systems March 2004 Renesse Processor Components the cornerstones of future platforms with emphasis on ILP exploitation Henk

ASCI winterschool H.C.-P.K. 56

Dynamic Techniques:Branch Prediction Buffer: 1 bit prediction

Problems

• Aliasing: lower K bits of different branch instructions could be the same

– Soln: Use tags (the buffer becomes a tag); however very expensive

• Loops are predicted wrong twice

– Soln: Use n-bit saturation counter prediction

* taken if counter 2 (n-1)

* not-taken if counter < 2 (n-1)

– A 2 bit saturating counter predicts a loop wrong only once

Branch address 2 K entries

(Lower K bits)

prediction bit

1-bit

Page 57: ASCI Winterschool on Embedded Systems March 2004 Renesse Processor Components the cornerstones of future platforms with emphasis on ILP exploitation Henk

ASCI winterschool H.C.-P.K. 57

Using n-bit Saturating Counters

2-bit saturating counter scheme

Branch address

a

n-bit saturating Up/Down Counter

Prediction

11/T 10/T

00/N01/N N

N

T

NT

N

TT

Page 58: ASCI Winterschool on Embedded Systems March 2004 Renesse Processor Components the cornerstones of future platforms with emphasis on ILP exploitation Henk

ASCI winterschool H.C.-P.K. 58

Branch Correlation Using Branch HistoryTwo schemes (a, k, m, n)

• PA: Per address history, a > 0• GA: Global history, a = 0

n-bit saturating Up/Down Counter Prediction

Table size (usually n = 2): #bits = k * 2a + 2k * 2m *n

Variant: Gshare (Scott McFarling’93): GA which takes logic XOR of PC address bits and branch history bits

Branch Address

0 1 2k-1

0

1

2m-1

Branch History Table

a k

m

Pattern History Table

Page 59: ASCI Winterschool on Embedded Systems March 2004 Renesse Processor Components the cornerstones of future platforms with emphasis on ILP exploitation Henk

ASCI winterschool H.C.-P.K. 59

Predicting the Target Address

1. Branch Target Buffer (BTB)

2. Branch Folding (Store instruction in BTB)

3. Return Stack

Page 60: ASCI Winterschool on Embedded Systems March 2004 Renesse Processor Components the cornerstones of future platforms with emphasis on ILP exploitation Henk

ASCI winterschool H.C.-P.K. 60

Accuracy (taking the best combination of parameters):

Predictor Size (bytes)64 128

Bra

nch

Pre

dic

tio

n A

ccu

racy

(%

)

256 1K 2K 4K 8K 16K 32K 64K

89

91

95

96

97

98

92

93

94

PA(10, 6, 4, 2)

GA(0,11,5,2)

Bimodal

GAs

PAs

Page 61: ASCI Winterschool on Embedded Systems March 2004 Renesse Processor Components the cornerstones of future platforms with emphasis on ILP exploitation Henk

ASCI winterschool H.C.-P.K. 61

Comparing Superscalar and VLIW

Characteristic Superscalar VLIW

Architecture Type Multiple issue Multiple operations

Complexity High Low

Binary Code Compat.. Yes No

Source Code Compat. Yes Yes, if good compiler

Scheduling Dynamic Static

Scheduling Window 10 instructions 100 - 1000 instructions

Speculation Dynamic Static

Branch Prediction Dynamic Static

Mem ref disambiguation Dynamic Static

Scalability Medium High

Functional Flexibility High Very High

Application General Purpose Special Purpose

Page 62: ASCI Winterschool on Embedded Systems March 2004 Renesse Processor Components the cornerstones of future platforms with emphasis on ILP exploitation Henk

ASCI winterschool H.C.-P.K. 62

Overview

• Motivation and Goals

• Trends in Computer Architecture

• RISC processors

• ILP Processors

• Transport Triggered Architectures

• Configurable components

• Summary and Conclusions

Page 63: ASCI Winterschool on Embedded Systems March 2004 Renesse Processor Components the cornerstones of future platforms with emphasis on ILP exploitation Henk

ASCI winterschool H.C.-P.K. 63

Reducing Datapath Complexity: TTATTA: Transport Triggered Architecture

Overview

PhilosophyMIRROR THE PROGRAMMING PARADIGM

• Program transports, operations are side effects of transports

• Compiler is in control of hardware transport capacity

Page 64: ASCI Winterschool on Embedded Systems March 2004 Renesse Processor Components the cornerstones of future platforms with emphasis on ILP exploitation Henk

ASCI winterschool H.C.-P.K. 64

Transport Triggered Architecture

FU1 FU1 FU1IntegerReg File

FPRegFIle

BooleanRegFile

General Structure of TTA

Data-transport Buses / Move Buses Sockets

Page 65: ASCI Winterschool on Embedded Systems March 2004 Renesse Processor Components the cornerstones of future platforms with emphasis on ILP exploitation Henk

ASCI winterschool H.C.-P.K. 65

Program TTAs

How to do data operations ?1. Transport of operands to FU

• Operand move (s)• Trigger move

2. Transport of results from FU• Result move (s)

How to do Control flow ?1. Jumps: #jump-address pc

2. Branch: #displacement pcd

3. Call: pc r; #call-address pcd

Example Add r3,r1,r2 becomesr1 Oint // operand move to integer unitr2 Tadd // trigger move to integer unit…………. // addition operation in progressRint r3 // result move from integer unit

Trigger Operand

Internal stage

Result

FU Pipeline

Page 66: ASCI Winterschool on Embedded Systems March 2004 Renesse Processor Components the cornerstones of future platforms with emphasis on ILP exploitation Henk

ASCI winterschool H.C.-P.K. 66

Program TTAs

Scheduling advantages of Transport Triggered Architectures

1. Software bypassingRint r1r1 Tadd Rint r1; Rint Tadd

2. Dead writeback removalRint r1; Rint Tadd Rint Tadd

3. Common operand elimination#4 Oint; r1 Tadd #4 Oint; r1 Tadd#4 Oint; r2 Tadd r2 Tadd

4. Decouple operand, trigger and result moves completelyr1 Oint; r2 Tadd r1 OintRint r3 ---

r2 Tadd --- Rint r3

Page 67: ASCI Winterschool on Embedded Systems March 2004 Renesse Processor Components the cornerstones of future platforms with emphasis on ILP exploitation Henk

ASCI winterschool H.C.-P.K. 67

TTA Advantages

Summary of advantages of TTAs

• Better usage of transport capacity– Instead of 3 transports per dyadic operation, about 2 are needed– # register ports reduced with at least 50%– Inter FU connectivity reduces with 50-70%

• No full connectivity required

• Both the transport capacity and # register ports become independent design parameters; this removes one of the major bottlenecks of VLIWs

• Flexible: FUs can incorporate arbitrary functionality• Scalable: #FUs, #reg.files, etc. can be changed• TTAs are easy to design and can have short cycle times

Page 68: ASCI Winterschool on Embedded Systems March 2004 Renesse Processor Components the cornerstones of future platforms with emphasis on ILP exploitation Henk

ASCI winterschool H.C.-P.K. 68

TTA automatic DSE

Architectureparameters

OptimizerOptimizer

Parametric compilerParametric compiler Hardware generatorHardware generator

feedbackfeedback

Userintercation

Parallel object code chip

Pareto curve(solution space)

cost

exec

. tim

e

x

x

x

x

xx

x

xx

x

x

x

x

x

x

xx x

x

x

Move framework

Page 69: ASCI Winterschool on Embedded Systems March 2004 Renesse Processor Components the cornerstones of future platforms with emphasis on ILP exploitation Henk

ASCI winterschool H.C.-P.K. 69

Overview

• Motivation and Goals

• Trends in Computer Architecture

• RISC processors

• ILP Processors

• Transport Triggered Architectures

• Configurable components

• Summary and Conclusions

Page 70: ASCI Winterschool on Embedded Systems March 2004 Renesse Processor Components the cornerstones of future platforms with emphasis on ILP exploitation Henk

ASCI winterschool H.C.-P.K. 70

Tensilica Xtensa• Configurable RISC• 0.13um• 0.9V• 1 issue slot / 5 stage pipeline• 490 MHz typical• 39.2 mW (no mem.)• 12500 MOPS / W

• Tool support• Optional vector unit• Special Function Units

Page 71: ASCI Winterschool on Embedded Systems March 2004 Renesse Processor Components the cornerstones of future platforms with emphasis on ILP exploitation Henk

ASCI winterschool H.C.-P.K. 71

CLB

CLB

CLB

CLB

SwitchMatrix

ProgrammableInterconnect I/O Blocks (IOBs)

ConfigurableLogic Blocks (CLBs)

D Q

SlewRate

Control

PassivePull-Up,

Pull-Down

Delay

Vcc

OutputBuffer

InputBuffer

Q D

Pad

D QSD

RDEC

S/RControl

D QSD

RDEC

S/RControl

1

1

F'

G'

H'

DIN

F'

G'

H'

DIN

F'

G'

H'

H'

HFunc.Gen.

GFunc.Gen.

FFunc.Gen.

G4G3G2G1

F4F3F2F1

C4C1 C2 C3

K

Y

X

H1 DIN S/R EC

Fine-Grained reconfigurable: Xilinx XC4000 FPGA

Page 72: ASCI Winterschool on Embedded Systems March 2004 Renesse Processor Components the cornerstones of future platforms with emphasis on ILP exploitation Henk

ASCI winterschool H.C.-P.K. 72

Coarse-Grained reconfigurable: Chameleon CS2000

Highlights:•32-bit datapath (ALU/Shift)•16x24 Multiplier•distributed local memory•fixed timing

Page 73: ASCI Winterschool on Embedded Systems March 2004 Renesse Processor Components the cornerstones of future platforms with emphasis on ILP exploitation Henk

ASCI winterschool H.C.-P.K. 73

Hybrid FPGAs: Virtex II-Pro

ReConfig

.logic

Up to 16 serial transceivers

PowerP

Cs

Courtesy of Xilinx (Virtex II Pro)

PowerPC

Reconfigurable logicblocks

Memory blocks

GHz IO: Up to 16 serial transceivers

Page 74: ASCI Winterschool on Embedded Systems March 2004 Renesse Processor Components the cornerstones of future platforms with emphasis on ILP exploitation Henk

ASCI winterschool H.C.-P.K. 74

HW or SW reconfigurable?

Data path granularityfine coarse

Rec

onfi

gura

tion

tim

e

1 cycleSubword parallelism

loopbuffercontext

reset

Spatial mapping

Temporal mapping

FPGA

VLIW

configuration bandwidth

Page 75: ASCI Winterschool on Embedded Systems March 2004 Renesse Processor Components the cornerstones of future platforms with emphasis on ILP exploitation Henk

ASCI winterschool H.C.-P.K. 75

Granularity Makes Differences

Fine-Grained Architecture

Coarse-Grained Architecture

Clock Speed Low High

Configuration Time

Long Short

Unit Amount Large Small

Flexibility High Low

Power High Low

Area Large Small

Page 76: ASCI Winterschool on Embedded Systems March 2004 Renesse Processor Components the cornerstones of future platforms with emphasis on ILP exploitation Henk

ASCI winterschool H.C.-P.K. 76

Overview

• Motivation and Goals

• Trends in Computer Architecture

• RISC processors

• ILP Processors

• Transport Triggered Architectures

• Configurable components

• Multi-threading

• Summary and Conclusions

Page 77: ASCI Winterschool on Embedded Systems March 2004 Renesse Processor Components the cornerstones of future platforms with emphasis on ILP exploitation Henk

ASCI winterschool H.C.-P.K. 77

Simultaneous Multithreading Characteristics

• An SMT has separate front-ends for the different threads but shares the back-end between all threads.

• Each thread has its own– Re-order buffer– Branch History Register

• Registers, caches, branch prediction tables, instruction queues, FUs etc. are shared.

Page 78: ASCI Winterschool on Embedded Systems March 2004 Renesse Processor Components the cornerstones of future platforms with emphasis on ILP exploitation Henk

ASCI winterschool H.C.-P.K. 78

Multi-threading in Uniprocessor Architectures

SuperscalarSimultaneousMultithreading

ConcurrentMultithreading

Issue slots

Clo

ck c

ycle

s

Empty Slot

Thread 1

Thread 2

Thread 3

Thread 4

Page 79: ASCI Winterschool on Embedded Systems March 2004 Renesse Processor Components the cornerstones of future platforms with emphasis on ILP exploitation Henk

ASCI winterschool H.C.-P.K. 79

Instruction Fetch Policies• The Instruction Fetch policy decides from which

threads to fetch each cycle.• Performance and throughput is highly sensitive to

the Instr.Fetch policy.• “Standard” icount fetches from thread with least

instructions in front-end.• Performance of a thread depends on policy as

well as workload and becomes highly unpredictable.

Page 80: ASCI Winterschool on Embedded Systems March 2004 Renesse Processor Components the cornerstones of future platforms with emphasis on ILP exploitation Henk

ASCI winterschool H.C.-P.K. 80

Resource Allocation in SMT

• Better to perform dynamic resource allocation to drive instruction fetch.

• DCRA outperforms icount in many cases.

• Possible to use resource allocation to guarantee certain percentage of single-thread performance.

• Improves predictability and hence suitability of SMT for real-time embedded systems.

Page 81: ASCI Winterschool on Embedded Systems March 2004 Renesse Processor Components the cornerstones of future platforms with emphasis on ILP exploitation Henk

ASCI winterschool H.C.-P.K. 81

Future Processors Components• New TriMedia has deep pipeline, L1 and L2

cache, and branch prediction.• META is a (simple) simultaneous multithreaded

architecture.• Calistro is a embedded multi-processor platform

for mobile applications.• Imagine (Stanford): combines operation (VLIW)

and data level parallelism (SIMD).• TRISP (Texas Austin / IBM) and SCALE (MIT)

processors combine task, operation and data level parallelism.

Page 82: ASCI Winterschool on Embedded Systems March 2004 Renesse Processor Components the cornerstones of future platforms with emphasis on ILP exploitation Henk

ASCI winterschool H.C.-P.K. 82

Summary and Conclusions

ILP architectures have great potential• Superscalars

– Binary compatible upgrade path

• VLIWs– Very flexible ASIPs

• TTAs– Avoid control and datapath bottlenecks– Completely compiler controlled– Very good cost-performance ratio– Low power

• Multi-threading– Surpass exploitable ILP in applications– How to choose threads ?