kilo-instruction processors mateo valero, upc hpca-10, madrid february 14-17 th 2004

59
Kilo-instruction Processors Mateo Valero, UPC HPCA-10, Madrid February 14-17 th 2004

Upload: edward-joseph

Post on 12-Jan-2016

214 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Kilo-instruction Processors Mateo Valero, UPC HPCA-10, Madrid February 14-17 th 2004

Kilo-instruction Processors

Mateo Valero, UPC

HPCA-10, MadridFebruary 14-17th 2004

Page 2: Kilo-instruction Processors Mateo Valero, UPC HPCA-10, Madrid February 14-17 th 2004

2

Motivation

Justin Rattner, Intel-MRL, Keynote lecture, Micro-32

Technology works against ILP: Faster clock rates => Lower ILP

Page 3: Kilo-instruction Processors Mateo Valero, UPC HPCA-10, Madrid February 14-17 th 2004

3

1990’s architecture Short pipelines Low memory latencies

2010+ architectures Long pipelines

• 30-50 stages Power-Thermal-Wire

delay aware architecture Long memory latencies

• 500 to 1000 cycles• ISCA-2003: 50 to 160

The trends are changing

QueueScheduleScheduleScheduleDispatchDispatch

Reg. ReadReg. ReadExecute

FlagsBr. chkDrive

DriveAlloc.

RenameRename

Next IPNext IPFetchFetch L1

Instr.

L1Data

L2

Mem

ory

Bra

nc h

mis

pr e

dic

tion

M. Valero. NSF Workshop on Computer Architecture. ISCA Conference. San Diego, June 2003

Page 4: Kilo-instruction Processors Mateo Valero, UPC HPCA-10, Madrid February 14-17 th 2004

4

128 in-flight inst.

0.0

0.5

1.0

1.5

2.0

2.5

3.0

int 4way fp 4way int 8way fp 8way

IPC 100

500

1000

latency

Memory Wall Problem

Memory latency has enormous impact on IPC

0.6X

0.45X

M. Valero. NSF Workshop on Computer Architecture. ISCA Conference. San Diego, June 2003

Page 5: Kilo-instruction Processors Mateo Valero, UPC HPCA-10, Madrid February 14-17 th 2004

5

Reducing Memory Latency

Technology Caches Prefetching

Hardware, Software and combined Assisted/SSMT Threads

“Kilo-instruction” Processor

Page 6: Kilo-instruction Processors Mateo Valero, UPC HPCA-10, Madrid February 14-17 th 2004

6

“Kilo-instruction” Processors Our goals

Better tolerate increasing memory latency Further improve ILP, even for such longer memory

latency Allow additional optimizations enabled by the new

architecture (See below)

Our proposal: “Kilo-instruction” Processors Out-Of-Order processors with thousands of

instructions in-flight (Very Large Instruction Windows) Intelligent use of resources (Resource requirements

growing much slower than window size)

Page 7: Kilo-instruction Processors Mateo Valero, UPC HPCA-10, Madrid February 14-17 th 2004

7

“Kilo-instruction Processsor”

It is not….. A “heavy” processor Cyber-205 like processor Vector Processor Blue-Gene like Multiscalar,Trace Processor Raw, Imagine, Levo,TRIPS

It is ……. An Affordable O-O-O Superscalar Processor having

“Thousands of In-flight Instructions”

Page 8: Kilo-instruction Processors Mateo Valero, UPC HPCA-10, Madrid February 14-17 th 2004

8

Outline Motivation Increasing the number of in-flight instructions Kilo-instruction Processor Ingredients:

Multi-Checkpointing the ROB• Out-of-Order Commit

Early Release of Resources• Ephemeral Registers• Load Queues

Locality Exploitation• Instruction Queues• LSQ

Cross-pollination with other techniques: “Kilo-processor” and multiprocessor systems “kilo-vector” processor “Kilo-SMT” processor Further Improvements:

• Branch prediction• “kilo-valpred” processor• Control Independence• Reuse• Predicated and multipath execution

Conclusion

Page 9: Kilo-instruction Processors Mateo Valero, UPC HPCA-10, Madrid February 14-17 th 2004

9

branch 3

ROB ActivityROB Register File

IQ

x

xx

branch 3xbx

load 2x

branchx

branch 1xax

load 1

load 1a

branch 1load 2

b128-entry

1024-entry

ROB full:

Lat=100 Lat=1000

int 62% 72%

fp 76% 90%

ROB full:

Lat=100 Lat=1000

int 30% 48%

fp 50% 75%

M. Valero. NSF Workshop on Computer Architecture. ISCA Conference. San Diego, June 2003

Page 10: Kilo-instruction Processors Mateo Valero, UPC HPCA-10, Madrid February 14-17 th 2004

10

0,0

0,5

1,0

1,5

2,0

2,5

3,0

3,5

4,0

4,5

5,0

128 512 1024 4096 128 512 1024 4096

perceptron perfect

ROB Size / Branch Predictor

IPC

1005001000

MemoryLatency

Perfect Mem. & Perfect BP

Perfect Mem. & Perceptron BP

Integer, 8-way, L2 1MB

0.6X

1.22X

1.41X1.86X

1.1X

Research Proposal to Intel (July 2001) and presentation to Intel-MRL Feb. 2002Cristal et al.: “Large Virtual ROBs by Processor Checkpointing”, TR UPC-DAC, July 2002M. Valero. NSF Workshop on Computer Architecture. ISCA Conference. San Diego, June 2003

Page 11: Kilo-instruction Processors Mateo Valero, UPC HPCA-10, Madrid February 14-17 th 2004

11

0,0

1,0

2,0

3,0

4,0

5,0

6,0

128 512 1024 4096 8192 128 512 1024 4096 8192

perceptron perfect

ROB Size / Branch Predictor

IPC

1005001000

MemoryLatency

Perfect Mem. & Perfect BP

Perfect Mem. & Perceptron BP

Floating-point, 8-way, L2 1MB

0.45X

2.34X

3.91X 4.58X2X

Research Proposal to Intel (July 2001) and presentation to Intel-MRL Feb. 2002Cristal et al.: “Large Virtual ROBs by Processor Checkpointing”, TR UPC-DAC, July 2002M. Valero. NSF Workshop on Computer Architecture. ISCA Conference. San Diego, June 2003

Page 12: Kilo-instruction Processors Mateo Valero, UPC HPCA-10, Madrid February 14-17 th 2004

12

Scalability Thousands of In-flight Instructions and In-

Order Commit make designs impractical: ROB : Needs to maintain a copy of every in-

flight instruction IQs : Instructions depending on long

latency instructions remain in these queues for a long time

LSQs : Instructions remain in the queue until commit

Registers : A new physical register for each instruction producing a new value

We would like to get the IPC of thousands of instructions in-flight without drastically increasing resource requirements

Re

ord

er B

uffe

r

IQF

PQ

LSQ

Inte

ger

Reg

. Fil e

FP

Re

g. F

il e

M. Valero. NSF Workshop on Computer Architecture. ISCA Conference. San Diego, June 2003

Page 13: Kilo-instruction Processors Mateo Valero, UPC HPCA-10, Madrid February 14-17 th 2004

13

Late Allocation/Early Release of RegistersRegister File

IQload 1

abranch 1

load 2b

a, b, c

R1, R2

Virtual Registers

Early Release

ROB

cx

branch 3xbx

load 2branch 2

xR1

branch 1xax

load 1

R2

a

b

c

R1

R2

c

Monreal et al.: “Delaying physical register allocation through virtual-physical registers”, MICRO’99T. Monreal et al., “Late allocation and early release of physical registers”, IEEE-TC (to appear)

Page 14: Kilo-instruction Processors Mateo Valero, UPC HPCA-10, Madrid February 14-17 th 2004

14

load

branch

Nearby & Distant Parallelism

ROB Register Fileload

Speculative

Replayable

f(X)

X

load

branch

load

branch

Balasubramonian et al.: “Dynamically Allocating Processor Resources…”, ISCA’01

Nearby

Distant

Page 15: Kilo-instruction Processors Mateo Valero, UPC HPCA-10, Madrid February 14-17 th 2004

15

Dynamic Vectorization

load

br

C.I. 1

C.I. 2

ROB_head

ROB_tail

…C.I.1

…C.I.2

registerfile

control-independent instructionspeculative pre-execution based

on the instructions in the ROB

when the L2 miss load commitspreexecuted control-independent

instructions can reuse data

ROB_head

load

A. Pajuelo et al. “Control-Flow Independence Reuse via Dynamic Vectorization”, UPC-DAC

Page 16: Kilo-instruction Processors Mateo Valero, UPC HPCA-10, Madrid February 14-17 th 2004

16

Outline Motivation Increasing the number of in-flight instructions Kilo-instruction Processor Ingredients:

Multi-Checkpointing the ROB• Out-of-Order Commit

Early Release of Resources• Ephemeral Registers• Load Queues

Locality Exploitation• Instruction Queues• LSQ

Cross-pollination with other techniques: “Kilo-processor” and multiprocessor systems “kilo-vector” processor “Kilo-SMT” processor Further Improvements:

• Branch prediction• “kilo-valpred” processor• Control Independence• Reuse• Predicated and multipath execution

Conclusion

Page 17: Kilo-instruction Processors Mateo Valero, UPC HPCA-10, Madrid February 14-17 th 2004

17

Checkpointing the ROB

Checkpointing to support precise exceptions: Quite well established and used technique

• W.M.Hwu and Y.N.Patt, ISCA 1987

Checkpointing to early release resources: Quite recent concept

• Cherry: J. Martínez et al., MICRO, Nov. 2002• Large VROB: A. Cristal et al. TR-UPC-DAC, July 2002

M. Valero. NSF Workshop on Computer Architecture. ISCA Conference. San Diego, June 2003

Page 18: Kilo-instruction Processors Mateo Valero, UPC HPCA-10, Madrid February 14-17 th 2004

18

Cherry ROB

branch

load

Early Release- registers- loads- stores

Point of no return(PNR)

Martínez et al.: “Cherry: Checkpointed Early Resource Recycling…”, MICRO’02

irreversible

reversible

CherryCherry

Page 19: Kilo-instruction Processors Mateo Valero, UPC HPCA-10, Madrid February 14-17 th 2004

19

Multi-Checkpoint

IQ

ROB

x

x

x

load 3

x

b

x

branch 2

x

branch

x

branch 1

x

a

x

load 1 Checkpointing TableCheckpoint 1Checkpoint 1

OOO commit

load 1: PC, status, counter, …

load 1a

branch 1

load 1Checkpoint 2Checkpoint 2

branch 2: PC, status, counter, …

x

x

x

x

branch 4

x

x

x

load 4

x

x

load 3

x

b

x

branch 2

branch 2b

load 3

Cristal et al.: “Large Virtual ROBs by Processor Checkpointing”, TR UPC-DAC, July 2002Research Proposal to Intel (July 2001) and presentation to Intel-MRL Feb. 2002

branch 2 load 1load 1 arrivesarrives

Gang commitGang commitCheckpoint 1Checkpoint 1

Page 20: Kilo-instruction Processors Mateo Valero, UPC HPCA-10, Madrid February 14-17 th 2004

20

Outline Motivation Increasing the number of in-flight instructions Kilo-instruction Processor Ingredients:

Multi-Checkpointing the ROB• Out-of-Order Commit

Early Release of Resources• Ephemeral Registers• Load Queues

Locality Exploitation• Instruction Queues• LSQ

Cross-pollination with other techniques: “Kilo-processor” and multiprocessor systems “kilo-vector” processor “Kilo-SMT” processor Further Improvements:

• Branch prediction• “kilo-valpred” processor• Control Independence• Reuse• Predicated and multipath execution

Conclusion

Page 21: Kilo-instruction Processors Mateo Valero, UPC HPCA-10, Madrid February 14-17 th 2004

21

Early Release of Resources

CommitCommit

Resource

Assignment

---

Release

ROB, IQ, registers

Dispatch, RenamingDispatch, Renaming

Resource

Assignment

ROB, IQ, registers

ReleaseIQ(issued)

FetchFetch

CommitCommitMemory LatencyMemory Latencyi.e, 1000 cyclesi.e, 1000 cycles

T. Karkhanis and J.Smith, “A day in the life of a data cache miss” Workshop Memory Performance Issues. ISCA-2002M. Valero. NSF Workshop on Computer Architecture. ISCA Conference. San Diego, June 2003

Page 22: Kilo-instruction Processors Mateo Valero, UPC HPCA-10, Madrid February 14-17 th 2004

22

Registers

Register File is a critical component of a modern superscalar processor Large number of entries to support out-of-order

execution and memory latency Large number of ports to increase issue width

Power and access time are key issues for register file design

It is always beneficial, to reduce the number of physical registers

Page 23: Kilo-instruction Processors Mateo Valero, UPC HPCA-10, Madrid February 14-17 th 2004

23

Conventional renaming scheme

Virtual-Physical Registers

Early Release

Ephemeral Registers: checkpoint + virtual-physical

Physical Registers

Icache Decode&Rename Commit

Register Unused Register Used Register Unused

Register Used Register Unused

Register Unused Register Used

Register Used

T. Monreal et al.: “Delaying physical register allocation through virtual-physical registers”, MICRO’99M. Moudgill et al, “Register renaming and dynamic speculation: an alternative approach”, MICRO93T. Monreal et al., “Late allocation and early release of physical registers”, IEEE-TC (to appear)J. Martínez et al, “Ephemeral Registers”, Technical Report CSL-TR-2003-1035 , 2003

Page 24: Kilo-instruction Processors Mateo Valero, UPC HPCA-10, Madrid February 14-17 th 2004

24

State of Registers (FP, ROB=2048)

1 10 25 50 75 90 1000

200

400

600

800

1000

1200

1400

FP

Re

gis

ters

Distribution of in-flight Instructions

Dead

Blocked-Long

Blocked-Short

Live

1168 1382 1607 1868 1955

Early Release

Late Allocation

Number of Instructions

A. Cristal, et al, “ A case for resource-concious out-of-order processors”, IEEE TCCA CA Letters, Vol. 2, Oct. 2003

Page 25: Kilo-instruction Processors Mateo Valero, UPC HPCA-10, Madrid February 14-17 th 2004

25

Outline Motivation Increasing the number of in-flight instructions Kilo-instruction Processor Ingredients:

Multi-Checkpointing the ROB• Out-of-Order Commit

Early Release of Resources• Ephemeral Registers• Load Queues

Locality Exploitation• Instruction´s Queues• LSQ

Cross-pollination with other techniques: “Kilo-processor” and multiprocessor systems “kilo-vector” processor “Kilo-SMT” processor Further Improvements:

• Branch prediction• “kilo-valpred” processor• Control Independence• Reuse• Predicated and multipath execution

Conclusion

Page 26: Kilo-instruction Processors Mateo Valero, UPC HPCA-10, Madrid February 14-17 th 2004

26

Increasing the number of IQ entries increase the power, area and access time

Wake-up and selection logic need to be done efficiently

“Kilo-instruction” processors may have many “in-flight” instructions

We need new organization for the IQs in order to have affordable “kilo-instruction processors”

IQs and “Kilo” processors

Page 27: Kilo-instruction Processors Mateo Valero, UPC HPCA-10, Madrid February 14-17 th 2004

27

1

2

3

slowfast medium

x

x

branch 4

x

x

x

load 4

x

x

load 3

x

b

x

branch 2

Secondary Buffer

ROB

13

31

2

IQ

Execution Time of Instructions

Lebeck et al., “A large, fast instruction window for tolerating cache misses”, ISCA-29, 2002.

Brekelbaum et al., “Hierarchical scheduling windows”, ISCA-35, 2002.

Cristal et al., “Out-of-Order Commit Processors”, TR UPC-DAC-2003-44, July 2003 & HPCA-10, Feb. 2004

Page 28: Kilo-instruction Processors Mateo Valero, UPC HPCA-10, Madrid February 14-17 th 2004

28

Load/Store Queues

Efficient and affordable memory disambiguation is mandatory for kilo-instruction processors We need to guarantee that loads and stores arrive to

the memory in the correct order

Increasing the number of in-flight instructions, can make the load/store queues a true bottleneck both in latency and power

Page 29: Kilo-instruction Processors Mateo Valero, UPC HPCA-10, Madrid February 14-17 th 2004

29

1 10 25 50 75 90 1000

100

200

300

400

500

600

LD

Qu

eu

e

Distribution of in-flight Instructions

DeadBlocked-LongBlocked-ShortReplayableLive

1168 1382 1607 1868 1955

Number of Instructions

FP

CheckpointingEarly Release

State of LD Queues (specFP, ROB=2048)

A. Cristal, et al, “ A case for resource-conscious out-of-order processors”, IEEE TCCA CA Letters, Vol. 2, October 2003

Page 30: Kilo-instruction Processors Mateo Valero, UPC HPCA-10, Madrid February 14-17 th 2004

30

State of ST Queues (specFP, ROB=2048)

1 10 25 50 75 90 1000

50

100

150

200

250

300

ST

Qu

eu

e

Distribution of in-flight Instructions

ReadyAddress ReadyBlocked-LongBlocked-Short

1168 1382 1607 1868 1955

Number of Instructions

FP

Locality

A. Cristal, et al, “ A case for resource-conscious out-of-order processors”, IEEE TCCA CA Letters, Vol. 2, October 2003

Page 31: Kilo-instruction Processors Mateo Valero, UPC HPCA-10, Madrid February 14-17 th 2004

31

Search Filtering

Determine independence without associative search on addresses

Use Bloom Filter to control associative search Approximate tracking (false positives

are possible) No false negatives => no

mispredictions

Address: 0xabcd

HashFunction

LSQ

1

Filter

Associatively searchIf hashed bit is set to 1

S. Sethumadhavan et al. “Scalable Hardware Memory Disambiguation for High ILP Processors” Micro-36, 2003

Page 32: Kilo-instruction Processors Mateo Valero, UPC HPCA-10, Madrid February 14-17 th 2004

32

0

0.5

1

1.5

2

2.5

3

3.5

512 1024 2048 512 1024 2048 512 1024 2048

100 500 1000

Virtual TagsMemory Latency

IPC

256

512

Limit 4096

Limit 4096

Limit 4096

Baseline 128

Baseline 128

Baseline 128

Putting It All TogetherPhysicalRegisters

Virtual Registers

IQs of 128 entries

Memory Latency

A. Cristal et al. “Kilo-instruction Processors”. Invited paper. ISHPC-V.Tokyo, LNCS-2858. October 20-22th, 2003

Page 33: Kilo-instruction Processors Mateo Valero, UPC HPCA-10, Madrid February 14-17 th 2004

33

Outline Motivation Increasing the number of in-flight instructions Kilo-instruction Processor Ingredients:

Multi-Checkpointing the ROB• Out-of-Order Commit

Early Release of Resources• Ephemeral Registers• Load Queues

Locality Exploitation• Instruction´s Queues• LSQ

Cross-pollination with other techniques: “Kilo-processor” and multiprocessor systems “kilo-vector” processor “Kilo-SMT” processor Further Improvements:

• Branch prediction• “kilo-valpred” processor• Control Independence• Reuse• Predicated and multipath execution

Conclusion

Page 34: Kilo-instruction Processors Mateo Valero, UPC HPCA-10, Madrid February 14-17 th 2004

34

0

0.5

1

1.5

2

2.5

3

3.5

FFT RADIX LU MP3D WATER

16 processors

IPC

ROB 64

ROB 128

ROB 512

ROB 1024

ROB 2048

First results: Ideal Network

“Kilo-processor” and multiprocessor systems

M. Galluzzi et al. “ A First glance at Kiloinstruction Based Multiprocessors” Invited Paper. ACM Computing Frontiers Conference. Ischia, Italy, April 10-12, 2004

Page 35: Kilo-instruction Processors Mateo Valero, UPC HPCA-10, Madrid February 14-17 th 2004

35

0

0.5

1

1.5

2

2.5

3

3.5

FFT RADIX LU

16 processors

IPC

IDEAL NET&MEM

IDEAL NET

BADA

Impact of the network-ROB 64

“Kilo-processor” and multiprocessor systems

M. Galluzzi et al. “ A First glance at Kiloinstruction Based Multiprocessors” Invited Paper. ACM Computing Frontiers Conference. Ischia, Italy, April 10-12, 2004

Page 36: Kilo-instruction Processors Mateo Valero, UPC HPCA-10, Madrid February 14-17 th 2004

36

0

0,5

1

1,5

2

2,5

3

3,5

64 128 512 1024 2048 64 128 512 1024 2048 64 128 512 1024 2048

FFT RADIX LU

ROB Size / Benchmark

IPC

IDEAL NET

BADA

IDEAL NET & MEM

First Results

M. Galluzzi et al. “ A First glance at Kiloinstruction Based Multiprocessors” Invited Paper. ACM Computing Frontiers Conference. Ischia, Italy, April 10-12, 2004

“Kilo-processor” and multiprocessor systems

Page 37: Kilo-instruction Processors Mateo Valero, UPC HPCA-10, Madrid February 14-17 th 2004

37

0

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

3.0E+05 8.0E+05 1.3E+06 1.8E+06 2.3E+06 2.8E+06 3.3E+06

Execution cycles

Pro

cess

or

cycl

es

2048

1024

512

q

Network latency, Radix, 250 cyc. latency

M. Galluzzi et al. “ A First glance at Kiloinstruction Based Multiprocessors” Invited Paper. ACM Computing Frontiers Conference. Ischia, Italy, April 10-12, 2004

“Kilo-processor” and multiprocessor systems

Page 38: Kilo-instruction Processors Mateo Valero, UPC HPCA-10, Madrid February 14-17 th 2004

38

“Kilo-vector” processor

20 80Program:

20 8Program:

5Program:

8

Kilo

Vector

Speedup: 3.5

Speedup: 7.7

F. Quintana et al, “Kilo-vector” processors, UPC-DAC

Page 39: Kilo-instruction Processors Mateo Valero, UPC HPCA-10, Madrid February 14-17 th 2004

39

Outline Motivation Increasing the number of in-flight instructions Kilo-instruction Processor Ingredients:

Multi-Checkpointing the ROB• Out-of-Order Commit

Early Release of Resources• Ephemeral Registers• Load Queues

Locality Exploitation• Instruction´s Queues• LSQ

Cross-pollination with other techniques: “Kilo-processor” and multiprocessor systems “kilo-vector” processor “Kilo-SMT” processor Further Improvements:

• Branch prediction• “kilo-valpred” processor• Control Independence• Reuse• Predicated and multipath execution

Conclusion

Page 40: Kilo-instruction Processors Mateo Valero, UPC HPCA-10, Madrid February 14-17 th 2004

40

“Kilo-valpred” processor

0

0.5

1

1.5

2

2.5

3

normal Checkpoint ValuePrediction

Hybrid

SpecFP

SpecINT

T. Ramírez et al. “Kilo-value prediction” processor… UPC-DAC

Page 41: Kilo-instruction Processors Mateo Valero, UPC HPCA-10, Madrid February 14-17 th 2004

41

Kilo and Control Independence

More opportunities to find control independent instructions Squash reuse Control-independent

instruction reexecution removal Savings:

• Power/energy• Execution bandwidth• Resources

Helps to go far ahead in the instruction window faster

% C.I. instructions

0

0,05

0,1

0,15

0,2

0,25

0,3

0,35

0,4

0,45

INT

Page 42: Kilo-instruction Processors Mateo Valero, UPC HPCA-10, Madrid February 14-17 th 2004

42

UPC contribution to “kilo” processors We started our work in June 2001 Grant proposal to Intel-MRL (Konrad Lai and Ronny Ronen) in January 28th. 2002 Presentation to Intel-MRL in February 2002 A. Cristal, et al. “Large virtual ROBs by processor checkpointing” Technical

Report UPC-DAC-2002-39, July 2002. (Rejected for Micro-2002)• Multiple Checkpointers• Out-of-order Commit, No need for ROB• Early release of registers and loads

A. Cristal and M. Valero, ”ROBs virtuales utilizando checkpointing”. Spanish Workshop on Parallelism. Lleida, Sept., 2002

• Same as the previous report, but in Spanish A. Cristal, J. Martínez, M. Valero and J. Llosa, “Ephemeral Registers”, Technical

Report CSL-TR-2003-1035 , 2003. Rejected for ISCA 2003 and Micro 2003• Ckeckpoint + Early Release + Late allocation of registers

Presentation to Intel-MRL in March 2003 A. Cristal, J. Martínez, J. LLosa and M. Valero, “ A case for resource-conscious

out-of-order processors”, IEEE TCCA Computer Architecture Letters, Vol. 2, October 2003

• Underutilization of resources

Page 43: Kilo-instruction Processors Mateo Valero, UPC HPCA-10, Madrid February 14-17 th 2004

43

UPC contribution to “kilo” processors

A. Cristal, et al. “ A case for resource-conscious out-of-order processors: Towards Kilo-instruction in-flight processors”. MEDEA Workshop, Sept 2003 and ACM-CAN, March 2004

A. Cristal et al. “Kilo-instruction Processors”. Invited paper. ISHPC-V.Tokyo, LNCS-2858. October 20-22th, 2003

A. Cristal et al. “Future ILP Processors”. Invited paper. IJHPCN, to be published

A. Cristal, et al. “Out-of-Order Commit Processors” Technical Report UPC-DAC-2003-44, July 2003. HPCA-10, Madrid, Feb. 2004

• Remove-Reinsert Mechanism• Simple reinsert mechanism

M. Galluzzi et al. “ A First glance at Kiloinstruction Based Multiprocessors” Invited Paper. ACM Computing Frontiers Conference. Ischia, Italy, April 10-12, 2004

Much new work done at this moment

Page 44: Kilo-instruction Processors Mateo Valero, UPC HPCA-10, Madrid February 14-17 th 2004

44

Talks about “Kilo” processors, from UPC Presentation in Barcelona, to Intel-MRL in February 2002 Spanish Workshop on Parallelism. Lleida, Sept., 2002 Presentation to Intel-MRL in March 2003 Invited presentation. NSF Panel “On the Future of Computer

Architecture Research: Wise Views and Fresh Perspectives”. San Diego, June 2003

Invited Lecture. PA3CT Conference. Edegem, Belgium, September 22-23, 2003

MEDEA Workshop. New Orleans, September 2003 Invited Lecture. ISHPC-V. The 5th International Symposium on High

Performance Computing. Tokyo, Japan, October 20-22, 2003 Keynote lecture. Seminar on Compilers and Architecture. IBM Haifa.

November 11th., 2003. Invited lecture. Intel MRL. Haifa., Israel. Nov. 12th., 2003 HPCA-10, Madrid, February 14-18, 2003 Keynote lecture. HPCA-10. Madrid, February 14-18, 2003 Invited lecture. ACM Computing Frontiers. Ischia, April, 2004 ACM Invited lecture. ENCAR México, May 2004 More future presentations scheduled

Page 45: Kilo-instruction Processors Mateo Valero, UPC HPCA-10, Madrid February 14-17 th 2004

45

Memory Latency

Jouppi and P. Ranganathan. “ The relative importance of memory latency, bandwidth and branch prediction” Whorkshop on Mixing Logic and DRAM: Chips that compute and remember”, during ISCA-24, 1997

S. Srinivasan and A. Lebeck, “ Load latency tolerance in dynamically scheduled processors”, Micro-31, 1998

K. Skadron, P. Ahuja, M. Martonosi and D. Clark “Branch prediction, instruction window size and cache size: Performance tradeoffs and simulation techniques” IEEE-TC, pp. 1260-1281, 1999.

Page 46: Kilo-instruction Processors Mateo Valero, UPC HPCA-10, Madrid February 14-17 th 2004

46

Large Reorder Buffers G. Sohi, S. Breach, and T. N. Vijaykumar “Multiscalar processors”

ISCA-22, 1995. E. Rotenberg, Q. Jacobson, Y. Sazeides, and J. Smith “Trace

processors” ISCA-24, 1997 H. Akkari and M. Driscoll “A dynamic multithreaded processor”

Micro-31, 1998 R. Balasubramonian, S. Dwarkadas, and D.

Albonesi.“Dynamically allocating processor resources between nearby and distant ilp” ISCA, June 2001.

• Save some resources allocated for eager execution P. Ranganathan, V. Pai, and S. Adve “Using speculative

retirement and large instruction windows to narrow the performance gap between memory consistency models” SPAA, 1997

J. M. Tendler, S. Dodson, S. Fields, H. Lee, and B. Sinharoy “Power4 System Microarchitecture” IBM Journal of Research and Development, pp. 5-25, January 2002.

Page 47: Kilo-instruction Processors Mateo Valero, UPC HPCA-10, Madrid February 14-17 th 2004

47

Checkpointing

W.M. Hwu and Y. N. Patt, “Checkpoint repair for out-of-order execution machines” ISCA-14, 1987.

• Checkpointing as a recovery mechanism

Early Release of Resources A. Cristal, M. Valero, and J. LLosa. “Large virtual ROBs by

processor checkpointing” Technical Report UPC-DAC-2002-39, July 2002.

• Multiple Checkpointers• Out-of-order Commit, No need for ROB• Early release of registers and loads

J.F. Martínez, J. Renau, M.C. Huang, M. Prvulovic, and J. Torrellas. Cherry: checkpointed early resource recycling in out-of-order microprocessors. MICRO-35, Nov. 2002.

• One checkpoint• Early release of resources

Page 48: Kilo-instruction Processors Mateo Valero, UPC HPCA-10, Madrid February 14-17 th 2004

48

Register File M. Moudgill and K. Pingali and S. Vassiliadis, “Register renaming and

dynamic speculation: an alternative approach”, In Proceedings of the 26th annual international symposium on Microarchitecture, 1993.

• Early Release of Registers

T. Monreal, A. González, M. Valero, J. González, V. Viñals, “Delaying Physical Register Allocation through Virtual-Physical Registers”, In Proceedings of the 33th annual international symposium on Microarchitecture, 1999.

• Virtual Registers, Late allocation of registers

A. Cristal, J. Martínez, M. Valero and J. Llosa, “Ephemeral Registers”, Technical Report CSL-TR-2003-1035 , 2003.

• Ckeckpoint + Early Release + Late allocation of registers

T. Monreal et al., “Late allocation and early release of physical registers”, IEEE-TC (to appear)

Page 49: Kilo-instruction Processors Mateo Valero, UPC HPCA-10, Madrid February 14-17 th 2004

49

Instruction Queues S. Palacharla, N.P. Jouppi, and J.E. Smith “Complexity-effective

superscalar processors” ISCA-24, 1997.• Divide the Instruction queues in a set of FIFO queues

A.R. Lebeck, J. Koppanalil, T. Li, J. Patwardhan, and E. Rotenberg “A large, fast instruction window for tolerating cache misses” ISCA-29, 2002.

• Remove-Reinsert Mechanism• Keep the load dependence of all instructions

E. Brekelbaum, J. Rupley, C.Wilkerson, and B. Black “Hierarchical scheduling windows” ISCA-35, 2002.

• Two clusters, a slow/big one, and a faster/small one for critical instructions

A. Cristal, D. Ortega, J. Llosa and M. Valero “Out-of-Order Commit Processors” Technical Report UPC-DAC-2003-44, July 2003. HPCA-10, Madrid, Feb. 2004

• Remove-Reinsert Mechanism• Simple reinsert mechanism

Page 50: Kilo-instruction Processors Mateo Valero, UPC HPCA-10, Madrid February 14-17 th 2004

50

References for LSQ for Large ROB A. Cristal, M. Valero, and J. LLosa. “Large virtual ROBs by

processor checkpointing” Technical Report UPC-DAC-2002-39, July 2002

J.F. Martínez, J. Renau, M.C. Huang, M. Prvulovic, and J. Torrellas. “Cherry: checkpointed early resource recycling in out-of-order microprocessors”. MICRO-35, 2002

H. Akkari, R. Rajwar and S. T. Srinivasan “Checkpointing Processing and Recovery: Towards Scalable Large Instruction Window Processors” Micro-36, 2003

S. Sethumadhavan, R. Desikan, D. Burger, C.R. Moore and S. W. Keckler “Scalable Hardware Memory Disambiguation for High ILP Processors” Micro-36, 2003

Page 51: Kilo-instruction Processors Mateo Valero, UPC HPCA-10, Madrid February 14-17 th 2004

51

Conclusion Affordable “Kilo-instruction Processors” Checkpointing and resource-conscious architectures

Out-of- order commit Ephemeral registers Two-level instruction queues Early release of loads Load/store queue management

New ideas to watch for Better branch predictors Predication and Multi-path execution Control and data independent instructions Reuse of large blocks of instructions

New processor paradigms: “Kilo-based” multiprocessor systems “Kilo-vector” processors “Kilo-SMT” processors “Kilo-valpred” processors

Page 52: Kilo-instruction Processors Mateo Valero, UPC HPCA-10, Madrid February 14-17 th 2004

52

Acknowledgments Adrián Cristal José Martínez

Josep Llosa Daniel Ortega Fran Cazorla Enrique Fernández Ayose Falcón Alex Pajuelo Marco Galluzzi Tanausu Ramírez Jim Smith

Yale Patt Alex Veidenbaum Guri Sohi Mark Hill Wen-mei Hwu “Mon” Beivide Valentín Puente José Angel Gregorio Teresa Monreal Victor Viñals

Intel, Konrad Lai and Ronny Ronen

Page 53: Kilo-instruction Processors Mateo Valero, UPC HPCA-10, Madrid February 14-17 th 2004

53

Thank you very much

Page 54: Kilo-instruction Processors Mateo Valero, UPC HPCA-10, Madrid February 14-17 th 2004

54

Processor-DRAM Gap (latency)

µProc60%/yr.

DRAM7%/yr.

1

10

100

10001980

1981

1983

1984

1985

1986

1987

1988

1989

1990

1991

1992

1993

1994

1995

1996

1997

1998

1999

2000

DRAM

CPU

1982

Processor-MemoryPerformance Gap:(grows 50% / year)

Per

form

ance

Time

“Moore’s Law”

D.A. Patterson “ New directions in Computer Architecture” Berkeley, June 1998

Page 55: Kilo-instruction Processors Mateo Valero, UPC HPCA-10, Madrid February 14-17 th 2004

55

Runahead Execution ROB

x

x

x

branch 3

x

b

x

load 2

x

branch

x

branch 1

x

a

x

load 1

L2 cache miss

CheckpointCheckpoint

INV

INV

INV

INV

INV

Runahead Mode- generate bogus value- invalidate dep. registers- continue execution

- Virtually increments ROB size- Prefetch data of future loads

Mutlu et al.: “Runahead Execution: An Alternative…”, HPCA’03

Page 56: Kilo-instruction Processors Mateo Valero, UPC HPCA-10, Madrid February 14-17 th 2004

56

Kilo and Control Independence

Larger windows improve: The probability of finding the

reconvergence point The correct detection of

controlindependent instructions because the wrong path is completely executed

The execution of more controlindependent instructions for later reuse

Wrongpath

Correctpath

RP

currentinstructionwindows

kilo-instructionwindows

CI

Page 57: Kilo-instruction Processors Mateo Valero, UPC HPCA-10, Madrid February 14-17 th 2004

57

0.60

0.65

0.70

0.75

0.80

0.85

0.90

0.95

1.00

INT

Kilo and Control Independence

The larger the window the more opportunities to find the reconvergence point.

Current instruction windows

Page 58: Kilo-instruction Processors Mateo Valero, UPC HPCA-10, Madrid February 14-17 th 2004

58

Grant Proposal to Intel January 28th, 2002 In the first semester we worked on the smart register file and the associated ISA, and

evaluated the proposed architecture with a few kernels. We showed speedups around 20% in the tested kernels. At the end of the first semester, we began the work on wide registers.From April to August 2001, we have been investigating three different approaches to use register files with wide ports (i.e. ports that allow to read various consecutive registers in a single access). The first one was trying to find subgraphs in the data dependence graph that have the same shape. The drawback of this approach is that it requires to move loads above stores in order to have a significant coverage. Some type of dependence speculation that adds a non-negligible complexity is required. We also did a study of the potential to exploit wide registers by looking at instructions in a window of 32 instructions. For Spec95 programs, we obtained that 48.9 % and 52.3 % of the operands were not wide for integer and FP codes respectively. We continue working on an approach that tries to group the two values of all two-operand instructions in a single wide register.

Since August 2001, we have been working on committing instructions out of order that allows to free in advance processor resources and to continue the execution of new instructions. The main idea is as follows: when the processor finds an ‘old’ instruction in the ROB with a large latency and the ROB is full, the processor removes this instruction by checkpointing the state of the processor at the last committed instruction. The processor continues its work “normally” and it moves all instructions that depend on the checkpointed instruction, to the checkpointing table. In case of misspeculation or an exception of either the checkpointed instruction or any instruction dependent on it, the checkpointed state is restored. The design of the mechanism is still in progress. We are building a simulation environment that will permit us to evaluate the proposal.

The work we plan to do during this year concerning to the out-of-order commit mechanism is the following:

To finish the simulator to start with the evaluation of different alternatives for the implementation of the out-of-order commit mechanism

To optimize the mechanism for those branches where the branch predictor fails frequently. To study new organizations for the load-store queues. To use the concept of virtual registers to optimize the register file organization. Concerning to the work dealing with wide registers, we are going to finish the design of the mechanism

and to evaluate it.

Page 59: Kilo-instruction Processors Mateo Valero, UPC HPCA-10, Madrid February 14-17 th 2004

59

Grant Proposal to Intel January 28th, 2002 Since August 2001, we have been working on committing instructions

out of order that allows to free in advance processor resources and to continue the execution of new instructions. The main idea is as follows: when the processor finds an ‘old’ instruction in the ROB with a large latency and the ROB is full, the processor removes this instruction by checkpointing the state of the processor at the last committed instruction. The processor continues its work “normally” and it moves all instructions that depend on the checkpointed instruction, to the checkpointing table. In case of misspeculation or an exception of either the checkpointed instruction or any instruction dependent on it, the checkpointed state is restored. The design of the mechanism is still in progress. We are building a simulation environment that will permit us to evaluate the proposal.

The work we plan to do during this year concerning to the out-of-order commit mechanism is the following:

To finish the simulator to start with the evaluation of different alternatives for the implementation of the out-of-order commit mechanism

To optimize the mechanism for those branches where the branch predictor fails frequently.

To study new organizations for the load-store queues. To use the concept of virtual registers to optimize the register file

organization. Concerning to the work dealing with wide registers, we are going to finish

the design of the mechanism and to evaluate it.