kilo-instruction processors mateo valero, upc hpca-10, madrid february 14-17 th 2004
TRANSCRIPT
Kilo-instruction Processors
Mateo Valero, UPC
HPCA-10, MadridFebruary 14-17th 2004
2
Motivation
Justin Rattner, Intel-MRL, Keynote lecture, Micro-32
Technology works against ILP: Faster clock rates => Lower ILP
3
1990’s architecture Short pipelines Low memory latencies
2010+ architectures Long pipelines
• 30-50 stages Power-Thermal-Wire
delay aware architecture Long memory latencies
• 500 to 1000 cycles• ISCA-2003: 50 to 160
The trends are changing
QueueScheduleScheduleScheduleDispatchDispatch
Reg. ReadReg. ReadExecute
FlagsBr. chkDrive
DriveAlloc.
RenameRename
Next IPNext IPFetchFetch L1
Instr.
L1Data
L2
Mem
ory
Bra
nc h
mis
pr e
dic
tion
M. Valero. NSF Workshop on Computer Architecture. ISCA Conference. San Diego, June 2003
4
128 in-flight inst.
0.0
0.5
1.0
1.5
2.0
2.5
3.0
int 4way fp 4way int 8way fp 8way
IPC 100
500
1000
latency
Memory Wall Problem
Memory latency has enormous impact on IPC
0.6X
0.45X
M. Valero. NSF Workshop on Computer Architecture. ISCA Conference. San Diego, June 2003
5
Reducing Memory Latency
Technology Caches Prefetching
Hardware, Software and combined Assisted/SSMT Threads
“Kilo-instruction” Processor
6
“Kilo-instruction” Processors Our goals
Better tolerate increasing memory latency Further improve ILP, even for such longer memory
latency Allow additional optimizations enabled by the new
architecture (See below)
Our proposal: “Kilo-instruction” Processors Out-Of-Order processors with thousands of
instructions in-flight (Very Large Instruction Windows) Intelligent use of resources (Resource requirements
growing much slower than window size)
7
“Kilo-instruction Processsor”
It is not….. A “heavy” processor Cyber-205 like processor Vector Processor Blue-Gene like Multiscalar,Trace Processor Raw, Imagine, Levo,TRIPS
It is ……. An Affordable O-O-O Superscalar Processor having
“Thousands of In-flight Instructions”
8
Outline Motivation Increasing the number of in-flight instructions Kilo-instruction Processor Ingredients:
Multi-Checkpointing the ROB• Out-of-Order Commit
Early Release of Resources• Ephemeral Registers• Load Queues
Locality Exploitation• Instruction Queues• LSQ
Cross-pollination with other techniques: “Kilo-processor” and multiprocessor systems “kilo-vector” processor “Kilo-SMT” processor Further Improvements:
• Branch prediction• “kilo-valpred” processor• Control Independence• Reuse• Predicated and multipath execution
Conclusion
9
branch 3
ROB ActivityROB Register File
IQ
x
xx
branch 3xbx
load 2x
branchx
branch 1xax
load 1
load 1a
branch 1load 2
b128-entry
1024-entry
ROB full:
Lat=100 Lat=1000
int 62% 72%
fp 76% 90%
ROB full:
Lat=100 Lat=1000
int 30% 48%
fp 50% 75%
M. Valero. NSF Workshop on Computer Architecture. ISCA Conference. San Diego, June 2003
10
0,0
0,5
1,0
1,5
2,0
2,5
3,0
3,5
4,0
4,5
5,0
128 512 1024 4096 128 512 1024 4096
perceptron perfect
ROB Size / Branch Predictor
IPC
1005001000
MemoryLatency
Perfect Mem. & Perfect BP
Perfect Mem. & Perceptron BP
Integer, 8-way, L2 1MB
0.6X
1.22X
1.41X1.86X
1.1X
Research Proposal to Intel (July 2001) and presentation to Intel-MRL Feb. 2002Cristal et al.: “Large Virtual ROBs by Processor Checkpointing”, TR UPC-DAC, July 2002M. Valero. NSF Workshop on Computer Architecture. ISCA Conference. San Diego, June 2003
11
0,0
1,0
2,0
3,0
4,0
5,0
6,0
128 512 1024 4096 8192 128 512 1024 4096 8192
perceptron perfect
ROB Size / Branch Predictor
IPC
1005001000
MemoryLatency
Perfect Mem. & Perfect BP
Perfect Mem. & Perceptron BP
Floating-point, 8-way, L2 1MB
0.45X
2.34X
3.91X 4.58X2X
Research Proposal to Intel (July 2001) and presentation to Intel-MRL Feb. 2002Cristal et al.: “Large Virtual ROBs by Processor Checkpointing”, TR UPC-DAC, July 2002M. Valero. NSF Workshop on Computer Architecture. ISCA Conference. San Diego, June 2003
12
Scalability Thousands of In-flight Instructions and In-
Order Commit make designs impractical: ROB : Needs to maintain a copy of every in-
flight instruction IQs : Instructions depending on long
latency instructions remain in these queues for a long time
LSQs : Instructions remain in the queue until commit
Registers : A new physical register for each instruction producing a new value
We would like to get the IPC of thousands of instructions in-flight without drastically increasing resource requirements
Re
ord
er B
uffe
r
IQF
PQ
LSQ
Inte
ger
Reg
. Fil e
FP
Re
g. F
il e
M. Valero. NSF Workshop on Computer Architecture. ISCA Conference. San Diego, June 2003
13
Late Allocation/Early Release of RegistersRegister File
IQload 1
abranch 1
load 2b
a, b, c
R1, R2
Virtual Registers
Early Release
ROB
cx
branch 3xbx
load 2branch 2
xR1
branch 1xax
load 1
R2
a
b
c
R1
R2
c
Monreal et al.: “Delaying physical register allocation through virtual-physical registers”, MICRO’99T. Monreal et al., “Late allocation and early release of physical registers”, IEEE-TC (to appear)
14
load
branch
Nearby & Distant Parallelism
ROB Register Fileload
Speculative
Replayable
f(X)
X
load
branch
load
branch
Balasubramonian et al.: “Dynamically Allocating Processor Resources…”, ISCA’01
Nearby
Distant
15
Dynamic Vectorization
load
br
C.I. 1
C.I. 2
ROB_head
ROB_tail
…C.I.1
…C.I.2
registerfile
control-independent instructionspeculative pre-execution based
on the instructions in the ROB
when the L2 miss load commitspreexecuted control-independent
instructions can reuse data
ROB_head
load
A. Pajuelo et al. “Control-Flow Independence Reuse via Dynamic Vectorization”, UPC-DAC
16
Outline Motivation Increasing the number of in-flight instructions Kilo-instruction Processor Ingredients:
Multi-Checkpointing the ROB• Out-of-Order Commit
Early Release of Resources• Ephemeral Registers• Load Queues
Locality Exploitation• Instruction Queues• LSQ
Cross-pollination with other techniques: “Kilo-processor” and multiprocessor systems “kilo-vector” processor “Kilo-SMT” processor Further Improvements:
• Branch prediction• “kilo-valpred” processor• Control Independence• Reuse• Predicated and multipath execution
Conclusion
17
Checkpointing the ROB
Checkpointing to support precise exceptions: Quite well established and used technique
• W.M.Hwu and Y.N.Patt, ISCA 1987
Checkpointing to early release resources: Quite recent concept
• Cherry: J. Martínez et al., MICRO, Nov. 2002• Large VROB: A. Cristal et al. TR-UPC-DAC, July 2002
M. Valero. NSF Workshop on Computer Architecture. ISCA Conference. San Diego, June 2003
18
Cherry ROB
branch
load
Early Release- registers- loads- stores
Point of no return(PNR)
Martínez et al.: “Cherry: Checkpointed Early Resource Recycling…”, MICRO’02
irreversible
reversible
CherryCherry
19
Multi-Checkpoint
IQ
ROB
x
x
x
load 3
x
b
x
branch 2
x
branch
x
branch 1
x
a
x
load 1 Checkpointing TableCheckpoint 1Checkpoint 1
OOO commit
load 1: PC, status, counter, …
load 1a
branch 1
load 1Checkpoint 2Checkpoint 2
branch 2: PC, status, counter, …
x
x
x
x
branch 4
x
x
x
load 4
x
x
load 3
x
b
x
branch 2
branch 2b
load 3
Cristal et al.: “Large Virtual ROBs by Processor Checkpointing”, TR UPC-DAC, July 2002Research Proposal to Intel (July 2001) and presentation to Intel-MRL Feb. 2002
branch 2 load 1load 1 arrivesarrives
Gang commitGang commitCheckpoint 1Checkpoint 1
20
Outline Motivation Increasing the number of in-flight instructions Kilo-instruction Processor Ingredients:
Multi-Checkpointing the ROB• Out-of-Order Commit
Early Release of Resources• Ephemeral Registers• Load Queues
Locality Exploitation• Instruction Queues• LSQ
Cross-pollination with other techniques: “Kilo-processor” and multiprocessor systems “kilo-vector” processor “Kilo-SMT” processor Further Improvements:
• Branch prediction• “kilo-valpred” processor• Control Independence• Reuse• Predicated and multipath execution
Conclusion
21
Early Release of Resources
CommitCommit
Resource
Assignment
---
Release
ROB, IQ, registers
Dispatch, RenamingDispatch, Renaming
Resource
Assignment
ROB, IQ, registers
ReleaseIQ(issued)
FetchFetch
CommitCommitMemory LatencyMemory Latencyi.e, 1000 cyclesi.e, 1000 cycles
T. Karkhanis and J.Smith, “A day in the life of a data cache miss” Workshop Memory Performance Issues. ISCA-2002M. Valero. NSF Workshop on Computer Architecture. ISCA Conference. San Diego, June 2003
22
Registers
Register File is a critical component of a modern superscalar processor Large number of entries to support out-of-order
execution and memory latency Large number of ports to increase issue width
Power and access time are key issues for register file design
It is always beneficial, to reduce the number of physical registers
23
Conventional renaming scheme
Virtual-Physical Registers
Early Release
Ephemeral Registers: checkpoint + virtual-physical
Physical Registers
Icache Decode&Rename Commit
Register Unused Register Used Register Unused
Register Used Register Unused
Register Unused Register Used
Register Used
T. Monreal et al.: “Delaying physical register allocation through virtual-physical registers”, MICRO’99M. Moudgill et al, “Register renaming and dynamic speculation: an alternative approach”, MICRO93T. Monreal et al., “Late allocation and early release of physical registers”, IEEE-TC (to appear)J. Martínez et al, “Ephemeral Registers”, Technical Report CSL-TR-2003-1035 , 2003
24
State of Registers (FP, ROB=2048)
1 10 25 50 75 90 1000
200
400
600
800
1000
1200
1400
FP
Re
gis
ters
Distribution of in-flight Instructions
Dead
Blocked-Long
Blocked-Short
Live
1168 1382 1607 1868 1955
Early Release
Late Allocation
Number of Instructions
A. Cristal, et al, “ A case for resource-concious out-of-order processors”, IEEE TCCA CA Letters, Vol. 2, Oct. 2003
25
Outline Motivation Increasing the number of in-flight instructions Kilo-instruction Processor Ingredients:
Multi-Checkpointing the ROB• Out-of-Order Commit
Early Release of Resources• Ephemeral Registers• Load Queues
Locality Exploitation• Instruction´s Queues• LSQ
Cross-pollination with other techniques: “Kilo-processor” and multiprocessor systems “kilo-vector” processor “Kilo-SMT” processor Further Improvements:
• Branch prediction• “kilo-valpred” processor• Control Independence• Reuse• Predicated and multipath execution
Conclusion
26
Increasing the number of IQ entries increase the power, area and access time
Wake-up and selection logic need to be done efficiently
“Kilo-instruction” processors may have many “in-flight” instructions
We need new organization for the IQs in order to have affordable “kilo-instruction processors”
IQs and “Kilo” processors
27
1
2
3
slowfast medium
x
x
branch 4
x
x
x
load 4
x
x
load 3
x
b
x
branch 2
Secondary Buffer
ROB
13
31
2
IQ
Execution Time of Instructions
Lebeck et al., “A large, fast instruction window for tolerating cache misses”, ISCA-29, 2002.
Brekelbaum et al., “Hierarchical scheduling windows”, ISCA-35, 2002.
Cristal et al., “Out-of-Order Commit Processors”, TR UPC-DAC-2003-44, July 2003 & HPCA-10, Feb. 2004
28
Load/Store Queues
Efficient and affordable memory disambiguation is mandatory for kilo-instruction processors We need to guarantee that loads and stores arrive to
the memory in the correct order
Increasing the number of in-flight instructions, can make the load/store queues a true bottleneck both in latency and power
29
1 10 25 50 75 90 1000
100
200
300
400
500
600
LD
Qu
eu
e
Distribution of in-flight Instructions
DeadBlocked-LongBlocked-ShortReplayableLive
1168 1382 1607 1868 1955
Number of Instructions
FP
CheckpointingEarly Release
State of LD Queues (specFP, ROB=2048)
A. Cristal, et al, “ A case for resource-conscious out-of-order processors”, IEEE TCCA CA Letters, Vol. 2, October 2003
30
State of ST Queues (specFP, ROB=2048)
1 10 25 50 75 90 1000
50
100
150
200
250
300
ST
Qu
eu
e
Distribution of in-flight Instructions
ReadyAddress ReadyBlocked-LongBlocked-Short
1168 1382 1607 1868 1955
Number of Instructions
FP
Locality
A. Cristal, et al, “ A case for resource-conscious out-of-order processors”, IEEE TCCA CA Letters, Vol. 2, October 2003
31
Search Filtering
Determine independence without associative search on addresses
Use Bloom Filter to control associative search Approximate tracking (false positives
are possible) No false negatives => no
mispredictions
Address: 0xabcd
HashFunction
LSQ
1
Filter
Associatively searchIf hashed bit is set to 1
S. Sethumadhavan et al. “Scalable Hardware Memory Disambiguation for High ILP Processors” Micro-36, 2003
32
0
0.5
1
1.5
2
2.5
3
3.5
512 1024 2048 512 1024 2048 512 1024 2048
100 500 1000
Virtual TagsMemory Latency
IPC
256
512
Limit 4096
Limit 4096
Limit 4096
Baseline 128
Baseline 128
Baseline 128
Putting It All TogetherPhysicalRegisters
Virtual Registers
IQs of 128 entries
Memory Latency
A. Cristal et al. “Kilo-instruction Processors”. Invited paper. ISHPC-V.Tokyo, LNCS-2858. October 20-22th, 2003
33
Outline Motivation Increasing the number of in-flight instructions Kilo-instruction Processor Ingredients:
Multi-Checkpointing the ROB• Out-of-Order Commit
Early Release of Resources• Ephemeral Registers• Load Queues
Locality Exploitation• Instruction´s Queues• LSQ
Cross-pollination with other techniques: “Kilo-processor” and multiprocessor systems “kilo-vector” processor “Kilo-SMT” processor Further Improvements:
• Branch prediction• “kilo-valpred” processor• Control Independence• Reuse• Predicated and multipath execution
Conclusion
34
0
0.5
1
1.5
2
2.5
3
3.5
FFT RADIX LU MP3D WATER
16 processors
IPC
ROB 64
ROB 128
ROB 512
ROB 1024
ROB 2048
First results: Ideal Network
“Kilo-processor” and multiprocessor systems
M. Galluzzi et al. “ A First glance at Kiloinstruction Based Multiprocessors” Invited Paper. ACM Computing Frontiers Conference. Ischia, Italy, April 10-12, 2004
35
0
0.5
1
1.5
2
2.5
3
3.5
FFT RADIX LU
16 processors
IPC
IDEAL NET&MEM
IDEAL NET
BADA
Impact of the network-ROB 64
“Kilo-processor” and multiprocessor systems
M. Galluzzi et al. “ A First glance at Kiloinstruction Based Multiprocessors” Invited Paper. ACM Computing Frontiers Conference. Ischia, Italy, April 10-12, 2004
36
0
0,5
1
1,5
2
2,5
3
3,5
64 128 512 1024 2048 64 128 512 1024 2048 64 128 512 1024 2048
FFT RADIX LU
ROB Size / Benchmark
IPC
IDEAL NET
BADA
IDEAL NET & MEM
First Results
M. Galluzzi et al. “ A First glance at Kiloinstruction Based Multiprocessors” Invited Paper. ACM Computing Frontiers Conference. Ischia, Italy, April 10-12, 2004
“Kilo-processor” and multiprocessor systems
37
0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
3.0E+05 8.0E+05 1.3E+06 1.8E+06 2.3E+06 2.8E+06 3.3E+06
Execution cycles
Pro
cess
or
cycl
es
2048
1024
512
q
Network latency, Radix, 250 cyc. latency
M. Galluzzi et al. “ A First glance at Kiloinstruction Based Multiprocessors” Invited Paper. ACM Computing Frontiers Conference. Ischia, Italy, April 10-12, 2004
“Kilo-processor” and multiprocessor systems
38
“Kilo-vector” processor
20 80Program:
20 8Program:
5Program:
8
Kilo
Vector
Speedup: 3.5
Speedup: 7.7
F. Quintana et al, “Kilo-vector” processors, UPC-DAC
39
Outline Motivation Increasing the number of in-flight instructions Kilo-instruction Processor Ingredients:
Multi-Checkpointing the ROB• Out-of-Order Commit
Early Release of Resources• Ephemeral Registers• Load Queues
Locality Exploitation• Instruction´s Queues• LSQ
Cross-pollination with other techniques: “Kilo-processor” and multiprocessor systems “kilo-vector” processor “Kilo-SMT” processor Further Improvements:
• Branch prediction• “kilo-valpred” processor• Control Independence• Reuse• Predicated and multipath execution
Conclusion
40
“Kilo-valpred” processor
0
0.5
1
1.5
2
2.5
3
normal Checkpoint ValuePrediction
Hybrid
SpecFP
SpecINT
T. Ramírez et al. “Kilo-value prediction” processor… UPC-DAC
41
Kilo and Control Independence
More opportunities to find control independent instructions Squash reuse Control-independent
instruction reexecution removal Savings:
• Power/energy• Execution bandwidth• Resources
Helps to go far ahead in the instruction window faster
% C.I. instructions
0
0,05
0,1
0,15
0,2
0,25
0,3
0,35
0,4
0,45
INT
42
UPC contribution to “kilo” processors We started our work in June 2001 Grant proposal to Intel-MRL (Konrad Lai and Ronny Ronen) in January 28th. 2002 Presentation to Intel-MRL in February 2002 A. Cristal, et al. “Large virtual ROBs by processor checkpointing” Technical
Report UPC-DAC-2002-39, July 2002. (Rejected for Micro-2002)• Multiple Checkpointers• Out-of-order Commit, No need for ROB• Early release of registers and loads
A. Cristal and M. Valero, ”ROBs virtuales utilizando checkpointing”. Spanish Workshop on Parallelism. Lleida, Sept., 2002
• Same as the previous report, but in Spanish A. Cristal, J. Martínez, M. Valero and J. Llosa, “Ephemeral Registers”, Technical
Report CSL-TR-2003-1035 , 2003. Rejected for ISCA 2003 and Micro 2003• Ckeckpoint + Early Release + Late allocation of registers
Presentation to Intel-MRL in March 2003 A. Cristal, J. Martínez, J. LLosa and M. Valero, “ A case for resource-conscious
out-of-order processors”, IEEE TCCA Computer Architecture Letters, Vol. 2, October 2003
• Underutilization of resources
43
UPC contribution to “kilo” processors
A. Cristal, et al. “ A case for resource-conscious out-of-order processors: Towards Kilo-instruction in-flight processors”. MEDEA Workshop, Sept 2003 and ACM-CAN, March 2004
A. Cristal et al. “Kilo-instruction Processors”. Invited paper. ISHPC-V.Tokyo, LNCS-2858. October 20-22th, 2003
A. Cristal et al. “Future ILP Processors”. Invited paper. IJHPCN, to be published
A. Cristal, et al. “Out-of-Order Commit Processors” Technical Report UPC-DAC-2003-44, July 2003. HPCA-10, Madrid, Feb. 2004
• Remove-Reinsert Mechanism• Simple reinsert mechanism
M. Galluzzi et al. “ A First glance at Kiloinstruction Based Multiprocessors” Invited Paper. ACM Computing Frontiers Conference. Ischia, Italy, April 10-12, 2004
Much new work done at this moment
44
Talks about “Kilo” processors, from UPC Presentation in Barcelona, to Intel-MRL in February 2002 Spanish Workshop on Parallelism. Lleida, Sept., 2002 Presentation to Intel-MRL in March 2003 Invited presentation. NSF Panel “On the Future of Computer
Architecture Research: Wise Views and Fresh Perspectives”. San Diego, June 2003
Invited Lecture. PA3CT Conference. Edegem, Belgium, September 22-23, 2003
MEDEA Workshop. New Orleans, September 2003 Invited Lecture. ISHPC-V. The 5th International Symposium on High
Performance Computing. Tokyo, Japan, October 20-22, 2003 Keynote lecture. Seminar on Compilers and Architecture. IBM Haifa.
November 11th., 2003. Invited lecture. Intel MRL. Haifa., Israel. Nov. 12th., 2003 HPCA-10, Madrid, February 14-18, 2003 Keynote lecture. HPCA-10. Madrid, February 14-18, 2003 Invited lecture. ACM Computing Frontiers. Ischia, April, 2004 ACM Invited lecture. ENCAR México, May 2004 More future presentations scheduled
45
Memory Latency
Jouppi and P. Ranganathan. “ The relative importance of memory latency, bandwidth and branch prediction” Whorkshop on Mixing Logic and DRAM: Chips that compute and remember”, during ISCA-24, 1997
S. Srinivasan and A. Lebeck, “ Load latency tolerance in dynamically scheduled processors”, Micro-31, 1998
K. Skadron, P. Ahuja, M. Martonosi and D. Clark “Branch prediction, instruction window size and cache size: Performance tradeoffs and simulation techniques” IEEE-TC, pp. 1260-1281, 1999.
46
Large Reorder Buffers G. Sohi, S. Breach, and T. N. Vijaykumar “Multiscalar processors”
ISCA-22, 1995. E. Rotenberg, Q. Jacobson, Y. Sazeides, and J. Smith “Trace
processors” ISCA-24, 1997 H. Akkari and M. Driscoll “A dynamic multithreaded processor”
Micro-31, 1998 R. Balasubramonian, S. Dwarkadas, and D.
Albonesi.“Dynamically allocating processor resources between nearby and distant ilp” ISCA, June 2001.
• Save some resources allocated for eager execution P. Ranganathan, V. Pai, and S. Adve “Using speculative
retirement and large instruction windows to narrow the performance gap between memory consistency models” SPAA, 1997
J. M. Tendler, S. Dodson, S. Fields, H. Lee, and B. Sinharoy “Power4 System Microarchitecture” IBM Journal of Research and Development, pp. 5-25, January 2002.
47
Checkpointing
W.M. Hwu and Y. N. Patt, “Checkpoint repair for out-of-order execution machines” ISCA-14, 1987.
• Checkpointing as a recovery mechanism
Early Release of Resources A. Cristal, M. Valero, and J. LLosa. “Large virtual ROBs by
processor checkpointing” Technical Report UPC-DAC-2002-39, July 2002.
• Multiple Checkpointers• Out-of-order Commit, No need for ROB• Early release of registers and loads
J.F. Martínez, J. Renau, M.C. Huang, M. Prvulovic, and J. Torrellas. Cherry: checkpointed early resource recycling in out-of-order microprocessors. MICRO-35, Nov. 2002.
• One checkpoint• Early release of resources
48
Register File M. Moudgill and K. Pingali and S. Vassiliadis, “Register renaming and
dynamic speculation: an alternative approach”, In Proceedings of the 26th annual international symposium on Microarchitecture, 1993.
• Early Release of Registers
T. Monreal, A. González, M. Valero, J. González, V. Viñals, “Delaying Physical Register Allocation through Virtual-Physical Registers”, In Proceedings of the 33th annual international symposium on Microarchitecture, 1999.
• Virtual Registers, Late allocation of registers
A. Cristal, J. Martínez, M. Valero and J. Llosa, “Ephemeral Registers”, Technical Report CSL-TR-2003-1035 , 2003.
• Ckeckpoint + Early Release + Late allocation of registers
T. Monreal et al., “Late allocation and early release of physical registers”, IEEE-TC (to appear)
49
Instruction Queues S. Palacharla, N.P. Jouppi, and J.E. Smith “Complexity-effective
superscalar processors” ISCA-24, 1997.• Divide the Instruction queues in a set of FIFO queues
A.R. Lebeck, J. Koppanalil, T. Li, J. Patwardhan, and E. Rotenberg “A large, fast instruction window for tolerating cache misses” ISCA-29, 2002.
• Remove-Reinsert Mechanism• Keep the load dependence of all instructions
E. Brekelbaum, J. Rupley, C.Wilkerson, and B. Black “Hierarchical scheduling windows” ISCA-35, 2002.
• Two clusters, a slow/big one, and a faster/small one for critical instructions
A. Cristal, D. Ortega, J. Llosa and M. Valero “Out-of-Order Commit Processors” Technical Report UPC-DAC-2003-44, July 2003. HPCA-10, Madrid, Feb. 2004
• Remove-Reinsert Mechanism• Simple reinsert mechanism
50
References for LSQ for Large ROB A. Cristal, M. Valero, and J. LLosa. “Large virtual ROBs by
processor checkpointing” Technical Report UPC-DAC-2002-39, July 2002
J.F. Martínez, J. Renau, M.C. Huang, M. Prvulovic, and J. Torrellas. “Cherry: checkpointed early resource recycling in out-of-order microprocessors”. MICRO-35, 2002
H. Akkari, R. Rajwar and S. T. Srinivasan “Checkpointing Processing and Recovery: Towards Scalable Large Instruction Window Processors” Micro-36, 2003
S. Sethumadhavan, R. Desikan, D. Burger, C.R. Moore and S. W. Keckler “Scalable Hardware Memory Disambiguation for High ILP Processors” Micro-36, 2003
51
Conclusion Affordable “Kilo-instruction Processors” Checkpointing and resource-conscious architectures
Out-of- order commit Ephemeral registers Two-level instruction queues Early release of loads Load/store queue management
New ideas to watch for Better branch predictors Predication and Multi-path execution Control and data independent instructions Reuse of large blocks of instructions
New processor paradigms: “Kilo-based” multiprocessor systems “Kilo-vector” processors “Kilo-SMT” processors “Kilo-valpred” processors
52
Acknowledgments Adrián Cristal José Martínez
Josep Llosa Daniel Ortega Fran Cazorla Enrique Fernández Ayose Falcón Alex Pajuelo Marco Galluzzi Tanausu Ramírez Jim Smith
Yale Patt Alex Veidenbaum Guri Sohi Mark Hill Wen-mei Hwu “Mon” Beivide Valentín Puente José Angel Gregorio Teresa Monreal Victor Viñals
Intel, Konrad Lai and Ronny Ronen
53
Thank you very much
54
Processor-DRAM Gap (latency)
µProc60%/yr.
DRAM7%/yr.
1
10
100
10001980
1981
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
DRAM
CPU
1982
Processor-MemoryPerformance Gap:(grows 50% / year)
Per
form
ance
Time
“Moore’s Law”
D.A. Patterson “ New directions in Computer Architecture” Berkeley, June 1998
55
Runahead Execution ROB
x
x
x
branch 3
x
b
x
load 2
x
branch
x
branch 1
x
a
x
load 1
L2 cache miss
CheckpointCheckpoint
INV
INV
INV
INV
INV
Runahead Mode- generate bogus value- invalidate dep. registers- continue execution
- Virtually increments ROB size- Prefetch data of future loads
Mutlu et al.: “Runahead Execution: An Alternative…”, HPCA’03
56
Kilo and Control Independence
Larger windows improve: The probability of finding the
reconvergence point The correct detection of
controlindependent instructions because the wrong path is completely executed
The execution of more controlindependent instructions for later reuse
Wrongpath
Correctpath
RP
currentinstructionwindows
kilo-instructionwindows
CI
57
0.60
0.65
0.70
0.75
0.80
0.85
0.90
0.95
1.00
INT
Kilo and Control Independence
The larger the window the more opportunities to find the reconvergence point.
Current instruction windows
58
Grant Proposal to Intel January 28th, 2002 In the first semester we worked on the smart register file and the associated ISA, and
evaluated the proposed architecture with a few kernels. We showed speedups around 20% in the tested kernels. At the end of the first semester, we began the work on wide registers.From April to August 2001, we have been investigating three different approaches to use register files with wide ports (i.e. ports that allow to read various consecutive registers in a single access). The first one was trying to find subgraphs in the data dependence graph that have the same shape. The drawback of this approach is that it requires to move loads above stores in order to have a significant coverage. Some type of dependence speculation that adds a non-negligible complexity is required. We also did a study of the potential to exploit wide registers by looking at instructions in a window of 32 instructions. For Spec95 programs, we obtained that 48.9 % and 52.3 % of the operands were not wide for integer and FP codes respectively. We continue working on an approach that tries to group the two values of all two-operand instructions in a single wide register.
Since August 2001, we have been working on committing instructions out of order that allows to free in advance processor resources and to continue the execution of new instructions. The main idea is as follows: when the processor finds an ‘old’ instruction in the ROB with a large latency and the ROB is full, the processor removes this instruction by checkpointing the state of the processor at the last committed instruction. The processor continues its work “normally” and it moves all instructions that depend on the checkpointed instruction, to the checkpointing table. In case of misspeculation or an exception of either the checkpointed instruction or any instruction dependent on it, the checkpointed state is restored. The design of the mechanism is still in progress. We are building a simulation environment that will permit us to evaluate the proposal.
The work we plan to do during this year concerning to the out-of-order commit mechanism is the following:
To finish the simulator to start with the evaluation of different alternatives for the implementation of the out-of-order commit mechanism
To optimize the mechanism for those branches where the branch predictor fails frequently. To study new organizations for the load-store queues. To use the concept of virtual registers to optimize the register file organization. Concerning to the work dealing with wide registers, we are going to finish the design of the mechanism
and to evaluate it.
59
Grant Proposal to Intel January 28th, 2002 Since August 2001, we have been working on committing instructions
out of order that allows to free in advance processor resources and to continue the execution of new instructions. The main idea is as follows: when the processor finds an ‘old’ instruction in the ROB with a large latency and the ROB is full, the processor removes this instruction by checkpointing the state of the processor at the last committed instruction. The processor continues its work “normally” and it moves all instructions that depend on the checkpointed instruction, to the checkpointing table. In case of misspeculation or an exception of either the checkpointed instruction or any instruction dependent on it, the checkpointed state is restored. The design of the mechanism is still in progress. We are building a simulation environment that will permit us to evaluate the proposal.
The work we plan to do during this year concerning to the out-of-order commit mechanism is the following:
To finish the simulator to start with the evaluation of different alternatives for the implementation of the out-of-order commit mechanism
To optimize the mechanism for those branches where the branch predictor fails frequently.
To study new organizations for the load-store queues. To use the concept of virtual registers to optimize the register file
organization. Concerning to the work dealing with wide registers, we are going to finish
the design of the mechanism and to evaluate it.