turborob a low cost checkpoint/restore accelerator
DESCRIPTION
TurboROB A Low Cost Checkpoint/Restore Accelerator. Patrick Akl and Andreas Moshovos AENAO Research Group Department of Electrical and Computer Engineering University of Toronto { pakl , moshovos}@eecg.toronto.edu. Recovering From Control Flow Mispredictions. Execution Timeline. - PowerPoint PPT PresentationTRANSCRIPT
1/25HIPEAC 2008 TurboROB
TurboROBA Low Cost Checkpoint/Restore
Accelerator
Patrick Akl and Andreas Moshovos
AENAO Research GroupDepartment of Electrical and Computer Engineering
University of Toronto
{pakl, moshovos}@eecg.toronto.edu
2/25HIPEAC 2008 TurboROB
• Accelerate Recovery – Improve Performance
Recovering From Control Flow MispredictionsExecution Timeline
Misprediction
Discovered Recover Processor
State
Redirect Fetch
Resume
Execution
Predict a Branch Outcome
Pre
dic
ted
Pat
h
Co
rrec
t P
ath
3/25HIPEAC 2008 TurboROB
State-of-the-Art Recovery
Misprediction
Discovered
Predict a Branch Outcome
what old value
Log of Changes
RO
B
State Snapshot
• Scalability and/or Performance Issues
4/25HIPEAC 2008 TurboROB
• Make common case fast: – Recover only at branches
• Store only as much as needed: – Partial Log
Turbo-ROB
Misprediction
Discovered
Predict a Branch Outcome
Log of Changes
RO
B
Partial Log of Changes
5/25HIPEAC 2008 TurboROB
Outline
• Control Flow Mispeculation Recovery
• TurboROB
• Methodology and Results
• Summary
6/25HIPEAC 2008 TurboROB
State Recovery Example: Register Alias Table
RAT
ArchitecturalRegister
PhysicalRegister
# a
rch
. re
gs
Lg(# arch. regs)
A add r1, r2, 100B breq r1, EC sub r1, r2, r2
Original Code
A add p4, p2, 100B breq p4, EC sub r5, p2, p2
Renamed Code
p1
p2
p3
p4p5p5p4
7/25HIPEAC 2008 TurboROB
ROB: Slow, Fine-Grain Recovery
• Too slow: recovery latency proportional to number of instructions to squash
Reorder
BufferB B B BB
1. Misprediction discovered2. Locate newest instruction
3. Undo RAT updates in reverse order
Program Order
RATINVALID
Each entry contains
1. Architectural destination register
2. Its previous RAT map
8/25HIPEAC 2008 TurboROB
Global Checkpoints: Fast, Coarse-Grain Recovery
• Branch w/ GC: Recovery is “Instantaneous”
Reorder
BufferB B B BB
1. Misprediction discovered
Program Order
RATINVALID
checkpointcheckpointcheckpointcheckpoint
9/25HIPEAC 2008 TurboROB
Impact of More Checkpoints
• More checkpoints ?– Power hungry structure
– Increased delay
• Only a few checkpoints can practically be implemented– Cannot always cover all branches
architecturalregister
physical register
Actual Implementation
Working Copy chec
kpoint
sRAT
Concept
10/25HIPEAC 2008 TurboROB
Intelligent Checkpointing & BranchTap
• Use Few Checkpoints Effectively
• BranchTap:– Throttle Speculation
B B B BB
checkpointcheckpointcheckpointcheckpoint
11/25HIPEAC 2008 TurboROB
Conventional Mechanisms: Recovery Scenarios
BBB
BBB
checkpoint
checkpoint
BBB
checkpoint
Re-Execution
12/25HIPEAC 2008 TurboROB
Outline
• Background
• Turbo-ROB
• Methodology and Results
• Summary
13/25HIPEAC 2008 TurboROB
Turbo-ROB
We only need to reverse the first subsequent change
for every RAT entry
ROB Recovery B R1 R1
useful redundant
~ Recovery Cost
R2 R2 R1
14/25HIPEAC 2008 TurboROB
Turbo-ROB Replacing the ROB
BBB
TROB
BBB
TROB
Re-Execution
15/25HIPEAC 2008 TurboROB
Selective Turbo-ROB w/ ROB
BBB
TROB
Selective Turbo-ROB w/ GCs
BBB
TROB
checkpoint
16/25HIPEAC 2008 TurboROB
Outline
• Background
• TurboROB
• Methodology and Results
• Summary
17/25HIPEAC 2008 TurboROB
Results Overview
• TROB as an ROB replacement– BranchTap offers better performance than ROB– Fewer resources– Even for smaller windows
• Selective TROB as a GC reduction mechanism– TROB reduces pressure for GCs– Offload a critical structure: RAT
• In the paper:– Selective TROB as an ROB accelerator– Even the smallest TROB accelerates recovery
18/25HIPEAC 2008 TurboROB
Methodology
• Simulator based on Simplescalar– Alpha/OSF
• 24 SPEC CPU 2000 benchmarks
• Reference Inputs
• Processor configurations– 4-way OoO core– 128/256/512 in-flight instructions– 1K-entry confidence table for low confidence branch
identification / similar results with Anyweak
• 1B committed instructions after skipping 2B
19/25HIPEAC 2008 TurboROB
“Perfect Checkpointing” Configuration
• A checkpoint is auto-magically taken at all mispredicted branches– All recoveries are fast
• We report the “deterioration relative to perfect checkpointing”
20/25HIPEAC 2008 TurboROB
TROB Replacing the ROB/512-Entry Window
• 64-entry TROB == ROB on the Average• Pathological cases exist 256-entry needed• 512-Entry TROB better than ROB
0%
10%
20%
30%
40%
50%
164.gzip 176.gcc 179.art 197.parser 301.apsi AVG
ROB TROB_32 TROB_64 TROB_128 TROB_256 TROB_512
better
21/25HIPEAC 2008 TurboROB
TROB Replacing the ROB/128-Entry Window
• 64-Entry 50% better than ROB• Fewer pathological cases• 128-Entry TROB better than ROB
0%
10%
20%
30%
40%
50%
164.gzip 176.gcc 179.art 197.parser 301.apsi AVG
ROB TROB_32 TROB_64 TROB_128
better
22/25HIPEAC 2008 TurboROB
sTROB and Global Checkpoints/128-Entry Window
• TROB + 1 GC better than 4GCs
better
23/25HIPEAC 2008 TurboROB
Summary
• TROB vs. ROB– Replacement
• Same resources better performance
• Fewer resources often better performance – Except when accuracy is high
– Acceleration: • ¼ resources 35% improvement
• TROB vs. GCs– Reduce pressure from the critical path– With just 1 GC match the performance of four GCs
• One more alternative for designers– Allows different area/performance/power tradeoffs
24/25HIPEAC 2008 TurboROB
TurboROBA Low Cost Checkpoint/Restore Accelerator
Patrick Akl and Andreas Moshovos
AENAO Research GroupDepartment of Electrical and Computer Engineering
University of Toronto
{pakl, moshovos}@eecg.toronto.edu
25/25HIPEAC 2008 TurboROB
TROB Replacing the ROB/512-Entry Window
• 64-entry TROB == ROB on the Average• Pathological cases exist 256-entry needed• 512-Entry TROB better than ROB
better
26/25HIPEAC 2008 TurboROB
TROB Replacing the ROB/128-Entry Window
• 64-Entry 50% better than ROB• Fewer pathological cases• 128-Entry TROB better than ROB
better