turborob a low cost checkpoint/restore accelerator
DESCRIPTION
TurboROB A Low Cost Checkpoint/Restore Accelerator. Patrick Akl 1 and Andreas Moshovos AENAO Research Group Department of Electrical and Computer Engineering University of Toronto 1 Now with AMD/ATI. What Happens on a Branch Misprediction?. Execution Timeline. Predict a Branch Outcome. - PowerPoint PPT PresentationTRANSCRIPT
1/25HIPEAC 2008 TurboROB
TurboROB A Low Cost Checkpoint/Restore Accelerator
Patrick Akl1 and Andreas Moshovos
AENAO Research GroupDepartment of Electrical and Computer Engineering
University of Toronto
1 Now with AMD/ATI
2/25HIPEAC 2008 TurboROB
• We wish to make the recovery fast
What Happens on a Branch Misprediction?
Execution Timeline
Misprediction
Discovered Recover Processor
State
Redirect Fetch
Resume
Execution
Predict a Branch Outcome
Predicted Path Correct Path
3/25HIPEAC 2008 TurboROB
• ROB:– Buffer all changes– Slow
• Instantaneous checkpoints:– Snapshot before speculating– Fast– Problem: can’t have enough checkpoints
• Checkpoint prediction– Allocate the few checkpoints judiciously
• Speculation control– Sometimes deeper speculation = higher recovery cost
• Can hurt performance
– Throttle speculation
Recover Mechanisms Overview
4/25HIPEAC 2008 TurboROB
• Complements or Replaces Existing Mechanisms
• ROB: recover at any point
• TurboROB: recover only at frequent points
• Improves performance for most programs– Misprediction performance penalty reduced by 28% on AVG
• BranchTap comes “for free” – Very simple to implement– Better than more accurate checkpoint predictors
TurboROB Overview
5/25HIPEAC 2008 TurboROB
Outline
• Background
• BranchTap
• Methodology and Results
• Summary
6/25HIPEAC 2008 TurboROB
State Recovery Example: Register Alias Table
RAT
ArchitecturalRegister
PhysicalRegister
# a
rch
. re
gs
Lg(# arch. regs)
A add r1, r2, 100B breq r1, EC sub r1, r2, r2
Original Code
A add p4, p2, 100B breq p4, EC sub r5, p2, p2
Renamed Code
p1
p2
p3
p4p5p5p4
7/25HIPEAC 2008 TurboROB
ROB: Slow, Fine-Grain Recovery
• Too slow: recovery latency proportional to number of instructions to squash
Reorder
BufferB B B BB
1. Misprediction discovered2. Locate newest instruction
3. Undo RAT updates in reverse order
Program Order
RATINVALID
Each entry contains
1. Architectural destination register
2. Its previous RAT map
8/25HIPEAC 2008 TurboROB
Global Checkpoints: Fast, Coarse-Grain Recovery
• Branch w/ GC: Recovery is “Instantaneous”
Reorder
BufferB B B BB
1. Misprediction discovered
Program Order
RATINVALID
checkpointcheckpointcheckpointcheckpoint
9/25HIPEAC 2008 TurboROB
Impact of More Checkpoints
• More checkpoints ?– Power hungry structure
– Increased delay
• Only a few checkpoints can practically be implemented– Cannot always cover all branches
architecturalregister
physical register
Actual Implementation
Working Copy chec
kpoint
sRAT
Concept
10/25HIPEAC 2008 TurboROB
Intelligent Checkpointing
• State of the art solution– Checkpoint allocation: Allocate checkpoints at hard-to-
predict branches
– Checkpoint management: Release checkpoints as soon as they are no longer needed
• Use few checkpoints efficiently
11/25HIPEAC 2008 TurboROB
• Mispeculation on a branch w/ a GC: Direct recovery
• Mispeculation on a branch w/o a GC: Indirect recovery
• With intelligent checkpointing: • 30% Indirect recoveries 75% of performance loss
Conventional Mechanisms: Recovery Scenarios
BBB ROB
BBB ROB
checkpoint
Fast Recovery
Slow Recovery
checkpoint
12/25HIPEAC 2008 TurboROB
Outline
• Background
• BranchTap
• Methodology and Results
• Summary
13/25HIPEAC 2008 TurboROB
BranchTap Motivation
ROBNo Wait Scenario
Misprediction
discovered
~ Recovery Cost
~ Recovery Cost
checkpoint
Low confidence branch
checkpoint
checkpoint checkpoint
ROB
Sometimes, it is better to wait if no checkpoint is available
Wait Scenario
B B B
B B B
14/25HIPEAC 2008 TurboROB
BranchTap Concept
• Key idea: stall when speculation is likely to deteriorate performance– Count the number of low confidence branches w/o a checkpoint– If it exceeds a threshold, stall
• Threshold selection– Fixed
• Varies greatly across programs• Can deteriorate performance significantly
– Adaptive• Robust performance
• Minimize recovery cost while conserving good speculation opportunities
15/25HIPEAC 2008 TurboROB
No adaptation Sample &adapt
Execution Timeline (Cycles)
WT Next WT
Threshold Adaptation Policy
• BranchTap adapts across and within applications
16/25HIPEAC 2008 TurboROB
Outline
• Background
• BranchTap
• Methodology and Results
• Summary
17/25HIPEAC 2008 TurboROB
Results Overview
• Performance w/o Checkpoints– BranchTap improves even with just an ROB
• Performance w/ 4 Checkpoints– BranchTap improves over conventional recovery methods
• Performance w/ Larger Checkpoint Predictors– BranchTap offers better performance than a 64x larger
predictor
18/25HIPEAC 2008 TurboROB
Methodology
• Simulator based on Simplescalar
• 24 SPEC CPU 2000 benchmarks
• Reference Inputs
• Processor configurations– 8-way OoO core– Up to 1K in-flight instructions– 1K-entry confidence table for low confidence branch
identification
• 1B committed instructions after skipping 100B
19/25HIPEAC 2008 TurboROB
“Perfect Checkpointing” Configuration
• A checkpoint is auto-magically taken at all mispredicted branches– All recoveries are fast
• We report the “deterioration relative to perfect checkpointing”
20/25HIPEAC 2008 TurboROB
0%
5%
10%
15%
20%
25%
gzip vpr lucas art AVG
Conventional BranchTap Adaptive BranchTap Non-Adaptive
Performance with No Checkpoints• Deterioration relative to “perfect checkpointing”
-39%
dete
riora
tion
• BranchTap improves over conventional mechanisms• Adaptation leads to robust performance improvements
bet
ter
21/25HIPEAC 2008 TurboROB
• Deterioration relative to “perfect checkpointing”
• BranchTap with 4 checkpoints is better than 6 checkpoints alone
0%
2%
4%
6%
8%
10%
twolf parser lucas mcf bzip2 AVG
Conventional BranchTap Adaptive BranchTap non-Adaptive
Performance Evaluation with 4 Checkpoints
-28%
dete
riora
tion b
ette
r
22/25HIPEAC 2008 TurboROB
• BranchTap with a 1K-entry confidence table and 4 GCs:– Higher performance than a 64K-entry confidence table with 4 GCs
– Lower complexity, virtually comes “for free”
0.0%
0.5%
1.0%
1.5%
2.0%
2.5%
3.0%
64 256 1K 4K 16K 64K
BranchTap vs. Larger Checkpoint Predictors
BranchTapde
terio
ratio
n
confidence table size
bet
ter
23/25HIPEAC 2008 TurboROB
Outline
• Background
• BranchTap
• Methodology and Results
• Summary
24/25HIPEAC 2008 TurboROB
Summary
• Performance with 4 (no) checkpoints– ~28 (39) % of misprediction penalty removed– BranchTap is robust:
• Up to 6 (13) % better and max 1.2 (0.1) % worse than conventional mechanisms
• BranchTap is very simple to implement– Few counters and comparators
• BranchTap is better than other alternatives– BT + 1K predictor better than a 64K predictor alone– BT + 4 GCs better than 6 GCs alone
25/25HIPEAC 2008 TurboROB
BranchTapImproving Performance With Very Few Checkpoints
Through Adaptive Speculation Control
Patrick Akl and Andreas Moshovos
AENAO Research GroupDepartment of Electrical and Computer Engineering
University of Toronto
{pakl, moshovos}@eecg.toronto.edu