exploiting eager register release in a redundantly multi-threaded processor
DESCRIPTION
Exploiting Eager Register Release in a Redundantly Multi-threaded Processor. Niti Madan Rajeev Balasubramonian University of Utah. Introduction. Rising soft error rates due to shrinking transistor sizes and lower supply voltages Existing Solutions: Process level – SOI - PowerPoint PPT PresentationTRANSCRIPT
School of Computing
Exploiting Eager Register Release in a Redundantly Multi-threaded Processor
Niti Madan
Rajeev Balasubramonian
University of Utah
School of Computing
Introduction Rising soft error rates due to shrinking
transistor sizes and lower supply voltagesExisting Solutions:
• Process level – SOI• Circuit level – Rad-hard cells, ECC, BISER• Architecture level –
– Redundant Multithreading– Reducing the time useful state spends in
unprotected structures– Software assisted fault tolerance
School of Computing
Introduction
CMPs/SMTs enable redundant multi-threading (RMT)
• Detailed Design and Evaluation of Redundant Multithreading Alternatives, ISCA 2002– 2 processors/threads execute the same
program
School of Computing
Chip-level Redundant Multi-threading
(CRTR)
Processor 1 Processor 2
Branch Outcomes
Loads
Register Values
Stores
Leading thread 1Trailing thread 2
Trailing thread 1Leading thread 2
OoO OoOLags behind leading thread by some slack
School of Computing
Motivation
• Register file is already a critical resource:• impacts ILP• impacts cycle time• impacts peak temperature
• Multiple threads increase pressure on register file
School of Computing
Motivation
• Out-of-order processors are "conservative" since they must preserve correctness– Example: registers are de-allocated conservatively
• Having a trailing thread allows the leading thread to be aggressive– improves the performance of the leading thread– trailer state can be used for ensuring correctness– some errors may go undetected
School of Computing
Processor 1Processor 2
RVQ Leading 1 Trailing 1
lr5 = … lr5 mapped to R2
R1
R1
lr5 = …. lr5 mapped to R1
Branch
Mispredict
R2R1
School of Computing
Processor 1Processor 2
RVQ Leading 1 Trailing 1R1
R1R1’
Soft error Mispredict RecoveryFault Propagates
Very few errors slip through: Slack is most of the times less than RVQ size
School of Computing
Our Approach• RMT processor has duplicate register value
state in RVQ/trailer’s state• Improve Register file efficiency using Eager Register Release• Smaller Register file size can deliver same
performance using above technique – Reduced power– Increased reliability – ECC less expensive– Potentially faster clock speed
School of Computing
Outline
• Background on RMT design space • Proposed technique• Evaluation• Conclusions & Future Work
School of Computing
Redundant Multi-threading
• Fault model– Trailer’s state used for recovery
• Does not provide complete recovery
– Caches and Load Value Queue (LVQ) ECC protected
– Can detect all single event upset faults
• Baseline RMT models include SRTR, CRTR, ST-P-CRTR, MT-P-CRTR
School of Computing
Baseline RMT Model Leading Thread 1Trailing Thread 1
Out-of-OrderProcessor
• SRTR – SMT level RMT
• CRTR –Chip level RMT
• Proposed by Mukherjee et al ISCA 2002, Gomaa et al ISCA 2002, ISCA 2003
Processor 1 Processor 2
LVQ, BOQ, RVQ Leading 1Trailing 2
Trailing 1Leading 2
Out-of-order Out-of-order
School of Computing
Power-efficient RMT model
Our Earlier Work explores Power-efficient RMT model P-CRTR (Selse-2, Tech Report 2005)• Observations
– Trailing thread doesn’t suffer from D-cache misses and branch mispredictions– Trailing thread bound to have higher IPC
• High Trailer IPC enables power reduction• Techniques proposed for power-efficiency:
– Dynamic Frequency Scaling– In-order execution of trailer
School of Computing
Dynamic Frequency Scaling• High Trailer IPC enables frequency reduction• Reduce Trailer’s frequency to match the leader’s throughput• Reduction in Trailer’s dynamic power• Does not impact Trailer’s leakage power
School of Computing
In-order Execution of Checker
• Our approach– Send all register values computed by leading core
to the trailer (Register value prediction 100% accuracy if no fault)
– Trailer reads source operands from RVQ– Trailer verifies source operands at commit
• RVP enables perfect IPC – no stalls• Cost : Extra communication overhead
• Benefit : Overall reduced dynamic and leakage power
School of Computing
ST-P-CRTR
• Single thread workloads
Processor 1 Processor 2
LVQ, BOQ, RVQ Leading 1 Trailing 1
Out-of-orderIn-order
School of Computing
MT-P-CRTR
• Multi-threaded Workloads
Processor 1
Processor 2
LVQ, BOQ, RVQ
Leading 1Leading 2
Trailing 1Out-of-order
In-order
Processor 3
Trailing 2
In-order
LVQ, BOQ, RVQ
School of Computing
Eager Register Release
• Eager Register Release – Involves releasing older physical register after the value is rewritten and used by all consumers– Requires a mechanism to store the released state elsewhere
Original Codelr3= lr1,lr2lr5= lr3, lr4Branch to xlr3=…
Renamed Codepr21= pr8,pr11pr15= pr21, pr12Branch to xpr29=…
lr3 has 2 mappings – new pr29 and old pr21
pr21 cannot be released until branch resolves
School of Computing
Implementation Details• Need to keep track of various states for each
physical register in Usage Table– Bit that tracks if logical register value is overwritten– RVQ address/register id in trailing thread
• Counters for each physical register – To track pending consumers
• Modification in ROB to initiate recovery upon mispredict
• Non-trivial complexity and overheads
School of Computing
Evaluation Methodology• Simplescalar-3.0 (Modified for CMP/SMT) for
performance analysis and wattch for processor power• eCacti-3.0 to model register file power and area
overheads• Spec2k Int, FP benchmark suite
– 16 benchmarks for single thread experiments– 10 pairs of High/Low IPC/ Int/FP combinations for multi-
thread experiments
• Evaluated all RMT models for comprehensive analysis of all combinations of leading/trailing threads
• RVQ size = 600 entries
School of Computing
Performance Evaluation
School of Computing
Effect of Register File Size - SRTR
SRTR
1
1.2
1.4
1.6
1.8
2
80 100 120 160
Register File Size
Thro
ughp
ut (I
PC
)
Base
ER
ROB size 160
School of Computing
Effect of Register File Size ST-P-CRTR
Single Thread P-CRTR
1.8
1.9
2
2.1
2.2
2.3
2.4
2.5
50 60 70 80
Register File Size
Th
rou
gh
pu
t(IP
C)
Base
ER
School of Computing
Effect of Register File Size CRTR
CRTR
2
2.53
3.54
4.55
5.5
100 120 140 160 200
Register File Size
Thro
ughp
ut (I
PC
)
Base
ER
School of Computing
Effect of Register File Size MT-P-CRTR
Multi-Thread P-CRTR
2
2.5
3
3.5
4
4.5
5
100 120 140 160
Register File Size
Thro
ughp
ut (I
PC)
Base
ER
School of Computing
Effect of Register File Size
• For SRTR, CRTR, MT-P-CRTR:– Performance of 100 size RF with ER same as
baseline with 160 size (37.5% size reduction)– Performance improvement of 34% in 100 size RF
with ER compared to baseline with 100 size
• For ST-P-CRTR– Performance of 50 size register file with ER same
as baseline with 80 size (37.5% size reduction)– Performance improvement of 12% in 100 size RF
with ER compared to baseline with 100 size
School of Computing
Observations
• More favorable to models where leading thread co-executes with another leading/trailing thread
• Most FP benchmarks perform better with ER (greater than 20% improvement)
• Int benchmarks that have poor bpred rates do not benefit much (gcc, equake, eon etc upto 3%)
School of Computing
Performance Overheads
• For 100 million single thread execution– 70 million registers are released eagerly– 6% copied back upon mispredict recovery– Cost of copying back dependent upon program
mispredict rate – Each mispredict requires 6.6 copy back values– Cost of copying can be possibly hidden with
branch recovery time
School of Computing
Performance Overheads
012345678
SRTR ST-P-CRTR CRTR MT-P-CRTR
RMT models
% IP
C lo
ss
5 cycles overhead
10 cycles overhead
Max IPC loss for 5-cycle overhead is 4%
School of Computing
Power/Area Analysis8 Rd/4 Wr ports assumed for ST RF 16 Rd/8 Wr ports assumed for MT RF
School of Computing
Power/Area Analysis
• Single thread RF size 50 with ER compared to baseline RF size 80 can– Improve Clock speed by 19%– Consumes 11% less energy and 25% less area
• If SEC-DED ECC is implemented on baseline register file– 6% Energy increase and 16% area increase
• Smaller RF can help afford ECC for even multiple bit soft error resilience
School of Computing
Fault-Injection Analysis
• Modified Simplescalar for fault analysis• Conservative analysis as masking effects cannot be
modeled• Every 1000 cycles, register bit is flipped in trailing
register file– Only 0.0004% of faults go undetected
• On average 99% of time logical register is rewritten in less than 100 instruction interval– Ensures that slack is less than RVQ size
School of Computing
Conclusions and Future Work• RMT model very suitable for Eager Register Release
• A 100 entry RF can match the throughput of 160 entry file and shows 34% improvement over baseline
• Fault-coverage reduction marginal ~0.0004%
• Enables smaller RF for lower power, higher clock speed, lower area overheads
• Enables reliability by making ECC affordable
• Nontrivial implementation overheads
• Need to explore complexity-effective solution