exploiting eager register release in a redundantly multi-threaded processor

School of Computing

Exploiting Eager Register Release in a Redundantly Multi-threaded Processor

Niti Madan

Rajeev Balasubramonian

University of Utah

School of Computing

Introduction Rising soft error rates due to shrinking

transistor sizes and lower supply voltagesExisting Solutions:

• Process level – SOI• Circuit level – Rad-hard cells, ECC, BISER• Architecture level –

– Redundant Multithreading– Reducing the time useful state spends in

unprotected structures– Software assisted fault tolerance

School of Computing

Introduction

CMPs/SMTs enable redundant multi-threading (RMT)

• Detailed Design and Evaluation of Redundant Multithreading Alternatives, ISCA 2002– 2 processors/threads execute the same

program

School of Computing

Chip-level Redundant Multi-threading

(CRTR)

Processor 1 Processor 2

Branch Outcomes

Loads

Register Values

Stores

Leading thread 1Trailing thread 2

Trailing thread 1Leading thread 2

OoO OoOLags behind leading thread by some slack

School of Computing

Motivation

• Register file is already a critical resource:• impacts ILP• impacts cycle time• impacts peak temperature

• Multiple threads increase pressure on register file

School of Computing

Motivation

• Out-of-order processors are "conservative" since they must preserve correctness– Example: registers are de-allocated conservatively

• Having a trailing thread allows the leading thread to be aggressive– improves the performance of the leading thread– trailer state can be used for ensuring correctness– some errors may go undetected

School of Computing

Processor 1Processor 2

RVQ Leading 1 Trailing 1

lr5 = … lr5 mapped to R2

R1

R1

lr5 = …. lr5 mapped to R1

Branch

Mispredict

R2R1

School of Computing

Processor 1Processor 2

RVQ Leading 1 Trailing 1R1

R1R1’

Soft error Mispredict RecoveryFault Propagates

Very few errors slip through: Slack is most of the times less than RVQ size

School of Computing

Our Approach• RMT processor has duplicate register value

state in RVQ/trailer’s state• Improve Register file efficiency using Eager Register Release• Smaller Register file size can deliver same

performance using above technique – Reduced power– Increased reliability – ECC less expensive– Potentially faster clock speed

School of Computing

Outline

• Background on RMT design space • Proposed technique• Evaluation• Conclusions & Future Work

School of Computing

Redundant Multi-threading

• Fault model– Trailer’s state used for recovery

• Does not provide complete recovery

– Caches and Load Value Queue (LVQ) ECC protected

– Can detect all single event upset faults

• Baseline RMT models include SRTR, CRTR, ST-P-CRTR, MT-P-CRTR

School of Computing

Baseline RMT Model Leading Thread 1Trailing Thread 1

Out-of-OrderProcessor

• SRTR – SMT level RMT

• CRTR –Chip level RMT

• Proposed by Mukherjee et al ISCA 2002, Gomaa et al ISCA 2002, ISCA 2003


LVQ, BOQ, RVQ Leading 1Trailing 2

Trailing 1Leading 2

Out-of-order Out-of-order

School of Computing

Power-efficient RMT model

Our Earlier Work explores Power-efficient RMT model P-CRTR (Selse-2, Tech Report 2005)• Observations

– Trailing thread doesn’t suffer from D-cache misses and branch mispredictions– Trailing thread bound to have higher IPC

• High Trailer IPC enables power reduction• Techniques proposed for power-efficiency:

– Dynamic Frequency Scaling– In-order execution of trailer

School of Computing

Dynamic Frequency Scaling• High Trailer IPC enables frequency reduction• Reduce Trailer’s frequency to match the leader’s throughput• Reduction in Trailer’s dynamic power• Does not impact Trailer’s leakage power

School of Computing

In-order Execution of Checker

• Our approach– Send all register values computed by leading core

to the trailer (Register value prediction 100% accuracy if no fault)

– Trailer reads source operands from RVQ– Trailer verifies source operands at commit

• RVP enables perfect IPC – no stalls• Cost : Extra communication overhead

• Benefit : Overall reduced dynamic and leakage power

School of Computing

ST-P-CRTR

• Single thread workloads


LVQ, BOQ, RVQ Leading 1 Trailing 1

Out-of-orderIn-order

School of Computing

MT-P-CRTR

• Multi-threaded Workloads

Processor 1

Processor 2

LVQ, BOQ, RVQ

Leading 1Leading 2

Trailing 1Out-of-order

In-order

Processor 3

Trailing 2

In-order

LVQ, BOQ, RVQ

School of Computing

Eager Register Release

• Eager Register Release – Involves releasing older physical register after the value is rewritten and used by all consumers– Requires a mechanism to store the released state elsewhere

Original Codelr3= lr1,lr2lr5= lr3, lr4Branch to xlr3=…

Renamed Codepr21= pr8,pr11pr15= pr21, pr12Branch to xpr29=…

lr3 has 2 mappings – new pr29 and old pr21

pr21 cannot be released until branch resolves

School of Computing

Implementation Details• Need to keep track of various states for each

physical register in Usage Table– Bit that tracks if logical register value is overwritten– RVQ address/register id in trailing thread

• Counters for each physical register – To track pending consumers

• Modification in ROB to initiate recovery upon mispredict

• Non-trivial complexity and overheads

School of Computing

Evaluation Methodology• Simplescalar-3.0 (Modified for CMP/SMT) for

performance analysis and wattch for processor power• eCacti-3.0 to model register file power and area

overheads• Spec2k Int, FP benchmark suite

– 16 benchmarks for single thread experiments– 10 pairs of High/Low IPC/ Int/FP combinations for multi-

thread experiments

• Evaluated all RMT models for comprehensive analysis of all combinations of leading/trailing threads

• RVQ size = 600 entries

School of Computing

Performance Evaluation

School of Computing

Effect of Register File Size - SRTR

SRTR

1

1.2

1.4

1.6

1.8

2

80 100 120 160

Register File Size

Thro

ughp

ut (I

PC

)

Base

ER

ROB size 160

School of Computing

Effect of Register File Size ST-P-CRTR

Single Thread P-CRTR

1.8

1.9

2

2.1

2.2

2.3

2.4

2.5

50 60 70 80

Register File Size

Th

rou

gh

pu

t(IP

C)

Base

ER

School of Computing

Effect of Register File Size CRTR

CRTR

2

2.53

3.54

4.55

5.5

100 120 140 160 200

Register File Size

Thro

ughp

ut (I

PC

)

Base

ER

School of Computing

Effect of Register File Size MT-P-CRTR

Multi-Thread P-CRTR

2

2.5

3

3.5

4

4.5

5

100 120 140 160

Register File Size

Thro

ughp

ut (I

PC)

Base

ER

School of Computing

Effect of Register File Size

• For SRTR, CRTR, MT-P-CRTR:– Performance of 100 size RF with ER same as

baseline with 160 size (37.5% size reduction)– Performance improvement of 34% in 100 size RF

with ER compared to baseline with 100 size

• For ST-P-CRTR– Performance of 50 size register file with ER same

as baseline with 80 size (37.5% size reduction)– Performance improvement of 12% in 100 size RF

with ER compared to baseline with 100 size

School of Computing

Observations

• More favorable to models where leading thread co-executes with another leading/trailing thread

• Most FP benchmarks perform better with ER (greater than 20% improvement)

• Int benchmarks that have poor bpred rates do not benefit much (gcc, equake, eon etc upto 3%)

School of Computing

Performance Overheads

• For 100 million single thread execution– 70 million registers are released eagerly– 6% copied back upon mispredict recovery– Cost of copying back dependent upon program

mispredict rate – Each mispredict requires 6.6 copy back values– Cost of copying can be possibly hidden with

branch recovery time

School of Computing

Performance Overheads

012345678

SRTR ST-P-CRTR CRTR MT-P-CRTR

RMT models

% IP

C lo

ss

5 cycles overhead

10 cycles overhead

Max IPC loss for 5-cycle overhead is 4%

School of Computing

Power/Area Analysis8 Rd/4 Wr ports assumed for ST RF 16 Rd/8 Wr ports assumed for MT RF

School of Computing

Power/Area Analysis

• Single thread RF size 50 with ER compared to baseline RF size 80 can– Improve Clock speed by 19%– Consumes 11% less energy and 25% less area

• If SEC-DED ECC is implemented on baseline register file– 6% Energy increase and 16% area increase

• Smaller RF can help afford ECC for even multiple bit soft error resilience

School of Computing

Fault-Injection Analysis

• Modified Simplescalar for fault analysis• Conservative analysis as masking effects cannot be

modeled• Every 1000 cycles, register bit is flipped in trailing

register file– Only 0.0004% of faults go undetected

• On average 99% of time logical register is rewritten in less than 100 instruction interval– Ensures that slack is less than RVQ size

School of Computing

Conclusions and Future Work• RMT model very suitable for Eager Register Release

• A 100 entry RF can match the throughput of 160 entry file and shows 34% improvement over baseline

• Fault-coverage reduction marginal ~0.0004%

• Enables smaller RF for lower power, higher clock speed, lower area overheads

• Enables reliability by making ECC affordable

• Nontrivial implementation overheads

• Need to explore complexity-effective solution

exploiting eager register release in a redundantly multi-threaded processor

Documents

leading threadtrailer

register value state

register file motivationout

slackmotivationregister

power reductiontechniques

time useful state

trailers dynamic powerdoes

order processors