turborob a low cost checkpoint/restore accelerator

1/25HIPEAC 2008 TurboROB

TurboROB A Low Cost Checkpoint/Restore Accelerator

Patrick Akl1 and Andreas Moshovos

AENAO Research GroupDepartment of Electrical and Computer Engineering

University of Toronto

1 Now with AMD/ATI


• We wish to make the recovery fast

What Happens on a Branch Misprediction?

Execution Timeline

Misprediction

Discovered Recover Processor

State

Redirect Fetch

Resume

Execution

Predict a Branch Outcome

Predicted Path Correct Path


• ROB:– Buffer all changes– Slow

• Instantaneous checkpoints:– Snapshot before speculating– Fast– Problem: can’t have enough checkpoints

• Checkpoint prediction– Allocate the few checkpoints judiciously

• Speculation control– Sometimes deeper speculation = higher recovery cost

• Can hurt performance

– Throttle speculation

Recover Mechanisms Overview


• Complements or Replaces Existing Mechanisms

• ROB: recover at any point

• TurboROB: recover only at frequent points

• Improves performance for most programs– Misprediction performance penalty reduced by 28% on AVG

• BranchTap comes “for free” – Very simple to implement– Better than more accurate checkpoint predictors

TurboROB Overview


Outline

• Background

• BranchTap

• Methodology and Results

• Summary


State Recovery Example: Register Alias Table

RAT

ArchitecturalRegister

PhysicalRegister

# a

rch

. re

gs

Lg(# arch. regs)

A add r1, r2, 100B breq r1, EC sub r1, r2, r2

Original Code

A add p4, p2, 100B breq p4, EC sub r5, p2, p2

Renamed Code

p1

p2

p3

p4p5p5p4


ROB: Slow, Fine-Grain Recovery

• Too slow: recovery latency proportional to number of instructions to squash

Reorder

BufferB B B BB

1. Misprediction discovered2. Locate newest instruction

3. Undo RAT updates in reverse order

Program Order

RATINVALID

Each entry contains

1. Architectural destination register

2. Its previous RAT map


Global Checkpoints: Fast, Coarse-Grain Recovery

• Branch w/ GC: Recovery is “Instantaneous”

Reorder

BufferB B B BB

1. Misprediction discovered

Program Order

RATINVALID

checkpointcheckpointcheckpointcheckpoint


Impact of More Checkpoints

• More checkpoints ?– Power hungry structure

– Increased delay

• Only a few checkpoints can practically be implemented– Cannot always cover all branches

architecturalregister

physical register

Actual Implementation

Working Copy chec

kpoint

sRAT

Concept


Intelligent Checkpointing

• State of the art solution– Checkpoint allocation: Allocate checkpoints at hard-to-

predict branches

– Checkpoint management: Release checkpoints as soon as they are no longer needed

• Use few checkpoints efficiently


• Mispeculation on a branch w/ a GC: Direct recovery

• Mispeculation on a branch w/o a GC: Indirect recovery

• With intelligent checkpointing: • 30% Indirect recoveries 75% of performance loss

Conventional Mechanisms: Recovery Scenarios

BBB ROB

BBB ROB

checkpoint

Fast Recovery

Slow Recovery

checkpoint


Outline

• Background

• BranchTap


• Summary


BranchTap Motivation

ROBNo Wait Scenario

Misprediction

discovered

~ Recovery Cost

~ Recovery Cost

checkpoint

Low confidence branch

checkpoint

checkpoint checkpoint

ROB

Sometimes, it is better to wait if no checkpoint is available

Wait Scenario

B B B

B B B


BranchTap Concept

• Key idea: stall when speculation is likely to deteriorate performance– Count the number of low confidence branches w/o a checkpoint– If it exceeds a threshold, stall

• Threshold selection– Fixed

• Varies greatly across programs• Can deteriorate performance significantly

– Adaptive• Robust performance

• Minimize recovery cost while conserving good speculation opportunities


No adaptation Sample &adapt

Execution Timeline (Cycles)

WT Next WT

Threshold Adaptation Policy

• BranchTap adapts across and within applications


Outline

• Background

• BranchTap


• Summary


Results Overview

• Performance w/o Checkpoints– BranchTap improves even with just an ROB

• Performance w/ 4 Checkpoints– BranchTap improves over conventional recovery methods

• Performance w/ Larger Checkpoint Predictors– BranchTap offers better performance than a 64x larger

predictor


Methodology

• Simulator based on Simplescalar

• 24 SPEC CPU 2000 benchmarks

• Reference Inputs

• Processor configurations– 8-way OoO core– Up to 1K in-flight instructions– 1K-entry confidence table for low confidence branch

identification

• 1B committed instructions after skipping 100B


“Perfect Checkpointing” Configuration

• A checkpoint is auto-magically taken at all mispredicted branches– All recoveries are fast

• We report the “deterioration relative to perfect checkpointing”


0%

5%

10%

15%

20%

25%

gzip vpr lucas art AVG

Conventional BranchTap Adaptive BranchTap Non-Adaptive

Performance with No Checkpoints• Deterioration relative to “perfect checkpointing”

-39%

dete

riora

tion

• BranchTap improves over conventional mechanisms• Adaptation leads to robust performance improvements

bet

ter


• Deterioration relative to “perfect checkpointing”

• BranchTap with 4 checkpoints is better than 6 checkpoints alone

0%

2%

4%

6%

8%

10%

twolf parser lucas mcf bzip2 AVG

Conventional BranchTap Adaptive BranchTap non-Adaptive

Performance Evaluation with 4 Checkpoints

-28%

dete

riora

tion b

ette

r


• BranchTap with a 1K-entry confidence table and 4 GCs:– Higher performance than a 64K-entry confidence table with 4 GCs

– Lower complexity, virtually comes “for free”

0.0%

0.5%

1.0%

1.5%

2.0%

2.5%

3.0%

64 256 1K 4K 16K 64K

BranchTap vs. Larger Checkpoint Predictors

BranchTapde

terio

ratio

n

confidence table size

bet

ter


Outline

• Background

• BranchTap


• Summary


Summary

• Performance with 4 (no) checkpoints– ~28 (39) % of misprediction penalty removed– BranchTap is robust:

• Up to 6 (13) % better and max 1.2 (0.1) % worse than conventional mechanisms

• BranchTap is very simple to implement– Few counters and comparators

• BranchTap is better than other alternatives– BT + 1K predictor better than a 64K predictor alone– BT + 4 GCs better than 6 GCs alone


BranchTapImproving Performance With Very Few Checkpoints

Through Adaptive Speculation Control

Patrick Akl and Andreas Moshovos

AENAO Research GroupDepartment of Electrical and Computer Engineering

University of Toronto

{pakl, moshovos}@eecg.toronto.edu

turborob a low cost checkpoint/restore accelerator

Documents