turborob a low cost checkpoint/restore accelerator

25
1/25 HIPEAC 2008 TurboROB TurboROB A Low Cost Checkpoint/Restore Accelerator Patrick Akl 1 and Andreas Moshovos AENAO Research Group Department of Electrical and Computer Engineering University of Toronto 1 Now with AMD/ATI

Upload: kaspar

Post on 13-Jan-2016

24 views

Category:

Documents


0 download

DESCRIPTION

TurboROB A Low Cost Checkpoint/Restore Accelerator. Patrick Akl 1 and Andreas Moshovos AENAO Research Group Department of Electrical and Computer Engineering University of Toronto 1 Now with AMD/ATI. What Happens on a Branch Misprediction?. Execution Timeline. Predict a Branch Outcome. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: TurboROB  A Low Cost Checkpoint/Restore Accelerator

1/25HIPEAC 2008 TurboROB

TurboROB A Low Cost Checkpoint/Restore Accelerator

Patrick Akl1 and Andreas Moshovos

AENAO Research GroupDepartment of Electrical and Computer Engineering

University of Toronto

1 Now with AMD/ATI

Page 2: TurboROB  A Low Cost Checkpoint/Restore Accelerator

2/25HIPEAC 2008 TurboROB

• We wish to make the recovery fast

What Happens on a Branch Misprediction?

Execution Timeline

Misprediction

Discovered Recover Processor

State

Redirect Fetch

Resume

Execution

Predict a Branch Outcome

Predicted Path Correct Path

Page 3: TurboROB  A Low Cost Checkpoint/Restore Accelerator

3/25HIPEAC 2008 TurboROB

• ROB:– Buffer all changes– Slow

• Instantaneous checkpoints:– Snapshot before speculating– Fast– Problem: can’t have enough checkpoints

• Checkpoint prediction– Allocate the few checkpoints judiciously

• Speculation control– Sometimes deeper speculation = higher recovery cost

• Can hurt performance

– Throttle speculation

Recover Mechanisms Overview

Page 4: TurboROB  A Low Cost Checkpoint/Restore Accelerator

4/25HIPEAC 2008 TurboROB

• Complements or Replaces Existing Mechanisms

• ROB: recover at any point

• TurboROB: recover only at frequent points

• Improves performance for most programs– Misprediction performance penalty reduced by 28% on AVG

• BranchTap comes “for free” – Very simple to implement– Better than more accurate checkpoint predictors

TurboROB Overview

Page 5: TurboROB  A Low Cost Checkpoint/Restore Accelerator

5/25HIPEAC 2008 TurboROB

Outline

• Background

• BranchTap

• Methodology and Results

• Summary

Page 6: TurboROB  A Low Cost Checkpoint/Restore Accelerator

6/25HIPEAC 2008 TurboROB

State Recovery Example: Register Alias Table

RAT

ArchitecturalRegister

PhysicalRegister

# a

rch

. re

gs

Lg(# arch. regs)

A add r1, r2, 100B breq r1, EC sub r1, r2, r2

Original Code

A add p4, p2, 100B breq p4, EC sub r5, p2, p2

Renamed Code

p1

p2

p3

p4p5p5p4

Page 7: TurboROB  A Low Cost Checkpoint/Restore Accelerator

7/25HIPEAC 2008 TurboROB

ROB: Slow, Fine-Grain Recovery

• Too slow: recovery latency proportional to number of instructions to squash

Reorder

BufferB B B BB

1. Misprediction discovered2. Locate newest instruction

3. Undo RAT updates in reverse order

Program Order

RATINVALID

Each entry contains

1. Architectural destination register

2. Its previous RAT map

Page 8: TurboROB  A Low Cost Checkpoint/Restore Accelerator

8/25HIPEAC 2008 TurboROB

Global Checkpoints: Fast, Coarse-Grain Recovery

• Branch w/ GC: Recovery is “Instantaneous”

Reorder

BufferB B B BB

1. Misprediction discovered

Program Order

RATINVALID

checkpointcheckpointcheckpointcheckpoint

Page 9: TurboROB  A Low Cost Checkpoint/Restore Accelerator

9/25HIPEAC 2008 TurboROB

Impact of More Checkpoints

• More checkpoints ?– Power hungry structure

– Increased delay

• Only a few checkpoints can practically be implemented– Cannot always cover all branches

architecturalregister

physical register

Actual Implementation

Working Copy chec

kpoint

sRAT

Concept

Page 10: TurboROB  A Low Cost Checkpoint/Restore Accelerator

10/25HIPEAC 2008 TurboROB

Intelligent Checkpointing

• State of the art solution– Checkpoint allocation: Allocate checkpoints at hard-to-

predict branches

– Checkpoint management: Release checkpoints as soon as they are no longer needed

• Use few checkpoints efficiently

Page 11: TurboROB  A Low Cost Checkpoint/Restore Accelerator

11/25HIPEAC 2008 TurboROB

• Mispeculation on a branch w/ a GC: Direct recovery

• Mispeculation on a branch w/o a GC: Indirect recovery

• With intelligent checkpointing: • 30% Indirect recoveries 75% of performance loss

Conventional Mechanisms: Recovery Scenarios

BBB ROB

BBB ROB

checkpoint

Fast Recovery

Slow Recovery

checkpoint

Page 12: TurboROB  A Low Cost Checkpoint/Restore Accelerator

12/25HIPEAC 2008 TurboROB

Outline

• Background

• BranchTap

• Methodology and Results

• Summary

Page 13: TurboROB  A Low Cost Checkpoint/Restore Accelerator

13/25HIPEAC 2008 TurboROB

BranchTap Motivation

ROBNo Wait Scenario

Misprediction

discovered

~ Recovery Cost

~ Recovery Cost

checkpoint

Low confidence branch

checkpoint

checkpoint checkpoint

ROB

Sometimes, it is better to wait if no checkpoint is available

Wait Scenario

B B B

B B B

Page 14: TurboROB  A Low Cost Checkpoint/Restore Accelerator

14/25HIPEAC 2008 TurboROB

BranchTap Concept

• Key idea: stall when speculation is likely to deteriorate performance– Count the number of low confidence branches w/o a checkpoint– If it exceeds a threshold, stall

• Threshold selection– Fixed

• Varies greatly across programs• Can deteriorate performance significantly

– Adaptive• Robust performance

• Minimize recovery cost while conserving good speculation opportunities

Page 15: TurboROB  A Low Cost Checkpoint/Restore Accelerator

15/25HIPEAC 2008 TurboROB

No adaptation Sample &adapt

Execution Timeline (Cycles)

WT Next WT

Threshold Adaptation Policy

• BranchTap adapts across and within applications

Page 16: TurboROB  A Low Cost Checkpoint/Restore Accelerator

16/25HIPEAC 2008 TurboROB

Outline

• Background

• BranchTap

• Methodology and Results

• Summary

Page 17: TurboROB  A Low Cost Checkpoint/Restore Accelerator

17/25HIPEAC 2008 TurboROB

Results Overview

• Performance w/o Checkpoints– BranchTap improves even with just an ROB

• Performance w/ 4 Checkpoints– BranchTap improves over conventional recovery methods

• Performance w/ Larger Checkpoint Predictors– BranchTap offers better performance than a 64x larger

predictor

Page 18: TurboROB  A Low Cost Checkpoint/Restore Accelerator

18/25HIPEAC 2008 TurboROB

Methodology

• Simulator based on Simplescalar

• 24 SPEC CPU 2000 benchmarks

• Reference Inputs

• Processor configurations– 8-way OoO core– Up to 1K in-flight instructions– 1K-entry confidence table for low confidence branch

identification

• 1B committed instructions after skipping 100B

Page 19: TurboROB  A Low Cost Checkpoint/Restore Accelerator

19/25HIPEAC 2008 TurboROB

“Perfect Checkpointing” Configuration

• A checkpoint is auto-magically taken at all mispredicted branches– All recoveries are fast

• We report the “deterioration relative to perfect checkpointing”

Page 20: TurboROB  A Low Cost Checkpoint/Restore Accelerator

20/25HIPEAC 2008 TurboROB

0%

5%

10%

15%

20%

25%

gzip vpr lucas art AVG

Conventional BranchTap Adaptive BranchTap Non-Adaptive

Performance with No Checkpoints• Deterioration relative to “perfect checkpointing”

-39%

dete

riora

tion

• BranchTap improves over conventional mechanisms• Adaptation leads to robust performance improvements

bet

ter

Page 21: TurboROB  A Low Cost Checkpoint/Restore Accelerator

21/25HIPEAC 2008 TurboROB

• Deterioration relative to “perfect checkpointing”

• BranchTap with 4 checkpoints is better than 6 checkpoints alone

0%

2%

4%

6%

8%

10%

twolf parser lucas mcf bzip2 AVG

Conventional BranchTap Adaptive BranchTap non-Adaptive

Performance Evaluation with 4 Checkpoints

-28%

dete

riora

tion b

ette

r

Page 22: TurboROB  A Low Cost Checkpoint/Restore Accelerator

22/25HIPEAC 2008 TurboROB

• BranchTap with a 1K-entry confidence table and 4 GCs:– Higher performance than a 64K-entry confidence table with 4 GCs

– Lower complexity, virtually comes “for free”

0.0%

0.5%

1.0%

1.5%

2.0%

2.5%

3.0%

64 256 1K 4K 16K 64K

BranchTap vs. Larger Checkpoint Predictors

BranchTapde

terio

ratio

n

confidence table size

bet

ter

Page 23: TurboROB  A Low Cost Checkpoint/Restore Accelerator

23/25HIPEAC 2008 TurboROB

Outline

• Background

• BranchTap

• Methodology and Results

• Summary

Page 24: TurboROB  A Low Cost Checkpoint/Restore Accelerator

24/25HIPEAC 2008 TurboROB

Summary

• Performance with 4 (no) checkpoints– ~28 (39) % of misprediction penalty removed– BranchTap is robust:

• Up to 6 (13) % better and max 1.2 (0.1) % worse than conventional mechanisms

• BranchTap is very simple to implement– Few counters and comparators

• BranchTap is better than other alternatives– BT + 1K predictor better than a 64K predictor alone– BT + 4 GCs better than 6 GCs alone

Page 25: TurboROB  A Low Cost Checkpoint/Restore Accelerator

25/25HIPEAC 2008 TurboROB

BranchTapImproving Performance With Very Few Checkpoints

Through Adaptive Speculation Control

Patrick Akl and Andreas Moshovos

AENAO Research GroupDepartment of Electrical and Computer Engineering

University of Toronto

{pakl, moshovos}@eecg.toronto.edu