boosting single-thread performance in multi-core systems …isca09.cs.columbia.edu/pres/40.pdf · 1...

27
1 Boosting Single-thread Performance in Multi-Core Systems through Fine-Grain Multi-Threading Carlos Madriles, Pedro López, Josep M. Codina, Enric Gibert, Fernando Latorre, Alejandro Martínez, Raúl Martínez and Antonio González ISCA’09 June 24 th Intel Labs Barcelona

Upload: others

Post on 14-Oct-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Boosting Single-thread Performance in Multi-Core Systems …isca09.cs.columbia.edu/pres/40.pdf · 1 Boosting Single-thread Performance in Multi-Core Systems through Fine-Grain Multi-Threading

11

Boosting Single-thread Performance in Multi-Core Systems through Fine-Grain

Multi-Threading

Carlos Madriles, Pedro López, Josep M. Codina, Enric Gibert,

Fernando Latorre, Alejandro Martínez, Raúl Martínez and Antonio González

ISCA’09

June 24th

Intel Labs Barcelona

Page 2: Boosting Single-thread Performance in Multi-Core Systems …isca09.cs.columbia.edu/pres/40.pdf · 1 Boosting Single-thread Performance in Multi-Core Systems through Fine-Grain Multi-Threading

22

Why Is Single Thread Performance Important?

• Exploiting TLP is more power-efficient than exploiting ILP

– The trend is to increase the number of cores integrated into the same die

– Power and thermal constraints advocate for simpler cores

• However, single thread performance matters

– Sequential and lightly threaded apps

– Sequential regions limit scalability of parallelism (Amdahl’s law)

1 2 4 8 16 32 64

s=0s=0,1s=0.5

2

10

100s : serial fraction

overhead=0

overhead>0s=0s=0.1

Speed-up (c) = 1 / (s + ((1-s) / c))

Schemes to boost single-thread performance on CMPs are crucial

Page 3: Boosting Single-thread Performance in Multi-Core Systems …isca09.cs.columbia.edu/pres/40.pdf · 1 Boosting Single-thread Performance in Multi-Core Systems through Fine-Grain Multi-Threading

33

Main Research Efforts

• Main research efforts to exploit both ILP and TLP for boosting single-thread performance on CMPs

– Adaptative and cluster mircroarchitectures

– Combine or fuse cores

– Parallelism constrained by the instruction window

– Speculative multithreading (SpMT) techniques

– Traditional schemes constrain the exploitable parallelism

New techniques to exploit parallelism are needed

Page 4: Boosting Single-thread Performance in Multi-Core Systems …isca09.cs.columbia.edu/pres/40.pdf · 1 Boosting Single-thread Performance in Multi-Core Systems through Fine-Grain Multi-Threading

44

Our Contribution: Anaphase

• Novel speculative multithreading technique to boost single-thread applications on CMPs

• Effectively exploits ILP, TLP, and MLP

– Thread decomposition performed at instruction granularity

– Adapts to available parallelism in the application

– Minimizes inter-thread dependences

– Improves workload balance

– Increases the amount of exploitable MLP

– Hw / Sw hybrid scheme

– Low hardware complexity and powerful thread decomposition

• Cost-effective hardware support on top of a conventional CMP

– A novel uncore module supports anaphase threads execution

– Very few changes on the cores

Page 5: Boosting Single-thread Performance in Multi-Core Systems …isca09.cs.columbia.edu/pres/40.pdf · 1 Boosting Single-thread Performance in Multi-Core Systems through Fine-Grain Multi-Threading

55

Outline

• The Anaphase Scheme

– Threading Model

– Decomposition Algorithm

• The Anaphase Hardware Support

– Architecture Overview

– Architectural State Management

• Evaluation

• Conclusions

Page 6: Boosting Single-thread Performance in Multi-Core Systems …isca09.cs.columbia.edu/pres/40.pdf · 1 Boosting Single-thread Performance in Multi-Core Systems through Fine-Grain Multi-Threading

66

Anaphase Scheme Overview

C0

L2

L3

C1 C2

L2

C3

CMP with Anaphase Hw

Profile

Select hot regions

Fine-grain Thread

Decomposition

Serial Program

Paralellized Program

Compiler

HwSw

TILE

Anaphase Parallelizer

It behaves like a fused core

Page 7: Boosting Single-thread Performance in Multi-Core Systems …isca09.cs.columbia.edu/pres/40.pdf · 1 Boosting Single-thread Performance in Multi-Core Systems through Fine-Grain Multi-Threading

77

Anaphase Scheme Overview

C0

L2

L3

C1 C2

L2

C3

CMP with Anaphase Hw

Profile

Select hot regions

Fine-grain Thread

Decomposition

Serial Program

Paralellized Program

Compiler

HwSw

TILE

Anaphase Parallelizer

Main distinguishing features:

1. Threading model2. Fine-grain thread decomposition3. Hardware support

It behaves like a fused core

Page 8: Boosting Single-thread Performance in Multi-Core Systems …isca09.cs.columbia.edu/pres/40.pdf · 1 Boosting Single-thread Performance in Multi-Core Systems through Fine-Grain Multi-Threading

88

Threads may be composed of non-consecutive instructions

Threading Model: Key Features

A1

A2

A3 (Br)

B1

B2

B3

B4

D1

D2

D3 (Br)

C1

C2 (Br)

Original Region CFG

A2

B2

B3

D1

D2

C1

C2 (Br)

A1

A3 (Br)

B1

B4

D3 (Br)

Thread 1 Thread 2

Page 9: Boosting Single-thread Performance in Multi-Core Systems …isca09.cs.columbia.edu/pres/40.pdf · 1 Boosting Single-thread Performance in Multi-Core Systems through Fine-Grain Multi-Threading

99

Uses a heuristic to decide how to handle inter-thread dependences:

• Ignore Hw recovers in case of error

• Explicit communication• Pre-compute (p-slice)

Threading Model: Key Features

A1

A2

A3 (Br)

B1

B2

B3

B4

D1

D2

D3 (Br)

C1

C2 (Br)

Original Region CFG

A2

B2

B3

D1

D2

C1

C2 (Br)

A1

A3 (Br)

B1

B4

D3 (Br)

Thread 1 Thread 2

Page 10: Boosting Single-thread Performance in Multi-Core Systems …isca09.cs.columbia.edu/pres/40.pdf · 1 Boosting Single-thread Performance in Multi-Core Systems through Fine-Grain Multi-Threading

1010

Replicate those instructions needed to generate the consumed value

Threading Model: Key Features

A1

A2

A3 (Br)

B1

B2

B3

B4

D1

D2

D3 (Br)

C1

C2 (Br)

Original Region CFG

A2

B2

B3

D1

D2

C1

C2 (Br)

A1

A3 (Br)

B1

B4

D3 (Br)

Thread 1 Thread 2

A1’

Page 11: Boosting Single-thread Performance in Multi-Core Systems …isca09.cs.columbia.edu/pres/40.pdf · 1 Boosting Single-thread Performance in Multi-Core Systems through Fine-Grain Multi-Threading

1111

A2

B2

B3

D1

D2

C1

C2 (Br)

A1’

A3 (Br)

D3 (Br)

Threading Model: Key Features

A1

A2

A3 (Br)

B1

B2

B3

B4

D1

D2

D3 (Br)

C1

C2 (Br)

Original Region CFG

A1

A3 (Br)

B1

B4

D3 (Br)

Thread 1 Thread 2

Threads include all required control instructions• Some branches may need to be replicated• Keep Hw simple (no changes in front-end)

Page 12: Boosting Single-thread Performance in Multi-Core Systems …isca09.cs.columbia.edu/pres/40.pdf · 1 Boosting Single-thread Performance in Multi-Core Systems through Fine-Grain Multi-Threading

1212

Profile

Select hotregions

Fine-grainThread

Decomposition

Decomposition Algorithm [1]

Compiler

• Multi-level graph partitioning

– Coarsening

– First partition of the region DDG

– Based on: critical path, independence, workload balance, and delinquent loads

– Refinement

– Evolution of the traditional K-L algorithm

– Refines first partition based on a model that estimates execution time

[1] C. Madriles, P. López, J.M. Codina, E. Gibert, F. Latorre, A. Martínez, R. Martínez, A. González, “Anaphase: A Fine-Grain Thread Decomposition Scheme for Speculative Multithreading”, to be published in PACT’09.

Page 13: Boosting Single-thread Performance in Multi-Core Systems …isca09.cs.columbia.edu/pres/40.pdf · 1 Boosting Single-thread Performance in Multi-Core Systems through Fine-Grain Multi-Threading

1313

Outline

• The Anaphase Scheme

– Threading Model

– Decomposition Algorithm

• The Anaphase Hardware Support

– Architecture Overview

– Architectural State Management

• Evaluation

• Conclusions

Page 14: Boosting Single-thread Performance in Multi-Core Systems …isca09.cs.columbia.edu/pres/40.pdf · 1 Boosting Single-thread Performance in Multi-Core Systems through Fine-Grain Multi-Threading

1414

Architecture Overview

MEM

Core 0

IC DC

Core 1

IC DC

L2

TILE

Core 0

IC DC

Core 1

IC DC

L2

ICMC ICMC

About 7% increase in area on an Intel® CoreTM 2 Duo

Page 15: Boosting Single-thread Performance in Multi-Core Systems …isca09.cs.columbia.edu/pres/40.pdf · 1 Boosting Single-thread Performance in Multi-Core Systems through Fine-Grain Multi-Threading

1515

ICMC

Tile in cooperativemode

Tile in single-coremode

Anaphase Execution

Core 0

IC DC

Core 1

IC DC

L2

TILE

Sequential

0

1

2

3

4

5

6

7

Page 16: Boosting Single-thread Performance in Multi-Core Systems …isca09.cs.columbia.edu/pres/40.pdf · 1 Boosting Single-thread Performance in Multi-Core Systems through Fine-Grain Multi-Threading

1616

0

ICMC

Tile in cooperativemode

Anaphase Execution

Core 0

IC DC

Core 1

IC DC

L2

TILE

Thread 1 Thread 2

0

2

4

6

1

3

5

7

Local retire in threadorder

1

Page 17: Boosting Single-thread Performance in Multi-Core Systems …isca09.cs.columbia.edu/pres/40.pdf · 1 Boosting Single-thread Performance in Multi-Core Systems through Fine-Grain Multi-Threading

1717

ICMC

Tile in cooperativemode

Anaphase Execution

Core 0

IC DC

Core 1

IC DC

L2

TILE

Thread 1 Thread 2

0

2

4

6

1

3

5

7

2 1

Global commit in program order

Page 18: Boosting Single-thread Performance in Multi-Core Systems …isca09.cs.columbia.edu/pres/40.pdf · 1 Boosting Single-thread Performance in Multi-Core Systems through Fine-Grain Multi-Threading

1818

ICMC

Tile in cooperativemode

Anaphase Execution

Core 0

IC DC

Core 1

IC DC

L2

TILE

Thread 1 Thread 2

0

2

4

6

1

3

5

7

5

6

4

7

Page 19: Boosting Single-thread Performance in Multi-Core Systems …isca09.cs.columbia.edu/pres/40.pdf · 1 Boosting Single-thread Performance in Multi-Core Systems through Fine-Grain Multi-Threading

1919

ICMC

Tile in cooperativemode

Anaphase Execution

Core 0

IC DC

Core 1

IC DC

L2

TILE

Thread 1 Thread 2

0

2

4

6

1

3

5

7

17

Inter-core memoryviolation!

Page 20: Boosting Single-thread Performance in Multi-Core Systems …isca09.cs.columbia.edu/pres/40.pdf · 1 Boosting Single-thread Performance in Multi-Core Systems through Fine-Grain Multi-Threading

2020

ICMC

Tile in cooperativemode

Tile in single-coremode

Anaphase Execution

Core 0

IC DC

Core 1

IC DC

L2

TILE

Sequential

0

1

2

3

4

5

6

7

Page 21: Boosting Single-thread Performance in Multi-Core Systems …isca09.cs.columbia.edu/pres/40.pdf · 1 Boosting Single-thread Performance in Multi-Core Systems through Fine-Grain Multi-Threading

2121

Architectural State Management

• Memory– Speculative state kept private in L1 and L2 caches

– Merge performed in program order in the L2 cache

– Detects inter-core memory violations at chunk granularity

– Supports frequent memory checkpoints and very fast squash and commit operations

• Registers– Speculative state kept in core PRF

– Register architectural state merged in the ICMC in program order

– Supports frequent register checkpoints

– Reduces pressure on the PRF

– It allows a core to make forward progress

Page 22: Boosting Single-thread Performance in Multi-Core Systems …isca09.cs.columbia.edu/pres/40.pdf · 1 Boosting Single-thread Performance in Multi-Core Systems through Fine-Grain Multi-Threading

2222

Outline

• The Anaphase Scheme

– Threading Model

– Decomposition Algorithm

• The Anaphase Hardware Support

– Architecture Overview

– Architectural State Management

• Evaluation

• Conclusions

Page 23: Boosting Single-thread Performance in Multi-Core Systems …isca09.cs.columbia.edu/pres/40.pdf · 1 Boosting Single-thread Performance in Multi-Core Systems through Fine-Grain Multi-Threading

2323

Experimental Framework

• Compiler

– Anaphase partitioning implemented on top of the Intel®

Compiler (icc)

• Simulator

– Cycle accurate x86 CMP (1 tile with 2 OoO cores modeled)

– Two different core configurations evaluated

– Intel® CoreTM 2 like (Medium core)

– Half-sized main structures (Tiny core)

– Core Fusion[1] proposal implemented for comparison

• Benchmarks: Spec2006

– Profiling: PinPoint 20M traces using train input set

– Evaluation: 100M traces using ref input set

[1] E. Ipek, M. Kirman, N. Kirman, and J.F. Martinez, “Core fusion: Accommodating Software Diversity in Chip Multiprocessors”, in Proc. of the Int. Symp. On Computer Architecture, 2007

Page 24: Boosting Single-thread Performance in Multi-Core Systems …isca09.cs.columbia.edu/pres/40.pdf · 1 Boosting Single-thread Performance in Multi-Core Systems through Fine-Grain Multi-Threading

2424

0.9

1

1.1

1.2

1.3

1.4

1.5

1.6

1.7

1.8

1.9

2

bw

aves

gam

ess

Gem

sFD

TD lbm

lesl

ie3

d

milc

nam

d

po

vray

sop

lex

sph

inx3

ton

to wrf

AV

G S

pec

FP

asta

r

bzi

p2

gcc

gob

mk

h2

64

ref

hm

mer

libq

uan

tum

mcf

om

net

pp

per

lben

ch

sjen

g

zeu

smp

AV

G S

pec

INT

AV

G

Spee

du

p

CoreFusion Anaphase

Performance Evaluation

Results shown for Tiny core configuration

2.67 2.27

MLP exploited by Anaphase

AVG SpeedUp: Anaphase=41% CoreFusion=28%

Page 25: Boosting Single-thread Performance in Multi-Core Systems …isca09.cs.columbia.edu/pres/40.pdf · 1 Boosting Single-thread Performance in Multi-Core Systems through Fine-Grain Multi-Threading

2525

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

bw

aves

gam

ess

Gem

sFD

TD lbm

lesl

ie3

d

milc

nam

d

po

vray

sop

lex

sph

inx3

ton

to wrf

asta

r

bzi

p2

gcc

gob

mk

h2

64

ref

hm

mer

libq

uan

tum

mcf

om

net

pp

per

lben

ch

sjen

g

zeu

smp

AV

G

Dyn

amic

Inst

ruct

ion

s (%

)

Serial Parallel Replicated Communications Squashed

Dynamic Instruction Breakdown

High coverage less 10% serial codeLow overhead 22% replicated instrs and explicit commsFew explicit comms only 3%High accuracy less 1% squashed instrs

Page 26: Boosting Single-thread Performance in Multi-Core Systems …isca09.cs.columbia.edu/pres/40.pdf · 1 Boosting Single-thread Performance in Multi-Core Systems through Fine-Grain Multi-Threading

2626

Conclusions

• Anaphase is a Sw / Hw technique that leverages multiple cores to boost single-thread code

• It exploits ILP, TLP, and MLP with high accuracy and low overheads

• Low hardware overhead (7%)

• Important benefits: 41% average speedup for Spec2006

Page 27: Boosting Single-thread Performance in Multi-Core Systems …isca09.cs.columbia.edu/pres/40.pdf · 1 Boosting Single-thread Performance in Multi-Core Systems through Fine-Grain Multi-Threading

2727

Thank you!