complementing user-level coarse-grain parallelism with implicit speculative parallelism nikolas...

52
Complementing User-Level Coarse-Grain Parallelism with Implicit Speculative Parallelism Nikolas Ioannou , Marcelo Cintra School of Informatics University of Edinburgh

Upload: darrius-rodd

Post on 30-Mar-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Complementing User-Level Coarse-Grain Parallelism with Implicit Speculative Parallelism Nikolas Ioannou, Marcelo Cintra School of Informatics University

Complementing User-Level Coarse-Grain Parallelism

with Implicit Speculative Parallelism

Nikolas Ioannou, Marcelo Cintra

School of InformaticsUniversity of Edinburgh

Page 2: Complementing User-Level Coarse-Grain Parallelism with Implicit Speculative Parallelism Nikolas Ioannou, Marcelo Cintra School of Informatics University

Intl. Symp. on Microarchitecture - December 2011 2

Introduction

Source: Intel

Multi-cores and many-cores here to stay

Page 3: Complementing User-Level Coarse-Grain Parallelism with Implicit Speculative Parallelism Nikolas Ioannou, Marcelo Cintra School of Informatics University

Intl. Symp. on Microarchitecture - December 2011 3

Introduction

Multi-cores and many-cores are here to stay Parallel programming is essential to realize

potential Focus on coarse-grain parallelism Weak or no scaling of some parallel applications Can we exploit under-utilized cores to complement

coarse-grain parallelism?– Nested parallelism in multi-threaded applications– Exploit it using implicit speculative parallelism

Page 4: Complementing User-Level Coarse-Grain Parallelism with Implicit Speculative Parallelism Nikolas Ioannou, Marcelo Cintra School of Informatics University

Intl. Symp. on Microarchitecture - December 2011 4

Contributions

Evaluation of implicit speculative parallelism on top of explicit parallelism to improve scalability:– Improve scalability by 40% on avg.– Same energy consumption

Detailed analysis of multithreaded scalability:– Performance bottlenecks– Behavior on different input datasets

Auto-tuning to dynamically select the number of explicit and implicit threads

Page 5: Complementing User-Level Coarse-Grain Parallelism with Implicit Speculative Parallelism Nikolas Ioannou, Marcelo Cintra School of Informatics University

Intl. Symp. on Microarchitecture - December 2011 5

Outline

Introduction Motivation Proposal Evaluation Methodology Results Conclusions

Page 6: Complementing User-Level Coarse-Grain Parallelism with Implicit Speculative Parallelism Nikolas Ioannou, Marcelo Cintra School of Informatics University

Intl. Symp. on Microarchitecture - December 2011 6

Bottlenecks: Large Critical Sections

T0 T1 T2 T3

Tim

e

0 20 40 60Cores

0

1

2

3

Sp

ee

du

p

2 4 8 16 32 64Cores

0

0.2

0.4

0.6

0.8

1.0

1.2

Nor

m. E

xecu

tion

Tim

e

BusyLockBarrier

Integer Sort (IS) NASPB

Page 7: Complementing User-Level Coarse-Grain Parallelism with Implicit Speculative Parallelism Nikolas Ioannou, Marcelo Cintra School of Informatics University

Intl. Symp. on Microarchitecture - December 2011 7

Bottlenecks: Load Imbalance

T0 T1 T2 T3

Tim

e

0 20 40 60 80 100 120Cores

0

5

10

15

20

Spe

edup

2 4 8 16 32 64 128Cores

0

0.1

0.2

0.3

0.4

0.5

0.6

Nor

m. E

xecu

tion

Tim

e

BusyLockBarrier

RADIOSITY SPLASH 2

Can we use these coresto accelerate this app.?

Page 8: Complementing User-Level Coarse-Grain Parallelism with Implicit Speculative Parallelism Nikolas Ioannou, Marcelo Cintra School of Informatics University

Intl. Symp. on Microarchitecture - December 2011 8

Outline

Introduction Motivation Proposal Evaluation Methodology Results Low power nested parallelism Conclusions

Page 9: Complementing User-Level Coarse-Grain Parallelism with Implicit Speculative Parallelism Nikolas Ioannou, Marcelo Cintra School of Informatics University

9

Proposal

Programming:– Users explicitly parallelize code– Tradeoff development time for performance gains

Architecture and Compiler:– Exploit fine-grain parallelism on top of user threads– Thread-Level Speculation (TLS) within each user thread

Hardware:– Support both explicit and implicit threads simultaneously

in a nested fashion

Intl. Symp. on Microarchitecture - December 2011

Page 10: Complementing User-Level Coarse-Grain Parallelism with Implicit Speculative Parallelism Nikolas Ioannou, Marcelo Cintra School of Informatics University

Speculative

10

Proposal#pragma omp parallel forfor(j = 0; j < M; ++j) { … for(i = 0; i < N; ++i) { … = A[L[i]] + … … A[K[i]] = … } …}

T0 TK TL TM

… … …

TK,i TK,i+1 TK,i+2 TK,i+3

Speculative

TL,i TL,i+1 TL,i+2 TL,i+3

Intl. Symp. on Microarchitecture - December 2011

Page 11: Complementing User-Level Coarse-Grain Parallelism with Implicit Speculative Parallelism Nikolas Ioannou, Marcelo Cintra School of Informatics University

11

Proposal: Many-core Architecture

Many-core partitioned in clusters (tiles) Coherence (MESI)

– Snooping coherence within cluster– Directory coherence across clusters

Support for TLS only within cluster– Snooping TLS protocol– Speculative buffering in L1 data caches

Intl. Symp. on Microarchitecture - December 2011

Page 12: Complementing User-Level Coarse-Grain Parallelism with Implicit Speculative Parallelism Nikolas Ioannou, Marcelo Cintra School of Informatics University

12

Proposal: Many-core Architecture

T0 T1 T2 T3 T4 T5 T6 T7

T8 T9 T10 T11 T12 T13 T14 T15

T16 T17 T18 T19 T20 T21 T22 T23

T24 T25 T26 T27 T28 T29 T30 T31

Mem

. Con

tr.M

em. C

ontr.

Mem

. Con

tr.M

em. C

ontr.

C0 C1 C2 C3

IC DC IC DC IC DC IC DC

L2 $ Dir/Router

Intl. Symp. on Microarchitecture - December 2011

Page 13: Complementing User-Level Coarse-Grain Parallelism with Implicit Speculative Parallelism Nikolas Ioannou, Marcelo Cintra School of Informatics University

Intl. Symp. on Microarchitecture - December 2011 13

Complementing Coarse-Grain ParallelismT0 T1 T2 T3

Tim

e

T0 T1 T2 T3 T4 T5 T6 T7

2x Explicit Threads

Page 14: Complementing User-Level Coarse-Grain Parallelism with Implicit Speculative Parallelism Nikolas Ioannou, Marcelo Cintra School of Informatics University

Intl. Symp. on Microarchitecture - December 2011 14

Complementing Coarse-Grain ParallelismT0 T1 T2 T3

Tim

e

T0 T1 T2 T3 T4 T5 T6 T7

4ETs + 4ISTs

Page 15: Complementing User-Level Coarse-Grain Parallelism with Implicit Speculative Parallelism Nikolas Ioannou, Marcelo Cintra School of Informatics University

Intl. Symp. on Microarchitecture - December 2011 15

Complementing Coarse-Grain ParallelismT0 T1 T2 T3

Tim

e

T0 T1 T2 T3 T4 T5 T6 T7

2x Explicit Threads

Page 16: Complementing User-Level Coarse-Grain Parallelism with Implicit Speculative Parallelism Nikolas Ioannou, Marcelo Cintra School of Informatics University

Intl. Symp. on Microarchitecture - December 2011 16

Complementing Coarse-Grain ParallelismT0 T1 T2 T3

Tim

e

T0 T1 T2 T3 T4 T5 T6 T7

4ETs + 4ISTs

Page 17: Complementing User-Level Coarse-Grain Parallelism with Implicit Speculative Parallelism Nikolas Ioannou, Marcelo Cintra School of Informatics University

Intl. Symp. on Microarchitecture - December 2011 17

Expected Speedup Behavior

A

B

C

Sp

eed

up

Cores

Baseline

4-way TLS speedupregion

2-way TLS speedupregion

Baseline speedupregion

1 2 4 8 16 32 64

2-way TLS

4-way TLS

Page 18: Complementing User-Level Coarse-Grain Parallelism with Implicit Speculative Parallelism Nikolas Ioannou, Marcelo Cintra School of Informatics University

18

Proposal: Auto-Tuning the Thread Count Find the scalability tipping point dynamically Choose whether to employ implicit threads Simple hill climbing approach Applicable to OpenMP applications that are

amenable to Dynamic Concurrency Throttling (DCT [Curtis-Maury PACT’08] )

Developed a prototype in the Omni OpenMP System

Intl. Symp. on Microarchitecture - December 2011

Page 19: Complementing User-Level Coarse-Grain Parallelism with Implicit Speculative Parallelism Nikolas Ioannou, Marcelo Cintra School of Informatics University

Intl. Symp. on Microarchitecture - December 2011 19

Auto-tuning example

…#pragma omp parallel forfor(j = 0; j < M; ++j) { … for(i = 0; i < N; ++i) { … = A[L[i]] + … … A[K[i]] = … } …}…

Learningi

omp parallel region i detected:

First time:Can we compute iteration count statically and is less than max core count?

Yes -> set Initial Tcount to 32Measure execution time ti

1

M=32

Page 20: Complementing User-Level Coarse-Grain Parallelism with Implicit Speculative Parallelism Nikolas Ioannou, Marcelo Cintra School of Informatics University

Intl. Symp. on Microarchitecture - December 2011 20

Auto-tuning example

…#pragma omp parallel forfor(j = 0; j < M; ++j) { … for(i = 0; i < N; ++i) { … = A[L[i]] + … … A[K[i]] = … } …}…

Learningi i

omp parallel region i detected:

Set Tcount to next value (16)Measure execution time ti

2

ti2 < ti

1 → continue exploration

Page 21: Complementing User-Level Coarse-Grain Parallelism with Implicit Speculative Parallelism Nikolas Ioannou, Marcelo Cintra School of Informatics University

Intl. Symp. on Microarchitecture - December 2011 21

Auto-tuning example

…#pragma omp parallel forfor(j = 0; j < M; ++j) { … for(i = 0; i < N; ++i) { … = A[L[i]] + … … A[K[i]] = … } …}…

Learningi i i

omp parallel region i detected:

Set Tcount to next value (8)Measure execution time ti

3

ti3 > ti

2 → stop exploration

Page 22: Complementing User-Level Coarse-Grain Parallelism with Implicit Speculative Parallelism Nikolas Ioannou, Marcelo Cintra School of Informatics University

Intl. Symp. on Microarchitecture - December 2011 22

Auto-tuning example

…#pragma omp parallel forfor(j = 0; j < M; ++j) { … for(i = 0; i < N; ++i) { … = A[L[i]] + … … A[K[i]] = … } …}…

Learningi i i i

omp parallel region i detected:Use Tcount = 16, no further explorationSet TLS to 4-way

Page 23: Complementing User-Level Coarse-Grain Parallelism with Implicit Speculative Parallelism Nikolas Ioannou, Marcelo Cintra School of Informatics University

Intl. Symp. on Microarchitecture - December 2011 23

Outline

Introduction Motivation Proposal Evaluation Methodology Results Conclusions

Page 24: Complementing User-Level Coarse-Grain Parallelism with Implicit Speculative Parallelism Nikolas Ioannou, Marcelo Cintra School of Informatics University

24

Evaluation Methodology

SESC simulator - extended to model our scheme Architecture:

– Core: 4-issue OoO superscalar, 96-entry ROB, 3GHz 32KB, 4-way, DL1 $ - 32KB, 2-way, IL1 $ 16Kbit Hybrid Branch Predictor

– Tile/System: 128 cores partitioned in 2-way or 4-way tiles (evaluate both) Shared L2 cache, 8MB, 8-way, 64MSHRs Directory: Full-bit vector sharer list Interconnect: Grid, 64B links - 48GB/s to main memory

Intl. Symp. on Microarchitecture - December 2011

Page 25: Complementing User-Level Coarse-Grain Parallelism with Implicit Speculative Parallelism Nikolas Ioannou, Marcelo Cintra School of Informatics University

25

Evaluation Methodology

Benchmarks:– 12 workloads from PARSEC 2.1, SPLASH2, NASPB– Simulate parallel region to completion

Compilation:– MIPS binaries generated using GCC 3.4.4– Speculation added automatically through source-to-

source compiler– Selection of speculation regions through manual profiling

Power:– CACTI 4.2 and Wattch

Intl. Symp. on Microarchitecture - December 2011

Page 26: Complementing User-Level Coarse-Grain Parallelism with Implicit Speculative Parallelism Nikolas Ioannou, Marcelo Cintra School of Informatics University

26

Evaluation Methodology

Alternative schemes compared against:– Core Fusion [Ipek ISCA’07]:

Dynamic combination of cores to deal with lowly-threaded apps Approximated through wide 8-issue cores with all the core

resources doubled without latency increase => upper bound – Frequency Boost:

Inspired by Turbo Boost [Intel’08] For each idle core one other core gains a frequency boost of

800MHz with a 200mV increase in voltage (same power cap)

All these schemes shift resources to a subset of cores in order to improve performance

Intl. Symp. on Microarchitecture - December 2011

Page 27: Complementing User-Level Coarse-Grain Parallelism with Implicit Speculative Parallelism Nikolas Ioannou, Marcelo Cintra School of Informatics University

Intl. Symp. on Microarchitecture - December 2011 27

Outline

Introduction Motivation Proposal Evaluation Methodology Results Conclusions

Page 28: Complementing User-Level Coarse-Grain Parallelism with Implicit Speculative Parallelism Nikolas Ioannou, Marcelo Cintra School of Informatics University

Intl. Symp. on Microarchitecture - December 2011 28

Bottom Line

Speedup over best scalability point

0.8

1.0

1.2

1.4

1.6

1.8

2.0

Spee

dup

Benchmark

TLS-2TLS-4CFusionFBoost

TLS-4: 41% avgTLS-2:27% avg

Page 29: Complementing User-Level Coarse-Grain Parallelism with Implicit Speculative Parallelism Nikolas Ioannou, Marcelo Cintra School of Informatics University

0.6

0.8

1.0

1.2

1.4

1.6

1.8

2.0

Nor

mal

ized

Ene

rgy

Benchmark

2TLS4TLSCFusionFBoost

Intl. Symp. on Microarchitecture - December 2011 29

Energy

Showing best performing point for each schemeEnergy consumptionslightly lower on avg

Page 30: Complementing User-Level Coarse-Grain Parallelism with Implicit Speculative Parallelism Nikolas Ioannou, Marcelo Cintra School of Informatics University

0.6

0.8

1.0

1.2

1.4

1.6

1.8

2.0

Nor

mal

ized

Ene

rgy

Benchmark

2TLS4TLSCFusionFBoost

Intl. Symp. on Microarchitecture - December 2011 30

Energy

Showing best performing point for each schemeSpending less time in busy synchronization

Page 31: Complementing User-Level Coarse-Grain Parallelism with Implicit Speculative Parallelism Nikolas Ioannou, Marcelo Cintra School of Informatics University

0.6

0.8

1.0

1.2

1.4

1.6

1.8

2.0

Nor

mal

ized

Ene

rgy

Benchmark

2TLS4TLSCFusionFBoost

Intl. Symp. on Microarchitecture - December 2011 31

Energy

Showing best performing point for each schemeHigh mispeculation:

Higher energy

Page 32: Complementing User-Level Coarse-Grain Parallelism with Implicit Speculative Parallelism Nikolas Ioannou, Marcelo Cintra School of Informatics University

0.6

0.8

1.0

1.2

1.4

1.6

1.8

2.0

Nor

mal

ized

Ene

rgy

Benchmark

2TLS4TLSCFusionFBoost

Intl. Symp. on Microarchitecture - December 2011 32

Energy

Showing best performing point for each schemeLittle synchronization:

Higher energy

Page 33: Complementing User-Level Coarse-Grain Parallelism with Implicit Speculative Parallelism Nikolas Ioannou, Marcelo Cintra School of Informatics University

Intl. Symp. on Microarchitecture - December 2011 33

Serial/Critical Sections

0 10 20 30 40 50 60 70 80 90 100 110 120Cores

0

1

2

3

4

5

6

Sp

ee

du

pbaseTLS-2TLS-4FBoostCFusion

2 4 8 16 32 64 128Cores

0

0.5

1.0

1.5

2.0

2.5

3.0

Nor

m. E

xecu

tion

Tim

e

BusyLockBarrier

isNASPB

Page 34: Complementing User-Level Coarse-Grain Parallelism with Implicit Speculative Parallelism Nikolas Ioannou, Marcelo Cintra School of Informatics University

Intl. Symp. on Microarchitecture - December 2011 34

Load Imbalance

0 10 20 30 40 50 60 70 80 90 100 110 120Cores

0

5

10

15

20

25

Sp

ee

dup

baseTLS-2TLS-4FBoostCFusion

2 4 8 16 32 64 128Cores

0

0.5

1.0

1.5

2.0

2.5

Nor

m. E

xecu

tion

Tim

e

BusyLockBarrier

radiositySPLASH2

Page 35: Complementing User-Level Coarse-Grain Parallelism with Implicit Speculative Parallelism Nikolas Ioannou, Marcelo Cintra School of Informatics University

Intl. Symp. on Microarchitecture - December 2011 35

Synchronization Heavy

0 10 20 30 40 50 60 70 80 90 100 110 120Cores

0

2

4

6

8

10

12

14S

pe

ed

up

baseTLS-2TLS-4FBoostCFusion

2 4 8 16 32 64 128Cores

00.20.40.60.81.01.21.41.61.8

Nor

m. E

xecu

tion

Tim

e

BusyLockBarrier

oceanSPLASH2

Page 36: Complementing User-Level Coarse-Grain Parallelism with Implicit Speculative Parallelism Nikolas Ioannou, Marcelo Cintra School of Informatics University

Intl. Symp. on Microarchitecture - December 2011 36

Coarse-Grain Partitioning

0 10 20 30 40 50 60 70 80 90 100 110 120Cores

0

5

10

15

20

25

30

Sp

ee

du

pbaseTLS-2TLS-4FBoostCFusion

2 4 8 16 32 64 128Cores

00.20.40.60.81.01.21.41.61.82.0

Nor

m. E

xecu

tion

Tim

e

BusyLockBarrier

swaptionsPARSEC

Page 37: Complementing User-Level Coarse-Grain Parallelism with Implicit Speculative Parallelism Nikolas Ioannou, Marcelo Cintra School of Informatics University

Intl. Symp. on Microarchitecture - December 2011 37

Poor Static Partitioning

0 10 20 30 40 50 60 70 80 90 100 110 120Cores

0

2

4

6

8

10

12S

pe

ed

up

baseTLS-2TLS-4FBoostCFusion

2 4 8 16 32 64 128Cores

00.20.40.60.81.01.21.41.61.8

Nor

m. E

xecu

tion

Tim

e

BusyLockBarrier

spNASPB

Page 38: Complementing User-Level Coarse-Grain Parallelism with Implicit Speculative Parallelism Nikolas Ioannou, Marcelo Cintra School of Informatics University

Intl. Symp. on Microarchitecture - December 2011 38

Effect of Dataset size

Unchanged behavior: cholesky Also: canneal, ocean, ft, is, sp

10 20 30 40 50 60 70 80 90 100 110 120Cores

0

2

4

6

8

10

12

14

Spe

edup

basebaseLTLS-2TLS-2LTLS-4TLS-4L

Page 39: Complementing User-Level Coarse-Grain Parallelism with Implicit Speculative Parallelism Nikolas Ioannou, Marcelo Cintra School of Informatics University

Intl. Symp. on Microarchitecture - December 2011 39

Effect of Dataset size

Improved scalability, but TLS boost remains: swaptions Also: bodytrack, radiosity, ep

0 10 20 30 40 50 60 70 80 90 100 110 120Cores

0

5

10

15

20

25

30

35

40

45

50

Spe

edup

basebaseLTLS-2TLS-2LTLS-4TLS-4L

Page 40: Complementing User-Level Coarse-Grain Parallelism with Implicit Speculative Parallelism Nikolas Ioannou, Marcelo Cintra School of Informatics University

30 40 50 60 70 80 90 100 110 120Cores

0

10

20

30

40

50

60

Spe

edup

basebaseLTLS-2TLS-2LTLS-4TLS-4L

Intl. Symp. on Microarchitecture - December 2011 40

Effect of Dataset size

Improved scalability, lessened TLS boost: streamcluster

Page 41: Complementing User-Level Coarse-Grain Parallelism with Implicit Speculative Parallelism Nikolas Ioannou, Marcelo Cintra School of Informatics University

Intl. Symp. on Microarchitecture - December 2011 41

Effect of Dataset size

Worse scalability, even better TLS boost: water

10 20 30 40 50 60 70 80 90 100 110 120Cores

0

10

20

30

40

50

60

Spe

edup

basebaseLTLS-2TLS-2LTLS-4TLS-4L

Page 42: Complementing User-Level Coarse-Grain Parallelism with Implicit Speculative Parallelism Nikolas Ioannou, Marcelo Cintra School of Informatics University

Intl. Symp. on Microarchitecture - December 2011 42

Outline

Introduction Motivation Proposal Evaluation Methodology Results Conclusions

Page 43: Complementing User-Level Coarse-Grain Parallelism with Implicit Speculative Parallelism Nikolas Ioannou, Marcelo Cintra School of Informatics University

Intl. Symp. on Microarchitecture - December 2011 43

Conclusions

Multicores and many-cores are here to stay– Parallel programming essential to exploit new hardware– Some coarse-grain parallel programs do not scale– Enough nested parallelism to improve scalability

Proposed speculative parallelization through implicit speculative threads on top of explicit threads:

– Significant scalability improvement of 40% on avg– No increase in total energy consumptions– Presented an auto-tuning mechanism to dynamically choose

the number of threads that performs within 6% of the oracle

Page 44: Complementing User-Level Coarse-Grain Parallelism with Implicit Speculative Parallelism Nikolas Ioannou, Marcelo Cintra School of Informatics University

Complementing User-Level Coarse-Grain Parallelism

with Implicit Speculative Parallelism

Nikolas Ioannou, Marcelo Cintra

School of InformaticsUniversity of Edinburgh

Page 45: Complementing User-Level Coarse-Grain Parallelism with Implicit Speculative Parallelism Nikolas Ioannou, Marcelo Cintra School of Informatics University

Intl. Symp. on Microarchitecture - December 2011 45

Related Work

[von Praun PPoPP’07] Implicit ordered transactions [Kim Micro’10] Speculative Parallel-stage Decoupled Software

Pipelining [Ooi ICS’01] Multiplex [Madriles ISCA’09] Anaphase [Rajwar MICRO’01],[Martinez ASPLOS’02] Speculative Lock

Elision [Moravan ASPLOS’06], etc., Nested transactional memory

Page 46: Complementing User-Level Coarse-Grain Parallelism with Implicit Speculative Parallelism Nikolas Ioannou, Marcelo Cintra School of Informatics University

Intl. Symp. on Microarchitecture - December 2011 46

Bibliography

[Intl’08] Intel Corp. Intel Turbo Boost Technology in Intel Core Microarchitecture (Nehalem) Based Processors, White Paper, 2008

[Ipek ISCA’07] Ipek et al. Core fusion: Accommodating software diversity in chip multiprocessors

[von Praun PPoPP’07] C. von Praun et al. Implicit parallelism with ordered transactions, PPoPP 2007

[Kim Micro’10] Scalable speculative parallelization in commodity clusters, MICRO, 2010

[Ooi ICS’01] C.-L Ooi et al. Multiplex: Unifying conventional and speculative thread-level parallelism on a chip multiprocessor, ICS 2001

[Madriles ISCA’09] C. Madriles et al. Boosting single-thread performance in multi-core system through fine-grain multi-threading. ISCA 2009

Page 47: Complementing User-Level Coarse-Grain Parallelism with Implicit Speculative Parallelism Nikolas Ioannou, Marcelo Cintra School of Informatics University

Intl. Symp. on Microarchitecture - December 2011 47

Bibliography

[Rajwar MICRO’01] R. Rajwar and J.R. Goodman. Speculative lock elision: Enabling highly concurrent multithreaded execution. MICRO 2001

[Martinez ASPLOS’02] J. Martinez and J. Torellas. Speculative synchronization: Applying thread-level speculation to explicitly parallel applications. ASPLOS 2002

[Moravan ASPLOS’06] Supporting nested transactional memory in logtm. ASPLOS 2006

[Curtis-Maury PACT’08] Prediction models for multi-dimensional power-performance optimization on many-cores.

Page 48: Complementing User-Level Coarse-Grain Parallelism with Implicit Speculative Parallelism Nikolas Ioannou, Marcelo Cintra School of Informatics University

Intl. Symp. on Microarchitecture - December 2011 48

Benchmark details

Page 49: Complementing User-Level Coarse-Grain Parallelism with Implicit Speculative Parallelism Nikolas Ioannou, Marcelo Cintra School of Informatics University

Intl. Symp. on Microarchitecture - December 2011 49

Fetched Instructions

0.5

0.6

0.7

0.8

0.9

1.0

1.1

1.2

Norm

. Tot

al F

etch

ed In

s.

Benchmark

TLS-2TLS-4FBoostCFusion

Page 50: Complementing User-Level Coarse-Grain Parallelism with Implicit Speculative Parallelism Nikolas Ioannou, Marcelo Cintra School of Informatics University

Intl. Symp. on Microarchitecture - December 2011 50

Failed Speculation

0%

20%

40%

60%

80%

100%TL

S-2

TLS

-4

TLS

-2TL

S-4

TLS

-2TL

S-4

TLS

-2TL

S-4

TLS

-2TL

S-4

TLS

-2TL

S-4

TLS

-2TL

S-4

TLS

-2TL

S-4

TLS

-2TL

S-4

TLS

-2TL

S-4

TLS

-2TL

S-4

TLS

-2TL

S-4

Nor

m. E

xecu

tion

Tim

e

Benchmark

RestartBusy

Page 51: Complementing User-Level Coarse-Grain Parallelism with Implicit Speculative Parallelism Nikolas Ioannou, Marcelo Cintra School of Informatics University

Intl. Symp. on Microarchitecture - December 2011 51

Serial/Critical Sections

0 10 20 30 40 50 60 70 80 90 100 110 120Cores

0

1

2

3

4

5

6

7S

pe

ed

up

baseTLS-2TLS-4FBoostCFusion

2 4 8 16 32 64 128Cores

00.20.40.60.81.01.21.41.6

Nor

m. E

xecu

tion

Tim

e

BusyLockBarrier

bodytrackPARSEC

Page 52: Complementing User-Level Coarse-Grain Parallelism with Implicit Speculative Parallelism Nikolas Ioannou, Marcelo Cintra School of Informatics University

Intl. Symp. on Microarchitecture - December 2011 56

Auto-tuning

OpenMP apps Performs within 6% of static oracle

0.0

5.0

10.0

15.0

20.0

25.0

ep ft is sp

Sp

eed

up

Benchmark

Static OracleAuto-tuning