complementing user-level coarse-grain parallelism with implicit speculative parallelism nikolas...

Complementing User-Level Coarse-Grain Parallelism

with Implicit Speculative Parallelism

Nikolas Ioannou, Marcelo Cintra

School of InformaticsUniversity of Edinburgh

Intl. Symp. on Microarchitecture - December 2011 2

Introduction

Source: Intel

Multi-cores and many-cores here to stay


Introduction

Multi-cores and many-cores are here to stay Parallel programming is essential to realize

potential Focus on coarse-grain parallelism Weak or no scaling of some parallel applications Can we exploit under-utilized cores to complement

coarse-grain parallelism?– Nested parallelism in multi-threaded applications– Exploit it using implicit speculative parallelism


Contributions

Evaluation of implicit speculative parallelism on top of explicit parallelism to improve scalability:– Improve scalability by 40% on avg.– Same energy consumption

Detailed analysis of multithreaded scalability:– Performance bottlenecks– Behavior on different input datasets

Auto-tuning to dynamically select the number of explicit and implicit threads


Outline

Introduction Motivation Proposal Evaluation Methodology Results Conclusions


Bottlenecks: Large Critical Sections

T0 T1 T2 T3

Tim

e

0 20 40 60Cores

0

1

2

3

Sp

ee

du

p

2 4 8 16 32 64Cores

0

0.2

0.4

0.6

0.8

1.0

1.2

Nor

m. E

xecu

tion

Tim

e

BusyLockBarrier

Integer Sort (IS) NASPB


Bottlenecks: Load Imbalance

T0 T1 T2 T3

Tim

e

0 20 40 60 80 100 120Cores

0

5

10

15

20

Spe

edup

2 4 8 16 32 64 128Cores

0

0.1

0.2

0.3

0.4

0.5

0.6

Nor

m. E

xecu

tion

Tim

e

BusyLockBarrier

RADIOSITY SPLASH 2

Can we use these coresto accelerate this app.?


Outline

Introduction Motivation Proposal Evaluation Methodology Results Low power nested parallelism Conclusions

9

Proposal

Programming:– Users explicitly parallelize code– Tradeoff development time for performance gains

Architecture and Compiler:– Exploit fine-grain parallelism on top of user threads– Thread-Level Speculation (TLS) within each user thread

Hardware:– Support both explicit and implicit threads simultaneously

in a nested fashion

Intl. Symp. on Microarchitecture - December 2011

Speculative

10

Proposal#pragma omp parallel forfor(j = 0; j < M; ++j) { … for(i = 0; i < N; ++i) { … = A[L[i]] + … … A[K[i]] = … } …}

T0 TK TL TM

… … …

TK,i TK,i+1 TK,i+2 TK,i+3

Speculative

TL,i TL,i+1 TL,i+2 TL,i+3


11

Proposal: Many-core Architecture

Many-core partitioned in clusters (tiles) Coherence (MESI)

– Snooping coherence within cluster– Directory coherence across clusters

Support for TLS only within cluster– Snooping TLS protocol– Speculative buffering in L1 data caches


12

Proposal: Many-core Architecture

T0 T1 T2 T3 T4 T5 T6 T7

T8 T9 T10 T11 T12 T13 T14 T15

T16 T17 T18 T19 T20 T21 T22 T23

T24 T25 T26 T27 T28 T29 T30 T31

Mem

. Con

tr.M

em. C

ontr.

Mem

. Con

tr.M

em. C

ontr.

C0 C1 C2 C3

IC DC IC DC IC DC IC DC

L2 $ Dir/Router



Complementing Coarse-Grain ParallelismT0 T1 T2 T3

Tim

e

T0 T1 T2 T3 T4 T5 T6 T7

2x Explicit Threads



Tim

e

T0 T1 T2 T3 T4 T5 T6 T7

4ETs + 4ISTs



Tim

e

T0 T1 T2 T3 T4 T5 T6 T7

2x Explicit Threads



Tim

e

T0 T1 T2 T3 T4 T5 T6 T7

4ETs + 4ISTs


Expected Speedup Behavior

A

B

C

Sp

eed

up

Cores

Baseline

4-way TLS speedupregion

2-way TLS speedupregion

Baseline speedupregion

1 2 4 8 16 32 64

2-way TLS

4-way TLS

18

Proposal: Auto-Tuning the Thread Count Find the scalability tipping point dynamically Choose whether to employ implicit threads Simple hill climbing approach Applicable to OpenMP applications that are

amenable to Dynamic Concurrency Throttling (DCT [Curtis-Maury PACT’08] )

Developed a prototype in the Omni OpenMP System



Auto-tuning example

…#pragma omp parallel forfor(j = 0; j < M; ++j) { … for(i = 0; i < N; ++i) { … = A[L[i]] + … … A[K[i]] = … } …}…

Learningi

omp parallel region i detected:

First time:Can we compute iteration count statically and is less than max core count?

Yes -> set Initial Tcount to 32Measure execution time ti

1

M=32


Auto-tuning example


Learningi i


Set Tcount to next value (16)Measure execution time ti

2

ti2 < ti

1 → continue exploration


Auto-tuning example


Learningi i i


Set Tcount to next value (8)Measure execution time ti

3

ti3 > ti

2 → stop exploration


Auto-tuning example


Learningi i i i

omp parallel region i detected:Use Tcount = 16, no further explorationSet TLS to 4-way


Outline


24

Evaluation Methodology

SESC simulator - extended to model our scheme Architecture:

– Core: 4-issue OoO superscalar, 96-entry ROB, 3GHz 32KB, 4-way, DL1 $ - 32KB, 2-way, IL1 $ 16Kbit Hybrid Branch Predictor

– Tile/System: 128 cores partitioned in 2-way or 4-way tiles (evaluate both) Shared L2 cache, 8MB, 8-way, 64MSHRs Directory: Full-bit vector sharer list Interconnect: Grid, 64B links - 48GB/s to main memory


25


Benchmarks:– 12 workloads from PARSEC 2.1, SPLASH2, NASPB– Simulate parallel region to completion

Compilation:– MIPS binaries generated using GCC 3.4.4– Speculation added automatically through source-to-

source compiler– Selection of speculation regions through manual profiling

Power:– CACTI 4.2 and Wattch


26


Alternative schemes compared against:– Core Fusion [Ipek ISCA’07]:

Dynamic combination of cores to deal with lowly-threaded apps Approximated through wide 8-issue cores with all the core

resources doubled without latency increase => upper bound – Frequency Boost:

Inspired by Turbo Boost [Intel’08] For each idle core one other core gains a frequency boost of

800MHz with a 200mV increase in voltage (same power cap)

All these schemes shift resources to a subset of cores in order to improve performance



Outline



Bottom Line

Speedup over best scalability point

0.8

1.0

1.2

1.4

1.6

1.8

2.0

Spee

dup

Benchmark

TLS-2TLS-4CFusionFBoost

TLS-4: 41% avgTLS-2:27% avg

0.6

0.8

1.0

1.2

1.4

1.6

1.8

2.0

Nor

mal

ized

Ene

rgy

Benchmark

2TLS4TLSCFusionFBoost


Energy

Showing best performing point for each schemeEnergy consumptionslightly lower on avg

0.6

0.8

1.0

1.2

1.4

1.6

1.8

2.0

Nor

mal

ized

Ene

rgy

Benchmark



Energy

Showing best performing point for each schemeSpending less time in busy synchronization

0.6

0.8

1.0

1.2

1.4

1.6

1.8

2.0

Nor

mal

ized

Ene

rgy

Benchmark



Energy

Showing best performing point for each schemeHigh mispeculation:

Higher energy

0.6

0.8

1.0

1.2

1.4

1.6

1.8

2.0

Nor

mal

ized

Ene

rgy

Benchmark



Energy

Showing best performing point for each schemeLittle synchronization:

Higher energy


Serial/Critical Sections

0 10 20 30 40 50 60 70 80 90 100 110 120Cores

0

1

2

3

4

5

6

Sp

ee

du

pbaseTLS-2TLS-4FBoostCFusion

2 4 8 16 32 64 128Cores

0

0.5

1.0

1.5

2.0

2.5

3.0

Nor

m. E

xecu

tion

Tim

e

BusyLockBarrier

isNASPB


Load Imbalance

0 10 20 30 40 50 60 70 80 90 100 110 120Cores

0

5

10

15

20

25

Sp

ee

dup

baseTLS-2TLS-4FBoostCFusion

2 4 8 16 32 64 128Cores

0

0.5

1.0

1.5

2.0

2.5

Nor

m. E

xecu

tion

Tim

e

BusyLockBarrier

radiositySPLASH2


Synchronization Heavy

0 10 20 30 40 50 60 70 80 90 100 110 120Cores

0

2

4

6

8

10

12

14S

pe

ed

up


2 4 8 16 32 64 128Cores

00.20.40.60.81.01.21.41.61.8

Nor

m. E

xecu

tion

Tim

e

BusyLockBarrier

oceanSPLASH2


Coarse-Grain Partitioning

0 10 20 30 40 50 60 70 80 90 100 110 120Cores

0

5

10

15

20

25

30

Sp

ee

du

pbaseTLS-2TLS-4FBoostCFusion

2 4 8 16 32 64 128Cores

00.20.40.60.81.01.21.41.61.82.0

Nor

m. E

xecu

tion

Tim

e

BusyLockBarrier

swaptionsPARSEC


Poor Static Partitioning

0 10 20 30 40 50 60 70 80 90 100 110 120Cores

0

2

4

6

8

10

12S

pe

ed

up


2 4 8 16 32 64 128Cores

00.20.40.60.81.01.21.41.61.8

Nor

m. E

xecu

tion

Tim

e

BusyLockBarrier

spNASPB


Effect of Dataset size

Unchanged behavior: cholesky Also: canneal, ocean, ft, is, sp

10 20 30 40 50 60 70 80 90 100 110 120Cores

0

2

4

6

8

10

12

14

Spe

edup

basebaseLTLS-2TLS-2LTLS-4TLS-4L



Improved scalability, but TLS boost remains: swaptions Also: bodytrack, radiosity, ep

0 10 20 30 40 50 60 70 80 90 100 110 120Cores

0

5

10

15

20

25

30

35

40

45

50

Spe

edup


30 40 50 60 70 80 90 100 110 120Cores

0

10

20

30

40

50

60

Spe

edup




Improved scalability, lessened TLS boost: streamcluster



Worse scalability, even better TLS boost: water

10 20 30 40 50 60 70 80 90 100 110 120Cores

0

10

20

30

40

50

60

Spe

edup



Outline



Conclusions

Multicores and many-cores are here to stay– Parallel programming essential to exploit new hardware– Some coarse-grain parallel programs do not scale– Enough nested parallelism to improve scalability

Proposed speculative parallelization through implicit speculative threads on top of explicit threads:

– Significant scalability improvement of 40% on avg– No increase in total energy consumptions– Presented an auto-tuning mechanism to dynamically choose

the number of threads that performs within 6% of the oracle

Complementing User-Level Coarse-Grain Parallelism

with Implicit Speculative Parallelism

Nikolas Ioannou, Marcelo Cintra

School of InformaticsUniversity of Edinburgh


Related Work

[von Praun PPoPP’07] Implicit ordered transactions [Kim Micro’10] Speculative Parallel-stage Decoupled Software

Pipelining [Ooi ICS’01] Multiplex [Madriles ISCA’09] Anaphase [Rajwar MICRO’01],[Martinez ASPLOS’02] Speculative Lock

Elision [Moravan ASPLOS’06], etc., Nested transactional memory


Bibliography

[Intl’08] Intel Corp. Intel Turbo Boost Technology in Intel Core Microarchitecture (Nehalem) Based Processors, White Paper, 2008

[Ipek ISCA’07] Ipek et al. Core fusion: Accommodating software diversity in chip multiprocessors

[von Praun PPoPP’07] C. von Praun et al. Implicit parallelism with ordered transactions, PPoPP 2007

[Kim Micro’10] Scalable speculative parallelization in commodity clusters, MICRO, 2010

[Ooi ICS’01] C.-L Ooi et al. Multiplex: Unifying conventional and speculative thread-level parallelism on a chip multiprocessor, ICS 2001

[Madriles ISCA’09] C. Madriles et al. Boosting single-thread performance in multi-core system through fine-grain multi-threading. ISCA 2009


Bibliography

[Rajwar MICRO’01] R. Rajwar and J.R. Goodman. Speculative lock elision: Enabling highly concurrent multithreaded execution. MICRO 2001

[Martinez ASPLOS’02] J. Martinez and J. Torellas. Speculative synchronization: Applying thread-level speculation to explicitly parallel applications. ASPLOS 2002

[Moravan ASPLOS’06] Supporting nested transactional memory in logtm. ASPLOS 2006

[Curtis-Maury PACT’08] Prediction models for multi-dimensional power-performance optimization on many-cores.


Benchmark details


Fetched Instructions

0.5

0.6

0.7

0.8

0.9

1.0

1.1

1.2

Norm

. Tot

al F

etch

ed In

s.

Benchmark

TLS-2TLS-4FBoostCFusion


Failed Speculation

0%

20%

40%

60%

80%

100%TL

S-2

TLS

-4

TLS

-2TL

S-4

TLS

-2TL

S-4

TLS

-2TL

S-4

TLS

-2TL

S-4

TLS

-2TL

S-4

TLS

-2TL

S-4

TLS

-2TL

S-4

TLS

-2TL

S-4

TLS

-2TL

S-4

TLS

-2TL

S-4

TLS

-2TL

S-4

Nor

m. E

xecu

tion

Tim

e

Benchmark

RestartBusy


Serial/Critical Sections

0 10 20 30 40 50 60 70 80 90 100 110 120Cores

0

1

2

3

4

5

6

7S

pe

ed

up


2 4 8 16 32 64 128Cores

00.20.40.60.81.01.21.41.6

Nor

m. E

xecu

tion

Tim

e

BusyLockBarrier

bodytrackPARSEC


Auto-tuning

OpenMP apps Performs within 6% of static oracle

0.0

5.0

10.0

15.0

20.0

25.0

ep ft is sp

Sp

eed

up

Benchmark

Static OracleAuto-tuning

complementing user-level coarse-grain parallelism with implicit speculative parallelism nikolas...

Documents

speculative t

measure execution time

ists slide

exploration slide

naspb slide

t0t0 tktk tltl tmtm

explicit parallelism

finegrain parallelism