pl-4090, sequential consistency for heterogeneous-race-free: programmer-centric memory models for...

SEQUENTIAL CONSISTENCY FOR HETEROGENEOUS-‐RACE-‐FREE BRADFORD M. BECKMANN

11/13/2013

2 | SC FOR HRF | NOVEMBER 19, 2013 | PUBLIC

EXECUTIVE SUMMARY

!  ExisCng APU memory models ambiguous + opaque to programmers

!  From the CPU world: SequenCal consistency for data-‐race-‐free (SC for DRF) ‒ Relaxed HW, precise semanCcs, programmer-‐friendly… ‒ …but it took 30 years for CPUs -‐-‐ our goal is to improve that for GPGPUs

!  Primary difference between CPUs and GPGPUs ! Scoped synchroniza0on ‒ Our work adds scoped (parCal) synchronizaCon to SC for DRF

!  Two specific models:

!  Case study: GPGPU Task-‐sharing RunCme ‒ HRF-‐indirect provides up to 10% performance improvement

HRF-‐direct HRF-‐indirect

Model Complexity Simple Advanced

HW Flexibility Significant Limited

SW OpCmizaCons Limited Significant

Leverage current HW? No Yes

ü ü

ü ü


OUTLINE

BACKGROUND

SCOPES AND GPU HARDWARE

CASE STUDY: TASK-‐SHARING TASK RUNTIME


BACKGROUND

!  Programmers use memory model to understand memory behavior ‒  SequenCal consistency (SC) [1979]: threads interleave like mulC-‐tasking uni-‐processor ‒ HW/compiler actually implements TSO [1991] or more relaxed model ‒  JavaTM [2005] and C++ [2008] insure SC for data-‐race-‐free (DRF) programs

!  Programmers need a GPU memory model for abstracCon and portability ‒ Current GPU models expose ad hoc HW mechanisms ‒  SC for DRF is a start, but… ‒ … many GPU operaCons are not global, but have limited scopes (e.g., workgroup) OpenCLTM Execu>on Model


SEQUENTIAL CONSISTENCY FOR DATA-‐RACE-‐FREE

!  Two memory accesses parCcipate in a data race if they ‒ access the same locaCon ‒ at least one access is a store ‒ can occur simultaneously

‒  i.e., appear as adjacent operaCons in interleaving

! A program is data-‐race-‐free if no possible execuCon results in a data race ! SequenCal consistency for data-‐race-‐free programs

‒ Avoid everything else

GPUs: Not good enough!

How do different types of scoped-‐synchroniza>on opera>ons interact?


SCOPES

!  Scope: A subset of threads !  Scoped synchronizaCon: synchronizaCon w.r.t. a scope

‒ CUDA: threadfence_{block, system}, __syncthreads, etc. ‒ PTX: membar.{cta, gl, sys}, bar, etc. ‒ HSA: parCal acquire/release ‒ OPENCL 2.0: atomic_work_item_fence {work_group, device, all_svm_devices}

!  Scopes introduce new class of races: ‒ What happens when threads use different scopes?

wf1 wf2 wf3 wf4

ST X = 1

Release_S12 Acquire_S12 LD X (1)

Release_SGlobal Acquire_SGlobal LD X (??)

Workgroup 1 Workgroup 2

L1 L1

L2

t1 t2 t3 t4

t4 t3 t1 t2

SGlobal S12 S34


RUNNING EXAMPLE: CURRENT GPU HARDWARE !  Current GPU: write-‐combining cache hierarchy

‒ WG release ! flush stores from coalescer ‒ WG acquire ! stall unCl coalescer is empty

‒ Global release ! Flush all dirty loca0ons in L1 cache ‒ Global acquire ! Invalidate all valid loca0ons in L1 cache

L1 L1

L2

wf1 wf2 wf3 wf4

wf1 wf2 wf3 wf4

ST X = 1



Coalescer

t3 sees (1)

X = 1

ST X = 1


X = 1


X = 1

X = 1


RUNNING EXAMPLE: OPTIMIZED GPU !  OpCmized GPU: per-‐wavefront L1 cache management

‒ WG release ! flush stores from coalescer ‒ WG acquire ! stall unCl coalescer is empty

‒ Global release ! Flush loca0ons wri6en by releasing wavefront in L1 ‒ Global acquire ! Inv. loca0ons read by acquiring wavefront in L1

L1 L1

L2

wf1 wf2 wf3 wf4

wf1 wf2 wf3 wf4

ST X = 1



Coalescer

t3 sees (0)

X = 1

ST X = 1


X = 1


X = 0

X = 0


DEFINING PERMITTED BEHAVIOR

!  Which scenario should be allowed? ‒ Can programmers assume transiCvity?

‒  Permitng the “Current GPU Hardware” scenario ‒ Or must producers and consumers use the same scope?

‒  Permitng the “OpCmized GPU” scenario

!  Our notaCon: ‒ HRF-‐direct

‒  Requires communicaCng threads to synchronize using the same scope ‒  CommunicaCon using different scopes is explicitly undefined

‒ HRF-‐indirect ‒  Extends HRF-‐direct to support transiCve communicaCon using different scopes ‒  Allows indirect communicaCon using a third party

!  Both models require direct synchronizaCon using the same matching scope ‒  In other words, an acq/rel pair using scopes that are subset/superset is undefined ‒ We call this form of synchronizaCon scope inclusion ‒ While possible with current GPUs, extremely difficult to reason about


HRF MODELS IMPLICATION ON PROGRAMMERS !  What value will t3 see?

‒  HRF-‐direct: final LD X forms a race (inexact scopes between wf1-‐wf3) ‒  Undefined behavior ! don’t try!!

‒  HRF-‐indirect: No race (scope transiCvity) ‒  SC behavior ! t3 sees (1)

!  Consequences: ‒  HRF-‐direct:

‒ Must use global scope w/o future sync. knowledge ! slower on exisCng HW ‒  HRF-‐indirect:

‒  Can use local scope w/o future sync. knowledge ! faster on exisCng HW ‒ Will NOT work with poten0ally op0mized future GPU

wf1 wf2 wf3 wf4

ST X = 1




CASE STUDY – TASK-‐SHARING RUNTIME

Hierarchical task queue: -‐ Wavefront produce/consume tasks independently -‐ Use local queue unCl: -‐ Local queue empty ! pull from global -‐ Local queue full ! push to global Challenge: -‐ Unpredictable synchronizaCon HRF-‐direct: -‐ Either all sync has to be global or wavefronts coordinate to push/pull HRF-‐indirect: -‐ Single wavefront can push/pull independently

NDRange (Kernel) Queue

WI WI WI WI WI WI WI WI WI wi WI WI WI WI WI WI WI WI WI wi WI WI WI WI WI WI WI WI WI wi WI WI WI WI WI WI WI WI WI wi

Wavefront Queue

Workgroup Queue Workgroup Queue

Wavefront Queue

Wavefront Queue

Wavefront Queue

OVERVIEW


CASE STUDY – TASK-‐SHARING RUNTIME

!  RunCme hides the OpenCL™ execuCon model ‒ ApplicaCon defines only independent tasks ‒ RunCme uses persistent threads

!  The same funcCon assigned to a wavefront ‒ Grouped together using taskfronts

!  SynchronizaCon occurs when tasks are enqueued/dequeued ‒ Enqueuer/producer does not know eventual consumer ‒ HRF-‐direct: must always use device/kernel scope synchronizaCon ‒ HRF-‐indirect: only use kernel scope synchronizaCon for global donaCons/consumpCon

!  EvaluaCon: Unbalanced tree search (UTS) syntheCc workload ‒ Traversal of unbalanced graph whose topology is determined dynamically ‒ 4 different input sets

DETAILS

Queue

Taskfront

Task Task Task Task Taskfront Taskfront Task Task Task Task Task Task Task Task Task Task


CASE STUDY – TASK-‐SHARING RUNTIME RESULTS

0.95

1

1.05

1.1

1.15

uts_t1 uts_t2 uts_t4 uts_t5

Performan

ce Normalized

to HRF

-‐dire

ct

HRF-‐direct HRF-‐indirect input sets:


SUMMARY !  Our general approach (SC for HRF):

‒ Define a heterogeneous race: ‒  Two conflicCng accesses not separated by synchronizaCon, or ‒  SynchronizaCon does not use “enough” scope

‒ ExecuCon is SC if no heterogeneous races ‒ Undefined otherwise

!  Proposing two memory models: ‒ HRF-‐direct

‒  Conflicts separated by synchronizaCon of iden0cal scope + Easier to define/understand + Permits more future HW opCmizaCons − Prohibits some SW opts in current hardware

‒ HRF-‐indirect ‒  Relaxes idenCcal requirement of HRF-‐direct:

‒  Scope TransiCvity: A sync B, B sync C ! A sync C

+ More accurate descripCon of current hardware capabiliCes + Has some SW benefits (E.g., is more composable ) − May limit future HW opts


DISCLAIMER & ATTRIBUTION

The informaCon presented in this document is for informaConal purposes only and may contain technical inaccuracies, omissions and typographical errors.

The informaCon contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, so~ware changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligaCon to update or otherwise correct or revise this informaCon. However, AMD reserves the right to revise this informaCon and to make changes from Cme to Cme to the content hereof without obligaCon of AMD to noCfy any person of such revisions or changes.

AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION.

AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.

ATTRIBUTION

© 2013 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo and combinaCons thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdicCons. OpenCL is a registered trademark of Apple Inc. Other names are for informaConal purposes only and may be trademarks of their respecCve owners.

Backup


SIMULATION CONFIGURATION

!  APU Simulator ‒ Based on the gem5 open-‐source simulator ‒ Extended with a GPU execuCon model that directly executes HSAIL

!  ConfiguraCon Parameters:

Parameter Value

# Compute Units 8

# SIMD Units / Compute Unit 4

L1 Cache 32 KB, 16-‐way assoc.

L2 Cache 2 MB, 16-‐way assoc.


HRF DESIGN SPACE

Others


L1-‐2 L2-‐3 Stage 1 Stage 2 Stage 3

L1/L2/DRAM

Scope 1-‐2 Scope 2-‐3

Scope Global

PROGRAMMABLE PIPELINE EXAMPLE

pl-4090, sequential consistency for heterogeneous-race-free: programmer-centric memory models for...

Technology

releases12

releasesglobal

releasesglobal

releases12

stall uncl

coalescer

case study

case study