pl-4090, sequential consistency for heterogeneous-race-free: programmer-centric memory models for...

19
SEQUENTIAL CONSISTENCY FOR HETEROGENEOUSRACEFREE BRADFORD M. BECKMANN 11/13/2013

Post on 21-Oct-2014

817 views

Category:

Technology


2 download

DESCRIPTION

PL-4090, Sequential Consistency for Heterogeneous-Race-Free: Programmer-centric Memory Models for Heterogeneous Platforms, by Brad Beckmann at the AMD Developer Summit (APU13) November 11-13, 2013.

TRANSCRIPT

Page 1: PL-4090, Sequential Consistency for Heterogeneous-Race-Free: Programmer-centric Memory Models for Heterogeneous Platforms, by Brad Beckmann

SEQUENTIAL  CONSISTENCY  FOR  HETEROGENEOUS-­‐RACE-­‐FREE  BRADFORD  M.  BECKMANN  

11/13/2013  

Page 2: PL-4090, Sequential Consistency for Heterogeneous-Race-Free: Programmer-centric Memory Models for Heterogeneous Platforms, by Brad Beckmann

2   |      SC  FOR  HRF      |      NOVEMBER  19,  2013      |      PUBLIC  

EXECUTIVE  SUMMARY  

!  ExisCng  APU  memory  models  ambiguous  +  opaque  to  programmers  

!  From  the  CPU  world:  SequenCal  consistency  for  data-­‐race-­‐free  (SC  for  DRF)  ‒ Relaxed  HW,  precise  semanCcs,  programmer-­‐friendly…  ‒ …but  it  took  30  years  for  CPUs  -­‐-­‐  our  goal  is  to  improve  that  for  GPGPUs  

!  Primary  difference  between  CPUs  and  GPGPUs  !  Scoped  synchroniza0on  ‒ Our  work  adds  scoped  (parCal)  synchronizaCon  to  SC  for  DRF  

!  Two  specific  models:  

!  Case  study:  GPGPU  Task-­‐sharing  RunCme  ‒ HRF-­‐indirect  provides  up  to  10%  performance  improvement  

HRF-­‐direct   HRF-­‐indirect  

Model  Complexity   Simple   Advanced  

HW  Flexibility   Significant   Limited  

SW  OpCmizaCons   Limited   Significant  

Leverage  current  HW?   No   Yes  

ü  ü  

ü  ü  

Page 3: PL-4090, Sequential Consistency for Heterogeneous-Race-Free: Programmer-centric Memory Models for Heterogeneous Platforms, by Brad Beckmann

3   |      SC  FOR  HRF      |      NOVEMBER  19,  2013      |      PUBLIC  

OUTLINE  

BACKGROUND  

SCOPES  AND  GPU  HARDWARE  

CASE  STUDY:  TASK-­‐SHARING  TASK  RUNTIME    

Page 4: PL-4090, Sequential Consistency for Heterogeneous-Race-Free: Programmer-centric Memory Models for Heterogeneous Platforms, by Brad Beckmann

4   |      SC  FOR  HRF      |      NOVEMBER  19,  2013      |      PUBLIC  

BACKGROUND  

!  Programmers  use  memory  model  to  understand  memory  behavior    ‒  SequenCal  consistency  (SC)  [1979]:  threads  interleave  like  mulC-­‐tasking  uni-­‐processor  ‒ HW/compiler  actually  implements  TSO  [1991]  or  more  relaxed  model  ‒  JavaTM  [2005]  and  C++  [2008]  insure  SC  for  data-­‐race-­‐free  (DRF)  programs  

!  Programmers  need  a  GPU  memory  model  for  abstracCon  and  portability  ‒ Current  GPU  models  expose  ad  hoc  HW  mechanisms  ‒  SC  for  DRF  is  a  start,  but…  ‒ …  many  GPU  operaCons  are  not  global,  but  have  limited  scopes  (e.g.,  workgroup)   OpenCLTM  Execu>on  Model  

Page 5: PL-4090, Sequential Consistency for Heterogeneous-Race-Free: Programmer-centric Memory Models for Heterogeneous Platforms, by Brad Beckmann

5   |      SC  FOR  HRF      |      NOVEMBER  19,  2013      |      PUBLIC  

SEQUENTIAL  CONSISTENCY  FOR  DATA-­‐RACE-­‐FREE  

!  Two  memory  accesses  parCcipate  in  a  data  race  if  they  ‒ access  the  same  locaCon  ‒ at  least  one  access  is  a  store  ‒ can  occur  simultaneously  

‒  i.e.,  appear  as  adjacent  operaCons  in  interleaving  

! A  program  is  data-­‐race-­‐free  if  no  possible  execuCon  results  in  a  data  race  ! SequenCal  consistency  for  data-­‐race-­‐free  programs  

‒ Avoid  everything  else      

GPUs:  Not  good  enough!  

How  do  different  types  of  scoped-­‐synchroniza>on  opera>ons  interact?  

Page 6: PL-4090, Sequential Consistency for Heterogeneous-Race-Free: Programmer-centric Memory Models for Heterogeneous Platforms, by Brad Beckmann

6   |      SC  FOR  HRF      |      NOVEMBER  19,  2013      |      PUBLIC  

SCOPES  

!  Scope:  A  subset  of  threads  !  Scoped  synchronizaCon:  synchronizaCon  w.r.t.  a  scope  

‒ CUDA:  threadfence_{block, system},  __syncthreads,  etc.  ‒ PTX:  membar.{cta, gl, sys},  bar,  etc.  ‒ HSA:  parCal  acquire/release  ‒ OPENCL  2.0:  atomic_work_item_fence {work_group, device, all_svm_devices}

!  Scopes  introduce  new  class  of  races:  ‒ What  happens  when  threads  use  different  scopes?  

wf1 wf2 wf3 wf4

ST X = 1

Release_S12 Acquire_S12 LD X (1)

Release_SGlobal Acquire_SGlobal LD X (??)

Workgroup  1   Workgroup  2  

L1   L1  

L2  

t1   t2   t3   t4  

t4  t3  t1   t2  

SGlobal  S12   S34  

Page 7: PL-4090, Sequential Consistency for Heterogeneous-Race-Free: Programmer-centric Memory Models for Heterogeneous Platforms, by Brad Beckmann

7   |      SC  FOR  HRF      |      NOVEMBER  19,  2013      |      PUBLIC  

RUNNING  EXAMPLE:  CURRENT  GPU  HARDWARE  !  Current  GPU:  write-­‐combining  cache  hierarchy  

‒ WG  release  !  flush  stores  from  coalescer  ‒ WG  acquire  !  stall  unCl  coalescer  is  empty  

‒ Global  release  !  Flush  all  dirty  loca0ons  in  L1  cache    ‒ Global  acquire  !  Invalidate  all  valid  loca0ons  in  L1  cache  

L1   L1  

L2  

wf1   wf2   wf3   wf4  

wf1 wf2 wf3 wf4

ST X = 1

Release_S12 Acquire_S12 LD X (1)

Release_SGlobal Acquire_SGlobal LD X (??)

Coalescer  

t3  sees  (1)  

X  =  1  

ST X = 1

Release_S12 Acquire_S12 LD X (1)

X  =  1  

Release_SGlobal Acquire_SGlobal LD X (??)

X  =  1  

X  =  1  

Page 8: PL-4090, Sequential Consistency for Heterogeneous-Race-Free: Programmer-centric Memory Models for Heterogeneous Platforms, by Brad Beckmann

8   |      SC  FOR  HRF      |      NOVEMBER  19,  2013      |      PUBLIC  

RUNNING  EXAMPLE:  OPTIMIZED  GPU  !  OpCmized  GPU:  per-­‐wavefront  L1  cache  management  

‒ WG  release  !  flush  stores  from  coalescer  ‒ WG  acquire  !  stall  unCl  coalescer  is  empty  

‒ Global  release  !  Flush  loca0ons  wri6en  by  releasing  wavefront  in  L1    ‒ Global  acquire  !  Inv.  loca0ons  read  by  acquiring  wavefront  in  L1  

L1   L1  

L2  

wf1   wf2   wf3   wf4  

wf1 wf2 wf3 wf4

ST X = 1

Release_S12 Acquire_S12 LD X (1)

Release_SGlobal Acquire_SGlobal LD X (??)

Coalescer  

t3  sees  (0)  

X  =  1  

ST X = 1

Release_S12 Acquire_S12 LD X (1)

X  =  1  

Release_SGlobal Acquire_SGlobal LD X (??)

X  =  0  

X  =  0  

Page 9: PL-4090, Sequential Consistency for Heterogeneous-Race-Free: Programmer-centric Memory Models for Heterogeneous Platforms, by Brad Beckmann

9   |      SC  FOR  HRF      |      NOVEMBER  19,  2013      |      PUBLIC  

DEFINING  PERMITTED  BEHAVIOR  

!  Which  scenario  should  be  allowed?  ‒ Can  programmers  assume  transiCvity?  

‒  Permitng  the  “Current  GPU  Hardware”  scenario  ‒ Or  must  producers  and  consumers  use  the  same  scope?  

‒  Permitng  the  “OpCmized  GPU”  scenario  

!  Our  notaCon:  ‒ HRF-­‐direct  

‒  Requires  communicaCng  threads  to  synchronize  using  the  same  scope  ‒  CommunicaCon  using  different  scopes  is  explicitly  undefined  

‒ HRF-­‐indirect  ‒  Extends  HRF-­‐direct  to  support  transiCve  communicaCon  using  different  scopes  ‒  Allows  indirect  communicaCon  using  a  third  party  

!  Both  models  require  direct  synchronizaCon  using  the  same  matching  scope  ‒  In  other  words,  an  acq/rel  pair  using  scopes  that  are  subset/superset  is  undefined  ‒ We  call  this  form  of  synchronizaCon  scope  inclusion  ‒ While  possible  with  current  GPUs,  extremely  difficult  to  reason  about  

Page 10: PL-4090, Sequential Consistency for Heterogeneous-Race-Free: Programmer-centric Memory Models for Heterogeneous Platforms, by Brad Beckmann

10   |      SC  FOR  HRF      |      NOVEMBER  19,  2013      |      PUBLIC  

HRF  MODELS  IMPLICATION  ON  PROGRAMMERS  !  What  value  will  t3  see?  

‒  HRF-­‐direct:  final  LD  X  forms  a  race  (inexact  scopes  between  wf1-­‐wf3)  ‒  Undefined  behavior  !  don’t  try!!  

‒  HRF-­‐indirect:  No  race  (scope  transiCvity)  ‒  SC  behavior  !  t3  sees  (1)  

!  Consequences:  ‒  HRF-­‐direct:    

‒ Must  use  global  scope  w/o  future  sync.  knowledge  !  slower  on  exisCng  HW  ‒  HRF-­‐indirect:  

‒  Can  use  local  scope  w/o  future  sync.  knowledge  !  faster  on  exisCng  HW  ‒ Will  NOT  work  with  poten0ally  op0mized  future  GPU  

wf1 wf2 wf3 wf4

ST X = 1

Release_S12 Acquire_S12 LD X (1)

Release_SGlobal Acquire_SGlobal LD X (??)

Page 11: PL-4090, Sequential Consistency for Heterogeneous-Race-Free: Programmer-centric Memory Models for Heterogeneous Platforms, by Brad Beckmann

11   |      SC  FOR  HRF      |      NOVEMBER  19,  2013      |      PUBLIC  

CASE  STUDY  –  TASK-­‐SHARING  RUNTIME  

Hierarchical  task  queue:      -­‐  Wavefront  produce/consume  tasks  independently      -­‐  Use  local  queue  unCl:              -­‐  Local  queue  empty  !  pull  from  global              -­‐  Local  queue  full  !  push  to  global    Challenge:          -­‐  Unpredictable  synchronizaCon    HRF-­‐direct:        -­‐  Either  all  sync  has  to  be  global  or  wavefronts              coordinate  to  push/pull  HRF-­‐indirect:        -­‐  Single  wavefront  can  push/pull  independently  

NDRange  (Kernel)  Queue  

WI  WI  WI  WI  WI  WI  WI  WI  WI  wi   WI  WI  WI  WI  WI  WI  WI  WI  WI  wi   WI  WI  WI  WI  WI  WI  WI  WI  WI  wi   WI  WI  WI  WI  WI  WI  WI  WI  WI  wi  

Wavefront  Queue  

Workgroup  Queue   Workgroup  Queue  

Wavefront  Queue  

Wavefront  Queue  

Wavefront  Queue  

OVERVIEW  

Page 12: PL-4090, Sequential Consistency for Heterogeneous-Race-Free: Programmer-centric Memory Models for Heterogeneous Platforms, by Brad Beckmann

12   |      SC  FOR  HRF      |      NOVEMBER  19,  2013      |      PUBLIC  

CASE  STUDY  –  TASK-­‐SHARING  RUNTIME  

!  RunCme  hides  the  OpenCL™  execuCon  model  ‒ ApplicaCon  defines  only  independent  tasks  ‒ RunCme  uses  persistent  threads  

!  The  same  funcCon  assigned  to  a  wavefront  ‒ Grouped  together  using  taskfronts  

 

!  SynchronizaCon  occurs  when  tasks  are  enqueued/dequeued  ‒ Enqueuer/producer  does  not  know  eventual  consumer  ‒ HRF-­‐direct:  must  always  use  device/kernel  scope  synchronizaCon  ‒ HRF-­‐indirect:  only  use  kernel  scope  synchronizaCon  for  global  donaCons/consumpCon  

!  EvaluaCon:  Unbalanced  tree  search  (UTS)  syntheCc  workload  ‒ Traversal  of  unbalanced  graph  whose  topology  is  determined  dynamically  ‒ 4  different  input  sets  

DETAILS  

     

Queue  

   

Taskfront  

Task  Task  Task  Task   Taskfront   Taskfront  Task  Task  Task  Task  Task  Task  Task  Task  Task  Task  

Page 13: PL-4090, Sequential Consistency for Heterogeneous-Race-Free: Programmer-centric Memory Models for Heterogeneous Platforms, by Brad Beckmann

13   |      SC  FOR  HRF      |      NOVEMBER  19,  2013      |      PUBLIC  

CASE  STUDY  –  TASK-­‐SHARING  RUNTIME  RESULTS  

0.95  

1  

1.05  

1.1  

1.15  

uts_t1   uts_t2   uts_t4   uts_t5  

Performan

ce  Normalized

 to  HRF

-­‐dire

ct  

HRF-­‐direct   HRF-­‐indirect  input  sets:  

Page 14: PL-4090, Sequential Consistency for Heterogeneous-Race-Free: Programmer-centric Memory Models for Heterogeneous Platforms, by Brad Beckmann

14   |      SC  FOR  HRF      |      NOVEMBER  19,  2013      |      PUBLIC  

SUMMARY  !  Our  general  approach  (SC  for  HRF):  

‒ Define  a  heterogeneous  race:  ‒  Two  conflicCng  accesses  not  separated  by  synchronizaCon,  or  ‒  SynchronizaCon  does  not  use  “enough”  scope  

‒ ExecuCon  is  SC  if  no  heterogeneous  races  ‒ Undefined  otherwise  

!  Proposing  two  memory  models:  ‒ HRF-­‐direct  

‒  Conflicts  separated  by  synchronizaCon  of  iden0cal  scope  + Easier  to  define/understand  + Permits  more  future  HW  opCmizaCons    − Prohibits  some  SW  opts  in  current  hardware  

‒ HRF-­‐indirect  ‒  Relaxes  idenCcal  requirement  of  HRF-­‐direct:  

‒  Scope  TransiCvity:  A  sync  B,  B  sync  C  !  A  sync  C  

+ More  accurate  descripCon  of  current  hardware  capabiliCes  + Has  some  SW  benefits  (E.g.,  is  more  composable  )  − May  limit  future  HW  opts  

Page 15: PL-4090, Sequential Consistency for Heterogeneous-Race-Free: Programmer-centric Memory Models for Heterogeneous Platforms, by Brad Beckmann

15   |      SC  FOR  HRF      |      NOVEMBER  19,  2013      |      PUBLIC  

DISCLAIMER  &  ATTRIBUTION  

The  informaCon  presented  in  this  document  is  for  informaConal  purposes  only  and  may  contain  technical  inaccuracies,  omissions  and  typographical  errors.    

The  informaCon  contained  herein  is  subject  to  change  and  may  be  rendered  inaccurate  for  many  reasons,  including  but  not  limited  to  product  and  roadmap  changes,  component  and  motherboard  version  changes,  new  model  and/or  product  releases,  product  differences  between  differing  manufacturers,  so~ware  changes,  BIOS  flashes,  firmware  upgrades,  or  the  like.  AMD  assumes  no  obligaCon  to  update  or  otherwise  correct  or  revise  this  informaCon.  However,  AMD  reserves  the  right  to  revise  this  informaCon  and  to  make  changes  from  Cme  to  Cme  to  the  content  hereof  without  obligaCon  of  AMD  to  noCfy  any  person  of  such  revisions  or  changes.    

AMD  MAKES  NO  REPRESENTATIONS  OR  WARRANTIES  WITH  RESPECT  TO  THE  CONTENTS  HEREOF  AND  ASSUMES  NO  RESPONSIBILITY  FOR  ANY  INACCURACIES,  ERRORS  OR  OMISSIONS  THAT  MAY  APPEAR  IN  THIS  INFORMATION.    

AMD  SPECIFICALLY  DISCLAIMS  ANY  IMPLIED  WARRANTIES  OF  MERCHANTABILITY  OR  FITNESS  FOR  ANY  PARTICULAR  PURPOSE.  IN  NO  EVENT  WILL  AMD  BE  LIABLE  TO  ANY  PERSON  FOR  ANY  DIRECT,  INDIRECT,  SPECIAL  OR  OTHER  CONSEQUENTIAL  DAMAGES  ARISING  FROM  THE  USE  OF  ANY  INFORMATION  CONTAINED  HEREIN,  EVEN  IF  AMD  IS  EXPRESSLY  ADVISED  OF  THE  POSSIBILITY  OF  SUCH  DAMAGES.  

 

ATTRIBUTION  

©  2013  Advanced  Micro  Devices,  Inc.  All  rights  reserved.  AMD,  the  AMD  Arrow  logo  and  combinaCons  thereof  are  trademarks  of  Advanced  Micro  Devices,  Inc.  in  the  United  States  and/or  other  jurisdicCons.    OpenCL    is  a  registered  trademark  of  Apple  Inc.  Other  names  are  for  informaConal  purposes  only  and  may  be  trademarks  of  their  respecCve  owners.  

Page 16: PL-4090, Sequential Consistency for Heterogeneous-Race-Free: Programmer-centric Memory Models for Heterogeneous Platforms, by Brad Beckmann

Backup  

Page 17: PL-4090, Sequential Consistency for Heterogeneous-Race-Free: Programmer-centric Memory Models for Heterogeneous Platforms, by Brad Beckmann

17   |      SC  FOR  HRF      |      NOVEMBER  19,  2013      |      PUBLIC  

SIMULATION  CONFIGURATION  

!  APU  Simulator  ‒ Based  on  the  gem5  open-­‐source  simulator  ‒ Extended  with  a  GPU  execuCon  model  that  directly  executes  HSAIL  

!  ConfiguraCon  Parameters:  

Parameter   Value  

#  Compute  Units   8  

#  SIMD  Units  /  Compute  Unit   4  

L1  Cache   32  KB,  16-­‐way  assoc.  

L2  Cache   2  MB,  16-­‐way  assoc.  

Page 18: PL-4090, Sequential Consistency for Heterogeneous-Race-Free: Programmer-centric Memory Models for Heterogeneous Platforms, by Brad Beckmann

18   |      SC  FOR  HRF      |      NOVEMBER  19,  2013      |      PUBLIC  

HRF  DESIGN  SPACE  

Others  

Page 19: PL-4090, Sequential Consistency for Heterogeneous-Race-Free: Programmer-centric Memory Models for Heterogeneous Platforms, by Brad Beckmann

19   |      SC  FOR  HRF      |      NOVEMBER  19,  2013      |      PUBLIC  

L1-­‐2   L2-­‐3  Stage  1   Stage  2   Stage  3  

L1/L2/DRAM  

Scope  1-­‐2   Scope  2-­‐3  

Scope  Global  

PROGRAMMABLE  PIPELINE  EXAMPLE