llvm logo is copyrighted by apple inc. towards …– case study: hbm (mcdram) of knights landing...

22
Towards Automa-c HBM Alloca-on using LLVM: A Case Study with Knights Landing Dounia Khaldi and Barbara Chapman Ins.tute for Advanced Computa.onal Science Stony Brook University Stony Brook, NY The Third Workshop on the LLVM Compiler Infrastructure in HPC Salt Lake City, Utah, November 14, 2016 LLVM logo is copyrighted by Apple Inc.

Upload: others

Post on 24-May-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: LLVM logo is copyrighted by Apple Inc. Towards …– Case study: HBM (MCDRAM) of Knights Landing (KNL) – 2.29x performance improvement using LLVM compiler and 2.33x using Intel

TowardsAutoma-cHBMAlloca-onusingLLVM:

ACaseStudywithKnightsLanding

DouniaKhaldiandBarbaraChapmanIns.tuteforAdvancedComputa.onalScience

StonyBrookUniversityStonyBrook,NY

TheThirdWorkshopontheLLVMCompilerInfrastructureinHPCSaltLakeCity,Utah,November14,2016

LLVMlogoiscopyrightedbyAppleInc.

Page 2: LLVM logo is copyrighted by Apple Inc. Towards …– Case study: HBM (MCDRAM) of Knights Landing (KNL) – 2.29x performance improvement using LLVM compiler and 2.33x using Intel

Outline

•  Introduc.onandMo.va.on•  Methodology:–  Bandwidth-Cri.calDataAnalysis(BCDA)–  HBMAlloca.onTransforma.on

•  ExperimentalResultsusingCGbenchmark•  ConclusionandFutureWork

2

Page 3: LLVM logo is copyrighted by Apple Inc. Towards …– Case study: HBM (MCDRAM) of Knights Landing (KNL) – 2.29x performance improvement using LLVM compiler and 2.33x using Intel

Introduc.on:ExploringMemoryHierarchy

CachesLevel1Level2

MainMemoryDDR(NUMA)

ScrachpadMemory(MCDRAMfromIntel,HBM2fromNVIDIA,SPMofDSPsfromTI)

NVM(3DXPONITfromMicronandIntel)

•  Newkindsofmemoryinnewarchitectures

•  Whichdataelementshavetoresideonthesememories?

•  HighperformanceusingHBM,withlowerpowerrequirementscomparedtoDDR

•  3DXpointoffers1,000.mestheperformanceoftoday’sSSDs

Registers

3

DataMovement

DataMovement

Page 4: LLVM logo is copyrighted by Apple Inc. Towards …– Case study: HBM (MCDRAM) of Knights Landing (KNL) – 2.29x performance improvement using LLVM compiler and 2.33x using Intel

KNLArchitectureasaCaseStudy

4

490GB/s

90GB/s

Page 5: LLVM logo is copyrighted by Apple Inc. Towards …– Case study: HBM (MCDRAM) of Knights Landing (KNL) – 2.29x performance improvement using LLVM compiler and 2.33x using Intel

MCDRAMConfigura.onModes

5Extractedfromhep://colfaxresearch.com/knl-mcdram/

Page 6: LLVM logo is copyrighted by Apple Inc. Towards …– Case study: HBM (MCDRAM) of Knights Landing (KNL) – 2.29x performance improvement using LLVM compiler and 2.33x using Intel

ProgrammingKNLMCDRAM:FlatMode•  hbwmalloclibrary

•  Intelmemkindlibrary–  C,C++:memkind_malloc() –  Fortran:

•  !DIR$ ATTRIBUTES FASTMEM :: object •  SinceIntelFortran16.0compiler

•  AutoHBWlibrary–  Thresholdsize:AUTO_HBW_SIZE

•  numactlcommand

float *fv; fv = (float *) malloc(sizeof(float)*n);

float *fv; fv = (float *) hbw_malloc(sizeof(float)*n);

HBM

Alloca-on

6

Page 7: LLVM logo is copyrighted by Apple Inc. Towards …– Case study: HBM (MCDRAM) of Knights Landing (KNL) – 2.29x performance improvement using LLVM compiler and 2.33x using Intel

RelatedWorkLevel Work Example

APILevel Legion,Sequoia,RDDs,Adios persistent

OpenMP5.0?

Currentproposal:#pragmaompallocateWithmemoryspaces,allocatorsandtraits

VendorsLowLevelLibrariesfromIntelandCray

Cray:#pragmamemory(bandwidth)

Compilerlevel Compilertransforma.ons Loopnests

Tools VTune Collectbandwidthprofiles•  Dynamic•  bandwidthinforma.onandthen

what?

7

Page 8: LLVM logo is copyrighted by Apple Inc. Towards …– Case study: HBM (MCDRAM) of Knights Landing (KNL) – 2.29x performance improvement using LLVM compiler and 2.33x using Intel

RelatedWorkLevel Work Example

APILevel Legion,Sequoia,RDDs,Adios persistent

OpenMP5.0?

Currentproposal:#pragmaompallocateWithmemoryspaces,allocatorsandtraits

VendorsLowLevelLibrariesfromIntelandCray

Cray:#pragmamemory(bandwidth)

Compilerlevel Compilertransforma.ons Loopnests

Tools VTune Collectbandwidthprofiles•  Dynamic•  bandwidthinforma.onandthen

what?

8

•  WeuseLLVM,awidespreadSSA-basedcompila.oninfrastructureforsequen.alandparallellanguages

•  DecidewhenitisbeneficialtoallocatedataintheHBMforsequen.alandOpenMPcode

•  Casestudy:theHBM,calledMCDRAM,ofKnightsLanding(KNL)

Page 9: LLVM logo is copyrighted by Apple Inc. Towards …– Case study: HBM (MCDRAM) of Knights Landing (KNL) – 2.29x performance improvement using LLVM compiler and 2.33x using Intel

Mo.va.on:ImpactofMCDRAMonOpenMP3D7-pointStencil

0

2

4

6

8

10

12

14

5123:1 5123:5 5123:10 10243:1 10243:5 10243:10

Exec

utio

n tim

e (s

ec)

Grid size : timesteps

ICC:OMP:DDR ICC:OMP:HBM LLVM:OMP:DDR LLVM:OMP:HBM

•  Setup:1-nodemachinewithoneIntel(R)XeonPhi(TM)[email protected]

•  ICC16.0.3andLLVM3.8.1,with–O3

•  DDRvs.HBMexecu.on.meofOpenMPversionof3D7-pointStencil

•  hbw_set_policy(HBM_POLICY_BIND);

9

Page 10: LLVM logo is copyrighted by Apple Inc. Towards …– Case study: HBM (MCDRAM) of Knights Landing (KNL) – 2.29x performance improvement using LLVM compiler and 2.33x using Intel

WhattoallocateintoHBM?

for (cgit = 1; cgit <= cgitmax; cgit++){ ... #pragma omp for for (j = 0; j < lastrow - firstrow + 1; j++) { suml = 0.0; for (k = rowstr[j]; k < rowstr[j+1]; k++) { suml = suml + a[k]*p[colidx[k]]; } q[j] = suml; } #pragma omp for reduction (+:d) for (j = 0; j < lastcol -firstcol + 1; j++){ d = d + p[j] * q[j]; } ... }

•  Snippetcode(NAS-NPBCGbenchmark)•  Differenttypesofmemoryaccesses•  Severalmatrixandvectormul.plica.onsandaddi.ons

10

Page 11: LLVM logo is copyrighted by Apple Inc. Towards …– Case study: HBM (MCDRAM) of Knights Landing (KNL) – 2.29x performance improvement using LLVM compiler and 2.33x using Intel

Bandwidth-Cri.calData(1)

0

20

40

60

80

100

120

140

160

180

5123:1 5123:5 5123:10 10243:1 10243:5 10243:10

Exe

cutio

n t

ime

(se

c)

Grid size : timesteps

ICC:Seq:DDR ICC:Seq:HBM LLVM:Seq:DDR LLVM:Seq:HBM

0

20

40

60

80

100

120

140

160

180

DDR(all) HBM(all)

Mo

ps/

s

Different versions of CG (CLASS C)

ICC:Seq LLVM:Seq

•  ManywiresintoMCDRAMàsimultaneousaccessisneeded

11

Page 12: LLVM logo is copyrighted by Apple Inc. Towards …– Case study: HBM (MCDRAM) of Knights Landing (KNL) – 2.29x performance improvement using LLVM compiler and 2.33x using Intel

Bandwidth-Cri.calData(2)

12

•  Predictablememoryaccesspaeernsàapplica.onisbandwidth-bound

•  ManywiresintoMCDRAMàsimultaneousaccessisneeded

•  Librarysolu.onsànotportable•  APIlevel:mightbeaburdenàCompiler+run.mesolu.on

Page 13: LLVM logo is copyrighted by Apple Inc. Towards …– Case study: HBM (MCDRAM) of Knights Landing (KNL) – 2.29x performance improvement using LLVM compiler and 2.33x using Intel

Methodology:Bandwidth-Cri.calDataAnalysis(BCDA)

13

R = R(v)

P(R) = bandwidth(R) cost(r)workshare(r)r∈R∑

cost(r) =2 if is a store operation1 otherwise⎧⎨⎩

workshare(r) =0 if r is individual1 if r is simultaneous⎧⎨⎩

bandwidth(R) =1 ∀r ∈ R, r is regular0 otherwise⎧⎨⎩

Page 14: LLVM logo is copyrighted by Apple Inc. Towards …– Case study: HBM (MCDRAM) of Knights Landing (KNL) – 2.29x performance improvement using LLVM compiler and 2.33x using Intel

BCDA:InterproceduralMemoryOpera.onsCount

•  LLVMIRisinSSAform– Onedefini/onàmul/pleuses– AllowsforDef-UseandUse-Defchainanalysis

•  InterproceduralMemoryOpera.onsCount()–  __kmpc_fork_call – Numberofmemoryopera.onsinthegeneratedLLVMIR(load, store and getelementptr)

R = R(v)

P(R) = bandwidth(R) cost(r)workshare(r)r∈R∑

14

Page 15: LLVM logo is copyrighted by Apple Inc. Towards …– Case study: HBM (MCDRAM) of Knights Landing (KNL) – 2.29x performance improvement using LLVM compiler and 2.33x using Intel

BCDA:DataReuseCost

•  Func.oncostassignsaweighttoreferenceopera.ons

15

P(R) = bandwidth(R) cost(r)workshare(r)r∈R∑

cost(r) =2 if r is a store operation1 otherwise⎧⎨⎩

Page 16: LLVM logo is copyrighted by Apple Inc. Towards …– Case study: HBM (MCDRAM) of Knights Landing (KNL) – 2.29x performance improvement using LLVM compiler and 2.33x using Intel

BCDA:Individualvs.SimultaneousAccess

•  OpenMPasacasestudy•  Func.onworksharedetectsifanaccessrhasbeenperformedinanOpenMPwork-sharingregionornot

P(R) = bandwidth(R) cost(r)workshare(r)r∈R∑

workshare(r) =0 if r is individual1 if r is simultaneous⎧⎨⎩

16

Page 17: LLVM logo is copyrighted by Apple Inc. Towards …– Case study: HBM (MCDRAM) of Knights Landing (KNL) – 2.29x performance improvement using LLVM compiler and 2.33x using Intel

BCDA:Regularvs.IrregularAccessPaeern

•  Func.onbandwidth:latencyvsbandwidthbound

•  IndirectAccesses:indicesargumentsofthegetelementptrinstruc.on

P(R) = bandwidth(R) cost(r)workshare(r)r∈R∑

bandwidth(R) =1 ∀r ∈ R, r is regular0 otherwise⎧⎨⎩

17

Page 18: LLVM logo is copyrighted by Apple Inc. Towards …– Case study: HBM (MCDRAM) of Knights Landing (KNL) – 2.29x performance improvement using LLVM compiler and 2.33x using Intel

Methodology:Alloca.onTransforma.on#if defined (HAVE_HBWMALLOC_H) # include <hbwmalloc.h> void *memkind_alloc(size_t size) { int avail = hbw_check_available(); void *a; hbw_set_policy(HBW_POLICY_PREFERRED); if(avail == 0){ a = hbw_malloc(size); assert(a != NULL); } else{ a = malloc(size); } return a; } #else void *memkind_alloc(size_t size) { void *a = malloc(size); return a; } #endif

int *a = malloc(sizeof(int) * n);

%call3 = call i8* @malloc(i64 %mul) %6 = bitcast i8* %call3 to i32* store i32* %6, i32** @a, align 8

%call31 = call i8* @memkind_alloc(i64 %mul) %6 = bitcast i8* %call31 to i32* store i32* %6, i32** @a, align 8

18compiler-rtrun-melibrary

Page 19: LLVM logo is copyrighted by Apple Inc. Towards …– Case study: HBM (MCDRAM) of Knights Landing (KNL) – 2.29x performance improvement using LLVM compiler and 2.33x using Intel

ExperimentalResults:Cri.calDataAnalysisResultsfortheCGBenchmark

FPArray cost workshare

bandwidth

P(FPArray)

r 46 Allparallel regular 46

q 21 Allparallel regular

21

a 17 Allparallel regular

17

x 16 Allparallel regular

16

p 29 Allparallel irregular

0

Z 21 Allparallel irregular

0

19

Page 20: LLVM logo is copyrighted by Apple Inc. Towards …– Case study: HBM (MCDRAM) of Knights Landing (KNL) – 2.29x performance improvement using LLVM compiler and 2.33x using Intel

PerformanceResults

0

2000

4000

6000

8000

10000

DDR(All)

HBM(All)

HBM(z)

HBM(p)

HBM(x)

HBM(A)

HBM(A,q,r)

HBM(A,q,r,x)

Mo

ps/

s

Different versions of CG (CLASS C)

ICC:OMP LLVM:OMP•  Setup:1-nodemachinewithoneIntel(R)XeonPhi(TM)[email protected]

•  LLVM3.9,spor.ngClang3.9•  Resultsusing:•  ConjugateGradient(CG)benchmark(NASParallelsuite)

•  2.29xperformanceimprovementusingLLVMand2.33xusingICC

20

DDRvs.HBM-array-alloca.onperformanceoftheOpenMPversionofCG

Page 21: LLVM logo is copyrighted by Apple Inc. Towards …– Case study: HBM (MCDRAM) of Knights Landing (KNL) – 2.29x performance improvement using LLVM compiler and 2.33x using Intel

ConclusionandFutureWork•  HBMmanagementfromacompilerpoint-of-view– DecidewhenitisbeneficialtoallocatedataintheHBMforsequen.alandOpenMPcode

–  Casestudy:HBM(MCDRAM)ofKnightsLanding(KNL)–  2.29xperformanceimprovementusingLLVMcompilerand2.33xusingIntelcompilercomparedtotheDDRversionofCG

•  FutureWork:–  Improvetheaccuracyofourpriorityfunc.on–  Implementmorepreciseanalysesregardingirregularaccessesandinstruc.oncountsforrecursivefunc.onsandnestedloops

– UseofAutoHBWtoaddsizeasanaddi.onalmetric 21

Page 22: LLVM logo is copyrighted by Apple Inc. Towards …– Case study: HBM (MCDRAM) of Knights Landing (KNL) – 2.29x performance improvement using LLVM compiler and 2.33x using Intel

TowardsAutoma-cHBMAlloca-onusingLLVM:

ACaseStudywithKnightsLanding

DouniaKhaldiandBarbaraChapmanIns.tuteforAdvancedComputa.onalScience

StonyBrookUniversityStonyBrook,NY

TheThirdWorkshopontheLLVMCompilerInfrastructureinHPCSaltLakeCity,Utah,November14,2016

LLVMlogoiscopyrightedbyAppleInc.