loop scheduling in openmp · 2017-11-24 · •loop scheduling in openmp. •a primer of a loop...

LoopSchedulinginOpenMPVivekKale

UniversityofSouthernCalifornia/InformationSciencesInstitute

Overview• LoopSchedulinginOpenMP.• Aprimerofaloopconstruct• DefinitionsforschedulesforOpenMP loops.

• Aproposalforuser-definedloopscheduleforOpenMP• Needtoallowforrapiddevelopmentofnovelloopschedulingstrategies.• WesuggestgivingusersofOpenMP applicationscontroloftheloopschedulingstrategytodoso.• Wecalltheschemeuser-definedloopschedulingandproposetheschemeasanadditiontoOpenMP.

OpenMP loops:Aprimer• OpenMP providesaloopconstructthatspecifiesthattheiterationsofoneormoreassociatedloopswillbeexecutedin parallelbythreadsintheteaminthecontextoftheirimplicittasks.1

#pragma omp for [clause[ [,] clause] ... ]for (int i=0; i<100; i++){}

• Loopneedstobeincanonicalform.• Theclause canbeoneormoreofthefollowing:private(…), firstprivate(…), lastprivate(…), linear(…), reduction(…), schedule(…), collapse(...), ordered[…], nowait, allocate(…)

• Wefocusontheclauseschedule(…) inthistalk.1:OpenMP TechnicalReport6.November2017.http://www.openmp.org/press-release/openmp-tr6/

AScheduleforanOpenMP loop#pragma omp parallel for schedule([modifier [modifier]:]kind[,chunk_size])

• Aschedule inOpenMP is:• aspecificationofhowiterationsofassociatedloopsaredividedintocontiguousnon-emptysubsets• Wecalleachofthecontiguousnon-emptysubsetsachunk

• and howthesechunksaredistributedtothreadsoftheteam.1

• Thesizeofachunk,denotedaschunk_sizemustbeapositiveinteger.

1:OpenMP TechnicalReport6.November2017.http://www.openmp.org/press-release/openmp-tr6/

TheKindofaSchedule• Aschedulekind ispassedtoanOpenMP loopscheduleclause:• providesahintforhowiterationsofthecorrespondingOpenMP loopshouldbeassignedtothreadsintheteamoftheOpenMP regionsurroundingtheloop.

• FivekindsofschedulesforOpenMP loop1:• static• dynamic• guided• auto• runtime

• TheOpenMP implementationand/orruntimedefineshowtoassignchunkstothreadsofateamgiventhekindofschedulespecifiedbyasahint.


ModifiersoftheClauseSchedule• simd:thechunk_size mustbeamultipleofthesimd width.1

• monotonic:Ifathreadexecutediterationi,thenthethreadmustexecuteiterationslargerthani subsequently.1

• non-monotonic:Executionordernotsubjecttothemonotonicrestriction.1


NeedNovelLoopSchedulingSchemesinOpenMP• Supercomputerarchitecturesandapplicationsarechanging.• Largenumberofcorespernode.• Speedvariabilityacrosscores.• Complexdynamicbehaviorinapplicationsthemselves.

• So,weneednewmethodsofdistributinganapplication’scomputationalworktoanode’scores1,specificallytoscheduleanapplication’sparallelizedloop’siterationstocores.• Suchmethodsneedto• Ensuredatalocalityandreducesynchronizationoverheadwhilemaintainingloadbalance2.• Beawareofinter-nodeparallelismhandledbylibrariessuchasMPICH3.• Adaptduringanapplication’sexecution.

1:R.D.Blumofe andC.E.Leiserson.SchedulingMultithreadedComputationsbyWorkStealing.JournalofACM46(5):720–748,1999.2:S.Donfack,L.Grigori,W.D.Gropp,andV.Kale.HybridStatic/DynamicSchedulingforAlreadyOptimizedDenseMatrixFactorizations.InIEEEInternationalParallelandDistributedProcessingSymposium,IPDPS2012,Shanghai,China,2012.3:E.Lusk,N.Doss,andA.Skjellum.AHigh-performance,PortableImplementationoftheMessagePassingInterfaceStandard.ParallelComputing,22:789–828,1996.

UtilityofNovelStrategiesShown• TheutilityofnovelstrategiesisdemonstratedinpublishedworkbyV.Kaleetal 1,2 andothers.• Forexample,static-dynamicschedulingmixedstrategywithanadjustablestaticfraction.• Motivation:tolimittheoverheadofdynamicscheduling,whilehandlingimbalances,suchasthoseduetonoise.

1:S.Donfack,L.Grigori,W.D.Gropp,andV.Kale.HybridStatic/DynamicSchedulingtoImprovePerformanceofAlreadyOptimizedDenseMatrixFactorizations,IPDPS2012.2:V.Kale,S.Donfack,L.Grigori,andW.D.Gropp.LightweightSchedulingforBalancingtheTradeoffBetweenLoadBalanceandLocality2014.

CALUusingstaticscheduling(top)andfd =0.1(bottom)with2-levelblocklayoutrunonAMDOpteron16corenode.

Diagramofstatic(top)andmixedstatic/dynamicscheduling(bottom)wherefdisthedynamicfraction.

13

Scheduling CALU’s Task Dependency Graph• Static scheduling

+ Good locality of data - Ignores OS jitter

Slack MPI

1234

Threads

Tp

1234

1234

Time

Slack

Noise

ThreadsThreads

MPI

MPI

Amplification

t1q(1� fd) · Tp S

White:idletime

NeedtoSupportAUser-definedScheduleinOpenMP

• PracticeofburyingalloftheschedulingstrategyinsideanOpenMPimplementation,withlittlevisibilityinapplicationcodebeyondtheuseofkeywordssuchas’dynamic’or’guided’,isn’tadequateforthispurpose.• OpenMP’s implementations,e.g.,GCC’slibgomp 1andLLVM’sOpenMPlibrary 2 aredifficulttochange,whichcanhinderdevelopmentofloopschedulingstrategiesatarapidpace.

1. DocumentationforGCC’sOpenMP library.https://gcc.gnu.org/onlinedocs/libgomp/ .2. DocumentationforLLVM’sOpenMP library.http://openmp.llvm.org/Reference.pdf/ .

ReasonsforUser-definedSchedules• Flexibility.

• GiventhevarietyofOpenMP implementations,havingastandardizedwayofdefiningauser-levelstrategyprovidesflexibilitytoimplementschedulingstrategiesforOpenMP programseasilyandeffectively.

• EmergenceofThreadedRuntimeSystems.• EmergenceofthreadedlibrariessuchasArgobots1 andQuickThreads2 arguesinfavorofaflexiblespecificationofschedulingstrategiesalso.

• Notethatkeywordsauto andruntime aren’tadequate.• Specifyingautoorruntimeschedulesisn’tsufficientbecausetheydon’tallowforuser-levelscheduling.

1. S.Seo,A.Amer,P.Balaji,C.Bordage,G.Bosilca,A.Brooks,A.Castello,D.Genet,T.Herault,P.Jindal,L.Kale,S.Krishnamoorthy,J.Lifflander,H.Lu,E.Meneses,M.Snir,Y.Sun,andP.H.Beckman.Argobots:Alightweightthreadingtaskingframework.2016.

2. D.Keppel.Toolsandtechniquesforbuildingfastportablethreadspackages.TechnicalReportUWCSE93-05-06,UniversityofWashingtonDepartmentofComputerScienceandEngineering,May1993.

SchedulingCodeoflibgomp• Theschedulingcodeinlibgomp,forexample,supportsthestatic,dynamic,andguidedschedulesnaturally.• However,itscodestructurecan’taccommodatethenumberandsophisticationofthestrategiesthatwewouldliketoexplore.

àAddingauser-definedscheduleintoOpenMP libraries:• ispossiblewitheffortforagivenlibrary.• mustbedonedifferentlyforeachlibrary.

SpecificationofUser-definedSchedulingScheme

• Weaimtospecifyauser-definedschedulingschemewithintheOpenMPstandard1 .• Theschemeshouldaccommodateanarbitraryuser-definedscheduler.• Thesearetheelementsrequiredtodefineascheduler.• Scheduler-specificdatastructures.• Historyrecord.• Specificationofschedulingbehaviorofthreads.

1. OpenMP ApplicationProgrammingInterface.November2015.

PotentialSchedulingDataStructures• Thestrategiesshouldbeallowedtouseasubsetorcombinationof:• Shareddatastructures.• Low-overheadsteal-queues.• Exclusivequeuesmeantforeachthread.• Sharedqueuesfromwhichmultiplethreadscandequeue work,eachrepresentingachunkofloopiterations.

HistoryTrackingfromPriorInvocations1. Tofacilitatetheabilitytolearnfromrecenthistory,e.g.,valuesof

slackinMPIcommunication1,2 frompreviousouteriterations,theschedulingschemeshouldallowforpassingacall-sitespecifichistory-trackingobject3 tothescheduler.

2. Examplesofhistoryinformation,i.e.,attributestotrackviahistoryobjects:• Previousvaluesofdynamicfraction.• Iteration-to-coreaffinities.• Runtimeperformanceprofiles.

1. A.Faraj,P.Patarasuk,andX.Yuan.AStudyofProcessArrivalPatternsforMPICollectiveOperations.InProceedingsofthe2006ACM/IEEEConferenceonSupercomputing,SC’06,Tampa,FL,USA,2006.ACM.

2. B.Rountree,D.K.Lowenthal,B.R.deSupinski,M.Schulz,V.W.Freeh,andT.Bletsch.Adagio:MakingDVSPracticalforComplexHPCApplications.InProceedingsofthe23rdInternationalConferenceonSupercomputing,ICS’09,pages460–469,YorktownHeights,NY,USA.2009.ACM.

3. V.Kale,T.Gamblin,T.Hoefler,B.R.deSupinski andW.D.Gropp.Slack-consciousLightweightLoopSchedulingforScalingPasttheNoiseAmplificationProblem.Poster.SC2012

Example:LibrarytoSupportStaggeredScheduling• Inearlierwork1,aloopschedulinglibrarythatsupportsa“staggered”schedulingstrategywasimplemented.

• ItwasimplementedwithinanOpenMP parallelregionbyenclosingtheloopbodywithmacrosFORALL_BEGIN()andFORALL_END()withappropriateparameters.

1:V.Kale,A.P.Randles,V.Kale,andW.D.Gropp.Locality-OptimizedSchedulingforImprovedLoadBalancingonSMPs.InProceedingsofthe21stEuropeanMPIUsers’GroupMeetingConferenceonRecentAdvancesintheMessagePassingInterface,volume0,pages1063–1074.AssociationforComputingMachinery,2014.

1.Theschedulingstrategy’snameandassociatedparametersarespecifiedintheparametersofthemacro.2.Bothmacrosinvokelibraryfunctionscorrespondingtothestrategy’snameasspecifiedinthemacrocall.Themacro’sparametersarepassedtotheuser-definedschedulerfunctions.

int start, end = 0;static LoopTimeRecord* ltr;double fd = 0.3; #pragma omp parallel

{ int tid = omp_get_thread_num();int numThrds = omp_get_num_threads();FORALL_BEGIN(sds,tid,numThrds,0,n,start,end,fd)for(int i=start;i<end;i++)c[i] += a[i]*b[i];

FORALL_END(sds,tid,start,end)}

Example:LibrarytoSupportStaggeredScheduling• Inearlierwork1,aloopschedulinglibrarythatsupportsthis“staggered”schedulingstrategywasimplemented.

• ItwasimplementedwithinanOpenMP parallelregionbyenclosingtheloopbodywithmacrosFORALL_BEGIN()andFORALL_END()withappropriateparameters.

1:V.Kale,A.P.Randles,V.Kale,andW.D.Gropp.Locality-OptimizedSchedulingforImprovedLoadBalancingonSMPs.InProceedingsofthe21stEuropeanMPIUsers’GroupMeetingConferenceonRecentAdvancesintheMessagePassingInterface,volume0,pages1063–1074.AssociationforComputingMachinery,2014.

1.Theschedulingstrategy’snameandassociatedparametersarespecifiedintheparametersofthemacro.2.Bothmacrosinvokelibraryfunctionscorrespondingtothestrategy’snameasspecifiedinthemacrocall.Themacro’sparametersarepassedtotheuser-definedschedulerfunctions.

int start, end = 0;static LoopTimeRecord* ltr;double fd = 0.3; #pragma omp parallel

{ int tid = omp_get_thread_num();int numThrds = omp_get_num_threads();FORALL_BEGIN(sds,tid,numThrds,0,n,start,end,fd)for(int i=start;i<end;i++)c[i] += a[i]*b[i];

FORALL_END(sds,tid,start,end)}

FORALL_BEGIN(strat,…)macroexpansionstrat_##LoopStart(..&start,&end,..);do(loopnotdone){

FORALL_END(strat,…)macroexpansion:}

while(strat_##LoopNext(… &start,&end,..));barrier;

ProposalforUser-definedScheduleinOpenMP• Weproposeauser-definedschedulingschemethatisanadaptationoftheabovemacro-basedscheme.• Overcomeslimitationsofsimplemacro-basedscheme.

• TheproposedAPIusedforasimplecodeisillustratedbelow.

double dynamicFraction = 0.3;static LoopTimeRecord* ltr; // for history.int chunkSize = 4;#pragma omp parallel for schedule(user, staggered:chunkSize:dynamicFraction:ltr)for(int i = 0; i < n; i++)

c[i] = a[i]*b[i];

• Thefirstparameteroftheclause’schedule’specifiesanewschedulekinduser.• Thesecondparameterspecifiestheschedulingstrategy’snamestaggered,optionallywiththestrategy-specificparameters.

ImplementationofUser-definedSchedule• Whenauserspecifiesaschedulekinduser andastrategynamedX

• Theyneedtolinkalibrarythatdefinesfunctions:• X_loopStart(), X_loopNext() andX_init().

• X_init()allowsauser-levelschedulertoallocateandinitializeitsdatastructuresthataretobeusedcommonlyacrossparallelloops thatuseX.• ThefunctionsX_loopStart()andX_loopNext()determinealoop’sindicesthatathreadshouldworkonbasedontheparametervaluesfortheschedulingstrategyandoftheloop.

Everythreadexecuteswhenstartinganewloop:…X_loopStart();do X_loopNext();untildone;//doneflagissetbytheuser-defined//schedulerfunctions

• Aslongasoneisallowedtodefinethesefunctions,onecanimplementauser-definedscheduler.

• EverythreadshouldcallX_loopNext() repeatedly.

SoftwareArchitectureforUser-definedSchedule

#include <mpi.h>

#include <omp-mod.h>

int main(){double dynamicFraction = 0.3;static LoopTimeRecord* ltr; // for history.int chunkSize = 4;while(timestep <1000) {#pragma omp parallel for schedule(user,staggered:chunkSize:dynamicFraction:ltr)for(int i = 0; i < n; i++)

c[i] = a[i]*b[i];}

}

#include <userDefSched.h>

#defineFORALL_BEGIN(strat,…)strat_##LoopStart(..&start,&end,..);do(loopnotdone){

#defineFORALL_END(strat,…)}

while(strat_##LoopNext(… &start,&end,..));barrier();

switch(clause){ ‘static’:‘dynamic’:‘guided’: ‘auto’:‘runtime’:‘user’:

}

strat_##LoopStart(..&start,&end,..);{}

myApplication.Comp-mod.h

sd_Init():- allocateshareddatastructuresforloopsthatusestrategysd;

sd_LoopStart():ifIamthemasterthread,- setupadatastructureloopParams,alongwithalock;- signalotherthreadstostart;

else- waitforthesignalfrommasterthread;

usethesharedloopParams datastructuretocalculatemystaticiterationsandexecuteloop_body forthatrange;

sd_LoopNext():lockloopParams;extractachunktoworkon;unlockloopParams;if(nochunkavailable)waitforbarrier;done=1;

elseexecuteloop_body fortheextractedchunk;

sd_LoopStart():ifIamthemasterthread,- setupadatastructureloopParams,alongwithalock;- signalotherthreadstostart;


- usethesharedloopParams datastructuretocalculatemystaticiterationsandexecuteloop_body forthatrange;

sd_LoopNext():lockloopParams;extractachunktoworkon;unlockloopParams;if(nochunkavailable)waitforbarrier;done=1;

elseexecuteloop_body fortheextractedchunk;

Static/dynamicSchedulingUsingaSharedDataStructure

sd2_LoopStart():ifIamthemasterthread

— setupadatastructureloopParams withloopparameters;— enqueue asingleentrycorrespondingtodynamicrangeofiterations intothread0’sstealqueue;— signalotherthreadstostart;

else— waitforthesignalfrommasterthread;

- UsetheshareddatastructuretocalculatemystaticiterationsandexecuteloopBody forthatrange;

sd2_LoopNext():- nextRange =myQueue.dequeue();if(nextRange ==NULL)nextRange =steal(random_neighbor);if(nextRange !=NULL)L=nextRange->low;U=nextRange->high;if((U-L)>Threshold)- splittherangein2, enqueue theminmystealqueue;

else- executeloopbodyforiterationsL:U;- updatecountofiterationscompletedtosetthe“done”flagwhen done;

AnAlternativeImpl.ofStatic/DynamicStrategyUsingStealQueuesforDynamicIterations

staggered_LoopStart():ifIamthemasterthread

- setupadatastructureloopParams;- signalotherthreadstostart;


- enqueue entriescorrespondingtochunksofmydynamicrangeofiterationsintomythread’sstealqueue;- calculatemystaticiterationsandexecuteloopBody forthatrange;

staggered_LoopNext():nextRange =myQueue.dequeue();if(nextRange ==NULL)nextRange =steal(random_neighbor);if(nextRange !=NULL)- L=nextRange->low;U=nextRange->high;- executeloopbodyformyiterationsL:U;- updatecountofiterationscompletedtosetthe“done”flagwhendone;

AnImplementationofStaggeredStatic/DynamicStrategy

tion. The name of the scheduling strategy and its associated parameters isspecified in the parameter of the macro function. The parameters passedto the macro function are passed to the scheduler functions defined by theuser. This macro-based approach is of limited utility because of its inabilityto use the compiler support for features such as reductions. The approachwe propose eliminates these limitations and leads to concise code.

T0 T1 T2 T3T0 T0 T0 T0T0 T0T0 T1 T1 T1 T1 T1T1T1 T2 T2 T2 T2T2 T2T2

Increasing loop iteration number

T3 T3T3T3T3T3T3

Figure 1: Diagram of the iteration space of a threaded computation regionshowing loop iterations distributed to threads when using the technique ofstaggered mixed static/dynamic scheduling. In the figure, the smaller, redcolored, rectangles correspond to dynamic iterations tentatively assignedto a specific thread.

double static_fraction = 0.5; double constraint =0.2; int

chunk_size = 4;#pragma omp parallel for schedule(user, statdynstaggered:

chunk_size:static_fraction:constraint)for (int i=0; i<n; i++)

sum+= a[i]*b[i];

Figure 2: An illustration of the use of the user-defined scheduler in a dotproduct code parallelized with OpenMP.

We propose the user-defined scheduling scheme to be an adaptation ofthe scheme of using macro-based functions. The schedule kind ’user’ andthe name of the scheduling strategy with the strategy’s parameter value(s),e.g., the name ’staggered’ and the parameter values for the static frac-tion, chunk size and constraint, are specified as parameters of the OpenMP’schedule’ clause. Figure 2 illustrates the proposed API for using a user-defined schedule through the API’s use in the dot product code. The user-defined scheduler named ’staggered’, specified in the second parameter ofthe schedule clause, defines how the work of a loop in the code is enqueuedinto queues or data structures and the strategy of dequeuing (or obtaining)a chunk of loop iterations to execute next using the scheduling strategyillustrated in Figure 1.

When the user specifies a schedule clause with kind ’user’ and a

3

DiscussionofAPI’sdetails• WehavesketchedaboveasyntaxfortheAPIforuser-levelschedulers.• However,weexpectthattheAPI’sdetailswillbeworkedoutwiththecommunity’sconsensus.

• Notethatthereisaprecedentforaddinguser-definedfunctionsinOpenMP standard.• The combinerfunctioninuser-definedreductions.

Summary• Needforexperimentationwithsophisticatedloopschedulingstrategies.• OpenMP communityshoulddiscusshowtoallowflexiblespecificationofsuchstrategiesinauser’scodeandhowtodesignauser-levelschedulerlibrarysothatitcanbeportablyusedwithanyconformingOpenMP implementation.• Supportinguser-definedschedulersinthiswaywillfacilitaterapiddevelopmentofschedulingstrategies.• Ihopetheexpertswilldiscusstheseideasatthisconference.

Acknowledgements:UniversityofSouthernCalifornia/ISI’sTechnicalComputingNewDirectionsprogram.

loop scheduling in openmp · 2017-11-24 · •loop scheduling in openmp. •a primer of a loop...

Documents