loop scheduling in openmp · 2017-11-24 · •loop scheduling in openmp. •a primer of a loop...
TRANSCRIPT
LoopSchedulinginOpenMPVivekKale
UniversityofSouthernCalifornia/InformationSciencesInstitute
Overview• LoopSchedulinginOpenMP.• Aprimerofaloopconstruct• DefinitionsforschedulesforOpenMP loops.
• Aproposalforuser-definedloopscheduleforOpenMP• Needtoallowforrapiddevelopmentofnovelloopschedulingstrategies.• WesuggestgivingusersofOpenMP applicationscontroloftheloopschedulingstrategytodoso.• Wecalltheschemeuser-definedloopschedulingandproposetheschemeasanadditiontoOpenMP.
OpenMP loops:Aprimer• OpenMP providesaloopconstructthatspecifiesthattheiterationsofoneormoreassociatedloopswillbeexecutedin parallelbythreadsintheteaminthecontextoftheirimplicittasks.1
#pragma omp for [clause[ [,] clause] ... ]for (int i=0; i<100; i++){}
• Loopneedstobeincanonicalform.• Theclause canbeoneormoreofthefollowing:private(…), firstprivate(…), lastprivate(…), linear(…), reduction(…), schedule(…), collapse(...), ordered[…], nowait, allocate(…)
• Wefocusontheclauseschedule(…) inthistalk.1:OpenMP TechnicalReport6.November2017.http://www.openmp.org/press-release/openmp-tr6/
AScheduleforanOpenMP loop#pragma omp parallel for schedule([modifier [modifier]:]kind[,chunk_size])
• Aschedule inOpenMP is:• aspecificationofhowiterationsofassociatedloopsaredividedintocontiguousnon-emptysubsets• Wecalleachofthecontiguousnon-emptysubsetsachunk
• and howthesechunksaredistributedtothreadsoftheteam.1
• Thesizeofachunk,denotedaschunk_sizemustbeapositiveinteger.
1:OpenMP TechnicalReport6.November2017.http://www.openmp.org/press-release/openmp-tr6/
TheKindofaSchedule• Aschedulekind ispassedtoanOpenMP loopscheduleclause:• providesahintforhowiterationsofthecorrespondingOpenMP loopshouldbeassignedtothreadsintheteamoftheOpenMP regionsurroundingtheloop.
• FivekindsofschedulesforOpenMP loop1:• static• dynamic• guided• auto• runtime
• TheOpenMP implementationand/orruntimedefineshowtoassignchunkstothreadsofateamgiventhekindofschedulespecifiedbyasahint.
1:OpenMP TechnicalReport6.November2017.http://www.openmp.org/press-release/openmp-tr6/
ModifiersoftheClauseSchedule• simd:thechunk_size mustbeamultipleofthesimd width.1
• monotonic:Ifathreadexecutediterationi,thenthethreadmustexecuteiterationslargerthani subsequently.1
• non-monotonic:Executionordernotsubjecttothemonotonicrestriction.1
1:OpenMP TechnicalReport6.November2017.http://www.openmp.org/press-release/openmp-tr6/
NeedNovelLoopSchedulingSchemesinOpenMP• Supercomputerarchitecturesandapplicationsarechanging.• Largenumberofcorespernode.• Speedvariabilityacrosscores.• Complexdynamicbehaviorinapplicationsthemselves.
• So,weneednewmethodsofdistributinganapplication’scomputationalworktoanode’scores1,specificallytoscheduleanapplication’sparallelizedloop’siterationstocores.• Suchmethodsneedto• Ensuredatalocalityandreducesynchronizationoverheadwhilemaintainingloadbalance2.• Beawareofinter-nodeparallelismhandledbylibrariessuchasMPICH3.• Adaptduringanapplication’sexecution.
1:R.D.Blumofe andC.E.Leiserson.SchedulingMultithreadedComputationsbyWorkStealing.JournalofACM46(5):720–748,1999.2:S.Donfack,L.Grigori,W.D.Gropp,andV.Kale.HybridStatic/DynamicSchedulingforAlreadyOptimizedDenseMatrixFactorizations.InIEEEInternationalParallelandDistributedProcessingSymposium,IPDPS2012,Shanghai,China,2012.3:E.Lusk,N.Doss,andA.Skjellum.AHigh-performance,PortableImplementationoftheMessagePassingInterfaceStandard.ParallelComputing,22:789–828,1996.
UtilityofNovelStrategiesShown• TheutilityofnovelstrategiesisdemonstratedinpublishedworkbyV.Kaleetal 1,2 andothers.• Forexample,static-dynamicschedulingmixedstrategywithanadjustablestaticfraction.• Motivation:tolimittheoverheadofdynamicscheduling,whilehandlingimbalances,suchasthoseduetonoise.
1:S.Donfack,L.Grigori,W.D.Gropp,andV.Kale.HybridStatic/DynamicSchedulingtoImprovePerformanceofAlreadyOptimizedDenseMatrixFactorizations,IPDPS2012.2:V.Kale,S.Donfack,L.Grigori,andW.D.Gropp.LightweightSchedulingforBalancingtheTradeoffBetweenLoadBalanceandLocality2014.
CALUusingstaticscheduling(top)andfd =0.1(bottom)with2-levelblocklayoutrunonAMDOpteron16corenode.
Diagramofstatic(top)andmixedstatic/dynamicscheduling(bottom)wherefdisthedynamicfraction.
13
Scheduling CALU’s Task Dependency Graph• Static scheduling
+ Good locality of data - Ignores OS jitter
Slack MPI
1234
Threads
Tp
1234
1234
Time
Slack
Noise
ThreadsThreads
MPI
MPI
Amplification
t1q(1� fd) · Tp S
White:idletime
NeedtoSupportAUser-definedScheduleinOpenMP
• PracticeofburyingalloftheschedulingstrategyinsideanOpenMPimplementation,withlittlevisibilityinapplicationcodebeyondtheuseofkeywordssuchas’dynamic’or’guided’,isn’tadequateforthispurpose.• OpenMP’s implementations,e.g.,GCC’slibgomp 1andLLVM’sOpenMPlibrary 2 aredifficulttochange,whichcanhinderdevelopmentofloopschedulingstrategiesatarapidpace.
1. DocumentationforGCC’sOpenMP library.https://gcc.gnu.org/onlinedocs/libgomp/ .2. DocumentationforLLVM’sOpenMP library.http://openmp.llvm.org/Reference.pdf/ .
ReasonsforUser-definedSchedules• Flexibility.
• GiventhevarietyofOpenMP implementations,havingastandardizedwayofdefiningauser-levelstrategyprovidesflexibilitytoimplementschedulingstrategiesforOpenMP programseasilyandeffectively.
• EmergenceofThreadedRuntimeSystems.• EmergenceofthreadedlibrariessuchasArgobots1 andQuickThreads2 arguesinfavorofaflexiblespecificationofschedulingstrategiesalso.
• Notethatkeywordsauto andruntime aren’tadequate.• Specifyingautoorruntimeschedulesisn’tsufficientbecausetheydon’tallowforuser-levelscheduling.
1. S.Seo,A.Amer,P.Balaji,C.Bordage,G.Bosilca,A.Brooks,A.Castello,D.Genet,T.Herault,P.Jindal,L.Kale,S.Krishnamoorthy,J.Lifflander,H.Lu,E.Meneses,M.Snir,Y.Sun,andP.H.Beckman.Argobots:Alightweightthreadingtaskingframework.2016.
2. D.Keppel.Toolsandtechniquesforbuildingfastportablethreadspackages.TechnicalReportUWCSE93-05-06,UniversityofWashingtonDepartmentofComputerScienceandEngineering,May1993.
SchedulingCodeoflibgomp• Theschedulingcodeinlibgomp,forexample,supportsthestatic,dynamic,andguidedschedulesnaturally.• However,itscodestructurecan’taccommodatethenumberandsophisticationofthestrategiesthatwewouldliketoexplore.
àAddingauser-definedscheduleintoOpenMP libraries:• ispossiblewitheffortforagivenlibrary.• mustbedonedifferentlyforeachlibrary.
SpecificationofUser-definedSchedulingScheme
• Weaimtospecifyauser-definedschedulingschemewithintheOpenMPstandard1 .• Theschemeshouldaccommodateanarbitraryuser-definedscheduler.• Thesearetheelementsrequiredtodefineascheduler.• Scheduler-specificdatastructures.• Historyrecord.• Specificationofschedulingbehaviorofthreads.
1. OpenMP ApplicationProgrammingInterface.November2015.
PotentialSchedulingDataStructures• Thestrategiesshouldbeallowedtouseasubsetorcombinationof:• Shareddatastructures.• Low-overheadsteal-queues.• Exclusivequeuesmeantforeachthread.• Sharedqueuesfromwhichmultiplethreadscandequeue work,eachrepresentingachunkofloopiterations.
HistoryTrackingfromPriorInvocations1. Tofacilitatetheabilitytolearnfromrecenthistory,e.g.,valuesof
slackinMPIcommunication1,2 frompreviousouteriterations,theschedulingschemeshouldallowforpassingacall-sitespecifichistory-trackingobject3 tothescheduler.
2. Examplesofhistoryinformation,i.e.,attributestotrackviahistoryobjects:• Previousvaluesofdynamicfraction.• Iteration-to-coreaffinities.• Runtimeperformanceprofiles.
1. A.Faraj,P.Patarasuk,andX.Yuan.AStudyofProcessArrivalPatternsforMPICollectiveOperations.InProceedingsofthe2006ACM/IEEEConferenceonSupercomputing,SC’06,Tampa,FL,USA,2006.ACM.
2. B.Rountree,D.K.Lowenthal,B.R.deSupinski,M.Schulz,V.W.Freeh,andT.Bletsch.Adagio:MakingDVSPracticalforComplexHPCApplications.InProceedingsofthe23rdInternationalConferenceonSupercomputing,ICS’09,pages460–469,YorktownHeights,NY,USA.2009.ACM.
3. V.Kale,T.Gamblin,T.Hoefler,B.R.deSupinski andW.D.Gropp.Slack-consciousLightweightLoopSchedulingforScalingPasttheNoiseAmplificationProblem.Poster.SC2012
Example:LibrarytoSupportStaggeredScheduling• Inearlierwork1,aloopschedulinglibrarythatsupportsa“staggered”schedulingstrategywasimplemented.
• ItwasimplementedwithinanOpenMP parallelregionbyenclosingtheloopbodywithmacrosFORALL_BEGIN()andFORALL_END()withappropriateparameters.
1:V.Kale,A.P.Randles,V.Kale,andW.D.Gropp.Locality-OptimizedSchedulingforImprovedLoadBalancingonSMPs.InProceedingsofthe21stEuropeanMPIUsers’GroupMeetingConferenceonRecentAdvancesintheMessagePassingInterface,volume0,pages1063–1074.AssociationforComputingMachinery,2014.
1.Theschedulingstrategy’snameandassociatedparametersarespecifiedintheparametersofthemacro.2.Bothmacrosinvokelibraryfunctionscorrespondingtothestrategy’snameasspecifiedinthemacrocall.Themacro’sparametersarepassedtotheuser-definedschedulerfunctions.
int start, end = 0;static LoopTimeRecord* ltr;double fd = 0.3; #pragma omp parallel
{ int tid = omp_get_thread_num();int numThrds = omp_get_num_threads();FORALL_BEGIN(sds,tid,numThrds,0,n,start,end,fd)for(int i=start;i<end;i++)c[i] += a[i]*b[i];
FORALL_END(sds,tid,start,end)}
Example:LibrarytoSupportStaggeredScheduling• Inearlierwork1,aloopschedulinglibrarythatsupportsthis“staggered”schedulingstrategywasimplemented.
• ItwasimplementedwithinanOpenMP parallelregionbyenclosingtheloopbodywithmacrosFORALL_BEGIN()andFORALL_END()withappropriateparameters.
1:V.Kale,A.P.Randles,V.Kale,andW.D.Gropp.Locality-OptimizedSchedulingforImprovedLoadBalancingonSMPs.InProceedingsofthe21stEuropeanMPIUsers’GroupMeetingConferenceonRecentAdvancesintheMessagePassingInterface,volume0,pages1063–1074.AssociationforComputingMachinery,2014.
1.Theschedulingstrategy’snameandassociatedparametersarespecifiedintheparametersofthemacro.2.Bothmacrosinvokelibraryfunctionscorrespondingtothestrategy’snameasspecifiedinthemacrocall.Themacro’sparametersarepassedtotheuser-definedschedulerfunctions.
int start, end = 0;static LoopTimeRecord* ltr;double fd = 0.3; #pragma omp parallel
{ int tid = omp_get_thread_num();int numThrds = omp_get_num_threads();FORALL_BEGIN(sds,tid,numThrds,0,n,start,end,fd)for(int i=start;i<end;i++)c[i] += a[i]*b[i];
FORALL_END(sds,tid,start,end)}
FORALL_BEGIN(strat,…)macroexpansionstrat_##LoopStart(..&start,&end,..);do(loopnotdone){
FORALL_END(strat,…)macroexpansion:}
while(strat_##LoopNext(… &start,&end,..));barrier;
ProposalforUser-definedScheduleinOpenMP• Weproposeauser-definedschedulingschemethatisanadaptationoftheabovemacro-basedscheme.• Overcomeslimitationsofsimplemacro-basedscheme.
• TheproposedAPIusedforasimplecodeisillustratedbelow.
double dynamicFraction = 0.3;static LoopTimeRecord* ltr; // for history.int chunkSize = 4;#pragma omp parallel for schedule(user, staggered:chunkSize:dynamicFraction:ltr)for(int i = 0; i < n; i++)
c[i] = a[i]*b[i];
• Thefirstparameteroftheclause’schedule’specifiesanewschedulekinduser.• Thesecondparameterspecifiestheschedulingstrategy’snamestaggered,optionallywiththestrategy-specificparameters.
ImplementationofUser-definedSchedule• Whenauserspecifiesaschedulekinduser andastrategynamedX
• Theyneedtolinkalibrarythatdefinesfunctions:• X_loopStart(), X_loopNext() andX_init().
• X_init()allowsauser-levelschedulertoallocateandinitializeitsdatastructuresthataretobeusedcommonlyacrossparallelloops thatuseX.• ThefunctionsX_loopStart()andX_loopNext()determinealoop’sindicesthatathreadshouldworkonbasedontheparametervaluesfortheschedulingstrategyandoftheloop.
Everythreadexecuteswhenstartinganewloop:…X_loopStart();do X_loopNext();untildone;//doneflagissetbytheuser-defined//schedulerfunctions
• Aslongasoneisallowedtodefinethesefunctions,onecanimplementauser-definedscheduler.
• EverythreadshouldcallX_loopNext() repeatedly.
SoftwareArchitectureforUser-definedSchedule
#include <mpi.h>
#include <omp-mod.h>
int main(){double dynamicFraction = 0.3;static LoopTimeRecord* ltr; // for history.int chunkSize = 4;while(timestep <1000) {#pragma omp parallel for schedule(user,staggered:chunkSize:dynamicFraction:ltr)for(int i = 0; i < n; i++)
c[i] = a[i]*b[i];}
}
#include <userDefSched.h>
#defineFORALL_BEGIN(strat,…)strat_##LoopStart(..&start,&end,..);do(loopnotdone){
#defineFORALL_END(strat,…)}
while(strat_##LoopNext(… &start,&end,..));barrier();
switch(clause){ ‘static’:‘dynamic’:‘guided’: ‘auto’:‘runtime’:‘user’:
}
strat_##LoopStart(..&start,&end,..);{}
myApplication.Comp-mod.h
sd_Init():- allocateshareddatastructuresforloopsthatusestrategysd;
sd_LoopStart():ifIamthemasterthread,- setupadatastructureloopParams,alongwithalock;- signalotherthreadstostart;
else- waitforthesignalfrommasterthread;
usethesharedloopParams datastructuretocalculatemystaticiterationsandexecuteloop_body forthatrange;
sd_LoopNext():lockloopParams;extractachunktoworkon;unlockloopParams;if(nochunkavailable)waitforbarrier;done=1;
elseexecuteloop_body fortheextractedchunk;
sd_LoopStart():ifIamthemasterthread,- setupadatastructureloopParams,alongwithalock;- signalotherthreadstostart;
else- waitforthesignalfrommasterthread;
- usethesharedloopParams datastructuretocalculatemystaticiterationsandexecuteloop_body forthatrange;
sd_LoopNext():lockloopParams;extractachunktoworkon;unlockloopParams;if(nochunkavailable)waitforbarrier;done=1;
elseexecuteloop_body fortheextractedchunk;
Static/dynamicSchedulingUsingaSharedDataStructure
sd2_LoopStart():ifIamthemasterthread
— setupadatastructureloopParams withloopparameters;— enqueue asingleentrycorrespondingtodynamicrangeofiterations intothread0’sstealqueue;— signalotherthreadstostart;
else— waitforthesignalfrommasterthread;
- UsetheshareddatastructuretocalculatemystaticiterationsandexecuteloopBody forthatrange;
sd2_LoopNext():- nextRange =myQueue.dequeue();if(nextRange ==NULL)nextRange =steal(random_neighbor);if(nextRange !=NULL)L=nextRange->low;U=nextRange->high;if((U-L)>Threshold)- splittherangein2, enqueue theminmystealqueue;
else- executeloopbodyforiterationsL:U;- updatecountofiterationscompletedtosetthe“done”flagwhen done;
AnAlternativeImpl.ofStatic/DynamicStrategyUsingStealQueuesforDynamicIterations
staggered_LoopStart():ifIamthemasterthread
- setupadatastructureloopParams;- signalotherthreadstostart;
else- waitforthesignalfrommasterthread;
- enqueue entriescorrespondingtochunksofmydynamicrangeofiterationsintomythread’sstealqueue;- calculatemystaticiterationsandexecuteloopBody forthatrange;
staggered_LoopNext():nextRange =myQueue.dequeue();if(nextRange ==NULL)nextRange =steal(random_neighbor);if(nextRange !=NULL)- L=nextRange->low;U=nextRange->high;- executeloopbodyformyiterationsL:U;- updatecountofiterationscompletedtosetthe“done”flagwhendone;
AnImplementationofStaggeredStatic/DynamicStrategy
tion. The name of the scheduling strategy and its associated parameters isspecified in the parameter of the macro function. The parameters passedto the macro function are passed to the scheduler functions defined by theuser. This macro-based approach is of limited utility because of its inabilityto use the compiler support for features such as reductions. The approachwe propose eliminates these limitations and leads to concise code.
T0 T1 T2 T3T0 T0 T0 T0T0 T0T0 T1 T1 T1 T1 T1T1T1 T2 T2 T2 T2T2 T2T2
Increasing loop iteration number
T3 T3T3T3T3T3T3
Figure 1: Diagram of the iteration space of a threaded computation regionshowing loop iterations distributed to threads when using the technique ofstaggered mixed static/dynamic scheduling. In the figure, the smaller, redcolored, rectangles correspond to dynamic iterations tentatively assignedto a specific thread.
double static_fraction = 0.5; double constraint =0.2; int
chunk_size = 4;#pragma omp parallel for schedule(user, statdynstaggered:
chunk_size:static_fraction:constraint)for (int i=0; i<n; i++)
sum+= a[i]*b[i];
Figure 2: An illustration of the use of the user-defined scheduler in a dotproduct code parallelized with OpenMP.
We propose the user-defined scheduling scheme to be an adaptation ofthe scheme of using macro-based functions. The schedule kind ’user’ andthe name of the scheduling strategy with the strategy’s parameter value(s),e.g., the name ’staggered’ and the parameter values for the static frac-tion, chunk size and constraint, are specified as parameters of the OpenMP’schedule’ clause. Figure 2 illustrates the proposed API for using a user-defined schedule through the API’s use in the dot product code. The user-defined scheduler named ’staggered’, specified in the second parameter ofthe schedule clause, defines how the work of a loop in the code is enqueuedinto queues or data structures and the strategy of dequeuing (or obtaining)a chunk of loop iterations to execute next using the scheduling strategyillustrated in Figure 1.
When the user specifies a schedule clause with kind ’user’ and a
3
DiscussionofAPI’sdetails• WehavesketchedaboveasyntaxfortheAPIforuser-levelschedulers.• However,weexpectthattheAPI’sdetailswillbeworkedoutwiththecommunity’sconsensus.
• Notethatthereisaprecedentforaddinguser-definedfunctionsinOpenMP standard.• The combinerfunctioninuser-definedreductions.
Summary• Needforexperimentationwithsophisticatedloopschedulingstrategies.• OpenMP communityshoulddiscusshowtoallowflexiblespecificationofsuchstrategiesinauser’scodeandhowtodesignauser-levelschedulerlibrarysothatitcanbeportablyusedwithanyconformingOpenMP implementation.• Supportinguser-definedschedulersinthiswaywillfacilitaterapiddevelopmentofschedulingstrategies.• Ihopetheexpertswilldiscusstheseideasatthisconference.
Acknowledgements:UniversityofSouthernCalifornia/ISI’sTechnicalComputingNewDirectionsprogram.