darma - stanford cs theorytheory.stanford.edu/~aiken/west/talks/darma.pdfamt runtime system...
TRANSCRIPT
Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000. SAND NO. 2011-XXXXP
DARMAJanineC.Bennett,JonathanLifflander,DavidS.Hollman,JeremiahWilke,Hemanth Kolla,AramMarkosyan,NicoleSlattengren,RobertL.Clay(PM)
PSAAP-WESTFebruary22,2017
UnclassifiedSAND2017-1430C
WhatisDARMA?
DARMAisaC++abstractionlayerforasynchronousmany-task(AMT)runtimes.
Itprovidesasetofabstractionstofacilitatetheexpressionoftaskingthatmaptoavarietyofunderlying
AMTruntimesystemtechnologies.
Sandia’sATDMprogramisusingDARMAtoinformitstechnicalroadmapfornextgenerationcodes.
2
2015studytoassessleadingAMTruntimesledtoDARMA
3
§ BroadsurveyofmanyAMTruntimesystems§ DeepdiveonCharm++,Legion,Uintah
§ Programmability: DoesthisruntimeenableefficientexpressionofATDMworkloads?
§ Performance: Howperformantisthisruntimeforourworkloadsoncurrentplatformsandhowwellsuitedisthisruntimetoaddressfuturearchitecturechallenges?
§ Mutability:Whatistheeaseofadoptingthisruntimeandmodifyingittosuitourcodeneeds?
Aim:informSandia’stechnicalroadmapfornextgenerationcodes
2015studytoassessleadingAMTruntimesledtoDARMA
4
§ Conclusions§ AMTsystemsshowgreatpromise§ GapsinrequirementsforSandia
applications§ Nocommonuser-levelAPIs§ Needforbestpracticesandstandards
§ SurveyrecommendationsledtoDARMA§ C++abstractionlayerforAMTruntimes§ RequirementsdrivenbySandiaATDM
applications§ Asingleuser-levelAPI§ SupportmultipleAMTruntimestobegin
identificationofbestpractices
Aim:informSandia’stechnicalroadmapfornextgenerationcodes
5
MappingtoavarietyofAMTruntimesystemtechnologies
DARMAprovidesaunifiedAPItoapplicationdevelopersforexpressingtasks
6
Application
DARMA
Runtime
OS/Hardware
Common APIacross runtimes
Common APIacross runtimes
Front End API(Application User)
Translation Layer
Back End API(Specification for Runtime)
Glue Code(Specific to each runtime)
MappingtoavarietyofAMTruntimesystemtechnologies
ApplicationcodeistranslatedintoaseriesofbackendAPIcallstoanAMTruntime
7
Not all runtimes providethe same functionality
Application
DARMA
Runtime
OS/Hardware
Common APIacross runtimes
Common APIacross runtimes
Front End API(Application User)
Translation Layer
Back End API(Specification for Runtime)
Glue Code(Specific to each runtime)
MappingtoavarietyofAMTruntimesystemtechnologies
ApplicationcodeistranslatedintoaseriesofbackendAPIcallstoanAMTruntime
8
Challenge: design a back end API that maps to a variety of runtimes
Application
DARMA
Runtime
OS/Hardware
Common APIacross runtimes
Common APIacross runtimes
Front End API(Application User)
Translation Layer
Back End API(Specification for Runtime)
Glue Code(Specific to each runtime)
MappingtoavarietyofAMTruntimesystemtechnologies
ConsiderationswhendevelopingabackendAPIthatmapstoavarietyofruntimes
9
§ AMTruntimesoftenoperatewithadirectedacyclicgraph(DAG)§ Capturesrelationshipsbetweenapplicationdataandinter-dependenttasks
d1d2
d3 d4
d5d6 d7
t1 t2 t3
t4 t5
subsetreads
MappingtoavarietyofAMTruntimesystemtechnologies
ConsiderationswhendevelopingabackendAPIthatmapstoavarietyofruntimes
10
§ AMTruntimesoftenoperatewithadirectedacyclicgraph(DAG)§ Capturesrelationshipsbetweenapplicationdataandinter-dependenttasks
§ DAGscanbeannotatedtocaptureadditionalinformation§ Tasks’read/writeusageofdata§ Taskneedsasubsetofdata
d1d2
d3 d4
d5d6 d7
t1 t2 t3
t4 t5
subsetreads
MappingtoavarietyofAMTruntimesystemtechnologies
ConsiderationswhendevelopingabackendAPIthatmapstoavarietyofruntimes
11
§ AMTruntimesoftenoperatewithadirectedacyclicgraph(DAG)§ Capturesrelationshipsbetweenapplicationdataandinter-dependenttasks
§ DAGscanbeannotatedtocaptureadditionalinformation§ Tasks’read/writeusageofdata§ Taskneedsasubsetofdata
§ Additionalinformationenablesruntimetoreasonmorecompletelyabout§ Whenandwheretoexecuteatask§ Whethertoloadbalance
§ ExistingruntimesleverageDAGswithvaryingdegreesofannotation
d1d2
d3 d4
d5d6 d7
t1 t2 t3
t4 t5
subsetreads
MappingtoavarietyofAMTruntimesystemtechnologies
DARMAcapturesdata-taskdependencyinformationandtheruntimebuildsandexecutestheDAG
12
Captures data-task dependency information
Application
DARMA
Runtime
OS/Hardware
Common APIacross runtimes
Common APIacross runtimes
Front End API(Application User)
Translation Layer
Back End API(Specification for Runtime)
Glue Code(Specific to each runtime)
Runtime controls construction andexecution of the DAG
MappingtoavarietyofAMTruntimesystemtechnologies
13
Abstractionsthatfacilitatetheexpressionoftasking
DARMAfrontendabstractionsfordataandtasksareco-designedwithSandiaATDMapplicationscientists
14Abstractionsthatfacilitatetheexpressionoftasking
Provide abstractions to simplifycapturing of data-task dependencies
Application
DARMA
Runtime
OS/Hardware
Common APIacross runtimes
Common APIacross runtimes
Front End API(Application User)
Translation Layer
Back End API(Specification for Runtime)
Glue Code(Specific to each runtime)
DARMADataModel
Howaredatacollections/datastructuresdescribed?§ Asynchronoussmartpointerswrapapplicationdata
§ Trackmeta-datausedtobuildandannotatetheDAG– Currentlypermissionsinformation– Subsetting informationunderdevelopment
§ Enableextractionofparallelisminadata-race-freemannerHowaredatapartitioninganddistributionexpressed?§ Thereisanexplicit,hierarchical,logicaldecompositionofdata
§ AccessHandle<T> – Doesnotspanmultiplememoryspaces– Mustbeserializedtobetransferredbetweenmemoryspaces
§ AccessHandleCollection<T, R>– Expressesacollectionofdata– Canbemappedacrossmemoryspacesinascalablemanner
§ Distributionofdataisuptoindividualbackendruntime
15Abstractionsthatfacilitatetheexpressionoftasking
DARMAControlModel
16
Howisparallelismachieved?§ create_work
§ Ataskthatdoesn’tspanmultipleexecutionspaces§ Sequentialsemantics:theorderandmanner(e.g.,read,write)inwhichdata
(AccessHandle)isuseddetermineswhattasksmay beruninparallel§ create_concurrent_work
§ Scalableabstractiontolaunchacrossdistributedsystems§ Acollectionoftasksthatmustmakesimultaneousforwardprogress§ Sequentialsemanticssupportedacrossdifferenttaskcollectionsbasedon
orderandmannerofAccessHandleCollection usageHowissynchronizationexpressed?§ DARMAdoesnotprovideexplicittemporalsynchronizationabstractions§ DARMAdoes providedatacoordinationabstractions
§ publish/fetch semanticsbetweenparticipantsinataskcollection§ Asynchronouscollectivesbetweenparticipantsinataskcollection
Abstractionsthatfacilitatetheexpressionoftasking
17
UsingDARMAtoinformSandia’stechnicalroadmap
Currentlytherearethreebackendsinvariousstagesofdevelopment
18
Charm++
Charm++, Threads, HPX(in progress)
Application
DARMA
Runtime
OS/Hardware
Common APIacross runtimes
Common APIacross runtimes
Front End API(Application User)
Translation Layer
Back End API(Specification for Runtime)
Glue Code(Specific to each runtime)
fully distributed development
tool
prototype
ThreadsHPX
UsingDARMAtoinformSandia’sATDMtechnicalroadmap
2017study:ExploreprogrammabilityandperformanceoftheDARMAapproachinthecontextofATDMcodes
19UsingDARMAtoinformSandia’sATDMtechnicalroadmap
ElectromagneticPlasma Particle-in-cellKernels
MultiscaleProxy
MultiLevelMonteCarloUncertaintyQuantificationProxy
2017study:ExploreprogrammabilityandperformanceoftheDARMAapproachinthecontextofATDMcodes
§ Kernelsandproxies§ Formbasisforprogrammabilityassessments§ WillbeusedtoexploreperformancecharacteristicsoftheDARMA-
Charm++backend
§ Simplebenchmarksenablestudieson§ Taskgranularity§ Overlapofcommunicationandcomputation§ Runtime-managedloadbalancing
§ TheseearlyresultsarebeingusedtoidentifyandaddressbottlenecksinDARMA-Charm++backendinpreparationforstudieswithkernels/proxies
20UsingDARMAtoinformSandia’sATDMtechnicalroadmap
DARMA’sprogrammingmodelenablesruntime-managed,measurement-basedloadbalancing
21UsingDARMAtoinformSandia’sATDMtechnicalroadmap
InitialstrongscalingresultswithDARMA-Charm++backendarepromising
2DNewtonianparticlesimulationthatstartshighlyimbalanced.
DARMA’sprogrammingmodelenablesruntime-managed,measurement-basedloadbalancing
22UsingDARMAtoinformSandia’sATDMtechnicalroadmap
LB LB LB LB
Initial LoadImbalance
Perc
ent U
tiliz
atio
n
Time0s 165s
1.74s/iterBefore LB
1.66s/iterAfter LB
1.70s/iterBefore LB
1.47s/iterBefore LB
1.54s/iterAfter LB
1.41s/iterAfter LB
100%
0%
TheCharm++loadbalancerincrementallyrunsasparticlesmigrateandtheworkdistributionchanges.
Summary:DARMAseekstoacceleratediscoveryofbestpractices
§ Applicationdevelopers§ Useaunifiedinterfacetoexploreavarietyofdifferentruntimesystem
technologies§ DirectlyinformDARMA’suser-levelAPIviaco-design
requirements/feedback
§ Systemsoftwaredevelopers§ Acquireasynthesizedsetofrequirementsviathebackend
specification§ Directlyinformbackendspecificationviaco-designfeedback§ CanexperimentwithproxyapplicationswritteninDARMA
§ SandiaATDMisusingDARMAtoinformitstechnologyroadmapinthecontextofAMTruntimesystems
23
24
BackupSlides
Asynchronoussmartpointersenableextractionofparallelisminadata-race-freemanner
darma::AccessHandle<T> enforcessequentialsemantics:itusestheorderinwhichdataisaccessedinyourprogramandhowitisaccessed(read/write/etc.)toautomaticallyextractparallelism
25
Permission Level
None
Read
Write
Reduce
Permission Type
SchedulingA task with scheduling permission can create deferred tasks that can accessthe data at the specified permission level.
ImmediateA task with immediate permission can dereference the AccessHandle<T> and use it according to the permission level.
Abstractionsthatfacilitatetheexpressionoftasking
Tasksareannotatedinthecodeviaalambdaorfunctorinterface
Taskscanberecursivelynestedwithineachothertogeneratemoresubtasks
26
C++ Lambdas C++ Functors
darma::create_work( [=]{ /*do some work*/ });
This is the C++ 11 syntax for writing an anonymous function that captures variables by value.
struct MyFun { void operator()(...) { /* do some work */ }};
darma::create_work<MyFun>(...)
Functors are for larger blocks of codethat may be reused and migrated bythe backend to another memory space.
Abstractionsthatfacilitatetheexpressionoftasking
Example:Puttingtasksanddatatogether
27
Example Program
AccessHandle<int> my_data;
darma::create_work([=]{ my_data.set_value(29);});
darma::create_work( reads(my_data), [=]{ cout << my_data.get_value(); });
darma::create_work( reads(my_data), [=]{ cout << my_data.get_value(); });
darma::create_work([=]{ my_data.set_value(31);});
Modifymy_data
Readmy_data
Readmy_data
Modifymy_data
DAG (Directed Acyclic Graph)
These two tasks are concurrentand can be run in parallel by a DARMA backend!
SequentialSemantics
Abstractionsthatfacilitatetheexpressionoftasking
Smartpointercollectionscanbemappedacrossmemoryspacesinascalablemanner
28
AccessHandleCollection<vector<double>, Range1D> mycol = darma::initial_access_collection( index_range = Range1D(10) ); Range1D is a potentially user-defined
(or domain-specific) index range, aC++ object that describes the extents of the collection along with providing a corresponding index class for accessingan element.
Every element in the collectioncontains a vector<double>
mycol
Index 0 Index 1
...Index 9
Each indexed element is anAccessHandle<vector<double>>
AccessHandleCollection<T, R> isanextensiontoAccessHandle<T>thatexpressesacollectionofdata
Abstractionsthatfacilitatetheexpressionoftasking
Taskscanbegroupedintocollectionsthatmakeconcurrentforwardprogresstogether
29
create_concurrent_work<MyFun>( index_range = Range1D(5));
This call to create_concurrent_worklaunches a set of tasks, the size of which is specified by an index range, Range1D, that is passed as an argument. Each element in the task collection is
passed an Index1D within the range, used by the programmer to express communication patterns across elementsin the collection.
struct MyFun { void operator()(Index1D i) { int me = i.value; /* do some work */ }};
Taskcollectionsareascalableabstractiontoefficientlylaunchcommunicatingtasksacrosslarge-scaledistributedsystems
Abstractionsthatfacilitatetheexpressionoftasking
Puttingtaskcollectionsanddatacollectionstogether
30
auto mycol = initial_access_collection( index_range = Range1D(10));
create_concurrent_work<MyFun>( mycol, index_range = Range1D(10));
create_concurrent_work<MyFun>( mycol, index_range = Range1D(10));
Example Program
SequentialSemantics
Modifymycol
Modifymycol
Generated DAG
Scalable GraphRefinement
...
...
Modifymycol
Modifymycol
Index 0 Index 1 Index 9
A mapping must exist between thedata index ranges and task index range.In this case, since the three ranges areidentical in size and type, a one-to-oneidentity map is automatically applied.
Abstractionsthatfacilitatetheexpressionoftasking
Tasksindifferentexecutionstreamscancommunicateviapublish/fetchsemantics
31
Execution Stream A
AccessHandle<int> my_data = initial_access<int>(”my_key”);
darma::create_work([=]{ my_data.set_value(29);});
my_data.publish(version=”a”);
darma::create_work([=]{ my_data.set_value(31);});
Execution Stream B
AccessHandle<int> other_data = read_access(”my_key”, version=”a”);
darma::create_work([=]{ cout << other_data.get_value();});
other_data = nullptr;
Modifymy_data
Modifymy_data
Readmy_data
Copymy_data
Readmy_data
Potential DAG 1
If the read_access is on anothernode it might be send across thenetwork.
Abstractionsthatfacilitatetheexpressionoftasking
Tasksindifferentexecutionstreamscancommunicateviapublish/fetchsemantics
32
Execution Stream A
AccessHandle<int> my_data = initial_access<int>(”my_key”);
darma::create_work([=]{ my_data.set_value(29);});
my_data.publish(version=”a”);
darma::create_work([=]{ my_data.set_value(31);});
Execution Stream B
AccessHandle<int> other_data = read_access(”my_key”, version=”a”);
darma::create_work([=]{ cout << other_data.get_value();});
other_data = nullptr;
Modifymy_data
Modifymy_data
Readmy_data
Readmy_data
Potential DAG 2
If the read_access is on the samenode a back end runtime can generatean alternative DAG without the transfer.
Abstractionsthatfacilitatetheexpressionoftasking
Tasksindifferentexecutionstreamscancommunicateviapublish/fetchsemantics
33
Execution Stream A
AccessHandle<int> my_data = initial_access<int>(”my_key”);
darma::create_work([=]{ my_data.set_value(29);});
my_data.publish(version=”a”);
darma::create_work([=]{ my_data.set_value(31);});
Execution Stream B
AccessHandle<int> other_data = read_access(”my_key”, version=”a”);
darma::create_work([=]{ cout << other_data.get_value();});
other_data = nullptr;
Modifymy_data
Modifymy_data
Readmy_data
Readmy_data
Potential DAG 2
If the read_access is on the samenode a back end runtime can generatean alternative DAG without the transfer.
Abstractionsthatfacilitatetheexpressionoftasking
Amappingbetweendataandtaskcollectionsdeterminesaccesspermissionsbetweentasksanddata
34
auto mycol = initial_access_collection<int>( index_range = Range1D(10));create_concurrent_work<MyFun>( mycol, index_range = Range1D(10));
struct MyFun { void operator()( Index1D i, AccessHandleCollection<int> col ) { int me = i.value, mx = i.max_value; auto my_elm = col[i].local_access(); my_elm.publish(version=”x”); auto neighbor = me-1 < 0 ? mx : me-1; auto other_elm = col[neighbor].read_access(version=”x”); create_work([=]{ cout << “neighbor = ” << other_elm.get_value() << endl; }); }};
Identity map between these data andtasks. Thus, index i has local access todata index i.
Any other index must be read using read_access, which actually may bea remote or local operation dependingon the backend mapping, but is always a deferred operation.
Abstractionsthatfacilitatetheexpressionoftasking
SandiaATDMapplicationsdriverequirementsanddevelopersplayactiveroleininformingfrontendAPI
§ Applicationfeaturerequests§ Sequentialsemantics§ MPIinteroperability§ Node-levelperformanceportabilitylayerinteroperability(Kokkos)§ Collectives§ Runtime-enabledload-balancingschemes
35Abstractionsthatfacilitatetheexpressionoftasking
Alatency-intolerantbenchmarkhighlightsoverheadsasgrainsizedecreases
UsingDARMAtoinformSandia’sATDMtechnicalroadmap
Atthisscale,eachiterationislessthan5mslong.
Thisbenchmarkhastightsynchronizationeveryiteration.
Increasedasynchronyintheapplicationenablestheruntimetooverlapcommunicationandcomputation
UsingDARMAtoinformSandia’sATDMtechnicalroadmap
Scalabilityimproveswithasynchronousiterations.Requiresonlyminorchangestoapplicationcode.
DARMA’sprogrammingmodelenablesruntime-managed,measurement-basedloadbalancing
38UsingDARMAtoinformSandia’sATDMtechnicalroadmap
Loadbalancingdoesnotrequirechangestotheapplicationcode.
Stencilbenchmarkisnotlatencytolerantandhighlightsruntimeoverheadswhentask-granularityissmall
UsingDARMAtoinformSandia’sATDMtechnicalroadmap
Atthisscale,eachiterationislessthan5mslong.
Increasedasynchronyinapplicationenablesruntimetooverlapcommunicationandcomputation
UsingDARMAtoinformSandia’sATDMtechnicalroadmap
Scalabilityimproveswithasynchronousiterations.RequiresonlyminorchangestoDARMAcode.