performance advantages of a data centric computer architecture · 15/12/2016 · memory web...
TRANSCRIPT
1/25/17 EMUENHANCEDMEMORYUTILIZATION|COPYRIGHT2016 1
PerformanceAdvantagesofaDataCentricComputer
Architecture
AMSSeminarSeriesNASAAmesResearchCenter
December15,2016
READYFORSOMETHINGREALLYNEW?
CurrentMemory
1/25/17 EMUENHANCEDMEMORYUTILIZATION|COPYRIGHT2016 2
Cacheline–Linearaccess
READYFORSOMETHINGREALLYNEW?
CurrentMemory
1/25/17 EMUENHANCEDMEMORYUTILIZATION|COPYRIGHT2016 3
Stridedaccess–noadvantagetocaches
READYFORSOMETHINGREALLYNEW?
CurrentMemory
1/25/17 EMUENHANCEDMEMORYUTILIZATION|COPYRIGHT2016 4
Unstructuredaccess–noadvantagetocaches
READYFORSOMETHINGREALLYNEW?
CurrentMemory
1/25/17 EMUENHANCEDMEMORYUTILIZATION|COPYRIGHT2016 5
Forwardlookinglinkedlistaccess–noadvantagetocaches
READYFORSOMETHINGREALLYNEW?
CurrentMemory
1/25/17 EMUENHANCEDMEMORYUTILIZATION|COPYRIGHT2016 6
Memory1 Memory4Memory2 Memory3
Program(very,verysmallcomparedtodata)
Massiveamountsofdataspreadovermanymemories
Largeamountsofdataaremovedtoandfromeachmemoryareatotheprogramthroughaverylimitednetwork
READYFORSOMETHINGREALLYNEW?
IntroducOon
1/25/17 EMUENHANCEDMEMORYUTILIZATION|COPYRIGHT2016 7
• IssueswithToday’sArchitectures• EmuMigratoryNearMemoryProcessingArchitecture• ProgrammingEmuusingCilk• IniOalEmuproducts
READYFORSOMETHINGREALLYNEW?
MoOvaOon• Today’sarchitecturesopOmizedforhighlylocalmemoryreferencepaTernswithsignificantreuse– Large,deepcachehierarchies;widememorypaths
• NewapplicaOonslikeBigDataaredifferent– Verylargeunstructureddatastructures– O[enwithnon-tradiOonalsearches:relaOonships
• Result:Bigmismatch!– Scalableperformancerequiresverycomplexprograms– AndsOllresultinlowefficiencies
1/25/17 EMUENHANCEDMEMORYUTILIZATION|COPYRIGHT2016 8
READYFORSOMETHINGREALLYNEW?
IntroducingtheEmuArchitecture
1/25/17 EMUENHANCEDMEMORYUTILIZATION|COPYRIGHT2016 9
DesignedforIrregularData!!(Thebiggertheproblem,thebeYer)
• DesignedfromgroundupforliTleornolocality• Fullyrandomaccesstoindividual64-bitwordsatfullefficiency
• MulOthreadedcoreshidenearlyalllatency• Lowerenergy–lessdatamovedshorterdistances• Highradixnetworksupportsfastsystem-widePGAS• Compute,memorycapacity,memorybandwidthandso[wareallscalesimultaneously
READYFORSOMETHINGREALLYNEW?
GoingtotheMountain
Insteadofmovingblocksofdataacrossthenetwork,Emumoves(“Migrates”)theprogramcontexttothelocaleofthedataaccessed• Moveprogramcounter,workingregisters,threadstatusword
– ApplicaOoncodereplicatedoneachnodelet,nevermoves• One-waytrip–noreturnneeded• Invisibletoprogrammer–triggeredbyreferencetonon-local
address• LatencycompletelyhiddenifsufficientacOvethreads• WritestransmiTedonnetworkwithoutmigraOng,unless
programmerexplicitlyrequestsmigraOon
1/25/17 EMUENHANCEDMEMORYUTILIZATION|COPYRIGHT2016 10
Ifthemountainwon’tcometoyou,youmustgotothemountain-ancientproverb
READYFORSOMETHINGREALLYNEW?
Emu:The“Put-Only”ArchitectureReadingamemorylocaOononadifferentEmunodecausesthecontexttomovetothatnode,insteadofsendingareadrequest.• ThiswinsbigespeciallywhenprogrampaTernisseriesofo[en
brief“visits”towidelydisperseddata• ProcessoruOlizaOonimproved
– ProcessorsneverstallforlongperiodswaiOngforremotereads
• Networkissimplified– Doesn’tneedtosupportroundtrip(read/response)messages
• RemoteWritescanbeperformeddirectlyorviamigraOons,underprogrammer(compiler)control.
1/25/17 EMUENHANCEDMEMORYUTILIZATION|COPYRIGHT2016 11
READYFORSOMETHINGREALLYNEW?
HowtheEmuMemoryWorks
1/25/17 EMUENHANCEDMEMORYUTILIZATION|COPYRIGHT2016 12
Memory1 Memory4Memory2 Memory3
Program(very,verysmallcomparedtodata)
Modestamountsofdataspreadovermany,manymemories
TheprogramcontextmovesfrommemorytomemoryresulOngin10x+beTerusageofthenetwork
READYFORSOMETHINGREALLYNEW?
• SingleSystem-wideAddressSpace• Sta^onaryCores(SCs):
– ExecuteOperaOngSystem
– ManageIO/FileSystem
– CallorSpawnGossamerThreads
• GossamerCores(GCs)executeGossamerThreadsatNodelets
– PerformlocalcomputaOons&memoryreferences(inc.atomics)
– MigratetootherNodeletsw’oso`wareinvolvement
– SpawnnewThreads– CallSystemServicesonSCs
• ProgrammedinCilk
MemoryWebMigraOngThreadArchitecture
READYFORSOMETHINGREALLYNEW?
GossamerCoreArchitecture• Deeplypipelined,64-waymulOthreadedcore• Accumulator+16generalregisters• RichsuiteofAtomicMemoryOperaOons• SingleSPAWNinstrucOontocreateanewthread• ThreadschedulingisautomaOcandperformedbyhardware
• RELEASEinstrucOonplacescontextinaServiceQueueforprocessingbyStaOonaryCore
1/25/17 EMUENHANCEDMEMORYUTILIZATION|COPYRIGHT2016 14
READYFORSOMETHINGREALLYNEW?
Emu2MemoryServer:Densermemoryforhigherbandwidth,ASICforComputenodecanuOlizeallavailablebandwidth Emu3MemoryServer:
Dozensofnodeletsincustommemorystack
ProofofConcept
Emu1MemoryServer:• PracOcalsystemforlargedataproblems• 8NarrowChannelsuOlizeallavailablebandwidth• SolidStateDiskforbulkstorage• Highlyefficient,withmoderatepowerdensity• ScalestomulOpleracks• HighbandwidthGossamercoresusableinfutureversions
NodeImplementaOons
READYFORSOMETHINGREALLYNEW?
1/25/17 EMUENHANCEDMEMORYUTILIZATION|COPYRIGHT2016 17
EmuProductIni^alDeliverables
EmuChick• 2.2GUPsperformance• Copyroomenvironment• FCSQ3’16
Emu1MemoryServer• 70GUPsperformance• Serverroomenvironment• FCSQ3’17
Thenodeboardsarethesame!
READYFORSOMETHINGREALLYNEW?
Emu1RackConfiguraOon• 8Chassis/Motherboards
– 256Nodes(32/Tray)– 88-lanePCIexpressGen3slots(2/Tray)
• 24ProDriveRapidIOSwitches96SRIO3.1(6.25GB/s)linksbetweenSupergroupsandCentralSwitch
• 48VDCPowerDistribu^on• N+1RedundantFront-endPowerConverter• AirorBuildingWatercoolingop^ons• Approximately40KW@440VACTriPhase
1/25/17 EMUENHANCEDMEMORYUTILIZATION|EMUPROPRIETARY|COPYRIGHT2015 18
READYFORSOMETHINGREALLYNEW?
EmuSystemCharacterisOcs
Emu1Rack• 16TBDDR4DRAM• 256TBSolidStateDisk• [email protected]/
s• >2TB/sBisecOonBW/rack
Emu2Rack• ~16TBHMCMemory• ~256TBSolidStateDisk• 2048Mem.Channels@20GB/s• >2TB/sBisecOonBW/rack
1/25/17 EMUENHANCEDMEMORYUTILIZATION|EMUPROPRIETARY|COPYRIGHT2015 19
Upto64KNodes/System(256Racks,256Nodes/rack,8+nodelets/node)Airorbuildingwatercooled,~40KW/rackHighradixRapidIONetwork
READYFORSOMETHINGREALLYNEW?
ProgrammingEmu
1/25/17 EMUENHANCEDMEMORYUTILIZATION|COPYRIGHT2016 20
• ProgrammedinCilk,atrulyparallelextensionofC– EmucodegeneratorcoupledtoLLVMfrontend– InliningofmanylibraryfuncOonstoleverageEmuinstrucOons
• AddedcapabilitytodefinememoryViewsandplacedatainthoseViews– ShareddataitemsareDynamicSharedObjects(DSOs)thatmaybedefinedoutsidetheprogramsthatmanipulatethemandpersistbeyondprogramcompleOon
– PrivateautomaOcvariablesaredeclarednormallyinCilk.
READYFORSOMETHINGREALLYNEW?
TheEmuCilkLanguage
1/25/17 EMUENHANCEDMEMORYUTILIZATION|COPYRIGHT2016 21
EmuCilkextendsCwithafewnewkeywords• Cilk_spawn[<var>=]<func^on>createsnew“child”thread
running<funcOon>whileparentthreadconOnuesasynchronously• Cilk_synccausescurrentfuncOontowaitunOlallitschildrenhave
completed• Cilk_for(<iterator>){<itera^on_code>}[grainsize=<grainsize>]
createsagroupofnewthreads(fromCilkPlus)• TerminaOonofafuncOonalwaysperformsanimplicitsync• Nosupport(atpresent)forINLET,CilkPlusvectoroperaOons,C++
– EventuallyplantoaddC++funcOonality
READYFORSOMETHINGREALLYNEW?
SystemSo[ware
1/25/17 EMUENHANCEDMEMORYUTILIZATION|COPYRIGHT2016 22
• LINUXrunsontheStaOonaryCores(SCs).• OSlaunchesmain() userprogramonaGossamerCore(GC)• main() spawnsdescendantsthatexecuteinparallelandmigrate
throughoutsystemasneeded.• RunOmeexecutesprimarilyontheSCsandhandlesservice
requestsfromthethreadsrunningontheGCs.– MemoryallocaOonandrelease,I/O,excepOonhandling,and
performancemonitoring.– AfewspecialsystemthreadsrunonGossamerCorestoprovidereal-Ome
systemmanagement,suchasdistribuOonofcredits
• Threadsreturntomain() uponcompleOon,whichthenreturnstotheOS.
READYFORSOMETHINGREALLYNEW?
WheredoesEmushine?• BigData:thebiggerthebeTer.Datathatlacksregularstructure
worksespeciallywell• GraphAlgorithms:examplesincludeNon-ObviousRelaOonships,
ParOcleSwarmandNetworkOpOmizaOon• Searching,Sor^ng,andPaYernMatchingProblems:suchasGene
Sequencematching,Protein-LigandDockingandMachineLearning• SparseMatrixProblems:examplesincludetheHPCGbenchmark
andLinearProgrammingThesewillallperformfarbeTeronEmuthanonconvenOonalcomputers
1/25/17 EMUENHANCEDMEMORYUTILIZATION|EMUPROPRIETARY|COPYRIGHT2015 23
READYFORSOMETHINGREALLYNEW?
MarketApplicaOonAreas:TheseareareaswheretheEmuArchitecturewillshine
1/25/17 EMUENHANCEDMEMORYUTILIZATION|EMUPROPRIETARY|COPYRIGHT2016 24
Numerical Search
• ComputaOonalFluidDynamics • GraphAnalysis
• ReservoirModeling • NetworkOpOmizaOon
• Weather&ClimateModeling • RegressionAnalysis
• Linear&DynamicProgramming • Risk&FraudAnalysis
• QuantumPhysics • Map/ReduceAcceleraOon
• QuantumChemistry • RecordManagement&StaOsOcs
• MolecularDynamics • MachineLearning
• Many-bodyProblems • ImageUnderstanding
• Signal&ImageProcessing • GeneSequenceAnalysis
READYFORSOMETHINGREALLYNEW?
• WilldeliverfirstEmu“Chicks”inFall2016• Designportsdirectlytoracklevel&above• SignificantlybeTerscalabilitythancommercialsystemsonGUPS,BFS,andsimilarapplicaOons
• Ongoingdevelopmentoflibrariesforcomprehensiveprogrammingenvironment
• ExpandingapplicaOonbase• InterestedinresearchpartnershipstoexplorenewlibraryandapplicaOonareas
1/25/17 EMUENHANCEDMEMORYUTILIZATION|COPYRIGHT2016 25
CurrentStatus
READYFORSOMETHINGREALLYNEW?
SomeEarlyComparisons
1/25/17 EMUENHANCEDMEMORYUTILIZATION|COPYRIGHT2016 26
READYFORSOMETHINGREALLYNEW?
1.E-02
1.E-01
1.E+00
1.E+01
1.E+02
0.01 0.1 1 10 100
TotalD
ataMem
oryCh
anne
lsperSq.Ft.
PeakAggregateMemoryAccessesperSq.Ft.(G/s)Emu SM TCDM LCDM
MemoryChannelComparison
1/25/17 EMUENHANCEDMEMORYUTILIZATION|COPYRIGHT2016 27
Emu1 Emu2
Emu3
Emu1Chick
AllpointsnormalizedtoEmu1chickat(1,1)
MoreMemoryChannels=>moreopportuniOesforparallelismMoreaccesses/sec=>beTerabilitytoaccessrandomdata
SM:sharedmemoryTCDM:TightlyCoupledDistributedMemoryLCDM:LooselyCoupledDistributedMemory
READYFORSOMETHINGREALLYNEW?
1.E-02
1.E-01
1.E+00
1.E+01
1.E+02
1.E+03
0.01 0.1 1 10 100
PeakAggregateInjectionRa
teperSq.Ft.
(GB/s)
PeakAggregateMemoryAccessesperSq.Ft.(G/s)Emu SM TCDM LCDM
InjecOonBandwidth&AccessRate
1/25/17 EMUENHANCEDMEMORYUTILIZATION|COPYRIGHT2016 28
Emu1Emu2
Emu3
Emu1Chick
AllpointsnormalizedtoEmu1chickat(1,1)
MoreInjecOonrate=>moreabilitytocommunicateremotelyMoreaccesses/sec=>beTerabilitytoaccessrandomdata
SM:sharedmemoryTCDM:TightlyCoupledDistributedMemoryLCDM:LooselyCoupledDistributedMemory
READYFORSOMETHINGREALLYNEW?
1.E-02
1.E-01
1.E+00
1.E+01
1.E+02
0.01 0.1 1 10 100
InjectionB/WperW
att(MB/s/W)
Accesses/SecperWatt(M/s/W)eqvttoAccesses/nJEmu SM TCDM LCDM
SystemPerWaTComparison
1/25/17 EMUENHANCEDMEMORYUTILIZATION|COPYRIGHT2016 29
Emu1Emu2
Emu3
Emu1Chick
AllpointsnormalizedtoEmu1chickat(1,1)
Moreperformance/waT=>loweroperaOngcosts
SM:sharedmemoryTCDM:TightlyCoupledDistributedMemoryLCDM:LooselyCoupledDistributedMemory
READYFORSOMETHINGREALLYNEW?
GUPS:GlobalUpdatesperSecond
1/25/17 EMUENHANCEDMEMORYUTILIZATION|COPYRIGHT2016 30
Emu1
Emu2
Emu3
AllpointsnormalizedtoEmu1at(1,1)
SinglerackofEmu1wouldrank8thtoday.P775(#1)is21racks.BG/Q(#3)is48racks.KComputer(#2,4,7)is219-864racks.
READYFORSOMETHINGREALLYNEW?
BreadthFirstSearch• CoreofGRAPH500rankings• Startwitharoot,findreachableverOces• Metric:TEPS:TraversedEdges/sec• Performanceissues
• Massivedatasets• Veryirregularaccess• Verychallengingloadbalancing
Scale=log2(#verOces)
Level Scale SizeVertices(Billion) TB
Bytes/Vertex
10 26 Toy 0.1 0.02 281.804811 29 Mini 0.5 0.14 281.395212 32 Small 4.3 1.1 281.47213 36 Medium 68.7 17.6 281.475214 39 Large 549.8 141 281.47515 42 Huge 4398.0 1,126 281.475
Average 281.5162
GRAPH500GraphSizes
1/25/17 31EMUENHANCEDMEMORYUTILIZATION|COPYRIGHT2016
20
9
13
5
7
8e0 e1
e2e3e4
e5
e6
e7e8
Startingat1:1,0,3,2,9,5
READYFORSOMETHINGREALLYNEW?
1.E-02
1.E-01
1.E+00
1.E+01
1.E+02
1.E+03
1 10 100 1,000 10,000GTEPS
CoresinNode
MostRecentGRAPH500Data
1/25/17 EMUENHANCEDMEMORYUTILIZATION|COPYRIGHT2016 32
• GoodscalabilityformulO-nodeTCDMsystems
• ButsinglenodeSMsystemsfarhigherrela^veperformance
– Need~100nodesofbestTCDMtomatch
• “Singlenode”systemsactually“SharedMemory”mulO-coresystems
• Butperformanceflata[er~10cores• Today’sSharedMemorysystems
inherentlymoreefficientbutdon’tscale• Extra“cores”addmemory,not
performance
1.E-02
1.E-01
1.E+00
1.E+01
1.E+02
1.E+03
1.E+04
1.E+05
1 10 100 1,000 10,000 100,000
GTEPS
Nodes
ExpansionofSingleNodeRegionSingleRack
100’sofracks
READYFORSOMETHINGREALLYNEW?
0.1
1
10
100
1 10 100 1000 10000GTEPS
NodeletsScale22GTEPS Scale23GTEPS ProjectedTEPS
EarlyProjecOonsforEmu• Emuisasharedmemorysystem• ButwithFARbeYerscalingthan
today’sSMsystems• Andthisiswithverysmall(scale
22)problems• MorerealisOcsizes(scale30for
chick)couldincreaseGTEPSbyalmost10X– IniOalscale23resultspointinthisdirecOon
• MorecompletesimulaOonsinprogress
1/25/17 EMUENHANCEDMEMORYUTILIZATION|COPYRIGHT2016 33
Emu1simresults
StayTuned!
READYFORSOMETHINGREALLYNEW?
RealWorldCommercialChallengeProblem(FromLexisNexis)
• 40+TBofRawData• Periodicallycleanup&combineto
4-7TB• Weekly“BoiltheOcean”to
precomputeanswerstoallstandardqueries
– DoesXhavefinancialdifficul^es?– DoesXhavelegalproblems?– HasXhadsignificantdriving
problems?– WhohassharedaddresseswithZ?– Whohassharedpropertyownership
withZ?
AutoInsuranceCo:“TellmeaboutgivingautopolicytoJaneDoe”in<0.1sec
“JaneDoehasnoindicatorsButshehassharedmulOpleaddresseswithJoeScofflawWhohasthefollowingnegaOveindicators….”
Lookupanswerstoprecomputedqueriesfor“JaneDoe”andcombine
RelaOonships
READYFORSOMETHINGREALLYNEW?
TradiOonalApproach:RunawayIntermediateData14.2B recs325 B/rec4.6TB
Project14.2B recs100+ B/rec1.5TB
Join onAddress
1.6T recs200+ B/rec300+TB
Sort & RemoveDuplicates
1.5T recs30B/rec45TB
• Compute adr hash• Compare lnames• Init score to 3• Project
1.6T recs30 B/rec48+TB
Group byID pairs &Sum scores,Lname_match
{(ID1, ID2, adrhash, score,lname_match)}
{(ID, lname, adr)}
Hash ID1,2& Distribute
12B recs16B/rec200GB
Select onScore &Lname_match
1.2T recs16B/rec20TB
{(ID1, ID2, score, lname_match)}
800M distinct IDs
400M distinct IDs
Send between nodes via TCP/IP datagrams
“h”“t”
“J”
“D”
READYFORSOMETHINGREALLYNEW?
HeavyweightUpgrades
1.E-02
1.E-01
1.E+00
1.E+01
1.E+02
1.E+03
1 2 3 4 5 6 7 8 9
ResourcesU
sed/no
de(sec)
Step#Disk CPU Memory Network
1.E-02
1.E-01
1.E+00
1.E+01
1.E+02
1.E+03
1 2 3 4 5 6 7 8 9
ResourcesU
sed/no
de(sec)
Step#Disk CPU Memory Network
1.E-02
1.E-01
1.E+00
1.E+01
1.E+02
1.E+03
1 2 3 4 5 6 7 8 9
ResourcesU
sed/no
de(sec)
Step#Disk CPU Memory Network
2xIBFDR
SATA
SSD
RAMDISK
22Cores,
4xDD
R4
1 2 3 4 5 6 7 8 975.2 75.2 75.2 75.2 75.2 580.5 580.5 360.0 362.9 1025.8 1.00
X 75.2 75.2 75.2 75.2 75.2 250.2 360.0 360.0 362.9 805.3 1.27X 28.8 28.8 28.8 28.8 28.8 580.5 580.5 144.0 145.2 757.4 1.35
X 75.2 75.2 75.2 75.2 75.2 580.5 580.5 25.0 50.0 707.7 1.45X 75.2 75.2 75.2 75.2 75.2 580.5 580.5 360.0 362.9 1025.8 1.00
X X X 28.8 28.8 28.8 28.8 28.8 250.2 275.2 25.0 50.0 355.1 2.89X X X X 28.8 28.8 28.8 28.8 28.8 74.4 81.9 7.4 14.9 126.3 8.12
Enhancements
2012Baseline
TimeperStepTotalTime
Speedu
p
2012Baseline UpgradeDisk,NetworkUseRAMDISK2.9XSpeedup
AlsoUpgradeProcessor8.1XSpeedup
AllaresOll10RackSystems
READYFORSOMETHINGREALLYNEW?
ComparisonBasedonAnalyOcModel
0.1
1
10
100
1000
0 1 2 3 4 5 6 7 8 9 10
Speedu
pover2012Ba
selin
e
RacksHeavyWeight Lightweight Next-GenCompute MemoryWed
MW1
MW2
MW3
Baseline
Upgrades
READYFORSOMETHINGREALLYNEW?
Summary
1/25/17 EMUENHANCEDMEMORYUTILIZATION|COPYRIGHT2016 38
• Supportsenormousproblemsin-memory
• Supportshigh-ratereal-Omeupdatestothedatabase
• Executecache-breakerappsfarmoreeffecOvelythantoday
• MakesbeTeruseofsystembandwidththanconvenOonalcomputers,resulOnginhigherperformanceandlowerenergyuOlizaOon
• Offersunprecedentedlevelsofscalableparallelisminastructurethatcanbeunderstoodandexploited
• ProgrammingenvironmentiseffecOveandreasonablyfamiliartothoseusedtostandardsystems
READYFORSOMETHINGREALLYNEW?
RelevancetoSpace• Growthinsparse&irregularinonboardapps• NaturalintegraOoninto3D• Lowerenergyperop• Spawnscanbeextendedtointerfacewithspecializedacceleratorcores
• Cilkprovidessimplerparallelprogramming• MemoryopscanbearchitectedtoincludeRAID-likeredundancy
• Address-basedmigraOonsenablesimplefailurerecovery
1/25/17 EMUENHANCEDMEMORYUTILIZATION|COPYRIGHT2016 39