performance advantages of a data centric computer architecture · 15/12/2016  · memory web...

40
1/25/17 EMU ENHANCED MEMORY UTILIZATION | COPYRIGHT 2016 1 Performance Advantages of a Data Centric Computer Architecture AMS Seminar Series NASA Ames Research Center December 15, 2016

Upload: others

Post on 22-May-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

1/25/17 EMUENHANCEDMEMORYUTILIZATION|COPYRIGHT2016 1

PerformanceAdvantagesofaDataCentricComputer

Architecture

AMSSeminarSeriesNASAAmesResearchCenter

December15,2016

READYFORSOMETHINGREALLYNEW?

CurrentMemory

1/25/17 EMUENHANCEDMEMORYUTILIZATION|COPYRIGHT2016 2

Cacheline–Linearaccess

READYFORSOMETHINGREALLYNEW?

CurrentMemory

1/25/17 EMUENHANCEDMEMORYUTILIZATION|COPYRIGHT2016 3

Stridedaccess–noadvantagetocaches

READYFORSOMETHINGREALLYNEW?

CurrentMemory

1/25/17 EMUENHANCEDMEMORYUTILIZATION|COPYRIGHT2016 4

Unstructuredaccess–noadvantagetocaches

READYFORSOMETHINGREALLYNEW?

CurrentMemory

1/25/17 EMUENHANCEDMEMORYUTILIZATION|COPYRIGHT2016 5

Forwardlookinglinkedlistaccess–noadvantagetocaches

READYFORSOMETHINGREALLYNEW?

CurrentMemory

1/25/17 EMUENHANCEDMEMORYUTILIZATION|COPYRIGHT2016 6

Memory1 Memory4Memory2 Memory3

Program(very,verysmallcomparedtodata)

Massiveamountsofdataspreadovermanymemories

Largeamountsofdataaremovedtoandfromeachmemoryareatotheprogramthroughaverylimitednetwork

READYFORSOMETHINGREALLYNEW?

IntroducOon

1/25/17 EMUENHANCEDMEMORYUTILIZATION|COPYRIGHT2016 7

•  IssueswithToday’sArchitectures•  EmuMigratoryNearMemoryProcessingArchitecture•  ProgrammingEmuusingCilk•  IniOalEmuproducts

READYFORSOMETHINGREALLYNEW?

MoOvaOon•  Today’sarchitecturesopOmizedforhighlylocalmemoryreferencepaTernswithsignificantreuse–  Large,deepcachehierarchies;widememorypaths

•  NewapplicaOonslikeBigDataaredifferent–  Verylargeunstructureddatastructures– O[enwithnon-tradiOonalsearches:relaOonships

•  Result:Bigmismatch!–  Scalableperformancerequiresverycomplexprograms– AndsOllresultinlowefficiencies

1/25/17 EMUENHANCEDMEMORYUTILIZATION|COPYRIGHT2016 8

READYFORSOMETHINGREALLYNEW?

IntroducingtheEmuArchitecture

1/25/17 EMUENHANCEDMEMORYUTILIZATION|COPYRIGHT2016 9

DesignedforIrregularData!!(Thebiggertheproblem,thebeYer)

•  DesignedfromgroundupforliTleornolocality•  Fullyrandomaccesstoindividual64-bitwordsatfullefficiency

•  MulOthreadedcoreshidenearlyalllatency•  Lowerenergy–lessdatamovedshorterdistances•  Highradixnetworksupportsfastsystem-widePGAS•  Compute,memorycapacity,memorybandwidthandso[wareallscalesimultaneously

READYFORSOMETHINGREALLYNEW?

GoingtotheMountain

Insteadofmovingblocksofdataacrossthenetwork,Emumoves(“Migrates”)theprogramcontexttothelocaleofthedataaccessed•  Moveprogramcounter,workingregisters,threadstatusword

–  ApplicaOoncodereplicatedoneachnodelet,nevermoves•  One-waytrip–noreturnneeded•  Invisibletoprogrammer–triggeredbyreferencetonon-local

address•  LatencycompletelyhiddenifsufficientacOvethreads•  WritestransmiTedonnetworkwithoutmigraOng,unless

programmerexplicitlyrequestsmigraOon

1/25/17 EMUENHANCEDMEMORYUTILIZATION|COPYRIGHT2016 10

Ifthemountainwon’tcometoyou,youmustgotothemountain-ancientproverb

READYFORSOMETHINGREALLYNEW?

Emu:The“Put-Only”ArchitectureReadingamemorylocaOononadifferentEmunodecausesthecontexttomovetothatnode,insteadofsendingareadrequest.•  ThiswinsbigespeciallywhenprogrampaTernisseriesofo[en

brief“visits”towidelydisperseddata•  ProcessoruOlizaOonimproved

–  ProcessorsneverstallforlongperiodswaiOngforremotereads

•  Networkissimplified–  Doesn’tneedtosupportroundtrip(read/response)messages

•  RemoteWritescanbeperformeddirectlyorviamigraOons,underprogrammer(compiler)control.

1/25/17 EMUENHANCEDMEMORYUTILIZATION|COPYRIGHT2016 11

READYFORSOMETHINGREALLYNEW?

HowtheEmuMemoryWorks

1/25/17 EMUENHANCEDMEMORYUTILIZATION|COPYRIGHT2016 12

Memory1 Memory4Memory2 Memory3

Program(very,verysmallcomparedtodata)

Modestamountsofdataspreadovermany,manymemories

TheprogramcontextmovesfrommemorytomemoryresulOngin10x+beTerusageofthenetwork

READYFORSOMETHINGREALLYNEW?

•  SingleSystem-wideAddressSpace•  Sta^onaryCores(SCs):

–  ExecuteOperaOngSystem

–  ManageIO/FileSystem

–  CallorSpawnGossamerThreads

•  GossamerCores(GCs)executeGossamerThreadsatNodelets

–  PerformlocalcomputaOons&memoryreferences(inc.atomics)

–  MigratetootherNodeletsw’oso`wareinvolvement

–  SpawnnewThreads–  CallSystemServicesonSCs

•  ProgrammedinCilk

MemoryWebMigraOngThreadArchitecture

READYFORSOMETHINGREALLYNEW?

GossamerCoreArchitecture• Deeplypipelined,64-waymulOthreadedcore• Accumulator+16generalregisters• RichsuiteofAtomicMemoryOperaOons• SingleSPAWNinstrucOontocreateanewthread• ThreadschedulingisautomaOcandperformedbyhardware

• RELEASEinstrucOonplacescontextinaServiceQueueforprocessingbyStaOonaryCore

1/25/17 EMUENHANCEDMEMORYUTILIZATION|COPYRIGHT2016 14

READYFORSOMETHINGREALLYNEW?

NodeArchitecture

1/25/17 EMUENHANCEDMEMORYUTILIZATION|COPYRIGHT2016 15

READYFORSOMETHINGREALLYNEW?

Emu2MemoryServer:Densermemoryforhigherbandwidth,ASICforComputenodecanuOlizeallavailablebandwidth Emu3MemoryServer:

Dozensofnodeletsincustommemorystack

ProofofConcept

Emu1MemoryServer:•  PracOcalsystemforlargedataproblems•  8NarrowChannelsuOlizeallavailablebandwidth•  SolidStateDiskforbulkstorage•  Highlyefficient,withmoderatepowerdensity•  ScalestomulOpleracks•  HighbandwidthGossamercoresusableinfutureversions

NodeImplementaOons

READYFORSOMETHINGREALLYNEW?

1/25/17 EMUENHANCEDMEMORYUTILIZATION|COPYRIGHT2016 17

EmuProductIni^alDeliverables

EmuChick•  2.2GUPsperformance•  Copyroomenvironment•  FCSQ3’16

Emu1MemoryServer•  70GUPsperformance•  Serverroomenvironment•  FCSQ3’17

Thenodeboardsarethesame!

READYFORSOMETHINGREALLYNEW?

Emu1RackConfiguraOon•  8Chassis/Motherboards

–  256Nodes(32/Tray)–  88-lanePCIexpressGen3slots(2/Tray)

•  24ProDriveRapidIOSwitches96SRIO3.1(6.25GB/s)linksbetweenSupergroupsandCentralSwitch

•  48VDCPowerDistribu^on•  N+1RedundantFront-endPowerConverter•  AirorBuildingWatercoolingop^ons•  Approximately40KW@440VACTriPhase

1/25/17 EMUENHANCEDMEMORYUTILIZATION|EMUPROPRIETARY|COPYRIGHT2015 18

READYFORSOMETHINGREALLYNEW?

EmuSystemCharacterisOcs

Emu1Rack•  16TBDDR4DRAM•  256TBSolidStateDisk•  [email protected]/

s•  >2TB/sBisecOonBW/rack

Emu2Rack•  ~16TBHMCMemory•  ~256TBSolidStateDisk•  2048Mem.Channels@20GB/s•  >2TB/sBisecOonBW/rack

1/25/17 EMUENHANCEDMEMORYUTILIZATION|EMUPROPRIETARY|COPYRIGHT2015 19

Upto64KNodes/System(256Racks,256Nodes/rack,8+nodelets/node)Airorbuildingwatercooled,~40KW/rackHighradixRapidIONetwork

READYFORSOMETHINGREALLYNEW?

ProgrammingEmu

1/25/17 EMUENHANCEDMEMORYUTILIZATION|COPYRIGHT2016 20

•  ProgrammedinCilk,atrulyparallelextensionofC–  EmucodegeneratorcoupledtoLLVMfrontend–  InliningofmanylibraryfuncOonstoleverageEmuinstrucOons

•  AddedcapabilitytodefinememoryViewsandplacedatainthoseViews–  ShareddataitemsareDynamicSharedObjects(DSOs)thatmaybedefinedoutsidetheprogramsthatmanipulatethemandpersistbeyondprogramcompleOon

–  PrivateautomaOcvariablesaredeclarednormallyinCilk.

READYFORSOMETHINGREALLYNEW?

TheEmuCilkLanguage

1/25/17 EMUENHANCEDMEMORYUTILIZATION|COPYRIGHT2016 21

EmuCilkextendsCwithafewnewkeywords•  Cilk_spawn[<var>=]<func^on>createsnew“child”thread

running<funcOon>whileparentthreadconOnuesasynchronously•  Cilk_synccausescurrentfuncOontowaitunOlallitschildrenhave

completed•  Cilk_for(<iterator>){<itera^on_code>}[grainsize=<grainsize>]

createsagroupofnewthreads(fromCilkPlus)•  TerminaOonofafuncOonalwaysperformsanimplicitsync•  Nosupport(atpresent)forINLET,CilkPlusvectoroperaOons,C++

–  EventuallyplantoaddC++funcOonality

READYFORSOMETHINGREALLYNEW?

SystemSo[ware

1/25/17 EMUENHANCEDMEMORYUTILIZATION|COPYRIGHT2016 22

•  LINUXrunsontheStaOonaryCores(SCs).•  OSlaunchesmain() userprogramonaGossamerCore(GC)•  main() spawnsdescendantsthatexecuteinparallelandmigrate

throughoutsystemasneeded.•  RunOmeexecutesprimarilyontheSCsandhandlesservice

requestsfromthethreadsrunningontheGCs.–  MemoryallocaOonandrelease,I/O,excepOonhandling,and

performancemonitoring.–  AfewspecialsystemthreadsrunonGossamerCorestoprovidereal-Ome

systemmanagement,suchasdistribuOonofcredits

•  Threadsreturntomain() uponcompleOon,whichthenreturnstotheOS.

READYFORSOMETHINGREALLYNEW?

WheredoesEmushine?•  BigData:thebiggerthebeTer.Datathatlacksregularstructure

worksespeciallywell•  GraphAlgorithms:examplesincludeNon-ObviousRelaOonships,

ParOcleSwarmandNetworkOpOmizaOon•  Searching,Sor^ng,andPaYernMatchingProblems:suchasGene

Sequencematching,Protein-LigandDockingandMachineLearning•  SparseMatrixProblems:examplesincludetheHPCGbenchmark

andLinearProgrammingThesewillallperformfarbeTeronEmuthanonconvenOonalcomputers

1/25/17 EMUENHANCEDMEMORYUTILIZATION|EMUPROPRIETARY|COPYRIGHT2015 23

READYFORSOMETHINGREALLYNEW?

MarketApplicaOonAreas:TheseareareaswheretheEmuArchitecturewillshine

1/25/17 EMUENHANCEDMEMORYUTILIZATION|EMUPROPRIETARY|COPYRIGHT2016 24

Numerical Search

•  ComputaOonalFluidDynamics •  GraphAnalysis

•  ReservoirModeling •  NetworkOpOmizaOon

•  Weather&ClimateModeling •  RegressionAnalysis

•  Linear&DynamicProgramming •  Risk&FraudAnalysis

•  QuantumPhysics •  Map/ReduceAcceleraOon

•  QuantumChemistry •  RecordManagement&StaOsOcs

•  MolecularDynamics •  MachineLearning

•  Many-bodyProblems •  ImageUnderstanding

•  Signal&ImageProcessing •  GeneSequenceAnalysis

READYFORSOMETHINGREALLYNEW?

•  WilldeliverfirstEmu“Chicks”inFall2016•  Designportsdirectlytoracklevel&above•  SignificantlybeTerscalabilitythancommercialsystemsonGUPS,BFS,andsimilarapplicaOons

•  Ongoingdevelopmentoflibrariesforcomprehensiveprogrammingenvironment

•  ExpandingapplicaOonbase•  InterestedinresearchpartnershipstoexplorenewlibraryandapplicaOonareas

1/25/17 EMUENHANCEDMEMORYUTILIZATION|COPYRIGHT2016 25

CurrentStatus

READYFORSOMETHINGREALLYNEW?

SomeEarlyComparisons

1/25/17 EMUENHANCEDMEMORYUTILIZATION|COPYRIGHT2016 26

READYFORSOMETHINGREALLYNEW?

1.E-02

1.E-01

1.E+00

1.E+01

1.E+02

0.01 0.1 1 10 100

TotalD

ataMem

oryCh

anne

lsperSq.Ft.

PeakAggregateMemoryAccessesperSq.Ft.(G/s)Emu SM TCDM LCDM

MemoryChannelComparison

1/25/17 EMUENHANCEDMEMORYUTILIZATION|COPYRIGHT2016 27

Emu1 Emu2

Emu3

Emu1Chick

AllpointsnormalizedtoEmu1chickat(1,1)

MoreMemoryChannels=>moreopportuniOesforparallelismMoreaccesses/sec=>beTerabilitytoaccessrandomdata

SM:sharedmemoryTCDM:TightlyCoupledDistributedMemoryLCDM:LooselyCoupledDistributedMemory

READYFORSOMETHINGREALLYNEW?

1.E-02

1.E-01

1.E+00

1.E+01

1.E+02

1.E+03

0.01 0.1 1 10 100

PeakAggregateInjectionRa

teperSq.Ft.

(GB/s)

PeakAggregateMemoryAccessesperSq.Ft.(G/s)Emu SM TCDM LCDM

InjecOonBandwidth&AccessRate

1/25/17 EMUENHANCEDMEMORYUTILIZATION|COPYRIGHT2016 28

Emu1Emu2

Emu3

Emu1Chick

AllpointsnormalizedtoEmu1chickat(1,1)

MoreInjecOonrate=>moreabilitytocommunicateremotelyMoreaccesses/sec=>beTerabilitytoaccessrandomdata

SM:sharedmemoryTCDM:TightlyCoupledDistributedMemoryLCDM:LooselyCoupledDistributedMemory

READYFORSOMETHINGREALLYNEW?

1.E-02

1.E-01

1.E+00

1.E+01

1.E+02

0.01 0.1 1 10 100

InjectionB/WperW

att(MB/s/W)

Accesses/SecperWatt(M/s/W)eqvttoAccesses/nJEmu SM TCDM LCDM

SystemPerWaTComparison

1/25/17 EMUENHANCEDMEMORYUTILIZATION|COPYRIGHT2016 29

Emu1Emu2

Emu3

Emu1Chick

AllpointsnormalizedtoEmu1chickat(1,1)

Moreperformance/waT=>loweroperaOngcosts

SM:sharedmemoryTCDM:TightlyCoupledDistributedMemoryLCDM:LooselyCoupledDistributedMemory

READYFORSOMETHINGREALLYNEW?

GUPS:GlobalUpdatesperSecond

1/25/17 EMUENHANCEDMEMORYUTILIZATION|COPYRIGHT2016 30

Emu1

Emu2

Emu3

AllpointsnormalizedtoEmu1at(1,1)

SinglerackofEmu1wouldrank8thtoday.P775(#1)is21racks.BG/Q(#3)is48racks.KComputer(#2,4,7)is219-864racks.

READYFORSOMETHINGREALLYNEW?

BreadthFirstSearch•  CoreofGRAPH500rankings•  Startwitharoot,findreachableverOces•  Metric:TEPS:TraversedEdges/sec•  Performanceissues

•  Massivedatasets•  Veryirregularaccess•  Verychallengingloadbalancing

Scale=log2(#verOces)

Level Scale SizeVertices(Billion) TB

Bytes/Vertex

10 26 Toy 0.1 0.02 281.804811 29 Mini 0.5 0.14 281.395212 32 Small 4.3 1.1 281.47213 36 Medium 68.7 17.6 281.475214 39 Large 549.8 141 281.47515 42 Huge 4398.0 1,126 281.475

Average 281.5162

GRAPH500GraphSizes

1/25/17 31EMUENHANCEDMEMORYUTILIZATION|COPYRIGHT2016

20

9

13

5

7

8e0 e1

e2e3e4

e5

e6

e7e8

Startingat1:1,0,3,2,9,5

READYFORSOMETHINGREALLYNEW?

1.E-02

1.E-01

1.E+00

1.E+01

1.E+02

1.E+03

1 10 100 1,000 10,000GTEPS

CoresinNode

MostRecentGRAPH500Data

1/25/17 EMUENHANCEDMEMORYUTILIZATION|COPYRIGHT2016 32

•  GoodscalabilityformulO-nodeTCDMsystems

•  ButsinglenodeSMsystemsfarhigherrela^veperformance

–  Need~100nodesofbestTCDMtomatch

•  “Singlenode”systemsactually“SharedMemory”mulO-coresystems

•  Butperformanceflata[er~10cores•  Today’sSharedMemorysystems

inherentlymoreefficientbutdon’tscale•  Extra“cores”addmemory,not

performance

1.E-02

1.E-01

1.E+00

1.E+01

1.E+02

1.E+03

1.E+04

1.E+05

1 10 100 1,000 10,000 100,000

GTEPS

Nodes

ExpansionofSingleNodeRegionSingleRack

100’sofracks

READYFORSOMETHINGREALLYNEW?

0.1

1

10

100

1 10 100 1000 10000GTEPS

NodeletsScale22GTEPS Scale23GTEPS ProjectedTEPS

EarlyProjecOonsforEmu•  Emuisasharedmemorysystem•  ButwithFARbeYerscalingthan

today’sSMsystems•  Andthisiswithverysmall(scale

22)problems•  MorerealisOcsizes(scale30for

chick)couldincreaseGTEPSbyalmost10X–  IniOalscale23resultspointinthisdirecOon

•  MorecompletesimulaOonsinprogress

1/25/17 EMUENHANCEDMEMORYUTILIZATION|COPYRIGHT2016 33

Emu1simresults

StayTuned!

READYFORSOMETHINGREALLYNEW?

RealWorldCommercialChallengeProblem(FromLexisNexis)

•  40+TBofRawData•  Periodicallycleanup&combineto

4-7TB•  Weekly“BoiltheOcean”to

precomputeanswerstoallstandardqueries

–  DoesXhavefinancialdifficul^es?–  DoesXhavelegalproblems?–  HasXhadsignificantdriving

problems?–  WhohassharedaddresseswithZ?–  Whohassharedpropertyownership

withZ?

AutoInsuranceCo:“TellmeaboutgivingautopolicytoJaneDoe”in<0.1sec

“JaneDoehasnoindicatorsButshehassharedmulOpleaddresseswithJoeScofflawWhohasthefollowingnegaOveindicators….”

Lookupanswerstoprecomputedqueriesfor“JaneDoe”andcombine

RelaOonships

READYFORSOMETHINGREALLYNEW?

TradiOonalApproach:RunawayIntermediateData14.2B recs325 B/rec4.6TB

Project14.2B recs100+ B/rec1.5TB

Join onAddress

1.6T recs200+ B/rec300+TB

Sort & RemoveDuplicates

1.5T recs30B/rec45TB

• Compute adr hash• Compare lnames• Init score to 3• Project

1.6T recs30 B/rec48+TB

Group byID pairs &Sum scores,Lname_match

{(ID1, ID2, adrhash, score,lname_match)}

{(ID, lname, adr)}

Hash ID1,2& Distribute

12B recs16B/rec200GB

Select onScore &Lname_match

1.2T recs16B/rec20TB

{(ID1, ID2, score, lname_match)}

800M distinct IDs

400M distinct IDs

Send between nodes via TCP/IP datagrams

“h”“t”

“J”

“D”

READYFORSOMETHINGREALLYNEW?

HeavyweightUpgrades

1.E-02

1.E-01

1.E+00

1.E+01

1.E+02

1.E+03

1 2 3 4 5 6 7 8 9

ResourcesU

sed/no

de(sec)

Step#Disk CPU Memory Network

1.E-02

1.E-01

1.E+00

1.E+01

1.E+02

1.E+03

1 2 3 4 5 6 7 8 9

ResourcesU

sed/no

de(sec)

Step#Disk CPU Memory Network

1.E-02

1.E-01

1.E+00

1.E+01

1.E+02

1.E+03

1 2 3 4 5 6 7 8 9

ResourcesU

sed/no

de(sec)

Step#Disk CPU Memory Network

2xIBFDR

SATA

SSD

RAMDISK

22Cores,

4xDD

R4

1 2 3 4 5 6 7 8 975.2 75.2 75.2 75.2 75.2 580.5 580.5 360.0 362.9 1025.8 1.00

X 75.2 75.2 75.2 75.2 75.2 250.2 360.0 360.0 362.9 805.3 1.27X 28.8 28.8 28.8 28.8 28.8 580.5 580.5 144.0 145.2 757.4 1.35

X 75.2 75.2 75.2 75.2 75.2 580.5 580.5 25.0 50.0 707.7 1.45X 75.2 75.2 75.2 75.2 75.2 580.5 580.5 360.0 362.9 1025.8 1.00

X X X 28.8 28.8 28.8 28.8 28.8 250.2 275.2 25.0 50.0 355.1 2.89X X X X 28.8 28.8 28.8 28.8 28.8 74.4 81.9 7.4 14.9 126.3 8.12

Enhancements

2012Baseline

TimeperStepTotalTime

Speedu

p

2012Baseline UpgradeDisk,NetworkUseRAMDISK2.9XSpeedup

AlsoUpgradeProcessor8.1XSpeedup

AllaresOll10RackSystems

READYFORSOMETHINGREALLYNEW?

ComparisonBasedonAnalyOcModel

0.1

1

10

100

1000

0 1 2 3 4 5 6 7 8 9 10

Speedu

pover2012Ba

selin

e

RacksHeavyWeight Lightweight Next-GenCompute MemoryWed

MW1

MW2

MW3

Baseline

Upgrades

READYFORSOMETHINGREALLYNEW?

Summary

1/25/17 EMUENHANCEDMEMORYUTILIZATION|COPYRIGHT2016 38

•  Supportsenormousproblemsin-memory

•  Supportshigh-ratereal-Omeupdatestothedatabase

•  Executecache-breakerappsfarmoreeffecOvelythantoday

•  MakesbeTeruseofsystembandwidththanconvenOonalcomputers,resulOnginhigherperformanceandlowerenergyuOlizaOon

•  Offersunprecedentedlevelsofscalableparallelisminastructurethatcanbeunderstoodandexploited

•  ProgrammingenvironmentiseffecOveandreasonablyfamiliartothoseusedtostandardsystems

READYFORSOMETHINGREALLYNEW?

RelevancetoSpace•  Growthinsparse&irregularinonboardapps•  NaturalintegraOoninto3D•  Lowerenergyperop•  Spawnscanbeextendedtointerfacewithspecializedacceleratorcores

•  Cilkprovidessimplerparallelprogramming•  MemoryopscanbearchitectedtoincludeRAID-likeredundancy

•  Address-basedmigraOonsenablesimplefailurerecovery

1/25/17 EMUENHANCEDMEMORYUTILIZATION|COPYRIGHT2016 39

READYFORSOMETHINGREALLYNEW?

ThankYou

1/25/17 40EMUENHANCEDMEMORYUTILIZATION|COPYRIGHT2016