on-chip static vs. dynamic routing for feed forward neural networks on multicore neuromorphic...

Post on 17-Aug-2015

218 Views

Category:

Documents

2 Downloads

Preview:

Click to see full reader

DESCRIPTION

yy

TRANSCRIPT

AbstractWithprocessorreliabilityandpowerlimitingthe performance of future computing systems, interest in multicore neuromorphicarchitecturesisincreasing.Thesearchitectures requireon-chiproutingnetworkstoenablecoresto communicateneuraloutputswitheachother.Inthisstudywe examinetworoutingapproachesforlargemulticorefeed forwardneuralnetworkaccelerators:staticanddynamic. Modelsaredevelopedtodetermineroutingresourcesfor2D meshinterconnectiontopologies.Detailedanalysisofpower, area,andlinkutilizationarecarriedoutforseveral architectureoptions.Inalmostallcases,staticroutingis significantlymoreefficientthandynamicrouting,requiring both lower area and power. Keywords:On-chiprouting;neuromorphiccomputing; computer architecture. I.INTRODUCTION ELIABILITYandpowerconsumptionarethemain obstaclestocontinuedperformanceimprovementsin futuremulti-corecomputingsystems[1].Interestin specializedarchitecturesforacceleratingneuralnetworks has increased significantly becauseof their ability to reduce power,increaseperformance,andallowfaulttolerant computing.Chenetal.[2]haveshownthatRecognition, Mining,andSynthesis(RMS)applications(describedby Intelasthekeyapplicationdriversofthefuture[3])canbe representedasneuralnetworks.Theymakethecasethat neuralnetworkacceleratorscanhavebroadapplications. Esmaeilzadehetal.[4]showthatseveralkeyapplication kernels (such as FFT and JPEG) can be approximated using neuralnetworksandmakethecaseforspecializedneural network accelerators on general purpose CPUs.Oneofthemostcommonformsofneuralnetworksisfeed forward networks, as illustrated in Fig 1. If the axonal inputs toaneuronaregivenbyxithenthecorrespondingneuron output is evaluated as: vj=iWi,jxi(1) yj=f(vj)(2) Here,WisaweightmatrixinwhichWi,jisthesynaptic weightofaxoniforneuronjandfisanonlinearfunction (usually a sigmoid function). Recentstudieshaveshownthatneuralnetworkswith multiplelayers(suchasdeepbeliefnetworks)havestrong inferencecapabilities.Ciresanetal.[5]havedemonstrated that a six layer neural network with up to 2500 neurons in a layercangenerateaverylowerrorrateof0.35%inthe recognitionof28x28pixelimagesofhandwrittendigits fromtheMNISTdatabase[6].Toprocesstheselarge networksefficiently,thenetworkwasimplementedon GPGPUs.Largerimagesandvideoswilllikelyrequire neural networks with more layers and with more neurons per layer.Highperformancecomputinghardwarewilllikelybe required to process these larger networks. Tahaetal.[7]presentedseveralspecializedmulticore neuromorphicarchitecturestoallowhighthroughput pipelinedprocessingoflargefeedforwardneuralnetworks. Theyshowedthatthesearchitecturescanreducethepower consumption forneuralnetwork processing by about 100 to 105timescomparedtoconventionalarchitectures (dependingonthearchitectureoptionsselected).The architecturesconsistedofacollectionofspecialized processors connected with an on-chipmesh routing network (see Fig 2). In large neural networks, the volume of synaptic weightdatafaroutnumbersthevolumeofneuraloutput data. Since data communication is one of the key sources of energyconsumptionandperformancedelays,the architecturesin[7]storethesynapticdatawithineachcore (instead of in off-chipmemory or ashared cache)to reduce memory access times and energies. This however means that On-Chip Static vs. Dynamic Routing for Feed Forward Neural Networks on Multicore Neuromorphic Architectures Raqibul Hasan and Tarek M. Taha Department of Electrical and Computer Engineering University of Dayton, Dayton, OH 45469, USA{hasanm1, tarek.taha}@udayton.edu R Fig1.Exampleneuralnetworkanditsmulticore implementation.Eachcoreneeds100inputsasithas100 neuronsand10,000weights.Thedatatransferbetweencores takes place through the on-chip routing network. . . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . .. . .. . . . .Inputs InputsOutputs OutputsCore 2100 Inputs10k Weights10k Weights10k Weights100 Outputs100 Transfers 100 Transfers . . .. . .Core 1. . .. . .Core 3Proceedings of 2013 2nd International Conference on Advances in Electrical Engineering (ICAEE 2013)19-21 December, 2013, Dhaka, Bangladesh978-1-4799-2465-3/13/$31.00 2013 IEEE 329 thearchitecturewouldhavetobepreprogrammedto implement a specific neural network. In this architecture, the on-chip routing network has to exchange only neural outputs between cores (as the synaptic data is already on chip), thus enabling low power, high speed, processing. Fig 2. 2D mesh interconnection network connecting a set of neural cores. Toreducetheoverallareaandpowerofmulticoreneural processingchips,itisessentialtooptimizetheon-chip routing network. Two possible approaches to implement the routingnetworkarestaticanddynamicrouting.Indynamic routing,eachcoresendsoutapacketwithadestination header.Thispacketheaderisexaminedbyeachrouterit passesthroughtodirectthepackettowardsitsdestination. Dynamicroutingisresourceandpowerintensive,requiring buffers, a crossbar switch, and a switch allocator per router.In static routing, adedicated connection is setup between a sourcecoreanditsdestinationcores.Whenaparticular neuralnetworkismappedontothemulti-coresystem,the communicationpatternbetweenthecoresbecomes deterministic.Thustheconnectivityneededbetweenthe coresispre-determined,andtherefore,staticrouting between the cores can be utilized (similar to routing between configurablelogicblocksinanFPGA).Thisapproach requires a routing switch. Each connection within the routing switchrequiresanSRAMcelltoenablereconfigurationof the path for a particular network (Fig. 3) [8]. The key benefit ofstaticroutingisthatitdoesnotrequiredynamicrouting logic. This cansignificantly reduce thepower consumption. Ifthechannelutilizationsarelow,thentheareaofstatic routing could be larger than dynamic routing. Fig 3. Static routing switch. In this study we compare the area and power consumption of bothstaticanddynamicroutingforthemulticoreneural networkarchitecturespresentedin[7].Weexaminethe impact of core performance and the number of cores needed toimplementalayer.Ourresultsshowthatinalmostall cases,staticroutingenableslowerareaandpower consumption. Therestofthepaperisorganizedasfollows:sectionII describes relatedworks in the area. Section IIIexaminesan approach to map neural networks to the cores in a multi-core system. Sections IV presents on-chip data transfer model for 2Dmeshinterconnections.Experimentalresultsare presentedinsectionV,followedbyconclusionsinsection VI. II.RELATED WORK A rangeofneuralnetwork emulation projects areunderway around the world that include GPU, FPGA, CMP, and ASIC based systems [9-13].Nageswaranetal.[9]examinedGPU accelerationofspikingneuralmodels.Accelerationonhigh performanceclusters[10]andFPGAs[11]havealsobeen studied.SinceGPUandFPGAsystemsarerelativelyhigh powerandareaconsuming,theyarenotfavorablefor embedded applications. TheSpiNNakerproject[12]atManchesterUniversityisa multi-chipsystem,whereeachchipincludestwenty simplified ARM 968 processors. Among these twenty cores, only eighteen cores are used for neural simulation, with each coresimulatingabout1,000neurons.Theycommunicate spikeeventstoothercoresthroughpacketsonanon-chip network. 128MB of RAM is shared by all 18 cores through a DMAcontroller.Eachchipisconnectedtosixadjacent SpiNNaker chips through off-chip interconnections.Acrossbarmemorybasedfullydigitalneuromorphiccore wasproposedin[14,15].Thissystemconsistedof256 integrate-and-fireneuronsanda1024256SRAMcrossbar memoryforconnectivityinformation.Incontrastto traditionalvonNeumannarchitectures,thissystem integratedcomputationalongsidememory,enablingitto achievelowpowerexecution.Thestudydidnotexamine howmultiplecoreswouldbeconnected on-chip to simulate large neural networks.MOStransistorbasedanalogneuroncircuitshavebeen studiedextensively[16-18].TheNeurogridproject[17,19] useslocalanalogwiringtominimizetheneedfor digitizationofon-chipcommunications.Spikes,ratherthan voltagelevels,arepropagatedtodestinationsynapses.The focus is to mimic biological neurons on a silicon chip.Intheeraofmulti-corearchitectures,on-chip communicationnetworksareofsignificantinteresttothe research community. Several studies have examined on-chip routingingeneralpurposeCMPinterconnection architectures[20-22].IntheMITRAWprocessor, corerouterSRAMAABB330 processingtilesareconnectedtoneighboringtilesina2D mesh using dynamic and static networks for communication [23].There are several routing alternatives such as unicast, multi-cast,multiple-unicast,andbroadcast.Inneuralnetworks, one neuron is connected to several other neurons. Therefore amulti-castroutingschemecouldreducenetworktraffic significantly.Multi-castroutingschemesaredemonstrated in [24, 25]. III.MULTI-CORE ARCHITECTURE FOR NEUROMORPHIC APPLICATIONS Chipmulti-processors(CMP)areoneofthemostcommon architecturesforexploitingtasklevelparallelism.Since neuralnetworkapplicationsexhibitaggressivetasklevel parallelism,wecanachieveveryhighthroughputsby executing them on CMPs. A. Mapping of Neural Networks Thecommunicationpatternintheon-chipinterconnection networkdependsonhowaneuralnetworkismappedonto theprocessingcores.ConsideranNcorecomputing platforminwhichwewanttomapanLlayeredneural network,whereeachlayerhasnkneurons(k=1,2,..L).We willmaptheneuronsontotheprocessingcoresuniformly and each core will simulate approximately (k nk)/N neurons. To optimize communication delays, the cores corresponding toalayerofneuronsshouldbeasclosetoeachotheras possible(preferablyasquaresub-gridfromthegridof cores).Neuronswillbesimulatedinapipelinedfashionto hidecommunicationdelays(seeFig.1).Whenlayerl neuronsarebeingsimulatedforpatterni,layerl-1neurons will be simulated for pattern i+1. B.On-chip Interconnection Topology Processingcoressimulatingneuronsshareonlygenerated neuron outputs. These data would be exchanged over an on-chip interconnection network among the processing cores. In thispaper,weevaluatethebandwidthrequirementfor exchanging neuron output data. Coresareconnectedbyaninterconnectionnetworkfor sharingneuronoutputs.Weneedreconfigurationcapability intheinterconnectionnetworktobeabletosimulate differentneuralnetworktopologies.Inasharedbus interconnection,weneedabuswhichwillbeconnectinga coretoalltheothercores.Thisdesignisnotscalable, becauseincreasingthenumberofcoresrequiresahuge overheadofresourcesforwiringanddealingwithalarge number of routing channels.Inthispaper,wewillexamine2Dmeshnetworks(Fig.2) forconnectingthecoresinamulticoreneuromorphic architecture. Mesh interconnects are the most widely studied on-chipnetworktopologyduetotheirscalabilityand regularity. Fig 4. Multi-cast routing (S is source core, and D are the destination cores). C.Communication and Routing Technique Weassumethatneuronswillbemappedontoamulti-core systemsuchthateachcoreissimulatingapproximatelythe same number of neurons. For feed-forward neural networks, whenacorefinishessimulatingallthelayerl-1neurons assignedtoit,thecorewillsendthecorrespondingneuron outputstoallprocessingunitswhicharesimulatinglayerl neurons.Multicastrouting[24,25]willbeefficientfor sendingneuronoutputstomultipledestinations(Fig.4). Multi-castroutingwillreducenetworktrafficandhence requireslessbandwidth.Foraparticularneuralnetwork, routingisstaticsinceaparticularsourceanddestination would always use the same path to send data packets. Thenumberofbitsusedtorepresenteachneuronoutput hasanimpactonthedatatransferbandwidth.Someneural modelsrequireasinglebitperneuronoutput(suchas restrictedBoltzmannmachines).Ontheotherhandsome models require multiple bits per neuron output. In this study, weassumed8bitsperneuronoutput,althoughotherbit widths would simply scale our results. IV.ON-CHIP BANDWIDTH REQUIREMENT FOR COMMUNICATION Theon-chipinterconnectionnetworkenablesexchangeof generatedneuronoutputsamongcores.Theon-chip interconnection-linkbandwidthdependsonseveralissues, suchastheshareddatagenerationrate,howmanyother coresaresharingthedata,andwherethedestinationcores are located with respect to the source cores. In this paper we assumealinktostandfortwosetsofwirestoenablebi-directional data transfers.To understand the core-to-core link bandwidth requirements, considertheexamplesshowninFigs.5and6.Fig.5a representstherelativebandwidthneededforsendingthe outputstothenextlayerofcores.Inthisexample,we examine a 44 collection of cores to simulate one layer, and assumeeachcoregenerates1outputpertimestep.Thus each column of cores will generate 4 packets per time step. Fig. 5b shows the relative bandwidth on the internal links for distributing the incoming packets from the previous layer (l-1). Since we assume a pipelined evaluation of the inputs, the relativebandwidthsofFigs.5aand5bwillbeoverlapped. SDDDDD331 Fig.5cshowstheoverallrelativebandwidthforthelayerl corelinksbycombiningthebandwidthsfromFigs.5aand 5b.Fig.6showstheinternallinkrelativebandwidthsfor another positioning of cores relative totheir adjacent layers. ItisseenthattheinternallinkbandwidthforFig.6cis higher than in Fig. 5c. The maximum link bandwidth in this system was derived in [26],andissummarizedhereforconvenience.Assumethat thenumberofprocessingcoressimulatingasinglelayerof neurons is n1n2 and b is the number of bits used to encode a singleneuronoutput.Additionally,assumethateachcore hasaperformanceofPneuronspersecond.Therequired linkbandwidthforthissystemisderivedin[26],andis given by: Blink =(max{n1, n2}1+ n1n2)Pb bps (3) The maximum link bandwidth in a particular direction (per port) is given by: (n1n2 + |n2 -n1| -1)Pb bps.(4) Thebandwidthrequirementsforalinkdeterminethe minimum number of wires needed for that link. In this study, weassume5inputportsand5outputportsperrouteror switch(Fig.3).Fouroftheseareforcommunicatingwith neighboringrouters,andoneforcommunicatingwiththe attachedcore.Eachportcanhavemultiplewiresandeach portcanhandleasetoffinitebitspersecond.Equation(4) indicatesthatthelinkbandwidthincreaseswithincreasing themaximumnumberofcoresperlayer(n1n2).Thisalso increases the number of wires needed per port. In the rest of the paper wewill denote themaximum number of cores per layer (n1n2), as nmax. This is the maximum number of cores per layer we can have without making routing a performance bottleneck. V.EXPERIMENTAL EVALUATION Several system configurations were examined to evaluate the areaandpowerofstaticversusdynamicrouting.The maximumnumberofcoresperlayer(nmax)wassetto4,9, and25.Themaximumnumberofchannelsneededperport for static routing in each case is shown in Table 1. Foreach case,theperformanceofthecoreswasvariedfrom105to 2108 neurons per second, with each neuron output assumed to be 8 bits wide. We have utilized the link bandwidth model from[26]toevaluatedatatransferpower.Increasingthe coreperformancewillincreasetheamountofdataputonto theroutingnetwork.Thiswillresultinmorewiresbeing needed per port and a higher link utilization. Hence both the (a)(b)(c) Fig 5. Traffic pattern for layer l-1 and layer l+1cores are at two opposite sides of layer l cores. (a)(b)(c) Fig 6. Traffic pattern for layer l-1 and layer l+1cores are at sides perpendicular to each other. 412881244 4 4 44 4 4 44 4 4 44 4 4 4Layer l-1Layer lLayer l+14128812441288124412881241 1 1 12 2 2 23 3 3 3Layer l-1Layer lLayer l+14 4 4 44 4 4 45 5 5 56 6 6 67 7 7 7Layer l-1Layer lLayer l+14 4 4 4412881244128812441288124412881244 4 4 44 4 4 44 4 4 44 4 4 4Layer l-1Layer lLayer l+141288124412881244128812441288124Layer l-1Layer lLayer l+11 2 3 41 2 3 41 2 3 41 2 3 44 4 4 44 4 4 44 4 4 44 4 4 4Layer l-1Layer lLayer l+14444512108154512108154512108154512108154332 routingareaandpowerwillincreasewiththecore performance. Thepowerandareawerecalculatedassuminga32nm processtechnology,a200MHzclock,anda1mm2neural corearea. In this study, the coreareawas not changedwith variationincoreperformance.Theareaandpowerofthe dynamicrouterwascalculatedusingtheOrion[27] interconnection network simulator. The static routing switch consistsofanarrayofSRAMcells,eachcontrollingapass transistor.TheareaandpoweroftheSRAMcellswere calculated using the Cacti [28] cache model. Table 1: Channels per port Maximum cores per layer Maximum channels per port 43 98 2524 Table 1 shows the maximum number of channels needed per port.Inthedynamicroutingcase,allthesechannelscould potentiallybehandledbyonephysicalwireifthelink utilizationislow.Inthestaticroutingcase,eachchannel would require its own physical wire. Fig.7showsthepower(mW),area(mm2),wiresperport, andwireutilizationvariationwithcoreperformance.In almostallcases,itisseenthattheareaandpowerforthe staticroutingnetworkislowerthantheequivalentdynamic routingnetwork.Thedynamicroutingnetworkhasalower area only when the core performance is low. In these cases it isseenthatthelinkutilizationofthenetworkislow,thus allowing the dynamicnetwork to have fewer wires per port, and thus a lower overall area. As shown in Fig. 7, the lowest number of wires per port for the dynamic network is 1, while theminimumforthestaticnetworkisgivenbythenumber of channels per port shown in Table 1. From Fig. 7, it is seen thatthelinkutilizationreachesabout100%withacore (a) 4 cores per layer max(b) 9 cores per layer max(c) 25 cores per layer max Fig 7. Power (mW), area (mm2), wires per port, and wire utilization variation with core performance (in 106 neurons/s). Both dynamic routing (dx) and static routing (sx) data are shown. Three cases are presented, where the maximum number of cores per layer (x) is restricted to: (a) 4, (b) 9, and (c) 25. This data is presented for the router and links with the highest utilization within the system. 333 performance of about 50millionneurons per second.These resultsindicatethatstaticroutingislikelytobethemost optimum approach for multicore neuromorphic processor. VI.CONCLUSION Inthispaperwehavestudiedtwoon-chiprouting techniques:staticanddynamicformulti-coreneuromorphic systems.Foreachcasewehaveexaminedroutingresource requirements.Ourexperimentalresultsshowthatinmulti-coreneuromorphicarchitectures,staticroutingrequiresless routingareaandpowercomparedtodynamicrouting.A similaranalysiscouldbedoneforotherneuralnetworks (suchasspikingandconvolutional),likelygenerating similar results.REFERENCES [1]H.Esmaeilzadeh,E.Blem,R.Amant,K.Sankaralingam,andD. Burger, Dark silicon and the end of multicore scaling, in Proceeding ofthe38thAnnualInternationalSymposiumonComputer Architecture, 2011, pp. 365376. [2]T.Chen,Y.Chen,M.Duranton,Q.Guo,A.Hashmi,M.Lipasti,A. Nere,S.Qiu,M.Sebag,O.Temam,BenchNN:OntheBroad PotentialApplicationScopeofHardwareNeuralNetwork Accelerators,IEEEInternationalSymposiumonWorkload Characterization (IISWC), November 2012. [3]P. Dubey, Recognition, mining and synthesis moves computers to the era of tera, Technology@Intel Magazine, Feb. 2005. [4]H.Esmaeilzadeh,A.Sampson,L.Ceze,andD.Burger,Neural AccelerationforGeneral-PurposeApproximatePrograms, International Symposium on Microarchitecture (MICRO), 2012. [5]D.C.Cirean,U.Meier,L.M.Gambardella,andJ.Schmidhuber, Deep,big,simpleneuralnetsforhandwrittendigitrecognition, Journal of Neural Computation, Vol. 22, issue 12, 3207-3220, 2010. [6]http://yann.lecun.com/exdb/mnist/ [7]T.Taha,R.Hasan,C.YakopcicandM.McLean,Exploringthe DesignSpaceofSpecializedMulticoreNeuralProcessors, InternationalJointConferenceonNeuralNetworks(IJCNN),August 2013. [8]W.Wang,T.T.Jing,B.Butcher,FPGAbasedonintegrationof memristorsandCMOSdevices,ProceedingsofIEEEInternational SymposiumonCircuitsandSystems(ISCAS),1963-1966,May30-June 2, 2010. [9]J.M.Nageswaran,N.Dutt,J.L.Krichmar,A.Nicolau,andA. Veidenbaum,Efficientsimulationoflarge-scalespikingneural networksusingCUDAgraphicsprocessors,InProceedingsofthe InternationalJointConferenceonNeuralNetworks(IJCNN),3201-3208, NJ, USA, 2009.[10]T. M. Taha, P. Yalamanchili, M. Bhuiyan, R. Jalasutram, C. Chen,R. Linderman, Neuromorphic algorithms on clusters of PlayStation 3s,, International Joint Conference on Neural Networks (IJCNN), vol., no., pp.1-10, 18-23 July 2010. [11]H.Hellmich,H.Klar,AnFPGAbasedsimulationacceleration platform for spiking neural networks, The 47th Midwest Symposium onCircuitsandSystems(MWSCAS),vol.2,no.,pp.389-392,July 2004. [12]S.B.Furber,S.TempleandA.D.Brown,High-Performance Computing for Systems of Spiking Neurons, Proceedings of AISB'06 workshoponGC5:ArchitectureofBrainandMind,vol.2,pp29-36, Bristol, April, 2006. [13]J. Schemmel, J. Fieres, K. Meier, Wafer-Scale Integration of Analog NeuralNetworks,IEEEInternationalJointConferenceonNeural Networks (IJCNN), 2008. [14]P.Merolla,J.Arthur,F.Akopyan,N.Imam,R.Manohar,D.S. Modha,Adigitalneurosynapticcoreusingembeddedcrossbar memorywith45pJperspikein45nm,IEEECustomIntegrated Circuits Conference (CICC),vol., no., pp.1-4, 19-21 Sept. 2011. [15]J.V.Arthur,P.A.Merolla,F.Akopyan,R.Alvarez,A.Cassidy,S. Chandra, S. K. Esser, N. Imam, W. Risk, D. B. D. Rubin, R. Manohar, D.S.Modha,Buildingblockofaprogrammableneuromorphic substrate:Adigitalneurosynapticcore,TheInternationalJoint Conference on Neural Networks (IJCNN), pp.1-8, June 2012. [16]G.Indiveri,B.L.Barranco,T.J.Hamilton,A.vanSchaik,R.E. Cummings, T. Delbruck, S. C. Liu, P. Dudek,P. Hfliger, S. Renaud, J.Schemmel,G.Cauwenberghs,J.Arthur,K.Hynna,F.Folowosele, S.Saighi,T.S.Gotarredona,J.Wijekoon,Y.Wang,andk.Boahen, Neuromorphicsiliconneuroncircuits,FrontierofNeuroscience, 2011. [17]P.A.Merolla,K.Boahen,Dynamiccomputationinarecurrent networkofheterogeneoussiliconneurons,ProceedingsofIEEE International Symposium on Circuits and Systems, May 2006. [18]J.H.B.Wijekoon,P.Dudek,Compactsiliconneuroncircuitwith spikingandburstingbehaviour,ProceedingsofNeuralNetworks, Volume 21, Issues 23, Pages 524-534, 2008. [19]J. Lin, P. Merolla, J. Arthur, K. Boahen, Programmable Connections inNeuromorphicGrids,IEEEInternationalMidwestSymposium on Circuits and Systems (MWSCAS), pp.80-84, Aug. 2006. [20]T.Moscibroda,andO.Mutlu,Acaseforbufferlessroutinginon-chipnetworks,InProceedingsofthe36thAnnualInternational Symposium on Computer architecture (ISCA), NY, USA, 2009. [21]A.Bakhoda,J.Kim,andT.M.Aamodt,Throughput-EffectiveOn-ChipNetworksforManycoreAccelerators,InProceedingsofthe 43rdAnnualIEEE/ACMInternationalSymposiumon Microarchitecture (MICRO), pp. 421-432, USA, 2010.[22]J.Kim,Low-costroutermicroarchitectureforon-chipnetworks,In Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), NY, USA, 255-266, 2009.[23]M.B.Taylor,J.Kim,J.Miller,D.Wentzlaff,F.Ghodrat,B. Greenwald,H.Hoffmann,P.Johnson,J.W.Lee,W.Lee,A.Ma,A. Saraf, M. Seneski, N. Shnidman, V. S. M. Frank, S. Amarasinghe, and A.Agarwal,TheRawMicroprocessor:AComputationalFabricfor SoftwareCircuitsandGeneralPurposePrograms,IEEEMicro, Mar/Apr 2002. [24]N.E.Jerger,L.S.Peh,andM.Lipasti,VirtualCircuitTree Multicasting:ACaseforOn-ChipHardwareMulticastSupport, Proceedingsofthe35thAnnualInternationalSymposiumon Computer Architecture (ISCA), June, 2008.[25]J.Wu,andS.Furber,AMulticastRoutingSchemeforaUniversal SpikingNeuralNetworkArchitecture,JournalsofComputersvol. 53, no. 3, pp. 280-288, March 2010. [26]R.Hasan,T.Taha,RoutingBandwidthModelforFeedForward NeuralNetworksonMulticoreNeuromorphicArchitectures, InternationalJointConferenceonNeuralNetworks(IJCNN),August 2013. [27]A.B.Kahng,B.Li,L.S.Peh,K.Samadi.,ORION2.0:Afastand accurateNoCpowerandareamodelforearly-stagedesignspace exploration,Design,Automation&TestinEuropeConference& Exhibition (DATE), pp.423-428, April 2009. [28]N.Muralimanohar,R.Balasubramonian,andN.Jouppi,Optimizing NUCAOrganizationsandWiringAlternativesforLargeCacheswith CACTI6.0InProceedingsofthe40thAnnualIEEE/ACM InternationalSymposiumonMicroarchitecture(MICRO40), Washington, DC, USA, 3-14, 2007. 334

top related