on-chip static vs. dynamic routing for feed forward neural networks on multicore neuromorphic...

6
Abstract—With processor reliability and power limiting the performance of future computing systems, interest in multicore neuromorphic architectures is increasing. These architectures require on-chip routing networks to enable cores to communicate neural outputs with each other. In this study we examine two routing approaches for large multicore feed forward neural network accelerators: static and dynamic. Models are developed to determine routing resources for 2D mesh interconnection topologies. Detailed analysis of power, area, and link utilization are carried out for several architecture options. In almost all cases, static routing is significantly more efficient than dynamic routing, requiring both lower area and power. Key words: On-chip routing; neuromorphic computing; computer architecture. I. INTRODUCTION ELIABILITY and power consumption are the main obstacles to continued performance improvements in future multi-core computing systems [1]. Interest in specialized architectures for accelerating neural networks has increased significantly because of their ability to reduce power, increase performance, and allow fault tolerant computing. Chen et al. [2] have shown that Recognition, Mining, and Synthesis (RMS) applications (described by Intel as the key application drivers of the future [3]) can be represented as neural networks. They make the case that neural network accelerators can have broad applications. Esmaeilzadeh et al. [4] show that several key application kernels (such as FFT and JPEG) can be approximated using neural networks and make the case for specialized neural network accelerators on general purpose CPUs. One of the most common forms of neural networks is feed forward networks, as illustrated in Fig 1. If the axonal inputs to a neuron are given by xi then the corresponding neuron output is evaluated as: vj=∑iWi,jxi (1) yj=f(vj) (2) Here, W is a weight matrix in which Wi,j is the synaptic weight of axon i for neuron j and f is a nonlinear function (usually a sigmoid function). Recent studies have shown that neural networks with multiple layers (such as deep belief networks) have strong inference capabilities. Ciresan et al. [5] have demonstrated that a six layer neural network with up to 2500 neurons in a layer can generate a very low error rate of 0.35% in the recognition of 28x28 pixel images of handwritten digits from the MNIST database [6]. To process these large networks efficiently, the network was implemented on GPGPUs. Larger images and videos will likely require neural networks with more layers and with more neurons per layer. High performance computing hardware will likely be required to process these larger networks. Taha et al. [7] presented several specialized multicore neuromorphic architectures to allow high throughput pipelined processing of large feed forward neural networks. They showed that these architectures can reduce the power consumption for neural network processing by about 100 to 10 5 times compared to conventional architectures (depending on the architecture options selected). The architectures consisted of a collection of specialized processors connected with an on-chip mesh routing network (see Fig 2). In large neural networks, the volume of synaptic weight data far outnumbers the volume of neural output data. Since data communication is one of the key sources of energy consumption and performance delays, the architectures in [7] store the synaptic data within each core (instead of in off-chip memory or a shared cache) to reduce memory access times and energies. This however means that On-Chip Static vs. Dynamic Routing for Feed Forward Neural Networks on Multicore Neuromorphic Architectures Raqibul Hasan and Tarek M. Taha Department of Electrical and Computer Engineering University of Dayton, Dayton, OH 45469, USA {hasanm1, tarek.taha}@udayton.edu R Fig 1. Example neural network and its multicore implementation. Each core needs 100 inputs as it has 100 neurons and 10,000 weights. The data transfer between cores takes place through the on-chip routing network. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Inputs Inputs Outputs Outputs Core 2 100 Inputs 10k Weights 10k Weights 10k Weights 100 Outputs 100 Transfers 100 Transfers . . . . . . Core 1 . . . . . . Core 3 Proceedings of 2013 2nd International Conference on Advances in Electrical Engineering (ICAEE 2013) 19-21 December, 2013, Dhaka, Bangladesh 978-1-4799-2465-3/13/$31.00 ©2013 IEEE 329

Upload: muhammad-hafizan

Post on 17-Aug-2015

218 views

Category:

Documents


2 download

DESCRIPTION

yy

TRANSCRIPT

AbstractWithprocessorreliabilityandpowerlimitingthe performance of future computing systems, interest in multicore neuromorphicarchitecturesisincreasing.Thesearchitectures requireon-chiproutingnetworkstoenablecoresto communicateneuraloutputswitheachother.Inthisstudywe examinetworoutingapproachesforlargemulticorefeed forwardneuralnetworkaccelerators:staticanddynamic. Modelsaredevelopedtodetermineroutingresourcesfor2D meshinterconnectiontopologies.Detailedanalysisofpower, area,andlinkutilizationarecarriedoutforseveral architectureoptions.Inalmostallcases,staticroutingis significantlymoreefficientthandynamicrouting,requiring both lower area and power. Keywords:On-chiprouting;neuromorphiccomputing; computer architecture. I.INTRODUCTION ELIABILITYandpowerconsumptionarethemain obstaclestocontinuedperformanceimprovementsin futuremulti-corecomputingsystems[1].Interestin specializedarchitecturesforacceleratingneuralnetworks has increased significantly becauseof their ability to reduce power,increaseperformance,andallowfaulttolerant computing.Chenetal.[2]haveshownthatRecognition, Mining,andSynthesis(RMS)applications(describedby Intelasthekeyapplicationdriversofthefuture[3])canbe representedasneuralnetworks.Theymakethecasethat neuralnetworkacceleratorscanhavebroadapplications. Esmaeilzadehetal.[4]showthatseveralkeyapplication kernels (such as FFT and JPEG) can be approximated using neuralnetworksandmakethecaseforspecializedneural network accelerators on general purpose CPUs.Oneofthemostcommonformsofneuralnetworksisfeed forward networks, as illustrated in Fig 1. If the axonal inputs toaneuronaregivenbyxithenthecorrespondingneuron output is evaluated as: vj=iWi,jxi(1) yj=f(vj)(2) Here,WisaweightmatrixinwhichWi,jisthesynaptic weightofaxoniforneuronjandfisanonlinearfunction (usually a sigmoid function). Recentstudieshaveshownthatneuralnetworkswith multiplelayers(suchasdeepbeliefnetworks)havestrong inferencecapabilities.Ciresanetal.[5]havedemonstrated that a six layer neural network with up to 2500 neurons in a layercangenerateaverylowerrorrateof0.35%inthe recognitionof28x28pixelimagesofhandwrittendigits fromtheMNISTdatabase[6].Toprocesstheselarge networksefficiently,thenetworkwasimplementedon GPGPUs.Largerimagesandvideoswilllikelyrequire neural networks with more layers and with more neurons per layer.Highperformancecomputinghardwarewilllikelybe required to process these larger networks. Tahaetal.[7]presentedseveralspecializedmulticore neuromorphicarchitecturestoallowhighthroughput pipelinedprocessingoflargefeedforwardneuralnetworks. Theyshowedthatthesearchitecturescanreducethepower consumption forneuralnetwork processing by about 100 to 105timescomparedtoconventionalarchitectures (dependingonthearchitectureoptionsselected).The architecturesconsistedofacollectionofspecialized processors connected with an on-chipmesh routing network (see Fig 2). In large neural networks, the volume of synaptic weightdatafaroutnumbersthevolumeofneuraloutput data. Since data communication is one of the key sources of energyconsumptionandperformancedelays,the architecturesin[7]storethesynapticdatawithineachcore (instead of in off-chipmemory or ashared cache)to reduce memory access times and energies. This however means that On-Chip Static vs. Dynamic Routing for Feed Forward Neural Networks on Multicore Neuromorphic Architectures Raqibul Hasan and Tarek M. Taha Department of Electrical and Computer Engineering University of Dayton, Dayton, OH 45469, USA{hasanm1, tarek.taha}@udayton.edu R Fig1.Exampleneuralnetworkanditsmulticore implementation.Eachcoreneeds100inputsasithas100 neuronsand10,000weights.Thedatatransferbetweencores takes place through the on-chip routing network. . . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . .. . .. . . . .Inputs InputsOutputs OutputsCore 2100 Inputs10k Weights10k Weights10k Weights100 Outputs100 Transfers 100 Transfers . . .. . .Core 1. . .. . .Core 3Proceedings of 2013 2nd International Conference on Advances in Electrical Engineering (ICAEE 2013)19-21 December, 2013, Dhaka, Bangladesh978-1-4799-2465-3/13/$31.00 2013 IEEE 329 thearchitecturewouldhavetobepreprogrammedto implement a specific neural network. In this architecture, the on-chip routing network has to exchange only neural outputs between cores (as the synaptic data is already on chip), thus enabling low power, high speed, processing. Fig 2. 2D mesh interconnection network connecting a set of neural cores. Toreducetheoverallareaandpowerofmulticoreneural processingchips,itisessentialtooptimizetheon-chip routing network. Two possible approaches to implement the routingnetworkarestaticanddynamicrouting.Indynamic routing,eachcoresendsoutapacketwithadestination header.Thispacketheaderisexaminedbyeachrouterit passesthroughtodirectthepackettowardsitsdestination. Dynamicroutingisresourceandpowerintensive,requiring buffers, a crossbar switch, and a switch allocator per router.In static routing, adedicated connection is setup between a sourcecoreanditsdestinationcores.Whenaparticular neuralnetworkismappedontothemulti-coresystem,the communicationpatternbetweenthecoresbecomes deterministic.Thustheconnectivityneededbetweenthe coresispre-determined,andtherefore,staticrouting between the cores can be utilized (similar to routing between configurablelogicblocksinanFPGA).Thisapproach requires a routing switch. Each connection within the routing switchrequiresanSRAMcelltoenablereconfigurationof the path for a particular network (Fig. 3) [8]. The key benefit ofstaticroutingisthatitdoesnotrequiredynamicrouting logic. This cansignificantly reduce thepower consumption. Ifthechannelutilizationsarelow,thentheareaofstatic routing could be larger than dynamic routing. Fig 3. Static routing switch. In this study we compare the area and power consumption of bothstaticanddynamicroutingforthemulticoreneural networkarchitecturespresentedin[7].Weexaminethe impact of core performance and the number of cores needed toimplementalayer.Ourresultsshowthatinalmostall cases,staticroutingenableslowerareaandpower consumption. Therestofthepaperisorganizedasfollows:sectionII describes relatedworks in the area. Section IIIexaminesan approach to map neural networks to the cores in a multi-core system. Sections IV presents on-chip data transfer model for 2Dmeshinterconnections.Experimentalresultsare presentedinsectionV,followedbyconclusionsinsection VI. II.RELATED WORK A rangeofneuralnetwork emulation projects areunderway around the world that include GPU, FPGA, CMP, and ASIC based systems [9-13].Nageswaranetal.[9]examinedGPU accelerationofspikingneuralmodels.Accelerationonhigh performanceclusters[10]andFPGAs[11]havealsobeen studied.SinceGPUandFPGAsystemsarerelativelyhigh powerandareaconsuming,theyarenotfavorablefor embedded applications. TheSpiNNakerproject[12]atManchesterUniversityisa multi-chipsystem,whereeachchipincludestwenty simplified ARM 968 processors. Among these twenty cores, only eighteen cores are used for neural simulation, with each coresimulatingabout1,000neurons.Theycommunicate spikeeventstoothercoresthroughpacketsonanon-chip network. 128MB of RAM is shared by all 18 cores through a DMAcontroller.Eachchipisconnectedtosixadjacent SpiNNaker chips through off-chip interconnections.Acrossbarmemorybasedfullydigitalneuromorphiccore wasproposedin[14,15].Thissystemconsistedof256 integrate-and-fireneuronsanda1024256SRAMcrossbar memoryforconnectivityinformation.Incontrastto traditionalvonNeumannarchitectures,thissystem integratedcomputationalongsidememory,enablingitto achievelowpowerexecution.Thestudydidnotexamine howmultiplecoreswouldbeconnected on-chip to simulate large neural networks.MOStransistorbasedanalogneuroncircuitshavebeen studiedextensively[16-18].TheNeurogridproject[17,19] useslocalanalogwiringtominimizetheneedfor digitizationofon-chipcommunications.Spikes,ratherthan voltagelevels,arepropagatedtodestinationsynapses.The focus is to mimic biological neurons on a silicon chip.Intheeraofmulti-corearchitectures,on-chip communicationnetworksareofsignificantinteresttothe research community. Several studies have examined on-chip routingingeneralpurposeCMPinterconnection architectures[20-22].IntheMITRAWprocessor, corerouterSRAMAABB330 processingtilesareconnectedtoneighboringtilesina2D mesh using dynamic and static networks for communication [23].There are several routing alternatives such as unicast, multi-cast,multiple-unicast,andbroadcast.Inneuralnetworks, one neuron is connected to several other neurons. Therefore amulti-castroutingschemecouldreducenetworktraffic significantly.Multi-castroutingschemesaredemonstrated in [24, 25]. III.MULTI-CORE ARCHITECTURE FOR NEUROMORPHIC APPLICATIONS Chipmulti-processors(CMP)areoneofthemostcommon architecturesforexploitingtasklevelparallelism.Since neuralnetworkapplicationsexhibitaggressivetasklevel parallelism,wecanachieveveryhighthroughputsby executing them on CMPs. A. Mapping of Neural Networks Thecommunicationpatternintheon-chipinterconnection networkdependsonhowaneuralnetworkismappedonto theprocessingcores.ConsideranNcorecomputing platforminwhichwewanttomapanLlayeredneural network,whereeachlayerhasnkneurons(k=1,2,..L).We willmaptheneuronsontotheprocessingcoresuniformly and each core will simulate approximately (k nk)/N neurons. To optimize communication delays, the cores corresponding toalayerofneuronsshouldbeasclosetoeachotheras possible(preferablyasquaresub-gridfromthegridof cores).Neuronswillbesimulatedinapipelinedfashionto hidecommunicationdelays(seeFig.1).Whenlayerl neuronsarebeingsimulatedforpatterni,layerl-1neurons will be simulated for pattern i+1. B.On-chip Interconnection Topology Processingcoressimulatingneuronsshareonlygenerated neuron outputs. These data would be exchanged over an on-chip interconnection network among the processing cores. In thispaper,weevaluatethebandwidthrequirementfor exchanging neuron output data. Coresareconnectedbyaninterconnectionnetworkfor sharingneuronoutputs.Weneedreconfigurationcapability intheinterconnectionnetworktobeabletosimulate differentneuralnetworktopologies.Inasharedbus interconnection,weneedabuswhichwillbeconnectinga coretoalltheothercores.Thisdesignisnotscalable, becauseincreasingthenumberofcoresrequiresahuge overheadofresourcesforwiringanddealingwithalarge number of routing channels.Inthispaper,wewillexamine2Dmeshnetworks(Fig.2) forconnectingthecoresinamulticoreneuromorphic architecture. Mesh interconnects are the most widely studied on-chipnetworktopologyduetotheirscalabilityand regularity. Fig 4. Multi-cast routing (S is source core, and D are the destination cores). C.Communication and Routing Technique Weassumethatneuronswillbemappedontoamulti-core systemsuchthateachcoreissimulatingapproximatelythe same number of neurons. For feed-forward neural networks, whenacorefinishessimulatingallthelayerl-1neurons assignedtoit,thecorewillsendthecorrespondingneuron outputstoallprocessingunitswhicharesimulatinglayerl neurons.Multicastrouting[24,25]willbeefficientfor sendingneuronoutputstomultipledestinations(Fig.4). Multi-castroutingwillreducenetworktrafficandhence requireslessbandwidth.Foraparticularneuralnetwork, routingisstaticsinceaparticularsourceanddestination would always use the same path to send data packets. Thenumberofbitsusedtorepresenteachneuronoutput hasanimpactonthedatatransferbandwidth.Someneural modelsrequireasinglebitperneuronoutput(suchas restrictedBoltzmannmachines).Ontheotherhandsome models require multiple bits per neuron output. In this study, weassumed8bitsperneuronoutput,althoughotherbit widths would simply scale our results. IV.ON-CHIP BANDWIDTH REQUIREMENT FOR COMMUNICATION Theon-chipinterconnectionnetworkenablesexchangeof generatedneuronoutputsamongcores.Theon-chip interconnection-linkbandwidthdependsonseveralissues, suchastheshareddatagenerationrate,howmanyother coresaresharingthedata,andwherethedestinationcores are located with respect to the source cores. In this paper we assumealinktostandfortwosetsofwirestoenablebi-directional data transfers.To understand the core-to-core link bandwidth requirements, considertheexamplesshowninFigs.5and6.Fig.5a representstherelativebandwidthneededforsendingthe outputstothenextlayerofcores.Inthisexample,we examine a 44 collection of cores to simulate one layer, and assumeeachcoregenerates1outputpertimestep.Thus each column of cores will generate 4 packets per time step. Fig. 5b shows the relative bandwidth on the internal links for distributing the incoming packets from the previous layer (l-1). Since we assume a pipelined evaluation of the inputs, the relativebandwidthsofFigs.5aand5bwillbeoverlapped. SDDDDD331 Fig.5cshowstheoverallrelativebandwidthforthelayerl corelinksbycombiningthebandwidthsfromFigs.5aand 5b.Fig.6showstheinternallinkrelativebandwidthsfor another positioning of cores relative totheir adjacent layers. ItisseenthattheinternallinkbandwidthforFig.6cis higher than in Fig. 5c. The maximum link bandwidth in this system was derived in [26],andissummarizedhereforconvenience.Assumethat thenumberofprocessingcoressimulatingasinglelayerof neurons is n1n2 and b is the number of bits used to encode a singleneuronoutput.Additionally,assumethateachcore hasaperformanceofPneuronspersecond.Therequired linkbandwidthforthissystemisderivedin[26],andis given by: Blink =(max{n1, n2}1+ n1n2)Pb bps (3) The maximum link bandwidth in a particular direction (per port) is given by: (n1n2 + |n2 -n1| -1)Pb bps.(4) Thebandwidthrequirementsforalinkdeterminethe minimum number of wires needed for that link. In this study, weassume5inputportsand5outputportsperrouteror switch(Fig.3).Fouroftheseareforcommunicatingwith neighboringrouters,andoneforcommunicatingwiththe attachedcore.Eachportcanhavemultiplewiresandeach portcanhandleasetoffinitebitspersecond.Equation(4) indicatesthatthelinkbandwidthincreaseswithincreasing themaximumnumberofcoresperlayer(n1n2).Thisalso increases the number of wires needed per port. In the rest of the paper wewill denote themaximum number of cores per layer (n1n2), as nmax. This is the maximum number of cores per layer we can have without making routing a performance bottleneck. V.EXPERIMENTAL EVALUATION Several system configurations were examined to evaluate the areaandpowerofstaticversusdynamicrouting.The maximumnumberofcoresperlayer(nmax)wassetto4,9, and25.Themaximumnumberofchannelsneededperport for static routing in each case is shown in Table 1. Foreach case,theperformanceofthecoreswasvariedfrom105to 2108 neurons per second, with each neuron output assumed to be 8 bits wide. We have utilized the link bandwidth model from[26]toevaluatedatatransferpower.Increasingthe coreperformancewillincreasetheamountofdataputonto theroutingnetwork.Thiswillresultinmorewiresbeing needed per port and a higher link utilization. Hence both the (a)(b)(c) Fig 5. Traffic pattern for layer l-1 and layer l+1cores are at two opposite sides of layer l cores. (a)(b)(c) Fig 6. Traffic pattern for layer l-1 and layer l+1cores are at sides perpendicular to each other. 412881244 4 4 44 4 4 44 4 4 44 4 4 4Layer l-1Layer lLayer l+14128812441288124412881241 1 1 12 2 2 23 3 3 3Layer l-1Layer lLayer l+14 4 4 44 4 4 45 5 5 56 6 6 67 7 7 7Layer l-1Layer lLayer l+14 4 4 4412881244128812441288124412881244 4 4 44 4 4 44 4 4 44 4 4 4Layer l-1Layer lLayer l+141288124412881244128812441288124Layer l-1Layer lLayer l+11 2 3 41 2 3 41 2 3 41 2 3 44 4 4 44 4 4 44 4 4 44 4 4 4Layer l-1Layer lLayer l+14444512108154512108154512108154512108154332 routingareaandpowerwillincreasewiththecore performance. Thepowerandareawerecalculatedassuminga32nm processtechnology,a200MHzclock,anda1mm2neural corearea. In this study, the coreareawas not changedwith variationincoreperformance.Theareaandpowerofthe dynamicrouterwascalculatedusingtheOrion[27] interconnection network simulator. The static routing switch consistsofanarrayofSRAMcells,eachcontrollingapass transistor.TheareaandpoweroftheSRAMcellswere calculated using the Cacti [28] cache model. Table 1: Channels per port Maximum cores per layer Maximum channels per port 43 98 2524 Table 1 shows the maximum number of channels needed per port.Inthedynamicroutingcase,allthesechannelscould potentiallybehandledbyonephysicalwireifthelink utilizationislow.Inthestaticroutingcase,eachchannel would require its own physical wire. Fig.7showsthepower(mW),area(mm2),wiresperport, andwireutilizationvariationwithcoreperformance.In almostallcases,itisseenthattheareaandpowerforthe staticroutingnetworkislowerthantheequivalentdynamic routingnetwork.Thedynamicroutingnetworkhasalower area only when the core performance is low. In these cases it isseenthatthelinkutilizationofthenetworkislow,thus allowing the dynamicnetwork to have fewer wires per port, and thus a lower overall area. As shown in Fig. 7, the lowest number of wires per port for the dynamic network is 1, while theminimumforthestaticnetworkisgivenbythenumber of channels per port shown in Table 1. From Fig. 7, it is seen thatthelinkutilizationreachesabout100%withacore (a) 4 cores per layer max(b) 9 cores per layer max(c) 25 cores per layer max Fig 7. Power (mW), area (mm2), wires per port, and wire utilization variation with core performance (in 106 neurons/s). Both dynamic routing (dx) and static routing (sx) data are shown. Three cases are presented, where the maximum number of cores per layer (x) is restricted to: (a) 4, (b) 9, and (c) 25. This data is presented for the router and links with the highest utilization within the system. 333 performance of about 50millionneurons per second.These resultsindicatethatstaticroutingislikelytobethemost optimum approach for multicore neuromorphic processor. VI.CONCLUSION Inthispaperwehavestudiedtwoon-chiprouting techniques:staticanddynamicformulti-coreneuromorphic systems.Foreachcasewehaveexaminedroutingresource requirements.Ourexperimentalresultsshowthatinmulti-coreneuromorphicarchitectures,staticroutingrequiresless routingareaandpowercomparedtodynamicrouting.A similaranalysiscouldbedoneforotherneuralnetworks (suchasspikingandconvolutional),likelygenerating similar results.REFERENCES [1]H.Esmaeilzadeh,E.Blem,R.Amant,K.Sankaralingam,andD. Burger, Dark silicon and the end of multicore scaling, in Proceeding ofthe38thAnnualInternationalSymposiumonComputer Architecture, 2011, pp. 365376. [2]T.Chen,Y.Chen,M.Duranton,Q.Guo,A.Hashmi,M.Lipasti,A. Nere,S.Qiu,M.Sebag,O.Temam,BenchNN:OntheBroad PotentialApplicationScopeofHardwareNeuralNetwork Accelerators,IEEEInternationalSymposiumonWorkload Characterization (IISWC), November 2012. [3]P. Dubey, Recognition, mining and synthesis moves computers to the era of tera, Technology@Intel Magazine, Feb. 2005. [4]H.Esmaeilzadeh,A.Sampson,L.Ceze,andD.Burger,Neural AccelerationforGeneral-PurposeApproximatePrograms, International Symposium on Microarchitecture (MICRO), 2012. [5]D.C.Cirean,U.Meier,L.M.Gambardella,andJ.Schmidhuber, Deep,big,simpleneuralnetsforhandwrittendigitrecognition, Journal of Neural Computation, Vol. 22, issue 12, 3207-3220, 2010. [6]http://yann.lecun.com/exdb/mnist/ [7]T.Taha,R.Hasan,C.YakopcicandM.McLean,Exploringthe DesignSpaceofSpecializedMulticoreNeuralProcessors, InternationalJointConferenceonNeuralNetworks(IJCNN),August 2013. [8]W.Wang,T.T.Jing,B.Butcher,FPGAbasedonintegrationof memristorsandCMOSdevices,ProceedingsofIEEEInternational SymposiumonCircuitsandSystems(ISCAS),1963-1966,May30-June 2, 2010. [9]J.M.Nageswaran,N.Dutt,J.L.Krichmar,A.Nicolau,andA. Veidenbaum,Efficientsimulationoflarge-scalespikingneural networksusingCUDAgraphicsprocessors,InProceedingsofthe InternationalJointConferenceonNeuralNetworks(IJCNN),3201-3208, NJ, USA, 2009.[10]T. M. Taha, P. Yalamanchili, M. Bhuiyan, R. Jalasutram, C. Chen,R. Linderman, Neuromorphic algorithms on clusters of PlayStation 3s,, International Joint Conference on Neural Networks (IJCNN), vol., no., pp.1-10, 18-23 July 2010. [11]H.Hellmich,H.Klar,AnFPGAbasedsimulationacceleration platform for spiking neural networks, The 47th Midwest Symposium onCircuitsandSystems(MWSCAS),vol.2,no.,pp.389-392,July 2004. [12]S.B.Furber,S.TempleandA.D.Brown,High-Performance Computing for Systems of Spiking Neurons, Proceedings of AISB'06 workshoponGC5:ArchitectureofBrainandMind,vol.2,pp29-36, Bristol, April, 2006. [13]J. Schemmel, J. Fieres, K. Meier, Wafer-Scale Integration of Analog NeuralNetworks,IEEEInternationalJointConferenceonNeural Networks (IJCNN), 2008. [14]P.Merolla,J.Arthur,F.Akopyan,N.Imam,R.Manohar,D.S. Modha,Adigitalneurosynapticcoreusingembeddedcrossbar memorywith45pJperspikein45nm,IEEECustomIntegrated Circuits Conference (CICC),vol., no., pp.1-4, 19-21 Sept. 2011. [15]J.V.Arthur,P.A.Merolla,F.Akopyan,R.Alvarez,A.Cassidy,S. Chandra, S. K. Esser, N. Imam, W. Risk, D. B. D. Rubin, R. Manohar, D.S.Modha,Buildingblockofaprogrammableneuromorphic substrate:Adigitalneurosynapticcore,TheInternationalJoint Conference on Neural Networks (IJCNN), pp.1-8, June 2012. [16]G.Indiveri,B.L.Barranco,T.J.Hamilton,A.vanSchaik,R.E. Cummings, T. Delbruck, S. C. Liu, P. Dudek,P. Hfliger, S. Renaud, J.Schemmel,G.Cauwenberghs,J.Arthur,K.Hynna,F.Folowosele, S.Saighi,T.S.Gotarredona,J.Wijekoon,Y.Wang,andk.Boahen, Neuromorphicsiliconneuroncircuits,FrontierofNeuroscience, 2011. [17]P.A.Merolla,K.Boahen,Dynamiccomputationinarecurrent networkofheterogeneoussiliconneurons,ProceedingsofIEEE International Symposium on Circuits and Systems, May 2006. [18]J.H.B.Wijekoon,P.Dudek,Compactsiliconneuroncircuitwith spikingandburstingbehaviour,ProceedingsofNeuralNetworks, Volume 21, Issues 23, Pages 524-534, 2008. [19]J. Lin, P. Merolla, J. Arthur, K. Boahen, Programmable Connections inNeuromorphicGrids,IEEEInternationalMidwestSymposium on Circuits and Systems (MWSCAS), pp.80-84, Aug. 2006. [20]T.Moscibroda,andO.Mutlu,Acaseforbufferlessroutinginon-chipnetworks,InProceedingsofthe36thAnnualInternational Symposium on Computer architecture (ISCA), NY, USA, 2009. [21]A.Bakhoda,J.Kim,andT.M.Aamodt,Throughput-EffectiveOn-ChipNetworksforManycoreAccelerators,InProceedingsofthe 43rdAnnualIEEE/ACMInternationalSymposiumon Microarchitecture (MICRO), pp. 421-432, USA, 2010.[22]J.Kim,Low-costroutermicroarchitectureforon-chipnetworks,In Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), NY, USA, 255-266, 2009.[23]M.B.Taylor,J.Kim,J.Miller,D.Wentzlaff,F.Ghodrat,B. Greenwald,H.Hoffmann,P.Johnson,J.W.Lee,W.Lee,A.Ma,A. Saraf, M. Seneski, N. Shnidman, V. S. M. Frank, S. Amarasinghe, and A.Agarwal,TheRawMicroprocessor:AComputationalFabricfor SoftwareCircuitsandGeneralPurposePrograms,IEEEMicro, Mar/Apr 2002. [24]N.E.Jerger,L.S.Peh,andM.Lipasti,VirtualCircuitTree Multicasting:ACaseforOn-ChipHardwareMulticastSupport, Proceedingsofthe35thAnnualInternationalSymposiumon Computer Architecture (ISCA), June, 2008.[25]J.Wu,andS.Furber,AMulticastRoutingSchemeforaUniversal SpikingNeuralNetworkArchitecture,JournalsofComputersvol. 53, no. 3, pp. 280-288, March 2010. [26]R.Hasan,T.Taha,RoutingBandwidthModelforFeedForward NeuralNetworksonMulticoreNeuromorphicArchitectures, InternationalJointConferenceonNeuralNetworks(IJCNN),August 2013. [27]A.B.Kahng,B.Li,L.S.Peh,K.Samadi.,ORION2.0:Afastand accurateNoCpowerandareamodelforearly-stagedesignspace exploration,Design,Automation&TestinEuropeConference& Exhibition (DATE), pp.423-428, April 2009. [28]N.Muralimanohar,R.Balasubramonian,andN.Jouppi,Optimizing NUCAOrganizationsandWiringAlternativesforLargeCacheswith CACTI6.0InProceedingsofthe40thAnnualIEEE/ACM InternationalSymposiumonMicroarchitecture(MICRO40), Washington, DC, USA, 3-14, 2007. 334