sc2012 compass cr

11
Compass: A scalable simulator for an architecture for Cognitive Computing Robert Preissl, Theodore M. Wong, Pallab Datta, Myron Flickner, Raghavendra Singh, Steven K. Esser, William P. Risk, Horst D. Simon , and Dharmendra S. Modha IBM Research - Almaden, San Jose CA 95120 Lawrence Berkeley National Lab, Berkeley, CA 94720 Contact e-mail: [email protected] Abstract—Inspired by the function, power, and volume of the organic brain, we are developing TrueNorth, a novel modu- lar, non-von Neumann, ultra-low power, compact architecture. TrueNorth consists of a scalable network of neurosynaptic cores, with each core containing neurons, dendrites, synapses, and axons. To set sail for TrueNorth, we developed Compass, a multi-threaded, massively parallel functional simulator and a parallel compiler that maps a network of long-distance pathways in the macaque monkey brain to TrueNorth. We demonstrate near-perfect weak scaling on a 16 rack IBM® Blue Gene®/Q (262144 CPUs, 256 TB memory), achieving an unprecedented scale of 256 million neurosynaptic cores containing 65 billion neurons and 16 trillion synapses running only 388× slower than real time with an average spiking rate of 8.1 Hz. By using emerging PGAS communication primitives, we also demonstrate 2× better real-time performance over MPI primitives on a 4 rack Blue Gene/P (16384 CPUs, 16 TB memory). I. I NTRODUCTION The brain and modern computers have radically different architectures [1] suited for complementary applications. Mod- ern computing posits a stored program model, traditionally implemented in digital, synchronous, serial, centralized, fast, hardwired, general-purpose, brittle circuits, with explicit mem- ory addressing imposing a dichotomy between computation and data. In stark contrast, the brain uses replicated computa- tional units of neurons and synapses implemented in mixed- mode analog-digital, asynchronous, parallel, distributed, slow, reconfigurable, specialized, and fault-tolerant biological sub- strates, with implicit memory addressing blurring the boundary between computation and data [2]. It is therefore no surprise that one cannot emulate the function, power, volume, and real- time performance of the brain within the modern computer architecture. This task requires a radically novel architecture. Today, one must still build novel architectures in CMOS technology, which has evolved over the past half-century to serve modern computers and which is not optimized for delivering brain-like functionality in a compact, ultra-low- power package. For example, biophysical richness of neurons and 3D physical wiring are out of the question at the very outset. We need to shift attention from neuroscientific richness that is sufficient to mathematical primitives that are necessary. A question of profound relevance to science, technology, business, government, and society is how closely can one approximate the function, power, volume, and real-time per- formance of the brain within the limits of modern technology. To this end, under the auspices of DARPA SyNAPSE, we are developing a novel, ultra-low power, compact, modular ar- chitecture called TrueNorth that consists of an interconnected and communicating network of extremely large numbers of neurosynaptic cores [3], [4], [5], [6]. Each core integrates computation (neurons), memory (synapses), and intra-core communication (axons), breaking the von Neumann bottle- neck [7]. Each core is event-driven (as opposed to clock- driven), reconfigurable, and consumes ultra-low power. Cores operate in parallel and send unidirectional messages to one another; that is, neurons on a source core send spikes to axons on a target core. One can think of cores as gray matter canonical cortical microcircuits [8], and inter-core connectivity as long-distance white matter [9]. Like the cerebral cortex, TrueNorth is highly scalable in terms of number of cores. TrueNorth is therefore a novel non-von Neumann architecture that captures the essence of neuroscience within the limits of current CMOS technology. As our main contribution, we describe an architectural simulator called Compass for TrueNorth, and demonstrate its near-perfect weak and strong scaling performance when taking advantage of native multi-threading on IBM® Blue Gene®/Q compute nodes. Compass has one-to-one equivalence to the functionality of TrueNorth. Compass is multi-threaded, massively parallel and highly scalable, and incorporates several innovations in communi- cation, computation, and memory. On a 16-rack IBM Blue Gene/Q supercomputer with 262144 processors and 256 TB of main memory, Compass simulated an unprecedented 256M (10 6 ) TrueNorth cores containing 65B (10 9 ) neurons and 16T (10 12 ) synapses. These results are 3× the number of estimated neurons [10] in the human cortex, comparable to the number of synapses in the monkey cortex, and 0.08× the number of synapses in the human cortex 1 . At an average neuron spiking rate of 8.1 Hz the simulation is only 388× slower than real time. Additionally, we determine how many TrueNorth cores Compass can simulate under a soft real-time constraint. We compare approaches using both PGAS and MPI communica- tion primitives to find the most efficient in terms of latency and 1 The disparity between neuron and synapse scales arises because the ratio of neurons to synapses is 1 : 256 in TrueNorth, but is 1 : 10000 in the brain. SC12, November 10-16, 2012, Salt Lake City, Utah, USA 978-1-4673-0806-9/12/$31.00 ©2012 IEEE DARPA: Approved for public release; distribution is unlimited

Upload: raja-roopesh

Post on 19-Jan-2016

13 views

Category:

Documents


5 download

DESCRIPTION

great document about cognitive computing

TRANSCRIPT

Page 1: SC2012 Compass Cr

Compass A scalable simulator foran architecture for Cognitive Computing

Robert Preissl Theodore M Wong Pallab Datta Myron FlicknerRaghavendra Singh Steven K Esser William P Risk Horst D Simondagger and Dharmendra S Modha

IBM Research - Almaden San Jose CA 95120dagger Lawrence Berkeley National Lab Berkeley CA 94720

Contact e-mail dmodhausibmcom

AbstractmdashInspired by the function power and volume of theorganic brain we are developing TrueNorth a novel modu-lar non-von Neumann ultra-low power compact architectureTrueNorth consists of a scalable network of neurosynaptic coreswith each core containing neurons dendrites synapses andaxons To set sail for TrueNorth we developed Compass amulti-threaded massively parallel functional simulator and aparallel compiler that maps a network of long-distance pathwaysin the macaque monkey brain to TrueNorth We demonstratenear-perfect weak scaling on a 16 rack IBMreg Blue GeneregQ(262144 CPUs 256 TB memory) achieving an unprecedentedscale of 256 million neurosynaptic cores containing 65 billionneurons and 16 trillion synapses running only 388times slower thanreal time with an average spiking rate of 81 Hz By usingemerging PGAS communication primitives we also demonstrate2times better real-time performance over MPI primitives on a 4 rackBlue GeneP (16384 CPUs 16 TB memory)

I INTRODUCTION

The brain and modern computers have radically differentarchitectures [1] suited for complementary applications Mod-ern computing posits a stored program model traditionallyimplemented in digital synchronous serial centralized fasthardwired general-purpose brittle circuits with explicit mem-ory addressing imposing a dichotomy between computationand data In stark contrast the brain uses replicated computa-tional units of neurons and synapses implemented in mixed-mode analog-digital asynchronous parallel distributed slowreconfigurable specialized and fault-tolerant biological sub-strates with implicit memory addressing blurring the boundarybetween computation and data [2] It is therefore no surprisethat one cannot emulate the function power volume and real-time performance of the brain within the modern computerarchitecture This task requires a radically novel architecture

Today one must still build novel architectures in CMOStechnology which has evolved over the past half-centuryto serve modern computers and which is not optimized fordelivering brain-like functionality in a compact ultra-low-power package For example biophysical richness of neuronsand 3D physical wiring are out of the question at the veryoutset We need to shift attention from neuroscientific richnessthat is sufficient to mathematical primitives that are necessaryA question of profound relevance to science technologybusiness government and society is how closely can oneapproximate the function power volume and real-time per-formance of the brain within the limits of modern technology

To this end under the auspices of DARPA SyNAPSE weare developing a novel ultra-low power compact modular ar-chitecture called TrueNorth that consists of an interconnectedand communicating network of extremely large numbers ofneurosynaptic cores [3] [4] [5] [6] Each core integratescomputation (neurons) memory (synapses) and intra-corecommunication (axons) breaking the von Neumann bottle-neck [7] Each core is event-driven (as opposed to clock-driven) reconfigurable and consumes ultra-low power Coresoperate in parallel and send unidirectional messages to oneanother that is neurons on a source core send spikes toaxons on a target core One can think of cores as gray mattercanonical cortical microcircuits [8] and inter-core connectivityas long-distance white matter [9] Like the cerebral cortexTrueNorth is highly scalable in terms of number of coresTrueNorth is therefore a novel non-von Neumann architecturethat captures the essence of neuroscience within the limits ofcurrent CMOS technology

As our main contribution we describe an architecturalsimulator called Compass for TrueNorth and demonstrate itsnear-perfect weak and strong scaling performance when takingadvantage of native multi-threading on IBMreg Blue GeneregQcompute nodes Compass has one-to-one equivalence to thefunctionality of TrueNorth

Compass is multi-threaded massively parallel and highlyscalable and incorporates several innovations in communi-cation computation and memory On a 16-rack IBM BlueGeneQ supercomputer with 262144 processors and 256 TBof main memory Compass simulated an unprecedented 256M(106) TrueNorth cores containing 65B (109) neurons and 16T(1012) synapses These results are 3times the number of estimatedneurons [10] in the human cortex comparable to the numberof synapses in the monkey cortex and 008times the number ofsynapses in the human cortex1 At an average neuron spikingrate of 81 Hz the simulation is only 388times slower than realtime

Additionally we determine how many TrueNorth coresCompass can simulate under a soft real-time constraint Wecompare approaches using both PGAS and MPI communica-tion primitives to find the most efficient in terms of latency and

1The disparity between neuron and synapse scales arises because the ratioof neurons to synapses is 1 256 in TrueNorth but is 1 10000 in the brain

SC12 November 10-16 2012 Salt Lake City Utah USA978-1-4673-0806-912$3100 copy2012 IEEE

DARPA Approved for public release distribution is unlimited

bandwidth We note that one-sided inter-core communicationfrom a neuron on one core to an axon on another core residingon a different compute node maps naturally to the PGASmodel

Compass enables large-scale simulations that provide brain-like function as a precursor to demonstrating the same func-tionality on TrueNorth hardware Our goal is not to useCompass to model the brain but to approximate brain-likefunction that integrates multiple sensory-motor modalities Webelieve that function follows form The form that we havechosen to concentrate on in this paper leverages the largestknown long-distance wiring diagram in the macaque brain thatspans cortex thalamus and basal ganglia [9] Concretely weuse the macaque wiring diagram to instantiate an inter-corewiring pattern The richness of the wiring diagram challengesthe communication and computational capabilities of Compassin a manner consistent with supporting brain-like networks Tothis end we have built a novel parallel compiler that takes ahigh-level network description of the macaque wiring diagramand converts it to the parameters needed to configure a networkof TrueNorth cores Our compiler is sufficiently general tosupport other application-driven networks

Compass is indispensable for (a) verifying TrueNorth cor-rectness via regression testing (b) studying TrueNorth dy-namics (c) benchmarking inter-core communication topolo-gies (d) demonstrating applications in vision audition real-time motor control and sensor integration (e) estimatingpower consumption and (f) hypotheses testing verificationand iteration regarding neural codes and function We haveused Compass to demonstrate numerous applications of theTrueNorth architecture such as optic flow attention mech-anisms image and audio classification multi-modal image-audio classification character recognition robotic navigationand spatio-temporal feature extraction

Compass differs completely from our previous simulatorC2 [11] [12] First the fundamental data structure is a neu-rosynaptic core instead of a synapse the synapse is simplifiedto a bit resulting in 32times less storage required for the synapsedata structure as compared to C2 Second the local and long-distance anatomical connectivity in Compass that emulaterespectively intra-core and inter-core constraints in TrueNorthhave no counterpart in C2 Third the neuron dynamicsequations in Compass are amenable to efficient hardwareimplementation whereas C2 focused on single-compartmentphenomenological dynamic neuron models [13] Fourth Com-pass uses a fully multi-threaded programming model whereasC2 used a flat MPI programming model rendering it incapableof exploiting the full potential of Blue GeneQ Last Compassincorporates an in situ compiler for generating a completeTrueNorth model from a compact CoreObject descriptionfile which reduces simulation set-up times by three ordersof magnitude As a result of these innovations Compassdemonstrates unprecedented weak scaling beyond the cat brainscale achieved by C2 The architecture of Compass standsin contrast to other computational neuroscientific simulators[14] [15] [16] [17] [18] [19] [20] For reference we point

readers at reviews of large-scale brain simulations [21] [22]Compass is a harbinger of an emerging use of todayrsquos

modern supercomputers for midwifing the next generationof application-specific processors that are increasingly pro-liferating to satisfy a world that is hungering for increasedperformance and lower power while facing the projected endof CMOS scaling and increasing obstacles in pushing clockrates ever higher

II THE TRUENORTH ARCHITECTURE

TrueNorth is a scalable neurosynaptic computer architecturedeveloped at IBM under the DARPA SyNAPSE program In-spired by the organization and function of the brain TrueNorthemploys a similar vocabulary describing CMOS circuit ele-ments as neurons axons dendrites and synapses

Core

Synaptic crossbar

Axons

Neu

rons

a0 Buffer

a1 Buffer

Buffer

an-1 Buffer

n0 n1 nn-1

Network

PRNG

Clock

Fig 1 Conceptual architecture of a neurosynaptic core incorporating asynaptic crossbar that contains axons (horizontal lines) dendrites (verticallines) synapses (intersection of horizontal axons and vertical dendrites) andneurons attached to dendrites A buffer for incoming spikes precedes eachaxon to account for axonal delays

The key building block of TrueNorth is a neurosynapticcore [3] [4] as shown in figure 1 connected through a com-munication network to other TrueNorth cores in the systemThe specific instance of a neurosynaptic core that we simulatehas 256 axons 256 dendrites feeding to 256 neurons and a256times256 binary crossbar synaptic array Neurons are digitalintegrate-leak-and-fire circuits characterized by configurableparameters sufficient to produce a rich repertoire of dynamicand functional behavior A neuron on any TrueNorth core canconnect to an axon on any TrueNorth core in the networkwhen the neuron fires it sends a spike message to the axon

Each TrueNorth core operates in a parallel distributed andsemi-synchronous fashion Each TrueNorth core receives aslow clock tick at 1000 Hz to discretize neuron dynamics in 1-millisecond time steps Otherwise neuronal spike events driveall operations in the system When a TrueNorth core receives

a tick from the slow clock it cycles through each of its axonsFor each axon if the axon has a spike ready for delivery atthe current time step in its buffer each synaptic value on thehorizontal synaptic line connected to the axon is deliveredto its corresponding post-synaptic neuron in the TrueNorthcore If the synapse value for a particular axon-neuron pair isnon-zero then the neuron increments its membrane potentialby a (possibly stochastic) weight corresponding to the axontype After all axons are processed each neuron applies aconfigurable possibly stochastic leak and a neuron whosemembrane potential exceeds its threshold fires a spike Eachspike is then delivered via the communication network toits corresponding target axon An axon that receives a spikeschedules the spike for delivery at a future time step in itsbuffer The entire cycle repeats when the TrueNorth corereceives the next tick from the slow clock

Neurons represent computation synapses represent memoryand neuron-axon connections represent communication EachTrueNorth core brings computation and memory into extremeproximity breaking the von Neumann bottleneck Note thatsynaptic or neuronal state never leaves a TrueNorth core andonly spikes ever leave or enter any TrueNorth core Thecommunication network is driven solely by spike events andrequires no clocks The neuron parameters synaptic crossbarand target axon for each neuron are reconfigurable throughoutthe system

To ensure one-to-one equivalence between TrueNorth andCompass we have taken judicious care in the design ofboth systems For example we have bypassed the complexityof analog neurons traditionally associated with neuromorphicsystems Also we have adopted pseudo-random number gen-erators with configurable seeds As a result Compass hasbecome the key contract between our hardware architects andsoftware algorithmapplication designers

III THE COMPASS NEUROSYNAPTIC CHIP SIMULATOR

Compass is an architectural simulator for large-scale net-work models of TrueNorth cores and is implemented usinga combination of MPI library calls and OpenMP threadingprimitives Compass partitions the TrueNorth cores in a modelacross several processes and distributes TrueNorth cores resid-ing in the same shared memory space within a process amongmultiple threads Threads within a process independently sim-ulate the synaptic crossbar and neuron behavior of one or moreTrueNorth cores in the model A master thread then sends anyspikes destined for TrueNorth cores on remote processes in anaggregated fashion upon reception all threads deliver spikesto TrueNorth cores on the local process

Compass executes its main simulation loop in a semi-synchronous manner Listing 1 shows pseudo-code for themain simulation phases that each process in Compass executesEach pass through the phases simulates a single tick of theglobal clock in a system of TrueNorth cores In the Synapsephase each process propagates input spikes from axons toneurons through the crossbar Next in the Neuron phase eachprocess executes the integrate leak and fire model for each

1pragma omp s h a r e d ( l o c a l B u f remoteBuf )pragma omp s h a r e d ( TrueNor thCores remoteBufAgg )

3pragma omp p a r a l l e l

5f o r c o r e i n TrueNor thCores [ t h r e a d I d ] Synapse phase

7f o r axon i n c o r e axons axon p r o p a g a t e S p i k e ( )

9 Neuron phase

11f o r neuron i n c o r e n e u r o n s s p i k e S = neuron i n t e g r a t e L e a k F i r e ( )

13i f ( S d e s t == l o c a l )l o c a l B u f [ t h r e a d I D ] push ( S )

15e l s eremoteBuf [ t h r e a d I D ] [ S d e s t ] push ( S )

17

19pragma omp b a r r i e rt h r e a d A g g r e g a t e ( remoteBuf remoteBufAgg )

21i f ( t h r e a d I D == 0) f o r d e s t i n r e m o t e D e s t s

23i f ( remoteBufAgg [ d e s t ] s i z e ( ) = 0 ) MPI Isend ( remoteBufAgg [ d e s t ] )

25sendCoun t s [ d e s t ] + +

27

29 Network phase

31i f ( t h r e a d I D == 0) MPI Reduce Sca t t e r ( sendCounts r ecvCoun t )

33e l s e

35t h r e a d B u f = p a r t i t i o n ( l o c a l B u f t h r e a d ) d e l i v e r ( s p i k e s i n t h r e a d B u f )

37pragma omp b a r r i e r

39pragma omp f o r

41f o r message i n recvCoun t messages pragma c r i t i c a l

43MPI Iprobe ( s t a t u s )

45MPI Get count ( s t a t u s r e c v L e n g t h ) MPI Recv ( recvBuf r e c v L e n g t h )

47 end c r i t i c a ld e l i v e r ( s p i k e s i n recvBuf )

49 end omp f o r end omp p a r a l l e l

Listing 1 Pseudo-code of the main simulation loop in Compass

neuron Last in the Network phase processes communicateto send spikes from firing neurons and deliver spikes todestination axons

To minimize communication overhead Compass aggregatesspikes between pairs of processes into a single MPI messageAt startup each process in Compass gets all destinationTrueNorth core axon pairs from the neurons within itshosted TrueNorth cores uses an implicit TrueNorth core toprocess map to identify all destinations on remote processesand preallocates per-process send buffers During a tick theprocess aggregates all spikes for TrueNorth cores at remoteprocesses into the corresponding buffer and uses a singleMPI send call to transmit each buffer once all neurons haveintegrated leaked and fired

In detail each Compass process forks multiple OpenMPthreads which execute the main Compass phases in parallelas follows

bull Synapse phase For each axon within a set of TrueNorthcores dedicated to a specific thread the thread checksif a spike is ready for delivery If a spike is readyCompass propagates the spike along the row in thesynaptic crossbar connected to the axon For each setconnection in the row Compass buffers the spike forintegration at the neuron connected to the set connectionFor example if the ith axon in a TrueNorth core has aspike ready for delivery and the i jth connection is set inthe crossbar then Compass propagates the spike to thejth neuron

bull Neuron phase For each neuron within a TrueNorth corethreads integrate all propagated spikes leak potentialand generate spikes as appropriate Threads then ag-gregate spikes destined for TrueNorth cores at remoteprocesses into per-process remote destination buffers(remoteBu f Agg) so that spikes are consecutively laid outin memory for MPI message transfers The master threadwithin each process lastly uses a single MPI message tosend each remote destination buffer to the correspondingremote process

bull Network phase The master thread uses an MPI Reduce-Scatter operation to determine how many incoming mes-sages to expect In parallel Compass divides the lo-cal destination buffer uniformly among all non-masterthreads each of which then delivers spikes from itsportion to the destination TrueNorth cores (in detail tothe corresponding axon buffer at a specific destinationaxon) immediately Performance is improved since theprocessing of local spikes by non-master threads over-laps with the Reduce-Scatter operation performed by themaster threadAfter the master thread completes the Reduce-Scatteroperation and after all non-master threads deliver spikesfrom the local buffer all threads take turns to receivean MPI message and deliver each spike contained inthe message to its destination TrueNorth core residingin the shared memory space of the Compass processEach thread receives MPI messages in a critical sectiondue to thread-safety issues in the MPI library [23] butdelivers the spikes within the messages outside of thecritical section

IV PARALLEL COMPASS COMPILER

Compass is capable of simulating networks of tens of mil-lions of TrueNorth cores To build applications for such large-scale TrueNorth networks we envisage first implementinglibraries of functional primitives that run on one or moreinterconnected TrueNorth cores We can then build richer ap-plications by instantiating and connecting regions of functionalprimitives Configuring such rich applications in Compass (orfor that matter on TrueNorth hardware) can potentially requiresetting trillions of parameters To make the configuration more

tractable we require tools to create lower-level core parameterspecifications from higher-level more compact descriptions ofthe functional regions

We have designed and implemented a parallel processingtool called the Parallel Compass Compiler (PCC) that trans-lates a compact definition of functional regions of TrueNorthcores into the explicit neuron parameter synaptic connectionparameter and neuron-to-axon connectivity declarations re-quired by Compass PCC works to minimize MPI messagecounts within the Compass main simulation loop by assigningTrueNorth cores in the same functional region to as fewCompass processes as necessary This minimization enablesCompass to use faster shared memory communication to han-dle most intra-region spiking reserving more expensive MPIcommunication for mostly inter-region spiking Overall thenumber of Compass processes required to simulate a particularregion corresponds to the computation (neuron count) andmemory (synaptic connection count) required for the regionwhich in turn is related to the functional complexity of theregion

We show an illustrative example of three inter- and intra-connected functional regions in Figure 2 In our examplethree processes manage each of the functional regions Aand C and two processes manage functional region B Forillustration clarity we present only connections of process 0(in region A) and of process 3 (in region B) to region C Thethickness of the edges between processes reflects the strengthof the connection thicker connections represents more neuronconnections

PCC constructs the connectivity graph between functionalregions using a distributed algorithm in which each parallelPCC process compiles TrueNorth core parameters for at mostone functional region To create neuron-to-axon connectionsbetween regions the PCC process managing the target regionuses MPI message operations to send the global core ID andaxon ID of an available axon to the PCC process managing thesource region To create neuron-to-axon connections within aregion (for example as with process P0 in Figure 2) a PCCprocess connects cores locally we use OpenMP threadingprimitives to exploit thread-level parallelism

This exchange of information happens in an aggregated perprocess pair fashion As illustrated in Figure 2 if K neurons onP0 connect to P5 P5 needs to send the TrueNorth core IDs andthe K axon IDs to P0 using MPI_Isend The axon types andthe synaptic crossbar on the corresponding TrueNorth cores onP5 are setup simultaneously On P0 the received TrueNorth coreIDs and axon IDs are used to connect the source TrueNorthcores on P0 with the target TrueNorth cores on P5 We requirea realizability mechanism for connections to guarantee thateach target process has enough TrueNorth cores to satisfyincoming connection requests One way to accomplish this isto normalize the network to have consistent neuron and axonrequirements This is equivalent to normalizing the connectionmatrix to have identical pre-specified column sum and rowsums [24] [25] [26] - a generalization of doubly stochasticmatrices This procedure is known as iterative proportional

Fig 2 Illustration of a neurosynaptic core network built using the ParallelCompass Compiler across multiple processes showing connectivity within andacross functional regions

fitting procedure (IPFP) in statistics and as matrix balancingin linear algebra

The high-level network description describing the networkconnectivity is expressed in a relatively small and compactCoreObject file For large scale simulation of millions ofTrueNorth cores the network model specification for Compasscan be on the order of several terabytes Offline generationand copying such large files is impractical Parallel modelgeneration using the compiler requires only few minutes ascompared to several hours to read or write it to disk Oncethe compiler completes the wiring of the neurosynaptic coresacross all regions the TrueNorth cores from each processorare instantiated within Compass and the TrueNorth cores in thecompiler are deallocated to free up space Finally Compasssimulates the network model that was created

V COCOMAC MACAQUE BRAIN MODEL NETWORK

Higher cognition in the brain is thought to emerge from thethalamocortical system which can be divided into function-ally specialized thalamic or cortical regions Each region iscomposed of millions or billions of neurons cells specializedto receive integrate and send messages Long range whitematter connections provide communication between neuronsin different regions while short range gray matter connectionsprovide for communication between neurons within the sameregion Tremendous efforts towards mapping the brain andits connectivity have been made over decades of researcha tiny fraction of which will be drawn from here to con-struct a simple network that nevertheless captures several keycharacteristics desirable for TrueNorth simulations In buildingour test network we establish natural parallels between brainregions white matter connectivity gray matter connectivityand the underlying simulator hardware We simulate each brainregion using non-overlapping sets of 1 or more processes suchthat white matter communication becomes equivalent to inter-

process communication and we then establish gray matterconnectivity as within process communication

A Structure

It is hypothesized that distinct functions are supportedby signature subnetworks throughout the brain that facilitateinformation flow integration and cooperation across function-ally differentiated distributed centers We sought to capturethe richness found in these networks by specifying our testnetworks modules and the connectivity between them usinga very large scale database of brain regions and connectivityin the macaque monkey called CoCoMac [27] [28] Decadesof white matter anatomical tracing studies in the macaquebrain have been painstakingly assembled in the CoCoMacdatabase In previous efforts we have integrated the contribu-tions made to CoCoMac by many disparate labs to produce acohesive conflict-free network of brain regions in the macaquemonkey that is three times larger than the largest previoussuch network [9] Here we reduced this network to 77 brainregions based on white matter connectivity metrics describedbelow to make it more amenable to simulation We derivedvolumetric information for each region from the Paxinos brainatlas [29] and accompanying software [30] which in turn wasused to set relative neuron counts for each region Volumeinformation was not available for 5 cortical and 8 thalamicregions and so was approximated using the median size of theother cortical or thalamic regions respectively

B Long range connectivity

In the CoCoMac network each brain region is representedas a vertex and the presence of a white matter connectionbetween two brain regions is represented as an edge betweencorresponding vertices The derived network consists of 383hierarchically organized regions spanning cortex thalamusand basal ganglia and has 6602 directed edges that cap-ture well-known cortico-cortical cortico-subcortical and intra-subcortical white matter pathways [9] In the many studiesreported to CoCoMac there are often cases where one labreports connections for a brain region while another labsubdivides that brain region and reports connections for one ofthe child subregions For simplicity we reduced the networkby merging a child subregion into a parent region where bothchild and parent regions report connections We do this byORing the connections of the child region with that of theparent region The smaller lower resolution network consistsof 102 regions 77 of which report connections In the brainthe topography of connections between regions has in somecases been shown to be highly structured and focused [31]and in other cases very diffuse [32] For a given vertexin our reduced CoCoMac graph such focused connectionscorrespond to a source process targeting a single process inthe target region while diffuse connections correspond toa source process targeting multiple processes in the targetregion (if more than one such process exists) We choose touse long range connections that are as diffuse as possible inour test network because this places the largest burden on

Fig 3 Macaque brain map consisting of the 77 brain regions used for the test network The relative number of TrueNorth cores for each area indicated bythe Paxinos atlas is depicted in green and the actual number of TrueNorth cores allocated to each region following our normalization step is depicted in redboth plotted in log space Outgoing connections and neurons allocated in a 4096 TrueNorth cores model are shown for a typical region LGN which is thefirst stage in the thalamocortical visual processing stream

the communication infrastructure of our simulator As regionsizes are increased through scaling requiring more processper region the number inter-process connections required fora given process will increase At the same time the numberof neuron-axon connections per such link will decrease

C Local connectivity

Extensive studies have revealed that while each neuron canform gray matter connections to a large array of possibletargets targets are invariably within a 2-3 mm radius from thesource [33] The number of cortical neurons under a squaremillimeter of cortex in a the macaque brain is estimated in thetens of thousands [34] which sets an upper limit on possibleneuron targets from gray matter connectivity as the nearest3 million or so neurons which fits within the number ofneurons we are able to use per process for the simulationsperformed here Therefore we choose to restrict our modelingof gray matter connectivity such that the source neuron andtarget neuron of gray matter connections are located on thesame process It has also been found that while each neuron

often connects to the same target gray matter patches as itsneighbors this is not always the case [33] Therefore toprovide the highest possible challenge to cache performancewe chose to ensure that all locally connecting neurons on thesame TrueNorth core distribute their connections as broadlyas possible across the set of possible target TrueNorth cores

We establish our ratio of long range to local connectivityin approximately a 6040 ratio for cortical regions and inan 8020 ratio for non-cortical regions From CoCoMac weobtained a binary connection matrix This matrix is convertedto a stochastic matrix with the above gray matter percentageson the diagonals and white matter connections set to beproportional to the volume percentage of the outgoing regionThen the connection matrix was balanced as described insection IV to make the row and column sums equal to thevolume of that region The end result guarantees a realizablemodel in that all axon and neuron requests can be fulfilledin all regions The brain regions used a sample of theirinterconnectivity and the results of this normalization processcan be seen in figure 3

VI EXPERIMENTAL EVALUATION

We completed an experimental evaluation of Compass thatshows near-perfect weak and strong scaling behavior whensimulating a CoCoMac macaque network model on an IBMBlue GeneQ system of 16384 to 262144 compute CPUsWe attribute this behavior to several important design andoperational features of Compass that make efficient use ofthe communication subsystem The Compass implementationoverlaps the single collective MPI Reduce-Scatter operation ina process with the delivery of spikes from neurons destinedfor local cores and minimizes the MPI communicator size byspawning multiple threads within an MPI process on multi-core CPUs in place of spawning multiple MPI processes Inoperation we observe that the MPI point-to-point messagingrate scales sub-linearly with increasing size of simulatedsystem due to weaker links (in terms of neuron-to-axonconnections) between processes of connected brain regionsMoreover the overall message data volume per simulated ticksent by Compass for even the largest simulated system ofTrueNorth cores is well below the interconnect bandwidth ofthe communication subsystem

A Hardware platforms

We ran Compass simulations of our CoCoMac macaquenetwork model on the IBM Blue GeneQ which is the thirdgeneration in the Blue Gene line of massively parallel super-computers Our Blue GeneQ at IBM Rochester comprised16 racks with each rack in turn comprising 1024 computenodes Each compute node is composed of 17 processor coreson a single multi-core CPU paired with 16 GB of dedicatedphysical memory [35] and is connected to other nodes in afive-dimensional torus through 10 bidirectional 2 GBsecondlinks [36] HPC applications run on 16 of the processorcores and can spawn up to 4 hardware threads on each coreper-node system software runs on the remaining core Themaximum available Blue GeneQ size was therefore 16384nodes equivalent to 262144 application processor cores

For all experiments on the Blue GeneQ we compiled withthe IBM XL CC++ compiler version 120 (MPICH2 MPIlibrary version 141)

B Weak scaling behavior

Compass exhibits nearly perfect weak scaling behaviorwhen simulating the CoCoMac macaque network model onthe IBM Blue GeneQ Figure 4 shows the results of experi-ments in which we increased the CoCoMac model size whenincreasing the available Blue GeneQ CPU count while atthe same time fixing the count of simulated TrueNorth coresper node at 16384 We ran with 1 MPI process per node and32 OpenMP threads per MPI process2 to minimize the MPIcommunication overhead and maximize the available memoryper MPI process Because each Compass process distributes

2We are investigating unexpected system errors that occurred when runningwith the theoretical maximum of 64 OpenMP threads per MPI process

simulated cores uniformly across the available threads (sec-tion III) the number of simulated cores per OpenMP threadwas 512 cores

With a fixed count of 16384 TrueNorth cores per computenode Compass runs with near-constant total wall-clock timeover an increasing number of compute nodes Figure 4(a)shows the total wall-clock time taken to simulate the CoCo-Mac model for 500 simulated ticks (ldquoTotal Compassrdquo) andthe wall-clock times taken per Compass execution phase inthe main simulation loop (Synapse Neuron and Network inlisting 1) as functions of available Blue GeneQ CPU countThe total wall-clock time remains near-constant for CoCoMacmodels ranging from 16M TrueNorth cores on 16384 CPUs(1024 compute nodes) through to 256M cores on 262144CPUs (16384 nodes) note that 256M cores is equal to 65Bneurons which is of human scale in the number of neuronsSimulating 256M cores on 262144 CPUs for 500 simulatedticks required 194 seconds3 or 388times slower than real time

We observe that the increase in total run time with increas-ing Blue GeneQ CPU count is related primarily to an increasein communication costs during the Network phase in the mainsimulation loop We attribute most of the increase to the timetaken by the MPI Reduce-Scatter operation which increaseswith increasing MPI communicator size We also attributesome of the increase to computation and communicationimbalances in the functional regions of the CoCoMac model

We see that some of the increase in total run time arisesfrom an increase in the MPI message traffic during the Neuronphase Figure 4(b) shows the MPI message count and totalspike count (equivalent to the sum of white matter spikesfrom all MPI processes) per simulated tick as functions ofthe available Blue GeneQ CPU count As we increase theCoCoMac model size in the number of TrueNorth coresmore MPI processes represent the 77 macaque brain regionsto which spikes can be sent which leads to an increasein the message count and volume We note however thatthe increase in message count is sub-linear given that thewhite matter connections become thinner and therefore lessfrequented with increasing model size as discussed in sec-tion V-B For a system of 256M cores on 262144 CPUsCompass generates and sends approximately 22M spikes persimulated tick which at 20 bytes per spike equals 044 GB pertick and is well below the 5D torus link bandwidth of 2 GBs

C Strong scaling behavior

As with weak scaling Compass exhibits excellent strongscaling behavior when simulating the CoCoMac macaquenetwork model on the IBM Blue GeneQ Figure 5 shows theresults of experiments in which we fixed the CoCoMac modelsize at 32M TrueNorth cores (82B neurons) while increasingthe available Blue GeneQ CPU count As with the weakscaling results in figure 4(a) we show the total wall-clock time

3We exclude the CoCoMac compilation times from the total time Forreference compilation of the 256M core model using the PCC (section IV)required 107 wall-clock seconds mostly due to the communication costs inthe white matter wiring phase

(a) Compass total runtime and breakdown into major components (b) Compass messaging and data transfer analysis per simulation step

Fig 4 Compass weak scaling performance when simulating the CoCoMac macaque network model on an IBM Blue GeneQ with a fixed count of 16384TrueNorth cores per Blue GeneQ compute node Each Blue GeneQ rack had 16384 compute CPUs (1024 compute nodes) We ran with one MPI processper node and 32 OpenMP threads per MPI process

Fig 5 Compass strong scaling performance when simulating a fixedCoCoMac macaque network model of 32M TrueNorth cores on an IBM BlueGeneQ Each Blue GeneQ rack had 16384 compute CPUs (1024 computenodes) We ran with one MPI process per node and 32 OpenMP threads perMPI process

taken to simulate the CoCoMac model for 500 simulated ticksand the wall-clock times taken per Compass execution phasein the main simulation loop Simulating 32M cores takes 324seconds on 16384 Blue GeneQ CPUs (1 rack the baseline)47 seconds on 131073 CPUs (8 racks a speed-up of 69times overthe baseline with 8times more computation and memory capacity)and 37 seconds on 262144 CPUs (16 racks a speed-up of 88timesover the baseline with 16times more computation and memorycapacity) We observe that perfect scaling is inhibited by thecommunication-intense main loop phases when scaling from131072 CPUs (8 racks) up to 262144 CPUs (16 racks)

D Thread scaling behavior

Compass obtains excellent multi-threaded scaling behaviorwhen executing on the IBM Blue GeneQ Figure 6 shows theresults of experiments in which we increase the number ofOpenMP threads per MPI process while at the same timefixing the CoCoMac macaque network model size at 64MTrueNorth cores We show the speed-up over the baselinein the total wall-clock time taken to simulate the CoCoMacmodel for 500 simulated ticks and the speed-up in the wall-clock times taken per Compass execution phase in the mainsimulation loop where the baseline is the time taken wheneach MPI process spawns only one OpenMP thread We donot quite achieve perfect scaling in the number of OpenMPthreads due to a critical section in the Network phase that

Fig 6 OpenMP multi-threaded scaling tests when simulating a fixedCoCoMac macaque network model of 64M TrueNorth cores on a 65536 CPU(four rack) IBM Blue GeneQ with one MPI process per compute node Weshow the speed-up obtained with increasing numbers of threads over a baselineof one MPI process per node with one OpenMP thread (and therefore with15 of 16 CPU cores idled per node)

creates a serial bottleneck at all thread countsIn other multi-threaded experiments we found that trading

off between the number of MPI processes per compute nodeand the number of OpenMP threads per MPI processes yieldedlittle change in performance For example simulation runs ofCompass with one MPI process per compute node and 32OpenMP threads per process achieved nearly similar perfor-mance to runs with 16 MPI processes per compute node and 2OpenMP threads per process Using fewer MPI processes andmore OpenMP processes per thread reduces the size of theMPI communicator for the MPI Reduce-Scatter operation inthe Network phase which reduces the time taken by the phaseWe observe however that this performance improvement isoffset by false sharing penalties in the CPU caches due toincreased size of the shared memory region for the threads ineach MPI process

VII COMPARISON OF THE PGAS AND MPICOMMUNICATION MODELS FOR REAL-TIME SIMULATION

In addition to evaluating an implementation of Compassusing the MPI message passing library (section III) we evalu-ated a second implementation based on the Partitioned GlobalAddress Space (PGAS) model to explore the benefits of one-sided communications We achieved clear real-time simulationperformance improvements with the PGAS-based implementa-tion over the MPI-based implementation For this comparison

we ran both PGAS and MPI versions of Compass on a systemof four 1024-node IBM Blue GeneP racks installed at IBMWatson as no CC++ PGAS compiler yet exists for the BlueGeneQ Each compute node in the Blue GeneP comprises 4CPUs with 4 GB of dedicated physical memory We compiledwith the IBM XL CC++ compiler version 90 (MPICH2 MPIlibrary version 107) For PGAS support we reimplementedthe Compass messaging subroutines using Unified Parallel C(UPC) and compiled with the Berkeley UPC compiler version2140 [37] [38] the Berkeley UPC runtime uses the GASNetlibrary [39] for one-sided communication Both GASNet andthe MPI library implementation on Blue GeneP build uponthe Deep Computing Messaging Framework (DCMF) [40] toachieve the best available communication performance

A Tuning for real-time simulation

Real-time simulationmdash1 millisecond of wall-clock time per1 millisecond of simulated timemdashis important for designingapplications on the TrueNorth architecture Real-time simula-tions enable us to implement and test applications targeted forTrueNorth cores in advance of obtaining the actual hardwareand to debug and cross-check the TrueNorth hardware designsSuch simulations however already require that we reduce thesize of the simulated system of TrueNorth cores in order toreduce the total per-tick latency of the Synapse and Neuronphases We looked for further performance improvements inthe Network phase to maximize the size of the simulatedsystem given that the Network phase incurs the bulk of thelatency in the main simulation loop

The PGAS model of one-sided messaging is better suitedto the behavior of simulated TrueNorth cores in Compassthan the MPI model is The source and ordering of spikesarriving at an axon during the Network phase in a simulatedtick do not affect the computations during the Synapse andNeuron phases of the next tick Therefore each Compassprocess can use one-sided message primitives to insert spikesin a globally-addressable buffer residing at remote processeswithout incurring either the overhead of buffering those spikesfor sending or the overhead of tag matching when usingtwo-sided MPI messaging primitives Further using one-sidedmessage primitives enables the use of a single global barrierwith very low latency to synchronize writing into the globally-addressable buffers with reading from the buffers insteadof needing a collective Reduce-Scatter operation that scaleslinearly with communicator size We did not attempt to useMPI-2 one-sided messaging primitives as Hoefler et al havedemonstrated that such an approach is not promising from aperformance standpoint [41]

We experimented with writing our own custom synchro-nization primitives to provide a local barrier for subgroupsof processes In this way each process only synchronizeswith other processes with which it sends or receives spikesas opposed to each process synchronizing with all otherprocesses In practice though we found that the BerkeleyUPC synchronization primitives which are built upon the fastDCMF native barriers outperformed our custom primitives

Fig 7 Results comparing the PGAS and MPI communication modelsfor real-time simulations in Compass over a Blue GeneP system of fourracks (16384 CPUs) We simulated 81K TrueNorth cores for 1000 ticks withneurons firing on average at 10 Hz For each size of Blue GeneP we report theresult for each implementation using the best-performing thread configuration

B Real-time simulation results

To compare the real-time simulation performance of thePGAS implementation of Compass to the MPI implemen-tation we ran a set of strong scaling experiments using asynthetic system of TrueNorth cores simulated over four 1024-node Blue GeneP racks comprising 16384 CPUs We beganby finding the largest size of system we could simulate inreal time on all four racks and then measured the time takento simulate the same system for 1000 ticks on progressivelyfewer racks For the synthetic system4 75 of the neuronsin each TrueNorth core connect to TrueNorth cores on thesame Blue GeneP node while the remaining 25 connect toTrueNorth cores on other nodes All neurons fire on averageat 10 Hz

Our results show that the PGAS implementation of Compassis able to simulate the synthetic system with faster run timesthan the MPI implementation Figure 7 shows the strongscaling experimental results The PGAS implementation isable to simulate 81K TrueNorth cores in real time (1000 ticksin one second) on four racks while the MPI implementationtakes 21times as long (1000 ticks in 21 seconds) The benefitsof PGAS arise from the latency advantages of PGAS overMPI on the Blue GeneP [38] and the elimination of the MPIReduce-Scatter operation

For each experiment we only report the result for eachimplementation using the best-performing thread configura-tion Thus over four racks (16384 CPUs) of Blue GeneP weshow the result for the MPI implementation with one MPIprocess (each having four threads) per node while on onerack (4096 CPUs) of Blue GeneP we show the result for theMPI implementation with two MPI processes (each having twothreads) per node For all configuration we show the resultfor the PGAS implementation with four UPC instances (eachhaving one thread) per node

4We do not use the CoCoMac model for real-time simulations because thesize of simulated system has insufficient TrueNorth cores to populate eachCoCoMac region

VIII CONCLUSIONS

Cognitive Computing [2] is the quest for approximatingthe mind-like function low power small volume and real-time performance of the human brain To this end we havepursued neuroscience [42] [43] [9] nanotechnology [3] [4][44] [5] [6] and supercomputing [11] [12] Building on theseinsights we are marching ahead to develop and demonstrate anovel ultra-low power compact modular non-von Neumanncognitive computing architecture namely TrueNorth To setsail for TrueNorth in this paper we reported a multi-threadedmassively parallel simulator Compass that is functionallyequivalent to TrueNorth As a result of a number of innova-tions in communication computation and memory Compassdemonstrates unprecedented scale with its number of neuronscomparable to the human cortex and number of synapsescomparable to the monkey cortex while achieving unprece-dented time-to-solution Compass is a Swiss-army knife forour ambitious project supporting all aspects from architecturealgorithms to applications

TrueNorth and Compass represent a massively paralleland distributed architecture for computation that complementthe modern von Neumann architecture To fully exploit thisarchitecture we need to go beyond the ldquointellectual bottle-neckrdquo [7] of long sequential programming to short parallelprogramming To this end our next step is to develop a newparallel programming language with compositional semanticsthat can provide application and algorithm designers with thetools to effectively and efficiently unleash the full power ofthe emerging new era of computing Finally TrueNorth is de-signed to best approximate the brain within the constraints ofmodern CMOS technology but now informed by TrueNorthand Compass we ask how might we explore an entirely newtechnology that takes us beyond Moorersquos law and beyondthe ever-increasing memory-processor latency beyond needfor ever-increasing clock rates to a new era of computingExciting

ACKNOWLEDGMENTS

This research was sponsored by DARPA under contractNo HR0011-09-C-0002 The views and conclusions containedherein are those of the authors and should not be interpreted asrepresenting the official policies either expressly or impliedof DARPA or the US Government

We thank Filipp Akopyan John Arthur Andrew CassidyBryan Jackson Rajit Manohar Paul Merolla Rodrigo Alvarez-Icaza and Jun Sawada for their collaboration on the TrueNortharchitecture and our university partners Stefano Fusi RajitManohar Ashutosh Saxena and Giulio Tononi as well astheir research teams for their feedback on the Compass sim-ulator We are indebted to Fred Mintzer for access to IBMBlue GeneP and Blue GeneQ at the IBM TJ Watson Re-search Center and to George Fax Kerry Kaliszewski AndrewSchram Faith W Sell Steven M Westerbeck for access toIBM Rochester Blue GeneQ without which this paper wouldhave been impossible Finally we would like to thank DavidPeyton for his expert assistance revising this manuscript

REFERENCES

[1] J von Neumann The Computer and The Brain Yale University Press1958

[2] D S Modha R Ananthanarayanan S K Esser A Ndirango A JSherbondy and R Singh ldquoCognitive computingrdquo Communications ofthe ACM vol 54 no 8 pp 62ndash71 2011

[3] P Merolla J Arthur F Akopyan N Imam R Manohar and D ModhaldquoA digital neurosynaptic core using embedded crossbar memory with45pj per spike in 45nmrdquo in Custom Integrated Circuits Conference(CICC) 2011 IEEE sept 2011 pp 1 ndash4

[4] J Seo B Brezzo Y Liu B Parker S Esser R Montoye B RajendranJ Tierno L Chang D Modha and D Friedman ldquoA 45nm cmosneuromorphic chip with a scalable architecture for learning in networksof spiking neuronsrdquo in Custom Integrated Circuits Conference (CICC)2011 IEEE sept 2011 pp 1 ndash4

[5] N Imam F Akopyan J Arthur P Merolla R Manohar and D SModha ldquoA digital neurosynaptic core using event-driven qdi circuitsrdquo inASYNC 2012 IEEE International Symposium on Asynchronous Circuitsand Systems 2012

[6] J V Arthur P A Merolla F Akopyan R Alvarez-Icaza A CassidyS Chandra S K Esser N Imam W Risk D Rubin R Manohar andD S Modha ldquoBuilding block of a programmable neuromorphic sub-strate A digital neurosynaptic corerdquo in International Joint Conferenceon Neural Networks 2012

[7] J W Backus ldquoCan programming be liberated from the von neumannstyle a functional style and its algebra of programsrdquo Communicationsof the ACM vol 21 no 8 pp 613ndash641 1878

[8] V B Mountcastle Perceptual Neuroscience The Cerebral CortexHarvard University Press 1998

[9] D S Modha and R Singh ldquoNetwork architecture of the long distancepathways in the macaque brainrdquo Proceedings of the National Academyof the Sciences USA vol 107 no 30 pp 13 485ndash13 490 2010[Online] Available httpwwwpnasorgcontent1073013485abstract

[10] C Koch Biophysics of Computation Information Processing in SingleNeurons Oxford University Press New York New York 1999

[11] R Ananthanarayanan and D S Modha ldquoAnatomy of a cortical simu-latorrdquo in Supercomputing 07 2007

[12] R Ananthanarayanan S K Esser H D Simon and D S ModhaldquoThe cat is out of the bag cortical simulations with 109 neurons 1013

synapsesrdquo in Proceedings of the Conference on High PerformanceComputing Networking Storage and Analysis ser SC rsquo09 NewYork NY USA ACM 2009 pp 631ndash6312 [Online] Availablehttpdoiacmorg10114516540591654124

[13] E M Izhikevich ldquoWhich model to use for cortical spiking neuronsrdquoIEEE Transactions on Neural Networks vol 15 pp 1063ndash1070 2004[Online] Available httpwwwnsieduusersizhikevichpublicationswhichmodpdf

[14] H Markram ldquoThe Blue Brain Projectrdquo Nature Reviews Neurosciencevol 7 no 2 pp 153ndash160 Feb 2006 [Online] Availablehttpdxdoiorg101038nrn1848

[15] H Markram ldquoThe Human Brain Projectrdquo Scientific American vol 306no 6 pp 50ndash55 Jun 2012

[16] R Brette and D F M Goodman ldquoVectorized algorithms forspiking neural network simulationrdquo Neural Comput vol 23 no 6pp 1503ndash1535 2011 [Online] Available httpdxdoiorg101162NECO a 00123

[17] M Djurfeldt M Lundqvist C Johansson M Rehn O Ekebergand A Lansner ldquoBrain-scale simulation of the neocortex on the ibmblue genel supercomputerrdquo IBM J Res Dev vol 52 no 12 pp31ndash41 Jan 2008 [Online] Available httpdlacmorgcitationcfmid=13759901375994

[18] E M Izhikevich and G M Edelman ldquoLarge-scale model ofmammalian thalamocortical systemsrdquo Proceedings of the NationalAcademy of Sciences vol 105 no 9 pp 3593ndash3598 Mar 2008[Online] Available httpdxdoiorg101073pnas0712231105

[19] J M Nageswaran N Dutt J L Krichmar A Nicolau andA Veidenbaum ldquoEfficient simulation of large-scale spiking neuralnetworks using cuda graphics processorsrdquo in Proceedings of the 2009international joint conference on Neural Networks ser IJCNNrsquo09Piscataway NJ USA IEEE Press 2009 pp 3201ndash3208 [Online]Available httpdlacmorgcitationcfmid=17045551704736

[20] M M Waldrop ldquoComputer modelling Brain in a boxrdquo Nature vol482 pp 456ndash458 2012 [Online] Available httpdxdoiorg101038482456a

[21] H de Garis C Shuo B Goertzel and L Ruiting ldquoA world surveyof artificial brain projects part i Large-scale brain simulationsrdquoNeurocomput vol 74 no 1-3 pp 3ndash29 Dec 2010 [Online]Available httpdxdoiorg101016jneucom201008004

[22] R Brette M Rudolph T Carnevale M Hines D Beeman J MBower M Diesmann A Morrison P H Goodman A P DavisonS E Boustani and A Destexhe ldquoSimulation of networks of spikingneurons A review of tools and strategiesrdquo Journal of ComputationalNeuroscience vol 2007 pp 349ndash398 2007

[23] D Gregor T Hoefler B Barrett and A Lumsdaine ldquoFixing Probe forMulti-Threaded MPI Applicationsrdquo Indiana University Tech Rep 674Jan 2009

[24] R Sinkhorn and P Knopp ldquoConcerning nonnegative matrices anddoubly stochastic matricesrdquo Pacific Journal of Mathematics vol 21no 2 pp 343ndash348 1967

[25] A Marshall and I Olkin ldquoScaling of matrices to achieve specifiedrow and column sumsrdquo Numerische Mathematik vol 12 pp 83ndash901968 101007BF02170999 [Online] Available httpdxdoiorg101007BF02170999

[26] P Knight ldquoThe sinkhorn-knopp algorithm convergence and applica-tionsrdquo SIAM Journal on Matrix Analysis and Applications vol 30 no 1pp 261ndash275 2008

[27] R Kotter ldquoOnline retrieval processing and visualization of primate con-nectivity data from the CoCoMac databaserdquo Neuroinformatics vol 2pp 127ndash144 2004

[28] ldquoCocomac (collations of connectivity data on the macaque brain)rdquo wwwcocomacorg

[29] G Paxinos X Huang and M Petrides The Rhesus Monkey Brain inStereotaxic Coordinates Academic Press 2008 [Online] Availablehttpbooksgooglecoinbooksid=7HW6HgAACAAJ

[30] B G Kotter R Reid AT ldquoAn introduction to the cocomac-paxinos-3d viewerrdquo in The Rhesus Monkey Brain in Stereotaxic CoordinatesG Paxinos X-F Huang M Petrides and A Toga Eds Elsevier SanDiego 2008 ch 4

[31] D H Hubel and T N Wiesel ldquoReceptive fields binocular interactionand functional architecture in the catrsquos visual cortexrdquo J Physiol vol160 no 1 pp 106ndash154 1962

[32] A Peters and E G Jones Cerebral Cortex Vol 4 Association andauditory cortices Springer 1985

[33] V B Mountcastle ldquoThe columnar organization of the neocortexrdquo Brainvol 120 no 4 pp 701ndash22 1997

[34] C E Collins D C Airey N A Young D B Leitch and J H KaasldquoNeuron densities vary across and within cortical areas in primatesrdquo

Proceedings of the National Academy of Sciences vol 107 no 36 pp15 927ndash15 932 2010

[35] R A Haring M Ohmacht T W Fox M K Gschwind D L Sat-terfield K Sugavanam P W Coteus P Heidelberger M A BlumrichR W Wisniewski A Gara G L-T Chiu P A Boyle N H Chist andC Kim ldquoThe IBM Blue GeneQ Compute Chiprdquo IEEE Micro vol 32pp 48ndash60 2012

[36] D Chen N A Eisley P Heidelberger R M Senger Y SugawaraS Kumar V Salapura D L Satterfield B Steinmacher-Burow andJ J Parker ldquoThe IBM Blue GeneQ interconnection network andmessage unitrdquo in Proceedings of 2011 International Conference forHigh Performance Computing Networking Storage and Analysis serSC rsquo11 New York NY USA ACM 2011 pp 261ndash2610 [Online]Available httpdoiacmorg10114520633842063419

[37] ldquoThe Berkeley UPC Compilerrdquo 2002 httpupclblgov[38] R Nishtala P H Hargrove D O Bonachea and K A Yelick

ldquoScaling communication-intensive applications on BlueGeneP usingone-sided communication and overlaprdquo in Proceedings of the 2009IEEE International Symposium on ParallelampDistributed Processingser IPDPS rsquo09 Washington DC USA IEEE Computer Society2009 pp 1ndash12 [Online] Available httpdxdoiorg101109IPDPS20095161076

[39] D Bonachea ldquoGASNet Specification v11rdquo University of California atBerkeley Berkeley CA USA Tech Rep 2002

[40] S Kumar G Dozsa G Almasi P Heidelberger D Chen M EGiampapa M Blocksome A Faraj J Parker J RattermanB Smith and C J Archer ldquoThe Deep Computing MessagingFramework Generalized Scalable Message Passing on the Blue GeneP

Supercomputerrdquo in Proceedings of the 22nd annual internationalconference on Supercomputing ser ICS rsquo08 New York NY USAACM 2008 pp 94ndash103 [Online] Available httpdoiacmorg10114513755271375544

[41] T Hoefler C Siebert and A Lumsdaine ldquoScalable CommunicationProtocols for Dynamic Sparse Data Exchangerdquo in Proceedings of the2010 ACM SIGPLAN Symposium on Principles and Practice of ParallelProgramming (PPoPPrsquo10) ACM Jan 2010 pp 159ndash168

[42] D S Modha ldquoA conceptual cortical surface atlasrdquo PLoS ONE vol 4no 6 p e5693 2009

[43] A J Sherbondy R Ananthanrayanan R F Dougherty D S Modhaand B A Wandell ldquoThink global act local projectome estimationwith bluematterrdquo in Proceedings of MICCAI 2009 Lecture Notes inComputer Science 2009

[44] B L Jackson B Rajendran G S Corrado M Breitwisch G W BurrR Cheek K Gopalakrishnan S Raoux C T Rettner A G SchrottR S Shenoy B N Kurdi C H Lam and D S Modha ldquoNano-scale electronic synapses using phase change devicesrdquo ACM Journalon Emerging Technologies in Computing Systems forthcoming 2012

  • Introduction
  • The TrueNorth architecture
  • The Compass neurosynaptic chip simulator
  • Parallel Compass Compiler
  • CoCoMac macaque brain model network
    • Structure
    • Long range connectivity
    • Local connectivity
      • Experimental evaluation
        • Hardware platforms
        • Weak scaling behavior
        • Strong scaling behavior
        • Thread scaling behavior
          • Comparison of the PGAS and MPI communication models for real-time simulation
            • Tuning for real-time simulation
            • Real-time simulation results
              • Conclusions
              • References
Page 2: SC2012 Compass Cr

bandwidth We note that one-sided inter-core communicationfrom a neuron on one core to an axon on another core residingon a different compute node maps naturally to the PGASmodel

Compass enables large-scale simulations that provide brain-like function as a precursor to demonstrating the same func-tionality on TrueNorth hardware Our goal is not to useCompass to model the brain but to approximate brain-likefunction that integrates multiple sensory-motor modalities Webelieve that function follows form The form that we havechosen to concentrate on in this paper leverages the largestknown long-distance wiring diagram in the macaque brain thatspans cortex thalamus and basal ganglia [9] Concretely weuse the macaque wiring diagram to instantiate an inter-corewiring pattern The richness of the wiring diagram challengesthe communication and computational capabilities of Compassin a manner consistent with supporting brain-like networks Tothis end we have built a novel parallel compiler that takes ahigh-level network description of the macaque wiring diagramand converts it to the parameters needed to configure a networkof TrueNorth cores Our compiler is sufficiently general tosupport other application-driven networks

Compass is indispensable for (a) verifying TrueNorth cor-rectness via regression testing (b) studying TrueNorth dy-namics (c) benchmarking inter-core communication topolo-gies (d) demonstrating applications in vision audition real-time motor control and sensor integration (e) estimatingpower consumption and (f) hypotheses testing verificationand iteration regarding neural codes and function We haveused Compass to demonstrate numerous applications of theTrueNorth architecture such as optic flow attention mech-anisms image and audio classification multi-modal image-audio classification character recognition robotic navigationand spatio-temporal feature extraction

Compass differs completely from our previous simulatorC2 [11] [12] First the fundamental data structure is a neu-rosynaptic core instead of a synapse the synapse is simplifiedto a bit resulting in 32times less storage required for the synapsedata structure as compared to C2 Second the local and long-distance anatomical connectivity in Compass that emulaterespectively intra-core and inter-core constraints in TrueNorthhave no counterpart in C2 Third the neuron dynamicsequations in Compass are amenable to efficient hardwareimplementation whereas C2 focused on single-compartmentphenomenological dynamic neuron models [13] Fourth Com-pass uses a fully multi-threaded programming model whereasC2 used a flat MPI programming model rendering it incapableof exploiting the full potential of Blue GeneQ Last Compassincorporates an in situ compiler for generating a completeTrueNorth model from a compact CoreObject descriptionfile which reduces simulation set-up times by three ordersof magnitude As a result of these innovations Compassdemonstrates unprecedented weak scaling beyond the cat brainscale achieved by C2 The architecture of Compass standsin contrast to other computational neuroscientific simulators[14] [15] [16] [17] [18] [19] [20] For reference we point

readers at reviews of large-scale brain simulations [21] [22]Compass is a harbinger of an emerging use of todayrsquos

modern supercomputers for midwifing the next generationof application-specific processors that are increasingly pro-liferating to satisfy a world that is hungering for increasedperformance and lower power while facing the projected endof CMOS scaling and increasing obstacles in pushing clockrates ever higher

II THE TRUENORTH ARCHITECTURE

TrueNorth is a scalable neurosynaptic computer architecturedeveloped at IBM under the DARPA SyNAPSE program In-spired by the organization and function of the brain TrueNorthemploys a similar vocabulary describing CMOS circuit ele-ments as neurons axons dendrites and synapses

Core

Synaptic crossbar

Axons

Neu

rons

a0 Buffer

a1 Buffer

Buffer

an-1 Buffer

n0 n1 nn-1

Network

PRNG

Clock

Fig 1 Conceptual architecture of a neurosynaptic core incorporating asynaptic crossbar that contains axons (horizontal lines) dendrites (verticallines) synapses (intersection of horizontal axons and vertical dendrites) andneurons attached to dendrites A buffer for incoming spikes precedes eachaxon to account for axonal delays

The key building block of TrueNorth is a neurosynapticcore [3] [4] as shown in figure 1 connected through a com-munication network to other TrueNorth cores in the systemThe specific instance of a neurosynaptic core that we simulatehas 256 axons 256 dendrites feeding to 256 neurons and a256times256 binary crossbar synaptic array Neurons are digitalintegrate-leak-and-fire circuits characterized by configurableparameters sufficient to produce a rich repertoire of dynamicand functional behavior A neuron on any TrueNorth core canconnect to an axon on any TrueNorth core in the networkwhen the neuron fires it sends a spike message to the axon

Each TrueNorth core operates in a parallel distributed andsemi-synchronous fashion Each TrueNorth core receives aslow clock tick at 1000 Hz to discretize neuron dynamics in 1-millisecond time steps Otherwise neuronal spike events driveall operations in the system When a TrueNorth core receives

a tick from the slow clock it cycles through each of its axonsFor each axon if the axon has a spike ready for delivery atthe current time step in its buffer each synaptic value on thehorizontal synaptic line connected to the axon is deliveredto its corresponding post-synaptic neuron in the TrueNorthcore If the synapse value for a particular axon-neuron pair isnon-zero then the neuron increments its membrane potentialby a (possibly stochastic) weight corresponding to the axontype After all axons are processed each neuron applies aconfigurable possibly stochastic leak and a neuron whosemembrane potential exceeds its threshold fires a spike Eachspike is then delivered via the communication network toits corresponding target axon An axon that receives a spikeschedules the spike for delivery at a future time step in itsbuffer The entire cycle repeats when the TrueNorth corereceives the next tick from the slow clock

Neurons represent computation synapses represent memoryand neuron-axon connections represent communication EachTrueNorth core brings computation and memory into extremeproximity breaking the von Neumann bottleneck Note thatsynaptic or neuronal state never leaves a TrueNorth core andonly spikes ever leave or enter any TrueNorth core Thecommunication network is driven solely by spike events andrequires no clocks The neuron parameters synaptic crossbarand target axon for each neuron are reconfigurable throughoutthe system

To ensure one-to-one equivalence between TrueNorth andCompass we have taken judicious care in the design ofboth systems For example we have bypassed the complexityof analog neurons traditionally associated with neuromorphicsystems Also we have adopted pseudo-random number gen-erators with configurable seeds As a result Compass hasbecome the key contract between our hardware architects andsoftware algorithmapplication designers

III THE COMPASS NEUROSYNAPTIC CHIP SIMULATOR

Compass is an architectural simulator for large-scale net-work models of TrueNorth cores and is implemented usinga combination of MPI library calls and OpenMP threadingprimitives Compass partitions the TrueNorth cores in a modelacross several processes and distributes TrueNorth cores resid-ing in the same shared memory space within a process amongmultiple threads Threads within a process independently sim-ulate the synaptic crossbar and neuron behavior of one or moreTrueNorth cores in the model A master thread then sends anyspikes destined for TrueNorth cores on remote processes in anaggregated fashion upon reception all threads deliver spikesto TrueNorth cores on the local process

Compass executes its main simulation loop in a semi-synchronous manner Listing 1 shows pseudo-code for themain simulation phases that each process in Compass executesEach pass through the phases simulates a single tick of theglobal clock in a system of TrueNorth cores In the Synapsephase each process propagates input spikes from axons toneurons through the crossbar Next in the Neuron phase eachprocess executes the integrate leak and fire model for each

1pragma omp s h a r e d ( l o c a l B u f remoteBuf )pragma omp s h a r e d ( TrueNor thCores remoteBufAgg )

3pragma omp p a r a l l e l

5f o r c o r e i n TrueNor thCores [ t h r e a d I d ] Synapse phase

7f o r axon i n c o r e axons axon p r o p a g a t e S p i k e ( )

9 Neuron phase

11f o r neuron i n c o r e n e u r o n s s p i k e S = neuron i n t e g r a t e L e a k F i r e ( )

13i f ( S d e s t == l o c a l )l o c a l B u f [ t h r e a d I D ] push ( S )

15e l s eremoteBuf [ t h r e a d I D ] [ S d e s t ] push ( S )

17

19pragma omp b a r r i e rt h r e a d A g g r e g a t e ( remoteBuf remoteBufAgg )

21i f ( t h r e a d I D == 0) f o r d e s t i n r e m o t e D e s t s

23i f ( remoteBufAgg [ d e s t ] s i z e ( ) = 0 ) MPI Isend ( remoteBufAgg [ d e s t ] )

25sendCoun t s [ d e s t ] + +

27

29 Network phase

31i f ( t h r e a d I D == 0) MPI Reduce Sca t t e r ( sendCounts r ecvCoun t )

33e l s e

35t h r e a d B u f = p a r t i t i o n ( l o c a l B u f t h r e a d ) d e l i v e r ( s p i k e s i n t h r e a d B u f )

37pragma omp b a r r i e r

39pragma omp f o r

41f o r message i n recvCoun t messages pragma c r i t i c a l

43MPI Iprobe ( s t a t u s )

45MPI Get count ( s t a t u s r e c v L e n g t h ) MPI Recv ( recvBuf r e c v L e n g t h )

47 end c r i t i c a ld e l i v e r ( s p i k e s i n recvBuf )

49 end omp f o r end omp p a r a l l e l

Listing 1 Pseudo-code of the main simulation loop in Compass

neuron Last in the Network phase processes communicateto send spikes from firing neurons and deliver spikes todestination axons

To minimize communication overhead Compass aggregatesspikes between pairs of processes into a single MPI messageAt startup each process in Compass gets all destinationTrueNorth core axon pairs from the neurons within itshosted TrueNorth cores uses an implicit TrueNorth core toprocess map to identify all destinations on remote processesand preallocates per-process send buffers During a tick theprocess aggregates all spikes for TrueNorth cores at remoteprocesses into the corresponding buffer and uses a singleMPI send call to transmit each buffer once all neurons haveintegrated leaked and fired

In detail each Compass process forks multiple OpenMPthreads which execute the main Compass phases in parallelas follows

bull Synapse phase For each axon within a set of TrueNorthcores dedicated to a specific thread the thread checksif a spike is ready for delivery If a spike is readyCompass propagates the spike along the row in thesynaptic crossbar connected to the axon For each setconnection in the row Compass buffers the spike forintegration at the neuron connected to the set connectionFor example if the ith axon in a TrueNorth core has aspike ready for delivery and the i jth connection is set inthe crossbar then Compass propagates the spike to thejth neuron

bull Neuron phase For each neuron within a TrueNorth corethreads integrate all propagated spikes leak potentialand generate spikes as appropriate Threads then ag-gregate spikes destined for TrueNorth cores at remoteprocesses into per-process remote destination buffers(remoteBu f Agg) so that spikes are consecutively laid outin memory for MPI message transfers The master threadwithin each process lastly uses a single MPI message tosend each remote destination buffer to the correspondingremote process

bull Network phase The master thread uses an MPI Reduce-Scatter operation to determine how many incoming mes-sages to expect In parallel Compass divides the lo-cal destination buffer uniformly among all non-masterthreads each of which then delivers spikes from itsportion to the destination TrueNorth cores (in detail tothe corresponding axon buffer at a specific destinationaxon) immediately Performance is improved since theprocessing of local spikes by non-master threads over-laps with the Reduce-Scatter operation performed by themaster threadAfter the master thread completes the Reduce-Scatteroperation and after all non-master threads deliver spikesfrom the local buffer all threads take turns to receivean MPI message and deliver each spike contained inthe message to its destination TrueNorth core residingin the shared memory space of the Compass processEach thread receives MPI messages in a critical sectiondue to thread-safety issues in the MPI library [23] butdelivers the spikes within the messages outside of thecritical section

IV PARALLEL COMPASS COMPILER

Compass is capable of simulating networks of tens of mil-lions of TrueNorth cores To build applications for such large-scale TrueNorth networks we envisage first implementinglibraries of functional primitives that run on one or moreinterconnected TrueNorth cores We can then build richer ap-plications by instantiating and connecting regions of functionalprimitives Configuring such rich applications in Compass (orfor that matter on TrueNorth hardware) can potentially requiresetting trillions of parameters To make the configuration more

tractable we require tools to create lower-level core parameterspecifications from higher-level more compact descriptions ofthe functional regions

We have designed and implemented a parallel processingtool called the Parallel Compass Compiler (PCC) that trans-lates a compact definition of functional regions of TrueNorthcores into the explicit neuron parameter synaptic connectionparameter and neuron-to-axon connectivity declarations re-quired by Compass PCC works to minimize MPI messagecounts within the Compass main simulation loop by assigningTrueNorth cores in the same functional region to as fewCompass processes as necessary This minimization enablesCompass to use faster shared memory communication to han-dle most intra-region spiking reserving more expensive MPIcommunication for mostly inter-region spiking Overall thenumber of Compass processes required to simulate a particularregion corresponds to the computation (neuron count) andmemory (synaptic connection count) required for the regionwhich in turn is related to the functional complexity of theregion

We show an illustrative example of three inter- and intra-connected functional regions in Figure 2 In our examplethree processes manage each of the functional regions Aand C and two processes manage functional region B Forillustration clarity we present only connections of process 0(in region A) and of process 3 (in region B) to region C Thethickness of the edges between processes reflects the strengthof the connection thicker connections represents more neuronconnections

PCC constructs the connectivity graph between functionalregions using a distributed algorithm in which each parallelPCC process compiles TrueNorth core parameters for at mostone functional region To create neuron-to-axon connectionsbetween regions the PCC process managing the target regionuses MPI message operations to send the global core ID andaxon ID of an available axon to the PCC process managing thesource region To create neuron-to-axon connections within aregion (for example as with process P0 in Figure 2) a PCCprocess connects cores locally we use OpenMP threadingprimitives to exploit thread-level parallelism

This exchange of information happens in an aggregated perprocess pair fashion As illustrated in Figure 2 if K neurons onP0 connect to P5 P5 needs to send the TrueNorth core IDs andthe K axon IDs to P0 using MPI_Isend The axon types andthe synaptic crossbar on the corresponding TrueNorth cores onP5 are setup simultaneously On P0 the received TrueNorth coreIDs and axon IDs are used to connect the source TrueNorthcores on P0 with the target TrueNorth cores on P5 We requirea realizability mechanism for connections to guarantee thateach target process has enough TrueNorth cores to satisfyincoming connection requests One way to accomplish this isto normalize the network to have consistent neuron and axonrequirements This is equivalent to normalizing the connectionmatrix to have identical pre-specified column sum and rowsums [24] [25] [26] - a generalization of doubly stochasticmatrices This procedure is known as iterative proportional

Fig 2 Illustration of a neurosynaptic core network built using the ParallelCompass Compiler across multiple processes showing connectivity within andacross functional regions

fitting procedure (IPFP) in statistics and as matrix balancingin linear algebra

The high-level network description describing the networkconnectivity is expressed in a relatively small and compactCoreObject file For large scale simulation of millions ofTrueNorth cores the network model specification for Compasscan be on the order of several terabytes Offline generationand copying such large files is impractical Parallel modelgeneration using the compiler requires only few minutes ascompared to several hours to read or write it to disk Oncethe compiler completes the wiring of the neurosynaptic coresacross all regions the TrueNorth cores from each processorare instantiated within Compass and the TrueNorth cores in thecompiler are deallocated to free up space Finally Compasssimulates the network model that was created

V COCOMAC MACAQUE BRAIN MODEL NETWORK

Higher cognition in the brain is thought to emerge from thethalamocortical system which can be divided into function-ally specialized thalamic or cortical regions Each region iscomposed of millions or billions of neurons cells specializedto receive integrate and send messages Long range whitematter connections provide communication between neuronsin different regions while short range gray matter connectionsprovide for communication between neurons within the sameregion Tremendous efforts towards mapping the brain andits connectivity have been made over decades of researcha tiny fraction of which will be drawn from here to con-struct a simple network that nevertheless captures several keycharacteristics desirable for TrueNorth simulations In buildingour test network we establish natural parallels between brainregions white matter connectivity gray matter connectivityand the underlying simulator hardware We simulate each brainregion using non-overlapping sets of 1 or more processes suchthat white matter communication becomes equivalent to inter-

process communication and we then establish gray matterconnectivity as within process communication

A Structure

It is hypothesized that distinct functions are supportedby signature subnetworks throughout the brain that facilitateinformation flow integration and cooperation across function-ally differentiated distributed centers We sought to capturethe richness found in these networks by specifying our testnetworks modules and the connectivity between them usinga very large scale database of brain regions and connectivityin the macaque monkey called CoCoMac [27] [28] Decadesof white matter anatomical tracing studies in the macaquebrain have been painstakingly assembled in the CoCoMacdatabase In previous efforts we have integrated the contribu-tions made to CoCoMac by many disparate labs to produce acohesive conflict-free network of brain regions in the macaquemonkey that is three times larger than the largest previoussuch network [9] Here we reduced this network to 77 brainregions based on white matter connectivity metrics describedbelow to make it more amenable to simulation We derivedvolumetric information for each region from the Paxinos brainatlas [29] and accompanying software [30] which in turn wasused to set relative neuron counts for each region Volumeinformation was not available for 5 cortical and 8 thalamicregions and so was approximated using the median size of theother cortical or thalamic regions respectively

B Long range connectivity

In the CoCoMac network each brain region is representedas a vertex and the presence of a white matter connectionbetween two brain regions is represented as an edge betweencorresponding vertices The derived network consists of 383hierarchically organized regions spanning cortex thalamusand basal ganglia and has 6602 directed edges that cap-ture well-known cortico-cortical cortico-subcortical and intra-subcortical white matter pathways [9] In the many studiesreported to CoCoMac there are often cases where one labreports connections for a brain region while another labsubdivides that brain region and reports connections for one ofthe child subregions For simplicity we reduced the networkby merging a child subregion into a parent region where bothchild and parent regions report connections We do this byORing the connections of the child region with that of theparent region The smaller lower resolution network consistsof 102 regions 77 of which report connections In the brainthe topography of connections between regions has in somecases been shown to be highly structured and focused [31]and in other cases very diffuse [32] For a given vertexin our reduced CoCoMac graph such focused connectionscorrespond to a source process targeting a single process inthe target region while diffuse connections correspond toa source process targeting multiple processes in the targetregion (if more than one such process exists) We choose touse long range connections that are as diffuse as possible inour test network because this places the largest burden on

Fig 3 Macaque brain map consisting of the 77 brain regions used for the test network The relative number of TrueNorth cores for each area indicated bythe Paxinos atlas is depicted in green and the actual number of TrueNorth cores allocated to each region following our normalization step is depicted in redboth plotted in log space Outgoing connections and neurons allocated in a 4096 TrueNorth cores model are shown for a typical region LGN which is thefirst stage in the thalamocortical visual processing stream

the communication infrastructure of our simulator As regionsizes are increased through scaling requiring more processper region the number inter-process connections required fora given process will increase At the same time the numberof neuron-axon connections per such link will decrease

C Local connectivity

Extensive studies have revealed that while each neuron canform gray matter connections to a large array of possibletargets targets are invariably within a 2-3 mm radius from thesource [33] The number of cortical neurons under a squaremillimeter of cortex in a the macaque brain is estimated in thetens of thousands [34] which sets an upper limit on possibleneuron targets from gray matter connectivity as the nearest3 million or so neurons which fits within the number ofneurons we are able to use per process for the simulationsperformed here Therefore we choose to restrict our modelingof gray matter connectivity such that the source neuron andtarget neuron of gray matter connections are located on thesame process It has also been found that while each neuron

often connects to the same target gray matter patches as itsneighbors this is not always the case [33] Therefore toprovide the highest possible challenge to cache performancewe chose to ensure that all locally connecting neurons on thesame TrueNorth core distribute their connections as broadlyas possible across the set of possible target TrueNorth cores

We establish our ratio of long range to local connectivityin approximately a 6040 ratio for cortical regions and inan 8020 ratio for non-cortical regions From CoCoMac weobtained a binary connection matrix This matrix is convertedto a stochastic matrix with the above gray matter percentageson the diagonals and white matter connections set to beproportional to the volume percentage of the outgoing regionThen the connection matrix was balanced as described insection IV to make the row and column sums equal to thevolume of that region The end result guarantees a realizablemodel in that all axon and neuron requests can be fulfilledin all regions The brain regions used a sample of theirinterconnectivity and the results of this normalization processcan be seen in figure 3

VI EXPERIMENTAL EVALUATION

We completed an experimental evaluation of Compass thatshows near-perfect weak and strong scaling behavior whensimulating a CoCoMac macaque network model on an IBMBlue GeneQ system of 16384 to 262144 compute CPUsWe attribute this behavior to several important design andoperational features of Compass that make efficient use ofthe communication subsystem The Compass implementationoverlaps the single collective MPI Reduce-Scatter operation ina process with the delivery of spikes from neurons destinedfor local cores and minimizes the MPI communicator size byspawning multiple threads within an MPI process on multi-core CPUs in place of spawning multiple MPI processes Inoperation we observe that the MPI point-to-point messagingrate scales sub-linearly with increasing size of simulatedsystem due to weaker links (in terms of neuron-to-axonconnections) between processes of connected brain regionsMoreover the overall message data volume per simulated ticksent by Compass for even the largest simulated system ofTrueNorth cores is well below the interconnect bandwidth ofthe communication subsystem

A Hardware platforms

We ran Compass simulations of our CoCoMac macaquenetwork model on the IBM Blue GeneQ which is the thirdgeneration in the Blue Gene line of massively parallel super-computers Our Blue GeneQ at IBM Rochester comprised16 racks with each rack in turn comprising 1024 computenodes Each compute node is composed of 17 processor coreson a single multi-core CPU paired with 16 GB of dedicatedphysical memory [35] and is connected to other nodes in afive-dimensional torus through 10 bidirectional 2 GBsecondlinks [36] HPC applications run on 16 of the processorcores and can spawn up to 4 hardware threads on each coreper-node system software runs on the remaining core Themaximum available Blue GeneQ size was therefore 16384nodes equivalent to 262144 application processor cores

For all experiments on the Blue GeneQ we compiled withthe IBM XL CC++ compiler version 120 (MPICH2 MPIlibrary version 141)

B Weak scaling behavior

Compass exhibits nearly perfect weak scaling behaviorwhen simulating the CoCoMac macaque network model onthe IBM Blue GeneQ Figure 4 shows the results of experi-ments in which we increased the CoCoMac model size whenincreasing the available Blue GeneQ CPU count while atthe same time fixing the count of simulated TrueNorth coresper node at 16384 We ran with 1 MPI process per node and32 OpenMP threads per MPI process2 to minimize the MPIcommunication overhead and maximize the available memoryper MPI process Because each Compass process distributes

2We are investigating unexpected system errors that occurred when runningwith the theoretical maximum of 64 OpenMP threads per MPI process

simulated cores uniformly across the available threads (sec-tion III) the number of simulated cores per OpenMP threadwas 512 cores

With a fixed count of 16384 TrueNorth cores per computenode Compass runs with near-constant total wall-clock timeover an increasing number of compute nodes Figure 4(a)shows the total wall-clock time taken to simulate the CoCo-Mac model for 500 simulated ticks (ldquoTotal Compassrdquo) andthe wall-clock times taken per Compass execution phase inthe main simulation loop (Synapse Neuron and Network inlisting 1) as functions of available Blue GeneQ CPU countThe total wall-clock time remains near-constant for CoCoMacmodels ranging from 16M TrueNorth cores on 16384 CPUs(1024 compute nodes) through to 256M cores on 262144CPUs (16384 nodes) note that 256M cores is equal to 65Bneurons which is of human scale in the number of neuronsSimulating 256M cores on 262144 CPUs for 500 simulatedticks required 194 seconds3 or 388times slower than real time

We observe that the increase in total run time with increas-ing Blue GeneQ CPU count is related primarily to an increasein communication costs during the Network phase in the mainsimulation loop We attribute most of the increase to the timetaken by the MPI Reduce-Scatter operation which increaseswith increasing MPI communicator size We also attributesome of the increase to computation and communicationimbalances in the functional regions of the CoCoMac model

We see that some of the increase in total run time arisesfrom an increase in the MPI message traffic during the Neuronphase Figure 4(b) shows the MPI message count and totalspike count (equivalent to the sum of white matter spikesfrom all MPI processes) per simulated tick as functions ofthe available Blue GeneQ CPU count As we increase theCoCoMac model size in the number of TrueNorth coresmore MPI processes represent the 77 macaque brain regionsto which spikes can be sent which leads to an increasein the message count and volume We note however thatthe increase in message count is sub-linear given that thewhite matter connections become thinner and therefore lessfrequented with increasing model size as discussed in sec-tion V-B For a system of 256M cores on 262144 CPUsCompass generates and sends approximately 22M spikes persimulated tick which at 20 bytes per spike equals 044 GB pertick and is well below the 5D torus link bandwidth of 2 GBs

C Strong scaling behavior

As with weak scaling Compass exhibits excellent strongscaling behavior when simulating the CoCoMac macaquenetwork model on the IBM Blue GeneQ Figure 5 shows theresults of experiments in which we fixed the CoCoMac modelsize at 32M TrueNorth cores (82B neurons) while increasingthe available Blue GeneQ CPU count As with the weakscaling results in figure 4(a) we show the total wall-clock time

3We exclude the CoCoMac compilation times from the total time Forreference compilation of the 256M core model using the PCC (section IV)required 107 wall-clock seconds mostly due to the communication costs inthe white matter wiring phase

(a) Compass total runtime and breakdown into major components (b) Compass messaging and data transfer analysis per simulation step

Fig 4 Compass weak scaling performance when simulating the CoCoMac macaque network model on an IBM Blue GeneQ with a fixed count of 16384TrueNorth cores per Blue GeneQ compute node Each Blue GeneQ rack had 16384 compute CPUs (1024 compute nodes) We ran with one MPI processper node and 32 OpenMP threads per MPI process

Fig 5 Compass strong scaling performance when simulating a fixedCoCoMac macaque network model of 32M TrueNorth cores on an IBM BlueGeneQ Each Blue GeneQ rack had 16384 compute CPUs (1024 computenodes) We ran with one MPI process per node and 32 OpenMP threads perMPI process

taken to simulate the CoCoMac model for 500 simulated ticksand the wall-clock times taken per Compass execution phasein the main simulation loop Simulating 32M cores takes 324seconds on 16384 Blue GeneQ CPUs (1 rack the baseline)47 seconds on 131073 CPUs (8 racks a speed-up of 69times overthe baseline with 8times more computation and memory capacity)and 37 seconds on 262144 CPUs (16 racks a speed-up of 88timesover the baseline with 16times more computation and memorycapacity) We observe that perfect scaling is inhibited by thecommunication-intense main loop phases when scaling from131072 CPUs (8 racks) up to 262144 CPUs (16 racks)

D Thread scaling behavior

Compass obtains excellent multi-threaded scaling behaviorwhen executing on the IBM Blue GeneQ Figure 6 shows theresults of experiments in which we increase the number ofOpenMP threads per MPI process while at the same timefixing the CoCoMac macaque network model size at 64MTrueNorth cores We show the speed-up over the baselinein the total wall-clock time taken to simulate the CoCoMacmodel for 500 simulated ticks and the speed-up in the wall-clock times taken per Compass execution phase in the mainsimulation loop where the baseline is the time taken wheneach MPI process spawns only one OpenMP thread We donot quite achieve perfect scaling in the number of OpenMPthreads due to a critical section in the Network phase that

Fig 6 OpenMP multi-threaded scaling tests when simulating a fixedCoCoMac macaque network model of 64M TrueNorth cores on a 65536 CPU(four rack) IBM Blue GeneQ with one MPI process per compute node Weshow the speed-up obtained with increasing numbers of threads over a baselineof one MPI process per node with one OpenMP thread (and therefore with15 of 16 CPU cores idled per node)

creates a serial bottleneck at all thread countsIn other multi-threaded experiments we found that trading

off between the number of MPI processes per compute nodeand the number of OpenMP threads per MPI processes yieldedlittle change in performance For example simulation runs ofCompass with one MPI process per compute node and 32OpenMP threads per process achieved nearly similar perfor-mance to runs with 16 MPI processes per compute node and 2OpenMP threads per process Using fewer MPI processes andmore OpenMP processes per thread reduces the size of theMPI communicator for the MPI Reduce-Scatter operation inthe Network phase which reduces the time taken by the phaseWe observe however that this performance improvement isoffset by false sharing penalties in the CPU caches due toincreased size of the shared memory region for the threads ineach MPI process

VII COMPARISON OF THE PGAS AND MPICOMMUNICATION MODELS FOR REAL-TIME SIMULATION

In addition to evaluating an implementation of Compassusing the MPI message passing library (section III) we evalu-ated a second implementation based on the Partitioned GlobalAddress Space (PGAS) model to explore the benefits of one-sided communications We achieved clear real-time simulationperformance improvements with the PGAS-based implementa-tion over the MPI-based implementation For this comparison

we ran both PGAS and MPI versions of Compass on a systemof four 1024-node IBM Blue GeneP racks installed at IBMWatson as no CC++ PGAS compiler yet exists for the BlueGeneQ Each compute node in the Blue GeneP comprises 4CPUs with 4 GB of dedicated physical memory We compiledwith the IBM XL CC++ compiler version 90 (MPICH2 MPIlibrary version 107) For PGAS support we reimplementedthe Compass messaging subroutines using Unified Parallel C(UPC) and compiled with the Berkeley UPC compiler version2140 [37] [38] the Berkeley UPC runtime uses the GASNetlibrary [39] for one-sided communication Both GASNet andthe MPI library implementation on Blue GeneP build uponthe Deep Computing Messaging Framework (DCMF) [40] toachieve the best available communication performance

A Tuning for real-time simulation

Real-time simulationmdash1 millisecond of wall-clock time per1 millisecond of simulated timemdashis important for designingapplications on the TrueNorth architecture Real-time simula-tions enable us to implement and test applications targeted forTrueNorth cores in advance of obtaining the actual hardwareand to debug and cross-check the TrueNorth hardware designsSuch simulations however already require that we reduce thesize of the simulated system of TrueNorth cores in order toreduce the total per-tick latency of the Synapse and Neuronphases We looked for further performance improvements inthe Network phase to maximize the size of the simulatedsystem given that the Network phase incurs the bulk of thelatency in the main simulation loop

The PGAS model of one-sided messaging is better suitedto the behavior of simulated TrueNorth cores in Compassthan the MPI model is The source and ordering of spikesarriving at an axon during the Network phase in a simulatedtick do not affect the computations during the Synapse andNeuron phases of the next tick Therefore each Compassprocess can use one-sided message primitives to insert spikesin a globally-addressable buffer residing at remote processeswithout incurring either the overhead of buffering those spikesfor sending or the overhead of tag matching when usingtwo-sided MPI messaging primitives Further using one-sidedmessage primitives enables the use of a single global barrierwith very low latency to synchronize writing into the globally-addressable buffers with reading from the buffers insteadof needing a collective Reduce-Scatter operation that scaleslinearly with communicator size We did not attempt to useMPI-2 one-sided messaging primitives as Hoefler et al havedemonstrated that such an approach is not promising from aperformance standpoint [41]

We experimented with writing our own custom synchro-nization primitives to provide a local barrier for subgroupsof processes In this way each process only synchronizeswith other processes with which it sends or receives spikesas opposed to each process synchronizing with all otherprocesses In practice though we found that the BerkeleyUPC synchronization primitives which are built upon the fastDCMF native barriers outperformed our custom primitives

Fig 7 Results comparing the PGAS and MPI communication modelsfor real-time simulations in Compass over a Blue GeneP system of fourracks (16384 CPUs) We simulated 81K TrueNorth cores for 1000 ticks withneurons firing on average at 10 Hz For each size of Blue GeneP we report theresult for each implementation using the best-performing thread configuration

B Real-time simulation results

To compare the real-time simulation performance of thePGAS implementation of Compass to the MPI implemen-tation we ran a set of strong scaling experiments using asynthetic system of TrueNorth cores simulated over four 1024-node Blue GeneP racks comprising 16384 CPUs We beganby finding the largest size of system we could simulate inreal time on all four racks and then measured the time takento simulate the same system for 1000 ticks on progressivelyfewer racks For the synthetic system4 75 of the neuronsin each TrueNorth core connect to TrueNorth cores on thesame Blue GeneP node while the remaining 25 connect toTrueNorth cores on other nodes All neurons fire on averageat 10 Hz

Our results show that the PGAS implementation of Compassis able to simulate the synthetic system with faster run timesthan the MPI implementation Figure 7 shows the strongscaling experimental results The PGAS implementation isable to simulate 81K TrueNorth cores in real time (1000 ticksin one second) on four racks while the MPI implementationtakes 21times as long (1000 ticks in 21 seconds) The benefitsof PGAS arise from the latency advantages of PGAS overMPI on the Blue GeneP [38] and the elimination of the MPIReduce-Scatter operation

For each experiment we only report the result for eachimplementation using the best-performing thread configura-tion Thus over four racks (16384 CPUs) of Blue GeneP weshow the result for the MPI implementation with one MPIprocess (each having four threads) per node while on onerack (4096 CPUs) of Blue GeneP we show the result for theMPI implementation with two MPI processes (each having twothreads) per node For all configuration we show the resultfor the PGAS implementation with four UPC instances (eachhaving one thread) per node

4We do not use the CoCoMac model for real-time simulations because thesize of simulated system has insufficient TrueNorth cores to populate eachCoCoMac region

VIII CONCLUSIONS

Cognitive Computing [2] is the quest for approximatingthe mind-like function low power small volume and real-time performance of the human brain To this end we havepursued neuroscience [42] [43] [9] nanotechnology [3] [4][44] [5] [6] and supercomputing [11] [12] Building on theseinsights we are marching ahead to develop and demonstrate anovel ultra-low power compact modular non-von Neumanncognitive computing architecture namely TrueNorth To setsail for TrueNorth in this paper we reported a multi-threadedmassively parallel simulator Compass that is functionallyequivalent to TrueNorth As a result of a number of innova-tions in communication computation and memory Compassdemonstrates unprecedented scale with its number of neuronscomparable to the human cortex and number of synapsescomparable to the monkey cortex while achieving unprece-dented time-to-solution Compass is a Swiss-army knife forour ambitious project supporting all aspects from architecturealgorithms to applications

TrueNorth and Compass represent a massively paralleland distributed architecture for computation that complementthe modern von Neumann architecture To fully exploit thisarchitecture we need to go beyond the ldquointellectual bottle-neckrdquo [7] of long sequential programming to short parallelprogramming To this end our next step is to develop a newparallel programming language with compositional semanticsthat can provide application and algorithm designers with thetools to effectively and efficiently unleash the full power ofthe emerging new era of computing Finally TrueNorth is de-signed to best approximate the brain within the constraints ofmodern CMOS technology but now informed by TrueNorthand Compass we ask how might we explore an entirely newtechnology that takes us beyond Moorersquos law and beyondthe ever-increasing memory-processor latency beyond needfor ever-increasing clock rates to a new era of computingExciting

ACKNOWLEDGMENTS

This research was sponsored by DARPA under contractNo HR0011-09-C-0002 The views and conclusions containedherein are those of the authors and should not be interpreted asrepresenting the official policies either expressly or impliedof DARPA or the US Government

We thank Filipp Akopyan John Arthur Andrew CassidyBryan Jackson Rajit Manohar Paul Merolla Rodrigo Alvarez-Icaza and Jun Sawada for their collaboration on the TrueNortharchitecture and our university partners Stefano Fusi RajitManohar Ashutosh Saxena and Giulio Tononi as well astheir research teams for their feedback on the Compass sim-ulator We are indebted to Fred Mintzer for access to IBMBlue GeneP and Blue GeneQ at the IBM TJ Watson Re-search Center and to George Fax Kerry Kaliszewski AndrewSchram Faith W Sell Steven M Westerbeck for access toIBM Rochester Blue GeneQ without which this paper wouldhave been impossible Finally we would like to thank DavidPeyton for his expert assistance revising this manuscript

REFERENCES

[1] J von Neumann The Computer and The Brain Yale University Press1958

[2] D S Modha R Ananthanarayanan S K Esser A Ndirango A JSherbondy and R Singh ldquoCognitive computingrdquo Communications ofthe ACM vol 54 no 8 pp 62ndash71 2011

[3] P Merolla J Arthur F Akopyan N Imam R Manohar and D ModhaldquoA digital neurosynaptic core using embedded crossbar memory with45pj per spike in 45nmrdquo in Custom Integrated Circuits Conference(CICC) 2011 IEEE sept 2011 pp 1 ndash4

[4] J Seo B Brezzo Y Liu B Parker S Esser R Montoye B RajendranJ Tierno L Chang D Modha and D Friedman ldquoA 45nm cmosneuromorphic chip with a scalable architecture for learning in networksof spiking neuronsrdquo in Custom Integrated Circuits Conference (CICC)2011 IEEE sept 2011 pp 1 ndash4

[5] N Imam F Akopyan J Arthur P Merolla R Manohar and D SModha ldquoA digital neurosynaptic core using event-driven qdi circuitsrdquo inASYNC 2012 IEEE International Symposium on Asynchronous Circuitsand Systems 2012

[6] J V Arthur P A Merolla F Akopyan R Alvarez-Icaza A CassidyS Chandra S K Esser N Imam W Risk D Rubin R Manohar andD S Modha ldquoBuilding block of a programmable neuromorphic sub-strate A digital neurosynaptic corerdquo in International Joint Conferenceon Neural Networks 2012

[7] J W Backus ldquoCan programming be liberated from the von neumannstyle a functional style and its algebra of programsrdquo Communicationsof the ACM vol 21 no 8 pp 613ndash641 1878

[8] V B Mountcastle Perceptual Neuroscience The Cerebral CortexHarvard University Press 1998

[9] D S Modha and R Singh ldquoNetwork architecture of the long distancepathways in the macaque brainrdquo Proceedings of the National Academyof the Sciences USA vol 107 no 30 pp 13 485ndash13 490 2010[Online] Available httpwwwpnasorgcontent1073013485abstract

[10] C Koch Biophysics of Computation Information Processing in SingleNeurons Oxford University Press New York New York 1999

[11] R Ananthanarayanan and D S Modha ldquoAnatomy of a cortical simu-latorrdquo in Supercomputing 07 2007

[12] R Ananthanarayanan S K Esser H D Simon and D S ModhaldquoThe cat is out of the bag cortical simulations with 109 neurons 1013

synapsesrdquo in Proceedings of the Conference on High PerformanceComputing Networking Storage and Analysis ser SC rsquo09 NewYork NY USA ACM 2009 pp 631ndash6312 [Online] Availablehttpdoiacmorg10114516540591654124

[13] E M Izhikevich ldquoWhich model to use for cortical spiking neuronsrdquoIEEE Transactions on Neural Networks vol 15 pp 1063ndash1070 2004[Online] Available httpwwwnsieduusersizhikevichpublicationswhichmodpdf

[14] H Markram ldquoThe Blue Brain Projectrdquo Nature Reviews Neurosciencevol 7 no 2 pp 153ndash160 Feb 2006 [Online] Availablehttpdxdoiorg101038nrn1848

[15] H Markram ldquoThe Human Brain Projectrdquo Scientific American vol 306no 6 pp 50ndash55 Jun 2012

[16] R Brette and D F M Goodman ldquoVectorized algorithms forspiking neural network simulationrdquo Neural Comput vol 23 no 6pp 1503ndash1535 2011 [Online] Available httpdxdoiorg101162NECO a 00123

[17] M Djurfeldt M Lundqvist C Johansson M Rehn O Ekebergand A Lansner ldquoBrain-scale simulation of the neocortex on the ibmblue genel supercomputerrdquo IBM J Res Dev vol 52 no 12 pp31ndash41 Jan 2008 [Online] Available httpdlacmorgcitationcfmid=13759901375994

[18] E M Izhikevich and G M Edelman ldquoLarge-scale model ofmammalian thalamocortical systemsrdquo Proceedings of the NationalAcademy of Sciences vol 105 no 9 pp 3593ndash3598 Mar 2008[Online] Available httpdxdoiorg101073pnas0712231105

[19] J M Nageswaran N Dutt J L Krichmar A Nicolau andA Veidenbaum ldquoEfficient simulation of large-scale spiking neuralnetworks using cuda graphics processorsrdquo in Proceedings of the 2009international joint conference on Neural Networks ser IJCNNrsquo09Piscataway NJ USA IEEE Press 2009 pp 3201ndash3208 [Online]Available httpdlacmorgcitationcfmid=17045551704736

[20] M M Waldrop ldquoComputer modelling Brain in a boxrdquo Nature vol482 pp 456ndash458 2012 [Online] Available httpdxdoiorg101038482456a

[21] H de Garis C Shuo B Goertzel and L Ruiting ldquoA world surveyof artificial brain projects part i Large-scale brain simulationsrdquoNeurocomput vol 74 no 1-3 pp 3ndash29 Dec 2010 [Online]Available httpdxdoiorg101016jneucom201008004

[22] R Brette M Rudolph T Carnevale M Hines D Beeman J MBower M Diesmann A Morrison P H Goodman A P DavisonS E Boustani and A Destexhe ldquoSimulation of networks of spikingneurons A review of tools and strategiesrdquo Journal of ComputationalNeuroscience vol 2007 pp 349ndash398 2007

[23] D Gregor T Hoefler B Barrett and A Lumsdaine ldquoFixing Probe forMulti-Threaded MPI Applicationsrdquo Indiana University Tech Rep 674Jan 2009

[24] R Sinkhorn and P Knopp ldquoConcerning nonnegative matrices anddoubly stochastic matricesrdquo Pacific Journal of Mathematics vol 21no 2 pp 343ndash348 1967

[25] A Marshall and I Olkin ldquoScaling of matrices to achieve specifiedrow and column sumsrdquo Numerische Mathematik vol 12 pp 83ndash901968 101007BF02170999 [Online] Available httpdxdoiorg101007BF02170999

[26] P Knight ldquoThe sinkhorn-knopp algorithm convergence and applica-tionsrdquo SIAM Journal on Matrix Analysis and Applications vol 30 no 1pp 261ndash275 2008

[27] R Kotter ldquoOnline retrieval processing and visualization of primate con-nectivity data from the CoCoMac databaserdquo Neuroinformatics vol 2pp 127ndash144 2004

[28] ldquoCocomac (collations of connectivity data on the macaque brain)rdquo wwwcocomacorg

[29] G Paxinos X Huang and M Petrides The Rhesus Monkey Brain inStereotaxic Coordinates Academic Press 2008 [Online] Availablehttpbooksgooglecoinbooksid=7HW6HgAACAAJ

[30] B G Kotter R Reid AT ldquoAn introduction to the cocomac-paxinos-3d viewerrdquo in The Rhesus Monkey Brain in Stereotaxic CoordinatesG Paxinos X-F Huang M Petrides and A Toga Eds Elsevier SanDiego 2008 ch 4

[31] D H Hubel and T N Wiesel ldquoReceptive fields binocular interactionand functional architecture in the catrsquos visual cortexrdquo J Physiol vol160 no 1 pp 106ndash154 1962

[32] A Peters and E G Jones Cerebral Cortex Vol 4 Association andauditory cortices Springer 1985

[33] V B Mountcastle ldquoThe columnar organization of the neocortexrdquo Brainvol 120 no 4 pp 701ndash22 1997

[34] C E Collins D C Airey N A Young D B Leitch and J H KaasldquoNeuron densities vary across and within cortical areas in primatesrdquo

Proceedings of the National Academy of Sciences vol 107 no 36 pp15 927ndash15 932 2010

[35] R A Haring M Ohmacht T W Fox M K Gschwind D L Sat-terfield K Sugavanam P W Coteus P Heidelberger M A BlumrichR W Wisniewski A Gara G L-T Chiu P A Boyle N H Chist andC Kim ldquoThe IBM Blue GeneQ Compute Chiprdquo IEEE Micro vol 32pp 48ndash60 2012

[36] D Chen N A Eisley P Heidelberger R M Senger Y SugawaraS Kumar V Salapura D L Satterfield B Steinmacher-Burow andJ J Parker ldquoThe IBM Blue GeneQ interconnection network andmessage unitrdquo in Proceedings of 2011 International Conference forHigh Performance Computing Networking Storage and Analysis serSC rsquo11 New York NY USA ACM 2011 pp 261ndash2610 [Online]Available httpdoiacmorg10114520633842063419

[37] ldquoThe Berkeley UPC Compilerrdquo 2002 httpupclblgov[38] R Nishtala P H Hargrove D O Bonachea and K A Yelick

ldquoScaling communication-intensive applications on BlueGeneP usingone-sided communication and overlaprdquo in Proceedings of the 2009IEEE International Symposium on ParallelampDistributed Processingser IPDPS rsquo09 Washington DC USA IEEE Computer Society2009 pp 1ndash12 [Online] Available httpdxdoiorg101109IPDPS20095161076

[39] D Bonachea ldquoGASNet Specification v11rdquo University of California atBerkeley Berkeley CA USA Tech Rep 2002

[40] S Kumar G Dozsa G Almasi P Heidelberger D Chen M EGiampapa M Blocksome A Faraj J Parker J RattermanB Smith and C J Archer ldquoThe Deep Computing MessagingFramework Generalized Scalable Message Passing on the Blue GeneP

Supercomputerrdquo in Proceedings of the 22nd annual internationalconference on Supercomputing ser ICS rsquo08 New York NY USAACM 2008 pp 94ndash103 [Online] Available httpdoiacmorg10114513755271375544

[41] T Hoefler C Siebert and A Lumsdaine ldquoScalable CommunicationProtocols for Dynamic Sparse Data Exchangerdquo in Proceedings of the2010 ACM SIGPLAN Symposium on Principles and Practice of ParallelProgramming (PPoPPrsquo10) ACM Jan 2010 pp 159ndash168

[42] D S Modha ldquoA conceptual cortical surface atlasrdquo PLoS ONE vol 4no 6 p e5693 2009

[43] A J Sherbondy R Ananthanrayanan R F Dougherty D S Modhaand B A Wandell ldquoThink global act local projectome estimationwith bluematterrdquo in Proceedings of MICCAI 2009 Lecture Notes inComputer Science 2009

[44] B L Jackson B Rajendran G S Corrado M Breitwisch G W BurrR Cheek K Gopalakrishnan S Raoux C T Rettner A G SchrottR S Shenoy B N Kurdi C H Lam and D S Modha ldquoNano-scale electronic synapses using phase change devicesrdquo ACM Journalon Emerging Technologies in Computing Systems forthcoming 2012

  • Introduction
  • The TrueNorth architecture
  • The Compass neurosynaptic chip simulator
  • Parallel Compass Compiler
  • CoCoMac macaque brain model network
    • Structure
    • Long range connectivity
    • Local connectivity
      • Experimental evaluation
        • Hardware platforms
        • Weak scaling behavior
        • Strong scaling behavior
        • Thread scaling behavior
          • Comparison of the PGAS and MPI communication models for real-time simulation
            • Tuning for real-time simulation
            • Real-time simulation results
              • Conclusions
              • References
Page 3: SC2012 Compass Cr

a tick from the slow clock it cycles through each of its axonsFor each axon if the axon has a spike ready for delivery atthe current time step in its buffer each synaptic value on thehorizontal synaptic line connected to the axon is deliveredto its corresponding post-synaptic neuron in the TrueNorthcore If the synapse value for a particular axon-neuron pair isnon-zero then the neuron increments its membrane potentialby a (possibly stochastic) weight corresponding to the axontype After all axons are processed each neuron applies aconfigurable possibly stochastic leak and a neuron whosemembrane potential exceeds its threshold fires a spike Eachspike is then delivered via the communication network toits corresponding target axon An axon that receives a spikeschedules the spike for delivery at a future time step in itsbuffer The entire cycle repeats when the TrueNorth corereceives the next tick from the slow clock

Neurons represent computation synapses represent memoryand neuron-axon connections represent communication EachTrueNorth core brings computation and memory into extremeproximity breaking the von Neumann bottleneck Note thatsynaptic or neuronal state never leaves a TrueNorth core andonly spikes ever leave or enter any TrueNorth core Thecommunication network is driven solely by spike events andrequires no clocks The neuron parameters synaptic crossbarand target axon for each neuron are reconfigurable throughoutthe system

To ensure one-to-one equivalence between TrueNorth andCompass we have taken judicious care in the design ofboth systems For example we have bypassed the complexityof analog neurons traditionally associated with neuromorphicsystems Also we have adopted pseudo-random number gen-erators with configurable seeds As a result Compass hasbecome the key contract between our hardware architects andsoftware algorithmapplication designers

III THE COMPASS NEUROSYNAPTIC CHIP SIMULATOR

Compass is an architectural simulator for large-scale net-work models of TrueNorth cores and is implemented usinga combination of MPI library calls and OpenMP threadingprimitives Compass partitions the TrueNorth cores in a modelacross several processes and distributes TrueNorth cores resid-ing in the same shared memory space within a process amongmultiple threads Threads within a process independently sim-ulate the synaptic crossbar and neuron behavior of one or moreTrueNorth cores in the model A master thread then sends anyspikes destined for TrueNorth cores on remote processes in anaggregated fashion upon reception all threads deliver spikesto TrueNorth cores on the local process

Compass executes its main simulation loop in a semi-synchronous manner Listing 1 shows pseudo-code for themain simulation phases that each process in Compass executesEach pass through the phases simulates a single tick of theglobal clock in a system of TrueNorth cores In the Synapsephase each process propagates input spikes from axons toneurons through the crossbar Next in the Neuron phase eachprocess executes the integrate leak and fire model for each

1pragma omp s h a r e d ( l o c a l B u f remoteBuf )pragma omp s h a r e d ( TrueNor thCores remoteBufAgg )

3pragma omp p a r a l l e l

5f o r c o r e i n TrueNor thCores [ t h r e a d I d ] Synapse phase

7f o r axon i n c o r e axons axon p r o p a g a t e S p i k e ( )

9 Neuron phase

11f o r neuron i n c o r e n e u r o n s s p i k e S = neuron i n t e g r a t e L e a k F i r e ( )

13i f ( S d e s t == l o c a l )l o c a l B u f [ t h r e a d I D ] push ( S )

15e l s eremoteBuf [ t h r e a d I D ] [ S d e s t ] push ( S )

17

19pragma omp b a r r i e rt h r e a d A g g r e g a t e ( remoteBuf remoteBufAgg )

21i f ( t h r e a d I D == 0) f o r d e s t i n r e m o t e D e s t s

23i f ( remoteBufAgg [ d e s t ] s i z e ( ) = 0 ) MPI Isend ( remoteBufAgg [ d e s t ] )

25sendCoun t s [ d e s t ] + +

27

29 Network phase

31i f ( t h r e a d I D == 0) MPI Reduce Sca t t e r ( sendCounts r ecvCoun t )

33e l s e

35t h r e a d B u f = p a r t i t i o n ( l o c a l B u f t h r e a d ) d e l i v e r ( s p i k e s i n t h r e a d B u f )

37pragma omp b a r r i e r

39pragma omp f o r

41f o r message i n recvCoun t messages pragma c r i t i c a l

43MPI Iprobe ( s t a t u s )

45MPI Get count ( s t a t u s r e c v L e n g t h ) MPI Recv ( recvBuf r e c v L e n g t h )

47 end c r i t i c a ld e l i v e r ( s p i k e s i n recvBuf )

49 end omp f o r end omp p a r a l l e l

Listing 1 Pseudo-code of the main simulation loop in Compass

neuron Last in the Network phase processes communicateto send spikes from firing neurons and deliver spikes todestination axons

To minimize communication overhead Compass aggregatesspikes between pairs of processes into a single MPI messageAt startup each process in Compass gets all destinationTrueNorth core axon pairs from the neurons within itshosted TrueNorth cores uses an implicit TrueNorth core toprocess map to identify all destinations on remote processesand preallocates per-process send buffers During a tick theprocess aggregates all spikes for TrueNorth cores at remoteprocesses into the corresponding buffer and uses a singleMPI send call to transmit each buffer once all neurons haveintegrated leaked and fired

In detail each Compass process forks multiple OpenMPthreads which execute the main Compass phases in parallelas follows

bull Synapse phase For each axon within a set of TrueNorthcores dedicated to a specific thread the thread checksif a spike is ready for delivery If a spike is readyCompass propagates the spike along the row in thesynaptic crossbar connected to the axon For each setconnection in the row Compass buffers the spike forintegration at the neuron connected to the set connectionFor example if the ith axon in a TrueNorth core has aspike ready for delivery and the i jth connection is set inthe crossbar then Compass propagates the spike to thejth neuron

bull Neuron phase For each neuron within a TrueNorth corethreads integrate all propagated spikes leak potentialand generate spikes as appropriate Threads then ag-gregate spikes destined for TrueNorth cores at remoteprocesses into per-process remote destination buffers(remoteBu f Agg) so that spikes are consecutively laid outin memory for MPI message transfers The master threadwithin each process lastly uses a single MPI message tosend each remote destination buffer to the correspondingremote process

bull Network phase The master thread uses an MPI Reduce-Scatter operation to determine how many incoming mes-sages to expect In parallel Compass divides the lo-cal destination buffer uniformly among all non-masterthreads each of which then delivers spikes from itsportion to the destination TrueNorth cores (in detail tothe corresponding axon buffer at a specific destinationaxon) immediately Performance is improved since theprocessing of local spikes by non-master threads over-laps with the Reduce-Scatter operation performed by themaster threadAfter the master thread completes the Reduce-Scatteroperation and after all non-master threads deliver spikesfrom the local buffer all threads take turns to receivean MPI message and deliver each spike contained inthe message to its destination TrueNorth core residingin the shared memory space of the Compass processEach thread receives MPI messages in a critical sectiondue to thread-safety issues in the MPI library [23] butdelivers the spikes within the messages outside of thecritical section

IV PARALLEL COMPASS COMPILER

Compass is capable of simulating networks of tens of mil-lions of TrueNorth cores To build applications for such large-scale TrueNorth networks we envisage first implementinglibraries of functional primitives that run on one or moreinterconnected TrueNorth cores We can then build richer ap-plications by instantiating and connecting regions of functionalprimitives Configuring such rich applications in Compass (orfor that matter on TrueNorth hardware) can potentially requiresetting trillions of parameters To make the configuration more

tractable we require tools to create lower-level core parameterspecifications from higher-level more compact descriptions ofthe functional regions

We have designed and implemented a parallel processingtool called the Parallel Compass Compiler (PCC) that trans-lates a compact definition of functional regions of TrueNorthcores into the explicit neuron parameter synaptic connectionparameter and neuron-to-axon connectivity declarations re-quired by Compass PCC works to minimize MPI messagecounts within the Compass main simulation loop by assigningTrueNorth cores in the same functional region to as fewCompass processes as necessary This minimization enablesCompass to use faster shared memory communication to han-dle most intra-region spiking reserving more expensive MPIcommunication for mostly inter-region spiking Overall thenumber of Compass processes required to simulate a particularregion corresponds to the computation (neuron count) andmemory (synaptic connection count) required for the regionwhich in turn is related to the functional complexity of theregion

We show an illustrative example of three inter- and intra-connected functional regions in Figure 2 In our examplethree processes manage each of the functional regions Aand C and two processes manage functional region B Forillustration clarity we present only connections of process 0(in region A) and of process 3 (in region B) to region C Thethickness of the edges between processes reflects the strengthof the connection thicker connections represents more neuronconnections

PCC constructs the connectivity graph between functionalregions using a distributed algorithm in which each parallelPCC process compiles TrueNorth core parameters for at mostone functional region To create neuron-to-axon connectionsbetween regions the PCC process managing the target regionuses MPI message operations to send the global core ID andaxon ID of an available axon to the PCC process managing thesource region To create neuron-to-axon connections within aregion (for example as with process P0 in Figure 2) a PCCprocess connects cores locally we use OpenMP threadingprimitives to exploit thread-level parallelism

This exchange of information happens in an aggregated perprocess pair fashion As illustrated in Figure 2 if K neurons onP0 connect to P5 P5 needs to send the TrueNorth core IDs andthe K axon IDs to P0 using MPI_Isend The axon types andthe synaptic crossbar on the corresponding TrueNorth cores onP5 are setup simultaneously On P0 the received TrueNorth coreIDs and axon IDs are used to connect the source TrueNorthcores on P0 with the target TrueNorth cores on P5 We requirea realizability mechanism for connections to guarantee thateach target process has enough TrueNorth cores to satisfyincoming connection requests One way to accomplish this isto normalize the network to have consistent neuron and axonrequirements This is equivalent to normalizing the connectionmatrix to have identical pre-specified column sum and rowsums [24] [25] [26] - a generalization of doubly stochasticmatrices This procedure is known as iterative proportional

Fig 2 Illustration of a neurosynaptic core network built using the ParallelCompass Compiler across multiple processes showing connectivity within andacross functional regions

fitting procedure (IPFP) in statistics and as matrix balancingin linear algebra

The high-level network description describing the networkconnectivity is expressed in a relatively small and compactCoreObject file For large scale simulation of millions ofTrueNorth cores the network model specification for Compasscan be on the order of several terabytes Offline generationand copying such large files is impractical Parallel modelgeneration using the compiler requires only few minutes ascompared to several hours to read or write it to disk Oncethe compiler completes the wiring of the neurosynaptic coresacross all regions the TrueNorth cores from each processorare instantiated within Compass and the TrueNorth cores in thecompiler are deallocated to free up space Finally Compasssimulates the network model that was created

V COCOMAC MACAQUE BRAIN MODEL NETWORK

Higher cognition in the brain is thought to emerge from thethalamocortical system which can be divided into function-ally specialized thalamic or cortical regions Each region iscomposed of millions or billions of neurons cells specializedto receive integrate and send messages Long range whitematter connections provide communication between neuronsin different regions while short range gray matter connectionsprovide for communication between neurons within the sameregion Tremendous efforts towards mapping the brain andits connectivity have been made over decades of researcha tiny fraction of which will be drawn from here to con-struct a simple network that nevertheless captures several keycharacteristics desirable for TrueNorth simulations In buildingour test network we establish natural parallels between brainregions white matter connectivity gray matter connectivityand the underlying simulator hardware We simulate each brainregion using non-overlapping sets of 1 or more processes suchthat white matter communication becomes equivalent to inter-

process communication and we then establish gray matterconnectivity as within process communication

A Structure

It is hypothesized that distinct functions are supportedby signature subnetworks throughout the brain that facilitateinformation flow integration and cooperation across function-ally differentiated distributed centers We sought to capturethe richness found in these networks by specifying our testnetworks modules and the connectivity between them usinga very large scale database of brain regions and connectivityin the macaque monkey called CoCoMac [27] [28] Decadesof white matter anatomical tracing studies in the macaquebrain have been painstakingly assembled in the CoCoMacdatabase In previous efforts we have integrated the contribu-tions made to CoCoMac by many disparate labs to produce acohesive conflict-free network of brain regions in the macaquemonkey that is three times larger than the largest previoussuch network [9] Here we reduced this network to 77 brainregions based on white matter connectivity metrics describedbelow to make it more amenable to simulation We derivedvolumetric information for each region from the Paxinos brainatlas [29] and accompanying software [30] which in turn wasused to set relative neuron counts for each region Volumeinformation was not available for 5 cortical and 8 thalamicregions and so was approximated using the median size of theother cortical or thalamic regions respectively

B Long range connectivity

In the CoCoMac network each brain region is representedas a vertex and the presence of a white matter connectionbetween two brain regions is represented as an edge betweencorresponding vertices The derived network consists of 383hierarchically organized regions spanning cortex thalamusand basal ganglia and has 6602 directed edges that cap-ture well-known cortico-cortical cortico-subcortical and intra-subcortical white matter pathways [9] In the many studiesreported to CoCoMac there are often cases where one labreports connections for a brain region while another labsubdivides that brain region and reports connections for one ofthe child subregions For simplicity we reduced the networkby merging a child subregion into a parent region where bothchild and parent regions report connections We do this byORing the connections of the child region with that of theparent region The smaller lower resolution network consistsof 102 regions 77 of which report connections In the brainthe topography of connections between regions has in somecases been shown to be highly structured and focused [31]and in other cases very diffuse [32] For a given vertexin our reduced CoCoMac graph such focused connectionscorrespond to a source process targeting a single process inthe target region while diffuse connections correspond toa source process targeting multiple processes in the targetregion (if more than one such process exists) We choose touse long range connections that are as diffuse as possible inour test network because this places the largest burden on

Fig 3 Macaque brain map consisting of the 77 brain regions used for the test network The relative number of TrueNorth cores for each area indicated bythe Paxinos atlas is depicted in green and the actual number of TrueNorth cores allocated to each region following our normalization step is depicted in redboth plotted in log space Outgoing connections and neurons allocated in a 4096 TrueNorth cores model are shown for a typical region LGN which is thefirst stage in the thalamocortical visual processing stream

the communication infrastructure of our simulator As regionsizes are increased through scaling requiring more processper region the number inter-process connections required fora given process will increase At the same time the numberof neuron-axon connections per such link will decrease

C Local connectivity

Extensive studies have revealed that while each neuron canform gray matter connections to a large array of possibletargets targets are invariably within a 2-3 mm radius from thesource [33] The number of cortical neurons under a squaremillimeter of cortex in a the macaque brain is estimated in thetens of thousands [34] which sets an upper limit on possibleneuron targets from gray matter connectivity as the nearest3 million or so neurons which fits within the number ofneurons we are able to use per process for the simulationsperformed here Therefore we choose to restrict our modelingof gray matter connectivity such that the source neuron andtarget neuron of gray matter connections are located on thesame process It has also been found that while each neuron

often connects to the same target gray matter patches as itsneighbors this is not always the case [33] Therefore toprovide the highest possible challenge to cache performancewe chose to ensure that all locally connecting neurons on thesame TrueNorth core distribute their connections as broadlyas possible across the set of possible target TrueNorth cores

We establish our ratio of long range to local connectivityin approximately a 6040 ratio for cortical regions and inan 8020 ratio for non-cortical regions From CoCoMac weobtained a binary connection matrix This matrix is convertedto a stochastic matrix with the above gray matter percentageson the diagonals and white matter connections set to beproportional to the volume percentage of the outgoing regionThen the connection matrix was balanced as described insection IV to make the row and column sums equal to thevolume of that region The end result guarantees a realizablemodel in that all axon and neuron requests can be fulfilledin all regions The brain regions used a sample of theirinterconnectivity and the results of this normalization processcan be seen in figure 3

VI EXPERIMENTAL EVALUATION

We completed an experimental evaluation of Compass thatshows near-perfect weak and strong scaling behavior whensimulating a CoCoMac macaque network model on an IBMBlue GeneQ system of 16384 to 262144 compute CPUsWe attribute this behavior to several important design andoperational features of Compass that make efficient use ofthe communication subsystem The Compass implementationoverlaps the single collective MPI Reduce-Scatter operation ina process with the delivery of spikes from neurons destinedfor local cores and minimizes the MPI communicator size byspawning multiple threads within an MPI process on multi-core CPUs in place of spawning multiple MPI processes Inoperation we observe that the MPI point-to-point messagingrate scales sub-linearly with increasing size of simulatedsystem due to weaker links (in terms of neuron-to-axonconnections) between processes of connected brain regionsMoreover the overall message data volume per simulated ticksent by Compass for even the largest simulated system ofTrueNorth cores is well below the interconnect bandwidth ofthe communication subsystem

A Hardware platforms

We ran Compass simulations of our CoCoMac macaquenetwork model on the IBM Blue GeneQ which is the thirdgeneration in the Blue Gene line of massively parallel super-computers Our Blue GeneQ at IBM Rochester comprised16 racks with each rack in turn comprising 1024 computenodes Each compute node is composed of 17 processor coreson a single multi-core CPU paired with 16 GB of dedicatedphysical memory [35] and is connected to other nodes in afive-dimensional torus through 10 bidirectional 2 GBsecondlinks [36] HPC applications run on 16 of the processorcores and can spawn up to 4 hardware threads on each coreper-node system software runs on the remaining core Themaximum available Blue GeneQ size was therefore 16384nodes equivalent to 262144 application processor cores

For all experiments on the Blue GeneQ we compiled withthe IBM XL CC++ compiler version 120 (MPICH2 MPIlibrary version 141)

B Weak scaling behavior

Compass exhibits nearly perfect weak scaling behaviorwhen simulating the CoCoMac macaque network model onthe IBM Blue GeneQ Figure 4 shows the results of experi-ments in which we increased the CoCoMac model size whenincreasing the available Blue GeneQ CPU count while atthe same time fixing the count of simulated TrueNorth coresper node at 16384 We ran with 1 MPI process per node and32 OpenMP threads per MPI process2 to minimize the MPIcommunication overhead and maximize the available memoryper MPI process Because each Compass process distributes

2We are investigating unexpected system errors that occurred when runningwith the theoretical maximum of 64 OpenMP threads per MPI process

simulated cores uniformly across the available threads (sec-tion III) the number of simulated cores per OpenMP threadwas 512 cores

With a fixed count of 16384 TrueNorth cores per computenode Compass runs with near-constant total wall-clock timeover an increasing number of compute nodes Figure 4(a)shows the total wall-clock time taken to simulate the CoCo-Mac model for 500 simulated ticks (ldquoTotal Compassrdquo) andthe wall-clock times taken per Compass execution phase inthe main simulation loop (Synapse Neuron and Network inlisting 1) as functions of available Blue GeneQ CPU countThe total wall-clock time remains near-constant for CoCoMacmodels ranging from 16M TrueNorth cores on 16384 CPUs(1024 compute nodes) through to 256M cores on 262144CPUs (16384 nodes) note that 256M cores is equal to 65Bneurons which is of human scale in the number of neuronsSimulating 256M cores on 262144 CPUs for 500 simulatedticks required 194 seconds3 or 388times slower than real time

We observe that the increase in total run time with increas-ing Blue GeneQ CPU count is related primarily to an increasein communication costs during the Network phase in the mainsimulation loop We attribute most of the increase to the timetaken by the MPI Reduce-Scatter operation which increaseswith increasing MPI communicator size We also attributesome of the increase to computation and communicationimbalances in the functional regions of the CoCoMac model

We see that some of the increase in total run time arisesfrom an increase in the MPI message traffic during the Neuronphase Figure 4(b) shows the MPI message count and totalspike count (equivalent to the sum of white matter spikesfrom all MPI processes) per simulated tick as functions ofthe available Blue GeneQ CPU count As we increase theCoCoMac model size in the number of TrueNorth coresmore MPI processes represent the 77 macaque brain regionsto which spikes can be sent which leads to an increasein the message count and volume We note however thatthe increase in message count is sub-linear given that thewhite matter connections become thinner and therefore lessfrequented with increasing model size as discussed in sec-tion V-B For a system of 256M cores on 262144 CPUsCompass generates and sends approximately 22M spikes persimulated tick which at 20 bytes per spike equals 044 GB pertick and is well below the 5D torus link bandwidth of 2 GBs

C Strong scaling behavior

As with weak scaling Compass exhibits excellent strongscaling behavior when simulating the CoCoMac macaquenetwork model on the IBM Blue GeneQ Figure 5 shows theresults of experiments in which we fixed the CoCoMac modelsize at 32M TrueNorth cores (82B neurons) while increasingthe available Blue GeneQ CPU count As with the weakscaling results in figure 4(a) we show the total wall-clock time

3We exclude the CoCoMac compilation times from the total time Forreference compilation of the 256M core model using the PCC (section IV)required 107 wall-clock seconds mostly due to the communication costs inthe white matter wiring phase

(a) Compass total runtime and breakdown into major components (b) Compass messaging and data transfer analysis per simulation step

Fig 4 Compass weak scaling performance when simulating the CoCoMac macaque network model on an IBM Blue GeneQ with a fixed count of 16384TrueNorth cores per Blue GeneQ compute node Each Blue GeneQ rack had 16384 compute CPUs (1024 compute nodes) We ran with one MPI processper node and 32 OpenMP threads per MPI process

Fig 5 Compass strong scaling performance when simulating a fixedCoCoMac macaque network model of 32M TrueNorth cores on an IBM BlueGeneQ Each Blue GeneQ rack had 16384 compute CPUs (1024 computenodes) We ran with one MPI process per node and 32 OpenMP threads perMPI process

taken to simulate the CoCoMac model for 500 simulated ticksand the wall-clock times taken per Compass execution phasein the main simulation loop Simulating 32M cores takes 324seconds on 16384 Blue GeneQ CPUs (1 rack the baseline)47 seconds on 131073 CPUs (8 racks a speed-up of 69times overthe baseline with 8times more computation and memory capacity)and 37 seconds on 262144 CPUs (16 racks a speed-up of 88timesover the baseline with 16times more computation and memorycapacity) We observe that perfect scaling is inhibited by thecommunication-intense main loop phases when scaling from131072 CPUs (8 racks) up to 262144 CPUs (16 racks)

D Thread scaling behavior

Compass obtains excellent multi-threaded scaling behaviorwhen executing on the IBM Blue GeneQ Figure 6 shows theresults of experiments in which we increase the number ofOpenMP threads per MPI process while at the same timefixing the CoCoMac macaque network model size at 64MTrueNorth cores We show the speed-up over the baselinein the total wall-clock time taken to simulate the CoCoMacmodel for 500 simulated ticks and the speed-up in the wall-clock times taken per Compass execution phase in the mainsimulation loop where the baseline is the time taken wheneach MPI process spawns only one OpenMP thread We donot quite achieve perfect scaling in the number of OpenMPthreads due to a critical section in the Network phase that

Fig 6 OpenMP multi-threaded scaling tests when simulating a fixedCoCoMac macaque network model of 64M TrueNorth cores on a 65536 CPU(four rack) IBM Blue GeneQ with one MPI process per compute node Weshow the speed-up obtained with increasing numbers of threads over a baselineof one MPI process per node with one OpenMP thread (and therefore with15 of 16 CPU cores idled per node)

creates a serial bottleneck at all thread countsIn other multi-threaded experiments we found that trading

off between the number of MPI processes per compute nodeand the number of OpenMP threads per MPI processes yieldedlittle change in performance For example simulation runs ofCompass with one MPI process per compute node and 32OpenMP threads per process achieved nearly similar perfor-mance to runs with 16 MPI processes per compute node and 2OpenMP threads per process Using fewer MPI processes andmore OpenMP processes per thread reduces the size of theMPI communicator for the MPI Reduce-Scatter operation inthe Network phase which reduces the time taken by the phaseWe observe however that this performance improvement isoffset by false sharing penalties in the CPU caches due toincreased size of the shared memory region for the threads ineach MPI process

VII COMPARISON OF THE PGAS AND MPICOMMUNICATION MODELS FOR REAL-TIME SIMULATION

In addition to evaluating an implementation of Compassusing the MPI message passing library (section III) we evalu-ated a second implementation based on the Partitioned GlobalAddress Space (PGAS) model to explore the benefits of one-sided communications We achieved clear real-time simulationperformance improvements with the PGAS-based implementa-tion over the MPI-based implementation For this comparison

we ran both PGAS and MPI versions of Compass on a systemof four 1024-node IBM Blue GeneP racks installed at IBMWatson as no CC++ PGAS compiler yet exists for the BlueGeneQ Each compute node in the Blue GeneP comprises 4CPUs with 4 GB of dedicated physical memory We compiledwith the IBM XL CC++ compiler version 90 (MPICH2 MPIlibrary version 107) For PGAS support we reimplementedthe Compass messaging subroutines using Unified Parallel C(UPC) and compiled with the Berkeley UPC compiler version2140 [37] [38] the Berkeley UPC runtime uses the GASNetlibrary [39] for one-sided communication Both GASNet andthe MPI library implementation on Blue GeneP build uponthe Deep Computing Messaging Framework (DCMF) [40] toachieve the best available communication performance

A Tuning for real-time simulation

Real-time simulationmdash1 millisecond of wall-clock time per1 millisecond of simulated timemdashis important for designingapplications on the TrueNorth architecture Real-time simula-tions enable us to implement and test applications targeted forTrueNorth cores in advance of obtaining the actual hardwareand to debug and cross-check the TrueNorth hardware designsSuch simulations however already require that we reduce thesize of the simulated system of TrueNorth cores in order toreduce the total per-tick latency of the Synapse and Neuronphases We looked for further performance improvements inthe Network phase to maximize the size of the simulatedsystem given that the Network phase incurs the bulk of thelatency in the main simulation loop

The PGAS model of one-sided messaging is better suitedto the behavior of simulated TrueNorth cores in Compassthan the MPI model is The source and ordering of spikesarriving at an axon during the Network phase in a simulatedtick do not affect the computations during the Synapse andNeuron phases of the next tick Therefore each Compassprocess can use one-sided message primitives to insert spikesin a globally-addressable buffer residing at remote processeswithout incurring either the overhead of buffering those spikesfor sending or the overhead of tag matching when usingtwo-sided MPI messaging primitives Further using one-sidedmessage primitives enables the use of a single global barrierwith very low latency to synchronize writing into the globally-addressable buffers with reading from the buffers insteadof needing a collective Reduce-Scatter operation that scaleslinearly with communicator size We did not attempt to useMPI-2 one-sided messaging primitives as Hoefler et al havedemonstrated that such an approach is not promising from aperformance standpoint [41]

We experimented with writing our own custom synchro-nization primitives to provide a local barrier for subgroupsof processes In this way each process only synchronizeswith other processes with which it sends or receives spikesas opposed to each process synchronizing with all otherprocesses In practice though we found that the BerkeleyUPC synchronization primitives which are built upon the fastDCMF native barriers outperformed our custom primitives

Fig 7 Results comparing the PGAS and MPI communication modelsfor real-time simulations in Compass over a Blue GeneP system of fourracks (16384 CPUs) We simulated 81K TrueNorth cores for 1000 ticks withneurons firing on average at 10 Hz For each size of Blue GeneP we report theresult for each implementation using the best-performing thread configuration

B Real-time simulation results

To compare the real-time simulation performance of thePGAS implementation of Compass to the MPI implemen-tation we ran a set of strong scaling experiments using asynthetic system of TrueNorth cores simulated over four 1024-node Blue GeneP racks comprising 16384 CPUs We beganby finding the largest size of system we could simulate inreal time on all four racks and then measured the time takento simulate the same system for 1000 ticks on progressivelyfewer racks For the synthetic system4 75 of the neuronsin each TrueNorth core connect to TrueNorth cores on thesame Blue GeneP node while the remaining 25 connect toTrueNorth cores on other nodes All neurons fire on averageat 10 Hz

Our results show that the PGAS implementation of Compassis able to simulate the synthetic system with faster run timesthan the MPI implementation Figure 7 shows the strongscaling experimental results The PGAS implementation isable to simulate 81K TrueNorth cores in real time (1000 ticksin one second) on four racks while the MPI implementationtakes 21times as long (1000 ticks in 21 seconds) The benefitsof PGAS arise from the latency advantages of PGAS overMPI on the Blue GeneP [38] and the elimination of the MPIReduce-Scatter operation

For each experiment we only report the result for eachimplementation using the best-performing thread configura-tion Thus over four racks (16384 CPUs) of Blue GeneP weshow the result for the MPI implementation with one MPIprocess (each having four threads) per node while on onerack (4096 CPUs) of Blue GeneP we show the result for theMPI implementation with two MPI processes (each having twothreads) per node For all configuration we show the resultfor the PGAS implementation with four UPC instances (eachhaving one thread) per node

4We do not use the CoCoMac model for real-time simulations because thesize of simulated system has insufficient TrueNorth cores to populate eachCoCoMac region

VIII CONCLUSIONS

Cognitive Computing [2] is the quest for approximatingthe mind-like function low power small volume and real-time performance of the human brain To this end we havepursued neuroscience [42] [43] [9] nanotechnology [3] [4][44] [5] [6] and supercomputing [11] [12] Building on theseinsights we are marching ahead to develop and demonstrate anovel ultra-low power compact modular non-von Neumanncognitive computing architecture namely TrueNorth To setsail for TrueNorth in this paper we reported a multi-threadedmassively parallel simulator Compass that is functionallyequivalent to TrueNorth As a result of a number of innova-tions in communication computation and memory Compassdemonstrates unprecedented scale with its number of neuronscomparable to the human cortex and number of synapsescomparable to the monkey cortex while achieving unprece-dented time-to-solution Compass is a Swiss-army knife forour ambitious project supporting all aspects from architecturealgorithms to applications

TrueNorth and Compass represent a massively paralleland distributed architecture for computation that complementthe modern von Neumann architecture To fully exploit thisarchitecture we need to go beyond the ldquointellectual bottle-neckrdquo [7] of long sequential programming to short parallelprogramming To this end our next step is to develop a newparallel programming language with compositional semanticsthat can provide application and algorithm designers with thetools to effectively and efficiently unleash the full power ofthe emerging new era of computing Finally TrueNorth is de-signed to best approximate the brain within the constraints ofmodern CMOS technology but now informed by TrueNorthand Compass we ask how might we explore an entirely newtechnology that takes us beyond Moorersquos law and beyondthe ever-increasing memory-processor latency beyond needfor ever-increasing clock rates to a new era of computingExciting

ACKNOWLEDGMENTS

This research was sponsored by DARPA under contractNo HR0011-09-C-0002 The views and conclusions containedherein are those of the authors and should not be interpreted asrepresenting the official policies either expressly or impliedof DARPA or the US Government

We thank Filipp Akopyan John Arthur Andrew CassidyBryan Jackson Rajit Manohar Paul Merolla Rodrigo Alvarez-Icaza and Jun Sawada for their collaboration on the TrueNortharchitecture and our university partners Stefano Fusi RajitManohar Ashutosh Saxena and Giulio Tononi as well astheir research teams for their feedback on the Compass sim-ulator We are indebted to Fred Mintzer for access to IBMBlue GeneP and Blue GeneQ at the IBM TJ Watson Re-search Center and to George Fax Kerry Kaliszewski AndrewSchram Faith W Sell Steven M Westerbeck for access toIBM Rochester Blue GeneQ without which this paper wouldhave been impossible Finally we would like to thank DavidPeyton for his expert assistance revising this manuscript

REFERENCES

[1] J von Neumann The Computer and The Brain Yale University Press1958

[2] D S Modha R Ananthanarayanan S K Esser A Ndirango A JSherbondy and R Singh ldquoCognitive computingrdquo Communications ofthe ACM vol 54 no 8 pp 62ndash71 2011

[3] P Merolla J Arthur F Akopyan N Imam R Manohar and D ModhaldquoA digital neurosynaptic core using embedded crossbar memory with45pj per spike in 45nmrdquo in Custom Integrated Circuits Conference(CICC) 2011 IEEE sept 2011 pp 1 ndash4

[4] J Seo B Brezzo Y Liu B Parker S Esser R Montoye B RajendranJ Tierno L Chang D Modha and D Friedman ldquoA 45nm cmosneuromorphic chip with a scalable architecture for learning in networksof spiking neuronsrdquo in Custom Integrated Circuits Conference (CICC)2011 IEEE sept 2011 pp 1 ndash4

[5] N Imam F Akopyan J Arthur P Merolla R Manohar and D SModha ldquoA digital neurosynaptic core using event-driven qdi circuitsrdquo inASYNC 2012 IEEE International Symposium on Asynchronous Circuitsand Systems 2012

[6] J V Arthur P A Merolla F Akopyan R Alvarez-Icaza A CassidyS Chandra S K Esser N Imam W Risk D Rubin R Manohar andD S Modha ldquoBuilding block of a programmable neuromorphic sub-strate A digital neurosynaptic corerdquo in International Joint Conferenceon Neural Networks 2012

[7] J W Backus ldquoCan programming be liberated from the von neumannstyle a functional style and its algebra of programsrdquo Communicationsof the ACM vol 21 no 8 pp 613ndash641 1878

[8] V B Mountcastle Perceptual Neuroscience The Cerebral CortexHarvard University Press 1998

[9] D S Modha and R Singh ldquoNetwork architecture of the long distancepathways in the macaque brainrdquo Proceedings of the National Academyof the Sciences USA vol 107 no 30 pp 13 485ndash13 490 2010[Online] Available httpwwwpnasorgcontent1073013485abstract

[10] C Koch Biophysics of Computation Information Processing in SingleNeurons Oxford University Press New York New York 1999

[11] R Ananthanarayanan and D S Modha ldquoAnatomy of a cortical simu-latorrdquo in Supercomputing 07 2007

[12] R Ananthanarayanan S K Esser H D Simon and D S ModhaldquoThe cat is out of the bag cortical simulations with 109 neurons 1013

synapsesrdquo in Proceedings of the Conference on High PerformanceComputing Networking Storage and Analysis ser SC rsquo09 NewYork NY USA ACM 2009 pp 631ndash6312 [Online] Availablehttpdoiacmorg10114516540591654124

[13] E M Izhikevich ldquoWhich model to use for cortical spiking neuronsrdquoIEEE Transactions on Neural Networks vol 15 pp 1063ndash1070 2004[Online] Available httpwwwnsieduusersizhikevichpublicationswhichmodpdf

[14] H Markram ldquoThe Blue Brain Projectrdquo Nature Reviews Neurosciencevol 7 no 2 pp 153ndash160 Feb 2006 [Online] Availablehttpdxdoiorg101038nrn1848

[15] H Markram ldquoThe Human Brain Projectrdquo Scientific American vol 306no 6 pp 50ndash55 Jun 2012

[16] R Brette and D F M Goodman ldquoVectorized algorithms forspiking neural network simulationrdquo Neural Comput vol 23 no 6pp 1503ndash1535 2011 [Online] Available httpdxdoiorg101162NECO a 00123

[17] M Djurfeldt M Lundqvist C Johansson M Rehn O Ekebergand A Lansner ldquoBrain-scale simulation of the neocortex on the ibmblue genel supercomputerrdquo IBM J Res Dev vol 52 no 12 pp31ndash41 Jan 2008 [Online] Available httpdlacmorgcitationcfmid=13759901375994

[18] E M Izhikevich and G M Edelman ldquoLarge-scale model ofmammalian thalamocortical systemsrdquo Proceedings of the NationalAcademy of Sciences vol 105 no 9 pp 3593ndash3598 Mar 2008[Online] Available httpdxdoiorg101073pnas0712231105

[19] J M Nageswaran N Dutt J L Krichmar A Nicolau andA Veidenbaum ldquoEfficient simulation of large-scale spiking neuralnetworks using cuda graphics processorsrdquo in Proceedings of the 2009international joint conference on Neural Networks ser IJCNNrsquo09Piscataway NJ USA IEEE Press 2009 pp 3201ndash3208 [Online]Available httpdlacmorgcitationcfmid=17045551704736

[20] M M Waldrop ldquoComputer modelling Brain in a boxrdquo Nature vol482 pp 456ndash458 2012 [Online] Available httpdxdoiorg101038482456a

[21] H de Garis C Shuo B Goertzel and L Ruiting ldquoA world surveyof artificial brain projects part i Large-scale brain simulationsrdquoNeurocomput vol 74 no 1-3 pp 3ndash29 Dec 2010 [Online]Available httpdxdoiorg101016jneucom201008004

[22] R Brette M Rudolph T Carnevale M Hines D Beeman J MBower M Diesmann A Morrison P H Goodman A P DavisonS E Boustani and A Destexhe ldquoSimulation of networks of spikingneurons A review of tools and strategiesrdquo Journal of ComputationalNeuroscience vol 2007 pp 349ndash398 2007

[23] D Gregor T Hoefler B Barrett and A Lumsdaine ldquoFixing Probe forMulti-Threaded MPI Applicationsrdquo Indiana University Tech Rep 674Jan 2009

[24] R Sinkhorn and P Knopp ldquoConcerning nonnegative matrices anddoubly stochastic matricesrdquo Pacific Journal of Mathematics vol 21no 2 pp 343ndash348 1967

[25] A Marshall and I Olkin ldquoScaling of matrices to achieve specifiedrow and column sumsrdquo Numerische Mathematik vol 12 pp 83ndash901968 101007BF02170999 [Online] Available httpdxdoiorg101007BF02170999

[26] P Knight ldquoThe sinkhorn-knopp algorithm convergence and applica-tionsrdquo SIAM Journal on Matrix Analysis and Applications vol 30 no 1pp 261ndash275 2008

[27] R Kotter ldquoOnline retrieval processing and visualization of primate con-nectivity data from the CoCoMac databaserdquo Neuroinformatics vol 2pp 127ndash144 2004

[28] ldquoCocomac (collations of connectivity data on the macaque brain)rdquo wwwcocomacorg

[29] G Paxinos X Huang and M Petrides The Rhesus Monkey Brain inStereotaxic Coordinates Academic Press 2008 [Online] Availablehttpbooksgooglecoinbooksid=7HW6HgAACAAJ

[30] B G Kotter R Reid AT ldquoAn introduction to the cocomac-paxinos-3d viewerrdquo in The Rhesus Monkey Brain in Stereotaxic CoordinatesG Paxinos X-F Huang M Petrides and A Toga Eds Elsevier SanDiego 2008 ch 4

[31] D H Hubel and T N Wiesel ldquoReceptive fields binocular interactionand functional architecture in the catrsquos visual cortexrdquo J Physiol vol160 no 1 pp 106ndash154 1962

[32] A Peters and E G Jones Cerebral Cortex Vol 4 Association andauditory cortices Springer 1985

[33] V B Mountcastle ldquoThe columnar organization of the neocortexrdquo Brainvol 120 no 4 pp 701ndash22 1997

[34] C E Collins D C Airey N A Young D B Leitch and J H KaasldquoNeuron densities vary across and within cortical areas in primatesrdquo

Proceedings of the National Academy of Sciences vol 107 no 36 pp15 927ndash15 932 2010

[35] R A Haring M Ohmacht T W Fox M K Gschwind D L Sat-terfield K Sugavanam P W Coteus P Heidelberger M A BlumrichR W Wisniewski A Gara G L-T Chiu P A Boyle N H Chist andC Kim ldquoThe IBM Blue GeneQ Compute Chiprdquo IEEE Micro vol 32pp 48ndash60 2012

[36] D Chen N A Eisley P Heidelberger R M Senger Y SugawaraS Kumar V Salapura D L Satterfield B Steinmacher-Burow andJ J Parker ldquoThe IBM Blue GeneQ interconnection network andmessage unitrdquo in Proceedings of 2011 International Conference forHigh Performance Computing Networking Storage and Analysis serSC rsquo11 New York NY USA ACM 2011 pp 261ndash2610 [Online]Available httpdoiacmorg10114520633842063419

[37] ldquoThe Berkeley UPC Compilerrdquo 2002 httpupclblgov[38] R Nishtala P H Hargrove D O Bonachea and K A Yelick

ldquoScaling communication-intensive applications on BlueGeneP usingone-sided communication and overlaprdquo in Proceedings of the 2009IEEE International Symposium on ParallelampDistributed Processingser IPDPS rsquo09 Washington DC USA IEEE Computer Society2009 pp 1ndash12 [Online] Available httpdxdoiorg101109IPDPS20095161076

[39] D Bonachea ldquoGASNet Specification v11rdquo University of California atBerkeley Berkeley CA USA Tech Rep 2002

[40] S Kumar G Dozsa G Almasi P Heidelberger D Chen M EGiampapa M Blocksome A Faraj J Parker J RattermanB Smith and C J Archer ldquoThe Deep Computing MessagingFramework Generalized Scalable Message Passing on the Blue GeneP

Supercomputerrdquo in Proceedings of the 22nd annual internationalconference on Supercomputing ser ICS rsquo08 New York NY USAACM 2008 pp 94ndash103 [Online] Available httpdoiacmorg10114513755271375544

[41] T Hoefler C Siebert and A Lumsdaine ldquoScalable CommunicationProtocols for Dynamic Sparse Data Exchangerdquo in Proceedings of the2010 ACM SIGPLAN Symposium on Principles and Practice of ParallelProgramming (PPoPPrsquo10) ACM Jan 2010 pp 159ndash168

[42] D S Modha ldquoA conceptual cortical surface atlasrdquo PLoS ONE vol 4no 6 p e5693 2009

[43] A J Sherbondy R Ananthanrayanan R F Dougherty D S Modhaand B A Wandell ldquoThink global act local projectome estimationwith bluematterrdquo in Proceedings of MICCAI 2009 Lecture Notes inComputer Science 2009

[44] B L Jackson B Rajendran G S Corrado M Breitwisch G W BurrR Cheek K Gopalakrishnan S Raoux C T Rettner A G SchrottR S Shenoy B N Kurdi C H Lam and D S Modha ldquoNano-scale electronic synapses using phase change devicesrdquo ACM Journalon Emerging Technologies in Computing Systems forthcoming 2012

  • Introduction
  • The TrueNorth architecture
  • The Compass neurosynaptic chip simulator
  • Parallel Compass Compiler
  • CoCoMac macaque brain model network
    • Structure
    • Long range connectivity
    • Local connectivity
      • Experimental evaluation
        • Hardware platforms
        • Weak scaling behavior
        • Strong scaling behavior
        • Thread scaling behavior
          • Comparison of the PGAS and MPI communication models for real-time simulation
            • Tuning for real-time simulation
            • Real-time simulation results
              • Conclusions
              • References
Page 4: SC2012 Compass Cr

In detail each Compass process forks multiple OpenMPthreads which execute the main Compass phases in parallelas follows

bull Synapse phase For each axon within a set of TrueNorthcores dedicated to a specific thread the thread checksif a spike is ready for delivery If a spike is readyCompass propagates the spike along the row in thesynaptic crossbar connected to the axon For each setconnection in the row Compass buffers the spike forintegration at the neuron connected to the set connectionFor example if the ith axon in a TrueNorth core has aspike ready for delivery and the i jth connection is set inthe crossbar then Compass propagates the spike to thejth neuron

bull Neuron phase For each neuron within a TrueNorth corethreads integrate all propagated spikes leak potentialand generate spikes as appropriate Threads then ag-gregate spikes destined for TrueNorth cores at remoteprocesses into per-process remote destination buffers(remoteBu f Agg) so that spikes are consecutively laid outin memory for MPI message transfers The master threadwithin each process lastly uses a single MPI message tosend each remote destination buffer to the correspondingremote process

bull Network phase The master thread uses an MPI Reduce-Scatter operation to determine how many incoming mes-sages to expect In parallel Compass divides the lo-cal destination buffer uniformly among all non-masterthreads each of which then delivers spikes from itsportion to the destination TrueNorth cores (in detail tothe corresponding axon buffer at a specific destinationaxon) immediately Performance is improved since theprocessing of local spikes by non-master threads over-laps with the Reduce-Scatter operation performed by themaster threadAfter the master thread completes the Reduce-Scatteroperation and after all non-master threads deliver spikesfrom the local buffer all threads take turns to receivean MPI message and deliver each spike contained inthe message to its destination TrueNorth core residingin the shared memory space of the Compass processEach thread receives MPI messages in a critical sectiondue to thread-safety issues in the MPI library [23] butdelivers the spikes within the messages outside of thecritical section

IV PARALLEL COMPASS COMPILER

Compass is capable of simulating networks of tens of mil-lions of TrueNorth cores To build applications for such large-scale TrueNorth networks we envisage first implementinglibraries of functional primitives that run on one or moreinterconnected TrueNorth cores We can then build richer ap-plications by instantiating and connecting regions of functionalprimitives Configuring such rich applications in Compass (orfor that matter on TrueNorth hardware) can potentially requiresetting trillions of parameters To make the configuration more

tractable we require tools to create lower-level core parameterspecifications from higher-level more compact descriptions ofthe functional regions

We have designed and implemented a parallel processingtool called the Parallel Compass Compiler (PCC) that trans-lates a compact definition of functional regions of TrueNorthcores into the explicit neuron parameter synaptic connectionparameter and neuron-to-axon connectivity declarations re-quired by Compass PCC works to minimize MPI messagecounts within the Compass main simulation loop by assigningTrueNorth cores in the same functional region to as fewCompass processes as necessary This minimization enablesCompass to use faster shared memory communication to han-dle most intra-region spiking reserving more expensive MPIcommunication for mostly inter-region spiking Overall thenumber of Compass processes required to simulate a particularregion corresponds to the computation (neuron count) andmemory (synaptic connection count) required for the regionwhich in turn is related to the functional complexity of theregion

We show an illustrative example of three inter- and intra-connected functional regions in Figure 2 In our examplethree processes manage each of the functional regions Aand C and two processes manage functional region B Forillustration clarity we present only connections of process 0(in region A) and of process 3 (in region B) to region C Thethickness of the edges between processes reflects the strengthof the connection thicker connections represents more neuronconnections

PCC constructs the connectivity graph between functionalregions using a distributed algorithm in which each parallelPCC process compiles TrueNorth core parameters for at mostone functional region To create neuron-to-axon connectionsbetween regions the PCC process managing the target regionuses MPI message operations to send the global core ID andaxon ID of an available axon to the PCC process managing thesource region To create neuron-to-axon connections within aregion (for example as with process P0 in Figure 2) a PCCprocess connects cores locally we use OpenMP threadingprimitives to exploit thread-level parallelism

This exchange of information happens in an aggregated perprocess pair fashion As illustrated in Figure 2 if K neurons onP0 connect to P5 P5 needs to send the TrueNorth core IDs andthe K axon IDs to P0 using MPI_Isend The axon types andthe synaptic crossbar on the corresponding TrueNorth cores onP5 are setup simultaneously On P0 the received TrueNorth coreIDs and axon IDs are used to connect the source TrueNorthcores on P0 with the target TrueNorth cores on P5 We requirea realizability mechanism for connections to guarantee thateach target process has enough TrueNorth cores to satisfyincoming connection requests One way to accomplish this isto normalize the network to have consistent neuron and axonrequirements This is equivalent to normalizing the connectionmatrix to have identical pre-specified column sum and rowsums [24] [25] [26] - a generalization of doubly stochasticmatrices This procedure is known as iterative proportional

Fig 2 Illustration of a neurosynaptic core network built using the ParallelCompass Compiler across multiple processes showing connectivity within andacross functional regions

fitting procedure (IPFP) in statistics and as matrix balancingin linear algebra

The high-level network description describing the networkconnectivity is expressed in a relatively small and compactCoreObject file For large scale simulation of millions ofTrueNorth cores the network model specification for Compasscan be on the order of several terabytes Offline generationand copying such large files is impractical Parallel modelgeneration using the compiler requires only few minutes ascompared to several hours to read or write it to disk Oncethe compiler completes the wiring of the neurosynaptic coresacross all regions the TrueNorth cores from each processorare instantiated within Compass and the TrueNorth cores in thecompiler are deallocated to free up space Finally Compasssimulates the network model that was created

V COCOMAC MACAQUE BRAIN MODEL NETWORK

Higher cognition in the brain is thought to emerge from thethalamocortical system which can be divided into function-ally specialized thalamic or cortical regions Each region iscomposed of millions or billions of neurons cells specializedto receive integrate and send messages Long range whitematter connections provide communication between neuronsin different regions while short range gray matter connectionsprovide for communication between neurons within the sameregion Tremendous efforts towards mapping the brain andits connectivity have been made over decades of researcha tiny fraction of which will be drawn from here to con-struct a simple network that nevertheless captures several keycharacteristics desirable for TrueNorth simulations In buildingour test network we establish natural parallels between brainregions white matter connectivity gray matter connectivityand the underlying simulator hardware We simulate each brainregion using non-overlapping sets of 1 or more processes suchthat white matter communication becomes equivalent to inter-

process communication and we then establish gray matterconnectivity as within process communication

A Structure

It is hypothesized that distinct functions are supportedby signature subnetworks throughout the brain that facilitateinformation flow integration and cooperation across function-ally differentiated distributed centers We sought to capturethe richness found in these networks by specifying our testnetworks modules and the connectivity between them usinga very large scale database of brain regions and connectivityin the macaque monkey called CoCoMac [27] [28] Decadesof white matter anatomical tracing studies in the macaquebrain have been painstakingly assembled in the CoCoMacdatabase In previous efforts we have integrated the contribu-tions made to CoCoMac by many disparate labs to produce acohesive conflict-free network of brain regions in the macaquemonkey that is three times larger than the largest previoussuch network [9] Here we reduced this network to 77 brainregions based on white matter connectivity metrics describedbelow to make it more amenable to simulation We derivedvolumetric information for each region from the Paxinos brainatlas [29] and accompanying software [30] which in turn wasused to set relative neuron counts for each region Volumeinformation was not available for 5 cortical and 8 thalamicregions and so was approximated using the median size of theother cortical or thalamic regions respectively

B Long range connectivity

In the CoCoMac network each brain region is representedas a vertex and the presence of a white matter connectionbetween two brain regions is represented as an edge betweencorresponding vertices The derived network consists of 383hierarchically organized regions spanning cortex thalamusand basal ganglia and has 6602 directed edges that cap-ture well-known cortico-cortical cortico-subcortical and intra-subcortical white matter pathways [9] In the many studiesreported to CoCoMac there are often cases where one labreports connections for a brain region while another labsubdivides that brain region and reports connections for one ofthe child subregions For simplicity we reduced the networkby merging a child subregion into a parent region where bothchild and parent regions report connections We do this byORing the connections of the child region with that of theparent region The smaller lower resolution network consistsof 102 regions 77 of which report connections In the brainthe topography of connections between regions has in somecases been shown to be highly structured and focused [31]and in other cases very diffuse [32] For a given vertexin our reduced CoCoMac graph such focused connectionscorrespond to a source process targeting a single process inthe target region while diffuse connections correspond toa source process targeting multiple processes in the targetregion (if more than one such process exists) We choose touse long range connections that are as diffuse as possible inour test network because this places the largest burden on

Fig 3 Macaque brain map consisting of the 77 brain regions used for the test network The relative number of TrueNorth cores for each area indicated bythe Paxinos atlas is depicted in green and the actual number of TrueNorth cores allocated to each region following our normalization step is depicted in redboth plotted in log space Outgoing connections and neurons allocated in a 4096 TrueNorth cores model are shown for a typical region LGN which is thefirst stage in the thalamocortical visual processing stream

the communication infrastructure of our simulator As regionsizes are increased through scaling requiring more processper region the number inter-process connections required fora given process will increase At the same time the numberof neuron-axon connections per such link will decrease

C Local connectivity

Extensive studies have revealed that while each neuron canform gray matter connections to a large array of possibletargets targets are invariably within a 2-3 mm radius from thesource [33] The number of cortical neurons under a squaremillimeter of cortex in a the macaque brain is estimated in thetens of thousands [34] which sets an upper limit on possibleneuron targets from gray matter connectivity as the nearest3 million or so neurons which fits within the number ofneurons we are able to use per process for the simulationsperformed here Therefore we choose to restrict our modelingof gray matter connectivity such that the source neuron andtarget neuron of gray matter connections are located on thesame process It has also been found that while each neuron

often connects to the same target gray matter patches as itsneighbors this is not always the case [33] Therefore toprovide the highest possible challenge to cache performancewe chose to ensure that all locally connecting neurons on thesame TrueNorth core distribute their connections as broadlyas possible across the set of possible target TrueNorth cores

We establish our ratio of long range to local connectivityin approximately a 6040 ratio for cortical regions and inan 8020 ratio for non-cortical regions From CoCoMac weobtained a binary connection matrix This matrix is convertedto a stochastic matrix with the above gray matter percentageson the diagonals and white matter connections set to beproportional to the volume percentage of the outgoing regionThen the connection matrix was balanced as described insection IV to make the row and column sums equal to thevolume of that region The end result guarantees a realizablemodel in that all axon and neuron requests can be fulfilledin all regions The brain regions used a sample of theirinterconnectivity and the results of this normalization processcan be seen in figure 3

VI EXPERIMENTAL EVALUATION

We completed an experimental evaluation of Compass thatshows near-perfect weak and strong scaling behavior whensimulating a CoCoMac macaque network model on an IBMBlue GeneQ system of 16384 to 262144 compute CPUsWe attribute this behavior to several important design andoperational features of Compass that make efficient use ofthe communication subsystem The Compass implementationoverlaps the single collective MPI Reduce-Scatter operation ina process with the delivery of spikes from neurons destinedfor local cores and minimizes the MPI communicator size byspawning multiple threads within an MPI process on multi-core CPUs in place of spawning multiple MPI processes Inoperation we observe that the MPI point-to-point messagingrate scales sub-linearly with increasing size of simulatedsystem due to weaker links (in terms of neuron-to-axonconnections) between processes of connected brain regionsMoreover the overall message data volume per simulated ticksent by Compass for even the largest simulated system ofTrueNorth cores is well below the interconnect bandwidth ofthe communication subsystem

A Hardware platforms

We ran Compass simulations of our CoCoMac macaquenetwork model on the IBM Blue GeneQ which is the thirdgeneration in the Blue Gene line of massively parallel super-computers Our Blue GeneQ at IBM Rochester comprised16 racks with each rack in turn comprising 1024 computenodes Each compute node is composed of 17 processor coreson a single multi-core CPU paired with 16 GB of dedicatedphysical memory [35] and is connected to other nodes in afive-dimensional torus through 10 bidirectional 2 GBsecondlinks [36] HPC applications run on 16 of the processorcores and can spawn up to 4 hardware threads on each coreper-node system software runs on the remaining core Themaximum available Blue GeneQ size was therefore 16384nodes equivalent to 262144 application processor cores

For all experiments on the Blue GeneQ we compiled withthe IBM XL CC++ compiler version 120 (MPICH2 MPIlibrary version 141)

B Weak scaling behavior

Compass exhibits nearly perfect weak scaling behaviorwhen simulating the CoCoMac macaque network model onthe IBM Blue GeneQ Figure 4 shows the results of experi-ments in which we increased the CoCoMac model size whenincreasing the available Blue GeneQ CPU count while atthe same time fixing the count of simulated TrueNorth coresper node at 16384 We ran with 1 MPI process per node and32 OpenMP threads per MPI process2 to minimize the MPIcommunication overhead and maximize the available memoryper MPI process Because each Compass process distributes

2We are investigating unexpected system errors that occurred when runningwith the theoretical maximum of 64 OpenMP threads per MPI process

simulated cores uniformly across the available threads (sec-tion III) the number of simulated cores per OpenMP threadwas 512 cores

With a fixed count of 16384 TrueNorth cores per computenode Compass runs with near-constant total wall-clock timeover an increasing number of compute nodes Figure 4(a)shows the total wall-clock time taken to simulate the CoCo-Mac model for 500 simulated ticks (ldquoTotal Compassrdquo) andthe wall-clock times taken per Compass execution phase inthe main simulation loop (Synapse Neuron and Network inlisting 1) as functions of available Blue GeneQ CPU countThe total wall-clock time remains near-constant for CoCoMacmodels ranging from 16M TrueNorth cores on 16384 CPUs(1024 compute nodes) through to 256M cores on 262144CPUs (16384 nodes) note that 256M cores is equal to 65Bneurons which is of human scale in the number of neuronsSimulating 256M cores on 262144 CPUs for 500 simulatedticks required 194 seconds3 or 388times slower than real time

We observe that the increase in total run time with increas-ing Blue GeneQ CPU count is related primarily to an increasein communication costs during the Network phase in the mainsimulation loop We attribute most of the increase to the timetaken by the MPI Reduce-Scatter operation which increaseswith increasing MPI communicator size We also attributesome of the increase to computation and communicationimbalances in the functional regions of the CoCoMac model

We see that some of the increase in total run time arisesfrom an increase in the MPI message traffic during the Neuronphase Figure 4(b) shows the MPI message count and totalspike count (equivalent to the sum of white matter spikesfrom all MPI processes) per simulated tick as functions ofthe available Blue GeneQ CPU count As we increase theCoCoMac model size in the number of TrueNorth coresmore MPI processes represent the 77 macaque brain regionsto which spikes can be sent which leads to an increasein the message count and volume We note however thatthe increase in message count is sub-linear given that thewhite matter connections become thinner and therefore lessfrequented with increasing model size as discussed in sec-tion V-B For a system of 256M cores on 262144 CPUsCompass generates and sends approximately 22M spikes persimulated tick which at 20 bytes per spike equals 044 GB pertick and is well below the 5D torus link bandwidth of 2 GBs

C Strong scaling behavior

As with weak scaling Compass exhibits excellent strongscaling behavior when simulating the CoCoMac macaquenetwork model on the IBM Blue GeneQ Figure 5 shows theresults of experiments in which we fixed the CoCoMac modelsize at 32M TrueNorth cores (82B neurons) while increasingthe available Blue GeneQ CPU count As with the weakscaling results in figure 4(a) we show the total wall-clock time

3We exclude the CoCoMac compilation times from the total time Forreference compilation of the 256M core model using the PCC (section IV)required 107 wall-clock seconds mostly due to the communication costs inthe white matter wiring phase

(a) Compass total runtime and breakdown into major components (b) Compass messaging and data transfer analysis per simulation step

Fig 4 Compass weak scaling performance when simulating the CoCoMac macaque network model on an IBM Blue GeneQ with a fixed count of 16384TrueNorth cores per Blue GeneQ compute node Each Blue GeneQ rack had 16384 compute CPUs (1024 compute nodes) We ran with one MPI processper node and 32 OpenMP threads per MPI process

Fig 5 Compass strong scaling performance when simulating a fixedCoCoMac macaque network model of 32M TrueNorth cores on an IBM BlueGeneQ Each Blue GeneQ rack had 16384 compute CPUs (1024 computenodes) We ran with one MPI process per node and 32 OpenMP threads perMPI process

taken to simulate the CoCoMac model for 500 simulated ticksand the wall-clock times taken per Compass execution phasein the main simulation loop Simulating 32M cores takes 324seconds on 16384 Blue GeneQ CPUs (1 rack the baseline)47 seconds on 131073 CPUs (8 racks a speed-up of 69times overthe baseline with 8times more computation and memory capacity)and 37 seconds on 262144 CPUs (16 racks a speed-up of 88timesover the baseline with 16times more computation and memorycapacity) We observe that perfect scaling is inhibited by thecommunication-intense main loop phases when scaling from131072 CPUs (8 racks) up to 262144 CPUs (16 racks)

D Thread scaling behavior

Compass obtains excellent multi-threaded scaling behaviorwhen executing on the IBM Blue GeneQ Figure 6 shows theresults of experiments in which we increase the number ofOpenMP threads per MPI process while at the same timefixing the CoCoMac macaque network model size at 64MTrueNorth cores We show the speed-up over the baselinein the total wall-clock time taken to simulate the CoCoMacmodel for 500 simulated ticks and the speed-up in the wall-clock times taken per Compass execution phase in the mainsimulation loop where the baseline is the time taken wheneach MPI process spawns only one OpenMP thread We donot quite achieve perfect scaling in the number of OpenMPthreads due to a critical section in the Network phase that

Fig 6 OpenMP multi-threaded scaling tests when simulating a fixedCoCoMac macaque network model of 64M TrueNorth cores on a 65536 CPU(four rack) IBM Blue GeneQ with one MPI process per compute node Weshow the speed-up obtained with increasing numbers of threads over a baselineof one MPI process per node with one OpenMP thread (and therefore with15 of 16 CPU cores idled per node)

creates a serial bottleneck at all thread countsIn other multi-threaded experiments we found that trading

off between the number of MPI processes per compute nodeand the number of OpenMP threads per MPI processes yieldedlittle change in performance For example simulation runs ofCompass with one MPI process per compute node and 32OpenMP threads per process achieved nearly similar perfor-mance to runs with 16 MPI processes per compute node and 2OpenMP threads per process Using fewer MPI processes andmore OpenMP processes per thread reduces the size of theMPI communicator for the MPI Reduce-Scatter operation inthe Network phase which reduces the time taken by the phaseWe observe however that this performance improvement isoffset by false sharing penalties in the CPU caches due toincreased size of the shared memory region for the threads ineach MPI process

VII COMPARISON OF THE PGAS AND MPICOMMUNICATION MODELS FOR REAL-TIME SIMULATION

In addition to evaluating an implementation of Compassusing the MPI message passing library (section III) we evalu-ated a second implementation based on the Partitioned GlobalAddress Space (PGAS) model to explore the benefits of one-sided communications We achieved clear real-time simulationperformance improvements with the PGAS-based implementa-tion over the MPI-based implementation For this comparison

we ran both PGAS and MPI versions of Compass on a systemof four 1024-node IBM Blue GeneP racks installed at IBMWatson as no CC++ PGAS compiler yet exists for the BlueGeneQ Each compute node in the Blue GeneP comprises 4CPUs with 4 GB of dedicated physical memory We compiledwith the IBM XL CC++ compiler version 90 (MPICH2 MPIlibrary version 107) For PGAS support we reimplementedthe Compass messaging subroutines using Unified Parallel C(UPC) and compiled with the Berkeley UPC compiler version2140 [37] [38] the Berkeley UPC runtime uses the GASNetlibrary [39] for one-sided communication Both GASNet andthe MPI library implementation on Blue GeneP build uponthe Deep Computing Messaging Framework (DCMF) [40] toachieve the best available communication performance

A Tuning for real-time simulation

Real-time simulationmdash1 millisecond of wall-clock time per1 millisecond of simulated timemdashis important for designingapplications on the TrueNorth architecture Real-time simula-tions enable us to implement and test applications targeted forTrueNorth cores in advance of obtaining the actual hardwareand to debug and cross-check the TrueNorth hardware designsSuch simulations however already require that we reduce thesize of the simulated system of TrueNorth cores in order toreduce the total per-tick latency of the Synapse and Neuronphases We looked for further performance improvements inthe Network phase to maximize the size of the simulatedsystem given that the Network phase incurs the bulk of thelatency in the main simulation loop

The PGAS model of one-sided messaging is better suitedto the behavior of simulated TrueNorth cores in Compassthan the MPI model is The source and ordering of spikesarriving at an axon during the Network phase in a simulatedtick do not affect the computations during the Synapse andNeuron phases of the next tick Therefore each Compassprocess can use one-sided message primitives to insert spikesin a globally-addressable buffer residing at remote processeswithout incurring either the overhead of buffering those spikesfor sending or the overhead of tag matching when usingtwo-sided MPI messaging primitives Further using one-sidedmessage primitives enables the use of a single global barrierwith very low latency to synchronize writing into the globally-addressable buffers with reading from the buffers insteadof needing a collective Reduce-Scatter operation that scaleslinearly with communicator size We did not attempt to useMPI-2 one-sided messaging primitives as Hoefler et al havedemonstrated that such an approach is not promising from aperformance standpoint [41]

We experimented with writing our own custom synchro-nization primitives to provide a local barrier for subgroupsof processes In this way each process only synchronizeswith other processes with which it sends or receives spikesas opposed to each process synchronizing with all otherprocesses In practice though we found that the BerkeleyUPC synchronization primitives which are built upon the fastDCMF native barriers outperformed our custom primitives

Fig 7 Results comparing the PGAS and MPI communication modelsfor real-time simulations in Compass over a Blue GeneP system of fourracks (16384 CPUs) We simulated 81K TrueNorth cores for 1000 ticks withneurons firing on average at 10 Hz For each size of Blue GeneP we report theresult for each implementation using the best-performing thread configuration

B Real-time simulation results

To compare the real-time simulation performance of thePGAS implementation of Compass to the MPI implemen-tation we ran a set of strong scaling experiments using asynthetic system of TrueNorth cores simulated over four 1024-node Blue GeneP racks comprising 16384 CPUs We beganby finding the largest size of system we could simulate inreal time on all four racks and then measured the time takento simulate the same system for 1000 ticks on progressivelyfewer racks For the synthetic system4 75 of the neuronsin each TrueNorth core connect to TrueNorth cores on thesame Blue GeneP node while the remaining 25 connect toTrueNorth cores on other nodes All neurons fire on averageat 10 Hz

Our results show that the PGAS implementation of Compassis able to simulate the synthetic system with faster run timesthan the MPI implementation Figure 7 shows the strongscaling experimental results The PGAS implementation isable to simulate 81K TrueNorth cores in real time (1000 ticksin one second) on four racks while the MPI implementationtakes 21times as long (1000 ticks in 21 seconds) The benefitsof PGAS arise from the latency advantages of PGAS overMPI on the Blue GeneP [38] and the elimination of the MPIReduce-Scatter operation

For each experiment we only report the result for eachimplementation using the best-performing thread configura-tion Thus over four racks (16384 CPUs) of Blue GeneP weshow the result for the MPI implementation with one MPIprocess (each having four threads) per node while on onerack (4096 CPUs) of Blue GeneP we show the result for theMPI implementation with two MPI processes (each having twothreads) per node For all configuration we show the resultfor the PGAS implementation with four UPC instances (eachhaving one thread) per node

4We do not use the CoCoMac model for real-time simulations because thesize of simulated system has insufficient TrueNorth cores to populate eachCoCoMac region

VIII CONCLUSIONS

Cognitive Computing [2] is the quest for approximatingthe mind-like function low power small volume and real-time performance of the human brain To this end we havepursued neuroscience [42] [43] [9] nanotechnology [3] [4][44] [5] [6] and supercomputing [11] [12] Building on theseinsights we are marching ahead to develop and demonstrate anovel ultra-low power compact modular non-von Neumanncognitive computing architecture namely TrueNorth To setsail for TrueNorth in this paper we reported a multi-threadedmassively parallel simulator Compass that is functionallyequivalent to TrueNorth As a result of a number of innova-tions in communication computation and memory Compassdemonstrates unprecedented scale with its number of neuronscomparable to the human cortex and number of synapsescomparable to the monkey cortex while achieving unprece-dented time-to-solution Compass is a Swiss-army knife forour ambitious project supporting all aspects from architecturealgorithms to applications

TrueNorth and Compass represent a massively paralleland distributed architecture for computation that complementthe modern von Neumann architecture To fully exploit thisarchitecture we need to go beyond the ldquointellectual bottle-neckrdquo [7] of long sequential programming to short parallelprogramming To this end our next step is to develop a newparallel programming language with compositional semanticsthat can provide application and algorithm designers with thetools to effectively and efficiently unleash the full power ofthe emerging new era of computing Finally TrueNorth is de-signed to best approximate the brain within the constraints ofmodern CMOS technology but now informed by TrueNorthand Compass we ask how might we explore an entirely newtechnology that takes us beyond Moorersquos law and beyondthe ever-increasing memory-processor latency beyond needfor ever-increasing clock rates to a new era of computingExciting

ACKNOWLEDGMENTS

This research was sponsored by DARPA under contractNo HR0011-09-C-0002 The views and conclusions containedherein are those of the authors and should not be interpreted asrepresenting the official policies either expressly or impliedof DARPA or the US Government

We thank Filipp Akopyan John Arthur Andrew CassidyBryan Jackson Rajit Manohar Paul Merolla Rodrigo Alvarez-Icaza and Jun Sawada for their collaboration on the TrueNortharchitecture and our university partners Stefano Fusi RajitManohar Ashutosh Saxena and Giulio Tononi as well astheir research teams for their feedback on the Compass sim-ulator We are indebted to Fred Mintzer for access to IBMBlue GeneP and Blue GeneQ at the IBM TJ Watson Re-search Center and to George Fax Kerry Kaliszewski AndrewSchram Faith W Sell Steven M Westerbeck for access toIBM Rochester Blue GeneQ without which this paper wouldhave been impossible Finally we would like to thank DavidPeyton for his expert assistance revising this manuscript

REFERENCES

[1] J von Neumann The Computer and The Brain Yale University Press1958

[2] D S Modha R Ananthanarayanan S K Esser A Ndirango A JSherbondy and R Singh ldquoCognitive computingrdquo Communications ofthe ACM vol 54 no 8 pp 62ndash71 2011

[3] P Merolla J Arthur F Akopyan N Imam R Manohar and D ModhaldquoA digital neurosynaptic core using embedded crossbar memory with45pj per spike in 45nmrdquo in Custom Integrated Circuits Conference(CICC) 2011 IEEE sept 2011 pp 1 ndash4

[4] J Seo B Brezzo Y Liu B Parker S Esser R Montoye B RajendranJ Tierno L Chang D Modha and D Friedman ldquoA 45nm cmosneuromorphic chip with a scalable architecture for learning in networksof spiking neuronsrdquo in Custom Integrated Circuits Conference (CICC)2011 IEEE sept 2011 pp 1 ndash4

[5] N Imam F Akopyan J Arthur P Merolla R Manohar and D SModha ldquoA digital neurosynaptic core using event-driven qdi circuitsrdquo inASYNC 2012 IEEE International Symposium on Asynchronous Circuitsand Systems 2012

[6] J V Arthur P A Merolla F Akopyan R Alvarez-Icaza A CassidyS Chandra S K Esser N Imam W Risk D Rubin R Manohar andD S Modha ldquoBuilding block of a programmable neuromorphic sub-strate A digital neurosynaptic corerdquo in International Joint Conferenceon Neural Networks 2012

[7] J W Backus ldquoCan programming be liberated from the von neumannstyle a functional style and its algebra of programsrdquo Communicationsof the ACM vol 21 no 8 pp 613ndash641 1878

[8] V B Mountcastle Perceptual Neuroscience The Cerebral CortexHarvard University Press 1998

[9] D S Modha and R Singh ldquoNetwork architecture of the long distancepathways in the macaque brainrdquo Proceedings of the National Academyof the Sciences USA vol 107 no 30 pp 13 485ndash13 490 2010[Online] Available httpwwwpnasorgcontent1073013485abstract

[10] C Koch Biophysics of Computation Information Processing in SingleNeurons Oxford University Press New York New York 1999

[11] R Ananthanarayanan and D S Modha ldquoAnatomy of a cortical simu-latorrdquo in Supercomputing 07 2007

[12] R Ananthanarayanan S K Esser H D Simon and D S ModhaldquoThe cat is out of the bag cortical simulations with 109 neurons 1013

synapsesrdquo in Proceedings of the Conference on High PerformanceComputing Networking Storage and Analysis ser SC rsquo09 NewYork NY USA ACM 2009 pp 631ndash6312 [Online] Availablehttpdoiacmorg10114516540591654124

[13] E M Izhikevich ldquoWhich model to use for cortical spiking neuronsrdquoIEEE Transactions on Neural Networks vol 15 pp 1063ndash1070 2004[Online] Available httpwwwnsieduusersizhikevichpublicationswhichmodpdf

[14] H Markram ldquoThe Blue Brain Projectrdquo Nature Reviews Neurosciencevol 7 no 2 pp 153ndash160 Feb 2006 [Online] Availablehttpdxdoiorg101038nrn1848

[15] H Markram ldquoThe Human Brain Projectrdquo Scientific American vol 306no 6 pp 50ndash55 Jun 2012

[16] R Brette and D F M Goodman ldquoVectorized algorithms forspiking neural network simulationrdquo Neural Comput vol 23 no 6pp 1503ndash1535 2011 [Online] Available httpdxdoiorg101162NECO a 00123

[17] M Djurfeldt M Lundqvist C Johansson M Rehn O Ekebergand A Lansner ldquoBrain-scale simulation of the neocortex on the ibmblue genel supercomputerrdquo IBM J Res Dev vol 52 no 12 pp31ndash41 Jan 2008 [Online] Available httpdlacmorgcitationcfmid=13759901375994

[18] E M Izhikevich and G M Edelman ldquoLarge-scale model ofmammalian thalamocortical systemsrdquo Proceedings of the NationalAcademy of Sciences vol 105 no 9 pp 3593ndash3598 Mar 2008[Online] Available httpdxdoiorg101073pnas0712231105

[19] J M Nageswaran N Dutt J L Krichmar A Nicolau andA Veidenbaum ldquoEfficient simulation of large-scale spiking neuralnetworks using cuda graphics processorsrdquo in Proceedings of the 2009international joint conference on Neural Networks ser IJCNNrsquo09Piscataway NJ USA IEEE Press 2009 pp 3201ndash3208 [Online]Available httpdlacmorgcitationcfmid=17045551704736

[20] M M Waldrop ldquoComputer modelling Brain in a boxrdquo Nature vol482 pp 456ndash458 2012 [Online] Available httpdxdoiorg101038482456a

[21] H de Garis C Shuo B Goertzel and L Ruiting ldquoA world surveyof artificial brain projects part i Large-scale brain simulationsrdquoNeurocomput vol 74 no 1-3 pp 3ndash29 Dec 2010 [Online]Available httpdxdoiorg101016jneucom201008004

[22] R Brette M Rudolph T Carnevale M Hines D Beeman J MBower M Diesmann A Morrison P H Goodman A P DavisonS E Boustani and A Destexhe ldquoSimulation of networks of spikingneurons A review of tools and strategiesrdquo Journal of ComputationalNeuroscience vol 2007 pp 349ndash398 2007

[23] D Gregor T Hoefler B Barrett and A Lumsdaine ldquoFixing Probe forMulti-Threaded MPI Applicationsrdquo Indiana University Tech Rep 674Jan 2009

[24] R Sinkhorn and P Knopp ldquoConcerning nonnegative matrices anddoubly stochastic matricesrdquo Pacific Journal of Mathematics vol 21no 2 pp 343ndash348 1967

[25] A Marshall and I Olkin ldquoScaling of matrices to achieve specifiedrow and column sumsrdquo Numerische Mathematik vol 12 pp 83ndash901968 101007BF02170999 [Online] Available httpdxdoiorg101007BF02170999

[26] P Knight ldquoThe sinkhorn-knopp algorithm convergence and applica-tionsrdquo SIAM Journal on Matrix Analysis and Applications vol 30 no 1pp 261ndash275 2008

[27] R Kotter ldquoOnline retrieval processing and visualization of primate con-nectivity data from the CoCoMac databaserdquo Neuroinformatics vol 2pp 127ndash144 2004

[28] ldquoCocomac (collations of connectivity data on the macaque brain)rdquo wwwcocomacorg

[29] G Paxinos X Huang and M Petrides The Rhesus Monkey Brain inStereotaxic Coordinates Academic Press 2008 [Online] Availablehttpbooksgooglecoinbooksid=7HW6HgAACAAJ

[30] B G Kotter R Reid AT ldquoAn introduction to the cocomac-paxinos-3d viewerrdquo in The Rhesus Monkey Brain in Stereotaxic CoordinatesG Paxinos X-F Huang M Petrides and A Toga Eds Elsevier SanDiego 2008 ch 4

[31] D H Hubel and T N Wiesel ldquoReceptive fields binocular interactionand functional architecture in the catrsquos visual cortexrdquo J Physiol vol160 no 1 pp 106ndash154 1962

[32] A Peters and E G Jones Cerebral Cortex Vol 4 Association andauditory cortices Springer 1985

[33] V B Mountcastle ldquoThe columnar organization of the neocortexrdquo Brainvol 120 no 4 pp 701ndash22 1997

[34] C E Collins D C Airey N A Young D B Leitch and J H KaasldquoNeuron densities vary across and within cortical areas in primatesrdquo

Proceedings of the National Academy of Sciences vol 107 no 36 pp15 927ndash15 932 2010

[35] R A Haring M Ohmacht T W Fox M K Gschwind D L Sat-terfield K Sugavanam P W Coteus P Heidelberger M A BlumrichR W Wisniewski A Gara G L-T Chiu P A Boyle N H Chist andC Kim ldquoThe IBM Blue GeneQ Compute Chiprdquo IEEE Micro vol 32pp 48ndash60 2012

[36] D Chen N A Eisley P Heidelberger R M Senger Y SugawaraS Kumar V Salapura D L Satterfield B Steinmacher-Burow andJ J Parker ldquoThe IBM Blue GeneQ interconnection network andmessage unitrdquo in Proceedings of 2011 International Conference forHigh Performance Computing Networking Storage and Analysis serSC rsquo11 New York NY USA ACM 2011 pp 261ndash2610 [Online]Available httpdoiacmorg10114520633842063419

[37] ldquoThe Berkeley UPC Compilerrdquo 2002 httpupclblgov[38] R Nishtala P H Hargrove D O Bonachea and K A Yelick

ldquoScaling communication-intensive applications on BlueGeneP usingone-sided communication and overlaprdquo in Proceedings of the 2009IEEE International Symposium on ParallelampDistributed Processingser IPDPS rsquo09 Washington DC USA IEEE Computer Society2009 pp 1ndash12 [Online] Available httpdxdoiorg101109IPDPS20095161076

[39] D Bonachea ldquoGASNet Specification v11rdquo University of California atBerkeley Berkeley CA USA Tech Rep 2002

[40] S Kumar G Dozsa G Almasi P Heidelberger D Chen M EGiampapa M Blocksome A Faraj J Parker J RattermanB Smith and C J Archer ldquoThe Deep Computing MessagingFramework Generalized Scalable Message Passing on the Blue GeneP

Supercomputerrdquo in Proceedings of the 22nd annual internationalconference on Supercomputing ser ICS rsquo08 New York NY USAACM 2008 pp 94ndash103 [Online] Available httpdoiacmorg10114513755271375544

[41] T Hoefler C Siebert and A Lumsdaine ldquoScalable CommunicationProtocols for Dynamic Sparse Data Exchangerdquo in Proceedings of the2010 ACM SIGPLAN Symposium on Principles and Practice of ParallelProgramming (PPoPPrsquo10) ACM Jan 2010 pp 159ndash168

[42] D S Modha ldquoA conceptual cortical surface atlasrdquo PLoS ONE vol 4no 6 p e5693 2009

[43] A J Sherbondy R Ananthanrayanan R F Dougherty D S Modhaand B A Wandell ldquoThink global act local projectome estimationwith bluematterrdquo in Proceedings of MICCAI 2009 Lecture Notes inComputer Science 2009

[44] B L Jackson B Rajendran G S Corrado M Breitwisch G W BurrR Cheek K Gopalakrishnan S Raoux C T Rettner A G SchrottR S Shenoy B N Kurdi C H Lam and D S Modha ldquoNano-scale electronic synapses using phase change devicesrdquo ACM Journalon Emerging Technologies in Computing Systems forthcoming 2012

  • Introduction
  • The TrueNorth architecture
  • The Compass neurosynaptic chip simulator
  • Parallel Compass Compiler
  • CoCoMac macaque brain model network
    • Structure
    • Long range connectivity
    • Local connectivity
      • Experimental evaluation
        • Hardware platforms
        • Weak scaling behavior
        • Strong scaling behavior
        • Thread scaling behavior
          • Comparison of the PGAS and MPI communication models for real-time simulation
            • Tuning for real-time simulation
            • Real-time simulation results
              • Conclusions
              • References
Page 5: SC2012 Compass Cr

Fig 2 Illustration of a neurosynaptic core network built using the ParallelCompass Compiler across multiple processes showing connectivity within andacross functional regions

fitting procedure (IPFP) in statistics and as matrix balancingin linear algebra

The high-level network description describing the networkconnectivity is expressed in a relatively small and compactCoreObject file For large scale simulation of millions ofTrueNorth cores the network model specification for Compasscan be on the order of several terabytes Offline generationand copying such large files is impractical Parallel modelgeneration using the compiler requires only few minutes ascompared to several hours to read or write it to disk Oncethe compiler completes the wiring of the neurosynaptic coresacross all regions the TrueNorth cores from each processorare instantiated within Compass and the TrueNorth cores in thecompiler are deallocated to free up space Finally Compasssimulates the network model that was created

V COCOMAC MACAQUE BRAIN MODEL NETWORK

Higher cognition in the brain is thought to emerge from thethalamocortical system which can be divided into function-ally specialized thalamic or cortical regions Each region iscomposed of millions or billions of neurons cells specializedto receive integrate and send messages Long range whitematter connections provide communication between neuronsin different regions while short range gray matter connectionsprovide for communication between neurons within the sameregion Tremendous efforts towards mapping the brain andits connectivity have been made over decades of researcha tiny fraction of which will be drawn from here to con-struct a simple network that nevertheless captures several keycharacteristics desirable for TrueNorth simulations In buildingour test network we establish natural parallels between brainregions white matter connectivity gray matter connectivityand the underlying simulator hardware We simulate each brainregion using non-overlapping sets of 1 or more processes suchthat white matter communication becomes equivalent to inter-

process communication and we then establish gray matterconnectivity as within process communication

A Structure

It is hypothesized that distinct functions are supportedby signature subnetworks throughout the brain that facilitateinformation flow integration and cooperation across function-ally differentiated distributed centers We sought to capturethe richness found in these networks by specifying our testnetworks modules and the connectivity between them usinga very large scale database of brain regions and connectivityin the macaque monkey called CoCoMac [27] [28] Decadesof white matter anatomical tracing studies in the macaquebrain have been painstakingly assembled in the CoCoMacdatabase In previous efforts we have integrated the contribu-tions made to CoCoMac by many disparate labs to produce acohesive conflict-free network of brain regions in the macaquemonkey that is three times larger than the largest previoussuch network [9] Here we reduced this network to 77 brainregions based on white matter connectivity metrics describedbelow to make it more amenable to simulation We derivedvolumetric information for each region from the Paxinos brainatlas [29] and accompanying software [30] which in turn wasused to set relative neuron counts for each region Volumeinformation was not available for 5 cortical and 8 thalamicregions and so was approximated using the median size of theother cortical or thalamic regions respectively

B Long range connectivity

In the CoCoMac network each brain region is representedas a vertex and the presence of a white matter connectionbetween two brain regions is represented as an edge betweencorresponding vertices The derived network consists of 383hierarchically organized regions spanning cortex thalamusand basal ganglia and has 6602 directed edges that cap-ture well-known cortico-cortical cortico-subcortical and intra-subcortical white matter pathways [9] In the many studiesreported to CoCoMac there are often cases where one labreports connections for a brain region while another labsubdivides that brain region and reports connections for one ofthe child subregions For simplicity we reduced the networkby merging a child subregion into a parent region where bothchild and parent regions report connections We do this byORing the connections of the child region with that of theparent region The smaller lower resolution network consistsof 102 regions 77 of which report connections In the brainthe topography of connections between regions has in somecases been shown to be highly structured and focused [31]and in other cases very diffuse [32] For a given vertexin our reduced CoCoMac graph such focused connectionscorrespond to a source process targeting a single process inthe target region while diffuse connections correspond toa source process targeting multiple processes in the targetregion (if more than one such process exists) We choose touse long range connections that are as diffuse as possible inour test network because this places the largest burden on

Fig 3 Macaque brain map consisting of the 77 brain regions used for the test network The relative number of TrueNorth cores for each area indicated bythe Paxinos atlas is depicted in green and the actual number of TrueNorth cores allocated to each region following our normalization step is depicted in redboth plotted in log space Outgoing connections and neurons allocated in a 4096 TrueNorth cores model are shown for a typical region LGN which is thefirst stage in the thalamocortical visual processing stream

the communication infrastructure of our simulator As regionsizes are increased through scaling requiring more processper region the number inter-process connections required fora given process will increase At the same time the numberof neuron-axon connections per such link will decrease

C Local connectivity

Extensive studies have revealed that while each neuron canform gray matter connections to a large array of possibletargets targets are invariably within a 2-3 mm radius from thesource [33] The number of cortical neurons under a squaremillimeter of cortex in a the macaque brain is estimated in thetens of thousands [34] which sets an upper limit on possibleneuron targets from gray matter connectivity as the nearest3 million or so neurons which fits within the number ofneurons we are able to use per process for the simulationsperformed here Therefore we choose to restrict our modelingof gray matter connectivity such that the source neuron andtarget neuron of gray matter connections are located on thesame process It has also been found that while each neuron

often connects to the same target gray matter patches as itsneighbors this is not always the case [33] Therefore toprovide the highest possible challenge to cache performancewe chose to ensure that all locally connecting neurons on thesame TrueNorth core distribute their connections as broadlyas possible across the set of possible target TrueNorth cores

We establish our ratio of long range to local connectivityin approximately a 6040 ratio for cortical regions and inan 8020 ratio for non-cortical regions From CoCoMac weobtained a binary connection matrix This matrix is convertedto a stochastic matrix with the above gray matter percentageson the diagonals and white matter connections set to beproportional to the volume percentage of the outgoing regionThen the connection matrix was balanced as described insection IV to make the row and column sums equal to thevolume of that region The end result guarantees a realizablemodel in that all axon and neuron requests can be fulfilledin all regions The brain regions used a sample of theirinterconnectivity and the results of this normalization processcan be seen in figure 3

VI EXPERIMENTAL EVALUATION

We completed an experimental evaluation of Compass thatshows near-perfect weak and strong scaling behavior whensimulating a CoCoMac macaque network model on an IBMBlue GeneQ system of 16384 to 262144 compute CPUsWe attribute this behavior to several important design andoperational features of Compass that make efficient use ofthe communication subsystem The Compass implementationoverlaps the single collective MPI Reduce-Scatter operation ina process with the delivery of spikes from neurons destinedfor local cores and minimizes the MPI communicator size byspawning multiple threads within an MPI process on multi-core CPUs in place of spawning multiple MPI processes Inoperation we observe that the MPI point-to-point messagingrate scales sub-linearly with increasing size of simulatedsystem due to weaker links (in terms of neuron-to-axonconnections) between processes of connected brain regionsMoreover the overall message data volume per simulated ticksent by Compass for even the largest simulated system ofTrueNorth cores is well below the interconnect bandwidth ofthe communication subsystem

A Hardware platforms

We ran Compass simulations of our CoCoMac macaquenetwork model on the IBM Blue GeneQ which is the thirdgeneration in the Blue Gene line of massively parallel super-computers Our Blue GeneQ at IBM Rochester comprised16 racks with each rack in turn comprising 1024 computenodes Each compute node is composed of 17 processor coreson a single multi-core CPU paired with 16 GB of dedicatedphysical memory [35] and is connected to other nodes in afive-dimensional torus through 10 bidirectional 2 GBsecondlinks [36] HPC applications run on 16 of the processorcores and can spawn up to 4 hardware threads on each coreper-node system software runs on the remaining core Themaximum available Blue GeneQ size was therefore 16384nodes equivalent to 262144 application processor cores

For all experiments on the Blue GeneQ we compiled withthe IBM XL CC++ compiler version 120 (MPICH2 MPIlibrary version 141)

B Weak scaling behavior

Compass exhibits nearly perfect weak scaling behaviorwhen simulating the CoCoMac macaque network model onthe IBM Blue GeneQ Figure 4 shows the results of experi-ments in which we increased the CoCoMac model size whenincreasing the available Blue GeneQ CPU count while atthe same time fixing the count of simulated TrueNorth coresper node at 16384 We ran with 1 MPI process per node and32 OpenMP threads per MPI process2 to minimize the MPIcommunication overhead and maximize the available memoryper MPI process Because each Compass process distributes

2We are investigating unexpected system errors that occurred when runningwith the theoretical maximum of 64 OpenMP threads per MPI process

simulated cores uniformly across the available threads (sec-tion III) the number of simulated cores per OpenMP threadwas 512 cores

With a fixed count of 16384 TrueNorth cores per computenode Compass runs with near-constant total wall-clock timeover an increasing number of compute nodes Figure 4(a)shows the total wall-clock time taken to simulate the CoCo-Mac model for 500 simulated ticks (ldquoTotal Compassrdquo) andthe wall-clock times taken per Compass execution phase inthe main simulation loop (Synapse Neuron and Network inlisting 1) as functions of available Blue GeneQ CPU countThe total wall-clock time remains near-constant for CoCoMacmodels ranging from 16M TrueNorth cores on 16384 CPUs(1024 compute nodes) through to 256M cores on 262144CPUs (16384 nodes) note that 256M cores is equal to 65Bneurons which is of human scale in the number of neuronsSimulating 256M cores on 262144 CPUs for 500 simulatedticks required 194 seconds3 or 388times slower than real time

We observe that the increase in total run time with increas-ing Blue GeneQ CPU count is related primarily to an increasein communication costs during the Network phase in the mainsimulation loop We attribute most of the increase to the timetaken by the MPI Reduce-Scatter operation which increaseswith increasing MPI communicator size We also attributesome of the increase to computation and communicationimbalances in the functional regions of the CoCoMac model

We see that some of the increase in total run time arisesfrom an increase in the MPI message traffic during the Neuronphase Figure 4(b) shows the MPI message count and totalspike count (equivalent to the sum of white matter spikesfrom all MPI processes) per simulated tick as functions ofthe available Blue GeneQ CPU count As we increase theCoCoMac model size in the number of TrueNorth coresmore MPI processes represent the 77 macaque brain regionsto which spikes can be sent which leads to an increasein the message count and volume We note however thatthe increase in message count is sub-linear given that thewhite matter connections become thinner and therefore lessfrequented with increasing model size as discussed in sec-tion V-B For a system of 256M cores on 262144 CPUsCompass generates and sends approximately 22M spikes persimulated tick which at 20 bytes per spike equals 044 GB pertick and is well below the 5D torus link bandwidth of 2 GBs

C Strong scaling behavior

As with weak scaling Compass exhibits excellent strongscaling behavior when simulating the CoCoMac macaquenetwork model on the IBM Blue GeneQ Figure 5 shows theresults of experiments in which we fixed the CoCoMac modelsize at 32M TrueNorth cores (82B neurons) while increasingthe available Blue GeneQ CPU count As with the weakscaling results in figure 4(a) we show the total wall-clock time

3We exclude the CoCoMac compilation times from the total time Forreference compilation of the 256M core model using the PCC (section IV)required 107 wall-clock seconds mostly due to the communication costs inthe white matter wiring phase

(a) Compass total runtime and breakdown into major components (b) Compass messaging and data transfer analysis per simulation step

Fig 4 Compass weak scaling performance when simulating the CoCoMac macaque network model on an IBM Blue GeneQ with a fixed count of 16384TrueNorth cores per Blue GeneQ compute node Each Blue GeneQ rack had 16384 compute CPUs (1024 compute nodes) We ran with one MPI processper node and 32 OpenMP threads per MPI process

Fig 5 Compass strong scaling performance when simulating a fixedCoCoMac macaque network model of 32M TrueNorth cores on an IBM BlueGeneQ Each Blue GeneQ rack had 16384 compute CPUs (1024 computenodes) We ran with one MPI process per node and 32 OpenMP threads perMPI process

taken to simulate the CoCoMac model for 500 simulated ticksand the wall-clock times taken per Compass execution phasein the main simulation loop Simulating 32M cores takes 324seconds on 16384 Blue GeneQ CPUs (1 rack the baseline)47 seconds on 131073 CPUs (8 racks a speed-up of 69times overthe baseline with 8times more computation and memory capacity)and 37 seconds on 262144 CPUs (16 racks a speed-up of 88timesover the baseline with 16times more computation and memorycapacity) We observe that perfect scaling is inhibited by thecommunication-intense main loop phases when scaling from131072 CPUs (8 racks) up to 262144 CPUs (16 racks)

D Thread scaling behavior

Compass obtains excellent multi-threaded scaling behaviorwhen executing on the IBM Blue GeneQ Figure 6 shows theresults of experiments in which we increase the number ofOpenMP threads per MPI process while at the same timefixing the CoCoMac macaque network model size at 64MTrueNorth cores We show the speed-up over the baselinein the total wall-clock time taken to simulate the CoCoMacmodel for 500 simulated ticks and the speed-up in the wall-clock times taken per Compass execution phase in the mainsimulation loop where the baseline is the time taken wheneach MPI process spawns only one OpenMP thread We donot quite achieve perfect scaling in the number of OpenMPthreads due to a critical section in the Network phase that

Fig 6 OpenMP multi-threaded scaling tests when simulating a fixedCoCoMac macaque network model of 64M TrueNorth cores on a 65536 CPU(four rack) IBM Blue GeneQ with one MPI process per compute node Weshow the speed-up obtained with increasing numbers of threads over a baselineof one MPI process per node with one OpenMP thread (and therefore with15 of 16 CPU cores idled per node)

creates a serial bottleneck at all thread countsIn other multi-threaded experiments we found that trading

off between the number of MPI processes per compute nodeand the number of OpenMP threads per MPI processes yieldedlittle change in performance For example simulation runs ofCompass with one MPI process per compute node and 32OpenMP threads per process achieved nearly similar perfor-mance to runs with 16 MPI processes per compute node and 2OpenMP threads per process Using fewer MPI processes andmore OpenMP processes per thread reduces the size of theMPI communicator for the MPI Reduce-Scatter operation inthe Network phase which reduces the time taken by the phaseWe observe however that this performance improvement isoffset by false sharing penalties in the CPU caches due toincreased size of the shared memory region for the threads ineach MPI process

VII COMPARISON OF THE PGAS AND MPICOMMUNICATION MODELS FOR REAL-TIME SIMULATION

In addition to evaluating an implementation of Compassusing the MPI message passing library (section III) we evalu-ated a second implementation based on the Partitioned GlobalAddress Space (PGAS) model to explore the benefits of one-sided communications We achieved clear real-time simulationperformance improvements with the PGAS-based implementa-tion over the MPI-based implementation For this comparison

we ran both PGAS and MPI versions of Compass on a systemof four 1024-node IBM Blue GeneP racks installed at IBMWatson as no CC++ PGAS compiler yet exists for the BlueGeneQ Each compute node in the Blue GeneP comprises 4CPUs with 4 GB of dedicated physical memory We compiledwith the IBM XL CC++ compiler version 90 (MPICH2 MPIlibrary version 107) For PGAS support we reimplementedthe Compass messaging subroutines using Unified Parallel C(UPC) and compiled with the Berkeley UPC compiler version2140 [37] [38] the Berkeley UPC runtime uses the GASNetlibrary [39] for one-sided communication Both GASNet andthe MPI library implementation on Blue GeneP build uponthe Deep Computing Messaging Framework (DCMF) [40] toachieve the best available communication performance

A Tuning for real-time simulation

Real-time simulationmdash1 millisecond of wall-clock time per1 millisecond of simulated timemdashis important for designingapplications on the TrueNorth architecture Real-time simula-tions enable us to implement and test applications targeted forTrueNorth cores in advance of obtaining the actual hardwareand to debug and cross-check the TrueNorth hardware designsSuch simulations however already require that we reduce thesize of the simulated system of TrueNorth cores in order toreduce the total per-tick latency of the Synapse and Neuronphases We looked for further performance improvements inthe Network phase to maximize the size of the simulatedsystem given that the Network phase incurs the bulk of thelatency in the main simulation loop

The PGAS model of one-sided messaging is better suitedto the behavior of simulated TrueNorth cores in Compassthan the MPI model is The source and ordering of spikesarriving at an axon during the Network phase in a simulatedtick do not affect the computations during the Synapse andNeuron phases of the next tick Therefore each Compassprocess can use one-sided message primitives to insert spikesin a globally-addressable buffer residing at remote processeswithout incurring either the overhead of buffering those spikesfor sending or the overhead of tag matching when usingtwo-sided MPI messaging primitives Further using one-sidedmessage primitives enables the use of a single global barrierwith very low latency to synchronize writing into the globally-addressable buffers with reading from the buffers insteadof needing a collective Reduce-Scatter operation that scaleslinearly with communicator size We did not attempt to useMPI-2 one-sided messaging primitives as Hoefler et al havedemonstrated that such an approach is not promising from aperformance standpoint [41]

We experimented with writing our own custom synchro-nization primitives to provide a local barrier for subgroupsof processes In this way each process only synchronizeswith other processes with which it sends or receives spikesas opposed to each process synchronizing with all otherprocesses In practice though we found that the BerkeleyUPC synchronization primitives which are built upon the fastDCMF native barriers outperformed our custom primitives

Fig 7 Results comparing the PGAS and MPI communication modelsfor real-time simulations in Compass over a Blue GeneP system of fourracks (16384 CPUs) We simulated 81K TrueNorth cores for 1000 ticks withneurons firing on average at 10 Hz For each size of Blue GeneP we report theresult for each implementation using the best-performing thread configuration

B Real-time simulation results

To compare the real-time simulation performance of thePGAS implementation of Compass to the MPI implemen-tation we ran a set of strong scaling experiments using asynthetic system of TrueNorth cores simulated over four 1024-node Blue GeneP racks comprising 16384 CPUs We beganby finding the largest size of system we could simulate inreal time on all four racks and then measured the time takento simulate the same system for 1000 ticks on progressivelyfewer racks For the synthetic system4 75 of the neuronsin each TrueNorth core connect to TrueNorth cores on thesame Blue GeneP node while the remaining 25 connect toTrueNorth cores on other nodes All neurons fire on averageat 10 Hz

Our results show that the PGAS implementation of Compassis able to simulate the synthetic system with faster run timesthan the MPI implementation Figure 7 shows the strongscaling experimental results The PGAS implementation isable to simulate 81K TrueNorth cores in real time (1000 ticksin one second) on four racks while the MPI implementationtakes 21times as long (1000 ticks in 21 seconds) The benefitsof PGAS arise from the latency advantages of PGAS overMPI on the Blue GeneP [38] and the elimination of the MPIReduce-Scatter operation

For each experiment we only report the result for eachimplementation using the best-performing thread configura-tion Thus over four racks (16384 CPUs) of Blue GeneP weshow the result for the MPI implementation with one MPIprocess (each having four threads) per node while on onerack (4096 CPUs) of Blue GeneP we show the result for theMPI implementation with two MPI processes (each having twothreads) per node For all configuration we show the resultfor the PGAS implementation with four UPC instances (eachhaving one thread) per node

4We do not use the CoCoMac model for real-time simulations because thesize of simulated system has insufficient TrueNorth cores to populate eachCoCoMac region

VIII CONCLUSIONS

Cognitive Computing [2] is the quest for approximatingthe mind-like function low power small volume and real-time performance of the human brain To this end we havepursued neuroscience [42] [43] [9] nanotechnology [3] [4][44] [5] [6] and supercomputing [11] [12] Building on theseinsights we are marching ahead to develop and demonstrate anovel ultra-low power compact modular non-von Neumanncognitive computing architecture namely TrueNorth To setsail for TrueNorth in this paper we reported a multi-threadedmassively parallel simulator Compass that is functionallyequivalent to TrueNorth As a result of a number of innova-tions in communication computation and memory Compassdemonstrates unprecedented scale with its number of neuronscomparable to the human cortex and number of synapsescomparable to the monkey cortex while achieving unprece-dented time-to-solution Compass is a Swiss-army knife forour ambitious project supporting all aspects from architecturealgorithms to applications

TrueNorth and Compass represent a massively paralleland distributed architecture for computation that complementthe modern von Neumann architecture To fully exploit thisarchitecture we need to go beyond the ldquointellectual bottle-neckrdquo [7] of long sequential programming to short parallelprogramming To this end our next step is to develop a newparallel programming language with compositional semanticsthat can provide application and algorithm designers with thetools to effectively and efficiently unleash the full power ofthe emerging new era of computing Finally TrueNorth is de-signed to best approximate the brain within the constraints ofmodern CMOS technology but now informed by TrueNorthand Compass we ask how might we explore an entirely newtechnology that takes us beyond Moorersquos law and beyondthe ever-increasing memory-processor latency beyond needfor ever-increasing clock rates to a new era of computingExciting

ACKNOWLEDGMENTS

This research was sponsored by DARPA under contractNo HR0011-09-C-0002 The views and conclusions containedherein are those of the authors and should not be interpreted asrepresenting the official policies either expressly or impliedof DARPA or the US Government

We thank Filipp Akopyan John Arthur Andrew CassidyBryan Jackson Rajit Manohar Paul Merolla Rodrigo Alvarez-Icaza and Jun Sawada for their collaboration on the TrueNortharchitecture and our university partners Stefano Fusi RajitManohar Ashutosh Saxena and Giulio Tononi as well astheir research teams for their feedback on the Compass sim-ulator We are indebted to Fred Mintzer for access to IBMBlue GeneP and Blue GeneQ at the IBM TJ Watson Re-search Center and to George Fax Kerry Kaliszewski AndrewSchram Faith W Sell Steven M Westerbeck for access toIBM Rochester Blue GeneQ without which this paper wouldhave been impossible Finally we would like to thank DavidPeyton for his expert assistance revising this manuscript

REFERENCES

[1] J von Neumann The Computer and The Brain Yale University Press1958

[2] D S Modha R Ananthanarayanan S K Esser A Ndirango A JSherbondy and R Singh ldquoCognitive computingrdquo Communications ofthe ACM vol 54 no 8 pp 62ndash71 2011

[3] P Merolla J Arthur F Akopyan N Imam R Manohar and D ModhaldquoA digital neurosynaptic core using embedded crossbar memory with45pj per spike in 45nmrdquo in Custom Integrated Circuits Conference(CICC) 2011 IEEE sept 2011 pp 1 ndash4

[4] J Seo B Brezzo Y Liu B Parker S Esser R Montoye B RajendranJ Tierno L Chang D Modha and D Friedman ldquoA 45nm cmosneuromorphic chip with a scalable architecture for learning in networksof spiking neuronsrdquo in Custom Integrated Circuits Conference (CICC)2011 IEEE sept 2011 pp 1 ndash4

[5] N Imam F Akopyan J Arthur P Merolla R Manohar and D SModha ldquoA digital neurosynaptic core using event-driven qdi circuitsrdquo inASYNC 2012 IEEE International Symposium on Asynchronous Circuitsand Systems 2012

[6] J V Arthur P A Merolla F Akopyan R Alvarez-Icaza A CassidyS Chandra S K Esser N Imam W Risk D Rubin R Manohar andD S Modha ldquoBuilding block of a programmable neuromorphic sub-strate A digital neurosynaptic corerdquo in International Joint Conferenceon Neural Networks 2012

[7] J W Backus ldquoCan programming be liberated from the von neumannstyle a functional style and its algebra of programsrdquo Communicationsof the ACM vol 21 no 8 pp 613ndash641 1878

[8] V B Mountcastle Perceptual Neuroscience The Cerebral CortexHarvard University Press 1998

[9] D S Modha and R Singh ldquoNetwork architecture of the long distancepathways in the macaque brainrdquo Proceedings of the National Academyof the Sciences USA vol 107 no 30 pp 13 485ndash13 490 2010[Online] Available httpwwwpnasorgcontent1073013485abstract

[10] C Koch Biophysics of Computation Information Processing in SingleNeurons Oxford University Press New York New York 1999

[11] R Ananthanarayanan and D S Modha ldquoAnatomy of a cortical simu-latorrdquo in Supercomputing 07 2007

[12] R Ananthanarayanan S K Esser H D Simon and D S ModhaldquoThe cat is out of the bag cortical simulations with 109 neurons 1013

synapsesrdquo in Proceedings of the Conference on High PerformanceComputing Networking Storage and Analysis ser SC rsquo09 NewYork NY USA ACM 2009 pp 631ndash6312 [Online] Availablehttpdoiacmorg10114516540591654124

[13] E M Izhikevich ldquoWhich model to use for cortical spiking neuronsrdquoIEEE Transactions on Neural Networks vol 15 pp 1063ndash1070 2004[Online] Available httpwwwnsieduusersizhikevichpublicationswhichmodpdf

[14] H Markram ldquoThe Blue Brain Projectrdquo Nature Reviews Neurosciencevol 7 no 2 pp 153ndash160 Feb 2006 [Online] Availablehttpdxdoiorg101038nrn1848

[15] H Markram ldquoThe Human Brain Projectrdquo Scientific American vol 306no 6 pp 50ndash55 Jun 2012

[16] R Brette and D F M Goodman ldquoVectorized algorithms forspiking neural network simulationrdquo Neural Comput vol 23 no 6pp 1503ndash1535 2011 [Online] Available httpdxdoiorg101162NECO a 00123

[17] M Djurfeldt M Lundqvist C Johansson M Rehn O Ekebergand A Lansner ldquoBrain-scale simulation of the neocortex on the ibmblue genel supercomputerrdquo IBM J Res Dev vol 52 no 12 pp31ndash41 Jan 2008 [Online] Available httpdlacmorgcitationcfmid=13759901375994

[18] E M Izhikevich and G M Edelman ldquoLarge-scale model ofmammalian thalamocortical systemsrdquo Proceedings of the NationalAcademy of Sciences vol 105 no 9 pp 3593ndash3598 Mar 2008[Online] Available httpdxdoiorg101073pnas0712231105

[19] J M Nageswaran N Dutt J L Krichmar A Nicolau andA Veidenbaum ldquoEfficient simulation of large-scale spiking neuralnetworks using cuda graphics processorsrdquo in Proceedings of the 2009international joint conference on Neural Networks ser IJCNNrsquo09Piscataway NJ USA IEEE Press 2009 pp 3201ndash3208 [Online]Available httpdlacmorgcitationcfmid=17045551704736

[20] M M Waldrop ldquoComputer modelling Brain in a boxrdquo Nature vol482 pp 456ndash458 2012 [Online] Available httpdxdoiorg101038482456a

[21] H de Garis C Shuo B Goertzel and L Ruiting ldquoA world surveyof artificial brain projects part i Large-scale brain simulationsrdquoNeurocomput vol 74 no 1-3 pp 3ndash29 Dec 2010 [Online]Available httpdxdoiorg101016jneucom201008004

[22] R Brette M Rudolph T Carnevale M Hines D Beeman J MBower M Diesmann A Morrison P H Goodman A P DavisonS E Boustani and A Destexhe ldquoSimulation of networks of spikingneurons A review of tools and strategiesrdquo Journal of ComputationalNeuroscience vol 2007 pp 349ndash398 2007

[23] D Gregor T Hoefler B Barrett and A Lumsdaine ldquoFixing Probe forMulti-Threaded MPI Applicationsrdquo Indiana University Tech Rep 674Jan 2009

[24] R Sinkhorn and P Knopp ldquoConcerning nonnegative matrices anddoubly stochastic matricesrdquo Pacific Journal of Mathematics vol 21no 2 pp 343ndash348 1967

[25] A Marshall and I Olkin ldquoScaling of matrices to achieve specifiedrow and column sumsrdquo Numerische Mathematik vol 12 pp 83ndash901968 101007BF02170999 [Online] Available httpdxdoiorg101007BF02170999

[26] P Knight ldquoThe sinkhorn-knopp algorithm convergence and applica-tionsrdquo SIAM Journal on Matrix Analysis and Applications vol 30 no 1pp 261ndash275 2008

[27] R Kotter ldquoOnline retrieval processing and visualization of primate con-nectivity data from the CoCoMac databaserdquo Neuroinformatics vol 2pp 127ndash144 2004

[28] ldquoCocomac (collations of connectivity data on the macaque brain)rdquo wwwcocomacorg

[29] G Paxinos X Huang and M Petrides The Rhesus Monkey Brain inStereotaxic Coordinates Academic Press 2008 [Online] Availablehttpbooksgooglecoinbooksid=7HW6HgAACAAJ

[30] B G Kotter R Reid AT ldquoAn introduction to the cocomac-paxinos-3d viewerrdquo in The Rhesus Monkey Brain in Stereotaxic CoordinatesG Paxinos X-F Huang M Petrides and A Toga Eds Elsevier SanDiego 2008 ch 4

[31] D H Hubel and T N Wiesel ldquoReceptive fields binocular interactionand functional architecture in the catrsquos visual cortexrdquo J Physiol vol160 no 1 pp 106ndash154 1962

[32] A Peters and E G Jones Cerebral Cortex Vol 4 Association andauditory cortices Springer 1985

[33] V B Mountcastle ldquoThe columnar organization of the neocortexrdquo Brainvol 120 no 4 pp 701ndash22 1997

[34] C E Collins D C Airey N A Young D B Leitch and J H KaasldquoNeuron densities vary across and within cortical areas in primatesrdquo

Proceedings of the National Academy of Sciences vol 107 no 36 pp15 927ndash15 932 2010

[35] R A Haring M Ohmacht T W Fox M K Gschwind D L Sat-terfield K Sugavanam P W Coteus P Heidelberger M A BlumrichR W Wisniewski A Gara G L-T Chiu P A Boyle N H Chist andC Kim ldquoThe IBM Blue GeneQ Compute Chiprdquo IEEE Micro vol 32pp 48ndash60 2012

[36] D Chen N A Eisley P Heidelberger R M Senger Y SugawaraS Kumar V Salapura D L Satterfield B Steinmacher-Burow andJ J Parker ldquoThe IBM Blue GeneQ interconnection network andmessage unitrdquo in Proceedings of 2011 International Conference forHigh Performance Computing Networking Storage and Analysis serSC rsquo11 New York NY USA ACM 2011 pp 261ndash2610 [Online]Available httpdoiacmorg10114520633842063419

[37] ldquoThe Berkeley UPC Compilerrdquo 2002 httpupclblgov[38] R Nishtala P H Hargrove D O Bonachea and K A Yelick

ldquoScaling communication-intensive applications on BlueGeneP usingone-sided communication and overlaprdquo in Proceedings of the 2009IEEE International Symposium on ParallelampDistributed Processingser IPDPS rsquo09 Washington DC USA IEEE Computer Society2009 pp 1ndash12 [Online] Available httpdxdoiorg101109IPDPS20095161076

[39] D Bonachea ldquoGASNet Specification v11rdquo University of California atBerkeley Berkeley CA USA Tech Rep 2002

[40] S Kumar G Dozsa G Almasi P Heidelberger D Chen M EGiampapa M Blocksome A Faraj J Parker J RattermanB Smith and C J Archer ldquoThe Deep Computing MessagingFramework Generalized Scalable Message Passing on the Blue GeneP

Supercomputerrdquo in Proceedings of the 22nd annual internationalconference on Supercomputing ser ICS rsquo08 New York NY USAACM 2008 pp 94ndash103 [Online] Available httpdoiacmorg10114513755271375544

[41] T Hoefler C Siebert and A Lumsdaine ldquoScalable CommunicationProtocols for Dynamic Sparse Data Exchangerdquo in Proceedings of the2010 ACM SIGPLAN Symposium on Principles and Practice of ParallelProgramming (PPoPPrsquo10) ACM Jan 2010 pp 159ndash168

[42] D S Modha ldquoA conceptual cortical surface atlasrdquo PLoS ONE vol 4no 6 p e5693 2009

[43] A J Sherbondy R Ananthanrayanan R F Dougherty D S Modhaand B A Wandell ldquoThink global act local projectome estimationwith bluematterrdquo in Proceedings of MICCAI 2009 Lecture Notes inComputer Science 2009

[44] B L Jackson B Rajendran G S Corrado M Breitwisch G W BurrR Cheek K Gopalakrishnan S Raoux C T Rettner A G SchrottR S Shenoy B N Kurdi C H Lam and D S Modha ldquoNano-scale electronic synapses using phase change devicesrdquo ACM Journalon Emerging Technologies in Computing Systems forthcoming 2012

  • Introduction
  • The TrueNorth architecture
  • The Compass neurosynaptic chip simulator
  • Parallel Compass Compiler
  • CoCoMac macaque brain model network
    • Structure
    • Long range connectivity
    • Local connectivity
      • Experimental evaluation
        • Hardware platforms
        • Weak scaling behavior
        • Strong scaling behavior
        • Thread scaling behavior
          • Comparison of the PGAS and MPI communication models for real-time simulation
            • Tuning for real-time simulation
            • Real-time simulation results
              • Conclusions
              • References
Page 6: SC2012 Compass Cr

Fig 3 Macaque brain map consisting of the 77 brain regions used for the test network The relative number of TrueNorth cores for each area indicated bythe Paxinos atlas is depicted in green and the actual number of TrueNorth cores allocated to each region following our normalization step is depicted in redboth plotted in log space Outgoing connections and neurons allocated in a 4096 TrueNorth cores model are shown for a typical region LGN which is thefirst stage in the thalamocortical visual processing stream

the communication infrastructure of our simulator As regionsizes are increased through scaling requiring more processper region the number inter-process connections required fora given process will increase At the same time the numberof neuron-axon connections per such link will decrease

C Local connectivity

Extensive studies have revealed that while each neuron canform gray matter connections to a large array of possibletargets targets are invariably within a 2-3 mm radius from thesource [33] The number of cortical neurons under a squaremillimeter of cortex in a the macaque brain is estimated in thetens of thousands [34] which sets an upper limit on possibleneuron targets from gray matter connectivity as the nearest3 million or so neurons which fits within the number ofneurons we are able to use per process for the simulationsperformed here Therefore we choose to restrict our modelingof gray matter connectivity such that the source neuron andtarget neuron of gray matter connections are located on thesame process It has also been found that while each neuron

often connects to the same target gray matter patches as itsneighbors this is not always the case [33] Therefore toprovide the highest possible challenge to cache performancewe chose to ensure that all locally connecting neurons on thesame TrueNorth core distribute their connections as broadlyas possible across the set of possible target TrueNorth cores

We establish our ratio of long range to local connectivityin approximately a 6040 ratio for cortical regions and inan 8020 ratio for non-cortical regions From CoCoMac weobtained a binary connection matrix This matrix is convertedto a stochastic matrix with the above gray matter percentageson the diagonals and white matter connections set to beproportional to the volume percentage of the outgoing regionThen the connection matrix was balanced as described insection IV to make the row and column sums equal to thevolume of that region The end result guarantees a realizablemodel in that all axon and neuron requests can be fulfilledin all regions The brain regions used a sample of theirinterconnectivity and the results of this normalization processcan be seen in figure 3

VI EXPERIMENTAL EVALUATION

We completed an experimental evaluation of Compass thatshows near-perfect weak and strong scaling behavior whensimulating a CoCoMac macaque network model on an IBMBlue GeneQ system of 16384 to 262144 compute CPUsWe attribute this behavior to several important design andoperational features of Compass that make efficient use ofthe communication subsystem The Compass implementationoverlaps the single collective MPI Reduce-Scatter operation ina process with the delivery of spikes from neurons destinedfor local cores and minimizes the MPI communicator size byspawning multiple threads within an MPI process on multi-core CPUs in place of spawning multiple MPI processes Inoperation we observe that the MPI point-to-point messagingrate scales sub-linearly with increasing size of simulatedsystem due to weaker links (in terms of neuron-to-axonconnections) between processes of connected brain regionsMoreover the overall message data volume per simulated ticksent by Compass for even the largest simulated system ofTrueNorth cores is well below the interconnect bandwidth ofthe communication subsystem

A Hardware platforms

We ran Compass simulations of our CoCoMac macaquenetwork model on the IBM Blue GeneQ which is the thirdgeneration in the Blue Gene line of massively parallel super-computers Our Blue GeneQ at IBM Rochester comprised16 racks with each rack in turn comprising 1024 computenodes Each compute node is composed of 17 processor coreson a single multi-core CPU paired with 16 GB of dedicatedphysical memory [35] and is connected to other nodes in afive-dimensional torus through 10 bidirectional 2 GBsecondlinks [36] HPC applications run on 16 of the processorcores and can spawn up to 4 hardware threads on each coreper-node system software runs on the remaining core Themaximum available Blue GeneQ size was therefore 16384nodes equivalent to 262144 application processor cores

For all experiments on the Blue GeneQ we compiled withthe IBM XL CC++ compiler version 120 (MPICH2 MPIlibrary version 141)

B Weak scaling behavior

Compass exhibits nearly perfect weak scaling behaviorwhen simulating the CoCoMac macaque network model onthe IBM Blue GeneQ Figure 4 shows the results of experi-ments in which we increased the CoCoMac model size whenincreasing the available Blue GeneQ CPU count while atthe same time fixing the count of simulated TrueNorth coresper node at 16384 We ran with 1 MPI process per node and32 OpenMP threads per MPI process2 to minimize the MPIcommunication overhead and maximize the available memoryper MPI process Because each Compass process distributes

2We are investigating unexpected system errors that occurred when runningwith the theoretical maximum of 64 OpenMP threads per MPI process

simulated cores uniformly across the available threads (sec-tion III) the number of simulated cores per OpenMP threadwas 512 cores

With a fixed count of 16384 TrueNorth cores per computenode Compass runs with near-constant total wall-clock timeover an increasing number of compute nodes Figure 4(a)shows the total wall-clock time taken to simulate the CoCo-Mac model for 500 simulated ticks (ldquoTotal Compassrdquo) andthe wall-clock times taken per Compass execution phase inthe main simulation loop (Synapse Neuron and Network inlisting 1) as functions of available Blue GeneQ CPU countThe total wall-clock time remains near-constant for CoCoMacmodels ranging from 16M TrueNorth cores on 16384 CPUs(1024 compute nodes) through to 256M cores on 262144CPUs (16384 nodes) note that 256M cores is equal to 65Bneurons which is of human scale in the number of neuronsSimulating 256M cores on 262144 CPUs for 500 simulatedticks required 194 seconds3 or 388times slower than real time

We observe that the increase in total run time with increas-ing Blue GeneQ CPU count is related primarily to an increasein communication costs during the Network phase in the mainsimulation loop We attribute most of the increase to the timetaken by the MPI Reduce-Scatter operation which increaseswith increasing MPI communicator size We also attributesome of the increase to computation and communicationimbalances in the functional regions of the CoCoMac model

We see that some of the increase in total run time arisesfrom an increase in the MPI message traffic during the Neuronphase Figure 4(b) shows the MPI message count and totalspike count (equivalent to the sum of white matter spikesfrom all MPI processes) per simulated tick as functions ofthe available Blue GeneQ CPU count As we increase theCoCoMac model size in the number of TrueNorth coresmore MPI processes represent the 77 macaque brain regionsto which spikes can be sent which leads to an increasein the message count and volume We note however thatthe increase in message count is sub-linear given that thewhite matter connections become thinner and therefore lessfrequented with increasing model size as discussed in sec-tion V-B For a system of 256M cores on 262144 CPUsCompass generates and sends approximately 22M spikes persimulated tick which at 20 bytes per spike equals 044 GB pertick and is well below the 5D torus link bandwidth of 2 GBs

C Strong scaling behavior

As with weak scaling Compass exhibits excellent strongscaling behavior when simulating the CoCoMac macaquenetwork model on the IBM Blue GeneQ Figure 5 shows theresults of experiments in which we fixed the CoCoMac modelsize at 32M TrueNorth cores (82B neurons) while increasingthe available Blue GeneQ CPU count As with the weakscaling results in figure 4(a) we show the total wall-clock time

3We exclude the CoCoMac compilation times from the total time Forreference compilation of the 256M core model using the PCC (section IV)required 107 wall-clock seconds mostly due to the communication costs inthe white matter wiring phase

(a) Compass total runtime and breakdown into major components (b) Compass messaging and data transfer analysis per simulation step

Fig 4 Compass weak scaling performance when simulating the CoCoMac macaque network model on an IBM Blue GeneQ with a fixed count of 16384TrueNorth cores per Blue GeneQ compute node Each Blue GeneQ rack had 16384 compute CPUs (1024 compute nodes) We ran with one MPI processper node and 32 OpenMP threads per MPI process

Fig 5 Compass strong scaling performance when simulating a fixedCoCoMac macaque network model of 32M TrueNorth cores on an IBM BlueGeneQ Each Blue GeneQ rack had 16384 compute CPUs (1024 computenodes) We ran with one MPI process per node and 32 OpenMP threads perMPI process

taken to simulate the CoCoMac model for 500 simulated ticksand the wall-clock times taken per Compass execution phasein the main simulation loop Simulating 32M cores takes 324seconds on 16384 Blue GeneQ CPUs (1 rack the baseline)47 seconds on 131073 CPUs (8 racks a speed-up of 69times overthe baseline with 8times more computation and memory capacity)and 37 seconds on 262144 CPUs (16 racks a speed-up of 88timesover the baseline with 16times more computation and memorycapacity) We observe that perfect scaling is inhibited by thecommunication-intense main loop phases when scaling from131072 CPUs (8 racks) up to 262144 CPUs (16 racks)

D Thread scaling behavior

Compass obtains excellent multi-threaded scaling behaviorwhen executing on the IBM Blue GeneQ Figure 6 shows theresults of experiments in which we increase the number ofOpenMP threads per MPI process while at the same timefixing the CoCoMac macaque network model size at 64MTrueNorth cores We show the speed-up over the baselinein the total wall-clock time taken to simulate the CoCoMacmodel for 500 simulated ticks and the speed-up in the wall-clock times taken per Compass execution phase in the mainsimulation loop where the baseline is the time taken wheneach MPI process spawns only one OpenMP thread We donot quite achieve perfect scaling in the number of OpenMPthreads due to a critical section in the Network phase that

Fig 6 OpenMP multi-threaded scaling tests when simulating a fixedCoCoMac macaque network model of 64M TrueNorth cores on a 65536 CPU(four rack) IBM Blue GeneQ with one MPI process per compute node Weshow the speed-up obtained with increasing numbers of threads over a baselineof one MPI process per node with one OpenMP thread (and therefore with15 of 16 CPU cores idled per node)

creates a serial bottleneck at all thread countsIn other multi-threaded experiments we found that trading

off between the number of MPI processes per compute nodeand the number of OpenMP threads per MPI processes yieldedlittle change in performance For example simulation runs ofCompass with one MPI process per compute node and 32OpenMP threads per process achieved nearly similar perfor-mance to runs with 16 MPI processes per compute node and 2OpenMP threads per process Using fewer MPI processes andmore OpenMP processes per thread reduces the size of theMPI communicator for the MPI Reduce-Scatter operation inthe Network phase which reduces the time taken by the phaseWe observe however that this performance improvement isoffset by false sharing penalties in the CPU caches due toincreased size of the shared memory region for the threads ineach MPI process

VII COMPARISON OF THE PGAS AND MPICOMMUNICATION MODELS FOR REAL-TIME SIMULATION

In addition to evaluating an implementation of Compassusing the MPI message passing library (section III) we evalu-ated a second implementation based on the Partitioned GlobalAddress Space (PGAS) model to explore the benefits of one-sided communications We achieved clear real-time simulationperformance improvements with the PGAS-based implementa-tion over the MPI-based implementation For this comparison

we ran both PGAS and MPI versions of Compass on a systemof four 1024-node IBM Blue GeneP racks installed at IBMWatson as no CC++ PGAS compiler yet exists for the BlueGeneQ Each compute node in the Blue GeneP comprises 4CPUs with 4 GB of dedicated physical memory We compiledwith the IBM XL CC++ compiler version 90 (MPICH2 MPIlibrary version 107) For PGAS support we reimplementedthe Compass messaging subroutines using Unified Parallel C(UPC) and compiled with the Berkeley UPC compiler version2140 [37] [38] the Berkeley UPC runtime uses the GASNetlibrary [39] for one-sided communication Both GASNet andthe MPI library implementation on Blue GeneP build uponthe Deep Computing Messaging Framework (DCMF) [40] toachieve the best available communication performance

A Tuning for real-time simulation

Real-time simulationmdash1 millisecond of wall-clock time per1 millisecond of simulated timemdashis important for designingapplications on the TrueNorth architecture Real-time simula-tions enable us to implement and test applications targeted forTrueNorth cores in advance of obtaining the actual hardwareand to debug and cross-check the TrueNorth hardware designsSuch simulations however already require that we reduce thesize of the simulated system of TrueNorth cores in order toreduce the total per-tick latency of the Synapse and Neuronphases We looked for further performance improvements inthe Network phase to maximize the size of the simulatedsystem given that the Network phase incurs the bulk of thelatency in the main simulation loop

The PGAS model of one-sided messaging is better suitedto the behavior of simulated TrueNorth cores in Compassthan the MPI model is The source and ordering of spikesarriving at an axon during the Network phase in a simulatedtick do not affect the computations during the Synapse andNeuron phases of the next tick Therefore each Compassprocess can use one-sided message primitives to insert spikesin a globally-addressable buffer residing at remote processeswithout incurring either the overhead of buffering those spikesfor sending or the overhead of tag matching when usingtwo-sided MPI messaging primitives Further using one-sidedmessage primitives enables the use of a single global barrierwith very low latency to synchronize writing into the globally-addressable buffers with reading from the buffers insteadof needing a collective Reduce-Scatter operation that scaleslinearly with communicator size We did not attempt to useMPI-2 one-sided messaging primitives as Hoefler et al havedemonstrated that such an approach is not promising from aperformance standpoint [41]

We experimented with writing our own custom synchro-nization primitives to provide a local barrier for subgroupsof processes In this way each process only synchronizeswith other processes with which it sends or receives spikesas opposed to each process synchronizing with all otherprocesses In practice though we found that the BerkeleyUPC synchronization primitives which are built upon the fastDCMF native barriers outperformed our custom primitives

Fig 7 Results comparing the PGAS and MPI communication modelsfor real-time simulations in Compass over a Blue GeneP system of fourracks (16384 CPUs) We simulated 81K TrueNorth cores for 1000 ticks withneurons firing on average at 10 Hz For each size of Blue GeneP we report theresult for each implementation using the best-performing thread configuration

B Real-time simulation results

To compare the real-time simulation performance of thePGAS implementation of Compass to the MPI implemen-tation we ran a set of strong scaling experiments using asynthetic system of TrueNorth cores simulated over four 1024-node Blue GeneP racks comprising 16384 CPUs We beganby finding the largest size of system we could simulate inreal time on all four racks and then measured the time takento simulate the same system for 1000 ticks on progressivelyfewer racks For the synthetic system4 75 of the neuronsin each TrueNorth core connect to TrueNorth cores on thesame Blue GeneP node while the remaining 25 connect toTrueNorth cores on other nodes All neurons fire on averageat 10 Hz

Our results show that the PGAS implementation of Compassis able to simulate the synthetic system with faster run timesthan the MPI implementation Figure 7 shows the strongscaling experimental results The PGAS implementation isable to simulate 81K TrueNorth cores in real time (1000 ticksin one second) on four racks while the MPI implementationtakes 21times as long (1000 ticks in 21 seconds) The benefitsof PGAS arise from the latency advantages of PGAS overMPI on the Blue GeneP [38] and the elimination of the MPIReduce-Scatter operation

For each experiment we only report the result for eachimplementation using the best-performing thread configura-tion Thus over four racks (16384 CPUs) of Blue GeneP weshow the result for the MPI implementation with one MPIprocess (each having four threads) per node while on onerack (4096 CPUs) of Blue GeneP we show the result for theMPI implementation with two MPI processes (each having twothreads) per node For all configuration we show the resultfor the PGAS implementation with four UPC instances (eachhaving one thread) per node

4We do not use the CoCoMac model for real-time simulations because thesize of simulated system has insufficient TrueNorth cores to populate eachCoCoMac region

VIII CONCLUSIONS

Cognitive Computing [2] is the quest for approximatingthe mind-like function low power small volume and real-time performance of the human brain To this end we havepursued neuroscience [42] [43] [9] nanotechnology [3] [4][44] [5] [6] and supercomputing [11] [12] Building on theseinsights we are marching ahead to develop and demonstrate anovel ultra-low power compact modular non-von Neumanncognitive computing architecture namely TrueNorth To setsail for TrueNorth in this paper we reported a multi-threadedmassively parallel simulator Compass that is functionallyequivalent to TrueNorth As a result of a number of innova-tions in communication computation and memory Compassdemonstrates unprecedented scale with its number of neuronscomparable to the human cortex and number of synapsescomparable to the monkey cortex while achieving unprece-dented time-to-solution Compass is a Swiss-army knife forour ambitious project supporting all aspects from architecturealgorithms to applications

TrueNorth and Compass represent a massively paralleland distributed architecture for computation that complementthe modern von Neumann architecture To fully exploit thisarchitecture we need to go beyond the ldquointellectual bottle-neckrdquo [7] of long sequential programming to short parallelprogramming To this end our next step is to develop a newparallel programming language with compositional semanticsthat can provide application and algorithm designers with thetools to effectively and efficiently unleash the full power ofthe emerging new era of computing Finally TrueNorth is de-signed to best approximate the brain within the constraints ofmodern CMOS technology but now informed by TrueNorthand Compass we ask how might we explore an entirely newtechnology that takes us beyond Moorersquos law and beyondthe ever-increasing memory-processor latency beyond needfor ever-increasing clock rates to a new era of computingExciting

ACKNOWLEDGMENTS

This research was sponsored by DARPA under contractNo HR0011-09-C-0002 The views and conclusions containedherein are those of the authors and should not be interpreted asrepresenting the official policies either expressly or impliedof DARPA or the US Government

We thank Filipp Akopyan John Arthur Andrew CassidyBryan Jackson Rajit Manohar Paul Merolla Rodrigo Alvarez-Icaza and Jun Sawada for their collaboration on the TrueNortharchitecture and our university partners Stefano Fusi RajitManohar Ashutosh Saxena and Giulio Tononi as well astheir research teams for their feedback on the Compass sim-ulator We are indebted to Fred Mintzer for access to IBMBlue GeneP and Blue GeneQ at the IBM TJ Watson Re-search Center and to George Fax Kerry Kaliszewski AndrewSchram Faith W Sell Steven M Westerbeck for access toIBM Rochester Blue GeneQ without which this paper wouldhave been impossible Finally we would like to thank DavidPeyton for his expert assistance revising this manuscript

REFERENCES

[1] J von Neumann The Computer and The Brain Yale University Press1958

[2] D S Modha R Ananthanarayanan S K Esser A Ndirango A JSherbondy and R Singh ldquoCognitive computingrdquo Communications ofthe ACM vol 54 no 8 pp 62ndash71 2011

[3] P Merolla J Arthur F Akopyan N Imam R Manohar and D ModhaldquoA digital neurosynaptic core using embedded crossbar memory with45pj per spike in 45nmrdquo in Custom Integrated Circuits Conference(CICC) 2011 IEEE sept 2011 pp 1 ndash4

[4] J Seo B Brezzo Y Liu B Parker S Esser R Montoye B RajendranJ Tierno L Chang D Modha and D Friedman ldquoA 45nm cmosneuromorphic chip with a scalable architecture for learning in networksof spiking neuronsrdquo in Custom Integrated Circuits Conference (CICC)2011 IEEE sept 2011 pp 1 ndash4

[5] N Imam F Akopyan J Arthur P Merolla R Manohar and D SModha ldquoA digital neurosynaptic core using event-driven qdi circuitsrdquo inASYNC 2012 IEEE International Symposium on Asynchronous Circuitsand Systems 2012

[6] J V Arthur P A Merolla F Akopyan R Alvarez-Icaza A CassidyS Chandra S K Esser N Imam W Risk D Rubin R Manohar andD S Modha ldquoBuilding block of a programmable neuromorphic sub-strate A digital neurosynaptic corerdquo in International Joint Conferenceon Neural Networks 2012

[7] J W Backus ldquoCan programming be liberated from the von neumannstyle a functional style and its algebra of programsrdquo Communicationsof the ACM vol 21 no 8 pp 613ndash641 1878

[8] V B Mountcastle Perceptual Neuroscience The Cerebral CortexHarvard University Press 1998

[9] D S Modha and R Singh ldquoNetwork architecture of the long distancepathways in the macaque brainrdquo Proceedings of the National Academyof the Sciences USA vol 107 no 30 pp 13 485ndash13 490 2010[Online] Available httpwwwpnasorgcontent1073013485abstract

[10] C Koch Biophysics of Computation Information Processing in SingleNeurons Oxford University Press New York New York 1999

[11] R Ananthanarayanan and D S Modha ldquoAnatomy of a cortical simu-latorrdquo in Supercomputing 07 2007

[12] R Ananthanarayanan S K Esser H D Simon and D S ModhaldquoThe cat is out of the bag cortical simulations with 109 neurons 1013

synapsesrdquo in Proceedings of the Conference on High PerformanceComputing Networking Storage and Analysis ser SC rsquo09 NewYork NY USA ACM 2009 pp 631ndash6312 [Online] Availablehttpdoiacmorg10114516540591654124

[13] E M Izhikevich ldquoWhich model to use for cortical spiking neuronsrdquoIEEE Transactions on Neural Networks vol 15 pp 1063ndash1070 2004[Online] Available httpwwwnsieduusersizhikevichpublicationswhichmodpdf

[14] H Markram ldquoThe Blue Brain Projectrdquo Nature Reviews Neurosciencevol 7 no 2 pp 153ndash160 Feb 2006 [Online] Availablehttpdxdoiorg101038nrn1848

[15] H Markram ldquoThe Human Brain Projectrdquo Scientific American vol 306no 6 pp 50ndash55 Jun 2012

[16] R Brette and D F M Goodman ldquoVectorized algorithms forspiking neural network simulationrdquo Neural Comput vol 23 no 6pp 1503ndash1535 2011 [Online] Available httpdxdoiorg101162NECO a 00123

[17] M Djurfeldt M Lundqvist C Johansson M Rehn O Ekebergand A Lansner ldquoBrain-scale simulation of the neocortex on the ibmblue genel supercomputerrdquo IBM J Res Dev vol 52 no 12 pp31ndash41 Jan 2008 [Online] Available httpdlacmorgcitationcfmid=13759901375994

[18] E M Izhikevich and G M Edelman ldquoLarge-scale model ofmammalian thalamocortical systemsrdquo Proceedings of the NationalAcademy of Sciences vol 105 no 9 pp 3593ndash3598 Mar 2008[Online] Available httpdxdoiorg101073pnas0712231105

[19] J M Nageswaran N Dutt J L Krichmar A Nicolau andA Veidenbaum ldquoEfficient simulation of large-scale spiking neuralnetworks using cuda graphics processorsrdquo in Proceedings of the 2009international joint conference on Neural Networks ser IJCNNrsquo09Piscataway NJ USA IEEE Press 2009 pp 3201ndash3208 [Online]Available httpdlacmorgcitationcfmid=17045551704736

[20] M M Waldrop ldquoComputer modelling Brain in a boxrdquo Nature vol482 pp 456ndash458 2012 [Online] Available httpdxdoiorg101038482456a

[21] H de Garis C Shuo B Goertzel and L Ruiting ldquoA world surveyof artificial brain projects part i Large-scale brain simulationsrdquoNeurocomput vol 74 no 1-3 pp 3ndash29 Dec 2010 [Online]Available httpdxdoiorg101016jneucom201008004

[22] R Brette M Rudolph T Carnevale M Hines D Beeman J MBower M Diesmann A Morrison P H Goodman A P DavisonS E Boustani and A Destexhe ldquoSimulation of networks of spikingneurons A review of tools and strategiesrdquo Journal of ComputationalNeuroscience vol 2007 pp 349ndash398 2007

[23] D Gregor T Hoefler B Barrett and A Lumsdaine ldquoFixing Probe forMulti-Threaded MPI Applicationsrdquo Indiana University Tech Rep 674Jan 2009

[24] R Sinkhorn and P Knopp ldquoConcerning nonnegative matrices anddoubly stochastic matricesrdquo Pacific Journal of Mathematics vol 21no 2 pp 343ndash348 1967

[25] A Marshall and I Olkin ldquoScaling of matrices to achieve specifiedrow and column sumsrdquo Numerische Mathematik vol 12 pp 83ndash901968 101007BF02170999 [Online] Available httpdxdoiorg101007BF02170999

[26] P Knight ldquoThe sinkhorn-knopp algorithm convergence and applica-tionsrdquo SIAM Journal on Matrix Analysis and Applications vol 30 no 1pp 261ndash275 2008

[27] R Kotter ldquoOnline retrieval processing and visualization of primate con-nectivity data from the CoCoMac databaserdquo Neuroinformatics vol 2pp 127ndash144 2004

[28] ldquoCocomac (collations of connectivity data on the macaque brain)rdquo wwwcocomacorg

[29] G Paxinos X Huang and M Petrides The Rhesus Monkey Brain inStereotaxic Coordinates Academic Press 2008 [Online] Availablehttpbooksgooglecoinbooksid=7HW6HgAACAAJ

[30] B G Kotter R Reid AT ldquoAn introduction to the cocomac-paxinos-3d viewerrdquo in The Rhesus Monkey Brain in Stereotaxic CoordinatesG Paxinos X-F Huang M Petrides and A Toga Eds Elsevier SanDiego 2008 ch 4

[31] D H Hubel and T N Wiesel ldquoReceptive fields binocular interactionand functional architecture in the catrsquos visual cortexrdquo J Physiol vol160 no 1 pp 106ndash154 1962

[32] A Peters and E G Jones Cerebral Cortex Vol 4 Association andauditory cortices Springer 1985

[33] V B Mountcastle ldquoThe columnar organization of the neocortexrdquo Brainvol 120 no 4 pp 701ndash22 1997

[34] C E Collins D C Airey N A Young D B Leitch and J H KaasldquoNeuron densities vary across and within cortical areas in primatesrdquo

Proceedings of the National Academy of Sciences vol 107 no 36 pp15 927ndash15 932 2010

[35] R A Haring M Ohmacht T W Fox M K Gschwind D L Sat-terfield K Sugavanam P W Coteus P Heidelberger M A BlumrichR W Wisniewski A Gara G L-T Chiu P A Boyle N H Chist andC Kim ldquoThe IBM Blue GeneQ Compute Chiprdquo IEEE Micro vol 32pp 48ndash60 2012

[36] D Chen N A Eisley P Heidelberger R M Senger Y SugawaraS Kumar V Salapura D L Satterfield B Steinmacher-Burow andJ J Parker ldquoThe IBM Blue GeneQ interconnection network andmessage unitrdquo in Proceedings of 2011 International Conference forHigh Performance Computing Networking Storage and Analysis serSC rsquo11 New York NY USA ACM 2011 pp 261ndash2610 [Online]Available httpdoiacmorg10114520633842063419

[37] ldquoThe Berkeley UPC Compilerrdquo 2002 httpupclblgov[38] R Nishtala P H Hargrove D O Bonachea and K A Yelick

ldquoScaling communication-intensive applications on BlueGeneP usingone-sided communication and overlaprdquo in Proceedings of the 2009IEEE International Symposium on ParallelampDistributed Processingser IPDPS rsquo09 Washington DC USA IEEE Computer Society2009 pp 1ndash12 [Online] Available httpdxdoiorg101109IPDPS20095161076

[39] D Bonachea ldquoGASNet Specification v11rdquo University of California atBerkeley Berkeley CA USA Tech Rep 2002

[40] S Kumar G Dozsa G Almasi P Heidelberger D Chen M EGiampapa M Blocksome A Faraj J Parker J RattermanB Smith and C J Archer ldquoThe Deep Computing MessagingFramework Generalized Scalable Message Passing on the Blue GeneP

Supercomputerrdquo in Proceedings of the 22nd annual internationalconference on Supercomputing ser ICS rsquo08 New York NY USAACM 2008 pp 94ndash103 [Online] Available httpdoiacmorg10114513755271375544

[41] T Hoefler C Siebert and A Lumsdaine ldquoScalable CommunicationProtocols for Dynamic Sparse Data Exchangerdquo in Proceedings of the2010 ACM SIGPLAN Symposium on Principles and Practice of ParallelProgramming (PPoPPrsquo10) ACM Jan 2010 pp 159ndash168

[42] D S Modha ldquoA conceptual cortical surface atlasrdquo PLoS ONE vol 4no 6 p e5693 2009

[43] A J Sherbondy R Ananthanrayanan R F Dougherty D S Modhaand B A Wandell ldquoThink global act local projectome estimationwith bluematterrdquo in Proceedings of MICCAI 2009 Lecture Notes inComputer Science 2009

[44] B L Jackson B Rajendran G S Corrado M Breitwisch G W BurrR Cheek K Gopalakrishnan S Raoux C T Rettner A G SchrottR S Shenoy B N Kurdi C H Lam and D S Modha ldquoNano-scale electronic synapses using phase change devicesrdquo ACM Journalon Emerging Technologies in Computing Systems forthcoming 2012

  • Introduction
  • The TrueNorth architecture
  • The Compass neurosynaptic chip simulator
  • Parallel Compass Compiler
  • CoCoMac macaque brain model network
    • Structure
    • Long range connectivity
    • Local connectivity
      • Experimental evaluation
        • Hardware platforms
        • Weak scaling behavior
        • Strong scaling behavior
        • Thread scaling behavior
          • Comparison of the PGAS and MPI communication models for real-time simulation
            • Tuning for real-time simulation
            • Real-time simulation results
              • Conclusions
              • References
Page 7: SC2012 Compass Cr

VI EXPERIMENTAL EVALUATION

We completed an experimental evaluation of Compass thatshows near-perfect weak and strong scaling behavior whensimulating a CoCoMac macaque network model on an IBMBlue GeneQ system of 16384 to 262144 compute CPUsWe attribute this behavior to several important design andoperational features of Compass that make efficient use ofthe communication subsystem The Compass implementationoverlaps the single collective MPI Reduce-Scatter operation ina process with the delivery of spikes from neurons destinedfor local cores and minimizes the MPI communicator size byspawning multiple threads within an MPI process on multi-core CPUs in place of spawning multiple MPI processes Inoperation we observe that the MPI point-to-point messagingrate scales sub-linearly with increasing size of simulatedsystem due to weaker links (in terms of neuron-to-axonconnections) between processes of connected brain regionsMoreover the overall message data volume per simulated ticksent by Compass for even the largest simulated system ofTrueNorth cores is well below the interconnect bandwidth ofthe communication subsystem

A Hardware platforms

We ran Compass simulations of our CoCoMac macaquenetwork model on the IBM Blue GeneQ which is the thirdgeneration in the Blue Gene line of massively parallel super-computers Our Blue GeneQ at IBM Rochester comprised16 racks with each rack in turn comprising 1024 computenodes Each compute node is composed of 17 processor coreson a single multi-core CPU paired with 16 GB of dedicatedphysical memory [35] and is connected to other nodes in afive-dimensional torus through 10 bidirectional 2 GBsecondlinks [36] HPC applications run on 16 of the processorcores and can spawn up to 4 hardware threads on each coreper-node system software runs on the remaining core Themaximum available Blue GeneQ size was therefore 16384nodes equivalent to 262144 application processor cores

For all experiments on the Blue GeneQ we compiled withthe IBM XL CC++ compiler version 120 (MPICH2 MPIlibrary version 141)

B Weak scaling behavior

Compass exhibits nearly perfect weak scaling behaviorwhen simulating the CoCoMac macaque network model onthe IBM Blue GeneQ Figure 4 shows the results of experi-ments in which we increased the CoCoMac model size whenincreasing the available Blue GeneQ CPU count while atthe same time fixing the count of simulated TrueNorth coresper node at 16384 We ran with 1 MPI process per node and32 OpenMP threads per MPI process2 to minimize the MPIcommunication overhead and maximize the available memoryper MPI process Because each Compass process distributes

2We are investigating unexpected system errors that occurred when runningwith the theoretical maximum of 64 OpenMP threads per MPI process

simulated cores uniformly across the available threads (sec-tion III) the number of simulated cores per OpenMP threadwas 512 cores

With a fixed count of 16384 TrueNorth cores per computenode Compass runs with near-constant total wall-clock timeover an increasing number of compute nodes Figure 4(a)shows the total wall-clock time taken to simulate the CoCo-Mac model for 500 simulated ticks (ldquoTotal Compassrdquo) andthe wall-clock times taken per Compass execution phase inthe main simulation loop (Synapse Neuron and Network inlisting 1) as functions of available Blue GeneQ CPU countThe total wall-clock time remains near-constant for CoCoMacmodels ranging from 16M TrueNorth cores on 16384 CPUs(1024 compute nodes) through to 256M cores on 262144CPUs (16384 nodes) note that 256M cores is equal to 65Bneurons which is of human scale in the number of neuronsSimulating 256M cores on 262144 CPUs for 500 simulatedticks required 194 seconds3 or 388times slower than real time

We observe that the increase in total run time with increas-ing Blue GeneQ CPU count is related primarily to an increasein communication costs during the Network phase in the mainsimulation loop We attribute most of the increase to the timetaken by the MPI Reduce-Scatter operation which increaseswith increasing MPI communicator size We also attributesome of the increase to computation and communicationimbalances in the functional regions of the CoCoMac model

We see that some of the increase in total run time arisesfrom an increase in the MPI message traffic during the Neuronphase Figure 4(b) shows the MPI message count and totalspike count (equivalent to the sum of white matter spikesfrom all MPI processes) per simulated tick as functions ofthe available Blue GeneQ CPU count As we increase theCoCoMac model size in the number of TrueNorth coresmore MPI processes represent the 77 macaque brain regionsto which spikes can be sent which leads to an increasein the message count and volume We note however thatthe increase in message count is sub-linear given that thewhite matter connections become thinner and therefore lessfrequented with increasing model size as discussed in sec-tion V-B For a system of 256M cores on 262144 CPUsCompass generates and sends approximately 22M spikes persimulated tick which at 20 bytes per spike equals 044 GB pertick and is well below the 5D torus link bandwidth of 2 GBs

C Strong scaling behavior

As with weak scaling Compass exhibits excellent strongscaling behavior when simulating the CoCoMac macaquenetwork model on the IBM Blue GeneQ Figure 5 shows theresults of experiments in which we fixed the CoCoMac modelsize at 32M TrueNorth cores (82B neurons) while increasingthe available Blue GeneQ CPU count As with the weakscaling results in figure 4(a) we show the total wall-clock time

3We exclude the CoCoMac compilation times from the total time Forreference compilation of the 256M core model using the PCC (section IV)required 107 wall-clock seconds mostly due to the communication costs inthe white matter wiring phase

(a) Compass total runtime and breakdown into major components (b) Compass messaging and data transfer analysis per simulation step

Fig 4 Compass weak scaling performance when simulating the CoCoMac macaque network model on an IBM Blue GeneQ with a fixed count of 16384TrueNorth cores per Blue GeneQ compute node Each Blue GeneQ rack had 16384 compute CPUs (1024 compute nodes) We ran with one MPI processper node and 32 OpenMP threads per MPI process

Fig 5 Compass strong scaling performance when simulating a fixedCoCoMac macaque network model of 32M TrueNorth cores on an IBM BlueGeneQ Each Blue GeneQ rack had 16384 compute CPUs (1024 computenodes) We ran with one MPI process per node and 32 OpenMP threads perMPI process

taken to simulate the CoCoMac model for 500 simulated ticksand the wall-clock times taken per Compass execution phasein the main simulation loop Simulating 32M cores takes 324seconds on 16384 Blue GeneQ CPUs (1 rack the baseline)47 seconds on 131073 CPUs (8 racks a speed-up of 69times overthe baseline with 8times more computation and memory capacity)and 37 seconds on 262144 CPUs (16 racks a speed-up of 88timesover the baseline with 16times more computation and memorycapacity) We observe that perfect scaling is inhibited by thecommunication-intense main loop phases when scaling from131072 CPUs (8 racks) up to 262144 CPUs (16 racks)

D Thread scaling behavior

Compass obtains excellent multi-threaded scaling behaviorwhen executing on the IBM Blue GeneQ Figure 6 shows theresults of experiments in which we increase the number ofOpenMP threads per MPI process while at the same timefixing the CoCoMac macaque network model size at 64MTrueNorth cores We show the speed-up over the baselinein the total wall-clock time taken to simulate the CoCoMacmodel for 500 simulated ticks and the speed-up in the wall-clock times taken per Compass execution phase in the mainsimulation loop where the baseline is the time taken wheneach MPI process spawns only one OpenMP thread We donot quite achieve perfect scaling in the number of OpenMPthreads due to a critical section in the Network phase that

Fig 6 OpenMP multi-threaded scaling tests when simulating a fixedCoCoMac macaque network model of 64M TrueNorth cores on a 65536 CPU(four rack) IBM Blue GeneQ with one MPI process per compute node Weshow the speed-up obtained with increasing numbers of threads over a baselineof one MPI process per node with one OpenMP thread (and therefore with15 of 16 CPU cores idled per node)

creates a serial bottleneck at all thread countsIn other multi-threaded experiments we found that trading

off between the number of MPI processes per compute nodeand the number of OpenMP threads per MPI processes yieldedlittle change in performance For example simulation runs ofCompass with one MPI process per compute node and 32OpenMP threads per process achieved nearly similar perfor-mance to runs with 16 MPI processes per compute node and 2OpenMP threads per process Using fewer MPI processes andmore OpenMP processes per thread reduces the size of theMPI communicator for the MPI Reduce-Scatter operation inthe Network phase which reduces the time taken by the phaseWe observe however that this performance improvement isoffset by false sharing penalties in the CPU caches due toincreased size of the shared memory region for the threads ineach MPI process

VII COMPARISON OF THE PGAS AND MPICOMMUNICATION MODELS FOR REAL-TIME SIMULATION

In addition to evaluating an implementation of Compassusing the MPI message passing library (section III) we evalu-ated a second implementation based on the Partitioned GlobalAddress Space (PGAS) model to explore the benefits of one-sided communications We achieved clear real-time simulationperformance improvements with the PGAS-based implementa-tion over the MPI-based implementation For this comparison

we ran both PGAS and MPI versions of Compass on a systemof four 1024-node IBM Blue GeneP racks installed at IBMWatson as no CC++ PGAS compiler yet exists for the BlueGeneQ Each compute node in the Blue GeneP comprises 4CPUs with 4 GB of dedicated physical memory We compiledwith the IBM XL CC++ compiler version 90 (MPICH2 MPIlibrary version 107) For PGAS support we reimplementedthe Compass messaging subroutines using Unified Parallel C(UPC) and compiled with the Berkeley UPC compiler version2140 [37] [38] the Berkeley UPC runtime uses the GASNetlibrary [39] for one-sided communication Both GASNet andthe MPI library implementation on Blue GeneP build uponthe Deep Computing Messaging Framework (DCMF) [40] toachieve the best available communication performance

A Tuning for real-time simulation

Real-time simulationmdash1 millisecond of wall-clock time per1 millisecond of simulated timemdashis important for designingapplications on the TrueNorth architecture Real-time simula-tions enable us to implement and test applications targeted forTrueNorth cores in advance of obtaining the actual hardwareand to debug and cross-check the TrueNorth hardware designsSuch simulations however already require that we reduce thesize of the simulated system of TrueNorth cores in order toreduce the total per-tick latency of the Synapse and Neuronphases We looked for further performance improvements inthe Network phase to maximize the size of the simulatedsystem given that the Network phase incurs the bulk of thelatency in the main simulation loop

The PGAS model of one-sided messaging is better suitedto the behavior of simulated TrueNorth cores in Compassthan the MPI model is The source and ordering of spikesarriving at an axon during the Network phase in a simulatedtick do not affect the computations during the Synapse andNeuron phases of the next tick Therefore each Compassprocess can use one-sided message primitives to insert spikesin a globally-addressable buffer residing at remote processeswithout incurring either the overhead of buffering those spikesfor sending or the overhead of tag matching when usingtwo-sided MPI messaging primitives Further using one-sidedmessage primitives enables the use of a single global barrierwith very low latency to synchronize writing into the globally-addressable buffers with reading from the buffers insteadof needing a collective Reduce-Scatter operation that scaleslinearly with communicator size We did not attempt to useMPI-2 one-sided messaging primitives as Hoefler et al havedemonstrated that such an approach is not promising from aperformance standpoint [41]

We experimented with writing our own custom synchro-nization primitives to provide a local barrier for subgroupsof processes In this way each process only synchronizeswith other processes with which it sends or receives spikesas opposed to each process synchronizing with all otherprocesses In practice though we found that the BerkeleyUPC synchronization primitives which are built upon the fastDCMF native barriers outperformed our custom primitives

Fig 7 Results comparing the PGAS and MPI communication modelsfor real-time simulations in Compass over a Blue GeneP system of fourracks (16384 CPUs) We simulated 81K TrueNorth cores for 1000 ticks withneurons firing on average at 10 Hz For each size of Blue GeneP we report theresult for each implementation using the best-performing thread configuration

B Real-time simulation results

To compare the real-time simulation performance of thePGAS implementation of Compass to the MPI implemen-tation we ran a set of strong scaling experiments using asynthetic system of TrueNorth cores simulated over four 1024-node Blue GeneP racks comprising 16384 CPUs We beganby finding the largest size of system we could simulate inreal time on all four racks and then measured the time takento simulate the same system for 1000 ticks on progressivelyfewer racks For the synthetic system4 75 of the neuronsin each TrueNorth core connect to TrueNorth cores on thesame Blue GeneP node while the remaining 25 connect toTrueNorth cores on other nodes All neurons fire on averageat 10 Hz

Our results show that the PGAS implementation of Compassis able to simulate the synthetic system with faster run timesthan the MPI implementation Figure 7 shows the strongscaling experimental results The PGAS implementation isable to simulate 81K TrueNorth cores in real time (1000 ticksin one second) on four racks while the MPI implementationtakes 21times as long (1000 ticks in 21 seconds) The benefitsof PGAS arise from the latency advantages of PGAS overMPI on the Blue GeneP [38] and the elimination of the MPIReduce-Scatter operation

For each experiment we only report the result for eachimplementation using the best-performing thread configura-tion Thus over four racks (16384 CPUs) of Blue GeneP weshow the result for the MPI implementation with one MPIprocess (each having four threads) per node while on onerack (4096 CPUs) of Blue GeneP we show the result for theMPI implementation with two MPI processes (each having twothreads) per node For all configuration we show the resultfor the PGAS implementation with four UPC instances (eachhaving one thread) per node

4We do not use the CoCoMac model for real-time simulations because thesize of simulated system has insufficient TrueNorth cores to populate eachCoCoMac region

VIII CONCLUSIONS

Cognitive Computing [2] is the quest for approximatingthe mind-like function low power small volume and real-time performance of the human brain To this end we havepursued neuroscience [42] [43] [9] nanotechnology [3] [4][44] [5] [6] and supercomputing [11] [12] Building on theseinsights we are marching ahead to develop and demonstrate anovel ultra-low power compact modular non-von Neumanncognitive computing architecture namely TrueNorth To setsail for TrueNorth in this paper we reported a multi-threadedmassively parallel simulator Compass that is functionallyequivalent to TrueNorth As a result of a number of innova-tions in communication computation and memory Compassdemonstrates unprecedented scale with its number of neuronscomparable to the human cortex and number of synapsescomparable to the monkey cortex while achieving unprece-dented time-to-solution Compass is a Swiss-army knife forour ambitious project supporting all aspects from architecturealgorithms to applications

TrueNorth and Compass represent a massively paralleland distributed architecture for computation that complementthe modern von Neumann architecture To fully exploit thisarchitecture we need to go beyond the ldquointellectual bottle-neckrdquo [7] of long sequential programming to short parallelprogramming To this end our next step is to develop a newparallel programming language with compositional semanticsthat can provide application and algorithm designers with thetools to effectively and efficiently unleash the full power ofthe emerging new era of computing Finally TrueNorth is de-signed to best approximate the brain within the constraints ofmodern CMOS technology but now informed by TrueNorthand Compass we ask how might we explore an entirely newtechnology that takes us beyond Moorersquos law and beyondthe ever-increasing memory-processor latency beyond needfor ever-increasing clock rates to a new era of computingExciting

ACKNOWLEDGMENTS

This research was sponsored by DARPA under contractNo HR0011-09-C-0002 The views and conclusions containedherein are those of the authors and should not be interpreted asrepresenting the official policies either expressly or impliedof DARPA or the US Government

We thank Filipp Akopyan John Arthur Andrew CassidyBryan Jackson Rajit Manohar Paul Merolla Rodrigo Alvarez-Icaza and Jun Sawada for their collaboration on the TrueNortharchitecture and our university partners Stefano Fusi RajitManohar Ashutosh Saxena and Giulio Tononi as well astheir research teams for their feedback on the Compass sim-ulator We are indebted to Fred Mintzer for access to IBMBlue GeneP and Blue GeneQ at the IBM TJ Watson Re-search Center and to George Fax Kerry Kaliszewski AndrewSchram Faith W Sell Steven M Westerbeck for access toIBM Rochester Blue GeneQ without which this paper wouldhave been impossible Finally we would like to thank DavidPeyton for his expert assistance revising this manuscript

REFERENCES

[1] J von Neumann The Computer and The Brain Yale University Press1958

[2] D S Modha R Ananthanarayanan S K Esser A Ndirango A JSherbondy and R Singh ldquoCognitive computingrdquo Communications ofthe ACM vol 54 no 8 pp 62ndash71 2011

[3] P Merolla J Arthur F Akopyan N Imam R Manohar and D ModhaldquoA digital neurosynaptic core using embedded crossbar memory with45pj per spike in 45nmrdquo in Custom Integrated Circuits Conference(CICC) 2011 IEEE sept 2011 pp 1 ndash4

[4] J Seo B Brezzo Y Liu B Parker S Esser R Montoye B RajendranJ Tierno L Chang D Modha and D Friedman ldquoA 45nm cmosneuromorphic chip with a scalable architecture for learning in networksof spiking neuronsrdquo in Custom Integrated Circuits Conference (CICC)2011 IEEE sept 2011 pp 1 ndash4

[5] N Imam F Akopyan J Arthur P Merolla R Manohar and D SModha ldquoA digital neurosynaptic core using event-driven qdi circuitsrdquo inASYNC 2012 IEEE International Symposium on Asynchronous Circuitsand Systems 2012

[6] J V Arthur P A Merolla F Akopyan R Alvarez-Icaza A CassidyS Chandra S K Esser N Imam W Risk D Rubin R Manohar andD S Modha ldquoBuilding block of a programmable neuromorphic sub-strate A digital neurosynaptic corerdquo in International Joint Conferenceon Neural Networks 2012

[7] J W Backus ldquoCan programming be liberated from the von neumannstyle a functional style and its algebra of programsrdquo Communicationsof the ACM vol 21 no 8 pp 613ndash641 1878

[8] V B Mountcastle Perceptual Neuroscience The Cerebral CortexHarvard University Press 1998

[9] D S Modha and R Singh ldquoNetwork architecture of the long distancepathways in the macaque brainrdquo Proceedings of the National Academyof the Sciences USA vol 107 no 30 pp 13 485ndash13 490 2010[Online] Available httpwwwpnasorgcontent1073013485abstract

[10] C Koch Biophysics of Computation Information Processing in SingleNeurons Oxford University Press New York New York 1999

[11] R Ananthanarayanan and D S Modha ldquoAnatomy of a cortical simu-latorrdquo in Supercomputing 07 2007

[12] R Ananthanarayanan S K Esser H D Simon and D S ModhaldquoThe cat is out of the bag cortical simulations with 109 neurons 1013

synapsesrdquo in Proceedings of the Conference on High PerformanceComputing Networking Storage and Analysis ser SC rsquo09 NewYork NY USA ACM 2009 pp 631ndash6312 [Online] Availablehttpdoiacmorg10114516540591654124

[13] E M Izhikevich ldquoWhich model to use for cortical spiking neuronsrdquoIEEE Transactions on Neural Networks vol 15 pp 1063ndash1070 2004[Online] Available httpwwwnsieduusersizhikevichpublicationswhichmodpdf

[14] H Markram ldquoThe Blue Brain Projectrdquo Nature Reviews Neurosciencevol 7 no 2 pp 153ndash160 Feb 2006 [Online] Availablehttpdxdoiorg101038nrn1848

[15] H Markram ldquoThe Human Brain Projectrdquo Scientific American vol 306no 6 pp 50ndash55 Jun 2012

[16] R Brette and D F M Goodman ldquoVectorized algorithms forspiking neural network simulationrdquo Neural Comput vol 23 no 6pp 1503ndash1535 2011 [Online] Available httpdxdoiorg101162NECO a 00123

[17] M Djurfeldt M Lundqvist C Johansson M Rehn O Ekebergand A Lansner ldquoBrain-scale simulation of the neocortex on the ibmblue genel supercomputerrdquo IBM J Res Dev vol 52 no 12 pp31ndash41 Jan 2008 [Online] Available httpdlacmorgcitationcfmid=13759901375994

[18] E M Izhikevich and G M Edelman ldquoLarge-scale model ofmammalian thalamocortical systemsrdquo Proceedings of the NationalAcademy of Sciences vol 105 no 9 pp 3593ndash3598 Mar 2008[Online] Available httpdxdoiorg101073pnas0712231105

[19] J M Nageswaran N Dutt J L Krichmar A Nicolau andA Veidenbaum ldquoEfficient simulation of large-scale spiking neuralnetworks using cuda graphics processorsrdquo in Proceedings of the 2009international joint conference on Neural Networks ser IJCNNrsquo09Piscataway NJ USA IEEE Press 2009 pp 3201ndash3208 [Online]Available httpdlacmorgcitationcfmid=17045551704736

[20] M M Waldrop ldquoComputer modelling Brain in a boxrdquo Nature vol482 pp 456ndash458 2012 [Online] Available httpdxdoiorg101038482456a

[21] H de Garis C Shuo B Goertzel and L Ruiting ldquoA world surveyof artificial brain projects part i Large-scale brain simulationsrdquoNeurocomput vol 74 no 1-3 pp 3ndash29 Dec 2010 [Online]Available httpdxdoiorg101016jneucom201008004

[22] R Brette M Rudolph T Carnevale M Hines D Beeman J MBower M Diesmann A Morrison P H Goodman A P DavisonS E Boustani and A Destexhe ldquoSimulation of networks of spikingneurons A review of tools and strategiesrdquo Journal of ComputationalNeuroscience vol 2007 pp 349ndash398 2007

[23] D Gregor T Hoefler B Barrett and A Lumsdaine ldquoFixing Probe forMulti-Threaded MPI Applicationsrdquo Indiana University Tech Rep 674Jan 2009

[24] R Sinkhorn and P Knopp ldquoConcerning nonnegative matrices anddoubly stochastic matricesrdquo Pacific Journal of Mathematics vol 21no 2 pp 343ndash348 1967

[25] A Marshall and I Olkin ldquoScaling of matrices to achieve specifiedrow and column sumsrdquo Numerische Mathematik vol 12 pp 83ndash901968 101007BF02170999 [Online] Available httpdxdoiorg101007BF02170999

[26] P Knight ldquoThe sinkhorn-knopp algorithm convergence and applica-tionsrdquo SIAM Journal on Matrix Analysis and Applications vol 30 no 1pp 261ndash275 2008

[27] R Kotter ldquoOnline retrieval processing and visualization of primate con-nectivity data from the CoCoMac databaserdquo Neuroinformatics vol 2pp 127ndash144 2004

[28] ldquoCocomac (collations of connectivity data on the macaque brain)rdquo wwwcocomacorg

[29] G Paxinos X Huang and M Petrides The Rhesus Monkey Brain inStereotaxic Coordinates Academic Press 2008 [Online] Availablehttpbooksgooglecoinbooksid=7HW6HgAACAAJ

[30] B G Kotter R Reid AT ldquoAn introduction to the cocomac-paxinos-3d viewerrdquo in The Rhesus Monkey Brain in Stereotaxic CoordinatesG Paxinos X-F Huang M Petrides and A Toga Eds Elsevier SanDiego 2008 ch 4

[31] D H Hubel and T N Wiesel ldquoReceptive fields binocular interactionand functional architecture in the catrsquos visual cortexrdquo J Physiol vol160 no 1 pp 106ndash154 1962

[32] A Peters and E G Jones Cerebral Cortex Vol 4 Association andauditory cortices Springer 1985

[33] V B Mountcastle ldquoThe columnar organization of the neocortexrdquo Brainvol 120 no 4 pp 701ndash22 1997

[34] C E Collins D C Airey N A Young D B Leitch and J H KaasldquoNeuron densities vary across and within cortical areas in primatesrdquo

Proceedings of the National Academy of Sciences vol 107 no 36 pp15 927ndash15 932 2010

[35] R A Haring M Ohmacht T W Fox M K Gschwind D L Sat-terfield K Sugavanam P W Coteus P Heidelberger M A BlumrichR W Wisniewski A Gara G L-T Chiu P A Boyle N H Chist andC Kim ldquoThe IBM Blue GeneQ Compute Chiprdquo IEEE Micro vol 32pp 48ndash60 2012

[36] D Chen N A Eisley P Heidelberger R M Senger Y SugawaraS Kumar V Salapura D L Satterfield B Steinmacher-Burow andJ J Parker ldquoThe IBM Blue GeneQ interconnection network andmessage unitrdquo in Proceedings of 2011 International Conference forHigh Performance Computing Networking Storage and Analysis serSC rsquo11 New York NY USA ACM 2011 pp 261ndash2610 [Online]Available httpdoiacmorg10114520633842063419

[37] ldquoThe Berkeley UPC Compilerrdquo 2002 httpupclblgov[38] R Nishtala P H Hargrove D O Bonachea and K A Yelick

ldquoScaling communication-intensive applications on BlueGeneP usingone-sided communication and overlaprdquo in Proceedings of the 2009IEEE International Symposium on ParallelampDistributed Processingser IPDPS rsquo09 Washington DC USA IEEE Computer Society2009 pp 1ndash12 [Online] Available httpdxdoiorg101109IPDPS20095161076

[39] D Bonachea ldquoGASNet Specification v11rdquo University of California atBerkeley Berkeley CA USA Tech Rep 2002

[40] S Kumar G Dozsa G Almasi P Heidelberger D Chen M EGiampapa M Blocksome A Faraj J Parker J RattermanB Smith and C J Archer ldquoThe Deep Computing MessagingFramework Generalized Scalable Message Passing on the Blue GeneP

Supercomputerrdquo in Proceedings of the 22nd annual internationalconference on Supercomputing ser ICS rsquo08 New York NY USAACM 2008 pp 94ndash103 [Online] Available httpdoiacmorg10114513755271375544

[41] T Hoefler C Siebert and A Lumsdaine ldquoScalable CommunicationProtocols for Dynamic Sparse Data Exchangerdquo in Proceedings of the2010 ACM SIGPLAN Symposium on Principles and Practice of ParallelProgramming (PPoPPrsquo10) ACM Jan 2010 pp 159ndash168

[42] D S Modha ldquoA conceptual cortical surface atlasrdquo PLoS ONE vol 4no 6 p e5693 2009

[43] A J Sherbondy R Ananthanrayanan R F Dougherty D S Modhaand B A Wandell ldquoThink global act local projectome estimationwith bluematterrdquo in Proceedings of MICCAI 2009 Lecture Notes inComputer Science 2009

[44] B L Jackson B Rajendran G S Corrado M Breitwisch G W BurrR Cheek K Gopalakrishnan S Raoux C T Rettner A G SchrottR S Shenoy B N Kurdi C H Lam and D S Modha ldquoNano-scale electronic synapses using phase change devicesrdquo ACM Journalon Emerging Technologies in Computing Systems forthcoming 2012

  • Introduction
  • The TrueNorth architecture
  • The Compass neurosynaptic chip simulator
  • Parallel Compass Compiler
  • CoCoMac macaque brain model network
    • Structure
    • Long range connectivity
    • Local connectivity
      • Experimental evaluation
        • Hardware platforms
        • Weak scaling behavior
        • Strong scaling behavior
        • Thread scaling behavior
          • Comparison of the PGAS and MPI communication models for real-time simulation
            • Tuning for real-time simulation
            • Real-time simulation results
              • Conclusions
              • References
Page 8: SC2012 Compass Cr

(a) Compass total runtime and breakdown into major components (b) Compass messaging and data transfer analysis per simulation step

Fig 4 Compass weak scaling performance when simulating the CoCoMac macaque network model on an IBM Blue GeneQ with a fixed count of 16384TrueNorth cores per Blue GeneQ compute node Each Blue GeneQ rack had 16384 compute CPUs (1024 compute nodes) We ran with one MPI processper node and 32 OpenMP threads per MPI process

Fig 5 Compass strong scaling performance when simulating a fixedCoCoMac macaque network model of 32M TrueNorth cores on an IBM BlueGeneQ Each Blue GeneQ rack had 16384 compute CPUs (1024 computenodes) We ran with one MPI process per node and 32 OpenMP threads perMPI process

taken to simulate the CoCoMac model for 500 simulated ticksand the wall-clock times taken per Compass execution phasein the main simulation loop Simulating 32M cores takes 324seconds on 16384 Blue GeneQ CPUs (1 rack the baseline)47 seconds on 131073 CPUs (8 racks a speed-up of 69times overthe baseline with 8times more computation and memory capacity)and 37 seconds on 262144 CPUs (16 racks a speed-up of 88timesover the baseline with 16times more computation and memorycapacity) We observe that perfect scaling is inhibited by thecommunication-intense main loop phases when scaling from131072 CPUs (8 racks) up to 262144 CPUs (16 racks)

D Thread scaling behavior

Compass obtains excellent multi-threaded scaling behaviorwhen executing on the IBM Blue GeneQ Figure 6 shows theresults of experiments in which we increase the number ofOpenMP threads per MPI process while at the same timefixing the CoCoMac macaque network model size at 64MTrueNorth cores We show the speed-up over the baselinein the total wall-clock time taken to simulate the CoCoMacmodel for 500 simulated ticks and the speed-up in the wall-clock times taken per Compass execution phase in the mainsimulation loop where the baseline is the time taken wheneach MPI process spawns only one OpenMP thread We donot quite achieve perfect scaling in the number of OpenMPthreads due to a critical section in the Network phase that

Fig 6 OpenMP multi-threaded scaling tests when simulating a fixedCoCoMac macaque network model of 64M TrueNorth cores on a 65536 CPU(four rack) IBM Blue GeneQ with one MPI process per compute node Weshow the speed-up obtained with increasing numbers of threads over a baselineof one MPI process per node with one OpenMP thread (and therefore with15 of 16 CPU cores idled per node)

creates a serial bottleneck at all thread countsIn other multi-threaded experiments we found that trading

off between the number of MPI processes per compute nodeand the number of OpenMP threads per MPI processes yieldedlittle change in performance For example simulation runs ofCompass with one MPI process per compute node and 32OpenMP threads per process achieved nearly similar perfor-mance to runs with 16 MPI processes per compute node and 2OpenMP threads per process Using fewer MPI processes andmore OpenMP processes per thread reduces the size of theMPI communicator for the MPI Reduce-Scatter operation inthe Network phase which reduces the time taken by the phaseWe observe however that this performance improvement isoffset by false sharing penalties in the CPU caches due toincreased size of the shared memory region for the threads ineach MPI process

VII COMPARISON OF THE PGAS AND MPICOMMUNICATION MODELS FOR REAL-TIME SIMULATION

In addition to evaluating an implementation of Compassusing the MPI message passing library (section III) we evalu-ated a second implementation based on the Partitioned GlobalAddress Space (PGAS) model to explore the benefits of one-sided communications We achieved clear real-time simulationperformance improvements with the PGAS-based implementa-tion over the MPI-based implementation For this comparison

we ran both PGAS and MPI versions of Compass on a systemof four 1024-node IBM Blue GeneP racks installed at IBMWatson as no CC++ PGAS compiler yet exists for the BlueGeneQ Each compute node in the Blue GeneP comprises 4CPUs with 4 GB of dedicated physical memory We compiledwith the IBM XL CC++ compiler version 90 (MPICH2 MPIlibrary version 107) For PGAS support we reimplementedthe Compass messaging subroutines using Unified Parallel C(UPC) and compiled with the Berkeley UPC compiler version2140 [37] [38] the Berkeley UPC runtime uses the GASNetlibrary [39] for one-sided communication Both GASNet andthe MPI library implementation on Blue GeneP build uponthe Deep Computing Messaging Framework (DCMF) [40] toachieve the best available communication performance

A Tuning for real-time simulation

Real-time simulationmdash1 millisecond of wall-clock time per1 millisecond of simulated timemdashis important for designingapplications on the TrueNorth architecture Real-time simula-tions enable us to implement and test applications targeted forTrueNorth cores in advance of obtaining the actual hardwareand to debug and cross-check the TrueNorth hardware designsSuch simulations however already require that we reduce thesize of the simulated system of TrueNorth cores in order toreduce the total per-tick latency of the Synapse and Neuronphases We looked for further performance improvements inthe Network phase to maximize the size of the simulatedsystem given that the Network phase incurs the bulk of thelatency in the main simulation loop

The PGAS model of one-sided messaging is better suitedto the behavior of simulated TrueNorth cores in Compassthan the MPI model is The source and ordering of spikesarriving at an axon during the Network phase in a simulatedtick do not affect the computations during the Synapse andNeuron phases of the next tick Therefore each Compassprocess can use one-sided message primitives to insert spikesin a globally-addressable buffer residing at remote processeswithout incurring either the overhead of buffering those spikesfor sending or the overhead of tag matching when usingtwo-sided MPI messaging primitives Further using one-sidedmessage primitives enables the use of a single global barrierwith very low latency to synchronize writing into the globally-addressable buffers with reading from the buffers insteadof needing a collective Reduce-Scatter operation that scaleslinearly with communicator size We did not attempt to useMPI-2 one-sided messaging primitives as Hoefler et al havedemonstrated that such an approach is not promising from aperformance standpoint [41]

We experimented with writing our own custom synchro-nization primitives to provide a local barrier for subgroupsof processes In this way each process only synchronizeswith other processes with which it sends or receives spikesas opposed to each process synchronizing with all otherprocesses In practice though we found that the BerkeleyUPC synchronization primitives which are built upon the fastDCMF native barriers outperformed our custom primitives

Fig 7 Results comparing the PGAS and MPI communication modelsfor real-time simulations in Compass over a Blue GeneP system of fourracks (16384 CPUs) We simulated 81K TrueNorth cores for 1000 ticks withneurons firing on average at 10 Hz For each size of Blue GeneP we report theresult for each implementation using the best-performing thread configuration

B Real-time simulation results

To compare the real-time simulation performance of thePGAS implementation of Compass to the MPI implemen-tation we ran a set of strong scaling experiments using asynthetic system of TrueNorth cores simulated over four 1024-node Blue GeneP racks comprising 16384 CPUs We beganby finding the largest size of system we could simulate inreal time on all four racks and then measured the time takento simulate the same system for 1000 ticks on progressivelyfewer racks For the synthetic system4 75 of the neuronsin each TrueNorth core connect to TrueNorth cores on thesame Blue GeneP node while the remaining 25 connect toTrueNorth cores on other nodes All neurons fire on averageat 10 Hz

Our results show that the PGAS implementation of Compassis able to simulate the synthetic system with faster run timesthan the MPI implementation Figure 7 shows the strongscaling experimental results The PGAS implementation isable to simulate 81K TrueNorth cores in real time (1000 ticksin one second) on four racks while the MPI implementationtakes 21times as long (1000 ticks in 21 seconds) The benefitsof PGAS arise from the latency advantages of PGAS overMPI on the Blue GeneP [38] and the elimination of the MPIReduce-Scatter operation

For each experiment we only report the result for eachimplementation using the best-performing thread configura-tion Thus over four racks (16384 CPUs) of Blue GeneP weshow the result for the MPI implementation with one MPIprocess (each having four threads) per node while on onerack (4096 CPUs) of Blue GeneP we show the result for theMPI implementation with two MPI processes (each having twothreads) per node For all configuration we show the resultfor the PGAS implementation with four UPC instances (eachhaving one thread) per node

4We do not use the CoCoMac model for real-time simulations because thesize of simulated system has insufficient TrueNorth cores to populate eachCoCoMac region

VIII CONCLUSIONS

Cognitive Computing [2] is the quest for approximatingthe mind-like function low power small volume and real-time performance of the human brain To this end we havepursued neuroscience [42] [43] [9] nanotechnology [3] [4][44] [5] [6] and supercomputing [11] [12] Building on theseinsights we are marching ahead to develop and demonstrate anovel ultra-low power compact modular non-von Neumanncognitive computing architecture namely TrueNorth To setsail for TrueNorth in this paper we reported a multi-threadedmassively parallel simulator Compass that is functionallyequivalent to TrueNorth As a result of a number of innova-tions in communication computation and memory Compassdemonstrates unprecedented scale with its number of neuronscomparable to the human cortex and number of synapsescomparable to the monkey cortex while achieving unprece-dented time-to-solution Compass is a Swiss-army knife forour ambitious project supporting all aspects from architecturealgorithms to applications

TrueNorth and Compass represent a massively paralleland distributed architecture for computation that complementthe modern von Neumann architecture To fully exploit thisarchitecture we need to go beyond the ldquointellectual bottle-neckrdquo [7] of long sequential programming to short parallelprogramming To this end our next step is to develop a newparallel programming language with compositional semanticsthat can provide application and algorithm designers with thetools to effectively and efficiently unleash the full power ofthe emerging new era of computing Finally TrueNorth is de-signed to best approximate the brain within the constraints ofmodern CMOS technology but now informed by TrueNorthand Compass we ask how might we explore an entirely newtechnology that takes us beyond Moorersquos law and beyondthe ever-increasing memory-processor latency beyond needfor ever-increasing clock rates to a new era of computingExciting

ACKNOWLEDGMENTS

This research was sponsored by DARPA under contractNo HR0011-09-C-0002 The views and conclusions containedherein are those of the authors and should not be interpreted asrepresenting the official policies either expressly or impliedof DARPA or the US Government

We thank Filipp Akopyan John Arthur Andrew CassidyBryan Jackson Rajit Manohar Paul Merolla Rodrigo Alvarez-Icaza and Jun Sawada for their collaboration on the TrueNortharchitecture and our university partners Stefano Fusi RajitManohar Ashutosh Saxena and Giulio Tononi as well astheir research teams for their feedback on the Compass sim-ulator We are indebted to Fred Mintzer for access to IBMBlue GeneP and Blue GeneQ at the IBM TJ Watson Re-search Center and to George Fax Kerry Kaliszewski AndrewSchram Faith W Sell Steven M Westerbeck for access toIBM Rochester Blue GeneQ without which this paper wouldhave been impossible Finally we would like to thank DavidPeyton for his expert assistance revising this manuscript

REFERENCES

[1] J von Neumann The Computer and The Brain Yale University Press1958

[2] D S Modha R Ananthanarayanan S K Esser A Ndirango A JSherbondy and R Singh ldquoCognitive computingrdquo Communications ofthe ACM vol 54 no 8 pp 62ndash71 2011

[3] P Merolla J Arthur F Akopyan N Imam R Manohar and D ModhaldquoA digital neurosynaptic core using embedded crossbar memory with45pj per spike in 45nmrdquo in Custom Integrated Circuits Conference(CICC) 2011 IEEE sept 2011 pp 1 ndash4

[4] J Seo B Brezzo Y Liu B Parker S Esser R Montoye B RajendranJ Tierno L Chang D Modha and D Friedman ldquoA 45nm cmosneuromorphic chip with a scalable architecture for learning in networksof spiking neuronsrdquo in Custom Integrated Circuits Conference (CICC)2011 IEEE sept 2011 pp 1 ndash4

[5] N Imam F Akopyan J Arthur P Merolla R Manohar and D SModha ldquoA digital neurosynaptic core using event-driven qdi circuitsrdquo inASYNC 2012 IEEE International Symposium on Asynchronous Circuitsand Systems 2012

[6] J V Arthur P A Merolla F Akopyan R Alvarez-Icaza A CassidyS Chandra S K Esser N Imam W Risk D Rubin R Manohar andD S Modha ldquoBuilding block of a programmable neuromorphic sub-strate A digital neurosynaptic corerdquo in International Joint Conferenceon Neural Networks 2012

[7] J W Backus ldquoCan programming be liberated from the von neumannstyle a functional style and its algebra of programsrdquo Communicationsof the ACM vol 21 no 8 pp 613ndash641 1878

[8] V B Mountcastle Perceptual Neuroscience The Cerebral CortexHarvard University Press 1998

[9] D S Modha and R Singh ldquoNetwork architecture of the long distancepathways in the macaque brainrdquo Proceedings of the National Academyof the Sciences USA vol 107 no 30 pp 13 485ndash13 490 2010[Online] Available httpwwwpnasorgcontent1073013485abstract

[10] C Koch Biophysics of Computation Information Processing in SingleNeurons Oxford University Press New York New York 1999

[11] R Ananthanarayanan and D S Modha ldquoAnatomy of a cortical simu-latorrdquo in Supercomputing 07 2007

[12] R Ananthanarayanan S K Esser H D Simon and D S ModhaldquoThe cat is out of the bag cortical simulations with 109 neurons 1013

synapsesrdquo in Proceedings of the Conference on High PerformanceComputing Networking Storage and Analysis ser SC rsquo09 NewYork NY USA ACM 2009 pp 631ndash6312 [Online] Availablehttpdoiacmorg10114516540591654124

[13] E M Izhikevich ldquoWhich model to use for cortical spiking neuronsrdquoIEEE Transactions on Neural Networks vol 15 pp 1063ndash1070 2004[Online] Available httpwwwnsieduusersizhikevichpublicationswhichmodpdf

[14] H Markram ldquoThe Blue Brain Projectrdquo Nature Reviews Neurosciencevol 7 no 2 pp 153ndash160 Feb 2006 [Online] Availablehttpdxdoiorg101038nrn1848

[15] H Markram ldquoThe Human Brain Projectrdquo Scientific American vol 306no 6 pp 50ndash55 Jun 2012

[16] R Brette and D F M Goodman ldquoVectorized algorithms forspiking neural network simulationrdquo Neural Comput vol 23 no 6pp 1503ndash1535 2011 [Online] Available httpdxdoiorg101162NECO a 00123

[17] M Djurfeldt M Lundqvist C Johansson M Rehn O Ekebergand A Lansner ldquoBrain-scale simulation of the neocortex on the ibmblue genel supercomputerrdquo IBM J Res Dev vol 52 no 12 pp31ndash41 Jan 2008 [Online] Available httpdlacmorgcitationcfmid=13759901375994

[18] E M Izhikevich and G M Edelman ldquoLarge-scale model ofmammalian thalamocortical systemsrdquo Proceedings of the NationalAcademy of Sciences vol 105 no 9 pp 3593ndash3598 Mar 2008[Online] Available httpdxdoiorg101073pnas0712231105

[19] J M Nageswaran N Dutt J L Krichmar A Nicolau andA Veidenbaum ldquoEfficient simulation of large-scale spiking neuralnetworks using cuda graphics processorsrdquo in Proceedings of the 2009international joint conference on Neural Networks ser IJCNNrsquo09Piscataway NJ USA IEEE Press 2009 pp 3201ndash3208 [Online]Available httpdlacmorgcitationcfmid=17045551704736

[20] M M Waldrop ldquoComputer modelling Brain in a boxrdquo Nature vol482 pp 456ndash458 2012 [Online] Available httpdxdoiorg101038482456a

[21] H de Garis C Shuo B Goertzel and L Ruiting ldquoA world surveyof artificial brain projects part i Large-scale brain simulationsrdquoNeurocomput vol 74 no 1-3 pp 3ndash29 Dec 2010 [Online]Available httpdxdoiorg101016jneucom201008004

[22] R Brette M Rudolph T Carnevale M Hines D Beeman J MBower M Diesmann A Morrison P H Goodman A P DavisonS E Boustani and A Destexhe ldquoSimulation of networks of spikingneurons A review of tools and strategiesrdquo Journal of ComputationalNeuroscience vol 2007 pp 349ndash398 2007

[23] D Gregor T Hoefler B Barrett and A Lumsdaine ldquoFixing Probe forMulti-Threaded MPI Applicationsrdquo Indiana University Tech Rep 674Jan 2009

[24] R Sinkhorn and P Knopp ldquoConcerning nonnegative matrices anddoubly stochastic matricesrdquo Pacific Journal of Mathematics vol 21no 2 pp 343ndash348 1967

[25] A Marshall and I Olkin ldquoScaling of matrices to achieve specifiedrow and column sumsrdquo Numerische Mathematik vol 12 pp 83ndash901968 101007BF02170999 [Online] Available httpdxdoiorg101007BF02170999

[26] P Knight ldquoThe sinkhorn-knopp algorithm convergence and applica-tionsrdquo SIAM Journal on Matrix Analysis and Applications vol 30 no 1pp 261ndash275 2008

[27] R Kotter ldquoOnline retrieval processing and visualization of primate con-nectivity data from the CoCoMac databaserdquo Neuroinformatics vol 2pp 127ndash144 2004

[28] ldquoCocomac (collations of connectivity data on the macaque brain)rdquo wwwcocomacorg

[29] G Paxinos X Huang and M Petrides The Rhesus Monkey Brain inStereotaxic Coordinates Academic Press 2008 [Online] Availablehttpbooksgooglecoinbooksid=7HW6HgAACAAJ

[30] B G Kotter R Reid AT ldquoAn introduction to the cocomac-paxinos-3d viewerrdquo in The Rhesus Monkey Brain in Stereotaxic CoordinatesG Paxinos X-F Huang M Petrides and A Toga Eds Elsevier SanDiego 2008 ch 4

[31] D H Hubel and T N Wiesel ldquoReceptive fields binocular interactionand functional architecture in the catrsquos visual cortexrdquo J Physiol vol160 no 1 pp 106ndash154 1962

[32] A Peters and E G Jones Cerebral Cortex Vol 4 Association andauditory cortices Springer 1985

[33] V B Mountcastle ldquoThe columnar organization of the neocortexrdquo Brainvol 120 no 4 pp 701ndash22 1997

[34] C E Collins D C Airey N A Young D B Leitch and J H KaasldquoNeuron densities vary across and within cortical areas in primatesrdquo

Proceedings of the National Academy of Sciences vol 107 no 36 pp15 927ndash15 932 2010

[35] R A Haring M Ohmacht T W Fox M K Gschwind D L Sat-terfield K Sugavanam P W Coteus P Heidelberger M A BlumrichR W Wisniewski A Gara G L-T Chiu P A Boyle N H Chist andC Kim ldquoThe IBM Blue GeneQ Compute Chiprdquo IEEE Micro vol 32pp 48ndash60 2012

[36] D Chen N A Eisley P Heidelberger R M Senger Y SugawaraS Kumar V Salapura D L Satterfield B Steinmacher-Burow andJ J Parker ldquoThe IBM Blue GeneQ interconnection network andmessage unitrdquo in Proceedings of 2011 International Conference forHigh Performance Computing Networking Storage and Analysis serSC rsquo11 New York NY USA ACM 2011 pp 261ndash2610 [Online]Available httpdoiacmorg10114520633842063419

[37] ldquoThe Berkeley UPC Compilerrdquo 2002 httpupclblgov[38] R Nishtala P H Hargrove D O Bonachea and K A Yelick

ldquoScaling communication-intensive applications on BlueGeneP usingone-sided communication and overlaprdquo in Proceedings of the 2009IEEE International Symposium on ParallelampDistributed Processingser IPDPS rsquo09 Washington DC USA IEEE Computer Society2009 pp 1ndash12 [Online] Available httpdxdoiorg101109IPDPS20095161076

[39] D Bonachea ldquoGASNet Specification v11rdquo University of California atBerkeley Berkeley CA USA Tech Rep 2002

[40] S Kumar G Dozsa G Almasi P Heidelberger D Chen M EGiampapa M Blocksome A Faraj J Parker J RattermanB Smith and C J Archer ldquoThe Deep Computing MessagingFramework Generalized Scalable Message Passing on the Blue GeneP

Supercomputerrdquo in Proceedings of the 22nd annual internationalconference on Supercomputing ser ICS rsquo08 New York NY USAACM 2008 pp 94ndash103 [Online] Available httpdoiacmorg10114513755271375544

[41] T Hoefler C Siebert and A Lumsdaine ldquoScalable CommunicationProtocols for Dynamic Sparse Data Exchangerdquo in Proceedings of the2010 ACM SIGPLAN Symposium on Principles and Practice of ParallelProgramming (PPoPPrsquo10) ACM Jan 2010 pp 159ndash168

[42] D S Modha ldquoA conceptual cortical surface atlasrdquo PLoS ONE vol 4no 6 p e5693 2009

[43] A J Sherbondy R Ananthanrayanan R F Dougherty D S Modhaand B A Wandell ldquoThink global act local projectome estimationwith bluematterrdquo in Proceedings of MICCAI 2009 Lecture Notes inComputer Science 2009

[44] B L Jackson B Rajendran G S Corrado M Breitwisch G W BurrR Cheek K Gopalakrishnan S Raoux C T Rettner A G SchrottR S Shenoy B N Kurdi C H Lam and D S Modha ldquoNano-scale electronic synapses using phase change devicesrdquo ACM Journalon Emerging Technologies in Computing Systems forthcoming 2012

  • Introduction
  • The TrueNorth architecture
  • The Compass neurosynaptic chip simulator
  • Parallel Compass Compiler
  • CoCoMac macaque brain model network
    • Structure
    • Long range connectivity
    • Local connectivity
      • Experimental evaluation
        • Hardware platforms
        • Weak scaling behavior
        • Strong scaling behavior
        • Thread scaling behavior
          • Comparison of the PGAS and MPI communication models for real-time simulation
            • Tuning for real-time simulation
            • Real-time simulation results
              • Conclusions
              • References
Page 9: SC2012 Compass Cr

we ran both PGAS and MPI versions of Compass on a systemof four 1024-node IBM Blue GeneP racks installed at IBMWatson as no CC++ PGAS compiler yet exists for the BlueGeneQ Each compute node in the Blue GeneP comprises 4CPUs with 4 GB of dedicated physical memory We compiledwith the IBM XL CC++ compiler version 90 (MPICH2 MPIlibrary version 107) For PGAS support we reimplementedthe Compass messaging subroutines using Unified Parallel C(UPC) and compiled with the Berkeley UPC compiler version2140 [37] [38] the Berkeley UPC runtime uses the GASNetlibrary [39] for one-sided communication Both GASNet andthe MPI library implementation on Blue GeneP build uponthe Deep Computing Messaging Framework (DCMF) [40] toachieve the best available communication performance

A Tuning for real-time simulation

Real-time simulationmdash1 millisecond of wall-clock time per1 millisecond of simulated timemdashis important for designingapplications on the TrueNorth architecture Real-time simula-tions enable us to implement and test applications targeted forTrueNorth cores in advance of obtaining the actual hardwareand to debug and cross-check the TrueNorth hardware designsSuch simulations however already require that we reduce thesize of the simulated system of TrueNorth cores in order toreduce the total per-tick latency of the Synapse and Neuronphases We looked for further performance improvements inthe Network phase to maximize the size of the simulatedsystem given that the Network phase incurs the bulk of thelatency in the main simulation loop

The PGAS model of one-sided messaging is better suitedto the behavior of simulated TrueNorth cores in Compassthan the MPI model is The source and ordering of spikesarriving at an axon during the Network phase in a simulatedtick do not affect the computations during the Synapse andNeuron phases of the next tick Therefore each Compassprocess can use one-sided message primitives to insert spikesin a globally-addressable buffer residing at remote processeswithout incurring either the overhead of buffering those spikesfor sending or the overhead of tag matching when usingtwo-sided MPI messaging primitives Further using one-sidedmessage primitives enables the use of a single global barrierwith very low latency to synchronize writing into the globally-addressable buffers with reading from the buffers insteadof needing a collective Reduce-Scatter operation that scaleslinearly with communicator size We did not attempt to useMPI-2 one-sided messaging primitives as Hoefler et al havedemonstrated that such an approach is not promising from aperformance standpoint [41]

We experimented with writing our own custom synchro-nization primitives to provide a local barrier for subgroupsof processes In this way each process only synchronizeswith other processes with which it sends or receives spikesas opposed to each process synchronizing with all otherprocesses In practice though we found that the BerkeleyUPC synchronization primitives which are built upon the fastDCMF native barriers outperformed our custom primitives

Fig 7 Results comparing the PGAS and MPI communication modelsfor real-time simulations in Compass over a Blue GeneP system of fourracks (16384 CPUs) We simulated 81K TrueNorth cores for 1000 ticks withneurons firing on average at 10 Hz For each size of Blue GeneP we report theresult for each implementation using the best-performing thread configuration

B Real-time simulation results

To compare the real-time simulation performance of thePGAS implementation of Compass to the MPI implemen-tation we ran a set of strong scaling experiments using asynthetic system of TrueNorth cores simulated over four 1024-node Blue GeneP racks comprising 16384 CPUs We beganby finding the largest size of system we could simulate inreal time on all four racks and then measured the time takento simulate the same system for 1000 ticks on progressivelyfewer racks For the synthetic system4 75 of the neuronsin each TrueNorth core connect to TrueNorth cores on thesame Blue GeneP node while the remaining 25 connect toTrueNorth cores on other nodes All neurons fire on averageat 10 Hz

Our results show that the PGAS implementation of Compassis able to simulate the synthetic system with faster run timesthan the MPI implementation Figure 7 shows the strongscaling experimental results The PGAS implementation isable to simulate 81K TrueNorth cores in real time (1000 ticksin one second) on four racks while the MPI implementationtakes 21times as long (1000 ticks in 21 seconds) The benefitsof PGAS arise from the latency advantages of PGAS overMPI on the Blue GeneP [38] and the elimination of the MPIReduce-Scatter operation

For each experiment we only report the result for eachimplementation using the best-performing thread configura-tion Thus over four racks (16384 CPUs) of Blue GeneP weshow the result for the MPI implementation with one MPIprocess (each having four threads) per node while on onerack (4096 CPUs) of Blue GeneP we show the result for theMPI implementation with two MPI processes (each having twothreads) per node For all configuration we show the resultfor the PGAS implementation with four UPC instances (eachhaving one thread) per node

4We do not use the CoCoMac model for real-time simulations because thesize of simulated system has insufficient TrueNorth cores to populate eachCoCoMac region

VIII CONCLUSIONS

Cognitive Computing [2] is the quest for approximatingthe mind-like function low power small volume and real-time performance of the human brain To this end we havepursued neuroscience [42] [43] [9] nanotechnology [3] [4][44] [5] [6] and supercomputing [11] [12] Building on theseinsights we are marching ahead to develop and demonstrate anovel ultra-low power compact modular non-von Neumanncognitive computing architecture namely TrueNorth To setsail for TrueNorth in this paper we reported a multi-threadedmassively parallel simulator Compass that is functionallyequivalent to TrueNorth As a result of a number of innova-tions in communication computation and memory Compassdemonstrates unprecedented scale with its number of neuronscomparable to the human cortex and number of synapsescomparable to the monkey cortex while achieving unprece-dented time-to-solution Compass is a Swiss-army knife forour ambitious project supporting all aspects from architecturealgorithms to applications

TrueNorth and Compass represent a massively paralleland distributed architecture for computation that complementthe modern von Neumann architecture To fully exploit thisarchitecture we need to go beyond the ldquointellectual bottle-neckrdquo [7] of long sequential programming to short parallelprogramming To this end our next step is to develop a newparallel programming language with compositional semanticsthat can provide application and algorithm designers with thetools to effectively and efficiently unleash the full power ofthe emerging new era of computing Finally TrueNorth is de-signed to best approximate the brain within the constraints ofmodern CMOS technology but now informed by TrueNorthand Compass we ask how might we explore an entirely newtechnology that takes us beyond Moorersquos law and beyondthe ever-increasing memory-processor latency beyond needfor ever-increasing clock rates to a new era of computingExciting

ACKNOWLEDGMENTS

This research was sponsored by DARPA under contractNo HR0011-09-C-0002 The views and conclusions containedherein are those of the authors and should not be interpreted asrepresenting the official policies either expressly or impliedof DARPA or the US Government

We thank Filipp Akopyan John Arthur Andrew CassidyBryan Jackson Rajit Manohar Paul Merolla Rodrigo Alvarez-Icaza and Jun Sawada for their collaboration on the TrueNortharchitecture and our university partners Stefano Fusi RajitManohar Ashutosh Saxena and Giulio Tononi as well astheir research teams for their feedback on the Compass sim-ulator We are indebted to Fred Mintzer for access to IBMBlue GeneP and Blue GeneQ at the IBM TJ Watson Re-search Center and to George Fax Kerry Kaliszewski AndrewSchram Faith W Sell Steven M Westerbeck for access toIBM Rochester Blue GeneQ without which this paper wouldhave been impossible Finally we would like to thank DavidPeyton for his expert assistance revising this manuscript

REFERENCES

[1] J von Neumann The Computer and The Brain Yale University Press1958

[2] D S Modha R Ananthanarayanan S K Esser A Ndirango A JSherbondy and R Singh ldquoCognitive computingrdquo Communications ofthe ACM vol 54 no 8 pp 62ndash71 2011

[3] P Merolla J Arthur F Akopyan N Imam R Manohar and D ModhaldquoA digital neurosynaptic core using embedded crossbar memory with45pj per spike in 45nmrdquo in Custom Integrated Circuits Conference(CICC) 2011 IEEE sept 2011 pp 1 ndash4

[4] J Seo B Brezzo Y Liu B Parker S Esser R Montoye B RajendranJ Tierno L Chang D Modha and D Friedman ldquoA 45nm cmosneuromorphic chip with a scalable architecture for learning in networksof spiking neuronsrdquo in Custom Integrated Circuits Conference (CICC)2011 IEEE sept 2011 pp 1 ndash4

[5] N Imam F Akopyan J Arthur P Merolla R Manohar and D SModha ldquoA digital neurosynaptic core using event-driven qdi circuitsrdquo inASYNC 2012 IEEE International Symposium on Asynchronous Circuitsand Systems 2012

[6] J V Arthur P A Merolla F Akopyan R Alvarez-Icaza A CassidyS Chandra S K Esser N Imam W Risk D Rubin R Manohar andD S Modha ldquoBuilding block of a programmable neuromorphic sub-strate A digital neurosynaptic corerdquo in International Joint Conferenceon Neural Networks 2012

[7] J W Backus ldquoCan programming be liberated from the von neumannstyle a functional style and its algebra of programsrdquo Communicationsof the ACM vol 21 no 8 pp 613ndash641 1878

[8] V B Mountcastle Perceptual Neuroscience The Cerebral CortexHarvard University Press 1998

[9] D S Modha and R Singh ldquoNetwork architecture of the long distancepathways in the macaque brainrdquo Proceedings of the National Academyof the Sciences USA vol 107 no 30 pp 13 485ndash13 490 2010[Online] Available httpwwwpnasorgcontent1073013485abstract

[10] C Koch Biophysics of Computation Information Processing in SingleNeurons Oxford University Press New York New York 1999

[11] R Ananthanarayanan and D S Modha ldquoAnatomy of a cortical simu-latorrdquo in Supercomputing 07 2007

[12] R Ananthanarayanan S K Esser H D Simon and D S ModhaldquoThe cat is out of the bag cortical simulations with 109 neurons 1013

synapsesrdquo in Proceedings of the Conference on High PerformanceComputing Networking Storage and Analysis ser SC rsquo09 NewYork NY USA ACM 2009 pp 631ndash6312 [Online] Availablehttpdoiacmorg10114516540591654124

[13] E M Izhikevich ldquoWhich model to use for cortical spiking neuronsrdquoIEEE Transactions on Neural Networks vol 15 pp 1063ndash1070 2004[Online] Available httpwwwnsieduusersizhikevichpublicationswhichmodpdf

[14] H Markram ldquoThe Blue Brain Projectrdquo Nature Reviews Neurosciencevol 7 no 2 pp 153ndash160 Feb 2006 [Online] Availablehttpdxdoiorg101038nrn1848

[15] H Markram ldquoThe Human Brain Projectrdquo Scientific American vol 306no 6 pp 50ndash55 Jun 2012

[16] R Brette and D F M Goodman ldquoVectorized algorithms forspiking neural network simulationrdquo Neural Comput vol 23 no 6pp 1503ndash1535 2011 [Online] Available httpdxdoiorg101162NECO a 00123

[17] M Djurfeldt M Lundqvist C Johansson M Rehn O Ekebergand A Lansner ldquoBrain-scale simulation of the neocortex on the ibmblue genel supercomputerrdquo IBM J Res Dev vol 52 no 12 pp31ndash41 Jan 2008 [Online] Available httpdlacmorgcitationcfmid=13759901375994

[18] E M Izhikevich and G M Edelman ldquoLarge-scale model ofmammalian thalamocortical systemsrdquo Proceedings of the NationalAcademy of Sciences vol 105 no 9 pp 3593ndash3598 Mar 2008[Online] Available httpdxdoiorg101073pnas0712231105

[19] J M Nageswaran N Dutt J L Krichmar A Nicolau andA Veidenbaum ldquoEfficient simulation of large-scale spiking neuralnetworks using cuda graphics processorsrdquo in Proceedings of the 2009international joint conference on Neural Networks ser IJCNNrsquo09Piscataway NJ USA IEEE Press 2009 pp 3201ndash3208 [Online]Available httpdlacmorgcitationcfmid=17045551704736

[20] M M Waldrop ldquoComputer modelling Brain in a boxrdquo Nature vol482 pp 456ndash458 2012 [Online] Available httpdxdoiorg101038482456a

[21] H de Garis C Shuo B Goertzel and L Ruiting ldquoA world surveyof artificial brain projects part i Large-scale brain simulationsrdquoNeurocomput vol 74 no 1-3 pp 3ndash29 Dec 2010 [Online]Available httpdxdoiorg101016jneucom201008004

[22] R Brette M Rudolph T Carnevale M Hines D Beeman J MBower M Diesmann A Morrison P H Goodman A P DavisonS E Boustani and A Destexhe ldquoSimulation of networks of spikingneurons A review of tools and strategiesrdquo Journal of ComputationalNeuroscience vol 2007 pp 349ndash398 2007

[23] D Gregor T Hoefler B Barrett and A Lumsdaine ldquoFixing Probe forMulti-Threaded MPI Applicationsrdquo Indiana University Tech Rep 674Jan 2009

[24] R Sinkhorn and P Knopp ldquoConcerning nonnegative matrices anddoubly stochastic matricesrdquo Pacific Journal of Mathematics vol 21no 2 pp 343ndash348 1967

[25] A Marshall and I Olkin ldquoScaling of matrices to achieve specifiedrow and column sumsrdquo Numerische Mathematik vol 12 pp 83ndash901968 101007BF02170999 [Online] Available httpdxdoiorg101007BF02170999

[26] P Knight ldquoThe sinkhorn-knopp algorithm convergence and applica-tionsrdquo SIAM Journal on Matrix Analysis and Applications vol 30 no 1pp 261ndash275 2008

[27] R Kotter ldquoOnline retrieval processing and visualization of primate con-nectivity data from the CoCoMac databaserdquo Neuroinformatics vol 2pp 127ndash144 2004

[28] ldquoCocomac (collations of connectivity data on the macaque brain)rdquo wwwcocomacorg

[29] G Paxinos X Huang and M Petrides The Rhesus Monkey Brain inStereotaxic Coordinates Academic Press 2008 [Online] Availablehttpbooksgooglecoinbooksid=7HW6HgAACAAJ

[30] B G Kotter R Reid AT ldquoAn introduction to the cocomac-paxinos-3d viewerrdquo in The Rhesus Monkey Brain in Stereotaxic CoordinatesG Paxinos X-F Huang M Petrides and A Toga Eds Elsevier SanDiego 2008 ch 4

[31] D H Hubel and T N Wiesel ldquoReceptive fields binocular interactionand functional architecture in the catrsquos visual cortexrdquo J Physiol vol160 no 1 pp 106ndash154 1962

[32] A Peters and E G Jones Cerebral Cortex Vol 4 Association andauditory cortices Springer 1985

[33] V B Mountcastle ldquoThe columnar organization of the neocortexrdquo Brainvol 120 no 4 pp 701ndash22 1997

[34] C E Collins D C Airey N A Young D B Leitch and J H KaasldquoNeuron densities vary across and within cortical areas in primatesrdquo

Proceedings of the National Academy of Sciences vol 107 no 36 pp15 927ndash15 932 2010

[35] R A Haring M Ohmacht T W Fox M K Gschwind D L Sat-terfield K Sugavanam P W Coteus P Heidelberger M A BlumrichR W Wisniewski A Gara G L-T Chiu P A Boyle N H Chist andC Kim ldquoThe IBM Blue GeneQ Compute Chiprdquo IEEE Micro vol 32pp 48ndash60 2012

[36] D Chen N A Eisley P Heidelberger R M Senger Y SugawaraS Kumar V Salapura D L Satterfield B Steinmacher-Burow andJ J Parker ldquoThe IBM Blue GeneQ interconnection network andmessage unitrdquo in Proceedings of 2011 International Conference forHigh Performance Computing Networking Storage and Analysis serSC rsquo11 New York NY USA ACM 2011 pp 261ndash2610 [Online]Available httpdoiacmorg10114520633842063419

[37] ldquoThe Berkeley UPC Compilerrdquo 2002 httpupclblgov[38] R Nishtala P H Hargrove D O Bonachea and K A Yelick

ldquoScaling communication-intensive applications on BlueGeneP usingone-sided communication and overlaprdquo in Proceedings of the 2009IEEE International Symposium on ParallelampDistributed Processingser IPDPS rsquo09 Washington DC USA IEEE Computer Society2009 pp 1ndash12 [Online] Available httpdxdoiorg101109IPDPS20095161076

[39] D Bonachea ldquoGASNet Specification v11rdquo University of California atBerkeley Berkeley CA USA Tech Rep 2002

[40] S Kumar G Dozsa G Almasi P Heidelberger D Chen M EGiampapa M Blocksome A Faraj J Parker J RattermanB Smith and C J Archer ldquoThe Deep Computing MessagingFramework Generalized Scalable Message Passing on the Blue GeneP

Supercomputerrdquo in Proceedings of the 22nd annual internationalconference on Supercomputing ser ICS rsquo08 New York NY USAACM 2008 pp 94ndash103 [Online] Available httpdoiacmorg10114513755271375544

[41] T Hoefler C Siebert and A Lumsdaine ldquoScalable CommunicationProtocols for Dynamic Sparse Data Exchangerdquo in Proceedings of the2010 ACM SIGPLAN Symposium on Principles and Practice of ParallelProgramming (PPoPPrsquo10) ACM Jan 2010 pp 159ndash168

[42] D S Modha ldquoA conceptual cortical surface atlasrdquo PLoS ONE vol 4no 6 p e5693 2009

[43] A J Sherbondy R Ananthanrayanan R F Dougherty D S Modhaand B A Wandell ldquoThink global act local projectome estimationwith bluematterrdquo in Proceedings of MICCAI 2009 Lecture Notes inComputer Science 2009

[44] B L Jackson B Rajendran G S Corrado M Breitwisch G W BurrR Cheek K Gopalakrishnan S Raoux C T Rettner A G SchrottR S Shenoy B N Kurdi C H Lam and D S Modha ldquoNano-scale electronic synapses using phase change devicesrdquo ACM Journalon Emerging Technologies in Computing Systems forthcoming 2012

  • Introduction
  • The TrueNorth architecture
  • The Compass neurosynaptic chip simulator
  • Parallel Compass Compiler
  • CoCoMac macaque brain model network
    • Structure
    • Long range connectivity
    • Local connectivity
      • Experimental evaluation
        • Hardware platforms
        • Weak scaling behavior
        • Strong scaling behavior
        • Thread scaling behavior
          • Comparison of the PGAS and MPI communication models for real-time simulation
            • Tuning for real-time simulation
            • Real-time simulation results
              • Conclusions
              • References
Page 10: SC2012 Compass Cr

VIII CONCLUSIONS

Cognitive Computing [2] is the quest for approximatingthe mind-like function low power small volume and real-time performance of the human brain To this end we havepursued neuroscience [42] [43] [9] nanotechnology [3] [4][44] [5] [6] and supercomputing [11] [12] Building on theseinsights we are marching ahead to develop and demonstrate anovel ultra-low power compact modular non-von Neumanncognitive computing architecture namely TrueNorth To setsail for TrueNorth in this paper we reported a multi-threadedmassively parallel simulator Compass that is functionallyequivalent to TrueNorth As a result of a number of innova-tions in communication computation and memory Compassdemonstrates unprecedented scale with its number of neuronscomparable to the human cortex and number of synapsescomparable to the monkey cortex while achieving unprece-dented time-to-solution Compass is a Swiss-army knife forour ambitious project supporting all aspects from architecturealgorithms to applications

TrueNorth and Compass represent a massively paralleland distributed architecture for computation that complementthe modern von Neumann architecture To fully exploit thisarchitecture we need to go beyond the ldquointellectual bottle-neckrdquo [7] of long sequential programming to short parallelprogramming To this end our next step is to develop a newparallel programming language with compositional semanticsthat can provide application and algorithm designers with thetools to effectively and efficiently unleash the full power ofthe emerging new era of computing Finally TrueNorth is de-signed to best approximate the brain within the constraints ofmodern CMOS technology but now informed by TrueNorthand Compass we ask how might we explore an entirely newtechnology that takes us beyond Moorersquos law and beyondthe ever-increasing memory-processor latency beyond needfor ever-increasing clock rates to a new era of computingExciting

ACKNOWLEDGMENTS

This research was sponsored by DARPA under contractNo HR0011-09-C-0002 The views and conclusions containedherein are those of the authors and should not be interpreted asrepresenting the official policies either expressly or impliedof DARPA or the US Government

We thank Filipp Akopyan John Arthur Andrew CassidyBryan Jackson Rajit Manohar Paul Merolla Rodrigo Alvarez-Icaza and Jun Sawada for their collaboration on the TrueNortharchitecture and our university partners Stefano Fusi RajitManohar Ashutosh Saxena and Giulio Tononi as well astheir research teams for their feedback on the Compass sim-ulator We are indebted to Fred Mintzer for access to IBMBlue GeneP and Blue GeneQ at the IBM TJ Watson Re-search Center and to George Fax Kerry Kaliszewski AndrewSchram Faith W Sell Steven M Westerbeck for access toIBM Rochester Blue GeneQ without which this paper wouldhave been impossible Finally we would like to thank DavidPeyton for his expert assistance revising this manuscript

REFERENCES

[1] J von Neumann The Computer and The Brain Yale University Press1958

[2] D S Modha R Ananthanarayanan S K Esser A Ndirango A JSherbondy and R Singh ldquoCognitive computingrdquo Communications ofthe ACM vol 54 no 8 pp 62ndash71 2011

[3] P Merolla J Arthur F Akopyan N Imam R Manohar and D ModhaldquoA digital neurosynaptic core using embedded crossbar memory with45pj per spike in 45nmrdquo in Custom Integrated Circuits Conference(CICC) 2011 IEEE sept 2011 pp 1 ndash4

[4] J Seo B Brezzo Y Liu B Parker S Esser R Montoye B RajendranJ Tierno L Chang D Modha and D Friedman ldquoA 45nm cmosneuromorphic chip with a scalable architecture for learning in networksof spiking neuronsrdquo in Custom Integrated Circuits Conference (CICC)2011 IEEE sept 2011 pp 1 ndash4

[5] N Imam F Akopyan J Arthur P Merolla R Manohar and D SModha ldquoA digital neurosynaptic core using event-driven qdi circuitsrdquo inASYNC 2012 IEEE International Symposium on Asynchronous Circuitsand Systems 2012

[6] J V Arthur P A Merolla F Akopyan R Alvarez-Icaza A CassidyS Chandra S K Esser N Imam W Risk D Rubin R Manohar andD S Modha ldquoBuilding block of a programmable neuromorphic sub-strate A digital neurosynaptic corerdquo in International Joint Conferenceon Neural Networks 2012

[7] J W Backus ldquoCan programming be liberated from the von neumannstyle a functional style and its algebra of programsrdquo Communicationsof the ACM vol 21 no 8 pp 613ndash641 1878

[8] V B Mountcastle Perceptual Neuroscience The Cerebral CortexHarvard University Press 1998

[9] D S Modha and R Singh ldquoNetwork architecture of the long distancepathways in the macaque brainrdquo Proceedings of the National Academyof the Sciences USA vol 107 no 30 pp 13 485ndash13 490 2010[Online] Available httpwwwpnasorgcontent1073013485abstract

[10] C Koch Biophysics of Computation Information Processing in SingleNeurons Oxford University Press New York New York 1999

[11] R Ananthanarayanan and D S Modha ldquoAnatomy of a cortical simu-latorrdquo in Supercomputing 07 2007

[12] R Ananthanarayanan S K Esser H D Simon and D S ModhaldquoThe cat is out of the bag cortical simulations with 109 neurons 1013

synapsesrdquo in Proceedings of the Conference on High PerformanceComputing Networking Storage and Analysis ser SC rsquo09 NewYork NY USA ACM 2009 pp 631ndash6312 [Online] Availablehttpdoiacmorg10114516540591654124

[13] E M Izhikevich ldquoWhich model to use for cortical spiking neuronsrdquoIEEE Transactions on Neural Networks vol 15 pp 1063ndash1070 2004[Online] Available httpwwwnsieduusersizhikevichpublicationswhichmodpdf

[14] H Markram ldquoThe Blue Brain Projectrdquo Nature Reviews Neurosciencevol 7 no 2 pp 153ndash160 Feb 2006 [Online] Availablehttpdxdoiorg101038nrn1848

[15] H Markram ldquoThe Human Brain Projectrdquo Scientific American vol 306no 6 pp 50ndash55 Jun 2012

[16] R Brette and D F M Goodman ldquoVectorized algorithms forspiking neural network simulationrdquo Neural Comput vol 23 no 6pp 1503ndash1535 2011 [Online] Available httpdxdoiorg101162NECO a 00123

[17] M Djurfeldt M Lundqvist C Johansson M Rehn O Ekebergand A Lansner ldquoBrain-scale simulation of the neocortex on the ibmblue genel supercomputerrdquo IBM J Res Dev vol 52 no 12 pp31ndash41 Jan 2008 [Online] Available httpdlacmorgcitationcfmid=13759901375994

[18] E M Izhikevich and G M Edelman ldquoLarge-scale model ofmammalian thalamocortical systemsrdquo Proceedings of the NationalAcademy of Sciences vol 105 no 9 pp 3593ndash3598 Mar 2008[Online] Available httpdxdoiorg101073pnas0712231105

[19] J M Nageswaran N Dutt J L Krichmar A Nicolau andA Veidenbaum ldquoEfficient simulation of large-scale spiking neuralnetworks using cuda graphics processorsrdquo in Proceedings of the 2009international joint conference on Neural Networks ser IJCNNrsquo09Piscataway NJ USA IEEE Press 2009 pp 3201ndash3208 [Online]Available httpdlacmorgcitationcfmid=17045551704736

[20] M M Waldrop ldquoComputer modelling Brain in a boxrdquo Nature vol482 pp 456ndash458 2012 [Online] Available httpdxdoiorg101038482456a

[21] H de Garis C Shuo B Goertzel and L Ruiting ldquoA world surveyof artificial brain projects part i Large-scale brain simulationsrdquoNeurocomput vol 74 no 1-3 pp 3ndash29 Dec 2010 [Online]Available httpdxdoiorg101016jneucom201008004

[22] R Brette M Rudolph T Carnevale M Hines D Beeman J MBower M Diesmann A Morrison P H Goodman A P DavisonS E Boustani and A Destexhe ldquoSimulation of networks of spikingneurons A review of tools and strategiesrdquo Journal of ComputationalNeuroscience vol 2007 pp 349ndash398 2007

[23] D Gregor T Hoefler B Barrett and A Lumsdaine ldquoFixing Probe forMulti-Threaded MPI Applicationsrdquo Indiana University Tech Rep 674Jan 2009

[24] R Sinkhorn and P Knopp ldquoConcerning nonnegative matrices anddoubly stochastic matricesrdquo Pacific Journal of Mathematics vol 21no 2 pp 343ndash348 1967

[25] A Marshall and I Olkin ldquoScaling of matrices to achieve specifiedrow and column sumsrdquo Numerische Mathematik vol 12 pp 83ndash901968 101007BF02170999 [Online] Available httpdxdoiorg101007BF02170999

[26] P Knight ldquoThe sinkhorn-knopp algorithm convergence and applica-tionsrdquo SIAM Journal on Matrix Analysis and Applications vol 30 no 1pp 261ndash275 2008

[27] R Kotter ldquoOnline retrieval processing and visualization of primate con-nectivity data from the CoCoMac databaserdquo Neuroinformatics vol 2pp 127ndash144 2004

[28] ldquoCocomac (collations of connectivity data on the macaque brain)rdquo wwwcocomacorg

[29] G Paxinos X Huang and M Petrides The Rhesus Monkey Brain inStereotaxic Coordinates Academic Press 2008 [Online] Availablehttpbooksgooglecoinbooksid=7HW6HgAACAAJ

[30] B G Kotter R Reid AT ldquoAn introduction to the cocomac-paxinos-3d viewerrdquo in The Rhesus Monkey Brain in Stereotaxic CoordinatesG Paxinos X-F Huang M Petrides and A Toga Eds Elsevier SanDiego 2008 ch 4

[31] D H Hubel and T N Wiesel ldquoReceptive fields binocular interactionand functional architecture in the catrsquos visual cortexrdquo J Physiol vol160 no 1 pp 106ndash154 1962

[32] A Peters and E G Jones Cerebral Cortex Vol 4 Association andauditory cortices Springer 1985

[33] V B Mountcastle ldquoThe columnar organization of the neocortexrdquo Brainvol 120 no 4 pp 701ndash22 1997

[34] C E Collins D C Airey N A Young D B Leitch and J H KaasldquoNeuron densities vary across and within cortical areas in primatesrdquo

Proceedings of the National Academy of Sciences vol 107 no 36 pp15 927ndash15 932 2010

[35] R A Haring M Ohmacht T W Fox M K Gschwind D L Sat-terfield K Sugavanam P W Coteus P Heidelberger M A BlumrichR W Wisniewski A Gara G L-T Chiu P A Boyle N H Chist andC Kim ldquoThe IBM Blue GeneQ Compute Chiprdquo IEEE Micro vol 32pp 48ndash60 2012

[36] D Chen N A Eisley P Heidelberger R M Senger Y SugawaraS Kumar V Salapura D L Satterfield B Steinmacher-Burow andJ J Parker ldquoThe IBM Blue GeneQ interconnection network andmessage unitrdquo in Proceedings of 2011 International Conference forHigh Performance Computing Networking Storage and Analysis serSC rsquo11 New York NY USA ACM 2011 pp 261ndash2610 [Online]Available httpdoiacmorg10114520633842063419

[37] ldquoThe Berkeley UPC Compilerrdquo 2002 httpupclblgov[38] R Nishtala P H Hargrove D O Bonachea and K A Yelick

ldquoScaling communication-intensive applications on BlueGeneP usingone-sided communication and overlaprdquo in Proceedings of the 2009IEEE International Symposium on ParallelampDistributed Processingser IPDPS rsquo09 Washington DC USA IEEE Computer Society2009 pp 1ndash12 [Online] Available httpdxdoiorg101109IPDPS20095161076

[39] D Bonachea ldquoGASNet Specification v11rdquo University of California atBerkeley Berkeley CA USA Tech Rep 2002

[40] S Kumar G Dozsa G Almasi P Heidelberger D Chen M EGiampapa M Blocksome A Faraj J Parker J RattermanB Smith and C J Archer ldquoThe Deep Computing MessagingFramework Generalized Scalable Message Passing on the Blue GeneP

Supercomputerrdquo in Proceedings of the 22nd annual internationalconference on Supercomputing ser ICS rsquo08 New York NY USAACM 2008 pp 94ndash103 [Online] Available httpdoiacmorg10114513755271375544

[41] T Hoefler C Siebert and A Lumsdaine ldquoScalable CommunicationProtocols for Dynamic Sparse Data Exchangerdquo in Proceedings of the2010 ACM SIGPLAN Symposium on Principles and Practice of ParallelProgramming (PPoPPrsquo10) ACM Jan 2010 pp 159ndash168

[42] D S Modha ldquoA conceptual cortical surface atlasrdquo PLoS ONE vol 4no 6 p e5693 2009

[43] A J Sherbondy R Ananthanrayanan R F Dougherty D S Modhaand B A Wandell ldquoThink global act local projectome estimationwith bluematterrdquo in Proceedings of MICCAI 2009 Lecture Notes inComputer Science 2009

[44] B L Jackson B Rajendran G S Corrado M Breitwisch G W BurrR Cheek K Gopalakrishnan S Raoux C T Rettner A G SchrottR S Shenoy B N Kurdi C H Lam and D S Modha ldquoNano-scale electronic synapses using phase change devicesrdquo ACM Journalon Emerging Technologies in Computing Systems forthcoming 2012

  • Introduction
  • The TrueNorth architecture
  • The Compass neurosynaptic chip simulator
  • Parallel Compass Compiler
  • CoCoMac macaque brain model network
    • Structure
    • Long range connectivity
    • Local connectivity
      • Experimental evaluation
        • Hardware platforms
        • Weak scaling behavior
        • Strong scaling behavior
        • Thread scaling behavior
          • Comparison of the PGAS and MPI communication models for real-time simulation
            • Tuning for real-time simulation
            • Real-time simulation results
              • Conclusions
              • References
Page 11: SC2012 Compass Cr

[20] M M Waldrop ldquoComputer modelling Brain in a boxrdquo Nature vol482 pp 456ndash458 2012 [Online] Available httpdxdoiorg101038482456a

[21] H de Garis C Shuo B Goertzel and L Ruiting ldquoA world surveyof artificial brain projects part i Large-scale brain simulationsrdquoNeurocomput vol 74 no 1-3 pp 3ndash29 Dec 2010 [Online]Available httpdxdoiorg101016jneucom201008004

[22] R Brette M Rudolph T Carnevale M Hines D Beeman J MBower M Diesmann A Morrison P H Goodman A P DavisonS E Boustani and A Destexhe ldquoSimulation of networks of spikingneurons A review of tools and strategiesrdquo Journal of ComputationalNeuroscience vol 2007 pp 349ndash398 2007

[23] D Gregor T Hoefler B Barrett and A Lumsdaine ldquoFixing Probe forMulti-Threaded MPI Applicationsrdquo Indiana University Tech Rep 674Jan 2009

[24] R Sinkhorn and P Knopp ldquoConcerning nonnegative matrices anddoubly stochastic matricesrdquo Pacific Journal of Mathematics vol 21no 2 pp 343ndash348 1967

[25] A Marshall and I Olkin ldquoScaling of matrices to achieve specifiedrow and column sumsrdquo Numerische Mathematik vol 12 pp 83ndash901968 101007BF02170999 [Online] Available httpdxdoiorg101007BF02170999

[26] P Knight ldquoThe sinkhorn-knopp algorithm convergence and applica-tionsrdquo SIAM Journal on Matrix Analysis and Applications vol 30 no 1pp 261ndash275 2008

[27] R Kotter ldquoOnline retrieval processing and visualization of primate con-nectivity data from the CoCoMac databaserdquo Neuroinformatics vol 2pp 127ndash144 2004

[28] ldquoCocomac (collations of connectivity data on the macaque brain)rdquo wwwcocomacorg

[29] G Paxinos X Huang and M Petrides The Rhesus Monkey Brain inStereotaxic Coordinates Academic Press 2008 [Online] Availablehttpbooksgooglecoinbooksid=7HW6HgAACAAJ

[30] B G Kotter R Reid AT ldquoAn introduction to the cocomac-paxinos-3d viewerrdquo in The Rhesus Monkey Brain in Stereotaxic CoordinatesG Paxinos X-F Huang M Petrides and A Toga Eds Elsevier SanDiego 2008 ch 4

[31] D H Hubel and T N Wiesel ldquoReceptive fields binocular interactionand functional architecture in the catrsquos visual cortexrdquo J Physiol vol160 no 1 pp 106ndash154 1962

[32] A Peters and E G Jones Cerebral Cortex Vol 4 Association andauditory cortices Springer 1985

[33] V B Mountcastle ldquoThe columnar organization of the neocortexrdquo Brainvol 120 no 4 pp 701ndash22 1997

[34] C E Collins D C Airey N A Young D B Leitch and J H KaasldquoNeuron densities vary across and within cortical areas in primatesrdquo

Proceedings of the National Academy of Sciences vol 107 no 36 pp15 927ndash15 932 2010

[35] R A Haring M Ohmacht T W Fox M K Gschwind D L Sat-terfield K Sugavanam P W Coteus P Heidelberger M A BlumrichR W Wisniewski A Gara G L-T Chiu P A Boyle N H Chist andC Kim ldquoThe IBM Blue GeneQ Compute Chiprdquo IEEE Micro vol 32pp 48ndash60 2012

[36] D Chen N A Eisley P Heidelberger R M Senger Y SugawaraS Kumar V Salapura D L Satterfield B Steinmacher-Burow andJ J Parker ldquoThe IBM Blue GeneQ interconnection network andmessage unitrdquo in Proceedings of 2011 International Conference forHigh Performance Computing Networking Storage and Analysis serSC rsquo11 New York NY USA ACM 2011 pp 261ndash2610 [Online]Available httpdoiacmorg10114520633842063419

[37] ldquoThe Berkeley UPC Compilerrdquo 2002 httpupclblgov[38] R Nishtala P H Hargrove D O Bonachea and K A Yelick

ldquoScaling communication-intensive applications on BlueGeneP usingone-sided communication and overlaprdquo in Proceedings of the 2009IEEE International Symposium on ParallelampDistributed Processingser IPDPS rsquo09 Washington DC USA IEEE Computer Society2009 pp 1ndash12 [Online] Available httpdxdoiorg101109IPDPS20095161076

[39] D Bonachea ldquoGASNet Specification v11rdquo University of California atBerkeley Berkeley CA USA Tech Rep 2002

[40] S Kumar G Dozsa G Almasi P Heidelberger D Chen M EGiampapa M Blocksome A Faraj J Parker J RattermanB Smith and C J Archer ldquoThe Deep Computing MessagingFramework Generalized Scalable Message Passing on the Blue GeneP

Supercomputerrdquo in Proceedings of the 22nd annual internationalconference on Supercomputing ser ICS rsquo08 New York NY USAACM 2008 pp 94ndash103 [Online] Available httpdoiacmorg10114513755271375544

[41] T Hoefler C Siebert and A Lumsdaine ldquoScalable CommunicationProtocols for Dynamic Sparse Data Exchangerdquo in Proceedings of the2010 ACM SIGPLAN Symposium on Principles and Practice of ParallelProgramming (PPoPPrsquo10) ACM Jan 2010 pp 159ndash168

[42] D S Modha ldquoA conceptual cortical surface atlasrdquo PLoS ONE vol 4no 6 p e5693 2009

[43] A J Sherbondy R Ananthanrayanan R F Dougherty D S Modhaand B A Wandell ldquoThink global act local projectome estimationwith bluematterrdquo in Proceedings of MICCAI 2009 Lecture Notes inComputer Science 2009

[44] B L Jackson B Rajendran G S Corrado M Breitwisch G W BurrR Cheek K Gopalakrishnan S Raoux C T Rettner A G SchrottR S Shenoy B N Kurdi C H Lam and D S Modha ldquoNano-scale electronic synapses using phase change devicesrdquo ACM Journalon Emerging Technologies in Computing Systems forthcoming 2012

  • Introduction
  • The TrueNorth architecture
  • The Compass neurosynaptic chip simulator
  • Parallel Compass Compiler
  • CoCoMac macaque brain model network
    • Structure
    • Long range connectivity
    • Local connectivity
      • Experimental evaluation
        • Hardware platforms
        • Weak scaling behavior
        • Strong scaling behavior
        • Thread scaling behavior
          • Comparison of the PGAS and MPI communication models for real-time simulation
            • Tuning for real-time simulation
            • Real-time simulation results
              • Conclusions
              • References