enabling scientific computing on memristive …reducing the required memristor resolution...

16
Enabling Scientific Computing on Memristive Accelerators Ben Feinberg * , Uday Kumar Reddy Vengalam * , Nathan Whitehair * , Shibo Wang and Engin Ipek *† * Department of Electrical and Computer Engineering Department of Computer Science University of Rochester Rochester, NY 14627 USA * {bfeinber, uvengala, nwhiteha}@ece.rochester.edu {swang, ipek}@cs.rochester.edu Abstract—Linear algebra is ubiquitous across virtually every field of science and engineering, from climate modeling to macroeconomics. This ubiquity makes linear algebra a prime candidate for hardware acceleration, which can improve both the run time and the energy efficiency of a wide range of scientific applications. Recent work on memristive hardware accelerators shows significant potential to speed up matrix- vector multiplication (MVM), a critical linear algebra kernel at the heart of neural network inference tasks. Regrettably, the proposed hardware is constrained to a narrow range of workloads: although the eight- to 16-bit computations afforded by memristive MVM accelerators are acceptable for machine learning, they are insufficient for scientific computing where high-precision floating point is the norm. This paper presents the first proposal to enable scientific computing on memristive crossbars. Three techniques are explored—reducing overheads by exploiting exponent range locality, early termination of fixed-point computation, and static operation scheduling—that together enable a fixed-point memristive accelerator to perform high-precision floating point without the exorbitant cost of na¨ ıve floating-point emulation on fixed-point hardware. A heterogeneous collection of crossbars with varying sizes is proposed to efficiently handle sparse matrices, and an algorithm for mapping the dense subblocks of a sparse matrix to an appropriate set of crossbars is investigated. The accelerator can be combined with existing GPU-based systems to handle datasets that cannot be efficiently handled by the memristive accelerator alone. The proposed optimizations permit the memristive MVM concept to be applied to a wide range of problem domains, respectively improving the execution time and energy dissipation of sparse linear solvers by 10.3x and 10.9x over a purely GPU-based system. Keywords-Accelerator Architectures; Resistive RAM. I. I NTRODUCTION Computational models of dynamic systems are essential to our understanding of the world across a diverse set of problem domains, including high-energy physics [1], weather and climate modeling [2], biology [3], and even macroeconomics [4]. Achieving high fidelity using these models often demands immense computing power, poten- tially requiring thousands of nodes running over periods of weeks or months [5]. Enabling future advances in these problem domains requires significant improvements to com- puting efficiency on frequently used kernels. One important source of such kernels is linear algebra, which applies to nearly every domain of science and engineering [6]. Recently, a new class of accelerators based on memristive in-situ computation has been proposed [7]–[11]. These ac- celerators show significant potential for speeding up matrix- vector multiplication (MVM) in the context of connectionist machine learning models, such as deep neural networks and the Boltzmann Machine. Not only does in-situ MVM reduce the rapidly increasing cost of data movement [12], but it also allows for substantial parallelism by exploiting analog computation. Although promising, existing memristive ac- celerator proposals rely on the lax precision requirements of machine learning workloads to side-step the issue of precise computation, which makes these accelerators inapplicable to scientific computing. This paper shows that a memristive accelerator can be architected to arbitrary precision requirements, preserving significant performance and efficiency gains over a GPU baseline while also meeting the particular needs of scien- tific workloads. Three techniques are proposed to perform floating-point computation with the same precision as IEEE- 754 on the fundamentally fixed-point hardware of a mem- ristive crossbar: 1) the overhead of floating- to fixed-point conversion is reduced by exploiting dynamic range locality; 2) each fixed-point computation is terminated as soon as the precision requirements of IEEE-754 have been satisfied; and 3) operations are scheduled statically to compute only the necessary bits. Although fixed-point emulation of floating point is not a new concept, and is often used on systems lacking floating-point hardware support [13], without the proposed optimizations it imposes a prohibitive throughput penalty on in-situ MVM. This paper also addresses the problem of efficiently com- puting sparse MVM on memristive crossbars. The matrices generated by scientific applications are typically sparse [14]. Software optimizations that target sparse matrices have been widely explored, and the literature contains numerous solutions depending on the behavior of the underlying hard- ware [15]. These techniques, however, are not applicable to MVM performed using a memristive crossbar, because these techniques rely on multiple levels of indirection that have no direct analog in in-situ linear algebra. To meet this chal-

Upload: others

Post on 14-Apr-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Enabling Scientific Computing on Memristive …reducing the required memristor resolution [7]–[10]. An example of bit slicing is shown in Equation 1, where each 3-bit matrix coefficient

Enabling Scientific Computing on Memristive Accelerators

Ben Feinberglowast Uday Kumar Reddy Vengalamlowast Nathan Whitehairlowast Shibo Wangdagger and Engin IpeklowastdaggerlowastDepartment of Electrical and Computer Engineering

daggerDepartment of Computer ScienceUniversity of Rochester

Rochester NY 14627 USAlowastbfeinber uvengala nwhitehaecerochesteredu daggerswang ipekcsrochesteredu

AbstractmdashLinear algebra is ubiquitous across virtually everyfield of science and engineering from climate modeling tomacroeconomics This ubiquity makes linear algebra a primecandidate for hardware acceleration which can improve boththe run time and the energy efficiency of a wide range ofscientific applications Recent work on memristive hardwareaccelerators shows significant potential to speed up matrix-vector multiplication (MVM) a critical linear algebra kernelat the heart of neural network inference tasks Regrettablythe proposed hardware is constrained to a narrow range ofworkloads although the eight- to 16-bit computations affordedby memristive MVM accelerators are acceptable for machinelearning they are insufficient for scientific computing wherehigh-precision floating point is the norm

This paper presents the first proposal to enable scientificcomputing on memristive crossbars Three techniques areexploredmdashreducing overheads by exploiting exponent rangelocality early termination of fixed-point computation andstatic operation schedulingmdashthat together enable a fixed-pointmemristive accelerator to perform high-precision floating pointwithout the exorbitant cost of naıve floating-point emulation onfixed-point hardware A heterogeneous collection of crossbarswith varying sizes is proposed to efficiently handle sparsematrices and an algorithm for mapping the dense subblocksof a sparse matrix to an appropriate set of crossbars isinvestigated The accelerator can be combined with existingGPU-based systems to handle datasets that cannot be efficientlyhandled by the memristive accelerator alone The proposedoptimizations permit the memristive MVM concept to beapplied to a wide range of problem domains respectivelyimproving the execution time and energy dissipation of sparselinear solvers by 103x and 109x over a purely GPU-basedsystem

Keywords-Accelerator Architectures Resistive RAM

I INTRODUCTION

Computational models of dynamic systems are essentialto our understanding of the world across a diverse setof problem domains including high-energy physics [1]weather and climate modeling [2] biology [3] and evenmacroeconomics [4] Achieving high fidelity using thesemodels often demands immense computing power poten-tially requiring thousands of nodes running over periodsof weeks or months [5] Enabling future advances in theseproblem domains requires significant improvements to com-puting efficiency on frequently used kernels One important

source of such kernels is linear algebra which applies tonearly every domain of science and engineering [6]

Recently a new class of accelerators based on memristivein-situ computation has been proposed [7]ndash[11] These ac-celerators show significant potential for speeding up matrix-vector multiplication (MVM) in the context of connectionistmachine learning models such as deep neural networks andthe Boltzmann Machine Not only does in-situ MVM reducethe rapidly increasing cost of data movement [12] but italso allows for substantial parallelism by exploiting analogcomputation Although promising existing memristive ac-celerator proposals rely on the lax precision requirements ofmachine learning workloads to side-step the issue of precisecomputation which makes these accelerators inapplicable toscientific computing

This paper shows that a memristive accelerator can bearchitected to arbitrary precision requirements preservingsignificant performance and efficiency gains over a GPUbaseline while also meeting the particular needs of scien-tific workloads Three techniques are proposed to performfloating-point computation with the same precision as IEEE-754 on the fundamentally fixed-point hardware of a mem-ristive crossbar 1) the overhead of floating- to fixed-pointconversion is reduced by exploiting dynamic range locality2) each fixed-point computation is terminated as soon as theprecision requirements of IEEE-754 have been satisfied and3) operations are scheduled statically to compute only thenecessary bits Although fixed-point emulation of floatingpoint is not a new concept and is often used on systemslacking floating-point hardware support [13] without theproposed optimizations it imposes a prohibitive throughputpenalty on in-situ MVM

This paper also addresses the problem of efficiently com-puting sparse MVM on memristive crossbars The matricesgenerated by scientific applications are typically sparse [14]Software optimizations that target sparse matrices havebeen widely explored and the literature contains numeroussolutions depending on the behavior of the underlying hard-ware [15] These techniques however are not applicable toMVM performed using a memristive crossbar because thesetechniques rely on multiple levels of indirection that haveno direct analog in in-situ linear algebra To meet this chal-

lenge a novel heterogeneous hardware substrate comprisingcrossbars with different sizes is proposed A preprocessingstep maps the dense subblocks within a sparse matrix ontothe proposed substrate by exploiting the interlocking trade-offs among crossbar size throughput matrix density andblocking efficiency

The proposed techniques combine with existing GPU-based approaches to accelerating high performance lin-ear algebra Some matricesmdashdue to their sparsity pat-ternsmdashcannot be mapped efficiently to even a heterogeneoussubstrate during the preprocessing step in those rare casesthe computation is performed on a GPU instead Notably thechoice between the accelerator and the GPU can be madequickly based on the output of the preprocessing step

The proposed system is evaluated on two high-performance iterative solvers with a set of 20 input matricesfrom the SuiteSparse collection [14] representing problemdomains within computational fluid dynamics structuralanalysis circuit analysis and seven others Taken togetherthe proposed techniques improve the execution time by103times and reduce the energy consumption by 109times over aconventional GPU platform

II BACKGROUND

The proposed accelerator builds upon prior work onmemristive accelerators for machine learning and linearalgebra for scientific computing

A Memristive Accelerators for Machine Learning andGraph Processing

The increasing prominence of machine learning work-loads in recent years has led to many proposals for dedicatedmachine learning accelerators both in academia [16] andindustry [17] These accelerators focus on the MVM opera-tion as the primary kernel in connectionist machine learningmodels More recently a new class of MVM acceleratorshas been proposed that exploits the inherent parallelismof analog computation to perform MVM with memristivenetworks [7]ndash[11] [18] Conceptually a matrix is mappedonto a memristive crossbar such that the conductance ofeach crossbar cell is proportional to the corresponding matrixcoefficient For instance in a binary matrix 0s and 1s arerespectively encoded as the low and high conductance statesof the memristive devices With these values mapped a dotproduct operation is performed by applying voltages to eachrow and quantizing the output current at the columns1 asshown in Figure 1 Given a memristor at the intersection ofrow i and column j with resistance Rij and row voltageVi the current through the memristor is ViRij the totalcurrent flowing through the column is

sumi ViRij which

1When describing hardware we use the memory systems convention forthe terms column and row ie the rows of a matrix are mapped to thecolumns of the crossbars To minimize confusion we explicitly refer torows within a matrix as rdquomatrix rowsrdquo

ADC Array

Line

Driv

ers

V1

VN

R1N

Reduction Network

Current

Figure 1 A conceptual example of memristive crossbars for computation

is proportional to the dot product of the voltages and theconductances

This conceptual example however assumes high-precision analog circuits and many-bit memristive devicesboth of which incur substantial overheads To reduce theseoverheads prior work has employed bit slicing a techniqueto map each element of a matrix to multiple crossbarsreducing the required memristor resolution [7]ndash[10] Anexample of bit slicing is shown in Equation 1 whereeach 3-bit matrix coefficient is mapped onto three binarycrossbars Note that while this example uses single-bit cellsthe technique can be generalized to multi-bit cells

[3 16 5

]= 22

[0 01 1

]+ 21

[1 01 0

]+ 20

[1 10 1

](1)

To perform computation on the full operand the partialdot products from all of the bit slices are combined througha shift-and-add reduction network [7] [8] an example isshown on the right hand side of Figure 1 The same bitslicing technique is applied to the vector as well as thematrix

B High-Performance Linear Algebra for ScientificComputing

Complex physical models are often described by sys-tems of partial differential equations (PDEs) There are noknown methods for finding analytical solutions to generalsystems of PDEs Thus these continuous PDEs are typicallydiscretized into a system of linear equations Ax = brepresenting a mesh of points that approximate the originalmodel [19] The resulting system of linear equations is oftensparse that is most of the variables have zero contributionto most of the linear equations Conceptually this sparsityis a result of the locality inherent in the physical systemVariables in each linear equation represent the relationshipbetween given points in the discretized mesh and nearbypoints tend to have stronger relationships

A sparse linear system can be solved using either direct oriterative methods Direct methods include factorizations such

as LU or Cholesky as implemented in the High PerformanceLINPACK benchmark [20] These direct methods can resultin significant fill-in where zero entries become non-zeroesthis increases the memory footprint and reduces the benefitsof storing the matrix in a sparse format Iterative methodsby contrast do not involve modifying the underlying matrixThis results in significantly lower memory requirementsenabling the full sparse matrix to fit in memory whereasa sparse version with significant fill-in might not

Iterative methods are subdivided into stationary andKrylov subspace methods We focus on the latter since theyare the primary methods used today [19] Krylov subspacemethods iteratively improve an estimated solution x tothe linear system Ax = b These methods are typicallyimplemented using a stopping tolerance ε and terminatewhen |b minus Ax| lt ε There are many Krylov subspacesolvers with different behaviors depending on the matrixstructure including conjugate gradient (CG) for symmetricpositive definite matrices (SPD) [21] as well as BiConjugateGradient (BiCG) Stabilized BiCG (BiCG-STAB) [22] andGeneralized Minimal Residual [23] for non-SPD matrices

Due to the importance of PDEs and the wide appli-cability of sparse solvers a number of accelerator sys-tems leveraging FPGAs and ASICs have been proposedKung et al [24] propose an SRAM based processing inmemory accelerator for simulating systems of differentialequations The accelerator uses 32-bit fixed-point and doesnot meet the high-precision floating-point requirements ofmainstream scientific applications (a key focus of our work)Additionally Kung et al accelerates the cellular nonlinearnetwork model which differs significantly from mainstreammethods for solving PDEs By contrast we target widelyaccepted iterative solvers such as CG Zhu et al [25]discuss a graph processing accelerator that leverages contentaddressable memories (CAMs) This accelerator may beapplicable to sparse MVM however the proposed hardwareheavily depends on the sparsity of the input vector (lessthan 1 non-zeros) to limit CAM size This vector sparsityis uncommon in iterative solvers an analysis of the linearsystems that we evaluate show vector densities ranging from30-100 which would result in prohibitive CAM overheadsKestur et al [26] and Fowers et al [27] propose FPGA basedaccelerators that target sparse MVM for iterative solvershowever the proposed approaches do not outperform aGPU Dorrance et al [28] propose an FPGA acceleratorthat achieves an average speedup of 128times over a GTXTITAN GPU on single-precision sparse MVM however itis not clear how the FPGA would scale to double-precisionBy contrast the proposed memristive accelerator achievesan average speedup of 103times over a Tesla P100 GPU ondouble-precision sparse MVM

Bank 2

Bank N

Global M

emory

Bank 1

Cluster 2Cluster N

Cluster 1Xbar

Xbar

De-bias ECU FPCU Result

Buffer

Unblocked Matrix Elements

AlignedFixed Point

SignedFixed Point

CorrectedFixed Point

IntermediateFloating Point

IEEE-754Floating Point

Input Vector Buffer

Processor

Figure 2 System overview

III SYSTEM ORGANIZATION

The accelerator is organized in a hierarchy of banksand clusters as shown in Figure 2 Each bank comprisesa heterogeneous set of clusters with different crossbar sizesas well as a local processor that orchestrates computationand performs the operations not handled by the crossbarsWithin a cluster a group of crossbars perform sparse MVMon a fixed-size matrix block

A Banks

Unlike prior work on memristive accelerators [7] [8] theclusters within each bank are of different sizes This allowsthe system to capture dense patterns within sparse matricesas discussed in Section V With sparse matrices someelements are ill-suited to the memristive crossbars eitherbecause the elements do not block efficiently or becausethe computation requires more bits than are provided by acluster In either case the value is not mapped to a clusterand is instead handled by the local processor Finally cross-bank communication (for operations that require resourcesfrom multiple banks) is implemented through reads andwrites to global memory

To interface with the local processor every cluster con-tains two memory-mapped SRAM buffers an incomingvector buffer and a result buffer The incoming vector bufferholds the vector to be multiplied by the matrix in floating-point format and extracts the bit slices from the vector asneeded The partial result buffer holds the running sum of thepartial dot products in an intermediate floating-point formatOnce the computation on a block completes each scalarin the partial result buffer is converted to a final IEEE-754representation which is read by the local processor

B Clusters

A cluster comprises a set of crossbarsmdashsimilar to themats in the Memristive Boltzmann Machine [7] or the in-situ multiply accumulate (IMA) units in ISAAC [8]mdashthatperform MVM and reduction across all of the bit slices

within a single matrix block Unlike prior work clustersare designed for IEEE-754 compatible double-precisionfloating-point computation Each cluster contains 127 cross-bars with single-bit cells that are organized in a shift-and-add reduction tree The double-precision floating-pointcoefficients of a matrix block are converted to 118-bit fixed-point operands comprising a 53-bit mantissa one sign bitand up to 64 bits of padding for mantissa alignment This118-bit value is then encoded with a nine-bit error-correctingAN code (as proposed in [29]) for a full operand width ofup to 127 bits

Figure 3 shows the structure of the crossbars in a clusterand how computation proceeds Initially a vector bit slicex[i] is applied to all of the crossbars within a cluster andthe resulting dot product for each crossbar column is latchedby a sample-and-hold circuit (1)2 A set of ADCs (oneper crossbar) then start scanning the values latched by thesample-and-hold units (2) The outputs of the ADCs arepropagated through a shift-and-add reduction network (3)producing a single fixed-point value that corresponds to thedot product between a row of the matrix A and a single bitslice of the vector x The system is pipelined such that everycycle a single fixed-point value is produced by the reductionnetwork and after every column has been quantized a newvector bit slice is applied to all of the crossbars

Once the fixed-point result has been computed threeadditional operations are performed to convert it to anintermediate floating-point representation (Figure 2) Firstthe bias is removed converting the value from unsignedto signed fixed point Second error correction is appliedThird the running sum for the relevant vector element isloaded from the partial result buffer the incoming partialdot product is added to the running sum and the resultis written back After the running sum is read from thepartial result buffer it is inspected to determine whether theprecision requirements of IEEE-754 have been met (in whichcase no further vector bit slice computations are needed)If so all of the crossbars in the cluster are signaled to skipquantizing the corresponding column in the remaining vectorbit slice calculations3 The criteria for establishing whetherthe precision requirements of IEEE-754 have been met arediscussed in the next section

IV ACHIEVING HIGH PRECISION

Unlike machine learning applications scientific workloadsrequire high precision [30] To enable scientific computingwith memristive accelerators the system must meet theseprecision requirements through floating-point format supportand error correction

2Throughout the paper we use xj [i] to refer to the jth coefficient ofthe ith bit slice of vector x

3The total number vector bit slices is still bounded by the number ofvector bit slices required by the worst case column

1

2

3

SH Array SH Array SH Array

ADC ADC ADC

Figure 3 Cluster organization

A Floating-Point Computation on Fixed-Point Hardware

In double precision floating point every number is repre-sented by a 53-bit mantissa (including the implied leading1) an 11-bit exponent and a sign bit [31] When performingaddition in a floating-point representation the mantissas ofthe values must be aligned so that the bits of the samemagnitude can be added Consider as an example two base-ten floating-point numbers with two-digit mantissas onwhich the computation 12 + 013 is to be performed Toproperly compute the sum the values must be aligned basedon the decimal point 120+013 In a conventional floating-point unit this alignment is performed dynamically based onthe exponent field of each value On a memristive crossbarhowever since the addition occurs within the analog domainthe values must be converted to an aligned fixed-pointformat before they are programmed into the crossbar Thisrequires padding the floating point mantissas with zeros inthe example above the addition would be represented as120 + 013 in fixed point with an exponent of 10minus2

Although this simple padding approach allows fixed-pointhardware to perform floating-point operations it comes ata significant cost In double-precision for instance naıvepadding requires 2100 bits in the worst case to fully accountfor the exponent range 2046 bits for padding 53 bits for themantissa itself and one sign bit Not only is this requirementprohibitive from a storage perspective but full fixed-pointcomputation would require multiplying each matrix bit sliceby each vector bit slice for a worst case of 44 millioncrossbar operations Fortunately two important properties offloating-point numbers can be exploited to reduce the costof the fixed-point computation First while the IEEE-754standard provides 11 bits for the exponent in double preci-sion few applications ever approach the full exponent rangeSecond the naıve fixed-point computation described aboveprovides more precision than IEEE-754 as it computes bitswell beyond the 53-bit mantissa These extraneous values aresubsequently truncated when the result is converted back tofloating point

0110011001

010101

+

010111010

100101

+

000101

010110

011001

001000

+

01101000111

111010

+

0 0 0

(a) Initial partial product summation

Running Sum

(b) Sum of four partial products

(c) No overlap with partial product

(d) Addition can terminate

Figure 4 Illustrative example of partial product summation a) addition of the first four partial products b) overlap between the running sum and themantissa c) carries can still modify the running sum d) the mantissa bits are settled

B Avoiding Extraneous Calculations

Three techniques are proposed to reduce the cost ofperforming floating-point computation on fixed-pointhardware

Exploiting exponent range locality The 11-bit exponentfield specified by IEEE-754 means that the dynamic rangeof floating-point values is 22

11minus2 (two exponent fields arereserved for special values) or approximately 10616 (Forcomparison the size of the state space for the game ofGo is only 10172 [32]) IEEE-754 provides this range sothat applications from a wide variety of domains can usefloating-point values without additional scaling howeverit is highly unlikely that any model of a physical systemwill have a dynamic range requirement close to 10616 Inpractice the number of required pad bits is equal to thedifference between the minimum and maximum exponentsin the matrix and can be covered by a few hundred bitsrather than 2046 Furthermore since the fixed-point over-head is required only for numbers that are to be summed inthe analog domain the alignment overhead can be furtherreduced by exploiting locality Since the matrix is muchlarger than a single crossbar it is necessarily split into blocks(discussed in Section V-B1) only those values within thesame block are added in the analog domain and requirealignment The matrices can be expected to exhibit localitybecause they often come from physical systems where alarge dynamic range between neighboring points is unlikely

The aforementioned alignment procedure is also appliedto the vector to be multiplied by the matrix The alignmentof mantissas encodes the exponent of each vector elementrelative to a single exponent assigned to the vector as awhole Notably multiplication does not require alignmentsince it is sufficient to add the relevant exponents of thematrix block the vector and the exponent offset impliedby the location of the leading 1 in each dot product

Early termination In a conventional double-precision float-ing point unit (FPU) each arithmetic operation is performed

on two operands after which the resulting mantissa isrounded and truncated to 53 bits To compute a dot productbetween two vectors a middot x the FPU first calculates theproduct of two scalars a1x1 rounds and truncates the resultto 53 bits and adds the truncated product to subsequentproducts a middot x = a1x1 + a2x2 + middot middot middot A memristive MVMaccelerator on the other hand computes a dot product as anaggregation of partial dot products calculated over multiplebit slices This necessitates a truncation strategy differentfrom that of a digital FPU

A naıve approach to truncation on a memristive MVMaccelerator would be to perform all of the bit sliced matrix-vector multiplications aggregate the partial dot productsand truncate the result This naıve approach would requirea total of 127 times 127 crossbar operations since every bitslice of a must be multiplied by every bit slice of x 4

However many of these operations would contribute only tothe portion of the mantissa beyond the point of truncationwasting time and energy By detecting the position of theleading 1mdashie the 1 at the most significant bit position inthe final resultmdashthe computation can be safely terminatedas soon as the 52 following bits of the mantissa have settled

Consider the simplified example depicted in Figure 4 inwhich a series of partial products with different bit-sliceweights are to be added (a) To clarify the exposition the ex-ample assumes that the partial products are six bits wide andthat the final sum is to be represented by a four-bit mantissaand an exponent (The actual crossbars sum 127-bit partialproducts encoding the result with a 53-bit mantissa and anexponent) In the example the addition of partial productsproceeds iteratively from the most significant partial producttoward the least significant accumulating the result in arunning sum The goal is to determine whether this iterativeprocess can be terminated early while still producing thesame mantissa as the naıve approach

Figure 4b shows the running sum after the summationof the four most significant partial products Since the next

4Recall from Section III that scalars are encoded with up to 127-bitfixed point All examples in this section assume a 127-bit representationfor simplicity unless otherwise noted

Stable Region

X hellip X

Potential carry out

+

Barrier Bit

0

Carry Region

Aligned Region

X X 0X X X

1 hellip 1 X Running sum

Next partial dot product

Figure 5 Regions within a running sum

partial product still overlaps with the four-bit mantissa it isstill possible for the mantissa to be affected by subsequentadditions At a minimum the computation must proceeduntil the mantissa in the running sum has cleared the overlapwith the next partial product This occurs in Figure 4chowever at that point it is still possible for a carry from thesubsequent partial products to propagate into and modify themantissa Notably if the accumulation is halted at any pointthe sum of the remaining partial products would produce atmost one carry out into the running sum Hence in additionto clearing the overlap safe termination requires that a 0less significant than the mantissa be generated such that itwill absorb the single potential carry out preventing any bitflips within the mantissa (Figure 4d)

At any given time the running sum can be split intofour non-overlapping regions as shown in Figure 5 Thealigned region is the portion of the running sum thatoverlaps with the subsequent partial products A chain of1s more significant than the aligned region is called thecarry region as it can propagate the single generated carryout forward This carry if present is absorbed by a barrierbit protecting the more significant bits in the stable regionfrom bit flips Thus the running sum calculation canbe terminated without loss of mantissa precision assoon as the full mantissa is contained in the stable region

Scheduling array activations Even with the early termina-tion technique discussed above there is still a possibility forextraneous crossbar operations to be performed Althougheach partial dot product takes up to 127 bits the finalmantissa is truncated to 52 bits following the leading 1requiring fewer than 127 bits to be computed in most casesTherefore some subset of the 127-bit partial dot productmay be wasted depending on the schedule according towhich vector bit slices are applied to the crossbars

Figure 6 shows three approaches to scheduling the cross-bar activations In the figure rows and columns respec-tively represent the matrix and vector bit slices sortedby significance The numbers at the intersection of eachrow and column represent the significance of the resultingpartial product Different groupings indicate which bit-slicedmatrix-vector multiplications are scheduled to occur simul-taneously the black colored groupings must be performedwhereas the gray colored groupings may be skipped dueto early termination Hence the number of black groupings

1 0123

1

023

4

5

3 24 1

6 5

4

3

4

3

2

3

2

6

5

4

5

4

3

4

3

2

3

2

1

3 2 1 0

6

5

3

5

4

2

4

3

1

3

2

0

3 2 1 0

3

2

1

0

Mat

rix S

lice

3 2 1 0Vector Slice

Activations Skipped Activations PerformedVertical Grouping Diagonal Grouping Hybrid Grouping

Figure 6 Three policies for scheduling crossbar activations

correlates with execution time and the number of crossbarsenclosed by the black groupings correlates with crossbaractivation energy The example shown in the figure assumesthat the calculation can be terminated early after the fifthmost significant bit of the mantissa is computed (That isall groupings computing a partial product with significanceof at least 2 must be performed)

In the leftmost case all of the crossbars are activated withthe same vector bit slice after which the next significantvector bit slice is applied to the crossbars This naıve verticalgrouping requires 16 crossbar activations performed overfour time steps In the next example arrays are activatedwith different vector bit slices such that each active arrayis contributing to the same slice of the output vector si-multaneously This diagonal grouping results in 13 crossbaractivationsmdashthe minimum required in the examplemdashoverfive time steps In general this policy takes the minimumnumber of crossbar activations at the cost of additionallatency

The hybrid grouping shown on the right which we adoptin the evaluation strikes a balance between the respectiveenergy and latency advantages of the diagonal and verticalgroupings The result is 14 crossbar activations performedover four time steps The more closely the hybrid groupingapproximates a diagonal grouping the greater the energysavings at the cost of latency

Taken together the above techniques significantly reducethe overhead of performing floating-point computation onfixed-point hardware by restricting computations to onlythose that are likely to produce a bit needed for the finalmantissa The techniques render the latency and energyconsumption strongly data dependent effectively forming adata-specific subset of the floating point format without anyloss in precision

C Handling Negative Numbers

The previously proposed ISAAC memristive accelera-tor [8] uses a biasing scheme to handle negative numbersThe approach adds a bias of 216 to each 16-bit operandsuch that the values range from 0 to 216 rather than plusmn215Results are de-biased by subtracting 216pv where pv is thepopulation count (ie the number of bits set to 1) of therelevant bit slice of the vector We adopt this technique with

one modification in place of the constant 216 we use aper-block biasing constant based on the actual floating-pointexponent range contained within the block

D Floating Point Anomalies and Rounding Modes

The IEEE-754 floating point standard specifies a numberof floating point exceptions and rounding modes that mustbe handled correctly for proper operation [31]

Rounding modes When the proposed accelerator isoperated in its typical configuration mantissa alignmentand leading 1 detection result in a truncation of the resultto the required precision The potential contributions ofany uncalculated bits beyond the least-significant bit wouldonly result in a mantissa of greater magnitude than whatis ultimately obtained and since a biasing scheme is usedto enable the representation of negative values (SectionIV-C) the truncation is equivalent to rounding the resultof a dot product toward negative infinity For applicationsrequiring other rounding modes (eg to nearest towardpositive infinity toward zero) the accelerator can beconfigured to compute three additional settled bits beforetruncation and perform the rounding according to those bits

Infinities and NaNs Signed infinities and not-a-numbers(NaNs) are valid floating point values however they cannotbe mapped to the crossbars in a meaningful way in theproposed accelerator nor can they be included in any dotproduct computations without producing a NaN infinity orinvalid result [31] Therefore the accelerator requires allinput matrices and vectors to contain no infinities or NaNsand any infinities or NaNs arising from an intermediatecomputation in the local processor are handled there toprevent them from propagating to the resulting dot product

Floating point exceptions Operations with floating pointvalues may generally lead to underflow overflow invalidoperation and inexact exceptions [31] Since in-situ dotproducts are performed on mantissas aligned with respect torelative exponents there are no upper or lower bounds onresulting values (Only the range of exponents is bounded)This precludes the possibility of an overflow or underflowoccurring while values are in an intermediate representationHowever when the result of a dot product is converted fromthe intermediate floating point representation into IEEE-754it is possible that the required exponent will be outsideof the available range In this case the exponent field ofthe resulting value will be respectively set to all 1s orall 0s for overflow and underflow conditions Similarly toCUDA computations resulting in an overflow underflowor inexact arithmetic do not cause a trap [33] Since theaccelerator can only accept matrices and input vectors withfinite numerical values and since the memristive substrateis used only for computing dot products invalid operations

(eg 0 0radicminus|c|) and operations resulting in NaNs or

infinities can only occur within the local processor wherethey are handled through an IEEE-754 compliant floatingpoint unit

E Detecting and Correcting Errors

Prior work by Feinberg et al proposes an error-correctionscheme for memristive neural network accelerators based onAN codes [29] We adopt this scheme with three modifi-cations taking into consideration the differences betweenneural networks and scientific computing First as noted inSection I scientific computing has higher precision require-ments than neural networks thus the proposed accelera-tor uses only 1-bit cells making it more robust Secondthe matrices for scientific computing are typically sparsewhereas the matrices in neural networks tend to be densethis leads to lower error rates using the RTN-based errormodel of Feinberg et al allowing errors to be corrected withgreater than 9999 accuracy However the use of arrayslarger than used in prior work introduces new sources oferror If the array size is comparable to the dynamic rangeof the memristors the non-zero current in the RHI statemay cause errors when multiplying a low density matrixrow with a high density vector To avert this problem welimit the maximum crossbar block size in a bank (discussedin the next section) to 512times512 with a memristor dynamicrange of 15times 103 Third since the floating-point operandsexpand to as many as 118 bits during the conversion fromfloating- to fixed point we do not use multi-operand codingbut rather apply an A = 251 code that protects a 118-bit operand with eight bits for correction and one bit fordetection The correction is applied after the reduction butbefore the leading-one detection

V HANDLING AND EXPLOITING SPARSITY

An in-situ MVM system has an inherent advantage whenperforming computation with large and dense matrices asthe degree of parallelism obtained by computing in theanalog domain is proportional to the number of elementsin the matrix up to the size of the matrix M times N Asnoted in Section II-B however the majority of the matricesof interest are sparse This reduces the effective parallelismmaking it a challenge to achieve high throughput

A Tradeoffs in Crossbar Sizing

The potential performance of a memristive MVM systemis constrained by the sizes of its crossbars as crossbar sizeincreases so does the peak throughput of the system Thispeak is rarely achieved however since the sparsity patternof each matrix also limits the number of elements that maybe mapped to a given block

We define the throughput of the system as the number ofeffective element-wise operations in a single-cluster MVMcomputation divided by the latency of that computation

τMV M

As the size of a crossbar increases greater throughputis naıvely implied since the size of the matrix block repre-sentable within the crossbarmdashand consequently the numberof sums-of-products resulting from a cluster operationmdashalsoincreases However since only non-zero matrix elementscontribute to an MVM operation effective throughput in-creases only when the number of non-zero (NNZ) elementsincreases where NNZ depends on both block size M timesN and block density dblock Actual throughput for a givenblock is therefore dblock timesM timesNτMV M

Thus to achievehigh throughput and efficiency large and dense blocks aregenerally preferable to small or sparse blocks

The number of non-zero elements actually mapped to acrossbar is in practice far less than the number of cells inthe crossbar and the trend is for large blocks to be sparserthan small blocks This tendency follows intuitively fromthe notion that as the block size approaches the size of thematrix the density of the block approaches the density ofthe matrixmdash037 for the densest matrix in the evaluation

The latency of a memristive MVM operation is alsodirectly related to the size of a crossbar The maximumpossible crossbar column output occurs when all N bits ofa column and all N bits of the vector slice are set to 1resulting in a sum-of-products of

sumN1 1times1 = N Therefore

a sufficient analog to digital converter (ADC) for a crossbarwith N rows must generally have a minimum resolution ofdlog2[N +1]e bits (though this can be reduced as shown inSection V-B2) For the type of ADC used in the proposedsystem and in prior work [8] conversion time per columnis strictly proportional to ADC resolution which increaseswith log2N where N is the number of rows and the numberof ADC conversions per binary MVM operation is equal tothe number of columns M Total conversion time is thusproportional to Mdlog2[N + 1]e

1) Energy Crossbar sizing strongly affects the amountof energy used by both the array and the ADC during anMVM operation

ADC energy The average power dissipation of the ADCgrows exponentially with resolution [34]ndash[36] Since res-olution is a logarithmic function of the number of rowsN but conversion time is proportional to resolution ADCenergy per column conversion is approximately proportionalto N log2N The column count M determines the numberof ADC conversions per MVM operation so the total ADCenergy per operation is proportional to M timesN log2N

Crossbar energy Crossbar latency is defined as theminimum amount of time during which the crossbars mustbe actively pulling current in order to ensure that all outputnodes are within less than LSB

2 of their final voltages therebyconstraining all column outputs to be within the errortolerance range of the ADC The worst-case path resistanceand capacitance of an MtimesN crossbar are both proportionalto M + N The number of RC time constants necessaryto ensure error-free output is proportional to resolution or

log2N While the crossbar is active power is proportionalto the total conductance of the crossbar which can be foundby multiplying the number of cells M timesN by the averagepath conductance which is proportional to 1

M+N Crossbarenergy for a single MVM operation grows with the productof crossbar latency and power and is thus found to beproportional to (M timesN)(M +N) log2N

2) Area As with ADC power ADC area also scalesexponentially with resolution [34]ndash[36] making averageADC area approximately proportional to N Crossbar area isdominated by drivers consequently the total crossbar areagrows as M(M +N)

B A Heterogeneous Crossbar Substrate

Crossbar sizing requires striking a careful balance be-tween parallelism and energy efficiency a crossbar mustbe large enough to capture sufficient non-zero elements toexploit parallelism yet small enough to retain sufficient den-sity for area and energy efficiency By contrast to the singlecrossbar size used in prior work [7] [8] [18] the proposedsystem utilizes crossbars of different sizes allowing blocksto be mapped to the crossbars best suited to the sparsitypattern of the matrix This heterogeneous composition makesthe system performance robust yielding high performanceand efficiency on most matrices without over-designing fora particular matrix type or sparsity pattern

The problem of performance-optimal matrix blocking isneither new nor solved in a computationally feasible mannerPrior work seeks to block sparse matrices while maximizingthroughput and minimizing fill-in but focuses on block sizesthat are too small to be efficient with memristive crossbars(12times 12 at most [15])

1) Mapping a Matrix to the HeterogeneousSubstrate A fast preprocessing step is proposed that ef-ficiently maps an arbitrary matrix to the heterogeneouscollection of clusters The matrix is partitioned into blockcandidates and for each candidate block the number of non-zero elements captured by the block and the range of theexponents within the block are calculated If the exponentrange of the block falls outside of the maximum allowablerange (ie 117) the elements are selectively removed untilan acceptable range is attained The number of non-zero ele-ments remaining is then compared to a dimension-dependentthreshold if the number of elements exceeds the thresholdthe candidate block is accepted If the candidate block isrejected a smaller block size is considered and the processrepeats

The preprocessing step considers every block size in thesystem (512times512 256times256 128times128 and 64times64) untileach element has either been mapped to a block or found tobe unblockable Elements that cannot be blocked efficientlyor that fall outside of the exponent range of an acceptedblock are completely removed from the mapping and areinstead processed separately by the local processor Figure

(a) Pres Poisson sparsity (b) Pres Poisson blocking (c) Xenon1 sparsity (d) Xenon1 blockingFigure 7 Sparsity and blocking patterns of two of the evaluated matrices

7 shows the non-zero elements and the resulting blockingpatterns for two sparse matrices

Since the non-zero elements in the matrix are traversed atmost one time for each block size the worst-case complexityof the preprocessing step is 4timesNNZ In practice the earlydiscovery of good blocks brings the average-case complexityto 18timesNNZ

2) Exploiting Sparsity Though sparsity introduces chal-lenges in terms of throughput it also presents some oppor-tunities for optimization

Computational invert coding Prior work [8] introducesa technique for statically reducing ADC resolution require-ments If a crossbar column contains all 1s the maximumoutput for the column is equal to the number of rowsN To handle this extreme case the ADC would requiredlog2[N + 1]e bits of resolution Eliminating this casereduces the ADC resolution requirement by one bit This isachieved by inverting the contents of a column if it containsall 1s To obtain a result equivalent to a computation witha non-inverted column the partial dot product is subtractedfrom the population count of the applied vector bit slice

We extend this technique to computational invert coding(CIC) a computational analog to the widely used bus-invertcoding [37] If the number of ones in a binary matrixrow is less than N

2 the required resolution of the ADCis again decreased by one bit By inverting any columnmapping that would result in greater than 50 bit density astatic maximum of N

2 1s per crossbar column is guaranteedNotably if the column has exactly N

2 1s a resolution oflog2N is still required Since the proposed system allowsfor removing arbitrary elements from a block for processingon the local processor this corner case can be eliminatedThus for any block even in matrices approaching 100density the ADC resolution can be statically reduced tolog2[N ]minus1 bits CIC also lowers the maximum conductanceof the crossbars (since at most half of the cells can be inthe RLO state) which in turn reduces crossbar energy

ADC headstart The ADCs that are commonly used inmemristive accelerators [7] [8] perform a binary searchover the space of possible results with the MSb initially

set to 1 However the maximum output of each column isconstrained by the number of ones mapped to that columnMatrix rows tend to produce at most dlog2[drowtimes05timesN ]ebits of output assuming that 1s and 0s are equally likelySince the actual number of output bits is data dependentthis tendency cannot be leveraged to statically reduce ADCprecision beyond what is attainable with CIC Instead theADC is pre-set to the maximum number of bits that can beproduced by the subsequent conversion step This approachallows the ADC to start from the most significant possibleoutput bit position and reduces the amount of time requiredto find the correct output value by an amount proportionalto the number of bits skipped Notably this does not affectoperation latency since reduction logic is fully synchronousand assumes a full clock period for conversion however bydecreasing the time spent on conversion the energy of theMVM operation is reduced

VI PROGRAMMING THE ACCELERATOR

We focus on Krylov subspace solvers for specificity asthey are a widely used application involving sparse MVMhowever the proposed accelerator is not limited to Krylovsubspace solvers These solvers are built from three com-putational kernels a sparse-matrix dense-vector multiply adense-vector sum or AXPY (ylarr ax+y) and a dense-vectordot product (z = x middot y) To split the computation betweenthe banks each bank is given a 1200-element section ofthe solution vector x only that bank performs updates onthe corresponding section of the solution vector and theportions of the derived vectors For simplicity this restrictionis not enforced by hardware and must be followed by theprogrammer for correctness

On problems that are too large for a single accelerator theMVM can be split in a manner analogous to the partitioningon GPUs each accelerator handles a portion of the MVMand the accelerators synchronize between iterations

A Kernel Implementation

To implement Krylov subspace solvers we implement thethree primary kernels needed for the solver

1) Sparse MVM The sparse MVM operation beginsby reading the vector map for each cluster The vectormap contains a set of three-element tuples each of whichcomprises the base address of a cluster input vector thevector element index corresponding to that base addressand the size of the cluster Since the vector applied to eachcluster is contiguous the tuple is used to compute and loadall of the vector elements into the appropriate vector bufferThereafter a start signal is written to a cluster status registerand the computation begins Since larger clusters tend tohave a higher latency the vector map entries are ordered bycluster size

Once all cluster operations have started the processorbegins operating on the non-blocked entries As noted inthe previous section some values within the sparse matrixare not suitable for mapping onto memristive hardwareeither because they cannot be blocked at a sufficient densityor because they would exceed the exponent range Theseremaining values are stored in the compressed sparse rowformat [15]

When a cluster completes an interrupt is raised andserviced by an interrupt service routine running on the localprocessor Once all clusters signal completion and the localprocessor finishes processing the non-blocked elements thebank has completed its portion of the sparse MVM The localprocessors for different banks rely on a barrier synchroniza-tion scheme to determine when the entire sparse MVM iscomplete

2) Vector Dot Product To perform a vector dot producteach processor computes the relevant dot product fromits own vector elements Notably due to vector elementownership of the solution vector x and all of the derivedvectors the dot product is performed using only thosevalues that belong to a local processor Once the processorcomputes its local dot product the value is written backto shared memory for all other banks to see Each bankcomputes the final product from the set of bank dot productsindividually to simplify synchronization

3) AXPY The AXPY operation is the simplest kernelsince it can be performed purely locally Each vector elementis loaded modified and written back Barrier synchroniza-tion is used to determine completion across all banks

VII EXPERIMENTAL SETUP

We develop a parameterized and modular system modelto evaluate throughput energy efficiency and area Keyparameters of the simulated system are listed in Table I

A Circuits and Devices

We model all of the CMOS based periphery at the 15nmtechnology node using process parameters and design rulesfrom FreePDK15 [38] coupled with the ASU PredictiveTechnology Model (PTM) [39] TaOx memristor cells basedon [40] are used with linearity and dynamic range as

reported by [18] During computation cells are modeled asresistors with resistance determined either by a statisticalapproach considering block density or actual mapped bitvalues depending on whether the model is being evaluatedas part of a design-space exploration or to characterizeperformance with a particular matrix The interconnect ismodeled with a conservative lumped RC approach usingthe wire capacitance parameters provided by the ASU PTMderived from [41] To reduce path distance and latency wesplit the crossbars and place the drivers in the middle ratherthan at the periphery similarly to the double-sided groundbiasing technique from [42]

Table IACCELERATOR CONFIGURATION

System (128) banks double-precision floating point fclk =12GHz 15nm process VDD = 080V

Bank (2) times 512 times 512 clusters (4) times 256 times 256 clusters(6)times128times128 (8)times64times64 clusters 1 LEON core

Cluster 127 bit slice crossbarsCrossbar N timesN cells (log2[N ]minus 1)-bit pipelined SAR ADC

2N driversCell [18] [40] TaOx Ron = 2kΩRoff = 3MΩ Vread = 02V

Vset = minus26V Vreset = 26V Ewrite = 391nJTwrite = 5088ns

The peripheral circuitry includes 2N driver circuits basedon [43] and sized such that they are sufficient to sourcesinkthe maximum current required to program and read thememristors Also included are N sample-and-hold circuitsbased on [44] with power and area scaled for the requiredsampling frequency and precision

A 12 GHz 10-bit pipelined SAR ADC [34] is used as areference design point 12 GHz is chosen as the operatingfrequency of the ADC in order to maximize throughput andefficiency while maintaining acceptable ADC signal to noiseand distortion ratio (SNDR) at our supply voltage Areapower and the internal conversion time of each individualcrossbarrsquos ADC are scaled based on resolution as discussedin Section V-A Approximately 7 of the reported ADCpower scales exponentially with resolution with the remain-der of power either scaling linearly or remaining constantand 20 of the total power is assumed to be static basedon the analysis reported in [34] Similarly approximately23 of the reported area scales exponentially (due to thecapacitive DAC and the track-and-hold circuit) and theremainder either scales linearly or remains constant We holdthe conversion time of the ADC constant with respect toresolution rounding up to the nearest integral clock periodDuring the slack portion of the clock period (ie the timebetween conversions) we reduce the ADC power dissipationto its static power

The local processors are implemented with the open-source LEON3 core [45] Since the open-source versionof LEON3 does not include the netlist for the FPU forsynthesis we modify the FPU wrapper to the fused multiplyadd (FMA) timings for an FMA unit generated by FP-

Table IIEVALUATED MATRICES SPD MATRICES ON TOP

Matrix NNZs Rows NNZRow Blocked2cubes sphere 1647264 101492 162 497crystm03 583770 24696 236 947finan512 596992 74752 79 467G2 circuit 726674 150102 45 609nasasrb 2677324 54870 498 991Pres Poisson 715804 14822 483 964qa8fm 1660579 66127 251 928ship 001 3896496 34920 1116 664thermomech TC 711558 102158 68 08Trefethen 20000 554466 20000 277 633ASIC 100K 940621 99340 95 609bcircuit 375558 68902 54 649epb3 463625 84617 55 722GaAsH6 3381809 61349 5512 692ns3Da 1679599 20414 82 32Si34H36 5156379 97569 528 537torso2 1033473 115697 89 981venkat25 1717792 62424 275 798wang3 177168 26064 68 646xenon1 1181120 48600 243 810

GEN [46] The LEON3 core FMA and in-cluster reductionnetwork are synthesized using the NanGate 15nm OpenCell Library [47] with the Synopsys Design Compiler [48]SRAM buffers within each cluster and the eDRAM mem-ory are modeled using CACTI7 [49] using 14nm eDRAMparameters from [50]

B Architecture

We compare against an nVidia Tesla P100 GPU accelera-tor modeled using GPGPUSim [51] and GPGPUWattch [52]To evaluate the performance of the LEON3 based localprocessors we implement CG and BiCG-STAB in C andcompile the implementations using the LEON3 Bare CCompiler with timers around each kernel call Since all bankmicroprocessors must synchronize during each iteration weevaluate the latency of the bank microprocessor with thelargest number of unblocked elements

In the worst case the preprocessing step touches everynon-zero in the matrix four times which is comparable toperforming four MVM operations Thus we conservativelyassume that preprocessing takes time equivalent to fourMVM operations on the baseline system

C Algorithms

We evaluate the proposed accelerator on 20 matrices fromthe SuiteSparse matrix collection [14] using CG for sym-metric positive definite matrices (SPD) and BiCG-STAB fornon-SPD matrices The matrices were selected to showcasea range of matrix structures sizes and densities across avariety of applications When available we use the b vectorprovided by the collection when b is unavailable we use ab vector of all 1s as in prior work [53]

The solvers running on the proposed accelerator convergein the same number of iterations as they do when running

103

01

1

10

100

nasasrb

Pres_P

2cub

es

qa8fm

finan

Trefe

ship

crystm

therm

G2

ASIC

bcircuit

epb3

Ga

AsH6

ns3D

aSi34H3

6torso2

venkat

wang3

xeno

n1

G-MEA

NS

peed

upoverG

PU

Figure 8 Speedup over the GPU baseline

on the GPU since both systems perform computation at thesame level of precision

VIII EVALUATION

We evaluate the system performance energy area androbustness to different sources of error

A Execution Time

Figure 8 shows that the accelerator achieves an averagespeedup of 103times as compared to the GPU baseline acrossthe set of 20 matrices One important insight from theseresults is that the average number of non-zeros per matrixrow is a poor proxy for potential speedup The matrix withthe greatest speedup in the dataset (torso2) has just 89 NNZper matrix row and the two quantum chemistry matrices(GaSsH6 and Si34H36) have more than 50 NNZ per rowshowing above-average but not exceptional speedup due totheir lower blocking efficiencies (ie the percent of the non-zeros that are blocked) 69 and 53 respectively Sinceeach bank microprocessor is designed to work in conjunctionwith its clusters matrices with fewer rows perform worsethan larger matrices at the same blocking efficiency Asexpected blocking efficiency is the most reliable predictorof speedup

Notably in Table II there are two matrices that are effec-tively unblocked by the proposed preprocessing step ther-motech TC and ns3Da with respective blocking efficienciesof 08 and 32 Since the accelerator is optimized for in-situ rather than digital MVM and since the majority of theelements in these two matrices cannot be mapped to theproposed heterogeneous substrate the accelerator is morethan an order of magnitude less efficient than the GPUbaseline on these matrices Rather than attempt to performMVM on matrices so clearly ill-suited to the acceleratorwe assume that the proposed accelerator co-exists with aGPU in the same node and that the GPU may be usedfor the rare matrices which do not block effectively Sincethe blocking algorithm has a worst-case performance offour MVM operations (Section V-B1) and since the actualcomplexity reaches the worst case if and when the matrixcannot be blocked we can choose whether to perform thecomputation on the accelerator or the GPU after the blockinghas completed This results in a performance loss of less than

0092

1E-02

1E-01

1E+00

nasasrb

Pres_P

2cub

es

qa8fm

finan

Trefe

ship

crystm

therm

G2

ASIC

bcircuit

epb3

Ga

AsH6

ns3D

aSi34H3

6torso2

venkat

wang3

xeno

n1

G-MEA

N

EnergyNormalized

toGPU

Figure 9 Accelerator energy consumption normalized to the GPU baseline

3 for both matrices far better than if the accelerator wereused directly

B Energy Consumption

Energy consumption results are shown in Figure 9 To-tal energy consumption is improved by 142times on the 18matrices executed on the accelerator and by 109times overthe entire 20-matrix dataset The impacts of differing ex-ponent ranges can be seen in the energy results Notablythe nasasrb matrix has a 3 higher blocking efficiencythan Pres Poisson however Pres Poisson shows nearlytwice the improvement in energy dissipation this is dueto the much narrower exponent range of Pres PoissonPres Poisson never requires more than 14 bits of paddingfor storage which also indicates a much narrower dynamicrange for the computation and by extension fewer vectorbit slices that are needed per cluster By contrast nasasrbhas multiple blocks with excluded elements (due to thoseelements requiring greater than 117 bits) and in generalthe average number of stored bits per cluster is 107 (30more bits than the worst block of Pres Poisson) This strongdependence on the exponent range indicates a potentialbenefit of the fixed-point computation over floating pointAs discussed in Section IV-B floating point is designed forbroad compatibility and exceeds the requirements of manyapplications By operating on fixed-point representations ofdata for much of the computation the accelerator implicitlycreates a problem-specific subset of the floating-point formatwith significant performance benefits without discarding anyinformation that would cause the final result to differ fromthat of an end-to-end floating-point calculation

C Area Footprint

The area footprint of the accelerator is computed usingthe area model described in Section VII-A The overallsystem area for the 128-bank system described above is539mm2 which is lower than the 610mm2 die size of thebaseline Nvidia Tesla P100 GPU [54] Notably unlike inprior work on memristive accelerators the crossbars andperipheral circuitry are the dominant area consumermdashratherthan the ADCsmdashwith a total of 541 of total overall clusterarea This is due to two important differences from priorwork 1) due to computational invert coding (Section V-B2)

Table IIIAREA ENERGY AND LATENCY OF DIFFERENT CROSSBAR SIZES

(INCLUDES THE ADC)

Size Area [mm2] Energy [pJ] Latency [nsec]64 000078 280 533128 000103 652 107256 000162 150 213512 000352 342 427

0

5

10

15

20

nasasrb

Pres_P

2cub

es

qa8fm

finan

Trefe

ship

crystm

therm

G2

ASIC

bcircuit

epb3

Ga

AsH6

ns3D

aSi34H3

6torso2

venkat

wang3

xeno

n1

AcceleratorC

ompu

tatio

n

Overhead

WriteTime PreprocessingTime

Figure 10 Overhead of preprocessing and write time as a percent of totalsolve time on the accelerator

the ADC precision can be reduced by one bit which has anexponential effect on ADC area and 2) the wider operandsthat propagate through the reduction network The per-bankprocessors and global memory buffer consume 136 of thetotal system area This result suggests that a more powerfulprocessor than the LEON3 core may be integrated based onapplication demands without substantially increasing systemarea Table III summarizes the area as well as the latencyand energy characteristics for different crossbar sizes usedin different clusters

D Initialization Overhead

Figure 10 shows the overhead of setting up the itera-tive computation including matrix preprocessing time andprogramming time normalized to the entire solve timeof each linear system on the accelerator Since iterativealgorithms take thousands of iterations to converge theoverhead of writing is effectively amortized leading toan overhead of less than 20 across the entire matrixset and generally falling as the size of the linear systemincreases By extension the number of iterations requiredto converge increases faster than the write requirements ofthe array For large linear systems (the most likely use caseof the proposed accelerator) the overhead is typically lessthan 4 of the total solver runtime The overhead may beeven less in practice as multiple time steps are generallysimulated to examine the time-evolution of the linear systemIn these time-stepped computations only a subset of non-zeros change each step and the matrix structure is typicallypreserved requiring minimal re-processing

E System Endurance

Although memristive devices have finite switching en-durance the iterative nature of the algorithms allows a singlematrix to be programmed once and reused throughout thecomputation Even under a conservative set of assumptions

(a) ns3Da sparsity (b) ns3Da blockingFigure 11 Sparsity and blocking patterns of ns3Da

where each array is fully rewritten for each new matrixignoring both the sparsity of the blocks and the time stepbehavior mentioned above the system lifetime is suffi-cient given a memristive device with a conservative 109

writes [55]ndash[57] the total system lifetime is over 100 yearsassuming the system runs constantly and a new matrix isfully rewritten between solves

F Understanding Difficult-to-Block Matrices

Figure 11 shows the non-zero element distribution and theblocking pattern for ns3Da a matrix with a particularly poorblocking efficiency The figure shows that despite the highrelative density of the matrix the values do not form densesub-blocks that can be blocked efficiently Instead the valuesare distributed relatively uniformly over the matrix with thethird-highest NNZs per matrix row in the evaluated datasetFigure 7 shows the non-zero distribution of two matrices thatcan be blocked effectively Pres Poisson and Xenon1 thenon-zeros of both matrices are clustered primarily aroundthe diagonal An analysis of the blocks formed on ns3Dashows that most of the formed blocks capture patterns withnon-zeros separated by approximately 30 elements 2 permatrix block-row in 64 blocks and 4 per matrix block-rowin 128 blocks This suggests that even if ns3Da were tobe blocked more efficiently the effective speedup would beconstrained by the limited current summation from havingfew blocks per matrix row

G Sensitivity to Device Characteristics

Since scientific computing emphasizes high precisionwe analyze the effects of device characteristics on systemaccuracy and convergence behavior

Dynamic range We re-evaluate the convergence behaviorwith various cell ranges (ie onoff state ratios) to determinehow the system behaves when the states are closer togetheror farther apart Figure 12 shows the relative iterationcount at various configuration points The results showeffectively no sensitivity to dynamic range for single-bitcells With two-bit cells the relatively low dynamic rangeresults in some computational error as the cell states arenot sufficiently spread to tolerate the sources of error in

013 213 413 613 813

B=113 D=075K13

B=113 D=15K13

B=113 D=3K13

B=213 D=075K13

B=213 D=15K13

B=213 D=3K13

Normalized

13 Ite

rao

n13 Co

unt13

Min13 Mean13 Max13

Figure 12 Iteration count as a function of bits per cell and dynamicrange normalized to 1-bit cells with RoffRon = 1500 The minimummaximum and the mean iteration counts are reported over 100 Monte-Carlosimulation experiments

Figure 13 Iteration count as a function of bits per cell and programmingerror normalized to 1-bit cells with no programming error The minimummaximum and the mean iteration counts are reported over 100 Monte-Carlosimulation experiments

current summation This in turn introduces error into thecomputation which tends to hinder convergence rates

Programming precision We re-evaluate the convergencebehavior with and without cell programming errors to de-termine how results are affected by non-ideal programmingThe results show virtually no sensitivity to programmingerror for single-bit cells until the programming error reaches5 well within the achievable programming precision re-ported in the literature [58] Again the general dependencebecomes stronger as the number of bits per cell is increasedHigh programming error with the same tolerance introduceserrors into the computation hindering convergence

IX CONCLUSIONS

Previous work with memristive systems has been re-stricted to low-precision error-prone fixed-point computa-tion taking advantage of machine learning workloads thatdemand fidelity but not precision We have presented thefirst correct implementation of a double-precision floating-point compute unit based on a memristive substrate Theinnovations presented here take advantage of insights thatallow the system to achieve a 103times throughput improve-ment and a 109times reduction in energy consumption over aGPU baseline when executing iterative solvers The potentialfor such a system in scientific computing where iterativesolvers are ubiquitous is especially promising

ACKNOWLEDGMENT

This work was supported in part by NSF grant CCF-1533762

REFERENCES

[1] F Pop ldquoHigh performance numerical computing for highenergy physics a new challenge for big data sciencerdquo AAdvHigh Energy Phys vol 2014 2014

[2] R A Pielke Sr Mesoscale meteorological modeling 2nd edAcademic press 2013

[3] B Scholkopf K Tsuda and J-P Vert Kernel methods incomputational biology 1st ed MIT press 2004

[4] L Maliar and S Maliar ldquoNumerical methods for large-scaledynamic economic modelsrdquo 2014

[5] M Vogelsberger S Genel V Springel P Torrey D SijackiD Xu G Snyder S Bird D Nelson and L HernquistldquoProperties of galaxies reproduced by a hydrodynamic simu-lationrdquo Nature vol 509 pp 177ndash82 May 2014

[6] P Messina (2017 February) The US DOEexascale computing projectmdashgoals and challengeshttpswwwnistgovsitesdefaultfilesdocuments20170221messina nist 20170214final pdf [Online] Availablerdquohttpswwwnistgovsitesdefaultfilesdocuments20170221messina nist 20170214final pdfrdquo

[7] M N Bojnordi and E Ipek ldquoMemristive boltzmann machineA hardware accelerator for combinatorial optimization anddeep learningrdquo in Intl Symp on High Performance ComputerArchitecture (HPCA) March 2016

[8] A Shafiee A Nag N Muralimanohar R BalasubramonianJ P Strachan M Hu R S Williams and V SrikumarldquoISAAC A convolutional neural network accelerator within-situ analog arithmetic in crossbarsrdquo in Intl Symp onComputer Architecture (ISCA) June 2016

[9] P Chi S Li S Li T Zhang J Zhao Y Liu Y Wangand Y Xie ldquoPRIME A novel processing-in-memory archi-tecture for neural network computation in ReRAM-basedmain memoryrdquo in Intl Symp on High Performance ComputerArchitecture (HPCA) June 2016

[10] L Song X Qian H Li and Y Chen ldquoPipeLayer Apipelined ReRAM-based accelerator for deep learningrdquo inIntl Symp on High Performance Computer Architecture(HPCA) Feb 2017

[11] L Song Y Zhuo X Qian H Li and Y Chen ldquoGraphRAccelerating graph processing using ReRAMrdquo Feb 2017

[12] G Kestor R Gioiosa D J Kerbyson and A HoisieldquoQuantifying the energy cost of data movement in scientificapplicationsrdquo in Intl Symp on Workload Characterization(IISWC) Sept 2013

[13] J Hauser (2002) SoftFloat httpwwwjhauserusarithmeticSoftFloathtml [Online] Available rdquohttpwwwjhauserusarithmeticSoftFloathtmlrdquo

[14] T A Davis and Y Hu ldquoThe university of florida sparse matrixcollectionrdquo ACM Transactions on Math Softwware (TOMS)vol 38 no 1 pp 11ndash125 Dec 2011

[15] R Vuduc ldquoAutomatic performance tuning of sparse matrixkernelsrdquo PhD dissertation 2003

[16] Y Chen T Luo S Liu S Zhang L He J Wang L LiT Chen Z Xu N Sun and O Temam ldquoDaDianNaoA machine-learning supercomputerrdquo in Intl Symp on onMicroarchitecture (MICRO) Dec 2014

[17] N P Jouppi C Young N Patil D Patterson G Agrawalet al ldquoIn-datacenter performance analysis of a tensor pro-cessing unitrdquo in Intl Symp on on Computer Architecture(ISCA) June 2017

[18] M Hu J P Strachan Z Li E M Grafals N DavilaC Graves S Lam N Ge J J Yang and R S WilliamsldquoDot-product engine for neuromorphic computing Program-ming 1T1M crossbar to accelerate matrix-vector multiplica-tionrdquo in Design Automation Conference (DAC) June 2016

[19] Y Saad Iterative Methods for Sparse Linear Systems 3rd edSIAM 2003

[20] J J Dongarra P Luszczek and A Petitet ldquoThe LINPACKbenchmark Past present and futurerdquo Concurrency and Com-putation Practice and Experience vol 15 pp 803ndash8202003

[21] M R Hestenes and E Stiefel Methods of conjugate gradientsfor solving linear systems 1952 vol 49 no 6

[22] H A Van der Vorst ldquoBi-CGSTAB A fast and smoothlyconverging variant of Bi-CG for the solution of nonsymmetriclinear systemsrdquo SIAM Journal on scientific and StatisticalComputing (SISC) vol 13 no 2 pp 631ndash644 1992

[23] Y Saad and M H Schultz ldquoGMRES A generalized minimalresidual algorithm for solving nonsymmetric linear systemsrdquoSIAM Journal on scientific and Statistical Computing (SISC)vol 7 no 3 pp 856ndash869 July 1986

[24] J Kung Y Long D Kim and S Mukhopadhyay ldquoAprogrammable hardware accelerator for simulating dynamicalsystemsrdquo in Intl Symp on Computer Architecture (ISCA)June 2017

[25] Q Zhu T Graf H E Sumbul L Pileggi and F FranchettildquoAccelerating sparse matrix-matrix multiplication with 3d-stacked logic-in-memory hardwarerdquo in 2013 IEEE HighPerformance Extreme Computing Conference (HPEC) Sept2013

[26] S Kestur J D Davis and E S Chung ldquoTowards a univer-sal FPGA matrix-vector multiplication architecturerdquo in IntlSymp on Field-Programmable Custom Computing Machines(FCCM) May 2012

[27] J Fowers K Ovtcharov K Strauss E S Chung andG Stitt ldquoA high memory bandwidth FPGA accelerator forsparse matrix-vector multiplicationrdquo in Intl Symp on Field-Programmable Custom Computing Machines (FCCM) May2014

[28] R Dorrance F Ren and D Markovic ldquoA scalablesparse matrix-vector multiplication kernel for energy-efficientsparse-blas on FPGAsrdquo in ACMSIGDA International Sym-posium on Field-programmable Gate Arrays (FPGA) Feb2014

[29] B Feinberg S Wang and E Ipek ldquoMaking memristiveneural net accelerators reliablerdquo in Intl Symp on HighPerformance Computer Architecture (HPCA) Feb 2018

[30] D H Bailey ldquoHigh-precision floating-point arithmetic inscientific computationrdquo Computing in science amp engineeringvol 7 no 3 pp 54ndash61 2005

[31] D Zuras M Cowlishaw A Aiken M Applegate D BaileyS Bass D Bhandarkar M Bhat D Bindel S Boldo et alldquoIeee standard for floating-point arithmeticrdquo IEEE Std 754-2008 pp 1ndash70 2008

[32] L V Allis ldquoSearching for solutions in games and artificialintelligencerdquo 1994

[33] N Whitehead and A Fit-Florea ldquoPrecision amp performanceFloating point and IEEE 754 compliance for nvidia gpusrdquoNvidia Tech Rep 2011

[34] L Kull D Luu C Menolfi M Braendli P A FranceseT Morf M Kossel H Yueksel A Cevrero I Ozkayaand T Toifl ldquo285 a 10b 15gss pipelined-SAR ADC withbackground second-stage common-mode regulation and offsetcalibration in 14nm CMOS FinFETrdquo in Intl Solid-StateCircuits Conference (ISSCC) Feb 2017 pp 474ndash475

[35] L Kull T Toifl M Schmatz P A Francese C MenolfiM Brandli M Kossel T Morf T M Andersen andY Leblebici ldquoA 31 mw 8b 12 GSs single-channel asyn-chronous SAR ADC with alternate comparators for enhancedspeed in 32 nm digital SOI CMOSrdquo IEEE Journal of Solid-State Circuits vol 48 no 12 pp 3049ndash3058 2013

[36] M Saberi R Lotfi K Mafinezhad and W A SerdijnldquoAnalysis of power consumption and linearity in capacitivedigital-to-analog converters used in successive approximationadcsrdquo IEEE Transactions on Circuits and Systems I RegularPapers vol 58 no 8 pp 1736ndash1748 2011

[37] M R Stan and W P Burleson ldquoBus-invert coding forlow-power IOrdquo IEEE Transactions on Very Large ScaleIntegration (VLSI) Systems vol 3 no 1 pp 49ndash58 1995

[38] K Bhanushali and W R Davis ldquoFreePDK15 An open-source predictive process design kit for 15nm FinFET tech-nologyrdquo in Intl Symp on on Physical Design (ISPD) March2015

[39] Predictive technology model (PTM) httpptmasuedu[Online] Available httpptmasuedu

[40] D Niu C Xu N Muralimanohar N P Jouppi and Y XieldquoDesign of cross-point metal-oxide ReRAM emphasizingreliability and costrdquo in Intl Conference on Computer-AidedDesign (ICCAD) Nov 2013

[41] S-C Wong G-Y Lee and D-J Ma ldquoModeling of inter-connect capacitance delay and crosstalk in VLSIrdquo IEEETransactions on semiconductor manufacturing vol 13 no 1pp 108ndash111 2000

[42] C Xu D Niu N Muralimanohar R BalasubramonianT Zhang S Yu and Y Xie ldquoOvercoming the challengesof crossbar resistive memory architecturesrdquo in rdquoIntl Sympon High Performance Computer Architecture (HPCA)rdquo Feb2015

[43] C Yakopcic and T Taha ldquoModel for maximum crossbar sizebased on input driver impedancerdquo Electronics Letters vol 52no 1 pp 25ndash27 2015

[44] M OrsquoHalloran and R Sarpeshkar ldquoA 10-nw 12-bit accurateanalog storage cell with 10-aa leakagerdquo IEEE Journal ofSolid-State Circuits vol 39 no 11 pp 1985ndash1996 2004

[45] Leon3GRLIB httpwwwgaislercomindexphpdownloadsleongrlib [Online] Available httpwwwgaislercomindexphpdownloadsleongrlib

[46] S Galal O Shacham J S B II J Pu A Vassiliev andM Horowitz ldquoFPU generator for design space explorationrdquoin IEEE Symposium on Computer Arithmetic (ARITH) April2013

[47] M Martins J M Matos R P Ribas A Reis G SchlinkerL Rech and J Michelsen ldquoOpen cell library in 15nmFreePDK technologyrdquo in Intl Symp on Physical Design(ISPD) March 2015

[48] Synopsys ldquoSynopsys Design Compiler User GuiderdquohttpwwwsynopsyscomToolsImplementationRTLSynthesisDCUltraPages

[49] R Balasubramonian A B Kahng N MuralimanoharA Shafiee and V Srinivas ldquoCACTI 7 New tools for inter-connect exploration in innovative off-chip memoriesrdquo ACMTransactions on Architecture and Code Optimization (TACO)vol 14 no 2 pp 141ndash1425 June 2017

[50] G Fredeman D W Plass A Mathews J ViraraghavanK Reyer T J Knips T Miller E L Gerhard D Kannam-badi C Paone D Lee D J Rainey M Sperling M WhalenS Burns R R Tummuru H Ho A Cestero N ArnoldB A Khan T Kirihata and S S Iyer ldquoA 14 nm 11 mbembedded DRAM macro with 1 ns accessrdquo IEEE Journal ofSolid-State Circuits vol 51 no 1 pp 230ndash239 Jan 2016

[51] A Bakhoda G L Yuan W W L Fung H Wong and T MAamodt ldquoAnalyzing CUDA workloads using a detailed GPUsimulatorrdquo April 2009

[52] J Leng T Hetherington A ElTantawy S Gilani N S KimT M Aamodt and V J Reddi ldquoGPUWattch Enabling en-ergy optimizations in GPGPUsrdquo in Intl Symp on ComputerArchitecture (ISCA) June 2013

[53] H Anzt J Dongarra M Kreutzer G Wellein and M KhlerldquoEfficiency of general krylov methods on GPUs ndash an exper-imental studyrdquo in Intl Parallel and Distributed ProcessingSymposium Workshops (IPDPSW) May 2016

[54] ldquoNvidia tesla P100rdquo NVIDIA Corporation Tech Rep WP-08019-001 v011 2016

[55] Z Wei Y Kanzawa K Arita Y Katoh K Kawai S Mu-raoka S Mitani S Fujii K Katayama M Iijima et alldquoHighly reliable TaOx ReRAM and direct evidence of redoxreaction mechanismrdquo in Electron Devices Meeting 2008IEDM 2008 IEEE International IEEE 2008 pp 1ndash4

[56] J J Yang M-X Zhang J P Strachan F Miao M D Pick-ett R D Kelley G Medeiros-Ribeiro and R S WilliamsldquoHigh switching endurance in TaOx memristive devicesrdquoApplied Physics Letters vol 97 no 23 p 232102 2010

[57] C-W Hsu I-T Wang C-L Lo M-C Chiang W-Y JangC-H Lin and T-H Hou ldquoSelf-rectifying bipolar TaOxTiO2RRAM with superior endurance over 1012 cycles for 3d high-density storage-class memoryrdquo

[58] F Alibart L Gao B D Hoskins and D B Strukov ldquoHighprecision tuning of state for memristive devices by adaptablevariation-tolerant algorithmrdquo Nanotechnology vol 23 no 7Jan 2012

Page 2: Enabling Scientific Computing on Memristive …reducing the required memristor resolution [7]–[10]. An example of bit slicing is shown in Equation 1, where each 3-bit matrix coefficient

lenge a novel heterogeneous hardware substrate comprisingcrossbars with different sizes is proposed A preprocessingstep maps the dense subblocks within a sparse matrix ontothe proposed substrate by exploiting the interlocking trade-offs among crossbar size throughput matrix density andblocking efficiency

The proposed techniques combine with existing GPU-based approaches to accelerating high performance lin-ear algebra Some matricesmdashdue to their sparsity pat-ternsmdashcannot be mapped efficiently to even a heterogeneoussubstrate during the preprocessing step in those rare casesthe computation is performed on a GPU instead Notably thechoice between the accelerator and the GPU can be madequickly based on the output of the preprocessing step

The proposed system is evaluated on two high-performance iterative solvers with a set of 20 input matricesfrom the SuiteSparse collection [14] representing problemdomains within computational fluid dynamics structuralanalysis circuit analysis and seven others Taken togetherthe proposed techniques improve the execution time by103times and reduce the energy consumption by 109times over aconventional GPU platform

II BACKGROUND

The proposed accelerator builds upon prior work onmemristive accelerators for machine learning and linearalgebra for scientific computing

A Memristive Accelerators for Machine Learning andGraph Processing

The increasing prominence of machine learning work-loads in recent years has led to many proposals for dedicatedmachine learning accelerators both in academia [16] andindustry [17] These accelerators focus on the MVM opera-tion as the primary kernel in connectionist machine learningmodels More recently a new class of MVM acceleratorshas been proposed that exploits the inherent parallelismof analog computation to perform MVM with memristivenetworks [7]ndash[11] [18] Conceptually a matrix is mappedonto a memristive crossbar such that the conductance ofeach crossbar cell is proportional to the corresponding matrixcoefficient For instance in a binary matrix 0s and 1s arerespectively encoded as the low and high conductance statesof the memristive devices With these values mapped a dotproduct operation is performed by applying voltages to eachrow and quantizing the output current at the columns1 asshown in Figure 1 Given a memristor at the intersection ofrow i and column j with resistance Rij and row voltageVi the current through the memristor is ViRij the totalcurrent flowing through the column is

sumi ViRij which

1When describing hardware we use the memory systems convention forthe terms column and row ie the rows of a matrix are mapped to thecolumns of the crossbars To minimize confusion we explicitly refer torows within a matrix as rdquomatrix rowsrdquo

ADC Array

Line

Driv

ers

V1

VN

R1N

Reduction Network

Current

Figure 1 A conceptual example of memristive crossbars for computation

is proportional to the dot product of the voltages and theconductances

This conceptual example however assumes high-precision analog circuits and many-bit memristive devicesboth of which incur substantial overheads To reduce theseoverheads prior work has employed bit slicing a techniqueto map each element of a matrix to multiple crossbarsreducing the required memristor resolution [7]ndash[10] Anexample of bit slicing is shown in Equation 1 whereeach 3-bit matrix coefficient is mapped onto three binarycrossbars Note that while this example uses single-bit cellsthe technique can be generalized to multi-bit cells

[3 16 5

]= 22

[0 01 1

]+ 21

[1 01 0

]+ 20

[1 10 1

](1)

To perform computation on the full operand the partialdot products from all of the bit slices are combined througha shift-and-add reduction network [7] [8] an example isshown on the right hand side of Figure 1 The same bitslicing technique is applied to the vector as well as thematrix

B High-Performance Linear Algebra for ScientificComputing

Complex physical models are often described by sys-tems of partial differential equations (PDEs) There are noknown methods for finding analytical solutions to generalsystems of PDEs Thus these continuous PDEs are typicallydiscretized into a system of linear equations Ax = brepresenting a mesh of points that approximate the originalmodel [19] The resulting system of linear equations is oftensparse that is most of the variables have zero contributionto most of the linear equations Conceptually this sparsityis a result of the locality inherent in the physical systemVariables in each linear equation represent the relationshipbetween given points in the discretized mesh and nearbypoints tend to have stronger relationships

A sparse linear system can be solved using either direct oriterative methods Direct methods include factorizations such

as LU or Cholesky as implemented in the High PerformanceLINPACK benchmark [20] These direct methods can resultin significant fill-in where zero entries become non-zeroesthis increases the memory footprint and reduces the benefitsof storing the matrix in a sparse format Iterative methodsby contrast do not involve modifying the underlying matrixThis results in significantly lower memory requirementsenabling the full sparse matrix to fit in memory whereasa sparse version with significant fill-in might not

Iterative methods are subdivided into stationary andKrylov subspace methods We focus on the latter since theyare the primary methods used today [19] Krylov subspacemethods iteratively improve an estimated solution x tothe linear system Ax = b These methods are typicallyimplemented using a stopping tolerance ε and terminatewhen |b minus Ax| lt ε There are many Krylov subspacesolvers with different behaviors depending on the matrixstructure including conjugate gradient (CG) for symmetricpositive definite matrices (SPD) [21] as well as BiConjugateGradient (BiCG) Stabilized BiCG (BiCG-STAB) [22] andGeneralized Minimal Residual [23] for non-SPD matrices

Due to the importance of PDEs and the wide appli-cability of sparse solvers a number of accelerator sys-tems leveraging FPGAs and ASICs have been proposedKung et al [24] propose an SRAM based processing inmemory accelerator for simulating systems of differentialequations The accelerator uses 32-bit fixed-point and doesnot meet the high-precision floating-point requirements ofmainstream scientific applications (a key focus of our work)Additionally Kung et al accelerates the cellular nonlinearnetwork model which differs significantly from mainstreammethods for solving PDEs By contrast we target widelyaccepted iterative solvers such as CG Zhu et al [25]discuss a graph processing accelerator that leverages contentaddressable memories (CAMs) This accelerator may beapplicable to sparse MVM however the proposed hardwareheavily depends on the sparsity of the input vector (lessthan 1 non-zeros) to limit CAM size This vector sparsityis uncommon in iterative solvers an analysis of the linearsystems that we evaluate show vector densities ranging from30-100 which would result in prohibitive CAM overheadsKestur et al [26] and Fowers et al [27] propose FPGA basedaccelerators that target sparse MVM for iterative solvershowever the proposed approaches do not outperform aGPU Dorrance et al [28] propose an FPGA acceleratorthat achieves an average speedup of 128times over a GTXTITAN GPU on single-precision sparse MVM however itis not clear how the FPGA would scale to double-precisionBy contrast the proposed memristive accelerator achievesan average speedup of 103times over a Tesla P100 GPU ondouble-precision sparse MVM

Bank 2

Bank N

Global M

emory

Bank 1

Cluster 2Cluster N

Cluster 1Xbar

Xbar

De-bias ECU FPCU Result

Buffer

Unblocked Matrix Elements

AlignedFixed Point

SignedFixed Point

CorrectedFixed Point

IntermediateFloating Point

IEEE-754Floating Point

Input Vector Buffer

Processor

Figure 2 System overview

III SYSTEM ORGANIZATION

The accelerator is organized in a hierarchy of banksand clusters as shown in Figure 2 Each bank comprisesa heterogeneous set of clusters with different crossbar sizesas well as a local processor that orchestrates computationand performs the operations not handled by the crossbarsWithin a cluster a group of crossbars perform sparse MVMon a fixed-size matrix block

A Banks

Unlike prior work on memristive accelerators [7] [8] theclusters within each bank are of different sizes This allowsthe system to capture dense patterns within sparse matricesas discussed in Section V With sparse matrices someelements are ill-suited to the memristive crossbars eitherbecause the elements do not block efficiently or becausethe computation requires more bits than are provided by acluster In either case the value is not mapped to a clusterand is instead handled by the local processor Finally cross-bank communication (for operations that require resourcesfrom multiple banks) is implemented through reads andwrites to global memory

To interface with the local processor every cluster con-tains two memory-mapped SRAM buffers an incomingvector buffer and a result buffer The incoming vector bufferholds the vector to be multiplied by the matrix in floating-point format and extracts the bit slices from the vector asneeded The partial result buffer holds the running sum of thepartial dot products in an intermediate floating-point formatOnce the computation on a block completes each scalarin the partial result buffer is converted to a final IEEE-754representation which is read by the local processor

B Clusters

A cluster comprises a set of crossbarsmdashsimilar to themats in the Memristive Boltzmann Machine [7] or the in-situ multiply accumulate (IMA) units in ISAAC [8]mdashthatperform MVM and reduction across all of the bit slices

within a single matrix block Unlike prior work clustersare designed for IEEE-754 compatible double-precisionfloating-point computation Each cluster contains 127 cross-bars with single-bit cells that are organized in a shift-and-add reduction tree The double-precision floating-pointcoefficients of a matrix block are converted to 118-bit fixed-point operands comprising a 53-bit mantissa one sign bitand up to 64 bits of padding for mantissa alignment This118-bit value is then encoded with a nine-bit error-correctingAN code (as proposed in [29]) for a full operand width ofup to 127 bits

Figure 3 shows the structure of the crossbars in a clusterand how computation proceeds Initially a vector bit slicex[i] is applied to all of the crossbars within a cluster andthe resulting dot product for each crossbar column is latchedby a sample-and-hold circuit (1)2 A set of ADCs (oneper crossbar) then start scanning the values latched by thesample-and-hold units (2) The outputs of the ADCs arepropagated through a shift-and-add reduction network (3)producing a single fixed-point value that corresponds to thedot product between a row of the matrix A and a single bitslice of the vector x The system is pipelined such that everycycle a single fixed-point value is produced by the reductionnetwork and after every column has been quantized a newvector bit slice is applied to all of the crossbars

Once the fixed-point result has been computed threeadditional operations are performed to convert it to anintermediate floating-point representation (Figure 2) Firstthe bias is removed converting the value from unsignedto signed fixed point Second error correction is appliedThird the running sum for the relevant vector element isloaded from the partial result buffer the incoming partialdot product is added to the running sum and the resultis written back After the running sum is read from thepartial result buffer it is inspected to determine whether theprecision requirements of IEEE-754 have been met (in whichcase no further vector bit slice computations are needed)If so all of the crossbars in the cluster are signaled to skipquantizing the corresponding column in the remaining vectorbit slice calculations3 The criteria for establishing whetherthe precision requirements of IEEE-754 have been met arediscussed in the next section

IV ACHIEVING HIGH PRECISION

Unlike machine learning applications scientific workloadsrequire high precision [30] To enable scientific computingwith memristive accelerators the system must meet theseprecision requirements through floating-point format supportand error correction

2Throughout the paper we use xj [i] to refer to the jth coefficient ofthe ith bit slice of vector x

3The total number vector bit slices is still bounded by the number ofvector bit slices required by the worst case column

1

2

3

SH Array SH Array SH Array

ADC ADC ADC

Figure 3 Cluster organization

A Floating-Point Computation on Fixed-Point Hardware

In double precision floating point every number is repre-sented by a 53-bit mantissa (including the implied leading1) an 11-bit exponent and a sign bit [31] When performingaddition in a floating-point representation the mantissas ofthe values must be aligned so that the bits of the samemagnitude can be added Consider as an example two base-ten floating-point numbers with two-digit mantissas onwhich the computation 12 + 013 is to be performed Toproperly compute the sum the values must be aligned basedon the decimal point 120+013 In a conventional floating-point unit this alignment is performed dynamically based onthe exponent field of each value On a memristive crossbarhowever since the addition occurs within the analog domainthe values must be converted to an aligned fixed-pointformat before they are programmed into the crossbar Thisrequires padding the floating point mantissas with zeros inthe example above the addition would be represented as120 + 013 in fixed point with an exponent of 10minus2

Although this simple padding approach allows fixed-pointhardware to perform floating-point operations it comes ata significant cost In double-precision for instance naıvepadding requires 2100 bits in the worst case to fully accountfor the exponent range 2046 bits for padding 53 bits for themantissa itself and one sign bit Not only is this requirementprohibitive from a storage perspective but full fixed-pointcomputation would require multiplying each matrix bit sliceby each vector bit slice for a worst case of 44 millioncrossbar operations Fortunately two important properties offloating-point numbers can be exploited to reduce the costof the fixed-point computation First while the IEEE-754standard provides 11 bits for the exponent in double preci-sion few applications ever approach the full exponent rangeSecond the naıve fixed-point computation described aboveprovides more precision than IEEE-754 as it computes bitswell beyond the 53-bit mantissa These extraneous values aresubsequently truncated when the result is converted back tofloating point

0110011001

010101

+

010111010

100101

+

000101

010110

011001

001000

+

01101000111

111010

+

0 0 0

(a) Initial partial product summation

Running Sum

(b) Sum of four partial products

(c) No overlap with partial product

(d) Addition can terminate

Figure 4 Illustrative example of partial product summation a) addition of the first four partial products b) overlap between the running sum and themantissa c) carries can still modify the running sum d) the mantissa bits are settled

B Avoiding Extraneous Calculations

Three techniques are proposed to reduce the cost ofperforming floating-point computation on fixed-pointhardware

Exploiting exponent range locality The 11-bit exponentfield specified by IEEE-754 means that the dynamic rangeof floating-point values is 22

11minus2 (two exponent fields arereserved for special values) or approximately 10616 (Forcomparison the size of the state space for the game ofGo is only 10172 [32]) IEEE-754 provides this range sothat applications from a wide variety of domains can usefloating-point values without additional scaling howeverit is highly unlikely that any model of a physical systemwill have a dynamic range requirement close to 10616 Inpractice the number of required pad bits is equal to thedifference between the minimum and maximum exponentsin the matrix and can be covered by a few hundred bitsrather than 2046 Furthermore since the fixed-point over-head is required only for numbers that are to be summed inthe analog domain the alignment overhead can be furtherreduced by exploiting locality Since the matrix is muchlarger than a single crossbar it is necessarily split into blocks(discussed in Section V-B1) only those values within thesame block are added in the analog domain and requirealignment The matrices can be expected to exhibit localitybecause they often come from physical systems where alarge dynamic range between neighboring points is unlikely

The aforementioned alignment procedure is also appliedto the vector to be multiplied by the matrix The alignmentof mantissas encodes the exponent of each vector elementrelative to a single exponent assigned to the vector as awhole Notably multiplication does not require alignmentsince it is sufficient to add the relevant exponents of thematrix block the vector and the exponent offset impliedby the location of the leading 1 in each dot product

Early termination In a conventional double-precision float-ing point unit (FPU) each arithmetic operation is performed

on two operands after which the resulting mantissa isrounded and truncated to 53 bits To compute a dot productbetween two vectors a middot x the FPU first calculates theproduct of two scalars a1x1 rounds and truncates the resultto 53 bits and adds the truncated product to subsequentproducts a middot x = a1x1 + a2x2 + middot middot middot A memristive MVMaccelerator on the other hand computes a dot product as anaggregation of partial dot products calculated over multiplebit slices This necessitates a truncation strategy differentfrom that of a digital FPU

A naıve approach to truncation on a memristive MVMaccelerator would be to perform all of the bit sliced matrix-vector multiplications aggregate the partial dot productsand truncate the result This naıve approach would requirea total of 127 times 127 crossbar operations since every bitslice of a must be multiplied by every bit slice of x 4

However many of these operations would contribute only tothe portion of the mantissa beyond the point of truncationwasting time and energy By detecting the position of theleading 1mdashie the 1 at the most significant bit position inthe final resultmdashthe computation can be safely terminatedas soon as the 52 following bits of the mantissa have settled

Consider the simplified example depicted in Figure 4 inwhich a series of partial products with different bit-sliceweights are to be added (a) To clarify the exposition the ex-ample assumes that the partial products are six bits wide andthat the final sum is to be represented by a four-bit mantissaand an exponent (The actual crossbars sum 127-bit partialproducts encoding the result with a 53-bit mantissa and anexponent) In the example the addition of partial productsproceeds iteratively from the most significant partial producttoward the least significant accumulating the result in arunning sum The goal is to determine whether this iterativeprocess can be terminated early while still producing thesame mantissa as the naıve approach

Figure 4b shows the running sum after the summationof the four most significant partial products Since the next

4Recall from Section III that scalars are encoded with up to 127-bitfixed point All examples in this section assume a 127-bit representationfor simplicity unless otherwise noted

Stable Region

X hellip X

Potential carry out

+

Barrier Bit

0

Carry Region

Aligned Region

X X 0X X X

1 hellip 1 X Running sum

Next partial dot product

Figure 5 Regions within a running sum

partial product still overlaps with the four-bit mantissa it isstill possible for the mantissa to be affected by subsequentadditions At a minimum the computation must proceeduntil the mantissa in the running sum has cleared the overlapwith the next partial product This occurs in Figure 4chowever at that point it is still possible for a carry from thesubsequent partial products to propagate into and modify themantissa Notably if the accumulation is halted at any pointthe sum of the remaining partial products would produce atmost one carry out into the running sum Hence in additionto clearing the overlap safe termination requires that a 0less significant than the mantissa be generated such that itwill absorb the single potential carry out preventing any bitflips within the mantissa (Figure 4d)

At any given time the running sum can be split intofour non-overlapping regions as shown in Figure 5 Thealigned region is the portion of the running sum thatoverlaps with the subsequent partial products A chain of1s more significant than the aligned region is called thecarry region as it can propagate the single generated carryout forward This carry if present is absorbed by a barrierbit protecting the more significant bits in the stable regionfrom bit flips Thus the running sum calculation canbe terminated without loss of mantissa precision assoon as the full mantissa is contained in the stable region

Scheduling array activations Even with the early termina-tion technique discussed above there is still a possibility forextraneous crossbar operations to be performed Althougheach partial dot product takes up to 127 bits the finalmantissa is truncated to 52 bits following the leading 1requiring fewer than 127 bits to be computed in most casesTherefore some subset of the 127-bit partial dot productmay be wasted depending on the schedule according towhich vector bit slices are applied to the crossbars

Figure 6 shows three approaches to scheduling the cross-bar activations In the figure rows and columns respec-tively represent the matrix and vector bit slices sortedby significance The numbers at the intersection of eachrow and column represent the significance of the resultingpartial product Different groupings indicate which bit-slicedmatrix-vector multiplications are scheduled to occur simul-taneously the black colored groupings must be performedwhereas the gray colored groupings may be skipped dueto early termination Hence the number of black groupings

1 0123

1

023

4

5

3 24 1

6 5

4

3

4

3

2

3

2

6

5

4

5

4

3

4

3

2

3

2

1

3 2 1 0

6

5

3

5

4

2

4

3

1

3

2

0

3 2 1 0

3

2

1

0

Mat

rix S

lice

3 2 1 0Vector Slice

Activations Skipped Activations PerformedVertical Grouping Diagonal Grouping Hybrid Grouping

Figure 6 Three policies for scheduling crossbar activations

correlates with execution time and the number of crossbarsenclosed by the black groupings correlates with crossbaractivation energy The example shown in the figure assumesthat the calculation can be terminated early after the fifthmost significant bit of the mantissa is computed (That isall groupings computing a partial product with significanceof at least 2 must be performed)

In the leftmost case all of the crossbars are activated withthe same vector bit slice after which the next significantvector bit slice is applied to the crossbars This naıve verticalgrouping requires 16 crossbar activations performed overfour time steps In the next example arrays are activatedwith different vector bit slices such that each active arrayis contributing to the same slice of the output vector si-multaneously This diagonal grouping results in 13 crossbaractivationsmdashthe minimum required in the examplemdashoverfive time steps In general this policy takes the minimumnumber of crossbar activations at the cost of additionallatency

The hybrid grouping shown on the right which we adoptin the evaluation strikes a balance between the respectiveenergy and latency advantages of the diagonal and verticalgroupings The result is 14 crossbar activations performedover four time steps The more closely the hybrid groupingapproximates a diagonal grouping the greater the energysavings at the cost of latency

Taken together the above techniques significantly reducethe overhead of performing floating-point computation onfixed-point hardware by restricting computations to onlythose that are likely to produce a bit needed for the finalmantissa The techniques render the latency and energyconsumption strongly data dependent effectively forming adata-specific subset of the floating point format without anyloss in precision

C Handling Negative Numbers

The previously proposed ISAAC memristive accelera-tor [8] uses a biasing scheme to handle negative numbersThe approach adds a bias of 216 to each 16-bit operandsuch that the values range from 0 to 216 rather than plusmn215Results are de-biased by subtracting 216pv where pv is thepopulation count (ie the number of bits set to 1) of therelevant bit slice of the vector We adopt this technique with

one modification in place of the constant 216 we use aper-block biasing constant based on the actual floating-pointexponent range contained within the block

D Floating Point Anomalies and Rounding Modes

The IEEE-754 floating point standard specifies a numberof floating point exceptions and rounding modes that mustbe handled correctly for proper operation [31]

Rounding modes When the proposed accelerator isoperated in its typical configuration mantissa alignmentand leading 1 detection result in a truncation of the resultto the required precision The potential contributions ofany uncalculated bits beyond the least-significant bit wouldonly result in a mantissa of greater magnitude than whatis ultimately obtained and since a biasing scheme is usedto enable the representation of negative values (SectionIV-C) the truncation is equivalent to rounding the resultof a dot product toward negative infinity For applicationsrequiring other rounding modes (eg to nearest towardpositive infinity toward zero) the accelerator can beconfigured to compute three additional settled bits beforetruncation and perform the rounding according to those bits

Infinities and NaNs Signed infinities and not-a-numbers(NaNs) are valid floating point values however they cannotbe mapped to the crossbars in a meaningful way in theproposed accelerator nor can they be included in any dotproduct computations without producing a NaN infinity orinvalid result [31] Therefore the accelerator requires allinput matrices and vectors to contain no infinities or NaNsand any infinities or NaNs arising from an intermediatecomputation in the local processor are handled there toprevent them from propagating to the resulting dot product

Floating point exceptions Operations with floating pointvalues may generally lead to underflow overflow invalidoperation and inexact exceptions [31] Since in-situ dotproducts are performed on mantissas aligned with respect torelative exponents there are no upper or lower bounds onresulting values (Only the range of exponents is bounded)This precludes the possibility of an overflow or underflowoccurring while values are in an intermediate representationHowever when the result of a dot product is converted fromthe intermediate floating point representation into IEEE-754it is possible that the required exponent will be outsideof the available range In this case the exponent field ofthe resulting value will be respectively set to all 1s orall 0s for overflow and underflow conditions Similarly toCUDA computations resulting in an overflow underflowor inexact arithmetic do not cause a trap [33] Since theaccelerator can only accept matrices and input vectors withfinite numerical values and since the memristive substrateis used only for computing dot products invalid operations

(eg 0 0radicminus|c|) and operations resulting in NaNs or

infinities can only occur within the local processor wherethey are handled through an IEEE-754 compliant floatingpoint unit

E Detecting and Correcting Errors

Prior work by Feinberg et al proposes an error-correctionscheme for memristive neural network accelerators based onAN codes [29] We adopt this scheme with three modifi-cations taking into consideration the differences betweenneural networks and scientific computing First as noted inSection I scientific computing has higher precision require-ments than neural networks thus the proposed accelera-tor uses only 1-bit cells making it more robust Secondthe matrices for scientific computing are typically sparsewhereas the matrices in neural networks tend to be densethis leads to lower error rates using the RTN-based errormodel of Feinberg et al allowing errors to be corrected withgreater than 9999 accuracy However the use of arrayslarger than used in prior work introduces new sources oferror If the array size is comparable to the dynamic rangeof the memristors the non-zero current in the RHI statemay cause errors when multiplying a low density matrixrow with a high density vector To avert this problem welimit the maximum crossbar block size in a bank (discussedin the next section) to 512times512 with a memristor dynamicrange of 15times 103 Third since the floating-point operandsexpand to as many as 118 bits during the conversion fromfloating- to fixed point we do not use multi-operand codingbut rather apply an A = 251 code that protects a 118-bit operand with eight bits for correction and one bit fordetection The correction is applied after the reduction butbefore the leading-one detection

V HANDLING AND EXPLOITING SPARSITY

An in-situ MVM system has an inherent advantage whenperforming computation with large and dense matrices asthe degree of parallelism obtained by computing in theanalog domain is proportional to the number of elementsin the matrix up to the size of the matrix M times N Asnoted in Section II-B however the majority of the matricesof interest are sparse This reduces the effective parallelismmaking it a challenge to achieve high throughput

A Tradeoffs in Crossbar Sizing

The potential performance of a memristive MVM systemis constrained by the sizes of its crossbars as crossbar sizeincreases so does the peak throughput of the system Thispeak is rarely achieved however since the sparsity patternof each matrix also limits the number of elements that maybe mapped to a given block

We define the throughput of the system as the number ofeffective element-wise operations in a single-cluster MVMcomputation divided by the latency of that computation

τMV M

As the size of a crossbar increases greater throughputis naıvely implied since the size of the matrix block repre-sentable within the crossbarmdashand consequently the numberof sums-of-products resulting from a cluster operationmdashalsoincreases However since only non-zero matrix elementscontribute to an MVM operation effective throughput in-creases only when the number of non-zero (NNZ) elementsincreases where NNZ depends on both block size M timesN and block density dblock Actual throughput for a givenblock is therefore dblock timesM timesNτMV M

Thus to achievehigh throughput and efficiency large and dense blocks aregenerally preferable to small or sparse blocks

The number of non-zero elements actually mapped to acrossbar is in practice far less than the number of cells inthe crossbar and the trend is for large blocks to be sparserthan small blocks This tendency follows intuitively fromthe notion that as the block size approaches the size of thematrix the density of the block approaches the density ofthe matrixmdash037 for the densest matrix in the evaluation

The latency of a memristive MVM operation is alsodirectly related to the size of a crossbar The maximumpossible crossbar column output occurs when all N bits ofa column and all N bits of the vector slice are set to 1resulting in a sum-of-products of

sumN1 1times1 = N Therefore

a sufficient analog to digital converter (ADC) for a crossbarwith N rows must generally have a minimum resolution ofdlog2[N +1]e bits (though this can be reduced as shown inSection V-B2) For the type of ADC used in the proposedsystem and in prior work [8] conversion time per columnis strictly proportional to ADC resolution which increaseswith log2N where N is the number of rows and the numberof ADC conversions per binary MVM operation is equal tothe number of columns M Total conversion time is thusproportional to Mdlog2[N + 1]e

1) Energy Crossbar sizing strongly affects the amountof energy used by both the array and the ADC during anMVM operation

ADC energy The average power dissipation of the ADCgrows exponentially with resolution [34]ndash[36] Since res-olution is a logarithmic function of the number of rowsN but conversion time is proportional to resolution ADCenergy per column conversion is approximately proportionalto N log2N The column count M determines the numberof ADC conversions per MVM operation so the total ADCenergy per operation is proportional to M timesN log2N

Crossbar energy Crossbar latency is defined as theminimum amount of time during which the crossbars mustbe actively pulling current in order to ensure that all outputnodes are within less than LSB

2 of their final voltages therebyconstraining all column outputs to be within the errortolerance range of the ADC The worst-case path resistanceand capacitance of an MtimesN crossbar are both proportionalto M + N The number of RC time constants necessaryto ensure error-free output is proportional to resolution or

log2N While the crossbar is active power is proportionalto the total conductance of the crossbar which can be foundby multiplying the number of cells M timesN by the averagepath conductance which is proportional to 1

M+N Crossbarenergy for a single MVM operation grows with the productof crossbar latency and power and is thus found to beproportional to (M timesN)(M +N) log2N

2) Area As with ADC power ADC area also scalesexponentially with resolution [34]ndash[36] making averageADC area approximately proportional to N Crossbar area isdominated by drivers consequently the total crossbar areagrows as M(M +N)

B A Heterogeneous Crossbar Substrate

Crossbar sizing requires striking a careful balance be-tween parallelism and energy efficiency a crossbar mustbe large enough to capture sufficient non-zero elements toexploit parallelism yet small enough to retain sufficient den-sity for area and energy efficiency By contrast to the singlecrossbar size used in prior work [7] [8] [18] the proposedsystem utilizes crossbars of different sizes allowing blocksto be mapped to the crossbars best suited to the sparsitypattern of the matrix This heterogeneous composition makesthe system performance robust yielding high performanceand efficiency on most matrices without over-designing fora particular matrix type or sparsity pattern

The problem of performance-optimal matrix blocking isneither new nor solved in a computationally feasible mannerPrior work seeks to block sparse matrices while maximizingthroughput and minimizing fill-in but focuses on block sizesthat are too small to be efficient with memristive crossbars(12times 12 at most [15])

1) Mapping a Matrix to the HeterogeneousSubstrate A fast preprocessing step is proposed that ef-ficiently maps an arbitrary matrix to the heterogeneouscollection of clusters The matrix is partitioned into blockcandidates and for each candidate block the number of non-zero elements captured by the block and the range of theexponents within the block are calculated If the exponentrange of the block falls outside of the maximum allowablerange (ie 117) the elements are selectively removed untilan acceptable range is attained The number of non-zero ele-ments remaining is then compared to a dimension-dependentthreshold if the number of elements exceeds the thresholdthe candidate block is accepted If the candidate block isrejected a smaller block size is considered and the processrepeats

The preprocessing step considers every block size in thesystem (512times512 256times256 128times128 and 64times64) untileach element has either been mapped to a block or found tobe unblockable Elements that cannot be blocked efficientlyor that fall outside of the exponent range of an acceptedblock are completely removed from the mapping and areinstead processed separately by the local processor Figure

(a) Pres Poisson sparsity (b) Pres Poisson blocking (c) Xenon1 sparsity (d) Xenon1 blockingFigure 7 Sparsity and blocking patterns of two of the evaluated matrices

7 shows the non-zero elements and the resulting blockingpatterns for two sparse matrices

Since the non-zero elements in the matrix are traversed atmost one time for each block size the worst-case complexityof the preprocessing step is 4timesNNZ In practice the earlydiscovery of good blocks brings the average-case complexityto 18timesNNZ

2) Exploiting Sparsity Though sparsity introduces chal-lenges in terms of throughput it also presents some oppor-tunities for optimization

Computational invert coding Prior work [8] introducesa technique for statically reducing ADC resolution require-ments If a crossbar column contains all 1s the maximumoutput for the column is equal to the number of rowsN To handle this extreme case the ADC would requiredlog2[N + 1]e bits of resolution Eliminating this casereduces the ADC resolution requirement by one bit This isachieved by inverting the contents of a column if it containsall 1s To obtain a result equivalent to a computation witha non-inverted column the partial dot product is subtractedfrom the population count of the applied vector bit slice

We extend this technique to computational invert coding(CIC) a computational analog to the widely used bus-invertcoding [37] If the number of ones in a binary matrixrow is less than N

2 the required resolution of the ADCis again decreased by one bit By inverting any columnmapping that would result in greater than 50 bit density astatic maximum of N

2 1s per crossbar column is guaranteedNotably if the column has exactly N

2 1s a resolution oflog2N is still required Since the proposed system allowsfor removing arbitrary elements from a block for processingon the local processor this corner case can be eliminatedThus for any block even in matrices approaching 100density the ADC resolution can be statically reduced tolog2[N ]minus1 bits CIC also lowers the maximum conductanceof the crossbars (since at most half of the cells can be inthe RLO state) which in turn reduces crossbar energy

ADC headstart The ADCs that are commonly used inmemristive accelerators [7] [8] perform a binary searchover the space of possible results with the MSb initially

set to 1 However the maximum output of each column isconstrained by the number of ones mapped to that columnMatrix rows tend to produce at most dlog2[drowtimes05timesN ]ebits of output assuming that 1s and 0s are equally likelySince the actual number of output bits is data dependentthis tendency cannot be leveraged to statically reduce ADCprecision beyond what is attainable with CIC Instead theADC is pre-set to the maximum number of bits that can beproduced by the subsequent conversion step This approachallows the ADC to start from the most significant possibleoutput bit position and reduces the amount of time requiredto find the correct output value by an amount proportionalto the number of bits skipped Notably this does not affectoperation latency since reduction logic is fully synchronousand assumes a full clock period for conversion however bydecreasing the time spent on conversion the energy of theMVM operation is reduced

VI PROGRAMMING THE ACCELERATOR

We focus on Krylov subspace solvers for specificity asthey are a widely used application involving sparse MVMhowever the proposed accelerator is not limited to Krylovsubspace solvers These solvers are built from three com-putational kernels a sparse-matrix dense-vector multiply adense-vector sum or AXPY (ylarr ax+y) and a dense-vectordot product (z = x middot y) To split the computation betweenthe banks each bank is given a 1200-element section ofthe solution vector x only that bank performs updates onthe corresponding section of the solution vector and theportions of the derived vectors For simplicity this restrictionis not enforced by hardware and must be followed by theprogrammer for correctness

On problems that are too large for a single accelerator theMVM can be split in a manner analogous to the partitioningon GPUs each accelerator handles a portion of the MVMand the accelerators synchronize between iterations

A Kernel Implementation

To implement Krylov subspace solvers we implement thethree primary kernels needed for the solver

1) Sparse MVM The sparse MVM operation beginsby reading the vector map for each cluster The vectormap contains a set of three-element tuples each of whichcomprises the base address of a cluster input vector thevector element index corresponding to that base addressand the size of the cluster Since the vector applied to eachcluster is contiguous the tuple is used to compute and loadall of the vector elements into the appropriate vector bufferThereafter a start signal is written to a cluster status registerand the computation begins Since larger clusters tend tohave a higher latency the vector map entries are ordered bycluster size

Once all cluster operations have started the processorbegins operating on the non-blocked entries As noted inthe previous section some values within the sparse matrixare not suitable for mapping onto memristive hardwareeither because they cannot be blocked at a sufficient densityor because they would exceed the exponent range Theseremaining values are stored in the compressed sparse rowformat [15]

When a cluster completes an interrupt is raised andserviced by an interrupt service routine running on the localprocessor Once all clusters signal completion and the localprocessor finishes processing the non-blocked elements thebank has completed its portion of the sparse MVM The localprocessors for different banks rely on a barrier synchroniza-tion scheme to determine when the entire sparse MVM iscomplete

2) Vector Dot Product To perform a vector dot producteach processor computes the relevant dot product fromits own vector elements Notably due to vector elementownership of the solution vector x and all of the derivedvectors the dot product is performed using only thosevalues that belong to a local processor Once the processorcomputes its local dot product the value is written backto shared memory for all other banks to see Each bankcomputes the final product from the set of bank dot productsindividually to simplify synchronization

3) AXPY The AXPY operation is the simplest kernelsince it can be performed purely locally Each vector elementis loaded modified and written back Barrier synchroniza-tion is used to determine completion across all banks

VII EXPERIMENTAL SETUP

We develop a parameterized and modular system modelto evaluate throughput energy efficiency and area Keyparameters of the simulated system are listed in Table I

A Circuits and Devices

We model all of the CMOS based periphery at the 15nmtechnology node using process parameters and design rulesfrom FreePDK15 [38] coupled with the ASU PredictiveTechnology Model (PTM) [39] TaOx memristor cells basedon [40] are used with linearity and dynamic range as

reported by [18] During computation cells are modeled asresistors with resistance determined either by a statisticalapproach considering block density or actual mapped bitvalues depending on whether the model is being evaluatedas part of a design-space exploration or to characterizeperformance with a particular matrix The interconnect ismodeled with a conservative lumped RC approach usingthe wire capacitance parameters provided by the ASU PTMderived from [41] To reduce path distance and latency wesplit the crossbars and place the drivers in the middle ratherthan at the periphery similarly to the double-sided groundbiasing technique from [42]

Table IACCELERATOR CONFIGURATION

System (128) banks double-precision floating point fclk =12GHz 15nm process VDD = 080V

Bank (2) times 512 times 512 clusters (4) times 256 times 256 clusters(6)times128times128 (8)times64times64 clusters 1 LEON core

Cluster 127 bit slice crossbarsCrossbar N timesN cells (log2[N ]minus 1)-bit pipelined SAR ADC

2N driversCell [18] [40] TaOx Ron = 2kΩRoff = 3MΩ Vread = 02V

Vset = minus26V Vreset = 26V Ewrite = 391nJTwrite = 5088ns

The peripheral circuitry includes 2N driver circuits basedon [43] and sized such that they are sufficient to sourcesinkthe maximum current required to program and read thememristors Also included are N sample-and-hold circuitsbased on [44] with power and area scaled for the requiredsampling frequency and precision

A 12 GHz 10-bit pipelined SAR ADC [34] is used as areference design point 12 GHz is chosen as the operatingfrequency of the ADC in order to maximize throughput andefficiency while maintaining acceptable ADC signal to noiseand distortion ratio (SNDR) at our supply voltage Areapower and the internal conversion time of each individualcrossbarrsquos ADC are scaled based on resolution as discussedin Section V-A Approximately 7 of the reported ADCpower scales exponentially with resolution with the remain-der of power either scaling linearly or remaining constantand 20 of the total power is assumed to be static basedon the analysis reported in [34] Similarly approximately23 of the reported area scales exponentially (due to thecapacitive DAC and the track-and-hold circuit) and theremainder either scales linearly or remains constant We holdthe conversion time of the ADC constant with respect toresolution rounding up to the nearest integral clock periodDuring the slack portion of the clock period (ie the timebetween conversions) we reduce the ADC power dissipationto its static power

The local processors are implemented with the open-source LEON3 core [45] Since the open-source versionof LEON3 does not include the netlist for the FPU forsynthesis we modify the FPU wrapper to the fused multiplyadd (FMA) timings for an FMA unit generated by FP-

Table IIEVALUATED MATRICES SPD MATRICES ON TOP

Matrix NNZs Rows NNZRow Blocked2cubes sphere 1647264 101492 162 497crystm03 583770 24696 236 947finan512 596992 74752 79 467G2 circuit 726674 150102 45 609nasasrb 2677324 54870 498 991Pres Poisson 715804 14822 483 964qa8fm 1660579 66127 251 928ship 001 3896496 34920 1116 664thermomech TC 711558 102158 68 08Trefethen 20000 554466 20000 277 633ASIC 100K 940621 99340 95 609bcircuit 375558 68902 54 649epb3 463625 84617 55 722GaAsH6 3381809 61349 5512 692ns3Da 1679599 20414 82 32Si34H36 5156379 97569 528 537torso2 1033473 115697 89 981venkat25 1717792 62424 275 798wang3 177168 26064 68 646xenon1 1181120 48600 243 810

GEN [46] The LEON3 core FMA and in-cluster reductionnetwork are synthesized using the NanGate 15nm OpenCell Library [47] with the Synopsys Design Compiler [48]SRAM buffers within each cluster and the eDRAM mem-ory are modeled using CACTI7 [49] using 14nm eDRAMparameters from [50]

B Architecture

We compare against an nVidia Tesla P100 GPU accelera-tor modeled using GPGPUSim [51] and GPGPUWattch [52]To evaluate the performance of the LEON3 based localprocessors we implement CG and BiCG-STAB in C andcompile the implementations using the LEON3 Bare CCompiler with timers around each kernel call Since all bankmicroprocessors must synchronize during each iteration weevaluate the latency of the bank microprocessor with thelargest number of unblocked elements

In the worst case the preprocessing step touches everynon-zero in the matrix four times which is comparable toperforming four MVM operations Thus we conservativelyassume that preprocessing takes time equivalent to fourMVM operations on the baseline system

C Algorithms

We evaluate the proposed accelerator on 20 matrices fromthe SuiteSparse matrix collection [14] using CG for sym-metric positive definite matrices (SPD) and BiCG-STAB fornon-SPD matrices The matrices were selected to showcasea range of matrix structures sizes and densities across avariety of applications When available we use the b vectorprovided by the collection when b is unavailable we use ab vector of all 1s as in prior work [53]

The solvers running on the proposed accelerator convergein the same number of iterations as they do when running

103

01

1

10

100

nasasrb

Pres_P

2cub

es

qa8fm

finan

Trefe

ship

crystm

therm

G2

ASIC

bcircuit

epb3

Ga

AsH6

ns3D

aSi34H3

6torso2

venkat

wang3

xeno

n1

G-MEA

NS

peed

upoverG

PU

Figure 8 Speedup over the GPU baseline

on the GPU since both systems perform computation at thesame level of precision

VIII EVALUATION

We evaluate the system performance energy area androbustness to different sources of error

A Execution Time

Figure 8 shows that the accelerator achieves an averagespeedup of 103times as compared to the GPU baseline acrossthe set of 20 matrices One important insight from theseresults is that the average number of non-zeros per matrixrow is a poor proxy for potential speedup The matrix withthe greatest speedup in the dataset (torso2) has just 89 NNZper matrix row and the two quantum chemistry matrices(GaSsH6 and Si34H36) have more than 50 NNZ per rowshowing above-average but not exceptional speedup due totheir lower blocking efficiencies (ie the percent of the non-zeros that are blocked) 69 and 53 respectively Sinceeach bank microprocessor is designed to work in conjunctionwith its clusters matrices with fewer rows perform worsethan larger matrices at the same blocking efficiency Asexpected blocking efficiency is the most reliable predictorof speedup

Notably in Table II there are two matrices that are effec-tively unblocked by the proposed preprocessing step ther-motech TC and ns3Da with respective blocking efficienciesof 08 and 32 Since the accelerator is optimized for in-situ rather than digital MVM and since the majority of theelements in these two matrices cannot be mapped to theproposed heterogeneous substrate the accelerator is morethan an order of magnitude less efficient than the GPUbaseline on these matrices Rather than attempt to performMVM on matrices so clearly ill-suited to the acceleratorwe assume that the proposed accelerator co-exists with aGPU in the same node and that the GPU may be usedfor the rare matrices which do not block effectively Sincethe blocking algorithm has a worst-case performance offour MVM operations (Section V-B1) and since the actualcomplexity reaches the worst case if and when the matrixcannot be blocked we can choose whether to perform thecomputation on the accelerator or the GPU after the blockinghas completed This results in a performance loss of less than

0092

1E-02

1E-01

1E+00

nasasrb

Pres_P

2cub

es

qa8fm

finan

Trefe

ship

crystm

therm

G2

ASIC

bcircuit

epb3

Ga

AsH6

ns3D

aSi34H3

6torso2

venkat

wang3

xeno

n1

G-MEA

N

EnergyNormalized

toGPU

Figure 9 Accelerator energy consumption normalized to the GPU baseline

3 for both matrices far better than if the accelerator wereused directly

B Energy Consumption

Energy consumption results are shown in Figure 9 To-tal energy consumption is improved by 142times on the 18matrices executed on the accelerator and by 109times overthe entire 20-matrix dataset The impacts of differing ex-ponent ranges can be seen in the energy results Notablythe nasasrb matrix has a 3 higher blocking efficiencythan Pres Poisson however Pres Poisson shows nearlytwice the improvement in energy dissipation this is dueto the much narrower exponent range of Pres PoissonPres Poisson never requires more than 14 bits of paddingfor storage which also indicates a much narrower dynamicrange for the computation and by extension fewer vectorbit slices that are needed per cluster By contrast nasasrbhas multiple blocks with excluded elements (due to thoseelements requiring greater than 117 bits) and in generalthe average number of stored bits per cluster is 107 (30more bits than the worst block of Pres Poisson) This strongdependence on the exponent range indicates a potentialbenefit of the fixed-point computation over floating pointAs discussed in Section IV-B floating point is designed forbroad compatibility and exceeds the requirements of manyapplications By operating on fixed-point representations ofdata for much of the computation the accelerator implicitlycreates a problem-specific subset of the floating-point formatwith significant performance benefits without discarding anyinformation that would cause the final result to differ fromthat of an end-to-end floating-point calculation

C Area Footprint

The area footprint of the accelerator is computed usingthe area model described in Section VII-A The overallsystem area for the 128-bank system described above is539mm2 which is lower than the 610mm2 die size of thebaseline Nvidia Tesla P100 GPU [54] Notably unlike inprior work on memristive accelerators the crossbars andperipheral circuitry are the dominant area consumermdashratherthan the ADCsmdashwith a total of 541 of total overall clusterarea This is due to two important differences from priorwork 1) due to computational invert coding (Section V-B2)

Table IIIAREA ENERGY AND LATENCY OF DIFFERENT CROSSBAR SIZES

(INCLUDES THE ADC)

Size Area [mm2] Energy [pJ] Latency [nsec]64 000078 280 533128 000103 652 107256 000162 150 213512 000352 342 427

0

5

10

15

20

nasasrb

Pres_P

2cub

es

qa8fm

finan

Trefe

ship

crystm

therm

G2

ASIC

bcircuit

epb3

Ga

AsH6

ns3D

aSi34H3

6torso2

venkat

wang3

xeno

n1

AcceleratorC

ompu

tatio

n

Overhead

WriteTime PreprocessingTime

Figure 10 Overhead of preprocessing and write time as a percent of totalsolve time on the accelerator

the ADC precision can be reduced by one bit which has anexponential effect on ADC area and 2) the wider operandsthat propagate through the reduction network The per-bankprocessors and global memory buffer consume 136 of thetotal system area This result suggests that a more powerfulprocessor than the LEON3 core may be integrated based onapplication demands without substantially increasing systemarea Table III summarizes the area as well as the latencyand energy characteristics for different crossbar sizes usedin different clusters

D Initialization Overhead

Figure 10 shows the overhead of setting up the itera-tive computation including matrix preprocessing time andprogramming time normalized to the entire solve timeof each linear system on the accelerator Since iterativealgorithms take thousands of iterations to converge theoverhead of writing is effectively amortized leading toan overhead of less than 20 across the entire matrixset and generally falling as the size of the linear systemincreases By extension the number of iterations requiredto converge increases faster than the write requirements ofthe array For large linear systems (the most likely use caseof the proposed accelerator) the overhead is typically lessthan 4 of the total solver runtime The overhead may beeven less in practice as multiple time steps are generallysimulated to examine the time-evolution of the linear systemIn these time-stepped computations only a subset of non-zeros change each step and the matrix structure is typicallypreserved requiring minimal re-processing

E System Endurance

Although memristive devices have finite switching en-durance the iterative nature of the algorithms allows a singlematrix to be programmed once and reused throughout thecomputation Even under a conservative set of assumptions

(a) ns3Da sparsity (b) ns3Da blockingFigure 11 Sparsity and blocking patterns of ns3Da

where each array is fully rewritten for each new matrixignoring both the sparsity of the blocks and the time stepbehavior mentioned above the system lifetime is suffi-cient given a memristive device with a conservative 109

writes [55]ndash[57] the total system lifetime is over 100 yearsassuming the system runs constantly and a new matrix isfully rewritten between solves

F Understanding Difficult-to-Block Matrices

Figure 11 shows the non-zero element distribution and theblocking pattern for ns3Da a matrix with a particularly poorblocking efficiency The figure shows that despite the highrelative density of the matrix the values do not form densesub-blocks that can be blocked efficiently Instead the valuesare distributed relatively uniformly over the matrix with thethird-highest NNZs per matrix row in the evaluated datasetFigure 7 shows the non-zero distribution of two matrices thatcan be blocked effectively Pres Poisson and Xenon1 thenon-zeros of both matrices are clustered primarily aroundthe diagonal An analysis of the blocks formed on ns3Dashows that most of the formed blocks capture patterns withnon-zeros separated by approximately 30 elements 2 permatrix block-row in 64 blocks and 4 per matrix block-rowin 128 blocks This suggests that even if ns3Da were tobe blocked more efficiently the effective speedup would beconstrained by the limited current summation from havingfew blocks per matrix row

G Sensitivity to Device Characteristics

Since scientific computing emphasizes high precisionwe analyze the effects of device characteristics on systemaccuracy and convergence behavior

Dynamic range We re-evaluate the convergence behaviorwith various cell ranges (ie onoff state ratios) to determinehow the system behaves when the states are closer togetheror farther apart Figure 12 shows the relative iterationcount at various configuration points The results showeffectively no sensitivity to dynamic range for single-bitcells With two-bit cells the relatively low dynamic rangeresults in some computational error as the cell states arenot sufficiently spread to tolerate the sources of error in

013 213 413 613 813

B=113 D=075K13

B=113 D=15K13

B=113 D=3K13

B=213 D=075K13

B=213 D=15K13

B=213 D=3K13

Normalized

13 Ite

rao

n13 Co

unt13

Min13 Mean13 Max13

Figure 12 Iteration count as a function of bits per cell and dynamicrange normalized to 1-bit cells with RoffRon = 1500 The minimummaximum and the mean iteration counts are reported over 100 Monte-Carlosimulation experiments

Figure 13 Iteration count as a function of bits per cell and programmingerror normalized to 1-bit cells with no programming error The minimummaximum and the mean iteration counts are reported over 100 Monte-Carlosimulation experiments

current summation This in turn introduces error into thecomputation which tends to hinder convergence rates

Programming precision We re-evaluate the convergencebehavior with and without cell programming errors to de-termine how results are affected by non-ideal programmingThe results show virtually no sensitivity to programmingerror for single-bit cells until the programming error reaches5 well within the achievable programming precision re-ported in the literature [58] Again the general dependencebecomes stronger as the number of bits per cell is increasedHigh programming error with the same tolerance introduceserrors into the computation hindering convergence

IX CONCLUSIONS

Previous work with memristive systems has been re-stricted to low-precision error-prone fixed-point computa-tion taking advantage of machine learning workloads thatdemand fidelity but not precision We have presented thefirst correct implementation of a double-precision floating-point compute unit based on a memristive substrate Theinnovations presented here take advantage of insights thatallow the system to achieve a 103times throughput improve-ment and a 109times reduction in energy consumption over aGPU baseline when executing iterative solvers The potentialfor such a system in scientific computing where iterativesolvers are ubiquitous is especially promising

ACKNOWLEDGMENT

This work was supported in part by NSF grant CCF-1533762

REFERENCES

[1] F Pop ldquoHigh performance numerical computing for highenergy physics a new challenge for big data sciencerdquo AAdvHigh Energy Phys vol 2014 2014

[2] R A Pielke Sr Mesoscale meteorological modeling 2nd edAcademic press 2013

[3] B Scholkopf K Tsuda and J-P Vert Kernel methods incomputational biology 1st ed MIT press 2004

[4] L Maliar and S Maliar ldquoNumerical methods for large-scaledynamic economic modelsrdquo 2014

[5] M Vogelsberger S Genel V Springel P Torrey D SijackiD Xu G Snyder S Bird D Nelson and L HernquistldquoProperties of galaxies reproduced by a hydrodynamic simu-lationrdquo Nature vol 509 pp 177ndash82 May 2014

[6] P Messina (2017 February) The US DOEexascale computing projectmdashgoals and challengeshttpswwwnistgovsitesdefaultfilesdocuments20170221messina nist 20170214final pdf [Online] Availablerdquohttpswwwnistgovsitesdefaultfilesdocuments20170221messina nist 20170214final pdfrdquo

[7] M N Bojnordi and E Ipek ldquoMemristive boltzmann machineA hardware accelerator for combinatorial optimization anddeep learningrdquo in Intl Symp on High Performance ComputerArchitecture (HPCA) March 2016

[8] A Shafiee A Nag N Muralimanohar R BalasubramonianJ P Strachan M Hu R S Williams and V SrikumarldquoISAAC A convolutional neural network accelerator within-situ analog arithmetic in crossbarsrdquo in Intl Symp onComputer Architecture (ISCA) June 2016

[9] P Chi S Li S Li T Zhang J Zhao Y Liu Y Wangand Y Xie ldquoPRIME A novel processing-in-memory archi-tecture for neural network computation in ReRAM-basedmain memoryrdquo in Intl Symp on High Performance ComputerArchitecture (HPCA) June 2016

[10] L Song X Qian H Li and Y Chen ldquoPipeLayer Apipelined ReRAM-based accelerator for deep learningrdquo inIntl Symp on High Performance Computer Architecture(HPCA) Feb 2017

[11] L Song Y Zhuo X Qian H Li and Y Chen ldquoGraphRAccelerating graph processing using ReRAMrdquo Feb 2017

[12] G Kestor R Gioiosa D J Kerbyson and A HoisieldquoQuantifying the energy cost of data movement in scientificapplicationsrdquo in Intl Symp on Workload Characterization(IISWC) Sept 2013

[13] J Hauser (2002) SoftFloat httpwwwjhauserusarithmeticSoftFloathtml [Online] Available rdquohttpwwwjhauserusarithmeticSoftFloathtmlrdquo

[14] T A Davis and Y Hu ldquoThe university of florida sparse matrixcollectionrdquo ACM Transactions on Math Softwware (TOMS)vol 38 no 1 pp 11ndash125 Dec 2011

[15] R Vuduc ldquoAutomatic performance tuning of sparse matrixkernelsrdquo PhD dissertation 2003

[16] Y Chen T Luo S Liu S Zhang L He J Wang L LiT Chen Z Xu N Sun and O Temam ldquoDaDianNaoA machine-learning supercomputerrdquo in Intl Symp on onMicroarchitecture (MICRO) Dec 2014

[17] N P Jouppi C Young N Patil D Patterson G Agrawalet al ldquoIn-datacenter performance analysis of a tensor pro-cessing unitrdquo in Intl Symp on on Computer Architecture(ISCA) June 2017

[18] M Hu J P Strachan Z Li E M Grafals N DavilaC Graves S Lam N Ge J J Yang and R S WilliamsldquoDot-product engine for neuromorphic computing Program-ming 1T1M crossbar to accelerate matrix-vector multiplica-tionrdquo in Design Automation Conference (DAC) June 2016

[19] Y Saad Iterative Methods for Sparse Linear Systems 3rd edSIAM 2003

[20] J J Dongarra P Luszczek and A Petitet ldquoThe LINPACKbenchmark Past present and futurerdquo Concurrency and Com-putation Practice and Experience vol 15 pp 803ndash8202003

[21] M R Hestenes and E Stiefel Methods of conjugate gradientsfor solving linear systems 1952 vol 49 no 6

[22] H A Van der Vorst ldquoBi-CGSTAB A fast and smoothlyconverging variant of Bi-CG for the solution of nonsymmetriclinear systemsrdquo SIAM Journal on scientific and StatisticalComputing (SISC) vol 13 no 2 pp 631ndash644 1992

[23] Y Saad and M H Schultz ldquoGMRES A generalized minimalresidual algorithm for solving nonsymmetric linear systemsrdquoSIAM Journal on scientific and Statistical Computing (SISC)vol 7 no 3 pp 856ndash869 July 1986

[24] J Kung Y Long D Kim and S Mukhopadhyay ldquoAprogrammable hardware accelerator for simulating dynamicalsystemsrdquo in Intl Symp on Computer Architecture (ISCA)June 2017

[25] Q Zhu T Graf H E Sumbul L Pileggi and F FranchettildquoAccelerating sparse matrix-matrix multiplication with 3d-stacked logic-in-memory hardwarerdquo in 2013 IEEE HighPerformance Extreme Computing Conference (HPEC) Sept2013

[26] S Kestur J D Davis and E S Chung ldquoTowards a univer-sal FPGA matrix-vector multiplication architecturerdquo in IntlSymp on Field-Programmable Custom Computing Machines(FCCM) May 2012

[27] J Fowers K Ovtcharov K Strauss E S Chung andG Stitt ldquoA high memory bandwidth FPGA accelerator forsparse matrix-vector multiplicationrdquo in Intl Symp on Field-Programmable Custom Computing Machines (FCCM) May2014

[28] R Dorrance F Ren and D Markovic ldquoA scalablesparse matrix-vector multiplication kernel for energy-efficientsparse-blas on FPGAsrdquo in ACMSIGDA International Sym-posium on Field-programmable Gate Arrays (FPGA) Feb2014

[29] B Feinberg S Wang and E Ipek ldquoMaking memristiveneural net accelerators reliablerdquo in Intl Symp on HighPerformance Computer Architecture (HPCA) Feb 2018

[30] D H Bailey ldquoHigh-precision floating-point arithmetic inscientific computationrdquo Computing in science amp engineeringvol 7 no 3 pp 54ndash61 2005

[31] D Zuras M Cowlishaw A Aiken M Applegate D BaileyS Bass D Bhandarkar M Bhat D Bindel S Boldo et alldquoIeee standard for floating-point arithmeticrdquo IEEE Std 754-2008 pp 1ndash70 2008

[32] L V Allis ldquoSearching for solutions in games and artificialintelligencerdquo 1994

[33] N Whitehead and A Fit-Florea ldquoPrecision amp performanceFloating point and IEEE 754 compliance for nvidia gpusrdquoNvidia Tech Rep 2011

[34] L Kull D Luu C Menolfi M Braendli P A FranceseT Morf M Kossel H Yueksel A Cevrero I Ozkayaand T Toifl ldquo285 a 10b 15gss pipelined-SAR ADC withbackground second-stage common-mode regulation and offsetcalibration in 14nm CMOS FinFETrdquo in Intl Solid-StateCircuits Conference (ISSCC) Feb 2017 pp 474ndash475

[35] L Kull T Toifl M Schmatz P A Francese C MenolfiM Brandli M Kossel T Morf T M Andersen andY Leblebici ldquoA 31 mw 8b 12 GSs single-channel asyn-chronous SAR ADC with alternate comparators for enhancedspeed in 32 nm digital SOI CMOSrdquo IEEE Journal of Solid-State Circuits vol 48 no 12 pp 3049ndash3058 2013

[36] M Saberi R Lotfi K Mafinezhad and W A SerdijnldquoAnalysis of power consumption and linearity in capacitivedigital-to-analog converters used in successive approximationadcsrdquo IEEE Transactions on Circuits and Systems I RegularPapers vol 58 no 8 pp 1736ndash1748 2011

[37] M R Stan and W P Burleson ldquoBus-invert coding forlow-power IOrdquo IEEE Transactions on Very Large ScaleIntegration (VLSI) Systems vol 3 no 1 pp 49ndash58 1995

[38] K Bhanushali and W R Davis ldquoFreePDK15 An open-source predictive process design kit for 15nm FinFET tech-nologyrdquo in Intl Symp on on Physical Design (ISPD) March2015

[39] Predictive technology model (PTM) httpptmasuedu[Online] Available httpptmasuedu

[40] D Niu C Xu N Muralimanohar N P Jouppi and Y XieldquoDesign of cross-point metal-oxide ReRAM emphasizingreliability and costrdquo in Intl Conference on Computer-AidedDesign (ICCAD) Nov 2013

[41] S-C Wong G-Y Lee and D-J Ma ldquoModeling of inter-connect capacitance delay and crosstalk in VLSIrdquo IEEETransactions on semiconductor manufacturing vol 13 no 1pp 108ndash111 2000

[42] C Xu D Niu N Muralimanohar R BalasubramonianT Zhang S Yu and Y Xie ldquoOvercoming the challengesof crossbar resistive memory architecturesrdquo in rdquoIntl Sympon High Performance Computer Architecture (HPCA)rdquo Feb2015

[43] C Yakopcic and T Taha ldquoModel for maximum crossbar sizebased on input driver impedancerdquo Electronics Letters vol 52no 1 pp 25ndash27 2015

[44] M OrsquoHalloran and R Sarpeshkar ldquoA 10-nw 12-bit accurateanalog storage cell with 10-aa leakagerdquo IEEE Journal ofSolid-State Circuits vol 39 no 11 pp 1985ndash1996 2004

[45] Leon3GRLIB httpwwwgaislercomindexphpdownloadsleongrlib [Online] Available httpwwwgaislercomindexphpdownloadsleongrlib

[46] S Galal O Shacham J S B II J Pu A Vassiliev andM Horowitz ldquoFPU generator for design space explorationrdquoin IEEE Symposium on Computer Arithmetic (ARITH) April2013

[47] M Martins J M Matos R P Ribas A Reis G SchlinkerL Rech and J Michelsen ldquoOpen cell library in 15nmFreePDK technologyrdquo in Intl Symp on Physical Design(ISPD) March 2015

[48] Synopsys ldquoSynopsys Design Compiler User GuiderdquohttpwwwsynopsyscomToolsImplementationRTLSynthesisDCUltraPages

[49] R Balasubramonian A B Kahng N MuralimanoharA Shafiee and V Srinivas ldquoCACTI 7 New tools for inter-connect exploration in innovative off-chip memoriesrdquo ACMTransactions on Architecture and Code Optimization (TACO)vol 14 no 2 pp 141ndash1425 June 2017

[50] G Fredeman D W Plass A Mathews J ViraraghavanK Reyer T J Knips T Miller E L Gerhard D Kannam-badi C Paone D Lee D J Rainey M Sperling M WhalenS Burns R R Tummuru H Ho A Cestero N ArnoldB A Khan T Kirihata and S S Iyer ldquoA 14 nm 11 mbembedded DRAM macro with 1 ns accessrdquo IEEE Journal ofSolid-State Circuits vol 51 no 1 pp 230ndash239 Jan 2016

[51] A Bakhoda G L Yuan W W L Fung H Wong and T MAamodt ldquoAnalyzing CUDA workloads using a detailed GPUsimulatorrdquo April 2009

[52] J Leng T Hetherington A ElTantawy S Gilani N S KimT M Aamodt and V J Reddi ldquoGPUWattch Enabling en-ergy optimizations in GPGPUsrdquo in Intl Symp on ComputerArchitecture (ISCA) June 2013

[53] H Anzt J Dongarra M Kreutzer G Wellein and M KhlerldquoEfficiency of general krylov methods on GPUs ndash an exper-imental studyrdquo in Intl Parallel and Distributed ProcessingSymposium Workshops (IPDPSW) May 2016

[54] ldquoNvidia tesla P100rdquo NVIDIA Corporation Tech Rep WP-08019-001 v011 2016

[55] Z Wei Y Kanzawa K Arita Y Katoh K Kawai S Mu-raoka S Mitani S Fujii K Katayama M Iijima et alldquoHighly reliable TaOx ReRAM and direct evidence of redoxreaction mechanismrdquo in Electron Devices Meeting 2008IEDM 2008 IEEE International IEEE 2008 pp 1ndash4

[56] J J Yang M-X Zhang J P Strachan F Miao M D Pick-ett R D Kelley G Medeiros-Ribeiro and R S WilliamsldquoHigh switching endurance in TaOx memristive devicesrdquoApplied Physics Letters vol 97 no 23 p 232102 2010

[57] C-W Hsu I-T Wang C-L Lo M-C Chiang W-Y JangC-H Lin and T-H Hou ldquoSelf-rectifying bipolar TaOxTiO2RRAM with superior endurance over 1012 cycles for 3d high-density storage-class memoryrdquo

[58] F Alibart L Gao B D Hoskins and D B Strukov ldquoHighprecision tuning of state for memristive devices by adaptablevariation-tolerant algorithmrdquo Nanotechnology vol 23 no 7Jan 2012

Page 3: Enabling Scientific Computing on Memristive …reducing the required memristor resolution [7]–[10]. An example of bit slicing is shown in Equation 1, where each 3-bit matrix coefficient

as LU or Cholesky as implemented in the High PerformanceLINPACK benchmark [20] These direct methods can resultin significant fill-in where zero entries become non-zeroesthis increases the memory footprint and reduces the benefitsof storing the matrix in a sparse format Iterative methodsby contrast do not involve modifying the underlying matrixThis results in significantly lower memory requirementsenabling the full sparse matrix to fit in memory whereasa sparse version with significant fill-in might not

Iterative methods are subdivided into stationary andKrylov subspace methods We focus on the latter since theyare the primary methods used today [19] Krylov subspacemethods iteratively improve an estimated solution x tothe linear system Ax = b These methods are typicallyimplemented using a stopping tolerance ε and terminatewhen |b minus Ax| lt ε There are many Krylov subspacesolvers with different behaviors depending on the matrixstructure including conjugate gradient (CG) for symmetricpositive definite matrices (SPD) [21] as well as BiConjugateGradient (BiCG) Stabilized BiCG (BiCG-STAB) [22] andGeneralized Minimal Residual [23] for non-SPD matrices

Due to the importance of PDEs and the wide appli-cability of sparse solvers a number of accelerator sys-tems leveraging FPGAs and ASICs have been proposedKung et al [24] propose an SRAM based processing inmemory accelerator for simulating systems of differentialequations The accelerator uses 32-bit fixed-point and doesnot meet the high-precision floating-point requirements ofmainstream scientific applications (a key focus of our work)Additionally Kung et al accelerates the cellular nonlinearnetwork model which differs significantly from mainstreammethods for solving PDEs By contrast we target widelyaccepted iterative solvers such as CG Zhu et al [25]discuss a graph processing accelerator that leverages contentaddressable memories (CAMs) This accelerator may beapplicable to sparse MVM however the proposed hardwareheavily depends on the sparsity of the input vector (lessthan 1 non-zeros) to limit CAM size This vector sparsityis uncommon in iterative solvers an analysis of the linearsystems that we evaluate show vector densities ranging from30-100 which would result in prohibitive CAM overheadsKestur et al [26] and Fowers et al [27] propose FPGA basedaccelerators that target sparse MVM for iterative solvershowever the proposed approaches do not outperform aGPU Dorrance et al [28] propose an FPGA acceleratorthat achieves an average speedup of 128times over a GTXTITAN GPU on single-precision sparse MVM however itis not clear how the FPGA would scale to double-precisionBy contrast the proposed memristive accelerator achievesan average speedup of 103times over a Tesla P100 GPU ondouble-precision sparse MVM

Bank 2

Bank N

Global M

emory

Bank 1

Cluster 2Cluster N

Cluster 1Xbar

Xbar

De-bias ECU FPCU Result

Buffer

Unblocked Matrix Elements

AlignedFixed Point

SignedFixed Point

CorrectedFixed Point

IntermediateFloating Point

IEEE-754Floating Point

Input Vector Buffer

Processor

Figure 2 System overview

III SYSTEM ORGANIZATION

The accelerator is organized in a hierarchy of banksand clusters as shown in Figure 2 Each bank comprisesa heterogeneous set of clusters with different crossbar sizesas well as a local processor that orchestrates computationand performs the operations not handled by the crossbarsWithin a cluster a group of crossbars perform sparse MVMon a fixed-size matrix block

A Banks

Unlike prior work on memristive accelerators [7] [8] theclusters within each bank are of different sizes This allowsthe system to capture dense patterns within sparse matricesas discussed in Section V With sparse matrices someelements are ill-suited to the memristive crossbars eitherbecause the elements do not block efficiently or becausethe computation requires more bits than are provided by acluster In either case the value is not mapped to a clusterand is instead handled by the local processor Finally cross-bank communication (for operations that require resourcesfrom multiple banks) is implemented through reads andwrites to global memory

To interface with the local processor every cluster con-tains two memory-mapped SRAM buffers an incomingvector buffer and a result buffer The incoming vector bufferholds the vector to be multiplied by the matrix in floating-point format and extracts the bit slices from the vector asneeded The partial result buffer holds the running sum of thepartial dot products in an intermediate floating-point formatOnce the computation on a block completes each scalarin the partial result buffer is converted to a final IEEE-754representation which is read by the local processor

B Clusters

A cluster comprises a set of crossbarsmdashsimilar to themats in the Memristive Boltzmann Machine [7] or the in-situ multiply accumulate (IMA) units in ISAAC [8]mdashthatperform MVM and reduction across all of the bit slices

within a single matrix block Unlike prior work clustersare designed for IEEE-754 compatible double-precisionfloating-point computation Each cluster contains 127 cross-bars with single-bit cells that are organized in a shift-and-add reduction tree The double-precision floating-pointcoefficients of a matrix block are converted to 118-bit fixed-point operands comprising a 53-bit mantissa one sign bitand up to 64 bits of padding for mantissa alignment This118-bit value is then encoded with a nine-bit error-correctingAN code (as proposed in [29]) for a full operand width ofup to 127 bits

Figure 3 shows the structure of the crossbars in a clusterand how computation proceeds Initially a vector bit slicex[i] is applied to all of the crossbars within a cluster andthe resulting dot product for each crossbar column is latchedby a sample-and-hold circuit (1)2 A set of ADCs (oneper crossbar) then start scanning the values latched by thesample-and-hold units (2) The outputs of the ADCs arepropagated through a shift-and-add reduction network (3)producing a single fixed-point value that corresponds to thedot product between a row of the matrix A and a single bitslice of the vector x The system is pipelined such that everycycle a single fixed-point value is produced by the reductionnetwork and after every column has been quantized a newvector bit slice is applied to all of the crossbars

Once the fixed-point result has been computed threeadditional operations are performed to convert it to anintermediate floating-point representation (Figure 2) Firstthe bias is removed converting the value from unsignedto signed fixed point Second error correction is appliedThird the running sum for the relevant vector element isloaded from the partial result buffer the incoming partialdot product is added to the running sum and the resultis written back After the running sum is read from thepartial result buffer it is inspected to determine whether theprecision requirements of IEEE-754 have been met (in whichcase no further vector bit slice computations are needed)If so all of the crossbars in the cluster are signaled to skipquantizing the corresponding column in the remaining vectorbit slice calculations3 The criteria for establishing whetherthe precision requirements of IEEE-754 have been met arediscussed in the next section

IV ACHIEVING HIGH PRECISION

Unlike machine learning applications scientific workloadsrequire high precision [30] To enable scientific computingwith memristive accelerators the system must meet theseprecision requirements through floating-point format supportand error correction

2Throughout the paper we use xj [i] to refer to the jth coefficient ofthe ith bit slice of vector x

3The total number vector bit slices is still bounded by the number ofvector bit slices required by the worst case column

1

2

3

SH Array SH Array SH Array

ADC ADC ADC

Figure 3 Cluster organization

A Floating-Point Computation on Fixed-Point Hardware

In double precision floating point every number is repre-sented by a 53-bit mantissa (including the implied leading1) an 11-bit exponent and a sign bit [31] When performingaddition in a floating-point representation the mantissas ofthe values must be aligned so that the bits of the samemagnitude can be added Consider as an example two base-ten floating-point numbers with two-digit mantissas onwhich the computation 12 + 013 is to be performed Toproperly compute the sum the values must be aligned basedon the decimal point 120+013 In a conventional floating-point unit this alignment is performed dynamically based onthe exponent field of each value On a memristive crossbarhowever since the addition occurs within the analog domainthe values must be converted to an aligned fixed-pointformat before they are programmed into the crossbar Thisrequires padding the floating point mantissas with zeros inthe example above the addition would be represented as120 + 013 in fixed point with an exponent of 10minus2

Although this simple padding approach allows fixed-pointhardware to perform floating-point operations it comes ata significant cost In double-precision for instance naıvepadding requires 2100 bits in the worst case to fully accountfor the exponent range 2046 bits for padding 53 bits for themantissa itself and one sign bit Not only is this requirementprohibitive from a storage perspective but full fixed-pointcomputation would require multiplying each matrix bit sliceby each vector bit slice for a worst case of 44 millioncrossbar operations Fortunately two important properties offloating-point numbers can be exploited to reduce the costof the fixed-point computation First while the IEEE-754standard provides 11 bits for the exponent in double preci-sion few applications ever approach the full exponent rangeSecond the naıve fixed-point computation described aboveprovides more precision than IEEE-754 as it computes bitswell beyond the 53-bit mantissa These extraneous values aresubsequently truncated when the result is converted back tofloating point

0110011001

010101

+

010111010

100101

+

000101

010110

011001

001000

+

01101000111

111010

+

0 0 0

(a) Initial partial product summation

Running Sum

(b) Sum of four partial products

(c) No overlap with partial product

(d) Addition can terminate

Figure 4 Illustrative example of partial product summation a) addition of the first four partial products b) overlap between the running sum and themantissa c) carries can still modify the running sum d) the mantissa bits are settled

B Avoiding Extraneous Calculations

Three techniques are proposed to reduce the cost ofperforming floating-point computation on fixed-pointhardware

Exploiting exponent range locality The 11-bit exponentfield specified by IEEE-754 means that the dynamic rangeof floating-point values is 22

11minus2 (two exponent fields arereserved for special values) or approximately 10616 (Forcomparison the size of the state space for the game ofGo is only 10172 [32]) IEEE-754 provides this range sothat applications from a wide variety of domains can usefloating-point values without additional scaling howeverit is highly unlikely that any model of a physical systemwill have a dynamic range requirement close to 10616 Inpractice the number of required pad bits is equal to thedifference between the minimum and maximum exponentsin the matrix and can be covered by a few hundred bitsrather than 2046 Furthermore since the fixed-point over-head is required only for numbers that are to be summed inthe analog domain the alignment overhead can be furtherreduced by exploiting locality Since the matrix is muchlarger than a single crossbar it is necessarily split into blocks(discussed in Section V-B1) only those values within thesame block are added in the analog domain and requirealignment The matrices can be expected to exhibit localitybecause they often come from physical systems where alarge dynamic range between neighboring points is unlikely

The aforementioned alignment procedure is also appliedto the vector to be multiplied by the matrix The alignmentof mantissas encodes the exponent of each vector elementrelative to a single exponent assigned to the vector as awhole Notably multiplication does not require alignmentsince it is sufficient to add the relevant exponents of thematrix block the vector and the exponent offset impliedby the location of the leading 1 in each dot product

Early termination In a conventional double-precision float-ing point unit (FPU) each arithmetic operation is performed

on two operands after which the resulting mantissa isrounded and truncated to 53 bits To compute a dot productbetween two vectors a middot x the FPU first calculates theproduct of two scalars a1x1 rounds and truncates the resultto 53 bits and adds the truncated product to subsequentproducts a middot x = a1x1 + a2x2 + middot middot middot A memristive MVMaccelerator on the other hand computes a dot product as anaggregation of partial dot products calculated over multiplebit slices This necessitates a truncation strategy differentfrom that of a digital FPU

A naıve approach to truncation on a memristive MVMaccelerator would be to perform all of the bit sliced matrix-vector multiplications aggregate the partial dot productsand truncate the result This naıve approach would requirea total of 127 times 127 crossbar operations since every bitslice of a must be multiplied by every bit slice of x 4

However many of these operations would contribute only tothe portion of the mantissa beyond the point of truncationwasting time and energy By detecting the position of theleading 1mdashie the 1 at the most significant bit position inthe final resultmdashthe computation can be safely terminatedas soon as the 52 following bits of the mantissa have settled

Consider the simplified example depicted in Figure 4 inwhich a series of partial products with different bit-sliceweights are to be added (a) To clarify the exposition the ex-ample assumes that the partial products are six bits wide andthat the final sum is to be represented by a four-bit mantissaand an exponent (The actual crossbars sum 127-bit partialproducts encoding the result with a 53-bit mantissa and anexponent) In the example the addition of partial productsproceeds iteratively from the most significant partial producttoward the least significant accumulating the result in arunning sum The goal is to determine whether this iterativeprocess can be terminated early while still producing thesame mantissa as the naıve approach

Figure 4b shows the running sum after the summationof the four most significant partial products Since the next

4Recall from Section III that scalars are encoded with up to 127-bitfixed point All examples in this section assume a 127-bit representationfor simplicity unless otherwise noted

Stable Region

X hellip X

Potential carry out

+

Barrier Bit

0

Carry Region

Aligned Region

X X 0X X X

1 hellip 1 X Running sum

Next partial dot product

Figure 5 Regions within a running sum

partial product still overlaps with the four-bit mantissa it isstill possible for the mantissa to be affected by subsequentadditions At a minimum the computation must proceeduntil the mantissa in the running sum has cleared the overlapwith the next partial product This occurs in Figure 4chowever at that point it is still possible for a carry from thesubsequent partial products to propagate into and modify themantissa Notably if the accumulation is halted at any pointthe sum of the remaining partial products would produce atmost one carry out into the running sum Hence in additionto clearing the overlap safe termination requires that a 0less significant than the mantissa be generated such that itwill absorb the single potential carry out preventing any bitflips within the mantissa (Figure 4d)

At any given time the running sum can be split intofour non-overlapping regions as shown in Figure 5 Thealigned region is the portion of the running sum thatoverlaps with the subsequent partial products A chain of1s more significant than the aligned region is called thecarry region as it can propagate the single generated carryout forward This carry if present is absorbed by a barrierbit protecting the more significant bits in the stable regionfrom bit flips Thus the running sum calculation canbe terminated without loss of mantissa precision assoon as the full mantissa is contained in the stable region

Scheduling array activations Even with the early termina-tion technique discussed above there is still a possibility forextraneous crossbar operations to be performed Althougheach partial dot product takes up to 127 bits the finalmantissa is truncated to 52 bits following the leading 1requiring fewer than 127 bits to be computed in most casesTherefore some subset of the 127-bit partial dot productmay be wasted depending on the schedule according towhich vector bit slices are applied to the crossbars

Figure 6 shows three approaches to scheduling the cross-bar activations In the figure rows and columns respec-tively represent the matrix and vector bit slices sortedby significance The numbers at the intersection of eachrow and column represent the significance of the resultingpartial product Different groupings indicate which bit-slicedmatrix-vector multiplications are scheduled to occur simul-taneously the black colored groupings must be performedwhereas the gray colored groupings may be skipped dueto early termination Hence the number of black groupings

1 0123

1

023

4

5

3 24 1

6 5

4

3

4

3

2

3

2

6

5

4

5

4

3

4

3

2

3

2

1

3 2 1 0

6

5

3

5

4

2

4

3

1

3

2

0

3 2 1 0

3

2

1

0

Mat

rix S

lice

3 2 1 0Vector Slice

Activations Skipped Activations PerformedVertical Grouping Diagonal Grouping Hybrid Grouping

Figure 6 Three policies for scheduling crossbar activations

correlates with execution time and the number of crossbarsenclosed by the black groupings correlates with crossbaractivation energy The example shown in the figure assumesthat the calculation can be terminated early after the fifthmost significant bit of the mantissa is computed (That isall groupings computing a partial product with significanceof at least 2 must be performed)

In the leftmost case all of the crossbars are activated withthe same vector bit slice after which the next significantvector bit slice is applied to the crossbars This naıve verticalgrouping requires 16 crossbar activations performed overfour time steps In the next example arrays are activatedwith different vector bit slices such that each active arrayis contributing to the same slice of the output vector si-multaneously This diagonal grouping results in 13 crossbaractivationsmdashthe minimum required in the examplemdashoverfive time steps In general this policy takes the minimumnumber of crossbar activations at the cost of additionallatency

The hybrid grouping shown on the right which we adoptin the evaluation strikes a balance between the respectiveenergy and latency advantages of the diagonal and verticalgroupings The result is 14 crossbar activations performedover four time steps The more closely the hybrid groupingapproximates a diagonal grouping the greater the energysavings at the cost of latency

Taken together the above techniques significantly reducethe overhead of performing floating-point computation onfixed-point hardware by restricting computations to onlythose that are likely to produce a bit needed for the finalmantissa The techniques render the latency and energyconsumption strongly data dependent effectively forming adata-specific subset of the floating point format without anyloss in precision

C Handling Negative Numbers

The previously proposed ISAAC memristive accelera-tor [8] uses a biasing scheme to handle negative numbersThe approach adds a bias of 216 to each 16-bit operandsuch that the values range from 0 to 216 rather than plusmn215Results are de-biased by subtracting 216pv where pv is thepopulation count (ie the number of bits set to 1) of therelevant bit slice of the vector We adopt this technique with

one modification in place of the constant 216 we use aper-block biasing constant based on the actual floating-pointexponent range contained within the block

D Floating Point Anomalies and Rounding Modes

The IEEE-754 floating point standard specifies a numberof floating point exceptions and rounding modes that mustbe handled correctly for proper operation [31]

Rounding modes When the proposed accelerator isoperated in its typical configuration mantissa alignmentand leading 1 detection result in a truncation of the resultto the required precision The potential contributions ofany uncalculated bits beyond the least-significant bit wouldonly result in a mantissa of greater magnitude than whatis ultimately obtained and since a biasing scheme is usedto enable the representation of negative values (SectionIV-C) the truncation is equivalent to rounding the resultof a dot product toward negative infinity For applicationsrequiring other rounding modes (eg to nearest towardpositive infinity toward zero) the accelerator can beconfigured to compute three additional settled bits beforetruncation and perform the rounding according to those bits

Infinities and NaNs Signed infinities and not-a-numbers(NaNs) are valid floating point values however they cannotbe mapped to the crossbars in a meaningful way in theproposed accelerator nor can they be included in any dotproduct computations without producing a NaN infinity orinvalid result [31] Therefore the accelerator requires allinput matrices and vectors to contain no infinities or NaNsand any infinities or NaNs arising from an intermediatecomputation in the local processor are handled there toprevent them from propagating to the resulting dot product

Floating point exceptions Operations with floating pointvalues may generally lead to underflow overflow invalidoperation and inexact exceptions [31] Since in-situ dotproducts are performed on mantissas aligned with respect torelative exponents there are no upper or lower bounds onresulting values (Only the range of exponents is bounded)This precludes the possibility of an overflow or underflowoccurring while values are in an intermediate representationHowever when the result of a dot product is converted fromthe intermediate floating point representation into IEEE-754it is possible that the required exponent will be outsideof the available range In this case the exponent field ofthe resulting value will be respectively set to all 1s orall 0s for overflow and underflow conditions Similarly toCUDA computations resulting in an overflow underflowor inexact arithmetic do not cause a trap [33] Since theaccelerator can only accept matrices and input vectors withfinite numerical values and since the memristive substrateis used only for computing dot products invalid operations

(eg 0 0radicminus|c|) and operations resulting in NaNs or

infinities can only occur within the local processor wherethey are handled through an IEEE-754 compliant floatingpoint unit

E Detecting and Correcting Errors

Prior work by Feinberg et al proposes an error-correctionscheme for memristive neural network accelerators based onAN codes [29] We adopt this scheme with three modifi-cations taking into consideration the differences betweenneural networks and scientific computing First as noted inSection I scientific computing has higher precision require-ments than neural networks thus the proposed accelera-tor uses only 1-bit cells making it more robust Secondthe matrices for scientific computing are typically sparsewhereas the matrices in neural networks tend to be densethis leads to lower error rates using the RTN-based errormodel of Feinberg et al allowing errors to be corrected withgreater than 9999 accuracy However the use of arrayslarger than used in prior work introduces new sources oferror If the array size is comparable to the dynamic rangeof the memristors the non-zero current in the RHI statemay cause errors when multiplying a low density matrixrow with a high density vector To avert this problem welimit the maximum crossbar block size in a bank (discussedin the next section) to 512times512 with a memristor dynamicrange of 15times 103 Third since the floating-point operandsexpand to as many as 118 bits during the conversion fromfloating- to fixed point we do not use multi-operand codingbut rather apply an A = 251 code that protects a 118-bit operand with eight bits for correction and one bit fordetection The correction is applied after the reduction butbefore the leading-one detection

V HANDLING AND EXPLOITING SPARSITY

An in-situ MVM system has an inherent advantage whenperforming computation with large and dense matrices asthe degree of parallelism obtained by computing in theanalog domain is proportional to the number of elementsin the matrix up to the size of the matrix M times N Asnoted in Section II-B however the majority of the matricesof interest are sparse This reduces the effective parallelismmaking it a challenge to achieve high throughput

A Tradeoffs in Crossbar Sizing

The potential performance of a memristive MVM systemis constrained by the sizes of its crossbars as crossbar sizeincreases so does the peak throughput of the system Thispeak is rarely achieved however since the sparsity patternof each matrix also limits the number of elements that maybe mapped to a given block

We define the throughput of the system as the number ofeffective element-wise operations in a single-cluster MVMcomputation divided by the latency of that computation

τMV M

As the size of a crossbar increases greater throughputis naıvely implied since the size of the matrix block repre-sentable within the crossbarmdashand consequently the numberof sums-of-products resulting from a cluster operationmdashalsoincreases However since only non-zero matrix elementscontribute to an MVM operation effective throughput in-creases only when the number of non-zero (NNZ) elementsincreases where NNZ depends on both block size M timesN and block density dblock Actual throughput for a givenblock is therefore dblock timesM timesNτMV M

Thus to achievehigh throughput and efficiency large and dense blocks aregenerally preferable to small or sparse blocks

The number of non-zero elements actually mapped to acrossbar is in practice far less than the number of cells inthe crossbar and the trend is for large blocks to be sparserthan small blocks This tendency follows intuitively fromthe notion that as the block size approaches the size of thematrix the density of the block approaches the density ofthe matrixmdash037 for the densest matrix in the evaluation

The latency of a memristive MVM operation is alsodirectly related to the size of a crossbar The maximumpossible crossbar column output occurs when all N bits ofa column and all N bits of the vector slice are set to 1resulting in a sum-of-products of

sumN1 1times1 = N Therefore

a sufficient analog to digital converter (ADC) for a crossbarwith N rows must generally have a minimum resolution ofdlog2[N +1]e bits (though this can be reduced as shown inSection V-B2) For the type of ADC used in the proposedsystem and in prior work [8] conversion time per columnis strictly proportional to ADC resolution which increaseswith log2N where N is the number of rows and the numberof ADC conversions per binary MVM operation is equal tothe number of columns M Total conversion time is thusproportional to Mdlog2[N + 1]e

1) Energy Crossbar sizing strongly affects the amountof energy used by both the array and the ADC during anMVM operation

ADC energy The average power dissipation of the ADCgrows exponentially with resolution [34]ndash[36] Since res-olution is a logarithmic function of the number of rowsN but conversion time is proportional to resolution ADCenergy per column conversion is approximately proportionalto N log2N The column count M determines the numberof ADC conversions per MVM operation so the total ADCenergy per operation is proportional to M timesN log2N

Crossbar energy Crossbar latency is defined as theminimum amount of time during which the crossbars mustbe actively pulling current in order to ensure that all outputnodes are within less than LSB

2 of their final voltages therebyconstraining all column outputs to be within the errortolerance range of the ADC The worst-case path resistanceand capacitance of an MtimesN crossbar are both proportionalto M + N The number of RC time constants necessaryto ensure error-free output is proportional to resolution or

log2N While the crossbar is active power is proportionalto the total conductance of the crossbar which can be foundby multiplying the number of cells M timesN by the averagepath conductance which is proportional to 1

M+N Crossbarenergy for a single MVM operation grows with the productof crossbar latency and power and is thus found to beproportional to (M timesN)(M +N) log2N

2) Area As with ADC power ADC area also scalesexponentially with resolution [34]ndash[36] making averageADC area approximately proportional to N Crossbar area isdominated by drivers consequently the total crossbar areagrows as M(M +N)

B A Heterogeneous Crossbar Substrate

Crossbar sizing requires striking a careful balance be-tween parallelism and energy efficiency a crossbar mustbe large enough to capture sufficient non-zero elements toexploit parallelism yet small enough to retain sufficient den-sity for area and energy efficiency By contrast to the singlecrossbar size used in prior work [7] [8] [18] the proposedsystem utilizes crossbars of different sizes allowing blocksto be mapped to the crossbars best suited to the sparsitypattern of the matrix This heterogeneous composition makesthe system performance robust yielding high performanceand efficiency on most matrices without over-designing fora particular matrix type or sparsity pattern

The problem of performance-optimal matrix blocking isneither new nor solved in a computationally feasible mannerPrior work seeks to block sparse matrices while maximizingthroughput and minimizing fill-in but focuses on block sizesthat are too small to be efficient with memristive crossbars(12times 12 at most [15])

1) Mapping a Matrix to the HeterogeneousSubstrate A fast preprocessing step is proposed that ef-ficiently maps an arbitrary matrix to the heterogeneouscollection of clusters The matrix is partitioned into blockcandidates and for each candidate block the number of non-zero elements captured by the block and the range of theexponents within the block are calculated If the exponentrange of the block falls outside of the maximum allowablerange (ie 117) the elements are selectively removed untilan acceptable range is attained The number of non-zero ele-ments remaining is then compared to a dimension-dependentthreshold if the number of elements exceeds the thresholdthe candidate block is accepted If the candidate block isrejected a smaller block size is considered and the processrepeats

The preprocessing step considers every block size in thesystem (512times512 256times256 128times128 and 64times64) untileach element has either been mapped to a block or found tobe unblockable Elements that cannot be blocked efficientlyor that fall outside of the exponent range of an acceptedblock are completely removed from the mapping and areinstead processed separately by the local processor Figure

(a) Pres Poisson sparsity (b) Pres Poisson blocking (c) Xenon1 sparsity (d) Xenon1 blockingFigure 7 Sparsity and blocking patterns of two of the evaluated matrices

7 shows the non-zero elements and the resulting blockingpatterns for two sparse matrices

Since the non-zero elements in the matrix are traversed atmost one time for each block size the worst-case complexityof the preprocessing step is 4timesNNZ In practice the earlydiscovery of good blocks brings the average-case complexityto 18timesNNZ

2) Exploiting Sparsity Though sparsity introduces chal-lenges in terms of throughput it also presents some oppor-tunities for optimization

Computational invert coding Prior work [8] introducesa technique for statically reducing ADC resolution require-ments If a crossbar column contains all 1s the maximumoutput for the column is equal to the number of rowsN To handle this extreme case the ADC would requiredlog2[N + 1]e bits of resolution Eliminating this casereduces the ADC resolution requirement by one bit This isachieved by inverting the contents of a column if it containsall 1s To obtain a result equivalent to a computation witha non-inverted column the partial dot product is subtractedfrom the population count of the applied vector bit slice

We extend this technique to computational invert coding(CIC) a computational analog to the widely used bus-invertcoding [37] If the number of ones in a binary matrixrow is less than N

2 the required resolution of the ADCis again decreased by one bit By inverting any columnmapping that would result in greater than 50 bit density astatic maximum of N

2 1s per crossbar column is guaranteedNotably if the column has exactly N

2 1s a resolution oflog2N is still required Since the proposed system allowsfor removing arbitrary elements from a block for processingon the local processor this corner case can be eliminatedThus for any block even in matrices approaching 100density the ADC resolution can be statically reduced tolog2[N ]minus1 bits CIC also lowers the maximum conductanceof the crossbars (since at most half of the cells can be inthe RLO state) which in turn reduces crossbar energy

ADC headstart The ADCs that are commonly used inmemristive accelerators [7] [8] perform a binary searchover the space of possible results with the MSb initially

set to 1 However the maximum output of each column isconstrained by the number of ones mapped to that columnMatrix rows tend to produce at most dlog2[drowtimes05timesN ]ebits of output assuming that 1s and 0s are equally likelySince the actual number of output bits is data dependentthis tendency cannot be leveraged to statically reduce ADCprecision beyond what is attainable with CIC Instead theADC is pre-set to the maximum number of bits that can beproduced by the subsequent conversion step This approachallows the ADC to start from the most significant possibleoutput bit position and reduces the amount of time requiredto find the correct output value by an amount proportionalto the number of bits skipped Notably this does not affectoperation latency since reduction logic is fully synchronousand assumes a full clock period for conversion however bydecreasing the time spent on conversion the energy of theMVM operation is reduced

VI PROGRAMMING THE ACCELERATOR

We focus on Krylov subspace solvers for specificity asthey are a widely used application involving sparse MVMhowever the proposed accelerator is not limited to Krylovsubspace solvers These solvers are built from three com-putational kernels a sparse-matrix dense-vector multiply adense-vector sum or AXPY (ylarr ax+y) and a dense-vectordot product (z = x middot y) To split the computation betweenthe banks each bank is given a 1200-element section ofthe solution vector x only that bank performs updates onthe corresponding section of the solution vector and theportions of the derived vectors For simplicity this restrictionis not enforced by hardware and must be followed by theprogrammer for correctness

On problems that are too large for a single accelerator theMVM can be split in a manner analogous to the partitioningon GPUs each accelerator handles a portion of the MVMand the accelerators synchronize between iterations

A Kernel Implementation

To implement Krylov subspace solvers we implement thethree primary kernels needed for the solver

1) Sparse MVM The sparse MVM operation beginsby reading the vector map for each cluster The vectormap contains a set of three-element tuples each of whichcomprises the base address of a cluster input vector thevector element index corresponding to that base addressand the size of the cluster Since the vector applied to eachcluster is contiguous the tuple is used to compute and loadall of the vector elements into the appropriate vector bufferThereafter a start signal is written to a cluster status registerand the computation begins Since larger clusters tend tohave a higher latency the vector map entries are ordered bycluster size

Once all cluster operations have started the processorbegins operating on the non-blocked entries As noted inthe previous section some values within the sparse matrixare not suitable for mapping onto memristive hardwareeither because they cannot be blocked at a sufficient densityor because they would exceed the exponent range Theseremaining values are stored in the compressed sparse rowformat [15]

When a cluster completes an interrupt is raised andserviced by an interrupt service routine running on the localprocessor Once all clusters signal completion and the localprocessor finishes processing the non-blocked elements thebank has completed its portion of the sparse MVM The localprocessors for different banks rely on a barrier synchroniza-tion scheme to determine when the entire sparse MVM iscomplete

2) Vector Dot Product To perform a vector dot producteach processor computes the relevant dot product fromits own vector elements Notably due to vector elementownership of the solution vector x and all of the derivedvectors the dot product is performed using only thosevalues that belong to a local processor Once the processorcomputes its local dot product the value is written backto shared memory for all other banks to see Each bankcomputes the final product from the set of bank dot productsindividually to simplify synchronization

3) AXPY The AXPY operation is the simplest kernelsince it can be performed purely locally Each vector elementis loaded modified and written back Barrier synchroniza-tion is used to determine completion across all banks

VII EXPERIMENTAL SETUP

We develop a parameterized and modular system modelto evaluate throughput energy efficiency and area Keyparameters of the simulated system are listed in Table I

A Circuits and Devices

We model all of the CMOS based periphery at the 15nmtechnology node using process parameters and design rulesfrom FreePDK15 [38] coupled with the ASU PredictiveTechnology Model (PTM) [39] TaOx memristor cells basedon [40] are used with linearity and dynamic range as

reported by [18] During computation cells are modeled asresistors with resistance determined either by a statisticalapproach considering block density or actual mapped bitvalues depending on whether the model is being evaluatedas part of a design-space exploration or to characterizeperformance with a particular matrix The interconnect ismodeled with a conservative lumped RC approach usingthe wire capacitance parameters provided by the ASU PTMderived from [41] To reduce path distance and latency wesplit the crossbars and place the drivers in the middle ratherthan at the periphery similarly to the double-sided groundbiasing technique from [42]

Table IACCELERATOR CONFIGURATION

System (128) banks double-precision floating point fclk =12GHz 15nm process VDD = 080V

Bank (2) times 512 times 512 clusters (4) times 256 times 256 clusters(6)times128times128 (8)times64times64 clusters 1 LEON core

Cluster 127 bit slice crossbarsCrossbar N timesN cells (log2[N ]minus 1)-bit pipelined SAR ADC

2N driversCell [18] [40] TaOx Ron = 2kΩRoff = 3MΩ Vread = 02V

Vset = minus26V Vreset = 26V Ewrite = 391nJTwrite = 5088ns

The peripheral circuitry includes 2N driver circuits basedon [43] and sized such that they are sufficient to sourcesinkthe maximum current required to program and read thememristors Also included are N sample-and-hold circuitsbased on [44] with power and area scaled for the requiredsampling frequency and precision

A 12 GHz 10-bit pipelined SAR ADC [34] is used as areference design point 12 GHz is chosen as the operatingfrequency of the ADC in order to maximize throughput andefficiency while maintaining acceptable ADC signal to noiseand distortion ratio (SNDR) at our supply voltage Areapower and the internal conversion time of each individualcrossbarrsquos ADC are scaled based on resolution as discussedin Section V-A Approximately 7 of the reported ADCpower scales exponentially with resolution with the remain-der of power either scaling linearly or remaining constantand 20 of the total power is assumed to be static basedon the analysis reported in [34] Similarly approximately23 of the reported area scales exponentially (due to thecapacitive DAC and the track-and-hold circuit) and theremainder either scales linearly or remains constant We holdthe conversion time of the ADC constant with respect toresolution rounding up to the nearest integral clock periodDuring the slack portion of the clock period (ie the timebetween conversions) we reduce the ADC power dissipationto its static power

The local processors are implemented with the open-source LEON3 core [45] Since the open-source versionof LEON3 does not include the netlist for the FPU forsynthesis we modify the FPU wrapper to the fused multiplyadd (FMA) timings for an FMA unit generated by FP-

Table IIEVALUATED MATRICES SPD MATRICES ON TOP

Matrix NNZs Rows NNZRow Blocked2cubes sphere 1647264 101492 162 497crystm03 583770 24696 236 947finan512 596992 74752 79 467G2 circuit 726674 150102 45 609nasasrb 2677324 54870 498 991Pres Poisson 715804 14822 483 964qa8fm 1660579 66127 251 928ship 001 3896496 34920 1116 664thermomech TC 711558 102158 68 08Trefethen 20000 554466 20000 277 633ASIC 100K 940621 99340 95 609bcircuit 375558 68902 54 649epb3 463625 84617 55 722GaAsH6 3381809 61349 5512 692ns3Da 1679599 20414 82 32Si34H36 5156379 97569 528 537torso2 1033473 115697 89 981venkat25 1717792 62424 275 798wang3 177168 26064 68 646xenon1 1181120 48600 243 810

GEN [46] The LEON3 core FMA and in-cluster reductionnetwork are synthesized using the NanGate 15nm OpenCell Library [47] with the Synopsys Design Compiler [48]SRAM buffers within each cluster and the eDRAM mem-ory are modeled using CACTI7 [49] using 14nm eDRAMparameters from [50]

B Architecture

We compare against an nVidia Tesla P100 GPU accelera-tor modeled using GPGPUSim [51] and GPGPUWattch [52]To evaluate the performance of the LEON3 based localprocessors we implement CG and BiCG-STAB in C andcompile the implementations using the LEON3 Bare CCompiler with timers around each kernel call Since all bankmicroprocessors must synchronize during each iteration weevaluate the latency of the bank microprocessor with thelargest number of unblocked elements

In the worst case the preprocessing step touches everynon-zero in the matrix four times which is comparable toperforming four MVM operations Thus we conservativelyassume that preprocessing takes time equivalent to fourMVM operations on the baseline system

C Algorithms

We evaluate the proposed accelerator on 20 matrices fromthe SuiteSparse matrix collection [14] using CG for sym-metric positive definite matrices (SPD) and BiCG-STAB fornon-SPD matrices The matrices were selected to showcasea range of matrix structures sizes and densities across avariety of applications When available we use the b vectorprovided by the collection when b is unavailable we use ab vector of all 1s as in prior work [53]

The solvers running on the proposed accelerator convergein the same number of iterations as they do when running

103

01

1

10

100

nasasrb

Pres_P

2cub

es

qa8fm

finan

Trefe

ship

crystm

therm

G2

ASIC

bcircuit

epb3

Ga

AsH6

ns3D

aSi34H3

6torso2

venkat

wang3

xeno

n1

G-MEA

NS

peed

upoverG

PU

Figure 8 Speedup over the GPU baseline

on the GPU since both systems perform computation at thesame level of precision

VIII EVALUATION

We evaluate the system performance energy area androbustness to different sources of error

A Execution Time

Figure 8 shows that the accelerator achieves an averagespeedup of 103times as compared to the GPU baseline acrossthe set of 20 matrices One important insight from theseresults is that the average number of non-zeros per matrixrow is a poor proxy for potential speedup The matrix withthe greatest speedup in the dataset (torso2) has just 89 NNZper matrix row and the two quantum chemistry matrices(GaSsH6 and Si34H36) have more than 50 NNZ per rowshowing above-average but not exceptional speedup due totheir lower blocking efficiencies (ie the percent of the non-zeros that are blocked) 69 and 53 respectively Sinceeach bank microprocessor is designed to work in conjunctionwith its clusters matrices with fewer rows perform worsethan larger matrices at the same blocking efficiency Asexpected blocking efficiency is the most reliable predictorof speedup

Notably in Table II there are two matrices that are effec-tively unblocked by the proposed preprocessing step ther-motech TC and ns3Da with respective blocking efficienciesof 08 and 32 Since the accelerator is optimized for in-situ rather than digital MVM and since the majority of theelements in these two matrices cannot be mapped to theproposed heterogeneous substrate the accelerator is morethan an order of magnitude less efficient than the GPUbaseline on these matrices Rather than attempt to performMVM on matrices so clearly ill-suited to the acceleratorwe assume that the proposed accelerator co-exists with aGPU in the same node and that the GPU may be usedfor the rare matrices which do not block effectively Sincethe blocking algorithm has a worst-case performance offour MVM operations (Section V-B1) and since the actualcomplexity reaches the worst case if and when the matrixcannot be blocked we can choose whether to perform thecomputation on the accelerator or the GPU after the blockinghas completed This results in a performance loss of less than

0092

1E-02

1E-01

1E+00

nasasrb

Pres_P

2cub

es

qa8fm

finan

Trefe

ship

crystm

therm

G2

ASIC

bcircuit

epb3

Ga

AsH6

ns3D

aSi34H3

6torso2

venkat

wang3

xeno

n1

G-MEA

N

EnergyNormalized

toGPU

Figure 9 Accelerator energy consumption normalized to the GPU baseline

3 for both matrices far better than if the accelerator wereused directly

B Energy Consumption

Energy consumption results are shown in Figure 9 To-tal energy consumption is improved by 142times on the 18matrices executed on the accelerator and by 109times overthe entire 20-matrix dataset The impacts of differing ex-ponent ranges can be seen in the energy results Notablythe nasasrb matrix has a 3 higher blocking efficiencythan Pres Poisson however Pres Poisson shows nearlytwice the improvement in energy dissipation this is dueto the much narrower exponent range of Pres PoissonPres Poisson never requires more than 14 bits of paddingfor storage which also indicates a much narrower dynamicrange for the computation and by extension fewer vectorbit slices that are needed per cluster By contrast nasasrbhas multiple blocks with excluded elements (due to thoseelements requiring greater than 117 bits) and in generalthe average number of stored bits per cluster is 107 (30more bits than the worst block of Pres Poisson) This strongdependence on the exponent range indicates a potentialbenefit of the fixed-point computation over floating pointAs discussed in Section IV-B floating point is designed forbroad compatibility and exceeds the requirements of manyapplications By operating on fixed-point representations ofdata for much of the computation the accelerator implicitlycreates a problem-specific subset of the floating-point formatwith significant performance benefits without discarding anyinformation that would cause the final result to differ fromthat of an end-to-end floating-point calculation

C Area Footprint

The area footprint of the accelerator is computed usingthe area model described in Section VII-A The overallsystem area for the 128-bank system described above is539mm2 which is lower than the 610mm2 die size of thebaseline Nvidia Tesla P100 GPU [54] Notably unlike inprior work on memristive accelerators the crossbars andperipheral circuitry are the dominant area consumermdashratherthan the ADCsmdashwith a total of 541 of total overall clusterarea This is due to two important differences from priorwork 1) due to computational invert coding (Section V-B2)

Table IIIAREA ENERGY AND LATENCY OF DIFFERENT CROSSBAR SIZES

(INCLUDES THE ADC)

Size Area [mm2] Energy [pJ] Latency [nsec]64 000078 280 533128 000103 652 107256 000162 150 213512 000352 342 427

0

5

10

15

20

nasasrb

Pres_P

2cub

es

qa8fm

finan

Trefe

ship

crystm

therm

G2

ASIC

bcircuit

epb3

Ga

AsH6

ns3D

aSi34H3

6torso2

venkat

wang3

xeno

n1

AcceleratorC

ompu

tatio

n

Overhead

WriteTime PreprocessingTime

Figure 10 Overhead of preprocessing and write time as a percent of totalsolve time on the accelerator

the ADC precision can be reduced by one bit which has anexponential effect on ADC area and 2) the wider operandsthat propagate through the reduction network The per-bankprocessors and global memory buffer consume 136 of thetotal system area This result suggests that a more powerfulprocessor than the LEON3 core may be integrated based onapplication demands without substantially increasing systemarea Table III summarizes the area as well as the latencyand energy characteristics for different crossbar sizes usedin different clusters

D Initialization Overhead

Figure 10 shows the overhead of setting up the itera-tive computation including matrix preprocessing time andprogramming time normalized to the entire solve timeof each linear system on the accelerator Since iterativealgorithms take thousands of iterations to converge theoverhead of writing is effectively amortized leading toan overhead of less than 20 across the entire matrixset and generally falling as the size of the linear systemincreases By extension the number of iterations requiredto converge increases faster than the write requirements ofthe array For large linear systems (the most likely use caseof the proposed accelerator) the overhead is typically lessthan 4 of the total solver runtime The overhead may beeven less in practice as multiple time steps are generallysimulated to examine the time-evolution of the linear systemIn these time-stepped computations only a subset of non-zeros change each step and the matrix structure is typicallypreserved requiring minimal re-processing

E System Endurance

Although memristive devices have finite switching en-durance the iterative nature of the algorithms allows a singlematrix to be programmed once and reused throughout thecomputation Even under a conservative set of assumptions

(a) ns3Da sparsity (b) ns3Da blockingFigure 11 Sparsity and blocking patterns of ns3Da

where each array is fully rewritten for each new matrixignoring both the sparsity of the blocks and the time stepbehavior mentioned above the system lifetime is suffi-cient given a memristive device with a conservative 109

writes [55]ndash[57] the total system lifetime is over 100 yearsassuming the system runs constantly and a new matrix isfully rewritten between solves

F Understanding Difficult-to-Block Matrices

Figure 11 shows the non-zero element distribution and theblocking pattern for ns3Da a matrix with a particularly poorblocking efficiency The figure shows that despite the highrelative density of the matrix the values do not form densesub-blocks that can be blocked efficiently Instead the valuesare distributed relatively uniformly over the matrix with thethird-highest NNZs per matrix row in the evaluated datasetFigure 7 shows the non-zero distribution of two matrices thatcan be blocked effectively Pres Poisson and Xenon1 thenon-zeros of both matrices are clustered primarily aroundthe diagonal An analysis of the blocks formed on ns3Dashows that most of the formed blocks capture patterns withnon-zeros separated by approximately 30 elements 2 permatrix block-row in 64 blocks and 4 per matrix block-rowin 128 blocks This suggests that even if ns3Da were tobe blocked more efficiently the effective speedup would beconstrained by the limited current summation from havingfew blocks per matrix row

G Sensitivity to Device Characteristics

Since scientific computing emphasizes high precisionwe analyze the effects of device characteristics on systemaccuracy and convergence behavior

Dynamic range We re-evaluate the convergence behaviorwith various cell ranges (ie onoff state ratios) to determinehow the system behaves when the states are closer togetheror farther apart Figure 12 shows the relative iterationcount at various configuration points The results showeffectively no sensitivity to dynamic range for single-bitcells With two-bit cells the relatively low dynamic rangeresults in some computational error as the cell states arenot sufficiently spread to tolerate the sources of error in

013 213 413 613 813

B=113 D=075K13

B=113 D=15K13

B=113 D=3K13

B=213 D=075K13

B=213 D=15K13

B=213 D=3K13

Normalized

13 Ite

rao

n13 Co

unt13

Min13 Mean13 Max13

Figure 12 Iteration count as a function of bits per cell and dynamicrange normalized to 1-bit cells with RoffRon = 1500 The minimummaximum and the mean iteration counts are reported over 100 Monte-Carlosimulation experiments

Figure 13 Iteration count as a function of bits per cell and programmingerror normalized to 1-bit cells with no programming error The minimummaximum and the mean iteration counts are reported over 100 Monte-Carlosimulation experiments

current summation This in turn introduces error into thecomputation which tends to hinder convergence rates

Programming precision We re-evaluate the convergencebehavior with and without cell programming errors to de-termine how results are affected by non-ideal programmingThe results show virtually no sensitivity to programmingerror for single-bit cells until the programming error reaches5 well within the achievable programming precision re-ported in the literature [58] Again the general dependencebecomes stronger as the number of bits per cell is increasedHigh programming error with the same tolerance introduceserrors into the computation hindering convergence

IX CONCLUSIONS

Previous work with memristive systems has been re-stricted to low-precision error-prone fixed-point computa-tion taking advantage of machine learning workloads thatdemand fidelity but not precision We have presented thefirst correct implementation of a double-precision floating-point compute unit based on a memristive substrate Theinnovations presented here take advantage of insights thatallow the system to achieve a 103times throughput improve-ment and a 109times reduction in energy consumption over aGPU baseline when executing iterative solvers The potentialfor such a system in scientific computing where iterativesolvers are ubiquitous is especially promising

ACKNOWLEDGMENT

This work was supported in part by NSF grant CCF-1533762

REFERENCES

[1] F Pop ldquoHigh performance numerical computing for highenergy physics a new challenge for big data sciencerdquo AAdvHigh Energy Phys vol 2014 2014

[2] R A Pielke Sr Mesoscale meteorological modeling 2nd edAcademic press 2013

[3] B Scholkopf K Tsuda and J-P Vert Kernel methods incomputational biology 1st ed MIT press 2004

[4] L Maliar and S Maliar ldquoNumerical methods for large-scaledynamic economic modelsrdquo 2014

[5] M Vogelsberger S Genel V Springel P Torrey D SijackiD Xu G Snyder S Bird D Nelson and L HernquistldquoProperties of galaxies reproduced by a hydrodynamic simu-lationrdquo Nature vol 509 pp 177ndash82 May 2014

[6] P Messina (2017 February) The US DOEexascale computing projectmdashgoals and challengeshttpswwwnistgovsitesdefaultfilesdocuments20170221messina nist 20170214final pdf [Online] Availablerdquohttpswwwnistgovsitesdefaultfilesdocuments20170221messina nist 20170214final pdfrdquo

[7] M N Bojnordi and E Ipek ldquoMemristive boltzmann machineA hardware accelerator for combinatorial optimization anddeep learningrdquo in Intl Symp on High Performance ComputerArchitecture (HPCA) March 2016

[8] A Shafiee A Nag N Muralimanohar R BalasubramonianJ P Strachan M Hu R S Williams and V SrikumarldquoISAAC A convolutional neural network accelerator within-situ analog arithmetic in crossbarsrdquo in Intl Symp onComputer Architecture (ISCA) June 2016

[9] P Chi S Li S Li T Zhang J Zhao Y Liu Y Wangand Y Xie ldquoPRIME A novel processing-in-memory archi-tecture for neural network computation in ReRAM-basedmain memoryrdquo in Intl Symp on High Performance ComputerArchitecture (HPCA) June 2016

[10] L Song X Qian H Li and Y Chen ldquoPipeLayer Apipelined ReRAM-based accelerator for deep learningrdquo inIntl Symp on High Performance Computer Architecture(HPCA) Feb 2017

[11] L Song Y Zhuo X Qian H Li and Y Chen ldquoGraphRAccelerating graph processing using ReRAMrdquo Feb 2017

[12] G Kestor R Gioiosa D J Kerbyson and A HoisieldquoQuantifying the energy cost of data movement in scientificapplicationsrdquo in Intl Symp on Workload Characterization(IISWC) Sept 2013

[13] J Hauser (2002) SoftFloat httpwwwjhauserusarithmeticSoftFloathtml [Online] Available rdquohttpwwwjhauserusarithmeticSoftFloathtmlrdquo

[14] T A Davis and Y Hu ldquoThe university of florida sparse matrixcollectionrdquo ACM Transactions on Math Softwware (TOMS)vol 38 no 1 pp 11ndash125 Dec 2011

[15] R Vuduc ldquoAutomatic performance tuning of sparse matrixkernelsrdquo PhD dissertation 2003

[16] Y Chen T Luo S Liu S Zhang L He J Wang L LiT Chen Z Xu N Sun and O Temam ldquoDaDianNaoA machine-learning supercomputerrdquo in Intl Symp on onMicroarchitecture (MICRO) Dec 2014

[17] N P Jouppi C Young N Patil D Patterson G Agrawalet al ldquoIn-datacenter performance analysis of a tensor pro-cessing unitrdquo in Intl Symp on on Computer Architecture(ISCA) June 2017

[18] M Hu J P Strachan Z Li E M Grafals N DavilaC Graves S Lam N Ge J J Yang and R S WilliamsldquoDot-product engine for neuromorphic computing Program-ming 1T1M crossbar to accelerate matrix-vector multiplica-tionrdquo in Design Automation Conference (DAC) June 2016

[19] Y Saad Iterative Methods for Sparse Linear Systems 3rd edSIAM 2003

[20] J J Dongarra P Luszczek and A Petitet ldquoThe LINPACKbenchmark Past present and futurerdquo Concurrency and Com-putation Practice and Experience vol 15 pp 803ndash8202003

[21] M R Hestenes and E Stiefel Methods of conjugate gradientsfor solving linear systems 1952 vol 49 no 6

[22] H A Van der Vorst ldquoBi-CGSTAB A fast and smoothlyconverging variant of Bi-CG for the solution of nonsymmetriclinear systemsrdquo SIAM Journal on scientific and StatisticalComputing (SISC) vol 13 no 2 pp 631ndash644 1992

[23] Y Saad and M H Schultz ldquoGMRES A generalized minimalresidual algorithm for solving nonsymmetric linear systemsrdquoSIAM Journal on scientific and Statistical Computing (SISC)vol 7 no 3 pp 856ndash869 July 1986

[24] J Kung Y Long D Kim and S Mukhopadhyay ldquoAprogrammable hardware accelerator for simulating dynamicalsystemsrdquo in Intl Symp on Computer Architecture (ISCA)June 2017

[25] Q Zhu T Graf H E Sumbul L Pileggi and F FranchettildquoAccelerating sparse matrix-matrix multiplication with 3d-stacked logic-in-memory hardwarerdquo in 2013 IEEE HighPerformance Extreme Computing Conference (HPEC) Sept2013

[26] S Kestur J D Davis and E S Chung ldquoTowards a univer-sal FPGA matrix-vector multiplication architecturerdquo in IntlSymp on Field-Programmable Custom Computing Machines(FCCM) May 2012

[27] J Fowers K Ovtcharov K Strauss E S Chung andG Stitt ldquoA high memory bandwidth FPGA accelerator forsparse matrix-vector multiplicationrdquo in Intl Symp on Field-Programmable Custom Computing Machines (FCCM) May2014

[28] R Dorrance F Ren and D Markovic ldquoA scalablesparse matrix-vector multiplication kernel for energy-efficientsparse-blas on FPGAsrdquo in ACMSIGDA International Sym-posium on Field-programmable Gate Arrays (FPGA) Feb2014

[29] B Feinberg S Wang and E Ipek ldquoMaking memristiveneural net accelerators reliablerdquo in Intl Symp on HighPerformance Computer Architecture (HPCA) Feb 2018

[30] D H Bailey ldquoHigh-precision floating-point arithmetic inscientific computationrdquo Computing in science amp engineeringvol 7 no 3 pp 54ndash61 2005

[31] D Zuras M Cowlishaw A Aiken M Applegate D BaileyS Bass D Bhandarkar M Bhat D Bindel S Boldo et alldquoIeee standard for floating-point arithmeticrdquo IEEE Std 754-2008 pp 1ndash70 2008

[32] L V Allis ldquoSearching for solutions in games and artificialintelligencerdquo 1994

[33] N Whitehead and A Fit-Florea ldquoPrecision amp performanceFloating point and IEEE 754 compliance for nvidia gpusrdquoNvidia Tech Rep 2011

[34] L Kull D Luu C Menolfi M Braendli P A FranceseT Morf M Kossel H Yueksel A Cevrero I Ozkayaand T Toifl ldquo285 a 10b 15gss pipelined-SAR ADC withbackground second-stage common-mode regulation and offsetcalibration in 14nm CMOS FinFETrdquo in Intl Solid-StateCircuits Conference (ISSCC) Feb 2017 pp 474ndash475

[35] L Kull T Toifl M Schmatz P A Francese C MenolfiM Brandli M Kossel T Morf T M Andersen andY Leblebici ldquoA 31 mw 8b 12 GSs single-channel asyn-chronous SAR ADC with alternate comparators for enhancedspeed in 32 nm digital SOI CMOSrdquo IEEE Journal of Solid-State Circuits vol 48 no 12 pp 3049ndash3058 2013

[36] M Saberi R Lotfi K Mafinezhad and W A SerdijnldquoAnalysis of power consumption and linearity in capacitivedigital-to-analog converters used in successive approximationadcsrdquo IEEE Transactions on Circuits and Systems I RegularPapers vol 58 no 8 pp 1736ndash1748 2011

[37] M R Stan and W P Burleson ldquoBus-invert coding forlow-power IOrdquo IEEE Transactions on Very Large ScaleIntegration (VLSI) Systems vol 3 no 1 pp 49ndash58 1995

[38] K Bhanushali and W R Davis ldquoFreePDK15 An open-source predictive process design kit for 15nm FinFET tech-nologyrdquo in Intl Symp on on Physical Design (ISPD) March2015

[39] Predictive technology model (PTM) httpptmasuedu[Online] Available httpptmasuedu

[40] D Niu C Xu N Muralimanohar N P Jouppi and Y XieldquoDesign of cross-point metal-oxide ReRAM emphasizingreliability and costrdquo in Intl Conference on Computer-AidedDesign (ICCAD) Nov 2013

[41] S-C Wong G-Y Lee and D-J Ma ldquoModeling of inter-connect capacitance delay and crosstalk in VLSIrdquo IEEETransactions on semiconductor manufacturing vol 13 no 1pp 108ndash111 2000

[42] C Xu D Niu N Muralimanohar R BalasubramonianT Zhang S Yu and Y Xie ldquoOvercoming the challengesof crossbar resistive memory architecturesrdquo in rdquoIntl Sympon High Performance Computer Architecture (HPCA)rdquo Feb2015

[43] C Yakopcic and T Taha ldquoModel for maximum crossbar sizebased on input driver impedancerdquo Electronics Letters vol 52no 1 pp 25ndash27 2015

[44] M OrsquoHalloran and R Sarpeshkar ldquoA 10-nw 12-bit accurateanalog storage cell with 10-aa leakagerdquo IEEE Journal ofSolid-State Circuits vol 39 no 11 pp 1985ndash1996 2004

[45] Leon3GRLIB httpwwwgaislercomindexphpdownloadsleongrlib [Online] Available httpwwwgaislercomindexphpdownloadsleongrlib

[46] S Galal O Shacham J S B II J Pu A Vassiliev andM Horowitz ldquoFPU generator for design space explorationrdquoin IEEE Symposium on Computer Arithmetic (ARITH) April2013

[47] M Martins J M Matos R P Ribas A Reis G SchlinkerL Rech and J Michelsen ldquoOpen cell library in 15nmFreePDK technologyrdquo in Intl Symp on Physical Design(ISPD) March 2015

[48] Synopsys ldquoSynopsys Design Compiler User GuiderdquohttpwwwsynopsyscomToolsImplementationRTLSynthesisDCUltraPages

[49] R Balasubramonian A B Kahng N MuralimanoharA Shafiee and V Srinivas ldquoCACTI 7 New tools for inter-connect exploration in innovative off-chip memoriesrdquo ACMTransactions on Architecture and Code Optimization (TACO)vol 14 no 2 pp 141ndash1425 June 2017

[50] G Fredeman D W Plass A Mathews J ViraraghavanK Reyer T J Knips T Miller E L Gerhard D Kannam-badi C Paone D Lee D J Rainey M Sperling M WhalenS Burns R R Tummuru H Ho A Cestero N ArnoldB A Khan T Kirihata and S S Iyer ldquoA 14 nm 11 mbembedded DRAM macro with 1 ns accessrdquo IEEE Journal ofSolid-State Circuits vol 51 no 1 pp 230ndash239 Jan 2016

[51] A Bakhoda G L Yuan W W L Fung H Wong and T MAamodt ldquoAnalyzing CUDA workloads using a detailed GPUsimulatorrdquo April 2009

[52] J Leng T Hetherington A ElTantawy S Gilani N S KimT M Aamodt and V J Reddi ldquoGPUWattch Enabling en-ergy optimizations in GPGPUsrdquo in Intl Symp on ComputerArchitecture (ISCA) June 2013

[53] H Anzt J Dongarra M Kreutzer G Wellein and M KhlerldquoEfficiency of general krylov methods on GPUs ndash an exper-imental studyrdquo in Intl Parallel and Distributed ProcessingSymposium Workshops (IPDPSW) May 2016

[54] ldquoNvidia tesla P100rdquo NVIDIA Corporation Tech Rep WP-08019-001 v011 2016

[55] Z Wei Y Kanzawa K Arita Y Katoh K Kawai S Mu-raoka S Mitani S Fujii K Katayama M Iijima et alldquoHighly reliable TaOx ReRAM and direct evidence of redoxreaction mechanismrdquo in Electron Devices Meeting 2008IEDM 2008 IEEE International IEEE 2008 pp 1ndash4

[56] J J Yang M-X Zhang J P Strachan F Miao M D Pick-ett R D Kelley G Medeiros-Ribeiro and R S WilliamsldquoHigh switching endurance in TaOx memristive devicesrdquoApplied Physics Letters vol 97 no 23 p 232102 2010

[57] C-W Hsu I-T Wang C-L Lo M-C Chiang W-Y JangC-H Lin and T-H Hou ldquoSelf-rectifying bipolar TaOxTiO2RRAM with superior endurance over 1012 cycles for 3d high-density storage-class memoryrdquo

[58] F Alibart L Gao B D Hoskins and D B Strukov ldquoHighprecision tuning of state for memristive devices by adaptablevariation-tolerant algorithmrdquo Nanotechnology vol 23 no 7Jan 2012

Page 4: Enabling Scientific Computing on Memristive …reducing the required memristor resolution [7]–[10]. An example of bit slicing is shown in Equation 1, where each 3-bit matrix coefficient

within a single matrix block Unlike prior work clustersare designed for IEEE-754 compatible double-precisionfloating-point computation Each cluster contains 127 cross-bars with single-bit cells that are organized in a shift-and-add reduction tree The double-precision floating-pointcoefficients of a matrix block are converted to 118-bit fixed-point operands comprising a 53-bit mantissa one sign bitand up to 64 bits of padding for mantissa alignment This118-bit value is then encoded with a nine-bit error-correctingAN code (as proposed in [29]) for a full operand width ofup to 127 bits

Figure 3 shows the structure of the crossbars in a clusterand how computation proceeds Initially a vector bit slicex[i] is applied to all of the crossbars within a cluster andthe resulting dot product for each crossbar column is latchedby a sample-and-hold circuit (1)2 A set of ADCs (oneper crossbar) then start scanning the values latched by thesample-and-hold units (2) The outputs of the ADCs arepropagated through a shift-and-add reduction network (3)producing a single fixed-point value that corresponds to thedot product between a row of the matrix A and a single bitslice of the vector x The system is pipelined such that everycycle a single fixed-point value is produced by the reductionnetwork and after every column has been quantized a newvector bit slice is applied to all of the crossbars

Once the fixed-point result has been computed threeadditional operations are performed to convert it to anintermediate floating-point representation (Figure 2) Firstthe bias is removed converting the value from unsignedto signed fixed point Second error correction is appliedThird the running sum for the relevant vector element isloaded from the partial result buffer the incoming partialdot product is added to the running sum and the resultis written back After the running sum is read from thepartial result buffer it is inspected to determine whether theprecision requirements of IEEE-754 have been met (in whichcase no further vector bit slice computations are needed)If so all of the crossbars in the cluster are signaled to skipquantizing the corresponding column in the remaining vectorbit slice calculations3 The criteria for establishing whetherthe precision requirements of IEEE-754 have been met arediscussed in the next section

IV ACHIEVING HIGH PRECISION

Unlike machine learning applications scientific workloadsrequire high precision [30] To enable scientific computingwith memristive accelerators the system must meet theseprecision requirements through floating-point format supportand error correction

2Throughout the paper we use xj [i] to refer to the jth coefficient ofthe ith bit slice of vector x

3The total number vector bit slices is still bounded by the number ofvector bit slices required by the worst case column

1

2

3

SH Array SH Array SH Array

ADC ADC ADC

Figure 3 Cluster organization

A Floating-Point Computation on Fixed-Point Hardware

In double precision floating point every number is repre-sented by a 53-bit mantissa (including the implied leading1) an 11-bit exponent and a sign bit [31] When performingaddition in a floating-point representation the mantissas ofthe values must be aligned so that the bits of the samemagnitude can be added Consider as an example two base-ten floating-point numbers with two-digit mantissas onwhich the computation 12 + 013 is to be performed Toproperly compute the sum the values must be aligned basedon the decimal point 120+013 In a conventional floating-point unit this alignment is performed dynamically based onthe exponent field of each value On a memristive crossbarhowever since the addition occurs within the analog domainthe values must be converted to an aligned fixed-pointformat before they are programmed into the crossbar Thisrequires padding the floating point mantissas with zeros inthe example above the addition would be represented as120 + 013 in fixed point with an exponent of 10minus2

Although this simple padding approach allows fixed-pointhardware to perform floating-point operations it comes ata significant cost In double-precision for instance naıvepadding requires 2100 bits in the worst case to fully accountfor the exponent range 2046 bits for padding 53 bits for themantissa itself and one sign bit Not only is this requirementprohibitive from a storage perspective but full fixed-pointcomputation would require multiplying each matrix bit sliceby each vector bit slice for a worst case of 44 millioncrossbar operations Fortunately two important properties offloating-point numbers can be exploited to reduce the costof the fixed-point computation First while the IEEE-754standard provides 11 bits for the exponent in double preci-sion few applications ever approach the full exponent rangeSecond the naıve fixed-point computation described aboveprovides more precision than IEEE-754 as it computes bitswell beyond the 53-bit mantissa These extraneous values aresubsequently truncated when the result is converted back tofloating point

0110011001

010101

+

010111010

100101

+

000101

010110

011001

001000

+

01101000111

111010

+

0 0 0

(a) Initial partial product summation

Running Sum

(b) Sum of four partial products

(c) No overlap with partial product

(d) Addition can terminate

Figure 4 Illustrative example of partial product summation a) addition of the first four partial products b) overlap between the running sum and themantissa c) carries can still modify the running sum d) the mantissa bits are settled

B Avoiding Extraneous Calculations

Three techniques are proposed to reduce the cost ofperforming floating-point computation on fixed-pointhardware

Exploiting exponent range locality The 11-bit exponentfield specified by IEEE-754 means that the dynamic rangeof floating-point values is 22

11minus2 (two exponent fields arereserved for special values) or approximately 10616 (Forcomparison the size of the state space for the game ofGo is only 10172 [32]) IEEE-754 provides this range sothat applications from a wide variety of domains can usefloating-point values without additional scaling howeverit is highly unlikely that any model of a physical systemwill have a dynamic range requirement close to 10616 Inpractice the number of required pad bits is equal to thedifference between the minimum and maximum exponentsin the matrix and can be covered by a few hundred bitsrather than 2046 Furthermore since the fixed-point over-head is required only for numbers that are to be summed inthe analog domain the alignment overhead can be furtherreduced by exploiting locality Since the matrix is muchlarger than a single crossbar it is necessarily split into blocks(discussed in Section V-B1) only those values within thesame block are added in the analog domain and requirealignment The matrices can be expected to exhibit localitybecause they often come from physical systems where alarge dynamic range between neighboring points is unlikely

The aforementioned alignment procedure is also appliedto the vector to be multiplied by the matrix The alignmentof mantissas encodes the exponent of each vector elementrelative to a single exponent assigned to the vector as awhole Notably multiplication does not require alignmentsince it is sufficient to add the relevant exponents of thematrix block the vector and the exponent offset impliedby the location of the leading 1 in each dot product

Early termination In a conventional double-precision float-ing point unit (FPU) each arithmetic operation is performed

on two operands after which the resulting mantissa isrounded and truncated to 53 bits To compute a dot productbetween two vectors a middot x the FPU first calculates theproduct of two scalars a1x1 rounds and truncates the resultto 53 bits and adds the truncated product to subsequentproducts a middot x = a1x1 + a2x2 + middot middot middot A memristive MVMaccelerator on the other hand computes a dot product as anaggregation of partial dot products calculated over multiplebit slices This necessitates a truncation strategy differentfrom that of a digital FPU

A naıve approach to truncation on a memristive MVMaccelerator would be to perform all of the bit sliced matrix-vector multiplications aggregate the partial dot productsand truncate the result This naıve approach would requirea total of 127 times 127 crossbar operations since every bitslice of a must be multiplied by every bit slice of x 4

However many of these operations would contribute only tothe portion of the mantissa beyond the point of truncationwasting time and energy By detecting the position of theleading 1mdashie the 1 at the most significant bit position inthe final resultmdashthe computation can be safely terminatedas soon as the 52 following bits of the mantissa have settled

Consider the simplified example depicted in Figure 4 inwhich a series of partial products with different bit-sliceweights are to be added (a) To clarify the exposition the ex-ample assumes that the partial products are six bits wide andthat the final sum is to be represented by a four-bit mantissaand an exponent (The actual crossbars sum 127-bit partialproducts encoding the result with a 53-bit mantissa and anexponent) In the example the addition of partial productsproceeds iteratively from the most significant partial producttoward the least significant accumulating the result in arunning sum The goal is to determine whether this iterativeprocess can be terminated early while still producing thesame mantissa as the naıve approach

Figure 4b shows the running sum after the summationof the four most significant partial products Since the next

4Recall from Section III that scalars are encoded with up to 127-bitfixed point All examples in this section assume a 127-bit representationfor simplicity unless otherwise noted

Stable Region

X hellip X

Potential carry out

+

Barrier Bit

0

Carry Region

Aligned Region

X X 0X X X

1 hellip 1 X Running sum

Next partial dot product

Figure 5 Regions within a running sum

partial product still overlaps with the four-bit mantissa it isstill possible for the mantissa to be affected by subsequentadditions At a minimum the computation must proceeduntil the mantissa in the running sum has cleared the overlapwith the next partial product This occurs in Figure 4chowever at that point it is still possible for a carry from thesubsequent partial products to propagate into and modify themantissa Notably if the accumulation is halted at any pointthe sum of the remaining partial products would produce atmost one carry out into the running sum Hence in additionto clearing the overlap safe termination requires that a 0less significant than the mantissa be generated such that itwill absorb the single potential carry out preventing any bitflips within the mantissa (Figure 4d)

At any given time the running sum can be split intofour non-overlapping regions as shown in Figure 5 Thealigned region is the portion of the running sum thatoverlaps with the subsequent partial products A chain of1s more significant than the aligned region is called thecarry region as it can propagate the single generated carryout forward This carry if present is absorbed by a barrierbit protecting the more significant bits in the stable regionfrom bit flips Thus the running sum calculation canbe terminated without loss of mantissa precision assoon as the full mantissa is contained in the stable region

Scheduling array activations Even with the early termina-tion technique discussed above there is still a possibility forextraneous crossbar operations to be performed Althougheach partial dot product takes up to 127 bits the finalmantissa is truncated to 52 bits following the leading 1requiring fewer than 127 bits to be computed in most casesTherefore some subset of the 127-bit partial dot productmay be wasted depending on the schedule according towhich vector bit slices are applied to the crossbars

Figure 6 shows three approaches to scheduling the cross-bar activations In the figure rows and columns respec-tively represent the matrix and vector bit slices sortedby significance The numbers at the intersection of eachrow and column represent the significance of the resultingpartial product Different groupings indicate which bit-slicedmatrix-vector multiplications are scheduled to occur simul-taneously the black colored groupings must be performedwhereas the gray colored groupings may be skipped dueto early termination Hence the number of black groupings

1 0123

1

023

4

5

3 24 1

6 5

4

3

4

3

2

3

2

6

5

4

5

4

3

4

3

2

3

2

1

3 2 1 0

6

5

3

5

4

2

4

3

1

3

2

0

3 2 1 0

3

2

1

0

Mat

rix S

lice

3 2 1 0Vector Slice

Activations Skipped Activations PerformedVertical Grouping Diagonal Grouping Hybrid Grouping

Figure 6 Three policies for scheduling crossbar activations

correlates with execution time and the number of crossbarsenclosed by the black groupings correlates with crossbaractivation energy The example shown in the figure assumesthat the calculation can be terminated early after the fifthmost significant bit of the mantissa is computed (That isall groupings computing a partial product with significanceof at least 2 must be performed)

In the leftmost case all of the crossbars are activated withthe same vector bit slice after which the next significantvector bit slice is applied to the crossbars This naıve verticalgrouping requires 16 crossbar activations performed overfour time steps In the next example arrays are activatedwith different vector bit slices such that each active arrayis contributing to the same slice of the output vector si-multaneously This diagonal grouping results in 13 crossbaractivationsmdashthe minimum required in the examplemdashoverfive time steps In general this policy takes the minimumnumber of crossbar activations at the cost of additionallatency

The hybrid grouping shown on the right which we adoptin the evaluation strikes a balance between the respectiveenergy and latency advantages of the diagonal and verticalgroupings The result is 14 crossbar activations performedover four time steps The more closely the hybrid groupingapproximates a diagonal grouping the greater the energysavings at the cost of latency

Taken together the above techniques significantly reducethe overhead of performing floating-point computation onfixed-point hardware by restricting computations to onlythose that are likely to produce a bit needed for the finalmantissa The techniques render the latency and energyconsumption strongly data dependent effectively forming adata-specific subset of the floating point format without anyloss in precision

C Handling Negative Numbers

The previously proposed ISAAC memristive accelera-tor [8] uses a biasing scheme to handle negative numbersThe approach adds a bias of 216 to each 16-bit operandsuch that the values range from 0 to 216 rather than plusmn215Results are de-biased by subtracting 216pv where pv is thepopulation count (ie the number of bits set to 1) of therelevant bit slice of the vector We adopt this technique with

one modification in place of the constant 216 we use aper-block biasing constant based on the actual floating-pointexponent range contained within the block

D Floating Point Anomalies and Rounding Modes

The IEEE-754 floating point standard specifies a numberof floating point exceptions and rounding modes that mustbe handled correctly for proper operation [31]

Rounding modes When the proposed accelerator isoperated in its typical configuration mantissa alignmentand leading 1 detection result in a truncation of the resultto the required precision The potential contributions ofany uncalculated bits beyond the least-significant bit wouldonly result in a mantissa of greater magnitude than whatis ultimately obtained and since a biasing scheme is usedto enable the representation of negative values (SectionIV-C) the truncation is equivalent to rounding the resultof a dot product toward negative infinity For applicationsrequiring other rounding modes (eg to nearest towardpositive infinity toward zero) the accelerator can beconfigured to compute three additional settled bits beforetruncation and perform the rounding according to those bits

Infinities and NaNs Signed infinities and not-a-numbers(NaNs) are valid floating point values however they cannotbe mapped to the crossbars in a meaningful way in theproposed accelerator nor can they be included in any dotproduct computations without producing a NaN infinity orinvalid result [31] Therefore the accelerator requires allinput matrices and vectors to contain no infinities or NaNsand any infinities or NaNs arising from an intermediatecomputation in the local processor are handled there toprevent them from propagating to the resulting dot product

Floating point exceptions Operations with floating pointvalues may generally lead to underflow overflow invalidoperation and inexact exceptions [31] Since in-situ dotproducts are performed on mantissas aligned with respect torelative exponents there are no upper or lower bounds onresulting values (Only the range of exponents is bounded)This precludes the possibility of an overflow or underflowoccurring while values are in an intermediate representationHowever when the result of a dot product is converted fromthe intermediate floating point representation into IEEE-754it is possible that the required exponent will be outsideof the available range In this case the exponent field ofthe resulting value will be respectively set to all 1s orall 0s for overflow and underflow conditions Similarly toCUDA computations resulting in an overflow underflowor inexact arithmetic do not cause a trap [33] Since theaccelerator can only accept matrices and input vectors withfinite numerical values and since the memristive substrateis used only for computing dot products invalid operations

(eg 0 0radicminus|c|) and operations resulting in NaNs or

infinities can only occur within the local processor wherethey are handled through an IEEE-754 compliant floatingpoint unit

E Detecting and Correcting Errors

Prior work by Feinberg et al proposes an error-correctionscheme for memristive neural network accelerators based onAN codes [29] We adopt this scheme with three modifi-cations taking into consideration the differences betweenneural networks and scientific computing First as noted inSection I scientific computing has higher precision require-ments than neural networks thus the proposed accelera-tor uses only 1-bit cells making it more robust Secondthe matrices for scientific computing are typically sparsewhereas the matrices in neural networks tend to be densethis leads to lower error rates using the RTN-based errormodel of Feinberg et al allowing errors to be corrected withgreater than 9999 accuracy However the use of arrayslarger than used in prior work introduces new sources oferror If the array size is comparable to the dynamic rangeof the memristors the non-zero current in the RHI statemay cause errors when multiplying a low density matrixrow with a high density vector To avert this problem welimit the maximum crossbar block size in a bank (discussedin the next section) to 512times512 with a memristor dynamicrange of 15times 103 Third since the floating-point operandsexpand to as many as 118 bits during the conversion fromfloating- to fixed point we do not use multi-operand codingbut rather apply an A = 251 code that protects a 118-bit operand with eight bits for correction and one bit fordetection The correction is applied after the reduction butbefore the leading-one detection

V HANDLING AND EXPLOITING SPARSITY

An in-situ MVM system has an inherent advantage whenperforming computation with large and dense matrices asthe degree of parallelism obtained by computing in theanalog domain is proportional to the number of elementsin the matrix up to the size of the matrix M times N Asnoted in Section II-B however the majority of the matricesof interest are sparse This reduces the effective parallelismmaking it a challenge to achieve high throughput

A Tradeoffs in Crossbar Sizing

The potential performance of a memristive MVM systemis constrained by the sizes of its crossbars as crossbar sizeincreases so does the peak throughput of the system Thispeak is rarely achieved however since the sparsity patternof each matrix also limits the number of elements that maybe mapped to a given block

We define the throughput of the system as the number ofeffective element-wise operations in a single-cluster MVMcomputation divided by the latency of that computation

τMV M

As the size of a crossbar increases greater throughputis naıvely implied since the size of the matrix block repre-sentable within the crossbarmdashand consequently the numberof sums-of-products resulting from a cluster operationmdashalsoincreases However since only non-zero matrix elementscontribute to an MVM operation effective throughput in-creases only when the number of non-zero (NNZ) elementsincreases where NNZ depends on both block size M timesN and block density dblock Actual throughput for a givenblock is therefore dblock timesM timesNτMV M

Thus to achievehigh throughput and efficiency large and dense blocks aregenerally preferable to small or sparse blocks

The number of non-zero elements actually mapped to acrossbar is in practice far less than the number of cells inthe crossbar and the trend is for large blocks to be sparserthan small blocks This tendency follows intuitively fromthe notion that as the block size approaches the size of thematrix the density of the block approaches the density ofthe matrixmdash037 for the densest matrix in the evaluation

The latency of a memristive MVM operation is alsodirectly related to the size of a crossbar The maximumpossible crossbar column output occurs when all N bits ofa column and all N bits of the vector slice are set to 1resulting in a sum-of-products of

sumN1 1times1 = N Therefore

a sufficient analog to digital converter (ADC) for a crossbarwith N rows must generally have a minimum resolution ofdlog2[N +1]e bits (though this can be reduced as shown inSection V-B2) For the type of ADC used in the proposedsystem and in prior work [8] conversion time per columnis strictly proportional to ADC resolution which increaseswith log2N where N is the number of rows and the numberof ADC conversions per binary MVM operation is equal tothe number of columns M Total conversion time is thusproportional to Mdlog2[N + 1]e

1) Energy Crossbar sizing strongly affects the amountof energy used by both the array and the ADC during anMVM operation

ADC energy The average power dissipation of the ADCgrows exponentially with resolution [34]ndash[36] Since res-olution is a logarithmic function of the number of rowsN but conversion time is proportional to resolution ADCenergy per column conversion is approximately proportionalto N log2N The column count M determines the numberof ADC conversions per MVM operation so the total ADCenergy per operation is proportional to M timesN log2N

Crossbar energy Crossbar latency is defined as theminimum amount of time during which the crossbars mustbe actively pulling current in order to ensure that all outputnodes are within less than LSB

2 of their final voltages therebyconstraining all column outputs to be within the errortolerance range of the ADC The worst-case path resistanceand capacitance of an MtimesN crossbar are both proportionalto M + N The number of RC time constants necessaryto ensure error-free output is proportional to resolution or

log2N While the crossbar is active power is proportionalto the total conductance of the crossbar which can be foundby multiplying the number of cells M timesN by the averagepath conductance which is proportional to 1

M+N Crossbarenergy for a single MVM operation grows with the productof crossbar latency and power and is thus found to beproportional to (M timesN)(M +N) log2N

2) Area As with ADC power ADC area also scalesexponentially with resolution [34]ndash[36] making averageADC area approximately proportional to N Crossbar area isdominated by drivers consequently the total crossbar areagrows as M(M +N)

B A Heterogeneous Crossbar Substrate

Crossbar sizing requires striking a careful balance be-tween parallelism and energy efficiency a crossbar mustbe large enough to capture sufficient non-zero elements toexploit parallelism yet small enough to retain sufficient den-sity for area and energy efficiency By contrast to the singlecrossbar size used in prior work [7] [8] [18] the proposedsystem utilizes crossbars of different sizes allowing blocksto be mapped to the crossbars best suited to the sparsitypattern of the matrix This heterogeneous composition makesthe system performance robust yielding high performanceand efficiency on most matrices without over-designing fora particular matrix type or sparsity pattern

The problem of performance-optimal matrix blocking isneither new nor solved in a computationally feasible mannerPrior work seeks to block sparse matrices while maximizingthroughput and minimizing fill-in but focuses on block sizesthat are too small to be efficient with memristive crossbars(12times 12 at most [15])

1) Mapping a Matrix to the HeterogeneousSubstrate A fast preprocessing step is proposed that ef-ficiently maps an arbitrary matrix to the heterogeneouscollection of clusters The matrix is partitioned into blockcandidates and for each candidate block the number of non-zero elements captured by the block and the range of theexponents within the block are calculated If the exponentrange of the block falls outside of the maximum allowablerange (ie 117) the elements are selectively removed untilan acceptable range is attained The number of non-zero ele-ments remaining is then compared to a dimension-dependentthreshold if the number of elements exceeds the thresholdthe candidate block is accepted If the candidate block isrejected a smaller block size is considered and the processrepeats

The preprocessing step considers every block size in thesystem (512times512 256times256 128times128 and 64times64) untileach element has either been mapped to a block or found tobe unblockable Elements that cannot be blocked efficientlyor that fall outside of the exponent range of an acceptedblock are completely removed from the mapping and areinstead processed separately by the local processor Figure

(a) Pres Poisson sparsity (b) Pres Poisson blocking (c) Xenon1 sparsity (d) Xenon1 blockingFigure 7 Sparsity and blocking patterns of two of the evaluated matrices

7 shows the non-zero elements and the resulting blockingpatterns for two sparse matrices

Since the non-zero elements in the matrix are traversed atmost one time for each block size the worst-case complexityof the preprocessing step is 4timesNNZ In practice the earlydiscovery of good blocks brings the average-case complexityto 18timesNNZ

2) Exploiting Sparsity Though sparsity introduces chal-lenges in terms of throughput it also presents some oppor-tunities for optimization

Computational invert coding Prior work [8] introducesa technique for statically reducing ADC resolution require-ments If a crossbar column contains all 1s the maximumoutput for the column is equal to the number of rowsN To handle this extreme case the ADC would requiredlog2[N + 1]e bits of resolution Eliminating this casereduces the ADC resolution requirement by one bit This isachieved by inverting the contents of a column if it containsall 1s To obtain a result equivalent to a computation witha non-inverted column the partial dot product is subtractedfrom the population count of the applied vector bit slice

We extend this technique to computational invert coding(CIC) a computational analog to the widely used bus-invertcoding [37] If the number of ones in a binary matrixrow is less than N

2 the required resolution of the ADCis again decreased by one bit By inverting any columnmapping that would result in greater than 50 bit density astatic maximum of N

2 1s per crossbar column is guaranteedNotably if the column has exactly N

2 1s a resolution oflog2N is still required Since the proposed system allowsfor removing arbitrary elements from a block for processingon the local processor this corner case can be eliminatedThus for any block even in matrices approaching 100density the ADC resolution can be statically reduced tolog2[N ]minus1 bits CIC also lowers the maximum conductanceof the crossbars (since at most half of the cells can be inthe RLO state) which in turn reduces crossbar energy

ADC headstart The ADCs that are commonly used inmemristive accelerators [7] [8] perform a binary searchover the space of possible results with the MSb initially

set to 1 However the maximum output of each column isconstrained by the number of ones mapped to that columnMatrix rows tend to produce at most dlog2[drowtimes05timesN ]ebits of output assuming that 1s and 0s are equally likelySince the actual number of output bits is data dependentthis tendency cannot be leveraged to statically reduce ADCprecision beyond what is attainable with CIC Instead theADC is pre-set to the maximum number of bits that can beproduced by the subsequent conversion step This approachallows the ADC to start from the most significant possibleoutput bit position and reduces the amount of time requiredto find the correct output value by an amount proportionalto the number of bits skipped Notably this does not affectoperation latency since reduction logic is fully synchronousand assumes a full clock period for conversion however bydecreasing the time spent on conversion the energy of theMVM operation is reduced

VI PROGRAMMING THE ACCELERATOR

We focus on Krylov subspace solvers for specificity asthey are a widely used application involving sparse MVMhowever the proposed accelerator is not limited to Krylovsubspace solvers These solvers are built from three com-putational kernels a sparse-matrix dense-vector multiply adense-vector sum or AXPY (ylarr ax+y) and a dense-vectordot product (z = x middot y) To split the computation betweenthe banks each bank is given a 1200-element section ofthe solution vector x only that bank performs updates onthe corresponding section of the solution vector and theportions of the derived vectors For simplicity this restrictionis not enforced by hardware and must be followed by theprogrammer for correctness

On problems that are too large for a single accelerator theMVM can be split in a manner analogous to the partitioningon GPUs each accelerator handles a portion of the MVMand the accelerators synchronize between iterations

A Kernel Implementation

To implement Krylov subspace solvers we implement thethree primary kernels needed for the solver

1) Sparse MVM The sparse MVM operation beginsby reading the vector map for each cluster The vectormap contains a set of three-element tuples each of whichcomprises the base address of a cluster input vector thevector element index corresponding to that base addressand the size of the cluster Since the vector applied to eachcluster is contiguous the tuple is used to compute and loadall of the vector elements into the appropriate vector bufferThereafter a start signal is written to a cluster status registerand the computation begins Since larger clusters tend tohave a higher latency the vector map entries are ordered bycluster size

Once all cluster operations have started the processorbegins operating on the non-blocked entries As noted inthe previous section some values within the sparse matrixare not suitable for mapping onto memristive hardwareeither because they cannot be blocked at a sufficient densityor because they would exceed the exponent range Theseremaining values are stored in the compressed sparse rowformat [15]

When a cluster completes an interrupt is raised andserviced by an interrupt service routine running on the localprocessor Once all clusters signal completion and the localprocessor finishes processing the non-blocked elements thebank has completed its portion of the sparse MVM The localprocessors for different banks rely on a barrier synchroniza-tion scheme to determine when the entire sparse MVM iscomplete

2) Vector Dot Product To perform a vector dot producteach processor computes the relevant dot product fromits own vector elements Notably due to vector elementownership of the solution vector x and all of the derivedvectors the dot product is performed using only thosevalues that belong to a local processor Once the processorcomputes its local dot product the value is written backto shared memory for all other banks to see Each bankcomputes the final product from the set of bank dot productsindividually to simplify synchronization

3) AXPY The AXPY operation is the simplest kernelsince it can be performed purely locally Each vector elementis loaded modified and written back Barrier synchroniza-tion is used to determine completion across all banks

VII EXPERIMENTAL SETUP

We develop a parameterized and modular system modelto evaluate throughput energy efficiency and area Keyparameters of the simulated system are listed in Table I

A Circuits and Devices

We model all of the CMOS based periphery at the 15nmtechnology node using process parameters and design rulesfrom FreePDK15 [38] coupled with the ASU PredictiveTechnology Model (PTM) [39] TaOx memristor cells basedon [40] are used with linearity and dynamic range as

reported by [18] During computation cells are modeled asresistors with resistance determined either by a statisticalapproach considering block density or actual mapped bitvalues depending on whether the model is being evaluatedas part of a design-space exploration or to characterizeperformance with a particular matrix The interconnect ismodeled with a conservative lumped RC approach usingthe wire capacitance parameters provided by the ASU PTMderived from [41] To reduce path distance and latency wesplit the crossbars and place the drivers in the middle ratherthan at the periphery similarly to the double-sided groundbiasing technique from [42]

Table IACCELERATOR CONFIGURATION

System (128) banks double-precision floating point fclk =12GHz 15nm process VDD = 080V

Bank (2) times 512 times 512 clusters (4) times 256 times 256 clusters(6)times128times128 (8)times64times64 clusters 1 LEON core

Cluster 127 bit slice crossbarsCrossbar N timesN cells (log2[N ]minus 1)-bit pipelined SAR ADC

2N driversCell [18] [40] TaOx Ron = 2kΩRoff = 3MΩ Vread = 02V

Vset = minus26V Vreset = 26V Ewrite = 391nJTwrite = 5088ns

The peripheral circuitry includes 2N driver circuits basedon [43] and sized such that they are sufficient to sourcesinkthe maximum current required to program and read thememristors Also included are N sample-and-hold circuitsbased on [44] with power and area scaled for the requiredsampling frequency and precision

A 12 GHz 10-bit pipelined SAR ADC [34] is used as areference design point 12 GHz is chosen as the operatingfrequency of the ADC in order to maximize throughput andefficiency while maintaining acceptable ADC signal to noiseand distortion ratio (SNDR) at our supply voltage Areapower and the internal conversion time of each individualcrossbarrsquos ADC are scaled based on resolution as discussedin Section V-A Approximately 7 of the reported ADCpower scales exponentially with resolution with the remain-der of power either scaling linearly or remaining constantand 20 of the total power is assumed to be static basedon the analysis reported in [34] Similarly approximately23 of the reported area scales exponentially (due to thecapacitive DAC and the track-and-hold circuit) and theremainder either scales linearly or remains constant We holdthe conversion time of the ADC constant with respect toresolution rounding up to the nearest integral clock periodDuring the slack portion of the clock period (ie the timebetween conversions) we reduce the ADC power dissipationto its static power

The local processors are implemented with the open-source LEON3 core [45] Since the open-source versionof LEON3 does not include the netlist for the FPU forsynthesis we modify the FPU wrapper to the fused multiplyadd (FMA) timings for an FMA unit generated by FP-

Table IIEVALUATED MATRICES SPD MATRICES ON TOP

Matrix NNZs Rows NNZRow Blocked2cubes sphere 1647264 101492 162 497crystm03 583770 24696 236 947finan512 596992 74752 79 467G2 circuit 726674 150102 45 609nasasrb 2677324 54870 498 991Pres Poisson 715804 14822 483 964qa8fm 1660579 66127 251 928ship 001 3896496 34920 1116 664thermomech TC 711558 102158 68 08Trefethen 20000 554466 20000 277 633ASIC 100K 940621 99340 95 609bcircuit 375558 68902 54 649epb3 463625 84617 55 722GaAsH6 3381809 61349 5512 692ns3Da 1679599 20414 82 32Si34H36 5156379 97569 528 537torso2 1033473 115697 89 981venkat25 1717792 62424 275 798wang3 177168 26064 68 646xenon1 1181120 48600 243 810

GEN [46] The LEON3 core FMA and in-cluster reductionnetwork are synthesized using the NanGate 15nm OpenCell Library [47] with the Synopsys Design Compiler [48]SRAM buffers within each cluster and the eDRAM mem-ory are modeled using CACTI7 [49] using 14nm eDRAMparameters from [50]

B Architecture

We compare against an nVidia Tesla P100 GPU accelera-tor modeled using GPGPUSim [51] and GPGPUWattch [52]To evaluate the performance of the LEON3 based localprocessors we implement CG and BiCG-STAB in C andcompile the implementations using the LEON3 Bare CCompiler with timers around each kernel call Since all bankmicroprocessors must synchronize during each iteration weevaluate the latency of the bank microprocessor with thelargest number of unblocked elements

In the worst case the preprocessing step touches everynon-zero in the matrix four times which is comparable toperforming four MVM operations Thus we conservativelyassume that preprocessing takes time equivalent to fourMVM operations on the baseline system

C Algorithms

We evaluate the proposed accelerator on 20 matrices fromthe SuiteSparse matrix collection [14] using CG for sym-metric positive definite matrices (SPD) and BiCG-STAB fornon-SPD matrices The matrices were selected to showcasea range of matrix structures sizes and densities across avariety of applications When available we use the b vectorprovided by the collection when b is unavailable we use ab vector of all 1s as in prior work [53]

The solvers running on the proposed accelerator convergein the same number of iterations as they do when running

103

01

1

10

100

nasasrb

Pres_P

2cub

es

qa8fm

finan

Trefe

ship

crystm

therm

G2

ASIC

bcircuit

epb3

Ga

AsH6

ns3D

aSi34H3

6torso2

venkat

wang3

xeno

n1

G-MEA

NS

peed

upoverG

PU

Figure 8 Speedup over the GPU baseline

on the GPU since both systems perform computation at thesame level of precision

VIII EVALUATION

We evaluate the system performance energy area androbustness to different sources of error

A Execution Time

Figure 8 shows that the accelerator achieves an averagespeedup of 103times as compared to the GPU baseline acrossthe set of 20 matrices One important insight from theseresults is that the average number of non-zeros per matrixrow is a poor proxy for potential speedup The matrix withthe greatest speedup in the dataset (torso2) has just 89 NNZper matrix row and the two quantum chemistry matrices(GaSsH6 and Si34H36) have more than 50 NNZ per rowshowing above-average but not exceptional speedup due totheir lower blocking efficiencies (ie the percent of the non-zeros that are blocked) 69 and 53 respectively Sinceeach bank microprocessor is designed to work in conjunctionwith its clusters matrices with fewer rows perform worsethan larger matrices at the same blocking efficiency Asexpected blocking efficiency is the most reliable predictorof speedup

Notably in Table II there are two matrices that are effec-tively unblocked by the proposed preprocessing step ther-motech TC and ns3Da with respective blocking efficienciesof 08 and 32 Since the accelerator is optimized for in-situ rather than digital MVM and since the majority of theelements in these two matrices cannot be mapped to theproposed heterogeneous substrate the accelerator is morethan an order of magnitude less efficient than the GPUbaseline on these matrices Rather than attempt to performMVM on matrices so clearly ill-suited to the acceleratorwe assume that the proposed accelerator co-exists with aGPU in the same node and that the GPU may be usedfor the rare matrices which do not block effectively Sincethe blocking algorithm has a worst-case performance offour MVM operations (Section V-B1) and since the actualcomplexity reaches the worst case if and when the matrixcannot be blocked we can choose whether to perform thecomputation on the accelerator or the GPU after the blockinghas completed This results in a performance loss of less than

0092

1E-02

1E-01

1E+00

nasasrb

Pres_P

2cub

es

qa8fm

finan

Trefe

ship

crystm

therm

G2

ASIC

bcircuit

epb3

Ga

AsH6

ns3D

aSi34H3

6torso2

venkat

wang3

xeno

n1

G-MEA

N

EnergyNormalized

toGPU

Figure 9 Accelerator energy consumption normalized to the GPU baseline

3 for both matrices far better than if the accelerator wereused directly

B Energy Consumption

Energy consumption results are shown in Figure 9 To-tal energy consumption is improved by 142times on the 18matrices executed on the accelerator and by 109times overthe entire 20-matrix dataset The impacts of differing ex-ponent ranges can be seen in the energy results Notablythe nasasrb matrix has a 3 higher blocking efficiencythan Pres Poisson however Pres Poisson shows nearlytwice the improvement in energy dissipation this is dueto the much narrower exponent range of Pres PoissonPres Poisson never requires more than 14 bits of paddingfor storage which also indicates a much narrower dynamicrange for the computation and by extension fewer vectorbit slices that are needed per cluster By contrast nasasrbhas multiple blocks with excluded elements (due to thoseelements requiring greater than 117 bits) and in generalthe average number of stored bits per cluster is 107 (30more bits than the worst block of Pres Poisson) This strongdependence on the exponent range indicates a potentialbenefit of the fixed-point computation over floating pointAs discussed in Section IV-B floating point is designed forbroad compatibility and exceeds the requirements of manyapplications By operating on fixed-point representations ofdata for much of the computation the accelerator implicitlycreates a problem-specific subset of the floating-point formatwith significant performance benefits without discarding anyinformation that would cause the final result to differ fromthat of an end-to-end floating-point calculation

C Area Footprint

The area footprint of the accelerator is computed usingthe area model described in Section VII-A The overallsystem area for the 128-bank system described above is539mm2 which is lower than the 610mm2 die size of thebaseline Nvidia Tesla P100 GPU [54] Notably unlike inprior work on memristive accelerators the crossbars andperipheral circuitry are the dominant area consumermdashratherthan the ADCsmdashwith a total of 541 of total overall clusterarea This is due to two important differences from priorwork 1) due to computational invert coding (Section V-B2)

Table IIIAREA ENERGY AND LATENCY OF DIFFERENT CROSSBAR SIZES

(INCLUDES THE ADC)

Size Area [mm2] Energy [pJ] Latency [nsec]64 000078 280 533128 000103 652 107256 000162 150 213512 000352 342 427

0

5

10

15

20

nasasrb

Pres_P

2cub

es

qa8fm

finan

Trefe

ship

crystm

therm

G2

ASIC

bcircuit

epb3

Ga

AsH6

ns3D

aSi34H3

6torso2

venkat

wang3

xeno

n1

AcceleratorC

ompu

tatio

n

Overhead

WriteTime PreprocessingTime

Figure 10 Overhead of preprocessing and write time as a percent of totalsolve time on the accelerator

the ADC precision can be reduced by one bit which has anexponential effect on ADC area and 2) the wider operandsthat propagate through the reduction network The per-bankprocessors and global memory buffer consume 136 of thetotal system area This result suggests that a more powerfulprocessor than the LEON3 core may be integrated based onapplication demands without substantially increasing systemarea Table III summarizes the area as well as the latencyand energy characteristics for different crossbar sizes usedin different clusters

D Initialization Overhead

Figure 10 shows the overhead of setting up the itera-tive computation including matrix preprocessing time andprogramming time normalized to the entire solve timeof each linear system on the accelerator Since iterativealgorithms take thousands of iterations to converge theoverhead of writing is effectively amortized leading toan overhead of less than 20 across the entire matrixset and generally falling as the size of the linear systemincreases By extension the number of iterations requiredto converge increases faster than the write requirements ofthe array For large linear systems (the most likely use caseof the proposed accelerator) the overhead is typically lessthan 4 of the total solver runtime The overhead may beeven less in practice as multiple time steps are generallysimulated to examine the time-evolution of the linear systemIn these time-stepped computations only a subset of non-zeros change each step and the matrix structure is typicallypreserved requiring minimal re-processing

E System Endurance

Although memristive devices have finite switching en-durance the iterative nature of the algorithms allows a singlematrix to be programmed once and reused throughout thecomputation Even under a conservative set of assumptions

(a) ns3Da sparsity (b) ns3Da blockingFigure 11 Sparsity and blocking patterns of ns3Da

where each array is fully rewritten for each new matrixignoring both the sparsity of the blocks and the time stepbehavior mentioned above the system lifetime is suffi-cient given a memristive device with a conservative 109

writes [55]ndash[57] the total system lifetime is over 100 yearsassuming the system runs constantly and a new matrix isfully rewritten between solves

F Understanding Difficult-to-Block Matrices

Figure 11 shows the non-zero element distribution and theblocking pattern for ns3Da a matrix with a particularly poorblocking efficiency The figure shows that despite the highrelative density of the matrix the values do not form densesub-blocks that can be blocked efficiently Instead the valuesare distributed relatively uniformly over the matrix with thethird-highest NNZs per matrix row in the evaluated datasetFigure 7 shows the non-zero distribution of two matrices thatcan be blocked effectively Pres Poisson and Xenon1 thenon-zeros of both matrices are clustered primarily aroundthe diagonal An analysis of the blocks formed on ns3Dashows that most of the formed blocks capture patterns withnon-zeros separated by approximately 30 elements 2 permatrix block-row in 64 blocks and 4 per matrix block-rowin 128 blocks This suggests that even if ns3Da were tobe blocked more efficiently the effective speedup would beconstrained by the limited current summation from havingfew blocks per matrix row

G Sensitivity to Device Characteristics

Since scientific computing emphasizes high precisionwe analyze the effects of device characteristics on systemaccuracy and convergence behavior

Dynamic range We re-evaluate the convergence behaviorwith various cell ranges (ie onoff state ratios) to determinehow the system behaves when the states are closer togetheror farther apart Figure 12 shows the relative iterationcount at various configuration points The results showeffectively no sensitivity to dynamic range for single-bitcells With two-bit cells the relatively low dynamic rangeresults in some computational error as the cell states arenot sufficiently spread to tolerate the sources of error in

013 213 413 613 813

B=113 D=075K13

B=113 D=15K13

B=113 D=3K13

B=213 D=075K13

B=213 D=15K13

B=213 D=3K13

Normalized

13 Ite

rao

n13 Co

unt13

Min13 Mean13 Max13

Figure 12 Iteration count as a function of bits per cell and dynamicrange normalized to 1-bit cells with RoffRon = 1500 The minimummaximum and the mean iteration counts are reported over 100 Monte-Carlosimulation experiments

Figure 13 Iteration count as a function of bits per cell and programmingerror normalized to 1-bit cells with no programming error The minimummaximum and the mean iteration counts are reported over 100 Monte-Carlosimulation experiments

current summation This in turn introduces error into thecomputation which tends to hinder convergence rates

Programming precision We re-evaluate the convergencebehavior with and without cell programming errors to de-termine how results are affected by non-ideal programmingThe results show virtually no sensitivity to programmingerror for single-bit cells until the programming error reaches5 well within the achievable programming precision re-ported in the literature [58] Again the general dependencebecomes stronger as the number of bits per cell is increasedHigh programming error with the same tolerance introduceserrors into the computation hindering convergence

IX CONCLUSIONS

Previous work with memristive systems has been re-stricted to low-precision error-prone fixed-point computa-tion taking advantage of machine learning workloads thatdemand fidelity but not precision We have presented thefirst correct implementation of a double-precision floating-point compute unit based on a memristive substrate Theinnovations presented here take advantage of insights thatallow the system to achieve a 103times throughput improve-ment and a 109times reduction in energy consumption over aGPU baseline when executing iterative solvers The potentialfor such a system in scientific computing where iterativesolvers are ubiquitous is especially promising

ACKNOWLEDGMENT

This work was supported in part by NSF grant CCF-1533762

REFERENCES

[1] F Pop ldquoHigh performance numerical computing for highenergy physics a new challenge for big data sciencerdquo AAdvHigh Energy Phys vol 2014 2014

[2] R A Pielke Sr Mesoscale meteorological modeling 2nd edAcademic press 2013

[3] B Scholkopf K Tsuda and J-P Vert Kernel methods incomputational biology 1st ed MIT press 2004

[4] L Maliar and S Maliar ldquoNumerical methods for large-scaledynamic economic modelsrdquo 2014

[5] M Vogelsberger S Genel V Springel P Torrey D SijackiD Xu G Snyder S Bird D Nelson and L HernquistldquoProperties of galaxies reproduced by a hydrodynamic simu-lationrdquo Nature vol 509 pp 177ndash82 May 2014

[6] P Messina (2017 February) The US DOEexascale computing projectmdashgoals and challengeshttpswwwnistgovsitesdefaultfilesdocuments20170221messina nist 20170214final pdf [Online] Availablerdquohttpswwwnistgovsitesdefaultfilesdocuments20170221messina nist 20170214final pdfrdquo

[7] M N Bojnordi and E Ipek ldquoMemristive boltzmann machineA hardware accelerator for combinatorial optimization anddeep learningrdquo in Intl Symp on High Performance ComputerArchitecture (HPCA) March 2016

[8] A Shafiee A Nag N Muralimanohar R BalasubramonianJ P Strachan M Hu R S Williams and V SrikumarldquoISAAC A convolutional neural network accelerator within-situ analog arithmetic in crossbarsrdquo in Intl Symp onComputer Architecture (ISCA) June 2016

[9] P Chi S Li S Li T Zhang J Zhao Y Liu Y Wangand Y Xie ldquoPRIME A novel processing-in-memory archi-tecture for neural network computation in ReRAM-basedmain memoryrdquo in Intl Symp on High Performance ComputerArchitecture (HPCA) June 2016

[10] L Song X Qian H Li and Y Chen ldquoPipeLayer Apipelined ReRAM-based accelerator for deep learningrdquo inIntl Symp on High Performance Computer Architecture(HPCA) Feb 2017

[11] L Song Y Zhuo X Qian H Li and Y Chen ldquoGraphRAccelerating graph processing using ReRAMrdquo Feb 2017

[12] G Kestor R Gioiosa D J Kerbyson and A HoisieldquoQuantifying the energy cost of data movement in scientificapplicationsrdquo in Intl Symp on Workload Characterization(IISWC) Sept 2013

[13] J Hauser (2002) SoftFloat httpwwwjhauserusarithmeticSoftFloathtml [Online] Available rdquohttpwwwjhauserusarithmeticSoftFloathtmlrdquo

[14] T A Davis and Y Hu ldquoThe university of florida sparse matrixcollectionrdquo ACM Transactions on Math Softwware (TOMS)vol 38 no 1 pp 11ndash125 Dec 2011

[15] R Vuduc ldquoAutomatic performance tuning of sparse matrixkernelsrdquo PhD dissertation 2003

[16] Y Chen T Luo S Liu S Zhang L He J Wang L LiT Chen Z Xu N Sun and O Temam ldquoDaDianNaoA machine-learning supercomputerrdquo in Intl Symp on onMicroarchitecture (MICRO) Dec 2014

[17] N P Jouppi C Young N Patil D Patterson G Agrawalet al ldquoIn-datacenter performance analysis of a tensor pro-cessing unitrdquo in Intl Symp on on Computer Architecture(ISCA) June 2017

[18] M Hu J P Strachan Z Li E M Grafals N DavilaC Graves S Lam N Ge J J Yang and R S WilliamsldquoDot-product engine for neuromorphic computing Program-ming 1T1M crossbar to accelerate matrix-vector multiplica-tionrdquo in Design Automation Conference (DAC) June 2016

[19] Y Saad Iterative Methods for Sparse Linear Systems 3rd edSIAM 2003

[20] J J Dongarra P Luszczek and A Petitet ldquoThe LINPACKbenchmark Past present and futurerdquo Concurrency and Com-putation Practice and Experience vol 15 pp 803ndash8202003

[21] M R Hestenes and E Stiefel Methods of conjugate gradientsfor solving linear systems 1952 vol 49 no 6

[22] H A Van der Vorst ldquoBi-CGSTAB A fast and smoothlyconverging variant of Bi-CG for the solution of nonsymmetriclinear systemsrdquo SIAM Journal on scientific and StatisticalComputing (SISC) vol 13 no 2 pp 631ndash644 1992

[23] Y Saad and M H Schultz ldquoGMRES A generalized minimalresidual algorithm for solving nonsymmetric linear systemsrdquoSIAM Journal on scientific and Statistical Computing (SISC)vol 7 no 3 pp 856ndash869 July 1986

[24] J Kung Y Long D Kim and S Mukhopadhyay ldquoAprogrammable hardware accelerator for simulating dynamicalsystemsrdquo in Intl Symp on Computer Architecture (ISCA)June 2017

[25] Q Zhu T Graf H E Sumbul L Pileggi and F FranchettildquoAccelerating sparse matrix-matrix multiplication with 3d-stacked logic-in-memory hardwarerdquo in 2013 IEEE HighPerformance Extreme Computing Conference (HPEC) Sept2013

[26] S Kestur J D Davis and E S Chung ldquoTowards a univer-sal FPGA matrix-vector multiplication architecturerdquo in IntlSymp on Field-Programmable Custom Computing Machines(FCCM) May 2012

[27] J Fowers K Ovtcharov K Strauss E S Chung andG Stitt ldquoA high memory bandwidth FPGA accelerator forsparse matrix-vector multiplicationrdquo in Intl Symp on Field-Programmable Custom Computing Machines (FCCM) May2014

[28] R Dorrance F Ren and D Markovic ldquoA scalablesparse matrix-vector multiplication kernel for energy-efficientsparse-blas on FPGAsrdquo in ACMSIGDA International Sym-posium on Field-programmable Gate Arrays (FPGA) Feb2014

[29] B Feinberg S Wang and E Ipek ldquoMaking memristiveneural net accelerators reliablerdquo in Intl Symp on HighPerformance Computer Architecture (HPCA) Feb 2018

[30] D H Bailey ldquoHigh-precision floating-point arithmetic inscientific computationrdquo Computing in science amp engineeringvol 7 no 3 pp 54ndash61 2005

[31] D Zuras M Cowlishaw A Aiken M Applegate D BaileyS Bass D Bhandarkar M Bhat D Bindel S Boldo et alldquoIeee standard for floating-point arithmeticrdquo IEEE Std 754-2008 pp 1ndash70 2008

[32] L V Allis ldquoSearching for solutions in games and artificialintelligencerdquo 1994

[33] N Whitehead and A Fit-Florea ldquoPrecision amp performanceFloating point and IEEE 754 compliance for nvidia gpusrdquoNvidia Tech Rep 2011

[34] L Kull D Luu C Menolfi M Braendli P A FranceseT Morf M Kossel H Yueksel A Cevrero I Ozkayaand T Toifl ldquo285 a 10b 15gss pipelined-SAR ADC withbackground second-stage common-mode regulation and offsetcalibration in 14nm CMOS FinFETrdquo in Intl Solid-StateCircuits Conference (ISSCC) Feb 2017 pp 474ndash475

[35] L Kull T Toifl M Schmatz P A Francese C MenolfiM Brandli M Kossel T Morf T M Andersen andY Leblebici ldquoA 31 mw 8b 12 GSs single-channel asyn-chronous SAR ADC with alternate comparators for enhancedspeed in 32 nm digital SOI CMOSrdquo IEEE Journal of Solid-State Circuits vol 48 no 12 pp 3049ndash3058 2013

[36] M Saberi R Lotfi K Mafinezhad and W A SerdijnldquoAnalysis of power consumption and linearity in capacitivedigital-to-analog converters used in successive approximationadcsrdquo IEEE Transactions on Circuits and Systems I RegularPapers vol 58 no 8 pp 1736ndash1748 2011

[37] M R Stan and W P Burleson ldquoBus-invert coding forlow-power IOrdquo IEEE Transactions on Very Large ScaleIntegration (VLSI) Systems vol 3 no 1 pp 49ndash58 1995

[38] K Bhanushali and W R Davis ldquoFreePDK15 An open-source predictive process design kit for 15nm FinFET tech-nologyrdquo in Intl Symp on on Physical Design (ISPD) March2015

[39] Predictive technology model (PTM) httpptmasuedu[Online] Available httpptmasuedu

[40] D Niu C Xu N Muralimanohar N P Jouppi and Y XieldquoDesign of cross-point metal-oxide ReRAM emphasizingreliability and costrdquo in Intl Conference on Computer-AidedDesign (ICCAD) Nov 2013

[41] S-C Wong G-Y Lee and D-J Ma ldquoModeling of inter-connect capacitance delay and crosstalk in VLSIrdquo IEEETransactions on semiconductor manufacturing vol 13 no 1pp 108ndash111 2000

[42] C Xu D Niu N Muralimanohar R BalasubramonianT Zhang S Yu and Y Xie ldquoOvercoming the challengesof crossbar resistive memory architecturesrdquo in rdquoIntl Sympon High Performance Computer Architecture (HPCA)rdquo Feb2015

[43] C Yakopcic and T Taha ldquoModel for maximum crossbar sizebased on input driver impedancerdquo Electronics Letters vol 52no 1 pp 25ndash27 2015

[44] M OrsquoHalloran and R Sarpeshkar ldquoA 10-nw 12-bit accurateanalog storage cell with 10-aa leakagerdquo IEEE Journal ofSolid-State Circuits vol 39 no 11 pp 1985ndash1996 2004

[45] Leon3GRLIB httpwwwgaislercomindexphpdownloadsleongrlib [Online] Available httpwwwgaislercomindexphpdownloadsleongrlib

[46] S Galal O Shacham J S B II J Pu A Vassiliev andM Horowitz ldquoFPU generator for design space explorationrdquoin IEEE Symposium on Computer Arithmetic (ARITH) April2013

[47] M Martins J M Matos R P Ribas A Reis G SchlinkerL Rech and J Michelsen ldquoOpen cell library in 15nmFreePDK technologyrdquo in Intl Symp on Physical Design(ISPD) March 2015

[48] Synopsys ldquoSynopsys Design Compiler User GuiderdquohttpwwwsynopsyscomToolsImplementationRTLSynthesisDCUltraPages

[49] R Balasubramonian A B Kahng N MuralimanoharA Shafiee and V Srinivas ldquoCACTI 7 New tools for inter-connect exploration in innovative off-chip memoriesrdquo ACMTransactions on Architecture and Code Optimization (TACO)vol 14 no 2 pp 141ndash1425 June 2017

[50] G Fredeman D W Plass A Mathews J ViraraghavanK Reyer T J Knips T Miller E L Gerhard D Kannam-badi C Paone D Lee D J Rainey M Sperling M WhalenS Burns R R Tummuru H Ho A Cestero N ArnoldB A Khan T Kirihata and S S Iyer ldquoA 14 nm 11 mbembedded DRAM macro with 1 ns accessrdquo IEEE Journal ofSolid-State Circuits vol 51 no 1 pp 230ndash239 Jan 2016

[51] A Bakhoda G L Yuan W W L Fung H Wong and T MAamodt ldquoAnalyzing CUDA workloads using a detailed GPUsimulatorrdquo April 2009

[52] J Leng T Hetherington A ElTantawy S Gilani N S KimT M Aamodt and V J Reddi ldquoGPUWattch Enabling en-ergy optimizations in GPGPUsrdquo in Intl Symp on ComputerArchitecture (ISCA) June 2013

[53] H Anzt J Dongarra M Kreutzer G Wellein and M KhlerldquoEfficiency of general krylov methods on GPUs ndash an exper-imental studyrdquo in Intl Parallel and Distributed ProcessingSymposium Workshops (IPDPSW) May 2016

[54] ldquoNvidia tesla P100rdquo NVIDIA Corporation Tech Rep WP-08019-001 v011 2016

[55] Z Wei Y Kanzawa K Arita Y Katoh K Kawai S Mu-raoka S Mitani S Fujii K Katayama M Iijima et alldquoHighly reliable TaOx ReRAM and direct evidence of redoxreaction mechanismrdquo in Electron Devices Meeting 2008IEDM 2008 IEEE International IEEE 2008 pp 1ndash4

[56] J J Yang M-X Zhang J P Strachan F Miao M D Pick-ett R D Kelley G Medeiros-Ribeiro and R S WilliamsldquoHigh switching endurance in TaOx memristive devicesrdquoApplied Physics Letters vol 97 no 23 p 232102 2010

[57] C-W Hsu I-T Wang C-L Lo M-C Chiang W-Y JangC-H Lin and T-H Hou ldquoSelf-rectifying bipolar TaOxTiO2RRAM with superior endurance over 1012 cycles for 3d high-density storage-class memoryrdquo

[58] F Alibart L Gao B D Hoskins and D B Strukov ldquoHighprecision tuning of state for memristive devices by adaptablevariation-tolerant algorithmrdquo Nanotechnology vol 23 no 7Jan 2012

Page 5: Enabling Scientific Computing on Memristive …reducing the required memristor resolution [7]–[10]. An example of bit slicing is shown in Equation 1, where each 3-bit matrix coefficient

0110011001

010101

+

010111010

100101

+

000101

010110

011001

001000

+

01101000111

111010

+

0 0 0

(a) Initial partial product summation

Running Sum

(b) Sum of four partial products

(c) No overlap with partial product

(d) Addition can terminate

Figure 4 Illustrative example of partial product summation a) addition of the first four partial products b) overlap between the running sum and themantissa c) carries can still modify the running sum d) the mantissa bits are settled

B Avoiding Extraneous Calculations

Three techniques are proposed to reduce the cost ofperforming floating-point computation on fixed-pointhardware

Exploiting exponent range locality The 11-bit exponentfield specified by IEEE-754 means that the dynamic rangeof floating-point values is 22

11minus2 (two exponent fields arereserved for special values) or approximately 10616 (Forcomparison the size of the state space for the game ofGo is only 10172 [32]) IEEE-754 provides this range sothat applications from a wide variety of domains can usefloating-point values without additional scaling howeverit is highly unlikely that any model of a physical systemwill have a dynamic range requirement close to 10616 Inpractice the number of required pad bits is equal to thedifference between the minimum and maximum exponentsin the matrix and can be covered by a few hundred bitsrather than 2046 Furthermore since the fixed-point over-head is required only for numbers that are to be summed inthe analog domain the alignment overhead can be furtherreduced by exploiting locality Since the matrix is muchlarger than a single crossbar it is necessarily split into blocks(discussed in Section V-B1) only those values within thesame block are added in the analog domain and requirealignment The matrices can be expected to exhibit localitybecause they often come from physical systems where alarge dynamic range between neighboring points is unlikely

The aforementioned alignment procedure is also appliedto the vector to be multiplied by the matrix The alignmentof mantissas encodes the exponent of each vector elementrelative to a single exponent assigned to the vector as awhole Notably multiplication does not require alignmentsince it is sufficient to add the relevant exponents of thematrix block the vector and the exponent offset impliedby the location of the leading 1 in each dot product

Early termination In a conventional double-precision float-ing point unit (FPU) each arithmetic operation is performed

on two operands after which the resulting mantissa isrounded and truncated to 53 bits To compute a dot productbetween two vectors a middot x the FPU first calculates theproduct of two scalars a1x1 rounds and truncates the resultto 53 bits and adds the truncated product to subsequentproducts a middot x = a1x1 + a2x2 + middot middot middot A memristive MVMaccelerator on the other hand computes a dot product as anaggregation of partial dot products calculated over multiplebit slices This necessitates a truncation strategy differentfrom that of a digital FPU

A naıve approach to truncation on a memristive MVMaccelerator would be to perform all of the bit sliced matrix-vector multiplications aggregate the partial dot productsand truncate the result This naıve approach would requirea total of 127 times 127 crossbar operations since every bitslice of a must be multiplied by every bit slice of x 4

However many of these operations would contribute only tothe portion of the mantissa beyond the point of truncationwasting time and energy By detecting the position of theleading 1mdashie the 1 at the most significant bit position inthe final resultmdashthe computation can be safely terminatedas soon as the 52 following bits of the mantissa have settled

Consider the simplified example depicted in Figure 4 inwhich a series of partial products with different bit-sliceweights are to be added (a) To clarify the exposition the ex-ample assumes that the partial products are six bits wide andthat the final sum is to be represented by a four-bit mantissaand an exponent (The actual crossbars sum 127-bit partialproducts encoding the result with a 53-bit mantissa and anexponent) In the example the addition of partial productsproceeds iteratively from the most significant partial producttoward the least significant accumulating the result in arunning sum The goal is to determine whether this iterativeprocess can be terminated early while still producing thesame mantissa as the naıve approach

Figure 4b shows the running sum after the summationof the four most significant partial products Since the next

4Recall from Section III that scalars are encoded with up to 127-bitfixed point All examples in this section assume a 127-bit representationfor simplicity unless otherwise noted

Stable Region

X hellip X

Potential carry out

+

Barrier Bit

0

Carry Region

Aligned Region

X X 0X X X

1 hellip 1 X Running sum

Next partial dot product

Figure 5 Regions within a running sum

partial product still overlaps with the four-bit mantissa it isstill possible for the mantissa to be affected by subsequentadditions At a minimum the computation must proceeduntil the mantissa in the running sum has cleared the overlapwith the next partial product This occurs in Figure 4chowever at that point it is still possible for a carry from thesubsequent partial products to propagate into and modify themantissa Notably if the accumulation is halted at any pointthe sum of the remaining partial products would produce atmost one carry out into the running sum Hence in additionto clearing the overlap safe termination requires that a 0less significant than the mantissa be generated such that itwill absorb the single potential carry out preventing any bitflips within the mantissa (Figure 4d)

At any given time the running sum can be split intofour non-overlapping regions as shown in Figure 5 Thealigned region is the portion of the running sum thatoverlaps with the subsequent partial products A chain of1s more significant than the aligned region is called thecarry region as it can propagate the single generated carryout forward This carry if present is absorbed by a barrierbit protecting the more significant bits in the stable regionfrom bit flips Thus the running sum calculation canbe terminated without loss of mantissa precision assoon as the full mantissa is contained in the stable region

Scheduling array activations Even with the early termina-tion technique discussed above there is still a possibility forextraneous crossbar operations to be performed Althougheach partial dot product takes up to 127 bits the finalmantissa is truncated to 52 bits following the leading 1requiring fewer than 127 bits to be computed in most casesTherefore some subset of the 127-bit partial dot productmay be wasted depending on the schedule according towhich vector bit slices are applied to the crossbars

Figure 6 shows three approaches to scheduling the cross-bar activations In the figure rows and columns respec-tively represent the matrix and vector bit slices sortedby significance The numbers at the intersection of eachrow and column represent the significance of the resultingpartial product Different groupings indicate which bit-slicedmatrix-vector multiplications are scheduled to occur simul-taneously the black colored groupings must be performedwhereas the gray colored groupings may be skipped dueto early termination Hence the number of black groupings

1 0123

1

023

4

5

3 24 1

6 5

4

3

4

3

2

3

2

6

5

4

5

4

3

4

3

2

3

2

1

3 2 1 0

6

5

3

5

4

2

4

3

1

3

2

0

3 2 1 0

3

2

1

0

Mat

rix S

lice

3 2 1 0Vector Slice

Activations Skipped Activations PerformedVertical Grouping Diagonal Grouping Hybrid Grouping

Figure 6 Three policies for scheduling crossbar activations

correlates with execution time and the number of crossbarsenclosed by the black groupings correlates with crossbaractivation energy The example shown in the figure assumesthat the calculation can be terminated early after the fifthmost significant bit of the mantissa is computed (That isall groupings computing a partial product with significanceof at least 2 must be performed)

In the leftmost case all of the crossbars are activated withthe same vector bit slice after which the next significantvector bit slice is applied to the crossbars This naıve verticalgrouping requires 16 crossbar activations performed overfour time steps In the next example arrays are activatedwith different vector bit slices such that each active arrayis contributing to the same slice of the output vector si-multaneously This diagonal grouping results in 13 crossbaractivationsmdashthe minimum required in the examplemdashoverfive time steps In general this policy takes the minimumnumber of crossbar activations at the cost of additionallatency

The hybrid grouping shown on the right which we adoptin the evaluation strikes a balance between the respectiveenergy and latency advantages of the diagonal and verticalgroupings The result is 14 crossbar activations performedover four time steps The more closely the hybrid groupingapproximates a diagonal grouping the greater the energysavings at the cost of latency

Taken together the above techniques significantly reducethe overhead of performing floating-point computation onfixed-point hardware by restricting computations to onlythose that are likely to produce a bit needed for the finalmantissa The techniques render the latency and energyconsumption strongly data dependent effectively forming adata-specific subset of the floating point format without anyloss in precision

C Handling Negative Numbers

The previously proposed ISAAC memristive accelera-tor [8] uses a biasing scheme to handle negative numbersThe approach adds a bias of 216 to each 16-bit operandsuch that the values range from 0 to 216 rather than plusmn215Results are de-biased by subtracting 216pv where pv is thepopulation count (ie the number of bits set to 1) of therelevant bit slice of the vector We adopt this technique with

one modification in place of the constant 216 we use aper-block biasing constant based on the actual floating-pointexponent range contained within the block

D Floating Point Anomalies and Rounding Modes

The IEEE-754 floating point standard specifies a numberof floating point exceptions and rounding modes that mustbe handled correctly for proper operation [31]

Rounding modes When the proposed accelerator isoperated in its typical configuration mantissa alignmentand leading 1 detection result in a truncation of the resultto the required precision The potential contributions ofany uncalculated bits beyond the least-significant bit wouldonly result in a mantissa of greater magnitude than whatis ultimately obtained and since a biasing scheme is usedto enable the representation of negative values (SectionIV-C) the truncation is equivalent to rounding the resultof a dot product toward negative infinity For applicationsrequiring other rounding modes (eg to nearest towardpositive infinity toward zero) the accelerator can beconfigured to compute three additional settled bits beforetruncation and perform the rounding according to those bits

Infinities and NaNs Signed infinities and not-a-numbers(NaNs) are valid floating point values however they cannotbe mapped to the crossbars in a meaningful way in theproposed accelerator nor can they be included in any dotproduct computations without producing a NaN infinity orinvalid result [31] Therefore the accelerator requires allinput matrices and vectors to contain no infinities or NaNsand any infinities or NaNs arising from an intermediatecomputation in the local processor are handled there toprevent them from propagating to the resulting dot product

Floating point exceptions Operations with floating pointvalues may generally lead to underflow overflow invalidoperation and inexact exceptions [31] Since in-situ dotproducts are performed on mantissas aligned with respect torelative exponents there are no upper or lower bounds onresulting values (Only the range of exponents is bounded)This precludes the possibility of an overflow or underflowoccurring while values are in an intermediate representationHowever when the result of a dot product is converted fromthe intermediate floating point representation into IEEE-754it is possible that the required exponent will be outsideof the available range In this case the exponent field ofthe resulting value will be respectively set to all 1s orall 0s for overflow and underflow conditions Similarly toCUDA computations resulting in an overflow underflowor inexact arithmetic do not cause a trap [33] Since theaccelerator can only accept matrices and input vectors withfinite numerical values and since the memristive substrateis used only for computing dot products invalid operations

(eg 0 0radicminus|c|) and operations resulting in NaNs or

infinities can only occur within the local processor wherethey are handled through an IEEE-754 compliant floatingpoint unit

E Detecting and Correcting Errors

Prior work by Feinberg et al proposes an error-correctionscheme for memristive neural network accelerators based onAN codes [29] We adopt this scheme with three modifi-cations taking into consideration the differences betweenneural networks and scientific computing First as noted inSection I scientific computing has higher precision require-ments than neural networks thus the proposed accelera-tor uses only 1-bit cells making it more robust Secondthe matrices for scientific computing are typically sparsewhereas the matrices in neural networks tend to be densethis leads to lower error rates using the RTN-based errormodel of Feinberg et al allowing errors to be corrected withgreater than 9999 accuracy However the use of arrayslarger than used in prior work introduces new sources oferror If the array size is comparable to the dynamic rangeof the memristors the non-zero current in the RHI statemay cause errors when multiplying a low density matrixrow with a high density vector To avert this problem welimit the maximum crossbar block size in a bank (discussedin the next section) to 512times512 with a memristor dynamicrange of 15times 103 Third since the floating-point operandsexpand to as many as 118 bits during the conversion fromfloating- to fixed point we do not use multi-operand codingbut rather apply an A = 251 code that protects a 118-bit operand with eight bits for correction and one bit fordetection The correction is applied after the reduction butbefore the leading-one detection

V HANDLING AND EXPLOITING SPARSITY

An in-situ MVM system has an inherent advantage whenperforming computation with large and dense matrices asthe degree of parallelism obtained by computing in theanalog domain is proportional to the number of elementsin the matrix up to the size of the matrix M times N Asnoted in Section II-B however the majority of the matricesof interest are sparse This reduces the effective parallelismmaking it a challenge to achieve high throughput

A Tradeoffs in Crossbar Sizing

The potential performance of a memristive MVM systemis constrained by the sizes of its crossbars as crossbar sizeincreases so does the peak throughput of the system Thispeak is rarely achieved however since the sparsity patternof each matrix also limits the number of elements that maybe mapped to a given block

We define the throughput of the system as the number ofeffective element-wise operations in a single-cluster MVMcomputation divided by the latency of that computation

τMV M

As the size of a crossbar increases greater throughputis naıvely implied since the size of the matrix block repre-sentable within the crossbarmdashand consequently the numberof sums-of-products resulting from a cluster operationmdashalsoincreases However since only non-zero matrix elementscontribute to an MVM operation effective throughput in-creases only when the number of non-zero (NNZ) elementsincreases where NNZ depends on both block size M timesN and block density dblock Actual throughput for a givenblock is therefore dblock timesM timesNτMV M

Thus to achievehigh throughput and efficiency large and dense blocks aregenerally preferable to small or sparse blocks

The number of non-zero elements actually mapped to acrossbar is in practice far less than the number of cells inthe crossbar and the trend is for large blocks to be sparserthan small blocks This tendency follows intuitively fromthe notion that as the block size approaches the size of thematrix the density of the block approaches the density ofthe matrixmdash037 for the densest matrix in the evaluation

The latency of a memristive MVM operation is alsodirectly related to the size of a crossbar The maximumpossible crossbar column output occurs when all N bits ofa column and all N bits of the vector slice are set to 1resulting in a sum-of-products of

sumN1 1times1 = N Therefore

a sufficient analog to digital converter (ADC) for a crossbarwith N rows must generally have a minimum resolution ofdlog2[N +1]e bits (though this can be reduced as shown inSection V-B2) For the type of ADC used in the proposedsystem and in prior work [8] conversion time per columnis strictly proportional to ADC resolution which increaseswith log2N where N is the number of rows and the numberof ADC conversions per binary MVM operation is equal tothe number of columns M Total conversion time is thusproportional to Mdlog2[N + 1]e

1) Energy Crossbar sizing strongly affects the amountof energy used by both the array and the ADC during anMVM operation

ADC energy The average power dissipation of the ADCgrows exponentially with resolution [34]ndash[36] Since res-olution is a logarithmic function of the number of rowsN but conversion time is proportional to resolution ADCenergy per column conversion is approximately proportionalto N log2N The column count M determines the numberof ADC conversions per MVM operation so the total ADCenergy per operation is proportional to M timesN log2N

Crossbar energy Crossbar latency is defined as theminimum amount of time during which the crossbars mustbe actively pulling current in order to ensure that all outputnodes are within less than LSB

2 of their final voltages therebyconstraining all column outputs to be within the errortolerance range of the ADC The worst-case path resistanceand capacitance of an MtimesN crossbar are both proportionalto M + N The number of RC time constants necessaryto ensure error-free output is proportional to resolution or

log2N While the crossbar is active power is proportionalto the total conductance of the crossbar which can be foundby multiplying the number of cells M timesN by the averagepath conductance which is proportional to 1

M+N Crossbarenergy for a single MVM operation grows with the productof crossbar latency and power and is thus found to beproportional to (M timesN)(M +N) log2N

2) Area As with ADC power ADC area also scalesexponentially with resolution [34]ndash[36] making averageADC area approximately proportional to N Crossbar area isdominated by drivers consequently the total crossbar areagrows as M(M +N)

B A Heterogeneous Crossbar Substrate

Crossbar sizing requires striking a careful balance be-tween parallelism and energy efficiency a crossbar mustbe large enough to capture sufficient non-zero elements toexploit parallelism yet small enough to retain sufficient den-sity for area and energy efficiency By contrast to the singlecrossbar size used in prior work [7] [8] [18] the proposedsystem utilizes crossbars of different sizes allowing blocksto be mapped to the crossbars best suited to the sparsitypattern of the matrix This heterogeneous composition makesthe system performance robust yielding high performanceand efficiency on most matrices without over-designing fora particular matrix type or sparsity pattern

The problem of performance-optimal matrix blocking isneither new nor solved in a computationally feasible mannerPrior work seeks to block sparse matrices while maximizingthroughput and minimizing fill-in but focuses on block sizesthat are too small to be efficient with memristive crossbars(12times 12 at most [15])

1) Mapping a Matrix to the HeterogeneousSubstrate A fast preprocessing step is proposed that ef-ficiently maps an arbitrary matrix to the heterogeneouscollection of clusters The matrix is partitioned into blockcandidates and for each candidate block the number of non-zero elements captured by the block and the range of theexponents within the block are calculated If the exponentrange of the block falls outside of the maximum allowablerange (ie 117) the elements are selectively removed untilan acceptable range is attained The number of non-zero ele-ments remaining is then compared to a dimension-dependentthreshold if the number of elements exceeds the thresholdthe candidate block is accepted If the candidate block isrejected a smaller block size is considered and the processrepeats

The preprocessing step considers every block size in thesystem (512times512 256times256 128times128 and 64times64) untileach element has either been mapped to a block or found tobe unblockable Elements that cannot be blocked efficientlyor that fall outside of the exponent range of an acceptedblock are completely removed from the mapping and areinstead processed separately by the local processor Figure

(a) Pres Poisson sparsity (b) Pres Poisson blocking (c) Xenon1 sparsity (d) Xenon1 blockingFigure 7 Sparsity and blocking patterns of two of the evaluated matrices

7 shows the non-zero elements and the resulting blockingpatterns for two sparse matrices

Since the non-zero elements in the matrix are traversed atmost one time for each block size the worst-case complexityof the preprocessing step is 4timesNNZ In practice the earlydiscovery of good blocks brings the average-case complexityto 18timesNNZ

2) Exploiting Sparsity Though sparsity introduces chal-lenges in terms of throughput it also presents some oppor-tunities for optimization

Computational invert coding Prior work [8] introducesa technique for statically reducing ADC resolution require-ments If a crossbar column contains all 1s the maximumoutput for the column is equal to the number of rowsN To handle this extreme case the ADC would requiredlog2[N + 1]e bits of resolution Eliminating this casereduces the ADC resolution requirement by one bit This isachieved by inverting the contents of a column if it containsall 1s To obtain a result equivalent to a computation witha non-inverted column the partial dot product is subtractedfrom the population count of the applied vector bit slice

We extend this technique to computational invert coding(CIC) a computational analog to the widely used bus-invertcoding [37] If the number of ones in a binary matrixrow is less than N

2 the required resolution of the ADCis again decreased by one bit By inverting any columnmapping that would result in greater than 50 bit density astatic maximum of N

2 1s per crossbar column is guaranteedNotably if the column has exactly N

2 1s a resolution oflog2N is still required Since the proposed system allowsfor removing arbitrary elements from a block for processingon the local processor this corner case can be eliminatedThus for any block even in matrices approaching 100density the ADC resolution can be statically reduced tolog2[N ]minus1 bits CIC also lowers the maximum conductanceof the crossbars (since at most half of the cells can be inthe RLO state) which in turn reduces crossbar energy

ADC headstart The ADCs that are commonly used inmemristive accelerators [7] [8] perform a binary searchover the space of possible results with the MSb initially

set to 1 However the maximum output of each column isconstrained by the number of ones mapped to that columnMatrix rows tend to produce at most dlog2[drowtimes05timesN ]ebits of output assuming that 1s and 0s are equally likelySince the actual number of output bits is data dependentthis tendency cannot be leveraged to statically reduce ADCprecision beyond what is attainable with CIC Instead theADC is pre-set to the maximum number of bits that can beproduced by the subsequent conversion step This approachallows the ADC to start from the most significant possibleoutput bit position and reduces the amount of time requiredto find the correct output value by an amount proportionalto the number of bits skipped Notably this does not affectoperation latency since reduction logic is fully synchronousand assumes a full clock period for conversion however bydecreasing the time spent on conversion the energy of theMVM operation is reduced

VI PROGRAMMING THE ACCELERATOR

We focus on Krylov subspace solvers for specificity asthey are a widely used application involving sparse MVMhowever the proposed accelerator is not limited to Krylovsubspace solvers These solvers are built from three com-putational kernels a sparse-matrix dense-vector multiply adense-vector sum or AXPY (ylarr ax+y) and a dense-vectordot product (z = x middot y) To split the computation betweenthe banks each bank is given a 1200-element section ofthe solution vector x only that bank performs updates onthe corresponding section of the solution vector and theportions of the derived vectors For simplicity this restrictionis not enforced by hardware and must be followed by theprogrammer for correctness

On problems that are too large for a single accelerator theMVM can be split in a manner analogous to the partitioningon GPUs each accelerator handles a portion of the MVMand the accelerators synchronize between iterations

A Kernel Implementation

To implement Krylov subspace solvers we implement thethree primary kernels needed for the solver

1) Sparse MVM The sparse MVM operation beginsby reading the vector map for each cluster The vectormap contains a set of three-element tuples each of whichcomprises the base address of a cluster input vector thevector element index corresponding to that base addressand the size of the cluster Since the vector applied to eachcluster is contiguous the tuple is used to compute and loadall of the vector elements into the appropriate vector bufferThereafter a start signal is written to a cluster status registerand the computation begins Since larger clusters tend tohave a higher latency the vector map entries are ordered bycluster size

Once all cluster operations have started the processorbegins operating on the non-blocked entries As noted inthe previous section some values within the sparse matrixare not suitable for mapping onto memristive hardwareeither because they cannot be blocked at a sufficient densityor because they would exceed the exponent range Theseremaining values are stored in the compressed sparse rowformat [15]

When a cluster completes an interrupt is raised andserviced by an interrupt service routine running on the localprocessor Once all clusters signal completion and the localprocessor finishes processing the non-blocked elements thebank has completed its portion of the sparse MVM The localprocessors for different banks rely on a barrier synchroniza-tion scheme to determine when the entire sparse MVM iscomplete

2) Vector Dot Product To perform a vector dot producteach processor computes the relevant dot product fromits own vector elements Notably due to vector elementownership of the solution vector x and all of the derivedvectors the dot product is performed using only thosevalues that belong to a local processor Once the processorcomputes its local dot product the value is written backto shared memory for all other banks to see Each bankcomputes the final product from the set of bank dot productsindividually to simplify synchronization

3) AXPY The AXPY operation is the simplest kernelsince it can be performed purely locally Each vector elementis loaded modified and written back Barrier synchroniza-tion is used to determine completion across all banks

VII EXPERIMENTAL SETUP

We develop a parameterized and modular system modelto evaluate throughput energy efficiency and area Keyparameters of the simulated system are listed in Table I

A Circuits and Devices

We model all of the CMOS based periphery at the 15nmtechnology node using process parameters and design rulesfrom FreePDK15 [38] coupled with the ASU PredictiveTechnology Model (PTM) [39] TaOx memristor cells basedon [40] are used with linearity and dynamic range as

reported by [18] During computation cells are modeled asresistors with resistance determined either by a statisticalapproach considering block density or actual mapped bitvalues depending on whether the model is being evaluatedas part of a design-space exploration or to characterizeperformance with a particular matrix The interconnect ismodeled with a conservative lumped RC approach usingthe wire capacitance parameters provided by the ASU PTMderived from [41] To reduce path distance and latency wesplit the crossbars and place the drivers in the middle ratherthan at the periphery similarly to the double-sided groundbiasing technique from [42]

Table IACCELERATOR CONFIGURATION

System (128) banks double-precision floating point fclk =12GHz 15nm process VDD = 080V

Bank (2) times 512 times 512 clusters (4) times 256 times 256 clusters(6)times128times128 (8)times64times64 clusters 1 LEON core

Cluster 127 bit slice crossbarsCrossbar N timesN cells (log2[N ]minus 1)-bit pipelined SAR ADC

2N driversCell [18] [40] TaOx Ron = 2kΩRoff = 3MΩ Vread = 02V

Vset = minus26V Vreset = 26V Ewrite = 391nJTwrite = 5088ns

The peripheral circuitry includes 2N driver circuits basedon [43] and sized such that they are sufficient to sourcesinkthe maximum current required to program and read thememristors Also included are N sample-and-hold circuitsbased on [44] with power and area scaled for the requiredsampling frequency and precision

A 12 GHz 10-bit pipelined SAR ADC [34] is used as areference design point 12 GHz is chosen as the operatingfrequency of the ADC in order to maximize throughput andefficiency while maintaining acceptable ADC signal to noiseand distortion ratio (SNDR) at our supply voltage Areapower and the internal conversion time of each individualcrossbarrsquos ADC are scaled based on resolution as discussedin Section V-A Approximately 7 of the reported ADCpower scales exponentially with resolution with the remain-der of power either scaling linearly or remaining constantand 20 of the total power is assumed to be static basedon the analysis reported in [34] Similarly approximately23 of the reported area scales exponentially (due to thecapacitive DAC and the track-and-hold circuit) and theremainder either scales linearly or remains constant We holdthe conversion time of the ADC constant with respect toresolution rounding up to the nearest integral clock periodDuring the slack portion of the clock period (ie the timebetween conversions) we reduce the ADC power dissipationto its static power

The local processors are implemented with the open-source LEON3 core [45] Since the open-source versionof LEON3 does not include the netlist for the FPU forsynthesis we modify the FPU wrapper to the fused multiplyadd (FMA) timings for an FMA unit generated by FP-

Table IIEVALUATED MATRICES SPD MATRICES ON TOP

Matrix NNZs Rows NNZRow Blocked2cubes sphere 1647264 101492 162 497crystm03 583770 24696 236 947finan512 596992 74752 79 467G2 circuit 726674 150102 45 609nasasrb 2677324 54870 498 991Pres Poisson 715804 14822 483 964qa8fm 1660579 66127 251 928ship 001 3896496 34920 1116 664thermomech TC 711558 102158 68 08Trefethen 20000 554466 20000 277 633ASIC 100K 940621 99340 95 609bcircuit 375558 68902 54 649epb3 463625 84617 55 722GaAsH6 3381809 61349 5512 692ns3Da 1679599 20414 82 32Si34H36 5156379 97569 528 537torso2 1033473 115697 89 981venkat25 1717792 62424 275 798wang3 177168 26064 68 646xenon1 1181120 48600 243 810

GEN [46] The LEON3 core FMA and in-cluster reductionnetwork are synthesized using the NanGate 15nm OpenCell Library [47] with the Synopsys Design Compiler [48]SRAM buffers within each cluster and the eDRAM mem-ory are modeled using CACTI7 [49] using 14nm eDRAMparameters from [50]

B Architecture

We compare against an nVidia Tesla P100 GPU accelera-tor modeled using GPGPUSim [51] and GPGPUWattch [52]To evaluate the performance of the LEON3 based localprocessors we implement CG and BiCG-STAB in C andcompile the implementations using the LEON3 Bare CCompiler with timers around each kernel call Since all bankmicroprocessors must synchronize during each iteration weevaluate the latency of the bank microprocessor with thelargest number of unblocked elements

In the worst case the preprocessing step touches everynon-zero in the matrix four times which is comparable toperforming four MVM operations Thus we conservativelyassume that preprocessing takes time equivalent to fourMVM operations on the baseline system

C Algorithms

We evaluate the proposed accelerator on 20 matrices fromthe SuiteSparse matrix collection [14] using CG for sym-metric positive definite matrices (SPD) and BiCG-STAB fornon-SPD matrices The matrices were selected to showcasea range of matrix structures sizes and densities across avariety of applications When available we use the b vectorprovided by the collection when b is unavailable we use ab vector of all 1s as in prior work [53]

The solvers running on the proposed accelerator convergein the same number of iterations as they do when running

103

01

1

10

100

nasasrb

Pres_P

2cub

es

qa8fm

finan

Trefe

ship

crystm

therm

G2

ASIC

bcircuit

epb3

Ga

AsH6

ns3D

aSi34H3

6torso2

venkat

wang3

xeno

n1

G-MEA

NS

peed

upoverG

PU

Figure 8 Speedup over the GPU baseline

on the GPU since both systems perform computation at thesame level of precision

VIII EVALUATION

We evaluate the system performance energy area androbustness to different sources of error

A Execution Time

Figure 8 shows that the accelerator achieves an averagespeedup of 103times as compared to the GPU baseline acrossthe set of 20 matrices One important insight from theseresults is that the average number of non-zeros per matrixrow is a poor proxy for potential speedup The matrix withthe greatest speedup in the dataset (torso2) has just 89 NNZper matrix row and the two quantum chemistry matrices(GaSsH6 and Si34H36) have more than 50 NNZ per rowshowing above-average but not exceptional speedup due totheir lower blocking efficiencies (ie the percent of the non-zeros that are blocked) 69 and 53 respectively Sinceeach bank microprocessor is designed to work in conjunctionwith its clusters matrices with fewer rows perform worsethan larger matrices at the same blocking efficiency Asexpected blocking efficiency is the most reliable predictorof speedup

Notably in Table II there are two matrices that are effec-tively unblocked by the proposed preprocessing step ther-motech TC and ns3Da with respective blocking efficienciesof 08 and 32 Since the accelerator is optimized for in-situ rather than digital MVM and since the majority of theelements in these two matrices cannot be mapped to theproposed heterogeneous substrate the accelerator is morethan an order of magnitude less efficient than the GPUbaseline on these matrices Rather than attempt to performMVM on matrices so clearly ill-suited to the acceleratorwe assume that the proposed accelerator co-exists with aGPU in the same node and that the GPU may be usedfor the rare matrices which do not block effectively Sincethe blocking algorithm has a worst-case performance offour MVM operations (Section V-B1) and since the actualcomplexity reaches the worst case if and when the matrixcannot be blocked we can choose whether to perform thecomputation on the accelerator or the GPU after the blockinghas completed This results in a performance loss of less than

0092

1E-02

1E-01

1E+00

nasasrb

Pres_P

2cub

es

qa8fm

finan

Trefe

ship

crystm

therm

G2

ASIC

bcircuit

epb3

Ga

AsH6

ns3D

aSi34H3

6torso2

venkat

wang3

xeno

n1

G-MEA

N

EnergyNormalized

toGPU

Figure 9 Accelerator energy consumption normalized to the GPU baseline

3 for both matrices far better than if the accelerator wereused directly

B Energy Consumption

Energy consumption results are shown in Figure 9 To-tal energy consumption is improved by 142times on the 18matrices executed on the accelerator and by 109times overthe entire 20-matrix dataset The impacts of differing ex-ponent ranges can be seen in the energy results Notablythe nasasrb matrix has a 3 higher blocking efficiencythan Pres Poisson however Pres Poisson shows nearlytwice the improvement in energy dissipation this is dueto the much narrower exponent range of Pres PoissonPres Poisson never requires more than 14 bits of paddingfor storage which also indicates a much narrower dynamicrange for the computation and by extension fewer vectorbit slices that are needed per cluster By contrast nasasrbhas multiple blocks with excluded elements (due to thoseelements requiring greater than 117 bits) and in generalthe average number of stored bits per cluster is 107 (30more bits than the worst block of Pres Poisson) This strongdependence on the exponent range indicates a potentialbenefit of the fixed-point computation over floating pointAs discussed in Section IV-B floating point is designed forbroad compatibility and exceeds the requirements of manyapplications By operating on fixed-point representations ofdata for much of the computation the accelerator implicitlycreates a problem-specific subset of the floating-point formatwith significant performance benefits without discarding anyinformation that would cause the final result to differ fromthat of an end-to-end floating-point calculation

C Area Footprint

The area footprint of the accelerator is computed usingthe area model described in Section VII-A The overallsystem area for the 128-bank system described above is539mm2 which is lower than the 610mm2 die size of thebaseline Nvidia Tesla P100 GPU [54] Notably unlike inprior work on memristive accelerators the crossbars andperipheral circuitry are the dominant area consumermdashratherthan the ADCsmdashwith a total of 541 of total overall clusterarea This is due to two important differences from priorwork 1) due to computational invert coding (Section V-B2)

Table IIIAREA ENERGY AND LATENCY OF DIFFERENT CROSSBAR SIZES

(INCLUDES THE ADC)

Size Area [mm2] Energy [pJ] Latency [nsec]64 000078 280 533128 000103 652 107256 000162 150 213512 000352 342 427

0

5

10

15

20

nasasrb

Pres_P

2cub

es

qa8fm

finan

Trefe

ship

crystm

therm

G2

ASIC

bcircuit

epb3

Ga

AsH6

ns3D

aSi34H3

6torso2

venkat

wang3

xeno

n1

AcceleratorC

ompu

tatio

n

Overhead

WriteTime PreprocessingTime

Figure 10 Overhead of preprocessing and write time as a percent of totalsolve time on the accelerator

the ADC precision can be reduced by one bit which has anexponential effect on ADC area and 2) the wider operandsthat propagate through the reduction network The per-bankprocessors and global memory buffer consume 136 of thetotal system area This result suggests that a more powerfulprocessor than the LEON3 core may be integrated based onapplication demands without substantially increasing systemarea Table III summarizes the area as well as the latencyand energy characteristics for different crossbar sizes usedin different clusters

D Initialization Overhead

Figure 10 shows the overhead of setting up the itera-tive computation including matrix preprocessing time andprogramming time normalized to the entire solve timeof each linear system on the accelerator Since iterativealgorithms take thousands of iterations to converge theoverhead of writing is effectively amortized leading toan overhead of less than 20 across the entire matrixset and generally falling as the size of the linear systemincreases By extension the number of iterations requiredto converge increases faster than the write requirements ofthe array For large linear systems (the most likely use caseof the proposed accelerator) the overhead is typically lessthan 4 of the total solver runtime The overhead may beeven less in practice as multiple time steps are generallysimulated to examine the time-evolution of the linear systemIn these time-stepped computations only a subset of non-zeros change each step and the matrix structure is typicallypreserved requiring minimal re-processing

E System Endurance

Although memristive devices have finite switching en-durance the iterative nature of the algorithms allows a singlematrix to be programmed once and reused throughout thecomputation Even under a conservative set of assumptions

(a) ns3Da sparsity (b) ns3Da blockingFigure 11 Sparsity and blocking patterns of ns3Da

where each array is fully rewritten for each new matrixignoring both the sparsity of the blocks and the time stepbehavior mentioned above the system lifetime is suffi-cient given a memristive device with a conservative 109

writes [55]ndash[57] the total system lifetime is over 100 yearsassuming the system runs constantly and a new matrix isfully rewritten between solves

F Understanding Difficult-to-Block Matrices

Figure 11 shows the non-zero element distribution and theblocking pattern for ns3Da a matrix with a particularly poorblocking efficiency The figure shows that despite the highrelative density of the matrix the values do not form densesub-blocks that can be blocked efficiently Instead the valuesare distributed relatively uniformly over the matrix with thethird-highest NNZs per matrix row in the evaluated datasetFigure 7 shows the non-zero distribution of two matrices thatcan be blocked effectively Pres Poisson and Xenon1 thenon-zeros of both matrices are clustered primarily aroundthe diagonal An analysis of the blocks formed on ns3Dashows that most of the formed blocks capture patterns withnon-zeros separated by approximately 30 elements 2 permatrix block-row in 64 blocks and 4 per matrix block-rowin 128 blocks This suggests that even if ns3Da were tobe blocked more efficiently the effective speedup would beconstrained by the limited current summation from havingfew blocks per matrix row

G Sensitivity to Device Characteristics

Since scientific computing emphasizes high precisionwe analyze the effects of device characteristics on systemaccuracy and convergence behavior

Dynamic range We re-evaluate the convergence behaviorwith various cell ranges (ie onoff state ratios) to determinehow the system behaves when the states are closer togetheror farther apart Figure 12 shows the relative iterationcount at various configuration points The results showeffectively no sensitivity to dynamic range for single-bitcells With two-bit cells the relatively low dynamic rangeresults in some computational error as the cell states arenot sufficiently spread to tolerate the sources of error in

013 213 413 613 813

B=113 D=075K13

B=113 D=15K13

B=113 D=3K13

B=213 D=075K13

B=213 D=15K13

B=213 D=3K13

Normalized

13 Ite

rao

n13 Co

unt13

Min13 Mean13 Max13

Figure 12 Iteration count as a function of bits per cell and dynamicrange normalized to 1-bit cells with RoffRon = 1500 The minimummaximum and the mean iteration counts are reported over 100 Monte-Carlosimulation experiments

Figure 13 Iteration count as a function of bits per cell and programmingerror normalized to 1-bit cells with no programming error The minimummaximum and the mean iteration counts are reported over 100 Monte-Carlosimulation experiments

current summation This in turn introduces error into thecomputation which tends to hinder convergence rates

Programming precision We re-evaluate the convergencebehavior with and without cell programming errors to de-termine how results are affected by non-ideal programmingThe results show virtually no sensitivity to programmingerror for single-bit cells until the programming error reaches5 well within the achievable programming precision re-ported in the literature [58] Again the general dependencebecomes stronger as the number of bits per cell is increasedHigh programming error with the same tolerance introduceserrors into the computation hindering convergence

IX CONCLUSIONS

Previous work with memristive systems has been re-stricted to low-precision error-prone fixed-point computa-tion taking advantage of machine learning workloads thatdemand fidelity but not precision We have presented thefirst correct implementation of a double-precision floating-point compute unit based on a memristive substrate Theinnovations presented here take advantage of insights thatallow the system to achieve a 103times throughput improve-ment and a 109times reduction in energy consumption over aGPU baseline when executing iterative solvers The potentialfor such a system in scientific computing where iterativesolvers are ubiquitous is especially promising

ACKNOWLEDGMENT

This work was supported in part by NSF grant CCF-1533762

REFERENCES

[1] F Pop ldquoHigh performance numerical computing for highenergy physics a new challenge for big data sciencerdquo AAdvHigh Energy Phys vol 2014 2014

[2] R A Pielke Sr Mesoscale meteorological modeling 2nd edAcademic press 2013

[3] B Scholkopf K Tsuda and J-P Vert Kernel methods incomputational biology 1st ed MIT press 2004

[4] L Maliar and S Maliar ldquoNumerical methods for large-scaledynamic economic modelsrdquo 2014

[5] M Vogelsberger S Genel V Springel P Torrey D SijackiD Xu G Snyder S Bird D Nelson and L HernquistldquoProperties of galaxies reproduced by a hydrodynamic simu-lationrdquo Nature vol 509 pp 177ndash82 May 2014

[6] P Messina (2017 February) The US DOEexascale computing projectmdashgoals and challengeshttpswwwnistgovsitesdefaultfilesdocuments20170221messina nist 20170214final pdf [Online] Availablerdquohttpswwwnistgovsitesdefaultfilesdocuments20170221messina nist 20170214final pdfrdquo

[7] M N Bojnordi and E Ipek ldquoMemristive boltzmann machineA hardware accelerator for combinatorial optimization anddeep learningrdquo in Intl Symp on High Performance ComputerArchitecture (HPCA) March 2016

[8] A Shafiee A Nag N Muralimanohar R BalasubramonianJ P Strachan M Hu R S Williams and V SrikumarldquoISAAC A convolutional neural network accelerator within-situ analog arithmetic in crossbarsrdquo in Intl Symp onComputer Architecture (ISCA) June 2016

[9] P Chi S Li S Li T Zhang J Zhao Y Liu Y Wangand Y Xie ldquoPRIME A novel processing-in-memory archi-tecture for neural network computation in ReRAM-basedmain memoryrdquo in Intl Symp on High Performance ComputerArchitecture (HPCA) June 2016

[10] L Song X Qian H Li and Y Chen ldquoPipeLayer Apipelined ReRAM-based accelerator for deep learningrdquo inIntl Symp on High Performance Computer Architecture(HPCA) Feb 2017

[11] L Song Y Zhuo X Qian H Li and Y Chen ldquoGraphRAccelerating graph processing using ReRAMrdquo Feb 2017

[12] G Kestor R Gioiosa D J Kerbyson and A HoisieldquoQuantifying the energy cost of data movement in scientificapplicationsrdquo in Intl Symp on Workload Characterization(IISWC) Sept 2013

[13] J Hauser (2002) SoftFloat httpwwwjhauserusarithmeticSoftFloathtml [Online] Available rdquohttpwwwjhauserusarithmeticSoftFloathtmlrdquo

[14] T A Davis and Y Hu ldquoThe university of florida sparse matrixcollectionrdquo ACM Transactions on Math Softwware (TOMS)vol 38 no 1 pp 11ndash125 Dec 2011

[15] R Vuduc ldquoAutomatic performance tuning of sparse matrixkernelsrdquo PhD dissertation 2003

[16] Y Chen T Luo S Liu S Zhang L He J Wang L LiT Chen Z Xu N Sun and O Temam ldquoDaDianNaoA machine-learning supercomputerrdquo in Intl Symp on onMicroarchitecture (MICRO) Dec 2014

[17] N P Jouppi C Young N Patil D Patterson G Agrawalet al ldquoIn-datacenter performance analysis of a tensor pro-cessing unitrdquo in Intl Symp on on Computer Architecture(ISCA) June 2017

[18] M Hu J P Strachan Z Li E M Grafals N DavilaC Graves S Lam N Ge J J Yang and R S WilliamsldquoDot-product engine for neuromorphic computing Program-ming 1T1M crossbar to accelerate matrix-vector multiplica-tionrdquo in Design Automation Conference (DAC) June 2016

[19] Y Saad Iterative Methods for Sparse Linear Systems 3rd edSIAM 2003

[20] J J Dongarra P Luszczek and A Petitet ldquoThe LINPACKbenchmark Past present and futurerdquo Concurrency and Com-putation Practice and Experience vol 15 pp 803ndash8202003

[21] M R Hestenes and E Stiefel Methods of conjugate gradientsfor solving linear systems 1952 vol 49 no 6

[22] H A Van der Vorst ldquoBi-CGSTAB A fast and smoothlyconverging variant of Bi-CG for the solution of nonsymmetriclinear systemsrdquo SIAM Journal on scientific and StatisticalComputing (SISC) vol 13 no 2 pp 631ndash644 1992

[23] Y Saad and M H Schultz ldquoGMRES A generalized minimalresidual algorithm for solving nonsymmetric linear systemsrdquoSIAM Journal on scientific and Statistical Computing (SISC)vol 7 no 3 pp 856ndash869 July 1986

[24] J Kung Y Long D Kim and S Mukhopadhyay ldquoAprogrammable hardware accelerator for simulating dynamicalsystemsrdquo in Intl Symp on Computer Architecture (ISCA)June 2017

[25] Q Zhu T Graf H E Sumbul L Pileggi and F FranchettildquoAccelerating sparse matrix-matrix multiplication with 3d-stacked logic-in-memory hardwarerdquo in 2013 IEEE HighPerformance Extreme Computing Conference (HPEC) Sept2013

[26] S Kestur J D Davis and E S Chung ldquoTowards a univer-sal FPGA matrix-vector multiplication architecturerdquo in IntlSymp on Field-Programmable Custom Computing Machines(FCCM) May 2012

[27] J Fowers K Ovtcharov K Strauss E S Chung andG Stitt ldquoA high memory bandwidth FPGA accelerator forsparse matrix-vector multiplicationrdquo in Intl Symp on Field-Programmable Custom Computing Machines (FCCM) May2014

[28] R Dorrance F Ren and D Markovic ldquoA scalablesparse matrix-vector multiplication kernel for energy-efficientsparse-blas on FPGAsrdquo in ACMSIGDA International Sym-posium on Field-programmable Gate Arrays (FPGA) Feb2014

[29] B Feinberg S Wang and E Ipek ldquoMaking memristiveneural net accelerators reliablerdquo in Intl Symp on HighPerformance Computer Architecture (HPCA) Feb 2018

[30] D H Bailey ldquoHigh-precision floating-point arithmetic inscientific computationrdquo Computing in science amp engineeringvol 7 no 3 pp 54ndash61 2005

[31] D Zuras M Cowlishaw A Aiken M Applegate D BaileyS Bass D Bhandarkar M Bhat D Bindel S Boldo et alldquoIeee standard for floating-point arithmeticrdquo IEEE Std 754-2008 pp 1ndash70 2008

[32] L V Allis ldquoSearching for solutions in games and artificialintelligencerdquo 1994

[33] N Whitehead and A Fit-Florea ldquoPrecision amp performanceFloating point and IEEE 754 compliance for nvidia gpusrdquoNvidia Tech Rep 2011

[34] L Kull D Luu C Menolfi M Braendli P A FranceseT Morf M Kossel H Yueksel A Cevrero I Ozkayaand T Toifl ldquo285 a 10b 15gss pipelined-SAR ADC withbackground second-stage common-mode regulation and offsetcalibration in 14nm CMOS FinFETrdquo in Intl Solid-StateCircuits Conference (ISSCC) Feb 2017 pp 474ndash475

[35] L Kull T Toifl M Schmatz P A Francese C MenolfiM Brandli M Kossel T Morf T M Andersen andY Leblebici ldquoA 31 mw 8b 12 GSs single-channel asyn-chronous SAR ADC with alternate comparators for enhancedspeed in 32 nm digital SOI CMOSrdquo IEEE Journal of Solid-State Circuits vol 48 no 12 pp 3049ndash3058 2013

[36] M Saberi R Lotfi K Mafinezhad and W A SerdijnldquoAnalysis of power consumption and linearity in capacitivedigital-to-analog converters used in successive approximationadcsrdquo IEEE Transactions on Circuits and Systems I RegularPapers vol 58 no 8 pp 1736ndash1748 2011

[37] M R Stan and W P Burleson ldquoBus-invert coding forlow-power IOrdquo IEEE Transactions on Very Large ScaleIntegration (VLSI) Systems vol 3 no 1 pp 49ndash58 1995

[38] K Bhanushali and W R Davis ldquoFreePDK15 An open-source predictive process design kit for 15nm FinFET tech-nologyrdquo in Intl Symp on on Physical Design (ISPD) March2015

[39] Predictive technology model (PTM) httpptmasuedu[Online] Available httpptmasuedu

[40] D Niu C Xu N Muralimanohar N P Jouppi and Y XieldquoDesign of cross-point metal-oxide ReRAM emphasizingreliability and costrdquo in Intl Conference on Computer-AidedDesign (ICCAD) Nov 2013

[41] S-C Wong G-Y Lee and D-J Ma ldquoModeling of inter-connect capacitance delay and crosstalk in VLSIrdquo IEEETransactions on semiconductor manufacturing vol 13 no 1pp 108ndash111 2000

[42] C Xu D Niu N Muralimanohar R BalasubramonianT Zhang S Yu and Y Xie ldquoOvercoming the challengesof crossbar resistive memory architecturesrdquo in rdquoIntl Sympon High Performance Computer Architecture (HPCA)rdquo Feb2015

[43] C Yakopcic and T Taha ldquoModel for maximum crossbar sizebased on input driver impedancerdquo Electronics Letters vol 52no 1 pp 25ndash27 2015

[44] M OrsquoHalloran and R Sarpeshkar ldquoA 10-nw 12-bit accurateanalog storage cell with 10-aa leakagerdquo IEEE Journal ofSolid-State Circuits vol 39 no 11 pp 1985ndash1996 2004

[45] Leon3GRLIB httpwwwgaislercomindexphpdownloadsleongrlib [Online] Available httpwwwgaislercomindexphpdownloadsleongrlib

[46] S Galal O Shacham J S B II J Pu A Vassiliev andM Horowitz ldquoFPU generator for design space explorationrdquoin IEEE Symposium on Computer Arithmetic (ARITH) April2013

[47] M Martins J M Matos R P Ribas A Reis G SchlinkerL Rech and J Michelsen ldquoOpen cell library in 15nmFreePDK technologyrdquo in Intl Symp on Physical Design(ISPD) March 2015

[48] Synopsys ldquoSynopsys Design Compiler User GuiderdquohttpwwwsynopsyscomToolsImplementationRTLSynthesisDCUltraPages

[49] R Balasubramonian A B Kahng N MuralimanoharA Shafiee and V Srinivas ldquoCACTI 7 New tools for inter-connect exploration in innovative off-chip memoriesrdquo ACMTransactions on Architecture and Code Optimization (TACO)vol 14 no 2 pp 141ndash1425 June 2017

[50] G Fredeman D W Plass A Mathews J ViraraghavanK Reyer T J Knips T Miller E L Gerhard D Kannam-badi C Paone D Lee D J Rainey M Sperling M WhalenS Burns R R Tummuru H Ho A Cestero N ArnoldB A Khan T Kirihata and S S Iyer ldquoA 14 nm 11 mbembedded DRAM macro with 1 ns accessrdquo IEEE Journal ofSolid-State Circuits vol 51 no 1 pp 230ndash239 Jan 2016

[51] A Bakhoda G L Yuan W W L Fung H Wong and T MAamodt ldquoAnalyzing CUDA workloads using a detailed GPUsimulatorrdquo April 2009

[52] J Leng T Hetherington A ElTantawy S Gilani N S KimT M Aamodt and V J Reddi ldquoGPUWattch Enabling en-ergy optimizations in GPGPUsrdquo in Intl Symp on ComputerArchitecture (ISCA) June 2013

[53] H Anzt J Dongarra M Kreutzer G Wellein and M KhlerldquoEfficiency of general krylov methods on GPUs ndash an exper-imental studyrdquo in Intl Parallel and Distributed ProcessingSymposium Workshops (IPDPSW) May 2016

[54] ldquoNvidia tesla P100rdquo NVIDIA Corporation Tech Rep WP-08019-001 v011 2016

[55] Z Wei Y Kanzawa K Arita Y Katoh K Kawai S Mu-raoka S Mitani S Fujii K Katayama M Iijima et alldquoHighly reliable TaOx ReRAM and direct evidence of redoxreaction mechanismrdquo in Electron Devices Meeting 2008IEDM 2008 IEEE International IEEE 2008 pp 1ndash4

[56] J J Yang M-X Zhang J P Strachan F Miao M D Pick-ett R D Kelley G Medeiros-Ribeiro and R S WilliamsldquoHigh switching endurance in TaOx memristive devicesrdquoApplied Physics Letters vol 97 no 23 p 232102 2010

[57] C-W Hsu I-T Wang C-L Lo M-C Chiang W-Y JangC-H Lin and T-H Hou ldquoSelf-rectifying bipolar TaOxTiO2RRAM with superior endurance over 1012 cycles for 3d high-density storage-class memoryrdquo

[58] F Alibart L Gao B D Hoskins and D B Strukov ldquoHighprecision tuning of state for memristive devices by adaptablevariation-tolerant algorithmrdquo Nanotechnology vol 23 no 7Jan 2012

Page 6: Enabling Scientific Computing on Memristive …reducing the required memristor resolution [7]–[10]. An example of bit slicing is shown in Equation 1, where each 3-bit matrix coefficient

Stable Region

X hellip X

Potential carry out

+

Barrier Bit

0

Carry Region

Aligned Region

X X 0X X X

1 hellip 1 X Running sum

Next partial dot product

Figure 5 Regions within a running sum

partial product still overlaps with the four-bit mantissa it isstill possible for the mantissa to be affected by subsequentadditions At a minimum the computation must proceeduntil the mantissa in the running sum has cleared the overlapwith the next partial product This occurs in Figure 4chowever at that point it is still possible for a carry from thesubsequent partial products to propagate into and modify themantissa Notably if the accumulation is halted at any pointthe sum of the remaining partial products would produce atmost one carry out into the running sum Hence in additionto clearing the overlap safe termination requires that a 0less significant than the mantissa be generated such that itwill absorb the single potential carry out preventing any bitflips within the mantissa (Figure 4d)

At any given time the running sum can be split intofour non-overlapping regions as shown in Figure 5 Thealigned region is the portion of the running sum thatoverlaps with the subsequent partial products A chain of1s more significant than the aligned region is called thecarry region as it can propagate the single generated carryout forward This carry if present is absorbed by a barrierbit protecting the more significant bits in the stable regionfrom bit flips Thus the running sum calculation canbe terminated without loss of mantissa precision assoon as the full mantissa is contained in the stable region

Scheduling array activations Even with the early termina-tion technique discussed above there is still a possibility forextraneous crossbar operations to be performed Althougheach partial dot product takes up to 127 bits the finalmantissa is truncated to 52 bits following the leading 1requiring fewer than 127 bits to be computed in most casesTherefore some subset of the 127-bit partial dot productmay be wasted depending on the schedule according towhich vector bit slices are applied to the crossbars

Figure 6 shows three approaches to scheduling the cross-bar activations In the figure rows and columns respec-tively represent the matrix and vector bit slices sortedby significance The numbers at the intersection of eachrow and column represent the significance of the resultingpartial product Different groupings indicate which bit-slicedmatrix-vector multiplications are scheduled to occur simul-taneously the black colored groupings must be performedwhereas the gray colored groupings may be skipped dueto early termination Hence the number of black groupings

1 0123

1

023

4

5

3 24 1

6 5

4

3

4

3

2

3

2

6

5

4

5

4

3

4

3

2

3

2

1

3 2 1 0

6

5

3

5

4

2

4

3

1

3

2

0

3 2 1 0

3

2

1

0

Mat

rix S

lice

3 2 1 0Vector Slice

Activations Skipped Activations PerformedVertical Grouping Diagonal Grouping Hybrid Grouping

Figure 6 Three policies for scheduling crossbar activations

correlates with execution time and the number of crossbarsenclosed by the black groupings correlates with crossbaractivation energy The example shown in the figure assumesthat the calculation can be terminated early after the fifthmost significant bit of the mantissa is computed (That isall groupings computing a partial product with significanceof at least 2 must be performed)

In the leftmost case all of the crossbars are activated withthe same vector bit slice after which the next significantvector bit slice is applied to the crossbars This naıve verticalgrouping requires 16 crossbar activations performed overfour time steps In the next example arrays are activatedwith different vector bit slices such that each active arrayis contributing to the same slice of the output vector si-multaneously This diagonal grouping results in 13 crossbaractivationsmdashthe minimum required in the examplemdashoverfive time steps In general this policy takes the minimumnumber of crossbar activations at the cost of additionallatency

The hybrid grouping shown on the right which we adoptin the evaluation strikes a balance between the respectiveenergy and latency advantages of the diagonal and verticalgroupings The result is 14 crossbar activations performedover four time steps The more closely the hybrid groupingapproximates a diagonal grouping the greater the energysavings at the cost of latency

Taken together the above techniques significantly reducethe overhead of performing floating-point computation onfixed-point hardware by restricting computations to onlythose that are likely to produce a bit needed for the finalmantissa The techniques render the latency and energyconsumption strongly data dependent effectively forming adata-specific subset of the floating point format without anyloss in precision

C Handling Negative Numbers

The previously proposed ISAAC memristive accelera-tor [8] uses a biasing scheme to handle negative numbersThe approach adds a bias of 216 to each 16-bit operandsuch that the values range from 0 to 216 rather than plusmn215Results are de-biased by subtracting 216pv where pv is thepopulation count (ie the number of bits set to 1) of therelevant bit slice of the vector We adopt this technique with

one modification in place of the constant 216 we use aper-block biasing constant based on the actual floating-pointexponent range contained within the block

D Floating Point Anomalies and Rounding Modes

The IEEE-754 floating point standard specifies a numberof floating point exceptions and rounding modes that mustbe handled correctly for proper operation [31]

Rounding modes When the proposed accelerator isoperated in its typical configuration mantissa alignmentand leading 1 detection result in a truncation of the resultto the required precision The potential contributions ofany uncalculated bits beyond the least-significant bit wouldonly result in a mantissa of greater magnitude than whatis ultimately obtained and since a biasing scheme is usedto enable the representation of negative values (SectionIV-C) the truncation is equivalent to rounding the resultof a dot product toward negative infinity For applicationsrequiring other rounding modes (eg to nearest towardpositive infinity toward zero) the accelerator can beconfigured to compute three additional settled bits beforetruncation and perform the rounding according to those bits

Infinities and NaNs Signed infinities and not-a-numbers(NaNs) are valid floating point values however they cannotbe mapped to the crossbars in a meaningful way in theproposed accelerator nor can they be included in any dotproduct computations without producing a NaN infinity orinvalid result [31] Therefore the accelerator requires allinput matrices and vectors to contain no infinities or NaNsand any infinities or NaNs arising from an intermediatecomputation in the local processor are handled there toprevent them from propagating to the resulting dot product

Floating point exceptions Operations with floating pointvalues may generally lead to underflow overflow invalidoperation and inexact exceptions [31] Since in-situ dotproducts are performed on mantissas aligned with respect torelative exponents there are no upper or lower bounds onresulting values (Only the range of exponents is bounded)This precludes the possibility of an overflow or underflowoccurring while values are in an intermediate representationHowever when the result of a dot product is converted fromthe intermediate floating point representation into IEEE-754it is possible that the required exponent will be outsideof the available range In this case the exponent field ofthe resulting value will be respectively set to all 1s orall 0s for overflow and underflow conditions Similarly toCUDA computations resulting in an overflow underflowor inexact arithmetic do not cause a trap [33] Since theaccelerator can only accept matrices and input vectors withfinite numerical values and since the memristive substrateis used only for computing dot products invalid operations

(eg 0 0radicminus|c|) and operations resulting in NaNs or

infinities can only occur within the local processor wherethey are handled through an IEEE-754 compliant floatingpoint unit

E Detecting and Correcting Errors

Prior work by Feinberg et al proposes an error-correctionscheme for memristive neural network accelerators based onAN codes [29] We adopt this scheme with three modifi-cations taking into consideration the differences betweenneural networks and scientific computing First as noted inSection I scientific computing has higher precision require-ments than neural networks thus the proposed accelera-tor uses only 1-bit cells making it more robust Secondthe matrices for scientific computing are typically sparsewhereas the matrices in neural networks tend to be densethis leads to lower error rates using the RTN-based errormodel of Feinberg et al allowing errors to be corrected withgreater than 9999 accuracy However the use of arrayslarger than used in prior work introduces new sources oferror If the array size is comparable to the dynamic rangeof the memristors the non-zero current in the RHI statemay cause errors when multiplying a low density matrixrow with a high density vector To avert this problem welimit the maximum crossbar block size in a bank (discussedin the next section) to 512times512 with a memristor dynamicrange of 15times 103 Third since the floating-point operandsexpand to as many as 118 bits during the conversion fromfloating- to fixed point we do not use multi-operand codingbut rather apply an A = 251 code that protects a 118-bit operand with eight bits for correction and one bit fordetection The correction is applied after the reduction butbefore the leading-one detection

V HANDLING AND EXPLOITING SPARSITY

An in-situ MVM system has an inherent advantage whenperforming computation with large and dense matrices asthe degree of parallelism obtained by computing in theanalog domain is proportional to the number of elementsin the matrix up to the size of the matrix M times N Asnoted in Section II-B however the majority of the matricesof interest are sparse This reduces the effective parallelismmaking it a challenge to achieve high throughput

A Tradeoffs in Crossbar Sizing

The potential performance of a memristive MVM systemis constrained by the sizes of its crossbars as crossbar sizeincreases so does the peak throughput of the system Thispeak is rarely achieved however since the sparsity patternof each matrix also limits the number of elements that maybe mapped to a given block

We define the throughput of the system as the number ofeffective element-wise operations in a single-cluster MVMcomputation divided by the latency of that computation

τMV M

As the size of a crossbar increases greater throughputis naıvely implied since the size of the matrix block repre-sentable within the crossbarmdashand consequently the numberof sums-of-products resulting from a cluster operationmdashalsoincreases However since only non-zero matrix elementscontribute to an MVM operation effective throughput in-creases only when the number of non-zero (NNZ) elementsincreases where NNZ depends on both block size M timesN and block density dblock Actual throughput for a givenblock is therefore dblock timesM timesNτMV M

Thus to achievehigh throughput and efficiency large and dense blocks aregenerally preferable to small or sparse blocks

The number of non-zero elements actually mapped to acrossbar is in practice far less than the number of cells inthe crossbar and the trend is for large blocks to be sparserthan small blocks This tendency follows intuitively fromthe notion that as the block size approaches the size of thematrix the density of the block approaches the density ofthe matrixmdash037 for the densest matrix in the evaluation

The latency of a memristive MVM operation is alsodirectly related to the size of a crossbar The maximumpossible crossbar column output occurs when all N bits ofa column and all N bits of the vector slice are set to 1resulting in a sum-of-products of

sumN1 1times1 = N Therefore

a sufficient analog to digital converter (ADC) for a crossbarwith N rows must generally have a minimum resolution ofdlog2[N +1]e bits (though this can be reduced as shown inSection V-B2) For the type of ADC used in the proposedsystem and in prior work [8] conversion time per columnis strictly proportional to ADC resolution which increaseswith log2N where N is the number of rows and the numberof ADC conversions per binary MVM operation is equal tothe number of columns M Total conversion time is thusproportional to Mdlog2[N + 1]e

1) Energy Crossbar sizing strongly affects the amountof energy used by both the array and the ADC during anMVM operation

ADC energy The average power dissipation of the ADCgrows exponentially with resolution [34]ndash[36] Since res-olution is a logarithmic function of the number of rowsN but conversion time is proportional to resolution ADCenergy per column conversion is approximately proportionalto N log2N The column count M determines the numberof ADC conversions per MVM operation so the total ADCenergy per operation is proportional to M timesN log2N

Crossbar energy Crossbar latency is defined as theminimum amount of time during which the crossbars mustbe actively pulling current in order to ensure that all outputnodes are within less than LSB

2 of their final voltages therebyconstraining all column outputs to be within the errortolerance range of the ADC The worst-case path resistanceand capacitance of an MtimesN crossbar are both proportionalto M + N The number of RC time constants necessaryto ensure error-free output is proportional to resolution or

log2N While the crossbar is active power is proportionalto the total conductance of the crossbar which can be foundby multiplying the number of cells M timesN by the averagepath conductance which is proportional to 1

M+N Crossbarenergy for a single MVM operation grows with the productof crossbar latency and power and is thus found to beproportional to (M timesN)(M +N) log2N

2) Area As with ADC power ADC area also scalesexponentially with resolution [34]ndash[36] making averageADC area approximately proportional to N Crossbar area isdominated by drivers consequently the total crossbar areagrows as M(M +N)

B A Heterogeneous Crossbar Substrate

Crossbar sizing requires striking a careful balance be-tween parallelism and energy efficiency a crossbar mustbe large enough to capture sufficient non-zero elements toexploit parallelism yet small enough to retain sufficient den-sity for area and energy efficiency By contrast to the singlecrossbar size used in prior work [7] [8] [18] the proposedsystem utilizes crossbars of different sizes allowing blocksto be mapped to the crossbars best suited to the sparsitypattern of the matrix This heterogeneous composition makesthe system performance robust yielding high performanceand efficiency on most matrices without over-designing fora particular matrix type or sparsity pattern

The problem of performance-optimal matrix blocking isneither new nor solved in a computationally feasible mannerPrior work seeks to block sparse matrices while maximizingthroughput and minimizing fill-in but focuses on block sizesthat are too small to be efficient with memristive crossbars(12times 12 at most [15])

1) Mapping a Matrix to the HeterogeneousSubstrate A fast preprocessing step is proposed that ef-ficiently maps an arbitrary matrix to the heterogeneouscollection of clusters The matrix is partitioned into blockcandidates and for each candidate block the number of non-zero elements captured by the block and the range of theexponents within the block are calculated If the exponentrange of the block falls outside of the maximum allowablerange (ie 117) the elements are selectively removed untilan acceptable range is attained The number of non-zero ele-ments remaining is then compared to a dimension-dependentthreshold if the number of elements exceeds the thresholdthe candidate block is accepted If the candidate block isrejected a smaller block size is considered and the processrepeats

The preprocessing step considers every block size in thesystem (512times512 256times256 128times128 and 64times64) untileach element has either been mapped to a block or found tobe unblockable Elements that cannot be blocked efficientlyor that fall outside of the exponent range of an acceptedblock are completely removed from the mapping and areinstead processed separately by the local processor Figure

(a) Pres Poisson sparsity (b) Pres Poisson blocking (c) Xenon1 sparsity (d) Xenon1 blockingFigure 7 Sparsity and blocking patterns of two of the evaluated matrices

7 shows the non-zero elements and the resulting blockingpatterns for two sparse matrices

Since the non-zero elements in the matrix are traversed atmost one time for each block size the worst-case complexityof the preprocessing step is 4timesNNZ In practice the earlydiscovery of good blocks brings the average-case complexityto 18timesNNZ

2) Exploiting Sparsity Though sparsity introduces chal-lenges in terms of throughput it also presents some oppor-tunities for optimization

Computational invert coding Prior work [8] introducesa technique for statically reducing ADC resolution require-ments If a crossbar column contains all 1s the maximumoutput for the column is equal to the number of rowsN To handle this extreme case the ADC would requiredlog2[N + 1]e bits of resolution Eliminating this casereduces the ADC resolution requirement by one bit This isachieved by inverting the contents of a column if it containsall 1s To obtain a result equivalent to a computation witha non-inverted column the partial dot product is subtractedfrom the population count of the applied vector bit slice

We extend this technique to computational invert coding(CIC) a computational analog to the widely used bus-invertcoding [37] If the number of ones in a binary matrixrow is less than N

2 the required resolution of the ADCis again decreased by one bit By inverting any columnmapping that would result in greater than 50 bit density astatic maximum of N

2 1s per crossbar column is guaranteedNotably if the column has exactly N

2 1s a resolution oflog2N is still required Since the proposed system allowsfor removing arbitrary elements from a block for processingon the local processor this corner case can be eliminatedThus for any block even in matrices approaching 100density the ADC resolution can be statically reduced tolog2[N ]minus1 bits CIC also lowers the maximum conductanceof the crossbars (since at most half of the cells can be inthe RLO state) which in turn reduces crossbar energy

ADC headstart The ADCs that are commonly used inmemristive accelerators [7] [8] perform a binary searchover the space of possible results with the MSb initially

set to 1 However the maximum output of each column isconstrained by the number of ones mapped to that columnMatrix rows tend to produce at most dlog2[drowtimes05timesN ]ebits of output assuming that 1s and 0s are equally likelySince the actual number of output bits is data dependentthis tendency cannot be leveraged to statically reduce ADCprecision beyond what is attainable with CIC Instead theADC is pre-set to the maximum number of bits that can beproduced by the subsequent conversion step This approachallows the ADC to start from the most significant possibleoutput bit position and reduces the amount of time requiredto find the correct output value by an amount proportionalto the number of bits skipped Notably this does not affectoperation latency since reduction logic is fully synchronousand assumes a full clock period for conversion however bydecreasing the time spent on conversion the energy of theMVM operation is reduced

VI PROGRAMMING THE ACCELERATOR

We focus on Krylov subspace solvers for specificity asthey are a widely used application involving sparse MVMhowever the proposed accelerator is not limited to Krylovsubspace solvers These solvers are built from three com-putational kernels a sparse-matrix dense-vector multiply adense-vector sum or AXPY (ylarr ax+y) and a dense-vectordot product (z = x middot y) To split the computation betweenthe banks each bank is given a 1200-element section ofthe solution vector x only that bank performs updates onthe corresponding section of the solution vector and theportions of the derived vectors For simplicity this restrictionis not enforced by hardware and must be followed by theprogrammer for correctness

On problems that are too large for a single accelerator theMVM can be split in a manner analogous to the partitioningon GPUs each accelerator handles a portion of the MVMand the accelerators synchronize between iterations

A Kernel Implementation

To implement Krylov subspace solvers we implement thethree primary kernels needed for the solver

1) Sparse MVM The sparse MVM operation beginsby reading the vector map for each cluster The vectormap contains a set of three-element tuples each of whichcomprises the base address of a cluster input vector thevector element index corresponding to that base addressand the size of the cluster Since the vector applied to eachcluster is contiguous the tuple is used to compute and loadall of the vector elements into the appropriate vector bufferThereafter a start signal is written to a cluster status registerand the computation begins Since larger clusters tend tohave a higher latency the vector map entries are ordered bycluster size

Once all cluster operations have started the processorbegins operating on the non-blocked entries As noted inthe previous section some values within the sparse matrixare not suitable for mapping onto memristive hardwareeither because they cannot be blocked at a sufficient densityor because they would exceed the exponent range Theseremaining values are stored in the compressed sparse rowformat [15]

When a cluster completes an interrupt is raised andserviced by an interrupt service routine running on the localprocessor Once all clusters signal completion and the localprocessor finishes processing the non-blocked elements thebank has completed its portion of the sparse MVM The localprocessors for different banks rely on a barrier synchroniza-tion scheme to determine when the entire sparse MVM iscomplete

2) Vector Dot Product To perform a vector dot producteach processor computes the relevant dot product fromits own vector elements Notably due to vector elementownership of the solution vector x and all of the derivedvectors the dot product is performed using only thosevalues that belong to a local processor Once the processorcomputes its local dot product the value is written backto shared memory for all other banks to see Each bankcomputes the final product from the set of bank dot productsindividually to simplify synchronization

3) AXPY The AXPY operation is the simplest kernelsince it can be performed purely locally Each vector elementis loaded modified and written back Barrier synchroniza-tion is used to determine completion across all banks

VII EXPERIMENTAL SETUP

We develop a parameterized and modular system modelto evaluate throughput energy efficiency and area Keyparameters of the simulated system are listed in Table I

A Circuits and Devices

We model all of the CMOS based periphery at the 15nmtechnology node using process parameters and design rulesfrom FreePDK15 [38] coupled with the ASU PredictiveTechnology Model (PTM) [39] TaOx memristor cells basedon [40] are used with linearity and dynamic range as

reported by [18] During computation cells are modeled asresistors with resistance determined either by a statisticalapproach considering block density or actual mapped bitvalues depending on whether the model is being evaluatedas part of a design-space exploration or to characterizeperformance with a particular matrix The interconnect ismodeled with a conservative lumped RC approach usingthe wire capacitance parameters provided by the ASU PTMderived from [41] To reduce path distance and latency wesplit the crossbars and place the drivers in the middle ratherthan at the periphery similarly to the double-sided groundbiasing technique from [42]

Table IACCELERATOR CONFIGURATION

System (128) banks double-precision floating point fclk =12GHz 15nm process VDD = 080V

Bank (2) times 512 times 512 clusters (4) times 256 times 256 clusters(6)times128times128 (8)times64times64 clusters 1 LEON core

Cluster 127 bit slice crossbarsCrossbar N timesN cells (log2[N ]minus 1)-bit pipelined SAR ADC

2N driversCell [18] [40] TaOx Ron = 2kΩRoff = 3MΩ Vread = 02V

Vset = minus26V Vreset = 26V Ewrite = 391nJTwrite = 5088ns

The peripheral circuitry includes 2N driver circuits basedon [43] and sized such that they are sufficient to sourcesinkthe maximum current required to program and read thememristors Also included are N sample-and-hold circuitsbased on [44] with power and area scaled for the requiredsampling frequency and precision

A 12 GHz 10-bit pipelined SAR ADC [34] is used as areference design point 12 GHz is chosen as the operatingfrequency of the ADC in order to maximize throughput andefficiency while maintaining acceptable ADC signal to noiseand distortion ratio (SNDR) at our supply voltage Areapower and the internal conversion time of each individualcrossbarrsquos ADC are scaled based on resolution as discussedin Section V-A Approximately 7 of the reported ADCpower scales exponentially with resolution with the remain-der of power either scaling linearly or remaining constantand 20 of the total power is assumed to be static basedon the analysis reported in [34] Similarly approximately23 of the reported area scales exponentially (due to thecapacitive DAC and the track-and-hold circuit) and theremainder either scales linearly or remains constant We holdthe conversion time of the ADC constant with respect toresolution rounding up to the nearest integral clock periodDuring the slack portion of the clock period (ie the timebetween conversions) we reduce the ADC power dissipationto its static power

The local processors are implemented with the open-source LEON3 core [45] Since the open-source versionof LEON3 does not include the netlist for the FPU forsynthesis we modify the FPU wrapper to the fused multiplyadd (FMA) timings for an FMA unit generated by FP-

Table IIEVALUATED MATRICES SPD MATRICES ON TOP

Matrix NNZs Rows NNZRow Blocked2cubes sphere 1647264 101492 162 497crystm03 583770 24696 236 947finan512 596992 74752 79 467G2 circuit 726674 150102 45 609nasasrb 2677324 54870 498 991Pres Poisson 715804 14822 483 964qa8fm 1660579 66127 251 928ship 001 3896496 34920 1116 664thermomech TC 711558 102158 68 08Trefethen 20000 554466 20000 277 633ASIC 100K 940621 99340 95 609bcircuit 375558 68902 54 649epb3 463625 84617 55 722GaAsH6 3381809 61349 5512 692ns3Da 1679599 20414 82 32Si34H36 5156379 97569 528 537torso2 1033473 115697 89 981venkat25 1717792 62424 275 798wang3 177168 26064 68 646xenon1 1181120 48600 243 810

GEN [46] The LEON3 core FMA and in-cluster reductionnetwork are synthesized using the NanGate 15nm OpenCell Library [47] with the Synopsys Design Compiler [48]SRAM buffers within each cluster and the eDRAM mem-ory are modeled using CACTI7 [49] using 14nm eDRAMparameters from [50]

B Architecture

We compare against an nVidia Tesla P100 GPU accelera-tor modeled using GPGPUSim [51] and GPGPUWattch [52]To evaluate the performance of the LEON3 based localprocessors we implement CG and BiCG-STAB in C andcompile the implementations using the LEON3 Bare CCompiler with timers around each kernel call Since all bankmicroprocessors must synchronize during each iteration weevaluate the latency of the bank microprocessor with thelargest number of unblocked elements

In the worst case the preprocessing step touches everynon-zero in the matrix four times which is comparable toperforming four MVM operations Thus we conservativelyassume that preprocessing takes time equivalent to fourMVM operations on the baseline system

C Algorithms

We evaluate the proposed accelerator on 20 matrices fromthe SuiteSparse matrix collection [14] using CG for sym-metric positive definite matrices (SPD) and BiCG-STAB fornon-SPD matrices The matrices were selected to showcasea range of matrix structures sizes and densities across avariety of applications When available we use the b vectorprovided by the collection when b is unavailable we use ab vector of all 1s as in prior work [53]

The solvers running on the proposed accelerator convergein the same number of iterations as they do when running

103

01

1

10

100

nasasrb

Pres_P

2cub

es

qa8fm

finan

Trefe

ship

crystm

therm

G2

ASIC

bcircuit

epb3

Ga

AsH6

ns3D

aSi34H3

6torso2

venkat

wang3

xeno

n1

G-MEA

NS

peed

upoverG

PU

Figure 8 Speedup over the GPU baseline

on the GPU since both systems perform computation at thesame level of precision

VIII EVALUATION

We evaluate the system performance energy area androbustness to different sources of error

A Execution Time

Figure 8 shows that the accelerator achieves an averagespeedup of 103times as compared to the GPU baseline acrossthe set of 20 matrices One important insight from theseresults is that the average number of non-zeros per matrixrow is a poor proxy for potential speedup The matrix withthe greatest speedup in the dataset (torso2) has just 89 NNZper matrix row and the two quantum chemistry matrices(GaSsH6 and Si34H36) have more than 50 NNZ per rowshowing above-average but not exceptional speedup due totheir lower blocking efficiencies (ie the percent of the non-zeros that are blocked) 69 and 53 respectively Sinceeach bank microprocessor is designed to work in conjunctionwith its clusters matrices with fewer rows perform worsethan larger matrices at the same blocking efficiency Asexpected blocking efficiency is the most reliable predictorof speedup

Notably in Table II there are two matrices that are effec-tively unblocked by the proposed preprocessing step ther-motech TC and ns3Da with respective blocking efficienciesof 08 and 32 Since the accelerator is optimized for in-situ rather than digital MVM and since the majority of theelements in these two matrices cannot be mapped to theproposed heterogeneous substrate the accelerator is morethan an order of magnitude less efficient than the GPUbaseline on these matrices Rather than attempt to performMVM on matrices so clearly ill-suited to the acceleratorwe assume that the proposed accelerator co-exists with aGPU in the same node and that the GPU may be usedfor the rare matrices which do not block effectively Sincethe blocking algorithm has a worst-case performance offour MVM operations (Section V-B1) and since the actualcomplexity reaches the worst case if and when the matrixcannot be blocked we can choose whether to perform thecomputation on the accelerator or the GPU after the blockinghas completed This results in a performance loss of less than

0092

1E-02

1E-01

1E+00

nasasrb

Pres_P

2cub

es

qa8fm

finan

Trefe

ship

crystm

therm

G2

ASIC

bcircuit

epb3

Ga

AsH6

ns3D

aSi34H3

6torso2

venkat

wang3

xeno

n1

G-MEA

N

EnergyNormalized

toGPU

Figure 9 Accelerator energy consumption normalized to the GPU baseline

3 for both matrices far better than if the accelerator wereused directly

B Energy Consumption

Energy consumption results are shown in Figure 9 To-tal energy consumption is improved by 142times on the 18matrices executed on the accelerator and by 109times overthe entire 20-matrix dataset The impacts of differing ex-ponent ranges can be seen in the energy results Notablythe nasasrb matrix has a 3 higher blocking efficiencythan Pres Poisson however Pres Poisson shows nearlytwice the improvement in energy dissipation this is dueto the much narrower exponent range of Pres PoissonPres Poisson never requires more than 14 bits of paddingfor storage which also indicates a much narrower dynamicrange for the computation and by extension fewer vectorbit slices that are needed per cluster By contrast nasasrbhas multiple blocks with excluded elements (due to thoseelements requiring greater than 117 bits) and in generalthe average number of stored bits per cluster is 107 (30more bits than the worst block of Pres Poisson) This strongdependence on the exponent range indicates a potentialbenefit of the fixed-point computation over floating pointAs discussed in Section IV-B floating point is designed forbroad compatibility and exceeds the requirements of manyapplications By operating on fixed-point representations ofdata for much of the computation the accelerator implicitlycreates a problem-specific subset of the floating-point formatwith significant performance benefits without discarding anyinformation that would cause the final result to differ fromthat of an end-to-end floating-point calculation

C Area Footprint

The area footprint of the accelerator is computed usingthe area model described in Section VII-A The overallsystem area for the 128-bank system described above is539mm2 which is lower than the 610mm2 die size of thebaseline Nvidia Tesla P100 GPU [54] Notably unlike inprior work on memristive accelerators the crossbars andperipheral circuitry are the dominant area consumermdashratherthan the ADCsmdashwith a total of 541 of total overall clusterarea This is due to two important differences from priorwork 1) due to computational invert coding (Section V-B2)

Table IIIAREA ENERGY AND LATENCY OF DIFFERENT CROSSBAR SIZES

(INCLUDES THE ADC)

Size Area [mm2] Energy [pJ] Latency [nsec]64 000078 280 533128 000103 652 107256 000162 150 213512 000352 342 427

0

5

10

15

20

nasasrb

Pres_P

2cub

es

qa8fm

finan

Trefe

ship

crystm

therm

G2

ASIC

bcircuit

epb3

Ga

AsH6

ns3D

aSi34H3

6torso2

venkat

wang3

xeno

n1

AcceleratorC

ompu

tatio

n

Overhead

WriteTime PreprocessingTime

Figure 10 Overhead of preprocessing and write time as a percent of totalsolve time on the accelerator

the ADC precision can be reduced by one bit which has anexponential effect on ADC area and 2) the wider operandsthat propagate through the reduction network The per-bankprocessors and global memory buffer consume 136 of thetotal system area This result suggests that a more powerfulprocessor than the LEON3 core may be integrated based onapplication demands without substantially increasing systemarea Table III summarizes the area as well as the latencyand energy characteristics for different crossbar sizes usedin different clusters

D Initialization Overhead

Figure 10 shows the overhead of setting up the itera-tive computation including matrix preprocessing time andprogramming time normalized to the entire solve timeof each linear system on the accelerator Since iterativealgorithms take thousands of iterations to converge theoverhead of writing is effectively amortized leading toan overhead of less than 20 across the entire matrixset and generally falling as the size of the linear systemincreases By extension the number of iterations requiredto converge increases faster than the write requirements ofthe array For large linear systems (the most likely use caseof the proposed accelerator) the overhead is typically lessthan 4 of the total solver runtime The overhead may beeven less in practice as multiple time steps are generallysimulated to examine the time-evolution of the linear systemIn these time-stepped computations only a subset of non-zeros change each step and the matrix structure is typicallypreserved requiring minimal re-processing

E System Endurance

Although memristive devices have finite switching en-durance the iterative nature of the algorithms allows a singlematrix to be programmed once and reused throughout thecomputation Even under a conservative set of assumptions

(a) ns3Da sparsity (b) ns3Da blockingFigure 11 Sparsity and blocking patterns of ns3Da

where each array is fully rewritten for each new matrixignoring both the sparsity of the blocks and the time stepbehavior mentioned above the system lifetime is suffi-cient given a memristive device with a conservative 109

writes [55]ndash[57] the total system lifetime is over 100 yearsassuming the system runs constantly and a new matrix isfully rewritten between solves

F Understanding Difficult-to-Block Matrices

Figure 11 shows the non-zero element distribution and theblocking pattern for ns3Da a matrix with a particularly poorblocking efficiency The figure shows that despite the highrelative density of the matrix the values do not form densesub-blocks that can be blocked efficiently Instead the valuesare distributed relatively uniformly over the matrix with thethird-highest NNZs per matrix row in the evaluated datasetFigure 7 shows the non-zero distribution of two matrices thatcan be blocked effectively Pres Poisson and Xenon1 thenon-zeros of both matrices are clustered primarily aroundthe diagonal An analysis of the blocks formed on ns3Dashows that most of the formed blocks capture patterns withnon-zeros separated by approximately 30 elements 2 permatrix block-row in 64 blocks and 4 per matrix block-rowin 128 blocks This suggests that even if ns3Da were tobe blocked more efficiently the effective speedup would beconstrained by the limited current summation from havingfew blocks per matrix row

G Sensitivity to Device Characteristics

Since scientific computing emphasizes high precisionwe analyze the effects of device characteristics on systemaccuracy and convergence behavior

Dynamic range We re-evaluate the convergence behaviorwith various cell ranges (ie onoff state ratios) to determinehow the system behaves when the states are closer togetheror farther apart Figure 12 shows the relative iterationcount at various configuration points The results showeffectively no sensitivity to dynamic range for single-bitcells With two-bit cells the relatively low dynamic rangeresults in some computational error as the cell states arenot sufficiently spread to tolerate the sources of error in

013 213 413 613 813

B=113 D=075K13

B=113 D=15K13

B=113 D=3K13

B=213 D=075K13

B=213 D=15K13

B=213 D=3K13

Normalized

13 Ite

rao

n13 Co

unt13

Min13 Mean13 Max13

Figure 12 Iteration count as a function of bits per cell and dynamicrange normalized to 1-bit cells with RoffRon = 1500 The minimummaximum and the mean iteration counts are reported over 100 Monte-Carlosimulation experiments

Figure 13 Iteration count as a function of bits per cell and programmingerror normalized to 1-bit cells with no programming error The minimummaximum and the mean iteration counts are reported over 100 Monte-Carlosimulation experiments

current summation This in turn introduces error into thecomputation which tends to hinder convergence rates

Programming precision We re-evaluate the convergencebehavior with and without cell programming errors to de-termine how results are affected by non-ideal programmingThe results show virtually no sensitivity to programmingerror for single-bit cells until the programming error reaches5 well within the achievable programming precision re-ported in the literature [58] Again the general dependencebecomes stronger as the number of bits per cell is increasedHigh programming error with the same tolerance introduceserrors into the computation hindering convergence

IX CONCLUSIONS

Previous work with memristive systems has been re-stricted to low-precision error-prone fixed-point computa-tion taking advantage of machine learning workloads thatdemand fidelity but not precision We have presented thefirst correct implementation of a double-precision floating-point compute unit based on a memristive substrate Theinnovations presented here take advantage of insights thatallow the system to achieve a 103times throughput improve-ment and a 109times reduction in energy consumption over aGPU baseline when executing iterative solvers The potentialfor such a system in scientific computing where iterativesolvers are ubiquitous is especially promising

ACKNOWLEDGMENT

This work was supported in part by NSF grant CCF-1533762

REFERENCES

[1] F Pop ldquoHigh performance numerical computing for highenergy physics a new challenge for big data sciencerdquo AAdvHigh Energy Phys vol 2014 2014

[2] R A Pielke Sr Mesoscale meteorological modeling 2nd edAcademic press 2013

[3] B Scholkopf K Tsuda and J-P Vert Kernel methods incomputational biology 1st ed MIT press 2004

[4] L Maliar and S Maliar ldquoNumerical methods for large-scaledynamic economic modelsrdquo 2014

[5] M Vogelsberger S Genel V Springel P Torrey D SijackiD Xu G Snyder S Bird D Nelson and L HernquistldquoProperties of galaxies reproduced by a hydrodynamic simu-lationrdquo Nature vol 509 pp 177ndash82 May 2014

[6] P Messina (2017 February) The US DOEexascale computing projectmdashgoals and challengeshttpswwwnistgovsitesdefaultfilesdocuments20170221messina nist 20170214final pdf [Online] Availablerdquohttpswwwnistgovsitesdefaultfilesdocuments20170221messina nist 20170214final pdfrdquo

[7] M N Bojnordi and E Ipek ldquoMemristive boltzmann machineA hardware accelerator for combinatorial optimization anddeep learningrdquo in Intl Symp on High Performance ComputerArchitecture (HPCA) March 2016

[8] A Shafiee A Nag N Muralimanohar R BalasubramonianJ P Strachan M Hu R S Williams and V SrikumarldquoISAAC A convolutional neural network accelerator within-situ analog arithmetic in crossbarsrdquo in Intl Symp onComputer Architecture (ISCA) June 2016

[9] P Chi S Li S Li T Zhang J Zhao Y Liu Y Wangand Y Xie ldquoPRIME A novel processing-in-memory archi-tecture for neural network computation in ReRAM-basedmain memoryrdquo in Intl Symp on High Performance ComputerArchitecture (HPCA) June 2016

[10] L Song X Qian H Li and Y Chen ldquoPipeLayer Apipelined ReRAM-based accelerator for deep learningrdquo inIntl Symp on High Performance Computer Architecture(HPCA) Feb 2017

[11] L Song Y Zhuo X Qian H Li and Y Chen ldquoGraphRAccelerating graph processing using ReRAMrdquo Feb 2017

[12] G Kestor R Gioiosa D J Kerbyson and A HoisieldquoQuantifying the energy cost of data movement in scientificapplicationsrdquo in Intl Symp on Workload Characterization(IISWC) Sept 2013

[13] J Hauser (2002) SoftFloat httpwwwjhauserusarithmeticSoftFloathtml [Online] Available rdquohttpwwwjhauserusarithmeticSoftFloathtmlrdquo

[14] T A Davis and Y Hu ldquoThe university of florida sparse matrixcollectionrdquo ACM Transactions on Math Softwware (TOMS)vol 38 no 1 pp 11ndash125 Dec 2011

[15] R Vuduc ldquoAutomatic performance tuning of sparse matrixkernelsrdquo PhD dissertation 2003

[16] Y Chen T Luo S Liu S Zhang L He J Wang L LiT Chen Z Xu N Sun and O Temam ldquoDaDianNaoA machine-learning supercomputerrdquo in Intl Symp on onMicroarchitecture (MICRO) Dec 2014

[17] N P Jouppi C Young N Patil D Patterson G Agrawalet al ldquoIn-datacenter performance analysis of a tensor pro-cessing unitrdquo in Intl Symp on on Computer Architecture(ISCA) June 2017

[18] M Hu J P Strachan Z Li E M Grafals N DavilaC Graves S Lam N Ge J J Yang and R S WilliamsldquoDot-product engine for neuromorphic computing Program-ming 1T1M crossbar to accelerate matrix-vector multiplica-tionrdquo in Design Automation Conference (DAC) June 2016

[19] Y Saad Iterative Methods for Sparse Linear Systems 3rd edSIAM 2003

[20] J J Dongarra P Luszczek and A Petitet ldquoThe LINPACKbenchmark Past present and futurerdquo Concurrency and Com-putation Practice and Experience vol 15 pp 803ndash8202003

[21] M R Hestenes and E Stiefel Methods of conjugate gradientsfor solving linear systems 1952 vol 49 no 6

[22] H A Van der Vorst ldquoBi-CGSTAB A fast and smoothlyconverging variant of Bi-CG for the solution of nonsymmetriclinear systemsrdquo SIAM Journal on scientific and StatisticalComputing (SISC) vol 13 no 2 pp 631ndash644 1992

[23] Y Saad and M H Schultz ldquoGMRES A generalized minimalresidual algorithm for solving nonsymmetric linear systemsrdquoSIAM Journal on scientific and Statistical Computing (SISC)vol 7 no 3 pp 856ndash869 July 1986

[24] J Kung Y Long D Kim and S Mukhopadhyay ldquoAprogrammable hardware accelerator for simulating dynamicalsystemsrdquo in Intl Symp on Computer Architecture (ISCA)June 2017

[25] Q Zhu T Graf H E Sumbul L Pileggi and F FranchettildquoAccelerating sparse matrix-matrix multiplication with 3d-stacked logic-in-memory hardwarerdquo in 2013 IEEE HighPerformance Extreme Computing Conference (HPEC) Sept2013

[26] S Kestur J D Davis and E S Chung ldquoTowards a univer-sal FPGA matrix-vector multiplication architecturerdquo in IntlSymp on Field-Programmable Custom Computing Machines(FCCM) May 2012

[27] J Fowers K Ovtcharov K Strauss E S Chung andG Stitt ldquoA high memory bandwidth FPGA accelerator forsparse matrix-vector multiplicationrdquo in Intl Symp on Field-Programmable Custom Computing Machines (FCCM) May2014

[28] R Dorrance F Ren and D Markovic ldquoA scalablesparse matrix-vector multiplication kernel for energy-efficientsparse-blas on FPGAsrdquo in ACMSIGDA International Sym-posium on Field-programmable Gate Arrays (FPGA) Feb2014

[29] B Feinberg S Wang and E Ipek ldquoMaking memristiveneural net accelerators reliablerdquo in Intl Symp on HighPerformance Computer Architecture (HPCA) Feb 2018

[30] D H Bailey ldquoHigh-precision floating-point arithmetic inscientific computationrdquo Computing in science amp engineeringvol 7 no 3 pp 54ndash61 2005

[31] D Zuras M Cowlishaw A Aiken M Applegate D BaileyS Bass D Bhandarkar M Bhat D Bindel S Boldo et alldquoIeee standard for floating-point arithmeticrdquo IEEE Std 754-2008 pp 1ndash70 2008

[32] L V Allis ldquoSearching for solutions in games and artificialintelligencerdquo 1994

[33] N Whitehead and A Fit-Florea ldquoPrecision amp performanceFloating point and IEEE 754 compliance for nvidia gpusrdquoNvidia Tech Rep 2011

[34] L Kull D Luu C Menolfi M Braendli P A FranceseT Morf M Kossel H Yueksel A Cevrero I Ozkayaand T Toifl ldquo285 a 10b 15gss pipelined-SAR ADC withbackground second-stage common-mode regulation and offsetcalibration in 14nm CMOS FinFETrdquo in Intl Solid-StateCircuits Conference (ISSCC) Feb 2017 pp 474ndash475

[35] L Kull T Toifl M Schmatz P A Francese C MenolfiM Brandli M Kossel T Morf T M Andersen andY Leblebici ldquoA 31 mw 8b 12 GSs single-channel asyn-chronous SAR ADC with alternate comparators for enhancedspeed in 32 nm digital SOI CMOSrdquo IEEE Journal of Solid-State Circuits vol 48 no 12 pp 3049ndash3058 2013

[36] M Saberi R Lotfi K Mafinezhad and W A SerdijnldquoAnalysis of power consumption and linearity in capacitivedigital-to-analog converters used in successive approximationadcsrdquo IEEE Transactions on Circuits and Systems I RegularPapers vol 58 no 8 pp 1736ndash1748 2011

[37] M R Stan and W P Burleson ldquoBus-invert coding forlow-power IOrdquo IEEE Transactions on Very Large ScaleIntegration (VLSI) Systems vol 3 no 1 pp 49ndash58 1995

[38] K Bhanushali and W R Davis ldquoFreePDK15 An open-source predictive process design kit for 15nm FinFET tech-nologyrdquo in Intl Symp on on Physical Design (ISPD) March2015

[39] Predictive technology model (PTM) httpptmasuedu[Online] Available httpptmasuedu

[40] D Niu C Xu N Muralimanohar N P Jouppi and Y XieldquoDesign of cross-point metal-oxide ReRAM emphasizingreliability and costrdquo in Intl Conference on Computer-AidedDesign (ICCAD) Nov 2013

[41] S-C Wong G-Y Lee and D-J Ma ldquoModeling of inter-connect capacitance delay and crosstalk in VLSIrdquo IEEETransactions on semiconductor manufacturing vol 13 no 1pp 108ndash111 2000

[42] C Xu D Niu N Muralimanohar R BalasubramonianT Zhang S Yu and Y Xie ldquoOvercoming the challengesof crossbar resistive memory architecturesrdquo in rdquoIntl Sympon High Performance Computer Architecture (HPCA)rdquo Feb2015

[43] C Yakopcic and T Taha ldquoModel for maximum crossbar sizebased on input driver impedancerdquo Electronics Letters vol 52no 1 pp 25ndash27 2015

[44] M OrsquoHalloran and R Sarpeshkar ldquoA 10-nw 12-bit accurateanalog storage cell with 10-aa leakagerdquo IEEE Journal ofSolid-State Circuits vol 39 no 11 pp 1985ndash1996 2004

[45] Leon3GRLIB httpwwwgaislercomindexphpdownloadsleongrlib [Online] Available httpwwwgaislercomindexphpdownloadsleongrlib

[46] S Galal O Shacham J S B II J Pu A Vassiliev andM Horowitz ldquoFPU generator for design space explorationrdquoin IEEE Symposium on Computer Arithmetic (ARITH) April2013

[47] M Martins J M Matos R P Ribas A Reis G SchlinkerL Rech and J Michelsen ldquoOpen cell library in 15nmFreePDK technologyrdquo in Intl Symp on Physical Design(ISPD) March 2015

[48] Synopsys ldquoSynopsys Design Compiler User GuiderdquohttpwwwsynopsyscomToolsImplementationRTLSynthesisDCUltraPages

[49] R Balasubramonian A B Kahng N MuralimanoharA Shafiee and V Srinivas ldquoCACTI 7 New tools for inter-connect exploration in innovative off-chip memoriesrdquo ACMTransactions on Architecture and Code Optimization (TACO)vol 14 no 2 pp 141ndash1425 June 2017

[50] G Fredeman D W Plass A Mathews J ViraraghavanK Reyer T J Knips T Miller E L Gerhard D Kannam-badi C Paone D Lee D J Rainey M Sperling M WhalenS Burns R R Tummuru H Ho A Cestero N ArnoldB A Khan T Kirihata and S S Iyer ldquoA 14 nm 11 mbembedded DRAM macro with 1 ns accessrdquo IEEE Journal ofSolid-State Circuits vol 51 no 1 pp 230ndash239 Jan 2016

[51] A Bakhoda G L Yuan W W L Fung H Wong and T MAamodt ldquoAnalyzing CUDA workloads using a detailed GPUsimulatorrdquo April 2009

[52] J Leng T Hetherington A ElTantawy S Gilani N S KimT M Aamodt and V J Reddi ldquoGPUWattch Enabling en-ergy optimizations in GPGPUsrdquo in Intl Symp on ComputerArchitecture (ISCA) June 2013

[53] H Anzt J Dongarra M Kreutzer G Wellein and M KhlerldquoEfficiency of general krylov methods on GPUs ndash an exper-imental studyrdquo in Intl Parallel and Distributed ProcessingSymposium Workshops (IPDPSW) May 2016

[54] ldquoNvidia tesla P100rdquo NVIDIA Corporation Tech Rep WP-08019-001 v011 2016

[55] Z Wei Y Kanzawa K Arita Y Katoh K Kawai S Mu-raoka S Mitani S Fujii K Katayama M Iijima et alldquoHighly reliable TaOx ReRAM and direct evidence of redoxreaction mechanismrdquo in Electron Devices Meeting 2008IEDM 2008 IEEE International IEEE 2008 pp 1ndash4

[56] J J Yang M-X Zhang J P Strachan F Miao M D Pick-ett R D Kelley G Medeiros-Ribeiro and R S WilliamsldquoHigh switching endurance in TaOx memristive devicesrdquoApplied Physics Letters vol 97 no 23 p 232102 2010

[57] C-W Hsu I-T Wang C-L Lo M-C Chiang W-Y JangC-H Lin and T-H Hou ldquoSelf-rectifying bipolar TaOxTiO2RRAM with superior endurance over 1012 cycles for 3d high-density storage-class memoryrdquo

[58] F Alibart L Gao B D Hoskins and D B Strukov ldquoHighprecision tuning of state for memristive devices by adaptablevariation-tolerant algorithmrdquo Nanotechnology vol 23 no 7Jan 2012

Page 7: Enabling Scientific Computing on Memristive …reducing the required memristor resolution [7]–[10]. An example of bit slicing is shown in Equation 1, where each 3-bit matrix coefficient

one modification in place of the constant 216 we use aper-block biasing constant based on the actual floating-pointexponent range contained within the block

D Floating Point Anomalies and Rounding Modes

The IEEE-754 floating point standard specifies a numberof floating point exceptions and rounding modes that mustbe handled correctly for proper operation [31]

Rounding modes When the proposed accelerator isoperated in its typical configuration mantissa alignmentand leading 1 detection result in a truncation of the resultto the required precision The potential contributions ofany uncalculated bits beyond the least-significant bit wouldonly result in a mantissa of greater magnitude than whatis ultimately obtained and since a biasing scheme is usedto enable the representation of negative values (SectionIV-C) the truncation is equivalent to rounding the resultof a dot product toward negative infinity For applicationsrequiring other rounding modes (eg to nearest towardpositive infinity toward zero) the accelerator can beconfigured to compute three additional settled bits beforetruncation and perform the rounding according to those bits

Infinities and NaNs Signed infinities and not-a-numbers(NaNs) are valid floating point values however they cannotbe mapped to the crossbars in a meaningful way in theproposed accelerator nor can they be included in any dotproduct computations without producing a NaN infinity orinvalid result [31] Therefore the accelerator requires allinput matrices and vectors to contain no infinities or NaNsand any infinities or NaNs arising from an intermediatecomputation in the local processor are handled there toprevent them from propagating to the resulting dot product

Floating point exceptions Operations with floating pointvalues may generally lead to underflow overflow invalidoperation and inexact exceptions [31] Since in-situ dotproducts are performed on mantissas aligned with respect torelative exponents there are no upper or lower bounds onresulting values (Only the range of exponents is bounded)This precludes the possibility of an overflow or underflowoccurring while values are in an intermediate representationHowever when the result of a dot product is converted fromthe intermediate floating point representation into IEEE-754it is possible that the required exponent will be outsideof the available range In this case the exponent field ofthe resulting value will be respectively set to all 1s orall 0s for overflow and underflow conditions Similarly toCUDA computations resulting in an overflow underflowor inexact arithmetic do not cause a trap [33] Since theaccelerator can only accept matrices and input vectors withfinite numerical values and since the memristive substrateis used only for computing dot products invalid operations

(eg 0 0radicminus|c|) and operations resulting in NaNs or

infinities can only occur within the local processor wherethey are handled through an IEEE-754 compliant floatingpoint unit

E Detecting and Correcting Errors

Prior work by Feinberg et al proposes an error-correctionscheme for memristive neural network accelerators based onAN codes [29] We adopt this scheme with three modifi-cations taking into consideration the differences betweenneural networks and scientific computing First as noted inSection I scientific computing has higher precision require-ments than neural networks thus the proposed accelera-tor uses only 1-bit cells making it more robust Secondthe matrices for scientific computing are typically sparsewhereas the matrices in neural networks tend to be densethis leads to lower error rates using the RTN-based errormodel of Feinberg et al allowing errors to be corrected withgreater than 9999 accuracy However the use of arrayslarger than used in prior work introduces new sources oferror If the array size is comparable to the dynamic rangeof the memristors the non-zero current in the RHI statemay cause errors when multiplying a low density matrixrow with a high density vector To avert this problem welimit the maximum crossbar block size in a bank (discussedin the next section) to 512times512 with a memristor dynamicrange of 15times 103 Third since the floating-point operandsexpand to as many as 118 bits during the conversion fromfloating- to fixed point we do not use multi-operand codingbut rather apply an A = 251 code that protects a 118-bit operand with eight bits for correction and one bit fordetection The correction is applied after the reduction butbefore the leading-one detection

V HANDLING AND EXPLOITING SPARSITY

An in-situ MVM system has an inherent advantage whenperforming computation with large and dense matrices asthe degree of parallelism obtained by computing in theanalog domain is proportional to the number of elementsin the matrix up to the size of the matrix M times N Asnoted in Section II-B however the majority of the matricesof interest are sparse This reduces the effective parallelismmaking it a challenge to achieve high throughput

A Tradeoffs in Crossbar Sizing

The potential performance of a memristive MVM systemis constrained by the sizes of its crossbars as crossbar sizeincreases so does the peak throughput of the system Thispeak is rarely achieved however since the sparsity patternof each matrix also limits the number of elements that maybe mapped to a given block

We define the throughput of the system as the number ofeffective element-wise operations in a single-cluster MVMcomputation divided by the latency of that computation

τMV M

As the size of a crossbar increases greater throughputis naıvely implied since the size of the matrix block repre-sentable within the crossbarmdashand consequently the numberof sums-of-products resulting from a cluster operationmdashalsoincreases However since only non-zero matrix elementscontribute to an MVM operation effective throughput in-creases only when the number of non-zero (NNZ) elementsincreases where NNZ depends on both block size M timesN and block density dblock Actual throughput for a givenblock is therefore dblock timesM timesNτMV M

Thus to achievehigh throughput and efficiency large and dense blocks aregenerally preferable to small or sparse blocks

The number of non-zero elements actually mapped to acrossbar is in practice far less than the number of cells inthe crossbar and the trend is for large blocks to be sparserthan small blocks This tendency follows intuitively fromthe notion that as the block size approaches the size of thematrix the density of the block approaches the density ofthe matrixmdash037 for the densest matrix in the evaluation

The latency of a memristive MVM operation is alsodirectly related to the size of a crossbar The maximumpossible crossbar column output occurs when all N bits ofa column and all N bits of the vector slice are set to 1resulting in a sum-of-products of

sumN1 1times1 = N Therefore

a sufficient analog to digital converter (ADC) for a crossbarwith N rows must generally have a minimum resolution ofdlog2[N +1]e bits (though this can be reduced as shown inSection V-B2) For the type of ADC used in the proposedsystem and in prior work [8] conversion time per columnis strictly proportional to ADC resolution which increaseswith log2N where N is the number of rows and the numberof ADC conversions per binary MVM operation is equal tothe number of columns M Total conversion time is thusproportional to Mdlog2[N + 1]e

1) Energy Crossbar sizing strongly affects the amountof energy used by both the array and the ADC during anMVM operation

ADC energy The average power dissipation of the ADCgrows exponentially with resolution [34]ndash[36] Since res-olution is a logarithmic function of the number of rowsN but conversion time is proportional to resolution ADCenergy per column conversion is approximately proportionalto N log2N The column count M determines the numberof ADC conversions per MVM operation so the total ADCenergy per operation is proportional to M timesN log2N

Crossbar energy Crossbar latency is defined as theminimum amount of time during which the crossbars mustbe actively pulling current in order to ensure that all outputnodes are within less than LSB

2 of their final voltages therebyconstraining all column outputs to be within the errortolerance range of the ADC The worst-case path resistanceand capacitance of an MtimesN crossbar are both proportionalto M + N The number of RC time constants necessaryto ensure error-free output is proportional to resolution or

log2N While the crossbar is active power is proportionalto the total conductance of the crossbar which can be foundby multiplying the number of cells M timesN by the averagepath conductance which is proportional to 1

M+N Crossbarenergy for a single MVM operation grows with the productof crossbar latency and power and is thus found to beproportional to (M timesN)(M +N) log2N

2) Area As with ADC power ADC area also scalesexponentially with resolution [34]ndash[36] making averageADC area approximately proportional to N Crossbar area isdominated by drivers consequently the total crossbar areagrows as M(M +N)

B A Heterogeneous Crossbar Substrate

Crossbar sizing requires striking a careful balance be-tween parallelism and energy efficiency a crossbar mustbe large enough to capture sufficient non-zero elements toexploit parallelism yet small enough to retain sufficient den-sity for area and energy efficiency By contrast to the singlecrossbar size used in prior work [7] [8] [18] the proposedsystem utilizes crossbars of different sizes allowing blocksto be mapped to the crossbars best suited to the sparsitypattern of the matrix This heterogeneous composition makesthe system performance robust yielding high performanceand efficiency on most matrices without over-designing fora particular matrix type or sparsity pattern

The problem of performance-optimal matrix blocking isneither new nor solved in a computationally feasible mannerPrior work seeks to block sparse matrices while maximizingthroughput and minimizing fill-in but focuses on block sizesthat are too small to be efficient with memristive crossbars(12times 12 at most [15])

1) Mapping a Matrix to the HeterogeneousSubstrate A fast preprocessing step is proposed that ef-ficiently maps an arbitrary matrix to the heterogeneouscollection of clusters The matrix is partitioned into blockcandidates and for each candidate block the number of non-zero elements captured by the block and the range of theexponents within the block are calculated If the exponentrange of the block falls outside of the maximum allowablerange (ie 117) the elements are selectively removed untilan acceptable range is attained The number of non-zero ele-ments remaining is then compared to a dimension-dependentthreshold if the number of elements exceeds the thresholdthe candidate block is accepted If the candidate block isrejected a smaller block size is considered and the processrepeats

The preprocessing step considers every block size in thesystem (512times512 256times256 128times128 and 64times64) untileach element has either been mapped to a block or found tobe unblockable Elements that cannot be blocked efficientlyor that fall outside of the exponent range of an acceptedblock are completely removed from the mapping and areinstead processed separately by the local processor Figure

(a) Pres Poisson sparsity (b) Pres Poisson blocking (c) Xenon1 sparsity (d) Xenon1 blockingFigure 7 Sparsity and blocking patterns of two of the evaluated matrices

7 shows the non-zero elements and the resulting blockingpatterns for two sparse matrices

Since the non-zero elements in the matrix are traversed atmost one time for each block size the worst-case complexityof the preprocessing step is 4timesNNZ In practice the earlydiscovery of good blocks brings the average-case complexityto 18timesNNZ

2) Exploiting Sparsity Though sparsity introduces chal-lenges in terms of throughput it also presents some oppor-tunities for optimization

Computational invert coding Prior work [8] introducesa technique for statically reducing ADC resolution require-ments If a crossbar column contains all 1s the maximumoutput for the column is equal to the number of rowsN To handle this extreme case the ADC would requiredlog2[N + 1]e bits of resolution Eliminating this casereduces the ADC resolution requirement by one bit This isachieved by inverting the contents of a column if it containsall 1s To obtain a result equivalent to a computation witha non-inverted column the partial dot product is subtractedfrom the population count of the applied vector bit slice

We extend this technique to computational invert coding(CIC) a computational analog to the widely used bus-invertcoding [37] If the number of ones in a binary matrixrow is less than N

2 the required resolution of the ADCis again decreased by one bit By inverting any columnmapping that would result in greater than 50 bit density astatic maximum of N

2 1s per crossbar column is guaranteedNotably if the column has exactly N

2 1s a resolution oflog2N is still required Since the proposed system allowsfor removing arbitrary elements from a block for processingon the local processor this corner case can be eliminatedThus for any block even in matrices approaching 100density the ADC resolution can be statically reduced tolog2[N ]minus1 bits CIC also lowers the maximum conductanceof the crossbars (since at most half of the cells can be inthe RLO state) which in turn reduces crossbar energy

ADC headstart The ADCs that are commonly used inmemristive accelerators [7] [8] perform a binary searchover the space of possible results with the MSb initially

set to 1 However the maximum output of each column isconstrained by the number of ones mapped to that columnMatrix rows tend to produce at most dlog2[drowtimes05timesN ]ebits of output assuming that 1s and 0s are equally likelySince the actual number of output bits is data dependentthis tendency cannot be leveraged to statically reduce ADCprecision beyond what is attainable with CIC Instead theADC is pre-set to the maximum number of bits that can beproduced by the subsequent conversion step This approachallows the ADC to start from the most significant possibleoutput bit position and reduces the amount of time requiredto find the correct output value by an amount proportionalto the number of bits skipped Notably this does not affectoperation latency since reduction logic is fully synchronousand assumes a full clock period for conversion however bydecreasing the time spent on conversion the energy of theMVM operation is reduced

VI PROGRAMMING THE ACCELERATOR

We focus on Krylov subspace solvers for specificity asthey are a widely used application involving sparse MVMhowever the proposed accelerator is not limited to Krylovsubspace solvers These solvers are built from three com-putational kernels a sparse-matrix dense-vector multiply adense-vector sum or AXPY (ylarr ax+y) and a dense-vectordot product (z = x middot y) To split the computation betweenthe banks each bank is given a 1200-element section ofthe solution vector x only that bank performs updates onthe corresponding section of the solution vector and theportions of the derived vectors For simplicity this restrictionis not enforced by hardware and must be followed by theprogrammer for correctness

On problems that are too large for a single accelerator theMVM can be split in a manner analogous to the partitioningon GPUs each accelerator handles a portion of the MVMand the accelerators synchronize between iterations

A Kernel Implementation

To implement Krylov subspace solvers we implement thethree primary kernels needed for the solver

1) Sparse MVM The sparse MVM operation beginsby reading the vector map for each cluster The vectormap contains a set of three-element tuples each of whichcomprises the base address of a cluster input vector thevector element index corresponding to that base addressand the size of the cluster Since the vector applied to eachcluster is contiguous the tuple is used to compute and loadall of the vector elements into the appropriate vector bufferThereafter a start signal is written to a cluster status registerand the computation begins Since larger clusters tend tohave a higher latency the vector map entries are ordered bycluster size

Once all cluster operations have started the processorbegins operating on the non-blocked entries As noted inthe previous section some values within the sparse matrixare not suitable for mapping onto memristive hardwareeither because they cannot be blocked at a sufficient densityor because they would exceed the exponent range Theseremaining values are stored in the compressed sparse rowformat [15]

When a cluster completes an interrupt is raised andserviced by an interrupt service routine running on the localprocessor Once all clusters signal completion and the localprocessor finishes processing the non-blocked elements thebank has completed its portion of the sparse MVM The localprocessors for different banks rely on a barrier synchroniza-tion scheme to determine when the entire sparse MVM iscomplete

2) Vector Dot Product To perform a vector dot producteach processor computes the relevant dot product fromits own vector elements Notably due to vector elementownership of the solution vector x and all of the derivedvectors the dot product is performed using only thosevalues that belong to a local processor Once the processorcomputes its local dot product the value is written backto shared memory for all other banks to see Each bankcomputes the final product from the set of bank dot productsindividually to simplify synchronization

3) AXPY The AXPY operation is the simplest kernelsince it can be performed purely locally Each vector elementis loaded modified and written back Barrier synchroniza-tion is used to determine completion across all banks

VII EXPERIMENTAL SETUP

We develop a parameterized and modular system modelto evaluate throughput energy efficiency and area Keyparameters of the simulated system are listed in Table I

A Circuits and Devices

We model all of the CMOS based periphery at the 15nmtechnology node using process parameters and design rulesfrom FreePDK15 [38] coupled with the ASU PredictiveTechnology Model (PTM) [39] TaOx memristor cells basedon [40] are used with linearity and dynamic range as

reported by [18] During computation cells are modeled asresistors with resistance determined either by a statisticalapproach considering block density or actual mapped bitvalues depending on whether the model is being evaluatedas part of a design-space exploration or to characterizeperformance with a particular matrix The interconnect ismodeled with a conservative lumped RC approach usingthe wire capacitance parameters provided by the ASU PTMderived from [41] To reduce path distance and latency wesplit the crossbars and place the drivers in the middle ratherthan at the periphery similarly to the double-sided groundbiasing technique from [42]

Table IACCELERATOR CONFIGURATION

System (128) banks double-precision floating point fclk =12GHz 15nm process VDD = 080V

Bank (2) times 512 times 512 clusters (4) times 256 times 256 clusters(6)times128times128 (8)times64times64 clusters 1 LEON core

Cluster 127 bit slice crossbarsCrossbar N timesN cells (log2[N ]minus 1)-bit pipelined SAR ADC

2N driversCell [18] [40] TaOx Ron = 2kΩRoff = 3MΩ Vread = 02V

Vset = minus26V Vreset = 26V Ewrite = 391nJTwrite = 5088ns

The peripheral circuitry includes 2N driver circuits basedon [43] and sized such that they are sufficient to sourcesinkthe maximum current required to program and read thememristors Also included are N sample-and-hold circuitsbased on [44] with power and area scaled for the requiredsampling frequency and precision

A 12 GHz 10-bit pipelined SAR ADC [34] is used as areference design point 12 GHz is chosen as the operatingfrequency of the ADC in order to maximize throughput andefficiency while maintaining acceptable ADC signal to noiseand distortion ratio (SNDR) at our supply voltage Areapower and the internal conversion time of each individualcrossbarrsquos ADC are scaled based on resolution as discussedin Section V-A Approximately 7 of the reported ADCpower scales exponentially with resolution with the remain-der of power either scaling linearly or remaining constantand 20 of the total power is assumed to be static basedon the analysis reported in [34] Similarly approximately23 of the reported area scales exponentially (due to thecapacitive DAC and the track-and-hold circuit) and theremainder either scales linearly or remains constant We holdthe conversion time of the ADC constant with respect toresolution rounding up to the nearest integral clock periodDuring the slack portion of the clock period (ie the timebetween conversions) we reduce the ADC power dissipationto its static power

The local processors are implemented with the open-source LEON3 core [45] Since the open-source versionof LEON3 does not include the netlist for the FPU forsynthesis we modify the FPU wrapper to the fused multiplyadd (FMA) timings for an FMA unit generated by FP-

Table IIEVALUATED MATRICES SPD MATRICES ON TOP

Matrix NNZs Rows NNZRow Blocked2cubes sphere 1647264 101492 162 497crystm03 583770 24696 236 947finan512 596992 74752 79 467G2 circuit 726674 150102 45 609nasasrb 2677324 54870 498 991Pres Poisson 715804 14822 483 964qa8fm 1660579 66127 251 928ship 001 3896496 34920 1116 664thermomech TC 711558 102158 68 08Trefethen 20000 554466 20000 277 633ASIC 100K 940621 99340 95 609bcircuit 375558 68902 54 649epb3 463625 84617 55 722GaAsH6 3381809 61349 5512 692ns3Da 1679599 20414 82 32Si34H36 5156379 97569 528 537torso2 1033473 115697 89 981venkat25 1717792 62424 275 798wang3 177168 26064 68 646xenon1 1181120 48600 243 810

GEN [46] The LEON3 core FMA and in-cluster reductionnetwork are synthesized using the NanGate 15nm OpenCell Library [47] with the Synopsys Design Compiler [48]SRAM buffers within each cluster and the eDRAM mem-ory are modeled using CACTI7 [49] using 14nm eDRAMparameters from [50]

B Architecture

We compare against an nVidia Tesla P100 GPU accelera-tor modeled using GPGPUSim [51] and GPGPUWattch [52]To evaluate the performance of the LEON3 based localprocessors we implement CG and BiCG-STAB in C andcompile the implementations using the LEON3 Bare CCompiler with timers around each kernel call Since all bankmicroprocessors must synchronize during each iteration weevaluate the latency of the bank microprocessor with thelargest number of unblocked elements

In the worst case the preprocessing step touches everynon-zero in the matrix four times which is comparable toperforming four MVM operations Thus we conservativelyassume that preprocessing takes time equivalent to fourMVM operations on the baseline system

C Algorithms

We evaluate the proposed accelerator on 20 matrices fromthe SuiteSparse matrix collection [14] using CG for sym-metric positive definite matrices (SPD) and BiCG-STAB fornon-SPD matrices The matrices were selected to showcasea range of matrix structures sizes and densities across avariety of applications When available we use the b vectorprovided by the collection when b is unavailable we use ab vector of all 1s as in prior work [53]

The solvers running on the proposed accelerator convergein the same number of iterations as they do when running

103

01

1

10

100

nasasrb

Pres_P

2cub

es

qa8fm

finan

Trefe

ship

crystm

therm

G2

ASIC

bcircuit

epb3

Ga

AsH6

ns3D

aSi34H3

6torso2

venkat

wang3

xeno

n1

G-MEA

NS

peed

upoverG

PU

Figure 8 Speedup over the GPU baseline

on the GPU since both systems perform computation at thesame level of precision

VIII EVALUATION

We evaluate the system performance energy area androbustness to different sources of error

A Execution Time

Figure 8 shows that the accelerator achieves an averagespeedup of 103times as compared to the GPU baseline acrossthe set of 20 matrices One important insight from theseresults is that the average number of non-zeros per matrixrow is a poor proxy for potential speedup The matrix withthe greatest speedup in the dataset (torso2) has just 89 NNZper matrix row and the two quantum chemistry matrices(GaSsH6 and Si34H36) have more than 50 NNZ per rowshowing above-average but not exceptional speedup due totheir lower blocking efficiencies (ie the percent of the non-zeros that are blocked) 69 and 53 respectively Sinceeach bank microprocessor is designed to work in conjunctionwith its clusters matrices with fewer rows perform worsethan larger matrices at the same blocking efficiency Asexpected blocking efficiency is the most reliable predictorof speedup

Notably in Table II there are two matrices that are effec-tively unblocked by the proposed preprocessing step ther-motech TC and ns3Da with respective blocking efficienciesof 08 and 32 Since the accelerator is optimized for in-situ rather than digital MVM and since the majority of theelements in these two matrices cannot be mapped to theproposed heterogeneous substrate the accelerator is morethan an order of magnitude less efficient than the GPUbaseline on these matrices Rather than attempt to performMVM on matrices so clearly ill-suited to the acceleratorwe assume that the proposed accelerator co-exists with aGPU in the same node and that the GPU may be usedfor the rare matrices which do not block effectively Sincethe blocking algorithm has a worst-case performance offour MVM operations (Section V-B1) and since the actualcomplexity reaches the worst case if and when the matrixcannot be blocked we can choose whether to perform thecomputation on the accelerator or the GPU after the blockinghas completed This results in a performance loss of less than

0092

1E-02

1E-01

1E+00

nasasrb

Pres_P

2cub

es

qa8fm

finan

Trefe

ship

crystm

therm

G2

ASIC

bcircuit

epb3

Ga

AsH6

ns3D

aSi34H3

6torso2

venkat

wang3

xeno

n1

G-MEA

N

EnergyNormalized

toGPU

Figure 9 Accelerator energy consumption normalized to the GPU baseline

3 for both matrices far better than if the accelerator wereused directly

B Energy Consumption

Energy consumption results are shown in Figure 9 To-tal energy consumption is improved by 142times on the 18matrices executed on the accelerator and by 109times overthe entire 20-matrix dataset The impacts of differing ex-ponent ranges can be seen in the energy results Notablythe nasasrb matrix has a 3 higher blocking efficiencythan Pres Poisson however Pres Poisson shows nearlytwice the improvement in energy dissipation this is dueto the much narrower exponent range of Pres PoissonPres Poisson never requires more than 14 bits of paddingfor storage which also indicates a much narrower dynamicrange for the computation and by extension fewer vectorbit slices that are needed per cluster By contrast nasasrbhas multiple blocks with excluded elements (due to thoseelements requiring greater than 117 bits) and in generalthe average number of stored bits per cluster is 107 (30more bits than the worst block of Pres Poisson) This strongdependence on the exponent range indicates a potentialbenefit of the fixed-point computation over floating pointAs discussed in Section IV-B floating point is designed forbroad compatibility and exceeds the requirements of manyapplications By operating on fixed-point representations ofdata for much of the computation the accelerator implicitlycreates a problem-specific subset of the floating-point formatwith significant performance benefits without discarding anyinformation that would cause the final result to differ fromthat of an end-to-end floating-point calculation

C Area Footprint

The area footprint of the accelerator is computed usingthe area model described in Section VII-A The overallsystem area for the 128-bank system described above is539mm2 which is lower than the 610mm2 die size of thebaseline Nvidia Tesla P100 GPU [54] Notably unlike inprior work on memristive accelerators the crossbars andperipheral circuitry are the dominant area consumermdashratherthan the ADCsmdashwith a total of 541 of total overall clusterarea This is due to two important differences from priorwork 1) due to computational invert coding (Section V-B2)

Table IIIAREA ENERGY AND LATENCY OF DIFFERENT CROSSBAR SIZES

(INCLUDES THE ADC)

Size Area [mm2] Energy [pJ] Latency [nsec]64 000078 280 533128 000103 652 107256 000162 150 213512 000352 342 427

0

5

10

15

20

nasasrb

Pres_P

2cub

es

qa8fm

finan

Trefe

ship

crystm

therm

G2

ASIC

bcircuit

epb3

Ga

AsH6

ns3D

aSi34H3

6torso2

venkat

wang3

xeno

n1

AcceleratorC

ompu

tatio

n

Overhead

WriteTime PreprocessingTime

Figure 10 Overhead of preprocessing and write time as a percent of totalsolve time on the accelerator

the ADC precision can be reduced by one bit which has anexponential effect on ADC area and 2) the wider operandsthat propagate through the reduction network The per-bankprocessors and global memory buffer consume 136 of thetotal system area This result suggests that a more powerfulprocessor than the LEON3 core may be integrated based onapplication demands without substantially increasing systemarea Table III summarizes the area as well as the latencyand energy characteristics for different crossbar sizes usedin different clusters

D Initialization Overhead

Figure 10 shows the overhead of setting up the itera-tive computation including matrix preprocessing time andprogramming time normalized to the entire solve timeof each linear system on the accelerator Since iterativealgorithms take thousands of iterations to converge theoverhead of writing is effectively amortized leading toan overhead of less than 20 across the entire matrixset and generally falling as the size of the linear systemincreases By extension the number of iterations requiredto converge increases faster than the write requirements ofthe array For large linear systems (the most likely use caseof the proposed accelerator) the overhead is typically lessthan 4 of the total solver runtime The overhead may beeven less in practice as multiple time steps are generallysimulated to examine the time-evolution of the linear systemIn these time-stepped computations only a subset of non-zeros change each step and the matrix structure is typicallypreserved requiring minimal re-processing

E System Endurance

Although memristive devices have finite switching en-durance the iterative nature of the algorithms allows a singlematrix to be programmed once and reused throughout thecomputation Even under a conservative set of assumptions

(a) ns3Da sparsity (b) ns3Da blockingFigure 11 Sparsity and blocking patterns of ns3Da

where each array is fully rewritten for each new matrixignoring both the sparsity of the blocks and the time stepbehavior mentioned above the system lifetime is suffi-cient given a memristive device with a conservative 109

writes [55]ndash[57] the total system lifetime is over 100 yearsassuming the system runs constantly and a new matrix isfully rewritten between solves

F Understanding Difficult-to-Block Matrices

Figure 11 shows the non-zero element distribution and theblocking pattern for ns3Da a matrix with a particularly poorblocking efficiency The figure shows that despite the highrelative density of the matrix the values do not form densesub-blocks that can be blocked efficiently Instead the valuesare distributed relatively uniformly over the matrix with thethird-highest NNZs per matrix row in the evaluated datasetFigure 7 shows the non-zero distribution of two matrices thatcan be blocked effectively Pres Poisson and Xenon1 thenon-zeros of both matrices are clustered primarily aroundthe diagonal An analysis of the blocks formed on ns3Dashows that most of the formed blocks capture patterns withnon-zeros separated by approximately 30 elements 2 permatrix block-row in 64 blocks and 4 per matrix block-rowin 128 blocks This suggests that even if ns3Da were tobe blocked more efficiently the effective speedup would beconstrained by the limited current summation from havingfew blocks per matrix row

G Sensitivity to Device Characteristics

Since scientific computing emphasizes high precisionwe analyze the effects of device characteristics on systemaccuracy and convergence behavior

Dynamic range We re-evaluate the convergence behaviorwith various cell ranges (ie onoff state ratios) to determinehow the system behaves when the states are closer togetheror farther apart Figure 12 shows the relative iterationcount at various configuration points The results showeffectively no sensitivity to dynamic range for single-bitcells With two-bit cells the relatively low dynamic rangeresults in some computational error as the cell states arenot sufficiently spread to tolerate the sources of error in

013 213 413 613 813

B=113 D=075K13

B=113 D=15K13

B=113 D=3K13

B=213 D=075K13

B=213 D=15K13

B=213 D=3K13

Normalized

13 Ite

rao

n13 Co

unt13

Min13 Mean13 Max13

Figure 12 Iteration count as a function of bits per cell and dynamicrange normalized to 1-bit cells with RoffRon = 1500 The minimummaximum and the mean iteration counts are reported over 100 Monte-Carlosimulation experiments

Figure 13 Iteration count as a function of bits per cell and programmingerror normalized to 1-bit cells with no programming error The minimummaximum and the mean iteration counts are reported over 100 Monte-Carlosimulation experiments

current summation This in turn introduces error into thecomputation which tends to hinder convergence rates

Programming precision We re-evaluate the convergencebehavior with and without cell programming errors to de-termine how results are affected by non-ideal programmingThe results show virtually no sensitivity to programmingerror for single-bit cells until the programming error reaches5 well within the achievable programming precision re-ported in the literature [58] Again the general dependencebecomes stronger as the number of bits per cell is increasedHigh programming error with the same tolerance introduceserrors into the computation hindering convergence

IX CONCLUSIONS

Previous work with memristive systems has been re-stricted to low-precision error-prone fixed-point computa-tion taking advantage of machine learning workloads thatdemand fidelity but not precision We have presented thefirst correct implementation of a double-precision floating-point compute unit based on a memristive substrate Theinnovations presented here take advantage of insights thatallow the system to achieve a 103times throughput improve-ment and a 109times reduction in energy consumption over aGPU baseline when executing iterative solvers The potentialfor such a system in scientific computing where iterativesolvers are ubiquitous is especially promising

ACKNOWLEDGMENT

This work was supported in part by NSF grant CCF-1533762

REFERENCES

[1] F Pop ldquoHigh performance numerical computing for highenergy physics a new challenge for big data sciencerdquo AAdvHigh Energy Phys vol 2014 2014

[2] R A Pielke Sr Mesoscale meteorological modeling 2nd edAcademic press 2013

[3] B Scholkopf K Tsuda and J-P Vert Kernel methods incomputational biology 1st ed MIT press 2004

[4] L Maliar and S Maliar ldquoNumerical methods for large-scaledynamic economic modelsrdquo 2014

[5] M Vogelsberger S Genel V Springel P Torrey D SijackiD Xu G Snyder S Bird D Nelson and L HernquistldquoProperties of galaxies reproduced by a hydrodynamic simu-lationrdquo Nature vol 509 pp 177ndash82 May 2014

[6] P Messina (2017 February) The US DOEexascale computing projectmdashgoals and challengeshttpswwwnistgovsitesdefaultfilesdocuments20170221messina nist 20170214final pdf [Online] Availablerdquohttpswwwnistgovsitesdefaultfilesdocuments20170221messina nist 20170214final pdfrdquo

[7] M N Bojnordi and E Ipek ldquoMemristive boltzmann machineA hardware accelerator for combinatorial optimization anddeep learningrdquo in Intl Symp on High Performance ComputerArchitecture (HPCA) March 2016

[8] A Shafiee A Nag N Muralimanohar R BalasubramonianJ P Strachan M Hu R S Williams and V SrikumarldquoISAAC A convolutional neural network accelerator within-situ analog arithmetic in crossbarsrdquo in Intl Symp onComputer Architecture (ISCA) June 2016

[9] P Chi S Li S Li T Zhang J Zhao Y Liu Y Wangand Y Xie ldquoPRIME A novel processing-in-memory archi-tecture for neural network computation in ReRAM-basedmain memoryrdquo in Intl Symp on High Performance ComputerArchitecture (HPCA) June 2016

[10] L Song X Qian H Li and Y Chen ldquoPipeLayer Apipelined ReRAM-based accelerator for deep learningrdquo inIntl Symp on High Performance Computer Architecture(HPCA) Feb 2017

[11] L Song Y Zhuo X Qian H Li and Y Chen ldquoGraphRAccelerating graph processing using ReRAMrdquo Feb 2017

[12] G Kestor R Gioiosa D J Kerbyson and A HoisieldquoQuantifying the energy cost of data movement in scientificapplicationsrdquo in Intl Symp on Workload Characterization(IISWC) Sept 2013

[13] J Hauser (2002) SoftFloat httpwwwjhauserusarithmeticSoftFloathtml [Online] Available rdquohttpwwwjhauserusarithmeticSoftFloathtmlrdquo

[14] T A Davis and Y Hu ldquoThe university of florida sparse matrixcollectionrdquo ACM Transactions on Math Softwware (TOMS)vol 38 no 1 pp 11ndash125 Dec 2011

[15] R Vuduc ldquoAutomatic performance tuning of sparse matrixkernelsrdquo PhD dissertation 2003

[16] Y Chen T Luo S Liu S Zhang L He J Wang L LiT Chen Z Xu N Sun and O Temam ldquoDaDianNaoA machine-learning supercomputerrdquo in Intl Symp on onMicroarchitecture (MICRO) Dec 2014

[17] N P Jouppi C Young N Patil D Patterson G Agrawalet al ldquoIn-datacenter performance analysis of a tensor pro-cessing unitrdquo in Intl Symp on on Computer Architecture(ISCA) June 2017

[18] M Hu J P Strachan Z Li E M Grafals N DavilaC Graves S Lam N Ge J J Yang and R S WilliamsldquoDot-product engine for neuromorphic computing Program-ming 1T1M crossbar to accelerate matrix-vector multiplica-tionrdquo in Design Automation Conference (DAC) June 2016

[19] Y Saad Iterative Methods for Sparse Linear Systems 3rd edSIAM 2003

[20] J J Dongarra P Luszczek and A Petitet ldquoThe LINPACKbenchmark Past present and futurerdquo Concurrency and Com-putation Practice and Experience vol 15 pp 803ndash8202003

[21] M R Hestenes and E Stiefel Methods of conjugate gradientsfor solving linear systems 1952 vol 49 no 6

[22] H A Van der Vorst ldquoBi-CGSTAB A fast and smoothlyconverging variant of Bi-CG for the solution of nonsymmetriclinear systemsrdquo SIAM Journal on scientific and StatisticalComputing (SISC) vol 13 no 2 pp 631ndash644 1992

[23] Y Saad and M H Schultz ldquoGMRES A generalized minimalresidual algorithm for solving nonsymmetric linear systemsrdquoSIAM Journal on scientific and Statistical Computing (SISC)vol 7 no 3 pp 856ndash869 July 1986

[24] J Kung Y Long D Kim and S Mukhopadhyay ldquoAprogrammable hardware accelerator for simulating dynamicalsystemsrdquo in Intl Symp on Computer Architecture (ISCA)June 2017

[25] Q Zhu T Graf H E Sumbul L Pileggi and F FranchettildquoAccelerating sparse matrix-matrix multiplication with 3d-stacked logic-in-memory hardwarerdquo in 2013 IEEE HighPerformance Extreme Computing Conference (HPEC) Sept2013

[26] S Kestur J D Davis and E S Chung ldquoTowards a univer-sal FPGA matrix-vector multiplication architecturerdquo in IntlSymp on Field-Programmable Custom Computing Machines(FCCM) May 2012

[27] J Fowers K Ovtcharov K Strauss E S Chung andG Stitt ldquoA high memory bandwidth FPGA accelerator forsparse matrix-vector multiplicationrdquo in Intl Symp on Field-Programmable Custom Computing Machines (FCCM) May2014

[28] R Dorrance F Ren and D Markovic ldquoA scalablesparse matrix-vector multiplication kernel for energy-efficientsparse-blas on FPGAsrdquo in ACMSIGDA International Sym-posium on Field-programmable Gate Arrays (FPGA) Feb2014

[29] B Feinberg S Wang and E Ipek ldquoMaking memristiveneural net accelerators reliablerdquo in Intl Symp on HighPerformance Computer Architecture (HPCA) Feb 2018

[30] D H Bailey ldquoHigh-precision floating-point arithmetic inscientific computationrdquo Computing in science amp engineeringvol 7 no 3 pp 54ndash61 2005

[31] D Zuras M Cowlishaw A Aiken M Applegate D BaileyS Bass D Bhandarkar M Bhat D Bindel S Boldo et alldquoIeee standard for floating-point arithmeticrdquo IEEE Std 754-2008 pp 1ndash70 2008

[32] L V Allis ldquoSearching for solutions in games and artificialintelligencerdquo 1994

[33] N Whitehead and A Fit-Florea ldquoPrecision amp performanceFloating point and IEEE 754 compliance for nvidia gpusrdquoNvidia Tech Rep 2011

[34] L Kull D Luu C Menolfi M Braendli P A FranceseT Morf M Kossel H Yueksel A Cevrero I Ozkayaand T Toifl ldquo285 a 10b 15gss pipelined-SAR ADC withbackground second-stage common-mode regulation and offsetcalibration in 14nm CMOS FinFETrdquo in Intl Solid-StateCircuits Conference (ISSCC) Feb 2017 pp 474ndash475

[35] L Kull T Toifl M Schmatz P A Francese C MenolfiM Brandli M Kossel T Morf T M Andersen andY Leblebici ldquoA 31 mw 8b 12 GSs single-channel asyn-chronous SAR ADC with alternate comparators for enhancedspeed in 32 nm digital SOI CMOSrdquo IEEE Journal of Solid-State Circuits vol 48 no 12 pp 3049ndash3058 2013

[36] M Saberi R Lotfi K Mafinezhad and W A SerdijnldquoAnalysis of power consumption and linearity in capacitivedigital-to-analog converters used in successive approximationadcsrdquo IEEE Transactions on Circuits and Systems I RegularPapers vol 58 no 8 pp 1736ndash1748 2011

[37] M R Stan and W P Burleson ldquoBus-invert coding forlow-power IOrdquo IEEE Transactions on Very Large ScaleIntegration (VLSI) Systems vol 3 no 1 pp 49ndash58 1995

[38] K Bhanushali and W R Davis ldquoFreePDK15 An open-source predictive process design kit for 15nm FinFET tech-nologyrdquo in Intl Symp on on Physical Design (ISPD) March2015

[39] Predictive technology model (PTM) httpptmasuedu[Online] Available httpptmasuedu

[40] D Niu C Xu N Muralimanohar N P Jouppi and Y XieldquoDesign of cross-point metal-oxide ReRAM emphasizingreliability and costrdquo in Intl Conference on Computer-AidedDesign (ICCAD) Nov 2013

[41] S-C Wong G-Y Lee and D-J Ma ldquoModeling of inter-connect capacitance delay and crosstalk in VLSIrdquo IEEETransactions on semiconductor manufacturing vol 13 no 1pp 108ndash111 2000

[42] C Xu D Niu N Muralimanohar R BalasubramonianT Zhang S Yu and Y Xie ldquoOvercoming the challengesof crossbar resistive memory architecturesrdquo in rdquoIntl Sympon High Performance Computer Architecture (HPCA)rdquo Feb2015

[43] C Yakopcic and T Taha ldquoModel for maximum crossbar sizebased on input driver impedancerdquo Electronics Letters vol 52no 1 pp 25ndash27 2015

[44] M OrsquoHalloran and R Sarpeshkar ldquoA 10-nw 12-bit accurateanalog storage cell with 10-aa leakagerdquo IEEE Journal ofSolid-State Circuits vol 39 no 11 pp 1985ndash1996 2004

[45] Leon3GRLIB httpwwwgaislercomindexphpdownloadsleongrlib [Online] Available httpwwwgaislercomindexphpdownloadsleongrlib

[46] S Galal O Shacham J S B II J Pu A Vassiliev andM Horowitz ldquoFPU generator for design space explorationrdquoin IEEE Symposium on Computer Arithmetic (ARITH) April2013

[47] M Martins J M Matos R P Ribas A Reis G SchlinkerL Rech and J Michelsen ldquoOpen cell library in 15nmFreePDK technologyrdquo in Intl Symp on Physical Design(ISPD) March 2015

[48] Synopsys ldquoSynopsys Design Compiler User GuiderdquohttpwwwsynopsyscomToolsImplementationRTLSynthesisDCUltraPages

[49] R Balasubramonian A B Kahng N MuralimanoharA Shafiee and V Srinivas ldquoCACTI 7 New tools for inter-connect exploration in innovative off-chip memoriesrdquo ACMTransactions on Architecture and Code Optimization (TACO)vol 14 no 2 pp 141ndash1425 June 2017

[50] G Fredeman D W Plass A Mathews J ViraraghavanK Reyer T J Knips T Miller E L Gerhard D Kannam-badi C Paone D Lee D J Rainey M Sperling M WhalenS Burns R R Tummuru H Ho A Cestero N ArnoldB A Khan T Kirihata and S S Iyer ldquoA 14 nm 11 mbembedded DRAM macro with 1 ns accessrdquo IEEE Journal ofSolid-State Circuits vol 51 no 1 pp 230ndash239 Jan 2016

[51] A Bakhoda G L Yuan W W L Fung H Wong and T MAamodt ldquoAnalyzing CUDA workloads using a detailed GPUsimulatorrdquo April 2009

[52] J Leng T Hetherington A ElTantawy S Gilani N S KimT M Aamodt and V J Reddi ldquoGPUWattch Enabling en-ergy optimizations in GPGPUsrdquo in Intl Symp on ComputerArchitecture (ISCA) June 2013

[53] H Anzt J Dongarra M Kreutzer G Wellein and M KhlerldquoEfficiency of general krylov methods on GPUs ndash an exper-imental studyrdquo in Intl Parallel and Distributed ProcessingSymposium Workshops (IPDPSW) May 2016

[54] ldquoNvidia tesla P100rdquo NVIDIA Corporation Tech Rep WP-08019-001 v011 2016

[55] Z Wei Y Kanzawa K Arita Y Katoh K Kawai S Mu-raoka S Mitani S Fujii K Katayama M Iijima et alldquoHighly reliable TaOx ReRAM and direct evidence of redoxreaction mechanismrdquo in Electron Devices Meeting 2008IEDM 2008 IEEE International IEEE 2008 pp 1ndash4

[56] J J Yang M-X Zhang J P Strachan F Miao M D Pick-ett R D Kelley G Medeiros-Ribeiro and R S WilliamsldquoHigh switching endurance in TaOx memristive devicesrdquoApplied Physics Letters vol 97 no 23 p 232102 2010

[57] C-W Hsu I-T Wang C-L Lo M-C Chiang W-Y JangC-H Lin and T-H Hou ldquoSelf-rectifying bipolar TaOxTiO2RRAM with superior endurance over 1012 cycles for 3d high-density storage-class memoryrdquo

[58] F Alibart L Gao B D Hoskins and D B Strukov ldquoHighprecision tuning of state for memristive devices by adaptablevariation-tolerant algorithmrdquo Nanotechnology vol 23 no 7Jan 2012

Page 8: Enabling Scientific Computing on Memristive …reducing the required memristor resolution [7]–[10]. An example of bit slicing is shown in Equation 1, where each 3-bit matrix coefficient

τMV M

As the size of a crossbar increases greater throughputis naıvely implied since the size of the matrix block repre-sentable within the crossbarmdashand consequently the numberof sums-of-products resulting from a cluster operationmdashalsoincreases However since only non-zero matrix elementscontribute to an MVM operation effective throughput in-creases only when the number of non-zero (NNZ) elementsincreases where NNZ depends on both block size M timesN and block density dblock Actual throughput for a givenblock is therefore dblock timesM timesNτMV M

Thus to achievehigh throughput and efficiency large and dense blocks aregenerally preferable to small or sparse blocks

The number of non-zero elements actually mapped to acrossbar is in practice far less than the number of cells inthe crossbar and the trend is for large blocks to be sparserthan small blocks This tendency follows intuitively fromthe notion that as the block size approaches the size of thematrix the density of the block approaches the density ofthe matrixmdash037 for the densest matrix in the evaluation

The latency of a memristive MVM operation is alsodirectly related to the size of a crossbar The maximumpossible crossbar column output occurs when all N bits ofa column and all N bits of the vector slice are set to 1resulting in a sum-of-products of

sumN1 1times1 = N Therefore

a sufficient analog to digital converter (ADC) for a crossbarwith N rows must generally have a minimum resolution ofdlog2[N +1]e bits (though this can be reduced as shown inSection V-B2) For the type of ADC used in the proposedsystem and in prior work [8] conversion time per columnis strictly proportional to ADC resolution which increaseswith log2N where N is the number of rows and the numberof ADC conversions per binary MVM operation is equal tothe number of columns M Total conversion time is thusproportional to Mdlog2[N + 1]e

1) Energy Crossbar sizing strongly affects the amountof energy used by both the array and the ADC during anMVM operation

ADC energy The average power dissipation of the ADCgrows exponentially with resolution [34]ndash[36] Since res-olution is a logarithmic function of the number of rowsN but conversion time is proportional to resolution ADCenergy per column conversion is approximately proportionalto N log2N The column count M determines the numberof ADC conversions per MVM operation so the total ADCenergy per operation is proportional to M timesN log2N

Crossbar energy Crossbar latency is defined as theminimum amount of time during which the crossbars mustbe actively pulling current in order to ensure that all outputnodes are within less than LSB

2 of their final voltages therebyconstraining all column outputs to be within the errortolerance range of the ADC The worst-case path resistanceand capacitance of an MtimesN crossbar are both proportionalto M + N The number of RC time constants necessaryto ensure error-free output is proportional to resolution or

log2N While the crossbar is active power is proportionalto the total conductance of the crossbar which can be foundby multiplying the number of cells M timesN by the averagepath conductance which is proportional to 1

M+N Crossbarenergy for a single MVM operation grows with the productof crossbar latency and power and is thus found to beproportional to (M timesN)(M +N) log2N

2) Area As with ADC power ADC area also scalesexponentially with resolution [34]ndash[36] making averageADC area approximately proportional to N Crossbar area isdominated by drivers consequently the total crossbar areagrows as M(M +N)

B A Heterogeneous Crossbar Substrate

Crossbar sizing requires striking a careful balance be-tween parallelism and energy efficiency a crossbar mustbe large enough to capture sufficient non-zero elements toexploit parallelism yet small enough to retain sufficient den-sity for area and energy efficiency By contrast to the singlecrossbar size used in prior work [7] [8] [18] the proposedsystem utilizes crossbars of different sizes allowing blocksto be mapped to the crossbars best suited to the sparsitypattern of the matrix This heterogeneous composition makesthe system performance robust yielding high performanceand efficiency on most matrices without over-designing fora particular matrix type or sparsity pattern

The problem of performance-optimal matrix blocking isneither new nor solved in a computationally feasible mannerPrior work seeks to block sparse matrices while maximizingthroughput and minimizing fill-in but focuses on block sizesthat are too small to be efficient with memristive crossbars(12times 12 at most [15])

1) Mapping a Matrix to the HeterogeneousSubstrate A fast preprocessing step is proposed that ef-ficiently maps an arbitrary matrix to the heterogeneouscollection of clusters The matrix is partitioned into blockcandidates and for each candidate block the number of non-zero elements captured by the block and the range of theexponents within the block are calculated If the exponentrange of the block falls outside of the maximum allowablerange (ie 117) the elements are selectively removed untilan acceptable range is attained The number of non-zero ele-ments remaining is then compared to a dimension-dependentthreshold if the number of elements exceeds the thresholdthe candidate block is accepted If the candidate block isrejected a smaller block size is considered and the processrepeats

The preprocessing step considers every block size in thesystem (512times512 256times256 128times128 and 64times64) untileach element has either been mapped to a block or found tobe unblockable Elements that cannot be blocked efficientlyor that fall outside of the exponent range of an acceptedblock are completely removed from the mapping and areinstead processed separately by the local processor Figure

(a) Pres Poisson sparsity (b) Pres Poisson blocking (c) Xenon1 sparsity (d) Xenon1 blockingFigure 7 Sparsity and blocking patterns of two of the evaluated matrices

7 shows the non-zero elements and the resulting blockingpatterns for two sparse matrices

Since the non-zero elements in the matrix are traversed atmost one time for each block size the worst-case complexityof the preprocessing step is 4timesNNZ In practice the earlydiscovery of good blocks brings the average-case complexityto 18timesNNZ

2) Exploiting Sparsity Though sparsity introduces chal-lenges in terms of throughput it also presents some oppor-tunities for optimization

Computational invert coding Prior work [8] introducesa technique for statically reducing ADC resolution require-ments If a crossbar column contains all 1s the maximumoutput for the column is equal to the number of rowsN To handle this extreme case the ADC would requiredlog2[N + 1]e bits of resolution Eliminating this casereduces the ADC resolution requirement by one bit This isachieved by inverting the contents of a column if it containsall 1s To obtain a result equivalent to a computation witha non-inverted column the partial dot product is subtractedfrom the population count of the applied vector bit slice

We extend this technique to computational invert coding(CIC) a computational analog to the widely used bus-invertcoding [37] If the number of ones in a binary matrixrow is less than N

2 the required resolution of the ADCis again decreased by one bit By inverting any columnmapping that would result in greater than 50 bit density astatic maximum of N

2 1s per crossbar column is guaranteedNotably if the column has exactly N

2 1s a resolution oflog2N is still required Since the proposed system allowsfor removing arbitrary elements from a block for processingon the local processor this corner case can be eliminatedThus for any block even in matrices approaching 100density the ADC resolution can be statically reduced tolog2[N ]minus1 bits CIC also lowers the maximum conductanceof the crossbars (since at most half of the cells can be inthe RLO state) which in turn reduces crossbar energy

ADC headstart The ADCs that are commonly used inmemristive accelerators [7] [8] perform a binary searchover the space of possible results with the MSb initially

set to 1 However the maximum output of each column isconstrained by the number of ones mapped to that columnMatrix rows tend to produce at most dlog2[drowtimes05timesN ]ebits of output assuming that 1s and 0s are equally likelySince the actual number of output bits is data dependentthis tendency cannot be leveraged to statically reduce ADCprecision beyond what is attainable with CIC Instead theADC is pre-set to the maximum number of bits that can beproduced by the subsequent conversion step This approachallows the ADC to start from the most significant possibleoutput bit position and reduces the amount of time requiredto find the correct output value by an amount proportionalto the number of bits skipped Notably this does not affectoperation latency since reduction logic is fully synchronousand assumes a full clock period for conversion however bydecreasing the time spent on conversion the energy of theMVM operation is reduced

VI PROGRAMMING THE ACCELERATOR

We focus on Krylov subspace solvers for specificity asthey are a widely used application involving sparse MVMhowever the proposed accelerator is not limited to Krylovsubspace solvers These solvers are built from three com-putational kernels a sparse-matrix dense-vector multiply adense-vector sum or AXPY (ylarr ax+y) and a dense-vectordot product (z = x middot y) To split the computation betweenthe banks each bank is given a 1200-element section ofthe solution vector x only that bank performs updates onthe corresponding section of the solution vector and theportions of the derived vectors For simplicity this restrictionis not enforced by hardware and must be followed by theprogrammer for correctness

On problems that are too large for a single accelerator theMVM can be split in a manner analogous to the partitioningon GPUs each accelerator handles a portion of the MVMand the accelerators synchronize between iterations

A Kernel Implementation

To implement Krylov subspace solvers we implement thethree primary kernels needed for the solver

1) Sparse MVM The sparse MVM operation beginsby reading the vector map for each cluster The vectormap contains a set of three-element tuples each of whichcomprises the base address of a cluster input vector thevector element index corresponding to that base addressand the size of the cluster Since the vector applied to eachcluster is contiguous the tuple is used to compute and loadall of the vector elements into the appropriate vector bufferThereafter a start signal is written to a cluster status registerand the computation begins Since larger clusters tend tohave a higher latency the vector map entries are ordered bycluster size

Once all cluster operations have started the processorbegins operating on the non-blocked entries As noted inthe previous section some values within the sparse matrixare not suitable for mapping onto memristive hardwareeither because they cannot be blocked at a sufficient densityor because they would exceed the exponent range Theseremaining values are stored in the compressed sparse rowformat [15]

When a cluster completes an interrupt is raised andserviced by an interrupt service routine running on the localprocessor Once all clusters signal completion and the localprocessor finishes processing the non-blocked elements thebank has completed its portion of the sparse MVM The localprocessors for different banks rely on a barrier synchroniza-tion scheme to determine when the entire sparse MVM iscomplete

2) Vector Dot Product To perform a vector dot producteach processor computes the relevant dot product fromits own vector elements Notably due to vector elementownership of the solution vector x and all of the derivedvectors the dot product is performed using only thosevalues that belong to a local processor Once the processorcomputes its local dot product the value is written backto shared memory for all other banks to see Each bankcomputes the final product from the set of bank dot productsindividually to simplify synchronization

3) AXPY The AXPY operation is the simplest kernelsince it can be performed purely locally Each vector elementis loaded modified and written back Barrier synchroniza-tion is used to determine completion across all banks

VII EXPERIMENTAL SETUP

We develop a parameterized and modular system modelto evaluate throughput energy efficiency and area Keyparameters of the simulated system are listed in Table I

A Circuits and Devices

We model all of the CMOS based periphery at the 15nmtechnology node using process parameters and design rulesfrom FreePDK15 [38] coupled with the ASU PredictiveTechnology Model (PTM) [39] TaOx memristor cells basedon [40] are used with linearity and dynamic range as

reported by [18] During computation cells are modeled asresistors with resistance determined either by a statisticalapproach considering block density or actual mapped bitvalues depending on whether the model is being evaluatedas part of a design-space exploration or to characterizeperformance with a particular matrix The interconnect ismodeled with a conservative lumped RC approach usingthe wire capacitance parameters provided by the ASU PTMderived from [41] To reduce path distance and latency wesplit the crossbars and place the drivers in the middle ratherthan at the periphery similarly to the double-sided groundbiasing technique from [42]

Table IACCELERATOR CONFIGURATION

System (128) banks double-precision floating point fclk =12GHz 15nm process VDD = 080V

Bank (2) times 512 times 512 clusters (4) times 256 times 256 clusters(6)times128times128 (8)times64times64 clusters 1 LEON core

Cluster 127 bit slice crossbarsCrossbar N timesN cells (log2[N ]minus 1)-bit pipelined SAR ADC

2N driversCell [18] [40] TaOx Ron = 2kΩRoff = 3MΩ Vread = 02V

Vset = minus26V Vreset = 26V Ewrite = 391nJTwrite = 5088ns

The peripheral circuitry includes 2N driver circuits basedon [43] and sized such that they are sufficient to sourcesinkthe maximum current required to program and read thememristors Also included are N sample-and-hold circuitsbased on [44] with power and area scaled for the requiredsampling frequency and precision

A 12 GHz 10-bit pipelined SAR ADC [34] is used as areference design point 12 GHz is chosen as the operatingfrequency of the ADC in order to maximize throughput andefficiency while maintaining acceptable ADC signal to noiseand distortion ratio (SNDR) at our supply voltage Areapower and the internal conversion time of each individualcrossbarrsquos ADC are scaled based on resolution as discussedin Section V-A Approximately 7 of the reported ADCpower scales exponentially with resolution with the remain-der of power either scaling linearly or remaining constantand 20 of the total power is assumed to be static basedon the analysis reported in [34] Similarly approximately23 of the reported area scales exponentially (due to thecapacitive DAC and the track-and-hold circuit) and theremainder either scales linearly or remains constant We holdthe conversion time of the ADC constant with respect toresolution rounding up to the nearest integral clock periodDuring the slack portion of the clock period (ie the timebetween conversions) we reduce the ADC power dissipationto its static power

The local processors are implemented with the open-source LEON3 core [45] Since the open-source versionof LEON3 does not include the netlist for the FPU forsynthesis we modify the FPU wrapper to the fused multiplyadd (FMA) timings for an FMA unit generated by FP-

Table IIEVALUATED MATRICES SPD MATRICES ON TOP

Matrix NNZs Rows NNZRow Blocked2cubes sphere 1647264 101492 162 497crystm03 583770 24696 236 947finan512 596992 74752 79 467G2 circuit 726674 150102 45 609nasasrb 2677324 54870 498 991Pres Poisson 715804 14822 483 964qa8fm 1660579 66127 251 928ship 001 3896496 34920 1116 664thermomech TC 711558 102158 68 08Trefethen 20000 554466 20000 277 633ASIC 100K 940621 99340 95 609bcircuit 375558 68902 54 649epb3 463625 84617 55 722GaAsH6 3381809 61349 5512 692ns3Da 1679599 20414 82 32Si34H36 5156379 97569 528 537torso2 1033473 115697 89 981venkat25 1717792 62424 275 798wang3 177168 26064 68 646xenon1 1181120 48600 243 810

GEN [46] The LEON3 core FMA and in-cluster reductionnetwork are synthesized using the NanGate 15nm OpenCell Library [47] with the Synopsys Design Compiler [48]SRAM buffers within each cluster and the eDRAM mem-ory are modeled using CACTI7 [49] using 14nm eDRAMparameters from [50]

B Architecture

We compare against an nVidia Tesla P100 GPU accelera-tor modeled using GPGPUSim [51] and GPGPUWattch [52]To evaluate the performance of the LEON3 based localprocessors we implement CG and BiCG-STAB in C andcompile the implementations using the LEON3 Bare CCompiler with timers around each kernel call Since all bankmicroprocessors must synchronize during each iteration weevaluate the latency of the bank microprocessor with thelargest number of unblocked elements

In the worst case the preprocessing step touches everynon-zero in the matrix four times which is comparable toperforming four MVM operations Thus we conservativelyassume that preprocessing takes time equivalent to fourMVM operations on the baseline system

C Algorithms

We evaluate the proposed accelerator on 20 matrices fromthe SuiteSparse matrix collection [14] using CG for sym-metric positive definite matrices (SPD) and BiCG-STAB fornon-SPD matrices The matrices were selected to showcasea range of matrix structures sizes and densities across avariety of applications When available we use the b vectorprovided by the collection when b is unavailable we use ab vector of all 1s as in prior work [53]

The solvers running on the proposed accelerator convergein the same number of iterations as they do when running

103

01

1

10

100

nasasrb

Pres_P

2cub

es

qa8fm

finan

Trefe

ship

crystm

therm

G2

ASIC

bcircuit

epb3

Ga

AsH6

ns3D

aSi34H3

6torso2

venkat

wang3

xeno

n1

G-MEA

NS

peed

upoverG

PU

Figure 8 Speedup over the GPU baseline

on the GPU since both systems perform computation at thesame level of precision

VIII EVALUATION

We evaluate the system performance energy area androbustness to different sources of error

A Execution Time

Figure 8 shows that the accelerator achieves an averagespeedup of 103times as compared to the GPU baseline acrossthe set of 20 matrices One important insight from theseresults is that the average number of non-zeros per matrixrow is a poor proxy for potential speedup The matrix withthe greatest speedup in the dataset (torso2) has just 89 NNZper matrix row and the two quantum chemistry matrices(GaSsH6 and Si34H36) have more than 50 NNZ per rowshowing above-average but not exceptional speedup due totheir lower blocking efficiencies (ie the percent of the non-zeros that are blocked) 69 and 53 respectively Sinceeach bank microprocessor is designed to work in conjunctionwith its clusters matrices with fewer rows perform worsethan larger matrices at the same blocking efficiency Asexpected blocking efficiency is the most reliable predictorof speedup

Notably in Table II there are two matrices that are effec-tively unblocked by the proposed preprocessing step ther-motech TC and ns3Da with respective blocking efficienciesof 08 and 32 Since the accelerator is optimized for in-situ rather than digital MVM and since the majority of theelements in these two matrices cannot be mapped to theproposed heterogeneous substrate the accelerator is morethan an order of magnitude less efficient than the GPUbaseline on these matrices Rather than attempt to performMVM on matrices so clearly ill-suited to the acceleratorwe assume that the proposed accelerator co-exists with aGPU in the same node and that the GPU may be usedfor the rare matrices which do not block effectively Sincethe blocking algorithm has a worst-case performance offour MVM operations (Section V-B1) and since the actualcomplexity reaches the worst case if and when the matrixcannot be blocked we can choose whether to perform thecomputation on the accelerator or the GPU after the blockinghas completed This results in a performance loss of less than

0092

1E-02

1E-01

1E+00

nasasrb

Pres_P

2cub

es

qa8fm

finan

Trefe

ship

crystm

therm

G2

ASIC

bcircuit

epb3

Ga

AsH6

ns3D

aSi34H3

6torso2

venkat

wang3

xeno

n1

G-MEA

N

EnergyNormalized

toGPU

Figure 9 Accelerator energy consumption normalized to the GPU baseline

3 for both matrices far better than if the accelerator wereused directly

B Energy Consumption

Energy consumption results are shown in Figure 9 To-tal energy consumption is improved by 142times on the 18matrices executed on the accelerator and by 109times overthe entire 20-matrix dataset The impacts of differing ex-ponent ranges can be seen in the energy results Notablythe nasasrb matrix has a 3 higher blocking efficiencythan Pres Poisson however Pres Poisson shows nearlytwice the improvement in energy dissipation this is dueto the much narrower exponent range of Pres PoissonPres Poisson never requires more than 14 bits of paddingfor storage which also indicates a much narrower dynamicrange for the computation and by extension fewer vectorbit slices that are needed per cluster By contrast nasasrbhas multiple blocks with excluded elements (due to thoseelements requiring greater than 117 bits) and in generalthe average number of stored bits per cluster is 107 (30more bits than the worst block of Pres Poisson) This strongdependence on the exponent range indicates a potentialbenefit of the fixed-point computation over floating pointAs discussed in Section IV-B floating point is designed forbroad compatibility and exceeds the requirements of manyapplications By operating on fixed-point representations ofdata for much of the computation the accelerator implicitlycreates a problem-specific subset of the floating-point formatwith significant performance benefits without discarding anyinformation that would cause the final result to differ fromthat of an end-to-end floating-point calculation

C Area Footprint

The area footprint of the accelerator is computed usingthe area model described in Section VII-A The overallsystem area for the 128-bank system described above is539mm2 which is lower than the 610mm2 die size of thebaseline Nvidia Tesla P100 GPU [54] Notably unlike inprior work on memristive accelerators the crossbars andperipheral circuitry are the dominant area consumermdashratherthan the ADCsmdashwith a total of 541 of total overall clusterarea This is due to two important differences from priorwork 1) due to computational invert coding (Section V-B2)

Table IIIAREA ENERGY AND LATENCY OF DIFFERENT CROSSBAR SIZES

(INCLUDES THE ADC)

Size Area [mm2] Energy [pJ] Latency [nsec]64 000078 280 533128 000103 652 107256 000162 150 213512 000352 342 427

0

5

10

15

20

nasasrb

Pres_P

2cub

es

qa8fm

finan

Trefe

ship

crystm

therm

G2

ASIC

bcircuit

epb3

Ga

AsH6

ns3D

aSi34H3

6torso2

venkat

wang3

xeno

n1

AcceleratorC

ompu

tatio

n

Overhead

WriteTime PreprocessingTime

Figure 10 Overhead of preprocessing and write time as a percent of totalsolve time on the accelerator

the ADC precision can be reduced by one bit which has anexponential effect on ADC area and 2) the wider operandsthat propagate through the reduction network The per-bankprocessors and global memory buffer consume 136 of thetotal system area This result suggests that a more powerfulprocessor than the LEON3 core may be integrated based onapplication demands without substantially increasing systemarea Table III summarizes the area as well as the latencyand energy characteristics for different crossbar sizes usedin different clusters

D Initialization Overhead

Figure 10 shows the overhead of setting up the itera-tive computation including matrix preprocessing time andprogramming time normalized to the entire solve timeof each linear system on the accelerator Since iterativealgorithms take thousands of iterations to converge theoverhead of writing is effectively amortized leading toan overhead of less than 20 across the entire matrixset and generally falling as the size of the linear systemincreases By extension the number of iterations requiredto converge increases faster than the write requirements ofthe array For large linear systems (the most likely use caseof the proposed accelerator) the overhead is typically lessthan 4 of the total solver runtime The overhead may beeven less in practice as multiple time steps are generallysimulated to examine the time-evolution of the linear systemIn these time-stepped computations only a subset of non-zeros change each step and the matrix structure is typicallypreserved requiring minimal re-processing

E System Endurance

Although memristive devices have finite switching en-durance the iterative nature of the algorithms allows a singlematrix to be programmed once and reused throughout thecomputation Even under a conservative set of assumptions

(a) ns3Da sparsity (b) ns3Da blockingFigure 11 Sparsity and blocking patterns of ns3Da

where each array is fully rewritten for each new matrixignoring both the sparsity of the blocks and the time stepbehavior mentioned above the system lifetime is suffi-cient given a memristive device with a conservative 109

writes [55]ndash[57] the total system lifetime is over 100 yearsassuming the system runs constantly and a new matrix isfully rewritten between solves

F Understanding Difficult-to-Block Matrices

Figure 11 shows the non-zero element distribution and theblocking pattern for ns3Da a matrix with a particularly poorblocking efficiency The figure shows that despite the highrelative density of the matrix the values do not form densesub-blocks that can be blocked efficiently Instead the valuesare distributed relatively uniformly over the matrix with thethird-highest NNZs per matrix row in the evaluated datasetFigure 7 shows the non-zero distribution of two matrices thatcan be blocked effectively Pres Poisson and Xenon1 thenon-zeros of both matrices are clustered primarily aroundthe diagonal An analysis of the blocks formed on ns3Dashows that most of the formed blocks capture patterns withnon-zeros separated by approximately 30 elements 2 permatrix block-row in 64 blocks and 4 per matrix block-rowin 128 blocks This suggests that even if ns3Da were tobe blocked more efficiently the effective speedup would beconstrained by the limited current summation from havingfew blocks per matrix row

G Sensitivity to Device Characteristics

Since scientific computing emphasizes high precisionwe analyze the effects of device characteristics on systemaccuracy and convergence behavior

Dynamic range We re-evaluate the convergence behaviorwith various cell ranges (ie onoff state ratios) to determinehow the system behaves when the states are closer togetheror farther apart Figure 12 shows the relative iterationcount at various configuration points The results showeffectively no sensitivity to dynamic range for single-bitcells With two-bit cells the relatively low dynamic rangeresults in some computational error as the cell states arenot sufficiently spread to tolerate the sources of error in

013 213 413 613 813

B=113 D=075K13

B=113 D=15K13

B=113 D=3K13

B=213 D=075K13

B=213 D=15K13

B=213 D=3K13

Normalized

13 Ite

rao

n13 Co

unt13

Min13 Mean13 Max13

Figure 12 Iteration count as a function of bits per cell and dynamicrange normalized to 1-bit cells with RoffRon = 1500 The minimummaximum and the mean iteration counts are reported over 100 Monte-Carlosimulation experiments

Figure 13 Iteration count as a function of bits per cell and programmingerror normalized to 1-bit cells with no programming error The minimummaximum and the mean iteration counts are reported over 100 Monte-Carlosimulation experiments

current summation This in turn introduces error into thecomputation which tends to hinder convergence rates

Programming precision We re-evaluate the convergencebehavior with and without cell programming errors to de-termine how results are affected by non-ideal programmingThe results show virtually no sensitivity to programmingerror for single-bit cells until the programming error reaches5 well within the achievable programming precision re-ported in the literature [58] Again the general dependencebecomes stronger as the number of bits per cell is increasedHigh programming error with the same tolerance introduceserrors into the computation hindering convergence

IX CONCLUSIONS

Previous work with memristive systems has been re-stricted to low-precision error-prone fixed-point computa-tion taking advantage of machine learning workloads thatdemand fidelity but not precision We have presented thefirst correct implementation of a double-precision floating-point compute unit based on a memristive substrate Theinnovations presented here take advantage of insights thatallow the system to achieve a 103times throughput improve-ment and a 109times reduction in energy consumption over aGPU baseline when executing iterative solvers The potentialfor such a system in scientific computing where iterativesolvers are ubiquitous is especially promising

ACKNOWLEDGMENT

This work was supported in part by NSF grant CCF-1533762

REFERENCES

[1] F Pop ldquoHigh performance numerical computing for highenergy physics a new challenge for big data sciencerdquo AAdvHigh Energy Phys vol 2014 2014

[2] R A Pielke Sr Mesoscale meteorological modeling 2nd edAcademic press 2013

[3] B Scholkopf K Tsuda and J-P Vert Kernel methods incomputational biology 1st ed MIT press 2004

[4] L Maliar and S Maliar ldquoNumerical methods for large-scaledynamic economic modelsrdquo 2014

[5] M Vogelsberger S Genel V Springel P Torrey D SijackiD Xu G Snyder S Bird D Nelson and L HernquistldquoProperties of galaxies reproduced by a hydrodynamic simu-lationrdquo Nature vol 509 pp 177ndash82 May 2014

[6] P Messina (2017 February) The US DOEexascale computing projectmdashgoals and challengeshttpswwwnistgovsitesdefaultfilesdocuments20170221messina nist 20170214final pdf [Online] Availablerdquohttpswwwnistgovsitesdefaultfilesdocuments20170221messina nist 20170214final pdfrdquo

[7] M N Bojnordi and E Ipek ldquoMemristive boltzmann machineA hardware accelerator for combinatorial optimization anddeep learningrdquo in Intl Symp on High Performance ComputerArchitecture (HPCA) March 2016

[8] A Shafiee A Nag N Muralimanohar R BalasubramonianJ P Strachan M Hu R S Williams and V SrikumarldquoISAAC A convolutional neural network accelerator within-situ analog arithmetic in crossbarsrdquo in Intl Symp onComputer Architecture (ISCA) June 2016

[9] P Chi S Li S Li T Zhang J Zhao Y Liu Y Wangand Y Xie ldquoPRIME A novel processing-in-memory archi-tecture for neural network computation in ReRAM-basedmain memoryrdquo in Intl Symp on High Performance ComputerArchitecture (HPCA) June 2016

[10] L Song X Qian H Li and Y Chen ldquoPipeLayer Apipelined ReRAM-based accelerator for deep learningrdquo inIntl Symp on High Performance Computer Architecture(HPCA) Feb 2017

[11] L Song Y Zhuo X Qian H Li and Y Chen ldquoGraphRAccelerating graph processing using ReRAMrdquo Feb 2017

[12] G Kestor R Gioiosa D J Kerbyson and A HoisieldquoQuantifying the energy cost of data movement in scientificapplicationsrdquo in Intl Symp on Workload Characterization(IISWC) Sept 2013

[13] J Hauser (2002) SoftFloat httpwwwjhauserusarithmeticSoftFloathtml [Online] Available rdquohttpwwwjhauserusarithmeticSoftFloathtmlrdquo

[14] T A Davis and Y Hu ldquoThe university of florida sparse matrixcollectionrdquo ACM Transactions on Math Softwware (TOMS)vol 38 no 1 pp 11ndash125 Dec 2011

[15] R Vuduc ldquoAutomatic performance tuning of sparse matrixkernelsrdquo PhD dissertation 2003

[16] Y Chen T Luo S Liu S Zhang L He J Wang L LiT Chen Z Xu N Sun and O Temam ldquoDaDianNaoA machine-learning supercomputerrdquo in Intl Symp on onMicroarchitecture (MICRO) Dec 2014

[17] N P Jouppi C Young N Patil D Patterson G Agrawalet al ldquoIn-datacenter performance analysis of a tensor pro-cessing unitrdquo in Intl Symp on on Computer Architecture(ISCA) June 2017

[18] M Hu J P Strachan Z Li E M Grafals N DavilaC Graves S Lam N Ge J J Yang and R S WilliamsldquoDot-product engine for neuromorphic computing Program-ming 1T1M crossbar to accelerate matrix-vector multiplica-tionrdquo in Design Automation Conference (DAC) June 2016

[19] Y Saad Iterative Methods for Sparse Linear Systems 3rd edSIAM 2003

[20] J J Dongarra P Luszczek and A Petitet ldquoThe LINPACKbenchmark Past present and futurerdquo Concurrency and Com-putation Practice and Experience vol 15 pp 803ndash8202003

[21] M R Hestenes and E Stiefel Methods of conjugate gradientsfor solving linear systems 1952 vol 49 no 6

[22] H A Van der Vorst ldquoBi-CGSTAB A fast and smoothlyconverging variant of Bi-CG for the solution of nonsymmetriclinear systemsrdquo SIAM Journal on scientific and StatisticalComputing (SISC) vol 13 no 2 pp 631ndash644 1992

[23] Y Saad and M H Schultz ldquoGMRES A generalized minimalresidual algorithm for solving nonsymmetric linear systemsrdquoSIAM Journal on scientific and Statistical Computing (SISC)vol 7 no 3 pp 856ndash869 July 1986

[24] J Kung Y Long D Kim and S Mukhopadhyay ldquoAprogrammable hardware accelerator for simulating dynamicalsystemsrdquo in Intl Symp on Computer Architecture (ISCA)June 2017

[25] Q Zhu T Graf H E Sumbul L Pileggi and F FranchettildquoAccelerating sparse matrix-matrix multiplication with 3d-stacked logic-in-memory hardwarerdquo in 2013 IEEE HighPerformance Extreme Computing Conference (HPEC) Sept2013

[26] S Kestur J D Davis and E S Chung ldquoTowards a univer-sal FPGA matrix-vector multiplication architecturerdquo in IntlSymp on Field-Programmable Custom Computing Machines(FCCM) May 2012

[27] J Fowers K Ovtcharov K Strauss E S Chung andG Stitt ldquoA high memory bandwidth FPGA accelerator forsparse matrix-vector multiplicationrdquo in Intl Symp on Field-Programmable Custom Computing Machines (FCCM) May2014

[28] R Dorrance F Ren and D Markovic ldquoA scalablesparse matrix-vector multiplication kernel for energy-efficientsparse-blas on FPGAsrdquo in ACMSIGDA International Sym-posium on Field-programmable Gate Arrays (FPGA) Feb2014

[29] B Feinberg S Wang and E Ipek ldquoMaking memristiveneural net accelerators reliablerdquo in Intl Symp on HighPerformance Computer Architecture (HPCA) Feb 2018

[30] D H Bailey ldquoHigh-precision floating-point arithmetic inscientific computationrdquo Computing in science amp engineeringvol 7 no 3 pp 54ndash61 2005

[31] D Zuras M Cowlishaw A Aiken M Applegate D BaileyS Bass D Bhandarkar M Bhat D Bindel S Boldo et alldquoIeee standard for floating-point arithmeticrdquo IEEE Std 754-2008 pp 1ndash70 2008

[32] L V Allis ldquoSearching for solutions in games and artificialintelligencerdquo 1994

[33] N Whitehead and A Fit-Florea ldquoPrecision amp performanceFloating point and IEEE 754 compliance for nvidia gpusrdquoNvidia Tech Rep 2011

[34] L Kull D Luu C Menolfi M Braendli P A FranceseT Morf M Kossel H Yueksel A Cevrero I Ozkayaand T Toifl ldquo285 a 10b 15gss pipelined-SAR ADC withbackground second-stage common-mode regulation and offsetcalibration in 14nm CMOS FinFETrdquo in Intl Solid-StateCircuits Conference (ISSCC) Feb 2017 pp 474ndash475

[35] L Kull T Toifl M Schmatz P A Francese C MenolfiM Brandli M Kossel T Morf T M Andersen andY Leblebici ldquoA 31 mw 8b 12 GSs single-channel asyn-chronous SAR ADC with alternate comparators for enhancedspeed in 32 nm digital SOI CMOSrdquo IEEE Journal of Solid-State Circuits vol 48 no 12 pp 3049ndash3058 2013

[36] M Saberi R Lotfi K Mafinezhad and W A SerdijnldquoAnalysis of power consumption and linearity in capacitivedigital-to-analog converters used in successive approximationadcsrdquo IEEE Transactions on Circuits and Systems I RegularPapers vol 58 no 8 pp 1736ndash1748 2011

[37] M R Stan and W P Burleson ldquoBus-invert coding forlow-power IOrdquo IEEE Transactions on Very Large ScaleIntegration (VLSI) Systems vol 3 no 1 pp 49ndash58 1995

[38] K Bhanushali and W R Davis ldquoFreePDK15 An open-source predictive process design kit for 15nm FinFET tech-nologyrdquo in Intl Symp on on Physical Design (ISPD) March2015

[39] Predictive technology model (PTM) httpptmasuedu[Online] Available httpptmasuedu

[40] D Niu C Xu N Muralimanohar N P Jouppi and Y XieldquoDesign of cross-point metal-oxide ReRAM emphasizingreliability and costrdquo in Intl Conference on Computer-AidedDesign (ICCAD) Nov 2013

[41] S-C Wong G-Y Lee and D-J Ma ldquoModeling of inter-connect capacitance delay and crosstalk in VLSIrdquo IEEETransactions on semiconductor manufacturing vol 13 no 1pp 108ndash111 2000

[42] C Xu D Niu N Muralimanohar R BalasubramonianT Zhang S Yu and Y Xie ldquoOvercoming the challengesof crossbar resistive memory architecturesrdquo in rdquoIntl Sympon High Performance Computer Architecture (HPCA)rdquo Feb2015

[43] C Yakopcic and T Taha ldquoModel for maximum crossbar sizebased on input driver impedancerdquo Electronics Letters vol 52no 1 pp 25ndash27 2015

[44] M OrsquoHalloran and R Sarpeshkar ldquoA 10-nw 12-bit accurateanalog storage cell with 10-aa leakagerdquo IEEE Journal ofSolid-State Circuits vol 39 no 11 pp 1985ndash1996 2004

[45] Leon3GRLIB httpwwwgaislercomindexphpdownloadsleongrlib [Online] Available httpwwwgaislercomindexphpdownloadsleongrlib

[46] S Galal O Shacham J S B II J Pu A Vassiliev andM Horowitz ldquoFPU generator for design space explorationrdquoin IEEE Symposium on Computer Arithmetic (ARITH) April2013

[47] M Martins J M Matos R P Ribas A Reis G SchlinkerL Rech and J Michelsen ldquoOpen cell library in 15nmFreePDK technologyrdquo in Intl Symp on Physical Design(ISPD) March 2015

[48] Synopsys ldquoSynopsys Design Compiler User GuiderdquohttpwwwsynopsyscomToolsImplementationRTLSynthesisDCUltraPages

[49] R Balasubramonian A B Kahng N MuralimanoharA Shafiee and V Srinivas ldquoCACTI 7 New tools for inter-connect exploration in innovative off-chip memoriesrdquo ACMTransactions on Architecture and Code Optimization (TACO)vol 14 no 2 pp 141ndash1425 June 2017

[50] G Fredeman D W Plass A Mathews J ViraraghavanK Reyer T J Knips T Miller E L Gerhard D Kannam-badi C Paone D Lee D J Rainey M Sperling M WhalenS Burns R R Tummuru H Ho A Cestero N ArnoldB A Khan T Kirihata and S S Iyer ldquoA 14 nm 11 mbembedded DRAM macro with 1 ns accessrdquo IEEE Journal ofSolid-State Circuits vol 51 no 1 pp 230ndash239 Jan 2016

[51] A Bakhoda G L Yuan W W L Fung H Wong and T MAamodt ldquoAnalyzing CUDA workloads using a detailed GPUsimulatorrdquo April 2009

[52] J Leng T Hetherington A ElTantawy S Gilani N S KimT M Aamodt and V J Reddi ldquoGPUWattch Enabling en-ergy optimizations in GPGPUsrdquo in Intl Symp on ComputerArchitecture (ISCA) June 2013

[53] H Anzt J Dongarra M Kreutzer G Wellein and M KhlerldquoEfficiency of general krylov methods on GPUs ndash an exper-imental studyrdquo in Intl Parallel and Distributed ProcessingSymposium Workshops (IPDPSW) May 2016

[54] ldquoNvidia tesla P100rdquo NVIDIA Corporation Tech Rep WP-08019-001 v011 2016

[55] Z Wei Y Kanzawa K Arita Y Katoh K Kawai S Mu-raoka S Mitani S Fujii K Katayama M Iijima et alldquoHighly reliable TaOx ReRAM and direct evidence of redoxreaction mechanismrdquo in Electron Devices Meeting 2008IEDM 2008 IEEE International IEEE 2008 pp 1ndash4

[56] J J Yang M-X Zhang J P Strachan F Miao M D Pick-ett R D Kelley G Medeiros-Ribeiro and R S WilliamsldquoHigh switching endurance in TaOx memristive devicesrdquoApplied Physics Letters vol 97 no 23 p 232102 2010

[57] C-W Hsu I-T Wang C-L Lo M-C Chiang W-Y JangC-H Lin and T-H Hou ldquoSelf-rectifying bipolar TaOxTiO2RRAM with superior endurance over 1012 cycles for 3d high-density storage-class memoryrdquo

[58] F Alibart L Gao B D Hoskins and D B Strukov ldquoHighprecision tuning of state for memristive devices by adaptablevariation-tolerant algorithmrdquo Nanotechnology vol 23 no 7Jan 2012

Page 9: Enabling Scientific Computing on Memristive …reducing the required memristor resolution [7]–[10]. An example of bit slicing is shown in Equation 1, where each 3-bit matrix coefficient

(a) Pres Poisson sparsity (b) Pres Poisson blocking (c) Xenon1 sparsity (d) Xenon1 blockingFigure 7 Sparsity and blocking patterns of two of the evaluated matrices

7 shows the non-zero elements and the resulting blockingpatterns for two sparse matrices

Since the non-zero elements in the matrix are traversed atmost one time for each block size the worst-case complexityof the preprocessing step is 4timesNNZ In practice the earlydiscovery of good blocks brings the average-case complexityto 18timesNNZ

2) Exploiting Sparsity Though sparsity introduces chal-lenges in terms of throughput it also presents some oppor-tunities for optimization

Computational invert coding Prior work [8] introducesa technique for statically reducing ADC resolution require-ments If a crossbar column contains all 1s the maximumoutput for the column is equal to the number of rowsN To handle this extreme case the ADC would requiredlog2[N + 1]e bits of resolution Eliminating this casereduces the ADC resolution requirement by one bit This isachieved by inverting the contents of a column if it containsall 1s To obtain a result equivalent to a computation witha non-inverted column the partial dot product is subtractedfrom the population count of the applied vector bit slice

We extend this technique to computational invert coding(CIC) a computational analog to the widely used bus-invertcoding [37] If the number of ones in a binary matrixrow is less than N

2 the required resolution of the ADCis again decreased by one bit By inverting any columnmapping that would result in greater than 50 bit density astatic maximum of N

2 1s per crossbar column is guaranteedNotably if the column has exactly N

2 1s a resolution oflog2N is still required Since the proposed system allowsfor removing arbitrary elements from a block for processingon the local processor this corner case can be eliminatedThus for any block even in matrices approaching 100density the ADC resolution can be statically reduced tolog2[N ]minus1 bits CIC also lowers the maximum conductanceof the crossbars (since at most half of the cells can be inthe RLO state) which in turn reduces crossbar energy

ADC headstart The ADCs that are commonly used inmemristive accelerators [7] [8] perform a binary searchover the space of possible results with the MSb initially

set to 1 However the maximum output of each column isconstrained by the number of ones mapped to that columnMatrix rows tend to produce at most dlog2[drowtimes05timesN ]ebits of output assuming that 1s and 0s are equally likelySince the actual number of output bits is data dependentthis tendency cannot be leveraged to statically reduce ADCprecision beyond what is attainable with CIC Instead theADC is pre-set to the maximum number of bits that can beproduced by the subsequent conversion step This approachallows the ADC to start from the most significant possibleoutput bit position and reduces the amount of time requiredto find the correct output value by an amount proportionalto the number of bits skipped Notably this does not affectoperation latency since reduction logic is fully synchronousand assumes a full clock period for conversion however bydecreasing the time spent on conversion the energy of theMVM operation is reduced

VI PROGRAMMING THE ACCELERATOR

We focus on Krylov subspace solvers for specificity asthey are a widely used application involving sparse MVMhowever the proposed accelerator is not limited to Krylovsubspace solvers These solvers are built from three com-putational kernels a sparse-matrix dense-vector multiply adense-vector sum or AXPY (ylarr ax+y) and a dense-vectordot product (z = x middot y) To split the computation betweenthe banks each bank is given a 1200-element section ofthe solution vector x only that bank performs updates onthe corresponding section of the solution vector and theportions of the derived vectors For simplicity this restrictionis not enforced by hardware and must be followed by theprogrammer for correctness

On problems that are too large for a single accelerator theMVM can be split in a manner analogous to the partitioningon GPUs each accelerator handles a portion of the MVMand the accelerators synchronize between iterations

A Kernel Implementation

To implement Krylov subspace solvers we implement thethree primary kernels needed for the solver

1) Sparse MVM The sparse MVM operation beginsby reading the vector map for each cluster The vectormap contains a set of three-element tuples each of whichcomprises the base address of a cluster input vector thevector element index corresponding to that base addressand the size of the cluster Since the vector applied to eachcluster is contiguous the tuple is used to compute and loadall of the vector elements into the appropriate vector bufferThereafter a start signal is written to a cluster status registerand the computation begins Since larger clusters tend tohave a higher latency the vector map entries are ordered bycluster size

Once all cluster operations have started the processorbegins operating on the non-blocked entries As noted inthe previous section some values within the sparse matrixare not suitable for mapping onto memristive hardwareeither because they cannot be blocked at a sufficient densityor because they would exceed the exponent range Theseremaining values are stored in the compressed sparse rowformat [15]

When a cluster completes an interrupt is raised andserviced by an interrupt service routine running on the localprocessor Once all clusters signal completion and the localprocessor finishes processing the non-blocked elements thebank has completed its portion of the sparse MVM The localprocessors for different banks rely on a barrier synchroniza-tion scheme to determine when the entire sparse MVM iscomplete

2) Vector Dot Product To perform a vector dot producteach processor computes the relevant dot product fromits own vector elements Notably due to vector elementownership of the solution vector x and all of the derivedvectors the dot product is performed using only thosevalues that belong to a local processor Once the processorcomputes its local dot product the value is written backto shared memory for all other banks to see Each bankcomputes the final product from the set of bank dot productsindividually to simplify synchronization

3) AXPY The AXPY operation is the simplest kernelsince it can be performed purely locally Each vector elementis loaded modified and written back Barrier synchroniza-tion is used to determine completion across all banks

VII EXPERIMENTAL SETUP

We develop a parameterized and modular system modelto evaluate throughput energy efficiency and area Keyparameters of the simulated system are listed in Table I

A Circuits and Devices

We model all of the CMOS based periphery at the 15nmtechnology node using process parameters and design rulesfrom FreePDK15 [38] coupled with the ASU PredictiveTechnology Model (PTM) [39] TaOx memristor cells basedon [40] are used with linearity and dynamic range as

reported by [18] During computation cells are modeled asresistors with resistance determined either by a statisticalapproach considering block density or actual mapped bitvalues depending on whether the model is being evaluatedas part of a design-space exploration or to characterizeperformance with a particular matrix The interconnect ismodeled with a conservative lumped RC approach usingthe wire capacitance parameters provided by the ASU PTMderived from [41] To reduce path distance and latency wesplit the crossbars and place the drivers in the middle ratherthan at the periphery similarly to the double-sided groundbiasing technique from [42]

Table IACCELERATOR CONFIGURATION

System (128) banks double-precision floating point fclk =12GHz 15nm process VDD = 080V

Bank (2) times 512 times 512 clusters (4) times 256 times 256 clusters(6)times128times128 (8)times64times64 clusters 1 LEON core

Cluster 127 bit slice crossbarsCrossbar N timesN cells (log2[N ]minus 1)-bit pipelined SAR ADC

2N driversCell [18] [40] TaOx Ron = 2kΩRoff = 3MΩ Vread = 02V

Vset = minus26V Vreset = 26V Ewrite = 391nJTwrite = 5088ns

The peripheral circuitry includes 2N driver circuits basedon [43] and sized such that they are sufficient to sourcesinkthe maximum current required to program and read thememristors Also included are N sample-and-hold circuitsbased on [44] with power and area scaled for the requiredsampling frequency and precision

A 12 GHz 10-bit pipelined SAR ADC [34] is used as areference design point 12 GHz is chosen as the operatingfrequency of the ADC in order to maximize throughput andefficiency while maintaining acceptable ADC signal to noiseand distortion ratio (SNDR) at our supply voltage Areapower and the internal conversion time of each individualcrossbarrsquos ADC are scaled based on resolution as discussedin Section V-A Approximately 7 of the reported ADCpower scales exponentially with resolution with the remain-der of power either scaling linearly or remaining constantand 20 of the total power is assumed to be static basedon the analysis reported in [34] Similarly approximately23 of the reported area scales exponentially (due to thecapacitive DAC and the track-and-hold circuit) and theremainder either scales linearly or remains constant We holdthe conversion time of the ADC constant with respect toresolution rounding up to the nearest integral clock periodDuring the slack portion of the clock period (ie the timebetween conversions) we reduce the ADC power dissipationto its static power

The local processors are implemented with the open-source LEON3 core [45] Since the open-source versionof LEON3 does not include the netlist for the FPU forsynthesis we modify the FPU wrapper to the fused multiplyadd (FMA) timings for an FMA unit generated by FP-

Table IIEVALUATED MATRICES SPD MATRICES ON TOP

Matrix NNZs Rows NNZRow Blocked2cubes sphere 1647264 101492 162 497crystm03 583770 24696 236 947finan512 596992 74752 79 467G2 circuit 726674 150102 45 609nasasrb 2677324 54870 498 991Pres Poisson 715804 14822 483 964qa8fm 1660579 66127 251 928ship 001 3896496 34920 1116 664thermomech TC 711558 102158 68 08Trefethen 20000 554466 20000 277 633ASIC 100K 940621 99340 95 609bcircuit 375558 68902 54 649epb3 463625 84617 55 722GaAsH6 3381809 61349 5512 692ns3Da 1679599 20414 82 32Si34H36 5156379 97569 528 537torso2 1033473 115697 89 981venkat25 1717792 62424 275 798wang3 177168 26064 68 646xenon1 1181120 48600 243 810

GEN [46] The LEON3 core FMA and in-cluster reductionnetwork are synthesized using the NanGate 15nm OpenCell Library [47] with the Synopsys Design Compiler [48]SRAM buffers within each cluster and the eDRAM mem-ory are modeled using CACTI7 [49] using 14nm eDRAMparameters from [50]

B Architecture

We compare against an nVidia Tesla P100 GPU accelera-tor modeled using GPGPUSim [51] and GPGPUWattch [52]To evaluate the performance of the LEON3 based localprocessors we implement CG and BiCG-STAB in C andcompile the implementations using the LEON3 Bare CCompiler with timers around each kernel call Since all bankmicroprocessors must synchronize during each iteration weevaluate the latency of the bank microprocessor with thelargest number of unblocked elements

In the worst case the preprocessing step touches everynon-zero in the matrix four times which is comparable toperforming four MVM operations Thus we conservativelyassume that preprocessing takes time equivalent to fourMVM operations on the baseline system

C Algorithms

We evaluate the proposed accelerator on 20 matrices fromthe SuiteSparse matrix collection [14] using CG for sym-metric positive definite matrices (SPD) and BiCG-STAB fornon-SPD matrices The matrices were selected to showcasea range of matrix structures sizes and densities across avariety of applications When available we use the b vectorprovided by the collection when b is unavailable we use ab vector of all 1s as in prior work [53]

The solvers running on the proposed accelerator convergein the same number of iterations as they do when running

103

01

1

10

100

nasasrb

Pres_P

2cub

es

qa8fm

finan

Trefe

ship

crystm

therm

G2

ASIC

bcircuit

epb3

Ga

AsH6

ns3D

aSi34H3

6torso2

venkat

wang3

xeno

n1

G-MEA

NS

peed

upoverG

PU

Figure 8 Speedup over the GPU baseline

on the GPU since both systems perform computation at thesame level of precision

VIII EVALUATION

We evaluate the system performance energy area androbustness to different sources of error

A Execution Time

Figure 8 shows that the accelerator achieves an averagespeedup of 103times as compared to the GPU baseline acrossthe set of 20 matrices One important insight from theseresults is that the average number of non-zeros per matrixrow is a poor proxy for potential speedup The matrix withthe greatest speedup in the dataset (torso2) has just 89 NNZper matrix row and the two quantum chemistry matrices(GaSsH6 and Si34H36) have more than 50 NNZ per rowshowing above-average but not exceptional speedup due totheir lower blocking efficiencies (ie the percent of the non-zeros that are blocked) 69 and 53 respectively Sinceeach bank microprocessor is designed to work in conjunctionwith its clusters matrices with fewer rows perform worsethan larger matrices at the same blocking efficiency Asexpected blocking efficiency is the most reliable predictorof speedup

Notably in Table II there are two matrices that are effec-tively unblocked by the proposed preprocessing step ther-motech TC and ns3Da with respective blocking efficienciesof 08 and 32 Since the accelerator is optimized for in-situ rather than digital MVM and since the majority of theelements in these two matrices cannot be mapped to theproposed heterogeneous substrate the accelerator is morethan an order of magnitude less efficient than the GPUbaseline on these matrices Rather than attempt to performMVM on matrices so clearly ill-suited to the acceleratorwe assume that the proposed accelerator co-exists with aGPU in the same node and that the GPU may be usedfor the rare matrices which do not block effectively Sincethe blocking algorithm has a worst-case performance offour MVM operations (Section V-B1) and since the actualcomplexity reaches the worst case if and when the matrixcannot be blocked we can choose whether to perform thecomputation on the accelerator or the GPU after the blockinghas completed This results in a performance loss of less than

0092

1E-02

1E-01

1E+00

nasasrb

Pres_P

2cub

es

qa8fm

finan

Trefe

ship

crystm

therm

G2

ASIC

bcircuit

epb3

Ga

AsH6

ns3D

aSi34H3

6torso2

venkat

wang3

xeno

n1

G-MEA

N

EnergyNormalized

toGPU

Figure 9 Accelerator energy consumption normalized to the GPU baseline

3 for both matrices far better than if the accelerator wereused directly

B Energy Consumption

Energy consumption results are shown in Figure 9 To-tal energy consumption is improved by 142times on the 18matrices executed on the accelerator and by 109times overthe entire 20-matrix dataset The impacts of differing ex-ponent ranges can be seen in the energy results Notablythe nasasrb matrix has a 3 higher blocking efficiencythan Pres Poisson however Pres Poisson shows nearlytwice the improvement in energy dissipation this is dueto the much narrower exponent range of Pres PoissonPres Poisson never requires more than 14 bits of paddingfor storage which also indicates a much narrower dynamicrange for the computation and by extension fewer vectorbit slices that are needed per cluster By contrast nasasrbhas multiple blocks with excluded elements (due to thoseelements requiring greater than 117 bits) and in generalthe average number of stored bits per cluster is 107 (30more bits than the worst block of Pres Poisson) This strongdependence on the exponent range indicates a potentialbenefit of the fixed-point computation over floating pointAs discussed in Section IV-B floating point is designed forbroad compatibility and exceeds the requirements of manyapplications By operating on fixed-point representations ofdata for much of the computation the accelerator implicitlycreates a problem-specific subset of the floating-point formatwith significant performance benefits without discarding anyinformation that would cause the final result to differ fromthat of an end-to-end floating-point calculation

C Area Footprint

The area footprint of the accelerator is computed usingthe area model described in Section VII-A The overallsystem area for the 128-bank system described above is539mm2 which is lower than the 610mm2 die size of thebaseline Nvidia Tesla P100 GPU [54] Notably unlike inprior work on memristive accelerators the crossbars andperipheral circuitry are the dominant area consumermdashratherthan the ADCsmdashwith a total of 541 of total overall clusterarea This is due to two important differences from priorwork 1) due to computational invert coding (Section V-B2)

Table IIIAREA ENERGY AND LATENCY OF DIFFERENT CROSSBAR SIZES

(INCLUDES THE ADC)

Size Area [mm2] Energy [pJ] Latency [nsec]64 000078 280 533128 000103 652 107256 000162 150 213512 000352 342 427

0

5

10

15

20

nasasrb

Pres_P

2cub

es

qa8fm

finan

Trefe

ship

crystm

therm

G2

ASIC

bcircuit

epb3

Ga

AsH6

ns3D

aSi34H3

6torso2

venkat

wang3

xeno

n1

AcceleratorC

ompu

tatio

n

Overhead

WriteTime PreprocessingTime

Figure 10 Overhead of preprocessing and write time as a percent of totalsolve time on the accelerator

the ADC precision can be reduced by one bit which has anexponential effect on ADC area and 2) the wider operandsthat propagate through the reduction network The per-bankprocessors and global memory buffer consume 136 of thetotal system area This result suggests that a more powerfulprocessor than the LEON3 core may be integrated based onapplication demands without substantially increasing systemarea Table III summarizes the area as well as the latencyand energy characteristics for different crossbar sizes usedin different clusters

D Initialization Overhead

Figure 10 shows the overhead of setting up the itera-tive computation including matrix preprocessing time andprogramming time normalized to the entire solve timeof each linear system on the accelerator Since iterativealgorithms take thousands of iterations to converge theoverhead of writing is effectively amortized leading toan overhead of less than 20 across the entire matrixset and generally falling as the size of the linear systemincreases By extension the number of iterations requiredto converge increases faster than the write requirements ofthe array For large linear systems (the most likely use caseof the proposed accelerator) the overhead is typically lessthan 4 of the total solver runtime The overhead may beeven less in practice as multiple time steps are generallysimulated to examine the time-evolution of the linear systemIn these time-stepped computations only a subset of non-zeros change each step and the matrix structure is typicallypreserved requiring minimal re-processing

E System Endurance

Although memristive devices have finite switching en-durance the iterative nature of the algorithms allows a singlematrix to be programmed once and reused throughout thecomputation Even under a conservative set of assumptions

(a) ns3Da sparsity (b) ns3Da blockingFigure 11 Sparsity and blocking patterns of ns3Da

where each array is fully rewritten for each new matrixignoring both the sparsity of the blocks and the time stepbehavior mentioned above the system lifetime is suffi-cient given a memristive device with a conservative 109

writes [55]ndash[57] the total system lifetime is over 100 yearsassuming the system runs constantly and a new matrix isfully rewritten between solves

F Understanding Difficult-to-Block Matrices

Figure 11 shows the non-zero element distribution and theblocking pattern for ns3Da a matrix with a particularly poorblocking efficiency The figure shows that despite the highrelative density of the matrix the values do not form densesub-blocks that can be blocked efficiently Instead the valuesare distributed relatively uniformly over the matrix with thethird-highest NNZs per matrix row in the evaluated datasetFigure 7 shows the non-zero distribution of two matrices thatcan be blocked effectively Pres Poisson and Xenon1 thenon-zeros of both matrices are clustered primarily aroundthe diagonal An analysis of the blocks formed on ns3Dashows that most of the formed blocks capture patterns withnon-zeros separated by approximately 30 elements 2 permatrix block-row in 64 blocks and 4 per matrix block-rowin 128 blocks This suggests that even if ns3Da were tobe blocked more efficiently the effective speedup would beconstrained by the limited current summation from havingfew blocks per matrix row

G Sensitivity to Device Characteristics

Since scientific computing emphasizes high precisionwe analyze the effects of device characteristics on systemaccuracy and convergence behavior

Dynamic range We re-evaluate the convergence behaviorwith various cell ranges (ie onoff state ratios) to determinehow the system behaves when the states are closer togetheror farther apart Figure 12 shows the relative iterationcount at various configuration points The results showeffectively no sensitivity to dynamic range for single-bitcells With two-bit cells the relatively low dynamic rangeresults in some computational error as the cell states arenot sufficiently spread to tolerate the sources of error in

013 213 413 613 813

B=113 D=075K13

B=113 D=15K13

B=113 D=3K13

B=213 D=075K13

B=213 D=15K13

B=213 D=3K13

Normalized

13 Ite

rao

n13 Co

unt13

Min13 Mean13 Max13

Figure 12 Iteration count as a function of bits per cell and dynamicrange normalized to 1-bit cells with RoffRon = 1500 The minimummaximum and the mean iteration counts are reported over 100 Monte-Carlosimulation experiments

Figure 13 Iteration count as a function of bits per cell and programmingerror normalized to 1-bit cells with no programming error The minimummaximum and the mean iteration counts are reported over 100 Monte-Carlosimulation experiments

current summation This in turn introduces error into thecomputation which tends to hinder convergence rates

Programming precision We re-evaluate the convergencebehavior with and without cell programming errors to de-termine how results are affected by non-ideal programmingThe results show virtually no sensitivity to programmingerror for single-bit cells until the programming error reaches5 well within the achievable programming precision re-ported in the literature [58] Again the general dependencebecomes stronger as the number of bits per cell is increasedHigh programming error with the same tolerance introduceserrors into the computation hindering convergence

IX CONCLUSIONS

Previous work with memristive systems has been re-stricted to low-precision error-prone fixed-point computa-tion taking advantage of machine learning workloads thatdemand fidelity but not precision We have presented thefirst correct implementation of a double-precision floating-point compute unit based on a memristive substrate Theinnovations presented here take advantage of insights thatallow the system to achieve a 103times throughput improve-ment and a 109times reduction in energy consumption over aGPU baseline when executing iterative solvers The potentialfor such a system in scientific computing where iterativesolvers are ubiquitous is especially promising

ACKNOWLEDGMENT

This work was supported in part by NSF grant CCF-1533762

REFERENCES

[1] F Pop ldquoHigh performance numerical computing for highenergy physics a new challenge for big data sciencerdquo AAdvHigh Energy Phys vol 2014 2014

[2] R A Pielke Sr Mesoscale meteorological modeling 2nd edAcademic press 2013

[3] B Scholkopf K Tsuda and J-P Vert Kernel methods incomputational biology 1st ed MIT press 2004

[4] L Maliar and S Maliar ldquoNumerical methods for large-scaledynamic economic modelsrdquo 2014

[5] M Vogelsberger S Genel V Springel P Torrey D SijackiD Xu G Snyder S Bird D Nelson and L HernquistldquoProperties of galaxies reproduced by a hydrodynamic simu-lationrdquo Nature vol 509 pp 177ndash82 May 2014

[6] P Messina (2017 February) The US DOEexascale computing projectmdashgoals and challengeshttpswwwnistgovsitesdefaultfilesdocuments20170221messina nist 20170214final pdf [Online] Availablerdquohttpswwwnistgovsitesdefaultfilesdocuments20170221messina nist 20170214final pdfrdquo

[7] M N Bojnordi and E Ipek ldquoMemristive boltzmann machineA hardware accelerator for combinatorial optimization anddeep learningrdquo in Intl Symp on High Performance ComputerArchitecture (HPCA) March 2016

[8] A Shafiee A Nag N Muralimanohar R BalasubramonianJ P Strachan M Hu R S Williams and V SrikumarldquoISAAC A convolutional neural network accelerator within-situ analog arithmetic in crossbarsrdquo in Intl Symp onComputer Architecture (ISCA) June 2016

[9] P Chi S Li S Li T Zhang J Zhao Y Liu Y Wangand Y Xie ldquoPRIME A novel processing-in-memory archi-tecture for neural network computation in ReRAM-basedmain memoryrdquo in Intl Symp on High Performance ComputerArchitecture (HPCA) June 2016

[10] L Song X Qian H Li and Y Chen ldquoPipeLayer Apipelined ReRAM-based accelerator for deep learningrdquo inIntl Symp on High Performance Computer Architecture(HPCA) Feb 2017

[11] L Song Y Zhuo X Qian H Li and Y Chen ldquoGraphRAccelerating graph processing using ReRAMrdquo Feb 2017

[12] G Kestor R Gioiosa D J Kerbyson and A HoisieldquoQuantifying the energy cost of data movement in scientificapplicationsrdquo in Intl Symp on Workload Characterization(IISWC) Sept 2013

[13] J Hauser (2002) SoftFloat httpwwwjhauserusarithmeticSoftFloathtml [Online] Available rdquohttpwwwjhauserusarithmeticSoftFloathtmlrdquo

[14] T A Davis and Y Hu ldquoThe university of florida sparse matrixcollectionrdquo ACM Transactions on Math Softwware (TOMS)vol 38 no 1 pp 11ndash125 Dec 2011

[15] R Vuduc ldquoAutomatic performance tuning of sparse matrixkernelsrdquo PhD dissertation 2003

[16] Y Chen T Luo S Liu S Zhang L He J Wang L LiT Chen Z Xu N Sun and O Temam ldquoDaDianNaoA machine-learning supercomputerrdquo in Intl Symp on onMicroarchitecture (MICRO) Dec 2014

[17] N P Jouppi C Young N Patil D Patterson G Agrawalet al ldquoIn-datacenter performance analysis of a tensor pro-cessing unitrdquo in Intl Symp on on Computer Architecture(ISCA) June 2017

[18] M Hu J P Strachan Z Li E M Grafals N DavilaC Graves S Lam N Ge J J Yang and R S WilliamsldquoDot-product engine for neuromorphic computing Program-ming 1T1M crossbar to accelerate matrix-vector multiplica-tionrdquo in Design Automation Conference (DAC) June 2016

[19] Y Saad Iterative Methods for Sparse Linear Systems 3rd edSIAM 2003

[20] J J Dongarra P Luszczek and A Petitet ldquoThe LINPACKbenchmark Past present and futurerdquo Concurrency and Com-putation Practice and Experience vol 15 pp 803ndash8202003

[21] M R Hestenes and E Stiefel Methods of conjugate gradientsfor solving linear systems 1952 vol 49 no 6

[22] H A Van der Vorst ldquoBi-CGSTAB A fast and smoothlyconverging variant of Bi-CG for the solution of nonsymmetriclinear systemsrdquo SIAM Journal on scientific and StatisticalComputing (SISC) vol 13 no 2 pp 631ndash644 1992

[23] Y Saad and M H Schultz ldquoGMRES A generalized minimalresidual algorithm for solving nonsymmetric linear systemsrdquoSIAM Journal on scientific and Statistical Computing (SISC)vol 7 no 3 pp 856ndash869 July 1986

[24] J Kung Y Long D Kim and S Mukhopadhyay ldquoAprogrammable hardware accelerator for simulating dynamicalsystemsrdquo in Intl Symp on Computer Architecture (ISCA)June 2017

[25] Q Zhu T Graf H E Sumbul L Pileggi and F FranchettildquoAccelerating sparse matrix-matrix multiplication with 3d-stacked logic-in-memory hardwarerdquo in 2013 IEEE HighPerformance Extreme Computing Conference (HPEC) Sept2013

[26] S Kestur J D Davis and E S Chung ldquoTowards a univer-sal FPGA matrix-vector multiplication architecturerdquo in IntlSymp on Field-Programmable Custom Computing Machines(FCCM) May 2012

[27] J Fowers K Ovtcharov K Strauss E S Chung andG Stitt ldquoA high memory bandwidth FPGA accelerator forsparse matrix-vector multiplicationrdquo in Intl Symp on Field-Programmable Custom Computing Machines (FCCM) May2014

[28] R Dorrance F Ren and D Markovic ldquoA scalablesparse matrix-vector multiplication kernel for energy-efficientsparse-blas on FPGAsrdquo in ACMSIGDA International Sym-posium on Field-programmable Gate Arrays (FPGA) Feb2014

[29] B Feinberg S Wang and E Ipek ldquoMaking memristiveneural net accelerators reliablerdquo in Intl Symp on HighPerformance Computer Architecture (HPCA) Feb 2018

[30] D H Bailey ldquoHigh-precision floating-point arithmetic inscientific computationrdquo Computing in science amp engineeringvol 7 no 3 pp 54ndash61 2005

[31] D Zuras M Cowlishaw A Aiken M Applegate D BaileyS Bass D Bhandarkar M Bhat D Bindel S Boldo et alldquoIeee standard for floating-point arithmeticrdquo IEEE Std 754-2008 pp 1ndash70 2008

[32] L V Allis ldquoSearching for solutions in games and artificialintelligencerdquo 1994

[33] N Whitehead and A Fit-Florea ldquoPrecision amp performanceFloating point and IEEE 754 compliance for nvidia gpusrdquoNvidia Tech Rep 2011

[34] L Kull D Luu C Menolfi M Braendli P A FranceseT Morf M Kossel H Yueksel A Cevrero I Ozkayaand T Toifl ldquo285 a 10b 15gss pipelined-SAR ADC withbackground second-stage common-mode regulation and offsetcalibration in 14nm CMOS FinFETrdquo in Intl Solid-StateCircuits Conference (ISSCC) Feb 2017 pp 474ndash475

[35] L Kull T Toifl M Schmatz P A Francese C MenolfiM Brandli M Kossel T Morf T M Andersen andY Leblebici ldquoA 31 mw 8b 12 GSs single-channel asyn-chronous SAR ADC with alternate comparators for enhancedspeed in 32 nm digital SOI CMOSrdquo IEEE Journal of Solid-State Circuits vol 48 no 12 pp 3049ndash3058 2013

[36] M Saberi R Lotfi K Mafinezhad and W A SerdijnldquoAnalysis of power consumption and linearity in capacitivedigital-to-analog converters used in successive approximationadcsrdquo IEEE Transactions on Circuits and Systems I RegularPapers vol 58 no 8 pp 1736ndash1748 2011

[37] M R Stan and W P Burleson ldquoBus-invert coding forlow-power IOrdquo IEEE Transactions on Very Large ScaleIntegration (VLSI) Systems vol 3 no 1 pp 49ndash58 1995

[38] K Bhanushali and W R Davis ldquoFreePDK15 An open-source predictive process design kit for 15nm FinFET tech-nologyrdquo in Intl Symp on on Physical Design (ISPD) March2015

[39] Predictive technology model (PTM) httpptmasuedu[Online] Available httpptmasuedu

[40] D Niu C Xu N Muralimanohar N P Jouppi and Y XieldquoDesign of cross-point metal-oxide ReRAM emphasizingreliability and costrdquo in Intl Conference on Computer-AidedDesign (ICCAD) Nov 2013

[41] S-C Wong G-Y Lee and D-J Ma ldquoModeling of inter-connect capacitance delay and crosstalk in VLSIrdquo IEEETransactions on semiconductor manufacturing vol 13 no 1pp 108ndash111 2000

[42] C Xu D Niu N Muralimanohar R BalasubramonianT Zhang S Yu and Y Xie ldquoOvercoming the challengesof crossbar resistive memory architecturesrdquo in rdquoIntl Sympon High Performance Computer Architecture (HPCA)rdquo Feb2015

[43] C Yakopcic and T Taha ldquoModel for maximum crossbar sizebased on input driver impedancerdquo Electronics Letters vol 52no 1 pp 25ndash27 2015

[44] M OrsquoHalloran and R Sarpeshkar ldquoA 10-nw 12-bit accurateanalog storage cell with 10-aa leakagerdquo IEEE Journal ofSolid-State Circuits vol 39 no 11 pp 1985ndash1996 2004

[45] Leon3GRLIB httpwwwgaislercomindexphpdownloadsleongrlib [Online] Available httpwwwgaislercomindexphpdownloadsleongrlib

[46] S Galal O Shacham J S B II J Pu A Vassiliev andM Horowitz ldquoFPU generator for design space explorationrdquoin IEEE Symposium on Computer Arithmetic (ARITH) April2013

[47] M Martins J M Matos R P Ribas A Reis G SchlinkerL Rech and J Michelsen ldquoOpen cell library in 15nmFreePDK technologyrdquo in Intl Symp on Physical Design(ISPD) March 2015

[48] Synopsys ldquoSynopsys Design Compiler User GuiderdquohttpwwwsynopsyscomToolsImplementationRTLSynthesisDCUltraPages

[49] R Balasubramonian A B Kahng N MuralimanoharA Shafiee and V Srinivas ldquoCACTI 7 New tools for inter-connect exploration in innovative off-chip memoriesrdquo ACMTransactions on Architecture and Code Optimization (TACO)vol 14 no 2 pp 141ndash1425 June 2017

[50] G Fredeman D W Plass A Mathews J ViraraghavanK Reyer T J Knips T Miller E L Gerhard D Kannam-badi C Paone D Lee D J Rainey M Sperling M WhalenS Burns R R Tummuru H Ho A Cestero N ArnoldB A Khan T Kirihata and S S Iyer ldquoA 14 nm 11 mbembedded DRAM macro with 1 ns accessrdquo IEEE Journal ofSolid-State Circuits vol 51 no 1 pp 230ndash239 Jan 2016

[51] A Bakhoda G L Yuan W W L Fung H Wong and T MAamodt ldquoAnalyzing CUDA workloads using a detailed GPUsimulatorrdquo April 2009

[52] J Leng T Hetherington A ElTantawy S Gilani N S KimT M Aamodt and V J Reddi ldquoGPUWattch Enabling en-ergy optimizations in GPGPUsrdquo in Intl Symp on ComputerArchitecture (ISCA) June 2013

[53] H Anzt J Dongarra M Kreutzer G Wellein and M KhlerldquoEfficiency of general krylov methods on GPUs ndash an exper-imental studyrdquo in Intl Parallel and Distributed ProcessingSymposium Workshops (IPDPSW) May 2016

[54] ldquoNvidia tesla P100rdquo NVIDIA Corporation Tech Rep WP-08019-001 v011 2016

[55] Z Wei Y Kanzawa K Arita Y Katoh K Kawai S Mu-raoka S Mitani S Fujii K Katayama M Iijima et alldquoHighly reliable TaOx ReRAM and direct evidence of redoxreaction mechanismrdquo in Electron Devices Meeting 2008IEDM 2008 IEEE International IEEE 2008 pp 1ndash4

[56] J J Yang M-X Zhang J P Strachan F Miao M D Pick-ett R D Kelley G Medeiros-Ribeiro and R S WilliamsldquoHigh switching endurance in TaOx memristive devicesrdquoApplied Physics Letters vol 97 no 23 p 232102 2010

[57] C-W Hsu I-T Wang C-L Lo M-C Chiang W-Y JangC-H Lin and T-H Hou ldquoSelf-rectifying bipolar TaOxTiO2RRAM with superior endurance over 1012 cycles for 3d high-density storage-class memoryrdquo

[58] F Alibart L Gao B D Hoskins and D B Strukov ldquoHighprecision tuning of state for memristive devices by adaptablevariation-tolerant algorithmrdquo Nanotechnology vol 23 no 7Jan 2012

Page 10: Enabling Scientific Computing on Memristive …reducing the required memristor resolution [7]–[10]. An example of bit slicing is shown in Equation 1, where each 3-bit matrix coefficient

1) Sparse MVM The sparse MVM operation beginsby reading the vector map for each cluster The vectormap contains a set of three-element tuples each of whichcomprises the base address of a cluster input vector thevector element index corresponding to that base addressand the size of the cluster Since the vector applied to eachcluster is contiguous the tuple is used to compute and loadall of the vector elements into the appropriate vector bufferThereafter a start signal is written to a cluster status registerand the computation begins Since larger clusters tend tohave a higher latency the vector map entries are ordered bycluster size

Once all cluster operations have started the processorbegins operating on the non-blocked entries As noted inthe previous section some values within the sparse matrixare not suitable for mapping onto memristive hardwareeither because they cannot be blocked at a sufficient densityor because they would exceed the exponent range Theseremaining values are stored in the compressed sparse rowformat [15]

When a cluster completes an interrupt is raised andserviced by an interrupt service routine running on the localprocessor Once all clusters signal completion and the localprocessor finishes processing the non-blocked elements thebank has completed its portion of the sparse MVM The localprocessors for different banks rely on a barrier synchroniza-tion scheme to determine when the entire sparse MVM iscomplete

2) Vector Dot Product To perform a vector dot producteach processor computes the relevant dot product fromits own vector elements Notably due to vector elementownership of the solution vector x and all of the derivedvectors the dot product is performed using only thosevalues that belong to a local processor Once the processorcomputes its local dot product the value is written backto shared memory for all other banks to see Each bankcomputes the final product from the set of bank dot productsindividually to simplify synchronization

3) AXPY The AXPY operation is the simplest kernelsince it can be performed purely locally Each vector elementis loaded modified and written back Barrier synchroniza-tion is used to determine completion across all banks

VII EXPERIMENTAL SETUP

We develop a parameterized and modular system modelto evaluate throughput energy efficiency and area Keyparameters of the simulated system are listed in Table I

A Circuits and Devices

We model all of the CMOS based periphery at the 15nmtechnology node using process parameters and design rulesfrom FreePDK15 [38] coupled with the ASU PredictiveTechnology Model (PTM) [39] TaOx memristor cells basedon [40] are used with linearity and dynamic range as

reported by [18] During computation cells are modeled asresistors with resistance determined either by a statisticalapproach considering block density or actual mapped bitvalues depending on whether the model is being evaluatedas part of a design-space exploration or to characterizeperformance with a particular matrix The interconnect ismodeled with a conservative lumped RC approach usingthe wire capacitance parameters provided by the ASU PTMderived from [41] To reduce path distance and latency wesplit the crossbars and place the drivers in the middle ratherthan at the periphery similarly to the double-sided groundbiasing technique from [42]

Table IACCELERATOR CONFIGURATION

System (128) banks double-precision floating point fclk =12GHz 15nm process VDD = 080V

Bank (2) times 512 times 512 clusters (4) times 256 times 256 clusters(6)times128times128 (8)times64times64 clusters 1 LEON core

Cluster 127 bit slice crossbarsCrossbar N timesN cells (log2[N ]minus 1)-bit pipelined SAR ADC

2N driversCell [18] [40] TaOx Ron = 2kΩRoff = 3MΩ Vread = 02V

Vset = minus26V Vreset = 26V Ewrite = 391nJTwrite = 5088ns

The peripheral circuitry includes 2N driver circuits basedon [43] and sized such that they are sufficient to sourcesinkthe maximum current required to program and read thememristors Also included are N sample-and-hold circuitsbased on [44] with power and area scaled for the requiredsampling frequency and precision

A 12 GHz 10-bit pipelined SAR ADC [34] is used as areference design point 12 GHz is chosen as the operatingfrequency of the ADC in order to maximize throughput andefficiency while maintaining acceptable ADC signal to noiseand distortion ratio (SNDR) at our supply voltage Areapower and the internal conversion time of each individualcrossbarrsquos ADC are scaled based on resolution as discussedin Section V-A Approximately 7 of the reported ADCpower scales exponentially with resolution with the remain-der of power either scaling linearly or remaining constantand 20 of the total power is assumed to be static basedon the analysis reported in [34] Similarly approximately23 of the reported area scales exponentially (due to thecapacitive DAC and the track-and-hold circuit) and theremainder either scales linearly or remains constant We holdthe conversion time of the ADC constant with respect toresolution rounding up to the nearest integral clock periodDuring the slack portion of the clock period (ie the timebetween conversions) we reduce the ADC power dissipationto its static power

The local processors are implemented with the open-source LEON3 core [45] Since the open-source versionof LEON3 does not include the netlist for the FPU forsynthesis we modify the FPU wrapper to the fused multiplyadd (FMA) timings for an FMA unit generated by FP-

Table IIEVALUATED MATRICES SPD MATRICES ON TOP

Matrix NNZs Rows NNZRow Blocked2cubes sphere 1647264 101492 162 497crystm03 583770 24696 236 947finan512 596992 74752 79 467G2 circuit 726674 150102 45 609nasasrb 2677324 54870 498 991Pres Poisson 715804 14822 483 964qa8fm 1660579 66127 251 928ship 001 3896496 34920 1116 664thermomech TC 711558 102158 68 08Trefethen 20000 554466 20000 277 633ASIC 100K 940621 99340 95 609bcircuit 375558 68902 54 649epb3 463625 84617 55 722GaAsH6 3381809 61349 5512 692ns3Da 1679599 20414 82 32Si34H36 5156379 97569 528 537torso2 1033473 115697 89 981venkat25 1717792 62424 275 798wang3 177168 26064 68 646xenon1 1181120 48600 243 810

GEN [46] The LEON3 core FMA and in-cluster reductionnetwork are synthesized using the NanGate 15nm OpenCell Library [47] with the Synopsys Design Compiler [48]SRAM buffers within each cluster and the eDRAM mem-ory are modeled using CACTI7 [49] using 14nm eDRAMparameters from [50]

B Architecture

We compare against an nVidia Tesla P100 GPU accelera-tor modeled using GPGPUSim [51] and GPGPUWattch [52]To evaluate the performance of the LEON3 based localprocessors we implement CG and BiCG-STAB in C andcompile the implementations using the LEON3 Bare CCompiler with timers around each kernel call Since all bankmicroprocessors must synchronize during each iteration weevaluate the latency of the bank microprocessor with thelargest number of unblocked elements

In the worst case the preprocessing step touches everynon-zero in the matrix four times which is comparable toperforming four MVM operations Thus we conservativelyassume that preprocessing takes time equivalent to fourMVM operations on the baseline system

C Algorithms

We evaluate the proposed accelerator on 20 matrices fromthe SuiteSparse matrix collection [14] using CG for sym-metric positive definite matrices (SPD) and BiCG-STAB fornon-SPD matrices The matrices were selected to showcasea range of matrix structures sizes and densities across avariety of applications When available we use the b vectorprovided by the collection when b is unavailable we use ab vector of all 1s as in prior work [53]

The solvers running on the proposed accelerator convergein the same number of iterations as they do when running

103

01

1

10

100

nasasrb

Pres_P

2cub

es

qa8fm

finan

Trefe

ship

crystm

therm

G2

ASIC

bcircuit

epb3

Ga

AsH6

ns3D

aSi34H3

6torso2

venkat

wang3

xeno

n1

G-MEA

NS

peed

upoverG

PU

Figure 8 Speedup over the GPU baseline

on the GPU since both systems perform computation at thesame level of precision

VIII EVALUATION

We evaluate the system performance energy area androbustness to different sources of error

A Execution Time

Figure 8 shows that the accelerator achieves an averagespeedup of 103times as compared to the GPU baseline acrossthe set of 20 matrices One important insight from theseresults is that the average number of non-zeros per matrixrow is a poor proxy for potential speedup The matrix withthe greatest speedup in the dataset (torso2) has just 89 NNZper matrix row and the two quantum chemistry matrices(GaSsH6 and Si34H36) have more than 50 NNZ per rowshowing above-average but not exceptional speedup due totheir lower blocking efficiencies (ie the percent of the non-zeros that are blocked) 69 and 53 respectively Sinceeach bank microprocessor is designed to work in conjunctionwith its clusters matrices with fewer rows perform worsethan larger matrices at the same blocking efficiency Asexpected blocking efficiency is the most reliable predictorof speedup

Notably in Table II there are two matrices that are effec-tively unblocked by the proposed preprocessing step ther-motech TC and ns3Da with respective blocking efficienciesof 08 and 32 Since the accelerator is optimized for in-situ rather than digital MVM and since the majority of theelements in these two matrices cannot be mapped to theproposed heterogeneous substrate the accelerator is morethan an order of magnitude less efficient than the GPUbaseline on these matrices Rather than attempt to performMVM on matrices so clearly ill-suited to the acceleratorwe assume that the proposed accelerator co-exists with aGPU in the same node and that the GPU may be usedfor the rare matrices which do not block effectively Sincethe blocking algorithm has a worst-case performance offour MVM operations (Section V-B1) and since the actualcomplexity reaches the worst case if and when the matrixcannot be blocked we can choose whether to perform thecomputation on the accelerator or the GPU after the blockinghas completed This results in a performance loss of less than

0092

1E-02

1E-01

1E+00

nasasrb

Pres_P

2cub

es

qa8fm

finan

Trefe

ship

crystm

therm

G2

ASIC

bcircuit

epb3

Ga

AsH6

ns3D

aSi34H3

6torso2

venkat

wang3

xeno

n1

G-MEA

N

EnergyNormalized

toGPU

Figure 9 Accelerator energy consumption normalized to the GPU baseline

3 for both matrices far better than if the accelerator wereused directly

B Energy Consumption

Energy consumption results are shown in Figure 9 To-tal energy consumption is improved by 142times on the 18matrices executed on the accelerator and by 109times overthe entire 20-matrix dataset The impacts of differing ex-ponent ranges can be seen in the energy results Notablythe nasasrb matrix has a 3 higher blocking efficiencythan Pres Poisson however Pres Poisson shows nearlytwice the improvement in energy dissipation this is dueto the much narrower exponent range of Pres PoissonPres Poisson never requires more than 14 bits of paddingfor storage which also indicates a much narrower dynamicrange for the computation and by extension fewer vectorbit slices that are needed per cluster By contrast nasasrbhas multiple blocks with excluded elements (due to thoseelements requiring greater than 117 bits) and in generalthe average number of stored bits per cluster is 107 (30more bits than the worst block of Pres Poisson) This strongdependence on the exponent range indicates a potentialbenefit of the fixed-point computation over floating pointAs discussed in Section IV-B floating point is designed forbroad compatibility and exceeds the requirements of manyapplications By operating on fixed-point representations ofdata for much of the computation the accelerator implicitlycreates a problem-specific subset of the floating-point formatwith significant performance benefits without discarding anyinformation that would cause the final result to differ fromthat of an end-to-end floating-point calculation

C Area Footprint

The area footprint of the accelerator is computed usingthe area model described in Section VII-A The overallsystem area for the 128-bank system described above is539mm2 which is lower than the 610mm2 die size of thebaseline Nvidia Tesla P100 GPU [54] Notably unlike inprior work on memristive accelerators the crossbars andperipheral circuitry are the dominant area consumermdashratherthan the ADCsmdashwith a total of 541 of total overall clusterarea This is due to two important differences from priorwork 1) due to computational invert coding (Section V-B2)

Table IIIAREA ENERGY AND LATENCY OF DIFFERENT CROSSBAR SIZES

(INCLUDES THE ADC)

Size Area [mm2] Energy [pJ] Latency [nsec]64 000078 280 533128 000103 652 107256 000162 150 213512 000352 342 427

0

5

10

15

20

nasasrb

Pres_P

2cub

es

qa8fm

finan

Trefe

ship

crystm

therm

G2

ASIC

bcircuit

epb3

Ga

AsH6

ns3D

aSi34H3

6torso2

venkat

wang3

xeno

n1

AcceleratorC

ompu

tatio

n

Overhead

WriteTime PreprocessingTime

Figure 10 Overhead of preprocessing and write time as a percent of totalsolve time on the accelerator

the ADC precision can be reduced by one bit which has anexponential effect on ADC area and 2) the wider operandsthat propagate through the reduction network The per-bankprocessors and global memory buffer consume 136 of thetotal system area This result suggests that a more powerfulprocessor than the LEON3 core may be integrated based onapplication demands without substantially increasing systemarea Table III summarizes the area as well as the latencyand energy characteristics for different crossbar sizes usedin different clusters

D Initialization Overhead

Figure 10 shows the overhead of setting up the itera-tive computation including matrix preprocessing time andprogramming time normalized to the entire solve timeof each linear system on the accelerator Since iterativealgorithms take thousands of iterations to converge theoverhead of writing is effectively amortized leading toan overhead of less than 20 across the entire matrixset and generally falling as the size of the linear systemincreases By extension the number of iterations requiredto converge increases faster than the write requirements ofthe array For large linear systems (the most likely use caseof the proposed accelerator) the overhead is typically lessthan 4 of the total solver runtime The overhead may beeven less in practice as multiple time steps are generallysimulated to examine the time-evolution of the linear systemIn these time-stepped computations only a subset of non-zeros change each step and the matrix structure is typicallypreserved requiring minimal re-processing

E System Endurance

Although memristive devices have finite switching en-durance the iterative nature of the algorithms allows a singlematrix to be programmed once and reused throughout thecomputation Even under a conservative set of assumptions

(a) ns3Da sparsity (b) ns3Da blockingFigure 11 Sparsity and blocking patterns of ns3Da

where each array is fully rewritten for each new matrixignoring both the sparsity of the blocks and the time stepbehavior mentioned above the system lifetime is suffi-cient given a memristive device with a conservative 109

writes [55]ndash[57] the total system lifetime is over 100 yearsassuming the system runs constantly and a new matrix isfully rewritten between solves

F Understanding Difficult-to-Block Matrices

Figure 11 shows the non-zero element distribution and theblocking pattern for ns3Da a matrix with a particularly poorblocking efficiency The figure shows that despite the highrelative density of the matrix the values do not form densesub-blocks that can be blocked efficiently Instead the valuesare distributed relatively uniformly over the matrix with thethird-highest NNZs per matrix row in the evaluated datasetFigure 7 shows the non-zero distribution of two matrices thatcan be blocked effectively Pres Poisson and Xenon1 thenon-zeros of both matrices are clustered primarily aroundthe diagonal An analysis of the blocks formed on ns3Dashows that most of the formed blocks capture patterns withnon-zeros separated by approximately 30 elements 2 permatrix block-row in 64 blocks and 4 per matrix block-rowin 128 blocks This suggests that even if ns3Da were tobe blocked more efficiently the effective speedup would beconstrained by the limited current summation from havingfew blocks per matrix row

G Sensitivity to Device Characteristics

Since scientific computing emphasizes high precisionwe analyze the effects of device characteristics on systemaccuracy and convergence behavior

Dynamic range We re-evaluate the convergence behaviorwith various cell ranges (ie onoff state ratios) to determinehow the system behaves when the states are closer togetheror farther apart Figure 12 shows the relative iterationcount at various configuration points The results showeffectively no sensitivity to dynamic range for single-bitcells With two-bit cells the relatively low dynamic rangeresults in some computational error as the cell states arenot sufficiently spread to tolerate the sources of error in

013 213 413 613 813

B=113 D=075K13

B=113 D=15K13

B=113 D=3K13

B=213 D=075K13

B=213 D=15K13

B=213 D=3K13

Normalized

13 Ite

rao

n13 Co

unt13

Min13 Mean13 Max13

Figure 12 Iteration count as a function of bits per cell and dynamicrange normalized to 1-bit cells with RoffRon = 1500 The minimummaximum and the mean iteration counts are reported over 100 Monte-Carlosimulation experiments

Figure 13 Iteration count as a function of bits per cell and programmingerror normalized to 1-bit cells with no programming error The minimummaximum and the mean iteration counts are reported over 100 Monte-Carlosimulation experiments

current summation This in turn introduces error into thecomputation which tends to hinder convergence rates

Programming precision We re-evaluate the convergencebehavior with and without cell programming errors to de-termine how results are affected by non-ideal programmingThe results show virtually no sensitivity to programmingerror for single-bit cells until the programming error reaches5 well within the achievable programming precision re-ported in the literature [58] Again the general dependencebecomes stronger as the number of bits per cell is increasedHigh programming error with the same tolerance introduceserrors into the computation hindering convergence

IX CONCLUSIONS

Previous work with memristive systems has been re-stricted to low-precision error-prone fixed-point computa-tion taking advantage of machine learning workloads thatdemand fidelity but not precision We have presented thefirst correct implementation of a double-precision floating-point compute unit based on a memristive substrate Theinnovations presented here take advantage of insights thatallow the system to achieve a 103times throughput improve-ment and a 109times reduction in energy consumption over aGPU baseline when executing iterative solvers The potentialfor such a system in scientific computing where iterativesolvers are ubiquitous is especially promising

ACKNOWLEDGMENT

This work was supported in part by NSF grant CCF-1533762

REFERENCES

[1] F Pop ldquoHigh performance numerical computing for highenergy physics a new challenge for big data sciencerdquo AAdvHigh Energy Phys vol 2014 2014

[2] R A Pielke Sr Mesoscale meteorological modeling 2nd edAcademic press 2013

[3] B Scholkopf K Tsuda and J-P Vert Kernel methods incomputational biology 1st ed MIT press 2004

[4] L Maliar and S Maliar ldquoNumerical methods for large-scaledynamic economic modelsrdquo 2014

[5] M Vogelsberger S Genel V Springel P Torrey D SijackiD Xu G Snyder S Bird D Nelson and L HernquistldquoProperties of galaxies reproduced by a hydrodynamic simu-lationrdquo Nature vol 509 pp 177ndash82 May 2014

[6] P Messina (2017 February) The US DOEexascale computing projectmdashgoals and challengeshttpswwwnistgovsitesdefaultfilesdocuments20170221messina nist 20170214final pdf [Online] Availablerdquohttpswwwnistgovsitesdefaultfilesdocuments20170221messina nist 20170214final pdfrdquo

[7] M N Bojnordi and E Ipek ldquoMemristive boltzmann machineA hardware accelerator for combinatorial optimization anddeep learningrdquo in Intl Symp on High Performance ComputerArchitecture (HPCA) March 2016

[8] A Shafiee A Nag N Muralimanohar R BalasubramonianJ P Strachan M Hu R S Williams and V SrikumarldquoISAAC A convolutional neural network accelerator within-situ analog arithmetic in crossbarsrdquo in Intl Symp onComputer Architecture (ISCA) June 2016

[9] P Chi S Li S Li T Zhang J Zhao Y Liu Y Wangand Y Xie ldquoPRIME A novel processing-in-memory archi-tecture for neural network computation in ReRAM-basedmain memoryrdquo in Intl Symp on High Performance ComputerArchitecture (HPCA) June 2016

[10] L Song X Qian H Li and Y Chen ldquoPipeLayer Apipelined ReRAM-based accelerator for deep learningrdquo inIntl Symp on High Performance Computer Architecture(HPCA) Feb 2017

[11] L Song Y Zhuo X Qian H Li and Y Chen ldquoGraphRAccelerating graph processing using ReRAMrdquo Feb 2017

[12] G Kestor R Gioiosa D J Kerbyson and A HoisieldquoQuantifying the energy cost of data movement in scientificapplicationsrdquo in Intl Symp on Workload Characterization(IISWC) Sept 2013

[13] J Hauser (2002) SoftFloat httpwwwjhauserusarithmeticSoftFloathtml [Online] Available rdquohttpwwwjhauserusarithmeticSoftFloathtmlrdquo

[14] T A Davis and Y Hu ldquoThe university of florida sparse matrixcollectionrdquo ACM Transactions on Math Softwware (TOMS)vol 38 no 1 pp 11ndash125 Dec 2011

[15] R Vuduc ldquoAutomatic performance tuning of sparse matrixkernelsrdquo PhD dissertation 2003

[16] Y Chen T Luo S Liu S Zhang L He J Wang L LiT Chen Z Xu N Sun and O Temam ldquoDaDianNaoA machine-learning supercomputerrdquo in Intl Symp on onMicroarchitecture (MICRO) Dec 2014

[17] N P Jouppi C Young N Patil D Patterson G Agrawalet al ldquoIn-datacenter performance analysis of a tensor pro-cessing unitrdquo in Intl Symp on on Computer Architecture(ISCA) June 2017

[18] M Hu J P Strachan Z Li E M Grafals N DavilaC Graves S Lam N Ge J J Yang and R S WilliamsldquoDot-product engine for neuromorphic computing Program-ming 1T1M crossbar to accelerate matrix-vector multiplica-tionrdquo in Design Automation Conference (DAC) June 2016

[19] Y Saad Iterative Methods for Sparse Linear Systems 3rd edSIAM 2003

[20] J J Dongarra P Luszczek and A Petitet ldquoThe LINPACKbenchmark Past present and futurerdquo Concurrency and Com-putation Practice and Experience vol 15 pp 803ndash8202003

[21] M R Hestenes and E Stiefel Methods of conjugate gradientsfor solving linear systems 1952 vol 49 no 6

[22] H A Van der Vorst ldquoBi-CGSTAB A fast and smoothlyconverging variant of Bi-CG for the solution of nonsymmetriclinear systemsrdquo SIAM Journal on scientific and StatisticalComputing (SISC) vol 13 no 2 pp 631ndash644 1992

[23] Y Saad and M H Schultz ldquoGMRES A generalized minimalresidual algorithm for solving nonsymmetric linear systemsrdquoSIAM Journal on scientific and Statistical Computing (SISC)vol 7 no 3 pp 856ndash869 July 1986

[24] J Kung Y Long D Kim and S Mukhopadhyay ldquoAprogrammable hardware accelerator for simulating dynamicalsystemsrdquo in Intl Symp on Computer Architecture (ISCA)June 2017

[25] Q Zhu T Graf H E Sumbul L Pileggi and F FranchettildquoAccelerating sparse matrix-matrix multiplication with 3d-stacked logic-in-memory hardwarerdquo in 2013 IEEE HighPerformance Extreme Computing Conference (HPEC) Sept2013

[26] S Kestur J D Davis and E S Chung ldquoTowards a univer-sal FPGA matrix-vector multiplication architecturerdquo in IntlSymp on Field-Programmable Custom Computing Machines(FCCM) May 2012

[27] J Fowers K Ovtcharov K Strauss E S Chung andG Stitt ldquoA high memory bandwidth FPGA accelerator forsparse matrix-vector multiplicationrdquo in Intl Symp on Field-Programmable Custom Computing Machines (FCCM) May2014

[28] R Dorrance F Ren and D Markovic ldquoA scalablesparse matrix-vector multiplication kernel for energy-efficientsparse-blas on FPGAsrdquo in ACMSIGDA International Sym-posium on Field-programmable Gate Arrays (FPGA) Feb2014

[29] B Feinberg S Wang and E Ipek ldquoMaking memristiveneural net accelerators reliablerdquo in Intl Symp on HighPerformance Computer Architecture (HPCA) Feb 2018

[30] D H Bailey ldquoHigh-precision floating-point arithmetic inscientific computationrdquo Computing in science amp engineeringvol 7 no 3 pp 54ndash61 2005

[31] D Zuras M Cowlishaw A Aiken M Applegate D BaileyS Bass D Bhandarkar M Bhat D Bindel S Boldo et alldquoIeee standard for floating-point arithmeticrdquo IEEE Std 754-2008 pp 1ndash70 2008

[32] L V Allis ldquoSearching for solutions in games and artificialintelligencerdquo 1994

[33] N Whitehead and A Fit-Florea ldquoPrecision amp performanceFloating point and IEEE 754 compliance for nvidia gpusrdquoNvidia Tech Rep 2011

[34] L Kull D Luu C Menolfi M Braendli P A FranceseT Morf M Kossel H Yueksel A Cevrero I Ozkayaand T Toifl ldquo285 a 10b 15gss pipelined-SAR ADC withbackground second-stage common-mode regulation and offsetcalibration in 14nm CMOS FinFETrdquo in Intl Solid-StateCircuits Conference (ISSCC) Feb 2017 pp 474ndash475

[35] L Kull T Toifl M Schmatz P A Francese C MenolfiM Brandli M Kossel T Morf T M Andersen andY Leblebici ldquoA 31 mw 8b 12 GSs single-channel asyn-chronous SAR ADC with alternate comparators for enhancedspeed in 32 nm digital SOI CMOSrdquo IEEE Journal of Solid-State Circuits vol 48 no 12 pp 3049ndash3058 2013

[36] M Saberi R Lotfi K Mafinezhad and W A SerdijnldquoAnalysis of power consumption and linearity in capacitivedigital-to-analog converters used in successive approximationadcsrdquo IEEE Transactions on Circuits and Systems I RegularPapers vol 58 no 8 pp 1736ndash1748 2011

[37] M R Stan and W P Burleson ldquoBus-invert coding forlow-power IOrdquo IEEE Transactions on Very Large ScaleIntegration (VLSI) Systems vol 3 no 1 pp 49ndash58 1995

[38] K Bhanushali and W R Davis ldquoFreePDK15 An open-source predictive process design kit for 15nm FinFET tech-nologyrdquo in Intl Symp on on Physical Design (ISPD) March2015

[39] Predictive technology model (PTM) httpptmasuedu[Online] Available httpptmasuedu

[40] D Niu C Xu N Muralimanohar N P Jouppi and Y XieldquoDesign of cross-point metal-oxide ReRAM emphasizingreliability and costrdquo in Intl Conference on Computer-AidedDesign (ICCAD) Nov 2013

[41] S-C Wong G-Y Lee and D-J Ma ldquoModeling of inter-connect capacitance delay and crosstalk in VLSIrdquo IEEETransactions on semiconductor manufacturing vol 13 no 1pp 108ndash111 2000

[42] C Xu D Niu N Muralimanohar R BalasubramonianT Zhang S Yu and Y Xie ldquoOvercoming the challengesof crossbar resistive memory architecturesrdquo in rdquoIntl Sympon High Performance Computer Architecture (HPCA)rdquo Feb2015

[43] C Yakopcic and T Taha ldquoModel for maximum crossbar sizebased on input driver impedancerdquo Electronics Letters vol 52no 1 pp 25ndash27 2015

[44] M OrsquoHalloran and R Sarpeshkar ldquoA 10-nw 12-bit accurateanalog storage cell with 10-aa leakagerdquo IEEE Journal ofSolid-State Circuits vol 39 no 11 pp 1985ndash1996 2004

[45] Leon3GRLIB httpwwwgaislercomindexphpdownloadsleongrlib [Online] Available httpwwwgaislercomindexphpdownloadsleongrlib

[46] S Galal O Shacham J S B II J Pu A Vassiliev andM Horowitz ldquoFPU generator for design space explorationrdquoin IEEE Symposium on Computer Arithmetic (ARITH) April2013

[47] M Martins J M Matos R P Ribas A Reis G SchlinkerL Rech and J Michelsen ldquoOpen cell library in 15nmFreePDK technologyrdquo in Intl Symp on Physical Design(ISPD) March 2015

[48] Synopsys ldquoSynopsys Design Compiler User GuiderdquohttpwwwsynopsyscomToolsImplementationRTLSynthesisDCUltraPages

[49] R Balasubramonian A B Kahng N MuralimanoharA Shafiee and V Srinivas ldquoCACTI 7 New tools for inter-connect exploration in innovative off-chip memoriesrdquo ACMTransactions on Architecture and Code Optimization (TACO)vol 14 no 2 pp 141ndash1425 June 2017

[50] G Fredeman D W Plass A Mathews J ViraraghavanK Reyer T J Knips T Miller E L Gerhard D Kannam-badi C Paone D Lee D J Rainey M Sperling M WhalenS Burns R R Tummuru H Ho A Cestero N ArnoldB A Khan T Kirihata and S S Iyer ldquoA 14 nm 11 mbembedded DRAM macro with 1 ns accessrdquo IEEE Journal ofSolid-State Circuits vol 51 no 1 pp 230ndash239 Jan 2016

[51] A Bakhoda G L Yuan W W L Fung H Wong and T MAamodt ldquoAnalyzing CUDA workloads using a detailed GPUsimulatorrdquo April 2009

[52] J Leng T Hetherington A ElTantawy S Gilani N S KimT M Aamodt and V J Reddi ldquoGPUWattch Enabling en-ergy optimizations in GPGPUsrdquo in Intl Symp on ComputerArchitecture (ISCA) June 2013

[53] H Anzt J Dongarra M Kreutzer G Wellein and M KhlerldquoEfficiency of general krylov methods on GPUs ndash an exper-imental studyrdquo in Intl Parallel and Distributed ProcessingSymposium Workshops (IPDPSW) May 2016

[54] ldquoNvidia tesla P100rdquo NVIDIA Corporation Tech Rep WP-08019-001 v011 2016

[55] Z Wei Y Kanzawa K Arita Y Katoh K Kawai S Mu-raoka S Mitani S Fujii K Katayama M Iijima et alldquoHighly reliable TaOx ReRAM and direct evidence of redoxreaction mechanismrdquo in Electron Devices Meeting 2008IEDM 2008 IEEE International IEEE 2008 pp 1ndash4

[56] J J Yang M-X Zhang J P Strachan F Miao M D Pick-ett R D Kelley G Medeiros-Ribeiro and R S WilliamsldquoHigh switching endurance in TaOx memristive devicesrdquoApplied Physics Letters vol 97 no 23 p 232102 2010

[57] C-W Hsu I-T Wang C-L Lo M-C Chiang W-Y JangC-H Lin and T-H Hou ldquoSelf-rectifying bipolar TaOxTiO2RRAM with superior endurance over 1012 cycles for 3d high-density storage-class memoryrdquo

[58] F Alibart L Gao B D Hoskins and D B Strukov ldquoHighprecision tuning of state for memristive devices by adaptablevariation-tolerant algorithmrdquo Nanotechnology vol 23 no 7Jan 2012

Page 11: Enabling Scientific Computing on Memristive …reducing the required memristor resolution [7]–[10]. An example of bit slicing is shown in Equation 1, where each 3-bit matrix coefficient

Table IIEVALUATED MATRICES SPD MATRICES ON TOP

Matrix NNZs Rows NNZRow Blocked2cubes sphere 1647264 101492 162 497crystm03 583770 24696 236 947finan512 596992 74752 79 467G2 circuit 726674 150102 45 609nasasrb 2677324 54870 498 991Pres Poisson 715804 14822 483 964qa8fm 1660579 66127 251 928ship 001 3896496 34920 1116 664thermomech TC 711558 102158 68 08Trefethen 20000 554466 20000 277 633ASIC 100K 940621 99340 95 609bcircuit 375558 68902 54 649epb3 463625 84617 55 722GaAsH6 3381809 61349 5512 692ns3Da 1679599 20414 82 32Si34H36 5156379 97569 528 537torso2 1033473 115697 89 981venkat25 1717792 62424 275 798wang3 177168 26064 68 646xenon1 1181120 48600 243 810

GEN [46] The LEON3 core FMA and in-cluster reductionnetwork are synthesized using the NanGate 15nm OpenCell Library [47] with the Synopsys Design Compiler [48]SRAM buffers within each cluster and the eDRAM mem-ory are modeled using CACTI7 [49] using 14nm eDRAMparameters from [50]

B Architecture

We compare against an nVidia Tesla P100 GPU accelera-tor modeled using GPGPUSim [51] and GPGPUWattch [52]To evaluate the performance of the LEON3 based localprocessors we implement CG and BiCG-STAB in C andcompile the implementations using the LEON3 Bare CCompiler with timers around each kernel call Since all bankmicroprocessors must synchronize during each iteration weevaluate the latency of the bank microprocessor with thelargest number of unblocked elements

In the worst case the preprocessing step touches everynon-zero in the matrix four times which is comparable toperforming four MVM operations Thus we conservativelyassume that preprocessing takes time equivalent to fourMVM operations on the baseline system

C Algorithms

We evaluate the proposed accelerator on 20 matrices fromthe SuiteSparse matrix collection [14] using CG for sym-metric positive definite matrices (SPD) and BiCG-STAB fornon-SPD matrices The matrices were selected to showcasea range of matrix structures sizes and densities across avariety of applications When available we use the b vectorprovided by the collection when b is unavailable we use ab vector of all 1s as in prior work [53]

The solvers running on the proposed accelerator convergein the same number of iterations as they do when running

103

01

1

10

100

nasasrb

Pres_P

2cub

es

qa8fm

finan

Trefe

ship

crystm

therm

G2

ASIC

bcircuit

epb3

Ga

AsH6

ns3D

aSi34H3

6torso2

venkat

wang3

xeno

n1

G-MEA

NS

peed

upoverG

PU

Figure 8 Speedup over the GPU baseline

on the GPU since both systems perform computation at thesame level of precision

VIII EVALUATION

We evaluate the system performance energy area androbustness to different sources of error

A Execution Time

Figure 8 shows that the accelerator achieves an averagespeedup of 103times as compared to the GPU baseline acrossthe set of 20 matrices One important insight from theseresults is that the average number of non-zeros per matrixrow is a poor proxy for potential speedup The matrix withthe greatest speedup in the dataset (torso2) has just 89 NNZper matrix row and the two quantum chemistry matrices(GaSsH6 and Si34H36) have more than 50 NNZ per rowshowing above-average but not exceptional speedup due totheir lower blocking efficiencies (ie the percent of the non-zeros that are blocked) 69 and 53 respectively Sinceeach bank microprocessor is designed to work in conjunctionwith its clusters matrices with fewer rows perform worsethan larger matrices at the same blocking efficiency Asexpected blocking efficiency is the most reliable predictorof speedup

Notably in Table II there are two matrices that are effec-tively unblocked by the proposed preprocessing step ther-motech TC and ns3Da with respective blocking efficienciesof 08 and 32 Since the accelerator is optimized for in-situ rather than digital MVM and since the majority of theelements in these two matrices cannot be mapped to theproposed heterogeneous substrate the accelerator is morethan an order of magnitude less efficient than the GPUbaseline on these matrices Rather than attempt to performMVM on matrices so clearly ill-suited to the acceleratorwe assume that the proposed accelerator co-exists with aGPU in the same node and that the GPU may be usedfor the rare matrices which do not block effectively Sincethe blocking algorithm has a worst-case performance offour MVM operations (Section V-B1) and since the actualcomplexity reaches the worst case if and when the matrixcannot be blocked we can choose whether to perform thecomputation on the accelerator or the GPU after the blockinghas completed This results in a performance loss of less than

0092

1E-02

1E-01

1E+00

nasasrb

Pres_P

2cub

es

qa8fm

finan

Trefe

ship

crystm

therm

G2

ASIC

bcircuit

epb3

Ga

AsH6

ns3D

aSi34H3

6torso2

venkat

wang3

xeno

n1

G-MEA

N

EnergyNormalized

toGPU

Figure 9 Accelerator energy consumption normalized to the GPU baseline

3 for both matrices far better than if the accelerator wereused directly

B Energy Consumption

Energy consumption results are shown in Figure 9 To-tal energy consumption is improved by 142times on the 18matrices executed on the accelerator and by 109times overthe entire 20-matrix dataset The impacts of differing ex-ponent ranges can be seen in the energy results Notablythe nasasrb matrix has a 3 higher blocking efficiencythan Pres Poisson however Pres Poisson shows nearlytwice the improvement in energy dissipation this is dueto the much narrower exponent range of Pres PoissonPres Poisson never requires more than 14 bits of paddingfor storage which also indicates a much narrower dynamicrange for the computation and by extension fewer vectorbit slices that are needed per cluster By contrast nasasrbhas multiple blocks with excluded elements (due to thoseelements requiring greater than 117 bits) and in generalthe average number of stored bits per cluster is 107 (30more bits than the worst block of Pres Poisson) This strongdependence on the exponent range indicates a potentialbenefit of the fixed-point computation over floating pointAs discussed in Section IV-B floating point is designed forbroad compatibility and exceeds the requirements of manyapplications By operating on fixed-point representations ofdata for much of the computation the accelerator implicitlycreates a problem-specific subset of the floating-point formatwith significant performance benefits without discarding anyinformation that would cause the final result to differ fromthat of an end-to-end floating-point calculation

C Area Footprint

The area footprint of the accelerator is computed usingthe area model described in Section VII-A The overallsystem area for the 128-bank system described above is539mm2 which is lower than the 610mm2 die size of thebaseline Nvidia Tesla P100 GPU [54] Notably unlike inprior work on memristive accelerators the crossbars andperipheral circuitry are the dominant area consumermdashratherthan the ADCsmdashwith a total of 541 of total overall clusterarea This is due to two important differences from priorwork 1) due to computational invert coding (Section V-B2)

Table IIIAREA ENERGY AND LATENCY OF DIFFERENT CROSSBAR SIZES

(INCLUDES THE ADC)

Size Area [mm2] Energy [pJ] Latency [nsec]64 000078 280 533128 000103 652 107256 000162 150 213512 000352 342 427

0

5

10

15

20

nasasrb

Pres_P

2cub

es

qa8fm

finan

Trefe

ship

crystm

therm

G2

ASIC

bcircuit

epb3

Ga

AsH6

ns3D

aSi34H3

6torso2

venkat

wang3

xeno

n1

AcceleratorC

ompu

tatio

n

Overhead

WriteTime PreprocessingTime

Figure 10 Overhead of preprocessing and write time as a percent of totalsolve time on the accelerator

the ADC precision can be reduced by one bit which has anexponential effect on ADC area and 2) the wider operandsthat propagate through the reduction network The per-bankprocessors and global memory buffer consume 136 of thetotal system area This result suggests that a more powerfulprocessor than the LEON3 core may be integrated based onapplication demands without substantially increasing systemarea Table III summarizes the area as well as the latencyand energy characteristics for different crossbar sizes usedin different clusters

D Initialization Overhead

Figure 10 shows the overhead of setting up the itera-tive computation including matrix preprocessing time andprogramming time normalized to the entire solve timeof each linear system on the accelerator Since iterativealgorithms take thousands of iterations to converge theoverhead of writing is effectively amortized leading toan overhead of less than 20 across the entire matrixset and generally falling as the size of the linear systemincreases By extension the number of iterations requiredto converge increases faster than the write requirements ofthe array For large linear systems (the most likely use caseof the proposed accelerator) the overhead is typically lessthan 4 of the total solver runtime The overhead may beeven less in practice as multiple time steps are generallysimulated to examine the time-evolution of the linear systemIn these time-stepped computations only a subset of non-zeros change each step and the matrix structure is typicallypreserved requiring minimal re-processing

E System Endurance

Although memristive devices have finite switching en-durance the iterative nature of the algorithms allows a singlematrix to be programmed once and reused throughout thecomputation Even under a conservative set of assumptions

(a) ns3Da sparsity (b) ns3Da blockingFigure 11 Sparsity and blocking patterns of ns3Da

where each array is fully rewritten for each new matrixignoring both the sparsity of the blocks and the time stepbehavior mentioned above the system lifetime is suffi-cient given a memristive device with a conservative 109

writes [55]ndash[57] the total system lifetime is over 100 yearsassuming the system runs constantly and a new matrix isfully rewritten between solves

F Understanding Difficult-to-Block Matrices

Figure 11 shows the non-zero element distribution and theblocking pattern for ns3Da a matrix with a particularly poorblocking efficiency The figure shows that despite the highrelative density of the matrix the values do not form densesub-blocks that can be blocked efficiently Instead the valuesare distributed relatively uniformly over the matrix with thethird-highest NNZs per matrix row in the evaluated datasetFigure 7 shows the non-zero distribution of two matrices thatcan be blocked effectively Pres Poisson and Xenon1 thenon-zeros of both matrices are clustered primarily aroundthe diagonal An analysis of the blocks formed on ns3Dashows that most of the formed blocks capture patterns withnon-zeros separated by approximately 30 elements 2 permatrix block-row in 64 blocks and 4 per matrix block-rowin 128 blocks This suggests that even if ns3Da were tobe blocked more efficiently the effective speedup would beconstrained by the limited current summation from havingfew blocks per matrix row

G Sensitivity to Device Characteristics

Since scientific computing emphasizes high precisionwe analyze the effects of device characteristics on systemaccuracy and convergence behavior

Dynamic range We re-evaluate the convergence behaviorwith various cell ranges (ie onoff state ratios) to determinehow the system behaves when the states are closer togetheror farther apart Figure 12 shows the relative iterationcount at various configuration points The results showeffectively no sensitivity to dynamic range for single-bitcells With two-bit cells the relatively low dynamic rangeresults in some computational error as the cell states arenot sufficiently spread to tolerate the sources of error in

013 213 413 613 813

B=113 D=075K13

B=113 D=15K13

B=113 D=3K13

B=213 D=075K13

B=213 D=15K13

B=213 D=3K13

Normalized

13 Ite

rao

n13 Co

unt13

Min13 Mean13 Max13

Figure 12 Iteration count as a function of bits per cell and dynamicrange normalized to 1-bit cells with RoffRon = 1500 The minimummaximum and the mean iteration counts are reported over 100 Monte-Carlosimulation experiments

Figure 13 Iteration count as a function of bits per cell and programmingerror normalized to 1-bit cells with no programming error The minimummaximum and the mean iteration counts are reported over 100 Monte-Carlosimulation experiments

current summation This in turn introduces error into thecomputation which tends to hinder convergence rates

Programming precision We re-evaluate the convergencebehavior with and without cell programming errors to de-termine how results are affected by non-ideal programmingThe results show virtually no sensitivity to programmingerror for single-bit cells until the programming error reaches5 well within the achievable programming precision re-ported in the literature [58] Again the general dependencebecomes stronger as the number of bits per cell is increasedHigh programming error with the same tolerance introduceserrors into the computation hindering convergence

IX CONCLUSIONS

Previous work with memristive systems has been re-stricted to low-precision error-prone fixed-point computa-tion taking advantage of machine learning workloads thatdemand fidelity but not precision We have presented thefirst correct implementation of a double-precision floating-point compute unit based on a memristive substrate Theinnovations presented here take advantage of insights thatallow the system to achieve a 103times throughput improve-ment and a 109times reduction in energy consumption over aGPU baseline when executing iterative solvers The potentialfor such a system in scientific computing where iterativesolvers are ubiquitous is especially promising

ACKNOWLEDGMENT

This work was supported in part by NSF grant CCF-1533762

REFERENCES

[1] F Pop ldquoHigh performance numerical computing for highenergy physics a new challenge for big data sciencerdquo AAdvHigh Energy Phys vol 2014 2014

[2] R A Pielke Sr Mesoscale meteorological modeling 2nd edAcademic press 2013

[3] B Scholkopf K Tsuda and J-P Vert Kernel methods incomputational biology 1st ed MIT press 2004

[4] L Maliar and S Maliar ldquoNumerical methods for large-scaledynamic economic modelsrdquo 2014

[5] M Vogelsberger S Genel V Springel P Torrey D SijackiD Xu G Snyder S Bird D Nelson and L HernquistldquoProperties of galaxies reproduced by a hydrodynamic simu-lationrdquo Nature vol 509 pp 177ndash82 May 2014

[6] P Messina (2017 February) The US DOEexascale computing projectmdashgoals and challengeshttpswwwnistgovsitesdefaultfilesdocuments20170221messina nist 20170214final pdf [Online] Availablerdquohttpswwwnistgovsitesdefaultfilesdocuments20170221messina nist 20170214final pdfrdquo

[7] M N Bojnordi and E Ipek ldquoMemristive boltzmann machineA hardware accelerator for combinatorial optimization anddeep learningrdquo in Intl Symp on High Performance ComputerArchitecture (HPCA) March 2016

[8] A Shafiee A Nag N Muralimanohar R BalasubramonianJ P Strachan M Hu R S Williams and V SrikumarldquoISAAC A convolutional neural network accelerator within-situ analog arithmetic in crossbarsrdquo in Intl Symp onComputer Architecture (ISCA) June 2016

[9] P Chi S Li S Li T Zhang J Zhao Y Liu Y Wangand Y Xie ldquoPRIME A novel processing-in-memory archi-tecture for neural network computation in ReRAM-basedmain memoryrdquo in Intl Symp on High Performance ComputerArchitecture (HPCA) June 2016

[10] L Song X Qian H Li and Y Chen ldquoPipeLayer Apipelined ReRAM-based accelerator for deep learningrdquo inIntl Symp on High Performance Computer Architecture(HPCA) Feb 2017

[11] L Song Y Zhuo X Qian H Li and Y Chen ldquoGraphRAccelerating graph processing using ReRAMrdquo Feb 2017

[12] G Kestor R Gioiosa D J Kerbyson and A HoisieldquoQuantifying the energy cost of data movement in scientificapplicationsrdquo in Intl Symp on Workload Characterization(IISWC) Sept 2013

[13] J Hauser (2002) SoftFloat httpwwwjhauserusarithmeticSoftFloathtml [Online] Available rdquohttpwwwjhauserusarithmeticSoftFloathtmlrdquo

[14] T A Davis and Y Hu ldquoThe university of florida sparse matrixcollectionrdquo ACM Transactions on Math Softwware (TOMS)vol 38 no 1 pp 11ndash125 Dec 2011

[15] R Vuduc ldquoAutomatic performance tuning of sparse matrixkernelsrdquo PhD dissertation 2003

[16] Y Chen T Luo S Liu S Zhang L He J Wang L LiT Chen Z Xu N Sun and O Temam ldquoDaDianNaoA machine-learning supercomputerrdquo in Intl Symp on onMicroarchitecture (MICRO) Dec 2014

[17] N P Jouppi C Young N Patil D Patterson G Agrawalet al ldquoIn-datacenter performance analysis of a tensor pro-cessing unitrdquo in Intl Symp on on Computer Architecture(ISCA) June 2017

[18] M Hu J P Strachan Z Li E M Grafals N DavilaC Graves S Lam N Ge J J Yang and R S WilliamsldquoDot-product engine for neuromorphic computing Program-ming 1T1M crossbar to accelerate matrix-vector multiplica-tionrdquo in Design Automation Conference (DAC) June 2016

[19] Y Saad Iterative Methods for Sparse Linear Systems 3rd edSIAM 2003

[20] J J Dongarra P Luszczek and A Petitet ldquoThe LINPACKbenchmark Past present and futurerdquo Concurrency and Com-putation Practice and Experience vol 15 pp 803ndash8202003

[21] M R Hestenes and E Stiefel Methods of conjugate gradientsfor solving linear systems 1952 vol 49 no 6

[22] H A Van der Vorst ldquoBi-CGSTAB A fast and smoothlyconverging variant of Bi-CG for the solution of nonsymmetriclinear systemsrdquo SIAM Journal on scientific and StatisticalComputing (SISC) vol 13 no 2 pp 631ndash644 1992

[23] Y Saad and M H Schultz ldquoGMRES A generalized minimalresidual algorithm for solving nonsymmetric linear systemsrdquoSIAM Journal on scientific and Statistical Computing (SISC)vol 7 no 3 pp 856ndash869 July 1986

[24] J Kung Y Long D Kim and S Mukhopadhyay ldquoAprogrammable hardware accelerator for simulating dynamicalsystemsrdquo in Intl Symp on Computer Architecture (ISCA)June 2017

[25] Q Zhu T Graf H E Sumbul L Pileggi and F FranchettildquoAccelerating sparse matrix-matrix multiplication with 3d-stacked logic-in-memory hardwarerdquo in 2013 IEEE HighPerformance Extreme Computing Conference (HPEC) Sept2013

[26] S Kestur J D Davis and E S Chung ldquoTowards a univer-sal FPGA matrix-vector multiplication architecturerdquo in IntlSymp on Field-Programmable Custom Computing Machines(FCCM) May 2012

[27] J Fowers K Ovtcharov K Strauss E S Chung andG Stitt ldquoA high memory bandwidth FPGA accelerator forsparse matrix-vector multiplicationrdquo in Intl Symp on Field-Programmable Custom Computing Machines (FCCM) May2014

[28] R Dorrance F Ren and D Markovic ldquoA scalablesparse matrix-vector multiplication kernel for energy-efficientsparse-blas on FPGAsrdquo in ACMSIGDA International Sym-posium on Field-programmable Gate Arrays (FPGA) Feb2014

[29] B Feinberg S Wang and E Ipek ldquoMaking memristiveneural net accelerators reliablerdquo in Intl Symp on HighPerformance Computer Architecture (HPCA) Feb 2018

[30] D H Bailey ldquoHigh-precision floating-point arithmetic inscientific computationrdquo Computing in science amp engineeringvol 7 no 3 pp 54ndash61 2005

[31] D Zuras M Cowlishaw A Aiken M Applegate D BaileyS Bass D Bhandarkar M Bhat D Bindel S Boldo et alldquoIeee standard for floating-point arithmeticrdquo IEEE Std 754-2008 pp 1ndash70 2008

[32] L V Allis ldquoSearching for solutions in games and artificialintelligencerdquo 1994

[33] N Whitehead and A Fit-Florea ldquoPrecision amp performanceFloating point and IEEE 754 compliance for nvidia gpusrdquoNvidia Tech Rep 2011

[34] L Kull D Luu C Menolfi M Braendli P A FranceseT Morf M Kossel H Yueksel A Cevrero I Ozkayaand T Toifl ldquo285 a 10b 15gss pipelined-SAR ADC withbackground second-stage common-mode regulation and offsetcalibration in 14nm CMOS FinFETrdquo in Intl Solid-StateCircuits Conference (ISSCC) Feb 2017 pp 474ndash475

[35] L Kull T Toifl M Schmatz P A Francese C MenolfiM Brandli M Kossel T Morf T M Andersen andY Leblebici ldquoA 31 mw 8b 12 GSs single-channel asyn-chronous SAR ADC with alternate comparators for enhancedspeed in 32 nm digital SOI CMOSrdquo IEEE Journal of Solid-State Circuits vol 48 no 12 pp 3049ndash3058 2013

[36] M Saberi R Lotfi K Mafinezhad and W A SerdijnldquoAnalysis of power consumption and linearity in capacitivedigital-to-analog converters used in successive approximationadcsrdquo IEEE Transactions on Circuits and Systems I RegularPapers vol 58 no 8 pp 1736ndash1748 2011

[37] M R Stan and W P Burleson ldquoBus-invert coding forlow-power IOrdquo IEEE Transactions on Very Large ScaleIntegration (VLSI) Systems vol 3 no 1 pp 49ndash58 1995

[38] K Bhanushali and W R Davis ldquoFreePDK15 An open-source predictive process design kit for 15nm FinFET tech-nologyrdquo in Intl Symp on on Physical Design (ISPD) March2015

[39] Predictive technology model (PTM) httpptmasuedu[Online] Available httpptmasuedu

[40] D Niu C Xu N Muralimanohar N P Jouppi and Y XieldquoDesign of cross-point metal-oxide ReRAM emphasizingreliability and costrdquo in Intl Conference on Computer-AidedDesign (ICCAD) Nov 2013

[41] S-C Wong G-Y Lee and D-J Ma ldquoModeling of inter-connect capacitance delay and crosstalk in VLSIrdquo IEEETransactions on semiconductor manufacturing vol 13 no 1pp 108ndash111 2000

[42] C Xu D Niu N Muralimanohar R BalasubramonianT Zhang S Yu and Y Xie ldquoOvercoming the challengesof crossbar resistive memory architecturesrdquo in rdquoIntl Sympon High Performance Computer Architecture (HPCA)rdquo Feb2015

[43] C Yakopcic and T Taha ldquoModel for maximum crossbar sizebased on input driver impedancerdquo Electronics Letters vol 52no 1 pp 25ndash27 2015

[44] M OrsquoHalloran and R Sarpeshkar ldquoA 10-nw 12-bit accurateanalog storage cell with 10-aa leakagerdquo IEEE Journal ofSolid-State Circuits vol 39 no 11 pp 1985ndash1996 2004

[45] Leon3GRLIB httpwwwgaislercomindexphpdownloadsleongrlib [Online] Available httpwwwgaislercomindexphpdownloadsleongrlib

[46] S Galal O Shacham J S B II J Pu A Vassiliev andM Horowitz ldquoFPU generator for design space explorationrdquoin IEEE Symposium on Computer Arithmetic (ARITH) April2013

[47] M Martins J M Matos R P Ribas A Reis G SchlinkerL Rech and J Michelsen ldquoOpen cell library in 15nmFreePDK technologyrdquo in Intl Symp on Physical Design(ISPD) March 2015

[48] Synopsys ldquoSynopsys Design Compiler User GuiderdquohttpwwwsynopsyscomToolsImplementationRTLSynthesisDCUltraPages

[49] R Balasubramonian A B Kahng N MuralimanoharA Shafiee and V Srinivas ldquoCACTI 7 New tools for inter-connect exploration in innovative off-chip memoriesrdquo ACMTransactions on Architecture and Code Optimization (TACO)vol 14 no 2 pp 141ndash1425 June 2017

[50] G Fredeman D W Plass A Mathews J ViraraghavanK Reyer T J Knips T Miller E L Gerhard D Kannam-badi C Paone D Lee D J Rainey M Sperling M WhalenS Burns R R Tummuru H Ho A Cestero N ArnoldB A Khan T Kirihata and S S Iyer ldquoA 14 nm 11 mbembedded DRAM macro with 1 ns accessrdquo IEEE Journal ofSolid-State Circuits vol 51 no 1 pp 230ndash239 Jan 2016

[51] A Bakhoda G L Yuan W W L Fung H Wong and T MAamodt ldquoAnalyzing CUDA workloads using a detailed GPUsimulatorrdquo April 2009

[52] J Leng T Hetherington A ElTantawy S Gilani N S KimT M Aamodt and V J Reddi ldquoGPUWattch Enabling en-ergy optimizations in GPGPUsrdquo in Intl Symp on ComputerArchitecture (ISCA) June 2013

[53] H Anzt J Dongarra M Kreutzer G Wellein and M KhlerldquoEfficiency of general krylov methods on GPUs ndash an exper-imental studyrdquo in Intl Parallel and Distributed ProcessingSymposium Workshops (IPDPSW) May 2016

[54] ldquoNvidia tesla P100rdquo NVIDIA Corporation Tech Rep WP-08019-001 v011 2016

[55] Z Wei Y Kanzawa K Arita Y Katoh K Kawai S Mu-raoka S Mitani S Fujii K Katayama M Iijima et alldquoHighly reliable TaOx ReRAM and direct evidence of redoxreaction mechanismrdquo in Electron Devices Meeting 2008IEDM 2008 IEEE International IEEE 2008 pp 1ndash4

[56] J J Yang M-X Zhang J P Strachan F Miao M D Pick-ett R D Kelley G Medeiros-Ribeiro and R S WilliamsldquoHigh switching endurance in TaOx memristive devicesrdquoApplied Physics Letters vol 97 no 23 p 232102 2010

[57] C-W Hsu I-T Wang C-L Lo M-C Chiang W-Y JangC-H Lin and T-H Hou ldquoSelf-rectifying bipolar TaOxTiO2RRAM with superior endurance over 1012 cycles for 3d high-density storage-class memoryrdquo

[58] F Alibart L Gao B D Hoskins and D B Strukov ldquoHighprecision tuning of state for memristive devices by adaptablevariation-tolerant algorithmrdquo Nanotechnology vol 23 no 7Jan 2012

Page 12: Enabling Scientific Computing on Memristive …reducing the required memristor resolution [7]–[10]. An example of bit slicing is shown in Equation 1, where each 3-bit matrix coefficient

0092

1E-02

1E-01

1E+00

nasasrb

Pres_P

2cub

es

qa8fm

finan

Trefe

ship

crystm

therm

G2

ASIC

bcircuit

epb3

Ga

AsH6

ns3D

aSi34H3

6torso2

venkat

wang3

xeno

n1

G-MEA

N

EnergyNormalized

toGPU

Figure 9 Accelerator energy consumption normalized to the GPU baseline

3 for both matrices far better than if the accelerator wereused directly

B Energy Consumption

Energy consumption results are shown in Figure 9 To-tal energy consumption is improved by 142times on the 18matrices executed on the accelerator and by 109times overthe entire 20-matrix dataset The impacts of differing ex-ponent ranges can be seen in the energy results Notablythe nasasrb matrix has a 3 higher blocking efficiencythan Pres Poisson however Pres Poisson shows nearlytwice the improvement in energy dissipation this is dueto the much narrower exponent range of Pres PoissonPres Poisson never requires more than 14 bits of paddingfor storage which also indicates a much narrower dynamicrange for the computation and by extension fewer vectorbit slices that are needed per cluster By contrast nasasrbhas multiple blocks with excluded elements (due to thoseelements requiring greater than 117 bits) and in generalthe average number of stored bits per cluster is 107 (30more bits than the worst block of Pres Poisson) This strongdependence on the exponent range indicates a potentialbenefit of the fixed-point computation over floating pointAs discussed in Section IV-B floating point is designed forbroad compatibility and exceeds the requirements of manyapplications By operating on fixed-point representations ofdata for much of the computation the accelerator implicitlycreates a problem-specific subset of the floating-point formatwith significant performance benefits without discarding anyinformation that would cause the final result to differ fromthat of an end-to-end floating-point calculation

C Area Footprint

The area footprint of the accelerator is computed usingthe area model described in Section VII-A The overallsystem area for the 128-bank system described above is539mm2 which is lower than the 610mm2 die size of thebaseline Nvidia Tesla P100 GPU [54] Notably unlike inprior work on memristive accelerators the crossbars andperipheral circuitry are the dominant area consumermdashratherthan the ADCsmdashwith a total of 541 of total overall clusterarea This is due to two important differences from priorwork 1) due to computational invert coding (Section V-B2)

Table IIIAREA ENERGY AND LATENCY OF DIFFERENT CROSSBAR SIZES

(INCLUDES THE ADC)

Size Area [mm2] Energy [pJ] Latency [nsec]64 000078 280 533128 000103 652 107256 000162 150 213512 000352 342 427

0

5

10

15

20

nasasrb

Pres_P

2cub

es

qa8fm

finan

Trefe

ship

crystm

therm

G2

ASIC

bcircuit

epb3

Ga

AsH6

ns3D

aSi34H3

6torso2

venkat

wang3

xeno

n1

AcceleratorC

ompu

tatio

n

Overhead

WriteTime PreprocessingTime

Figure 10 Overhead of preprocessing and write time as a percent of totalsolve time on the accelerator

the ADC precision can be reduced by one bit which has anexponential effect on ADC area and 2) the wider operandsthat propagate through the reduction network The per-bankprocessors and global memory buffer consume 136 of thetotal system area This result suggests that a more powerfulprocessor than the LEON3 core may be integrated based onapplication demands without substantially increasing systemarea Table III summarizes the area as well as the latencyand energy characteristics for different crossbar sizes usedin different clusters

D Initialization Overhead

Figure 10 shows the overhead of setting up the itera-tive computation including matrix preprocessing time andprogramming time normalized to the entire solve timeof each linear system on the accelerator Since iterativealgorithms take thousands of iterations to converge theoverhead of writing is effectively amortized leading toan overhead of less than 20 across the entire matrixset and generally falling as the size of the linear systemincreases By extension the number of iterations requiredto converge increases faster than the write requirements ofthe array For large linear systems (the most likely use caseof the proposed accelerator) the overhead is typically lessthan 4 of the total solver runtime The overhead may beeven less in practice as multiple time steps are generallysimulated to examine the time-evolution of the linear systemIn these time-stepped computations only a subset of non-zeros change each step and the matrix structure is typicallypreserved requiring minimal re-processing

E System Endurance

Although memristive devices have finite switching en-durance the iterative nature of the algorithms allows a singlematrix to be programmed once and reused throughout thecomputation Even under a conservative set of assumptions

(a) ns3Da sparsity (b) ns3Da blockingFigure 11 Sparsity and blocking patterns of ns3Da

where each array is fully rewritten for each new matrixignoring both the sparsity of the blocks and the time stepbehavior mentioned above the system lifetime is suffi-cient given a memristive device with a conservative 109

writes [55]ndash[57] the total system lifetime is over 100 yearsassuming the system runs constantly and a new matrix isfully rewritten between solves

F Understanding Difficult-to-Block Matrices

Figure 11 shows the non-zero element distribution and theblocking pattern for ns3Da a matrix with a particularly poorblocking efficiency The figure shows that despite the highrelative density of the matrix the values do not form densesub-blocks that can be blocked efficiently Instead the valuesare distributed relatively uniformly over the matrix with thethird-highest NNZs per matrix row in the evaluated datasetFigure 7 shows the non-zero distribution of two matrices thatcan be blocked effectively Pres Poisson and Xenon1 thenon-zeros of both matrices are clustered primarily aroundthe diagonal An analysis of the blocks formed on ns3Dashows that most of the formed blocks capture patterns withnon-zeros separated by approximately 30 elements 2 permatrix block-row in 64 blocks and 4 per matrix block-rowin 128 blocks This suggests that even if ns3Da were tobe blocked more efficiently the effective speedup would beconstrained by the limited current summation from havingfew blocks per matrix row

G Sensitivity to Device Characteristics

Since scientific computing emphasizes high precisionwe analyze the effects of device characteristics on systemaccuracy and convergence behavior

Dynamic range We re-evaluate the convergence behaviorwith various cell ranges (ie onoff state ratios) to determinehow the system behaves when the states are closer togetheror farther apart Figure 12 shows the relative iterationcount at various configuration points The results showeffectively no sensitivity to dynamic range for single-bitcells With two-bit cells the relatively low dynamic rangeresults in some computational error as the cell states arenot sufficiently spread to tolerate the sources of error in

013 213 413 613 813

B=113 D=075K13

B=113 D=15K13

B=113 D=3K13

B=213 D=075K13

B=213 D=15K13

B=213 D=3K13

Normalized

13 Ite

rao

n13 Co

unt13

Min13 Mean13 Max13

Figure 12 Iteration count as a function of bits per cell and dynamicrange normalized to 1-bit cells with RoffRon = 1500 The minimummaximum and the mean iteration counts are reported over 100 Monte-Carlosimulation experiments

Figure 13 Iteration count as a function of bits per cell and programmingerror normalized to 1-bit cells with no programming error The minimummaximum and the mean iteration counts are reported over 100 Monte-Carlosimulation experiments

current summation This in turn introduces error into thecomputation which tends to hinder convergence rates

Programming precision We re-evaluate the convergencebehavior with and without cell programming errors to de-termine how results are affected by non-ideal programmingThe results show virtually no sensitivity to programmingerror for single-bit cells until the programming error reaches5 well within the achievable programming precision re-ported in the literature [58] Again the general dependencebecomes stronger as the number of bits per cell is increasedHigh programming error with the same tolerance introduceserrors into the computation hindering convergence

IX CONCLUSIONS

Previous work with memristive systems has been re-stricted to low-precision error-prone fixed-point computa-tion taking advantage of machine learning workloads thatdemand fidelity but not precision We have presented thefirst correct implementation of a double-precision floating-point compute unit based on a memristive substrate Theinnovations presented here take advantage of insights thatallow the system to achieve a 103times throughput improve-ment and a 109times reduction in energy consumption over aGPU baseline when executing iterative solvers The potentialfor such a system in scientific computing where iterativesolvers are ubiquitous is especially promising

ACKNOWLEDGMENT

This work was supported in part by NSF grant CCF-1533762

REFERENCES

[1] F Pop ldquoHigh performance numerical computing for highenergy physics a new challenge for big data sciencerdquo AAdvHigh Energy Phys vol 2014 2014

[2] R A Pielke Sr Mesoscale meteorological modeling 2nd edAcademic press 2013

[3] B Scholkopf K Tsuda and J-P Vert Kernel methods incomputational biology 1st ed MIT press 2004

[4] L Maliar and S Maliar ldquoNumerical methods for large-scaledynamic economic modelsrdquo 2014

[5] M Vogelsberger S Genel V Springel P Torrey D SijackiD Xu G Snyder S Bird D Nelson and L HernquistldquoProperties of galaxies reproduced by a hydrodynamic simu-lationrdquo Nature vol 509 pp 177ndash82 May 2014

[6] P Messina (2017 February) The US DOEexascale computing projectmdashgoals and challengeshttpswwwnistgovsitesdefaultfilesdocuments20170221messina nist 20170214final pdf [Online] Availablerdquohttpswwwnistgovsitesdefaultfilesdocuments20170221messina nist 20170214final pdfrdquo

[7] M N Bojnordi and E Ipek ldquoMemristive boltzmann machineA hardware accelerator for combinatorial optimization anddeep learningrdquo in Intl Symp on High Performance ComputerArchitecture (HPCA) March 2016

[8] A Shafiee A Nag N Muralimanohar R BalasubramonianJ P Strachan M Hu R S Williams and V SrikumarldquoISAAC A convolutional neural network accelerator within-situ analog arithmetic in crossbarsrdquo in Intl Symp onComputer Architecture (ISCA) June 2016

[9] P Chi S Li S Li T Zhang J Zhao Y Liu Y Wangand Y Xie ldquoPRIME A novel processing-in-memory archi-tecture for neural network computation in ReRAM-basedmain memoryrdquo in Intl Symp on High Performance ComputerArchitecture (HPCA) June 2016

[10] L Song X Qian H Li and Y Chen ldquoPipeLayer Apipelined ReRAM-based accelerator for deep learningrdquo inIntl Symp on High Performance Computer Architecture(HPCA) Feb 2017

[11] L Song Y Zhuo X Qian H Li and Y Chen ldquoGraphRAccelerating graph processing using ReRAMrdquo Feb 2017

[12] G Kestor R Gioiosa D J Kerbyson and A HoisieldquoQuantifying the energy cost of data movement in scientificapplicationsrdquo in Intl Symp on Workload Characterization(IISWC) Sept 2013

[13] J Hauser (2002) SoftFloat httpwwwjhauserusarithmeticSoftFloathtml [Online] Available rdquohttpwwwjhauserusarithmeticSoftFloathtmlrdquo

[14] T A Davis and Y Hu ldquoThe university of florida sparse matrixcollectionrdquo ACM Transactions on Math Softwware (TOMS)vol 38 no 1 pp 11ndash125 Dec 2011

[15] R Vuduc ldquoAutomatic performance tuning of sparse matrixkernelsrdquo PhD dissertation 2003

[16] Y Chen T Luo S Liu S Zhang L He J Wang L LiT Chen Z Xu N Sun and O Temam ldquoDaDianNaoA machine-learning supercomputerrdquo in Intl Symp on onMicroarchitecture (MICRO) Dec 2014

[17] N P Jouppi C Young N Patil D Patterson G Agrawalet al ldquoIn-datacenter performance analysis of a tensor pro-cessing unitrdquo in Intl Symp on on Computer Architecture(ISCA) June 2017

[18] M Hu J P Strachan Z Li E M Grafals N DavilaC Graves S Lam N Ge J J Yang and R S WilliamsldquoDot-product engine for neuromorphic computing Program-ming 1T1M crossbar to accelerate matrix-vector multiplica-tionrdquo in Design Automation Conference (DAC) June 2016

[19] Y Saad Iterative Methods for Sparse Linear Systems 3rd edSIAM 2003

[20] J J Dongarra P Luszczek and A Petitet ldquoThe LINPACKbenchmark Past present and futurerdquo Concurrency and Com-putation Practice and Experience vol 15 pp 803ndash8202003

[21] M R Hestenes and E Stiefel Methods of conjugate gradientsfor solving linear systems 1952 vol 49 no 6

[22] H A Van der Vorst ldquoBi-CGSTAB A fast and smoothlyconverging variant of Bi-CG for the solution of nonsymmetriclinear systemsrdquo SIAM Journal on scientific and StatisticalComputing (SISC) vol 13 no 2 pp 631ndash644 1992

[23] Y Saad and M H Schultz ldquoGMRES A generalized minimalresidual algorithm for solving nonsymmetric linear systemsrdquoSIAM Journal on scientific and Statistical Computing (SISC)vol 7 no 3 pp 856ndash869 July 1986

[24] J Kung Y Long D Kim and S Mukhopadhyay ldquoAprogrammable hardware accelerator for simulating dynamicalsystemsrdquo in Intl Symp on Computer Architecture (ISCA)June 2017

[25] Q Zhu T Graf H E Sumbul L Pileggi and F FranchettildquoAccelerating sparse matrix-matrix multiplication with 3d-stacked logic-in-memory hardwarerdquo in 2013 IEEE HighPerformance Extreme Computing Conference (HPEC) Sept2013

[26] S Kestur J D Davis and E S Chung ldquoTowards a univer-sal FPGA matrix-vector multiplication architecturerdquo in IntlSymp on Field-Programmable Custom Computing Machines(FCCM) May 2012

[27] J Fowers K Ovtcharov K Strauss E S Chung andG Stitt ldquoA high memory bandwidth FPGA accelerator forsparse matrix-vector multiplicationrdquo in Intl Symp on Field-Programmable Custom Computing Machines (FCCM) May2014

[28] R Dorrance F Ren and D Markovic ldquoA scalablesparse matrix-vector multiplication kernel for energy-efficientsparse-blas on FPGAsrdquo in ACMSIGDA International Sym-posium on Field-programmable Gate Arrays (FPGA) Feb2014

[29] B Feinberg S Wang and E Ipek ldquoMaking memristiveneural net accelerators reliablerdquo in Intl Symp on HighPerformance Computer Architecture (HPCA) Feb 2018

[30] D H Bailey ldquoHigh-precision floating-point arithmetic inscientific computationrdquo Computing in science amp engineeringvol 7 no 3 pp 54ndash61 2005

[31] D Zuras M Cowlishaw A Aiken M Applegate D BaileyS Bass D Bhandarkar M Bhat D Bindel S Boldo et alldquoIeee standard for floating-point arithmeticrdquo IEEE Std 754-2008 pp 1ndash70 2008

[32] L V Allis ldquoSearching for solutions in games and artificialintelligencerdquo 1994

[33] N Whitehead and A Fit-Florea ldquoPrecision amp performanceFloating point and IEEE 754 compliance for nvidia gpusrdquoNvidia Tech Rep 2011

[34] L Kull D Luu C Menolfi M Braendli P A FranceseT Morf M Kossel H Yueksel A Cevrero I Ozkayaand T Toifl ldquo285 a 10b 15gss pipelined-SAR ADC withbackground second-stage common-mode regulation and offsetcalibration in 14nm CMOS FinFETrdquo in Intl Solid-StateCircuits Conference (ISSCC) Feb 2017 pp 474ndash475

[35] L Kull T Toifl M Schmatz P A Francese C MenolfiM Brandli M Kossel T Morf T M Andersen andY Leblebici ldquoA 31 mw 8b 12 GSs single-channel asyn-chronous SAR ADC with alternate comparators for enhancedspeed in 32 nm digital SOI CMOSrdquo IEEE Journal of Solid-State Circuits vol 48 no 12 pp 3049ndash3058 2013

[36] M Saberi R Lotfi K Mafinezhad and W A SerdijnldquoAnalysis of power consumption and linearity in capacitivedigital-to-analog converters used in successive approximationadcsrdquo IEEE Transactions on Circuits and Systems I RegularPapers vol 58 no 8 pp 1736ndash1748 2011

[37] M R Stan and W P Burleson ldquoBus-invert coding forlow-power IOrdquo IEEE Transactions on Very Large ScaleIntegration (VLSI) Systems vol 3 no 1 pp 49ndash58 1995

[38] K Bhanushali and W R Davis ldquoFreePDK15 An open-source predictive process design kit for 15nm FinFET tech-nologyrdquo in Intl Symp on on Physical Design (ISPD) March2015

[39] Predictive technology model (PTM) httpptmasuedu[Online] Available httpptmasuedu

[40] D Niu C Xu N Muralimanohar N P Jouppi and Y XieldquoDesign of cross-point metal-oxide ReRAM emphasizingreliability and costrdquo in Intl Conference on Computer-AidedDesign (ICCAD) Nov 2013

[41] S-C Wong G-Y Lee and D-J Ma ldquoModeling of inter-connect capacitance delay and crosstalk in VLSIrdquo IEEETransactions on semiconductor manufacturing vol 13 no 1pp 108ndash111 2000

[42] C Xu D Niu N Muralimanohar R BalasubramonianT Zhang S Yu and Y Xie ldquoOvercoming the challengesof crossbar resistive memory architecturesrdquo in rdquoIntl Sympon High Performance Computer Architecture (HPCA)rdquo Feb2015

[43] C Yakopcic and T Taha ldquoModel for maximum crossbar sizebased on input driver impedancerdquo Electronics Letters vol 52no 1 pp 25ndash27 2015

[44] M OrsquoHalloran and R Sarpeshkar ldquoA 10-nw 12-bit accurateanalog storage cell with 10-aa leakagerdquo IEEE Journal ofSolid-State Circuits vol 39 no 11 pp 1985ndash1996 2004

[45] Leon3GRLIB httpwwwgaislercomindexphpdownloadsleongrlib [Online] Available httpwwwgaislercomindexphpdownloadsleongrlib

[46] S Galal O Shacham J S B II J Pu A Vassiliev andM Horowitz ldquoFPU generator for design space explorationrdquoin IEEE Symposium on Computer Arithmetic (ARITH) April2013

[47] M Martins J M Matos R P Ribas A Reis G SchlinkerL Rech and J Michelsen ldquoOpen cell library in 15nmFreePDK technologyrdquo in Intl Symp on Physical Design(ISPD) March 2015

[48] Synopsys ldquoSynopsys Design Compiler User GuiderdquohttpwwwsynopsyscomToolsImplementationRTLSynthesisDCUltraPages

[49] R Balasubramonian A B Kahng N MuralimanoharA Shafiee and V Srinivas ldquoCACTI 7 New tools for inter-connect exploration in innovative off-chip memoriesrdquo ACMTransactions on Architecture and Code Optimization (TACO)vol 14 no 2 pp 141ndash1425 June 2017

[50] G Fredeman D W Plass A Mathews J ViraraghavanK Reyer T J Knips T Miller E L Gerhard D Kannam-badi C Paone D Lee D J Rainey M Sperling M WhalenS Burns R R Tummuru H Ho A Cestero N ArnoldB A Khan T Kirihata and S S Iyer ldquoA 14 nm 11 mbembedded DRAM macro with 1 ns accessrdquo IEEE Journal ofSolid-State Circuits vol 51 no 1 pp 230ndash239 Jan 2016

[51] A Bakhoda G L Yuan W W L Fung H Wong and T MAamodt ldquoAnalyzing CUDA workloads using a detailed GPUsimulatorrdquo April 2009

[52] J Leng T Hetherington A ElTantawy S Gilani N S KimT M Aamodt and V J Reddi ldquoGPUWattch Enabling en-ergy optimizations in GPGPUsrdquo in Intl Symp on ComputerArchitecture (ISCA) June 2013

[53] H Anzt J Dongarra M Kreutzer G Wellein and M KhlerldquoEfficiency of general krylov methods on GPUs ndash an exper-imental studyrdquo in Intl Parallel and Distributed ProcessingSymposium Workshops (IPDPSW) May 2016

[54] ldquoNvidia tesla P100rdquo NVIDIA Corporation Tech Rep WP-08019-001 v011 2016

[55] Z Wei Y Kanzawa K Arita Y Katoh K Kawai S Mu-raoka S Mitani S Fujii K Katayama M Iijima et alldquoHighly reliable TaOx ReRAM and direct evidence of redoxreaction mechanismrdquo in Electron Devices Meeting 2008IEDM 2008 IEEE International IEEE 2008 pp 1ndash4

[56] J J Yang M-X Zhang J P Strachan F Miao M D Pick-ett R D Kelley G Medeiros-Ribeiro and R S WilliamsldquoHigh switching endurance in TaOx memristive devicesrdquoApplied Physics Letters vol 97 no 23 p 232102 2010

[57] C-W Hsu I-T Wang C-L Lo M-C Chiang W-Y JangC-H Lin and T-H Hou ldquoSelf-rectifying bipolar TaOxTiO2RRAM with superior endurance over 1012 cycles for 3d high-density storage-class memoryrdquo

[58] F Alibart L Gao B D Hoskins and D B Strukov ldquoHighprecision tuning of state for memristive devices by adaptablevariation-tolerant algorithmrdquo Nanotechnology vol 23 no 7Jan 2012

Page 13: Enabling Scientific Computing on Memristive …reducing the required memristor resolution [7]–[10]. An example of bit slicing is shown in Equation 1, where each 3-bit matrix coefficient

(a) ns3Da sparsity (b) ns3Da blockingFigure 11 Sparsity and blocking patterns of ns3Da

where each array is fully rewritten for each new matrixignoring both the sparsity of the blocks and the time stepbehavior mentioned above the system lifetime is suffi-cient given a memristive device with a conservative 109

writes [55]ndash[57] the total system lifetime is over 100 yearsassuming the system runs constantly and a new matrix isfully rewritten between solves

F Understanding Difficult-to-Block Matrices

Figure 11 shows the non-zero element distribution and theblocking pattern for ns3Da a matrix with a particularly poorblocking efficiency The figure shows that despite the highrelative density of the matrix the values do not form densesub-blocks that can be blocked efficiently Instead the valuesare distributed relatively uniformly over the matrix with thethird-highest NNZs per matrix row in the evaluated datasetFigure 7 shows the non-zero distribution of two matrices thatcan be blocked effectively Pres Poisson and Xenon1 thenon-zeros of both matrices are clustered primarily aroundthe diagonal An analysis of the blocks formed on ns3Dashows that most of the formed blocks capture patterns withnon-zeros separated by approximately 30 elements 2 permatrix block-row in 64 blocks and 4 per matrix block-rowin 128 blocks This suggests that even if ns3Da were tobe blocked more efficiently the effective speedup would beconstrained by the limited current summation from havingfew blocks per matrix row

G Sensitivity to Device Characteristics

Since scientific computing emphasizes high precisionwe analyze the effects of device characteristics on systemaccuracy and convergence behavior

Dynamic range We re-evaluate the convergence behaviorwith various cell ranges (ie onoff state ratios) to determinehow the system behaves when the states are closer togetheror farther apart Figure 12 shows the relative iterationcount at various configuration points The results showeffectively no sensitivity to dynamic range for single-bitcells With two-bit cells the relatively low dynamic rangeresults in some computational error as the cell states arenot sufficiently spread to tolerate the sources of error in

013 213 413 613 813

B=113 D=075K13

B=113 D=15K13

B=113 D=3K13

B=213 D=075K13

B=213 D=15K13

B=213 D=3K13

Normalized

13 Ite

rao

n13 Co

unt13

Min13 Mean13 Max13

Figure 12 Iteration count as a function of bits per cell and dynamicrange normalized to 1-bit cells with RoffRon = 1500 The minimummaximum and the mean iteration counts are reported over 100 Monte-Carlosimulation experiments

Figure 13 Iteration count as a function of bits per cell and programmingerror normalized to 1-bit cells with no programming error The minimummaximum and the mean iteration counts are reported over 100 Monte-Carlosimulation experiments

current summation This in turn introduces error into thecomputation which tends to hinder convergence rates

Programming precision We re-evaluate the convergencebehavior with and without cell programming errors to de-termine how results are affected by non-ideal programmingThe results show virtually no sensitivity to programmingerror for single-bit cells until the programming error reaches5 well within the achievable programming precision re-ported in the literature [58] Again the general dependencebecomes stronger as the number of bits per cell is increasedHigh programming error with the same tolerance introduceserrors into the computation hindering convergence

IX CONCLUSIONS

Previous work with memristive systems has been re-stricted to low-precision error-prone fixed-point computa-tion taking advantage of machine learning workloads thatdemand fidelity but not precision We have presented thefirst correct implementation of a double-precision floating-point compute unit based on a memristive substrate Theinnovations presented here take advantage of insights thatallow the system to achieve a 103times throughput improve-ment and a 109times reduction in energy consumption over aGPU baseline when executing iterative solvers The potentialfor such a system in scientific computing where iterativesolvers are ubiquitous is especially promising

ACKNOWLEDGMENT

This work was supported in part by NSF grant CCF-1533762

REFERENCES

[1] F Pop ldquoHigh performance numerical computing for highenergy physics a new challenge for big data sciencerdquo AAdvHigh Energy Phys vol 2014 2014

[2] R A Pielke Sr Mesoscale meteorological modeling 2nd edAcademic press 2013

[3] B Scholkopf K Tsuda and J-P Vert Kernel methods incomputational biology 1st ed MIT press 2004

[4] L Maliar and S Maliar ldquoNumerical methods for large-scaledynamic economic modelsrdquo 2014

[5] M Vogelsberger S Genel V Springel P Torrey D SijackiD Xu G Snyder S Bird D Nelson and L HernquistldquoProperties of galaxies reproduced by a hydrodynamic simu-lationrdquo Nature vol 509 pp 177ndash82 May 2014

[6] P Messina (2017 February) The US DOEexascale computing projectmdashgoals and challengeshttpswwwnistgovsitesdefaultfilesdocuments20170221messina nist 20170214final pdf [Online] Availablerdquohttpswwwnistgovsitesdefaultfilesdocuments20170221messina nist 20170214final pdfrdquo

[7] M N Bojnordi and E Ipek ldquoMemristive boltzmann machineA hardware accelerator for combinatorial optimization anddeep learningrdquo in Intl Symp on High Performance ComputerArchitecture (HPCA) March 2016

[8] A Shafiee A Nag N Muralimanohar R BalasubramonianJ P Strachan M Hu R S Williams and V SrikumarldquoISAAC A convolutional neural network accelerator within-situ analog arithmetic in crossbarsrdquo in Intl Symp onComputer Architecture (ISCA) June 2016

[9] P Chi S Li S Li T Zhang J Zhao Y Liu Y Wangand Y Xie ldquoPRIME A novel processing-in-memory archi-tecture for neural network computation in ReRAM-basedmain memoryrdquo in Intl Symp on High Performance ComputerArchitecture (HPCA) June 2016

[10] L Song X Qian H Li and Y Chen ldquoPipeLayer Apipelined ReRAM-based accelerator for deep learningrdquo inIntl Symp on High Performance Computer Architecture(HPCA) Feb 2017

[11] L Song Y Zhuo X Qian H Li and Y Chen ldquoGraphRAccelerating graph processing using ReRAMrdquo Feb 2017

[12] G Kestor R Gioiosa D J Kerbyson and A HoisieldquoQuantifying the energy cost of data movement in scientificapplicationsrdquo in Intl Symp on Workload Characterization(IISWC) Sept 2013

[13] J Hauser (2002) SoftFloat httpwwwjhauserusarithmeticSoftFloathtml [Online] Available rdquohttpwwwjhauserusarithmeticSoftFloathtmlrdquo

[14] T A Davis and Y Hu ldquoThe university of florida sparse matrixcollectionrdquo ACM Transactions on Math Softwware (TOMS)vol 38 no 1 pp 11ndash125 Dec 2011

[15] R Vuduc ldquoAutomatic performance tuning of sparse matrixkernelsrdquo PhD dissertation 2003

[16] Y Chen T Luo S Liu S Zhang L He J Wang L LiT Chen Z Xu N Sun and O Temam ldquoDaDianNaoA machine-learning supercomputerrdquo in Intl Symp on onMicroarchitecture (MICRO) Dec 2014

[17] N P Jouppi C Young N Patil D Patterson G Agrawalet al ldquoIn-datacenter performance analysis of a tensor pro-cessing unitrdquo in Intl Symp on on Computer Architecture(ISCA) June 2017

[18] M Hu J P Strachan Z Li E M Grafals N DavilaC Graves S Lam N Ge J J Yang and R S WilliamsldquoDot-product engine for neuromorphic computing Program-ming 1T1M crossbar to accelerate matrix-vector multiplica-tionrdquo in Design Automation Conference (DAC) June 2016

[19] Y Saad Iterative Methods for Sparse Linear Systems 3rd edSIAM 2003

[20] J J Dongarra P Luszczek and A Petitet ldquoThe LINPACKbenchmark Past present and futurerdquo Concurrency and Com-putation Practice and Experience vol 15 pp 803ndash8202003

[21] M R Hestenes and E Stiefel Methods of conjugate gradientsfor solving linear systems 1952 vol 49 no 6

[22] H A Van der Vorst ldquoBi-CGSTAB A fast and smoothlyconverging variant of Bi-CG for the solution of nonsymmetriclinear systemsrdquo SIAM Journal on scientific and StatisticalComputing (SISC) vol 13 no 2 pp 631ndash644 1992

[23] Y Saad and M H Schultz ldquoGMRES A generalized minimalresidual algorithm for solving nonsymmetric linear systemsrdquoSIAM Journal on scientific and Statistical Computing (SISC)vol 7 no 3 pp 856ndash869 July 1986

[24] J Kung Y Long D Kim and S Mukhopadhyay ldquoAprogrammable hardware accelerator for simulating dynamicalsystemsrdquo in Intl Symp on Computer Architecture (ISCA)June 2017

[25] Q Zhu T Graf H E Sumbul L Pileggi and F FranchettildquoAccelerating sparse matrix-matrix multiplication with 3d-stacked logic-in-memory hardwarerdquo in 2013 IEEE HighPerformance Extreme Computing Conference (HPEC) Sept2013

[26] S Kestur J D Davis and E S Chung ldquoTowards a univer-sal FPGA matrix-vector multiplication architecturerdquo in IntlSymp on Field-Programmable Custom Computing Machines(FCCM) May 2012

[27] J Fowers K Ovtcharov K Strauss E S Chung andG Stitt ldquoA high memory bandwidth FPGA accelerator forsparse matrix-vector multiplicationrdquo in Intl Symp on Field-Programmable Custom Computing Machines (FCCM) May2014

[28] R Dorrance F Ren and D Markovic ldquoA scalablesparse matrix-vector multiplication kernel for energy-efficientsparse-blas on FPGAsrdquo in ACMSIGDA International Sym-posium on Field-programmable Gate Arrays (FPGA) Feb2014

[29] B Feinberg S Wang and E Ipek ldquoMaking memristiveneural net accelerators reliablerdquo in Intl Symp on HighPerformance Computer Architecture (HPCA) Feb 2018

[30] D H Bailey ldquoHigh-precision floating-point arithmetic inscientific computationrdquo Computing in science amp engineeringvol 7 no 3 pp 54ndash61 2005

[31] D Zuras M Cowlishaw A Aiken M Applegate D BaileyS Bass D Bhandarkar M Bhat D Bindel S Boldo et alldquoIeee standard for floating-point arithmeticrdquo IEEE Std 754-2008 pp 1ndash70 2008

[32] L V Allis ldquoSearching for solutions in games and artificialintelligencerdquo 1994

[33] N Whitehead and A Fit-Florea ldquoPrecision amp performanceFloating point and IEEE 754 compliance for nvidia gpusrdquoNvidia Tech Rep 2011

[34] L Kull D Luu C Menolfi M Braendli P A FranceseT Morf M Kossel H Yueksel A Cevrero I Ozkayaand T Toifl ldquo285 a 10b 15gss pipelined-SAR ADC withbackground second-stage common-mode regulation and offsetcalibration in 14nm CMOS FinFETrdquo in Intl Solid-StateCircuits Conference (ISSCC) Feb 2017 pp 474ndash475

[35] L Kull T Toifl M Schmatz P A Francese C MenolfiM Brandli M Kossel T Morf T M Andersen andY Leblebici ldquoA 31 mw 8b 12 GSs single-channel asyn-chronous SAR ADC with alternate comparators for enhancedspeed in 32 nm digital SOI CMOSrdquo IEEE Journal of Solid-State Circuits vol 48 no 12 pp 3049ndash3058 2013

[36] M Saberi R Lotfi K Mafinezhad and W A SerdijnldquoAnalysis of power consumption and linearity in capacitivedigital-to-analog converters used in successive approximationadcsrdquo IEEE Transactions on Circuits and Systems I RegularPapers vol 58 no 8 pp 1736ndash1748 2011

[37] M R Stan and W P Burleson ldquoBus-invert coding forlow-power IOrdquo IEEE Transactions on Very Large ScaleIntegration (VLSI) Systems vol 3 no 1 pp 49ndash58 1995

[38] K Bhanushali and W R Davis ldquoFreePDK15 An open-source predictive process design kit for 15nm FinFET tech-nologyrdquo in Intl Symp on on Physical Design (ISPD) March2015

[39] Predictive technology model (PTM) httpptmasuedu[Online] Available httpptmasuedu

[40] D Niu C Xu N Muralimanohar N P Jouppi and Y XieldquoDesign of cross-point metal-oxide ReRAM emphasizingreliability and costrdquo in Intl Conference on Computer-AidedDesign (ICCAD) Nov 2013

[41] S-C Wong G-Y Lee and D-J Ma ldquoModeling of inter-connect capacitance delay and crosstalk in VLSIrdquo IEEETransactions on semiconductor manufacturing vol 13 no 1pp 108ndash111 2000

[42] C Xu D Niu N Muralimanohar R BalasubramonianT Zhang S Yu and Y Xie ldquoOvercoming the challengesof crossbar resistive memory architecturesrdquo in rdquoIntl Sympon High Performance Computer Architecture (HPCA)rdquo Feb2015

[43] C Yakopcic and T Taha ldquoModel for maximum crossbar sizebased on input driver impedancerdquo Electronics Letters vol 52no 1 pp 25ndash27 2015

[44] M OrsquoHalloran and R Sarpeshkar ldquoA 10-nw 12-bit accurateanalog storage cell with 10-aa leakagerdquo IEEE Journal ofSolid-State Circuits vol 39 no 11 pp 1985ndash1996 2004

[45] Leon3GRLIB httpwwwgaislercomindexphpdownloadsleongrlib [Online] Available httpwwwgaislercomindexphpdownloadsleongrlib

[46] S Galal O Shacham J S B II J Pu A Vassiliev andM Horowitz ldquoFPU generator for design space explorationrdquoin IEEE Symposium on Computer Arithmetic (ARITH) April2013

[47] M Martins J M Matos R P Ribas A Reis G SchlinkerL Rech and J Michelsen ldquoOpen cell library in 15nmFreePDK technologyrdquo in Intl Symp on Physical Design(ISPD) March 2015

[48] Synopsys ldquoSynopsys Design Compiler User GuiderdquohttpwwwsynopsyscomToolsImplementationRTLSynthesisDCUltraPages

[49] R Balasubramonian A B Kahng N MuralimanoharA Shafiee and V Srinivas ldquoCACTI 7 New tools for inter-connect exploration in innovative off-chip memoriesrdquo ACMTransactions on Architecture and Code Optimization (TACO)vol 14 no 2 pp 141ndash1425 June 2017

[50] G Fredeman D W Plass A Mathews J ViraraghavanK Reyer T J Knips T Miller E L Gerhard D Kannam-badi C Paone D Lee D J Rainey M Sperling M WhalenS Burns R R Tummuru H Ho A Cestero N ArnoldB A Khan T Kirihata and S S Iyer ldquoA 14 nm 11 mbembedded DRAM macro with 1 ns accessrdquo IEEE Journal ofSolid-State Circuits vol 51 no 1 pp 230ndash239 Jan 2016

[51] A Bakhoda G L Yuan W W L Fung H Wong and T MAamodt ldquoAnalyzing CUDA workloads using a detailed GPUsimulatorrdquo April 2009

[52] J Leng T Hetherington A ElTantawy S Gilani N S KimT M Aamodt and V J Reddi ldquoGPUWattch Enabling en-ergy optimizations in GPGPUsrdquo in Intl Symp on ComputerArchitecture (ISCA) June 2013

[53] H Anzt J Dongarra M Kreutzer G Wellein and M KhlerldquoEfficiency of general krylov methods on GPUs ndash an exper-imental studyrdquo in Intl Parallel and Distributed ProcessingSymposium Workshops (IPDPSW) May 2016

[54] ldquoNvidia tesla P100rdquo NVIDIA Corporation Tech Rep WP-08019-001 v011 2016

[55] Z Wei Y Kanzawa K Arita Y Katoh K Kawai S Mu-raoka S Mitani S Fujii K Katayama M Iijima et alldquoHighly reliable TaOx ReRAM and direct evidence of redoxreaction mechanismrdquo in Electron Devices Meeting 2008IEDM 2008 IEEE International IEEE 2008 pp 1ndash4

[56] J J Yang M-X Zhang J P Strachan F Miao M D Pick-ett R D Kelley G Medeiros-Ribeiro and R S WilliamsldquoHigh switching endurance in TaOx memristive devicesrdquoApplied Physics Letters vol 97 no 23 p 232102 2010

[57] C-W Hsu I-T Wang C-L Lo M-C Chiang W-Y JangC-H Lin and T-H Hou ldquoSelf-rectifying bipolar TaOxTiO2RRAM with superior endurance over 1012 cycles for 3d high-density storage-class memoryrdquo

[58] F Alibart L Gao B D Hoskins and D B Strukov ldquoHighprecision tuning of state for memristive devices by adaptablevariation-tolerant algorithmrdquo Nanotechnology vol 23 no 7Jan 2012

Page 14: Enabling Scientific Computing on Memristive …reducing the required memristor resolution [7]–[10]. An example of bit slicing is shown in Equation 1, where each 3-bit matrix coefficient

REFERENCES

[1] F Pop ldquoHigh performance numerical computing for highenergy physics a new challenge for big data sciencerdquo AAdvHigh Energy Phys vol 2014 2014

[2] R A Pielke Sr Mesoscale meteorological modeling 2nd edAcademic press 2013

[3] B Scholkopf K Tsuda and J-P Vert Kernel methods incomputational biology 1st ed MIT press 2004

[4] L Maliar and S Maliar ldquoNumerical methods for large-scaledynamic economic modelsrdquo 2014

[5] M Vogelsberger S Genel V Springel P Torrey D SijackiD Xu G Snyder S Bird D Nelson and L HernquistldquoProperties of galaxies reproduced by a hydrodynamic simu-lationrdquo Nature vol 509 pp 177ndash82 May 2014

[6] P Messina (2017 February) The US DOEexascale computing projectmdashgoals and challengeshttpswwwnistgovsitesdefaultfilesdocuments20170221messina nist 20170214final pdf [Online] Availablerdquohttpswwwnistgovsitesdefaultfilesdocuments20170221messina nist 20170214final pdfrdquo

[7] M N Bojnordi and E Ipek ldquoMemristive boltzmann machineA hardware accelerator for combinatorial optimization anddeep learningrdquo in Intl Symp on High Performance ComputerArchitecture (HPCA) March 2016

[8] A Shafiee A Nag N Muralimanohar R BalasubramonianJ P Strachan M Hu R S Williams and V SrikumarldquoISAAC A convolutional neural network accelerator within-situ analog arithmetic in crossbarsrdquo in Intl Symp onComputer Architecture (ISCA) June 2016

[9] P Chi S Li S Li T Zhang J Zhao Y Liu Y Wangand Y Xie ldquoPRIME A novel processing-in-memory archi-tecture for neural network computation in ReRAM-basedmain memoryrdquo in Intl Symp on High Performance ComputerArchitecture (HPCA) June 2016

[10] L Song X Qian H Li and Y Chen ldquoPipeLayer Apipelined ReRAM-based accelerator for deep learningrdquo inIntl Symp on High Performance Computer Architecture(HPCA) Feb 2017

[11] L Song Y Zhuo X Qian H Li and Y Chen ldquoGraphRAccelerating graph processing using ReRAMrdquo Feb 2017

[12] G Kestor R Gioiosa D J Kerbyson and A HoisieldquoQuantifying the energy cost of data movement in scientificapplicationsrdquo in Intl Symp on Workload Characterization(IISWC) Sept 2013

[13] J Hauser (2002) SoftFloat httpwwwjhauserusarithmeticSoftFloathtml [Online] Available rdquohttpwwwjhauserusarithmeticSoftFloathtmlrdquo

[14] T A Davis and Y Hu ldquoThe university of florida sparse matrixcollectionrdquo ACM Transactions on Math Softwware (TOMS)vol 38 no 1 pp 11ndash125 Dec 2011

[15] R Vuduc ldquoAutomatic performance tuning of sparse matrixkernelsrdquo PhD dissertation 2003

[16] Y Chen T Luo S Liu S Zhang L He J Wang L LiT Chen Z Xu N Sun and O Temam ldquoDaDianNaoA machine-learning supercomputerrdquo in Intl Symp on onMicroarchitecture (MICRO) Dec 2014

[17] N P Jouppi C Young N Patil D Patterson G Agrawalet al ldquoIn-datacenter performance analysis of a tensor pro-cessing unitrdquo in Intl Symp on on Computer Architecture(ISCA) June 2017

[18] M Hu J P Strachan Z Li E M Grafals N DavilaC Graves S Lam N Ge J J Yang and R S WilliamsldquoDot-product engine for neuromorphic computing Program-ming 1T1M crossbar to accelerate matrix-vector multiplica-tionrdquo in Design Automation Conference (DAC) June 2016

[19] Y Saad Iterative Methods for Sparse Linear Systems 3rd edSIAM 2003

[20] J J Dongarra P Luszczek and A Petitet ldquoThe LINPACKbenchmark Past present and futurerdquo Concurrency and Com-putation Practice and Experience vol 15 pp 803ndash8202003

[21] M R Hestenes and E Stiefel Methods of conjugate gradientsfor solving linear systems 1952 vol 49 no 6

[22] H A Van der Vorst ldquoBi-CGSTAB A fast and smoothlyconverging variant of Bi-CG for the solution of nonsymmetriclinear systemsrdquo SIAM Journal on scientific and StatisticalComputing (SISC) vol 13 no 2 pp 631ndash644 1992

[23] Y Saad and M H Schultz ldquoGMRES A generalized minimalresidual algorithm for solving nonsymmetric linear systemsrdquoSIAM Journal on scientific and Statistical Computing (SISC)vol 7 no 3 pp 856ndash869 July 1986

[24] J Kung Y Long D Kim and S Mukhopadhyay ldquoAprogrammable hardware accelerator for simulating dynamicalsystemsrdquo in Intl Symp on Computer Architecture (ISCA)June 2017

[25] Q Zhu T Graf H E Sumbul L Pileggi and F FranchettildquoAccelerating sparse matrix-matrix multiplication with 3d-stacked logic-in-memory hardwarerdquo in 2013 IEEE HighPerformance Extreme Computing Conference (HPEC) Sept2013

[26] S Kestur J D Davis and E S Chung ldquoTowards a univer-sal FPGA matrix-vector multiplication architecturerdquo in IntlSymp on Field-Programmable Custom Computing Machines(FCCM) May 2012

[27] J Fowers K Ovtcharov K Strauss E S Chung andG Stitt ldquoA high memory bandwidth FPGA accelerator forsparse matrix-vector multiplicationrdquo in Intl Symp on Field-Programmable Custom Computing Machines (FCCM) May2014

[28] R Dorrance F Ren and D Markovic ldquoA scalablesparse matrix-vector multiplication kernel for energy-efficientsparse-blas on FPGAsrdquo in ACMSIGDA International Sym-posium on Field-programmable Gate Arrays (FPGA) Feb2014

[29] B Feinberg S Wang and E Ipek ldquoMaking memristiveneural net accelerators reliablerdquo in Intl Symp on HighPerformance Computer Architecture (HPCA) Feb 2018

[30] D H Bailey ldquoHigh-precision floating-point arithmetic inscientific computationrdquo Computing in science amp engineeringvol 7 no 3 pp 54ndash61 2005

[31] D Zuras M Cowlishaw A Aiken M Applegate D BaileyS Bass D Bhandarkar M Bhat D Bindel S Boldo et alldquoIeee standard for floating-point arithmeticrdquo IEEE Std 754-2008 pp 1ndash70 2008

[32] L V Allis ldquoSearching for solutions in games and artificialintelligencerdquo 1994

[33] N Whitehead and A Fit-Florea ldquoPrecision amp performanceFloating point and IEEE 754 compliance for nvidia gpusrdquoNvidia Tech Rep 2011

[34] L Kull D Luu C Menolfi M Braendli P A FranceseT Morf M Kossel H Yueksel A Cevrero I Ozkayaand T Toifl ldquo285 a 10b 15gss pipelined-SAR ADC withbackground second-stage common-mode regulation and offsetcalibration in 14nm CMOS FinFETrdquo in Intl Solid-StateCircuits Conference (ISSCC) Feb 2017 pp 474ndash475

[35] L Kull T Toifl M Schmatz P A Francese C MenolfiM Brandli M Kossel T Morf T M Andersen andY Leblebici ldquoA 31 mw 8b 12 GSs single-channel asyn-chronous SAR ADC with alternate comparators for enhancedspeed in 32 nm digital SOI CMOSrdquo IEEE Journal of Solid-State Circuits vol 48 no 12 pp 3049ndash3058 2013

[36] M Saberi R Lotfi K Mafinezhad and W A SerdijnldquoAnalysis of power consumption and linearity in capacitivedigital-to-analog converters used in successive approximationadcsrdquo IEEE Transactions on Circuits and Systems I RegularPapers vol 58 no 8 pp 1736ndash1748 2011

[37] M R Stan and W P Burleson ldquoBus-invert coding forlow-power IOrdquo IEEE Transactions on Very Large ScaleIntegration (VLSI) Systems vol 3 no 1 pp 49ndash58 1995

[38] K Bhanushali and W R Davis ldquoFreePDK15 An open-source predictive process design kit for 15nm FinFET tech-nologyrdquo in Intl Symp on on Physical Design (ISPD) March2015

[39] Predictive technology model (PTM) httpptmasuedu[Online] Available httpptmasuedu

[40] D Niu C Xu N Muralimanohar N P Jouppi and Y XieldquoDesign of cross-point metal-oxide ReRAM emphasizingreliability and costrdquo in Intl Conference on Computer-AidedDesign (ICCAD) Nov 2013

[41] S-C Wong G-Y Lee and D-J Ma ldquoModeling of inter-connect capacitance delay and crosstalk in VLSIrdquo IEEETransactions on semiconductor manufacturing vol 13 no 1pp 108ndash111 2000

[42] C Xu D Niu N Muralimanohar R BalasubramonianT Zhang S Yu and Y Xie ldquoOvercoming the challengesof crossbar resistive memory architecturesrdquo in rdquoIntl Sympon High Performance Computer Architecture (HPCA)rdquo Feb2015

[43] C Yakopcic and T Taha ldquoModel for maximum crossbar sizebased on input driver impedancerdquo Electronics Letters vol 52no 1 pp 25ndash27 2015

[44] M OrsquoHalloran and R Sarpeshkar ldquoA 10-nw 12-bit accurateanalog storage cell with 10-aa leakagerdquo IEEE Journal ofSolid-State Circuits vol 39 no 11 pp 1985ndash1996 2004

[45] Leon3GRLIB httpwwwgaislercomindexphpdownloadsleongrlib [Online] Available httpwwwgaislercomindexphpdownloadsleongrlib

[46] S Galal O Shacham J S B II J Pu A Vassiliev andM Horowitz ldquoFPU generator for design space explorationrdquoin IEEE Symposium on Computer Arithmetic (ARITH) April2013

[47] M Martins J M Matos R P Ribas A Reis G SchlinkerL Rech and J Michelsen ldquoOpen cell library in 15nmFreePDK technologyrdquo in Intl Symp on Physical Design(ISPD) March 2015

[48] Synopsys ldquoSynopsys Design Compiler User GuiderdquohttpwwwsynopsyscomToolsImplementationRTLSynthesisDCUltraPages

[49] R Balasubramonian A B Kahng N MuralimanoharA Shafiee and V Srinivas ldquoCACTI 7 New tools for inter-connect exploration in innovative off-chip memoriesrdquo ACMTransactions on Architecture and Code Optimization (TACO)vol 14 no 2 pp 141ndash1425 June 2017

[50] G Fredeman D W Plass A Mathews J ViraraghavanK Reyer T J Knips T Miller E L Gerhard D Kannam-badi C Paone D Lee D J Rainey M Sperling M WhalenS Burns R R Tummuru H Ho A Cestero N ArnoldB A Khan T Kirihata and S S Iyer ldquoA 14 nm 11 mbembedded DRAM macro with 1 ns accessrdquo IEEE Journal ofSolid-State Circuits vol 51 no 1 pp 230ndash239 Jan 2016

[51] A Bakhoda G L Yuan W W L Fung H Wong and T MAamodt ldquoAnalyzing CUDA workloads using a detailed GPUsimulatorrdquo April 2009

[52] J Leng T Hetherington A ElTantawy S Gilani N S KimT M Aamodt and V J Reddi ldquoGPUWattch Enabling en-ergy optimizations in GPGPUsrdquo in Intl Symp on ComputerArchitecture (ISCA) June 2013

[53] H Anzt J Dongarra M Kreutzer G Wellein and M KhlerldquoEfficiency of general krylov methods on GPUs ndash an exper-imental studyrdquo in Intl Parallel and Distributed ProcessingSymposium Workshops (IPDPSW) May 2016

[54] ldquoNvidia tesla P100rdquo NVIDIA Corporation Tech Rep WP-08019-001 v011 2016

[55] Z Wei Y Kanzawa K Arita Y Katoh K Kawai S Mu-raoka S Mitani S Fujii K Katayama M Iijima et alldquoHighly reliable TaOx ReRAM and direct evidence of redoxreaction mechanismrdquo in Electron Devices Meeting 2008IEDM 2008 IEEE International IEEE 2008 pp 1ndash4

[56] J J Yang M-X Zhang J P Strachan F Miao M D Pick-ett R D Kelley G Medeiros-Ribeiro and R S WilliamsldquoHigh switching endurance in TaOx memristive devicesrdquoApplied Physics Letters vol 97 no 23 p 232102 2010

[57] C-W Hsu I-T Wang C-L Lo M-C Chiang W-Y JangC-H Lin and T-H Hou ldquoSelf-rectifying bipolar TaOxTiO2RRAM with superior endurance over 1012 cycles for 3d high-density storage-class memoryrdquo

[58] F Alibart L Gao B D Hoskins and D B Strukov ldquoHighprecision tuning of state for memristive devices by adaptablevariation-tolerant algorithmrdquo Nanotechnology vol 23 no 7Jan 2012

Page 15: Enabling Scientific Computing on Memristive …reducing the required memristor resolution [7]–[10]. An example of bit slicing is shown in Equation 1, where each 3-bit matrix coefficient

[29] B Feinberg S Wang and E Ipek ldquoMaking memristiveneural net accelerators reliablerdquo in Intl Symp on HighPerformance Computer Architecture (HPCA) Feb 2018

[30] D H Bailey ldquoHigh-precision floating-point arithmetic inscientific computationrdquo Computing in science amp engineeringvol 7 no 3 pp 54ndash61 2005

[31] D Zuras M Cowlishaw A Aiken M Applegate D BaileyS Bass D Bhandarkar M Bhat D Bindel S Boldo et alldquoIeee standard for floating-point arithmeticrdquo IEEE Std 754-2008 pp 1ndash70 2008

[32] L V Allis ldquoSearching for solutions in games and artificialintelligencerdquo 1994

[33] N Whitehead and A Fit-Florea ldquoPrecision amp performanceFloating point and IEEE 754 compliance for nvidia gpusrdquoNvidia Tech Rep 2011

[34] L Kull D Luu C Menolfi M Braendli P A FranceseT Morf M Kossel H Yueksel A Cevrero I Ozkayaand T Toifl ldquo285 a 10b 15gss pipelined-SAR ADC withbackground second-stage common-mode regulation and offsetcalibration in 14nm CMOS FinFETrdquo in Intl Solid-StateCircuits Conference (ISSCC) Feb 2017 pp 474ndash475

[35] L Kull T Toifl M Schmatz P A Francese C MenolfiM Brandli M Kossel T Morf T M Andersen andY Leblebici ldquoA 31 mw 8b 12 GSs single-channel asyn-chronous SAR ADC with alternate comparators for enhancedspeed in 32 nm digital SOI CMOSrdquo IEEE Journal of Solid-State Circuits vol 48 no 12 pp 3049ndash3058 2013

[36] M Saberi R Lotfi K Mafinezhad and W A SerdijnldquoAnalysis of power consumption and linearity in capacitivedigital-to-analog converters used in successive approximationadcsrdquo IEEE Transactions on Circuits and Systems I RegularPapers vol 58 no 8 pp 1736ndash1748 2011

[37] M R Stan and W P Burleson ldquoBus-invert coding forlow-power IOrdquo IEEE Transactions on Very Large ScaleIntegration (VLSI) Systems vol 3 no 1 pp 49ndash58 1995

[38] K Bhanushali and W R Davis ldquoFreePDK15 An open-source predictive process design kit for 15nm FinFET tech-nologyrdquo in Intl Symp on on Physical Design (ISPD) March2015

[39] Predictive technology model (PTM) httpptmasuedu[Online] Available httpptmasuedu

[40] D Niu C Xu N Muralimanohar N P Jouppi and Y XieldquoDesign of cross-point metal-oxide ReRAM emphasizingreliability and costrdquo in Intl Conference on Computer-AidedDesign (ICCAD) Nov 2013

[41] S-C Wong G-Y Lee and D-J Ma ldquoModeling of inter-connect capacitance delay and crosstalk in VLSIrdquo IEEETransactions on semiconductor manufacturing vol 13 no 1pp 108ndash111 2000

[42] C Xu D Niu N Muralimanohar R BalasubramonianT Zhang S Yu and Y Xie ldquoOvercoming the challengesof crossbar resistive memory architecturesrdquo in rdquoIntl Sympon High Performance Computer Architecture (HPCA)rdquo Feb2015

[43] C Yakopcic and T Taha ldquoModel for maximum crossbar sizebased on input driver impedancerdquo Electronics Letters vol 52no 1 pp 25ndash27 2015

[44] M OrsquoHalloran and R Sarpeshkar ldquoA 10-nw 12-bit accurateanalog storage cell with 10-aa leakagerdquo IEEE Journal ofSolid-State Circuits vol 39 no 11 pp 1985ndash1996 2004

[45] Leon3GRLIB httpwwwgaislercomindexphpdownloadsleongrlib [Online] Available httpwwwgaislercomindexphpdownloadsleongrlib

[46] S Galal O Shacham J S B II J Pu A Vassiliev andM Horowitz ldquoFPU generator for design space explorationrdquoin IEEE Symposium on Computer Arithmetic (ARITH) April2013

[47] M Martins J M Matos R P Ribas A Reis G SchlinkerL Rech and J Michelsen ldquoOpen cell library in 15nmFreePDK technologyrdquo in Intl Symp on Physical Design(ISPD) March 2015

[48] Synopsys ldquoSynopsys Design Compiler User GuiderdquohttpwwwsynopsyscomToolsImplementationRTLSynthesisDCUltraPages

[49] R Balasubramonian A B Kahng N MuralimanoharA Shafiee and V Srinivas ldquoCACTI 7 New tools for inter-connect exploration in innovative off-chip memoriesrdquo ACMTransactions on Architecture and Code Optimization (TACO)vol 14 no 2 pp 141ndash1425 June 2017

[50] G Fredeman D W Plass A Mathews J ViraraghavanK Reyer T J Knips T Miller E L Gerhard D Kannam-badi C Paone D Lee D J Rainey M Sperling M WhalenS Burns R R Tummuru H Ho A Cestero N ArnoldB A Khan T Kirihata and S S Iyer ldquoA 14 nm 11 mbembedded DRAM macro with 1 ns accessrdquo IEEE Journal ofSolid-State Circuits vol 51 no 1 pp 230ndash239 Jan 2016

[51] A Bakhoda G L Yuan W W L Fung H Wong and T MAamodt ldquoAnalyzing CUDA workloads using a detailed GPUsimulatorrdquo April 2009

[52] J Leng T Hetherington A ElTantawy S Gilani N S KimT M Aamodt and V J Reddi ldquoGPUWattch Enabling en-ergy optimizations in GPGPUsrdquo in Intl Symp on ComputerArchitecture (ISCA) June 2013

[53] H Anzt J Dongarra M Kreutzer G Wellein and M KhlerldquoEfficiency of general krylov methods on GPUs ndash an exper-imental studyrdquo in Intl Parallel and Distributed ProcessingSymposium Workshops (IPDPSW) May 2016

[54] ldquoNvidia tesla P100rdquo NVIDIA Corporation Tech Rep WP-08019-001 v011 2016

[55] Z Wei Y Kanzawa K Arita Y Katoh K Kawai S Mu-raoka S Mitani S Fujii K Katayama M Iijima et alldquoHighly reliable TaOx ReRAM and direct evidence of redoxreaction mechanismrdquo in Electron Devices Meeting 2008IEDM 2008 IEEE International IEEE 2008 pp 1ndash4

[56] J J Yang M-X Zhang J P Strachan F Miao M D Pick-ett R D Kelley G Medeiros-Ribeiro and R S WilliamsldquoHigh switching endurance in TaOx memristive devicesrdquoApplied Physics Letters vol 97 no 23 p 232102 2010

[57] C-W Hsu I-T Wang C-L Lo M-C Chiang W-Y JangC-H Lin and T-H Hou ldquoSelf-rectifying bipolar TaOxTiO2RRAM with superior endurance over 1012 cycles for 3d high-density storage-class memoryrdquo

[58] F Alibart L Gao B D Hoskins and D B Strukov ldquoHighprecision tuning of state for memristive devices by adaptablevariation-tolerant algorithmrdquo Nanotechnology vol 23 no 7Jan 2012

Page 16: Enabling Scientific Computing on Memristive …reducing the required memristor resolution [7]–[10]. An example of bit slicing is shown in Equation 1, where each 3-bit matrix coefficient

[57] C-W Hsu I-T Wang C-L Lo M-C Chiang W-Y JangC-H Lin and T-H Hou ldquoSelf-rectifying bipolar TaOxTiO2RRAM with superior endurance over 1012 cycles for 3d high-density storage-class memoryrdquo

[58] F Alibart L Gao B D Hoskins and D B Strukov ldquoHighprecision tuning of state for memristive devices by adaptablevariation-tolerant algorithmrdquo Nanotechnology vol 23 no 7Jan 2012