fundamental limits on the precision of in-memory architectures

9
Fundamental Limits on the Precision of In-memory Architectures (Invited Talk) Sujan K. Gonugondla, Charbel Sakr, Hassan Dbouk, and Naresh R. Shanbhag (gonugon2,sakr2,hdbouk2,shanbhag)@illinois.edu Department of Electrical and Computer Engineering University of Illinois at Urbana-Champaign ABSTRACT This paper obtains the fundamental limits on the computational precision of in-memory computing architectures (IMCs). Various compute SNR metrics for IMCs are defined and their interrelation- ships analyzed to show that the accuracy of IMCs is fundamentally limited by the compute SNR (SNR a ) of its analog core, and that activation, weight and output precision needs to be assigned ap- propriately for the final output SNR SNR T SNR a . The minimum precision criterion (MPC) is proposed to minimize the output and hence the column analog-to-digital converter (ADC) precision. The charge summing (QS) compute model and its associated IMC QS- Arch are studied to obtain analytical models for its compute SNR, minimum ADC precision, energy and latency. Compute SNR mod- els of QS-Arch are validated via Monte Carlo simulations in a 65 nm CMOS process. Employing these models, upper bounds on SNR a of a QS-Arch-based IMC employing a 512 row SRAM array are obtained and it is shown that QS-Arch’s energy cost reduces by 3.3× for every 6 dB drop in SNR a , and that the maximum achievable SNR a reduces with technology scaling while the energy cost at the same SNR a increases. These models also indicate the existence of an upper bound on the dot product dimension due to voltage headroom clipping, and this bound can be doubled for every 3 dB drop in SNR a . KEYWORDS in-memory computing, taxonomy of in-memory, in-memory noise, machine learning, accelerator, in-memory precision, in-memory accuracy, compute in-memory ACM Reference Format: Sujan K. Gonugondla, Charbel Sakr, Hassan Dbouk, and Naresh R. Shanbhag. 2020. Fundamental Limits on the Precision of In-memory Architectures: (Invited Talk). In IEEE/ACM International Conference on Computer-Aided Design (ICCAD ’20), November 2–5, 2020, Virtual Event, USA. ACM, New York, NY, USA, 9 pages. https://doi.org/10.1145/3400302.3416344 Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. ICCAD ’20, November 2–5, 2020, Virtual Event, USA © 2020 Association for Computing Machinery. ACM ISBN 978-1-4503-8026-3/20/11. . . $15.00 https://doi.org/10.1145/3400302.3416344 1 INTRODUCTION In-memory computing (IMC) [13, 19, 28, 34] has emerged as an attractive alternative to conventional von Neumann (digital) ar- chitectures for addressing the energy and latency cost of memory accesses in data-centric machine learning workloads. IMCs embed analog mixed-signal computations in close proximity to the bit-cell array (BCA) in order to execute machine learning computations such as matrix-vector multiply (MVM) and dot products (DPs) as an intrinsic part of the read cycle and thereby avoid the need to access raw data. IMCs exhibit a fundamental trade-off between its energy-delay product (EDP) and the accuracy or signal-to-noise ratio (SNR) of its analog computations. This trade-off arises due to constraints on the maximum bit-line (BL) voltage discharge and due to process variations, specifically spatial variations in the threshold voltage t , which limit the dynamic range and the SNR. Additionally, IMCs also exhibit noise due to the quantization of its input activation and weight parameters and due to the column analog-to-digital converters (ADCs). Henceforth, we use "compute SNR" to refer to the computational precision/accuracy of an IMC, and "precision" to the number of bits assigned to various signals. Today, a large number of IMC prototype ICs have been demon- strated [1, 3, 4, 7, 12, 1517, 3133, 36, 38, 40]. While these IMCs have shown impressive reductions in the EDP over a von Neumann equivalent with minimal loss in inference accuracy, it is not clear that these gains are sustainable for larger problem sizes across data sets and inference tasks. Unlike digital architectures whose com- pute SNR can be made arbitrarily high by assigning sufficiently high precision to various signals, IMCs need to contend with both quan- tization noise as well as analog non-idealities. Therefore, IMCs will have intrinsic limits on their compute SNR. Since the compute SNR trades-off with energy and delay, it raises the following question: What are the fundamental limits on the achievable computational precision of IMCs? Answering this question is made challenging due to the rich design space occupied by IMCs encompassing a huge diversity of available memory devices, bitcell circuit topologies, circuit and ar- chitectural design methods. Today’s IMCs tend to employ ad-hoc approaches to assign input and ADC precisions or tend to over- provision its analog SNR in order to emulate the determinism of digital computations. An analytical understanding of the relation- ship between precision, compute SNR, energy, and delay in IMCs, is presently missing. This paper attempts to fill this gap by: 1) defining compute SNR metrics for IMCs, 2) developing a systematic methodology to obtain

Upload: others

Post on 07-Jan-2022

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Fundamental Limits on the Precision of In-memory Architectures

Fundamental Limits on the Precision of In-memoryArchitectures

(Invited Talk)

Sujan K. Gonugondla, Charbel Sakr, Hassan Dbouk, and Naresh R. Shanbhag(gonugon2,sakr2,hdbouk2,shanbhag)@illinois.eduDepartment of Electrical and Computer Engineering

University of Illinois at Urbana-Champaign

ABSTRACTThis paper obtains the fundamental limits on the computationalprecision of in-memory computing architectures (IMCs). Variouscompute SNR metrics for IMCs are defined and their interrelation-ships analyzed to show that the accuracy of IMCs is fundamentallylimited by the compute SNR (SNRa) of its analog core, and thatactivation, weight and output precision needs to be assigned ap-propriately for the final output SNR SNRT → SNRa. The minimumprecision criterion (MPC) is proposed to minimize the output andhence the column analog-to-digital converter (ADC) precision. Thecharge summing (QS) compute model and its associated IMC QS-Arch are studied to obtain analytical models for its compute SNR,minimum ADC precision, energy and latency. Compute SNR mod-els of QS-Arch are validated via Monte Carlo simulations in a 65 nmCMOS process. Employing these models, upper bounds on SNRaof a QS-Arch-based IMC employing a 512 row SRAM array areobtained and it is shown that QS-Arch’s energy cost reduces by3.3× for every 6 dB drop in SNRa, and that the maximum achievableSNRa reduces with technology scaling while the energy cost at thesame SNRa increases. These models also indicate the existence ofan upper bound on the dot product dimension 𝑁 due to voltageheadroom clipping, and this bound can be doubled for every 3 dBdrop in SNRa.

KEYWORDSin-memory computing, taxonomy of in-memory, in-memory noise,machine learning, accelerator, in-memory precision, in-memoryaccuracy, compute in-memory

ACM Reference Format:Sujan K. Gonugondla, Charbel Sakr, Hassan Dbouk, and Naresh R. Shanbhag.2020. Fundamental Limits on the Precision of In-memory Architectures:(Invited Talk). In IEEE/ACM International Conference on Computer-AidedDesign (ICCAD ’20), November 2–5, 2020, Virtual Event, USA. ACM, NewYork, NY, USA, 9 pages. https://doi.org/10.1145/3400302.3416344

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from [email protected] ’20, November 2–5, 2020, Virtual Event, USA© 2020 Association for Computing Machinery.ACM ISBN 978-1-4503-8026-3/20/11. . . $15.00https://doi.org/10.1145/3400302.3416344

1 INTRODUCTIONIn-memory computing (IMC) [13, 19, 28, 34] has emerged as anattractive alternative to conventional von Neumann (digital) ar-chitectures for addressing the energy and latency cost of memoryaccesses in data-centric machine learning workloads. IMCs embedanalog mixed-signal computations in close proximity to the bit-cellarray (BCA) in order to execute machine learning computationssuch as matrix-vector multiply (MVM) and dot products (DPs) asan intrinsic part of the read cycle and thereby avoid the need toaccess raw data.

IMCs exhibit a fundamental trade-off between its energy-delayproduct (EDP) and the accuracy or signal-to-noise ratio (SNR) of itsanalog computations. This trade-off arises due to constraints onthe maximum bit-line (BL) voltage discharge and due to processvariations, specifically spatial variations in the threshold voltage𝑉t, which limit the dynamic range and the SNR. Additionally, IMCsalso exhibit noise due to the quantization of its input activationand weight parameters and due to the column analog-to-digitalconverters (ADCs). Henceforth, we use "compute SNR" to refer tothe computational precision/accuracy of an IMC, and "precision"to the number of bits assigned to various signals.

Today, a large number of IMC prototype ICs have been demon-strated [1, 3, 4, 7, 12, 15–17, 31–33, 36, 38, 40]. While these IMCshave shown impressive reductions in the EDP over a von Neumannequivalent with minimal loss in inference accuracy, it is not clearthat these gains are sustainable for larger problem sizes across datasets and inference tasks. Unlike digital architectures whose com-pute SNR can be made arbitrarily high by assigning sufficiently highprecision to various signals, IMCs need to contend with both quan-tization noise as well as analog non-idealities. Therefore, IMCs willhave intrinsic limits on their compute SNR. Since the compute SNRtrades-off with energy and delay, it raises the following question:What are the fundamental limits on the achievable computationalprecision of IMCs?

Answering this question is made challenging due to the richdesign space occupied by IMCs encompassing a huge diversity ofavailable memory devices, bitcell circuit topologies, circuit and ar-chitectural design methods. Today’s IMCs tend to employ ad-hocapproaches to assign input and ADC precisions or tend to over-provision its analog SNR in order to emulate the determinism ofdigital computations. An analytical understanding of the relation-ship between precision, compute SNR, energy, and delay in IMCs,is presently missing.

This paper attempts to fill this gap by: 1) defining compute SNRmetrics for IMCs, 2) developing a systematic methodology to obtain

Page 2: Fundamental Limits on the Precision of In-memory Architectures

ICCAD ’20, November 2–5, 2020, Virtual Event, USA Sujan K. Gonugondla, Charbel Sakr, Hassan Dbouk, and Naresh R. Shanbhag

a minimum precision assignment for activations, weights and out-puts of fixed-point DPs realized on IMCs to meet network accuracyrequirements, and 3) employing this methodology to obtain thelimits on achievable compute SNR of a commonly employed IMCtopology, and quantify it energy vs. accuracy trade-off.

2 NOTATION AND PRELIMINARIES2.1 General NotationWe employ the term signal-to-quantization noise ratio (SQNR)when only quantization noise (denoted as 𝑞) is involved. The termSNR is employed when analog noise sources are included and use[ to denote such sources. SNR is also employed when both quanti-zation and analog noise sources are present.

2.2 The Additive Quantization Noise ModelUnder the additive quantization noise model, a floating-point (FL)signal 𝑥 quantized to 𝐵𝑥 bits is represented as 𝑥𝑞 = 𝑥 +𝑞𝑥 , where 𝑞𝑥is the quantization noise assumed to be independent of the signal𝑥 .

If 𝑥 ∈ [−𝑥m, 𝑥m] and 𝑞𝑥 ∼ 𝑈 [−0.5Δ𝑥 , 0.5Δ𝑥 ] where Δ𝑥 =

𝑥m2−(𝐵𝑥−1) is the quantization step size and 𝑈 [𝑎, 𝑏] denotes theuniform distribution over the interval [𝑎, 𝑏], then the signal-to-quantization noise ratio (SQNR𝑥 ) is given by:

SQNR𝑥 (dB) = 10 log10 (SQNR𝑥 ) = 6𝐵𝑥 + 4.78 − Z𝑥 (dB) (1)

where SQNR𝑥 =𝜎2𝑥

𝜎2𝑞𝑥

, 𝜎2𝑞𝑥 =Δ2𝑥

12 , and Z𝑥 (dB) = 10 log10 (𝑥2m

𝜎2𝑥) is the

peak-to-average (power) ratio (PAR) of 𝑥 . Equation (1) quantifiesthe familiar 6 dB SQNR gain per bit of precision.

2.3 The Dot-Product (DP) ComputationConsider the FL dot product (DP) computation defined as:

𝑦o = wTx =

𝑁∑𝑗=1

𝑤 𝑗𝑥 𝑗 (2)

where 𝑦o is the DP of two 𝑁 -dimensional real-valued vectors w =

[𝑤1, . . . ,𝑤𝑁 ]T (weight vector) and x = [𝑥1, . . . , 𝑥𝑁 ]T (activationvector) of precision 𝐵𝑤 and 𝐵𝑥 , respectively.

In DNNs, the dot product in (2) is computedwith𝑤 ∈ [−𝑤m,𝑤m](signed weights), input 𝑥 ∈ [0, 𝑥m] (unsigned activations) andoutput 𝑦 ∈ [−𝑦m, 𝑦m] (signed outputs). Assuming the additivequantization noise model from Section 2.2, the fixed-point (FX)computation of the DP (2) is described by:

𝑦 = wT𝑞x𝑞 + 𝑞𝑦 = (w + q𝑤)T (x + q𝑥 ) + 𝑞𝑦 (3)

≈ wTx +wTq𝑥 + qT𝑤x + 𝑞𝑦 = 𝑦o + 𝑞𝑖𝑦 + 𝑞𝑦 (4)

where w𝑞 = w + q𝑤 and x𝑞 = x + q𝑥 are the quantized weight andactivation vectors, respectively, 𝑞𝑖𝑦 is the total input (weight andactivation) quantization noise seen at the output 𝑦, and 𝑞𝑦 is theadditional output quantization noise due to round-off/truncationin digital architectures or from the finite resolution of the columnADCs in IMC architectures.

Assuming that the weights (signed) and inputs (unsigned) arei.i.d. random variables (RVs), the variances of signals in (4) are given

bitcell array𝐰!D

AC

ADC

𝐱!

𝑦

𝑞"#

𝑦

(a)

𝜂$ 𝑞#

analog core

𝒇 𝐚,𝐛 = 𝐚%𝐛𝑦&

(b)

𝐰𝐱

analog processing

Figure 1: Systemnoisemodel of IMC: (a) a generic IMC blockdiagram, and (b) dominant noise sources in fixed-point DPcomputation on IMCs.

by:

𝜎2𝑦o = 𝑁𝜎2𝑤E[𝑥2];𝜎2𝑞𝑦 =Δ2𝑦

12 ;𝜎2𝑞𝑖𝑦

=𝑁

12

(Δ2𝑤E[𝑥2] + Δ2

𝑥𝜎2𝑤

)(5)

where 𝜎2𝑤 is the variance of the weights, Δ𝑤 = 𝑤m2−𝐵𝑤+1, Δ𝑥 =

𝑥m2−𝐵𝑥 and Δ𝑦 = 𝑦m2−𝐵𝑦+1 are the weight, activation, and outputquantization step-sizes, respectively.

3 COMPUTE SNR LIMITS OF IMCSWe propose the system noise model in Fig. 1 for obtaining precisionlimits on IMC architectures. Such architectures (Fig. 1(a)) accept aquantized input (x𝑞 ) and a quantized weight vector (w𝑞 ) to imple-ment multiple FX DP computations of (4) in parallel in its analogcore. Hence, unlike digital architectures, IMC architectures sufferfrom both quantization and analog noise sources such as SRAMcell current variations, thermal noise, and charge injection, as wellas the limited headroom, which limits its compute SNR.

3.1 Compute SNR Metrics for IMCsThe following equations describe IMC noise model in Fig. 1:

𝑦 = 𝑦o + 𝑞𝑖𝑦 + [a + 𝑞𝑦 ; [a = [e + [h (6)

where 𝑦o is the ideal DP value defined in (2), 𝑞𝑖𝑦 is the input quanti-zation noise reflected at the output 𝑞𝑖𝑦 , [a is the analog noise termcomprising both clipping noise [h due to limited headroom, and[e being all other noise sources, and 𝑞𝑦 is the quantization noiseintroduced by the ADC.

We define the following fundamental compute SNR metrics:

SQNR𝑞𝑖𝑦 =𝜎2𝑦o𝜎2𝑞𝑖𝑦

; SNRa =𝜎2𝑦o𝜎2[a

; SQNR𝑞𝑦 =𝜎2𝑦o𝜎2𝑞𝑦

(7)

where SNRa is the analog SNR, SQNR𝑞𝑖𝑦 is the propagated SQNR atthe output due to input (weight and activation) quantization noiseand is given by:

SQNR𝑞𝑖𝑦 (dB) = 6(𝐵𝑥 + 𝐵𝑤) + 4.8 − [Z𝑥 (dB) + Z𝑤 (dB) ]

− 10 log10(22𝐵𝑥

Z𝑥+ 22𝐵𝑤

Z𝑤

)(8)

where Z𝑥 (dB) = 10 log10(

𝑥2m

4E[𝑥2 ]

)and Z𝑤 (dB) = 10 log10

(𝑤2m

𝜎2𝑤

)are

the PARs of the (unsigned) activations and (signed) weights, re-spectively, and SQNR𝑞𝑦 is the digitization SQNR solely due to ADC

Page 3: Fundamental Limits on the Precision of In-memory Architectures

Fundamental Limits on the Precision of In-memory Architectures ICCAD ’20, November 2–5, 2020, Virtual Event, USA

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

0

5

10

15

20

25

30

35

40

45

50

Layer Index

SN

RT

(dB

)

Figure 2: Per-layer SNRT(dB) requirements of DP computa-tions in VGG-16 deployed on ImageNet.

quantization noise and is given by:SQNR𝑞𝑦 (dB) = 6𝐵𝑦 + 4.8 − [Z𝑥 (dB) + Z𝑤(dB)] − 10 log10 (𝑁 ) (9)

which is obtained by the substitutions: 𝐵𝑥 ← 𝐵𝑦 and Z𝑥 (dB) ←Z𝑦(dB) = Z𝑥 (dB) + Z𝑤(dB) + 10 log10 (𝑁 ) in (1).

From (6) and (7), it is straightforward to show:

SNRA =𝜎2𝑦o

𝜎2𝑞𝑖𝑦 + 𝜎2[a=

[1

SNRa+ 1SQNR𝑞𝑖𝑦

]−1(10)

SNRT =𝜎2𝑦o

𝜎2𝑞𝑖𝑦 + 𝜎2[a + 𝜎2𝑞𝑦=

[1

SNRA+ 1SQNR𝑞𝑦

]−1(11)

where SNRA is the pre-ADC SNR and SNRT is the total output SNRincluding all noise sources. Note: (10)-(11) can be repurposed fordigital architectures by setting SNRa → ∞ since quantization isthe only noise source implying SNRA = SQNR𝑞𝑖𝑦 . Equations (8)-(9)indicate that SQNR𝑞𝑖𝑦 and SQNR𝑞𝑦 can be made arbitrarily largeby assigning sufficiently high precision to the DP inputs (𝐵𝑥 and𝐵𝑤 ) and the output (𝐵𝑦 ). Thus, from (10)-(11), SNRT in IMCs isfundamentally limited by SNRa which depends on the analog noisesources as one expects.

3.2 Precision Assignment Methodology forIMCs

Priorwork [25, 26], indicates the requirement SNRT(dB) > SNR∗T(dB) =10 dB-40 dB (see Fig. 2) for the inference accuracy of an FX networkto be within 1% of the corresponding FL network for popular DNNs(AlexNet, VGG-9, VGG-16, ResNet-18) deployed on the ImageNetand CIFAR-10 datasets. To meet this SNRT(dB) requirement, digitalarchitectures choose 𝐵𝑥 and 𝐵𝑤 such that SQNR𝑞𝑖𝑦 > SNR∗T, andthen choose 𝐵𝑦 sufficiently high to guarantee SQNR𝑞𝑦 ≫ SQNR𝑞𝑖𝑦so that SNRT → SQNR𝑞𝑖𝑦 .

In contrast, for IMCs, we first need to ensure that SNRa > SNR∗Tso that SNRT can be made to approach SNRa with appropriateprecision assignment via the following methodology:

(1) Assign sufficiently high values for 𝐵𝑥 and 𝐵𝑤 per (8) suchthat SQNR𝑞𝑖𝑦 ≫ SNRa so that SNRA → SNRa per (10).

(2) Assign sufficiently a high value for 𝐵𝑦 per (9) such thatSQNR𝑞𝑦 ≫ SNRA so that SNRT → SNRA per (11).

For example, if SQNR𝑞𝑖𝑦 (dB), SQNR𝑞𝑦 (dB) ≥ SNRa(dB) + 9 dB thenSNRa(dB) − SNRT(dB) ≤ 0.5 dB, i.e., SNRT(dB) lies within 0.5 dB ofSNRa(dB). In this manner, by appropriate choices for 𝐵𝑥 , 𝐵𝑤 , and𝐵𝑦 , IMCs can be designed such that SNRT → SNRa, which is thefundamental limit on SNRT.

From the above discussion it is clear that the input precisions 𝐵𝑥and 𝐵𝑤 are dictated by network accuracy requirements, while theoutput precision 𝐵𝑦 needs to be set sufficiently high to avoid becom-ing a significant noise contributor. To ensure that a sufficiently highvalue for 𝐵𝑦 , digital architectures employ the bit growth criterion(BGC) described next.

3.3 Bit Growth Criterion (BGC)The BGC is commonly employed to assign the output precision 𝐵𝑦in digital architectures [9, 25]. BGC sets 𝐵𝑦 as:

𝐵BGC𝑦 = 𝐵𝑥 + 𝐵𝑤 + log2 (𝑁 ) (12)

Substituting 𝐵𝑦 = 𝐵BGC𝑦 from (12) into (9) and employing the rela-tionship Z𝑦(dB) = 10 log10 (𝑁 ) + Z𝑥 (dB) + Z𝑤(dB), the resulting SQNRdue to output quantization using the BGC is given by:

SQNRBGC𝑞𝑦 (dB) = 10 log10

(𝜎2𝑦o𝜎2𝑞𝑦

)= 6(𝐵𝑥 + 𝐵𝑤) + 4.8 − [Z𝑥 (dB) + Z𝑤 (dB) ] + 10 log10 (𝑁 ) . (13)

Recall that SQNRBGC𝑞𝑦≫ SNRA in order to ensure SNRT is close to

its upper bound. Comparing (9) and (13), we see that, for high valuesof DP dimensionality 𝑁 , BGC is overly conservative since it assignslarge values to 𝐵𝑦 per (12). Some digital architectures truncatethe LSBs to control bit growth. The SQNR of such truncated BGC(tBGC) can be obtained from (9) by setting the value of 𝐵𝑦 < 𝐵BGC𝑦 .

BGC’s high precision requirements is accommodated by digitalarchitectures by increasing the precision of arithmetic units at acommensurate increase in the computational energy, latency, andactivation storage costs. However, IMCs cannot afford to use thiscriterion since 𝐵𝑦 is the precision of the BL ADCs which impacts itsenergy, latency, and area. Indeed, recent works [24] have claimedthat BL ADCs dominate the energy and latency costs of IMCsassuming BGC to assign 𝐵𝑦 .

In the next section, we propose an alternative to BGC referredto the minimum precision criterion (MPC), that can be employedby both digital and IMC architectures which achieves a desiredSQNR𝑞𝑦 with much fewer bits than BGC.

3.4 The Minimum Precision Criterion (MPC)WeproposeMPC to reduce𝐵𝑦 without incurring any loss in SQNR𝑞𝑦compared to BGC. Unlike BGC, MPC accounts for the statistics of𝑦o to permit controlled amounts of clipping to occur. In MPC (seeFig. 3(a)), the output 𝑦o is clipped to lie in the range [−𝑦c, 𝑦c] in-stead of [−𝑦m, 𝑦m] as in BGC (see Fig. 3(b)), where 𝑦c < 𝑦m (𝑦c:clipping level), and the 𝐵𝑦 bits are employed to quantize this re-duced range. The clipping probability 𝑝c = Pr{|𝑦𝑜 | > 𝑦c} is kept toa small user-defined value, e.g., 𝑦c = 4𝜎𝑦o ensures that 𝑝c < 0.001

Page 4: Fundamental Limits on the Precision of In-memory Architectures

ICCAD ’20, November 2–5, 2020, Virtual Event, USA Sujan K. Gonugondla, Charbel Sakr, Hassan Dbouk, and Naresh R. Shanbhag

clipped clipped

(a)

(b)

(c)

Figure 3: Comparison of BGC and MPC: (a) MPC quantiza-tion levels, (b) BGC quantization levels, and (c) distribution𝑓𝑌 (𝑦o) of the ideal DP output 𝑦o vs. DP dimensionality 𝑁 .

if 𝑦o ∼ N(0, 𝜎2𝑦o ). The resulting SQNR𝑦 is given by:

SQNRMPC𝑞𝑦 (dB) =6𝐵𝑦 + 4.8 − Z

MPC𝑦(dB) − 10 log10

(1 + 𝑝c

𝜎2𝑐𝑐𝜎2𝑞𝑦

)(14)

where ZMPC𝑦 (dB) = 10 log10

𝑦2c

𝜎2𝑦o, and 𝜎2𝑐𝑐 = E

[(𝑦o − 𝑦c)2

��|𝑦o | > 𝑦c]is

the conditional clipping noise variance. Setting 𝑦c = ZMPC𝑦 𝜎𝑦𝑜 yields

ZMPC𝑦(dB) = 10 log10 (ZMPC

𝑦 )2 indicating that 𝑝c is a decreasing functionof ZMPC

𝑦 . Thus, (14) has the same form as (1) with an additional (lastterm) clipping noise factor.

MPC exploits a key insight (see Fig. 3(c)), which follows from theCentral Limit Theorem (CLT) – in a 𝑁 -dimensional DP computation(2), 𝜎𝑦o grows sub-linearly (as

√𝑁 ) as compared to the maximum 𝑦m

which grows linearly with𝑁 . Furthermore, (14) shows a quantizationvs. clipping noise trade-off controlled by the clipping level 𝑦c. Thistrade-off, illustrated in Fig. 3(c), is absent in BGC and tBGC, and iscritical to MPC’s ability to realize desired values of SQNR𝑞𝑦 withsmaller values of 𝐵𝑦 .

Assuming 𝑦o ∼ N(0, 𝜎2𝑦o ), and substituting 𝑦c = 4𝜎𝑦o , and𝑝c = 0.001 into (14), we obtain the following lower bound:

𝐵MPC𝑦 ≥ 1

6

[SNRA(dB) + 7.2 − 𝛾 − 10 log10

(1 − 10−

𝛾

10)]

(15)

in order for SNRA(dB) − SNRT(dB) ≤ 𝛾 . For instance, the choice𝛾 = 0.5 dB yields 𝐵MPC

𝑦 ≥ 16

[SNRA(dB) + 16.3

]which corresponds

to SQNRMPC𝑦(dB) ≥ SNRA(dB) + 9 dB as discussed in Section 3.2.

(a)

(b)

Figure 4: Trends in SQNR𝑞𝑦 (dB) for DP computation with𝐵𝑥 = 𝐵𝑤 = 7: (a) SQNR𝑞𝑦 (dB) vs. 𝑁 for MPC (Z𝑦 = 4), BGC,tBGC, and (b) SQNRMPC

𝑞𝑦 (dB)vs. ZMPC

𝑦 when 𝐵𝑦 = 8.

3.5 Simulation ResultsTo illustrate the difference between MPC, BGC and tBGC, we as-sume that SNRa(dB) ≥ 31 dB, so that SNRT(dB) ≥ 30 dB providedSQNR𝑞𝑖𝑦 (dB), SQNR𝑞𝑦 (dB) ≥ 40 dB per (10)-(11). We further assumeDPs of varying dimension 𝑁 with 7-b quantized unsigned inputsand signed weights randomly sampled from uniform distributions.Substituting 𝐵𝑥 = 𝐵𝑤 = 7, Z𝑥 (dB) = −1.3 dB, and Z𝑤(dB) = 4.8 dBinto (8), we obtain SQNR𝑞𝑖𝑦 (dB) = 41 dB. Thus, all that remains is toassign 𝐵𝑦 such that SQNR𝑞𝑦 (dB) ≥ 40 dB, for which there are threechoices - MPC, BGC and tBGC.

Figure 4(a) compares the SQNR𝑞𝑦 achieved by the three methods.Per (15), MPC meets the SQNR𝑞𝑦 (dB) ≥ 40 dB requirement by set-ting 𝐵𝑦 = 8 and ZMPC

𝑦 = 4 independent of 𝑁 . In contrast, per (12),BGC assigns 16 ≤ 𝐵𝑦 ≤ 20 as a function of 𝑁 to achieve the sameSNRT as MPC. Furthermore, tBGC meets the SQNR𝑞𝑦 requirementwith 11 ≤ 𝐵𝑦 ≤ 13 but fails to do so with 𝐵𝑦 = 8. Figure 4(b)shows that SQNRMPC

𝑞𝑦 (dB)is maximized when ZMPC

𝑦 = 4, i.e., whenclipping level 𝑦c = 4𝜎𝑦o thereby illustrating MPC’s quantization vs.clipping noise trade-off described by (14). Figure 4 also validatesthe analytical expressions (8), (9), (13), and (14) (bold) by indicating

Page 5: Fundamental Limits on the Precision of In-memory Architectures

Fundamental Limits on the Precision of In-memory Architectures ICCAD ’20, November 2–5, 2020, Virtual Event, USA

Table 1: A Taxonomy of IMCs using In-memory ComputeModels

In-memoryCompute Model

Analog CorePrecision

ADCPrecision

QS IS QR 𝐵𝑥 𝐵𝑤 𝐵ADC

CMOS

Kang et al. [15] 8 8 8Biswas et al. [1] 8 1 7Zhang et al. [40] 5 1 1Valavi et al. [33] 1 1 1Khwa et al. [16] 1 1 1Jiang et al. [12] 1 1 3.46Si et al. [30] 2 5 5Jia et al. [11] 1 1 8

Okumura et al. [23] 1 T 8Kim et al. [17] 1 1 1Guo et al. [8] 1 1 3Yue et al. [38] 2 5 5Su et al. [32] 2 1 5Dong et al. [4] 4 4 4Si et al. [31] 2 2 5

Beyo

ndCM

OS Chen et al. [2] 1 T 3

Fick et al. [5] A A AXue et al.[35] 1 3 4Yan et al.[37] 1 1 1Zha et al.[39] 1 1 1Xue et al.[36] 2 4 6

T: Ternary; A: Analog/Continuous-valued

a close match to ensemble-averaged values of SQNR𝑞𝑦 obtainedfrom Monte Carlo simulations (dotted).

Note: the theoretically optimal quantizer given an arbitrary sig-nal distribution is obtained from the Lloyd-Max (LM) algorithm[18]. Unfortunately, the LM quantization levels are non-uniformlyspaced which makes it hard to design efficient arithmetic units toprocess such signals. MPC offers a practical alternative to LM.

4 ANALYTICAL MODELS FOR COMPUTE SNRThis section derives analytical expressions for SNRa of a typicalIMC. First, we show that most IMCs can be ’explained’ via a fewin-memory compute models.

4.1 In-memory Compute ModelsAll IMCs are viewed as employing one or more in-memory computemodels defined as a mapping of algorithmic variables 𝑦o, 𝑥 𝑗 and𝑤 𝑗

in (2) to physical quantities such as time, charge, current, or voltage,in order to (usually partially) realize an analog BL computation ofthe multi-bit DP in (2).

Furthermore, we suggest that most IMCs today employ one ormore of the following three in-memory compute models (see Fig. 5):(a) charge summing (QS) [7, 14, 15, 40]; (b) current summing (IS)[12, 16, 17, 30]; and (c) charge redistribution (QR) [1, 7, 15, 33], andconjecture that these computemodels are in some sense universal inthat they represent an approximation to a ‘complete set’ of practical,i.e., realizable, mappings of variables from the algorithmic to thecircuit domain as shown in Table 1.

Henceforth, we discuss the QS model and the corresponding QS-based IMC referred to as QS-Arch in detail since it is very commonlyused. Analytical expressions for circuit domain equivalents of [e

𝑉"" − 𝑉$𝜙&

𝜙'

𝜙(

𝜙&

𝜙'

𝜙(

𝑇&

𝑇'

𝑇(

𝐶

𝐼&

𝐼'

𝐼(

(a)

𝑉"𝐺$

𝐼&

𝐺'

𝐺(

Current Sensing

𝑉$

𝑉'

𝑉(

𝐺)𝑉)

(b)

𝜙" 𝜙# 𝜙$𝑉" 𝑉# 𝑉$

𝑉&

𝐶" 𝐶# 𝐶$

(c)

Figure 5: In-memory compute models: (a) charge summing(QS), (b) current summing (IS), and (c) charge redistribution(QR) models.

and[h in (6) for the QSmodel are presented. These will be combinedwith algorithm and precision-dependent noise sources 𝑞𝑖𝑦 and 𝑞𝑦to obtain SNRT.

4.2 The Charge Summing (QS) ModelThe QS model (see Fig. 5(a)) realizes the DP in (2) via the variablemapping (𝑦o → 𝑉o,𝑤 𝑗 → 𝐼 𝑗 , 𝑥 𝑗 → 𝑇𝑗 ) where the cell current 𝐼 𝑗 isintegrated over the WL pulse duration 𝑇𝑗 ( 𝑗 = 1, . . . , 𝑁 ) on a BL (orcell) capacitor 𝐶 resulting an output voltage as shown below:

(𝑦o → 𝑉o) =1𝐶

𝑁∑𝑗=1(𝑤 𝑗 → 𝐼 𝑗 ) (𝑥 𝑗 → 𝑇𝑗 ) (16)

where 𝑉o is the DP output assuming infinite voltage head-room,i.e., no clipping. The cell current 𝐼 𝑗 depends upon transistor sizesand the WL voltage 𝑉WL, and typical values are: 𝐶 (a few hundredfFs), 𝐼 𝑗 (tens of `As), and 𝑇𝑗 (hundreds of ps).

Noise Models: The noise contributions in QS arise from the fol-lowing sources: (1) variations in the pulse-widths 𝑇𝑗 of currentswitch pulses 𝜙 𝑗 (Fig. 5(a)); (2) their finite rise and fall times (seeFig. 6(b)); (3) spatial variations in the currents 𝐼 𝑗 ; (4) thermal noisein the discharge RC-network; and (5) clipping due to limited volt-age head-room. Thus, the analog DP output 𝑉a corresponding to𝑦a = 𝑦o + [a is given by:

(𝑦𝑎 → 𝑉a) = (𝑦o → 𝑉o) + ([e → 𝑣e) + ([h → 𝑣c),

𝑣e = 𝑣\ +1𝐶

𝑁∑𝑗=1

𝑖 𝑗𝑇𝑗 + 𝐼 𝑗 (𝑡 𝑗 − 𝑡rf),

𝑣c = min(𝑉o, 𝑉o,max

)−𝑉o, (17)

Page 6: Fundamental Limits on the Precision of In-memory Architectures

ICCAD ’20, November 2–5, 2020, Virtual Event, USA Sujan K. Gonugondla, Charbel Sakr, Hassan Dbouk, and Naresh R. Shanbhag

𝐼"

𝑉$%

𝐶

𝑉'' − 𝑉)

(a)

𝑇!𝑇"

time

𝑉#$𝑇%

(b)

Figure 6: Modeling the discharge process in the QS computemodel: (a) cell current 𝐼 𝑗 , and (b) the word-line voltage pulse𝑉WL.

Table 2: QS Model Parameters in a 65 nm CMOS Process

Parameter Value Parameter Value𝑘 ′ (`A/V2) 220 𝛼 1.8𝜎𝑇 0 (ps) 2.3 𝜎𝑉t (mV) 23.8

Δ𝑉BL,max (V) 0.8-to-0.9 𝑉WL (V) 0.4-to-0.8𝑉t (V) 0.4 𝑇0 (ps) 100𝑇 (K) 270 𝑘 (JK−1) 1.38e-23

where 𝑉o,max is the maximum allowable output voltage, and 𝑣e and𝑣c are the voltage domain noise due to circuit non-idealities andclipping, respectively, 𝑖 𝑗 ∼ N(0, 𝜎2𝐼 𝑗 ) is the noise due to (spatial)current mismatch, and 𝑡 𝑗 ∼ N(0, 𝜎2𝑇𝑗

) is the noise due to (temporal)pulse-width mismatch, respectively, both of which are modeledas zero mean Gaussian random variables, 𝑡rf models the impactof finite rise and fall times of the current switching pulses, and𝑣\ ∼ N(0, 𝜎\ ) is the thermal noise. Note: 𝑉o,max can be as high as0.9 V when 𝑉dd = 1V.

Analytical expressions to estimate the noise standard deviations𝜎𝐼 𝑗 , 𝜎𝑇𝑗

, 𝜎\ , and 𝑡rf, (see appendix) are provided below:

𝜎𝐼 𝑗 = 𝐼 𝑗

(𝛼𝜎𝑉t

𝑉WL −𝑉t

)= 𝐼 𝑗𝜎D (18)

𝑡rf = 𝑇𝑟 −(𝑉WL −𝑉t

𝑉WL

)𝑇r +𝑇f𝛼 + 1 (19)

𝜎𝑇𝑗=

√ℎ 𝑗𝜎𝑇 0, 𝜎\ =

√𝑘𝑇

𝐶(20)

where 𝜎2D is normalized current mismatch variance, 𝑇𝑗 = ℎ 𝑗𝑇0 isthe delay of a ℎ 𝑗 -stage WL driver composed unit elements withdelay 𝑇0 each, 𝜎𝑇 0 is the standard deviation of 𝑇0, 𝑇r and 𝑇f are WLpulse rise and fall times (see Fig. 6(b)), 𝛼 is a fitting parameter in the𝛼-law transistor equation, 𝜎𝑉t is standard deviation of𝑉t variations,𝑘 is the Boltzmann constant, and 𝑇 is the absolute temperature.

Note that typically the WL voltage 𝑉WL is identical for all rowsin the memory array with a few exceptions such as [40] whichmodulate 𝑉WL to tune the cell current 𝐼 𝑗 . The effects of rise/falltimes and delay variations can be mitigated by carefully designingthe WL pulse generators. Therefore, noise in QS is dominated byspatial threshold voltage variations. Indeed, using the typical valuesfromTable 2, we find that𝜎𝐼 𝑗 /𝐼 𝑗 ranges from 8% to 25%, while𝜎𝑇𝑗

/𝑇𝑗ranges from 0.5% to 3%.

𝐱"! 𝐱""!𝐱"𝟐

𝑥!

𝑥$

𝑥%

𝑥&

…… ……… ……… ……

… ……… ……… ……

𝑤"","

𝑤"!,#

𝑤"",$

𝑤"",%

… …𝐰%! 𝐰%"& 𝐰%! 𝐰%"&

𝐰! 𝐰'

𝑦! 𝑦'

WL

Δ𝑉'(

…POTS POTS

BLB

BL

ADC ADCADC ADC

Figure 7: The charge summing IMC (QS-Arch).

Energy and Delay Models: The average energy consumption inthe QS model is given by:

𝐸QS = E [𝑉a]𝑉dd𝐶 + 𝐸su (21)

where the spatio-temporal expectation E [𝑉a] is taken over inputs(temporal) and over columns (spatial) 𝐸su is the energy cost oftoggling switches 𝜙 𝑗 s. Equation (21) shows that the energy con-sumption in the QS model increases with𝐶 ∝ array size, the supplyvoltage 𝑉dd, and the mean value of the DP E [𝑉a].

The delay of the QSmodel is given by𝑇QS = 𝑇max+𝑇su,where𝑇suis the time required to precharge the capacitors and setup currents,and 𝑇max = max{𝑇𝑗 } is the longest allowable pulse-width.

Table 2 tabulates parameters of the QS model in a representative65 nm CMOS process.

4.3 QS-ArchThe charge summing architecture (QS-Arch) in Fig. 7(b) employs a6T [8] or 8T [30] SRAM bitcell within the QS model (see Section 4.2).This architecture implements fully-binarized DPs on the BLs bymapping the input bit 𝑥𝑖, 𝑗 to the WL access pulse 𝑉WL, 𝑗 while theweights �̂�𝑖, 𝑗 are stored across 𝐵𝑤 columns of the BCA so that the BCcurrents 𝐼𝑖, 𝑗 ∝ �̂�𝑖, 𝑗 . The output 𝑉o = Δ𝑉BL is the voltage dischargeon the BL and the capacitance𝐶 = 𝐶BL is the BL capacitance in (16).QS-Arch sequentially (bit-serially) processes one multi-bit inputvector x in 𝐵𝑥 in-memory compute cycles followed by a digitalsumming of the binarized DPs to obtain the final multi-bit DP (2).Table 3 summarizes the noise and energy models for QS-Arch.

We derive the analytical expressions of architecture-level noisemodels for QS-Arch using those of the QS model described in Sec-tion 4.2. In QS-Arch, clipping occurs in each of the 𝐵𝑥×𝐵𝑤 binarizedDPs and contributes to the overall clipping noise variance 𝜎2[h atthe multi-bit DP output. Circuit noise from each binarized DP isaggregated to obtain the final circuit noise variance 𝜎2𝑒 . In addi-tion, employing MPC imposed requirement on the final DP outputprecision 𝐵𝑦 (15), we obtain the lower bound on ADC precision𝐵ADC.

Since the multi-bit DP computation in (2) is high-dimensional(𝑁 can be in hundreds), it is clear that the limited BL dynamicrange e.g., 𝑉o,max in (17), will begin to dominate SNRa in (7). It isfor this reason that most, if not all, IMCs resort to some form ofbinarization of the multi-bit DP in (2) prior to employing one of thein-memory compute models (see Table 1). Ultimately, SNRa limits

Page 7: Fundamental Limits on the Precision of In-memory Architectures

Fundamental Limits on the Precision of In-memory Architectures ICCAD ’20, November 2–5, 2020, Virtual Event, USA

Table 3: Model Parameters for QS-Arch

Bitcell type 6T or 8T Analog Core Precision 𝐵𝑥 = 1, 𝐵𝑤 = 1

Energy cost per DP 𝐸QS-Arch = 𝐵𝑤𝐵𝑥 (𝐸QS + 𝐸ADC) + 𝐸miscCompute model

mapping

𝐶 → 𝐶BL𝑉o → Δ𝑉BL𝑇𝑗 → 𝑇WL, 𝑗

𝜎2𝑞𝑖𝑦112𝑁Δ2

𝑥𝜎2𝑤 + 1

12𝑁Δ2𝑤E

[𝑥2

]𝜎2[h

49

(1 − 4−𝐵𝑤

) (1 − 4−𝐵𝑥

)∑𝑁𝑘=𝑘h(𝑘 − 𝑘h)2

(𝑁𝑘

) (14

)𝑘 (34

)𝑁−𝑘𝜎2[e

𝑁𝜎2D (1−4−𝐵𝑤 ) (1−4−𝐵𝑥 )

9 𝐵ADC ≥ min( SNRA(dB)+16.2

6 , log2 (𝑘h), log2 (𝑁 ))

𝑘h =Δ𝑉BL.maxΔ𝑉BL,unit

; 𝜎D =𝜎𝐼𝐼is the normalized standard deviation of the bit-cell current (18); (𝑥)+ = max(𝑥, 0).

non-linear behavioral model

noise models architecture model

sample-accurate Python simulation

component level Monte Carlo

SPICE simulations

measurements from DIMA prototype

SNR expressions

comparison

SNR!SNR!

SNR"

SNR"

Figure 8: SNR validation methodology.

the number and accuracy of BL computations per read cycle andhence the overall energy efficiency of IMCs.

5 SIMULATION RESULTSThis section describes the noise model validation methodology forvalidating the noise expressions in Table 3 and simulation resultsfor QS-Arch.

5.1 Noise Model Validation MethodologyFigure 8, we obtain the QS model parameters (Section 4) usingMonte Carlo circuit simulations in a representative 65 nm CMOSprocess, with experimental validation of some of these, e.g., 𝜎[e ,from our IMC prototype ICs [6, 15] when possible.

Incorporating non-linear circuit behavior along with noise mod-els, sample-accurate Monte Carlo Python simulations are employedto numerically calculate SNR values using ensemble averaged (over1000 instances) statistics. We compare the SNR values obtainedthrough sample-accurate simulations with those obtained by evalu-ating the analytical expressions in Table 3.

The quantitative results in subsequent sections employ the QSmodel parameter values in Table 2 along with QS-Arch energyand noise models from Table 3. An SRAM BCA with 512 rows and𝐶BL = 270 fF is assumed throughout. Energy and accuracy of QS-Arch is traded-off by tuning 𝑉WL. We assume zero mean signedweights𝑤 𝑗 and unsigned inputs 𝑥 𝑗 drawn independently from twodifferent distributions. We set 𝐵𝑥 = 𝐵𝑤 = 6 everywhere, unlessotherwise stated, so that SQNR𝑞𝑖𝑦 (dB) = 38.9 dB ≫ SNRa(dB) andtherefore SNRA ≈ SNRa from (10). Next, we show how SNRA andSNRT trade-off with 𝑁 and 𝐵ADC.

5.2 SNR Trade-offs in QS-ArchFigure 9(a) shows that the maximum achievable SNRA increaseswith 𝑉WL. Further, for a fixed 𝑉WL, QS-Arch also exhibits a sharp

drop in SNRA at high values of 𝑁 > 𝑁max, e.g., SNRA ≈ 19.6 dB for𝑁 ≤ 125 and then drops with increase in 𝑁 . A key reason for thistrade-off is that 𝜎2[h decreases while 𝜎

2[e increases as𝑉WL is reduced

(see Table 3), and since 𝜎2[h limits 𝑁 and 𝜎2[e limits SNRa. Thus, bycontrolling 𝑉WL, we can trade-off 𝑁max with SNRA. Specifically,𝑁max increases by 2× for every 3 dB drop in SNRA.

In QS-Arch, the minimum value of 𝐵ADC (see Table 3) dependsupon the minimum of: 1) the MPC term (15); 2) the headroomclipping term; and 3) the small 𝑁 case where BL discharge Δ𝑉BL hasa finite number of discrete levels. Figure 9(b) shows that SNRT →SNRA of Fig. 9(a) when 𝐵ADC is greater than the lower bound(circled) in Table 3 for different values of 𝑉WL and 𝑁 .

5.3 Impact of ADC PrecisionMinimizing the column ADC energy is critical to maintain IMC’senergy efficiency since each DP in QS-Arch requires 𝐵𝑥×𝐵𝑤 conver-sions. Furthermore, ADCs need to operate in a noise-limited regimedue to the high PAR of high-dimensional DP outputs combinedwith severe area constraints imposed by column-pitch matchingrequirements.

ADC energy costs when operating in the noise-limited regimeis modeled as [20, 21]:

𝐸ADC = 𝛽4𝐵ADC (22)

where 𝛽 is estimated from the Schreier figure of merit [21, 27] whichis approximately 180 dB based on recent (2019) ADCs [22] leadingto 𝛽 = 7.5 × 10−4 fJ at 𝑉dd = 1V.

Figure 10 shows that ADC energy increases with DP dimension𝑁 . However, the gap between ADC energy consumption with BGCandMPC begins to increase for𝑁 > 60. This is because BGC assignshigher values of 𝐵ADC as compared to MPC (see Table 3) to achievethe same SNRT.

5.4 Impact of Technology ScalingOne expects IMCs to exhibit improved energy efficiency and through-put in advanced process nodes due to lower capacitance and lowersupply voltage. However, the impact of technology scaling on theanalog noise sources also needs to be considered. To study thistrade-off, we employ the SNR and energy models from Section 5(see Table 3) with parameters scaled as per the ITRS roadmap [10].FDSOI technology is assumed for the 22 nm, 11 nm and 7 nm nodes.

Page 8: Fundamental Limits on the Precision of In-memory Architectures

ICCAD ’20, November 2–5, 2020, Virtual Event, USA Sujan K. Gonugondla, Charbel Sakr, Hassan Dbouk, and Naresh R. Shanbhag

0 100 200 300 400N

6

8

10

12

14

16

18

20SN

R A(d

B)

E: VWL=0.6 VS: VWL=0.6 VE: VWL=0.7 VS: VWL=0.7 VE: VWL=0.8 VS: VWL=0.8 V

(a)

(b)

Figure 9: Compute SNR trade-offs in the QS-Arch with 𝐵𝑥 =

𝐵𝑤 = 6: (a) SNRA(dB) vs. 𝑁 for different values of 𝑉WL, and(b) SNRT(dB) vs. 𝐵ADC showing that the expression in Table3 correctly predicts the minimum ADC precision 𝐵ADC (cir-cled). Close match is achieved between expressions in Ta-ble 3 (E) and simulations (S) of (17).

101 102

N

10 3

10 2

10 1

100

E ADC

(pJ)

BGCtBGCMPC

Figure 10: ADC energy in QS-Arch with 𝐵𝑥 = 𝐵𝑤 = 6 atSNRA(dB) = 16.2 dB as a function of 𝑁 when 𝐵ADC chosen ac-cording to BGC (12), tBGC, and the MPC criterion (Table 3)such that SNRT(dB) is within 0.5 dB of SNRA(dB).

For a specific node, Fig. 11 shows that the QS-Arch’s energycost reduces by 3.3× for every 6 dB drop in SNRA, but it suffers acatastrophic drop in SNRA before reaching the input quantizationnoise limit set by (8). This drop occurs due to an increase in theclipping noise variance 𝜎2[h .

Across technology nodes, the maximum achievable SNRA inQS-Arch reduces as technology scales from 65 nm down to 7 nmdue to: 1) increased clipping probability caused by lower supply

SQNR!!" limit

Figure 11: Impact of CMOS technology scaling on the com-pute SNR vs. energy trade-off in QS-Archwith 𝐵𝑥 = 3, 𝐵𝑤 = 5,and 𝑁 = 300.

voltages, and 2) increased variations in BL discharge voltage Δ𝑉BLdue to smaller 𝑉dd/𝑉t ratio. As a result, Fig. 11 also shows that theenergy consumption, at the same SNRA, is in fact higher in 11 nmand 7 nm nodes as compared to the 22 nm node due to the needto employ a higher values of 𝑉WL to control variations in Δ𝑉BLimplying the technology scaling may not be very friendly to IMCsbased on QS-Arch.

6 CONCLUSIONS AND SUMMARYBased on the results presented in the earlier sections, we providethe following IMC design guidelines:

• For IMCs to be useful in realizing DNNs, the compute SNR oftheir analog core (SNRa) needs to be the range 10 dB − 40 dB orgreater depending on the layer.• The total SNR (SNRT) of DP computations implemented on IMCsis limited by SNRa. Weight, activation, and column ADC preci-sions need to assigned in accordance with the minimum precisioncriterion (MPC) in order to minimize the energy and latency over-heads, especially of column ADCs.• For the commonly used IMC QS-Arch, given an array size, thereexists a trade-off between the maximum achievable SNRa and themaximum realizable DP dimension 𝑁 . Multi-bank IMCs will berequired for high-dimensional DPs in order to boost the overallcompute SNR.• Technology scaling will have an adverse impact on QS-Arch’smaximum achievable SNRa and the energy cost incurred for afixed SNRa.

An overarching conclusion of this paper is that the drive towardsminimizing energy and latency using IMCs, runs counter tomeetingthe compute SNR requirements imposed by applications. This paperquantifies this trade-off through analytical expressions for computeSNR and energy-delay models. It is hoped that IMC designers willemploy these models as they seek to optimize the design of IMCsof the future, including the use of algorithmic methods for SNRboosting such as statistical error compensation (SEC) [29].

ACKNOWLEDGMENTSThis work was sponsored by C-BRIC, the Semiconductor ResearchCorporation (SRC) and the Defense Advanced Research ProjectsAgency (DARPA).

Page 9: Fundamental Limits on the Precision of In-memory Architectures

Fundamental Limits on the Precision of In-memory Architectures ICCAD ’20, November 2–5, 2020, Virtual Event, USA

REFERENCES[1] Avishek Biswas and Anantha P Chandrakasan. 2018. Conv-RAM: An energy-

efficient SRAM with embedded convolution computation for low-power CNN-based machine learning applications. In IEEE International Solid-State CircuitsConference (ISSCC). 488–490.

[2] Wei-Hao Chen et al. 2018. A 65nm 1Mb nonvolatile computing-in-memoryReRAM macro with sub-16ns multiply-and-accumulate for binary DNN AI edgeprocessors. In IEEE International Solid-State Circuits Conference (ISSCC). 494–496.

[3] Hassan Dbouk, Sujan K Gonugondla, Charbel Sakr, and Naresh R Shanbhag.2020. KeyRAM: A 0.34 uJ/decision 18 k decisions/s Recurrent Attention In-memory Processor for Keyword Spotting. In 2020 IEEE Custom Integrated CircuitsConference (CICC). IEEE, 1–4.

[4] Qing Dong et al. 2020. A 351 TOPS/W and 372.4 GOPS Compute-in-MemorySRAM Macro in 7nm FinFET CMOS for Machine Learning Applications. In IEEEInternational Solid-State Circuits Conference (ISSCC). 242–243.

[5] Laura Fick, David Blaauw, Dennis Sylvester, Skylar Skrzyniarz, M Parikh, andDavid Fick. 2017. Analog in-memory subthreshold deep neural network acceler-ator. In 2017 IEEE Custom Integrated Circuits Conference (CICC). IEEE, 1–4.

[6] Sujan Kumar Gonugondla, Mingu Kang, and Naresh Shanbhag. 2018. A42pJ/decision 3.12 TOPS/W robust in-memory machine learning classifier withon-chip training. In IEEE International Solid-State Circuits Conference (ISSCC).490–492.

[7] Sujan K Gonugondla, Mingu Kang, and Naresh R. Shanbhag. 2018. A variation-tolerant in-memory machine learning classifier via on-chip training. IEEE Journalof Solid-State Circuits 53, 11 (2018), 3163–3173.

[8] Ruiqi Guo, Yonggang Liu, Shixuan Zheng, Ssu-Yen Wu, Peng Ouyang, Win-SanKhwa, Xi Chen, Jia-Jing Chen, Xiudong Li, Leibo Liu, Meng-Fan Chang, ShaojunWei, and Shouyi Yin. 2019. A 5.1pJ/Neuron 127.3us/Inference RNN-based SpeechRecognition Processor using 16 Computing-in-Memory SRAM Macros in 65nmCMOS. In 2019 IEEE Symposium on VLSI Circuits. IEEE, 120–121.

[9] Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan.2015. Deep learning with limited numerical precision. In International Conferenceon Machine Learning. 1737–1746.

[10] ITRS-collaborations. 2015. ITRS Roadmap tables. ITRS (2015). http://www.itrs2.net/itrs-reports.html

[11] Hongyang Jia, Yinqi Tang, Hossein Valavi, Jintao Zhang, and Naveen Verma.2018. A Microprocessor implemented in 65nm CMOS with Configurable andBit-scalable Accelerator for Programmable In-memory Computing. arXiv preprintarXiv:1811.04047 (2018).

[12] Zhewei Jiang, Shihui Yin, Mingoo Seok, and Jae-sun Seo. 2018. XNOR-SRAM:In-memory computing SRAM macro for binary/ternary deep neural networks.In 2018 IEEE Symposium on VLSI Technology. IEEE, 173–174.

[13] Mingu Kang, Sujan Gonugondla, and Naresh R Shanbhag. 2020. Deep In-memoryArchitectures for Machine Learning. Springer.

[14] Mingu Kang, Sujan K. Gonugondla, Min-Sun Keel, and Naresh R. Shanbhag.2015. An energy-efficient memory-based high-throughput VLSI architecture forconvolutional networks. In IEEE International Conference on Acoustics, Speechand Signal Processing (ICASSP).

[15] Mingu Kang, Sujan K Gonugondla, Ameya Patil, and Naresh R. Shanbhag. 2018.A multi-functional in-memory inference processor using a standard 6T SRAMarray. IEEE Journal of Solid-State Circuits 53, 2 (2018), 642–655.

[16] Win-San Khwa, Jia-Jing Chen, Jia-Fang Li, Xin Si, En-Yu Yang, Xiaoyu Sun, Rui Liu,Pai-Yu Chen, Qiang Li, Shimeng Yu, et al. 2018. A 65nm 4Kb algorithm-dependentcomputing-in-memory SRAM unit-macro with 2.3 ns and 55.8 TOPS/W fullyparallel product-sum operation for binary DNN edge processors. In IEEE Interna-tional Solid-State Circuits Conference (ISSCC). 496–498.

[17] Jinseok Kim, Jongeun Koo, Taesu Kim, Yulhwa Kim, Hyungjun Kim, SeunghyunYoo, and Jae-Joon Kim. 2019. Area-Efficient and Variation-Tolerant In-MemoryBNN Computing using 6T SRAM Array. In 2019 IEEE Symposium on VLSI Circuits.IEEE, 118–119.

[18] S. Lloyd. 1982. Least squares quantization in PCM. IEEE Transactions on Informa-tion Theory 28, 2 (1982), 129–137.

[19] M. Kang, M.-S. Keel, N. R. Shanbhag, S. Eilert, and K. Curewitz. 2014. Anenergy-efficient VLSI architecture for pattern recognition via deep embedding ofcomputation in SRAM. In IEEE International Conference on Acoustics, Speech andSignal Processing (ICASSP). 8326–8330.

[20] Boris Murmann. 2008. A/D converter trends: Power dissipation, scaling anddigitally assisted architectures. In 2008 IEEE Custom Integrated Circuits Conference.IEEE, 105–112.

[21] Boris Murmann. 2015. The race for the extra decibel: a brief review of currentADC performance trajectories. IEEE Solid-State Circuits Magazine 7, 3 (2015),58–66.

[22] Boris Murmann. 2019. ADC performance survey 1997-2019. https://web.stanford.edu/~murmann/adcsurvey.html

[23] Shunsuke Okumura, Makoto Yabuuchi, Kenichiro Hijioka, and Koichi Nose. 2019.A Ternary Based Bit Scalable, 8.80 TOPS/W CNN accelerator with Many-coreProcessing-in-memory Architecture with 896K synapses/mm2. In 2019 IEEE

Symposium on VLSI Circuits. IEEE, 248–249.[24] Angad S. Rekhi, Brian Zimmer, Nikola Nedovic, Ningxi Liu, Rangharajan Venkate-

san, Miaorong Wang, Brucek Khailany, William J. Dally, and C. Thomas Gray.2019. Analog/Mixed-Signal Hardware Error Modeling for Deep Learning In-ference. In Proceedings of the 56th Annual Design Automation Conference 2019(DAC ’19). Association for Computing Machinery, New York, NY, USA, Article81, 6 pages. https://doi.org/10.1145/3316781.3317770

[25] Charbel Sakr, Yongjune Kim, and Naresh Shanbhag. 2017. Analytical Guaranteeson Numerical Precision of Deep Neural Networks. In International Conference onMachine Learning. 3007–3016.

[26] Charbel Sakr and Naresh Shanbhag. 2018. An analytical method to determineminimum per-layer precision of deep neural networks. In IEEE InternationalConference on Acoustics, Speech and Signal Processing (ICASSP). 1090–1094.

[27] Richard Schreier, Gabor C Temes, et al. 2005. Understanding delta-sigma dataconverters. Vol. 74. IEEE press Piscataway, NJ.

[28] Naresh Shanbhag, Mingu Kang, and Min-Sun Keel. 2017. Compute memory. USPatent 9,697,877, Issued July 4th., 2017.

[29] Naresh R Shanbhag, Naveen Verma, Yongjune Kim, Ameya D Patil, and Lav RVarshney. 2018. Shannon-inspired statistical computing for the nanoscale era.Proc. IEEE 107, 1 (2018), 90–107.

[30] Xin Si, Jia-Jing Chen, Yung-Ning Tu, Wei-Hsing Huang, Jing-Hong Wang, Yen-Cheng Chiu, Wei-Chen Wei, Ssu-Yen Wu, Xiaoyu Sun, Rui Liu, et al. 2019. ATwin-8T SRAM Computation-In-Memory Macro for Multiple-Bit CNN-BasedMachine Learning. In IEEE International Solid-State Circuits Conference (ISSCC).IEEE, 396–398.

[31] Xin Si, Yung-Ning Tu, Wei-Hsing Huang, Jian-Wei Su, Pei-Jung Lu, Jing-HongWang, Ta-Wei Liu, Ssu-Yen Wu, Ruhui Liu, Yen-Chi Chou, Zhixiao Zhang, Syuan-Hao Sie, Wei-Chen Wei, Yun-Chen Lo, Tai-Hsing Wen, Tzu-Hsiang Hsu, Yen-Kai Chen, William Shih, Chung-Chuan Lo, Ren-Shuo Liu, Chih-Cheng Hsieh,Kea-Tiong Tang, Nan-Chun Lien, Wei-Chiang Shih, Yajuan He, Qiang Li, andMeng-Fan Chang. 2020. A 28nm 64Kb 6T SRAM Computing-in- Memory Macrowith 8b MAC Operation for AI Edge Chips. In IEEE International Solid-StateCircuits Conference (ISSCC). 246–247.

[32] Jian-Wei Su, Xin Si, Yen-Chi Chou, Ting-Wei Chang, Wei-Hsing Huang, Yung-Ning Tu, Ruhui Liu, Ta-Wei Lu, Pei-Jungand Liu, Jing-Hong Wang, ZhixiaoZhang, Hongwu Jiang, Shanshi Huang, Chung-Chuan Lo, Ren-Shuo Liu, Chih-Cheng Hsieh, Kea-Tiong Tang, Shyh-Shyuan Sheu, Sih-Han Li, Heng-Yuan Lee,Shih-Chieh Chang, Shimeng Yu, and Meng-Fan Chang. 2020. A 28nm 64KbInference-Training Two-Way Transpose Multibit 6T SRAM Compute-in-MemoryMacro for AI Edge Chips. In IEEE International Solid-State Circuits Conference(ISSCC). 240–241.

[33] Hossein Valavi, Peter J Ramadge, Eric Nestler, and Naveen Verma. 2018. Amixed-signal binarized convolutional-neural-network accelerator integratingdense weight storage and multiplication for reduced data movement. In 2018IEEE Symposium on VLSI Circuits. IEEE, 141–142.

[34] Naveen Verma, Hongyang Jia, Hossein Valavi, Yinqi Tang, Murat Ozatay, Lung-Yen Chen, Bonan Zhang, and Peter Deaville. 2019. In-memory computing: Ad-vances and prospects. IEEE Solid-State Circuits Magazine 11, 3 (2019), 43–55.

[35] Cheng-Xin Xue, Wei-Hao Chen, Je-Syu Liu, Jia-Fang Li, Wei-Yu Lin, Wei-En Lin,Jing-Hong Wang, Wei-Chen Wei, Ting-Wei Chang, Tung-Cheng Chang, et al.2019. A 1MbMultibit ReRAM Computing-In-Memory Macro with 14.6 ns ParallelMAC Computing Time for CNN Based AI Edge Processors. In IEEE InternationalSolid-State Circuits Conference (ISSCC). IEEE, 388–390.

[36] Cheng-Xin Xue, Tsung-Yuan Huang, Je-Syu Liu, Ting-Wei Chang, Hui-Yao Kao,Jing-Hong Wang, Ta-Wei Liu, Shih-Ying Wei, Sheng-Po Huang, Wei-Chen Wei,Yi-Ren Chen, Tzu-Hsiang Hsu, Yen-Kai Chen, Yun-Chen Lo, Tai-Hsing Wen,Chung-Chuan Lo, Ren-Shuo Liu, Chih-Cheng Hsieh, Kea-Tiong Tang, and Meng-Fan Chang. 2020. A 22nm 2Mb ReRAM Compute-in-Memory Macro with 121-28TOPS/W for Multibit MAC Computing for Tiny AI Edge Devices. In IEEEInternational Solid-State Circuits Conference (ISSCC). 244–245.

[37] Bonan Yan, Qing Yang, Wei-Hao Chen, Kung-Tang Chang, Jian-Wei Su, Chien-HuaHsu, Sih-Han Li, Heng-Yuan Lee, Shyh-Shyuan Sheu,Mon-ShuHo, et al. 2019.RRAM-based Spiking Nonvolatile Computing-In-Memory Processing Enginewith Precision-Configurable In Situ Nonlinear Activation. In 2019 Symposium onVLSI Technology. IEEE, T86–T87.

[38] Jinshan Yue, Zhe Yuan, Xiaoyu Feng, Yifan He, Zhixiao Zhang, Xin Si, Ruhui Liu,Meng-Fan Chang, Xueqing Li, Huazhong Yang, and Yongpan Liu. 2020. A 65nmComputing-in-Memory-Based CNN Processor with 2.9-to-35.8TOPS/W SystemEnergy Efficiency Using Dynamic-Sparsity Performance-Scaling Architectureand Energy-Efficient Inter/Intra-Macro Data Reuse. In IEEE International Solid-State Circuits Conference (ISSCC). 234–235.

[39] Yue Zha, Etienne Nowak, and Jing Li. 2019. Liquid Silicon: A Nonvolatile FullyProgrammable Processing-In-Memory Processor with Monolithically IntegratedReRAM for Big Data/Machine Learning Applications. In 2019 IEEE Symposiumon VLSI Circuits. IEEE, 206–207.

[40] Jintao Zhang, Zhuo Wang, and Naveen Verma. 2017. In-Memory Computationof a Machine-Learning Classifier in a Standard 6T SRAM Array. IEEE Journal ofSolid-State Circuits 52, 4 (April 2017), 915–924.