504 • 2014 IEEE International Solid-State Circuits Conference
ISSCC 2014 / SESSION 30 / TECHNOLOGIES FOR NEXT-GENERATION SYSTEMS / 30.10
30.10 A 1TOPS/W Analog Deep Machine-Learning Engine with Floating-Gate Storage in 0.13µm CMOS
Junjie Lu, Steven Young, Itamar Arel, Jeremy Holleman
University of Tennessee, Knoxville, TN
Direct processing of raw high-dimensional data such as images and video bymachine learning systems is impractical both due to prohibitive power consumption and the “curse of dimensionality,” which makes learning tasksexponentially more difficult as dimension increases. Deep machine learning(DML) mimics the hierarchical presentation of information in the human brain toachieve robust automated feature extraction, reducing the dimension of suchdata. However, the computational complexity of DML systems limits large-scaleimplementations in standard digital computers. Custom analog or mixed-modesignal processors have been reported to yield much higher energy efficiencythan DSP [1-4], presenting the means of overcoming these limitations. However,the use of volatile digital memory in [1-3] precludes their use in intermittently-powered devices, and the required interfacing and internal A/D/A conversionsadd power and area overhead. Nonvolatile storage is employed in [4], but thelack of learning capability requires task-specific programming before operation,and precludes online adaptation.
The feasibility of analog clustering, a key component of DML, has been demon-strated in [5]. In this paper, we present an analog DML engine (ADE) implementing DeSTIN [6], a state-of-art DML framework, and featuring onlineunsupervised trainability. Floating-gate nonvolatile memory facilitates operationwith intermittent harvested energy. An energy efficiency of 1TOPS/W is achievedthrough massive parallelism, deep weak-inversion biasing, current-mode analogarithmetic, distributed memory, and power gating applied to per-operation partitions. Additionally, algorithm-level feedback desensitizes the system toerrors such as offset and noise, allowing reduced device sizes and bias currents.
Figure 30.10.1 shows the architecture of the ADE, in which seven identical cortical circuits (nodes) form a 4-2-1 hierarchy. Each node captures regularitiesin its inputs through an unsupervised learning process. The lowest layer receivesraw data (e.g. the pixels of an image), and continuously constructs belief statesthat characterize the sequence observed. The inputs of nodes on 2nd and 3rd
layers are the belief states of nodes at their respective lower layers. The beliefsof the top layer are then used as rich features for a classifier.
The node (Fig. 30.10.2) incorporates an 8×4 array of reconfigurable analog computation cells (RAC), grouped into 4 centroids, each with 8-dimensionalinput. The centroids are characterized by their mean μ and variance σ 2 in eachdimension, stored in their respective floating gate memories (FGM). In a trainingcycle, the analog arithmetic elements (AAE) calculate a simplified Mahalanobisdistance (assuming a diagonal covariance matrix) DMAH between the input observation o and each centroid. The 8-D distances are built by joining the output currents. A distance processing unit (DPU) performs inverse-normalization (IN) operation to the 4 distances to construct the belief states,which are the likelihood that the input belongs to each centroid. Then the centroid parameters μ and σ 2 are adapted using the online clustering algorithm.The centroid with the smallest Euclidean distance DEUC to the input is selected(classification). The errors between the selected centroids and input are loadedto the training control (TC) and their memories are then updated proportionally.In recognition mode, only the belief states are constructed and the memories arenot adapted. Intra-cycle power gating is applied to reduce the power consumption by up to 37%.
Figure 30.10.3 shows the schematic of the RAC, which performs three differentoperations through reconfigurable current routing. Two embedded FGMs provide nonvolatile storage for centroid parameters. Capacitive feedback stabilizes the floating gate voltage (VFG) to yield pulse-width controlled update.Tunneling is enabled by lowering its supply to bring down the VFG, increasing thevoltage across the tunneling junction. Injection is enabled by raising the sourceof the injection transistor. This allows random-accessible bidirectional updateswithout the need for on-chip high-voltage switches or charge pump. A 2-T V-Iconverter then provides a current output and sigmoid update rule. The FGM consumes 0.5nA of bias current, and shows an 8b programming accuracy anda 46dB SNR at full scale. The absolute value circuit (ABS) in the AAE rectifies thebidirectional difference current between o and μ. Class-B operation and the virtual ground provided by amplifier A allow high-speed resolution of small
current differences. The rectified currents are then fed into a translinear X2/Y circuit, which simulations indicate operates with more than an order of magnitude higher energy efficiency than its digital equivalence.
In the belief construction phase, the DPU (Fig. 30.10.4) inverts the distance outputs from the 4 centroids to calculate similarities, and normalizes them toyield a valid probability distribution. The output belief states are sampled thenheld for the rest of the cycle to allow parallel operation of all layers. The samplingswitch reduces current sampling error due to charge injection: a diode-connected PMOS provides a reduced VGS to the switch NMOS to turn it on withminimal channel charge. The S/H achieved less than 0.7mV of charge injectionerror (2% current error), and less than 14μV of droop with parasitic capacitorsas holding capacitor. In classification phase, the IN circuits are reused togetherwith the winner-take-all network (WTA) to classify the observation to the nearestcentroid. A starvation trace (ST) circuit is implemented to address unfavorableinitial conditions wherein some centroids are starved of nearby inputs and neverupdated. The ST provides starved centroids with a small but increasing additional current to force their occasional selection and pull them into morepopulated areas of the input space. The lower right of Fig. 30.10.4 shows the TCcircuit, which performs current-to-pulse-width conversion using a VDD-referenced comparison. Proportional updates cause the mean and variancememories to converge to the sample statistics, respectively.
The ADE is evaluated on a custom test board with data acquisition hardware connecting to a host PC. The waveforms in Fig. 30.10.5 show the measuredbeliefs, one from each layer. The sampling of beliefs proceeds from the top layerto the bottom to avoid delays due to output settling. The performance of thenode is demonstrated by solving a clustering problem. The input data consistsof 4 underlying clusters, each drawn from a Gaussian distribution with differentmean and variance. The node achieves accurate extraction of the cluster parameters (μ and σ 2), and the ST ensures a robust operation against unfavorable initial conditions.
We demonstrate feature extraction for pattern recognition with the setup shownin Fig. 30.10.6. The input patterns are 16×16 symbol bitmaps corrupted by random pixel errors. An 8×4 moving window defines the pixels applied to theADE’s 32-D input. First the ADE is trained unsupervised with examples of patterns. After the training converges, the 4 belief states from the top layer areused as rich features and classified with a neural network implemented in software, achieving a dimension reduction from 32 to 4. Recognition accuraciesof 100% with corruption lower than 10%, and 95.4% with 20% corruption areobtained, comparable to a software baseline, demonstrating robustness to thenon-idealities of analog computation.
The ADE was fabricated in a 0.13μm CMOS process with thick-oxide IO FETs.The die micrograph is shown in Fig. 30.10.7, together with a performance summary and a comparison with state-of-art bio-inspired parallel processorsutilizing analog computation. We achieve 1TOPS/W peak energy efficiency inrecognition mode. Compared to state-of-art, this work achieves very high energyefficiency in both modes. This combined with the advantages of nonvolatilememory and unsupervised online trainability makes it a general-purpose featureextraction engine ideal for autonomous sensory applications or as a buildingblock for large-scale learning systems.
References:[1] J. Park, et al., “A 646GOPS/W Multi-Classifier Many-Core Processor withCortex-Like Architecture for Super-Resolution Recognition,” ISSCC Dig. Tech.Papers, pp. 168-169, Feb. 2013.[2] J. Oh, et al., “A 57mW Embedded Mixed-Mode Neuro-Fuzzy Accelerator forIntelligent Multi-Core Processor,” ISSCC Dig. Tech. Papers, pp. 130-132, Feb.2011.[3] J.-Y. Kim, et al., “A 201.4GOPS 496mW Real-Time Multi-Object RecognitionProcessor With Bio-Inspired Neural Perception Engine,” ISSCC Dig. Tech.Papers, pp. 150-151, Feb. 2009.[4] S. Chakrabartty and G. Cauwenberghs, “Sub-Microwatt Analog VLSITrainable Pattern Classifier,” IEEE J. Solid-State Circuits, vol. 42, no. 5, pp.1169-1179, May 2007.[5] J. Lu, et al., “An Analog Online Clustering Circuit in 130nm CMOS,” IEEEAsian Solid-State Circuits Conference, Nov. 2013.[6] S. Young, et al., “Hierarchical Spatiotemporal Feature Extraction usingRecurrent Online Clustering,” Pattern Recognition Letters, Oct. 2013.
978-1-4799-0920-9/14/$31.00 ©2014 IEEE
505DIGEST OF TECHNICAL PAPERS •
ISSCC 2014 / February 12, 2014 / 4:45 PM
Figure 30.10.1: The analog deep learning engine (ADE) architecture.Figure 30.10.2: The node architecture and its timing diagram showing powergating.
Figure 30.10.3: The reconfigurable current routing of the RAC, the schematicsof the FGM and AAE and measurement results.
Figure 30.10.5: Measured output waveforms, clustering and ST test results.For clustering, 2-D results are shown for better visualization.
Figure 30.10.6: Pattern recognition test setup and results, demonstratingaccuracy comparable to baseline software simulation.
Figure 30.10.4: The schematic of the DPU and its sub-blocks. The trainingcontrol is shown on the lower right.
Video/Image
Pattern
Voice
Inputs
RichFeatures
The ADE Architecture
RawData
1st Layer
2nd Layer
3rd Layer
Node
BeliefStates
RawData
0101011
Transmission
Classification
Storage
Post-Processing
Pow
er(μ
W) 30
10
20
Disabled Enabled
TrainingMode
RecognitionMode
-22%
-37%
Timing Diagram and Intra-Cycle Power Gating
DEUC/DMAH WIN
Reconfigurable AnalogComputation Cell (RAC) 8×4
82Dim1
...
TrainingControl (TC)×8
TrainingControl
μTrainingControlσ2
Error16
16
Ctrl
Distance Processing Unit (DPU)4Output Belief
State
8Input
Observation
Inverse Normalization (IN)
Winner-Take-All Network (WTA) S/H
Node Architecture
Centroid1
82Dim1
...
AAE
FGMμ
FGMσ2
ΣI
MemoryWriting Logic 8
2Dim1
...
AAE
FGMμ
FGMσ2
ΣI
MemoryWriting Logic 8
2Dim1
...
AAE
FGMμ
FGMσ2
ΣI
MemoryWriting Logic 8
2Dim1
...
AAE
FGMμ
FGMσ2
ΣI
MemoryWriting Logic
Centroid2 Centroid3 Centroid4
BeliefConstruction
AAE
FGM
TC
WTA
IN
S/H
DMAH
Classific-ation
TrainingLoad
MemoryWrite
DEUC
Sample HoldHold
OFF
OFF
OFF
OFFOFF
OFF
Errors
Training Cycle
B
o
Saving of Power Gating
0 5 100
15
30
Input o (nA)
DM
AH(n
A)
0 5 100
20
40
Input o (nA)
DM
AH(n
A)
0 50 100
0
5
10
I out (n
A)
Number of 1ms update pulses
2 4 6 8
2
4
6
8
Target Values (nA)
Mem
ory
Val
ues
(nA
)
2 4 6 8-0.2
-0.1
0
0.1
0.2
Erro
r (nA
)
DMAH
FGMμ
Σ-
Abs(·)
YX Z
X2/YC
FGMσ2 Σ-
Input o
FGMμ
Σ-
Abs(·)
YX Z
X2/YC
FGMσ2 Σ-
Input oDEUC
Err_μ
Err_σ2
FGMμ
Σ-
Abs(·)
YX Z
X2/YC
FGMσ2 Σ-
Input o
Mahalanobis DistanceDMAH = Σ[(o-μ)2/σ2]
Euclidean DistanceDEUC = Σ(o-μ)2/C, C is constant
Memory ErrorsErr_μ = o-μ, Err_σ2=(o-μ)2/C-σ2
Mode = Belief Construction
Mode = Classification
Mode = Training Load
The Reconfigurable Current Routing of RAC FGM and Measured Results
VTUNNEL
FloatingGate
VBN
InjectionJunction
TunnelingJunction
VS
IOUT
Absolute Value Circuit
Updates Rule
Tunnel Inject
+−
o-μ
Translinear Loop
X=|o-μ|
Y Z=X2/Y
Translinear X2/Y Operator
VREF
Tun
3V
1VInj3V
0
VINJECT
Programming Accuracy
AAE and Measured Results
μ=5nA, Sweeping σ2 σ2=2nA, Sweeping μ
σ2 Increaseμ increase
VBP
V-IConverter
A
AAE
AAE
AAE0 5 10 15 20
0.2
0.4
0.6
Input Current (nA)
Cha
rge
Inje
ctio
n (m
V)
0 5 10 15 2012.4
12.8
13.2
Dro
op (
V)
DEUC/DMAH
Inverse Normalization (IN)
IOUT
IOUT = INORM
IOUT = λ/IIN{
VB
Starvation Trace
B
Winner Take All (WTA)
S/H
IB
FromRAC
To HigherLayer
IB/2
DPU × ¼
INORM
. . .
. . .
. . .
Distance Processing Unit
Starvation Trace (ST)
CLKRST CLK
RST
IOUT
IOUT
In next layerIOUT
IINS/H
M1M2
M3VDD
Signal
GND
S/H
|Vt,P|>Vt,N due to body effect
Sample Mode
VDD
Post-Layout Simulation Results
Training Control (TC)
VRAMP
VP
RST
LOAD
LOAD
OUT
PW_μ=K*Err_μPW_σ2=K*Err_σ2
Err_μ/Err_σ
2
μ(n+1)=μ(n)+α*Err_μσ2(n+1)=σ2(n)+β*Err_σ2
μ→MEAN(o)σ2→VAR(o)/C
RSTLOAD
VpVRAMP
VDDOUTPW
Utilize WiringCapacitance as CHOLD
CHOLD
Shielding from noises
BeliefStates
B
1st Layer
2nd Layer
3rd Layer
Clock
B(n)
BeliefConstruction
Classific-ation
TrainingLoad
MemoryWrite
Sample{B(n-1)
220μs
Input Data
Cluster Means
Evolution ofCentroid Means
ExtractedVariance
μ=σ2=
Data ClusterParameters μ= (3, 3)
σ2=(1, 1)
μ= (7, 7)σ2=(1.5, 1.5)
μ= (7, 3)σ2=(1, 1)
μ= (3, 7)σ2=(0.6, 0.6)
Belief Output Waveforms
Clustering Test
Starvation Trace Test
With ST
Without ST
Centroid movedto data by ST
CentroidStarved
Incorrectclustering
CorrectClustering
0 50 1000
0.5
1
Sample
Bel
ief S
tate
s
0 20 40 60 800
5
10
Training Sample (x1k)
Cen
troid
Mea
ns
ADE Neural Network Classifier
Classification ResultInput Pattern
Raw Data
Rich Features
Rich FeaturesTraining Convergence
Pattern Recognition Test
Recognition Accuracy
Input Pattern10% Corruption
ClassificationResult
95.4%Accuracy
100%Accuracy
Input Pattern20% Corruption
Beliefs
Sum ofBeliefs
4-D32-D
0 10 20 30 40 500
0.2
0.4
0.6
0.8
1
Percentage Corruption
Cla
ssifi
catio
n A
ccur
acy
MeasuredBaseline
30
• 2014 IEEE International Solid-State Circuits Conference 978-1-4799-0920-9/14/$31.00 ©2014 IEEE
ISSCC 2014 PAPER CONTINUATIONS
Figure 30.10.7: Die micrograph, performance summary and comparison table.
RAC
DPU
7 Nodes
TC
L1 L1 L1 L1L2 L2L3
0.4mm
0.9mm
This work ISSCC'13 [1] ISSCC'11 [2] ISSCC'09 [3]Process 0.13μm 0.13μm 0.13μm 0.13μmPurpose DML Feature Extraction Object Recognition Neural-Fuzzy Accelarator Object Recognition
Non-volatile Memory Floating Gate NA NA NAPower (W) 11.4μW 260mW 57mW 496mW
Peak Energy Efficiency 1.04TOPS/W 646GOPS/W 655GOPS/W 290GOPS/W
TechnologyPower Supply
Active AreaMemory
Memory SNRTraining
Algorithm
Output Feature
Input Referred Noise
System SNRI/O Type
Training Mode 4.5kHzRecognition Mode 8.3kHz
Training Mode 27μWRecognition Mode 11.4μW
Training Mode 480GOPS/WRecognition Mode 1.04TOPS/W
Energy Efficiency
Unsupervised Online Clustering
Inverse-Nomalized Mahanalobis Distances
3V1P8M 0.13μm CMOS
0.9mm×0.4mmNon-Volatile Floating Gate
46dB
56.23pArms
45dBAnalog Current
Operating Frequency
Power Consumption