[acm press the 23rd acm international conference - paris, france (2013.05.02-2013.05.03)]...

6
A 250mV Sub-threshold Asynchronous 8051 Microcontroller with a Novel 16T SRAM Cell for Improved Reliability in 40nm CMOS Jaeyoung Kim University of Michigan Ann Arbor, MI [email protected] Kwen-Siong Chong Nanyang Technological University Nanyang Avenue, Singapore [email protected] Joseph Sylvester Chang Nanyang Technological University Nanyang Avenue, Singapore [email protected] Pinaki Mazumder University of Michigan Ann Arbor, MI [email protected] ABSTRACT Asynchronous approach for digital systems is a way to re- solve increased timing uncertainty with technology scaling since timing issue is eliminated in asynchronous systems. This paper presents a sub-threshold operating asynchronous 8051 microcontroller (A8051) with a novel 16T SRAM cell for improved reliability in asynchronous systems. This A8051, adopting a 4-phase dual-rail protocol, can operate up to 250 mV. A8051 has 67.53 μs as a critical path delay with 91.6 nW power consumption at 250 mV, which is equivalent to 12.88 kHz in synchronous systems. At 1.0 V, the delay of a critical path of A8051 microcontroller is 5.74 ns, which is equivalent to 151.55 MHz, with 8.98 mW power consump- tion. The proposed 16T SRAM cell is applied in memory blocks. The 16T SRAM structure eliminates charge con- tentions between devices during read and write operations so that SRAM can be operated fully in static mode, bringing about improved write margin (WM). The WM of this 16T SRAM cell is 1.81 times greater than the conventional 6T SRAM cell and 1.58 times better than 8T SRAM cell. At 250 mV, the SNM of SRAM cell is 12.5 mV under process and mismatch variations. Write delay of the asynchronous SRAM block is 4.02 μs (equivalent to 248.5 kHz) with 5.44 pJ energy dissipation, while read delay is 12.61 μs (equiva- lent to 79.3 kHz) with 9.08 pJ energy dissipation. Categories and Subject Descriptors B.7.1 [Hardware]: Types and Design Styles—VLSI Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. GLSVLSI’13, May 2–3, 2013, Paris, France. Copyright 2013 ACM 978-1-4503-1902-7/13/05 ...$15.00. Sensor Front- End Microprocessor Digital Signal Processor Wireless Transceiver Power Management (including a power source ) Figure 1: Wireless Sensor Network. General Terms Performance, Design, Reliabiliy, Experimentation Keywords Asynchronous system, Microcontroller, Microprocessor, Sub- threshold operation, SRAM, Reliable memory structure 1. INTRODUCTION Wireless sensor networks (WSNs) are increasingly ubiq- uitous, in part, due to their ultra-low power, high reliabil- ity operation, and a small form factor. Figure 1 depicts a typical WSN, comprising six main modules: a sensor, a front-end (e.g. an analog-to-digital converter), a micropro- cessor, a digital signal processor, a wireless transceiver, and a power management unit including a power source [12]. As this WSN is typically designed for a long operational life- span, power is carefully budgeted where pertinent, and it is energized only when required so that the overall average power is typically 10μW to 100μW. In a typical case, the power source in the power management unit allocates 20% of power to each module; the actual power breakdown will vary depending on specific applications. The microprocessor module with ultra-low power dissipation is highly desirable as it often remains active for parameter monitoring, and it, 83

Upload: pinaki

Post on 30-Jan-2017

212 views

Category:

Documents


0 download

TRANSCRIPT

A 250mV Sub-threshold Asynchronous 8051Microcontroller with a Novel 16T SRAM Cell for Improved

Reliability in 40nm CMOS

Jaeyoung KimUniversity of Michigan

Ann Arbor, [email protected]

Kwen-Siong ChongNanyang Technological

UniversityNanyang Avenue, [email protected]

Joseph Sylvester ChangNanyang Technological

UniversityNanyang Avenue, Singapore

[email protected]

Pinaki MazumderUniversity of Michigan

Ann Arbor, [email protected]

ABSTRACTAsynchronous approach for digital systems is a way to re-solve increased timing uncertainty with technology scalingsince timing issue is eliminated in asynchronous systems.This paper presents a sub-threshold operating asynchronous8051 microcontroller (A8051) with a novel 16T SRAM cellfor improved reliability in asynchronous systems. This A8051,adopting a 4-phase dual-rail protocol, can operate up to 250mV. A8051 has 67.53 μs as a critical path delay with 91.6nW power consumption at 250 mV, which is equivalent to12.88 kHz in synchronous systems. At 1.0 V, the delay ofa critical path of A8051 microcontroller is 5.74 ns, which isequivalent to 151.55 MHz, with 8.98 mW power consump-tion. The proposed 16T SRAM cell is applied in memoryblocks. The 16T SRAM structure eliminates charge con-tentions between devices during read and write operationsso that SRAM can be operated fully in static mode, bringingabout improved write margin (WM). The WM of this 16TSRAM cell is 1.81 times greater than the conventional 6TSRAM cell and 1.58 times better than 8T SRAM cell. At250 mV, the SNM of SRAM cell is 12.5 mV under processand mismatch variations. Write delay of the asynchronousSRAM block is 4.02 μs (equivalent to 248.5 kHz) with 5.44pJ energy dissipation, while read delay is 12.61 μs (equiva-lent to 79.3 kHz) with 9.08 pJ energy dissipation.

Categories and Subject DescriptorsB.7.1 [Hardware]: Types and Design Styles—VLSI

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.GLSVLSI’13, May 2–3, 2013, Paris, France.Copyright 2013 ACM 978-1-4503-1902-7/13/05 ...$15.00.

SensorFront-End

Microprocessor

Digital Signal Processor

Wireless Transceiver

Power Management (including a power source )

��������

����� �

Figure 1: Wireless Sensor Network.

General TermsPerformance, Design, Reliabiliy, Experimentation

KeywordsAsynchronous system, Microcontroller, Microprocessor, Sub-threshold operation, SRAM, Reliable memory structure

1. INTRODUCTIONWireless sensor networks (WSNs) are increasingly ubiq-

uitous, in part, due to their ultra-low power, high reliabil-ity operation, and a small form factor. Figure 1 depictsa typical WSN, comprising six main modules: a sensor, afront-end (e.g. an analog-to-digital converter), a micropro-cessor, a digital signal processor, a wireless transceiver, anda power management unit including a power source [12]. Asthis WSN is typically designed for a long operational life-span, power is carefully budgeted where pertinent, and itis energized only when required so that the overall averagepower is typically 10μW to 100μW. In a typical case, thepower source in the power management unit allocates 20%of power to each module; the actual power breakdown willvary depending on specific applications. The microprocessormodule with ultra-low power dissipation is highly desirableas it often remains active for parameter monitoring, and it,

83

in part, enables various power-efficient techniques (e.g. witha very low duty cycle for wireless data transmission) [2]. Inthis paper, an ultra-low power microprocessor for WSNs isof specific interest.

For realization of a ultra-low power microprocessor, manydesign approaches have been proposed, including adopta-tion of limited instruction sets at the cost of reduced pro-grammability and versatility [13], usage of smaller featuresize devices and their corresponding lower supply voltages,clever design techniques with a similar objective to reduceswitched capacitances and switching activities [3], and adap-tive circuits and systems for lowering power budgets, suchas Dynamic Voltage and Frequency Scaling (DVFS). Manyapproaches to this adaptive circuits have been suggested:so-called “always correct” such as look-up tables [14] and ca-nary circuits [10, 11, 7, 16], and “fail and correct” such asRazorII [6]. These approaches, however, could have some is-sues. For example, look-up tables have fixed data in a staticmemory block, so these data should be determined by inten-sive simulation results under Process-Voltage Temperature(PVT) variations, so it is not fully fitted optimized. Canarycircuits should be designed as the worst case in performancefor monitoring, so there is a timing margin between this ca-nary’s delay and the critical path of a real fabricated chip.Razor has an assumption that secondary latch is always cor-rect, meaning that this latch should be designed as a slowerone than its primary latch under PVT variations. Further-more, Razor requires additional cycles to correct erroneousdata, bringing about performance degradation. The issues ofthe above mentioned approaches to the adaptive circuits andsystems mainly relate to timing uncertainty. This is becauseinappropriate, insufficient budget for clock period or cycletime in digital circuits and systems could lead to incorrectfunctioning. This timing uncertainty is due to many compo-nents, such as Phase-Lock-Loop jitter, clock skew and jitter,power supply noise, PVT variations, and etc. As a result,either supply voltage is increased or a cycle time is longeras a guard band for ensuring correct operation, resultingin more energy consumption [8]. Among these components,PVT variations have been a significant issue with technologyenhancement, switching the design paradigm from determin-istic to statistical [4]. Due to increased timing uncertainty,the required guard band has also been increased to ensurecorrect operation under this increased timing uncertainty.PVT variations are even compounded for operation at thesub-threshold voltage region, further increasing this guardband. Several earlier studies even reported that the tim-ing uncertainty could be more than 200 times in the sub-threshold region due to PVT variations [9].

In this paper, we present an adaptive sub-threshold 8051microcontroller by means of asynchronous-logic; we denoteit as ‘A8051’ for brevity. The main motivation of adoptingasynchronous-logic is to remove the clock, hence resolvingthe timing uncertainty issue. Our proposed A8051 consistsof an A8051 core, a 1,024×8 bit ROM, a 128×8 bit RAM,and a 1,024×8 bit XRAM. The overall architecture of theA8051 in this work is close to that of the design [5], but thelatter employs a memory intellectual property provided bythe manufacturer. Hence, the latter is not able to operaterobustly in the sub-threshold regime largely due to memoryfailures and in part due to PVT variations for the memoryinterfaces, and it only features limited-range DVS from nom-inal to near-threshold voltage. In this work, we address these

�����

�������

������

������������

�������

���������������

���

���������

����������

���������

�����������������

��������������

�����

�����������

������ ���

����������

���

���������

� ����

��!"

��# ��# �!"#

��#

$��# �$%#

���#

���!"#

��#����#

����

���

���

����

��

���

�� ��

��������

�#

�������� �������

Figure 2: Block diagram of the A8051 μC core.

problems, and our proposed A8051 features several interest-ing attributes and novelties. First, our proposed A8051 isable to operate robustly in the sub-threshold region (e.g. at250mV). Second, our proposed A8051 can innately accom-modate PVT variations at any prevailing operating condi-tions, hence enabling a full-range DVS from nominal volt-age through near-threshold to sub-threshold voltage unin-terruptedly. Third, we propose a novel 16-Transistor (16T)SRAM cell for the ROM, RAM and XRAM memories. This16T SRAM cell eliminates charge contentions between thecross-couple devices therein, hence operating fully in staticmode and bringing about improved write margin. The 16TSRAM cell also features low bit line leakage, enabling theROM, RAM and XRAM memories to operate at low volt-age. Fourth, the A8051 is designed using combinational-logic gates (i.e. no sequential-logic gates such as clockedlatch or flip-flop), hence potentially having lower error ratesas delineated in the ITRS-2011 report where the sequential-logic has significant higher error rates than combinationallogic for contemporary and future CMOS fabrication pro-cesses [1].

The remaining parts of this paper are as follows. Section2 presents the overall architecture of the A8051. Section 3describes memory design for the A8051. Simulation resultsfor A8051 and memory blocks are presented in Section 4,and finally Section 5 provides conclusions.

2. A8051 ARCHITECTUREA8051 consists of a 1,024×8 bit read-only memory (ROM)

for programing instructions, a 128×8 bit random-access mem-ory (RAM) for storing data, a 1,024×8 bit external RAM(XRAM), A8051 core, and an interface block for control-ling program ports and general purpose I/O ports. TheA8051 core mainly consists of two pipeline stages: Instruc-tion Fetch (IF) and Decode and Execute (D&X) (see Figure2). IF, Flow Controller (FCont), Instruction Pointer Arith-metic Unit (IPAU), Instruction Pointer (IP), and Memory

84

WLwrite

D.T D.F

Q.F Q.T

WLread WLread

GNDvirtual

��

��

��

��

��� ����� ��

��

��

��

���

���

���

���

Figure 3: The proposed 16T SRAM cell structure.

Controller (MemCont) participate in the first pipeline stagein which instruction fetch and exceptions including inter-rupts and branching are processed. Instructions are fetchedby one byte at each cycle in this IF stage since majority ofinstructions have 1-byte length. In the second pipeline stage,operands fetch, operation execution, and write back are con-ducted by D&X, Arithmetic and Logic Unit (ALU), RegisterFile (ReF), and MemCont, managing operands fetch, exe-cuting operation, and writing back. Each pipeline stage is al-most independent from each other so that pipeline stalls canbe minimized. Although a deeper pipeline stage can increasethe overall throughput of A8051, it requires significant areaand energy overhead for a pipeline controller to deal withdata and control dependency. In addition, pipeline registersmight be needed for stalling in the increased pipeline stages,which, in turn, dilutes the strength of asynchronous circuits:no register at all. Notice that each bus between blocks hashandshake channels along with data channels so that eachblock knows when the resulting data should be delivered tothe next block.

3. MEMORY DESIGNIn this section, a novel memory structure is presented. A

16T SRAM cell structure is presented in Section 3.1, andits operation principle is described in Section 3.2. Sizingconstraints are provided in Section 3.3, and Section 3.4 de-scribes a memory block.

3.1 SRAM Cell StructureReliability in memory system is one of the most challeng-

ing issues since A8051 is targeted at sub-threshold operation.One of the advantages of asynchronous systems is no timingissue, meaning that no flip-flop is required. Thus, memoryfor asynchronous systems should have such a property sothat the entire system can be more robust in terms of signaltiming; otherwise, memory blocks should be a bottleneck inreliability. In order to have this property, a memory cellshould have freedom from latch-like structure (i.e. back-to-back inverters). Latch structure, however, is inevitablewhen a memory cell is composed only of transistors, not ofa capacitor or any other element; otherwise, data cannot beretained. Accordingly, this latch structure is necessary forholding a state , but for read operation (READ) or write op-eration (WRITE), it should be circumvented in some ways.One possible solution is static operation mode.

Figure 3 shows our proposed 16T SRAM cell. Device M1to M4 are back-to-back inverters, and additional four PMOSdevices (M5, M6, M11, and M12) are attached, two of which

WLwrite

D.T D.F

Q.F Q.T

WLread WLread

GND virtual

� ��

� �

� �

� � � �

��

��

��

��

��� ����� ��

��

��

��

���

���

���

���

(a) WRITE operation.

WLwrite

D.T D.F

Q.F Q.T

WLread WLread

��

� �

� �

�� �

GND virtual

��

��

��

��

��� ����� ��

��

��

��

���

���

���

���

(b) READ operation.

Figure 4: WRITE/READ operation principle of theproposed 16T SRAM. (a) During WRITE, only onepath exists at each storage node, so there is no shortcircuit current as well as charge contentions. (b)When READ, access transistors M13 and M14 eval-uates the stored state as conventional 8T SRAM.

are connected to the source node of each PMOS of the in-verters in parallel. At each output node of each inverter, twoadditional NMOS devices (M7 to M10) are connected to theground node in series. These attached eight devices (M5 toM12) are for eliminating contentions between devices duringWRITE so that the SRAM can be operated in static modeduring WRITE. In addition, two additional NMOS devicesare connected to each storage node for decoupling the outputnode from the storage node as in the case of conventional8T SRAM cells [17]. In particular, one of these transistorsis connected to virtual power rail for reducing leakage cur-rent to the output bit lines. Since the system needs timingvalanced dual-rail output, those two additional devices (M13to M16) are attached to both storage nodes. Accordingly,either M13 and M14 or M15 and M16 can be detached forthe other systems, such as synchronous systems.

3.2 Operation Principle16T SRAM is operated in fully static mode during READ

and WRITE (see Figure 4).

3.2.1 Write OperationDevices M5 to M12 are involved in WRITE. When a word

line for WRITE (WLWRITE) is enabled, M5 and M6 areopened, so no current is flown to inverters from the supply,while M7 and M8 are on, so stored data can be dischargedthrough these transistors if M9 or M10 is enabled. When adesired state is asserted to bit lines, D.T and D.F, one patheither from the supply node or to the ground node is formedat each storage node, so when different states are assertedthese bit lines, a storage node has a different path fromthe other. At this moment, charging and discharging take

85

place sequentially without any charge contention, meaningstatic logical operation. After discharging one of the inter-nal nodes, the other node is applied with new data. For thissequential operation, WLWRITE should be asserted after ap-plying new data to bit lines, D.T and D.F. Once WRITE iscompleted, WLWRITE is disabled so that no charge can bedischarged through M7 to M10, while keeping power supplyconnected to inverters through M5 and M6.

3.2.2 Read OperationDevices M13 to M16 are involved in READ. M13 and

M15 decouple output nodes from storage nodes as conven-tional 8T SRAM. M14 and M16 are access transistors, andan inverter is connected to virtual ground node (GNDvirtual).When idle mode, access transistors are inactive, and GNDvirtual

is driven to VDD. Since READ evaluation occurs by dis-charging, this VDD driven GNDvirtual prevents read bit lines,Q.T and Q.F from misreading due to leakage through M14and M16. Notice that Q.T and Q.F are precharged to VDDeven if memory blocks are not used. When READ, M14and M16 are turned on, and GNDvirtual is forced to GND,so evaluation occurs through M13 to M16.

3.3 Memory Cell SizingThe proposed SRAM structure has an advantage in siz-

ing. Every transistor can be minimum since there is no con-tention of charge between devices, bringing about reductionin engineering effort to design a SRAM cell.

3.4 Memory BlockIn A8051, three memory blocks are included: two 1,024×8

bit block for ROM and XRAM and a 128×8 bit blocks forRAM. ROM is identical to RAM since A8051 is designedfor flexible applications due to reprogram. The 1,024×8 bitblock consists of eight banks, one of which is exactly sameas the 128×8 bit block. 128 rows of SRAM cells share a bitline. During idle mode, bit lines are precharged as VDD,and a balancer, which is a p-type pass transistor, evenlydistributes charges stored at both bit lines. When READ,these prechargers and a balancer are disabled, and chargeson either side of the bit lines begin to be discharged throughenabled read access transistors. Notice that little leakageflows to these bit lines since the virtual ground nodes of theother cells on the bit lines is driven to VDD, meaning thatvoltage between the node and the bit line is zero. In addi-tion, since fully static logic is more robust at sub-thresholdregime, an inverter is used as a sense amplifier even though acommonly used cross-coupled sense amplifier is faster. Ourapproach is to eliminate any feedback loop during WRITEalthough a feedback loop is inevitable in a SRAM cell. Inaddition, dual-rail outputs are needed since 4-phase dual-rail protocol is adopted in A8051, so an inverter can be agood candidate as a sense amplifier.

4. SIMULATION RESULTSSince A8051 has no flip-flop, a critical path should care-

fully be considered for the evaluation of its performance.Figure 5 shows how each block communicates with a 4-phasedual-rail handshake protocol. When the channel is available(i.e., acknowledge signal is logically low), logic block A trans-mits data to logic block B. When the block B receives datafrom block A, it asserts acknowledge signal. Then, blockA transmits empty data, and block B acknowledges by de-

����� �������� ������� �����

����������

������

����������� ������������

������

������

���� ����

�� ��� �� ���

� �

� �

� �

� �

��� �

�������

�������

�������

�������

Figure 5: 4-phase dual-rail protocol.

Figure 6: SNM of SRAM cell. Under process andmismatch variations, SNM=12.5 mV.

asserting acknowledge signal. These are the complete hand-shake process between two asynchronous logic blocks. Sim-ilarly, the performance of A8051 core can be evaluated withthe same manner. When the completion time of handshakebetween the two pipeline stages is acquired, this time can beassumed as a critical delay for A8051 core since acknowledgesignal for each pipelining stage is only asserted when it iscompletely done. The performance of asynchronous memoryblocks can also be evaluated as A8051 core.

4.1 SRAMFigure 6 shows the static noise margin (SNM) simulation

results of a SRAM cell under process and mismatch varia-tions. The SNM of this proposed design is 12.5 mV at 250mV power supply, which can overcome 5% of VDD peak-to-peak ripple under random-dopant fluctuations.

The delay and energy simulation results of SRAM blockis shown in Table 1. All results are from the simulationsat 250mV. In addition, delay and energy were evaluated byreferring a request signal to the related acknowledge signal.Layout cell size is 1750 λ2, which is almost twice as conven-

86

Table 1: SRAM Simulation Results(@250mV)1,024×8bit 16T SRAM block

Type 16T SRAM

Cell Size 1.54×1.82 μm2 (1750 λ2)SNM 12.5 mV

DelayRead 12.61 μs (∗79.3 kHz)Write 4.02 μs (∗248.5 kHz)

EnergyRead 9.08 pJWrite 5.44 pJ

∗ This figure is equivalent value to synchronous systems

Figure 7: SRAM cell write margin simulation re-sults at VDD=250mV. The WM of 6T, 8T, and 16TSRAM are 67.3 mV, 77.2 mV, and 122.1 mV, re-spectively

tional 8T SRAM. It can operate at 250mV under processand mismatch variations, but, in typical, it can even op-erate at 200 mV. Write performance is higher than READperformance due to the fact that there is no charge con-tention during WRITE, and inverters are used as a senseamplifier for READ.

Figure 7 shows write margin (WM) simulation results ofeach SRAM cell (6T, 8T, and 16T) at VDD=250 mV. Thedefinition in [15] is used for the WM simulations. WM canbe acuiqred from DC analysis when ‘1’ is written to storagenode 1 storing ‘0’ value. Left curve is the voltage trans-fer characteristics (VTC) of the inverter of each SRAM cell,showing where the switching threshold is. Since WRITE oc-curs at the switching threshold, WM can be from the pointto VDD. The WM of 6T, 8T, and 16T SRAM cells are 67.3mV, 77.2 mV, and 122.1 mV, respectively. Accordingly, theproposed SRAM cell has 1.81 times more WM than the con-ventional 6T SRAM cell and 1.58 times WM than 8T SRAMcell.

4.2 Asynchronous 8051 microcontrollerRegarding the performance and power consumption of

A8051, transistor level simulations were performed, and powerand delay of three blocks including A8051 core, ROM, andRAM were acquired from these simulations. A benchmarkprogram, Dhrystone [18], was converted to HEX code forprogramming ROM. Regarding the delay of A8051 core, arequest signal and its corresponding acknowledge signal toa pipeline stage was referred. For ROM and RAM, requestsignals from A8051 core to ROM and RAM were referred,indicating how fast A8051 core passes these request signalsto ROM and RAM so that the processing performance of

Figure 8: Delay and power simulation results forCore, ROM, and RAM.

Table 2: Asynchronous 8051 μC featuresTechnology 40nm CMOS technology

Die Size1×1 mm2

(254×545 μm2 without IO PADs)Operating supply

voltage250 mV ∼ 1.0V

Number of Devices

795,249 (Total)455,926 (Core)158,904 (ROM)158,904 (XRAM)20,174 (RAM)

1,342 (Buffers, MUXes, and etc.)

Critical path delay

67.53 μs @250mV(∗12.88 kHz)5.74 ns @1V

(∗151.55 MHz)

MIPS0.0074 MIPS (@250mV)

87.11 MIPS (@1V)

Total Average Power91.6 nW (@250mV)8.98 mW (@1V)

∗ This figure is equivalent value to synchronous systems.The inverse value of a critical path delay is multiplied by1.15 as a guard band.

ROM and RAM to A8051 core can be evaluated. Figure8 shows the curve of these delay values at different VDDfrom 250mV to 1V. Power consumption of each block wasacquired during the execution of the benchmark program,which is also shown in Figure 8.

Table 2 shows the features of A8051. At 250 mV, thedelay of a critical path in A8051 core is 67.53 μs, which isequivalent to 12.88 kHz when a guard band is considered as1.15×. It performs 0.0074 MIPS when CPI is regarded as2 since the majority of instructions in 8051 microcontrollerare 1 byte, and A8051 is 2 stage pipelined. Accordingly, theMIPS of A8051 would be lowered in reality. Total averagepower consumption is 91.6 nW. When operating at nominalvoltage, the critical path delay is 5.74 ns, which is equivalentto 151.55 MHz. It performs 87.11 MIPS, and total averagepower is 8.98 mW. The die size of A8051 is 1×1 mm2, andthe chip layout is shown in Figure 9.

87

��� ����

���

����

� ����� ����

���

Figure 9: A8051 chip Layout. Die size is 1×1 mm2.

5. CONCLUSIONSIn this paper, a sub-threshold asynchronous 8051 micro-

controller is demonstrated along with a novel sub-threshold16T SRAM cell for asynchronous systems. Asynchronousapproach eliminates timing uncertainty issue for robust sub-threshold operation. The proposed 16T SRAM cell elim-inates charge contentions between devices, so this struc-ture improves write margin. A8051 chip is implemented in40nm CMOS technology. The simulation results demon-strate A8051 operates up to 250 mV. At 250 mV, A8051has 67.53 μs as a critical delay with 91.6 nW power con-sumption, while it has a critical delay of 5.74 ns with 8.98mW power consumption at nominal voltage (i.e. 1 V). TheREAD operation for 1,024×8 bit SRAM block requires 12.61μs with 9.08 pJ, while WRITE needs 4.02 μs with 5.44 pJat 250mV.

6. ACKNOWLEDGMENTSThe authors thank Dr. Kok-Leong Chang for his technical

support. This research is sponsored in part by DARPA/AFOSR(FA9550-12-1-0033).

7. REFERENCES[1] International technology roadmap for semiconductors.

Technical report.[2] G. Asada, M. Dong, T. Lin, F. Newberg, G. Pottie,

W. Kaiser, and H. Marcy. Wireless integrated networksensors: Low power systems on a chip. In Solid-StateCircuits Conference, 1998. ESSCIRC ’98. Proceedings ofthe 24th European, pages 9 – 16, sept. 1998.

[3] L. Benini, G. De Micheli, E. Macii, D. Sciuto, andC. Silvano. Asymptotic zero-transition activity encoding foraddress busses in low-power microprocessor-based systems.In VLSI, 1997. Proceedings. Seventh Great LakesSymposium on, pages 77 –82, mar 1997.

[4] S. Borkar, T. Karnik, S. Narendra, J. Tschanz,A. Keshavarzi, and V. De. Parameter variations and impact

on circuits and microarchitecture. In Proceedings of the40th annual Design Automation Conference, DAC ’03,pages 338–342, New York, NY, USA, 2003. ACM.

[5] K.-L. Chang and B.-H. Gwee. A low-energy low-voltageasynchronous 8051 microcontroller core. In Circuits andSystems, 2006. ISCAS 2006. Proceedings. 2006 IEEEInternational Symposium on, pages 4 pp. –3184, may 2006.

[6] S. Das, C. Tokunaga, S. Pant, W.-H. Ma, S. Kalaiselvan,K. Lai, D. Bull, and D. Blaauw. Razorii: In situ errordetection and correction for pvt and ser tolerance.Solid-State Circuits, IEEE Journal of, 44(1):32 –48, jan.2009.

[7] A. Drake, R. Senger, H. Deogun, G. Carpenter, S. Ghiasi,T. Nguyen, N. James, M. Floyd, and V. Pokala. Adistributed critical-path timing monitor for a 65nmhigh-performance microprocessor. In Solid-State CircuitsConference, 2007. ISSCC 2007. Digest of TechnicalPapers. IEEE International, pages 398 –399, feb. 2007.

[8] R. Franch, P. Restle, N. James, W. Huott, J. Friedrich,R. Dixon, S. Weitzel, K. Van Goor, and G. Salem. On-chiptiming uncertainty measurements on ibm microprocessors.In Test Conference, 2007. ITC 2007. IEEE International,pages 1 –7, oct. 2007.

[9] R. Jorgenson, L. Sorensen, D. Leet, M. Hagedorn,D. Lamb, T. Friddell, and W. Snapp. Ultralow-poweroperation in subthreshold regimes applying clockless logic.Proceedings of the IEEE, 98(2):299 –314, feb. 2010.

[10] T. Kuroda, K. Suzuki, S. Mita, T. Fujita, F. Yamane,F. Sano, A. Chiba, Y. Watanabe, K. Matsuda, T. Maeda,T. Sakurai, and T. Furuyama. Variable supply-voltagescheme for low-power high-speed cmos digital design.Solid-State Circuits, IEEE Journal of, 33(3):454 –462, mar1998.

[11] M. Nakai, S. Akui, K. Seno, T. Meguro, T. Seki, T. Kondo,A. Hashiguchi, H. Kawahara, K. Kumano, and M. Shimura.Dynamic voltage and frequency management for alow-power embedded microprocessor. Solid-State Circuits,IEEE Journal of, 40(1):28 – 35, jan. 2005.

[12] G. J. Pottie and W. J. Kaiser. Wireless integrated networksensors. Commun. ACM, 43(5):51–58, May 2000.

[13] M. Seok, S. Hanson, Y.-S. Lin, Z. Foo, D. Kim, Y. Lee,N. Liu, D. Sylvester, and D. Blaauw. The phoenixprocessor: A 30pw platform for sensor applications. InVLSI Circuits, 2008 IEEE Symposium on, pages 188 –189,june 2008.

[14] B. Stackhouse, S. Bhimji, C. Bostak, D. Bradley,B. Cherkauer, J. Desai, E. Francom, M. Gowan,P. Gronowski, D. Krueger, C. Morganti, and S. Troyer. A65 nm 2-billion transistor quad-core itanium processor.Solid-State Circuits, IEEE Journal of, 44(1):18 –31, jan.2009.

[15] K. Takeda, H. Ikeda, Y. Hagihara, M. Nomura, andH. Kobatake. Redefinition of write margin fornext-generation sram and write-margin monitoring circuit.In Solid-State Circuits Conference, 2006. ISSCC 2006.Digest of Technical Papers. IEEE International, pages2602 –2611, feb. 2006.

[16] J. Tschanz, K. Bowman, S.-L. Lu, P. Aseron, M. Khellah,A. Raychowdhury, B. Geuskens, C. Tokunaga,C. Wilkerson, T. Karnik, and V. De. A 45nm resilient andadaptive microprocessor core for dynamic variationtolerance. In Solid-State Circuits Conference Digest ofTechnical Papers (ISSCC), 2010 IEEE International, pages282 –283, feb. 2010.

[17] N. Verma and A. Chandrakasan. A 256 kb 65 nm 8tsubthreshold sram employing sense-amplifier redundancy.Solid-State Circuits, IEEE Journal of, 43(1):141 –149, jan.2008.

[18] R. P. Weicker. Dhrystone: a synthetic systemsprogramming benchmark. Commun. ACM,27(10):1013–1030, Oct. 1984.

88