modeling power consumption of nand flash memories …gurumurthi/papers/tcad13.pdf · 1 modeling...

1

Modeling Power Consumption of NAND FlashMemories using FlashPower

Vidyabhushan Mohan, Trevor Bunker, Laura Grupp, Sudhanva Gurumurthi Senior Member, IEEE,Mircea R. Stan Senior Member, IEEE, and Steven Swanson

Abstract— Flash is the most popular solid-state memorytechnology used today. A range of consumer electronics products,such as cell-phones and music players, use flash memory forstorage and flash memory is increasingly displacing hard diskdrives as the primary storage device in laptops, desktops, andservers. There is a rich microarchitectural design space for flashmemory and there are several architectural options for incor-porating flash into the memory hierarchy. Exploring this designspace requires detailed insights into the power characteristics offlash memory. In this paper, we present FlashPower, a detailedpower model for the two most popular variants of NAND flash,namely, the Single-Level Cell (SLC) and 2-bit Multi-Level Cell(MLC) based flash memory chips. FlashPower is built on topof CACTI, a widely used tool in the architecture community forstudying various memory organizations. FlashPower takes severalparameters like the device technology, microarchitectural layout,bias voltages and workload parameters as input to estimate thepower consumption of a flash chip during its various operatingmodes. We validate FlashPower against chip power measurementsfrom several different manufacturers and show that our resultsare comparable to the actual chip measurements. We illustratethe versatility of the tool in a design space exploration of poweroptimal flash memory array configurations.

Index Terms—NAND Flash Memory, Power and AnalyticalModels.

I. INTRODUCTION

Flash is the most popular solid-state memory technologyused today. Consumer electronics products, such as cell-phones and portable music players, widely use flash memory,and flash-based solid-state disks (SSDs) are increasingly dis-placing hard disk drives as the storage of choice in laptops,desktops, and even servers. Computer architects have beenexploring a variety of topics related to flash including thedesign of SSDs [1], [2], disk-caches [3], new flash-basedserver architectures [4], even hybrid memory systems [5] andFlash-Translation Layer (FTL) design [6], [7]. Such studiesshow that flash-based disks offer several advantages, includinglower power and higher performance, compared to traditionalHard Disk Drives (HDDs). While most studies evaluate theperformance impact due to the use flash-based media, veryfew studies examine in detail the power benefits of using flash.

Quantifying the energy consumption of flash accuratelyis necessary for many reasons. Power is an important con-sideration because the design of flash memories is closelytied to the power budget within which they are allowedto operate. For example, flash used in consumer electronic

Copyright c© 2013 IEEE. Personal use of this material is permitted.However, permission to use this material for any other purposes must beobtained from the IEEE by sending an email to [email protected]

devices has a signicantly lower power budget compared tothat of a SSD used in data centers, while the power budgetfor flash used in disk based caches is between the two. In suchscenarios, an accurate knowledge of flash power consumptionis beneficial. Insights into flash power consumption are alsonecessary because hybrid memories use flash in conjunctionwith other new non-volatile memories like Phase ChangeRAM and Spin-Transfer Torque RAM [8], [5]. Power awaredesign of such hybrid memory systems is possible only whenan accurate estimate of flash energy consumption is available.So far, most studies that quantify the energy consumptionof flash were limited to using datasheets. But research hasshown that such datasheet-derived estimates are too generaland do not take the intricacies of flash memory operationsinto account. Datasheets only provide average energy estimates(which in certain cases can deviate from the actual behaviorby nearly 3X) and cannot be used when accurate estimates arerequired. These inaccuracies chiefly arise because they cannotaccount for properties like workload behavior that have ansignificant impact on flash energy consumption. In particular,we lack simulation tools that can accurately estimate the powerconsumption of various flash memory configurations.

This paper addresses this void by presenting FlashPower,a detailed power model for the two most popular variants ofNAND flash namely, the Single-Level Cell (SLC) and 2-bitMulti-Level Cell (MLC) based NAND flash memory chips.FlashPower models the key components of a flash chip duringthe read, program, and erase operations, and when idle, andis highly parameterized to facilitate the exploration of a widespectrum of flash memory organizations. FlashPower is builton top of CACTI [9], a widely used tool in the architecturecommunity for studying memory organizations, and is suitablefor use in conjunction with an architecture simulator. Wevalidate FlashPower against power measurements from chipsfrom various manufactures and we show that FlashPowerestimates are comparable to the measured values.

The organization of the rest of this paper is as follows. Thenext section provides an overview of the microarchitecture andoperation of NAND flash memory and Section III presentsthe details of the power model. Section IV explains theexperimental setup while Section V compares the results fromFlashPower with actual chip power measurements from avariety of manufacturers. Section VI presents a design spaceexploration using FlashPower. Section VII presents the relatedwork and Section VIII concludes this paper.

2

II. THE MICROARCHITECTURE OF NAND FLASHMEMORY

In this section, we describe the components in a NANDflash memory array, how they are architecturally laid out andtheir operation.

A. Components of NAND Flash Memory

Flash is a type of EEPROM (Electrically Erasable Pro-grammable Read-Only Memory) that supports read, program,and erase as its basic operations. A NAND flash memory chipincludes command status registers, a control unit, a set ofdecoders, some analog circuitry for generating high voltages,buffers to store and transmit data, and the memory array.An external memory controller sends read, program or erasecommands to the chip along with the relevant physical address.The main component of a NAND flash memory chip is theflash memory array. A flash memory array is organized intobanks (referred to as planes). Figure 1 describes a flash plane.Each plane has a page buffer (composed of sense-amplifiersand latches) which senses and stores the data to be read from orprogrammed into a plane. Each plane is physically organizedas blocks and the blocks are composed of pages. A controlleraddresses a flash chip in blocks and pages. Thus a planeis a two dimensional grid composed of rows (bit-lines) andcolumns (word-lines). At the intersection of each row andcolumn is a Floating Gate Transistor (FGT) which stores alogical bit of data. In this paper, the terms “cell” and “FGT”refer to the same physical entity and are used interchangeably.In Single-Level Cell (SLC) flash, the FGT stores a singlelogical bit of information, while in Multi-Level Cell (MLC)flash, the FGT stores more than one bit of information. Eventhough the term MLC flash means the number of bits storedin a FGT can be more than 1, we use the term MLC flashmostly in the context of a 2-bit MLC flash.

A group of bits in one row of a plane constitute a page. EachNAND flash block consists of a string of FGTs connected inseries with access transistors to the String Select Line (SSL),the Source line (SL), and the Ground Select Line (GSL), asshown in Figure 1. The string length is equal to the totalnumber of FGTs connected in series. The number of pagesin a block and the size of each page are integer multiples ofthe string length, and the number of bit-lines running throughthe block, respectively. The multiplication factor depends onthe architectural layout of the block. Section II-C explainshow the blocks are laid out architecturally. A pass transistoris connected to each word-line to select/unselect the page.

B. Basic Flash Operations

NAND flash uses Fowler-Nordheim (FN) tunneling to movecharges to/from the floating gate. A program operation in-volves tunneling charges to the floating gate while an eraseoperation involves tunneling charges off the floating gate.Tunneling of charges to and from the FGT varies its thresholdvoltage and bits are encoded as varying threshold voltagelevels. A read operation involves sensing the threshold voltageof the FGT. An erase is performed at the granularity of a block

Fig. 1: Circuit Layout of a NAND Flash Memory Plane.Adapted from [10]

Row Npagesize cells Npagesize cells1 Page 0 Page 12 Page 3.. .. .... .. ..

Npage/2 Npage − 2 Npage − 1

TABLE I: Architectural layout of a SLC Block. Adaptedfrom [13].

while the read and program operations are performed at a pagegranularity. Since each FGT in a MLC flash has more thantwo threshold voltage levels, a read operation for MLC flashinvolves multiple stages to sense the correct threshold voltagelevel. An SLC flash has only one stage in a read operation.Most NAND flash memories employ an iterative programmingmethod like Incremental Step Pulse Programming (ISPP) toprogram an FGT. In iterative programming, a FGT attains itstarget threshold voltage through a series of small incrementalsteps. By controlling the duration and the magnitude ofprogram voltage in each step, iterative programming ensuresthat precise threshold voltage levels are reached. After eachiteration, a verify operation is performed to ensure the requiredthreshold voltage level is reached. If not, the operation isrepeated until an upper bound on the number of steps isreached. This upper bound is fixed during the design of thechip. If the upper bound is reached and the target thresholdvoltage level is still not reached, then the program operationfails. As the iteration count increases, the energy required fora program operation also keeps increasing. Section III-E5 pro-vides more details about how FlashPower estimates programenergy consumption using iterative programming. Brewer etal. provide more details on circuit and micro-architecture offlash memory [10], [11], [12].

C. Architectural Layout of Flash Memory

While Figure 1 shows the circuit level layout of a flashmemory plane, this section explains the page layout in a flashmemory block. We illustrate the architectural layout using aSLC and a 2-bit MLC flash as examples, but the layout issimilar for all N-bit flash. We note that the architectural layoutof SLC and MLC flash can vary even though the underlying

3

Row MSB of first LSB of first MSB of last LSB of lastNpagesize Npagesize Npagesize Npagesize

cells cells cells cells1 Page 0 Page 4 Page 1 Page 52 Page 2 Page 8 Page 9.. .. .. .. .... .. .. .. ..

Npage/4 Npage − 6 Npage − 2 Npage − 5 Npage − 1

TABLE II: Architectural layout of a 2-bit MLC Block.Adapted from [13]

circuit layout and implementation are identical for the two.Assuming that there are a total of Npages in a block and thesize of each page is in Npagesize, Tables I and II depict thearchitectural layout of a flash memory block. In case of SLCflash, each cell stores 1-bit of information and is mapped toa logical page. The time taken to read and program this bitis nearly the same for all pages. In case of 2-bit MLC flashsome pages are mapped to the MSB while some other pagesare mapped to the LSB. The pages corresponding to LSB areread and programmed faster compared to pages mapped to theMSB. We will refer to pages mapped to the LSB (pages 0, 1,2, 3, etc. in Table II) as fast pages and pages mapped to theMSB (pages 4, 5, 8, 9, etc. in Table II) as slow pages.

III. FlashPower: A DETAILED ANALYTICAL POWERMODEL

This section presents the details of the analytical powermodel that we have developed for NAND flash memory chips.FlashPower uses analytical models to estimate NAND flashmemory chip energy dissipation during basic flash operationslike read, program and erase and when the chip is idle. Beforedelving into the details of the model, we list the componentsthat are modeled and the power state machine. We then explainhow FlashPower was integrated with CACTI [9], followed bythe details of the model itself.

A. Circuit Components

With respect to Figure 1, the components that dissipateenergy are,

• The bit-line (BL) and word-line (WL) wires.• The SSL, GSL and SL.• The drain, source and the gate of the SST, GST and PTs.• The drain, source and control gate of the FGTs.• The floating gate of the FGTs - Energy dissipated during

program and erase operation.• The Sense amplifiers (SAs) - Energy dissipated during

read, program verify and erase verify operation.

In addition to the above components, FlashPower models theenergy dissipated by the block and page decoders, and thecharge pumps (that provide voltages for read, program, anderase operation). The energy per read, program and eraseoperation are the sum of the energy dissipated by all theaforementioned components. Since the energy dissipation ofI/O pins varies significantly with the design of the circuit boardusing the flash chip, we do not model their energy.

Powered Off

Powered On

Precharge

State

Read

Command

Program

Command

Erase State

Era

se

Com

man

d

Read State

Unprogrammed

Page

Programmed

Page

Program State

Page ‘1’

Page ‘0’

Fig. 2: Power State Machine for an SLC NAND Flash Chip

Powered Off

Powered On

Precharge

State

Read

CommandProgram

Command

Erase State

Era

se

Com

man

d

Slow Page

Programmed

Fast Page

Unprogrammed

Slow Page

Unprogrammed

Read State

Fast Page

Programmed

Program State

Slow

Page

‘10’

Slow

Page

‘00’

Fast Page

‘1’

Slow

Page

‘01’

Fast Page

‘0’

Slow

Page

‘11’

Fig. 3: Power State Machine for a 2-bit MLC NAND FlashChip

B. Power State Machine

Figures 2 and 3 describes the power state machine for anSLC and an MLC NAND flash chip respectively. The circlesrepresent the individual states, while the solid lines denotestate transitions. We model the energy dissipated when the chipis powered. When the chip is on, but is not performing anyoperations, it is in the “precharge state”. In this state, the bit-lines are precharged while the word-lines, and the select lines(SSL, GSL and SL) are grounded. The select lines isolate thearray electrically but the chip is ready to respond to commandsfrom the controller.

Upon receiving a read, program, or erase command fromthe controller, the state machine switches to the correspondingstate. When the command is complete, the state machineswitches back to the precharge state. We model the energydissipation of the actual operation and each state transition.For an SLC read operation, depending on whether a page isprogrammed or not, the chip is in either the “ProgrammedPage” or the “Unprogrammed Page” state. This can be sensedin only one read cycle. For an SLC program operation,depending on whether the bit to be programmed is a “0” ora “1”, the state of the chip is either in “Page 0” or “Page 1”state.

4

Fast Page

‘1’

Fast page

‘0’

Slow Page

‘11’

Slow Page

‘01’

Slow Page

‘00’

Slow Page

‘10’

Fig. 4: Transition between MLC Program States. Thetransitions are based on the multi-page programming

algorithm proposed in [14].

For an MLC read, the chip transitions to one of the fourstates, depending on whether it is programmed or not andwhether it is a slow or a fast page. FlashPower requires atleast2 read cycles to sense one of the four states. For an MLCprogram operation, the state transition of the chip is a functionof the bit pattern to be programmed and whether the page is aslow page or a fast page. FlashPower models an conventionalmulti-page architecture programming method as described in[14]. According to this method, each bit cell stores two pages- a fast page and a slow page. Figure 3 lists all the states for anMLC program while Figure 4 explains how the chip transitionsbetween these states. When a fast page is programmed, thechip transitions to a “Fast Page 1” state or a “Fast Page0” state depending on the bit pattern. When a Slow Page isprogrammed, the chip transitions to one of the four slow pagestates depending on the bit pattern to be programmed and thecurrent state of the fast page. The main difference between atransition to a “Slow Page 11” state or to a “Slow Page 01”state from a “Fast Page 1” state is the number of steps inthe iterative programming method and the threshold voltagedifference between the two states. A transition to a slow pagestate can occur only if the chip is already in one of the twofast page states.

While FlashPower models a canonical MLC program op-eration, it can also model other programming algorithms aslong as they employ iterative programming techniques similarto the multi-page architecture model. Modifications to theMLC power state machine can support such algorithms. Forexample, full sequence programming is one programming al-gorithm in which both the slow and fast pages are programmedtogether. To support such an algorithm, the state machinemust transition directly from the idle state to one of the fourslow page states. Since the main difference between transitionsis the number of steps in the iterative programming methodand the threshold voltage difference between the states, theprogramming model does not change, but the values forindividual parameters that the model takes will change.

For an erase operation, we consider only a single state.Note that while we consider the bit pattern for programoperation, we do not consider the bit pattern for read or eraseoperation. This is because the read operation only involvessensing the threshold voltage of the cell, which is deterministicif all cells in the page are already programmed and is non-deterministic otherwise. For an erase operation, all cells inthe block are erased irrespective of their threshold voltagestate. In case of the program operation, which involves settingthe threshold voltage of the cells to one of the 2n states, theinput bit pattern makes a significant difference in the powerconsumption. As we show in Section V, the energy dissipatedto reach a “Fast Page 0” state is significantly different fromthe energy dissipated to reach a “Slow Page 00” state, whichin turn is different from the energy dissipated to reach a“Slow Page 01” state. The important parameters that affect theenergy dissipation for the program operation are the thresholdvoltage difference and the number of verify cycles requiredto transition to the final threshold voltage state from the startstate, both of which are governed by the input bit pattern.

C. Extending FlashPower for n-bit MLC flash

While the state machines in Figures 2 and 3 are for SLCand 2-bit MLC flash, FlashPower can be extended for othern-bit MLC flash variants like Three Level Cell (TLC) flash.The state machine can be generalized for a n-bit MLC flashas follows: FlashPower has a total of 2n states for the readoperation and we can sense one of these states within n cycles.Assuming that we employ iterative programming and multi-page architecture similar to 2-bit MLC, the model requires atotal of 2n+1 − 2 states for the program operation. Since anerase operation is for an entire block irrespective of the thresh-old voltage of the individual cells in the block, FlashPowerhas only one state for the erase operation. For TLC flash, thiscorresponds to 8, 14 and 1 states for read, program and eraseoperation respectively. The number of steps in programmingand erase method can vary from one implementation to anotherand so would the difference in threshold voltage of statesbetween transitions. Since the program and erase model arebased on these parameters, they can be extended to supportn-bit MLC flash.

D. Integration with CACTI

FlashPower is designed as a plug-in to CACTI [9], awidely used memory modeling tool. We estimate the powerconsumption of array peripherals such as decoders and senseamplifiers assuming that they are high performance devicesand for the bit-lines and word-lines assuming aggressiveinterconnect projections. We also model an FGT as a CMOStransistor and then use the CMOS transistor models of CACTIto calculate the parasitic capacitance of an FGT. However,unlike CACTI, which models individual components of SRAMand DRAM based memory systems, we have chosen to modelthe basic operations on the flash memory. This is because,in the memories that CACTI models, individual operationsoperate at the same supply voltage to drive the bit-lines andthe word-lines. For example in a SRAM-based memory, both

5

TABLE III: Inputs to FlashPower.FlashPower has 12 Required(R) and 23 Optional(O) Parameters.

Microarchitectural ParametersNpagesize (R) Size (in bytes) for the data area in

each page.Nsparebytes (R) Size (in bytes) for the spare area in

each page.Npages (R) Number of pages per block.Nbrows (R) Number of rows of blocks in a plane.Nbcols (O) Number of columns of blocks in a plane.

Defaults to 1.Nplanes (O) Number of planes per die. Defaults to 1.Ndies (O) Number of dies per chip. Defaults to 1.tech (R) Feature size of FGTs.Bits per cell (R) Number of bits per FGT.

Device-level ParametersNA, ND (O) Doping level of P-well and N-well.

Defaults to 1015cm−3 and 1019cm−3.β (O) Capacitive coupling between control gate.

and P-well. Defaults to 0.8.GCR (O) Ratio of control gate to total floating gate

capacitance. Obtained from ITRS.tox (R) Thickness of tunnel oxide in FGTs.WL

(O) Aspect ratio of FGTs. Defaults to 1.Workload Parameters

Nbits 1 (O) Number of 1’s to be read, or programmed.

Timing Parameterstprogram (R) Latency to program a page for SLC flash

(fast page in MLC flash).tprogram,slow (O) Latency to program a slow page in MLC flash.

Defaults to tprogram * 2.tread (R) Latency to read a page in SLC flash

(fast page in MLC flash).tread,slow (O) Latency to read a slow page in MLC flash.

Defaults to tread * 2.terase (R) Latency to erase a flash block.

Bias ParametersVdd (R) Maximum operating voltage of the chip.Vread (O) Read voltage for SLC flash (fast page

in MLC flash). Defaults to 4.5V.Vread,slow (O) Read voltage for slow page in MLC flash.

Defaults to 2.4VVbl,drop [0/1] (O) Bit-line swing for read operation. Defaults to 0.7V.Vpgm (O) Program voltage for selected page.

Obtained from ITRS.Vpass (O) Pass voltage during program for un-selected pages.

Defaults to 10V.Vera (O) Erase voltage to bias the substrate. Same as Vpgm.Vbl,pre (O) Bit-line precharge voltage. Defaults to 3/5th of Vdd.Vstep (O) Step voltage used for program and erase.

Defaults to 0.3VPolicy Parameters

Nread verify cycles (R) Number of read verify cycles for a program operationNerase cycles (R) Number of erase pulses required to erase a block.∆Vth,slc (O) Threshold voltage difference between programmed and erased state for SLC flash. Defaults to 3V.∆Vth,mlc (O) Threshold voltage difference between adjacent programmed states for MLC flash. Defaults to 0.9V.Optimize Write (O) Boolean flag enabling optimization to certain threshold levels transitions in MLC flash. Defaults to false.Optimize Erase (O) Boolean flag enabling optimization if the threshold level of a FGT is unchanged after programming. Defaults to false.

read and write operation use the same voltage for word-linesand bit-lines. The granularity (total bytes) of a read/write isalso the same. However for NAND flash, read, program anderase operate at different bias conditions and the granularity ofan erase differs from read/program. Since the circuitry behavesdifferently for these different operations, we believe that it ismore appropriate to model energy for the basic operationson the flash memory. Table III summarizes the inputs toFlashPower.

E. Power Modeling Methodology

1) Modeling the Parasitics of an FGT: To calculate theenergy dissipation of an FGT, it is necessary to estimatethe FGT’s parasitic capacitance - i.e. an FGT’s source, drainand gate capacitance. While the source and the drain of anFGT and a CMOS transistor are similar, an FGT has a twogate structure (floating and control) while a CMOS transistorhas a single gate. We model a dual gate FGT structure asan equivalent single gate CMOS structure by calculating theequivalent capacitance of the two capacitors (one across theinter-poly dielectric and other across the tunnel oxide). Wethen use CACTI to calculate the parasitics of this transis-tor. Figure 5 illustrates this. We calculate the control gatecapacitance Ccg,mc using the information on gate couplingratio (GCR) available in [15], while we calculate the floatinggate capacitance Cfg,mc using the overlap, fringe and areacapacitance of a CMOS transistor of the same feature size.The source and drain capacitance of the transistor depend onwhether the transistor is folded or not. FlashPower assumes

that the transistor is not folded as the feature size is in theorder of tens of nanometers. We use CACTI to model thedrain capacitance of the FGT and the parasitic capacitance ofother CMOS transistors like the GST, PT and SST.

Fig. 5: Modeling a Floating Gate Transistor

2) Derived Parameters: The length of each bit-line Lbitline

and each word-line Lwordline are derived as:

Lwordline = Nbitlines−perblock ∗Nbcols ∗ pitchbit−line (1)Lbitline = (Npages + 3) ∗Nbrows ∗ pitchword−line (2)

where Nbitlines−perblock = (Npagesize+Nsparebytes)∗8. “3” isadded to Npages because for each block, the bit-line crosses3 select lines (GSL, SSL and SL). The terms pitchbit−line

and pitchword−line refer to the bit-line and word-line pitchrespectively and are equal to 2 ∗ featuresize.

The total capacitance to be driven along a word-line is givenby:

Cwl = Cd,pt + Cg,mc ∗Nbitlines−perblock

+Cwl,wire ∗ Lwordline (3)

6

where, Cd,pt is the drain capacitance of the pass transistor,Cg,mc the equivalent gate capacitance of FGT and Cwl,wire isthe word-line wire capacitance. Similarly, the total capacitanceto be driven along the bit-line is equal to:

Cbl = 2 ∗ Cd,st + Cd,mc ∗Npages + Cbl,wire ∗ Lbitline (4)

where Cd,st is the drain capacitance of the select transistors(SST), Cd,mc is the drain and source capacitance of the FGTand Cbl,wire is the bit-line wire capacitance. Adjacent FGTsconnected in series share the source and the drain. Becauseof this assumption, the source capacitance of one FGT isconsidered equal to the drain capacitance of the neighboringFGT. The source of SST is shared with the drain of the firstFGT in the string while the source of the last FGT in the stringis shared with the drain of GST. The total capacitance of theSSL is:

Cssl = Cgsl = Cd,pt + Cg,st ∗Nbitlines−perblock

+Cwl,wire ∗ Lwordline (5)

Since the GSL needs to drive the same components as theSSL, we have Cssl = Cgsl. The capacitive component of theSL is equal to:

Csrcline = Cwl,wire ∗ Lwordline + Cd,st (6)

We assume that the source capacitance of GST is equal to itsdrain capacitance (Cd,st).

3) Transition from the Powered Off to the Precharged State:When this transition occurs, the bit-lines are precharged andword-lines to Vbl,pre and Vwl,pre respectively. We bias theSST, GST and the PTs so that the current flowing through theFGTs is disabled. We bias the source line to ground. Hencethe energy to transition to the precharge state (Epre) equals:

Epre = Ebl,pre + Ewl,pre (7)Ebl,pre = 0.5 ∗ Cbl,wire ∗ (0 − Vbl,pre)

2

∗Nbitlines−perblock ∗ Lbitline (8)Ewl,pre = 0.5 ∗ Cwl,wire ∗ (0 − Vwl,pre)

2

∗Npages ∗ Lwordline (9)

While we precharge the bit-lines to reduce the latency of readaccesses [16], the word-lines are not. During our validationwe set Vwl,pre to zero, but the model includes the word-lineprecharge Vwl,pre as a parameter so that devices that performword-line precharging can use this parameter.

4) The Read Operation: The read operation for SLC andMLC flash are quite similar. In case of SLC flash and fastpages in 2-bit MLC flash, the sequence of operations requiredto sense the threshold voltage of the cells is the same since weonly need to distinguish between two states. In case of slowMLC pages, we need 2 stages (and hence two read operations)to distinguish between four states.

To perform a read operation for both SLC and MLC flash,the block decoder selects one of the blocks to be read whilethe page decoder selects one of the Npages inside a block. ForSLC flash and fast pages in MLC flash, we bias the word-lineof the selected page to ground while we change the bias ofun-selected word-lines to Vread from Vwl,pre. This causes the

un-selected pages to serve as transfer gates. Hence the energyto bias the word-line of the selected page to ground and theun-selected pages to Vread equals:

Eselected−page,r = 0.5 ∗ Cwl ∗ (0 − Vwl,pre)2 (10)

Eunselected−pages,r = 0.5 ∗ Cwl ∗ (Vread − Vwl,pre)2

∗(Npages − 1) (11)

For the read operation, the bit-lines remain at Vbl,pre. Depend-ing upon the threshold voltage of the FGT, the FGT is on oroff which impacts the voltage drop in the bit-line. AssumingNbits 1 to be the number of bits corresponding to logical “1”and Nbits 0 to be the number of bits corresponding to logical“0”, Vbl,drop 1 and Vbl,drop 0 to be the drop in bitline voltagesfor a 1-read and 0-read, the resulting energy dissipation equals:

Ebl−1,r = 0.5 ∗ Cbl ∗ (Vbl,pre − Vbl,drop 1)2 ∗Nbits 1 (12)

Ebl−0,r = 0.5 ∗ Cbl ∗ (Vbl,pre − Vbl,drop 0)2 ∗Nbits 0 (13)Esl,r = Essl,r + Egsl,r + Esrcline,r (14)

where Vbl,drop 1 is the drop in bitline voltage when FGT ison and Vbl,drop 0 is the drop in bitline voltage when FGT isoff and Essl,r, Egsl,r and Esrcline,r are given by:

Essl,r = Egsl,r = 0.5 ∗ Cssl ∗ (Vread − 0)2 (15)

Esrcline,r = 0.5 ∗ Csrcline ∗ (Vread − 0)2 (16)

We detect the state of the cell using the sense amplifierconnected to each bit-line. The sense amplifier is a part ofthe page buffer and contains a latch unit [16]. It detectsthe voltage changes in the bit-line and compares it with areference voltage. Since this is very similar to DRAM senseamplifier [17], we use CACTI’s DRAM sense amplifier modelto determine the energy dissipated for sensing and CACTI’sSRAM model to model the energy dissipated to store thesensed data in the page buffer.

In order to read slow pages, we repeat the above steps, butinstead of biasing the word-line of the selected page to ground,we bias the word-line of the selected page to Vread,slow.If the threshold voltage of the FGT is less than Vread,slow,then the FGT is off and the current through the bitline issmall. Otherwise, the FGT is on and the current is high.Combining the first and the second read operation helps todistinguish between four threshold stages, thus helping to readthe contents of the slow page.

Once the read operation is complete, the system transitionsfrom the read state to the precharged state. This means thatwe bias bit-lines back to the precharge voltage Vbl,pre, whilewe bias the select lines back to ground from Vread and theword-lines back to Vwl,pre. But the biasing for the transitionfrom read operation to precharge state is equal but opposite tothe biasing for the transition from the precharge state to readoperation. Hence the energy to transition from the read to theprecharge state (Er−pre) equals:

Er−pre = Ebl−1,r + Ebl−0,r + Eselected−page,r

+Eunselected−pages,r + Esl,r (17)

We calculate the energy dissipated by the decoders for theread operation, Edec,r, using CACTI. We modify the CACTI

7

decoder model to perform two levels of decoding (block andpage decode) for each read and program operation. Hence thetotal energy dissipated per read operation for SLC flash andfast page of MLC flash equals:

Er = Eselected−page,r + Eunselected−pages,r + Ebl−1,r

+Ebl−0,r + Esl,r + Er−pre + Esenseamp,r + Edec,r(18)

The energy for a slow page read operation in MLC flash isequal to:

Er,slow = Er + Eselected−page,r + Eunselected−pages,r

+Ebl−1,r + Ebl−0,r + Esl,r + Er−pre

+Esenseamp,r + Edec,r (19)

where Eselected−page,r is calculated by biasing the word-lineto Vread,slow instead of ground.

5) The Program Operation: To perform a read operationfor both SLC and MLC flash, the block decoder selects oneof the blocks to be programmed while the page decoder selectsone of the Npages inside a block. We transfer the data fromthe controller to the page buffers. Since the decoding for theprogram operation is the same as that of the read operation, weestimate the energy for decoding during the program operationto be equal to Edec,p = Edec,r, where Edec,r is the energy forthe decode operation estimated using CACTI.

FlashPower adopts the incremental step pulse programmingmodel (ISPP) to estimate energy dissipation during programoperation [11]. In this paper, we refer to each pulse as asub-program operation. For each sub-program operation, weincrease the program voltage of the selected page by a stepfactor Vstep. After each sub-program operation, we perform averify program operation to check if the correct data is writtento the page. If not, we repeat the sub-program operation, butwith a higher program voltage. Nread verify cycles indicatesthe total number of sub-program operations performed dur-ing the write operation. The duration of each sub-programpulse is a function of the program latency tprogram andNread verify cycles.

We now calculate the energy for each sub-program op-eration. Let Vpgm,i be the program voltage of the selectedword-line in the ith iteration. We bias the selected word-lineto Vpgm,i and the un-selected word-lines to Vpass to inhibitthem from programming. Since the program voltage for theselected page varies for every iteration, we present the energydissipated by this step as a function of voltage. Thus the energyto bias the selected page and the un-selected page equals:

Eselected−page,p(V ) = 0.5 ∗ Cwl ∗ (Vpgm,i − Vwl,pre)2 (20)

Eunselected−page,p = 0.5 ∗ Cwl ∗ (Vpass − Vwl,pre)2

∗ (Npages − 1) (21)

FlashPower adopts the the self-boosted program inhibit model[11] to prevent cells corresponding to logical “1” from beingprogrammed. According to the self-boosted program inhibitmodel, the channel voltage is boosted to about 80% of theapplied control gate voltage by biasing the bit-lines corre-sponding to logical “1” at Vbl,ip and setting the SSL toVdd. The resulting electric field across the oxide is not high

enough for tunneling and the cells are inhibited from beingprogrammed. Assuming Nbits 1 to be the number of bitscorresponding to logical “1”, the energy dissipated by bit-linesto inhibit programming is equal to:

Ebl,ip = 0.5 ∗ (Cbl − Cd,mc ∗Npages)

∗ V 2ch boost ∗Nbits 1 (22)

where Vch boost is the boosted channel voltage and is a fractionof applied control gate voltage.

We bias the bit-lines corresponding to logical “0” to ground.The resulting high electric field across the tunnel oxide fa-cilitates FN tunneling resulting in charges tunneling from thechannel onto the floating gate. The energy dissipation of bitlinecorresponding to logical “0” is equal to:

Ebl,p = (0.5 ∗ (Cbl − Cd,mc ∗Npages) ∗ (0 − Vbl,pre)2

+ Etunnel,mc) ∗Nbits 0 (23)

where the term Etunnel,mc is the tunneling energy per cell.Etunnel,mc is calculated as

Etunnel,mc = ∆Vth ∗ IFN ∗ tsub−program (24)

where IFN is the tunnel current and tsub−program is theduration of sub-program operation. IFN is calculated asIFN = JFN ∗Afgt, where JFN is the tunnel current densitycalculated using [18] and Afgt is the area of the floating gate.

Then the energy dissipated in charging the select lines isequal to:

Esl,p = Essl,p + Egsl,p + Esrcline,p (25)

where

Essl,p = Egsl,p = 0.5 ∗ Cssl ∗ (Vdd − 0)2

Esrcline,p = 0.5 ∗ Csrcline ∗ (Vdd − 0)2

Once the sub-program operation completes, NAND flash chipsperform a program-verify operation to check if write operationis successful. FlashPower models the verify-program operationas a read operation. Hence, the total energy per sub-programoperation is:

Esubp(V ) = Eselected−page,p(V ) + Eunselected−page,p

+ Ebl,ip + Ebl,p + Esl,p + Evp (26)

where Evp, the energy spent in program verification and isgiven by Evp = Er −Edec,r. Er is defined by equation (18).Since the entire process of sub-program and verify-program isrepeated a maximum of Nloop number of times, the maximumenergy for programming is given by:

Epgm =

Nread verify cycles∑i=0

Esubp(Vpgm,i) (27)

Nread verify cycle is obtained from datasheets like [19] or canbe fed as a input to the model.

Since a program operation concludes with a read operation,the transition from the Program to Precharge is same as thetransition from a read to precharge.

Ep−pre = Er−pre (28)

8

where Er−pre is given by equation (17).Hence the total energy dissipated in the program operation

is equal to:

Ep = Edec,p + Epgm + Ep−pre (29)

6) The Erase Operation: Since erasure happens at the blockgranularity, the controller sends the address of the block tobe erased. The controller uses only the block decoder and theenergy for block decoding (Edec,e) is calculated using CACTI.To aid block-level erasing, the blocks are physically laid outsuch that all pages in a single block share a common P-well.Moreover, multiple blocks share the same P-well [11] andtherefore it is necessary to prevent other blocks sharing thesame P-well from being erased. FlashPower assumes the self-boosted erase inhibit model [11] to inhibit other blocks sharingthe same P-well from getting erased.

Once we select an erase block, NAND flash memory usesmultiple erase pulses to erase a block. Each such operationis referred to as a sub-erase operation. The bias voltage forthe P-well of the selected block, Vera, is set to the initialprogram voltage, Vpgm. After each sub-erase operation, a read-verify operation determines whether all FGTs in the block areerased or not. If all the FGTs in the block are not erased,we repeat the erase operation (maximum of Nerase cycles

times) after incrementing Vera by the step voltage Vstep. Wedenote the erase voltage used for each sub-erase operation tobe Vsub−erase,i, where i indicates the iteration count of thesub-erase operation. The duration of each sub-erase operation,tsub−erase, is a function of the maximum erase latency andNerase cycles.

For each sub-erase operation, we bias the control gates ofall the word-lines in the selected block to ground. We biasthe P-well for the selected block to erase voltage Vsub−erase,i

and the SSL and the GSL to Vsub−erase,i ∗ β, where β is thecapacitive coupling ratio of cells between control gate and theP-well. A typical value of β is 0.8 [10]. We bias the SL toVbl,e. Here Vbl,e is equal to Vsub−erase,i − Vbi where Vbi isthe built-in potential between the bitline and the P-well of thecell array [20]. Adding up the charging of the SSL, GSL andSL, we get the energy dissipated in the select lines to be:

Esl,e = Essl,e + Egsl,e + Esrcline,e (30)

where Essl,e, Egsl,e and Esrcline,e are calculated using theSSL, GSL and SL capacitance and the bias condition explainedabove. We bias the bit-lines to Vbl,e which dissipates energythat is modeled as:

Ebl,e(V ) = 0.5 ∗ Cbl ∗ (Vbl,e − Vbl,pre)2 ∗Nbitlines−perblock

(31)We parameterize Ebl,e(V ) as a function of applied erasevoltage, since the erase voltage changes for each sub-eraseoperation.

With the P-well biased to Vsub−erase,i, cells that have highthreshold voltage undergo FN tunneling. Electrons tunnel offthe floating gate onto the substrate and the threshold voltageof the cell is reduced. To effect FN tunneling, the depletionlayer capacitance between the P-well and the N-well should

be charged. This capacitance of the junction between the P-well and the N-well, which form a P-N junction, is a functionof the applied voltage Vera, the area of the p-well Awell. Thedynamic power to charge this capacitance as the voltage acrossthe junction raises from 0V to Vsub−erase,i is determined as:

Ejunction,e(V ) = (Cj0

(1 − Vsub−erase,i/φ0)m∗Afgt)∗V 2

sub−erase,i

(32)where Cj0 is the capacitance at zero-bias condition and φ0is the built-in potential of the P-N junction. m, the gradingcoefficient is assumed to be 0.5. We parameterize Ejunction,e

as a function of applied erase voltage since the erase voltagechanges for each sub-erase operation.

Once this P-N junction capacitance is fully charged, all cellsin the block are erased. The tunneling energy for the block,Etunnelerase,mc, is calculated using equation (24) and a sub-erase pulse whose duration is tsub−erase. After each sub-erasepulse, a verify erase operation is performed to ensure that allFGTs in the block are erased [11]. The verify erase operationis a single read operation which consumes energy equivalent toa read operation. This is modeled as, Eblock,ve = Er −Edec,r

where Er is given by equation (18).Since the erase voltage changes for each sub-erase opera-

tion, the total energy consumed in each sub-erase operation,Esub−erase is parameterized as a function of the voltage andis equal to:

Esub−erase(V ) = Esl,e + Ebl,e(V ) + Etunnelerase,mc(V )

+ Ejunction,e(V ) + Eblock,ve (33)

Since an erase operation ends with a read operation, thetransition from the erase to precharge is same as the transitionfrom read to precharge. The energy dissipated during thistransition is equal to:

Ee−pre = Er−pre (34)

where Er−pre is given by equation (17).Hence the total energy dissipated in erase operation Eerase

is equal to:

Eerase = Edec,e +

Nerase cycles∑i=0

Esub−erase(Vsub−erase,i)

+ Ee−pre (35)

We can observe that if all the cells in the block areprogrammed and are at a low threshold voltage stage, thenan erase operation is not necessary. Some controllers canidentify this scenario and prevent an erase operation even whenthey receive such a command, thereby reducing the powerconsumption and the latency of the erase operation.

7) Charge pumps: Each read, sub-program, or sub-eraseoperation requires that the charge pump supply a high voltage(5V-20V) to the flash memory array. Ishida et. al specify thatthe energy consumed for each high voltage pulse from thecharge-pumps is 0.25µJ for chips operating at 1.8V and 0.15µJfor chips operating at 3.3V for the read and program operation[21]. In FlashPower, we use these constant values but sinceFlashPower is parameterized, a detailed charge-pump modelcan be incorporated in lieu of the current one.

9

Chip Name Feature Size Page Size Spare Area Pages/Block Blocks/Plane Planes/Die Dies/Chip Total Capacity(in nm) (in KB) (in bytes) (in Gb)

B-MLC8 72 2KB 64 128 2048 2 1 8.25D-MLC32 80 4KB 128 128 2096 2 2 33.77E-MLC64 51 4KB 128 128 2048 4 2 66.00A-SLC4 73 2KB 64 64 2048 2 1 4.13B-SLC4 72 2KB 64 64 2048 2 1 4.13

TABLE IV: NAND Flash Chips Used for Evaluation

8) Multi-plane operation: The power model described thusfar corresponds to single-plane operations. However modernNAND flash chips allow multiple planes to operate in parallel.These are referred to as multi-plane operations. Since Flash-Power models single-plane operations, energy consumption formulti-plane operations can be determined by multiplying theresults of single-plane operations with the number of planesoperating in parallel.

So far, we have provided an detailed analytical model thatestimates the energy dissipation of NAND flash for individualoperations. We now validate our model with measurementsfrom real flash chips from various manufacturers.

IV. EXPERIMENTAL SETUP

To validate the model’s usability and accuracy, we measuredthe latency and energy consumption of read, program, anderase operations from real flash chips. Using a custom-builtboard, we acquired fine-grain measurements of a diverse setof flash chips. In this section, we describe the hardware setupand the test procedure.

A. Hardware Setup

Figure 6 depicts our flash characterization board that con-sists of two sockets for testing flash chips. The board connectsto a Xilinx XUP development kit with a Virtex-II FPGA andan embedded PowerPC processor. The processor runs Linuxand connects to a custom flash controller that gives the usercomplete access to the flash chips. The flash controller canissue operations to the chips and record latency measurementswith a 10 ns resolution. For more details about the character-ization board, please refer [22].

To measure the energy consumption of individual flashoperations, we use a high-bandwidth current probe attached toa mixed-signal oscilloscope. The Agilent 1147A current probeattaches to a jumper that provides power to a single flash chip.We trigger the oscilloscope to capture current measurementswhen the flash controller sends a command to the flash chip.

B. Test Procedure

Table IV summarizes the system-level properties of the flashchips we used for validation. We selected a diverse set ofchips - in terms of feature size, capacity, array structure, andmanufacturer - to depict the compatibility and adaptability ofFlashPower.

We measured the latency and power consumption for read,program, and erase operations on each chip. The types ofoperations we performed varied for SLC- and MLC-basedflash chips.

Fig. 6: Flash test board The custom-built flash test boardconnects to an FPGA-based development kit with an

embedded processor running Linux and a flash controllerthat issues operations to up to two flash chips at a time.

SLC chips We tested four states for a single-page read op-eration on SLC-based chips: unprogrammed (freshly-erased),all 0’s, 50% 0’s, and all 1’s. For the program operation, weacquired data while programming a single page with all 0’s,50% 0’s, and all 1’s. Finally, for erase operations, we measuredthe energy while erasing an entire block of pages programmedwith all 0’s and pages programmed with all 1’s.

MLC chips To quantify energy consumption for read opera-tions of MLC-based chips, we measured the energy consump-tion of reading a page in four states:

1) Fast Page Programmed - Reading a programmed fastpage.

2) Slow Page Programmed - Reading a programmed slowpage.

3) Fast Page Unprogrammed - Reading a freshly-erasedfast page.

4) Slow Page Unprogrammed - Reading a freshly-erasedslow page.

For program operations, we measured the energy consump-tion of the following six transitions caused by programming apage:

1) Fast Page 0 - Programming a freshly-erased fast pagewith all 0’s.

2) Fast Page 1 - Programming a freshly-erased fast pagewith all 1’s.

3) Slow Page 01 - Programming a slow page with all 1’sand the fast page counterpart is of type Fast Page 0.

10



6) Slow - Programming a slow page with all 0’sand the fast page counterpart is of type Fast Page 1.

For erase operations, we measured the energy consumptionof erasing an entire block of fully-programmed pages of all0’s and all 1’s.

For each chip and each type of operation, we performed tenmeasurements and reported the average. We summarize theresults of these experiments in conjunction with the model’sapproximations in Section V.

V. VALIDATION OF FlashPower

We now compare the outputs of the analytical model withpower measurements from real SLC and MLC flash chips.The sources used to obtain values for parameters used in ourmodel (listed in Table III) are as follows: Except for featuresize of FGTs, values for all microarchitectural parameters areobtained from manufacturer datasheets. The feature size ofFGTs used in our chips are obtained from [23], [24]. Valuesfor device-level parameters are obtained from [20], [10], [15].The operating voltage of the chip is obtained from datasheets,while for other bias parameters we use information availablein Chapter 6 of [10]. The value for timing parameters andpolicy parameters are deduced from datasheets, by measuringthe latency of flash operations, and from [10]. We used themeasured latencies of each flash operation (see Section IV)because we found that the values for timing parameters andprogram verify cycles in datasheets were significantly different(as high as 50% in some cases) from actual behavior. Whiledatasheets contained average latencies of flash operation, theactual latencies depended on factors like the type of page andin some cases on the data pattern being used. Finally, sincewe measure the energy consumption of the chips by varyingthe input bit pattern, we control the values for the workloadparameters.

We now present the results for read, program and eraseoperation for both MLC flash followed by SLC flash.

A. Validation for SLC flash

Figures 7(a), 7(b) and 7(c) show the results for the read,program and erase operations for SLC flash. The y-axisrepresents the energy consumption of the particular operation,while the x-axis represents the state of a page in SLC flash asexplained in Section III-B. From Figure 7(a), we can observethat the variation between the estimated and the measured readenergy is about 40% on average (at most 62% or 8µJ) for A-SLC4, while it is less than 10% for B-SLC4. For the programoperation, the difference between estimates and measurementsis 22% on average (at most 26% or 2µJ) for B-SLC4, whilethe difference is 40% on average (at most 62% or 8µJ) forA-SLC4. For the erase operation, it is 24% on average ( atmost 27% or 8µJ) for B-SLC4, while it is 36% on average(at most 37% or 24µJ) for A-SLC4.

(a) Read Energy Comparison for SLC flash

(b) Program Energy Comparison for SLC flash

(c) Erase Energy Comparison for SLC flash

Fig. 7: Energy Comparison for SLC flash

These results show that the FlashPower can estimate theenergy consumption of B-SLC4 with high accuracy, but forA-SLC4, the accuracy is low. It should be noted that thedifferences among the the measured values of the two chips issimilar. In other words, the estimates of A-SLC4 and B-SLC4are comparable among themselves, while the differences inmeasurements between A-SLC4 and B-SLC4 are high.

To understand why such variations occur, we compare theparameters of the two chips. We find that the microarchitec-tural parameters for the two chips are almost the same (pleaserefer to Table IV). Since the feature size and the operatingvoltage of the two chips are almost the same, our modelassumes that the device level parameters are same. As far asthe timing parameters are concerned, our experiments showthat the read latency of A-SLC4 is about 20% higher thanB-SLC4. The program latency for the two chips is similar,while the erase latency is about 30% higher for A-SLC4compared to B-SLC4. With respect to the policy parameters

11

mentioned in Table IV, values for ∆Vth,slc is set based on[15] and since the chips have the almost same feature size,their values are same. The values of program-verify anderase cycles are set using latency measurements from thechips and they are similar. Since the bit patterns are constantacross both the chips, we hypothesize that the differences inenergy are due to changes to specific policy parameters like∆Vth,mlc and Nread verify cycles, device parameters like βand tox or due to manufacturer-specific optimizations of theindividual operations. Better knowledge about the values forthese parameters can significantly increase the accuracy of themodel. FlashPower is developed based on [10], [16], [12],[11] and describes the canonical operation of NAND flashmemory and does not capture manufacturer-specific designs.This results in a significant variation between estimates andmeasurements for A-SLC4 while the variation is significantlylower for B-SLC4. However, since FlashPower is extensivelyparameterized, any manufacturer-specific implementations canbe used for estimating flash power consumption.

B. Validation for 2-bit MLC flash

Figure 8 shows the results for the read operation for MLCflash. The x-axis represents the page being read during the op-eration while the y-axis represents the total energy dissipatedduring the read. Each column represents the average energyconsumed by the flash chip when it is at one of the variousstates described in Section III-B. From the figure, we canobserve that the difference between estimated and measuredenergy is less than 22.3% (or 0.64µJ) for all types of pages.We found that the difference in estimated and measured energyvalues were significantly higher for unprogrammed pages thanfor programmed pages. Since the threshold voltage states ofcells in unprogrammed pages is non-deterministic, modelingthe energy consumption for an unprogrammed page is chal-lenging. It should be noted that reading an unprogrammedpage does not happen often in practice. If we just considerprogrammed pages, the difference drops to less than 12.3% (or0.35µJ) and in most cases, the difference between measuredand estimated values are negligible.

Figure 9 shows the result for the program operation forMLC flash. The y-axis represents the energy consumption ofa program operation, while the x-axis represents the variousstates of a flash page. Each column in the figure representsthe energy consumed by the flash chip when it is at one ofstates described in Section III-B. We find that the accuracy ofthe model is high while programming fast pages for all threeflash chips. For slow pages, the accuracy depends on the chipbeing measured and the bit pattern being used.

Many factors like the number of program verify cycles, theduration of each sub-program operation, the thickness of thetunnel oxide (which affects the tunneling energy), the programand step voltages used, the difference between the starting andthe final threshold voltage states affect the energy dissipationduring a program operation. Moreover, the values for thesefactors change with the type of bit pattern being written.Since programming a slow page involves carefully controllingthe threshold voltage distribution of the FGT, an accurate

Fig. 10: Erase Energy Comparison for MLC flash

value for each of these parameters is even more important.For example, a linear difference in program and step voltagecan cause quadratic difference in energy consumption. WhileFlashPower identifies values for some factors like number ofprogram-verify cycles used, and the duration of each sub-program operation by reverse engineering the chips, values forVpgm, Vstep, ∆Vth,mlc for each chip that is being evaluatedare not publicly available. Since values for such parametersare not publicly available, the variation between estimates andmeasurements are high for some bit patterns. For example, inthe case of B-MLC8, increasing Vpgm and ∆Vth,mlc reducedthe difference between the measurement and the estimates forthe Slow Page 00 and Slow states from 49% to nearly8%. In the case of D-MLC32, even a small decrease in Vpgmand ∆Vth,mlc reduces the variation between estimates andmeasurement for the Slow Page 11 state 49% to nearly 10%.Since FlashPower has been extensively parameterized, anyuser who has detailed information on individual parameterscan feed those parameters into FlashPower to get an accurateestimate of flash energy consumption.

Figure 10 shows the results for the erase operation forthe three chips for 2 different bit patterns. For B-MLC8 andD-MLC32, changing the bit pattern did not affect the eraseenergy significantly. For E-MLC64, when all pages in theblock were in the lowest threshold state (“11”), the eraseenergy was about 5X lower than the energy consumptionfor all other bit patterns. We call this the optimize eraseoperation, where the controller does not erase a block whereall cells already have the lowest threshold voltage. Whiletwo chips, B-MLC8 and D-MLC32, did not perform thisoptimization, E-MLC64 had this optimization enabled. Fromthis figure, we can also see that the estimates correspond wellto the measurements. The variation between estimates andmeasurement in less than 10% for B-MLC8 and E-MLC64,while the difference was less than 20% for D-MLC32.

So far, we have examined the accuracy of FlashPower bycomparing the results with measurements from a range ofchips across various manufacturers. We now use the tool toperform some design space exploration studies.

VI. DESIGN SPACE EXPLORATION USING FlashPower

FlashPower enables us to study the effect of various flashparameters (listed in Table III) on the power consumption ofthe chip. In this section, we show the utility of FlashPower

12

Fig. 8: Read Energy Comparison for MLC flash

Fig. 9: Program Energy Comparison for MLC flash

Chip Page Spare Pages/ Blocks/ Planes/ Dies/Name Size(KB) Bytes Block Plane Die Chip

X1-M16 2 64 128 2048 2 2X2-M16 2 64 128 4096 2 1X3-M16 4 128 128 2048 2 1X4-M16 2 64 256 2048 2 1X5-M16 2 64 128 2048 4 1

TABLE V: Different 16Gb Chip Configurations

through two case studies to identify power optimal designconfigurations.

Case Study 1: Let us say a chip architect decides to designa 16Gb NAND flash chip and wants to estimate the energyconsumption for various design scenarios. Table III providesa list of all parameters that can be adjusted by the architect toidentify the relative benefits of each design scenario. Assumingthat the designer has the flexibility to change the memoryarray configuration of flash, we show how to use FlashPowerto compare some configurations. For this study, we vary themicroarchitecture parameters listed in Table III, except thefeature size, while the values for all other parameters areassumed to be the same as the B-MLC8 chip. Table V providesa list of configurations for designing a 16Gb memory chip.

Table VI shows the read, write and erase energy for differentmemory configurations while reading and programming a fast

ConfigurationRead Energy Program Erase

Fast Page Energy-Fast Energy (µJ)Programmed(µJ) Page 0 (µJ)

X1-MLC16 1.085 15.418 117.602X2-MLC16 1.629 18.297 232.577X3-MLC16 1.632 29.747 232.908X4-MLC16 1.59 18.205 227.719X5-MLC16 1.277 15.562 117.595

TABLE VI: Energy dissipation for various 16Gb ChipConfigurations.

page and erasing a block. Note that, in the case of X3-M16,read and program operations occur at 4KB granularity, whilefor other configurations, they occur at a 2KB granularity. Theblock size is 512KB for X3-M16 and X4-M16, while it is256KB for the other configurations. Table VI can be usedby the architect to estimate the energy dissipation for variousmemory configurations. From Table VI we can see that X1-M16 is the chip that dissipates the least energy for read andprogram operations (for erase operation, the energy dissipationis almost the same as X5-M16). However, the page size ofX1-M16 is just 2KB. The page size for X3-M16 is 4KB,while the energy consumption of reads is only 60% more thanX1-M16. Depending upon whether the architect is optimizingfor total energy consumption or energy dissipated per byte

13

ConfigurationRead Energy Program Erase

Fast Page Energy-Fast Energy (µJ)Programmed(µJ) Page (µJ)

All 1s 1.182 8.385 2.566/115.21875% 1s 1.023 10.144 115.22150% 1s 0.863 11.902 115.22425% 1s 0.703 13.660 115.226All 0s 0.543 15.418 115.229

TABLE VII: Energy dissipation for various bit patterns forX1-M16. If X1-M16 optimizes erase, then the energy

dissipated is 2.566µJ . Otherwise, it is 115.218µJ .

processed, they can choose one design over the other. Insystems that run on battery power and have small data accessgranularity like consumer electronics devices and wirelesssensors , where the architect is limited by maximum energydissipation of the chip, X1-M16 is better option that X3-M16.In systems where energy efficiency is of higher importancethan maximum energy dissipation, X3-M16 is a better option.Some designers may choose X5-M16 over the rest because thenumber of planes in X5-M16 is higher than the rest, whichallows for more operations to happen in parallel. Note that forX3-M16, the write and erase energy, is still ∼2X higher thanX1-M16, because their pages are larger.

Case Study 2: Let us say a chip architect decides to designa NAND flash chip that can consume less energy while storingspecific data patterns. This scenario could be motivated by anapplication behavior or due to flash aware encoding schemesthat seek to increase flash lifetime by favoring one bit patternover an another. For example, a flash aware encoding schemecan choose to write more logical 1s than logical 0s in order toreduce the probability of tunneling of cells, thereby increasingflash lifetime. FlashPower allows the architect to estimate theimpact of energy dissipation due to different bit patterns. TableVII shows the energy dissipation of X1-M16 for read, writeand erase operations for various bit patterns. These results arebased on the assumption that there is 100% bitline dischargefor logical 1s and no discharge for logical 0s during theread operation. From Table VII, we can see that for a readoperation, a page filled with 0s consume less energy thana page with 1s because bit-lines corresponding to logical 0do not discharge while bit-lines corresponding to logical 1discharges fully and hence dissipates more energy. However,in case of write operations, writing logical 1s consumes lessenergy because programming is inhibited for logical 1s, whiletunneling happens when logical 0s are written, consumingmore energy. For erase operations, if the chip supports op-timize erase feature, then the erase energy is just 2.5µJ .Otherwise, the erase energy does not vary significantly withthe data pattern since the entire P-well of the erase block isbiased to the erase voltage Vera and hence dissipate almostthe same energy.

These two case studies show the utility of FlashPowerin evaluating the design tradeoffs for NAND flash memory.Because FlashPower is parameterized, it is possible to analyzesuch what-if scenarios and identify power optimal flash chipconfigurations.

VII. RELATED WORK

CACTI is an integrated timing, area, leakage and powermodeling tool from HP Labs [9]. CACTI supports evaluatingSRAM and DRAM memory technologies. Our paper uses theunderlying infrastructure available in CACTI and extends itto model NAND flash memories. Research interest in non-volatile memories has spawned development of analyticalmodels for many memory technologies, most of which useCACTI as their base. Smullen et al. provide an analyticalmodel to characterize the behavior of Spin-Transfer TorqueRAM (STT-RAM) [25]. Dong et al. provide a similar an-alytical model for phase change memories [26]. Xu et al.extend CACTI to model Resistive RAM (RRAM) memorytechnology [27]. While analytical models form one end ofthe spectrum, characterizing memory technologies with actualmeasurements form the other end. Grupp et al. provide a char-acterization of physical properties using actual measurementsfrom flash chips [23]. Yaakobi et al. use results from actualmeasurements to characterize the reliability of flash chips [13].Boboila et al. perform a similar study, but focus mainly onwrite endurance [28]. NVSim presents analytical models forflash with validations at a datasheet level [29]. FlashPoweruses measurements from actual chips and combines it witha detailed parameterized analytical model for NAND flashmemories. NANDFlashSim is a microarchitecture level toolthat simulates only the performance characteristics of NANDflash chips [30]. NANDFlashSim can be used on top of circuit-level analytical modeling tools like FlashPower to provide aflash simulation infrastructure that models both energy andperformance.

VIII. CONCLUSIONS

In this paper, we have presented FlashPower, a detailedpower model for the two most popular variants of NANDflash namely, the Single-Level Cell (SLC) and 2-bit Multi-Level Cell (MLC) based NAND flash memory chips. Flash-Power models the energy dissipation of flash chips fromfirst principles and is highly parameterized to study a largedesign space of memory organizations. We have validatedFlashPower against measurements from 5 different chips fromdifferent manufacturers and show that our model can accu-rately estimate the energy dissipation of several real chips.

IX. ACKNOWLEDGMENTS

We thank the reviewers for their valuable comments. Thisresearch has been supported in part by NSF CAREER AwardCCF-0643925 and gifts from Google.

REFERENCES

[1] C. Dirik and B. Jacob, “The Performance of PC Solid-State Disks(SSDs) as a Function of Bandwidth, Concurrency, Device Architecture,and System Organization,” in Proceedings of the International Sympo-sium on Computer Architecture (ISCA), June 2009, pp. 279–289.

[2] N. Agrawal and et al., “Design Tradeoffs for SSD Performance,” inProceedings the USENIX Technical Conference (USENIX), June 2008.

[3] T. Kgil, D. Roberts, and T. Mudge, “Improving NAND Flash Based DiskCaches,” in Proceedings of the International Symposium on ComputerArchitecture (ISCA), June 2008, pp. 327–338.

14

[4] A.M. Caulfield and L.M. Grupp and S. Swanson, “Gordon: using flashmemory to build fast, power-efficient clusters for data-intensive applica-tions,” in Proceedings of the International Conference on ArchitecturalSupport for Programming Languages and Operating Systems (ASPLOS),March 2009, pp. 217–228.

[5] J. Kim and et al, “A PRAM and NAND flash hybrid architecturefor high-performance embedded storage subsystems,” in EMSOFT ’08:Proceedings of the 8th ACM international conference on Embeddedsoftware. New York, NY, USA: ACM, 2008, pp. 31–40.

[6] E. Gal and S. Toledo, “Algorithms and Data Structures for FlashMemories,” ACM Computing Surveys, vol. 37, no. 2, pp. 138–163, June2005.

[7] A. Gupta, Y. Kim, and B. Urgaonkar, “DFTL: a flash translation layeremploying demand-based selective caching of page-level address map-pings,” in ASPLOS ’09: Proceeding of the 14th international conferenceon Architectural support for programming languages and operatingsystems, 2009, pp. 229–240.

[8] C. Smullen, V. Mohan, A. Nigam, S. Gurumurthi, and M. Stan,“Relaxing non-volatility for fast and energy-efficient stt-ram caches,”in High Performance Computer Architecture (HPCA), 2011 IEEE 17thInternational Symposium on. IEEE, 2011, pp. 50–61.

[9] S. Wilton and N. Jouppi, “Cacti: an enhanced cache access and cycletime model,” Solid-State Circuits, IEEE Journal of, vol. 31, no. 5, pp.677–688, May 1996.

[10] J. Brewer and M. Gill, Eds., Nonvolatile Memory Technologies withEmphasis on Flash. IEEE Press, 2008.

[11] K. Suh and et al, “A 3.3V 32 Mb nand flash memory with incrementalstep pulse programming scheme,” Solid-State Circuits, IEEE Journal of,vol. 30, no. 11, pp. 1149–1156, Nov 1995.

[12] T. Tanaka and et al, “A quick intelligent page-programming architectureand a shielded bitline sensing method for 3V-only nand flash memory,”Solid-State Circuits, IEEE Journal of, vol. 29, no. 11, pp. 1366–1373,Nov 1994.

[13] E. Yaakobi, J. Ma, L. Grupp, P. Siegel, S. Swanson, and J. Wolf,“Error characterization and coding schemes for flash memories,” inGLOBECOM Workshops (GC Wkshps), 2010 IEEE, dec. 2010, pp. 1856–1860.

[14] K. Takeuchi, T. Tanaka, and T. Tanzawa, “A multipage cell architecturefor high-speed programming multilevel NAND flash memories,” Solid-State Circuits, IEEE Journal of, vol. 33, no. 8, pp. 1228 –1238, aug1998.

[15] “Process Integration and Device Structures, ITRS 2011 Edition,” http://www.itrs.net/Links/2011ITRS/2011Tables/PIDS 2011Tables.xlsx.

[16] J. Lee and et al., “A 90-nm CMOS 1.8-V 2-Gb NAND flash memoryfor mass storage applications,” Solid-State Circuits, IEEE Journal of,vol. 38, no. 11, pp. 1934–1942, Nov 2003.

[17] D. T. Wang, “Modern dram memory systems: Performance analysisand a high performance, power-constrained dram scheduling algorithm,”Ph.D. dissertation, University of Maryland, College Park, 2005.

[18] M. Lenzlinger and E. Snow, “Fowler-Nordheim tunneling into thermallygrown SiO2,” Electron Devices, IEEE Transactions on, vol. 15, no. 9,pp. 686–686, Sep 1968.

[19] “Samsung corporation. K9XXG08XXM flash memory specification,”K9XXG08XXM Datasheet.

[20] K. Sakui, “Intel Corporation/Tohoku University,” May 2009, PrivateCorrespondence.

[21] K. Ishida and et al, “A 1.8V 30nJ adaptive program-voltage (20V)generator for 3D-integrated NAND flash SSD,” in Solid-State CircuitsConference - Digest of Technical Papers, 2009. ISSCC 2009. IEEEInternational, Feb. 2009, pp. 238–239,239a.

[22] T. Bunker, M. Wei, and S. Swanson, “Ming II: A Flexible Platformfor NAND Flash-based Research,” Department of Computer Science &Engineering, University of California, San Diego, Tech. Rep. CS2012-0978, May 2012.

[23] L. Grupp and et al, “Characterizing flash memory: Anomalies, observa-tions, and applications,” in Microarchitecture, 2009. MICRO-42. 200942nd IEEE/ACM International Symposium on, Nov. 2009.

[24] H. Schmidt and et al., “Tid and see tests of an advanced 8 gbit nand-flash memory,” in Radiation Effects Data Workshop, 2008 IEEE, july2008, pp. 38 –41.

[25] C. W. Smullen, IV, A. Nigam, S. Gurumurthi, and M. R. Stan, “Thestetsims stt-ram simulation and modeling system,” in Proceedings ofthe International Conference on Computer-Aided Design, ser. ICCAD’11. Piscataway, NJ, USA: IEEE Press, 2011, pp. 318–325. [Online].Available: http://dl.acm.org/citation.cfm?id=2132325.2132408

[26] X. Dong, N. P. Jouppi, and Y. Xie, “PCRAMsim: system-levelperformance, energy, and area modeling for Phase-Change RAM,” in

Proceedings of the 2009 International Conference on Computer-AidedDesign, ser. ICCAD ’09. New York, NY, USA: ACM, 2009, pp. 269–275. [Online]. Available: http://doi.acm.org/10.1145/1687399.1687449

[27] C. Xu, X. Dong, N. Jouppi, and Y. Xie, “Design implications ofmemristor-based rram cross-point structures,” in Design, AutomationTest in Europe Conference Exhibition (DATE), 2011, march 2011, pp. 1–6.

[28] S. Boboila and P. Desnoyers, “Write Endurance in Flash Drives: Mea-surements and Analysis,” in Proceedings of the USENIX Conference onFile and Storage Technologies (FAST), February 2010.

[29] “NVSim,” http://www.rioshering.com/nvsimwiki/index.php?title=MainPage.

[30] M. Jung, E. Wilson, D. Donofrio, J. Shalf, and M. Kandemir, “NAND-FlashSim: Intrinsic latency variation aware NAND flash memory systemmodeling and simulation at microarchitecture level,” in Mass StorageSystems and Technologies (MSST), 2012 IEEE 28th Symposium on, April2012.

PLACEPHOTOHERE

Vidyabhushan Mohan received his B.E degreefrom the College of Engineering, Guindy, AnnaUniversity in 2006 and his M.S degree from Uni-versity of Virginia, Charlottesville in 2010, both inComputer Science and Engineering. He is currentlypursuing his Ph.D degree in Computer Science inUniversity of Virginia, Charlottesville. His researchinterests include solid state storage, non-volatilememories and computer architecture.

Trevor Bunker received the B.S. degree in com-puter engineering from Brigham Young Universityin 2009. He is currently pursuing the Ph.D. withthe University of California, San Diego. His currentresearch interests include the hardware/software co-design of scalable systems for solving large-scale,data-intensive applications.

Laura Grupp’s research interests bridge the dividebetween hardware and software, with a focus onflash-based storage systems. She characterizes flashchips to find unpublished behaviors and trends, andthen designs ways to exploit these characteristics toaddress application-specific requirements for perfor-mance, energy efficiency and reliability.

Sudhanva Gurumurthi is an Associate Professor inthe Computer Science Department at the Universityof Virginia. His research interests include architec-ture reliability, storage systems, non-volatile mem-ory, and data center architectures. He received hisBE degree from the College of Engineering Guindy,Chennai, India in 2000 and his PhD from PennState in 2005, both in the field of Computer Scienceand Engineering. He was the Associate Editor-in-Chief of Computer Architecture Letters and cur-rently serves as an Associate Editor. Sudhanva is a

recipient of the NSF CAREER Award and several research awards from NSF,Google, Intel, and HP. He is a Senior Member of the IEEE and the ACM.

15

Mircea R. Stan is teaching and doing researchin the areas of high-performance low-power VLSI,temperature-aware circuits and architecture, embed-ded systems, and nanoelectronics in the ECE Depart-ment at the University of Virginia where he is now aprofessor. He has a Diploma from the ”Politehnica”University in Bucharest, Romania, and an MS andPhD from UMass Amherst. He has received the NSFCAREER award in 1997 and was a co-author onbest paper awards at ISQED 2008, GLSVLSI 2006,ISCA 2003 and SHAMAN 2002. He is an Associate

Editor for the IEEE Transactions on Nano and has been an AE for theIEEE Transactions on Circuits and Systems Systems I in 2004- 2008 andfor the IEEE Transactions on VLSI Systems in 2001-2003. He is currently aDistinguished Lecturer for the IEEE Circuits and Systems (CAS) Society, wasalso for 2004-2005, and for the IEEE Solid-State Circuits Society (SSCS) in2007-2008. Prof. Stan is a senior member of the IEEE, a member of ACM,and also of Eta Kappa Nu, Phi Kappa Phi and Sigma Xi.

Steven Swanson is an associate professor in theDepartment of Computer Science and Engineeringat the University of California, San Diego and thedirector of the Non-volatile Systems Laboratory.His research interests include the systems, architec-ture, security, and reliability issues surrounding non-volatile, solid-state memories. In previous lives hehas worked on scalable dataflow architectures, ubiq-uitous computing, and simultaneous multithreading.He received his PhD from the University of Wash-ington in 2006.

modeling power consumption of nand flash memories …gurumurthi/papers/tcad13.pdf · 1 modeling...

Documents