[ieee 2012 12th annual non-volatile memory technology symposium (nvmts) - singapore, singapore...

6
978-1-4673-2848-7/12/$31.00 ©2012 IEEE WBR - Word- and Block-level Hard Error Repair for Memories Patryk Skoncej IHP Im Technologiepark 25 15236 Frankfurt (Oder), Germany [email protected] Brandenburg University of Technology Konrad-Wachsmann-Allee 1 03046 Cottbus, Germany [email protected] Abstract— Many existing semiconductor memories face major problems concerning yield, reliability, testability, and manufacturability as the feature size decreases. In order to overcome these issues, new memory technologies are being developed. The greatest attention is paid to solid-state, non- volatile memories which are expected to meet the challenging demands of upcoming low-cost, low-power, and high- performance systems. However, despite all advantages which they offer, emerging non-volatile memories introduce challenges which have to be addressed and solved. One of the biggest obstacles preventing them from wider implementation are permanent faults which can occur right after production or during memory operational time. In order to mitigate the mentioned problem a novel memory repair approach, called WBR, has been invented. It is based on replacing defective data blocks with spare ones for every memory block separately. Repair procedures implemented in the WBR can be applied on memory word- and block-levels offering different repair speeds and correction capabilities. In comparison to recent solutions such as SAFER and ECP, WBR provides better memory lifetime improvement for small memory word sizes and achieves comparable results for wider memory words. In contrast to the state-of-the-art techniques, WBR can be applied to word sizes as small as 16 bits imposing less than 12.5% of additional bit overhead. The WBR can be implemented purely in hardware, e.g. in form of a memory controller. It is complementary to existing wear-leveling techniques, and it can be used in the recently-proposed PAYG framework. Keywords-non-volatile memory; emerging memories; redundancy repair; hard error repair; memory repair; built-in self- repair; bisr; I. INTRODUCTION Demands for low-power, high-performance, mobile devices set the direction of progress for embedded systems and Systems-on-Chips (SoCs). Due to rising requirements of today's memory-hungry applications, the area occupied by embedded memories in SoCs is constantly growing. Consequently, yield and reliability of SoCs depend heavily on yield and reliability of embedded memory cores [1][2][3]. Although commonly used types of embedded memories have been exploited for years, their future existence is questionable. Aggressive design rules and smaller feature sizes introduce new challenges that are becoming harder to meet [4][5][6]. As a result, many research groups are turning their focus toward new memory technologies. Emerging solid-state non-volatile memories (NVMs) such as phase-change memories (PCMs), magnetoresistive RAMs, resistive RAMs, or ferroelectric RAMs are expected to meet high demands of upcoming systems. However, before they will be adopted, all concerns about their reliability must be dispelled. Problems related to write endurance 1 , the early-maturity state of technology, and the lack of sufficient reliability data [8] must be solved. II. BACKGROUND AND MOTIVATION While some difficulties associated with emerging NVMs can be resolved only on the technology and circuit levels, problems concerning reliability can be also managed on the system level. There are two common system-level solutions which are frequently utilized for managing faults in semiconductor memories. The first one exploits error correcting codes (ECCs) and the second is based on 1-D or 2-D redundancy repair (usually performed after manufacturing). While these techniques in their present form are sufficient for existing technologies, for emerging memory systems their complexity, area, and time overheads may be unacceptable. Moreover, they do not take into account special properties of emerging NVMs such as immunity to radiation induced errors [9] and vulnerability to endurance and retention related faults. Additionally, emerging NVMs typically have asymmetric access, where write operations usually consume more energy and are slower than read operations. There is a need for universal and scalable system-level solutions which will target reliability issues in emerging and existing NVMs. Moreover, these solutions have to be applicable for small- and large- capacity memory systems without making changes in the memory design. They have to be OS-independent and must have a minimum impact on the system. Only by making them as transparent for the user as possible, they can have a good chance of being applied in real embedded systems and SoCs. III. RECENT WORK Recently, many repair mechanisms were proposed for fault management in emerging NVMs. Ipek et al. presented a hard error repair technique designed for PCMs called Dynamically 1 The number of write operations which can be performed on a single cell before it becomes damaged.

Upload: patryk

Post on 05-Mar-2017

213 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: [IEEE 2012 12th Annual Non-Volatile Memory Technology Symposium (NVMTS) - Singapore, Singapore (2012.10.31-2012.11.2)] 2012 12th Annual Non-Volatile Memory Technology Symposium Proceedings

978-1-4673-2848-7/12/$31.00 ©2012 IEEE

WBR - Word- and Block-level Hard Error Repair for Memories

Patryk Skoncej IHP

Im Technologiepark 25 15236 Frankfurt (Oder), Germany

[email protected]

Brandenburg University of Technology Konrad-Wachsmann-Allee 1

03046 Cottbus, Germany [email protected]

Abstract— Many existing semiconductor memories face major problems concerning yield, reliability, testability, and manufacturability as the feature size decreases. In order to overcome these issues, new memory technologies are being developed. The greatest attention is paid to solid-state, non-volatile memories which are expected to meet the challenging demands of upcoming low-cost, low-power, and high-performance systems. However, despite all advantages which they offer, emerging non-volatile memories introduce challenges which have to be addressed and solved. One of the biggest obstacles preventing them from wider implementation are permanent faults which can occur right after production or during memory operational time. In order to mitigate the mentioned problem a novel memory repair approach, called WBR, has been invented. It is based on replacing defective data blocks with spare ones for every memory block separately. Repair procedures implemented in the WBR can be applied on memory word- and block-levels offering different repair speeds and correction capabilities. In comparison to recent solutions such as SAFER and ECP, WBR provides better memory lifetime improvement for small memory word sizes and achieves comparable results for wider memory words. In contrast to the state-of-the-art techniques, WBR can be applied to word sizes as small as 16 bits imposing less than 12.5% of additional bit overhead. The WBR can be implemented purely in hardware, e.g. in form of a memory controller. It is complementary to existing wear-leveling techniques, and it can be used in the recently-proposed PAYG framework.

Keywords-non-volatile memory; emerging memories; redundancy repair; hard error repair; memory repair; built-in self-repair; bisr;

I. INTRODUCTION Demands for low-power, high-performance, mobile devices

set the direction of progress for embedded systems and Systems-on-Chips (SoCs). Due to rising requirements of today's memory-hungry applications, the area occupied by embedded memories in SoCs is constantly growing. Consequently, yield and reliability of SoCs depend heavily on yield and reliability of embedded memory cores [1][2][3]. Although commonly used types of embedded memories have been exploited for years, their future existence is questionable. Aggressive design rules and smaller feature sizes introduce new challenges that are becoming harder to meet [4][5][6]. As a result, many research groups are turning their focus toward

new memory technologies. Emerging solid-state non-volatile memories (NVMs) such as phase-change memories (PCMs), magnetoresistive RAMs, resistive RAMs, or ferroelectric RAMs are expected to meet high demands of upcoming systems. However, before they will be adopted, all concerns about their reliability must be dispelled. Problems related to write endurance1, the early-maturity state of technology, and the lack of sufficient reliability data [8] must be solved.

II. BACKGROUND AND MOTIVATION While some difficulties associated with emerging NVMs

can be resolved only on the technology and circuit levels, problems concerning reliability can be also managed on the system level. There are two common system-level solutions which are frequently utilized for managing faults in semiconductor memories. The first one exploits error correcting codes (ECCs) and the second is based on 1-D or 2-D redundancy repair (usually performed after manufacturing). While these techniques in their present form are sufficient for existing technologies, for emerging memory systems their complexity, area, and time overheads may be unacceptable. Moreover, they do not take into account special properties of emerging NVMs such as immunity to radiation induced errors [9] and vulnerability to endurance and retention related faults. Additionally, emerging NVMs typically have asymmetric access, where write operations usually consume more energy and are slower than read operations. There is a need for universal and scalable system-level solutions which will target reliability issues in emerging and existing NVMs. Moreover, these solutions have to be applicable for small- and large-capacity memory systems without making changes in the memory design. They have to be OS-independent and must have a minimum impact on the system. Only by making them as transparent for the user as possible, they can have a good chance of being applied in real embedded systems and SoCs.

III. RECENT WORK Recently, many repair mechanisms were proposed for fault

management in emerging NVMs. Ipek et al. presented a hard error repair technique designed for PCMs called Dynamically

1 The number of write operations which can be performed on a single cell before it becomes damaged.

Page 2: [IEEE 2012 12th Annual Non-Volatile Memory Technology Symposium (NVMTS) - Singapore, Singapore (2012.10.31-2012.11.2)] 2012 12th Annual Non-Volatile Memory Technology Symposium Proceedings

978-1-4673-2848-7/12/$31.00 ©2012 IEEE

Replicated Memory (DRM) [10]. In the DRM, for every byte in the memory one additional bit is used for indicating if the corresponding byte is corrupted or not. Next, with the help of the OS, two corrupted pages are paired into a single uncorrupted physical page provided that corrupted bytes in two faulty pages do not have the same position.

The work proposed in [12] provides a repair mechanism for soft and hard errors in emerging NVMs. FREE-P uses a coding technique based on BCH codes and a block-based redundancy repair. Different implementations of the BCH codes are used to provide repair mechanisms which vary in the number of corrected errors and correction speed. If the number of errors in the memory block exceeds the number of correctable errors provided by the strongest BCH implementation then block remapping is performed. FREE-P uses corrupted memory block to store multiple instances of the pointer which aims at block used as replacement. In the FREE-P the role of the OS is to allocate pages to remap failed blocks and keep track of free spots in the allocated pages.

In [7] an efficient hard-error resilient architecture is presented. The work is based on the observation that only a few memory rows require stronger hard-error correction than the rest of rows. Based on this observation Qureshi proposes a structure for allocating Error Correction Pointers (ECPs) [11] in proportion to the number of hard-faults in the memory line. PAYG is a generalized framework that can be implemented with any error correction scheme.

All presented solutions aim at enabling emerging NVMs to be used as main memory. They favor large-capacity memory systems over memory systems in general. In addition, most of presented techniques require support from the OS which diminish their applicability. For small-capacity or embedded memories the overhead imposed by the proposed techniques could be unacceptable.

Next some hard error repair mechanisms comparable to the solutions proposed in this paper are presented.

A. ECP The technique based on Error Correcting Pointers (ECPs)

[11] employs bit redundancy repair. In the proposed scheme, every row in the memory is equipped with ECPs. Each ECP consists of bits required to store address and correct value of a corrupted bit in a memory row. Worn-out bits are detected by the write-verify2 scheme and replaced with error-free bits in assigned ECPs. Furthermore, ECPs with higher indexes have

2 Additional memory read operation performed to the same address as preceding memory write operation.

precedence over ECPs with lower indexes. By doing so, the repair technique based on ECPs is able to correct errors in its own structures. Although the proposed repair mechanism can be implemented purely in the hardware it requires changes in the memory structure which limits its applicability.

B. SAFER The Stuck-At-Fault Error Recovery for Memories (SAFER)

[13] technique reuses memory cells with stuck-at faults for storing data. In case of new errors in the memory word, SAFER dynamically partitions the input data into separate groups to ensure that each group has at most one stuck-at fault. Next, based on the input data and stuck-at values of corrupted bits, data in groups is stored in original or inverted form. Repartitioning is performed every time new error occurs. In the SAFER technique information about error positions or their stuck-at values is not saved. Therefore, when data is stored to the corrupted memory word, an additional write operation is needed every time the input data does not match the stuck-at value. Similarly to the case of the ECP-based technique, SAFER requires changes in the memory design. Moreover, it relies heavily on the stuck-at model which may be too optimistic when it comes to a real implementation.

IV. MEMORY STRUCTURE In order to explain the WBR in detail, first a typical RAM

structure will be described and definition of the I/O block will be provided. A common RAM consists of (Fig. 1):

• a memory matrix built from memory cells connected by word- and bit-lines,

• an address row decoder which uses n address signals for selecting a single word-line,

• an address column decoder which uses m address signals for selecting multiple (i) bit-lines,

• column input/output circuitry consisting of i read and write sense amplifiers, and

• control logic for managing all mentioned components during write/read operations (not depicted in Fig. 1).

The presented memory structure consists of column blocks (each consisting of 2m bit-lines) and row blocks (each consisting of 2k word-lines). Through the rest of this paper I/O blocks are defined as blocks of data located on the intersections of column and row blocks. A corrupted I/O block is defined as an I/O block where at least one memory cell is defective. Apart from the area used for storing user data, often an additional area can be found in embedded memories. This redundant area is typically employed for storing parity bits or used by redundancy repair techniques. In the WBR this spare area (i.e. spare I/O blocks) is used for providing repair mechanisms. In order to simplify the description of the WBR mechanism, through the rest of this paper a simple memory model will be used (Fig. 2). In the simple memory model each I/O block consists of 2k x 1 memory cells and each memory block consists of 2k (u+s)-bit words, where u is the number of user (data) bits, and s is the number of spare bits. As a result, in each memory block there are (u+s) I/O blocks.

Figure 1 Basic RAM Structure

Page 3: [IEEE 2012 12th Annual Non-Volatile Memory Technology Symposium (NVMTS) - Singapore, Singapore (2012.10.31-2012.11.2)] 2012 12th Annual Non-Volatile Memory Technology Symposium Proceedings

978-1-4673-2848-7/12/$31.00 ©2012 IEEE

V. WORD- AND BLOCK-LEVEL REPAIR The WBR targets at repairing manufacturing- and

endurance-related faults in embedded non-volatile RAMs. Developed Word-level Repair (WR) and Block-level Repair (BR) mechanisms are based on exchanging corrupted I/O blocks with spare ones, for every memory block separately.

For each memory block the WBR utilizes t ⎡ ⎤u2log -bit Error Position Pointers (EPPs) for storing the positions of corrupted I/O blocks, where t is the number of I/O blocks which can be exchanged and u is the number of user (data) bits in a single word. In the WBR, corrupted I/O blocks are replaced with I/O blocks located in Spare Positions (SPs). Depending on the repair mechanism, BR or WR, each SP consists of one or two I/O blocks, respectively. The index of the EPP defines which SP will be used for replacement, i.e. the first EPP specifies that the first SP will be used for replacement, second EPP specifies second SP etc. Moreover, for each EPP an extra bit is needed for indicating whether the corresponding EPP is active. Additionally, the WBR utilizes t-bit Corrupted Spare Positions (CSP) vector for marking SPs with corrupted I/O blocks. SP marked as corrupt indicates that related EPP is considered as used.

In order to detect stuck-at faults, the WBR technique uses the write-verify scheme. When during normal memory operation the verify procedure reports a new error in a word, the position of the corrupted I/O block is stored in an unused EPP. Next, the assigned SP in the corrupted word (in case of WR) or in all words in the memory block (in case of BR) is updated with the value from the corrupted I/O block (I/O block pointed by the EPP). After the repair procedure, write accesses to memory words located in a corrupted block result in updating the corresponding SPs. During a read procedure, values from corrupted I/O blocks (pointed by EPPs) are automatically replaced with values from the related SPs.

A. Word-level Repair In Fig. 3 a simple example presenting the WR procedure is

depicted. Fig. 3a shows the initial state of a memory block consisting of four 8-bit memory words. For each word two SPs are assigned. Each SP consists of two bits where one bit is used for storing the replaced value while the second bit indicates if the SP is active. For simplicity, in Fig. 3a all memory words are filled with ones and all meta-bits are set to 0.

After a write operation to the 1st word in the memory block (Fig. 3b), the verify procedure is automatically started to check the validity of the stored data. If errors are detected, their positions are compared with the positions stored in EPPs to determine if the detected errors are new or if they have been already managed. In the presented example there are no previously stored positions, thus the error which occurred on the 3rd position in the first word (Fig. 3b) is assumed to be a new error and the 3rd I/O block is assumed corrupted. Next, the position of the corrupted I/O block is stored in the first, unused EPP. Storing a position in the EPP automatically activates that EPP. Afterwards, a second write operation (from now on called update operation) is performed to the corrupted word to store the value of the input data bit pointed by the 1st EPP in the 1st SP. Storing a value in the SP automatically activates that SP. If

no new errors occur during the update operation, the repair procedure ends.

In the WR, only one new error out of several possible new errors occurring in the word is managed at a time. After repairing the first new error, the rest of new errors will be detected during the verify procedure performed after the update operation. Next, a single new error will be selected, an unused EPP will be assigned, the update operation will be performed, and so on. The repair procedure ends when all positions of new errors are stored in EPPs or when there are no free EPPs left.

After repairing the 1st word, when a write operation will be performed to any word in the memory block, the 1st SP in that word will be automatically activated and filled with the value of the input data bit pointed by the 1st EPP. Note, that although there are no errors in the 2nd word in Fig. 3c, the 1st SP in that word is activated and filled with the appropriate value.

During the read operation, values from active SPs are automatically placed in the positions pointed by active EPPs. If one would like to read the 1st word from the memory block (Fig. 3b), the value from the 1st SP - "1" will be used instead of the value of the corrupted bit located in the 3rd I/O block. Furthermore, if the 3rd word would be read from the memory block (Fig. 3b) the original value of the bit located in the 3rd I/O block will be driven to the memory output because the 1st SP is not active (although the 1st EPP is active). This simple example illustrates the necessity of using two bits for every SP in the WR. If SPs would consist of replacement bits only, there would be no way to determine which SPs are active and which are not. Since not all words have the replacement bit filled, incorrect data would be read from the memory block.3

Once the SPs are activated, replacement cells start to wear-out in the same way as memory cells storing user data. Therefore, there is a possibility that errors will occur in SPs. An example concerning an error in the SP is depicted in Fig. 3d. After writing to the 3rd word in the memory block, the verify procedure detects a new error in the 1st SP. Next, during the repair phase, the 2nd EPP is filled with the content of the 1st EPP and the 1st SP is marked as corrupted in the CSP. Afterwards, a write operation is performed to the corrupted word to store the value of the input data bit pointed by the 1st and 2nd EPPs in the 1st and 2nd SPs. After the repair procedure, reading the 3rd word from the memory block (Fig. 3d) results in replacing the value of bit on the 3rd position with the value from the 2nd SP. In the WR, similarly like in the ECP technique, SPs with higher indexes have precedence over those with lower indexes. Hence, the value from the 2nd SP will

3 There is a possibility of faults located in the enable part of SPs. As a result, corrupted SPs would be mistakenly assumed as activated. Therefore it is advisable to perform initial testing/test writes to manage such faults.

Figure 2 Simple Memory Model

Page 4: [IEEE 2012 12th Annual Non-Volatile Memory Technology Symposium (NVMTS) - Singapore, Singapore (2012.10.31-2012.11.2)] 2012 12th Annual Non-Volatile Memory Technology Symposium Proceedings

978-1-4673-2848-7/12/$31.00 ©2012 IEEE

override the value from the 1st SP. In case when the 2nd SP is not active, the value from the 1st SP will be used for replacement during read operation. When all EPPs are used, new errors occurring in I/O blocks different from those already managed cannot be handled.

B. Block-level Repair The WR uses two bits for a single SP because not all words

in the memory block have appropriate values stored in their SPs. In order to reduce the number of meta-bits, an appropriate SP in all words in the memory block would have to be filled with correct values. Based on this observation, a BR mechanism was developed. The BR more effectively utilizes the spare area for the price of longer repair procedure. In the BR each SP consists of a single bit only. Moreover, after occurrence of a new error in the word, an appropriate SP in all words in the memory block is updated.

During the BR procedure no new errors can be repaired.4 Therefore, when a new error is detected during the repair process, its position and correct bit value (read before error

4 This disadvantage can be solved by using more advanced repair architectures called WBR+ and WBR++. WBR+ and WBR++ are being right now developed and evaluated.

occurrence) are stored in the register. Next, this information can be used to perform write operation to the corrupted word and thus initialize another BR.

C. Combined Approach The WR provides an instant repair mechanism but requires

two bits for every SP. The BR requires only one bit per SP but incurs longer repair time. Since WR and BR are based on similar principles, they can be combined into a consistent system, called WBR. In the WBR, repair time, effectiveness, imposed meta-bit overhead, and impact on the memory behavior can be tuned to the actual needs of a target application. The combination of the WR and BR is able to exchange wr+br=t corrupted I/O blocks. For whole memory block the WBR requires wr WR EPPs, br BR EPPs, and (wr+br)-bit CSP vector. For each word in the memory block wr WR SPs, and br BR SPs are required (Fig. 4).

In the WBR, new errors in the memory block are handled by the WR. When all EPPs belonging to the WR are used, following errors are managed by the BR. This implicates that SPs belonging to the BR have precedence over SPs belonging to the WR. Moreover, errors occurring in the SPs belonging to the BR can be managed only by the BR procedure.

VI. EVALUATION The WBR approach was compared with SAFER, ECP, and

ECC-based technique with respect to meta-bit overhead, average number of corrected faults, and average memory lifetime improvement. For each technique an appropriate model together with different test configurations were developed. Each configuration consisted of one repair mechanism implemented for a 2048-bit memory block and a selected word size (16 to 512 bits). In order to provide the most accurate comparison, a similar methodology to the methodologies presented in [11] and [13] was exploited. Therefore, in the evaluation process, the following assumptions were made:

• The lifetime of each memory cell follows the normal distribution with a mean lifetime of 108 write operations and standard deviation equal to 107. Furthermore, no correlation between neighboring cells is assumed.5

• Perfect wear-leveling mechanism is implemented for the memory. That is, write operations are evenly distributed among all memory blocks. As a result, for the evaluation purposes, it is sufficient that all techniques are implemented for a single memory block only.

• Similarly as in [13] a 2048-bit memory block is considered as a last level cache line. A single write operation concerns the whole 2048-bit block.

• Each configuration is implemented in the memory controller. That is, repair techniques have access only to memory input/output signals. They do not have access and cannot modify the internal structure of the memory.

5 Note that the WBR technique can correct multiple errors in a single I/O block. Thus, it seems that the WBR can provide better results when errors are correlated. This issue will be addressed in future work.

(a) Initial State

(b)

(c)

(d)

Figure 3 Example of Word-level Repair

Figure 4 Spare Area Utilization and Meta-bits Required by WBR

Page 5: [IEEE 2012 12th Annual Non-Volatile Memory Technology Symposium (NVMTS) - Singapore, Singapore (2012.10.31-2012.11.2)] 2012 12th Annual Non-Volatile Memory Technology Symposium Proceedings

978-1-4673-2848-7/12/$31.00 ©2012 IEEE

• There is a 50% probability (0.5 toggle rate) that the user (data) bit will change its state during the write operation.

• Configurations which are able to correct a maximum number of faults below 12.5% meta-bit overhead are selected for evaluation.

In each simulation, an array required for storing data bits together with necessary meta-bits was allocated. For each array element a random write endurance value according to the normal distribution was set and an appropriate toggle rate value was assigned. The toggle rate value equal to 0.5 was assigned for meta-bits in ECC-, ECP-, and WBR-based configurations.6 For SAFER-based configuration toggle rate equal to 1.0 was assigned for each corrupted group due to the required additional write operation. In each simulation loop, first an array element with the minimum write endurance value (in respect to element's toggle rate value) was selected. Next, the selected element's write endurance value was used to decrease the write endurance values of all array elements. Further, array elements whose write endurance value was equal to 0 were repaired. Next, appropriate toggle rate values were updated. Simulation ended when the repair procedure could not manage new faults. For each configuration 50000 simulations were performed and the average result was reported.

For evaluating the ECC-based repair, the Hamming bound was used to determine the minimum number of parity bits required to correct t errors.7 Through the rest of this paper, configurations based on ideal coding will be referred as IdealECCt. Configurations employing ECPs will be denoted by ECPt, where t defines the number of correctable errors per memory word. Moreover, SAFERg defines the SAFER configuration capable of repartitioning memory word into g groups. SAFERg can correct from ⎡ ⎤ 1log2 +u to g errors per memory word.

The meta-bit overhead imposed by the WBR and its repair capabilities strongly depend on the size of the memory block. Thus, in the evaluation process, the 2048-bit block was divided into sub-blocks for which the WBR was implemented. The best results from configurations concerning different sub-block sizes and different repair capabilities are reported in this paper.

6 For ECC-based configurations the toggle rate was assigned for meta-bits from the beginning. For ECP- and WBR-based once the replacement bits were assigned. 7 The minimum number of parity bits calculated with the Hamming bound may concern ECCs which does not exist or which require more parity bits.

Further, the WBR_b_wr_br defines a WBR implemented for a b-bit sub-block capable to exchange wr I/O blocks with the WR and br I/O blocks with the BR. Because the BR approach provides a larger number of I/O blocks which can be exchanged than the WR for comparable meta-bit requirements, configurations with the second best results are also reported. If multiple configurations shared similar results, the configuration with higher number of I/O blocks which could be exchanged with the WR was reported.

VII. RESULTS

A. Meta-bit Overhead In Fig. 5 the meta-bit overhead is presented for each

configuration. The WBR incurs meta-bit overhead close to the imposed restriction (12.5% meta-bit overhead). This is caused by the flexibility of the WBR in terms of required number of meta-bits in respect to the repair capabilities. For example, the WBR implemented for 512-bit words in a 2048-bit memory block offers 130 different configurations while the ECC, ECP-based repair, and SAFER offer only 9, 6, and 5 configurations, respectively. Moreover, the WBR can provide repair mechanism even for 16-bit and 32-bit words.

B. Average Relative Lifetime Improvement The average relative lifetime improvement determines the

effectiveness of the repair technique. In this paper, similar to [13], the relative lifetime improvement is presented as a function of a standard deviation. It can be calculated as

σTFL *)( − , where L is the number of writes to the memory

block achieved with the help of the repair mechanism, F is the number of writes until the first fail in the memory block occurs, T is a toggle rate value of data bits, and σ is the standard deviation used in the evaluation. The relative lifetime improvement was calculated after each simulation. Next, after evaluation of a single configuration (50000 simulations) the average relative lifetime improvement was calculated and reported. Although the F value changed in each simulation, on the average 131114590.5 write operations were performed until the first fail in the memory block occurred.

In Fig. 6 average relative lifetime improvement values are presented for different configurations. For word sizes up to 128 bits, the WBR outperforms state-of-the-art solutions and the IdealECC technique. For memory word sizes greater than 128 bits, the effectiveness of the WBR lies between ECP-based repair and SAFER. While the ECP-based repair and the WBR are based on similar principles, the WBR achieves better results. This can be in some extent explained by referring to the so-called "birthday problem". For example, in a 2048-bit block there are 32 64-bit words. The probability that the 2nd error will not occur in the same word where the 1st error occurred is (31/32)*100= 96.875%. The probability that the 7th error will not occur in an already corrupted word is just 49.4%.8 Thus a small number of words will suffer from a large number of errors while the rest of words will remain uncorrupted or will

8 Note the results for ECP1 and IdealECC1 implemented for 64-bit words in Figure 7.

Figure 5 Meta-bit Overhead

Page 6: [IEEE 2012 12th Annual Non-Volatile Memory Technology Symposium (NVMTS) - Singapore, Singapore (2012.10.31-2012.11.2)] 2012 12th Annual Non-Volatile Memory Technology Symposium Proceedings

978-1-4673-2848-7/12/$31.00 ©2012 IEEE

have a small number of errors. While the ECP1 and IdealECC1 can correct a single error in each individual 64-bit word, the WBR_512_0_4 can correct at least four errors in the whole 512-bit sub-block no matter where the errors occur. Therefore, having the same meta-bit overhead (12.5%) as ECP1 for 64-bit words, the WBR_512_0_4 applied for 64-bit words achieves ~26% better average relative lifetime improvement.

C. Average Corrected Faults Fig. 7 shows the average numbers of corrected faults. The

results for the WBR, ECC- and ECP-based techniques translate into the average relative lifetime improvement. The reason why the results for SAFER do not correspond to the average relative lifetime improvement is that in SAFER an additional write operation is required for each corrupted group. These additional write operations decrease the lifetime of undamaged memory cells in the corrupted group and the effectiveness of the whole repair mechanism.

VIII. PAYG WITH WBR The PAYG framework is based on a Hash-Table-with-

Chaining structure used for allocating error correction entries in proportion to the number of hard-faults in the memory line. Although Qureshi employed the ECP-based repair in [7], the PAYG can be used with any error correction scheme. Repair techniques proposed in this paper (WR, BR, or WBR) can be implemented in the PAYG framework although they require changes in the PAYG structure. In contrast to the ECP-based scheme, in WBR the repair procedure considers corrupted I/O blocks instead of single bits. Moreover, addresses of corrupted I/O blocks are stored in other place than bits used for replacement. As a result, the PAYG framework based on the WBR should concern memory blocks/sub-blocks and should allocate spare I/O blocks in proportion to the number of corrupted I/O blocks in the memory block/sub-block.

Furthermore, additional entries in the PAYG structure should be added to account for addresses of corrupted I/O blocks. More detailed considerations and evaluation of the WBR with PAYG framework will be provided in the future work.

IX. CONCLUSION Emerging NVMs offer great opportunities for the whole IT

industry. However, before they will be adopted, all concerns about their reliability must be dispelled. The techniques developed for embedded NVMs proposed in this paper provide a significant memory lifetime improvement. Moreover, they offer great design flexibility. With the WBR approach it is possible to tailor the speed and effectiveness of the repair mechanism to the requirements of the target application. The WBR can be implemented also for existing technologies such as Flash which suffer from the erase-before-write problem. Furthermore, the WBR is complementary to existing wear-leveling techniques and can be implemented in hardware.

REFERENCES [1] K. Pekmestzi, N. Axelos, I. Sideris, and N. Moshopoulos, “A BISR

Architecture for Embedded Memories,” 2008 14th IEEE International On-Line Testing Symposium, pp. 149–154, Jul. 2008.

[2] B. Godard, J.-M. Daga, L. Torres, and G. Sassatelli, “Architecture for Highly Reliable Embedded Flash Memories,” 2007 IEEE Design and Diagnostics of Electronic Circuits and Systems, pp. 1–6, 2007.

[3] R.-F. Huang, C.-H. Chen, and C.-W. Wu, “Economic Aspects of Memory Built-in Self-Repair,” IEEE Design & Test of Computers, vol. 24, no. 2, pp. 164–172, Feb. 2007.

[4] G. Burr et al., “Phase change memory technology,” Journal of Vacuum Science & Technology B: Microelectronics and Nanometer Structures, vol. 28, no. 2, p. 223, 2010.

[5] B. Lee et al., “Phase-change technology and the future of main mem-ory,” Micro, IEEE, vol. 30, no. 1, p. 143, jan.-feb. 2010.

[6] A. Ferreira et al., “Using PCM in Next-generation Embedded Space Applications,” 2010 16th IEEE Real-Time and Embedded Technology and Applications Symposium, pp. 153–162, Apr. 2010.

[7] M. K. Qureshi, “Pay-as-you-go: low-overhead hard-error correction for phase change memories,” in Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture, ser. MICRO-44 ’11. New York, NY, USA: ACM, 2011, pp. 318–328.

[8] L. Grupp et al., “Beyond the datasheet: Using test beds to probe non-volatile memories’ dark secrets,” in GLOBECOM Workshops (GC Wkshps), 2010 IEEE, 2010, pp. 1930–1935.

[9] J. Rodgers et al., “A 4-Mb Non-volatile Chalcogenide Random Access Memory designed for space applications: Project status update,” 2008 9th Annual Non-Volatile Memory Technology Symposium (NVMTS), vol. 00, no. c, pp. 1–6, 2008.

[10] E. Ipek et al., “Dynamically replicated memory: building reliable systems from nanoscale resistive memories,” ACM SIGARCH Computer Architecture News, vol. 38, no. 1, pp. 3–14, 2010.

[11] S. Schechter et al., “Use ECP, not ECC, for hard failures in resistive memories,” Proceedings of the 37th annual international symposium on Computer architecture - ISCA ’10, p. 141, 2010.

[12] D. Yoon et al., “FREE-p: Protecting non-volatile memory against both hard and soft errors,” in High Performance Computer Architecture (HPCA), 2011 IEEE 17th International Symposium on, 2011, no. 1, pp. 466–477.

[13] N. Seong et al., “SAFER: Stuck-At-Fault Error Recovery for Memories,” 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture, pp. 115–124, Dec. 2010.

Figure 6 Average Relative Lifetime Improvement

Figure 7 Average Corrected Faults