a v scaling p r strategy for memory ased ldpc decoders · g. gulak. i wish to highlight the fact...

A VOLTAGE SCALING POWER REDUCTION STRATEGY

FOR MEMORY-BASED LDPC DECODERS

by

Dawei Song

A thesis submitted in conformity with the requirementsfor the degree of Master of Applied Science

Graduate Department of Electrical and Computer EngineeringUniversity of Toronto

c© Copyright 2016 by Dawei Song

Abstract

A Voltage Scaling Power Reduction Strategyfor Memory-Based LDPC Decoders

Dawei SongMaster of Applied Science

Graduate Department of Electrical and Computer EngineeringUniversity of Toronto

2016

In baseband digital signal processing, dynamic voltage scaling is an effective method to reduce

the power consumption of an SOC. Voltage scaling directly applied to embedded memories

can lead to stability issues such as memory read and write errors. This thesis explores the

possibility of applying voltage scaling technique to embedded memory that is used in forward

error correction circuits (an LDPC decoder), in order to compensate for the error induced by

voltage scaling while improving the over energy efficiency of an LDPC decoder.

This thesis used an LDPC decoder designed for wireless IEEE802.11ad WiFi standard as

a target to quantitatively characterize its inherent tolerance to embedded memory errors. An

adaptive voltage scaling control algorithm is then proposed based on the error tolerance model.

A hardware aware implementation strategy is also presented with a simulated energy saving

potential of up to 20%.

ii

To my mother, who believes in me.

iii

Acknowledgements

First and foremost, I would like to express my gratitude to my supervisor Professor Patrick GlenGulak for providing me with the opportunity and guidances during my study here at Universityof Toronto. His advisory, both in technical and managerial aspects have been invaluable to me,and was absolutely indispensable for the completion of this work.

I would like to acknowledge the fact that the idea proposed in this thesis was first conceivedduring my internship at MaxLinear Inc. in California, with the close supervision and guidanceof Dr. Curtis Ling, Mr. Tim Gallagher, and Dr. Anand Anandakumar. When I later returnedto school, the idea was futher developed to fruition with the help of my supervisor Prof. P.G. Gulak. I wish to highlight the fact that the LDPC decoder simulation model used in thiswork was provided by a senior member of Prof. Gulak’s research team Mr. Mario Milicevic.His technical resourcefulness and his willingness to assist have provided me, on countlessoccasions, much needed supports. I would also like to thank other members of Prof. Gulak’sresearch team, Mr. Alhassan Khedr, Mr. Michal Fulmyk, and Ms. Rosanah Murugesu forproviding necessary feedbacks to improve the quality of this work.

I would like to, in general, thank everyone from BA5000 for the wonderful time and sleep-less nights we all shared and struggled one way or another. A series of special thanks goes tothe following receipients: To Luke, with whom I exchanged many technical, political, socioe-conomical, and philosophical discussions. To Jeff, whose mentorship has been my true sourceof inspiration for perserverance. To Kitty and Masumi, who cared, cares, and will continuallycare about my general well-being. To Nadeesha, who always challenges me with new ideasand possibilities. To Jason, with whom I share similar views on many things. To Jingxuan,who helped me with logistics, on many occasions. Finally, I would also like to extend my sin-cere appreciation to Weijia, who helped me, with special equipments, to pinpoint a fatal errorduring my post tape-out verification process.

Last but not least, I would like to thank all my defence committiee members, Prof. P.G. Gulak, Prof. Frank Kschischang, Prof. Antonio Liscidini and the chair of my defencecommittiee Prof. Jason Anderson for all the valuable feedbacks and suggestions.

iv

Contents

Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xList of Acronyms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi

1 Introduction 11.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Low-Density Parity-Check Codes 52.1 LDPC Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.2 LDPC Decoding Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.3 LDPC Decoder Hardware Implementation . . . . . . . . . . . . . . . . . . . . 102.4 Thesis Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3 Embedded Memory Error and Characterization 143.1 Embedded Memory Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143.2 Embedded SRAM Bit-Cell Design . . . . . . . . . . . . . . . . . . . . . . . . 153.3 SRAM Storage Failure Mechanisms . . . . . . . . . . . . . . . . . . . . . . . 183.4 Parametric Failure Induced Memory Error . . . . . . . . . . . . . . . . . . . . 18

3.4.1 Hold Failures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183.4.2 Read Failures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193.4.3 Write Failures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193.4.4 Access Time Failures . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.5 Effect of Voltage Scaling on Parametric Failures . . . . . . . . . . . . . . . . . 203.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

4 System Level LDPC Simulation with Injected Memory Errors 234.1 Simulation Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

v

4.2 Injected Memory Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244.3 Performance Benchmarking of the Native IEEE802.11ad LDPC Decoder . . . 264.4 Benchmarking LDPC Decoder with Additive Memory Errors . . . . . . . . . . 274.5 Trade-off Between Iteration Time and Supply Voltage . . . . . . . . . . . . . . 304.6 Establishing a Trade-off relationship between SNR and Number of Iterations

for a Particular LDPC Decoding Performance . . . . . . . . . . . . . . . . . . 324.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

5 LDPC Dynamic Voltage Scaling Control System 385.1 Testbench and System Overview . . . . . . . . . . . . . . . . . . . . . . . . . 395.2 Vdd Operating Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415.3 Description of the Optimization Function . . . . . . . . . . . . . . . . . . . . 45

5.3.1 Optimization Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 455.3.2 Implementation of the Optimization Algorithm . . . . . . . . . . . . . 47

5.4 Description of the Decision Block . . . . . . . . . . . . . . . . . . . . . . . . 495.5 LDPC Behavioural Model and Energy Saving Evaluation Module . . . . . . . 515.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

6 Dynamic Voltage Scaling Control Module Simulation 556.1 Simulation Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 556.2 Simulation Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 566.3 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 576.4 Projected Energy Saving Potential with Technology Scaling . . . . . . . . . . . 62

7 Conclusion 647.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

A LDPC BER Simulations 66

B LDPC Dynamic Voltage Scaling Simulation - MATLAB Code 71B.1 Main Simulation Script . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71B.2 Vdd Operating Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75B.3 Optimization Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77B.4 LDPC System Behavioral Model . . . . . . . . . . . . . . . . . . . . . . . . . 78B.5 Energy Saving Evaluation Module . . . . . . . . . . . . . . . . . . . . . . . . 80

Bibliography 82

vi

List of Figures

1.1 ITRS current and projected percentage of area occupied by embedded memoryin a SOC. [LOP: Low Operating Power. LSTP: Low Standby Power.] [1] . . . 2

2.1 Illustration of a FEC enabled communication link . . . . . . . . . . . . . . . . 52.2 Tanner Graph Representation of H matrix in (2.2) . . . . . . . . . . . . . . . . 82.3 System Architecture of a Partially Parallel LDPC Decoder [24] . . . . . . . . . 112.4 System Architecture of a Fully Parallel LDPC Decoder [24] . . . . . . . . . . 12

3.1 The 6-Transistor Embedded SRAM Bit-Cell Circuit [33] . . . . . . . . . . . . 163.2 [LEFT] Static Noise Margin (SNM) Extracted from Butterfly Diagram - [RIGHT]

Comparison of SNM during Read and Hold [34] . . . . . . . . . . . . . . . . . 173.3 Probability of Bit-cell Failure vs. Vdd for Various Cache Structures [41] . . . . 21

4.1 Reference LDPC Decoder Performance Simulation (no memory errors)[BER Plot with Multiple Iteration Numbers | Code Rate = 1/2 | Zero InjectedMemory Error ] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

4.2 LDPC Decoder Performance Comparisons with Added Memory Bit FlippingErrors (Iteration Number = 5). [ Black: Reference simulation with no mem-ory error. | Red: Memory errors present at all bit locations of LLR registerswith Perr = 10−2 (1 in 100). | Blue: Memory errors present at all butMSB bitlocations of LLR registers with Perr = 10−2 (1 in 100). ] . . . . . . . . . . . . 28

4.3 LDPC Decoder Performance Comparisons with Added Memory Bit FlippingErrors (Iteration Number = 15). [ Black: Reference simulations with no mem-ory error. | Red: Memory errors present at all bit locations of LLR registerswith Perr = 10−2 (1 in 100). | Blue: Memory errors present at all butMSB bitlocations of LLR registers with Perr = 10−2 (1 in 100). ] . . . . . . . . . . . . 29

vii

4.4 Demonstration of the Corrective Strategy of Using Higher Iteration Numbersto Counteract the Performance Degradation Caused by Injected Memory Er-rors.[ Blue: Reference simulations with no memory error. | Other Colours: Simu-lations with injected memory errors to all LLR bit locations with Perr = 10−2.] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.5 Benchmarking the trade-off relationship between iteration number and inputchannel SNR. (Reference simulation runs with no memory error.)[ Simulation curves: A set of reference LDPC decoding simulations with con-tant code rate of 1/2 and various iteration numbers. (Simulations conductedwith zero simulated memory errors.) | Dashed line: Output LDPC BER per-formance target line (BER = 10−5) ] . . . . . . . . . . . . . . . . . . . . . . . 33

4.6 Benchmarking the trade-off relationship between iteration number and inputchannel SNR. (Simulation runs with injected memory errors Perr = 10−2 (1 in100))[ Simulation Curves: A set of LDPC decoder simulations with constant coderate of 1/2 and various iteration numbers. (All simulations conducted withmemory bit flipping error probability of 10−2 (1 in 100). Bit flipping errorsapplied equally to all LLR bit locations.) | Dashed line: Output LDPC BERperformance target line (BER = 10−5) ] . . . . . . . . . . . . . . . . . . . . . 34

4.7 LDPC decoder trade-off relationship between channel SNR and iteration num-ber for BER target = 10−5 and Code Rate = 1/2.[ Each individual curve: Plot channel SNR vs. iteration number, channel SNRdata obtained from table 4.1. Each curve corresponds to a particular simulatedmemory error condition. ] . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

5.1 Overview of the LDPC Dynamic Voltage Scaling Control System and Testbench 395.2 Comparator Based Implementation of Vdd Operating Table Module . . . . . . 435.3 System Diagram of the Optimization Block [ N.B. The constants 10 and 20

held in the input registers of the muxs define the number of iterations to com-plete decoding as described in section 5.3.2. ] . . . . . . . . . . . . . . . . . . 48

5.4 System Diagram of the Decision Block . . . . . . . . . . . . . . . . . . . . . . 50

viii

5.5 Simplified Operating Curves for Three Discrete Operating Vdd supply voltagesof an LDPC Decoder.[ Simplified from Fig. 4.7, decoding performance target BER = 10−5. |Low Vdd Level: Smoothed and interpolated from curve Perr = 10−2 (or 1 in100). | Medium Vdd Level: Smoothed and interpolated from curve Perr = 2×10−3 (or 1 in 500). | Nominal Vdd Level: Smoothed and interpolated fromreference curve Perr = 0 (no err.). ] . . . . . . . . . . . . . . . . . . . . . . . 52

5.6 (Repeated Fig. 3.3) Probability of Bit-cell Failure vs. Vdd for Various CacheStructures [41] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

6.1 ITRS 2010 - Static Power Contribution to the Overall Power Consumption ofCache Logic [45] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

6.2 Dynamic Voltage Scaling Control System Simulation [With 3 Power SupplyVoltage Levels] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

6.3 A zoomed-in view of Fig. 6.2 [Demonstrating the hysterisis nature of the De-cision Block] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

6.4 Dynamic Voltage Scaling Control System Simulation [With 2 Power SupplyVoltage Levels] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

6.5 Dynamic Voltage Scaling Control System Simulation [With Rayleigh fadeingchannel (with maximum Doppler shift of 100Hz)] . . . . . . . . . . . . . . . . 60

6.6 Estimated Energy Saving Trend with Technology Node (using the static powerconsumption estimates from Fig. 6.1) . . . . . . . . . . . . . . . . . . . . . . 63

A.1 BER vs Channel SNR as a function of iteration number. (Simulation runs withinjected memory errors Perr = 2×10−3 (1 in 500))[ Simulation Curves: A set of LDPC decoder simulations with constant coderate of 1/2 and various iteration numbers. (All simulations conducted withmemory bit flipping error probability of 2×10−3 (1 in 500). Bit flipping errorsapplied equally to all LLR bit locations.) | Dashed line: Output LDPC BERperformance target line (BER = 10−5) ] . . . . . . . . . . . . . . . . . . . . . 67

A.2 BER vs Channel SNR as a function of iteration number. (Simulation runs withinjected memory errors Perr = 1.2×10−3 (1 in 800))[ Simulation Curves: A set of LDPC decoder simulations with constant coderate of 1/2 and various iteration numbers. (All simulations conducted withmemory bit flipping error probability of 1.2× 10−3 (1 in 800). Bit flippingerrors applied equally to all LLR bit locations.) | Dashed line: Output LDPCBER performance target line (BER = 10−5) ] . . . . . . . . . . . . . . . . . . 68

ix

A.3 BER vs Channel SNR as a function of iteration number. (Simulation runs withinjected memory errors Perr = 10−3 (1 in 1000))[ Simulation Curves: A set of LDPC decoder simulations with constant coderate of 1/2 and various iteration numbers. (All simulations conducted withmemory bit flipping error probability of 10−3 (1 in 1000). Bit flipping errorsapplied equally to all LLR bit locations.) | Dashed line: Output LDPC BERperformance target line (BER = 10−5) ] . . . . . . . . . . . . . . . . . . . . . 69

A.4 BER vs Channel SNR as a function of iteration number. (Simulation runs withinjected memory errors Perr = 2×10−4 (1 in 5000))[ Simulation Curves: A set of LDPC decoder simulations with constant coderate of 1/2 and various iteration numbers. (All simulations conducted withmemory bit flipping error probability of 2× 10−4 (1 in 5000). Bit flippingerrors applied equally to all LLR bit locations.) | Dashed line: Output LDPCBER performance target line (BER = 10−5) ] . . . . . . . . . . . . . . . . . . 70

x

List of Tables

4.1 LDPC Operating Table [Constructed from a generic LDPC decoder designedfor IEEE802.11ad standard and simulated under BPSK modulation scheme andcode rate = 1/2.] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

5.1 All Possible Output States of the Six Comparators are Thermal Coded in theVdd Operating Table Module . . . . . . . . . . . . . . . . . . . . . . . . . . 44

6.1 LDPC Dynamic Voltage Scaling Module Simulation Parameters for 65nm CMOSTechnology [41], [45]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

xi

List of Acronyms

ASIC Application-Specific Integrated Circuit

AWGN Additive White Gaussian Noise

BER Bit Error Rate

BL0 Bitline 0

BL1 Bitline 1

CAD Computer-Aided Design

CMOS Complementary Metal Oxide Semiconductor

CNU Check Node Update Unit

CR Code Rate

DC Direct Current

DRAM Dynamic Random-Access Memory

DVB-S Digital Video Broadcasting Satellite

DVB-S2 Digital Video Broadcasting Satellite Generation 2

EDA Electronic Design Automation

EEPROM Electrically Erasable Programmable Read-Only Memory

eSRAM Embedded Static Random Access memory

FEC Forward Error Correction

ITRS International Technology Roadmap for Semiconductors

xii

LDPC Low-Density Parity-Check

LLR Log-Likelihood Ratio

LSB Least Significant Bit

MOSFET Metal Oxide Semiconductor Field-Effect Transistor

MSA Min-Sum Algorithm

MSB Most Significant Bit

PAF Probability of Access Failure

RDF Random Dopant Fluctuation

ROM Read-Only Memory

SSI Signal Strength Indicator

SIA Semiconductor Industry Association

SNR Signal-to-Noise Ratio

SOC System on a Chip

SPA Sum-Product Algorithm

SRAM Static Random-Access Memory

VNU Variable Node Update Unit

WL Word Line

xiii

Chapter 1

Introduction

1.1 Motivation

Modern electronic devices are often designed with strict power efficiency requirements. Anefficient design can be evaluated by its power dissipation which in turn determines packagecooling and power supply requirements. With the progress of the integrated circuit (IC) indus-try being closely predicted by the Moore’s law, an increasing number of transistors are beingintegrated into a single silicon system on chip (SOC). Consequently, this makes an SOC oneof the dominant power consuming components in most modern electronic devices [1]. Theever growing complexity and size of an SOC requires more and more engineering effort to bedirected towards improving its energy efficiency. Energy efficient SOC design has become theessential requirement for modern electronics and is the key in sustaining long-term growth ofthe IC industry.

In recent years, an interesting observation can be made with regard to the area occupied byembedded memories on an SOC. According to data provided by the International TechnologyRoadmap for Semiconductors (ITRS), Fig. 1.1 suggests that more than 90% of the die areaon a current generation silicon chip is occupied by embedded memory [1] [2]. Consequently,embedded memories contribute to a significant portion of the overall chip power consumption,and reducing the energy consumption of embedded memories is a convenient and practicalstrategy to reduce the total energy consumption of a chip without major alteration to its design.

A common and straightforward method of reducing memory power consumption is throughthe technique of voltage scaling. In fact, numerous studies in [3], [4], and [5] employed volt-age scaling techniques to embedded memories in order to reduce leakage power during thememory standby state. The reason why voltage scaling is rarely used during active read andwrite cycles of a memory is due to the fact that voltage scaling induced read and write errorscan occur while the memory is in use [6]. For digital circuits containing embedded memories,

1

CHAPTER 1. INTRODUCTION 2

48 Computer

meters. Table 1 shows power constraints projectedthrough 2016.

Figure 2 projects logic/memory composition ofLP-SoC designs, assuming that chip power is con-strained according to a power budget of 0.1 W andthat chip size is constrained to 100 mm2. Memorycontent outstrips logic content faster with LSTP(low standby power) devices because they havemuch higher operating power than LOP (low oper-ating power) devices. Without substantial improve-ments in power management capability, memorywill asymptotically dominate both models by 2016.Given the projection that PDA chip size will growat approximately 20 percent per node even thoughpower remains flat at 0.1 W, this would lead to evenmore extreme memory-logic imbalances in the longterm.

DESIGNThe overriding message in the 2001 Roadmap is

that design cost is the greatest threat to continuationof the semiconductor industry’s phenomenal growth.Manufacturing nonrecurring engineering (NRE)costs are just reaching $1 million (mask set and probecard), whereas design NRE costs routinely reach tensof millions of dollars. We measure manufacturingcycle times in weeks, with low uncertainty, whereaswe measure design and verification cycle times inmonths or years, with high uncertainty. Moreover,design shortfalls are responsible for silicon respinsthat multiply manufacturing NRE costs.

Despite an acknowledged design productivity gapin which the number of available transistors growsfaster than the ability to design them meaningfully,investment in process technology has by far domi-nated investment in design technology. The goodnews is that developers continue to make progressin design technology (DT): The estimated designcost of a low-power SoC PDA was approximately$15 million in 2001 versus $342 million if DT inno-vations had not occurred between 1993 and 2001.The bad news is that software now routinely

accounts for 80 percent of embedded-systems devel-opment cost; test cost has grown significantly rela-tive to total manufacturing cost; verification engi-neers are twice as numerous as design engineers onmicroprocessor project teams—and the list goes on.In 2001, many previous design technology gapsbecame crises.

Complexity challengesDT faces two basic types of complexity: silicon

and system. Silicon complexity refers to the impactof process scaling and the introduction of newmaterials or device/interconnect architectures.Previously ignorable phenomena (implied chal-lenges) now have greater impact on design cor-rectness and value, including:

• nonideal scaling of device parasitics and sup-ply/threshold voltages—leakage, power man-agement, circuit/device innovation, currentdelivery;

• coupled high-frequency devices and intercon-nects—noise/interference, signal integrityanalysis and management;

• manufacturing equipment limits—statisticalprocess modeling, library characterization;

• scaling of global interconnect performance rel-ative to device performance—communication,synchronization;

• decreased reliability—gate insulator tunnelingand breakdown integrity, joule heating andelectromigration, single-event upset, generalfault tolerance;

• complexity of manufacturing handoff—reticleenhancement and mask writing/inspectionflow, NRE cost; and

• process variability—library characterization,analog and digital circuit performance, error-tolerant design, layout reuse, reliable and pre-dictable implementation platforms.

Silicon complexity places long-standing para-digms at risk: System-wide synchronization be-comes infeasible due to power limits and the cost ofrobustness under manufacturing variability; theCMOS transistor becomes subject to ever-largerstatistical variabilities in its behavior; and fabrica-tion of chips with 100 percent working transistorsand interconnects becomes prohibitively expensive.

System complexity refers to exponentiallyincreasing transistor counts enabled by smaller fea-ture sizes and spurred by consumer demand forincreased functionality, lower cost, and shorter timeto market. Implied challenges include:

0

10

20

30

40

50

60

70

80

90

100

2001 2004 2007 2010 2013 2016Year

Are

a (p

erce

nt)

Die Size = 1cm2

Logic area contribution LOPLogic area contribution LSTPTotal memory area LOPTotal memory area LSTP

Figure 2. Power gapeffect on chip com-position. Memorycontent outstripslogic content fasterwith LSTP devicesbecause they havemuch higher operat-ing power than LOPdevices.

Figure 1.1: ITRS current and projected percentage of area occupied by embedded memory ina SOC. [LOP: Low Operating Power. LSTP: Low Standby Power.] [1]


memory errors caused by read and write stability issues can critically impact the function andperformance of a circuit and thus cannot be tolerated. However, in SOCs with built-in errorcorrection subsystems, such as wireless and wireline SOCs that have forward error correction(FEC) subsystems with embedded memory implementations, voltage scaling applied to em-bedded memories can provide an interesting opportunity to exploit the power reduction benefitwithout compromising the overall system output BER specs. The principle idea involved hereis that an error correction circuit is often designed to offer a minimum bit error rate under theworst-case channel condition (low SNR) where channel errors presented to the system is high,and the error correction circuit has to reliably provide a minimum bit error rate. Under normaloperating conditions, at moderate to high channel SNR, when the performance limit of theerror correction circuit is not being constantly stressed, voltage scaling can be applied and theerror correction circuit can be used to correct and compensate for the memory errors inducedby voltage scaling. As a particular example, the channel condition of a wireless channel istime variant, voltage scaling can, when channel conditions are good, be adaptively applied tothe embedded memory of the FEC decoder with the hope that voltage scaling induced mem-ory errors will be fully corrected such that the normal decoding performance of the circuit ismaintained.

Having presented the main strategy of reducing embedded memory’s energy consumption,this thesis is divided into two sections. We begin by investigating failure mechanisms andbehaviours of embedded memories under voltage scaling. More specifically, the correlationbetween dynamic voltage scaling and the induced memory errors must be precisely modeledand quantified. Today’s SOCs rarely incorporate hand customized embedded memories [7].Instead, embedded memories are often designed by employing CAD tools such as a memorycompiler which generates the memory block according to the specific requirements of a design.The design target of the EDA tool is to conform to a wide range of sizes and speed require-ments of different applications [8]. Therefore, the extra design margin invoked by the compiledmemory opens an interesting research question in terms of how far the operating condition ofa embedded memory must be stressed before a significant level of memory failures can beobserved.

With these insights, we then focus on a memory-based forward error correction (FEC) de-coder, which is an integral part of a high-speed wireless digital communication system. Specif-ically, low density parity check (LDPC) decoder is chosen to be the target of study wherevoltage scaling is applied to its embedded memory. LDPC codes are used in various commu-nication standards, such as DVB-S2 (Digital Video Broadcasting Satellite Generation 2) andIEEE 802.11ad, and is known for being one of the best performing FEC codes that approachesthe Shannon channel capacity limit [9]. The LDPC decoding algorithm is iterative in nature and


is computationally intensive. An LDPC decoder is normally implemented with large amountsof memory to store intermediate results between iterations. The embedded memory inside anLDPC decoder thus offers a potential opportunity where voltage scaling can be applied. Bydefault, the assumed source of errors for an LDPC decoder comes from the communicationchannel where it is possible for the channel condition to vary significantly from time to time.The design of an LDPC decoder is achieved by accommodating for the worst channel condition(the worst anticipated channel SNR), but often the worst channel condition occurs infrequently.Under good channel conditions with high signal-to-noise ratio (SNR), the LDPC decoder em-ploys fewer iterations for error correction because errors embedded in the received code wordis relatively low. The method of using as few iterations as necessary is referred to as early ter-mination and is a well-known strategy to minimize decoding energy consumption. This work,however, focuses on an alternative approach where iteration cycles of the LDPC decoder underhigh SNR is used to compensate for the additional memory errors induced by dynamic voltagescaling. To accomplish this, an in-depth understanding on the tolerance of LDPC decoder tothe additive memory error is needed.

A quantitative analysis of both embedded memory errors and LDPC decoder’s robustnessto errors induced in memories required for its realization are combined to formulate an adap-tive control scheme which predicts the optimum operating supply voltage for the embeddedmemory of the LDPC decoder based on channel SNR. The overall objective is to optimizethe energy consumption of an LDPC decoder by minimizing the energy consumption of itsembedded memory while maintaining a required decoding performance threshold.

1.2 Outline

The thesis is organized into seven chapters. Chapter 2 provides background information onforward error correction (FEC) with a primary emphasis on low density parity check (LDPC)codes and the decoder hardware implementation. Chapter 3 presents the results of a literaturesurvey on embedded SRAM memories and their reliability errors under dynamic voltage scal-ing. Chapter 4 presents the results and analysis of a system level LDPC decoder under theinjection of memory errors. Chapter 5 combines observations obtained from Chapter 3 andChapter 4 to formulate the design of a dynamic voltage scaling control system to improve de-coding energy efficiency of an LDPC decoder. Chapter 6 presents simulation results of such acontrol system, and provides an estimate of the potential energy saving for future technologynodes. Finally, Chapter 7 draws conclusions to this work and suggests possibilities for futureresearch.

Chapter 2

Low-Density Parity-Check Codes

When transmitting information through a noisy communication channel, corruption of trans-mitted data at the receiver end is often unavoidable. To rectify this, forward error correction(FEC) codes are utilized to improve data transfer reliability. As a digital coding scheme, themain idea behind an FEC code is data redundancy. By encoding the transmitted data withextra bits, called parity bits, the FEC mechanism increases the robustness of data transmis-sion at the expense of higher transmission bandwidth. An illustration of an FEC enabled datacommunication chain is shown in Fig. 2.1.

FEC Encoder

FEC Decoder

Transmitter

Receiver

Digital Source Codeword

Received CodewordDigital Sink

+Channel Noise Channel

Figure 2.1: Illustration of a FEC enabled communication link

The original data message is first encoded by the FEC encoder which adds parity bits be-fore the transmitter sends it through the communication channel. The communication channelshown in Fig. 2.1 is subjected to additive white gaussian noise (AWGN). After the receiverhas obtained the received sequence from the channel, the FEC decoder performs a decodingoperation on the received codeword which either corrects or partially corrects any corrupted

5

CHAPTER 2. LOW-DENSITY PARITY-CHECK CODES 6

data bits with the help of additional parity bits.The discussion of forward error correction codes in this thesis focuses primarily on low

density parity check (LDPC) codes. This is because LDPC codes are known to be one of themost powerful and effective error correction codes commonly applied to numerous wirelesscommunication standards such as IEEE 802.11ad WiFi and digital video broadcasting - satellite- second generation (DVB-S2) [10]. In addition to its popularity, LDPC decoding algorithmsare also known to have some inherent fault-tolerant properties. In other words, errors thatoccur during computation and/or transmission of data have limited impact on the decodingfunctionality and performance of LDPC decoding algorithms [11]. The error tolerant propertiesof LDPC decoders have been studied in [11] and [12], where error tolerance to intermittenttransient errors at the physical and the circuit level, e.g. soft errors and timing errors, areexploited to improve overall circuit reliability. This work attempts to further exploit the errortolerance of LDPC decoders with the objective of improving the overall energy efficiency ofthe system.

2.1 LDPC Codes

A low-density parity-check code is an algebraic code. More specifically, it belongs to the classof linear block codes [13]. This means an LDPC code can be represented in its matrix form.For instance, if a codeword is represented by a vector, the encoding and decoding operationscan be seen as matrix operations applied to such a vector.

For instance, suppose an original codeword is defined by a vector ~u of length N, ~u =

{u1,u2, . . . ,uN}. To encode such a codeword, the original vector is multiplied by a gener-ator matrix G of dimension N×M to produce the encoded vector ~c of length M. Here it isassumed that M ≥ N where the encoding action elongates the length of a codeword by addingparity bits.

~c =~uG (2.1)

The decoding operation of an LDPC code involves the parity check matrix H, which has adimension of (M−N)×M. An example H matrix of 4× 8 is shown in (2.2). The term lowdensity in the name of LDPC implies that the number of non-zero elements in the parity matrixis relatively sparse.

H =

0 1 0 1 1 0 0 11 1 1 0 0 1 0 00 0 1 0 0 1 1 11 0 0 1 1 0 1 0

(2.2)


An LDPC code is defined by its parity-check matrix. The parity-check matrix defines thebit location dependency for each parity bit, hence defines how parity bits are constructed.Consequently, information contained in the parity-check matrix is also used for the LDPCdecoding. For instance, the sucess condition for hard decision decoding dictates that the resultof multiplying the coded vector with the transpose of the parity matrix must result in a zerovector (2.3),

~cHT =~0. (2.3)

Once the parity-check matrix is defined, the generator matrix can be obtained by first perform-ing a Gauss-Jordan elimination on the parity matrix H, forming

H = [A | IM−N ] (2.4)

where M and M−N are the numbers of rows and columns of H respectively, and IM−N is anidentity matrix of size M−N. The process of Gauss-Jordan elimination produces A, which hasthe dimension of (M−N)×N, and is used to construct the generator matrix G such that:

G = [IN | AT ], (2.5)

where IN is an identity matrix of dimension N. The row space of G is orthogonal to the rowspace of H and one can easily verify this by showing [14]:

GHT = 0. (2.6)

The parity check matrix H can be randomly generated as a sparse matrix. However, paritycheck matrices used in modern wireless communication standards such as DVB-S2 and IEEE802.11n are quasi-cyclic in a sense that the whole parity matrix is subdivided into smallersquare matrices where each of these sub-parity matrix is circulant, i.e, each row of a circulantmatrix is a cyclic shift of the previous row [15]. A well designed quasi-cyclic LDPC (QC-LDPC) code can have equivalent performance to a randomly generated sparse parity matrixwith the added advantage that a QC-LDPC decoder can potentially have higher throughput[16].

Apart from its matrix form, the H matrix of an LDPC code can alternatively be visualizedby a Tanner graph. An example of a Tanner graph representing parity matrix H in (2.2) isshown in Fig. 2.2.

The main advantage of a Tanner graph is its ability to link the parity matrix of an LDPCcode with its decoding algorithm through visual representation. A Tanner graph is a bipartite


C0

V0

C1 C2 C3

V1 V2 V3 V4 V5 V6 V7

Check Nodes

Variable Nodes

C0

V0

C0 C0 C0

V1 V2 V3 V4 V5 V6 V7

Check Nodes

Variable Nodes

MEMORY

Figure 2.2: Tanner Graph Representation of H matrix in (2.2)

graph which means the nodes in the graph are separated in two groups, seen as the checknode group and the variable node group in Fig. 2.2. The number of the check nodes aredetermined by the number of rows in H and the number of the variables nodes are determinedby the number of columns in H. The meshed interconnects between check nodes and variablenodes are described by the non-zero elements of the parity matrix. The interconnects can beviewed from both the row vector and column vector perspective. Each row vector can be seento represent a check node where the positions of each non-zero element marks a connection tothe respective variable nodes. Similarly, each column vector can be seen to represent a variablenode where the positions of each non-zero element marks a connection to the respective checknodes.

During LDPC decoding, messages are passed repetitively between check nodes and vari-able nodes and the decoding process is carried out iteratively. Depending on the type of mes-sages passed between variable nodes and check nodes, LDPC decoding algorithms can bedivided into two main categories: hard-decision decoding and soft-decision decoding [17]. Forhard-decision decoding, messages propagated between variable nodes and check nodes are bi-nary values from received codewords. During decoding iterations, hard decisions are madeto invert binary values in order to satisfy parity equtions. In contrast, messages propagatedduring soft-decision decodings are calculated based on the conditional probability of receivedmessage bits, in practice, often expressed as log likelihood ratios [17]. Soft-decision decodinggenerally offers enhanced decoding performance and is therefore commonly used in today’s


hardware implementations [18]. Since the objective of this work focuses on practical hardwareimplementations of LDPC decoders, the discussions on LDPC decoding algorithm throughoutthe rest of this work assume soft-decision decoding.

2.2 LDPC Decoding Algorithm

There are two commonly known LDPC iterative soft-decision decoding algorithms, the sum-product algorithm (SPA) and min-sum algorithm (MSA). The Tanner graph which describesthe message passing actions during iterative decoding is exactly the same for both of thesealgorithms.

Messages that are initially loaded to variable nodes are called intrinsic messages, whichcorresponds to the currently received codeword frame. Intrinsic messages are encoded in theform of a log-likelihood ratio (LLR) which is a way to represent the belief about the valueof each received bit. The initial a posteriori log-likelihood ratio for each bit of the receivedcodeword can be calculated by (2.7).

λn = log(

P[yn|cn = 0]P[yn|cn = 1]

)(2.7)

Where cn represents each individual bit in the transmitted codeword ~c and yn represents eachindividual bit in the received codeword~y for both~c and~y has length M, that is 1≤ n≤M.

During LDPC decoding, each intrinsic LLR is passed to its connected check nodes whichcompute, based on received LLRs from several other variable nodes, the extrinsic messagesthat gets passed back to the variable node. Upon receiving an extrinsic message, the variablenode updates its intrinsic LLR and checks for convergence (i.e, all parity-check equations aresatisfied). If the convergence is not reached, the updated intrinsic message is sent again to theconnected check node, and the decoding iteration repeats for another cycle. Upon successfuldecoding, the decoded codeword from the variable nodes eventually converges and the conver-gence is confirmed by the fact that all parity-check equations are satisfied. The routes throughwhich intrinsic and extrinsic messages are passed back and forth is thus defined by the Tannergraph and hence the parity matrix H. As soon as the LDPC decoding convergence is reached,the decoding process can be terminated. This strategy is referred as early termination and is aneffective way to improve energy efficiency. On the other hand, a maximum iteration number isoften set to prevent decoding operation from going on forever. The maximum iteration numberprovides a tuning option for bounded decoding delay in order to meet application requirements.When the maximum iteration number is reached, LDPC rejects the current decoding frame andreports a decode failure.


While the general message passing network is the same for both SPA and MSA , the dif-ference between the two lies in the way which extrinsic messages are computed. Historically,SPA was first proposed and its extrinsic message calculations involve complex math operatorssuch as tanh and tanh−1. Although the method works perfectly fine on paper, it is evidently notsuited for real-life hardware implementations. MSA can be seen as an improved approximationto SPA where its extrinsic message calculations replace tanh and tanh−1 operators with min andsign operators which are much easier to implement in hardware [19]. Being an approximationof the SPA algorithm, there exists a performance penalty associated with MSA [20]. However,numerous works such as [21], [22], and [23] demonstrate a modified version of the MSA, andit has been shown that the modified MSA offers approximately the same performance as theSPA algorithm. Therefore, most LDPC decoder designs nowadays are based on the modifiedmin-sum algorithms. A detailed description of the mathematical definitions for each algorithmcan be found in [24].

Apart from how decoding messages are calculated, the sequence and the order which gov-erns the message passing, i.e. the scheduling of decoding is also important to an LDPC decod-ing algorithm. Flooding scheduling updates all variable nodes simultaneously and then usesthe results to update all check nodes simultaneously. It is the most straight forward messagepassing schedule and does not pose requirements on the parity matrix [15]. Layered decodingscheduling does not update all check nodes and variable nodes at once. Instead, a sequentialorder is followed which only part of the variable node group and check node group is updated.Studies have shown that sequential scheduling offers improved convergence speed compare tothat of flooding scheduling [25].

2.3 LDPC Decoder Hardware Implementation

Modern LDPC decoder hardware implementations can be generally divided into two cate-gories, partially-parallel decoder, also called memory based decoder, and fully-parallel decoder[24]. A partially-parallel decoder uses shared variable node update units (VNUs), shared checknode update units (CNUs), and a shared memory to handle message passing between the two.A generic partially-parallel decoder design is shown in Fig. 2.3.

The main challenge of designing a partially-parallel LDPC decoder is to properly handle theaddress generation of the shared memory block and coordinate the access of its stored contentsin order to perfectly simulate the message passing actions described by the Tanner graph. Withthat being said, a partially-parallel decoder employs the concept of a shared computational unitto perform partial decoding multiple times and thus allows hardware to be reused. Practically,a partially-parallel LDPC decoder designs can be very efficiently implemented in hardware.


(Extrinsic)

Figure 2.3: System Architecture of a Partially Parallel LDPC Decoder [24]

For instance, [26] and [23] use partially-parallel designs to implement the LDPC decoder forDVB-S2 application. In order to give a sense of scale for the size of memories used in LDPCdecoders, the partially parallel decoder proposed in [23] is highlighted as an example. Theparticular LDPC decoder is designed to be used in a DVB-S2 wireless receiver. The codewordlength is 64800 bits with a quasi-cyclic parity check matrix of parallel degree S = 360 (thenumber of submatrices into which the main parity-check matrix is divided). The total amountof embedded memory used in this LDPC decoder is on the order of 1 Mbit 1 [23]. Anotherexample of an LDPC decoder, which all investigative simulations of this thesis is based on, isdesigned for the IEEE802.11ad WiFi standard. The IEEE802.11ad standard has the codewordlength of 672bits. The quasi-cyclic parity check matrix has a parallel degree of 42. The totalmemory embedded in this LDPC decoder is in the order of a few hundred Kbits 2 .

A fully-parallel LDPC decoder, on the other hand, attempts to gain decoding throughput bytrading off area and implementation complexity. In a fully-parallel LDPC design, each nodeof the Tanner graph, albeit a variable node or a check node, has its own assigned processing

1Approximately 60% of the total memory is dedicated to extrinsic memory (shared memory), while 40% isdedicated to intrinsic memory [23].

2Approximately 65% of the total memory is dedicated to extrinsic memory (shared memory), while 35% isdedicated to intrinsic memory.


hardware unit. The extrinsic messages are passed by a mesh interconnection network. Ageneral design of a fully-parallel decoder is shown in Fig. 2.4.

Figure 2.4: System Architecture of a Fully Parallel LDPC Decoder [24]

A fully-parallel decoder design has the potential to achieve much higher throughput whilealso avoiding the problem of designing the memory access controller. In reality however, acomplicated interconnected mesh introduces increased area and complexity in routing, andthe mismatches in delay can degrade performance considerably. Furthermore, a fully-paralleldesign usually takes more area to implement and the power dissipation is also an issue worthyof concern [27]. Since the primary focus of this work is on energy efficiency, the discussionconcerning LDPC decoder design will mainly focus on partially-parallel designs.

2.4 Thesis Objectives

This chapter briefly discussed the fundamentals of LDPC codes and their applications in mod-ern communication systems. Practical LDPC decoder algorithms and their modern hardwareimplementations are also presented with a particular emphasis on the partially-parallel LDPCdecoder architecture and the important role embedded memories play in this type of design. To


give reference to the size of the embedded memories, both LDPC decoders used in DVB-S2and IEEE802.11.ad are highlighted as examples.

The objective of this thesis is to improve the overall energy efficiency of LDPC decodersby reducing the power consumption of their embedded memories. Such an objective is moti-vated by two key observations. The first observation is that the embedded SRAM frequentlyemployed in digital IC design is compiled by CAD tools and occupies a significant chip area ofthe overall LDPC decoder. These compiled embedded memories are often designed with extraoperating margins to accommodate a wide variety of applications [8]. The second observationis based on the time-varying nature of the wireless communication channels, and the fact thatFEC subsystems in wireless receivers are designed for the worst-case channel SNRs, whereasoperating under average channel SNRs, the BER performance specification of FEC subsystemsare easily achieved.

In light of these observations, the principle methodology explored in this thesis is an adap-tive voltage scaling strategy applied to embedded memories of an LDPC decoder when channelSNR is good. The down-side of the power reduction delivered through voltage scaling is theaddition of memory errors which can be compensated by the error correction capability of theLDPC decoder. The net result is a more energy efficient LDPC decoder with no performancedegradation. This thesis presents the design and implementation details of the adaptive voltagescaling algorithm as an add-on module to an existing LDPC decoder. This way, the benefit ofvoltage scaling can be realized without major alterations to existing LDPC decoder designs.

The research of this thesis is divided into two parts. Firstly, error behaviours of embeddedmemories under voltage scaling are carefully studied and quantitatively characterised. Sec-ondly, under voltage scaling, an LDPC decoder must compensate for additional memory errors.This implies a higher number of decoding iterations and consequently longer decoding time.Despite the fact that voltage scaling will provide instantaneous power reduction, a prolongeddecoding time will consume more energy. This trade-off relation must be carefully analysedsuch that the ultimate objective of the adaptive voltage scaling algorithm to improve the overallenergy efficiency of the LDPC decoder is achieved.

Chapter 3

Embedded Memory Error andCharacterization

In Chapter 2, embedded memories have been highlighted as important building blocks in com-mon LDPC decoder implementations. In order to fully exploit the energy saving potential ofembedded memories, their functional characteristics under voltage scaling must be carefullystudied. The overall objective of this chapter is to observe and categorize different types ofmemory failures under dynamic voltage scaling. These failure events can be further modeledas additional codeword errors coming from an independent source. The ultimate goal is toquantize these memory failures to probabilistic events so that a system-level simulation of theLDPC decoder can be performed. The quantitative study of memory errors in the context ofvoltage scaling reveals the trade-off relationship between storage reliability and power saving.

The chapter is organized as follows. Basic embedded memory concepts are presented first,then followed by a discussion of SRAM bit-cell designs. SRAM failure mechanisms are high-lighted before finally presenting experimental data that correlates bit-cell failure probability tosupply voltage scaling. The data presented in this chapter serves as an important foundation tothe energy optimization frame work presented in Chapter 5.

3.1 Embedded Memory Basics

There are various types of embedded memory technologies suitable for different applications.A few of the most commonly known types of embedded memories are embedded ROM, em-bedded EEPROM, embedded SRAM, and embedded DRAM. This work focuses on the mostcommonly used embedded memory, eSRAM (Embedded Static Random Access Memory).Typical applications for embedded SRAMs are high-speed on-chip FIFO buffers, register files,

14

CHAPTER 3. EMBEDDED MEMORY ERROR AND CHARACTERIZATION 15

and caches [28].Aside from its high read and write speeds, one significant merit that makes embedded

SRAM favourable over all others is its compatibility with standard CMOS logic process. Inaddition, eSRAM is known as a static memory, meaning the content of stored information isalways preserved while the device is powered. This makes integrating eSRAM to existingdigital designs practical without requiring any additional refresh circuitry and complex read-/write procedures. Further design convenience can be achieved through the use of EDA toolssuch as a memory compiler which generates SRAM designs automatically to attain specifiedperformance requirements. Memory compilers, rather than fully-custom design allows design-ers to focus on algorithm level implementations and improves design portability for CMOStechnology scaling.

In this work, an embedded memory employed in an LDPC decoder is a particular case oflow-power high-performance embedded SRAM for baseband signal processing applications. Atypical capacity for this kind of embedded memory falls in the Mbit range for many commonwireless standards [29]. The memory used in an LDPC decoder interacts directly with com-putational kernels and must support several iterations (read and write cycles) at the frame rateor packet rate of the system. This places a strict demand on energy efficiency of the memory,which in turn has significant impact on the overall energy efficiency of the system.

Previous works such as [30], [31], and [32] have focused on reducing eSRAM’s staticleakage power consumption during the standby state. However, the subject of study of theseworks are embedded caches in general purpose processors. The typical access patterns of thesecache memories are often defined by spatial and temporal locality. The access pattern of theLDPC embedded memory, on the other hand, is frequent and yet with a highly deterministicaddress pattern. Such access behaviour requires the energy optimization scheme to target bothdynamic and static power consumptions in order for it to be effective.

3.2 Embedded SRAM Bit-Cell Design

The design and topology of an SRAM bit-cell has a large influence on its functional behaviourunder voltage scaling. In general, discussions presented in this thesis are based on the typical6 transistor (6T) implementation of the SRAM shown in Fig. 3.1. The design consists twoequally sized cross-coupled inverters and two access transistors. The positive feedback of thecross-coupled inverters reinforces stored values, while gate access transistors allow the statesof the inverter pair to be altered during read and write operations.

During the hold state, the word line WL is held low and the cross-coupled inverter paircreates a bi-stable circuit to maintain the stored value in its complementary form. During read


N0

P0

N1

L = ‘0’

WL

BL0

BL1

P1

R = ‘1’

VL VR

Figure 3.1: The 6-Transistor Embedded SRAM Bit-Cell Circuit [33]

operations, bit lines BL0 and BL1, with their lumped parasitic capacitances, are first chargedto a predefined pre-charging voltage (1/2 Vdd), then word line WL is pulled high to allowa differential voltage to be developed between the two complementary bit lines. Finally, thisvoltage is amplified by a sense amplifier to develop the read out value to its full range. Duringwrite operations, bit lines BL0 and BL1 are first driven to their desired complementary write-invalues before the word line WL is driven high to allow values to be stored at the inverter pairs.

One of the most critical design criteria for an SRAM is the storage stability of bit-cells.The stability of a memory bit-cell is evaluated by assessing its immunity against noise. Theclassical bit-cell stability criterion is defined by two static noise margins, static hold margin,and static read margin. Static noise margins (SNM) are defined and measured in voltages,indicating amplitudes of DC noise a pair of cross-coupled inverters can resist before theirstored states are inverted. Static noise margins for a bit-cell can be easily extracted from abutterfly diagram. Fig. 3.2 [LEFT] shows an example of a butterfly diagram for the 6T bit-cell shown in Fig. 3.1. Voltage transfer characteristics (VTC) of the left (L) and the right (R)storage node, VR over VL (VTC1) and VL over VR (VTC2), are superimposed in the same plot.The eye opening of the butterfly diagram (marked SNM) in Fig. 3.2 [LEFT] defines a staticnoise margin. For a bit-cell inverter pair, static noise margins are typically only defined for readand hold operations. During write operations, since stored logical states of the cross-coupled


Figure 3.2: [LEFT] Static Noise Margin (SNM) Extracted from Butterfly Diagram - [RIGHT]Comparison of SNM during Read and Hold [34]

inverter pair are actively driven by bit lines, it is therefore unnecessary to define a static noisemargin for write operations. As shown in Fig. 3.2 [RIGHT], it is usually the case that the staticnoise margin during a read operation is much smaller than the static noise margin during a holdoperation [34].

The reason the read noise margin is smaller than the hold noise margin is due to the factthat the pull down transistor ( N0 in Fig. 3.1) is required to discharge the current comingfrom the bit line (BL0 in Fig. 3.1). This temporary influx of current raises the nodal voltageon the logical low side, (VL in Fig. 3.1) and thus brings it closer to the tipping point of theinverter. This increases the probability of bit-cell flipping event and makes the read operationmore prone to noise interference. Therefore, by following conservative practices, the read noisemargin is often assessed as the primary design criterion for bit-cell stability.

The advantage of static noise margin analysis is that it offers a convenient design guidelinethat can be derived directly from the DC voltage transfer characteristic of an inverter pair.The downside, however, arises from the assumption that the noise under consideration is staticwhich is rarely the case in reality. Therefore, it has been suggested that the static noise margindesign criterion can be overly conservative since it ignores the temporal behaviour of the noisesignal [34]. The method proposed in [35] suggests a more modern dynamic criterion to assessbit-cell stability. The method employs non-linear circuit theory to quantify the pulse shape(specific amplitude and duration) of a noise signal required in order to flip the stored states ofa bit-cell and causing a bit-flipping failure. The dynamic stability criterion aims to provide amore accurate estimation for a bit-flipping condition especially under temporal non-stationarynoise environment.


3.3 SRAM Storage Failure Mechanisms

Data stored in an SRAM can be corrupted by many failure mechanisms. In general, errorsinduced by these failure mechanisms are categorized as hard errors, soft errors, supply noiseinduced errors, and parametric failure induced errors [36]. Hard errors are caused by manufac-turing defects. These errors are permanent and affect the yield of the SRAM chip and can bepartially compensated by employing redundancy. Soft errors are caused by background radia-tion, typically cosmic radiation or alpha particles from die packaging. Therefore, extra consid-eration must be taken for specific SRAMs designed for demanding applications. As mentionedpreviously, supply noise can significantly impact bit-cell stability and cause bit flipping events,therefore, either strict power supply ripple control or extra design margins are usually used toguard the operation of SRAM against supply noise. This work focuses on parametric failureinduced memory errors. These errors are caused by manufacturing variations within the samesilicon die and are sensitive to supply voltage scaling. These variations, though not consideredas defects, may cause the actual operating behaviour of the SRAM to deviate from its originaldesign.

3.4 Parametric Failure Induced Memory Error

For a typical 6T SRAM topology, mismatches between adjacent transistors can have significantimpact on the performance of bit cells, especially when operated under nonideal conditions,such as voltage scaling [37]. These process variation induced bit-cell failures are categorizedas parametric failures and have become a major challenge for sub-micrometer SRAM design[37]. We present different types of parametric failures induced memory errors and quantita-tively correlate these errors with voltage scaling. Five different kinds of parametric failures arediscussed in the following subsections.

3.4.1 Hold Failures

Hold failures refer to the data retention failure that occurs during standby storage mode [37]. This is especially relevant when supply voltage scaling is employed during hold in order tominimize static power consumption. The lowering of supply voltage weakens the feedback be-tween the crossed coupled inverter pair and hence makes bit-cells more susceptible to externaldisturbances such as noise. When a bit-cell’s stored content is inverted during a hold state, theread out value is interpreted as a bit flipping event. Read out errors caused by this type of eventis defined as hold failure induced memory error.


3.4.2 Read Failures

Read failures, also referred as destructive reads, happen during read operations. When a readfailure occurs, the original stored content of a bit-cell is altered while a read operation is inprogress. Consequently, a bit flipping error appears at the output. During a read operation,stored nodal states of a bit-cell are required to alter its bit lines (BL0 and BL1 in Fig. 3.1)which are pre-charged to 1/2 Vdd. On the side of the bit-cell inverter pair that stores a logical‘0’ (P0 and N0 in Fig. 3.1), the pull down transistor (N0 in Fig. 3.1) “sinks” current comingfrom the bit line (BL0 in Fig. 3.1), and thus lowers the bit line voltage. On the opposite side,the pull up transistor (P1 in Fig. 3.1) “charges” the bit line (BL1 in Fig. 3.1), and thus raises thebit line voltage. A destructive read often happens when the strength of the pull down transistor(N0 in Fig. 3.1) is weak. The temporary influx of current raises the voltage of the logical ‘0’node (VL in Fig. 3.1). If the nodal voltage surpasses the tipping point of the inverter, then theoriginally stored data is flipped and thus results in the corruption of the stored bit.

3.4.3 Write Failures

Write failures happen when desired write-in data is unable to override existing stored data ofbit-cells during write operations. When a write failure happens, a bit-cell retains its previouscontent. During a write operation, assuming the write-in value is different than existing storedvalue, the bit line that is driven to logical ‘0’ attempts to pull down the storage node (R inFig. 3.1) from logical ‘1’ to logical ‘0’. Depending on the size of the pull up transistor (P1 inFig. 3.1) and the lumped parasitic capacitance at the storage node (R in Fig. 3.1), if the nodalvoltage fails to drop to the tipping point of the inverter, the bit-cell fails to register its new state,and thus creates a write failure. Parametric write failures are usually caused by strong pull uptransistors.

3.4.4 Access Time Failures

Besides the above mentioned failure mechanisms which result in bit-cell storage corruption,there are two additional timing induced failures. Access time failure occurs during read andwrite time failure occurs during write. During a bit-cell read access, a specific maximum toler-able access period Taccess max is defined as the time required for a pair of complementarybit lines to develop the minimum voltage difference required by the output sense amplifier. Aweakened access transistor (with high on resistance) can cause an increase in bit-cell accesstime. An access time failure occurs when the access time exceeds Taccess max. Despite thefact that access time failure also occurs during read, its error mechanism is different than that of


a destructive read in the sense that an access time failure does not alter the content of the storedbit-cell. In essence, a read upset is often caused by the weakening of a pull down transistorwhile an access time failure is caused by the weakening of an access transistors. By the samelogic, weakened access transistors can also cause write failures. When voltages present on bitlines do not have enough time to invert the existing states of a bit-cell, a write time failureoccurs.

3.5 Effect of Voltage Scaling on Parametric Failures

In this section, the result of a survey of the effect of voltage scaling on parametric failure in-duced memory errors is presented. The main topic of the discussion for this section focuses onthe impact of voltage scaling on different parametric failure induced memory errors. Numer-ous publications have shown relationships between memory failures and supply voltage drop.However, due to the fact that many of these results are based on transistor level simulations,and there are a considerable number of assumptions based on device physics for different tech-nology nodes, it is therefore very difficult to establish a universal model. Despite this fact, onepublished result based on actual measured memory error behaviour stands out and offers themost realistic view of the relationship between memory errors and supply voltages.

Wilkerson, Gao et al. [41] in their research collaboration with the Microprocessor Tech-nology Lab at Intel offers silicon based measurements of embedded memory cell failure prob-ability. The published bit failure probability is based on actual measurement on the embeddedcache memory of an Intel Core 2 Duo microprocessor implemented in 65nm technology [41].The paper assumes the random nature (spatial independence) of bit-cell failures and correlatesthe magnitude of failure probability with different sizes of memories. Fig. 3.3 shows theprobability of bit-cell failure against the drop of supply voltage. The curve that is particularlyinteresting for this thesis is the 2Mbit cache curve which is the closest match to the eSRAMrequired to realize the LDPC decoders that are the subject of this thesis.

3.6 Discussion

This chapter gives a brief overview of the classical 6 transistor embedded SRAM design andits principles of operation. Storage failure mechanisms of a typical 6 transistor embeddedSRAM are categorized and each discussed in detail. The parametric failure is highlighted asthe primary failure mechanism of interest due to its sensitivity to supply voltage scaling. Bothliterature surveys from academia and experimental results from industry are presented to showthe random nature of parametric memory failures and to characterize the probability of these


Vdd [V]

Pro

bab

ility

of

Bit

Fai

lure

Figure 3.3: Probability of Bit-cell Failure vs. Vdd for Various Cache Structures [41]


memory errors corresponding to voltage scaling into BER (bit error rate) curves. The memoryBER with respect to voltage scaling is especially useful for this thesis because it shows thenumber of memory errors the system should anticipate under reduced operating voltage. Thememory BER behaviour must be accurately modeled and well understood by the adaptivevoltage scaling algorithm because it quantifies the physical limitation of the hardware andserves as a guideline for the aggressiveness of the voltage scaling strategy.

Through our literature surveys, it has become evident that the memory failure probabilityis highly dependent on the design of memory bit-cells and the manufacturing technology ofan embedded memory [34]. To the best of our knowledge, [41] is the only publication whichpresents the memory error probability based on measurements performed on fabricated CMOSchips, and thus should offer more accurate quantitative assessments than the results based onsimulations in [37] and [33]. Therefore, the adaptive voltage scaling algorithm described inlater chapters of this thesis is developed based on the BER data provided by Wilkerson, Gao etal. in [41].

With the above being said, it is important to point out that the main focus of this thesis isthe adaptive voltage scaling algorithm itself and the corresponding predicted power savings.Despite the fact that memory BER is an important parameter involved in the algorithm, themethod to obtain such a parameter either by direct measurements or by simulations is com-pletely decoupled from the adaptive voltage scaling algorithm. From a design point of view,the memory failure probability as a function of supply voltage can be simply treated as an inputparametric vector. For practical applications, designers have the freedom to use any methodsnecessary in order to obtain an accurate characterization of the embedded memory errors toallow the adaptive voltage scaling algorithm to operate as effectively as possible.

Chapter 4

System Level LDPC Simulation withInjected Memory Errors

In Chapter 2, a memory-based LDPC decoder architecture was presented. In Chapter 3, differ-ent types of embedded SRAM errors caused by dynamic voltage scaling have been discussedin detail. In this chapter these two ideas are combined. The proposed memory-based LDPCdecoder scheme must be able to function correctly under dynamic voltage scaling. In order toguarantee this, the presence of embedded memory errors and their impact on the decoding per-formance of the LDPC decoder must be measured. In this chapter, system-level simulations ofthe decoding performance of the LDPC decoder under injected memory errors are conductedand evaluated.

4.1 Simulation Setup

In this work, a generic LDPC decoder model, implemented in C++, is used as the main exper-imental test bench. The LDPC decoder is used in IEEE802.11ad WiFi standard and adheresto the design of the min-sum algorithm (MSA) with layered decoding architecture. The LDPCdecoder is capable of supporting multiple code rates with the codeword length of 672bits andthe quasi-cyclic parity check matrix has a parallel degree of 42 (The number of submatricesinto which the main parity-check matrix is divided). The total embedded memory used in suchan LDPC decoder is estimated to be roughly 200Kbits. Numerous simulations of this LDPCdecoder are used to quantify the performance degradation of an LDPC decoder under injectedmemory errors. All simulations are conducted under the assumption that the input to the LDPCdecoder is the received codeword, modulated using BPSK scheme, from an AWGN communi-cation channel. The nature of the channel noise is assumed to be AWGN and both codeword

23

CHAPTER 4. SYSTEM LEVEL LDPC SIMULATION WITH INJECTED MEMORY ERRORS 24

length and modulation scheme are assumed to be fixed. The tunable parameters of the testbench are the input channel SNR, and the decoding iteration number of the LDPC decoder.The iteration number is considered due to the common practice of early termination wherethe iteration cycle of an LDPC decoder is dynamically terminated as soon as all parity checksare met. The overall performance of the LDPC decoder, in terms of its output bit error rate(BER), is carefully benchmarked with different simulation scenarios where memory errors aredecidedly injected or omitted.

4.2 Injected Memory Errors

The method by which memory errors are injected to the test bench is discussed in this section.Since it is shown in Chapter 3 that embedded memory errors due to voltage scaling can bemodeled as random bit-flipping events, memory errors in the LDPC simulation is thereforemodeled by applying random bit inversions to LLR values every time they are updated. Thequantity of simulated memory errors is governed by a bit flipping probability value which isdefined for each bit of a LLR value. To implement this in the actual C++ LDPC simulation,each time a LLR variable is updated, an error mask is applied to the LLR value via bit-wiseXOR (exclusive or) operation. The error mask is generated on the fly each time, and has exactlythe same binary length as the masking LLR variable. Effectively, a binary ‘1’ at a certain bitlocation of the error mask results a bit inversion, while a binary ‘0’ leaves the original valueintact. This allows precise control of the bit flipping events within each LLR variable. The errormask is generated bit by bit and the decision of either producing a binary ‘1’ (bit inversion)or a binary ‘0’ (bit preservation) is governed by whether the integer produced by a uniformlydistributed random number generator from a certain integer range falls within a particular subrange. The ratio between the integer range and its sub-range is calculated from the bit flippingprobability defined for each bit location of the LLR value, therefore on average, the quantity ofbit flipping events introduced to LLR variables are statistically accurate to the predefined errorprobability and are independent from each other.

Bit flipping errors introduced once every LLR update during simulation physically repre-sent memory errors that occur at different memory locations. For example, it is possible forbit flipping errors to occur during initialization of an LLR variable. In this case, bit flippingevents represent memory errors physically located in the intrinsic LLR memory. During LDPCiterations, bit flipping events represent memory errors physically located in the edge memorywhere message passings between check nodes and variable nodes occur. At the end of eachLDPC iteration, partially or fully converged LLR results are stored back to the initial LLRvariable. Bit flipping events that occur at this time again represent memory errors physically


located in the intrinsic LLR memory.Bit flipping errors that happen once every LLR variable update also represent different

types of physical memory errors (i.e. read errors, write errors, and hold errors). From a system-level simulation point of view, memory errors represented by applying bit inversion masks toa specific runtime variable in a C++ simulation does not make a distinction between differentmemory error mechanisms. Physically generated memory errors (such as read errors, writeerrors, and hold errors as discussed in chapter 3) all produce the same net effect to the LLRstorage variable in the simulation code, i.e. errors appearing during the updates of LLR values.In reality, the inability to isolate different types of memory errors in the simulation is notnecessarily an inconvenience. After all, the principal goal of the LDPC simulation in thiswork is to study the performance penalty and quantitative tolerance of a type of memory-basedLDPC decoder to errors that occur internally within storage elements.

An added benefit of simulating bit flipping events by applying a bit inversion mask is theability to apply bit flipping errors with different error probabilities to different bit locations ofa LLR value. More specifically, the bit flipping probability for the most significant bit (MSB)position of a LLR value can be set differently from the rest of LLR bits. The motivation behindisolating the MSB from the rest of LLR bits is simple. Depending on the binary encodingscheme of the LLR stored in memory, the MSB is considered to have higher significance tothe value stored in the LLR. For example, should the LLR be encoded in 2’s complementform, the inversion of the MSB will first cause the sign of the LLR to be inverted and inaddition, the magnitude of the LLR will be altered as well due to the fact that 2’s complementnumber wraps around zero. This means that an originally encoded small positive LLR value,when its MSB is inverted, will represent a large negative number. This effectively makes anoriginally received bit with high LLR magnitude change to its complement value with low LLRmagnitude. For a 2’s complement encoded LLR value, such a double inversion always changesthe LLR value across half of the numerical range. In contrast, if a bit-flipping event happenedon the MSB of a sign magnitude binary encoding LLR only inverts the sign of its stored value.In extreme cases, this could potentially result in a positive encoded bit changed to a negativebit, thus demanding more correcting efforts from an LDPC decoder. With this as background,two simulation scenarios are studied in this thesis. In the first scenario, bit flipping errorsare identically added to all bit locations of LLR values. In the second scenario, bit flippingerrors are added to all but the MSB locations of LLR values (simulating the MSB positionbeing protected). The difference in decoding performance of the LDPC decoder between thesetwo scenarios are compared to help understand the significance of the MSB position in a LLRvalue.


4.3 Performance Benchmarking of the Native IEEE802.11adLDPC Decoder

−4 −3 −2 −1 0 1 2 3 410

−8

10−7

10−6

10−5

10−4

10−3

10−2

10−1

100

Channel SNR in [dB]

Bit E

rror

Rate

CR = 1/2, iter num = 5





CR = 1/2, iter num = 10

CR = 1/2, iter num = 15

CR = 1/2, iter num = 20

Figure 4.1: Reference LDPC Decoder Performance Simulation (no memory errors)[BER Plot with Multiple Iteration Numbers | Code Rate = 1/2 | Zero Injected Memory Error ]

Prior to introducing memory errors to the system, it is important to benchmark the nativeperformance of the LDPC decoder. This way any additional performance degradation due tothe addition of memory errors becomes comparatively evident. The decoding performance ofthe LDPC decoder can be evaluated by plotting the output BER against the channel SNR. Insimulations illustrated by Fig. 4.1, iteration numbers are varied to show their impact on thedecoding performance. As illustrated in Fig. 4.1, for a constant code rate, a larger number ofiterations results in better decoding performance (lower BER) for a given channel SNR level(shown as a vertical dashed line on the plot). Intercepts of the BER curves with the verticalSNR line quantifies the coding gain obtained by iterations in terms of decoding BER values.

This simulation is conducted to illustrate the effect of the decoding performance gain purely


due to a higher number of iterations. The important observation to take away is that decodingperformance can be improved by increasing the iteration number. This essentially highlightsthe trade-off between channel SNR and iteration number. For a desired output BER perfor-mance requirement, illustrated as a horizontal line in Fig. 4.5, the intercepts of this horizontalline with the BER curves shows that a higher number of iterations are required to compensatefor the deterioration of channel SNR.

4.4 Benchmarking LDPC Decoder with Additive MemoryErrors

With the addition of memory errors, the decoding performance of the LDPC decoder is ex-pected to be inferior to that of a native LDPC decoder with only channel errors alone. Sincethe memory error mechanism is uncorrelated to that of the channel error, the memory errorcan be interpreted as an independent source of error. The net effect of such an additional errorsource is studied in detail in this section.

In Fig. 4.2 simulations are conducted with a memory bit-flipping probability of 10−2 (1 in100). The setup demonstrates the impact of both code rate and the addition of memory errorson the decoding performance. Output BER curves of simulations with higher code rates areplotted using solid lines and those with lower code rate are plotted as dashed lines. A referenceBER with no memory injected error is also plotted in black. One observation can be madefrom Fig. 4.2, for a constant iteration number (in Fig. 4.2, iteration number = 5), the resultantsimulations with code rate of 1/2 offers better decoding performance than those with 3/4 coderate. The difference in decoding performance is expected since at code rate 1/2 a higher numberof parity bits are employed to protect the codeword. More to the point, regardless of whethermemory errors are added to the simulations or not, the performance difference due to code rateis always consistently observed.

From Fig. 4.2, another important observation can be made on the impact of memory errorsat the MSB location of LLR registers. Illustrated by the deviation between blue and red curves,a significant amount of performance degradation is observed in the case where memory errorsare injected to all LLR bits. On the other hand, a negligible difference in decoding performanceis observed when errors are injected to all but the MSB LLR bits (simulating MSB locationbeing protected against error injections). In fact, injected error simulation curves with protectedMSB locations appear to coincide with reference BER curves. Despite the relatively high bit-flipping probability of Perr = 10−2, this can be seen as true a testament to the robustness ofthe LDPC decoding algorithm. It suggests potentially that a negligible performance penalty is


−4 −2 0 2 4 6 8 1010

−8

10−7

10−6

10−5

10−4

10−3

10−2

10−1

100

Channel SNR in [dB]

Bit E

rro

r R

ate

Ref. | CR = 1/2 | iter num = 5


Err.[no MSB] | CR = 1/2 | iter num = 5


Err. | CR = 1/2 | iter num = 5


Figure 4.2: LDPC Decoder Performance Comparisons with Added Memory Bit Flipping Er-rors (Iteration Number = 5). [ Black: Reference simulation with no memory error. | Red:Memory errors present at all bit locations of LLR registers with Perr = 10−2 (1 in 100). | Blue:Memory errors present at all but MSB bit locations of LLR registers with Perr = 10−2 (1 in100). ]


incurred should memory errors affect only the LSBs of stored LLR values, and the protection ofMSB position is enough to guard against a relatively large number of random memory errors.

−4 −2 0 2 4 6 810

−7

10−6

10−5

10−4

10−3

10−2

10−1

100

Channel SNR in [dB]

Bit E

rror

Rate

Ref. | CR = 1/2 | iter num = 15

Ref. | CR = 3/4 | iter num = 15



Err. | CR = 1/2 | iter num = 15

Err. | CR = 3/4 | iter num = 15

Figure 4.3: LDPC Decoder Performance Comparisons with Added Memory Bit Flipping Er-rors (Iteration Number = 15). [ Black: Reference simulations with no memory error. | Red:Memory errors present at all bit locations of LLR registers with Perr = 10−2 (1 in 100). | Blue:Memory errors present at all but MSB bit locations of LLR registers with Perr = 10−2 (1 in100). ]

To further validate the observations, a similar set of simulations to those shown in Fig. 4.2is conducted with a much higher LDPC iteration number (Iteration Number = 15), shown inFig. 4.3. A quick comparison between Fig. 4.2 and Fig. 4.3 leads to nearly identical conclu-sions. BER performance gaps between simulations with 1/2 and 3/4 code rate are consistentlymaintained. A negligible amount of performance degradation is observed between referenceBER curves and the output BER curves of the simulations with memory errors injected onlyto LSBs of the LLR values. In addition to the above similarities, a new type of error floorphenomenon is observed in Fig. 4.3 where memory errors are uniformly injected to all LLR


bits (shown as a solid red curve). The BER curve for the 1/2 code rate simulation begins toflatten after the input channel SNR is higher than 2dB. The flattening of the output BER slopesignifies the limit to the corrective ability of the LDPC decoder. As the input channel SNRincreases above a certain point, the error in each codeword coming from the channel has di-minished to a degree such that the dominant source of error in the codeword is due to memoryerrors. In this particular simulation setup, the dominance of memory errors begins when theinput channel SNR rises above 2dB. Since the error events in memory are internal and num-bers of memory error events between iterations are constant, the output performance in termsof BER is severely limited despite the continuing improvement in channel conditions.

From Fig. 4.2 and Fig. 4.3, one important observation that must be highlighted is that theintroduction of memory errors to only the LSBs of the LLR values does not seem to impactthe output performance of the LDPC decoder. Such an observation has been confirmed bysimulations conducted under different code rates and with both high and low iteration counts.However, such an observation is only valid for the input SNR range covered by the simulation.The impact of memory errors intoduced to the LSBs of the LLR values may become apparent athigher input channel SNR values. High SNR values imply lower BERs which in turn, requiressignificant simulation time. Due to limits on simulation time, for the remainder of this thesis,we assume that memory errors are uniformly distributed among all LLR bits in the memory.

4.5 Trade-off Between Iteration Time and Supply Voltage

After having demonstrated the impact of memory errors on the decoding performance of theLDPC decoder, it is imperative to find a way to compensate for the deterioration of decod-ing performance. One strategy is to use an increased number of iterations. The coding gainobtained by higher iteration numbers is demonstrated to be effective in improving decodingperformance in simulations shown in Fig. 4.1, Fig. 4.2, and Fig. 4.3.

Figure 4.4 compares a reference performance curve with a low iteration number of 5 againstseveral simulation runs with bit flipping LLR error probability of 10−2 of various iterationnumbers. As observed from the figure, decoding performance of the error curves with iterationnumbers higher than 7 is capable of matching and surpassing the decoding performance of thereference curve within certain input SNR regions. This means injected memory errors (here tosimulate memory errors induced by dynamic voltage scaling) are compensated by extra LDPCiterations, and an equivalent level of decoding performance is achieved within a particularrange of input SNR. The cost of these added LDPC iteration cycles is a linear increase indecoding energy due to prolonged iteration time. However, since the dynamic instantaneouspower consumption scales down quadratically with respect to supply voltage, the net saving is


−4 −2 0 2 4 6 810

−7

10−6

10−5

10−4

10−3

10−2

10−1

100

Channel SNR in [dB]

Bit E

rror

Rate







Figure 4.4: Demonstration of the Corrective Strategy of Using Higher Iteration Numbers toCounteract the Performance Degradation Caused by Injected Memory Errors.[ Blue: Reference simulations with no memory error. | Other Colours: Simulations with in-jected memory errors to all LLR bit locations with Perr = 10−2. ]


worth considering. A detailed discussion of the trade-off between power saving and extendediteration number is presented in the optimization algorithm in Chapter 5.

4.6 Establishing a Trade-off relationship between SNR andNumber of Iterations for a Particular LDPC DecodingPerformance

In order to quantitatively describe the relationship between input channel SNR, LDPC iterationnumber, and the probability of memory error simulated during LDPC decoding, a benchmark-ing method is devised. The benchmarking method begins by defining an output BER decodingperformance target which satisfies the system BER specifications. This BER target dependson the application of an LDPC decoder and can be set to adhere to the maxium acceptable biterror rate output from an LDPC decoder to the rest of a system. Then, processing is performedon a set of LDPC decoding simulation curves of various iteration numbers, all simulated un-der identical simulated memory error conditions (i.e. equal memory bit flipping probability).The process finds, on each of these simulation curves, intersecting points with the output BERtarget line, and records the corresponding values of input channel SNR. The benchmarkingprocess is then repeated for various sets of simulation curves under different probabilities ofsimulated memory errors. The collected results can be then analysed and plotted to establisha trade-off relationship between the channel SNR and the number of iterations which servesas the guideline for the operation of an LDPC decoder to correct memory errors. A detailedillustration of the benchmarking methodology is presented below.

As an example to the benchmarking process, Fig. 4.5 shows a set of reference LDPC simu-lation curves with no injected memory errors (representing an LDPC decoder operating under anominal supply voltage). A predefined output BER requirement of 10−5 is drawn horizontallyon the plot. In this case, the BER target value is selected for demonstration purpose only. Inreality, this BER value has to be determined according to the system specification 1. The inter-cepts between BER target line and LDPC simulation curves are identified and marked with thesymbol ‘x’. The vertical intercepts of these points on the x-axis (SNR values) represents theinput channel conditions needed to achieve the BER target line for that particular simulation

1Part of the reason why 10−5 is selected to be the BER target line in our example is due to the fact that anerror floor phenomenon (shown in Fig. 4.3, Fig. 4.4, Fig. 4.6, and discussed earlier), begin to appear at outputBER level around 10−5. The flattened error floor on the decoding BER curve generally signifies the limit pointbeyond which the LDPC decoder is no longer capable to further decode and correct errors present in codewords.In order to avoid the situation where the benchmarking data falls within the error floor region, the target BER isdeliberately selected to be above the error floor line in this thesis.


−4 −3 −2 −1 0 1 2 3 410

−8

10−7

10−6

10−5

10−4

10−3

10−2

10−1

100

Channel SNR in [dB]

Bit E

rro

r R

ate






CR = 1/2, iter num = 10

CR = 1/2, iter num = 15

CR = 1/2, iter num = 20

Figure 4.5: Benchmarking the trade-off relationship between iteration number and input chan-nel SNR. (Reference simulation runs with no memory error.)[ Simulation curves: A set of reference LDPC decoding simulations with contant code rateof 1/2 and various iteration numbers. (Simulations conducted with zero simulated memoryerrors.) | Dashed line: Output LDPC BER performance target line (BER = 10−5) ]


curve. These SNR values along with their corresponding simulation curve’s iteration numbersare recorded in Table 4.1, column 2. In general, lower SNR values belong to simulation runswith higher iteration numbers. This can be interpreted as channels that have lower SNR values(worse channel conditions) that require higher iteration numbers for error correction in order tomeet the desired decoding performance. Overall, the input channel SNR, the iteration number,and the memory error probability form the three degrees of freedom for the trade-off relation-ship between input SNRs and numbers of iterations of an LDPC decoder under a particularoutput BER target.

−4 −2 0 2 4 6 810

−7

10−6

10−5

10−4

10−3

10−2

10−1

100

Channel SNR in [dB]

Bit E

rro

r R

ate






CR = 1/2, iter num = 10

CR = 1/2, iter num = 15

CR = 1/2, iter num = 20

Figure 4.6: Benchmarking the trade-off relationship between iteration number and input chan-nel SNR. (Simulation runs with injected memory errors Perr = 10−2 (1 in 100))[ Simulation Curves: A set of LDPC decoder simulations with constant code rate of 1/2 andvarious iteration numbers. (All simulations conducted with memory bit flipping error prob-ability of 10−2 (1 in 100). Bit flipping errors applied equally to all LLR bit locations.) |Dashed line: Output LDPC BER performance target line (BER = 10−5) ]

Another benchmarking example is shown in Fig. 4.6, where the process is applied to aset of curves simulated with injected memory error probability of Perr = 10−2. The result is


recorded in table 4.1, column 3. The benchmarking process for several other sets of simulationswith different simulated memory error probabilities are also performed. Detailed simulationplots are presented in Appendix A and the benchmarking results are recorded in table 4.1.To visualize the data, each column of SNR values from table 4.1 is plotted against iterationnumbers, and all resulting curves are shown in Fig. 4.7. For a given decoding BER target (10−5

in this example), each curve from Fig. 4.7 graphically illustrates the trade-off relationshipbetween iteration number and input channel SNR. Each curve is simulated under a specificmemory bit flipping probability, which correlates to memory errors induced by a particularlevel of voltage scaling. As an observable trend, when the iteration number increases, the inputSNR decreases and approaches a certain final SNR value asymptotically. The most significantrate of change in SNR values is observed between 5 to 10 iterations and a diminishing rate ofchange is observed beyond 10 iterations. The choice of iteration numbers (seen in 4.1 column1) in the benchmarking process is tailored to reflect such an observation where most iterationnumbers (5,6,7,8,9,10) are used to resolve the region between 5 and 10 iterations while onlytwo iteration numbers (15,20) are used to cover the rest of the range. The maximum iterationnumber of 20 is selected to terminate the decoding regardless of success, because a limit hasto be set to the overall iteration period in order to meet the latency requirement imposed by theapplication standard. In both IEEE.802.11.ad and DVB-S2 standards, the maximum iterationnumber is set to be 20 [23]. The selection of simulated memory error probability for each setof simulation runs is also carefully considered. Bit error probabilities are initially spaced bya decade in logarithmic scale such as 10−2, 10−3, and 10−4 (or 1 in 100, 1 in 1000, and 1in 10000). However, as can be observed in Fig. 4.7, operating curves produced by these errorprobabilities are not evenly spaced graphically. In order to fill the uneven ‘gap’ in the operatingspace plot, intermediate error probabilities such as 2×10−3 (1 in 500), 1.25×10−3 (1 in 800),and 2×10−4 (1 in 5000) are used to produce additional sets of simulation runs. Benchmarkingresults of these additional sets of simulations are used to provide a better understanding of thedynamics of an LDPC decoder subjected to memory errors.

4.7 Discussion

In summary, a benchmarking plot such as Fig. 4.7 can be viewed as the LDPC decoder op-erating space for a particular decoding BER target. Each curve within the plot correspondsto a particular simulated memory error level which in turn correlates to a particular operatingvoltage of the LDPC decoder. In practice, the plot can be used as a look-up reference to de-termine the operating voltage and its corresponding iteration number given an input channelSNR value. Imagine a given channel input SNR line is drawn across the plot, curves that inter-


Iteration #Perr = 0(no Err.)SNR [dB]

Perr = 10−2

(1 in 100)SNR [dB]

Perr = 2×10−3

(1 in 500)SNR [dB]

Perr = 1.25×10−3

(1 in 800)SNR [dB]

Perr = 10−3

(1 in 1000)SNR [dB]

Perr = 2×10−4

(1 in 5000)SNR [dB]

Perr = 10−4

(1 in 10000)SNR [dB]

5 3.4 5.8 4.6 3.9 3.8 3.7 3.56 2.6 4.9 3.6 3.14 3.09 3.11 2.97 2 4.8 2.8 2.61 2.54 2.6 2.128 1.6 4.7 2.8 1.9 1.9 2.14 1.89 1.3 4.6 3 1.81 1.69 1.74 1.3510 1.1 4.6 2.8 1.54 1.43 1.3 1.1515 0.8 4 2.7 1.2 1.05 1 0.9220 0.7 4.2 2.8 1 0.94 0.9 0.72

Table 4.1: LDPC Operating Table [Constructed from a generic LDPC decoder designed forIEEE802.11ad standard and simulated under BPSK modulation scheme and code rate = 1/2.]

5 10 15 200

1

2

3

4

5

6

Iteration Number

Channel S

NR

in [dB

]

ref. Perr = 0 (no err.)

Perr = 10−2 (1 in 100)

Perr = 2 × 10−3 (1 in 500)

Perr = 1.25 × 10−3 (1 in 800)

Perr = 10−3 (1 in 1000)

Perr = 2 × 10−4 (1 in 5000)

Perr = 10−4 (1 in 10000)

Figure 4.7: LDPC decoder trade-off relationship between channel SNR and iteration numberfor BER target = 10−5 and Code Rate = 1/2.[ Each individual curve: Plot channel SNR vs. iteration number, channel SNR data obtainedfrom table 4.1. Each curve corresponds to a particular simulated memory error condition. ]


sect with this SNR target are the “viable” operating voltage curves, and the actual intersectionpoints mark the corresponding iteration numbers. The look-up reference information offeredin plots such as Fig. 4.7 provide important insights which can be used to govern the operationof LDPC decoder under various voltage scaling schemes. Operating parameters obtained fromthe plot can be used to conduct a power dissipation analysis which yields the optimal operatingconditions for an LDPC decoder. Such an analysis is discussed in chapter 5.

It is worth mentioning that all LDPC performance requirements used in the benchmark-ing process in this chapter are for demonstration purpose only. The intention of this work is topresent the proposed benchmarking methodology of an LDPC decoder operating under variousmemory bit-error conditions. Therefore, the simulation model and the methodology are devel-oped to be as generic and parameterized as possible so they can be applied for any particularuse cases.

It is also worth mentioning that the injected memory error simulations assume the factthat all four different types of parametric memory errors mentioned in Chapter 3 are treatedequally because from the system perspective, the LDPC decoder does not make any distinctionsbetween them. However, from an academic point of view, it is worth investigating the effect ofthe different types of errors separately. Quantitatively speaking, there is a difference betweenthe number of read errors and write errors when operating the memory under the same reducedVdd. On top of that, the simulation results presented in this chapter assume the fact that theoutcome of memory errors are treated equally; that is bit errors are treated as bit inversionevents. This is only a simplified view. In reality, write errors are more likely to result aconsistent bit stuck event which in a way, makes it resembles a hard failure.

Chapter 5

LDPC Dynamic Voltage Scaling ControlSystem

In the previous chapter, a typical LDPC decoder is simulated under various memory errorconditions. The injection of memory errors is used to emulate the behaviour of the embeddedmemories used to realize the LDPC decoder operating under reduced voltages. The simulateddecoding performance of the LDPC decoder was benchmarked and the collected data is usedto construct a relationship describing a set of possible combinations of operating parameters inorder to meet a predefined decoding performance target. From the relationship graph plottedin Fig. 4.7, it is clear that input channel SNR, memory error probability (that correlates tothe operating voltage), and LDPC decoding iteration number are the three key parametersthat govern the operation of the LDPC decoder. Typically, the channel SNR is an input to thesystem and is not considered as a design parameter. The two remaining parameters, namely theoperating voltage (related to memory error probability) and iteration number form a trade-offrelationship. Intuitively, operating an LDPC decoder under reduced voltage will quadraticallydecrease the dynamic power consumption; however, memory errors induced by the voltagescaling implies that a larger number of iterations is needed to compensate for memory errorsin order to maintain the same level of decoding performance. This means prolonged iterationtime and thus proportionally higher decoding energy expenditure. To be able to fairly assessthe advantage of dynamic voltage scaling, the benefit of dynamic power reduction and thepenalty of higher iteration number must be carefully weighed. The ultimate design goal of theoptimization algorithm should be to increase the overall LDPC decoding energy efficiency.

In this chapter, an algorithm which performs the trade-off optimization is formulated and adynamic voltage scaling control system is proposed to select, from a set of discrete operatingsupply voltages, one that yields the maximum energy efficiency. The control system is designedwith hardware implementations in mind and will serve as an add-on module which operates in

38

CHAPTER 5. LDPC DYNAMIC VOLTAGE SCALING CONTROL SYSTEM 39

real-time in tandem with the existing LDPC decoder. The result given by such a module can beused to directly alter the supply voltage of the embedded LDPC decoder memory. In addition,the control system is designed to be tunable and adaptable with all relevant CMOS circuitbehaviour modeled by internal variables. This way, it can also be used as an evaluation toolto predict the potential energy savings for different hardware implementations. The overalldynamic voltage scaling module consists of three parts, an operating Vdd look-up table, anoptimization function, and a decision block. An overview of the entire system and a detaileddescription of its individual components are presented in this chapter.

5.1 Testbench and System Overview

A system level block diagram and its associated test bench are shown in Fig. 5.1. All majorcomponents of the voltage scaling control module with their internal signals are highlightedwithin the dashed rectangular shape. All matlab functions and simulation scripts for the test-bench are listed in Appendix B.

Vdd Operating

Table

Optimization Function

DecisionBlock

Optimized Vdd

SNR (dB) per c.u.

LDPC System Behavioral Model

Operating VddCurrent Input SNR

Energy Saving Evaluation Module

Iteration NumberUnder

Nominal Vdd

Operating Vdd

Iteration NumberUnder

Operating Vdd

LDPC Dynamic Voltage Scaling Module

Vdd_Med iter# <= 10 {0/1}

Vdd_Nom iter# <= 10 {0/1}

Low Vdd {0/1}

Vdd_Low iter# <= 10 {0/1}

Medium Vdd {0/1}

Nominal Vdd {0/1}

% of Energy Saving per Frame

Figure 5.1: Overview of the LDPC Dynamic Voltage Scaling Control System and Testbench


The input of the control module is the SNR of the currently received packet frame. ThisSNR value can be obtained from the SSI (Signal Strength Indicator), and will be updated foreach channel use (c.u.) (i.e. for each received packet). The operation of the control logic canbe summarized as follows. Every time a packet is received, a channel SNR value is read by thecontrol module to estimate the channel condition. A high SNR value indicates strong signalstrength thus suggesting good channel condition. In this case, random errors contained in thepacket due to channel noise is low, and thus provides the opportunity to operate the systembelow nominal Vdd.

The objective of the Vdd Operating Table, as the first functional block, is to evaluate thecurrent channel condition and provide decoding feasibility assessments on distinct operatingpower levels. The feasibility assessments also include estimations of the ranges of iterationnumbers for each proposed operating Vdd level. The feasibility reports on three distinct powersupply levels (nominal Vdd, medium Vdd, and low Vdd) are encoded using three comparatorsand passed on to the second functional block, the Optimization Function. Given the supplieddecoding feasibility assessments, it is possible to have more than one supply level to be deemedfeasible to decode the current frame of codeword. It is therefore the responsibility of the opti-miztion function to perform trade-off analysis between reduced instantaneous power consump-tion and additional iteration cycles. Eventually, a unique optimized choice is made among theselections supplied by the Vdd Operating Table, and the optimized choice is selected to yieldmaximized decoding energy efficiency. Due to the time-varying nature of a typical wirelesschannel, the input channel SNR is expected to vary with certain randomness. As such, theoutput result from the Optimization Function can not be directly applied to an LDPC decodersystem without causing frequent fluctuations to its power supply. To eliminate the fluctuations,an output filter (Decision Block) is added which contains a memory property so that the outputresponse is smoothed based on the current and previous channel SNR. The inherited memoryproperty of the output filter block introduces a hysteresis nature to the whole control systemwhich reduces the rate of response to input changes. However, the unwanted side effect of thehysteresis is that the system is not able to cope with occasional rapid channel SNR deteriora-tions. In other words, it is more desireble to tune the hysteresis of the control system to reactquickly to fast channel deterioration, so that the decoding performance can be guaranteed atall times, while reacting relatively conservatively when channel condition improves, in orderto prevent frequent supply changes. For this reason, a comparator based decision mechanismis implemented towards the end of the functional block to selectively output either the currentor filtered Vdd suggestions, hence the name Decision Block. From a system perspective, theentire LDPC dynamic voltage scaling control module can be viewed as an open loop controlsystem which contains memory.


Beside the three functional blocks contained in the voltage scaling control module, twoadditional functional blocks (LDPC System Behavioral Model, and Energy Saving Evaluation

Module) are shown in Fig. 5.1 to complete the test bench environment. These two functionalblocks only operates in the simulation setup presented in this thesis and are not actually imple-mented in real hardware. The LDPC System Behavioural Model function block is in essensea look-up table, similar to the Vdd Operating Table which stores the curves of the LDPC op-erating space shown in Fig. 4.7. However, since the LDPC System Behavioural Model is notimplemented in hardware and is only used in simulations, the look-up table is therefore con-structed with high numerical resolution (high number of data points each stored as floatingpoint type variable in Matlab). The LDPC System Behavioural Model performs two look-upoperations. Based on the current channel input SNR, it first finds an estimation of the iterationnumber assuming the LDPC decoder is operating under nominal Vdd. Given the suggested op-timal Vdd, the LDPC System Behavioural Model then performs another look-up to estimate thedecoding iteration number under the optimal Vdd. The Energy Saving Evaluation Module canthen use both of these iteration numbers and the suggested optimal Vdd to calculate percentageof total energy saving for the current decoding frame.

5.2 Vdd Operating Table

The Vdd operating table is responsible for finding suitable supply voltages based on the cur-rent channel SNR. To be more precise, for a given channel SNR value, the operating table isdesigned to output all possible operating supply voltages with the estimates of the ranges of theiteration numbers required to achieve a predefined output BER threshold. The reference look-up is based on the information contained in the LDPC decoder trade-off relationship obtainedpreviously in Fig. 4.7.

The most straight forward way of implementing the search function to the relationshipgraph shown in Fig. 4.7. is via a look-up table. With finite numbers of input SNRs beingthe search queries, the corresponding output results can be stored and fetched. This of course,requires a study of the quantizing resolution of the input SNR as well as the shape of the trade-off graph, which directly determines how much storage is required to implement the function.

Alternatively, two important observations can be made from the LDPC trade-off relation-ship graph (Fig. 4.7) to dramatically simplify the implementation of the look-up table. First, theset of curves that corresponds to different memory error probabilities can be roughly groupedinto three distinct levels. From Fig. 4.7, the simulation curve associated with Perr = 10−2 (or1 in 100) marks the lowest decoding performance profile of the simulated LDPC decoder, be-cause it requires the decoder to operate within a relatively high input SNR range. This curve


can be interpreted as the LDPC decoder is operating under a low Vdd where the amount ofinduced memory error is high and such a condition can only permit the decoder to operate un-der high input SNR. The simulation curve associated with Perr = 2×10−3 (or 1 in 500) marksthe intermediate decoding performance profile. This curve can be interpreted as the LDPCdecoder is operating under an intermediate Vdd. For the rest of simulation curves beyondPerr = 1.25×10−3 (or 1 in 800), including the reference simulation curve, only marginal per-formance deviation can be seen from Fig. 4.7. This allows the grouping of these curves to markthe nominal decoding performance profile of the LDPC decoder. For simplicity reasons, thereference curve is elected to represent the decoder operating under nominal Vdd. Besides theabove simplification, a second observation can be made on the shape of all simulation curveswhich tend to settle asymptotically when the iteration number exceeds 10. Along with the firstsimplification, this observation allows the entire LDPC trade-off graph to be further quantizedand subdivided into only a few distinct regions. Overall, the set of operating curves from Fig.4.7 can be simplified to just three curves with each representing a distinct simulated memoryerror probability (or operating Vdd), shown in Fig. 5.5. Furthermore, within each individualcurve the iteration number axis can be subdivided into two regions, namely the regions withiteration number below 10 and between 10 and 20. Based on these approximations, the Vdd

operating table can therefore be simply implemented via a set of comparators as shown in Fig.5.2.

The reference values used by each comparator shown in Fig. 5.2 is extracted directly fromFig. 4.7. Three main comparators, shown as shaded blocks in Fig. 5.2, are used to output sug-gested supply Vdd. Their reference values, SNR Low, SNR Medium, and SNR Nominal arechosen to be the asymptotic limit of each trade-off curve in Fig. 5.5. The asymptotic limit ofeach operating curve marks the minimum input SNR required for a decoder to operate at a par-ticular supply voltage, hence signifies the limit of operability of a particular Vdd. Three auxil-iary comparators are used in addition to estimate iteration numbers for each supply level. Theirreference values, SNR Critical Low, SNR Critical medium, and SNR Critical Nominal,are selected from each operating curve. These are the SNR values that corresponds to theiteration number 10. Beginning at position Iter num = 10, the trade-off relationship curvebegins to flatten and to approach to its asymptotic limit. Therefore, the SNR value sampled atthis point makes a good threshold to quantize each operating curve into two regions.

The working of the comparator based look-up table can be visualized by imagining a hori-zonal line (indicating a certain input SNR value) moving along the y-axis in Fig. 5.5. As theinput SNR is steadily increased, the imaginary horizontal line begins to intersect with one ormore of the three operating lines each corresponding to a specific level of operating Vdd. Whenthe SNR line surpasses the thresholds defined in the comparators, the comparators turn on and


>= SNR_Nominal

>= SNR_Medium

>= SNR_Low

>= SNR_Critical_Nominal

>= SNR_Critical_Medium

>= SNR_Critical_Low

SNR (dB) per c.u.

Low Vdd indicator = 1

Medium Vdd indicator = 1

Nominal Vdd indicator = 1

Vdd_Nom iter# <= 10 indicator = 1

Medium Vdd indicator = 0

Yes

Yes

No

Low Vdd indicator = 0No

Yes

Vdd_Nom iter# <= 10 indicator = 0

Vdd_Med iter# <= 10 indicator = 1

Vdd_Med iter# <= 10 indicator = 0

Vdd_Low iter# <= 10 indicator = 1

Vdd_Low iter# <= 10 indicator = 0

Yes

No

Yes

No

Yes

No

Figure 5.2: Comparator Based Implementation of Vdd Operating Table Module


set the indicators to logical ‘1’. When one of the main main Vdd comparators is turned on,the logical ‘1’ shown in its output indicates that it is possible to perform a successful decod-ing under the indicated Vdd. When one of the auxiliary comparators is turned on, the logical‘1’ shown in its output further indicates, beside the fact that a successful decoding under theindicated Vdd is possible, the decoding for the current frame of code word can be expected toterminate under 10 iterations. In short, when subjecting both main and auxiliary comparatorsof a particular Vdd level under an increasing input SNR, the main comparator first turns onindicating the possibility of a successful decoding. At this time, the output of the auxiliarycomparator for that particular Vdd level is set by default to show a logical ‘0’, indicating thatthe decoding for the current frame of code word is expected to terminate within 20 iterations1. When the input SNR becomes high enough, the auxiliary comparator for this particular Vddeventually turns on, indicating that the decoding for the current frame code word is expectedto terminate within 10 iterations.

Since the look-up reference can now be implemented by a series of comparators, the outputof the Vdd Operating Table module can also be simplified. The result of all comparators canbe combined and used directly as the output. This way the output is also conveniently coded inthermometer coding. All possible combinations of the outputs of the Vdd Operating Table arelisted in table 5.1 where discrete Vdd level indicators and estimates to the ranges of iterationnumbers are represented using thermometer coding.

Comparator OutputsOutput States

1 2 3 4 5 6

Nominal Vdd indicator 1 1 1 1 1 1Vdd Nom iter# <= 10 indicator 0 1 1 1 1 1

Medium Vdd indicator 0 0 1 1 1 1Vdd Med iter# <= 10 indicator 0 0 0 1 1 1

Low Vdd indicator 0 0 0 0 1 1Vdd Low iter# <= 10 indicator 0 0 0 0 0 1

Table 5.1: All Possible Output States of the Six Comparators are Thermal Coded in the VddOperating Table Module

To complete the discussion, the reason why the supply Vdds of the LDPC decoder is quan-tized to three distinct levels is purely due to the shape of Fig. 4.7. In reality, it is up to thedesigner to make the trade-off between implementation complexity and supply resolution. Ahigher number of discrete supply voltage options in this particular design implies a linear in-crease in the number of comparators used in the functional block and subsequent complexity

1The maximum iteration number of 20 is set to define a maximum decoding latency.


in the following stages. A higher number of supply options also create further complicationsin designing tunable power supplies to work with the adaptive voltage scaling algorithm.

5.3 Description of the Optimization Function

The function of the optimization block is to make the optimal choice from the suggested powersupply levels provided by the previous stage. Its main task involves evaluating the trade-offbetween instantaneous power reduction and prolonged iteration period for all provided supplylevels. The output result is a unique and optimized supply voltage level based on the currentchannel SNR. The optimization algorithm is formulated based on the following key parameters:the relation between supply voltage and dynamic power consumption, the relation betweensupply voltage and static power consumption, and the contribution of each to the total powerconsumption.

5.3.1 Optimization Algorithm

The dynamic power consumption for regular CMOS technology is defined by the followingexpression [42]:

Pdynamic = β ×C× f ×V dd2 (5.1)

Where C is the total lumped charging capacitance, f the operating frequency, β the ac-tivity factor, and V dd the supply voltage. The formula (5.1) suggests the dynamic power isquadratically proportional to the supply voltage. We can thus define a dynamic power reduc-tion parameter RD that is based on the ratio between a lowered operating Vdd compared to thenominal one.

Pdynamic low

Pdynamic nominal=

V dd2low

V dd2nominal

=

(V ddlow

V ddnominal

)2

= RD (5.2)

The static power consumption is more complicated to quantify than dynamic power due toits various sources of contribution. In general, MOSFET static leakage power can be brokendown into the following constituents, reverse-biased diode leakage between the diffusion re-gions of the MOSFETs, gate induced drain leakage in the drain region, gate oxide tunnelingeffect, and subthreshold leakage [43]. Out of all these different sources, subthreshold leak-age dominates over all other forms of leakage by orders of magnitude [44]. The subthresholdleakage current is often modeled by the following equation:


Psub =V dd× Isub (5.3)

Where the transition subthreshold current is defined by:

Isub = K1We−VthnVθ

(1− e−

V ddVθ

)(5.4)

Where Vth is the threshold voltage and Vθ is the thermal voltage and (roughly equal to 25mV).K1 and n are experimentally determined parameters. (5.4) shows that the subthreshold leakagecurrent is dependent on the threshold voltage of the technology and the supply voltage. Despitethe fact that subthreshold leakage current is a non-linear function of the supply voltage, thestatic power reduction factors RI can be defined for a few supply voltage levels of interest.These ratios can be calculated from the following expression:

Pstatic low

Pstatic nominal=

Istatic low

Istatic nominal× V ddlow

V ddnominal

=1− e−

V ddlowVθ

1− e−V ddnominal

Vθ

× V ddlow

V ddnominal= RI (5.5)

The last parameter required by the optimization algorithm, α , is the fractional contribution ofthe static power to the total power consumption. It is defined as:

Pstatic

Pdynamic +Pstatic= α ;

Pdynamic

Pdynamic +Pstatic= 1−α (5.6)

The static power contribution factor α is highly dependent on fabrication technology, thus canbe best determined from EDA design tools or estimated from ITRS data. The static powercontribution factor is the principle element to compute the cost of the trade-off between anincrease in the number of iterations and reduced dynamic power consumption. With the aboveparameters defined, the total energy saving during each frame of codeword decoding can becalculated. First, define Titer as the decoding time period per iteration, then the decoding energyper iteration Eiteration can be written as:

Eiteration = (Pdynamic +Pstatic)×Titer (5.7)

For a reduced operating Vdd, the iteration energy can be written by multiplying nominal powercomponents (Pdynamic and Pstatic) by the reduction factors (RD and RI):

Eiteration low = (RD×Pdynamic +RI×Pstatic)×Titer (5.8)


The ratio between the reduced iteration energy and the nominal iteration energy can be definedas K, a energy saving factor which encapsulates the effect of both RD and RI:

Eiteration low

Eiteration=

RD×Pdynamic +RI×Pstatic

Pdynamic +Pstatic

= RD×Pdynamic

Pdynamic +Pstatic+RI×

Pstatic

Pdynamic +Pstatic

= RD× (1−α)+RI×α = K (5.9)

Since RD and RI are dependent on Vdd, different values of K are calculated for each particularVdd level. In this work, three K values, K Low, K Med, K Nom are computed to reflect threepossible operating supply voltage choices for the embedded memory. Finally, to calculate thetotal energy required during each frame of LDPC decoding, Eiteration must be multiplied by thenumber of iterations. The total energy required for each Vdd level can be written as:

E f rame low = K Low×Eiteration×#Iteration low (5.10)

E f rame medium = K Med×Eiteration×#Iteration medium (5.11)

E f rame nominal = K Nom×Eiteration×#Iteration nominal (5.12)

Evidently, Eiteration is the common factor in the above equations. This suggests that the totalenergy saving per frame can be compared without knowing either the iteration period Titer or theiteration energy Eiteration. The total energy saving per frame can be estimated by the followingformula, and the voltage level which yields the lowest energy consumption 2 is determined tobe the optimum operating voltage.

Total energy saving f actor = Kparticular×#Iteration particular (5.13)

5.3.2 Implementation of the Optimization Algorithm

Figure 5.3 illustrates the design of the optimization function. The results from the previousstage are used to select different multiplexer inputs. Three signals, Vdd Low iter#, Vdd Med iter#,and Vdd Nom iter# control whether the respective supply level requires 10 or 20 iterations tocomplete decoding. The selected iteration numbers are then multiplied by the reduction factors

2Kparticular in (5.13) is assigned to be either K Low, K Med, or K Nom, whichever that yields the lowestenergy consumption.


Optimized Vdd {100, 010, 001}

20 10

X

Iter # Low Vdd

K_Low

Max Value

Low Vdd {0/1}

Vdd Low iter# <= 10 {0/1}

20 10

X

Iter # Med Vdd

K_Med

Max Value

Medium Vdd {0/1}

Vdd Med iter# <= 10 {0/1}

20 10

X

Iter # Nom Vdd

K_Nom

Max Value

Nominal Vdd {0/1}

Vdd Nom iter# <= 10 {0/1}

Minimization

0 1 0 1 0 1

0 1 0 1 0 1

3

Figure 5.3: System Diagram of the Optimization Block [ N.B. The constants 10 and 20 heldin the input registers of the muxs define the number of iterations to complete decoding asdescribed in section 5.3.2. ]


K Low, K Med, and K Nom to compute the total energy saving factors defined in (5.13). Be-fore sending the results directly to the minimization block, all values are passed through threeenabling multiplexers. These enabling multiplexers are controlled by LowVdd, MedVdd, andNomVdd. Each of the three multiplexers is responsible for a particular supply voltage level.Depending on whether the supply level of interest is deemed as a viable operating option bythe previous stage, the multiplexer passes either the computed total energy saving factor ora pre-defined maximum value to the minimization block. Such an enabling mechanism is inplace to prevent unselected supply levels from corrupting the result of minimization. The MaxValue can be arbitrarily defined as K Nom× 30 which is a value greater than all computedenergy saving factors. Finally, the optimized result given by the minimization block is encodedin one-hot encoding scheme and is sent to the next stage.

5.4 Description of the Decision Block

The decision block is the final stage of the voltage scaling control module and is responsible forfiltering undesired fluctuations of the final result. Figure 5.4 illustrates the working mechanismof the decision block. The key element of the decision block is comprised of three memoryelements, called N Register, each capable of storing an integer between 0 and N. These regis-ters act as accumulators that records the number of times a particular supply level is chosen bythe optimization function. When a particular supply level is selected, its respective N Registeris incremented while registers corresponding to other levels are decremented. The mechanismis implemented by three multiplexers that pass either +1 or −1 to add to the existing resultin N Register. The select signals of these multiplexers are connected to one-hot encodedoptimization results from the previous stage.

Before writing the incremented or decremented count to N Register, results are checked toconform within the storage limit of the register. Should the integer results lie outside the rangeof [0, N] of N Register, the Saturation blocks ensure the integrity of the data hard limits theoperand to 0 and N, respectively. The registered count numbers then go through a majority votestage to determine which supply level has the highest hit number. To deal with the occasionalcase when two registers may have exactly the same count, for example when one register isbeing decremented while another one is being incremented, an additional minimization blockis employed to guarantee the result is unique and favours the lower supply voltage of the twochoices.

So far, up to N previously chosen supply levels are used to smooth the output response.However, one of the design objectives is to allow the system to respond quickly when chan-nel SNR deteriorates. This helps to ensure that the system prioritises successful decoding at


+1

-1

Optimized Vdd [bit0]

+ N RegisterSaturation

Majority Vote

Min

+1

-1

Optimized Vdd [Bit1]


+1

-1

Optimized Vdd [Bit2]


Registered Decision

More than one possible Vdd choices

Operating Vdd

Optimized Vdd < Operating Vdd

Optimized Vdd

0 1

0

1

0

1

0

1

Figure 5.4: System Diagram of the Decision Block


all times. An additional compare and select stage is employed to implement such a function.The current operating output level, is stored in the output register Registered Decision. Thislatched result is compared with the optimized Vdd level given directly by the optimizationfunction. If the result from the previous stage is found to be higher than the current operatingsupply voltage, meaning the channel condition has worsened and requires higher supply volt-age, the optimized result is then immediately written to the output register. The name of thefunctional block (Decision Block) is used to remind us of the fact that the decision of bipassingthe result from the majority vote is only used when the channel SNR deteriorates.

5.5 LDPC Behavioural Model and Energy Saving Evalua-tion Module

The LDPC behavioural model and the energy saving evaluation module are two additionalfunctional blocks required to complete the simulation test bench of the LDPC Dynamic Voltagescaling module. These two additional functional blocks are only implemented for simulationpurposes and will not be incorporated in the actual hardware design. Together with the voltagescaling control module, the complete test bench is used to evaluate the percentage of energysaving provided by the dynamic voltage scaling control module on a given channel SNR.

The inner working principles of these two functional blocks closely resembles those ofthe Vdd Operating Table and Optimization Function. The LDPC System Behavioral Model

essentially implements the reverse lookup function of the LDPC trade-off relationship shownin Fig. 5.5. Given the current channel SNR and the optimized supply voltage, the LDPC

System Behavioural Model provides both the iteration number under reduced and nominal Vdd.Since it is only implemented for simulation purposes, the shape of the LDPC operating curvedoes not have to be quantized as in the Vdd Operating Table. This means the reverse look-up can have much higher resolution and thus provide more accurate estimation of iterationnumbers. The estimated iteration numbers for an LDPC decoder operating under both nominalVdd and optimized Vdd are then used by the Energy Saving Evaluation Module, which inessence implements the core part of the optimization algorithm ((5.9) and (5.13)), to calculatethe exact percentage of energy saved during the current frame of decoding.

5.6 Discussion

In this chapter, an optimization algorithm and its proposed implementation, an adaptive volt-age scaling control module, are presented based on the LDPC trade-off relationship shown


5 10 15 200

1

2

3

4

5

6

Iteration Number

Ch

an

ne

l S

NR

in

[d

B]

Low Vdd Level

Medium Vdd Level

Nominal Vdd Level

Figure 5.5: Simplified Operating Curves for Three Discrete Operating Vdd supply voltages ofan LDPC Decoder.[ Simplified from Fig. 4.7, decoding performance target BER = 10−5. |Low Vdd Level: Smoothed and interpolated from curve Perr = 10−2 (or 1 in 100). |Medium Vdd Level: Smoothed and interpolated from curve Perr = 2× 10−3 (or 1 in 500). |Nominal Vdd Level: Smoothed and interpolated from reference curve Perr = 0 (no err.). ]


in chapter 4. The LDPC trade-off relationship graph Fig. 4.7 is constructed from simulationbenchmarking results of a generic IEEE802.11.ad LDPC decoder with the target output BERof 10−5. The proposed implementation of the voltage scaling control module is based on asimplified version of the LDPC trade-off relationship graph shown in Fig. 5.5, which onlyhas only three curves representing embedded memory error probabilities of Perr = 10−2 (1 in100), Perr = 2× 10−3 (1 in 500), and Perr = 0 (reference). Each of these curves, based on itssimulated error probability, is associated with a specific supply Vdd voltage. The actual nu-merical Vdd voltage values are determined by the data supplied by Wilkerson, Gao et al. [41],and are plotted in Fig. 5.6 once again for illustration purpose. Tracing from the “2M Cache”curve in Fig. 5.6, the low Vdd (associated with Perr = 10−2) is found to be 0.8V, the mediumVdd (associated with Perr = 2×10−3) is found to be 0.85V, and the nominal Vdd is known forthe technology to be 1V [41]. By linking simulation curves with actual voltage values deter-mined from literature surveys, these Vdd voltages can be used in the next chapter to estimatethe potential energy saving percentage of the voltage scaling control module. Since the entiretestbench presented in this chapter is written and simulated in Matlab, all variables and con-stants defined and used in this chapter are stored and computed with floating-point precision.In future works, quantization studies are required to convert these floating-point variables andconstants into fixed-point numbers when the implementation of the control algorithm to anactual ASIC design is required.

The optimization algorithm and the proposed dynamic voltage scaling control module pre-sented in this work is based on the assumption that most SRAM based embedded memories intoday’s ICs are generated using memory compilers and manufactured using standard CMOSlogic. As such, the proposed optimization algorithm is generally applicable to many differ-ent types of memory-based FEC implementations. In addition, since manufacturing processdependent variables of the algorithm are parameterized, this allows the algorithm to be easilyadapted to a more advanced CMOS technology in the future.

In the case where the proposed algorithm is required to be used on embedded memories fab-ricated by a special process, some of the equations used by the algorithm have to be modified.More specifically, equations and variables defined through (5.1) - (5.5) have to be replaced byappropriate equations defined by the technology. Once modified, results of revised optimiza-tion equations can be readily integrated with the rest of the adaptive voltage control modulebecause the process of obtaining optimiztion variables is designed to be decoupled from therest of the control mechanisms.


Vdd [V]

Pro

bab

ility

of

Bit

Fai

lure

Figure 5.6: (Repeated Fig. 3.3) Probability of Bit-cell Failure vs. Vdd for Various CacheStructures [41]

Chapter 6

Dynamic Voltage Scaling Control ModuleSimulation

Having presented the implementation of the dynamic voltage scaling control module and its testbench environment in chapter 5, the objective of this chapter is to present the simulation resultsobtained and discuss the effectiveness of the proposed voltage scaling control mechanism.

6.1 Simulation Parameters

Technology RI Low RD Low RI Med RD Med RI Nom RD Nom α

65nm0.8 0.66 0.85 0.7225 1 1

0.05K Low K Med K Nom0.667 0.72887 1

Table 6.1: LDPC Dynamic Voltage Scaling Module Simulation Parameters for 65nm CMOSTechnology [41], [45].

All simulation parameters used by the LDPC simulation test bench are listed in table 6.1. Theseparameters are calculated from data borrowed from the literature survey in chapter 3. To beginwith, the three distinct Vdd levels defined in Fig. 5.5 are selected based on measurement dataprovided by Wilkerson, Gao et al. [41] from Fig. 3.3. More specifically, the three Vdd levelsare chosen from the 2Mbit cache curve of Fig. 3.3, which size falls within, as discussed inchapter 3, the range of interest of this work. Tracing the 2Mbit probability of failure curve(Fig. 3.3), the bit-cell failure probability of 10−2 (1 in 100) and 2×10−3 (1 in 500) correspondto a low Vdd of 0.8V, and a medium Vdd of 0.85V, respectively. Along with the nominalVdd of 1.0V for Intel’s 65nm CMOS technology, the three supply voltages can be used to

55

CHAPTER 6. DYNAMIC VOLTAGE SCALING CONTROL MODULE SIMULATION 56

calculate some of the simulation parameters. Three sets of power reduction factors RI and RD

are calculated for each low, medium, and nominal Vdd from (5.5) and (5.2). Furthermore, staticpower contribution α is extracted from ITRS 2010 data for CMOS cache logic (shown in Fig.6.1). [45]. Finally, three optimization parameters K Low, K Med, and K Nom, corresponds tolow, medium, and nominal Vdd, are calculated from (5.9).

2003 2004 2005 2006 2007 2008 2009 2010 2012 2013 2015 2016 20180

5

10

15

20

25

30

35

40

45

Perc

enta

ge o

f S

tatic P

ow

er

Consum

ption

Year

100nm

45nm

35nm

32nm

25nm

22nm

18nm

90nm

50nm80nm 70nm 65nm

57nm

Figure 6.1: ITRS 2010 - Static Power Contribution to the Overall Power Consumption of CacheLogic [45]

6.2 Simulation Environment

Simulation of the dynamic voltage scaling control module and its test bench is conducted usingMATLAB. The only input of the system, channel SNR, is swept from 1dB to 7dB in order toshow the behaviour of the system under a broad operating range. While the SNR is steadily


increased, over time, the input SNR to the system is also assumed to be subjected to randomvariations. The variations are centered around each sweeping point of the input SNR withstandard deviation of 1dB above and below the sweeping point. This way, while the meanvalue of the input SNR is swept across its range, the instantaneous SNR input values exhibitsubtle deviations with time which help to test the dynamic response of the control system. Thetest bench is simulated for 3000 simulation runs and all output parameters such as the operatingvoltage level as well as the actual iteration numbers are collected to calculate the percentage ofenergy saving.

Since the medium operating voltage, 0.85V, and the low operating voltage, 0.8V, are quiteclose in value, a comparative study is done to evaluate the necessity of having a medium Vddas a voltage scaling option. Parallel to the voltage control module design presented in chapter5, a separate design with only two voltage scaling levels Nominal (1.0V) and Low (0.8V) issimulated under the same setup. The goal is to see how much difference in terms of decodingenergy saving can be obtained by having an extra medium value Vdd supply option. An extrasupply level intuitively leads to better decoding energy efficiency, however, the trade-off isa more complex design of the control module. The quantitative comparison can hopefullyprovide a better idea for designers who make hardware implementation decisions.

6.3 Simulation Results

The results of simulations for both dynamic voltage scaling control modules with 3 discretesupply voltages and 2 discrete supply voltages levels of supply voltages are shown in Fig.6.2 and Fig. 6.4. The x axis in all plots of Fig. 6.2 and Fig. 6.4. indicates the number ofsimulation runs. In each figure, results of 3000 simulation runs are plotted and each simulationrun represents a complete frame of LDPC decoding. Within each figure, four different plotsare shown. The chanel input plot shows the temporal sweeping of channel SNR. The operatingVdd plot shows the only output of the control system, which is the recommended operatingVdd shown in distinct Vdd levels (Nominal Vdd = 1.0V, Medium Vdd = 0.85V, and Low Vdd= 0.8V). In addition, the decoding iteration number as well as the percentage of total decodingenergy saving are also plotted for each simulation run.

From Fig. 6.2 and Fig. 6.4, it can be seen that as the mean value of input channel SNRis swept from low to high. From a long term perspective, as the input SNR being steadilyincreased, the output recommended operating Vdd eventually drops from nominal Vdd level tolow Vdd level. This is in accordance with the design objective where good channel conditionis exploited to save decoding energy. However, before being stablely set to a fixed Vdd level,the system alternates back and forth between two Vdd states. These intermittent changes can


500 1000 1500 2000 2500 30000

10

20

Percentage of Energy Saving per Frame

Simulation Run #

Pe

rce

nta

ge

500 1000 1500 2000 2500 30000

2

4

6

Channel Input SNR

Simulation Run #

SN

R [

dB

]

500 1000 1500 2000 2500 3000

0.8V

0.85V

1V

Simulation Run #

Ope

rating V

dd [

V]

Operating Vdd | Low Vdd (0.8V) | Med. Vdd (0.85V) | Nom. Vdd (1.0V)

500 1000 1500 2000 2500 30000

5

10

15

20

Iteration Number During Each Simulation Run

Simulation Run #

# o

f itera

tion

s

Figure 6.2: Dynamic Voltage Scaling Control System Simulation [With 3 Power Supply Volt-age Levels]

Figure 6.3: A zoomed-in view of Fig. 6.2 [Demonstrating the hysterisis nature of the DecisionBlock]


500 1000 1500 2000 2500 30000

10

20


Simulation Run #

Pe

rce

nta

ge

500 1000 1500 2000 2500 30000

2

4

6

Channel Input SNR

Simulation Run #

SN

R [

dB

]

500 1000 1500 2000 2500 3000

0.8V

0.85V

1V

Simulation Run #

Op

era

tin

g V

dd

[V

]


500 1000 1500 2000 2500 30000

5

10

15

20


Simulation Run #

# o

f ite

ratio

ns

Figure 6.4: Dynamic Voltage Scaling Control System Simulation [With 2 Power Supply Volt-age Levels]


500 1000 1500 2000 2500 30000

5

10

15

20

25


Simulation Run #

Pe

rce

nta

ge

500 1000 1500 2000 2500 3000

0

2

4

6

Channel Input SNR

Simulation Run #

SN

R [

dB

]

500 1000 1500 2000 2500 3000

0.8V

0.85V

1V

Simulation Run #

Op

era

tin

g V

dd

[V

]


500 1000 1500 2000 2500 30000

5

10

15

20


Simulation Run #

# o

f ite

ratio

ns

Figure 6.5: Dynamic Voltage Scaling Control System Simulation [With Rayleigh fadeing chan-nel (with maximum Doppler shift of 100Hz)]


be observed around 1000 and 2000 simulation runs in Fig. 6.2 and around 2000 simulationruns in Fig. 6.4. This demonstrates the working of the decision block added at the end of thecontrol module. For a certain critical input channel SNR, the system attempts to adaptivelylower the supply Vdd. However, due to the temporal variations of channel SNR, the systemconservatively raises the supply Vdd because it is tuned to respond once a rapid channel de-terioration is detected. The strategy prioritizes the decoding performance in a rapid alteringchannel environment.

Upon a closer inspection, the dynamics of the adaptive voltage scaling algorithm is demon-strated more clearly. A zoomed-in view of Fig. 6.2 is shown in Fig. 6.3. Figure 6.3 shows thesimulation run numbers around 600. At this input SNR range, the input of the system, channelSNR, fluctuates around 2dB. As can be observed at the output, despite the low input SNR, thesystem begins to attempt using a lower supply voltage for decoding operations. The regionson the graph where the system uses intermediate Vdd (0.85V) corresponds to the regions atthe input where consecutive higher SNRs are observed. This shows the filtering effect of theoutput decision block where a lowering of Vdd is only granted after sufficient amount of chan-nel improvement has been observed. In the middle of the intermediate Vdd region, there areoccasional frames where the system stepped back and operated with the default nominal Vdd(1V). This coincides with the sudden drop of the input SNR and demonstrates the single-sidedhysteresis nature of the decision block where the system is biased to respond repidly to animmediate channel deterioration.

In chapter 5, it has been shown, through the derivation of (5.13), that the total energy sav-ing factor (which is a ratio to quantify the total frame decoding energy saving in the embeddedmemory) can be obtained with only the knowledge of the K (a per iteration energy saving factorcomputed from the operating Vdd), and the iteration number. By using (5.13), the total energysaving in the embedded memory for each simulation run is computed and then plotted as apercentage at the bottom of Fig. 6.2 and Fig. 6.4. Initially, when the channel SNR is low, thesystem has no other option but to operate at nominal Vdd. This means the optimized decodingenergy equals to the nominal decoding energy, hence the percentage saving is 0%. Later on,when the channel condition improves, the system is able to adaptively lower the supply voltagehence results in a positive energy saving percentage. The graph of percentage of energy savingper frame can be integrated along its x-axis and then normalized by the total number of simula-tion runs (3000 in this case) to obtain an average percentage of energy saving across the entireinput SNR range. This average percentage saving value can be thought as an indicator of theeffectiveness of the voltage scaling control system across an input SNR range. The calculatedaverage percentage of energy saving for dynamic voltage control module with three Vdd levelsis 21.1% and the average percentage of energy saving for the control module with two Vdd lev-


els is 14.31%. The reduction in energy savings is expected because the voltage scaling actionfor the system with only two Vdd levels does not come into play until a much higher channelSNR. However, the added hardware complexity of the 3 Vdd level system must also be keptin mind when choosing a design. The choice between a simpler hardware implementation anda more effective energy saving result is left to system-level designers depending on the rangeof channel conditions in which the system is targeted to operate. More extensive simulationsshould be performed using channel models for the intended application. Figure 6.5 shows theresult of a similar simulation except the channel SNR disturbance is modeled by a Rayleighchannel fading model (rather than Gaussian) with the maximum frequency drift of 100Hz (thesampling frequency is selected according to the decoding latency requirement of the LDPCdecoder because each simulation sample point represents one complete frame of decoding.)The calculated average percentage of energy saving in this case is 15.1%.

6.4 Projected Energy Saving Potential with Technology Scal-ing

Due to the lack of benchmarking data describing bit-cell failure probability for embeddedSRAMs implemented in other technology nodes, it is difficult to predict operating voltagesfor more advanced technology nodes. Without knowing the projected operating voltages, itmakes the estimation of energy saving challenging. Nevertheless, the ITRS data from Fig. 6.1clearly indicates the trend of an increasing contribution of the static power to the overall powerconsumption with technology scaling. To reveal the net effect of this single factor, simulationsare conducted with the assumption that the ratio of the voltage scaling is maintained consis-tently as the technology scales down. This means all simulation parameters except α in table6.1 is held constant while the static power consumption factor α increases. The average energysaving results of these simulations (calculated by integrating the percentage of energy savingplot across time and divided by the total number of simulation runs) are plotted in Fig. 6.6.The decrease in energy saving potential as technology is scaled is consistently observed forboth implementations with 3 Vdd levels and with 2 Vdd levels. To justify the observed graph,it is important to realize that voltage scaling affects static and dynamic power differently. Fordynamic power, the instantaneous power reduction is quadratically related to voltage scalingwhile less aggressive correlation can be observed for static power. It is also important to realizethat the estimation presented in Fig. 6.6 assumes all other simulation conditions are held con-stant. This includes the operating frequency which can directly affect the dynamic power of anIC. In fact, one power saving technique frequently observed is that as technology scales down,


the operating frequency of ICs is increased while the supply Vdd is decreased to improve theoverall power consumption.

2006 2008 2010 2012 2014 2016 201810

12

14

16

18

20

22

year

En

erg

y S

avin

g P

erc

en

tag

e

3 level Vdd

2 level Vdd

65nm 45nm

32nm

28nm 18nm

Figure 6.6: Estimated Energy Saving Trend with Technology Node (using the static powerconsumption estimates from Fig. 6.1)

Chapter 7

Conclusion

Driven by recent advancements in wireless communication, energy efficiency has become a keyresearch area for modern electronic design. With the trend of embedded memory occupying alarge percentage of the overall die area of a SOC realization, this work proposed a methodologyof applying adaptive dynamic voltage scaling to the embedded memory in order to effectivelyreduce the overall energy consumption of a design with minimum additional hardware. In thegeneral case, memory failures induced by voltage scaling decrease the storage reliability of amemory, thus limiting the use of voltage scaling. However, the specific use case of this workfocuses on the embedded memory used to realize an LDPC decoder which provides a verypowerful iterative error correction mechanism.

In chapter 3, voltage scaling induced memory errors are described and a particular surveyfrom industry provides quantitative relationships between error probability and supply voltagereduction. In chapter 4, a system-level LDPC decoder simulation is carried out to investigatethe decoding performance penalty due to additional memory errors. Based on informationobtained from chapter 3 and chapter 4, a dynamic voltage scaling control algorithm is pro-posed in chapter 5 which adaptively tunes the supply voltage of the embedded memory insidean LDPC decoder in order to obtain optimum energy efficiency while maintaining requireddecoding performance. Chapter 6 presents simulation results of the proposed voltage scalingcontrol algorithm, based on measurement data obtained from industry. For a memory-basedLDPC decoder designed for IEEE802.11.ad simulated for a BPSK modulation scheme withcode rate of 1/2 and the output BER target of 10−5, the simulation demonstrates that a systemlevel energy saving of up to 20% can be obtained for a embedded memory block implementedin 65nm technology. Furthermore, with data provided by ITRS and certain stated assumptions,the simulation also predicts similar future energy savings as the CMOS technology scales.

64

CHAPTER 7. CONCLUSION 65

7.1 Future Work

With the design of the voltage scaling control algorithm simulated, the next logical step isto implement the actual circuit and measure its effectiveness in an actual CMOS chip. Asuitable candidate for the embedded memory on which the voltage scaling control moduleoperates would be the embedded intrinsic and extrinsic memories of the LDPC decoder inIEEE802.11.ad, which many of the discussions and assumptions involved in this work is basedon. Alternatively, the general methodology presented in this work (including the memory errorperformance benchmarking method presented in chapter 4, and the proposed voltage scalingcontrol algorithm presented in chapter 5) can be generally applied to a wide range of errorresilient algorithms other than LDPC decoders [46] [47] [48] .

The relationships between memory error probability and the supply voltage presented inchapter 3 is based on data from academic and industry surveys. Memory failure data based ontransistor level simulations is often computationally expensive and less accurate than hardwaremeasurement. In the future, it would also be beneficial to experimentally investigate the errorbehaviour of the embedded memory via actual hardware measurement. This can be achievedby building a memory testing environment which continuously collects and analyses memoryerrors while its supply voltage is independently adjusted. Such a memory tester can allowfurther investigation on the quantity of different types of memory errors (read errors and writeerrors) in response to voltage scaling.

Appendix A

LDPC BER Simulations

66

APPENDIX A. LDPC BER SIMULATIONS 67

−4 −2 0 2 4 6 810

−7

10−6

10−5

10−4

10−3

10−2

10−1

100

Channel SNR in [dB]

Bit E

rro

r R

ate






CR = 1/2, iter num = 10

CR = 1/2, iter num = 15

CR = 1/2, iter num = 20

Figure A.1: BER vs Channel SNR as a function of iteration number. (Simulation runs withinjected memory errors Perr = 2×10−3 (1 in 500))[ Simulation Curves: A set of LDPC decoder simulations with constant code rate of 1/2 andvarious iteration numbers. (All simulations conducted with memory bit flipping error proba-bility of 2× 10−3 (1 in 500). Bit flipping errors applied equally to all LLR bit locations.) |Dashed line: Output LDPC BER performance target line (BER = 10−5) ]


−4 −2 0 2 4 6 810

−7

10−6

10−5

10−4

10−3

10−2

10−1

100

Channel SNR in [dB]

Bit E

rro

r R

ate






CR = 1/2, iter num = 10

CR = 1/2, iter num = 15

CR = 1/2, iter num = 20

Figure A.2: BER vs Channel SNR as a function of iteration number. (Simulation runs withinjected memory errors Perr = 1.2×10−3 (1 in 800))[ Simulation Curves: A set of LDPC decoder simulations with constant code rate of 1/2 andvarious iteration numbers. (All simulations conducted with memory bit flipping error proba-bility of 1.2× 10−3 (1 in 800). Bit flipping errors applied equally to all LLR bit locations.) |Dashed line: Output LDPC BER performance target line (BER = 10−5) ]


−4 −3 −2 −1 0 1 2 3 410

−8

10−7

10−6

10−5

10−4

10−3

10−2

10−1

100

Channel SNR in [dB]

Bit E

rro

r R

ate






CR = 1/2, iter num = 10

CR = 1/2, iter num = 15

CR = 1/2, iter num = 20

Figure A.3: BER vs Channel SNR as a function of iteration number. (Simulation runs withinjected memory errors Perr = 10−3 (1 in 1000))[ Simulation Curves: A set of LDPC decoder simulations with constant code rate of 1/2 andvarious iteration numbers. (All simulations conducted with memory bit flipping error prob-ability of 10−3 (1 in 1000). Bit flipping errors applied equally to all LLR bit locations.) |Dashed line: Output LDPC BER performance target line (BER = 10−5) ]


−4 −3 −2 −1 0 1 2 3 4 510

−8

10−7

10−6

10−5

10−4

10−3

10−2

10−1

100

Channel SNR in [dB]

Bit E

rro

r R

ate






CR = 1/2, iter num = 10

CR = 1/2, iter num = 15

CR = 1/2, iter num = 20

Figure A.4: BER vs Channel SNR as a function of iteration number. (Simulation runs withinjected memory errors Perr = 2×10−4 (1 in 5000))[ Simulation Curves: A set of LDPC decoder simulations with constant code rate of 1/2 andvarious iteration numbers. (All simulations conducted with memory bit flipping error proba-bility of 2× 10−4 (1 in 5000). Bit flipping errors applied equally to all LLR bit locations.) |Dashed line: Output LDPC BER performance target line (BER = 10−5) ]

Appendix B

LDPC Dynamic Voltage ScalingSimulation - MATLAB Code

B.1 Main Simulation Script

% This is the main test bench simulation script for the adaptive voltage

% scaling algorithm designed for IEEE802.11.ad LDPC decoder with

% performance BER target set to 10E-5.

% The script links the Vdd Operating Table, the Optimization Function,

% and then the Decision block (running average based output filter).

% The filter stage and the single sided hysteresis is implemented

% in this script.

% Results of the output are also used to calculate the energy saving

% percentages.

% ------------------------------------------------------------------------

% SIMULATION PARAMETERS

% ------------------------------------------------------------------------

% define the size of the register based output filter for the Decision

% Block.

filter_reg_size = 5;

% ------------------------------------------------------------------------

% INITIALIZATION

% ------------------------------------------------------------------------

% initialize filter registers

71

APPENDIX B. LDPC DYNAMIC VOLTAGE SCALING SIMULATION - MATLAB CODE 72

filter_reg_H = 0;

filter_reg_M = 0;

filter_reg_L = 0;

% initial Vdd operating level

vdd_current_register = 3; % initially set to Nominal Vdd

% -- 3 = nominal Vdd

% -- 2 = medium Vdd

% -- 1 = low Vdd

% define input SNR of the simulation

floor = 0.1;

step_variant = 2;

SNR_input_array = [];

SNR_sweep_array = [];

for j = 1:60

lower_SNR_bound = j*floor;

upper_SNR_bound = lower_SNR_bound + step_variant;

SNR_sweep_array = [SNR_sweep_array (upper_SNR_bound + ...

lower_SNR_bound)/2];

simulation_pts = 50;

SNR_input_array = [SNR_input_array (upper_SNR_bound - ...

lower_SNR_bound) .*rand(1,simulation_pts) + ...

lower_SNR_bound];

end

[length_sim_array width_sim_array] = size(SNR_input_array);

% initialize simulation results + intermediate values array (empty)

iter_norm_array = [];

iter_opt_array = [];

opt_vdd_array = [];

saving_array = [];

sim_idx = linspace(1,width_sim_array,width_sim_array)’;

% -----------------------------------------------------------------------------

% SIMULATION

% -----------------------------------------------------------------------------


for i=1:width_sim_array

% connect the input

SNR_in = SNR_input_array(i);

% feed the SNR to the first LUT stage

[Vdd_H_ind, Vdd_M_ind, Vdd_L_ind, ...

Vdd_H_iter_num, Vdd_M_iter_num, ...

Vdd_L_iter_num] = vdd_operating_table(SNR_in);

% feed the output of the first LUT stage to the 2nd optimizer stage

[min_idx] = optimization_function(Vdd_H_ind, Vdd_M_ind, ...

Vdd_L_ind, Vdd_H_iter_num, Vdd_M_iter_num, Vdd_L_iter_num);

% then the rest is the register filter part.

if (min_idx == 1)

filter_reg_H = filter_reg_H - 1;

filter_reg_M = filter_reg_M - 1;

filter_reg_L = filter_reg_L + 1;

elseif (min_idx == 2)

filter_reg_H = filter_reg_H - 1;

filter_reg_M = filter_reg_M + 1;

filter_reg_L = filter_reg_L - 1;

else

filter_reg_H = filter_reg_H + 1;

filter_reg_M = filter_reg_M - 1;

filter_reg_L = filter_reg_L - 1;

end

% saturate the filter registers value and make sure that no

% out of the bound situation would be possible.

if (filter_reg_H > filter_reg_size)

filter_reg_H = filter_reg_size;

end

if (filter_reg_H < 0)

filter_reg_H = 0;

end


if (filter_reg_M > filter_reg_size)

filter_reg_M = filter_reg_size;

end

if (filter_reg_M < 0)

filter_reg_M = 0;

end

if (filter_reg_L > filter_reg_size)

filter_reg_L = filter_reg_size;

end

if (filter_reg_L < 0)

filter_reg_L = 0;

end

% implement the majority vote mechanism

filter_reg_array = [filter_reg_L filter_reg_M filter_reg_H];

maj_idx = find(filter_reg_array == max(filter_reg_array));

% implement the min function following the majority vote, this min

% function only have to find out which of the two index, because

% the index indicates the voltage level as well.

maj_min_idx = min(maj_idx);

% implement the single sided hysteresis function.

% basically compare the current output with the result of the

% majority vote.

if (min_idx >= vdd_current_register)

% means the channel suddenly went terrible, emergency

% boost in Vdd must assign the current vdd register

% to the immediate result coming from the optimizer.

vdd_current_register = min_idx;

else

% means the channel suddenly becomes good, but in this

% case, we still want to be a little bit conservative.

% This case, we will take the result coming off the

% filter registers, there will be a little bit of

% hysteresis on it.

vdd_current_register = maj_min_idx;

end


% put the optimized result into the system behavioural model

% and use it to estimate the iteration number.

[iter_nominal, iter_optimized] = ldpc_system_behavioral_model ...

(SNR_in, vdd_current_register);

% now have to calculate the power saving on both the static

% and dynamic enery figures for comparison purpose.

[save] = energy_saving_evaluation_module ...

(iter_nominal, iter_optimized, vdd_current_register);

% log the experimental data

iter_norm_array = [iter_norm_array iter_nominal];

iter_opt_array = [iter_opt_array iter_optimized];

opt_vdd_array = [opt_vdd_array vdd_current_register];

saving_array = [saving_array save];

end

B.2 Vdd Operating Table

function [Nominal_Vdd_indicator, Medium_Vdd_indicator, ...

Low_Vdd_indicator, Vdd_Nom_iter_indicator, ...

Vdd_Med_iter_indicator, Vdd_Low_iter_indicator] = ...

vdd_operating_table(SNR_in)

% This (comparator based) truth table function evaluated the

% input SSI (Signal Strength Indicator) and provide an evaluation on

% the operating possibility on three different supply voltage levels.

%

% The naming of the function truth table due to legacy reasons, in reality

% the whole lookup table can be implemented using bunch of comparitors.

%

% INPUTS:

% SNR_in: input SNR (obtained as SSI signal strength indicator)

%

% OUTPUTS:

% Nominal_Vdd_indicator: High Vdd (nominal) indicator

% Medium_Vdd_indicator: Medium Vdd indicator

% Low_Vdd_indicator: Low Vdd indicator


%

% Vdd_Nom_iter_indicator: High Vdd (nominal) indicator

% Vdd_Med_iter_indicator: Medium Vdd indicator

% Vdd_Low_iter_indicator: Low Vdd indicator

%

% embedded parameters (determined by the LDPC operating space)

SNR_Nominal = 0.65;

SNR_Medium = 2.63;

SNR_Low = 3.75;

SNR_Critical_Nominal = 1.095;

SNR_Critical_Medium = 2.769;

SNR_Critical_Low = 4.284;

% first of all the Vdd High indicator should always be on, this option

% should always be available.

Nominal_Vdd_indicator = 1;

% compare the medium threshold and decide if the medium Vdd is okay.

if (SNR_in > SNR_Medium)

Medium_Vdd_indicator = 1;

else

Vdd_M_ind = 0;

end

% compare the high threshold and decide is the low Vdd is okay.

if (SNR_in > SNR_Low)

Low_Vdd_indicator = 1;

else

Low_Vdd_indicator = 0;

end

% Iteration indicator estimation.

if (SNR_in >= SNR_Critical_Low)

Vdd_Low_iter_indicator = 10;

else

Vdd_Low_iter_indicator = 20;

end

if (SNR_in >= SNR_Critical_Medium)

Vdd_Med_iter_indicator = 10;


else

Vdd_Med_iter_indicator = 20;

end

if (SNR_in >= SNR_Critical_Nominal)

Vdd_Nom_iter_indicator = 10;

else

Vdd_Nom_iter_indicator = 20;

end

B.3 Optimization Function

function [min_idx] = optimization_function(Nominal_Vdd_indicator, ...

Medium_Vdd_indicator, Low_Vdd_indicator, Vdd_Nom_iter_indicator, ...

Vdd_Med_iter_indicator, Vdd_Low_iter_indicator)

% The optimization function is the second stage of the power optimization

% algorithm. It takes the output from the lookup_table_1 function and

% then decide which of the available supply level offers the best

% energy efficiency.

% embedded parameters.

% dynamic reduction parameters

DRL = 0.66;

DRM = 0.7225;

DRH = 1;

% static reduction parameters

IRL = 0.8;

IRM = 0.85;

IRH = 1;

% contribution parameter of static power to the overall power consumption

% this should be less than 1 (basically a percentage).

fs = 0.42;

% first compute the lumped parameter required for the calculation of the

% optimization.

KL = IRL*fs + DRL*(1-fs);

KM = IRM*fs + DRM*(1-fs);

KH = IRH*fs + DRH*(1-fs);

% this value should be the a value which is bigger than the largest


% possible value produced by the the optimization computation. (A simply

% way to achieve this is to set the iteration number to 30 which is an

% imaginary number that will not likely to be reached.)

max_value = 30 * KH;

% first multiply the iteration value with the

P_candidate_H = Vdd_Nom_iter_indicator * KH;

P_candidate_M = Vdd_Med_iter_indicator * KM;

P_candidate_L = Vdd_Low_iter_indicator * KL;

% finally implement the minimization function, which will find the minimum

% of the three.

% load the candidate first

if (Low_Vdd_indicator == 0)

Vdd_candidate(1) = max_value;

else

Vdd_candidate(1) = P_candidate_L;

end

if (Medium_Vdd_indicator == 0)


else

Vdd_candidate(2) = P_candidate_M;

end

if (Nominal_Vdd_indicator == 0)


else

Vdd_candidate(3) = P_candidate_H;

end

% compare the candidate and find out the minimum

[min_p,min_idx] = min(Vdd_candidate);

% return result being: 1-low power; 2-med power; 3-high power

B.4 LDPC System Behavioral Model

function [iter_nominal, iter_optimized] = ...

ldpc_system_behavioral_model(SNR_in, optimized_level)


% this function describs the system behaviour of the LDPC with an input of

% SNR and the output of the iteration number, this will be used to estimate

% the power saving.

% load the curve fitted data.

load(’fitted_behaviour.mat’);

% make the SNR an array so that they can be compared to the curve fitted

% points.

[curve_num, curve_num_x] = size(vdd_ref_pts.yfit);

SNR_array = SNR_in * ones(curve_num, curve_num_x);

% select the optimized level for comparison purpose.

if (optimized_level == 0)

% low power state

optimized_curve = vdd_low_pts.yfit;

optimized_x_axis = vdd_low_pts.xi;

elseif (optimized_level == 1)

% med power state

optimized_curve = vdd_med_pts.yfit;

optimized_x_axis = vdd_med_pts.xi;

else

% regular power state

optimized_curve = vdd_ref_pts.yfit;

optimized_x_axis = vdd_ref_pts.xi;

end

% assign the nominal look-up curve

nominal_curve = vdd_ref_pts.yfit;

% perform look-up to find the iteration number under nominal Vdd

[min_value, idx_normal] = min(abs(vdd_ref_pts.yfit - SNR_array));

iter_nominal = vdd_ref_pts.xi(idx_normal);

% perform look-up to find the iteration number under optimized Vdd

[min_value, idx_optimized] = min(abs(optimized_curve - SNR_array));

iter_optimized = optimized_x_axis(idx_optimized);


B.5 Energy Saving Evaluation Module

function [save] = energy_saving_evaluation_module(iter_nominal, ...

iter_optimized, optimized_vdd_level)

% this function compare the total energy saved on during each frame of

% decoding (each frame of decoding may take several iterations.)

% define internal parameters

% dynamic reduction parameters

DRL = 0.66;

DRM = 0.7225;

DRH = 1;

% static reduction parameters

IRL = 0.8;

IRM = 0.85;

IRH = 1;

% contribution parameter of static power to the overall power consumption

% this should be less than 1 (basically a percentage).

fs = 0.42; % extracted from MXL datasheet.

% the original saving for nominal Vdd should be just 1 (because everything

% is normalized against it)

% assigned the appropriate level of the reduction parameter according to

% the input optimized vdd level.

if (optimized_vdd_level == 1)

DR = DRL;

IR = IRL;

elseif (optimized_vdd_level == 2)

DR = DRM;

IR = IRM;

else

DR = DRH;

IR = IRH;

end

% now calculate the operating energy. (optimized Vdd)

opt_energy_per_frame = (IR*fs + DR*(1-fs))*iter_optimized;

% calculate the nominal energy. (nominal Vdd)

nom_energy_per_frame = (IRH*fs + DRH*(1-fs))*iter_nominal;


% output the saving result (percentage).

save = opt_energy_per_frame/nom_energy_per_frame;

Bibliography

[1] A. Allan, D. Edenfeld, W. Joyner, A. Kahng, M. Rodgers, and Y. Zorian, “2001 tech-nology roadmap for semiconductors,” Computer, vol. 35, no. 1, pp. 42–53, 2002, ISSN:0018-9162. DOI: 10.1109/2.976918.

[2] Y. Zorian, “Embedded memory test and repair: infrastructure IP for SOC yield,” in Test

Conference, 2002. Proceedings. International, 2002, pp. 340–349. DOI: 10.1109/TEST.2002.1041777.

[3] M. Galib, I. J. Chang, and J. Kim, “Supply voltage decision methodology to minimizesram standby power under radiation environment,” Nuclear Science, IEEE Transactions

on, vol. 62, no. 3, pp. 1349–1356, Jun. 2015, ISSN: 0018-9499. DOI: 10.1109/TNS.2015.2420094.

[4] A. Vaknin, O. Yona, and A. Teman, “A double-feedback 8T SRAM bitcell for low-voltage low-leakage operation,” in SOI-3D-Subthreshold Microelectronics Technology

Unified Conference (S3S), 2013 IEEE, Oct. 2013, pp. 1–2. DOI: 10.1109/S3S.2013.6716565.

[5] C. Wu, L. Zhang, Z. Lu, Y. Ma, and J. Zheng, “Leakage reduction of sub-55nm SRAMbased on a feedback monitor scheme for standby voltage scaling,” in SoC Design Confer-

ence (ISOCC), 2010 International, Nov. 2010, pp. 315–318. DOI: 10.1109/SOCDC.2010.5682907.

[6] M. Khellah, D. Khalil, D. Somasekhar, Y. Ismail, T. Karnik, and V. De, “Effect of powersupply noise on SRAM dynamic stability,” in VLSI Circuits, 2007 IEEE Symposium on,2007, pp. 76–77. DOI: 10.1109/VLSIC.2007.4342772.

[7] C. Ming and B. Na, “An efficient and flexible embedded memory IP compiler,” inCyber-Enabled Distributed Computing and Knowledge Discovery (CyberC), 2012 Inter-

national Conference on, Oct. 2012, pp. 268–273. DOI: 10.1109/CyberC.2012.52.

82

http://dx.doi.org/10.1109/2.976918

http://dx.doi.org/10.1109/TEST.2002.1041777

http://dx.doi.org/10.1109/TEST.2002.1041777

http://dx.doi.org/10.1109/TNS.2015.2420094

http://dx.doi.org/10.1109/TNS.2015.2420094

http://dx.doi.org/10.1109/S3S.2013.6716565

http://dx.doi.org/10.1109/S3S.2013.6716565

http://dx.doi.org/10.1109/SOCDC.2010.5682907


http://dx.doi.org/10.1109/VLSIC.2007.4342772

http://dx.doi.org/10.1109/CyberC.2012.52

BIBLIOGRAPHY 83

[8] T. Tsang and R. Thukral, “An area-efficient 0.25 micron memory compiler designed for780 MHz operations,” in Electrical and Computer Engineering, 1999 IEEE Canadian

Conference on, vol. 1, May 1999, pp. 533–537. DOI: 10.1109/CCECE.1999.807255.

[9] J. Costello D.J. and J. Forney G.D., “Channel coding: the road to channel capacity,”Proceedings of the IEEE, vol. 95, no. 6, pp. 1150–1177, Jun. 2007, ISSN: 0018-9219.DOI: 10.1109/JPROC.2007.895188.

[10] P. H. Siegel. (May 2007). An introduction to low-density parity-check codes, Electricaland Computer Engineering University of California, San Diego.

[11] M. May, M. Alles, and N. Wehn, “A case study in reliability-aware design: a resilientLDPC code decoder,” in Design, Automation and Test in Europe, 2008. DATE ’08, Mar.2008, pp. 456–461. DOI: 10.1109/DATE.2008.4484723.

[12] M. Alles, T. Brack, and N. Wehn, “A reliability-aware LDPC code decoding algorithm,”in Vehicular Technology Conference, 2007. VTC2007-Spring. IEEE 65th, Apr. 2007,pp. 1544–1548. DOI: 10.1109/VETECS.2007.322.

[13] R. Gallager, “Low-density parity-check codes,” Information Theory, IRE Transactions

on, vol. 8, no. 1, pp. 21–28, 1962, ISSN: 0096-1000. DOI: 10.1109/TIT.1962.1057683.

[14] S. J. Johnson, Introducing Low-Density Parity-Check Codes. [Online]. Available: http://sigpromu.org/sarah/SJohnsonLDPCintro.pdf.

[15] Y.-M. Chang, A. Vila Casado, M.-C. Chang, and R. Wesel, “Lower-complexity layeredbelief-propagation decoding of LDPC codes,” in Communications, 2008. ICC ’08. IEEE

International Conference on, May 2008, pp. 1155–1160. DOI: 10.1109/ICC.2008.225.

[16] S. Lin, L. Chen, J. Xu, and I. Djurdjevic, “Near shannon limit quasi-cyclic low-densityparity-check codes,” in Global Telecommunications Conference, 2003. GLOBECOM

’03. IEEE, vol. 4, Dec. 2003, pp. 2030–2035. DOI: 10.1109/GLOCOM.2003.1258593.

[17] B. M. Leiner. (2005). LDPC codes - a brief tutorial, [Online]. Available: http://www.bernh.net/media/download/papers/ldpc.pdf.

[18] N. P. Bhavsar and B. Vala, “Design of hard and soft decision decoding algorithmsof LDPC,” English, International Journal of Computer Applications, vol. 90, no. 16,pp. 553–557, 2014.

http://dx.doi.org/10.1109/CCECE.1999.807255

http://dx.doi.org/10.1109/CCECE.1999.807255

http://dx.doi.org/10.1109/JPROC.2007.895188

http://dx.doi.org/10.1109/DATE.2008.4484723

http://dx.doi.org/10.1109/VETECS.2007.322

http://dx.doi.org/10.1109/TIT.1962.1057683

http://dx.doi.org/10.1109/TIT.1962.1057683

http://sigpromu.org/sarah/SJohnsonLDPCintro.pdf

http://sigpromu.org/sarah/SJohnsonLDPCintro.pdf

http://dx.doi.org/10.1109/ICC.2008.225

http://dx.doi.org/10.1109/ICC.2008.225

http://dx.doi.org/10.1109/GLOCOM.2003.1258593


http://www.bernh.net/media/download/papers/ldpc.pdf

http://www.bernh.net/media/download/papers/ldpc.pdf

BIBLIOGRAPHY 84

[19] F. Kschischang, B. Frey, and H.-A. Loeliger, “Factor graphs and the sum-product al-gorithm,” Information Theory, IEEE Transactions on, vol. 47, no. 2, pp. 498–519, Feb.2001, ISSN: 0018-9448. DOI: 10.1109/18.910572.

[20] N. Wiberg, “Codes and decoding on general graphs,” PhD thesis, Linkoping University,1996.

[21] A. Anastasopoulos, “A comparison between the sum-product and the min-sum iterativedetection algorithms based on density evolution,” in Global Telecommunications Con-

ference, 2001. GLOBECOM ’01. IEEE, vol. 2, 2001, pp. 1021–1025. DOI: 10.1109/GLOCOM.2001.965572.

[22] S. Howard, C. Schlegel, and V. Gaudet, “A degree-matched check node approximationfor LDPC decoding,” in Information Theory, 2005. ISIT 2005. Proceedings. Interna-

tional Symposium on, Sep. 2005, pp. 1131–1135. DOI: 10.1109/ISIT.2005.1523516.

[23] S. Muller, M. Schreger, M. Kabutz, M. Alles, F. Kienle, and N. Wehn, “A novel LDPCdecoder for DVB-S2 IP,” in Design, Automation Test in Europe Conference Exhibition,

2009. DATE ’09., 2009, pp. 1308–1313. DOI: 10.1109/DATE.2009.5090867.

[24] A. Darabiha, “VLSI architectures for multi-Gbps low-density parity-check decoders,”PhD thesis, University of Toronto, 2008.

[25] E. Sharon, S. Litsyn, and J. Goldberger, “An efficient message-passing schedule forLDPC decoding,” in Electrical and Electronics Engineers in Israel, 2004. Proceedings.

2004 23rd IEEE Convention of, Sep. 2004, pp. 223–226. DOI: 10.1109/EEEI.2004.1361130.

[26] P. Urard, E. Yeo, L. Paumier, P. Georgelin, T. Michel, V. Lebars, E. Lantreibecq, andB. Gupta, “A 135Mb/s DVB-S2 compliant codec based on 64800b LDPC and BCHcodes,” in Solid-State Circuits Conference, 2005. Digest of Technical Papers. ISSCC.

2005 IEEE International, vol. 1, Feb. 2005, pp. 446–609. DOI: 10.1109/ISSCC.2005.1494061.

[27] T. Brandon, R. Hang, G. Block, V. C. Gaudet, B. Cockburn, S. Howard, C. Giasson,K. Boyle, P. Goud, S. S. Zeinoddin, A. Rapley, S. Bates, D. Elliott, and C. Schlegel,“A scalable LDPC decoder ASIC architecture with bit-serial message exchange,” Inte-

gration, the VLSI Journal, vol. 41, no. 3, pp. 385–398, 2008, ISSN: 0167-9260. DOI:http://dx.doi.org/10.1016/j.vlsi.2007.07.003.

http://dx.doi.org/10.1109/18.910572



http://dx.doi.org/10.1109/ISIT.2005.1523516

http://dx.doi.org/10.1109/ISIT.2005.1523516


http://dx.doi.org/10.1109/EEEI.2004.1361130

http://dx.doi.org/10.1109/EEEI.2004.1361130

http://dx.doi.org/10.1109/ISSCC.2005.1494061

http://dx.doi.org/10.1109/ISSCC.2005.1494061

http://dx.doi.org/http://dx.doi.org/10.1016/j.vlsi.2007.07.003

BIBLIOGRAPHY 85

[28] T. Iizuka, “Embedded memory: a key to high performance system VLSIs,” in VLSI Cir-

cuits, 1990. Digest of Technical Papers., 1990 Symposium on, Jun. 1990, pp. 1–4. DOI:10.1109/VLSIC.1990.111070.

[29] S. Cosemans, W. Dehaene, and F. Catthoor, “A low-power embedded SRAM for wire-less applications,” Solid-State Circuits, IEEE Journal of, vol. 42, no. 7, pp. 1607–1617,Jul. 2007, ISSN: 0018-9200. DOI: 10.1109/JSSC.2007.896693.

[30] V. Kumar and G. Khanna, “A novel 7T SRAM cell design for reducing leakage powerand improved stability,” in Advanced Communication Control and Computing Tech-

nologies (ICACCCT), 2014 International Conference on, May 2014, pp. 56–59. DOI:10.1109/ICACCCT.2014.7019158.

[31] S. Huang, Z. Huang, A. Kurokawa, and Y. Inoue, “A novel SRAM structure for leak-age power suppression in 45nm technology,” in Communications, Circuits and Systems,

2008. ICCCAS 2008. International Conference on, May 2008, pp. 1070–1074. DOI:10.1109/ICCCAS.2008.4657953.

[32] C. Wu, L. Zhang, Z. Lu, Y. Ma, and J. Zheng, “Leakage reduction of sub-55nm SRAMbased on a feedback monitor scheme for standby voltage scaling,” in SoC Design Confer-

ence (ISOCC), 2010 International, Nov. 2010, pp. 315–318. DOI: 10.1109/SOCDC.2010.5682907.

[33] J. Kulkarni, “Low voltage robust memory circuit design,” PhD thesis, Purdue University,2007.

[34] E. Vatajelu, A. Goomez-Pau, M. Renovell, and J. Figueras, “Transient noise failures inSRAM cells: dynamic noise margin metric,” in Test Symposium (ATS), 2011 20th Asian,2011, pp. 413–418. DOI: 10.1109/ATS.2011.64.

[35] B. Zhang, A. Arapostathis, S. Nassif, and M. Orshansky, “Analytical modeling of SRAMdynamic stability,” in Computer-Aided Design, 2006. ICCAD ’06. IEEE/ACM Interna-

tional Conference on, 2006, pp. 315–322. DOI: 10.1109/ICCAD.2006.320052.

[36] A. Kumar, J. Rabaey, and K. Ramchandran, “SRAM supply voltage scaling: a reliabilityperspective,” in Quality of Electronic Design, 2009. ISQED 2009. Quality Electronic

Design, Mar. 2009, pp. 782–787. DOI: 10.1109/ISQED.2009.4810392.

[37] S. Mukhopadhyay, “Designing robust and low-leakage VLSI circuits at the end of siliconroadmap: technology and circuit perspectives,” PhD thesis, Purdue University, 2006.

[38] A. Kumar, “SRAM leakage-power optimization framework: a system level approach,”PhD thesis, University of California, Berkeley, 2008.

http://dx.doi.org/10.1109/VLSIC.1990.111070

http://dx.doi.org/10.1109/JSSC.2007.896693

http://dx.doi.org/10.1109/ICACCCT.2014.7019158

http://dx.doi.org/10.1109/ICCCAS.2008.4657953



http://dx.doi.org/10.1109/ATS.2011.64

http://dx.doi.org/10.1109/ICCAD.2006.320052

http://dx.doi.org/10.1109/ISQED.2009.4810392

BIBLIOGRAPHY 86

[39] B. Calhoun and A. Chandrakasan, “Analyzing static noise margin for sub-thresholdSRAM in 65nm CMOS,” in Solid-State Circuits Conference, 2005. ESSCIRC 2005. Pro-

ceedings of the 31st European, Sep. 2005, pp. 363–366. DOI: 10.1109/ESSCIR.2005.1541635.

[40] A. Bhavnagarwala, S. Kosonocky, C. Radens, K. Stawiasz, R. Mann, Q. Ye, and K.Chin, “Fluctuation limits amp; scaling opportunities for CMOS SRAM cells,” in Elec-

tron Devices Meeting, 2005. IEDM Technical Digest. IEEE International, Dec. 2005,pp. 659–662. DOI: 10.1109/IEDM.2005.1609437.

[41] C. Wilkerson, H. Gao, A. Alameldeen, Z. Chishti, M. Khellah, and S.-L. Lu, “Tradingoff cache capacity for reliability to enable low voltage operation,” in Computer Architec-

ture, 2008. ISCA ’08. 35th International Symposium on, Jun. 2008, pp. 203–214. DOI:10.1109/ISCA.2008.22.

[42] N. H. Weste and D. M. Harris, CMOS VLSI Design - A Circuit and System Perspective,4, Ed. Pearson Education, Inc., 2011.

[43] P. R. Panda, A. Shrivastava, B. V. N. Silpa, and K. Gummidipudi, Power-efficient System

Design, 1st Edition. Springer, 2010.

[44] R. Kumar and C. Ravikumar, “Leakage power estimation for deep submicron circuitsin an ASIC design environment,” in Design Automation Conference, 2002. Proceedings

of ASP-DAC 2002. 7th Asia and South Pacific and the 15th International Conference

on VLSI Design. Proceedings., 2002, pp. 45–50. DOI: 10.1109/ASPDAC.2002.994883.

[45] ITRS. (2010). Internaltional technology roadmap for semiconductors, [Online]. Avail-able: http://www.itrs.net/reports.html.

[46] M. May, M. Alles, and N. Wehn, “A case study in reliability-aware design: a resilientLDPC code decoder,” in Design, Automation and Test in Europe, 2008. DATE ’08, Mar.2008, pp. 456–461. DOI: 10.1109/DATE.2008.4484723.

[47] A. M. A. Hussien, M. S. Khairy, A. Khajeh, K. Amiri, A. M. Eltawil, and F. J. Kurdahi,“A combined channel and hardware noise resilient Viterbi decoder,” in Signals, Systems

and Computers (ASILOMAR), 2010 Conference Record of the Forty Fourth Asilomar

Conference on, Nov. 2010, pp. 395–399. DOI: 10.1109/ACSSC.2010.5757543.

[48] M. S. Khairy, C. A. Shen, A. M. Eltawil, and F. Kurdahi, “Error resilient MIMO detectorfor memory-dominated wireless communication systems,” in Global Communications

Conference (GLOBECOM), 2012 IEEE, Dec. 2012, pp. 3566–3571. DOI: 10.1109/GLOCOM.2012.6503668.

http://dx.doi.org/10.1109/ESSCIR.2005.1541635

http://dx.doi.org/10.1109/ESSCIR.2005.1541635

http://dx.doi.org/10.1109/IEDM.2005.1609437

http://dx.doi.org/10.1109/ISCA.2008.22

http://dx.doi.org/10.1109/ASPDAC.2002.994883

http://dx.doi.org/10.1109/ASPDAC.2002.994883

http://www.itrs.net/reports.html


http://dx.doi.org/10.1109/ACSSC.2010.5757543



a v scaling p r strategy for memory ased ldpc decoders · g. gulak. i wish to highlight the fact...

Documents