energy and quality-aware multimedia signal processing

15
IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 15, NO. 7, NOVEMBER 2013 1579 Energy and Quality-Aware Multimedia Signal Processing Yunus Emre, Student Member, IEEE, and Chaitali Chakrabarti, Fellow, IEEE Abstract—This paper presents techniques to reduce energy with minimal degradation in system performance for multimedia signal processing algorithms. It rst provides a survey of energy-saving techniques such as those based on voltage scaling, reducing number of computations and reducing dynamic range. While these techniques reduce energy, they also introduce errors that affect the performance quality. To compensate for these errors, techniques that exploit algorithm characteristics are presented. Next, several hybrid energy-saving techniques that further reduce the energy consumption with low performance degradation are presented. For instance, a combination of voltage scaling and dynamic range reduction is shown to achieve 85% energy saving in a low pass FIR lter for a fairly low noise level. A combination of computation reduction and dynamic reduction for Discrete Cosine Transform shows, on average, 33% to 46% reduction in energy consumption while incurring 0.5 dB to 1.5 dB loss in PSNR. Both of these techniques have very little overhead and achieve signicant energy reduction with little quality degradation. Index Terms—Error compensation, low-power, multimedia al- gorithms, variable precision, voltage scaling. I. INTRODUCTION P ORTABLE multimedia devices have proliferated in the last two decades, and the number of applications supported by these devices has increased signicantly. Each additional application comes at a cost of higher energy con- sumption and since most of these devices are battery powered, it is important that every effort be made to reduce the cost. The challenge is to minimize the energy cost while executing increasingly complex functionalities with minimal degradation in algorithm performance quality. Fortunately, many of the multimedia applications do not need 100% correctness during computation and energy saving transformations are favored as long as the output quality is mildly affected [1], [2]. Three of the most effective techniques for reducing energy consumption are voltage scaling [3]–[14], reduction in number of computations [2], [15]–[20] and dynamic range adjustment [16], [18], [21]–[24]. While voltage scaling results in signicant reduction in energy consumption due to the quadratic depen- dence between supply voltage and energy consumption, voltage Manuscript received September 12, 2012; revised November 21, 2012; ac- cepted December 30, 2012. Date of publication June 06, 2013; date of cur- rent version October 11, 2013. This work was supported in part by NSF CSR 0910699. The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Yiannis Andreopoulos. The authors are with the School of Electrical, Computer and Energy Engi- neering, Arizona State University, Tempe, AZ 85281 USA (e-mail: yemre@asu. edu; [email protected]). Color versions of one or more of the gures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identier 10.1109/TMM.2013.2266094 over-scaling (VOS) can lead to failures. Techniques have been developed to mitigate the errors due to critical path violation in the computation unit and memory due to VOS. While cir- cuit-level techniques [25], [26] are quite effective, there are low overhead algorithm-level techniques that use the inherent re- dundancy and characteristics of the data to detect and correct errors that occurred during the computation. Unlike general purpose computing, most multimedia appli- cations can provide decent quality even with reduced number of computations as long as the signicant computations are re- tained. The basic idea is that all components of the computation are not equally signicant and so for systems with limited re- sources, the more important computations are done rst and the less important computations are performed later or even elim- inated. Such a methodology has been applied to many image and video processing algorithms such as lters, where multi- plications with large coefcients have higher signicance [2], Discrete Cosine Transform (DCT), where low frequency coef- cients have higher signicance [2], [16], or Discrete Wavelet Transform (DWT) [20], where low subband coefcients have higher signicance. This is also the basis of incremental pro- cessing where computations can be halted when decent quality is achieved [20]. Another popular energy-saving technique is dynamic range reduction in the datapath computation. Typically, low order bits are less important and so can be truncated to save energy. Such a methodology has been used in many multimedia applications such as ltering [16], DCT [22], [23], FFT [5], etc. However, in some applications such as motion estimation [18], the less sig- nicant bit computations are more important since large values that are the result of computations with the more signicant bits, are discarded. While truncation reduces energy consumption, it also introduces errors due to operation with a reduced dynamic range. Simple techniques to compensate for these errors help re- duce energy consumption while mildly affecting the algorithm performance quality. In this paper, we describe several energy-saving techniques that achieve minimum degradation in quality with low over- head. While some of these techniques are general, others have been geared to exploit the algorithmic features and result in su- perior performance both in terms of energy consumption and algorithm quality. The key contributions of this paper are as follows: We provide a survey of general as well as algorithm-spe- cic techniques that trade-off energy with system per- formance for multimedia signal processing algorithms. Examples include FIR ltering, DCT, DWT, etc. These techniques are based on voltage scaling in datapath and 1520-9210 © 2013 IEEE

Upload: chaitali

Post on 19-Dec-2016

212 views

Category:

Documents


0 download

TRANSCRIPT

IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 15, NO. 7, NOVEMBER 2013 1579

Energy and Quality-AwareMultimedia Signal Processing

Yunus Emre, Student Member, IEEE, and Chaitali Chakrabarti, Fellow, IEEE

Abstract—This paper presents techniques to reduce energy withminimal degradation in system performance for multimedia signalprocessing algorithms. It first provides a survey of energy-savingtechniques such as those based on voltage scaling, reducingnumber of computations and reducing dynamic range. Whilethese techniques reduce energy, they also introduce errors thataffect the performance quality. To compensate for these errors,techniques that exploit algorithm characteristics are presented.Next, several hybrid energy-saving techniques that further reducethe energy consumption with low performance degradation arepresented. For instance, a combination of voltage scaling anddynamic range reduction is shown to achieve 85% energy savingin a low pass FIR filter for a fairly low noise level. A combinationof computation reduction and dynamic reduction for DiscreteCosine Transform shows, on average, 33% to 46% reduction inenergy consumption while incurring 0.5 dB to 1.5 dB loss in PSNR.Both of these techniques have very little overhead and achievesignificant energy reduction with little quality degradation.

Index Terms—Error compensation, low-power, multimedia al-gorithms, variable precision, voltage scaling.

I. INTRODUCTION

P ORTABLE multimedia devices have proliferated inthe last two decades, and the number of applications

supported by these devices has increased significantly. Eachadditional application comes at a cost of higher energy con-sumption and since most of these devices are battery powered,it is important that every effort be made to reduce the cost.The challenge is to minimize the energy cost while executingincreasingly complex functionalities with minimal degradationin algorithm performance quality. Fortunately, many of themultimedia applications do not need 100% correctness duringcomputation and energy saving transformations are favored aslong as the output quality is mildly affected [1], [2].Three of the most effective techniques for reducing energy

consumption are voltage scaling [3]–[14], reduction in numberof computations [2], [15]–[20] and dynamic range adjustment[16], [18], [21]–[24]. While voltage scaling results in significantreduction in energy consumption due to the quadratic depen-dence between supply voltage and energy consumption, voltage

Manuscript received September 12, 2012; revised November 21, 2012; ac-cepted December 30, 2012. Date of publication June 06, 2013; date of cur-rent version October 11, 2013. This work was supported in part by NSF CSR0910699. The associate editor coordinating the review of this manuscript andapproving it for publication was Dr. Yiannis Andreopoulos.The authors are with the School of Electrical, Computer and Energy Engi-

neering, Arizona State University, Tempe, AZ 85281 USA (e-mail: [email protected]; [email protected]).Color versions of one or more of the figures in this paper are available online

at http://ieeexplore.ieee.org.Digital Object Identifier 10.1109/TMM.2013.2266094

over-scaling (VOS) can lead to failures. Techniques have beendeveloped to mitigate the errors due to critical path violationin the computation unit and memory due to VOS. While cir-cuit-level techniques [25], [26] are quite effective, there are lowoverhead algorithm-level techniques that use the inherent re-dundancy and characteristics of the data to detect and correcterrors that occurred during the computation.Unlike general purpose computing, most multimedia appli-

cations can provide decent quality even with reduced numberof computations as long as the significant computations are re-tained. The basic idea is that all components of the computationare not equally significant and so for systems with limited re-sources, the more important computations are done first and theless important computations are performed later or even elim-inated. Such a methodology has been applied to many imageand video processing algorithms such as filters, where multi-plications with large coefficients have higher significance [2],Discrete Cosine Transform (DCT), where low frequency coef-ficients have higher significance [2], [16], or Discrete WaveletTransform (DWT) [20], where low subband coefficients havehigher significance. This is also the basis of incremental pro-cessing where computations can be halted when decent qualityis achieved [20].Another popular energy-saving technique is dynamic range

reduction in the datapath computation. Typically, low order bitsare less important and so can be truncated to save energy. Sucha methodology has been used in many multimedia applicationssuch as filtering [16], DCT [22], [23], FFT [5], etc. However, insome applications such as motion estimation [18], the less sig-nificant bit computations are more important since large valuesthat are the result of computations with the more significant bits,are discarded. While truncation reduces energy consumption, italso introduces errors due to operation with a reduced dynamicrange. Simple techniques to compensate for these errors help re-duce energy consumption while mildly affecting the algorithmperformance quality.In this paper, we describe several energy-saving techniques

that achieve minimum degradation in quality with low over-head. While some of these techniques are general, others havebeen geared to exploit the algorithmic features and result in su-perior performance both in terms of energy consumption andalgorithm quality. The key contributions of this paper are asfollows:• We provide a survey of general as well as algorithm-spe-cific techniques that trade-off energy with system per-formance for multimedia signal processing algorithms.Examples include FIR filtering, DCT, DWT, etc. Thesetechniques are based on voltage scaling in datapath and

1520-9210 © 2013 IEEE

1580 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 15, NO. 7, NOVEMBER 2013

Fig. 1. 16-bit RCA under voltage scaling. (a) Energy delay profile, (b) Error distribution.

memory, reduction in the number of computations andreduction in the dynamic range.

• We propose hybrid schemes that use combination ofthese techniques to achieve even higher energy savingwith smaller performance degradation. The performanceoverhead and energy savings of each scheme is quantifiedand analyzed.• We study the combination of voltage scaling and dy-namic range reduction in the context of low pass FIRfilter applications. The errors due to increase in criticalpath delay during voltage scaling are reduced by trun-cating the lower order bits which causes a reduction inthe critical path. The noise that is introduced due to trun-cation is compensated by using an unbiased estimator.For a MAC based FIR filter, such a scheme achieves85% energy saving for a fairly low noise level.

• We study the combination of computation reduction anddynamic range reduction for DCT.We propose a schemethat chooses which DCT coefficients have to be deacti-vated and the number of bits to be truncated based on thequality metric, Q. We derive combinations of deactiva-tion and truncation for different acceptable PSNR degra-dations across the whole range of Q. Simulation resultsshow on average, 33% to 46% reduction in energy con-sumption while incurring 0.5 dB to 1.5 dB degradationin PSNR performance of JPEG.

The rest of the paper is organized as follows. A survey oflow energy techniques is given in Section II followed by thehybrid schemes that combine different low energy techniquesin Section III. Finally, Section IV concludes the paper.

II. ENERGY-SAVING TECHNIQUES

In this section, we describe the three main techniquesfor reducing energy consumption, namely, voltage scaling(Section II-A), reducing number of computations (Section II-B)and reducing dynamic range in computation (Section II-C).We describe the errors introduced by each of the techniquesand ways to compensate them to reduce algorithm-qualitydegradation.

Fig. 2. Block Diagram of Algorithm Noise Tolerant (ANT) Scheme.

A. Voltage Scaling

One of the most effective techniques for energy reduction isvoltage scaling. This is due to the quadratic dependence betweenenergy consumption and supply voltage [3]. Fig. 1(a) illustratesthe normalized energy and delay plots of a 16-bit ripple carryadder (RCA) as a function of supply voltage. This was obtainedby our in-house simulator based on modelSim with 45 nm PTMmodels [27]; the normalization is with respect to 1 V nominalvoltage operation. Voltage overscaling (VOS) refers to scalingthe voltage beyond the value imposed by the critical delay of thecircuitry. This may result in timing violations in the data-path,resulting in erroneous operation. Fig. 1(b) illustrates the errordistribution of the 16-bit RCA under voltage scaling [8]. Notethat most of the errors reside in the most significant bits, whichcan result in significant performance degradation.1) Compensating Datapath Errors: To mitigate the errors

due to critical path violation of the computation unit undervoltage scaling, algorithm noise tolerance (ANT) has been usedin [4]–[7]. Fig. 2 illustrates the general block diagram of theANT scheme which consists of the main block and the reducedcomputation block. The main block implements the originalcomputation at full precision; thus its output is proneto critical path violation under voltage scaling. The reducedcomputation block is designed to generate a statistical replica

of the original result with a shorter critical path. Thesetwo outputs are compared to detect any errors that may haveoccurred in the main block. Since VOS results in large errors inmagnitude, the system chooses if the difference is smallerthan the predetermined threshold (Thr) and otherwise.

EMRE AND CHAKRABARTI: ENERGY AND QUALITY-AWARE MULTIMEDIA SIGNAL PROCESSING 1581

Fig. 3. (a) Delay distribution of a single SRAM read operation; (b) Read, write and total failure probability of SRAM for different voltage levels in 32 nm.

In an ANT based system, the reduced computation blockneeds to provide good approximation of the original output,have low complexity circuitry to minimize overall overhead,and have shorter critical path to ensure error-free operation. Re-duction in computation is achieved by using reduced precisionreplica [4], [5], subsampling [6], and prediction based errorcorrection [7]. ANT-based systems have been applied to multi-media applications such as FIR low pass filter [4], [5], FFT [5],and motion estimation [6]. In [4], correlation of the FIR lowpass filter outputs is used to correct errors if any. To minimizethe overhead, a very simple low pass filter is employed thatcomputes the estimates of the main block. In [5], the reducedcomputation block is based on 4-bit MSB implementation ofFFT while the main block operates on 8-bit data. In [6], asubsampled version of the original motion estimation blockis used in the reduced computation block. All these methodsachieve 20% to 40% energy reduction while incurring smallperformance degradation.Recently, an algorithmic-specific technique has been pro-

posed in [8] that mitigates datapath errors during the compu-tation of 2D-DCT and quantization in JPEG. The techniqueexploits encoded JPEG data features to detect the VOS inducederrors. The features are based on facts such as two adjacent ACcoefficients after zig-zag scan have similar values, coefficientscorresponding to higher frequencies generally have smallervalues and the number of sign extension bits is determinedby the quantization level. The technique achieves high perfor-mance with small circuit overhead. Simulation results show thatthe proposed technique has a PSNR performance degradationof around 1.5 dB compared to the error-free case, and 4 dBimpro vement compared to the no correction case at compres-sion rate of 0.75 bpp when . The overhead of thistechnique is quite small; it requires three simple units, namely,a 3-bit majority voter, a 8-bit coefficient comparator and a 8-bit2 input average calculator.2) Compensating Memory Errors: SRAM failure analysis

under voltage scaling have been investigated by several re-searchers [9]–[11], [28]–[31]. In [28], statistical models of

random dopant fluctuations (RDF) are used to determine read,write failure and access time variations. In [29], read andwrite noise margins of 6T SRAM cells are used to calculatethe memory reliability. Fig. 3(a) illustrates the distribution ofread access times at nominal and scaled voltage levels for a32 nm SRAM cell under 40 mV RDF and 5% channel lengthvariation. From the figure, we see that as the voltage scalesdown, the tail section of the access time become heavier,which manifests itself in access time errors at scaled voltages.Fig. 3(b) illustrates the read, write and total failure rates as thevoltage scales from 0.8 V to 0.6 V [32]. At nominal voltage of0.9 V, the BER is estimated to be . At lower voltages, theBERs are very high. For instance, at 0.7 V, the BER isand at 0.6 V, it climbs to . Such high error rates were alsoreported in [10], [29].Several circuit, system and architecture-level techniques have

been proposed to mitigate and/or compensate for memory fail-ures. At the circuit level, different SRAM structures such as 8Tand 10T have been proposed [10], [31]. In 8T and 10T struc-tures, data path for read and write operation are separated to in-crease the robustness of the operations. However, the additionalcircuitry increases leakage power and circuit area by approxi-mately 20% to 30%. The method in [10] stores the MSBs in amemory bank with 8T SRAM cells and the least significant bits(LSB) in memory banks with 6T cells. It uses the fact that 8TSRAM cells are more robust then basic 6T SRAM cells at scaledvoltages. Such a scheme achieves approximately 40% power re-duction while having 15% increase in overall circuitry area. Themethod in [11] operates the memory banks that store MSBs ata different voltage level than the ones that store LSBs. This isshown to achieve 45% power reduction with 10% degradationin image quality for a regular pixel based image-storage system.Many techniques make use of error control coding (ECC)

[8], [12], [32]–[34]. In [12], orthogonal Latin square codes areused to trade-off cache size with correction capability. ExtendedHamming codes which provide single error correction, doubleerror detection (SECDED) have been used for several yearsto combat failures in memory systems [32]–[34]. Their simple

1582 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 15, NO. 7, NOVEMBER 2013

structure makes them appealing for applications that require lowlatency and power consumption. The memory area overhead ofthe stronger codes is very large, thus using unequal error pro-tection (UEP) that combines strong and weak codes, is a betteroption. The main idea of UEP is to provide superior protectionto the more important bits and thus enable the area overhead tobe reduced without sacrificing performance [32]. For instance,in JPEG2000, the higher subband DWT outputs are more im-portant and so should be protected better with stronger codes.In [32], it has been shown that compared to single ECC, UEPbased on different SECDED codes has 35% lower MSE for thesame overhead.Algorithm-specific techniques have also been developed for

memory intensive multimedia applications in [13], [32], [35].These techniques mitigate system degradation due to memoryfailures using additional features that are intrinsic to the algo-rithm. In [13], binarization and second derivative of the imageare used to detect error locations in different DWT sub-bandsin JPEG2000. These are then corrected in an iterative fashionby flipping one bit at a time starting from the MSB. In [35], ap-plication-aware methods have been proposed to reduce powerconsumption of memories in video coding systems under VOSwhile maintaining the performance using simple filters afterprocessing. Algorithm-specific techniques that exploit the char-acteristics of the DWT coefficients have been proposed in [32]for JPEG2000. These techniques identify and correct errors byexploiting the fact that DWT outputs at high subbands typicallyconsist of smaller values and thus contain small number of non-zero bits inMSB planes. Also, there is a similarity betweenmag-nitudes of neighboring coefficients. Based on these features, anerror is flagged when isolated non-zero MSBs are detected athigh frequency bands of the DWT output. These techniquesachieve performance results close to error-free curves (only 1dB degradation in PSNR) at 0.75 bpp when .Even for very a high bit error rate such as , thealgorithm-specific scheme can achieve 7.9 dB performance im-provement compared to the no correction case, 4 dB to 5 dBimprovement compared to the (39, 32) Hamming ECC schemeand only 2.8 dB degradation compared to the no-error case. Theoverhead of these techniques is very small; the additional cir-cuits include a 9-bit counter, 35-bit all zero detector and a 4-bitcomparator.

B. Reducing Number of Computations

The number of computations can be reduced by choosing asmarter algorithm with a lower complexity. Examples of thisinclude Fast Fourier Transform implementation of the DiscreteFourier Transform, differential tree search based vector quan-tization [3], etc. Alternately, the number of computations canbe reduced with some performance hit. For instance, in blockmatching that is used in motion estimation, heuristic algorithmssuch as three step search and diamond search have lower com-plexity but sub-optimal performance. These search algorithmsare used when the performance requirements are not that strin-gent and/or the energy budget is low [19]. For each search algo-rithm, sub-sampling can be used to further reduce the numberof computations [6], [17], [18]. If 1/s is the subsampling ratio,then these schemes reduce the number of absolute difference

Fig. 4. Normalized energy and truncation noise as a function of bit-width.

computations in motion estimation by 1/s and result in signifi-cant energy reduction. However there is a performance cost, forinstance, increases the compression rate by 6.5% and re-duces the PSNR by approximately 0.4 dB [18].An effective way of reducing the number of computations

with minimal degradation in system performance is by ex-ploiting the fact that different portions of the computation havedifferent levels of significance on the overall system quality.One of the earliest works in this area was done for least meansquare (LMS) type adaptive filters in [36], where the filterlength was determined based on the energy consumption andsystem performance requirements. In DCT implementation, forinstance, most of the image energy of the DCT resides in thelow frequency coefficients and higher frequency coefficientscan be sacrificed when good enough quality is achieved [2],[16]. Similarly, in FIR filtering, larger filter taps contributemore to system performance, and so sorting the impulse re-sponse of the filter taps in decreasing order of magnitude andcomputing on larger coefficients first helps achieve energysaving with reduced overall quality degradation [16]. Otherexamples include multilevel DWT where the coefficients arecomputed incrementally one bit plane at a time till the desiredquality is achieved [20], and salient point detection where thedetection result is refined as the image precision improves [37].

C. Reducing Dynamic Range in Computation

Reducing the datapath precision to lower power consumptionis a popular technique in signal processing systems. Typically,high order bits contain most of the information while low orderbits capture the details of the application. Fig. 4 illustrates thesavings in energy consumption of a 16-bit RCA for different bitwidths in 45 nm technology. Since RCA has a regular structure,the energy reduction is proportional to the bit-width of the adder.For instance, at nominal voltage, we observe 24% reduction inenergy consumption of the adder when 12-bits are used insteadof 16-bits.One drawback of reduced precision arithmetic is that it intro-

duces truncation errors. Fig. 4 also plots truncation noise definedas the magnitude of the difference between the output obtainedwith full precision data and the output obtained with truncateddata scaled by the full precision output. From Fig. 4, we see thatwhile truncation noise increases logarithmically, energy savingof the adder increases linearly with increase in number of trun-cated bits. Low order bit truncation can easily be applied to other

EMRE AND CHAKRABARTI: ENERGY AND QUALITY-AWARE MULTIMEDIA SIGNAL PROCESSING 1583

multimedia applications such as filtering, DCT; however, one ofthe main challenges is to compensate for the quality degradationcaused by reduced precision.Additional energy saving can be obtained by approximating

the computations in the datapath components [38]–[43]. Addersand multipliers that trade-off accuracy for lower power con-sumption by reducing the carry chain have been proposed in[38]–[40]. For instance, the modified Kogge-stone adder in [38]operates on a shorter critical path but the errors are not as sig-nificant since the probability of having timing violations withshorter carry chain is not very high. The RCA based error-tol-erant adder proposed in [39] partitions the carry chain into vari-able width segments in which MSB side has longer segmentsthan LSB side to reduce the error. Multiplier architectures in[40] truncate the partial product generation at the LSB end re-sulting in small truncation noise while achieving significant en-ergy savings.In addition, the building blocks of the adders and multi-

pliers can also be approximated by selectively removing someminterms of their Boolean functions [41]–[43]. For instance,[42] describes a 2 2 under-designed, inaccurate multiplierwhich is used to implement a larger multiplier for imageprocessing applications. Based on the system requirements,a correction term is introduced to reduce the degradation inalgorithmic performance. A more general scheme is proposedin [43] to reduce the area of combinational logic for a givenerror rate threshold. During the synthesis phase, the number ofliterals used in the logic function is reduced by complementingthe minterms of the original function.1) Low Order bit Truncation: Bit truncation methods that re-

move low order bits have been very effective for motion estima-tion [17], [18], [24]. In [24], instead of using 8 bits, only 4 or 5 ofthe higher order bits are kept to reduce the activity in less impor-tant regions. In [18], the performance degradation and increasein compressed data rate have been studied for low order bit trun-cation in motion estimation used in H.264. Fig. 5 illustrates theaverage degradation over several video sequences for low orderbit truncation ranging from 1-bit to 4-bits for diamond (DS) andthree step search (TSS) strategies [18]. Here the performancemetrics are which represents the change in compressionrate and which represents the change in PSNR; lower

and corresponds to better quality. Since mo-tion estimation is based on subtraction of the pixel values, theexpected performance degradation is not very high. This is be-cause subtraction is more tolerant to truncation noise than addi-tion or multiplication operations. In algorithms whose buildingblocks are multiplications and additions such as DCT and FIRfilter, the truncation error has to be compensated to maintaingood image quality. Next, we describe the proposed techniquefor compensating truncation error.Truncation Errors: Analysis and Compensation First, we

investigate the effect of bit truncation on simple arithmetic op-erations such as addition, subtraction and multiplication.We de-scribe the error characteristics for operations on unsigned num-bers; however the procedure can easily be extended to opera-tions on two’s complement and signed numbers. Next, we de-scribe a method to reduce the effect of truncation based errorson system quality.

Fig. 5. Average performance loss due to bit truncation using DS and TSS.

The output of a DSP system after LSB truncation at time in-stant can be expressed as:

where is the truncation-free output and is the trun-cation induced error (noise) which is a random variable withmean and variance . The noise power can be representedby the mean square error (MSE) defined as . In orderto reduce the noise power, we propose a method that estimatesthe mean value of the truncation error during the pre-compu-tation stage and compensates for it. We refer to this method as

compensation. The overhead of this method is very small.Moreover the noise power after -compensation does not de-pend on anymore and is only a function of the variance ofthe truncation error.Let us consider a system whose inputs are originally repre-

sented with bits, . When bit truncation isemployed, where , the input becomes . As-suming uniformly distributed input signals, we can express ,the truncation error for the input signal , as:

where is an independent, uniform random variable with twodiscrete values: 0 and 1. The expected value and variance ofare given by,

(1)

(2)

where and are mean and varianceof and is the variance of .Using (1) and (2), we can compute the expected value and

variance of the truncation error of an adder with inputsand . Both inputs are independent and in both cases the lowerL bits (out of bits) have been truncated.

1584 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 15, NO. 7, NOVEMBER 2013

Fig. 6. Noise power of for 16-bit multiplication with and without -compensation.

Using a similar analysis, we can compute the expected valueand variance of subtraction and multiplication. Details of thecalculation for multiplication are given in the Appendix.

Fig. 6 illustrates how the noise power (MSE) of 16-bit multi-plication of unsigned numbers can be reduced with compen-sation. We see that the analytical results and simulated resultsmatch very closely. Moreover plays an important role in de-termining the noise power and compensating for helps reducethe MSE by .Since noise power is proportional to , the proposed

method helps in reducing the noise power for computations suchas additions and multiplications. It does not help in the case ofsubtractions since the of subtraction is 0. Furthermore, wesee that, noise power of an ‘ ’ bit truncation with compensationand ‘ ’ bit truncation without compensation are comparable.However since the overhead of compensation is very small, asystem with larger number of truncation bits has larger energysavings as will be shown in Section III-C. Next we illustrate theuse of the compensation method to compensate for errors inDCT and FIR filter computation.Example 1—Discrete Cosine Transform (DCT): 2-D DCT

is typically implemented using 1-D DCTs along rows (columns)followed by 1-D DCT along columns (rows). 1-D DCT trans-form of size 8 that is used in JPEG can be expressed as follows:

where ’s are input pixels in row or column order and ’sare the corresponding outputs. Typically 8-point DCT is com-puted along rows and the coefficients transposed so that datafor the 8-point DCT along columns can be obtained efficiently.The properties of the coefficient matrix are used to reduce thenumber of multiplications. Below is one such method of imple-menting the odd and even coefficients.

where , ,, , ,

, .Fig. 7 describes the architecture to compute 4 DCT coeffi-

cients ( and ) of the 8-point DCT used in JPEG.The AND gates at the inputs are used to implement input bittruncation. After pairwise subtraction and addition of the pixels,we obtain to , where , , ,and . For and , common sub-expressionelimination is used to obtain results with small number of com-putation units as illustrated in Fig. 7(b). Implementation ofis illustrated in Fig. 7(c); a variant of which is used for .Fig. 7(d) shows the computation structure used to find . Theodd coefficients ( and ) are computed using unitsthat are similar to the unit for .We calculate the truncation noise (TN) for the DCT outputs

for a 14 bit fixed point implementation of DCT, where 12 bitsrepresents the integer part and last 2 bits represent the fractionalpart of the computation. The expected errors due to truncationin and can be expressed as follows. To simplify ouranalysis, we assume that all Y values are uncorrelated and sothe expected value for L bit truncation is . Since

, the expected truncation error for isgiven by

. Similarly expected value of the truncationerror for is given by , and that of

is given by . The expected valueof truncation noise for and are also zero.The expected truncation noise values are used as unbiased

estimators to compensate the errors. Instead of compensatingfor errors of all the outputs, we only compensate for errors inthe computation of and . The motivation for this is thatthese coefficients are the most important ones and the corre-sponding estimation errors are the largest. Also this keeps thecomplexity of the overhead circuitry small. Fig. 8 illustrates thecompensation mechanism for computation. The overheadof this scheme is a 14-bit adder at the output as well as the ANDgates to disable a selective set of input bits. The area and poweroverhead due to extra processing elements is around 2% of theoverall DCT implementation.Fig. 9 illustrates the performance improvement with the use

of unbiased estimators for and when low order bits are

EMRE AND CHAKRABARTI: ENERGY AND QUALITY-AWARE MULTIMEDIA SIGNAL PROCESSING 1585

Fig. 7. Computation of 1-D DCT coefficients. (a) First stage butterfly; (b) and computation unit; (c) unit; (d) unit.

Fig. 8. Modified DCT computation of .

Fig. 9. Performance comparison between uncompensated and compensated bittruncation for DCT computation of Baboon image.

truncated for DCT computation of the Baboon image. For 1 bppcompression rate, 4-bit truncation causes a degradation of 1.3dB which is reduced to 0.6 dB with compensation. For the same1 bpp compression rate, when 6 bits are truncated, the perfor-mance improvement is approximately 1.2 dB compared to thesystem without compensation. Thus as the truncation level in-creases, we observe higher performance improvements in sys-tems that use compensation.Example 2: Consider a FIR low pass filter (LPF) using un-

signed inputs and coefficients, which is typical in many multi-

Fig. 10. MAC implementation of a low pass filter with compensation.

media algorithms. The output y(n) of an N-tap filter with-bit precision is given by

where is the ’th coefficient of the filter andrepresents the input value at time . Such a

computation can be implemented efficiently using MAC basedarchitectures. We can calculate the unbiased estimator for L-bittruncation assuming that the coefficients are less than one.

(3)

When filter coefficients are known, the estimator given in (3)reduces to:

where represents the sum of filter tap coefficients forLPF given by . As an example,for a 3 3 Gaussian Filter when and , theunbiased estimator value is 3; this value increases to 15 when

.

1586 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 15, NO. 7, NOVEMBER 2013

Fig. 11. Histogram plots of (a) AD values, (b) SAD values of all blocks, and (c) selected SAD values for the Football sequence.

Fig. 10 illustrates the block diagram of the proposed MACbased architecture for LPF. Filter coefficients and input data aretruncated using an array of AND gates before the multiplica-tion; thus only high order bits become active during computa-tion. After N cycles of MAC computation, the correction factoris applied to reduce errors due to truncation. The performanceresults of this filter are given in Section III.2) High Order Clipped Computation: It is not always the

case that the low order bits are less significant in computationand so can be dropped. In sum of absolute difference (SAD)computation used in motion estimation, for instance, the highorder bits can be dropped. The proposed scheme in [18] usesthe statistics of absolute difference (AD) and SAD computa-tions to reduce the dynamic range and approximate the com-putations. Specifically, it exploits the fact that most of the ADvalues are small due to locality of current and reference blocks,and that most of the large AD values are for blocks that are likelynot to be selected, and thus these values can be approximated.Fig. 11 illustrates the distribution of the AD values, SAD valuesand selected SAD values for the Football video sequence. Fromthe distributions, we see that the dynamic range of the selectedSAD values is significantly lower than the dynamic range whenall SAD values are taken into consideration. Thus during SADcomputation, it is not necessary to operate on the MSBs duringSAD computation with much care. The scheme in [18] detectslarge AD values using special logic and the delay correspondingSAD values are updated with a correction factor. The resultingarchitecture has a lower critical path delay compared to the base-line architecture and significantly lower energy consumption. Itachieves 37.5% energy reduction at nominal voltage and 68%reduction for iso-throughput while incurring 1.8% increase incompressed data size and approximately 1.3 dB reduction inPSNR.

III. HYBRID CONFIGURATIONS

A combination of the energy saving techniques in Section IIhelps achieves even higher energy savings as will be demon-strated in this section.

A. Combining Computation Reduction and Voltage Scaling

Several significance driven techniques where the significantcomponents have shorter delay and the less significant compo-nents have longer delay, have been proposed in [44]–[47]. Thesetechniques are very effective in reducing the energy consump-tion without affecting the quality too much. At nominal voltage,all computations ensure no-violation in the critical path while atscaled voltage levels those which have higher critical path delaythan that allowed by the operating frequency, are disabled. Forinstance, selective deactivation of DCT coefficients based onthe operating voltage has been proposed in [44]. Since low fre-quency DCT coefficients contain most of the input image en-ergy, they are significant and implemented with shortest criticalpath. It has been shown that 41% to 90% power saving is pos-sible compared to baseline scheme with up to 10 dB degradationin PSNR. A similar approach is applied in [45] for color inter-polation where only less important computations are affected byvoltage scaling and process variation. Such a scheme achieves40% power savings with 5 dB PSNR degradation.These significance driven techniques have also been applied

to support vector machines that are widely used in data-miningapplications [47]. Here the number of support vectors and fea-tures per vector are traded-off and voltage over-scaling is usedat the circuit level to minimize the energy for a given qualityof service. A more general method has been proposed in [48]where the behavior and cost of the computation componentsare modeled based on their importance on system performance.The supply voltage levels for a specific part of the circuitry aredetermined based on bit significance (profit) and energy cost(investment).

B. Combining Voltage Scaling and Dynamic Range Reduction

Voltage scaling and dynamic range reduction are two com-plementary energy reduction techniques. While voltage scalingreduces the energy consumption, it increases the delay of thecomputation unit and can cause timing errors. However, if re-duced precision operation is acceptable, the critical path of thecomputation is lower and timing errors due to voltage scalingcan be avoided. We illustrate this with the help of a 16 bitadder example and then show the effectiveness of this method

EMRE AND CHAKRABARTI: ENERGY AND QUALITY-AWARE MULTIMEDIA SIGNAL PROCESSING 1587

Fig. 12. 16-bit RCA. (a) Delay, (b) Energy, (c) Average Error per Operation distribution with and without precision reduction under voltage scaling.

in achieving energy reduction with minimal quality degradationfor a low pass FIR filter.Consider a simple 16-bit RCA implemented using modelSim

with 45 nm PTMmodel and simulated for uniformly distributedinputs. Fig. 12(a) illustrates the change in critical path delayunder voltage scaling; the target delay of the adder at nominalvoltage and full precision is illustrated with a dashed line par-allel to x-axis. As expected, the critical path delay of the 16-bitadder increases rapidly with voltage scaling. For instance, at 0.8V, the increase in critical path delay is approximately 45% of thetarget delay.The increase in critical path delay is reduced using lower

precision arithmetic unit that has shorter critical path. For in-stance, voltage scaling induced errors due to critical path vio-lation when operating the 16-bit adder at 0.8 V is prevented bytruncating 5 low order bits and operating only on the 11 MSBs.Similarly, critical path violation at 0.7 V is prevented by oper-ating on the 8 MSBs. Fig. 12(b) shows the difference in energysaving between a scheme where only voltage scaling is used anda scheme where a combination of voltage scaling and reducedprecision is used. At 0.6 V, use of only voltage scaling reducesthe energy consumption by 63% while the combination reducesit by 89%.Next, we analyze the average error induced by voltage scaling

and the combined technique. At nominal voltages both systemsoperate at full-precision, and so the average error is zero. Atscaled voltage levels, both systems have comparable averageerror per operation as shown in Fig. 12(c). For instance, at 0.8V, the 11-bit adder and the 16 bit adder have the same error/op-eration but the 11-bit adder has 20% lower energy (Fig. 12(b)).The effect of VOS on adders has been formulated in [8], [49]which use internal architecture of the adders to estimate thenoise power. Furthermore, the truncation noise can be loweredusing the compensation technique described in Section II. Evenwithout compensation, the combined technique achieves muchhigher energy saving compared to using only voltage scaling fora comparable error per operation.We see similar trends in delay, energy and error performance

for more complex adders such as carry-look ahead adder (CLA).Fig. 13 shows normalized energy as a function of the supply

Fig. 13. 16-bit CLA Energy distribution with and without precision reductionunder voltage scaling for comparable average error per operation.

voltage for a 16-bit CLA. We see that CLA supports more ag-gressive truncation, for instance, when operated at 0.8 V, it usesonly 8 bits (out of 16 bits) and thus achieves higher energy sav-ings. However, the average error per operation for the CLA istypically larger compared to RCA for the same voltage level andRCA tends to have better energy performance for the same errorlevel. This is in agreement with the result presented in [49].Next, we present the results of this procedure on real

image data. Consider processing the Lena image with a 3 3Gaussian filter using a MAC based architecture with8-bit precision. The multiplier is implemented using a carrysave adder tree and the final stage is implemented using RCA.Also, both the inputs and the filter coefficients are truncatedwith the same order. Fig. 14(a) shows the mean squared errornoise power (VOS induced+truncation) vs. normalized energyconsumption for various levels of low order bit truncationwithout compensation. Each point in the curve corresponds toa specific supply voltage level. Noise power is calculated usingthe mean square error (MSE) between the LPF results obtainedwith voltage scaling and those obtained with nominal voltageoperation. Note that MSE can be converted to PSNR using

; however we choose

1588 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 15, NO. 7, NOVEMBER 2013

Fig. 14. Noise Power versus Energy Consumption of 3 3 Gaussian filter on Lena image under voltage scaling with (a) precision reduction, (b) precision reduc-tion and compensation.

to use MSE here since it provides greater insight into theerror performance. From this figure, we see that full precisionLPF (original) shows a large increase in noise level when thevoltage is scaled to 0.9 V with only about 20% energy saving.On the other hand, 2-bit truncation operating at 0.9 V has lowernoise power and 45% energy saving. Thus, dynamic precisionadjustment with voltage scaling achieves considerable betterperformance compared to when only voltage scaling is used.Next, we study the effect of truncation noise compensation.

Fig. 14(b) illustrates the performance of the LPF when the esti-mator described in Section II-C1 is applied. For 4-bit truncationoperating at 0.9 V, the noise power reduces by 66% when com-pensation unit is used. The overhead is very small, since thecompensation unit is activated only 1/N of the time where N isthe number of the filter taps. At full precision, we have approx-imately 5% overhead compared to original MAC unit becauseof the final adder illustrated in Fig. 10.Finally, Fig. 15 shows the pareto-optimal curves for voltage

scaling in combination with truncation with and without com-pensation. These curves are generated by connecting the bestconfigurations shown in circles in Figs. 14(a) and 14(b). Wesee that the combination scheme always achieves better per-formance compared to sole voltage scaling at all levels. Fur-thermore, truncation with compensation achieves higher energysaving for the same noise power. For instance, at MSE(which is approximately PSNR ), FIR using compensa-tion achieves 16% extra energy saving compared to FIR with nocompensation.We repeat the analysis for three different 3 3 Gaussian fil-

ters ( and ) and for two different MACbased architectures, one with a RCA in the final stage and theother with a CLA in the final stage. The MSE improvement iscalculated as the difference betweenMSE of truncation)with and without compensation. We use four sample images(Baboon, Lena, Flight, and Pepper) and list the average MSEimprovement in Table I. We see that the MAC with RCA hasslightly higherMSE improvement.While theMSE performance

Fig. 15. Performance comparison of 3 3 Gaussian filter on Lena image; orig-inal and reduced precision filter with and without compensation.

Fig. 16. DCT coefficient deactivation and low order bit truncation for the1D-DCT coefficients.

of the two MAC based systems is slightly different, both benefitfrom use of this technique.

EMRE AND CHAKRABARTI: ENERGY AND QUALITY-AWARE MULTIMEDIA SIGNAL PROCESSING 1589

Fig. 17. Comparison between bit truncation level and DCT coefficient deactivation for (a) , (b) for Baboon image when .

We compare the performance of the proposed voltage scalingwith dynamic range reduction technique with the ANT tech-nique for FIR filtering [4]. The reduced computation block inthe ANT system is a filter that uses 4 MSBs for both filter coef-ficients and input values. It consumes approximately 23% extraenergy at nominal voltage but at 0.8 V, the ANT system achieves20% energy reduction for MSE noise power. In compar-ison, the proposed technique achieves 85% energy reduction forthe same level of noise power.

C. Combining Computation Reduction and Dynamic RangeReduction

Computation reduction and dynamic range reduction tech-niques both try to keep significant computations while removingless significant portions of the computation. The combination ishighly dependent on quality requirement and characteristics ofthe application. We illustrate this method using DCT as a casestudy.Here the combination is based on DCT coefficient deactiva-

tion and low order bit truncation. The DCT architecture underconsideration is given in Fig. 7. In DCT coefficient deactivation,DCT coefficients are deactivated starting from the highest fre-quency component of 1D DCT . Thus it is not possible todeactivate without deactivating . In low order bit trunca-tion, inputs are truncated for the entire computation unit with agranularity of 2-bit. These two techniques are combined in sucha way that the performance degradation is minimized. Fig. 16illustrates the proposed methods for 14-bit fixed point DCT im-plementation. The solid red line in Fig. 16 illustrates the sce-nario in which is deactivated and 4 low order bits are trun-cated in the rest of the coefficients. The above procedure can beimplemented by controlling the AND gates at the inputs of eachDCT coefficient computation unit as illustrated in Fig. 7.Next, we describe a scheme to combine coefficient deacti-

vation and low order bit truncation. First we note that there is acrossover point in performance where it becomes better to deac-tivate a coefficient instead of applying aggressive bit truncationto that coefficient. Fig. 17 illustrates the PSNR performance ofBaboon image as a function of low order bit truncation for

Fig. 18. Binary decision tree to choose combination of coefficient deactivationand truncation level.

TABLE IMSE IMPROVEMENT FOR DIFFERENT GAUSSIAN FILTERS

and coefficients. We see that deactivation of andcoefficients become more attractive after truncating 7 bits (outof 14). To improve confidence of the crossover point, we in-vestigate the performance of 6 sample images (Baboon, Lena,Flight, Pepper, House and Bridge) and find that it is better to de-activate the DCT coefficient rather than truncating 6 bits. Thus,in our procedure, we limit the low order bit truncation to 4 levelswith granularity of 2 bits, namely 0 bit (no truncation), 2 bit, 4bit, and 6 bit truncation.Next, we determine the order in which coefficient deactiva-

tion and low order bit truncation is applied using a binary deci-sion tree as illustrated in Fig. 18. We start from full precision,and at Level 1 choose between two competing schemes: 2-bitlow order truncation and deactivation based on PSNR. If2-bit truncation provides better performance, then we pick thatbranch. In Level 2, we choose between 4-bit truncationand deactivation with 2-bit low order bit truncation. If in

1590 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 15, NO. 7, NOVEMBER 2013

TABLE IIREDUCTION ORDER OF 6 SAMPLE IMAGES

Fig. 19. Performance of the resulting decision order generated using 6 training samples on test images (a) Lake, and (b) Elaine.

Level 1, was deactivated, then in Level 2, we choose be-tween deactivation and 2 bit truncation of all other coeffi-cients or and deactivation.Table II lists the reduction order of 6 images using the binary

decision tree method. We read the table from left to right in in-creasing level number so Level 4 for Lena image correspondsto deactivation of coefficients , and -bit trun-cation of all coefficients. Using majority voter scheme for eachlevel (each column of Table II), we form a general order whichis given in the last row of Table II. In this order, 2-bit low ordertruncation is followed by deactivation. Then, -bitlow order bit truncation is applied to all the coefficients. Notethat, since we consider the same computation units for theand pair to minimize the circuitry, we do not deactivate oneof the members of this pairs unless both of them are eligible tobe deactivated. Thus in Level 7, we do not deactivate . Thereduction order for the first eight levels of majority voter resultis also illustrated in Table II to the right.Using Table II, we determine suitable configurations for three

PSNR degradation schemes: i) Scheme dB),ii) Scheme-II (PSNR dB), and iii) Scheme-III (PSNR

dB). Here, PSNR is defined as the reduction in the PSNRvalue of the modified scheme compared to the baseline scheme.We use 6 sample images (Lena, Pepper, Bridge, Baboon, Flightand House) in our evaluation. For a given quality metric (Q)which is used in JPEG [50], we find all configurations that satisfythe PSNR constraint and choose the one that provides highest

TABLE IIIFINAL ORDER FOR SCHEMES I, II, III

saving in computation.Weuse themajority voter order generatedin Table II to determine the priority of a configuration. Table IIIlists the combination orders for the three schemes.We test the effectiveness of the combination schemes given

in Table III using five test images (Lake, Tank, Elaine, Featherand Boat). Fig. 19 illustrates the results for Elaine and Lake im-ages. For instance, for Lake image at , Scheme II cor-responding to PSNR dB) results in PSNR of 32.7 dB,which is only 0.6 dB lower than the original PSNR. Thus theproposed method guarantees that the PSNR constraints are sat-isfied for Q values from 75 down to 5.Next, we calculate the power consumption of the original

and proposed schemes for different configurations. All multi-plications are implemented using carry save adder structures.Table IV lists active power, latency and area estimations for theconfiguration order given in Table II using 45 nm PTM models.The proposed schemes have marginal increase in circuitry area

EMRE AND CHAKRABARTI: ENERGY AND QUALITY-AWARE MULTIMEDIA SIGNAL PROCESSING 1591

TABLE IVPOWER AND AREA FOR ORIGINAL AND REDUCED COMPUTATION 1D-DCT

Fig. 20. Energy saving at different quality levels for three candidate schemes.

(2.9%) and leakage (3.6%) compared to the original implemen-tation due to extra units that are used to gate inputs and com-pensate truncation error. Overall, the proposed scheme providesflexible performance with reduced power consumption for dif-ferent quality requirements. For instance, using configurationorder of L8, we save 61% power consumption and have 14%extra timing slack compared to original full precision DCT en-gine. The timing slack can be absorbed by operating at 0.9 V(instead of 1 V) resulting in 68% saving in power consumption.Finally, we present the power savings of Schemes I, II and

III for Q values from 75 to 5 in Fig. 20. We see that SchemeIII always achieves the highest power saving due to higher al-lowable degradation. As we move from high quality (Q large)regions to low quality regions (Q small), we see an increase inthe power savings. This is because the combinations used in thelow quality regions are quite aggressive in terms of power sav-ings. On average, we achieve 33% power saving for SchemeI, 39% power saving for Scheme II and 46% power saving forScheme III.We compare the proposed scheme with the ANT technique

[5], significance driven technique in [44] and adaptive trunca-tion technique in [23]. ANT based scheme using 4-bit MSBreplica of the DCT in the reduced computation block has over-head compared to original DCT. We found that it achieves 23%energy saving with approximately 5 dB degradation in PSNR.Significance driven technique achieves 47% energy reductionwith more than 4 dB degradation in PSNR [44]. The trunca-tion based technique in [23] achieves up to 40% energy savingwhile inducing approximately 0.2 reduction in MSSM whichcorresponds to approximately 15 dB loss in PSNR. In contrast,the proposed combination scheme can achieve average energysavings of 46% with dB PSNR degradation. Moreover,the proposed scheme provides a mechanism for higher energysaving for low Q settings while keeping the degradation low forhigh Q settings.

IV. CONCLUSION

In this paper, we presented several general as well as al-gorithm-specific techniques that trade-off energy with systemperformance for multimedia signal processing algorithms. We

provided an overview of energy-savings techniques such asvoltage scaling, reducing number of computations and reducingdynamic range. All these techniques introduce errors which canbe compensated by algorithm-level optimizations.Next, we described several hybrid techniques that further

reduce energy consumption while causing little reduction inquality. We investigated the combination of voltage scalingand dynamic range reduction and applied it to a low pass FIRfilter. The proposed scheme achieved 85% energy saving forfairly low noise level. We also studied the combination of com-putation reduction and dynamic range reduction for DCT usedin JPEG. Simulation results showed, on average, 33% to 46%reduction in energy consumption for a small 0.5 dB to 1.5 dBdegradation in the system performance. Thus algorithm-leveloptimizations can help reduce the energy consumption of manymultimedia signal processing algorithms with only a milddegradation in quality.

APPENDIX

We present the expected variance of error due to truncationfor unsigned multiplication. The two inputs andare independent and L bit truncation is applied before multi-

plication to obtain and .

where represents the covariance operation. We use thevariance and covariance property for product of random vari-ables to simplify the above expression.

1592 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 15, NO. 7, NOVEMBER 2013

REFERENCES

[1] S. Ghosh and K. Roy, “Parameter variation tolerance and error re-siliency: new design paradigm for the nanoscale era,” Proc. IEEE, vol.98, no. 10, pp. 1718–1751, Oct. 2010.

[2] S. H. Nawab, A. V. Oppenheim, A. Chandrakasan, J. M. Winograd,and J. T. Ludwig, “Approximate signal processing,” J. VLSI SignalProcess., vol. 15, pp. 177–200, 1997.

[3] A. Chandrakasan and R. Brodersen, “Minimizing power consumptionin digital CMOS circuits,” Proc. IEEE, vol. 83, no. 4, pp. 498–523,Apr. 1995.

[4] R. Hegde and N. R. Shanbhag, “Soft digital signal processing,” IEEETrans. VLSI Syst., vol. 9, no. 12, pp. 813–823, Dec. 2001.

[5] B. Shim, S. R. Sridhara, and N. R. Shanbhag, “Reliable low-power dig-ital signal processing via reduced precision redundancy,” IEEE Trans.VLSI Syst., vol. 12, no. 5, pp. 497–510, May 2004.

[6] G. Varatkar and N. R. Shanbhag, “Error-resilient motion estimationarchitecture,” IEEE Trans. VLSI Syst., vol. 16, no. 10, pp. 1399–1412,Oct. 2008.

[7] N. R. Shanbhag, R. A. Abdallah, R. Kumar, and D. L. Jones, “Sto-chastic computation,” in Proc. Design Automation Conf., Jun. 2010,pp. 859–864.

[8] Y. Emre and C. Chakrabarti, “Data-path and memory error com-pensation technique for low power JPEG implementation,” in Proc.Int. Conf. Acoustics Speech and Signal Processing, May 2011, pp.1589–1592.

[9] Y. Emre and C. Chakrabarti, “Memory error compensation techniquesfor JPEG2000,” in Proc. IEEE Workshop Signal Processing Syst., Oct.2010, pp. 36–41.

[10] I. J. Chang, D. Mohapatra, and K. Roy, “A voltage-scalable & processvariation resilient hybrid SRAM architecture for MPEG-4 video pro-cessors,” in Proc. Design and Automation Conf., 2009, pp. 670–675.

[11] M. Cho, J. Schlessman, W. Wolf, and S. Mukhopadhyay, “Accuracy-aware SRAM: A reconfigurable low power SRAM architecture for mo-bile multimedia applications,” in Proc. Asia and South Pacific DesignAutomation Conf., 2009, pp. 823–828.

[12] Z. Chishti et al., “Improving cache lifetime reliability at ultra-low volt-ages,” in Proc. Int. Symp. Microarchitecture, Dec. 2009, pp. 89–99.

[13] M.A.Makhzan, A.Khajeh, A. Eltawil, and F. J. Kurdahi, “A low powerJPEG2000 encoder with iterative and fault tolerant error concealment,”IEEE Trans. VLSI Syst., vol. 17, no. 6, pp. 827–837, Jun. 2009.

[14] J. Sartori and R. Kumar, “Architecting processors to allow voltage/reliability tradeoffs,” in Proc. Int. Conf. Compiler Architectures andSynthesis for Embedded Syst., Oct. 2011, pp. 115–124.

[15] K. J. Lin, S. Natarajan, and J. W. S. Liu, “Imprecise results: utilizingpartial computations in real-time systems,” in Proc. Real-Time Syst.Symp., Dec. 1987, pp. 210–217.

[16] A. Sinha, A. Wang, and A. Chandrakasan, “Energy scalable systemdesign,” IEEE Trans. VLSI Syst., vol. 10, no. 4, pp. 135–145, Apr. 2002.

[17] C. Chen et al., “Analysis and architecture design of variable block-sizemotion estimation for H.264/AVC,” IEEE Trans. Circuit Syst. I, vol.53, no. 2, pp. 578–593, Feb. 2006.

[18] Y. Emre and C. Chakrabarti, “Low energy motion estimation via se-lective approximation,” in Proc. IEEE Int. Conf. Application-SpecificSystems, Architectures and Processors, Sep. 2011, pp. 176–183.

[19] C. Lian et al., “Power-aware multimedia: concepts and challenges,”IEEE Circuits Syst. Mag., vol. 7, no. 2, pp. 26–34, 2007.

[20] Y. Andreopoulos and M. van der Schaar, “Incremental refinement ofcomputation for the discrete wavelet transform,” IEEE Trans. SignalProcess., vol. 56, no. 1, pp. 140–157, Jan. 2008.

[21] J. Y. F. Tong, D. Nagle, and R. A. Rutenbar, “Reducing power byoptimizing the necessary precision/range of floating point arithmetic,”IEEE Trans. VLSI Syst., vol. 8, no. 6, pp. 273–286, Jun. 2000.

[22] J. Park, J. H. Choi, and K. Roy, “Dynamic bit-width adaptation in DCT:an approach to trade-off image quality and computation energy,” IEEETrans. VLSI Syst., vol. 18, no. 5, pp. 787–793, May 2010.

[23] S. H. Kim, S. Mukohopadhyay, and M. Wolf, “System level energyoptimization for error tolerant image compression,” IEEE EmbeddedSyst. Lett., vol. 2, pp. 81–84, Sep. 2010.

[24] Z. He, C. Tsui, K. Chan, and M. L. Liou, “Low-power VLSI designfor motion estimation using adaptive pixel truncation,” IEEE Trans.Circuits Syst. Video Technol., vol. 10, pp. 669–678, Aug. 2000.

[25] D. Ernst et al., “Razor: a low-power pipeline based on circuit-leveltiming speculation,” in Proc. Int. Symp. Microarchitecture, Dec. 2003,pp. 7–18.

[26] K. A. Bowman, J.W. Tschanz, N. S. Kim, J. C. Lee, C. B.Wilkerson, S.L. Lu, T. Karnik, and V. K. De, “Energy-efficient and metastability-im-mune resilient circuits for dynamic variation tolerance,” IEEE J. SolidState Circuits, vol. 44, no. 1, pp. 49–63, Jan. 2009.

[27] Predictive Technology Modeling. [Online]. Available: ptm.asu.edu.[28] S. Mukhopadhyay, H. Mahmoodi, and K. Roy, “Modeling of failure

probability and statistical design of SRAM array for yield enhancementin nanoscaled CMOS,” IEEE Trans. CAD Integr. Circuits Syst., vol. 24,no. 12, pp. 1859–1880, Dec. 2005.

[29] K. Agarwal and S. Nassif, “Statistical analysis of SRAM cell stability,”in Proc. Design Automation Conf., Sep. 2006, pp. 57–62.

[30] K. Zhang, Embedded Memories for Nanoscale VLSIs. New York,NY, USA: Springer, 2009.

[31] G. K. Chen et al., “Yield-driven near-threshold SRAM design,” inProc. Int. Conf. Comput. Aided Design, 2007, pp. 660–666.

[32] Y. Emre and C. Chakrabarti, “Techniques for compensating memoryerrors in JPEG2000,” IEEE Trans. VLSI Syst., vol. 21, no. 1, pp.159–163, Jan. 2013.

[33] T. N. Rao and E. Fujiwara, Error Control Coding for Comput. Syst.Englewood Cliffs, NJ, USA: Prentice Hall, 1989.

[34] J. Kim et al., “Multi-bit tolerant caches using two dimensional errorcoding,” in Proc. Int. Symp. Microarchitecture, 2007, pp. 197–209.

[35] F. J. Kurdahi, A. Eltawil, K. Yi, S. Cheng, and A. Khajeh, “Low-powermultimedia system design by aggressive voltage scaling,” IEEE Trans.VLSI Syst., vol. 18, no. 5, pp. 852–856, May 2010.

[36] M. Goel and N. R. Shanbhag, “Dynamic algorithm transforms for low-power adaptive equalizers,” IEEE Trans. Signal Process., vol. 47, no.10, pp. 2821–2832, Oct. 1999.

[37] Y. Andreopoulos and I. Patras, “Incremental refinement of imagesalient-point detection,” IEEE Trans. Image Process., vol. 17, no. 9,pp. 1685–1699, Sep. 2008.

[38] A. K. Verma, P. Brisk, and P. Ienne, “Variable latency speculative ad-dition: a new paradigm for arithmetic circuit design,” in Proc. Design,Automation and Test Conf. in Europe, 2008, pp. 1250–1255.

[39] N. Zhu, W. Goh, W. Zhang, K. Yeo, and Z. Kong, “Design of low-power high-speed truncation-error-tolerant adder and its application indigital signal processing,” IEEE Trans. VLSI Syst., vol. 18, no. 8, pp.1225–1229, Aug. 2010.

[40] M. D. Ercegovac and T. Lang, Digital arithmetic. San Mateo, CA,USA: Morgan Kauffmann, 2004.

[41] M. R. Chowdhury and K. Mohanram, “Approximate logic circuits forlow overhead, non-intrusive concurrent error detection,” in Proc. De-sign, Automation and Test Conf. in Europe, 2009, pp. 903–908.

[42] P. Kulkarni, P. Gupta, and M. Ercegovac, “Trading accuracy for powerin a multiplier architecture,” J. Low Power Electron., pp. 490–501,Dec. 2011.

[43] D. Shin and S. K. Gupta, “Approximate logic synthesis for error tol-erant applications,” in Proc. Design, Automation and Test Conf. in Eu-rope, Mar. 2010, pp. 957–960.

[44] G. Karakonstantis, N. Banerjee, and K. Roy, “Process-variation re-silient and voltage scalable DCT architecture for robust low-powercomputing,” IEEE Trans. VLSI Syst., vol. 18, no. 10, pp. 1461–1470,Oct. 2010.

[45] N. Banerjee et al., “Design methodology for low power dissipation andparametric robustness through output quality modulation: applicationto color interpolation filtering,” IEEE Trans. CAD Integr. Circuits Syst.,vol. 28, no. 8, pp. 1127–1137, Aug. 2009.

[46] D. Mohapatra, G. Karakonstantis, and K. Roy, “Significance drivencomputation: a voltage-scalable, variation-aware, quality-tuning mo-tion estimator,” in Proc. Int. Symp. Low Power Electron. and Design,2009, pp. 195–200.

[47] V. Chippa, D. Mohapatra, A. Raghunathan, K. Roy, and S. T.Chakradhar, “Scalable effort hardware design: exploiting algorithmicresiliency for energy efficiency,” in Proc. Design Automation Conf.,Jun. 2010, pp. 555–560.

[48] J. George, B. Marr, B. E. S. Akgul, and K. V. Palem, “Probabilisticarithmetic and energy efficient embedded signal processing,” in Proc.Int. Conf. Compilers Architectures and Synthesis for Embedded Syst.,2006, pp. 158–168.

[49] Y. Liu and T. Zhang, “On the selection of arithmetic unit structure involtage overscaled soft digital signal processing,” in Proc. Int. Symp.Low Power Electron. and Design, Aug. 2007, pp. 250–255.

[50] T. Acharya and P.-S. Tsai, JPEG2000 Standard for Image Compres-sion: Concepts, Algorithms and VLSI Architectures. New York, NY,USA: Wiley Inter-Science, 2004.

EMRE AND CHAKRABARTI: ENERGY AND QUALITY-AWARE MULTIMEDIA SIGNAL PROCESSING 1593

Yunus Emre (S’10) received the B.E. degree in elec-trical engineering from Bilkent University, Ankara,Turkey in 2006 and M.S. degree and Ph.D. degreesin electrical engineering from Arizona State Univer-sity, Tempe, AZ, USA, in 2008 and 2012.His research interests include algorithm-

architecture co-design for low power multimediasystems, error control for non-volatile and volatilememories and variation tolerant design techniquesfor signal processing systems. Recently, he joinedApple Computers working on system performance

analysis of mobile devices.

Chaitali Chakrabarti (S’86–M’89–SM’02–F’11)received the B.Tech. degree in electronics and elec-trical communication engineering from the IndianInstitute of Technology, Kharagpur, India, in 1984,and the M.S. and Ph.D. degrees in electrical engi-neering from the University of Maryland, CollegePark, MD, USA, in 1986 and 1990, respectively.She is a Professor with the Department of Elec-

trical Computer and Energy Engineering, ArizonaState University (ASU), Tempe, AZ, USA. Herresearch interests include the areas of low power

embedded systems design including memory optimization, high level synthesisand compilation, and VLSI architectures and algorithms for signal processing,image processing, and communications.Prof. Chakrabarti was a recipient of the Research Initiation Award from the

National Science Foundation in 1993, a Best Teacher Award from the College ofEngineering and Applied Sciences from ASU in 1994, and the Outstanding Ed-ucator Award from the IEEE Phoenix Section in 2001. She served as the Tech-nical Committee Chair of the DISPS subcommittee, IEEE Signal ProcessingSociety (2006–2007). She is now an Associate Editor of the Journal of VLSISignal Processing Systems and the IEEE Transactions on Very Large Scale In-tegration Systems.