[ieee 2012 25th international conference on vlsi design - hyderabad, india (2012.01.7-2012.01.11)]...
TRANSCRIPT
Real-Time, Content AwareCamera–Algorithm–Hardware Co-Adaptation for
Minimal Power Video Encoding
Joshua W. Wells, Jayaram Natarajan, and Abhijit ChatterjeeSchool of Electrical and Computer Engineering
Georgia Institute of Technology, Atlanta, GA
Email: {josh.wells, jayaram, chat}@gatech.edu
Irtaza BarlasImpact Technologies, Atlanta, GA
Email: [email protected]
Abstract—In this paper, a content aware, low power video encoderdesign is presented in which the algorithms and hardware areco-optimized to adapt concurrently to video content in real-time.Natural image statistical models are used to form spatiotemporalpredictions about the content of future frames. A key innovationin this work is that that the predictions are used as parametersin a feedback control loop to intelligently downsample (changethe resolution of the frame image across different parts of theimage) the video encoder input immediately at the camera, thusreducing the amount of work required by the encoder per frame.A multiresolution frame representation is used to produce regulardata structures which allow for efficient hardware design. Thehardware is co-optimized with the algorithm to reduce powerbased on the reduced input size resulting from the algorithm.The design also allows for selectable, graceful degradation ofvideo quality while reducing power consumption.
Index Terms—adaptive video encoding, low-power DSP, selfaware
I. INTRODUCTION
The goal of video encoding is to reduce the amount of
redundant information in a video signal producing a more ef-
ficient, high-entropy signal. Modern video encoding standards
such as H.264 and MPEG-4 are capable of producing excellent
compression, but at the cost of high power consumption in
the video encoder. Performing real-time video encoding with
limited energy supply is a major issue, particularly for mobile
camera systems. The presented work is capable of overcoming
the problems associated with a limited energy system by
using predictive video encoding (PVE). PVE uses previously
encoded information to form a model and predict how future
information in the video should be encoded. This reduces the
cost associated with analyzing the signal to expose redundant
information.
A. Prior Work
Prior work in low-power video encoding has concentrated
on improving the energy efficiency of specific processing
elements (PEs) within the video encoder system. The motion
This material is based upon work supported by DARPA project W31P4Q-09-C-0538 and partially supported by the National Science Foundation underGrant No. CCR-0834484.
estimation (ME) kernel (described in Section II) typically
accounts for 60-80% of the computational load of a video
encoder [1]. Accordingly, prior works in low-power video
compression have focused on the ME kernel or other in-
dividual PEs to conserve energy [2], [3], [4], [5], [6], [7],
[8], [9], [10]. Notably, in [8], an error-tolerant, voltage over-
scaled ME design is proposed. Past work has been effective at
either reducing the workload of the ME kernel or reducing its
voltage directly at the expense of video precision. Although
meaningful work has been done to increase performance of
the various PEs of a video encoder, little progress has been
made to reduce the power consumption of the encoder as asystem.
B. Novel Contributions
The proposed design reduces the power consumed by theentire video compression system by extending algorithm con-trol all the way to the camera sensor itself. Rather than
encoding all data captured by the camera, raw data is pre-compressed at the sensor without direct knowledge of the
spatial redundancies that may be present in the frame, resulting
in a smaller signal being presented to the core encoding
system. There are three major contributions described in this
paper:
• Pre-compression Camera Sensor
• Video Input Feedback Controller
• Dynamic Voltage and Frequency Scaling Control
Integrated into the typical video encoder design as shown in
Fig. 1, each of these elements performs an important role in
pre-compressing a video signal and reducing system powerconsumption.
Pre-compression of the signal is performed directly on
the camera sensor by sensing and transforming the signal
together [11]. Instead of the camera producing a typical full-
resolution frame, it outputs a multi-resolution version of the
frame referred to as a pre-compressed frame (PCF). Static
regions of a video such as background are sensed with lower
resolution while more dynamic regions are sensed with higher
resolution without loss of information or loss of detail in thedecoded video sequence.
2012 25th International Conference on VLSI Design
1063-9667/12 $26.00 © 2012 IEEE
DOI 10.1109/VLSID.2012.78
245
����������� ����������
������
���������
�
�
�� �
����� ��
��
��
����
��������
�����
� �! " �!
�# ���!
Fig. 1. Proposed video encoder design depicting pre-compression camera sensor (PCCS), motion estimation (ME), motion compensation (MC), discretecosine transform (DCT), inverse DCT, quantizer (Q) and inverse quantizer
The video input feedback controller (VIFC) dynamically
predicts the spatial redundancy of future frames based on
previous, encoded frames. The control signal from the VIFC
to the camera sensor indicates which regions of the next frame
are likely to be redundant and should be sensed with lower
resolution.
The dynamic voltage and frequency scaling control
(DVFSC) is responsible for controlling the voltage and clock
frequency of all core PEs which are shown in the blue regions
of Fig. 1. The DVFSC changes the dynamic voltage and
frequency scaling (DVFS) of each PE before each frame is
encoded. This is possible and necessary because the timing
requirements for each frame change based on its content. The
more redundant the information in a frame is, the smaller the
size of thePCF is. With less information to process by the core
PEs of the encoder, the slower the information can processed
and still meet the timing restrictions of the frame rate. Each PE
is scaled so that the video encoder takes just under the frame
period to encode each PCF. The result is a video encoder that
is capable of large reductions in power based on the content
of the video.
C. Paper Overview
In Section II, a basic video encoder design will be reviewed
to help explain the details of the proposed design. Section III
describes the method used to predictively decrease the infor-
mation coming from the sensor. Section V explains how the
system power consumption is reduced using DVFS. Sections
VI and VII describe the methods used to evaluate the proposed
design and the conclusions drawn.
II. VIDEO ENCODER OVERVIEW
The general process of video encoding consists of first
removing temporally redundant information, followed by re-
moving spatially redundant information, using temporal and
spatial models, respectively. Rather than processing frames as
a whole, the frames are divided into equally sized blocks of
adjacent pixels and the blocks are processed independently.
As shown in Fig. 1, temporally redundant information is
removed by performing subtractions on each block of data
being encoded to reveal what has changed from the previous
frame. The resulting residual blocks are passed to the spatial
model. Spatial redundancy is reduced by performing a 2-
D discrete cosine transform (DCT) based transform on the
residual blocks. It is likely that a contiguous group of the
highest frequency coefficients in the DCT domain will be near
zero and can be neglected. Finally, a selected type of variable-
length coding (VLC) is used for the DCT coefficients. VLC is
an extensive topic in coding theory that will not be addressed
in this paper.
The ME kernel shown in Fig. 1 searches areas of the
previous frame for a group of pixels that is highly correlated
to the current block being encoded. This search is necessary
to account for moving objects in the video. Each block in
the frame being encoded is processed independently by the
ME kernel. Given the sheer volume of data, this is the most
computationally intense operation of the entire encoder [1].
The location of the best matching area from the previous frame
246
is described with a simple vector, is called the prediction,
and is produced by the motion compensation (MC) kernel. A
residual block is produced by subtracting the prediction from
the raw data of the block being encoded. The residual block
is passed on to the spatial model.
The spatial model transforms/quantizes each residual block
of data into the frequency domain using a DCT based trans-
formation. Typically, a portion of the highest frequency coeffi-
cients are very close to zero after transformation quantization
and are considered insignificant. The insignificant coefficients
are discarded. It is this insignificant high-frequency informa-
tion that the proposed design eliminates at the video sensor
itself. An inverse quantization and inverse DCT kernel are also
required in the spatial model to provide accurate reference
information for future frames to be encoded. The decoded
information is stored in the ref. buffer of Fig. 1. The stored
result is the exact data that is produced by the decoder and
used as a reference by both encoder and decoder.
III. VIDEO SENSOR PRE-COMPRESSION
After a given block has been processed by the temporal
model and the spatial model, the redundant data is revealed
as insignificant high-frequency coefficients. The elimination
of these coefficients is equivalent to applying a spatial low-
pass filter to the residual block. The effect of typical video
encoding can be described by
X[k, l] = DCT{b[m,n]− b′[m,n]} ·H[k, l] (1)
where x, b, b′, and h are the compressed block, raw input
block, reference block, and low-pass filter respectively. The
same result can be achieved by first decimating the raw input
from the camera as represented by
X[k, l] = DCT{b[m,n] ∗ h[m,n]− b′[m,n] ∗ h[m,n]} (2)
To perform the pre-compression before the signal is passed
through the temporal and spatial models, an estimate of the
magnitude of the DCT coefficients is required which will be
covered in Section IV.
Using CMOS separable transform image sensor (CSTIS),
the pre-compression of blocks can be moved all the way
to the input sensor itself [11]. Pre-compressed blocks that
have been sensed in this way are already described in the
DCT domain. This means the typical digital DCT function
of the encoder can be bypassed. However, a consequence of
sensing a pre-compressed block in the DCT domain is that
normal ME and MC are no longer possible. This is of little
consequence because as will be shown in Section IV, pre-
compressed regions of frames are relatively static and can be
assumed to have null motion vectors.
Non-contributing high-frequency content is removed before
the video signal is passed to the temporal model by sensing the
signal in the DCT domain for low frequency coefficients only.
If a sensed block is allowed to have a completely variable
number of low frequency coefficients included, blocks of
varying sizes will be produced which requires an undesirable,
complicated design for the PEs of the encoder. To mitigate
� �
� �
� �
�
�
�
� �
� �
�
� �
� �
Fig. 2. Multiresolution quadtree structure
this problem, effective decimation of blocks is limited to a
quadtree structure as shown in Fig. 2 allowing the PEs to
remain unchanged. For a block to be effectively decimated
by the smallest increment, it must be concatenated with three
of its neighbors according the the quadtree structure and then
decimated by a factor of 2. For example, in Fig. 2, blocks A,
B, E, and F can be combined and decimated by a factor of
2 to yield block Q. Higher levels of decimation are allowed
as long as they adhere to the quadtree structure meaning that
all sibling blocks must be collapsed to a single parent. For
example, in Fig. 2, blocks A-P can be combined and decimated
by a factor of 4 to yield block U. The quadtree structure allows
the input frame to be represented as a multi-resolution image
where areas predicted to have more high-frequency residue
can be represented with more resolution and other areas with
less as illustrated by Fig. 3. Each block bounded by the black
lines in the figure represents the same amount of information
– a single block. To make viewing easier, the blocks have
been interpolated to fill the whole area they represent. Blocks
that occupy a larger area like the one in the top left quadrant
contain less detailed information.
247
Fig. 3. Hypothetical multi-resolution image
IV. VIDEO INPUT FEEDBACK CONTROLLER
The pre-compression camera sensor (PCCS) described in
Section III requires a control signal indicating the quadtree
structure that should be applied to the next frame. The is
accomplished by analyzing the spectrum of each encoded
block in the current PCF. A model is fit to the spectrum of
the block that predicts the amount of extra detail or missing
detail in the residual block. Based on the model parameters, a
decision is made to increase, decrease, or keep the resolution
level associated with the corresponding area of the next frame.
The power spectrum of a natural image can be modeled
using an exponential function of the form
|PX(jω)| = β(ω + 1)α (3)
where β and α are the parameters for the model and ω is
the polar spatial frequency [12]. The expected magnitude of
the spectrum decays exponentially with respect to increasing
polar frequency. Once the two parameters for the model are
determined, the spectrum can be extended beyond sampled
frequencies to determine how much error is predicted to exist
in each block. The error can be measured in peak signal-to-
noise ratio (PSNR). The VIFC attempts to keep the PSNR of
the signal at a specified level. By pre-computing a family of
curves and storing it in a look-up table, power used to fit the
model parameters can be made negligible. The goal of the
image resolution control kernel is minimize overall resolution
on the input image while maintaining a given PSNR.
V. SYSTEM DYNAMIC VOLTAGE AND FREQUENCY SCALING
The PCCS modulates the size of the signal for each frame
presented to the video encoder. The video encoder also has
a fixed frame rate. Since the size of the input signal is
modulated, but the amount of time needed to process the
signal is static, the video encoder may process smaller frames
(less information) at a slower speed without error. Next, DVFS
is used to scale back the power consumption of the encoder
for all blocks (except for the control and decimation blocks).
When the voltage and frequency are scaled down, PE delay
time scales up. Based on the hardware implementation of
the encoder, a table is computed with optimal voltage and
frequency values for different frame sizes.
The DVFSC kernel is responsible for setting the proper volt-
age and clock frequency for each PE in the video encoder just
prior to the encoding of a frame. Transistor-level simulations
of each processing element reveal the relationship between
scaled voltage, frequency, and delay on the critical path. In
each PE, a static amount of work is required for each block
processed. For a given critical path delay, the amount of time
required to process a single block is known. The VIFC sends a
control signal to the DVFSC indicating the number of blocks
to be encoded in the next frame. The DVFSC uses a lookup
table to translate the block count into a voltage and frequency
level for each of the PEs and sets the voltage just prior to the
encoding of the frame. Essentially, voltage and frequency are
scaled down as the block count is scaled down from its largest
count (full resolution.)
VI. DESIGN ANALYSIS
Three topics must be addressed regarding the validation of
the proposed design. First, the VIFC prediction accuracy must
be measured. If the VIFC is not capable of producing valid
predictions, either power consumption or video quality will
suffer. Second, the video quality must be analyzed. When
a less than nominal resolution level is predicted, error will
be introduced into the encoded signal which will affect the
quality of the decoded sequence. Lastly, the power savings
from the proposed technique must be evaluated. The savings
will depend on the actual video content, so multiple test video
sequences are used.
A. VIFC Performance
As discussed in Section IV, the VIFC predicts how much
information in a transformed block is missing or alternatively,
unnecessary. To evaluate the accuracy of the predictor, a
sampling of frames is taken from the test video sequences
and indiscriminately decimated by a factor of 2, meaning
that groups of four blocks are combined into a single block.
Prediction blocks are subtracted from the decimated blocks
and the resulting residual blocks are transformed and passed
to the VIFC. The VIFC predicts the amount of missing
information in each residual test block. The directly computed
(correct) amount of error in each block is compared to the
predicted error, measured in PSNR as defined by
PSNR = 10 · log10
(MAX2
I
MSE
)(4)
248
−20 −10 0 100
500
1000
1500
2000
2500
Error (PSNR)
Fre
quen
cy
Fig. 4. Prediction accuracy for decimation by a factor of 2
MSE =1
MN
M−1∑m=0
N−1∑n=0
|x1[m,n]− x2[m,n]|2
where MAX is the maximum unsigned integer possible using
8 bits; x1[m,n] and x2[m,n] are the two signals being
compared. The difference between the two values represents
the prediction error. A histogram of the VIFC performance is
shown in Fig. 4. The results indicate that the proposed VIFC
design is valid.
B. Video Quality
Well known test video sequences in the signal processing
community were used to evaluate the proposed design in
terms of quality. Each raw test sequence was dynamically
pre-compressed, simulating the method described in Section
III. The reduced signal was processed as depicted in Fig. 1
with the exception of bypassing the quantizer. The quantizer
is bypassed to avoid mistaking quantization error for pre-
compression error. Each frame is decoded and evaluated
against the original, raw test sequence. In each test signal, the
VIFC is tasked with predicting the resolution for each region
of the frame that will maintain a PSNR of at least 35dB. The
resulting PSNR of the test sequences are shown in 5. The
VIFC does an excellent job at maintaining the required signal
quality. It should also be noted that modern video encoding
methods are capable of producing an encoded sequence with a
quality similar to the proposed encoder, but no power savings
are achieved with modern methods.
C. Power Savings
Modern video compression standards do not specify how a
signal should be encoded. Instead, they specify what criteria
the encoded signal must meet. Therefore, many different
hardware implementations have been developed, and as a
result, it is very difficult to quantify power savings in terms
of Watts. Instead, for the purposes of this research, estimated
power savings are described in terms of the amount of power
that can be reduced compared to nominal operation.
10 20 30 40 50 60 70 80 90 1000
50
100
150
200
250
300
350
400
Frame Index
PS
NR
(dB
)
citycrewharbouricesoccer
Fig. 5. Decoded video test sequence quality
10 20 30 40 50 60 70 80 90 100
0.4
0.5
0.6
0.7
0.8
0.9
1
Frame Index
Rel
ativ
e F
ram
e S
ize
citycrewharbouricesoccer
Fig. 6. Information content of each pre-processed frame relative to non-preprocessed information content
Power savings starts with pre-compression. Pre-compression
alone is capable of saving dynamic power by reducing the dig-
ital signal size and therefore reducing the amount of switching
in the circuit for each frame. The extent of the reduction in
switching depends on the content of the video. If there is a
large amount of temporal and spatial redundancy in the signal,
the VIFC will perform well and pre-compression will also be
effective, necessary switching in the digital circuit. For each
frame in the test sequences, the size of the pre-compressed
signal is shown relative to the size of the typical, non-pre-
compressed signal. Fig. 6 shows relative frame sizes. Due to
the difference in the content of each sequence, a different
amount of work is required for different frames while the
actual quality of each encoded signal remains constant as
shown in Fig. 5.
Although the reduced signals result in reduced switching
which decreases the dynamic power consumption of the
system, power utilization is not maximized until DVFS is
used to reduce the voltage for each PE. As described in
Section V, fixed frame rates allow the proposed encoder to
perform DVFS on each PE in the encoder. Since there is
no standardized encoder design, nominal power savings using
249
10 20 30 40 50 60 70 80 90 1000.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Frame Index
Rel
ativ
e P
ower
Con
sum
ptio
n
citycrewharbouricesoccer
Fig. 7. Estimated dynamic power savings
DVFS for dynamic power are generalized to an estimated
power relative to typical non adaptive encoding. The relative
frame size is defined as
γ =S
Smax(5)
where S is the frame size of an adaptively acquired frame
as shown in Fig. 6, and Smax is the size of the frame if it
were acquired normally (full resolution.) The relative dynamic
power of an adaptive encoder can now be computed as
PRdyn =(V γ)2
V 2
fγ
f= γ3 (6)
where V is the nominal voltage of a non adaptive encoder
and f is the average switching frequency. From (6), estimated
power results for the sample video sequences are computed.
The results are shown in Fig. 7. The presented estimates are
ideal. They do not account for incremental steps in DVFS but
as the number of allowed levels are increased, the savings
should approach the computed estimates.
VII. CONCLUSION
The presented theory and results demonstrate the huge
potential in power savings for real-time embedded video
encoding systems. But to achieve such large savings, the
system needs to be considered as a whole. Trying to improve
the efficiency of each PE has reached its limits. Although
the overall success of the video encoder depends on power
efficient operations, the ultimate savings comes from reducing
the number of operations needed through the entire system. For
a real-time video encoder to truly use minimal power, it must
adapt to the content of the video being processed. All parts of
the video encoder system (sensors, algorithm, and hardware)
must be able to adapt. The proposed design is capable of
dynamically scaling the amount of work performed and power
consumed throughout the system so that just enough energy
is used to meet the required goals.
REFERENCES
[1] P. Kuhn, Algorithms, complexity analysis, and VLSI architectures forMPEG-4 motion estimation. Kluwer Academic Publishers, 1999.
[2] T. Chen, Y. Huang, and L. Chen, “Fully utilized and reusable architecturefor fractional motion estimation of h. 264/avc,” in Proc. ICASSP, vol. 5,2004, pp. 9–12.
[3] M. Kim, I. Hwang, and S. Chae, “A fast vlsi architecture for full-search variable block size motion estimation in mpeg-4 avc/h. 264,”in Proceedings of the 2005 Asia and South Pacific Design AutomationConference. ACM, 2005, p. 634.
[4] S. Yap and J. McCanny, “A vlsi architecture for variable block size videomotion estimation,” IEEE Transactions on Circuits and Systems Part 2:Express Briefs, vol. 51, no. 7, pp. 384–389, 2004.
[5] H. Nakayama, T. Yoshitake, H. Komazaki, Y. Watanabe, H. Araki,K. Morioka, J. Li, L. Peilin, S. Lee, H. Kubosawa et al., “An mpeg-4 video lsi with an error-resilient codec core based on afast motionestimation algorithm,” in 2002 IEEE International Solid-State CircuitsConference, 2002. Digest of Technical Papers. ISSCC, vol. 1, 2002.
[6] D. Mohapatra, G. Karakonstantis, and K. Roy, “Significance drivencomputation: a voltage-scalable, variation-aware, quality-tuning motionestimator,” in Proceedings of the 14th ACM/IEEE international sympo-sium on Low power electronics and design. ACM, 2009, pp. 195–200.
[7] H. Cheong, I. Chong, and A. Ortega, “Computation error tolerance inmotion estimation algorithms,” in IEEE International Conference onImage Processing, ICIP’06, 2006.
[8] G. Varatkar and N. Shanbhag, “Energy-efficient motion estimationusing error-tolerance,” in Low Power Electronics and Design, 2006.ISLPED’06. Proceedings of the 2006 International Symposium on, Oct.2006, pp. 113–118.
[9] G. V. Varatkar and N. R. Shanbhag, “Variation-tolerant motion estima-tion architecture,” in Signal Processing Systems, 2007 IEEE Workshopon, Oct. 2007, pp. 126–131.
[10] G. Varatkar and N. Shanbhag, “Error-resilient motion estimation archi-tecture,” Very Large Scale Integration (VLSI) Systems, IEEE Transac-tions on, vol. 16, no. 10, pp. 1399–1412, Oct. 2008.
[11] R. Robucci, J. Gray, L. Chiu, J. Romberg, and P. Hasler, “Compressivesensing on a cmos separable-transform image sensor,” Proceedings ofthe IEEE, vol. 98, no. 6, pp. 1089–1101, 2010.
[12] A. Van der Schaaf and J. Van Hateren, “Modelling the power spectraof natural images: statistics and information,” Vision Research, vol. 36,no. 17, pp. 2759–2770, 1996.
250