survey 1
DESCRIPTION
http://www.ieee.org/web/publications/authors/transjnl/index.htmlTRANSCRIPT
![Page 1: Survey 1](https://reader036.vdocument.in/reader036/viewer/2022080917/55cf9bf8550346d033a80d2b/html5/thumbnails/1.jpg)
Copyright (c) 2013 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing [email protected].
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
Paper ID # J-STSP-VCHB-00081-2013 1
Abstract—Tiles is a new feature in the High Efficiency Video
Coding (HEVC) standard that divides a picture into independent,
rectangular regions. This division provides a number of
advantages. Specifically, it increases the “parallel friendliness” of
the new standard by enabling improved coding efficiency for
parallel architectures, as compared to previous sliced based
methods. Additionally, tiles facilitate improved maximum
transmission unit (MTU) size matching, reduced line buffer
memory, and additional region-of-interest functionality. In this
paper, we introduce the tiles feature and survey the performance
of the tool. Coding efficiency is reported for different
parallelization factors and MTU size requirements. Additionally,
a tile-based region of interest coding method is developed.
Index Terms—Video coding, Multicore processing, High
efficiency video coding, Tiles.
I. INTRODUCTION
HE ISO/IEC’s Moving Pictures Experts Group (MPEG)
and the International Telecommunications Union’s (ITU-
T) Video Coding Experts Group (VCEG) have recently
concluded work on the first edition of the High Efficiency
Video Coding (HEVC) standard [3][4][5]. This standard was
developed collaboratively by the Joint Collaborative Team on
Video Coding (JCT-VC). For consumer applications, HEVC
has been reported to achieve 50% improvement in coding
efficiency when compared to previous coding standards such
as MPEG-4 AVC/ITU-T H.264 [1][5]. These coding gains are
achieved through a number of improvements that result in an
increase in computational complexity for both encoder and
decoder.
Here, computational complexity refers to a combination of
Copyright (c) 2013 IEEE. Personal use of this material is permitted.
However, permission to use this material for any other purposes must be
obtained from the IEEE by sending a request to [email protected]
Manuscript received January 30, 2013, revised May 10, 2013, camera-
ready version submitted June 21, 2013. ††Kiran Misra and Andrew Segall are with Sharp Laboratories of America,
Inc., 5750 NW Pacific Rim Blvd, Camas, WA 98607, USA (e-mail:
{misrak,asegall}@sharplabs.com). †Micahel Horowitz and Shilin Xu are with eBrisk Video, Inc., Suite 1450,
1055 West Hastings Street, Vancouver, BC, V6E 2E9, Canada (e-mail:
{michael,shilin}@ ebriskvideo.com). ‡Arild Fuldseth is with Cisco Systems, Oslo, Norway (e-mail:
[email protected]). ‡‡ Minhua Zhou is with Texas Instruments Inc, 12500 TI Blvd. Dallas, TX-
75243, USA (e-mail: [email protected]).
algorithmic operations and memory transfers. Algorithmic
operations correspond to the calculations required in a decoder
to convert bit-stream information to reconstructed pixel values
or in an encoder to convert the original pixel values to a bit-
stream. For hardware, this corresponds to logic gates; for
software, this corresponds to calculations on a CPU, GPU, or
other processing units. Memory transfers represent the amount
of data that must be stored and accessed to perform the
required calculations. Typical architectures contain multiple
memory types, ranging from high speed memory that is on-
chip (including caches near a CPU core) to lower speed
memory that is off-chip or farther from the core. In general,
on-chip memory is more expensive and therefore relatively
small. Additionally, for many architectures, the critical
bottleneck is the bandwidth necessary to transfer data from
off-chip to on-chip memory in time to complete the required
calculations.
The increase in computational complexity in HEVC compared
with earlier standards directly impacts the implementation and
design. For systems with a single-core processor, the increased
complexity requires higher clock speeds. This has the
additional cost of increased power consumption and heat
dissipation. For many applications of interest today, the
increased clock rate is not desirable.
An alternative solution for addressing the increased
computational complexity is parallelism. Parallelism in a video
system is not a new concept. For example, today’s software
based video conferencing systems that operate at resolutions
up to 1080p (1920x1080 pixels) and frame rates of 60 frames
per second (fps) rely on high-level parallelism (i.e., encoders
and decoders that can process different portions of a video
picture in a relatively independent fashion) despite using the
less computationally complex H.264/AVC and its scalable
extension SVC. With previous standards, high-level
parallelism within a picture may be realized by partitioning the
source frames using slices and assigning each slice to one of
several processing cores. Slices were originally designed to
map a bit-stream into smaller independently decodable chunks
for transmission. The size of a coded slice was typically
determined by the network characteristics; for example, the
size is often selected to be less than the maximum transmission
unit (MTU) size of the network being considered.
An overview of tiles in HEVC
Kiran Misra††
, Member, IEEE, Andrew Segall††
, Member, IEEE, Michael Horowitz†, Shilin Xu
†, Arild
Fuldseth‡, Minhua Zhou
‡‡
T
![Page 2: Survey 1](https://reader036.vdocument.in/reader036/viewer/2022080917/55cf9bf8550346d033a80d2b/html5/thumbnails/2.jpg)
Copyright (c) 2013 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing [email protected].
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
Paper ID # J-STSP-VCHB-00081-2013 2
In practice, using slices for parallelization results in a number
of disadvantages. For example, the pixel segmentation
achieved by slices using only network constraints often result
in partitioning where the correlation existing in the pixel data
is reduced. This lowers the achievable coding efficiency.
Moreover, slices contain header information to facilitate
independent processing of pixel data. With the higher coding
efficiency of HEVC, this becomes problematic – it is possible
to transmit high resolution video at low bit rates such that the
overhead introduced by a slice header is not negligible.
Finally, for applications that require both parallelization and
packetization, it is difficult to use slices to achieve an optimal
partitioning for both goals.
Tiles provide an alternative partitioning that divides a picture
into rectangular sections that are processed in a relatively
independent fashion. Figure 1 illustrates an example where a
picture is partitioned into rectangular processing units called
tiles [7]. HEVC also provides additional tools for parallelism,
including both low-level and high-level methods. For high-
level parallelism, one alternative approach is wavefront
processing, which is further described in [9].
As described in this section, tiles have a number of desirable
properties for next generation video coding devices. In the rest
of this paper, we provide a detailed description of the tiles
feature in HEVC. The rest of the paper is organized as
follows: Section II provides background on slices and tiles. In
section III, we discuss constraints on tiles as described in the
HEVC standard. In section IV an example usage of tiles is
provided. Section V, reports the coding efficiency
improvement associated with the use of tiles in MTU size
matching and high-level parallelism applications. Section V
further demonstrates the efficacy of tiles in lightweight bit-
stream rewriting. Finally section VI provides concluding
remarks.
II. BACKGROUND
A. Slices
A video decoder consists of two fundamental processes: (a)
bit-stream parsing carried out by the entropy decoder and (b)
picture reconstruction carried out by the pixel processing
engine. The video bit-stream is typically organized in a causal
fashion where both the parsing and the reconstruction step for
the current location being processed depend on information
occurring earlier in the bit-stream. In practical applications, a
video bit-stream may be transmitted over lossy channels before
it arrives at the decoder. Loss of a part of the video bit-stream
would lead to an inability to parse and/or reconstruct
information later within the bit-stream. This causal
dependency propagates and therefore a single error may lead
to an inability to process a significant portion of the bit-stream
occurring after the error. To limit the propagation of error it is
important to break dependencies in processing. Earlier video
coding standards [1][2][6] achieved this by organizing the bit-
stream into independently parsable units called slices.
Within a picture, individual slices can be independently
reconstructed. In HEVC, slices define groups of independently
parsable coded tree blocks (CTBs). Slices contain CTBs which
follow raster scan order within a picture as shown in Figure 2.
Previous standards such as H.264/AVC include tools such as
flexible macroblock ordering (FMO) that enable defining
arbitrary shaped slices. However, while FMO provides
excellent capabilities in defining slice shapes, it unfortunately
requires frame-level decoupling of the deblocking filter from
the rest of the decoding process. Thus, in the context of H.264,
it is not possible to perform macroblock based deblocking
filtering with FMO. Largely as a result of this property, FMO
is not included in HEVC, as the frame-level deblocking
significantly increases memory bandwidth and leads to a
decrease in decoder performance. By contrast, with tiles, the
in-loop filtering process can be performed at the CTB-level
with the use of vertical column buffers. Moreover, tiles in
1 2 3 4 13 14 15 16 17 18 31 32 33
5 6 7 8 19 20 21 22 23 24 34 35 36
9 10 11 12 25 26 27 28 29 30 37 38 39
40 41
Column Boundaries
Row
Boundaries
CTB #1
Figure 1 – Example illustrating rectangular picture
partitioning and coded tree block (CTB) scanning order
within a picture that is divided into nine tiles.
Figure 2 - Example illustrating slice-based picture
partitioning of coded tree block (CTB) following a
raster scan order within the picture.
![Page 3: Survey 1](https://reader036.vdocument.in/reader036/viewer/2022080917/55cf9bf8550346d033a80d2b/html5/thumbnails/3.jpg)
Copyright (c) 2013 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing [email protected].
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
Paper ID # J-STSP-VCHB-00081-2013 3
Figure 4 – Illustration of sample data from
reconstructed frames to be stored in on-chip memory
for three different CTB rows. The dashed lines show
the sample data stored in memory.
HEVC provide an additional benefit of having lower overhead
since unlike FMO they do not have associated slice headers.
In HEVC, slice partitioning may be based on network
constraints such as maximum transmission unit (MTU) or
pixel processing constraints such as maximum number of
CTBs to be contained within a slice. As can be seen in Figure
2, following the raster-scan order within a picture results in
partitioning which has a lower level of spatial correlation
within the picture. Additionally, every slice contains within it
an associated slice header which adds a non-negligible
overhead at lower bitrates. As a result of the reduced spatial
correlation and additional slice header overhead video coding
efficiency suffers.
B. Tiles
A picture in HEVC is partitioned into coded tree blocks. In
addition, each picture may be partitioned into rows and
columns of CTBs, and the intersection of a row and column
results in a tile. Note that tiles are always aligned with CTB
boundaries. As a result of its partitioning flexibility, a tile may
be spatially more compact than a slice containing the same
number of coded tree blocks. This has the benefit of higher
correlation between pixels compared to slices. As an
additional advantage, tiles do not contain headers for improved
coding efficiency. The CTBs within a tile are processed in
raster scan order, and the tiles within a picture are processed in
raster scan order. An example of the above-described
partitioning is shown in Figure 1 where a picture is partitioned
into three columns and three rows of CTBs. The CTBs within
the first (upper-left) tile, depicted as numbered squares 1-12,
are scanned in raster scan order. After scanning the first tile,
the second tile follows (in tile raster scan order). Specifically,
as shown in Figure 1, CTB #13 in the second tile follows CTB
#12 in the first tile. Tiles allow the column and row boundaries
to be specified with and without uniform spacing.
The modified scan pattern has the advantage of reduced line
buffer requirements for motion estimation. Specifically, the
prediction of a CTB requires storing, in on-chip memory,
reconstructed pixel data (from previously coded frames) that
are candidates for motion compensation. This data is loaded
into on-chip memory and retained until no longer needed.
Without tiles, raster scanning of a picture results in storing
sample data equal to PicW*(2*SRy + CtbHeight), where PicW
is the width of the picture, SRy is the maximum vertical size of
a motion vector in full sample units and CtbHeight is the
height of an CTB in full sample units. With tiles, the modified
scan pattern results in a sample data storage requirement that is
approximately (TileW+2*SRx)*(2*SRy + CtbHeight), where
TileW is the width of a tile and SRx is the maximum
horizontal size of a motion vector. Using tiles, sample data
storage can be substantially reduced when PicW is
significantly larger than (TileW+2*SRx), which is typical.
Note that the above analysis assumes a single core encoder
that processes tiles sequentially. A graphical illustration of the
memory required for motion estimation using tiles is shown in
Figure 4.
In addition to changing the CTB scanning process, tile
boundaries denote a break in coding dependencies.
Dependencies between tiles are disabled in the same manner as
between slices. Specifically, entropy coding and reconstruction
dependencies are not allowed across a tile boundary. This
includes motion vector prediction, intra prediction and context
selection. In-loop filtering is the only exception which is
allowed across the boundaries but can be disabled by a flag in
the bit-stream. Thus, separate tiles may be encoded on
different processors with little inter-processor communication.
The breaking of coding dependencies at tile boundaries
implies that a decoder can process tiles in parallel with other
tiles. To facilitate parallel decoding the locations of tiles must
be signaled in the bit-stream. In HEVC, the bit-stream offsets
of all but the first tile are explicitly transmitted in the slice
Figure 3 – Example illustrating tiles partitioning of a
picture to identify region of interest. In the above example
the tiles in the center column for the first two rows form
the region of interest.
![Page 4: Survey 1](https://reader036.vdocument.in/reader036/viewer/2022080917/55cf9bf8550346d033a80d2b/html5/thumbnails/4.jpg)
Copyright (c) 2013 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing [email protected].
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
Paper ID # J-STSP-VCHB-00081-2013 4
header [10][18][19]. The location of the first tile immediately
following the slice header is known to the decoder.
Tiles have been successfully demonstrated to be an effective
parallelism tool for pixel-load balancing in ultra-high
definition (UHD) video [8]. In addition to high-level
parallelism, tiles also facilitate improved MTU size matching.
Additionally, for some applications the rectangular pixel data
partitioning afforded by tiles facilitates region of interest
(ROI) based coding. Figure 3 illustrates an example where two
tiles are used to represent the ROI within a video source. Tiles
based ROI identification can be used to facilitate asymmetric
processing of pictures, where the tile corresponding to the ROI
is processed by a more computationally-capable core. This is a
desirable trait in some applications where the ROI’s encoder
rate-distortion decision process is computationally more
demanding due to the use of advanced search algorithms and
distortion metrics. Additionally, when tiles lying within an
ROI are coded independently, the subset of the bit-stream
corresponding to the ROI can be easily extracted and re-
constituted into another bit-stream with lower bit rate
requirements. An example application where tile partitioning
and ROI based coding is used to perform lightweight bit-
stream rewriting is demonstrated in section V.C.
An equally strong benefit of tiles is the reduction of memory
bandwidth and on-chip memory requirements. Specifically,
with a rectangular partitioning, the size of the line buffers
required for motion compensation and in-loop filtering is
dramatically reduced. (The line buffer width is reduced from
the width of the picture to approximately the width of the tile.
Thus, for even two tiles, the reduction is nearly 50%) For an
encoder with significant memory restrictions, this has the
additional advantage of higher coding efficiency, as the
reduced memory requirements of tiles enables a larger vertical
search range for such an encoder [12].
C. Complexity properties
For system designers the worst-case performance of a tool is
an important measure that needs to be considered when
designing solutions that are guaranteed to meet a minimum
constraint. With this view in mind, we describe the worst-case
per-picture execution time for slices and tiles. It is assumed
that the degree of parallelism afforded by slices and tiles is the
same so as to enable comparisons between them.
We consider a video coding environment with � � 1
independent encoders and � independent decoders. The
encoders and decoders have equal computational capability.
Let ���, � represent the encoding time for ����, �. A picture
consists of a total of �� rows of CTBs and �� columns
of CTBs. If a picture is partitioned into � slices then the
worst-case encoding time for the system would be:
������,������ � ���∀ ����! " ���, �#$%��,&∈ ����!
(�)�*���+,������ ( � ,-,�������1
where �)�*���+,������ and � ,-,������ represent the deblocking
and sample adaptive offset filter encoding time at slice
boundaries respectively.
If a picture is partitioned into � uniform tiles versus �
uniform slices then the number of CTBs at the boundary of
tiles can always be made to be smaller than or equal to the
number of CTBs at the slice boundary.
For � tiles the worst-case encoding time for the system would
be:
������,����� � ���∀$���! " ���, �#$%��,&∈$���!
(�)�*���+,����� ( � ,-,������2
where �)�*���+,����� and ����/0�����,����� represent the
deblocking and sample adaptive offset filter encoding time at
tile boundaries respectively. Similar expressions for worst
case decoding time can be derived for the system.
As an example, if we assume that (a) the only difference in
execution times for slices and tiles based processing lie in the
processing of pixels at the boundary, (b) the number of CTBs
in a picture are large compared to � and (c) tiles take on
square-shapes, then the number of boundary edges to be
shared for slice and tile based parallelism approaches is at
most �min� ��,� 4 1 ∗ �� ( �� 4 1 and
2√� ∗ �� ∗ �� 4 �� 4 �� respectively. Here the
function min(x, y) returns the smaller of the two values x and y,
if the two values are equal the function returns x. The second
expression above, representing the shared boundary for tiles, is
strictly smaller than the first indicating that, for the stated
assumptions, tile-based parallel processing may be preferred.
In practice however the non-square nature of tiles and the
coding complexity of the video scene being coded make it
necessary to further refine the complexity considerations.
In theory, it is possible to perform load balancing between
different processing cores based on estimated coding
complexity of different regions of a picture and redefining tile
boundaries. A good load-balancing algorithm would increase
resource utilization while reducing average processing times.
However, frequent changes in tile boundaries and the
associated change in scan pattern and buffering requirements
make software/hardware optimizations difficult to achieve. In
practice, system designers need to determine a good trade-off
between variable tile structures and optimized implementations
for their individual applications.
![Page 5: Survey 1](https://reader036.vdocument.in/reader036/viewer/2022080917/55cf9bf8550346d033a80d2b/html5/thumbnails/5.jpg)
Copyright (c) 2013 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing [email protected].
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
Paper ID # J-STSP-VCHB-00081-2013 5
(a) (b)
Figure 5 – Interaction between tiles and slices. In the example above, the following are illustrated: (a) Three
complete tiles contained within a single slice (b) Two complete slices contained within each tile.
III. CONSTRAINTS ON TILES
In this section, we begin by listing the constraints related to
tiles in HEVC.
Supporting tiles in the HEVC system requires the transmission
of the tiles configuration information from an encoder to a
decoder. This includes column and row locations, loop filter
control and the bit-stream location information for the start of
all but the first tile in a picture. Using uniform spacing, the tile
boundaries are automatically distributed uniformly across the
picture. The tile boundaries thus balance the pixel load
approximately evenly amongst different tiles in a picture.
Alternatively, tile boundaries may be explicitly specified, for
example based on picture coding complexity. When more than
one tile exists within a picture then the tile column widths and
tile row heights are required to be greater than or equal to 256
and 64 luma samples respectively. This constraint ensures that
tile sizes cannot be too small. Additionally the total number of
tiles within a picture is limited by constraining the maximum
number of tile columns and maximum number of tile rows
allowed within a picture based on the level of the bitstream
under consideration. These bounds are specified in Table A-1
of the HEVC standard and monotonically increase with
increasing level.
In HEVC, slice boundaries can also be introduced by the
encoder and need not be coincident with tile boundaries.
However, to manage decoder implementation complexity, the
combination of slices and tiles is constrained. Specifically,
either all coded blocks in a tile must belong to the same slice,
or all coded blocks in a slice must belong to the same tile.
Figure 5 illustrates the two constraints. In Figure 5a all coded
blocks within the three tiles belong to a single slice illustrating
the earlier constraint. While in Figure 5b all the coded blocks
in a slice belong to the same tile. As a consequence of these
constraints it should be noted that a slice that does not start at
the beginning of a tile cannot span multiple tiles.
An additional constraint made on the tile system in HEVC is
that all tile locations within a bit-stream are provided to the
decoder. Conceptually, a single-core decoder could ignore this
location information while processing a bit-stream containing
tiles. This would result in a single core decoder following the
memory access pattern described in section II.B. However, the
HEVC design requires the transmission of entry points for all
but the first tile with the goal of realizing two key benefits. The
first benefit is enabling the maximum amount of parallelization
at the decoder while, the second benefit is allowing for
decoding a bit-stream containing tiles in raster-scan order.
More specifically, a decoder may receive a bit-stream in tiles
raster-scan, but choose to decode it in the (alternative) raster-
scan of the frame. This is achieved by decoding the CTBs in
the first row of a first tile, saving the entropy coding state, and
then resetting the entropy coder to decode the CTB in the first
row from the neighboring tile. For the example, as illustrated
in Figure 5a, this corresponds to decoding CTBs 1, 2, 3, 4 (in
that order); saving the entropy coding state for the current tile;
resetting the entropy coder and continue to decoding CTB 33
in the adjacent tile. This benefits single core decoders, as the
single core device can decode a bit-stream containing tiles
without significant changes in processing or memory access
pattern.
Entry points also allow a bit-stream containing independently
decodable tiles to be extracted and re-constituted into a lower
bit rate stream. This requires further non-normative (encoder
side) constraints on the bit-stream being generated. These non-
normative constraints with the corresponding lightweight
rewriting experiments are listed in section V.C.
IV. EXAMPLE USAGE FOR TILES
We now take a more detailed look at an example use case for
tiles.
Video conferencing applications that stand to benefit from the
parallelism afforded by tiles are used to demonstrate the usage
of tiles. Software-based interactive video applications run on
![Page 6: Survey 1](https://reader036.vdocument.in/reader036/viewer/2022080917/55cf9bf8550346d033a80d2b/html5/thumbnails/6.jpg)
Copyright (c) 2013 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing [email protected].
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
Paper ID # J-STSP-VCHB-00081-2013 6
platforms ranging from laptop and desktop computers to
tablets and smart phones. Modern desktop and laptop
computers use CPUs with four or more processing cores.
Many tablets and smartphones sporting dual-core and even
quad-core ARM processors are already commercially available
in the market. One way to leverage multi-core computational
power for HEVC video encoding and decoding is to use tiles.
The following example describes the use of tiles syntax in the
context of a software-based 720p (1280x720) interactive video
application operating at 60 frames per second (fps). The
example application is designed for a hardware platform
containing an Intel Core i7 CPU, which accounting for hyper-
threading, has eight virtual processing cores. The application
consists of several components including an HEVC encoder,
HEVC decoder, audio processing, user interface, etc. With the
relative computational complexity associated with each
component in mind, four virtual cores are allocated for HEVC
encoding, two for decoding (HEVC decoding generally
requires fewer computational resources than encoding) and the
remaining two cores are reserved for all other application
components.
To better take advantage of the processing capacity of the four
cores allocated for encoding, the input picture is partitioned
into four tiles. Since each core has identical computational
capabilities, it is desirable to partition a given picture so that
the encoding of each resulting tile requires the same
processing power. To achieve a proper processing load
balance, tiles in “active” picture regions requiring more
processing power are specified to be smaller than tiles in less
computationally demanding regions. One simple load
balancing strategy starts by partitioning the picture to be
encoded into tiles of roughly equal size and adapting the size
of the tiles over time depending on source content. Tile
locations and dimensions are specified in the picture parameter
set (PPS). The use of the PPS facilitates picture-to-picture tile
configuration changes that may be made in order to load
balance. A picture may be partitioned into four uniformly
spaced tiles (to facilitate load balancing). The resulting tiles
are more spatially compact than those resulting from other
partitioning strategies (e.g., four tiles side-by-side). In general,
tile compactness results in improved coding efficiency as
discussed earlier. The resulting left and right tile columns each
have a width of 640 luma samples. Assuming the coded tree
blocks comprise of 64x64 luma samples, setting the first tile
row height to 6 CTBs results in the top tile row having a height
of 384 luma samples while the bottom row has a height of 336
samples. In this way, four tiles having dimensions 640x384,
640x384, 640x336, and 640x336 counting clockwise from the
upper-left, are specified.
Having specified the tile dimensions, the encoder partitions the
input picture into four tiles and sends the picture data
associated with each tile to a separate processing core for
encoding. In this way, the encoder may achieve full processor
utilization with very low delay. After encoding, the bits
produced by each core must be assembled into a coded slice in
decoding order (tiles are decoded in raster scan order within a
picture) prior to being placed in a data packet and sent to the
network for transport. For the sake of clarity, we shall assume
the bits for all tiles in a picture are contained within a single
coded slice. This assumption is not unreasonable for a 720p,
60 fps HEVC encoding in the context of video conferencing.
Parallel decoding requires a different approach. Where an
encoder has the flexibility to choose how to partition and
allocate portions of an input picture for parallel processing,
due to dependencies in the decoding process, the decoder
cannot arbitrarily partition an input bit-stream. To facilitate
parallel decoding, HEVC inserts information into a slice
header to signal entry points associated with tiles contained
within that slice. In this example, three entry point per slice are
signaled to mark the location in the bit-stream of the start of
the second, third and fourth tile. The decoder receives the
coded slice, parses the slice header and determines the
associated PPS which is mandated to occur earlier in the bit-
stream than any slice referencing it. From the PPS, the decoder
may derive the number of tiles as well as the location and
spatial dimensions of each tile. In addition, the decoder
determines the location in the bit-stream of the tile entry point
from the slice header. The tile substreams may then be sent to
the two independent cores for decoding. In this example, each
core is assigned two tiles for processing at the decoder. An
important benefit of signaling entry points in the slice header is
that it facilitates raster-scan based decoding as described in
section III.
V. EXPERIMENTS
To assess the benefit of the tiles feature, we report the coding
efficiency improvement of the approach in a number of
configurations. These configurations include the cases of high-
level parallelization and network maximum transmission unit
(MTU) size matching. The efficacy of tiles is further
demonstrated using a lightweight bit-stream rewriting example.
Another experiment considering motion estimation with
limited on-chip memory is reported in [7].
The experiments reported here are conducted on test
sequences of different resolutions. Sequences used in
experiments are classified into five groups based on their
resolution. Class A sequences have the highest resolution of
2560x1600 and are cropped versions of ultra-high definition
(UHD) 4K resolution sequences. Class B sequences
correspond to full high definition (HD) sequences with a
resolution of 1920x1080. Class C and Class D sequences
correspond to WVGA and WQVGA resolutions of 832x480
and 416x240 respectively. Finally Class E sequences
correspond to sequences typically seen in video conferencing
applications and correspond to 720p i.e. 1280x720 pixel
resolution.
![Page 7: Survey 1](https://reader036.vdocument.in/reader036/viewer/2022080917/55cf9bf8550346d033a80d2b/html5/thumbnails/7.jpg)
Copyright (c) 2013 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing [email protected].
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
Paper ID # J-STSP-VCHB-00081-2013 7
For the experiments, Class A includes the Traffic and
PeopleOnStreet sequences; Class B includes the Kimono,
ParkScene, Cactus, BasketballDrive and BQTerrace
sequences; Class C includes the BasketballDrill, BQMall,
PartyScene and RaceHorses sequences, class D includes the
BasketballPass, BQSquare, BlowingBubbles and RaceHorses
sequences; Class E includes the FourPeople, Johnny and
KristenAndSara sequences.
Note, that for the random access configuration class E
sequences are not tested while for low delay configuration
class A sequences are not tested. This is consistent with the
test conditions defined during the standardization process of
HEVC [15].
A. Comparing parallelism using Tiles versus Slices
(Experiment 1)
In a first experiment, we compare the high-level
parallelization performance of tiles to traditional slices. Here,
we select the tile size to be approximately equal to the size of
one WQVGA image frame (i.e., 416x240 pixels). The exact
slice and tile partitioning used for different class of sequences
is listed in Table 1. Similarly, for reference, we select the slice
size to have the same number of CTBs as a tile. Choosing tile
sizes approximately equal to WQVGA results in a single tile
and correspondingly a single slice for class D sequences. This
will lead to the anchor and test data having exactly identical
rate-distortion performance. Consequently class D sequences
are not tested for this experiment.
Experiments are conducted using HM 9.2 [14] and JCT-VC
main configuration common conditions [15] for class A to E
test sequences. Results from the comparison appear in Table 2.
As can be seen from the table, the tiles system provides an
average 2.2%, 2.2%, 5.4% and 5.5% luma BD-rate
improvements for the main configuration of All Intra, Random
Access, Low Delay B and Low Delay P scenarios,
respectively, compared to slices and for the same amount of
parallelization.
B. MTU size matching using Tiles (Experiment 2)
In a second experiment, we compare the performance of the
tiles system for MTU size matching to traditional slices. Here,
an encoder divides a picture into slices that do not exceed
1500 bytes. This slice size is consistent with the MTU size of
an Ethernet v2 network. Tiles are used to improve the coding
efficiency of the system. We use column boundaries to divide
the picture, since we observe that columns result in more
square-like slice shapes leading to higher correlations. The
column widths used for each “sequence class” and “encoder
configuration” are listed in Table 3. The higher correlations
improve intra-prediction, mode prediction and motion vector
coding, for example.
Experiments are conducted using HM 9.2 [14] and JCT-VC
main configuration common conditions [15] for class A to E
test sequences. A coded tree block size of 32x32 was used in
lieu of 64x64. Additionally, the HM-9.2 encoder is modified
to allow byte-limited slices that begin at the start of a tile to
extend to the end of another tile. Results for the experiment are
reported in Table 4. As can be seen from the table, tiles
improve the coding efficiency of HEVC in the MTU size
Table 1 - Slice and tile partitioning for experiment 1
Reference Test
Class Number
of slices
Slice
sizes
(in
units of
CTBs)
Number of
tiles
Tile
dimensions
(in units of
CTBs)
horiz
ontal
ly
vert
ical
ly
horiz
ontal
ly
verti
cally
A 25 40 5 5 8 5
B 17 30 6 3 5 6
C 4 26 2 2 7 4
E 12 20 4 3 5 4
Table 2 - Encoder parallelization performance results for
experiment 1
All Intra Main
Y U V
Class A -1.5% -1.1% -1.0%
Class B -1.9% -1.6% -1.5%
Class C -0.9% -0.8% -0.9%
Class E -4.5% -4.0% -4.1%
Overall -2.2% -1.9% -1.9%
Random access Main
Y U V
Class A -2.1% -1.9% -1.8%
Class B -3.3% -3.8% -2.8%
Class C -1.3% -1.6% -2.0%
Class E * * *
Overall -2.2% -2.4% -2.2%
Low delay B Main
Y U V
Class A * * *
Class B -3.5% -3.3% -2.9%
Class C -1.5% -1.4% -1.5%
Class E -11.3% -10.3% -10.1%
Overall -5.4% -5.0% -4.8%
Low delay P Main
Y U V
Class A * * *
Class B -4.1% -3.9% -3.4%
Class C -1.8% -1.9% -2.0%
Class E -10.7% -9.7% -10.1%
Overall -5.5% -5.2% -5.2%
![Page 8: Survey 1](https://reader036.vdocument.in/reader036/viewer/2022080917/55cf9bf8550346d033a80d2b/html5/thumbnails/8.jpg)
Copyright (c) 2013 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing [email protected].
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
Paper ID # J-STSP-VCHB-00081-2013 8
matching scenario. Specifically, an average improvement of
2.1%, 1.1%, 0.4% and 0.4% luma BD-rate [16] is reported for
the main configuration of All Intra, Random Access, Low
Delay B and Low Delay P scenarios, respectively. As the CTB
size decreases, the coding gain realized by using tiles
increases. For example, for 16x16 CTBs the gains due to tiles
has been shown to be 4.7%, 2.5%, and 0.9% luma BD-rate (on
average) for Intra, Random Access and Low Delay scenarios,
respectively [17]. Note, that for extremely low bitrates where a
single slice exists per picture, the coding efficiency benefits of
compact representation using tiles are sometimes exceeded by
the losses incurred due to breaking of prediction dependencies
at tile boundaries. This is evidenced by the coding efficiency
losses observed for class E sequences in Low Delay B and
Low Delay P configuration.
Based on the above rate-distortion results it is fair to conclude
that the utility of tiles for encoder parallelization and MTU
size matching is low for smaller resolution sequences such as
class C and class D sequences.
C. Lightweight bit-stream rewriting using tiles based
region of interest coding (Experiment 3)
In a third experiment we partition pictures into tiles and
identify one tile as containing the region-of-interest (ROI). To
ensure that the ROI is independently decodable from non-ROI
tiles, temporal predictions within the ROI tile are prevented
from referring to pixels outside the ROI within reference
pictures using encoder restrictions. Additionally the
application of deblocking and sample adaptive offset filters is
disabled at tile boundaries. Each picture in the video source is
coded as a single slice. The slice header contains location
information identifying the start of each tile. Using the entry
point information a lightweight rewriting process extracts tiles
corresponding to the ROI from each picture, rewrites the slice
header and parameter sets to re-constitute a bit-stream
containing only the ROI tile.
Note [20] which was recently adopted within the working draft
of version 2 of the HEVC standard, describes a way to
constrain the encoding process so that the decoder can
correctly decode specific set(s) of tiles. It also describes an
encoder constraint which avoids the need for ROI applications
to disable deblocking and sample adaptive offset filtering
across tile boundaries.
Table 3 - Column widths in units of 32x32 CTBs used
for experiment 2
All
Intra
Main
Random
Access
Main
Low
Delay B
Main
Low
Delay P
Main
Class A
QP 22 40 5 * *
QP 27 40 7 * *
QP 32 4 10 * *
QP 37 4 10 * *
Class B
QP 22 30 6 4 4
QP 27 5 6 8 8
QP 32 6 8 10 10
QP 37 8 10 15 15
Class C
QP 22 13 4 3 3
QP 27 13 7 4 4
QP 32 3 7 7 7
QP 37 4 13 13 13
Class D
QP 22 7 4 4 4
QP 27 7 7 7 7
QP 32 2 7 7 7
QP 37 4 7 7 7
Class E
QP 22 4 * 7 7
QP 27 5 * 20 20
QP 32 5 * 20 20
QP 37 7 * 20 20
Table 4 - MTU size matching performance results for
experiment 2
All Intra Main
Y U V
Class A -0.9% -0.3% -0.3%
Class B -2.7% -2.3% -2.2%
Class C -1.8% -1.4% -1.5%
Class D -0.4% -0.3% -0.3%
Class E -4.8% -4.2% -4.0%
Overall -2.1% -1.8% -1.7%
Random access Main
Y U V
Class A -1.3% -1.4% -1.4%
Class B -1.9% -2.1% -1.9%
Class C -1.2% -1.1% -1.1%
Class D 0.0% -0.1% 0.0%
Class E * * *
Overall -1.1% -1.2% -1.1%
Low delay B Main
Y U V
Class A * * *
Class B -0.9% -1.2% -1.1%
Class C -0.6% -0.4% -0.3%
Class D 0.0% -0.1% -0.1%
Class E 0.5% 1.2% 0.7%
Overall -0.4% -0.2% -0.3%
Low delay P Main
Y U V
Class A * * *
Class B -1.0% -1.2% -1.2%
Class C -0.5% -0.4% -0.7%
Class D 0.0% -0.1% -0.1%
Class E 0.4% 0.5% 1.1%
Overall -0.4% -0.4% -0.4%
![Page 9: Survey 1](https://reader036.vdocument.in/reader036/viewer/2022080917/55cf9bf8550346d033a80d2b/html5/thumbnails/9.jpg)
Copyright (c) 2013 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing [email protected].
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
Paper ID # J-STSP-VCHB-00081-2013 9
For this experiment class E sequences were used. The tile
partitioning and the ROI tile index used for experiments is
listed in Table 5. The performance is measured using BD-rate.
The anchor bit rates used correspond to the sum of bit rates for
transmitting the full resolution class E sequence and a cropped
version of the class E sequence corresponding to ROI using a
single tile per picture. The quantization parameters used for
the experiment are 22, 27, 32 and 37. The anchor peak signal
to noise ratio (PSNR) used corresponds to the PSNR of the
full resolution class E sequence with one tile. For the test data
the bit rate of the full resolution class E sequences with the tile
configuration listed in Table 5 is used. The test PSNR also
corresponds to the full resolution class E sequence. The BD
rate measured using this set of anchor and test data is listed in
Table 6. This BD rate measure represents the bit rate savings
achieved by using a mechanism where only a single resolution
bit-stream is transmitted to a network middle box capable of
performing the lightweight rewriting process versus
transmitting two separate resolution bit-streams. Note that this
BD rate reflects the bit rate savings from the point-of-view of
an end-point device which receives the full resolution class E
bit-stream and represent average bandwidth savings 43.9%,
28.5%, 21.1% and 23.0% for the main configuration of All
Intra, Random Access, Low Delay B and Low Delay P
scenarios, respectively.
VI. CONCLUSION
The tiles based design of HEVC provides multiple benefits for
managing the computational complexity of video encoding and
decoding. This is especially true for high resolution video data.
By breaking dependencies within a picture, high-level
parallelism for both the encoder and decoder can be achieved
without the overhead of traditional slices. It was demonstrated
that a tiles based parallelism approach results in an average
luma bit rate saving of 2.2% to 5.5% over a slice based
approach. It was also shown that the compact CTB scan
pattern afforded by tiles can be used to improve the coding
efficiency of MTU size matching within HEVC. Moreover, by
altering the CTB scan pattern within an image, on-chip
memory requirements are reduced and coding efficiency
improvements can be achieved. Tiles can also be used to
perform region-of-interest based lightweight bit-stream
rewriting. The combination of high-level parallelism, resource
reduction and coding efficiency provides a very useful tool
within the HEVC system.
ACKNOWLEDGEMENTS
The authors would like to thank the anonymous reviewers for
their valuable comments and feedback which was extremely
helpful in improving the quality of the paper.
REFERENCES
[1] Advanced Video Coding for Generic Audiovisual Services, ITU-T Rec.
H.264 | ISO/IEC 14496-10, Version 2: May 2004, Version 3: Mar.2005,
Version 4: Sept. 2005, Version 5: June 2006, Version 7: Apr. 2007,
Version 8 (with SVC extension) “Consented” July 2007, May 2003.
[2] T. Wiegand, G. J. Sullivan, G. Bjøntegaard, and A. Luthra, "Overview
of the H.264/AVC video coding standard," IEEE Trans. Circuits Syst.
Video Technol., vol. 13, no. 7, pp. 560–576, Jul. 2003.
[3] “Joint Call for Proposals on Video Compression Technology,” ITU T
SG16/Q.6 Doc. VCEG-AM91, Kyoto, Japan, 2010.
[4] ITU-T Rec. H.265 and ISO/IEC 23008-2: High Efficiency Video
Coding, ITU-T and ISO/IEC, April 2013.
[5] Sullivan, G. J.; Ohm, J.-R.; Han, W.-J. & Wiegand, T. “Overview of the
High Efficiency Video Coding (HEVC) Standard,” IEEE Transactions
on Circuits and Systems for Video Technology, vol. 22, no. 12, pp.
1649-1668, 2012.
[6] S. Wenger and M. Horowitz, “FMO: Flexible Macroblock Ordering,”
JVT-C089, May 2002.
[7] A. Fuldseth, M. Horowitz, S. Xu, A. Segall. M. Zhou, “Tiles”, JCT-VC
F335, 6th Meeting of Joint Collaborative Team on Video Coding (JCT-
VC) of ITU-T SG16 WP3 and ISO/IEC JTC1/SC29/WG11, Torino, Italy
2011.
[8] M. Zhou, V. Sze, M. Budagavi, “Parallel tools in HEVC for high-
throughput processing,” Proceedings SPIE 8499, Applications of Digital
Image Processing XXXV, 849910, October 15, 2012.
Table 6 – Lightweight stream splitting based on ROI
using tiles (experiment 3)
All Intra Main
Y U V
FourPeople -39.4% -39.9% -39.7%
Johnny -46.5% -46.9% -47.0%
KristenAndSara -45.7% -46.1% -46.1%
Overall -43.9% -44.3% -44.3%
Random access Main
Y U V
FourPeople -10.8% -18.4% -22.0%
Johnny -30.9% -34.0% -36.8%
KristenAndSara -43.7% -51.0% -52.2%
Overall -28.5% -34.5% -37.0%
Low delay B Main
Y U V
FourPeople -6.1% -8.2% -9.2%
Johnny -17.9% -21.0% -23.5%
KristenAndSara -39.2% -40.0% -41.1%
Overall -21.1% -23.0% -24.6%
Low delay P Main
Y U V
FourPeople -8.4% -10.4% -11.6%
Johnny -20.5% -22.8% -25.7%
KristenAndSara -40.1% -40.9% -42.3%
Overall -23.0% -24.7% -26.5%
Table 5 – Tile heights and widths in units of 64x64
CTBs for lightweight bit-stream rewriting using tiles
based ROI (experiment 3)
Tile
Column
Widths
Tile Row
Heights
ROI tile
index
(indexing
starts with
zero)
FourPeople 20 3, 5, 4 1
Johnny 4, 12, 4 12 1
KristenAndSara 2, 18 12 1
![Page 10: Survey 1](https://reader036.vdocument.in/reader036/viewer/2022080917/55cf9bf8550346d033a80d2b/html5/thumbnails/10.jpg)
Copyright (c) 2013 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing [email protected].
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
Paper ID # J-STSP-VCHB-00081-2013 10
[9] Chi Ching Chi, Mauricio Alvarez-Mesa, Ben Juurlink, Gordon Clare,
Félix Henry, Stéphane Pateux and Thomas Schierl, "Parallel Scalability
and Efficiency of HEVC Parallelization Approaches," IEEE
Transactions on Circuits and Systems for Video Technology, vol. 22,
no. 12, pp. 1827-1838, 2012.
[10] K. Misra and A. Segall, “Parallel decoding with Tiles”, JCT-VC F594,
6th Meeting of Joint Collaborative Team on Video Coding (JCT-VC) of
ITU-T SG16 WP3 and ISO/IEC JTC1/SC29/WG11, Torino, Italy 2011.
[11] A. Fuldseth, “Replacing slices with tiles for high level parallelism,”
JCTVC-D227, 4th Meeting of Joint Collaborative Team on Video
Coding (JCT-VC) of ITU-T SG16 WP3 and ISO/IEC
JTC1/SC29/WG11, Daegu, January 2011.
[12] M. Zhou, “Sub-picture based raster scanning coding order for HEVC
UHD video coding”, JCTVC-B062, 2nd Meeting of Joint Collaborative
Team on Video Coding (JCT-VC) of ITU-T SG16 WP3 and ISO/IEC
JTC1/SC29/WG11, Geneva, July, 2010.
[13] M. Horowitz and S. Xu, “Generalized slices,” JCTVC-D378, 4th
Meeting of Joint Collaborative Team on Video Coding (JCT-VC) of
ITU-T SG16 WP3 and ISO/IEC JTC1/SC29/WG11, Daegu, January
2011.
[14] High efficiency test model software SVN repository
https://hevc.hhi.fraunhofer.de/svn/svn_HEVCSoftware/tags/HM-9.2/
[15] F. Bossen, “Common HM test conditions and software reference
configurations”, JCT-VC I1100, 9th Meeting of Joint Collaborative
Team on Video Coding (JCT-VC) of ITU-T SG16 WP3 and ISO/IEC
JTC1/SC29/WG11, Geneva, May 2012.
[16] G. Bjontegaard, “Calculation of average PSNR differences between RD-
curves”, VCEG M33, March, 2001.
[17] M. Horowitz, S. Xu, E. S. Rye, and Y. Ye, “The effect of LCU size on
coding efficiency in the context of MTU size matching”, JCT-VC F596,
6th Meeting of Joint Collaborative Team on Video Coding (JCT-VC) of
ITU-T SG16 WP3 and ISO/IEC JTC1/SC29/WG11, Torino, Italy 2011.
[18] Hendry, S. Jeong, S. W. Park, B. M. Jeon, K. Misra, A. Segall, "AHG4:
Harmonized method for signalling entry points of tiles and WPP
substreams," JCTVC-H0556, 8th Meeting of Joint Collaborative Team
on Video Coding (JCT-VC) of ITU-T SG16 WP3 and ISO/IEC
JTC1/SC29/WG11, San Jose, February 2012.
[19] Y. -K. Wang , A. Segall, M. Horowitz, Hendry, W. Wade, F. Henry , T.
Lee, "Text for tiles, WPP and entropy slices," JCTVC-H0737, 8th
Meeting of Joint Collaborative Team on Video Coding (JCT-VC) of
ITU-T SG16 WP3 and ISO/IEC JTC1/SC29/WG11, San Jose, February
2012.
[20] Y. Wu, G. J. Sullivan, Y. Zhang, “Motion-constrained tile sets SEI
message”, JCTVC-M0235, 13th Meeting of Joint Collaborative Team on
Video Coding (JCT-VC) of ITU-T SG16 WP3 and ISO/IEC
JTC1/SC29/WG11, Incheon, KR, April, 2013.
Kiran Misra (M’09) received his B.E. degree in
Electronics Engineering from Mumbai University, India,
in 1998. He received his M.S. and Ph.D. degrees in
Electrical and Computer Engineering in 2002 and 2010
from Michigan State University (MSU), East Lansing. He
is a member (M) of the IEEE since 2009. He was also the
recipient of MSU’s Graduate School Research
Enhancement and Summer Program Fellowship in 2008.
Dr. Misra joined Sharp Laboratories of America Inc. as a Post-doctoral
Researcher in 2010 where he is currently a Senior Researcher in the video
coding group. His research interests include video coding and image
compression, network coding, joint source and channel code design, wireless
networking, and stochastic modeling.
Andrew Segall (S’00–M’05) received the B.S. and M.S.
degrees in electrical engineering from Oklahoma State
University, Stillwater, in 1995 and 1997, respectively, and
the Ph.D. degree in electrical engineering from
Northwestern University, Evanston, IL, in 2002. He is a
currently a Manager at Sharp Laboratories of America,
Camas, WA, where he leads groups performing research on
video coding and video processing algorithms for next
generation display devices. From 2002 to 2004, he was a Senior Engineer at
Pixcise, Inc., Palo Alto, CA, where he developed scalable compression
methods for high definition video. His research interests are in image and
video processing and include video coding, super resolution and scale space
theory.
Michael Horowitz received an A.B. degree with
distinction in physics from Cornell University, Ithaca, NY,
in 1986, a M.S. in electrical engineering from Columbia
University, New York City, NY, in 1988 and a Ph.D. in
electrical engineering from The University of Michigan,
Ann Arbor, in 1998.
He is Chief Technology Officer at eBrisk Video. Prior to
eBrisk, he led the engineering team at Vidyo that developed
the first commercially available H.264 SVC video codec. Earlier, at Polycom
he led the engineering team that developed the first commercially available
in-product H.264/AVC video codec. Dr. Horowitz is Managing Partner at
Applied Video Compression and is a member of the Technical Advisory
Board of Vivox, Inc.
Dr. Horowitz has served as chair for several ad hoc groups including the ad
hoc group on High-level Parallelism during the ITU-T | ISO/IEC Joint
Collaborative Team on Video Coding’s (JCT-VC) development of HEVC.
Shilin Xu received a B.E. degree in Communication Engineering in 2004 and
a Ph.D. degree in Electrical and Information Engineering in 2009, both from
Huazhong University of Science and Technology, Wuhan, China.
He has been the research engineer at eBrisk Video since 2010 and is actively
participating in the standardization of HEVC. Prior to eBrisk, he was an
assistant professor in Wuhan Institute of Technology, China, from 2009 to
2010.
Arild Fuldseth received his B.Sc. degree from the
Norwegian Institute of Technology in 1988, his M.Sc.
degree from North Carolina State University in 1989, and
his Ph.D. degree from Norwegian University of Science
and Technology in 1997, all degrees in Signal Processing.
From 1989 to 1994, he was a Research Scientist in
SINTEF, Trondheim, Norway. From 1997 to 2002 he was a
Manager of the signal processing group of Fast Search and
Transfer, Oslo, Norway. Since 2002 he has been with Tandberg Telecom,
Oslo, Norway (now part of Cisco Systems) where he is currently a Principal
Engineer working with video compression technology.
Minhua Zhou received his B.E. degree in Electronic Engineering and M.E.
degree in Communication & Electronic Systems from Shanghai Jiao Tong
University, Shanghai, P.R. China, in 1987 and 1990, respectively. He
received his Ph.D. degree in Electronic Engineering from Technical
University Braunschweig, Germany, in 1997. He received Rudolf-Urtel Prize
1997 from German Society for Film and Television Technologies in
recognition of Ph.D. thesis work on “Optimization of MPEG-2 Video
Encoding”.
From 1993 to 1998, he was a Researcher at Heinrich-Hertz-Institute (HHI)
Berlin, Germany. Since 1998, he is with Texas Instruments Inc, where he is
currently a research manager of video coding technology. His research
interests include video compression, video pre- and post-processing, end-to-
end video quality, joint algorithm and architecture optimization, and 3D
video.