survey 1

Copyright (c) 2013 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing [email protected].

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

Paper ID # J-STSP-VCHB-00081-2013 1

Abstract—Tiles is a new feature in the High Efficiency Video

Coding (HEVC) standard that divides a picture into independent,

rectangular regions. This division provides a number of

advantages. Specifically, it increases the “parallel friendliness” of

the new standard by enabling improved coding efficiency for

parallel architectures, as compared to previous sliced based

methods. Additionally, tiles facilitate improved maximum

transmission unit (MTU) size matching, reduced line buffer

memory, and additional region-of-interest functionality. In this

paper, we introduce the tiles feature and survey the performance

of the tool. Coding efficiency is reported for different

parallelization factors and MTU size requirements. Additionally,

a tile-based region of interest coding method is developed.

Index Terms—Video coding, Multicore processing, High

efficiency video coding, Tiles.

I. INTRODUCTION

HE ISO/IEC’s Moving Pictures Experts Group (MPEG)

and the International Telecommunications Union’s (ITU-

T) Video Coding Experts Group (VCEG) have recently

concluded work on the first edition of the High Efficiency

Video Coding (HEVC) standard [3][4][5]. This standard was

developed collaboratively by the Joint Collaborative Team on

Video Coding (JCT-VC). For consumer applications, HEVC

has been reported to achieve 50% improvement in coding

efficiency when compared to previous coding standards such

as MPEG-4 AVC/ITU-T H.264 [1][5]. These coding gains are

achieved through a number of improvements that result in an

increase in computational complexity for both encoder and

decoder.

Here, computational complexity refers to a combination of

Copyright (c) 2013 IEEE. Personal use of this material is permitted.

However, permission to use this material for any other purposes must be

obtained from the IEEE by sending a request to [email protected]

Manuscript received January 30, 2013, revised May 10, 2013, camera-

ready version submitted June 21, 2013. ††Kiran Misra and Andrew Segall are with Sharp Laboratories of America,

Inc., 5750 NW Pacific Rim Blvd, Camas, WA 98607, USA (e-mail:

{misrak,asegall}@sharplabs.com). †Micahel Horowitz and Shilin Xu are with eBrisk Video, Inc., Suite 1450,

1055 West Hastings Street, Vancouver, BC, V6E 2E9, Canada (e-mail:

{michael,shilin}@ ebriskvideo.com). ‡Arild Fuldseth is with Cisco Systems, Oslo, Norway (e-mail:

[email protected]). ‡‡ Minhua Zhou is with Texas Instruments Inc, 12500 TI Blvd. Dallas, TX-

75243, USA (e-mail: [email protected]).

algorithmic operations and memory transfers. Algorithmic

operations correspond to the calculations required in a decoder

to convert bit-stream information to reconstructed pixel values

or in an encoder to convert the original pixel values to a bit-

stream. For hardware, this corresponds to logic gates; for

software, this corresponds to calculations on a CPU, GPU, or

other processing units. Memory transfers represent the amount

of data that must be stored and accessed to perform the

required calculations. Typical architectures contain multiple

memory types, ranging from high speed memory that is on-

chip (including caches near a CPU core) to lower speed

memory that is off-chip or farther from the core. In general,

on-chip memory is more expensive and therefore relatively

small. Additionally, for many architectures, the critical

bottleneck is the bandwidth necessary to transfer data from

off-chip to on-chip memory in time to complete the required

calculations.

The increase in computational complexity in HEVC compared

with earlier standards directly impacts the implementation and

design. For systems with a single-core processor, the increased

complexity requires higher clock speeds. This has the

additional cost of increased power consumption and heat

dissipation. For many applications of interest today, the

increased clock rate is not desirable.

An alternative solution for addressing the increased

computational complexity is parallelism. Parallelism in a video

system is not a new concept. For example, today’s software

based video conferencing systems that operate at resolutions

up to 1080p (1920x1080 pixels) and frame rates of 60 frames

per second (fps) rely on high-level parallelism (i.e., encoders

and decoders that can process different portions of a video

picture in a relatively independent fashion) despite using the

less computationally complex H.264/AVC and its scalable

extension SVC. With previous standards, high-level

parallelism within a picture may be realized by partitioning the

source frames using slices and assigning each slice to one of

several processing cores. Slices were originally designed to

map a bit-stream into smaller independently decodable chunks

for transmission. The size of a coded slice was typically

determined by the network characteristics; for example, the

size is often selected to be less than the maximum transmission

unit (MTU) size of the network being considered.

An overview of tiles in HEVC

Kiran Misra††

, Member, IEEE, Andrew Segall††

, Member, IEEE, Michael Horowitz†, Shilin Xu

†, Arild

Fuldseth‡, Minhua Zhou

‡‡

T




In practice, using slices for parallelization results in a number

of disadvantages. For example, the pixel segmentation

achieved by slices using only network constraints often result

in partitioning where the correlation existing in the pixel data

is reduced. This lowers the achievable coding efficiency.

Moreover, slices contain header information to facilitate

independent processing of pixel data. With the higher coding

efficiency of HEVC, this becomes problematic – it is possible

to transmit high resolution video at low bit rates such that the

overhead introduced by a slice header is not negligible.

Finally, for applications that require both parallelization and

packetization, it is difficult to use slices to achieve an optimal

partitioning for both goals.

Tiles provide an alternative partitioning that divides a picture

into rectangular sections that are processed in a relatively

independent fashion. Figure 1 illustrates an example where a

picture is partitioned into rectangular processing units called

tiles [7]. HEVC also provides additional tools for parallelism,

including both low-level and high-level methods. For high-

level parallelism, one alternative approach is wavefront

processing, which is further described in [9].

As described in this section, tiles have a number of desirable

properties for next generation video coding devices. In the rest

of this paper, we provide a detailed description of the tiles

feature in HEVC. The rest of the paper is organized as

follows: Section II provides background on slices and tiles. In

section III, we discuss constraints on tiles as described in the

HEVC standard. In section IV an example usage of tiles is

provided. Section V, reports the coding efficiency

improvement associated with the use of tiles in MTU size

matching and high-level parallelism applications. Section V

further demonstrates the efficacy of tiles in lightweight bit-

stream rewriting. Finally section VI provides concluding

remarks.

II. BACKGROUND

A. Slices

A video decoder consists of two fundamental processes: (a)

bit-stream parsing carried out by the entropy decoder and (b)

picture reconstruction carried out by the pixel processing

engine. The video bit-stream is typically organized in a causal

fashion where both the parsing and the reconstruction step for

the current location being processed depend on information

occurring earlier in the bit-stream. In practical applications, a

video bit-stream may be transmitted over lossy channels before

it arrives at the decoder. Loss of a part of the video bit-stream

would lead to an inability to parse and/or reconstruct

information later within the bit-stream. This causal

dependency propagates and therefore a single error may lead

to an inability to process a significant portion of the bit-stream

occurring after the error. To limit the propagation of error it is

important to break dependencies in processing. Earlier video

coding standards [1][2][6] achieved this by organizing the bit-

stream into independently parsable units called slices.

Within a picture, individual slices can be independently

reconstructed. In HEVC, slices define groups of independently

parsable coded tree blocks (CTBs). Slices contain CTBs which

follow raster scan order within a picture as shown in Figure 2.

Previous standards such as H.264/AVC include tools such as

flexible macroblock ordering (FMO) that enable defining

arbitrary shaped slices. However, while FMO provides

excellent capabilities in defining slice shapes, it unfortunately

requires frame-level decoupling of the deblocking filter from

the rest of the decoding process. Thus, in the context of H.264,

it is not possible to perform macroblock based deblocking

filtering with FMO. Largely as a result of this property, FMO

is not included in HEVC, as the frame-level deblocking

significantly increases memory bandwidth and leads to a

decrease in decoder performance. By contrast, with tiles, the

in-loop filtering process can be performed at the CTB-level

with the use of vertical column buffers. Moreover, tiles in

1 2 3 4 13 14 15 16 17 18 31 32 33

5 6 7 8 19 20 21 22 23 24 34 35 36

9 10 11 12 25 26 27 28 29 30 37 38 39

40 41

Column Boundaries

Row

Boundaries

CTB #1

Figure 1 – Example illustrating rectangular picture

partitioning and coded tree block (CTB) scanning order

within a picture that is divided into nine tiles.

Figure 2 - Example illustrating slice-based picture

partitioning of coded tree block (CTB) following a

raster scan order within the picture.




Figure 4 – Illustration of sample data from

reconstructed frames to be stored in on-chip memory

for three different CTB rows. The dashed lines show

the sample data stored in memory.

HEVC provide an additional benefit of having lower overhead

since unlike FMO they do not have associated slice headers.

In HEVC, slice partitioning may be based on network

constraints such as maximum transmission unit (MTU) or

pixel processing constraints such as maximum number of

CTBs to be contained within a slice. As can be seen in Figure

2, following the raster-scan order within a picture results in

partitioning which has a lower level of spatial correlation

within the picture. Additionally, every slice contains within it

an associated slice header which adds a non-negligible

overhead at lower bitrates. As a result of the reduced spatial

correlation and additional slice header overhead video coding

efficiency suffers.

B. Tiles

A picture in HEVC is partitioned into coded tree blocks. In

addition, each picture may be partitioned into rows and

columns of CTBs, and the intersection of a row and column

results in a tile. Note that tiles are always aligned with CTB

boundaries. As a result of its partitioning flexibility, a tile may

be spatially more compact than a slice containing the same

number of coded tree blocks. This has the benefit of higher

correlation between pixels compared to slices. As an

additional advantage, tiles do not contain headers for improved

coding efficiency. The CTBs within a tile are processed in

raster scan order, and the tiles within a picture are processed in

raster scan order. An example of the above-described

partitioning is shown in Figure 1 where a picture is partitioned

into three columns and three rows of CTBs. The CTBs within

the first (upper-left) tile, depicted as numbered squares 1-12,

are scanned in raster scan order. After scanning the first tile,

the second tile follows (in tile raster scan order). Specifically,

as shown in Figure 1, CTB #13 in the second tile follows CTB

#12 in the first tile. Tiles allow the column and row boundaries

to be specified with and without uniform spacing.

The modified scan pattern has the advantage of reduced line

buffer requirements for motion estimation. Specifically, the

prediction of a CTB requires storing, in on-chip memory,

reconstructed pixel data (from previously coded frames) that

are candidates for motion compensation. This data is loaded

into on-chip memory and retained until no longer needed.

Without tiles, raster scanning of a picture results in storing

sample data equal to PicW*(2*SRy + CtbHeight), where PicW

is the width of the picture, SRy is the maximum vertical size of

a motion vector in full sample units and CtbHeight is the

height of an CTB in full sample units. With tiles, the modified

scan pattern results in a sample data storage requirement that is

approximately (TileW+2*SRx)*(2*SRy + CtbHeight), where

TileW is the width of a tile and SRx is the maximum

horizontal size of a motion vector. Using tiles, sample data

storage can be substantially reduced when PicW is

significantly larger than (TileW+2*SRx), which is typical.

Note that the above analysis assumes a single core encoder

that processes tiles sequentially. A graphical illustration of the

memory required for motion estimation using tiles is shown in

Figure 4.

In addition to changing the CTB scanning process, tile

boundaries denote a break in coding dependencies.

Dependencies between tiles are disabled in the same manner as

between slices. Specifically, entropy coding and reconstruction

dependencies are not allowed across a tile boundary. This

includes motion vector prediction, intra prediction and context

selection. In-loop filtering is the only exception which is

allowed across the boundaries but can be disabled by a flag in

the bit-stream. Thus, separate tiles may be encoded on

different processors with little inter-processor communication.

The breaking of coding dependencies at tile boundaries

implies that a decoder can process tiles in parallel with other

tiles. To facilitate parallel decoding the locations of tiles must

be signaled in the bit-stream. In HEVC, the bit-stream offsets

of all but the first tile are explicitly transmitted in the slice

Figure 3 – Example illustrating tiles partitioning of a

picture to identify region of interest. In the above example

the tiles in the center column for the first two rows form

the region of interest.




header [10][18][19]. The location of the first tile immediately

following the slice header is known to the decoder.

Tiles have been successfully demonstrated to be an effective

parallelism tool for pixel-load balancing in ultra-high

definition (UHD) video [8]. In addition to high-level

parallelism, tiles also facilitate improved MTU size matching.

Additionally, for some applications the rectangular pixel data

partitioning afforded by tiles facilitates region of interest

(ROI) based coding. Figure 3 illustrates an example where two

tiles are used to represent the ROI within a video source. Tiles

based ROI identification can be used to facilitate asymmetric

processing of pictures, where the tile corresponding to the ROI

is processed by a more computationally-capable core. This is a

desirable trait in some applications where the ROI’s encoder

rate-distortion decision process is computationally more

demanding due to the use of advanced search algorithms and

distortion metrics. Additionally, when tiles lying within an

ROI are coded independently, the subset of the bit-stream

corresponding to the ROI can be easily extracted and re-

constituted into another bit-stream with lower bit rate

requirements. An example application where tile partitioning

and ROI based coding is used to perform lightweight bit-

stream rewriting is demonstrated in section V.C.

An equally strong benefit of tiles is the reduction of memory

bandwidth and on-chip memory requirements. Specifically,

with a rectangular partitioning, the size of the line buffers

required for motion compensation and in-loop filtering is

dramatically reduced. (The line buffer width is reduced from

the width of the picture to approximately the width of the tile.

Thus, for even two tiles, the reduction is nearly 50%) For an

encoder with significant memory restrictions, this has the

additional advantage of higher coding efficiency, as the

reduced memory requirements of tiles enables a larger vertical

search range for such an encoder [12].

C. Complexity properties

For system designers the worst-case performance of a tool is

an important measure that needs to be considered when

designing solutions that are guaranteed to meet a minimum

constraint. With this view in mind, we describe the worst-case

per-picture execution time for slices and tiles. It is assumed

that the degree of parallelism afforded by slices and tiles is the

same so as to enable comparisons between them.

We consider a video coding environment with � � 1

independent encoders and � independent decoders. The

encoders and decoders have equal computational capability.

Let ��, � represent the encoding time for ��, �. A picture

consists of a total of �� rows of CTBs and �� columns

of CTBs. If a picture is partitioned into � slices then the

worst-case encoding time for the system would be:

��,�� ∀ ��! " ��, �#$%��,&∈ ��!

(�)�*��+,�� ( � ,-,��1

where �)�*��+,�� and � ,-,�� represent the deblocking

and sample adaptive offset filter encoding time at slice

boundaries respectively.

If a picture is partitioned into � uniform tiles versus �

uniform slices then the number of CTBs at the boundary of

tiles can always be made to be smaller than or equal to the

number of CTBs at the slice boundary.

For � tiles the worst-case encoding time for the system would

be:

��,�� ∀$��! " ��, �#$%��,&∈$��!

(�)�*��+,�� ( � ,-,��2

where �)�*��+,�� and ��/0��,�� represent the

deblocking and sample adaptive offset filter encoding time at

tile boundaries respectively. Similar expressions for worst

case decoding time can be derived for the system.

As an example, if we assume that (a) the only difference in

execution times for slices and tiles based processing lie in the

processing of pixels at the boundary, (b) the number of CTBs

in a picture are large compared to � and (c) tiles take on

square-shapes, then the number of boundary edges to be

shared for slice and tile based parallelism approaches is at

most �min� ��,� 4 1 ∗ �� ( �� 4 1 and

2√� ∗ �� ∗ �� 4 �� 4 �� respectively. Here the

function min(x, y) returns the smaller of the two values x and y,

if the two values are equal the function returns x. The second

expression above, representing the shared boundary for tiles, is

strictly smaller than the first indicating that, for the stated

assumptions, tile-based parallel processing may be preferred.

In practice however the non-square nature of tiles and the

coding complexity of the video scene being coded make it

necessary to further refine the complexity considerations.

In theory, it is possible to perform load balancing between

different processing cores based on estimated coding

complexity of different regions of a picture and redefining tile

boundaries. A good load-balancing algorithm would increase

resource utilization while reducing average processing times.

However, frequent changes in tile boundaries and the

associated change in scan pattern and buffering requirements

make software/hardware optimizations difficult to achieve. In

practice, system designers need to determine a good trade-off

between variable tile structures and optimized implementations

for their individual applications.




(a) (b)

Figure 5 – Interaction between tiles and slices. In the example above, the following are illustrated: (a) Three

complete tiles contained within a single slice (b) Two complete slices contained within each tile.

III. CONSTRAINTS ON TILES

In this section, we begin by listing the constraints related to

tiles in HEVC.

Supporting tiles in the HEVC system requires the transmission

of the tiles configuration information from an encoder to a

decoder. This includes column and row locations, loop filter

control and the bit-stream location information for the start of

all but the first tile in a picture. Using uniform spacing, the tile

boundaries are automatically distributed uniformly across the

picture. The tile boundaries thus balance the pixel load

approximately evenly amongst different tiles in a picture.

Alternatively, tile boundaries may be explicitly specified, for

example based on picture coding complexity. When more than

one tile exists within a picture then the tile column widths and

tile row heights are required to be greater than or equal to 256

and 64 luma samples respectively. This constraint ensures that

tile sizes cannot be too small. Additionally the total number of

tiles within a picture is limited by constraining the maximum

number of tile columns and maximum number of tile rows

allowed within a picture based on the level of the bitstream

under consideration. These bounds are specified in Table A-1

of the HEVC standard and monotonically increase with

increasing level.

In HEVC, slice boundaries can also be introduced by the

encoder and need not be coincident with tile boundaries.

However, to manage decoder implementation complexity, the

combination of slices and tiles is constrained. Specifically,

either all coded blocks in a tile must belong to the same slice,

or all coded blocks in a slice must belong to the same tile.

Figure 5 illustrates the two constraints. In Figure 5a all coded

blocks within the three tiles belong to a single slice illustrating

the earlier constraint. While in Figure 5b all the coded blocks

in a slice belong to the same tile. As a consequence of these

constraints it should be noted that a slice that does not start at

the beginning of a tile cannot span multiple tiles.

An additional constraint made on the tile system in HEVC is

that all tile locations within a bit-stream are provided to the

decoder. Conceptually, a single-core decoder could ignore this

location information while processing a bit-stream containing

tiles. This would result in a single core decoder following the

memory access pattern described in section II.B. However, the

HEVC design requires the transmission of entry points for all

but the first tile with the goal of realizing two key benefits. The

first benefit is enabling the maximum amount of parallelization

at the decoder while, the second benefit is allowing for

decoding a bit-stream containing tiles in raster-scan order.

More specifically, a decoder may receive a bit-stream in tiles

raster-scan, but choose to decode it in the (alternative) raster-

scan of the frame. This is achieved by decoding the CTBs in

the first row of a first tile, saving the entropy coding state, and

then resetting the entropy coder to decode the CTB in the first

row from the neighboring tile. For the example, as illustrated

in Figure 5a, this corresponds to decoding CTBs 1, 2, 3, 4 (in

that order); saving the entropy coding state for the current tile;

resetting the entropy coder and continue to decoding CTB 33

in the adjacent tile. This benefits single core decoders, as the

single core device can decode a bit-stream containing tiles

without significant changes in processing or memory access

pattern.

Entry points also allow a bit-stream containing independently

decodable tiles to be extracted and re-constituted into a lower

bit rate stream. This requires further non-normative (encoder

side) constraints on the bit-stream being generated. These non-

normative constraints with the corresponding lightweight

rewriting experiments are listed in section V.C.

IV. EXAMPLE USAGE FOR TILES

We now take a more detailed look at an example use case for

tiles.

Video conferencing applications that stand to benefit from the

parallelism afforded by tiles are used to demonstrate the usage

of tiles. Software-based interactive video applications run on




platforms ranging from laptop and desktop computers to

tablets and smart phones. Modern desktop and laptop

computers use CPUs with four or more processing cores.

Many tablets and smartphones sporting dual-core and even

quad-core ARM processors are already commercially available

in the market. One way to leverage multi-core computational

power for HEVC video encoding and decoding is to use tiles.

The following example describes the use of tiles syntax in the

context of a software-based 720p (1280x720) interactive video

application operating at 60 frames per second (fps). The

example application is designed for a hardware platform

containing an Intel Core i7 CPU, which accounting for hyper-

threading, has eight virtual processing cores. The application

consists of several components including an HEVC encoder,

HEVC decoder, audio processing, user interface, etc. With the

relative computational complexity associated with each

component in mind, four virtual cores are allocated for HEVC

encoding, two for decoding (HEVC decoding generally

requires fewer computational resources than encoding) and the

remaining two cores are reserved for all other application

components.

To better take advantage of the processing capacity of the four

cores allocated for encoding, the input picture is partitioned

into four tiles. Since each core has identical computational

capabilities, it is desirable to partition a given picture so that

the encoding of each resulting tile requires the same

processing power. To achieve a proper processing load

balance, tiles in “active” picture regions requiring more

processing power are specified to be smaller than tiles in less

computationally demanding regions. One simple load

balancing strategy starts by partitioning the picture to be

encoded into tiles of roughly equal size and adapting the size

of the tiles over time depending on source content. Tile

locations and dimensions are specified in the picture parameter

set (PPS). The use of the PPS facilitates picture-to-picture tile

configuration changes that may be made in order to load

balance. A picture may be partitioned into four uniformly

spaced tiles (to facilitate load balancing). The resulting tiles

are more spatially compact than those resulting from other

partitioning strategies (e.g., four tiles side-by-side). In general,

tile compactness results in improved coding efficiency as

discussed earlier. The resulting left and right tile columns each

have a width of 640 luma samples. Assuming the coded tree

blocks comprise of 64x64 luma samples, setting the first tile

row height to 6 CTBs results in the top tile row having a height

of 384 luma samples while the bottom row has a height of 336

samples. In this way, four tiles having dimensions 640x384,

640x384, 640x336, and 640x336 counting clockwise from the

upper-left, are specified.

Having specified the tile dimensions, the encoder partitions the

input picture into four tiles and sends the picture data

associated with each tile to a separate processing core for

encoding. In this way, the encoder may achieve full processor

utilization with very low delay. After encoding, the bits

produced by each core must be assembled into a coded slice in

decoding order (tiles are decoded in raster scan order within a

picture) prior to being placed in a data packet and sent to the

network for transport. For the sake of clarity, we shall assume

the bits for all tiles in a picture are contained within a single

coded slice. This assumption is not unreasonable for a 720p,

60 fps HEVC encoding in the context of video conferencing.

Parallel decoding requires a different approach. Where an

encoder has the flexibility to choose how to partition and

allocate portions of an input picture for parallel processing,

due to dependencies in the decoding process, the decoder

cannot arbitrarily partition an input bit-stream. To facilitate

parallel decoding, HEVC inserts information into a slice

header to signal entry points associated with tiles contained

within that slice. In this example, three entry point per slice are

signaled to mark the location in the bit-stream of the start of

the second, third and fourth tile. The decoder receives the

coded slice, parses the slice header and determines the

associated PPS which is mandated to occur earlier in the bit-

stream than any slice referencing it. From the PPS, the decoder

may derive the number of tiles as well as the location and

spatial dimensions of each tile. In addition, the decoder

determines the location in the bit-stream of the tile entry point

from the slice header. The tile substreams may then be sent to

the two independent cores for decoding. In this example, each

core is assigned two tiles for processing at the decoder. An

important benefit of signaling entry points in the slice header is

that it facilitates raster-scan based decoding as described in

section III.

V. EXPERIMENTS

To assess the benefit of the tiles feature, we report the coding

efficiency improvement of the approach in a number of

configurations. These configurations include the cases of high-

level parallelization and network maximum transmission unit

(MTU) size matching. The efficacy of tiles is further

demonstrated using a lightweight bit-stream rewriting example.

Another experiment considering motion estimation with

limited on-chip memory is reported in [7].

The experiments reported here are conducted on test

sequences of different resolutions. Sequences used in

experiments are classified into five groups based on their

resolution. Class A sequences have the highest resolution of

2560x1600 and are cropped versions of ultra-high definition

(UHD) 4K resolution sequences. Class B sequences

correspond to full high definition (HD) sequences with a

resolution of 1920x1080. Class C and Class D sequences

correspond to WVGA and WQVGA resolutions of 832x480

and 416x240 respectively. Finally Class E sequences

correspond to sequences typically seen in video conferencing

applications and correspond to 720p i.e. 1280x720 pixel

resolution.




For the experiments, Class A includes the Traffic and

PeopleOnStreet sequences; Class B includes the Kimono,

ParkScene, Cactus, BasketballDrive and BQTerrace

sequences; Class C includes the BasketballDrill, BQMall,

PartyScene and RaceHorses sequences, class D includes the

BasketballPass, BQSquare, BlowingBubbles and RaceHorses

sequences; Class E includes the FourPeople, Johnny and

KristenAndSara sequences.

Note, that for the random access configuration class E

sequences are not tested while for low delay configuration

class A sequences are not tested. This is consistent with the

test conditions defined during the standardization process of

HEVC [15].

A. Comparing parallelism using Tiles versus Slices

(Experiment 1)

In a first experiment, we compare the high-level

parallelization performance of tiles to traditional slices. Here,

we select the tile size to be approximately equal to the size of

one WQVGA image frame (i.e., 416x240 pixels). The exact

slice and tile partitioning used for different class of sequences

is listed in Table 1. Similarly, for reference, we select the slice

size to have the same number of CTBs as a tile. Choosing tile

sizes approximately equal to WQVGA results in a single tile

and correspondingly a single slice for class D sequences. This

will lead to the anchor and test data having exactly identical

rate-distortion performance. Consequently class D sequences

are not tested for this experiment.

Experiments are conducted using HM 9.2 [14] and JCT-VC

main configuration common conditions [15] for class A to E

test sequences. Results from the comparison appear in Table 2.

As can be seen from the table, the tiles system provides an

average 2.2%, 2.2%, 5.4% and 5.5% luma BD-rate

improvements for the main configuration of All Intra, Random

Access, Low Delay B and Low Delay P scenarios,

respectively, compared to slices and for the same amount of

parallelization.

B. MTU size matching using Tiles (Experiment 2)

In a second experiment, we compare the performance of the

tiles system for MTU size matching to traditional slices. Here,

an encoder divides a picture into slices that do not exceed

1500 bytes. This slice size is consistent with the MTU size of

an Ethernet v2 network. Tiles are used to improve the coding

efficiency of the system. We use column boundaries to divide

the picture, since we observe that columns result in more

square-like slice shapes leading to higher correlations. The

column widths used for each “sequence class” and “encoder

configuration” are listed in Table 3. The higher correlations

improve intra-prediction, mode prediction and motion vector

coding, for example.

Experiments are conducted using HM 9.2 [14] and JCT-VC

main configuration common conditions [15] for class A to E

test sequences. A coded tree block size of 32x32 was used in

lieu of 64x64. Additionally, the HM-9.2 encoder is modified

to allow byte-limited slices that begin at the start of a tile to

extend to the end of another tile. Results for the experiment are

reported in Table 4. As can be seen from the table, tiles

improve the coding efficiency of HEVC in the MTU size

Table 1 - Slice and tile partitioning for experiment 1

Reference Test

Class Number

of slices

Slice

sizes

(in

units of

CTBs)

Number of

tiles

Tile

dimensions

(in units of

CTBs)

horiz

ontal

ly

vert

ical

ly

horiz

ontal

ly

verti

cally

A 25 40 5 5 8 5

B 17 30 6 3 5 6

C 4 26 2 2 7 4

E 12 20 4 3 5 4

Table 2 - Encoder parallelization performance results for

experiment 1

All Intra Main

Y U V

Class A -1.5% -1.1% -1.0%

Class B -1.9% -1.6% -1.5%

Class C -0.9% -0.8% -0.9%

Class E -4.5% -4.0% -4.1%

Overall -2.2% -1.9% -1.9%

Random access Main

Y U V

Class A -2.1% -1.9% -1.8%

Class B -3.3% -3.8% -2.8%

Class C -1.3% -1.6% -2.0%

Class E * * *

Overall -2.2% -2.4% -2.2%

Low delay B Main

Y U V

Class A * * *

Class B -3.5% -3.3% -2.9%

Class C -1.5% -1.4% -1.5%

Class E -11.3% -10.3% -10.1%

Overall -5.4% -5.0% -4.8%

Low delay P Main

Y U V

Class A * * *

Class B -4.1% -3.9% -3.4%

Class C -1.8% -1.9% -2.0%

Class E -10.7% -9.7% -10.1%

Overall -5.5% -5.2% -5.2%




matching scenario. Specifically, an average improvement of

2.1%, 1.1%, 0.4% and 0.4% luma BD-rate [16] is reported for

the main configuration of All Intra, Random Access, Low

Delay B and Low Delay P scenarios, respectively. As the CTB

size decreases, the coding gain realized by using tiles

increases. For example, for 16x16 CTBs the gains due to tiles

has been shown to be 4.7%, 2.5%, and 0.9% luma BD-rate (on

average) for Intra, Random Access and Low Delay scenarios,

respectively [17]. Note, that for extremely low bitrates where a

single slice exists per picture, the coding efficiency benefits of

compact representation using tiles are sometimes exceeded by

the losses incurred due to breaking of prediction dependencies

at tile boundaries. This is evidenced by the coding efficiency

losses observed for class E sequences in Low Delay B and

Low Delay P configuration.

Based on the above rate-distortion results it is fair to conclude

that the utility of tiles for encoder parallelization and MTU

size matching is low for smaller resolution sequences such as

class C and class D sequences.

C. Lightweight bit-stream rewriting using tiles based

region of interest coding (Experiment 3)

In a third experiment we partition pictures into tiles and

identify one tile as containing the region-of-interest (ROI). To

ensure that the ROI is independently decodable from non-ROI

tiles, temporal predictions within the ROI tile are prevented

from referring to pixels outside the ROI within reference

pictures using encoder restrictions. Additionally the

application of deblocking and sample adaptive offset filters is

disabled at tile boundaries. Each picture in the video source is

coded as a single slice. The slice header contains location

information identifying the start of each tile. Using the entry

point information a lightweight rewriting process extracts tiles

corresponding to the ROI from each picture, rewrites the slice

header and parameter sets to re-constitute a bit-stream

containing only the ROI tile.

Note [20] which was recently adopted within the working draft

of version 2 of the HEVC standard, describes a way to

constrain the encoding process so that the decoder can

correctly decode specific set(s) of tiles. It also describes an

encoder constraint which avoids the need for ROI applications

to disable deblocking and sample adaptive offset filtering

across tile boundaries.

Table 3 - Column widths in units of 32x32 CTBs used

for experiment 2

All

Intra

Main

Random

Access

Main

Low

Delay B

Main

Low

Delay P

Main

Class A

QP 22 40 5 * *

QP 27 40 7 * *

QP 32 4 10 * *

QP 37 4 10 * *

Class B

QP 22 30 6 4 4

QP 27 5 6 8 8

QP 32 6 8 10 10

QP 37 8 10 15 15

Class C

QP 22 13 4 3 3

QP 27 13 7 4 4

QP 32 3 7 7 7

QP 37 4 13 13 13

Class D

QP 22 7 4 4 4

QP 27 7 7 7 7

QP 32 2 7 7 7

QP 37 4 7 7 7

Class E

QP 22 4 * 7 7

QP 27 5 * 20 20

QP 32 5 * 20 20

QP 37 7 * 20 20

Table 4 - MTU size matching performance results for

experiment 2

All Intra Main

Y U V

Class A -0.9% -0.3% -0.3%

Class B -2.7% -2.3% -2.2%

Class C -1.8% -1.4% -1.5%

Class D -0.4% -0.3% -0.3%

Class E -4.8% -4.2% -4.0%

Overall -2.1% -1.8% -1.7%

Random access Main

Y U V

Class A -1.3% -1.4% -1.4%

Class B -1.9% -2.1% -1.9%

Class C -1.2% -1.1% -1.1%

Class D 0.0% -0.1% 0.0%

Class E * * *

Overall -1.1% -1.2% -1.1%

Low delay B Main

Y U V

Class A * * *

Class B -0.9% -1.2% -1.1%

Class C -0.6% -0.4% -0.3%

Class D 0.0% -0.1% -0.1%

Class E 0.5% 1.2% 0.7%

Overall -0.4% -0.2% -0.3%

Low delay P Main

Y U V

Class A * * *

Class B -1.0% -1.2% -1.2%

Class C -0.5% -0.4% -0.7%

Class D 0.0% -0.1% -0.1%

Class E 0.4% 0.5% 1.1%

Overall -0.4% -0.4% -0.4%




For this experiment class E sequences were used. The tile

partitioning and the ROI tile index used for experiments is

listed in Table 5. The performance is measured using BD-rate.

The anchor bit rates used correspond to the sum of bit rates for

transmitting the full resolution class E sequence and a cropped

version of the class E sequence corresponding to ROI using a

single tile per picture. The quantization parameters used for

the experiment are 22, 27, 32 and 37. The anchor peak signal

to noise ratio (PSNR) used corresponds to the PSNR of the

full resolution class E sequence with one tile. For the test data

the bit rate of the full resolution class E sequences with the tile

configuration listed in Table 5 is used. The test PSNR also

corresponds to the full resolution class E sequence. The BD

rate measured using this set of anchor and test data is listed in

Table 6. This BD rate measure represents the bit rate savings

achieved by using a mechanism where only a single resolution

bit-stream is transmitted to a network middle box capable of

performing the lightweight rewriting process versus

transmitting two separate resolution bit-streams. Note that this

BD rate reflects the bit rate savings from the point-of-view of

an end-point device which receives the full resolution class E

bit-stream and represent average bandwidth savings 43.9%,

28.5%, 21.1% and 23.0% for the main configuration of All

Intra, Random Access, Low Delay B and Low Delay P

scenarios, respectively.

VI. CONCLUSION

The tiles based design of HEVC provides multiple benefits for

managing the computational complexity of video encoding and

decoding. This is especially true for high resolution video data.

By breaking dependencies within a picture, high-level

parallelism for both the encoder and decoder can be achieved

without the overhead of traditional slices. It was demonstrated

that a tiles based parallelism approach results in an average

luma bit rate saving of 2.2% to 5.5% over a slice based

approach. It was also shown that the compact CTB scan

pattern afforded by tiles can be used to improve the coding

efficiency of MTU size matching within HEVC. Moreover, by

altering the CTB scan pattern within an image, on-chip

memory requirements are reduced and coding efficiency

improvements can be achieved. Tiles can also be used to

perform region-of-interest based lightweight bit-stream

rewriting. The combination of high-level parallelism, resource

reduction and coding efficiency provides a very useful tool

within the HEVC system.

ACKNOWLEDGEMENTS

The authors would like to thank the anonymous reviewers for

their valuable comments and feedback which was extremely

helpful in improving the quality of the paper.

REFERENCES

[1] Advanced Video Coding for Generic Audiovisual Services, ITU-T Rec.

H.264 | ISO/IEC 14496-10, Version 2: May 2004, Version 3: Mar.2005,

Version 4: Sept. 2005, Version 5: June 2006, Version 7: Apr. 2007,

Version 8 (with SVC extension) “Consented” July 2007, May 2003.

[2] T. Wiegand, G. J. Sullivan, G. Bjøntegaard, and A. Luthra, "Overview

of the H.264/AVC video coding standard," IEEE Trans. Circuits Syst.

Video Technol., vol. 13, no. 7, pp. 560–576, Jul. 2003.

[3] “Joint Call for Proposals on Video Compression Technology,” ITU T

SG16/Q.6 Doc. VCEG-AM91, Kyoto, Japan, 2010.

[4] ITU-T Rec. H.265 and ISO/IEC 23008-2: High Efficiency Video

Coding, ITU-T and ISO/IEC, April 2013.

[5] Sullivan, G. J.; Ohm, J.-R.; Han, W.-J. & Wiegand, T. “Overview of the

High Efficiency Video Coding (HEVC) Standard,” IEEE Transactions

on Circuits and Systems for Video Technology, vol. 22, no. 12, pp.

1649-1668, 2012.

[6] S. Wenger and M. Horowitz, “FMO: Flexible Macroblock Ordering,”

JVT-C089, May 2002.

[7] A. Fuldseth, M. Horowitz, S. Xu, A. Segall. M. Zhou, “Tiles”, JCT-VC

F335, 6th Meeting of Joint Collaborative Team on Video Coding (JCT-

VC) of ITU-T SG16 WP3 and ISO/IEC JTC1/SC29/WG11, Torino, Italy

2011.

[8] M. Zhou, V. Sze, M. Budagavi, “Parallel tools in HEVC for high-

throughput processing,” Proceedings SPIE 8499, Applications of Digital

Image Processing XXXV, 849910, October 15, 2012.

Table 6 – Lightweight stream splitting based on ROI

using tiles (experiment 3)

All Intra Main

Y U V

FourPeople -39.4% -39.9% -39.7%

Johnny -46.5% -46.9% -47.0%

KristenAndSara -45.7% -46.1% -46.1%

Overall -43.9% -44.3% -44.3%

Random access Main

Y U V

FourPeople -10.8% -18.4% -22.0%

Johnny -30.9% -34.0% -36.8%


Overall -28.5% -34.5% -37.0%

Low delay B Main

Y U V

FourPeople -6.1% -8.2% -9.2%

Johnny -17.9% -21.0% -23.5%


Overall -21.1% -23.0% -24.6%

Low delay P Main

Y U V

FourPeople -8.4% -10.4% -11.6%

Johnny -20.5% -22.8% -25.7%


Overall -23.0% -24.7% -26.5%

Table 5 – Tile heights and widths in units of 64x64

CTBs for lightweight bit-stream rewriting using tiles

based ROI (experiment 3)

Tile

Column

Widths

Tile Row

Heights

ROI tile

index

(indexing

starts with

zero)

FourPeople 20 3, 5, 4 1

Johnny 4, 12, 4 12 1

KristenAndSara 2, 18 12 1




[9] Chi Ching Chi, Mauricio Alvarez-Mesa, Ben Juurlink, Gordon Clare,

Félix Henry, Stéphane Pateux and Thomas Schierl, "Parallel Scalability

and Efficiency of HEVC Parallelization Approaches," IEEE

Transactions on Circuits and Systems for Video Technology, vol. 22,

no. 12, pp. 1827-1838, 2012.

[10] K. Misra and A. Segall, “Parallel decoding with Tiles”, JCT-VC F594,

6th Meeting of Joint Collaborative Team on Video Coding (JCT-VC) of

ITU-T SG16 WP3 and ISO/IEC JTC1/SC29/WG11, Torino, Italy 2011.

[11] A. Fuldseth, “Replacing slices with tiles for high level parallelism,”

JCTVC-D227, 4th Meeting of Joint Collaborative Team on Video

Coding (JCT-VC) of ITU-T SG16 WP3 and ISO/IEC

JTC1/SC29/WG11, Daegu, January 2011.

[12] M. Zhou, “Sub-picture based raster scanning coding order for HEVC

UHD video coding”, JCTVC-B062, 2nd Meeting of Joint Collaborative

Team on Video Coding (JCT-VC) of ITU-T SG16 WP3 and ISO/IEC

JTC1/SC29/WG11, Geneva, July, 2010.

[13] M. Horowitz and S. Xu, “Generalized slices,” JCTVC-D378, 4th

Meeting of Joint Collaborative Team on Video Coding (JCT-VC) of

ITU-T SG16 WP3 and ISO/IEC JTC1/SC29/WG11, Daegu, January

2011.

[14] High efficiency test model software SVN repository

https://hevc.hhi.fraunhofer.de/svn/svn_HEVCSoftware/tags/HM-9.2/

[15] F. Bossen, “Common HM test conditions and software reference

configurations”, JCT-VC I1100, 9th Meeting of Joint Collaborative

Team on Video Coding (JCT-VC) of ITU-T SG16 WP3 and ISO/IEC

JTC1/SC29/WG11, Geneva, May 2012.

[16] G. Bjontegaard, “Calculation of average PSNR differences between RD-

curves”, VCEG M33, March, 2001.

[17] M. Horowitz, S. Xu, E. S. Rye, and Y. Ye, “The effect of LCU size on

coding efficiency in the context of MTU size matching”, JCT-VC F596,

6th Meeting of Joint Collaborative Team on Video Coding (JCT-VC) of

ITU-T SG16 WP3 and ISO/IEC JTC1/SC29/WG11, Torino, Italy 2011.

[18] Hendry, S. Jeong, S. W. Park, B. M. Jeon, K. Misra, A. Segall, "AHG4:

Harmonized method for signalling entry points of tiles and WPP

substreams," JCTVC-H0556, 8th Meeting of Joint Collaborative Team

on Video Coding (JCT-VC) of ITU-T SG16 WP3 and ISO/IEC

JTC1/SC29/WG11, San Jose, February 2012.

[19] Y. -K. Wang , A. Segall, M. Horowitz, Hendry, W. Wade, F. Henry , T.

Lee, "Text for tiles, WPP and entropy slices," JCTVC-H0737, 8th

Meeting of Joint Collaborative Team on Video Coding (JCT-VC) of

ITU-T SG16 WP3 and ISO/IEC JTC1/SC29/WG11, San Jose, February

2012.

[20] Y. Wu, G. J. Sullivan, Y. Zhang, “Motion-constrained tile sets SEI

message”, JCTVC-M0235, 13th Meeting of Joint Collaborative Team on

Video Coding (JCT-VC) of ITU-T SG16 WP3 and ISO/IEC

JTC1/SC29/WG11, Incheon, KR, April, 2013.

Kiran Misra (M’09) received his B.E. degree in

Electronics Engineering from Mumbai University, India,

in 1998. He received his M.S. and Ph.D. degrees in

Electrical and Computer Engineering in 2002 and 2010

from Michigan State University (MSU), East Lansing. He

is a member (M) of the IEEE since 2009. He was also the

recipient of MSU’s Graduate School Research

Enhancement and Summer Program Fellowship in 2008.

Dr. Misra joined Sharp Laboratories of America Inc. as a Post-doctoral

Researcher in 2010 where he is currently a Senior Researcher in the video

coding group. His research interests include video coding and image

compression, network coding, joint source and channel code design, wireless

networking, and stochastic modeling.

Andrew Segall (S’00–M’05) received the B.S. and M.S.

degrees in electrical engineering from Oklahoma State

University, Stillwater, in 1995 and 1997, respectively, and

the Ph.D. degree in electrical engineering from

Northwestern University, Evanston, IL, in 2002. He is a

currently a Manager at Sharp Laboratories of America,

Camas, WA, where he leads groups performing research on

video coding and video processing algorithms for next

generation display devices. From 2002 to 2004, he was a Senior Engineer at

Pixcise, Inc., Palo Alto, CA, where he developed scalable compression

methods for high definition video. His research interests are in image and

video processing and include video coding, super resolution and scale space

theory.

Michael Horowitz received an A.B. degree with

distinction in physics from Cornell University, Ithaca, NY,

in 1986, a M.S. in electrical engineering from Columbia

University, New York City, NY, in 1988 and a Ph.D. in

electrical engineering from The University of Michigan,

Ann Arbor, in 1998.

He is Chief Technology Officer at eBrisk Video. Prior to

eBrisk, he led the engineering team at Vidyo that developed

the first commercially available H.264 SVC video codec. Earlier, at Polycom

he led the engineering team that developed the first commercially available

in-product H.264/AVC video codec. Dr. Horowitz is Managing Partner at

Applied Video Compression and is a member of the Technical Advisory

Board of Vivox, Inc.

Dr. Horowitz has served as chair for several ad hoc groups including the ad

hoc group on High-level Parallelism during the ITU-T | ISO/IEC Joint

Collaborative Team on Video Coding’s (JCT-VC) development of HEVC.

Shilin Xu received a B.E. degree in Communication Engineering in 2004 and

a Ph.D. degree in Electrical and Information Engineering in 2009, both from

Huazhong University of Science and Technology, Wuhan, China.

He has been the research engineer at eBrisk Video since 2010 and is actively

participating in the standardization of HEVC. Prior to eBrisk, he was an

assistant professor in Wuhan Institute of Technology, China, from 2009 to

2010.

Arild Fuldseth received his B.Sc. degree from the

Norwegian Institute of Technology in 1988, his M.Sc.

degree from North Carolina State University in 1989, and

his Ph.D. degree from Norwegian University of Science

and Technology in 1997, all degrees in Signal Processing.

From 1989 to 1994, he was a Research Scientist in

SINTEF, Trondheim, Norway. From 1997 to 2002 he was a

Manager of the signal processing group of Fast Search and

Transfer, Oslo, Norway. Since 2002 he has been with Tandberg Telecom,

Oslo, Norway (now part of Cisco Systems) where he is currently a Principal

Engineer working with video compression technology.

Minhua Zhou received his B.E. degree in Electronic Engineering and M.E.

degree in Communication & Electronic Systems from Shanghai Jiao Tong

University, Shanghai, P.R. China, in 1987 and 1990, respectively. He

received his Ph.D. degree in Electronic Engineering from Technical

University Braunschweig, Germany, in 1997. He received Rudolf-Urtel Prize

1997 from German Society for Film and Television Technologies in

recognition of Ph.D. thesis work on “Optimization of MPEG-2 Video

Encoding”.

From 1993 to 1998, he was a Researcher at Heinrich-Hertz-Institute (HHI)

Berlin, Germany. Since 1998, he is with Texas Instruments Inc, where he is

currently a research manager of video coding technology. His research

interests include video compression, video pre- and post-processing, end-to-

end video quality, joint algorithm and architecture optimization, and 3D

video.

survey 1

Documents