omnidirectional media format and its application to ... · international standard on immersive...

11
1 Omnidirectional Media Format and Its Application to Immersive Video Streaming: An Overview Yiling Xu, Shaowei Xie, Qiu Shen, Zhan Ma, Imed Bouazizi, and Ye-Kui Wang Abstract—This article reviews the recent Omnidirectional Me- diA Format (OMAF) specification developed by ISO/IEC Moving Picture Expert Group (MPEG), with focus on the application enabling of 360 video, image, text and associated audio. Specif- ically, we survey the key features and interfaces to existing file format and transport protocols introduced in OMAF, to enable the rotation, projection, packing, delivery encapsulation and the final consumption, as well as the media and presentation profiles for general interoperability. We also demonstrate the effectiveness of the perceptually optimized viewport-based immersive video streaming strategy under network bandwidth constraint, using our proposed analytical quality models. Index Terms—Omnidirectional Media, Immersive video, View- port Adaptation, Perceptual Quality, Rate-Quality Optimization I. I NTRODUCTION Media service for entertainment has evolved from conven- tional two-dimensional (2D) television displayed on flat panel, to recent 360 and immersive content navigated by wearing a head mounted display (HMD) [1] thanks to the advances of virtual reality (VR) technologies. Immersive media pro- duced by a VR system, such as HTC Vive, Oculus Rift, etc, represents a virtualized space where user can interact naturally as in the real world. A typical example of immersive media discussed in this context is the omnidirectional media, covering the 360 viewing range with three degrees of freedom (DoF). To ensure universal media access and interoperability for production, distribution, sharing and consumption [2], media is often encapsulated and signalled using standardized file formats and transport protocols. A well-known and recognized organization - Moving Picture Expert Group (MPEG), a sub- section of the ISO/IEC, has developed a series of international standards to enforce aforementioned media interoperability. They include MPEG-2 Transport Stream (TS) [3], MPEG- 4 Part 14 (MP4) [4], Dynamic Adaptive Streaming over HTTP (DASH) [5], MPEG Media Transport (MMT) [6], etc. Recently, because of the arising demands for immersive media applications, MPEG has approved MPEG-I working items (a.k.a., ISO/IEC 23090 - coded representation of im- mersive media) focusing on the specifications to enable future immersive applications. Among them, the second part of MPEG-I, which will be detailed in this paper, specifies the Y. Xu and S. Xie are with Shanghai Jiao Tong University ({yl.xu, sw.xie}@sjtu.edu.cn). Q. Shen and Z. Ma are with Nanjing University ({shenqiu, mazhan}@nju.edu.cn). I. Bouazizi is with Samsung Research America ([email protected]). Y.-K. Wang is with Huawei Central Research, Media Technologies Lab ([email protected]). TABLE I HISTORY OF OMAF STANDARDIZATION Date Status Location Oct. 2015 OMAF Project Launch Geneva, CH Feb. 2016 Technical Responses San Diego, USA Jun. 2016 First Working Draft (WD) Geneva, CH Jan. 2017 Committee Draft (CD) Geneva, CH Apr. 2017 Draft International Standard (DIS) Hobart, AU Oct. 2017 Final DIS (FDIS) Macau, CN Omnidirectional MediA Format (OMAF) [7] to address the coding, encapsulation, presentation and consumption of om- nidirectional video, image, text and associated audio, as well as the signaling and transport over DASH [5] and MMT [6]. A. Brief History of OMAF To the best of our knowledge, OMAF is probably the first international standard on immersive media format. Its asso- ciated standardization activities dated back to October 2015 in Geneva, Switzerland, and concluded on October 2017 in Macau, China, with a final draft international standard (FDIS) waiting for ISO/IEC approval (see more in Table I). There are tens of multimedia experts from over 30 participating parties world-widely, e.g., corporations, research institutes and universities, contributing to this standardization work together. Extensions of OMAF are also expected in future studies to support more functionalities, such as 6DoF, interactive live services, etc [2]. B. Scope and Organization of This Work OMAF specifies the means to enable the consumption of the received panoramic media captured by a set of cameras and microphones. Towards this goal, it defines interfaces to comply with the existing standards, including, file format [8], delivery signaling [9], [10], and associates the metadata to carry the information for rotation, projection, packing, orien- tation/viewport adaptation, etc. We began this article with an intuitive example in Section II, covering the life cycle from the panoramic or omnidirectional video capturing, processing, delivery and presentation render- ing; To support the exemplified omnidirectional media pro- duction and consumption, Section III first reviewed the OMAF extensions to the ISO Base Media File Format, followed by the OMAF compliant media delivery over the popular DASH [11] and MMT [10] in Section IV. To ensure the general interop- erability, OMAF associated media and presentation profiles were introduced in Section V. In Section VI, we presented various viewport-dependent processing schemes for improving

Upload: others

Post on 25-Mar-2020

7 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Omnidirectional Media Format and Its Application to ... · international standard on immersive media format. Its asso-ciated standardization activities dated back to October 2015

1

Omnidirectional Media Format and Its Applicationto Immersive Video Streaming: An Overview

Yiling Xu, Shaowei Xie, Qiu Shen, Zhan Ma, Imed Bouazizi, and Ye-Kui Wang

Abstract—This article reviews the recent Omnidirectional Me-

diA Format (OMAF) specification developed by ISO/IEC Moving

Picture Expert Group (MPEG), with focus on the application

enabling of 360�

video, image, text and associated audio. Specif-

ically, we survey the key features and interfaces to existing file

format and transport protocols introduced in OMAF, to enable

the rotation, projection, packing, delivery encapsulation and the

final consumption, as well as the media and presentation profiles

for general interoperability. We also demonstrate the effectiveness

of the perceptually optimized viewport-based immersive video

streaming strategy under network bandwidth constraint, using

our proposed analytical quality models.

Index Terms—Omnidirectional Media, Immersive video, View-

port Adaptation, Perceptual Quality, Rate-Quality Optimization

I. INTRODUCTION

Media service for entertainment has evolved from conven-tional two-dimensional (2D) television displayed on flat panel,to recent 360� and immersive content navigated by wearinga head mounted display (HMD) [1] thanks to the advancesof virtual reality (VR) technologies. Immersive media pro-duced by a VR system, such as HTC Vive, Oculus Rift,etc, represents a virtualized space where user can interactnaturally as in the real world. A typical example of immersivemedia discussed in this context is the omnidirectional media,covering the 360� viewing range with three degrees of freedom(DoF).

To ensure universal media access and interoperability forproduction, distribution, sharing and consumption [2], mediais often encapsulated and signalled using standardized fileformats and transport protocols. A well-known and recognizedorganization - Moving Picture Expert Group (MPEG), a sub-section of the ISO/IEC, has developed a series of internationalstandards to enforce aforementioned media interoperability.They include MPEG-2 Transport Stream (TS) [3], MPEG-4 Part 14 (MP4) [4], Dynamic Adaptive Streaming overHTTP (DASH) [5], MPEG Media Transport (MMT) [6],etc. Recently, because of the arising demands for immersivemedia applications, MPEG has approved MPEG-I workingitems (a.k.a., ISO/IEC 23090 - coded representation of im-mersive media) focusing on the specifications to enable futureimmersive applications. Among them, the second part ofMPEG-I, which will be detailed in this paper, specifies the

Y. Xu and S. Xie are with Shanghai Jiao Tong University ({yl.xu,sw.xie}@sjtu.edu.cn). Q. Shen and Z. Ma are with Nanjing University({shenqiu, mazhan}@nju.edu.cn). I. Bouazizi is with Samsung ResearchAmerica ([email protected]). Y.-K. Wang is with Huawei CentralResearch, Media Technologies Lab ([email protected]).

TABLE IHISTORY OF OMAF STANDARDIZATION

Date Status LocationOct. 2015 OMAF Project Launch Geneva, CHFeb. 2016 Technical Responses San Diego, USAJun. 2016 First Working Draft (WD) Geneva, CHJan. 2017 Committee Draft (CD) Geneva, CHApr. 2017 Draft International Standard (DIS) Hobart, AUOct. 2017 Final DIS (FDIS) Macau, CN

Omnidirectional MediA Format (OMAF) [7] to address thecoding, encapsulation, presentation and consumption of om-nidirectional video, image, text and associated audio, as wellas the signaling and transport over DASH [5] and MMT [6].

A. Brief History of OMAFTo the best of our knowledge, OMAF is probably the first

international standard on immersive media format. Its asso-ciated standardization activities dated back to October 2015in Geneva, Switzerland, and concluded on October 2017 inMacau, China, with a final draft international standard (FDIS)waiting for ISO/IEC approval (see more in Table I). Thereare tens of multimedia experts from over 30 participatingparties world-widely, e.g., corporations, research institutes anduniversities, contributing to this standardization work together.Extensions of OMAF are also expected in future studies tosupport more functionalities, such as 6DoF, interactive liveservices, etc [2].

B. Scope and Organization of This WorkOMAF specifies the means to enable the consumption of

the received panoramic media captured by a set of camerasand microphones. Towards this goal, it defines interfaces tocomply with the existing standards, including, file format [8],delivery signaling [9], [10], and associates the metadata tocarry the information for rotation, projection, packing, orien-tation/viewport adaptation, etc.

We began this article with an intuitive example in Section II,covering the life cycle from the panoramic or omnidirectionalvideo capturing, processing, delivery and presentation render-ing; To support the exemplified omnidirectional media pro-duction and consumption, Section III first reviewed the OMAFextensions to the ISO Base Media File Format, followed by theOMAF compliant media delivery over the popular DASH [11]and MMT [10] in Section IV. To ensure the general interop-erability, OMAF associated media and presentation profileswere introduced in Section V. In Section VI, we presentedvarious viewport-dependent processing schemes for improving

Page 2: Omnidirectional Media Format and Its Application to ... · international standard on immersive media format. Its asso-ciated standardization activities dated back to October 2015

2

Stitching Spherically

Rotation

Region-wise PackingEncodingInverse

ProcessingsEncapsulation

with OMAFRendering Projection

Bi

C

D

Fig. 1. Life spans of an omnidirectional projected video from its production to the consumption.

the transmission performance of omnidirectional media, aswell as the analysis of viewport adaptive based perceptualquality optimization. Finally, the conclusion was drawn inSection VII.

II. AN INTUITIVE EXAMPLE

Panoramic or omnidirectional1 media can be captured bymultiple cameras and audio sensors shown in Fig. I-A. Forthe sake of simplicity, we will emphasize the description onthe panoramic video. One example to produce the panoramicvideo/image is using the parallel cameras with many mi-crolens (or micro-cameras) [12], while another one is Fish-eye configuration. All micro-cameras sense the environmentsimultaneously to produce synchronized consecutive imagesor video frames I(t), t = 0, 1, 2, . . ., covering different fieldof views (FoVs, or viewports2) accordingly, and being stitchedtogether [13]–[15] for a complete 360� view spherically. Basi-cally, these micro-cameras were placed with sufficiently over-lapped FoV to extract keypoints (features) [16] for subsequentregistration, calibration and blending. A spherically stitchedvideo or images would be rotated (optionally), projected,packed (optionally), encoded and encapsulated for deliveryover the network.

A spherical viewing environment, illustrated using a (X,Y, Z) coordinate axes shown in Fig. 2(a) offers the user toperform the 3DoF interactions inside a virtualized space. Thisis often referred to as global coordinate axes (GCA). Itsinstantaneous viewport FoV(k) within I(t) at time-stamp t

can be identified by a pair of sphere coordinates: azimuth(�k) and elevation (✓k)3. This coincides with the distinctfeature of our Human Visual System (HVS) that, only aviewport of an entire omnidirectional video is displayedat any particular time instant [17], [18], while in normalconventional video applications typically the entire video isdisplayed. This biologically inspired fact can be utilized toimprove the performance of omnidirectional video systems,through selective delivery depending on the user’s FoV4 [20].

1Unless specified otherwise, we use “panoramic”, “360�” and “omnidirec-tional” interchangeably throughout this work.

2We use viewport and FoV interchangeably in this work unless specifiedotherwise.

3Note that � 2 [�180, 180), and ✓ 2 [�90, 90].4More details regarding the viewport or FoV dependent streaming can be

found in Section VI of this work and Annex D [19].

Intuitively, the performance improvement could come fromlower transmission bandwidth and decoding complexity underthe same resolution/quality of the video region presented tothe user.

Referring to the omnidirectional video delivery specifically,network often exhibits high dynamic channel conditions dueto congestions, channel fading, etc., resulting in a large fluc-tuation of bandwidth. Typically, we would implement unequalquality scales across various sub-regions of the entire view.Intuitively, a higher quality representation was allocated to thesub-region corresponding to the current viewport momentarily(cp., Fig. 2(a)), but degraded qualities representation else-where. A higher quality representation often requires more bitrate [21], and vice versa. With such unequal quality allocation,overall bandwidth requirement can be reduced significantly.One extreme case is just transmitting the content associatedwith the current FoV.

Instantaneous viewport can be parsed from the metadataover the feedback channel for live service. But for on-demandservices, it is better to prepare multiple content copies inadvance for later retrieval. These copies would be capable ofgenerating different high-quality viewport to cover all potentialviewport dependent salient regions interested to the users.These metadata signal can be found in Annex E and F ofOMAF standard [19]. FoV(k) in Fig. 2(a) is exemplified as asalient region. Rotation [22] is then applied to adapt FoV(k)deviated from the equator in GCA to the equator in a localcoordinate axes (LCA) shown in Fig. 2(b) for the followingproper projection and efficient transmission. Correspondingly,an inverse rotation process shown in Fig. 2(c) will be invokedto transform LCA to GCA according to the user’s viewportorientation metadata, after retrieving the appropriate mediacopy. These inverse operations can be achieved by adaptingrotation parameters, including 1) yaw (↵k), pitch (�k), roll(�k), all in units of degrees, with ↵k and �k 2 [�180, 180),and �k 2 [�90, 90], and 2) sphere coordinates (�k, ✓k) relativeto the LCA, to finally derive (�0, ✓

0) relative to the GCA.Specific steps of the calculation can be found in Section 5of [19].

Each rotated panoramic picture I(t) is then projected ontoa 2D plane for better transmission. There are many projectionschemes [23]. But OMAF only suggested two of them, e.g.,equirectangular (ERP) and cubemap (CMP) based projections

Page 3: Omnidirectional Media Format and Its Application to ... · international standard on immersive media format. Its asso-ciated standardization activities dated back to October 2015

3

Increasing Φ

θ=Φ=0

Z

Y

X

(a)

Z

Y

X

Global Coordinate Axes

Geometry Rotation

Local Coordinate Axes

(b)

Roll

Pitch

Yaw Z

Y

X

Z

Y

X

Local Coordinate Axes Geometry Rotation Global Coordinate Axes

αk

βk

γk

(c)

Fig. 2. Illustration of coordinate axes system and associated rotations: (a) a (X, Y, Z) based GCA with current FoV(k) at (✓k , �k) (b) GCA to LCA. (c)LCA to GCA via appropriate yaw, pitch and roll.

as specified in Section 5.2 of [19], because of their wideusage in practices. As shown in Fig. 3(a), the ERP projectionprocess is close to how a world map is generated, but withthe left-hand side being the east instead of the west, as theviewing perspective is opposite. In ERP, the user looks fromthe sphere center outward towards the inside surface of thesphere. Figure 3(b) draws the cube face arrangement of thecubemap projection format and the mapping of the cube facesonto the coordinate axes, where PX, NX, PY, NY, PZ, andNZ denote the positive X, negative X, positive Y, negative Y,positive Z, and negative Z, respectively.

As aforementioned, we could apply different qualities acrossvarious sub-regions (or viewports) in current projected pictureI(t). Towards this goal, an novel region-wise packing [24]and corresponding region-wise quality ranking, would bepreferred. It enables the manipulations (resampling, reposi-tioning, etc) of any sub-region in the projected picture beforeencoding, and only rectangular region-wise packing operationsare supported in current phase of OMAF specification [19].Figure 4 illustrates one simple example of generating a packedpicture through the region-wise resampling and repositioning,e.g., native resolution at “middle” sub-region close to theequator, but reduced resolutions for both “top” and “bottom”sub-regions, within a projected picture. We can see that eachpacked picture contains several sub-regions, and among themonly a specific region stays at the highest resolution (usuallythe original one). The packed picture may cover only a partof the entire sphere. Note that viewport dependent streamingcan be also achieved via the region-wise packing. An inversepacking process can be referred to the Section 5.4 of OMAFspecification [19].

In general, every packed video copy is stored as an inde-pendent track, with dedicated projection, rotation, region-wisepacking, and other information encapsulated in the metadata(see Section III) for streaming and rendering. For example,an ERP rendered omnidirectional video can be processed atvarious packed copies according to the salient viewport/FoVdependent rotation, where those sub-regions close to theequator are encoded with the highest-resolution (i.e., highestquality) while elsewhere is with reduced resolution (i.e., re-duced quality). Even for the same packed content, a variety ofrepresentations at different bit rates (or quality scales) wouldbe required to enable the adaptive streaming [6], [9]. Appro-priate delivery (i.e., viewport selective, or entire panoramicview) would be facilitated to transmit the content to the client

for rendering.An inverse processing would be implemented according to

the user’s input (e.g., viewport/orientation, bandwidth, etc) toretrieve the proper streams or sub-streams, as illustrated inFig. 5, together with the data flow of capturing, processingand encapsulation. The projected or packed pictures (D) areencoded as coded images (Ei) or a coded video bitstream (Ev).The captured audio (Ba) is encoded as an audio bitstream(Ea). Interleaved media data are then encapsulated into a filefor playback (F ), or a sequence of segments media process-ing units (MPUs) complying the DASH/MMT respectivelyfor streaming (Fs) to a OMAF player. A file decapsulatorprocesses the file (F 0) or the received segments or MPUs(F 0

s). It extracts the coded bitstreams (E0a, E

0v , and/or E

0i)

and parses the metadata. The audio, video, and/or images arethen decoded into decoded signals (B0

a for audio, and D0 for

images/video). The decoded pictures (D0) are projected ontothe screen of a HMD or any other display device accordingto the viewport and orientation sensed by head/eye trackingdevices. Likewise, decoded audio (B0

a) is rendered throughheadphones or loudspeakers.

One special omnidirectional media example supported byOMAF is captured via one or more Fisheye lens wherestitching and projection are intentionally removed at serverside. The circle images captured by each Fisheye lens [25] aredirectly placed on a common video picture before encoding.Appropriate stitching, correction, and rendering of Fisheyecaptured media would be realized at client side. Comparedwith the parallel camera platform [12], Fisheye cameras offerthe lower cost to enable the omnidirectional media experience.Details on the Fisheye omnidirectional video can be found inthe Section 6 of OMAF recommendation [19].

Referring to the aforementioned example, omnidirectionalmedia extended the conventional media with more interactionflexibility, but only a part of 360� content coverage, i.e.,viewport or FoV, was effectively rendered to our visionsystem at a specific moment. Conventional media focusedon the data processing and encapsulation in time domain,but omnidirectional media not only exhibited the temporalcorrelation, but also the spatial correlation among viewportsin a spherical space or a mapped 2D plane. It requiredadditional messages, such as the projection format, rotationstructure, coverage sphere, region-wise packing and qualityranking, as well as the users’ viewport/orientation signaling,to precisely describe the content and its associated proper-

Page 4: Omnidirectional Media Format and Its Application to ... · international standard on immersive media format. Its asso-ciated standardization activities dated back to October 2015

4

ties (e.g., positions, quality ranking, delivery model, etc), inresponse to the users’ feedback. Towards this goal, OMAFdefines the interfaces to augment existing media file format(see Section III) and transport protocols (see Section IV), aswell as the metadata to facilitate the 3DoF navigation andselective viewport delivery (see Section VI). More details willbe unfolded in the following sections.

III. OMAF EXTENSION OF ISO BMFF FILE FORMAT

The most famous and popular file format for media ex-change, management, editing and presentation, is the BaseMedia File Format standardized by the International Stan-dardization Organization (a.k.a., ISOBMFF) [8]. Files com-plying with the ISOBMFF consist of a series of flexibleand extensible “boxes”, within which data is self-contained.Each box is referred by its type with (normally) four printercharacters. Figure 6(a) shows the typical box hierarchy, whererepresentative boxes, “ftyp” (file type box), “moov” (moviebox), “trak” (track box) and “stbl” (sample table box), arehighlighted. Herein, “moov” often is used to contain themedia associated data. More details are suggested to studythe ISO/IEC 14496-12 specification [8].

Note that media data and metadata in ISOBMFF aredistributed separately, as illustrated in Fig. 6(b). Thereinto,media data often represent the coded video, audio, etc, inaccess units or samples. Conventional metadata includes mediatype, codec properties, timestamps, sample location and size,random access indications and so on [8]. Additional metadatainformation are specified by ISOBMFF as the OMAF inter-faces to facilitate the omnidirectional content authoring andconsumption, such as rotation, projection, packing, encoding,delivery and presentation rendering. They include:

(a)

(b)

Fig. 3. Omnidirectional Projection Formats:(a) equirectangular projection(ERP), (b) cubemap projection (CMP).

• Projection format structure to indicate the mappingtype of the projected picture onto the spherical coordinatesystem contained in the projection format box (“prfr”) inthe sample entry, where its values correspond to differentprojection formats.

• Region-wise packing structure to specify the packingprocess that loops all regions, each of which has 1)a flag indicating the presence of guard bands for thepacked region, 2) the packing type (as aforementioned,only rectangular region-wise packing is specified at thismoment), 3) the mapping between a projected region andthe respective packed region in the rectangular regionpacking structure, and 4) the guard band structure forthe packed region, when guard bands are present;

• Rotation structure to provide the yaw, pitch, and rollangles, respectively, of the rotation that is applied to theunit sphere to convert the local coordinate axes to theglobal coordinate axes5 shown in Fig. 2(c).

• Content coverage structure to provide the content cov-erage information, which is expressed by one or moresphere regions covered by the content, relative to theglobal coordinate axes; and specify the view type of everyregion, i.e., monoscopic or the left or/and right view(s)of a stereoscopic content;

• Sphere region structure to support the regionalizationof omnidirectional videos, that could be specified byfour great circles defined by four points cAzimuth1,cAzimuth2, cElevation1, cElevation2 and the centre point(centreAzimuth, centreElevation) (Fig. 7(a)), or specifiedby two azimuth circles and two elevation circles definedby the five points above (Fig. 7(b)).

• Region-wise quality ranking to indicate relative qual-ity of regions on the sphere (in sphere region qualityranking box ‘srqr’) or on the 2D picture domain (in 2Dregion quality ranking box ‘2dqr’), with a lower rankingvalue corresponding to a higher quality scale, where anOMAF player could extract tracks at lower ranking valuecovering the current viewport, but higher ranking valueelsewhere6.

• Timed metadata for sphere regions to indicate sphere

5In the case of stereoscopic omnidirectional video, the fields apply to eachview individually

6It should be noted that, the boundaries of the quality ranking sphere or2D regions may or may not match with the boundaries of the packed regionsor the boundaries of the projected regions.

top

middle

bottom

top bottom

middleRegion-wiseresampling

and positioning

Equirectangularpanorama picture

Fig. 4. An example of rectangular region-wise packing with the highest(native) resolution in the middle and reduced resolution at top and bottomparts.

Page 5: Omnidirectional Media Format and Its Application to ... · international standard on immersive media format. Its asso-ciated standardization activities dated back to October 2015

5

Acquisition

Orientation/viewport metadata

BaA

BiImage stitching,

rotation, projection, and region-wise

packing

Audio encoding

Video encoding

Image encoding

D File/segmentencapsulation

Ea

Ev

Ei

File/segmentdecapsulation

File playback

Fs

F

F

Head/eye tracking

Audio decoding

Video decoding

Image decoding

Audio rendering

Image rendering

E a

E v

E i

B a

D

Orientation/viewport metadata

Loudspeakers/headphones

Display

A a

OMAF player

Metadata

Metadata

Delivery

A i F s

Fig. 5. OMAF data flow process for omnidirectional media from capturing,processing, encapsulation and consumption (inverse process) at client.

ISO File

ftyp moov mdat ...

vmhd dinf stbl

dref stsd stts ctts stsc stsz ...

mdhd minfhdlr

tkhd mdiatref

mvhd trak trak ...

Main Extensions of OMAF: Projection format structure Region-wise packing structure Rotation structure Content coverage structure Sphere region structure Region-wise quality ranking Timed metadata for sphere regions

(a)

Movie Box (metadata)

Video Track

Audio Track

trak

trak

moov

Media Data Box

mdatsample sample sample sample

(b)

Fig. 6. ISOBMF compliant structure (a)Hierarchical Structure of Boxes, (b)An Encapsulated Media File.

regions, and the purpose for the timed metadata tracks,such as, initial viewing orientation, recommended view-port, and timed text sphere location metadata, whereinitial viewing orientation (identified by the sample entrytype ‘invo’) suggests a particular viewing orientation,when random accessed from a point in the stream andwhether it is still recommended even when playing thevideo continuously, recommended viewport (identified bythe sample entry type ‘rcvp’) indicates that a particularviewport trace is recommended for playback based onthe director’s cut for best VR story telling, or based

on statistical measures, and timed text sphere location(identified by the sample entry type ‘ttsl’) metadatasignals the information of determining the position of therendering plane in the 3D space with a timely dynamicfashion.

In the meantime, Fig. 8 shows the overview of OMAFextensions to the ISOBMFF. OMAF associated supplementarymetadata is mainly contained in SampleTableBox (‘stbl’)(see Fig. 6(a)), which comprises all the time and data indexinginformation of the media samples in a track.

IV. OMAF COMPLIANT MEDIA DELIVERY

Aforementioned OMAF extensions to the ISOBMFF en-sures the flexible aggregation of important information for easyaccess and inclusion in a transport manifest that is vital formedia delivery and consumption over the network. Therefore,in this section, we introduced the OMAF extensions to thepopular DASH and MMT, respectively.

A. OMAF over DASH

DASH [9] supported media streaming delivery with exclu-sive controls at the client. It is based on a hierarchical datamodel [11] aligned with the presentation in Fig. 9(a). Mediacontent was composed of a single or multiple contiguousmedia content Periods in time, and each media content periodwas composed of one or multiple media content componentsencoded at various quality scales or bitrates, which wasadapted to dyanmics induced by the network conditions orother factors. The collection of encoded and deliverable ver-sions of media content and the appropriate descriptions formeda media Presentation, where the description is contained ina media presentation description (MPD) document as thedelivery manifest.

Figure 9(b) illustrates the OMAF compliant content deliveryflow over the DASH. An MPD (G) is generated based on thesegments (Fs) and other associated files. The newly definedomnidirectional media-specific descriptors (a.k.a. metadata) inMPD include projection format (PF) descriptor, region-wisepacking (RWPK) descriptor, content coverage (CC) descriptor,spherical region-wise quality ranking (SRQR) descriptor, 2Dregion-wise quality ranking (2DQR) descriptor and fisheyeomnidirectional video (FOMV) descriptor. These descriptors

(a) (b)

Fig. 7. The shape of the sphere regions expressing the content coverage: (a)specified by four great circles, (b) specified by two azimuth circles and twoelevation circles

Page 6: Omnidirectional Media Format and Its Application to ... · international standard on immersive media format. Its asso-ciated standardization activities dated back to October 2015

6

moov (MovieBox)

mvhd (MovieHeaderBox)

trak (AudioTrackBox)

trak (VideoTrackBox)

tkhd (TrackHeaderBox)

tref (TrackReferenceBox)

mdia (MediaBox)

RestrictedSchemeInfoBox

OriginalFormatBox (fmt)*

SchemeTypeBox*{ Scheme_type = podv or fodv }

ProjectedOmniVideoBox*

SchemeInformationBox

FisheyeOmniVideoBox*

FisheyeVideoEssentialInfoBox() FisheyeVideoEssentialInfoStruct()

StereoVideoBox

ProjectionFormatBox()*

RegionWisePackingBox()

TrackReferenceTypeBox (reference_type){ track_IDs[];}

tref (TrackReferenceBox)

mdia (MediaBox)

RestrictedSchemeInfoBox()*

SphereRegionConfigBox()*SphereRegionSample() RcvpSampleEntry() SphereRegionStruct()InitialViewingOrientationSample()TTSphereLocationSample()

stsd (SampleDescriptionBox)

SampleEntry

SphereRegionSampleEntry extends MetaDataSampleEntry

mdhd (MediaHeaderBox)

minf (MediaInformationBox)

stbl( SampleTableBox)

SphereRegionQualityRankingBox() SphereRegionStruct()2DRegionQualityRankingBox()

VisualSampleEntry extends SampleEntry

CoverageInformationBox() ContentCoverageStruct()

RotationBox()

Fig. 8. The overview of OMAF extensions of the ISOBMFF.

MPD

Periodstart = 0 sec...

Periodstart = 100 sec...

Periodstart = 245 sec...

...

Periodstart = 100 secbaseURL=http://ex.com

Adaptation Set 1video

...

Adaptation Set 2audio

Adaptation Set 3subtile

...

Adaptation Set 1

...

Representation 2bandwidth = 500 Kbit/swidth 640, height 480

Representation 3bandwidth = 2 Mbit/swidth 1280, height 720

Segment Infoduration = 145 sec

Initialization Segmenthttp://ex.com/is.mp4

Media Segment 1start = 0 sechttp://ex.com/ms1.mp4

Media Segment 2start = 10 sechttp://ex.com/ms2.mp4

Media Segment 3start = 20 sechttp://ex.com/ms3.mp4

Media Segment 15start = 140 sechttp://ex.com/ms3.mp4

...

Representation 1bandwidth = 250 Kbit/swidth 640, height 480

Segment InfodurationURL template

...

...

...

(a)

DASH delivery

DASH MPD generation

Head/eye tracking

DASH MPD & segment reception

Server

Orientation/viewport metadata

H H

G

FsFs F s

(b)

Fig. 9. OMAF over DASH, (a) the hierarchical data model in DASHwhere the structured media content incorporates periods, adaptation sets,representations and segments, (b) illustration of the OMAF content deliveryover the DASH.

may be generated according to the information embedded inthe segments as discussed in Section III.

The server typically provides segments (Fs) and the corre-sponding MPD (G) over the networks to the client. Receivedpackets (H’) will be interpreted to produce the segments (F ’s)for rendering. A DASH client obtains a current viewing ori-entation or viewport, e.g. from the HMD that detects the headand possibly also eye movements. By parsing the associatedmetadata, e.g., projection and region-wise packing, from the

Media Data

Asset #1

TimedMPU #1

TimedMPU #2

TimedMPU #N

Asset #2

Non-timedMPU

Signaling Messages

Asset Delivery Characteristics

Asset Delivery Characteristics

PresentationInformation

Package

(a)

ADC Generator

MPEG CI/PI Generator

MMT Sending Entity

MMT Receiving Entity

Fs

MMTP/UDPMMTP/WebSockets

Viewport FeedbackApplication Signaling

MMT delivery

Presentation Information/

Signaling

F s

Head/eye trackingOrientation/

viewport information

(b)

Fig. 10. OMAF over MMT, (a) logical data model of MMT, (b) referencearchitecture of OMAF compliant media delivery using MMT.

MPD, the DASH client selectively chooses the appropriateAdaptation Set and Representation with the complying formatcovering the current viewing orientation at the highest qualityand at a bitrate that could be sustained by the underlyingnetwork.

B. OMAF over MMT

MMT [10] is designed for the transport and delivery ofcoded media data for multimedia services over heterogeneouspacket-switched networks including Internet Protocol (IP)networks and digital broadcasting networks. The logical datamodel assumed for the operation of the MMT protocol isintroduced in Fig. 10(a), and it’s preserved during the deliv-ery by maintaining the structural relationships among MPU,Asset, and Package using signalling messages. Specifically,the collection of the encoded media data and its relatedmetadata builds a Package, which may be delivered fromone or more MMT sending entities to the MMT receivingentities [26]. Each piece of encoded media data of a Package,such as a piece of audio or video content, constitutes anAsset. Thereinto, each MPU constitutes a non-overlappingpiece of an Asset, and it may be consumed independently bythe presentation engine of the MMT receiving entity.

Figure 10(b) depicts the reference architecture for OMAFcontent delivery over MMT. The OMAF content may bedescribed in asset delivery characteristics (ADC) to assistthe MMT sending entity during the streaming process. Thecomposition information (CI) and presentation information(PI) should contain information to ensure MPUs conform toOMAF for enabling appropriate processing in a particularapplication [27]. MMT delivery may use the MMT protocol(MMTP) over UDP, TCP, or other alternatives. The player

Page 7: Omnidirectional Media Format and Its Application to ... · international standard on immersive media format. Its asso-ciated standardization activities dated back to October 2015

7

receives information about the current viewing direction orviewport from the HMD. Viewport-dependent streaming couldbe achieved either client-based or server-based approach inMMT.

Similiar as the descriptors defined in DASH, MMT intro-duces a new asset descriptor called as VR Information, whichdescribes the projection type that is used, how the VR contentis region-wise packed, what areas on the sphere it covers,and so on. The VR Information descriptor shall be presentin all assets that carry OMAF formatted content. Besides,an application-specific signalling message is also designed toallow for the delivery of application-specific information, asdiscussed in Annex F of [19].

V. MEDIA AND PRESENTATION PROFILES

Profiles are defined to ensure the interoperability betweendifferent clients.

A. Media profilesA media profile for timed media (or static media) is defined

as requirements and constraints for a set of one or moreISO BMFF tracks (or items) of a single media type. Theconformance of these tracks (or items) to a media profile isspecified as a combination of:

• Specification of which sample entry (or item) type(s)are allowed, and which constraints and extensions arerequired in addition to those imposed by the sample entry(or item) type(s).

• Constraints on the samples of the tracks (or the contentof the items), typically expressed as constraints on theelementary stream contained within the samples of thetracks (or within the items).

OMAF specifies 9 media profiles, including 3 video profiles,2 audio profiles, 2 image profiles, and 2 timed text profiles.

Focusing on video content, the media profiles definedin OMAF are named as HEVC-based viewport-independent(HEVC VI), HEVC-based viewport-dependent (HEVC VD),and AVC-based viewport-dependent (AVC VD) OMAF videoprofile, respectively. Under the case of HEVC VI, regu-lar HEVC encoders, DASH/MMT packagers, DASH/MMTclients, file format parsers, and HEVC decoder engines couldbe used for encoding, distribution and decoding, which meansthey do not need special features for handling of viewport-dependent schemes. Relatively, HEVC VD and AVC VDOMAF video profiles allow unconstrained use of rectangularregion-wise packing, where the resolution or quality of theomnidirectional video could be emphasized in certain regions,e.g., according to the user’s viewing orientation. In addition,it is possible to use extractors and get a conforming HEVC(or AVC) bitstream when tile-based (or slice-based) streamingis applied.

To our knowledge, accompanied by video encoding, sup-plemental enhancement information (SEI) messages [28] areintegrated in elementary bitstreams and could assist in pro-cesses related to encapsulation, decoding, display or otherpurposes. Applicable to video profiles, there are some OMAFconstraints depending on the presence of omnidirectional

video SEI messages. For instance, when the bitstream containsa region-wise packing SEI message applying to a picture,RegionWisePackingBox shall be present in the sampleentry applying to the sample containing the picture. And whenpresent, RegionWisePackingBox shall signal the sameinformation as in the region-wise packing SEI message(s).

Informatively, OMAF players conforming to HEVCVI profile are expected to process either all referencedSEI messages or all allowed boxes within theSchemeInformationBox for the equirectangularprojected video (‘erpv’) scheme type, while these conformingto HEVC VD or AVC VD profile for the ‘erpv’ andpacked equirectangular or cubemap projected video (‘ercm’)scheme types. When playing a file and when the filecontains the initial viewing orientation metadata, OMAFplayers are expected to parse the initial viewing orientationmetadata track associated with a media track and obeyit when rendering the media track. Besides, for HEVCVD or AVC VD profile, a player is also expected toparse both SphereRegionQualityRankingBoxand 2DRegionQualityRankingBox, when present,of tracks present in a file and select the track thatmatches user’s viewing orientation, necessarily togetherwith RegionWisePackingBox.

B. Presentation profiles

A presentation profile is defined as requirements and con-straints for an OMAF file, containing tracks or items of anynumber of media types. More precisely, the specification of apresentation profile should refer to the specified media profilesand may include additional requirements or constraints. Thefile conforming to a presentation profile typically provides anomnidirectional audio-visual experience.

OMAF specifies two presentation profiles, i.e., OMAFviewport-independent baseline presentation profile (‘ompp’)and OMAF viewport-dependent baseline presentation pro-file (‘ovdp’). As the name suggests, the profile ‘ompp’requires neither viewport-dependent delivery nor viewport-dependent decoding, and it is intended to provide the high-est interoperability and quality on HMDs (including mobile-powered HMDs [29]). Comparatively, the profile ‘ovdp’ re-quires viewport-dependent delivery and rendering, with thepurpose of providing interoperability and quality on the HMDsthat go beyond the viewport resolution achievable by the‘ompp’.

VI. VIEWPORT-DEPENDENT OMNIDIRECTIONALVIDEO STREAMING

In pratice, viewport adaptive streaming is highly preferredfor omnidirectional video for lower bandwidth and less de-coding capability demands. Ideally, only the coded video datacovering the current viewport needs to be streamed. However,this would induce a blackout status when the user turns his/herhead quickly to a new viewport without the correspondingcoded video data prepared. Therefore, this method could onlywork if the network round trip time (i.e., the end-to-end

Page 8: Omnidirectional Media Format and Its Application to ... · international standard on immersive media format. Its asso-ciated standardization activities dated back to October 2015

8

HQ HQ HQ HQ HQ

Fig. 11. An example of four tracks each of which presents a 360� coverageERP frame, but with different high quality encoded regions (marked as “HQ”).

latency) is extremely low, e.g., at a magnitude of 10 millisec-onds, which is not feasible or at least a big challenge today orin the near future. Thus, OMAF provides several informativemethods for viewport-dependent operation, where each entireomnidirectional video is transmitted at various resolutions orqualities across different sub-regions (or sub-pictures). Pleaserefer to OMAF specfication for more details [19].

A. Region-wise Quality Ranked Encoding

Multiple coded single-layer bitstreams corresponding to thesame panoramic content, are cached in different tracks ata server, presenting an entire omnidirectional video framewith different high quality (HQ) encoded regions [30] shownin Fig. 11. These HQ regions possibly coincide with thesalient areas that are signaled via region-wise quality rankingmetadata (defined for 2D regions or sphere regions). An appro-priate track (for the entire panoramic content) with interpretedmetadata matched to the user’s current viewport or orientation,will be selectively transmitted and rendered.

B. Motion-constrained Sub-picture Encoding

Unequal qualities across various sub-regions could be alsoachieved via using the motion-constrained sub-picture tracks,inlcuding motion-constrained tile set (MCTS) of HEVC [31],[32], motion-constrained slice set (MCSS) of AVC [33], andsub-picture-based content authoring [34], [35].

Towards this purpose, pictures are first spatially parti-tioned into sub-pictures, in either tiles or slices with motion-compensation constrained thereinto to allow the OMAF systemto merge invididual sub-picture tracks using a correspondingextractor track, based on the current or predicted users view-port [36], [37].

We could naturally achieve the parallel processing withmotion-compensated sub-picutre strategy. To futher reduce

1 2 3 45 6 7 81 2 3 4

5 6 7 8

1 2 3 45 6 7 81 2 3 4

5 6 7 8

HEVC bitstream with motion-constrained tile sets @ quality 1

1

8

1

8

1

8

. . .

. . .

. . .

0t 1t Nt1

8

1

8

1

8

. . .

. . .

. . .

HEVC bitstream with motion-constrained tile sets @ quality 2

sub-picture tracks + extractor track

Deliv

ery

File

enc

apsu

latio

n

Segm

ent d

ecap

sula

tion

E.g.sub-picture tracks 1,2,5,6

extractor track

E.g.sub-picture tracks 3,4,7,8

1 2 3 45 6 7 81 2 3 4

5 6 7 8HEVC bitstream with motion-constrained tile sets @ mixed quality

Fig. 12. An Example of parallel processing of HEVC MCTS-based sub-picture tracks at the same resolution.

EncodingHigh-resolution bitstream4×2 tile grid, tile size 1280×1280

Selection of tiles (viewport adaptive)4 tiles from high-resolution bitstream

Creating extractor track

5120

×256

025

60×1

280 25

60×3

200

Two low-resolution bitstreams.2×2 tile grid, tile size 1280×640.Yaw offset is 90 degrees.

Two tiles from low-resolution bitstream(s), covering the remaining part of 360 degrees.

Fig. 13. An Example of region-wise packing of HEVC MCTS-based sub-picture tracks at different resolutions.

the bandwidth consumption, region-wise packing can be aug-mented as well. For the sake of simplicity, we briefly discussexamples using the HEVC MCTS scheme. Similiar ideas canbe extended to MCSS as well.

Figure 12 presents an example how sub-picture tracks of thesame resolution could be used for tile-based omnidirectionalvideo streaming [38]. A 4 ⇥ 2 tile grid has been used toconstruct the motion-constrained tile sets and two HEVC bit-streams from the same source content are encoded at differentpicture qualities and bitrates. Each motion-constrained tile setsequence is included in one sub-picture track and an extractortrack is also created. The OMAF player receives sub-picturetracks 1, 2, 5, and 6 at a particular quality and sub-picturetracks 3, 4, 7, and 8 at another quality, according to theviewing orientation. The extractor track is used to reconstruct abitstream that could be decoded with a single HEVC decoder.

Video resolution gets higher and higher, from million pixelsto gigapixels, from 4K to 8K (or even more). Even with afor-mentioned MCTS, it still requires a large amount of networkbandwidth to sustain a stable service. Leveraging the naturalinstincts of our HVS, we could heavily degrade the regionsdeviated from our current FoV, and even in the periphery ofsame FoV [17], [18]. Therefore, we could augment region-wisepacking on top of the MCTS.

An example for achieving a 5K effective ERP resolutionwith the HEVC-based viewport-dependent OMAF video pro-file is presented in Fig. 13. The content is firstly encoded attwo spatial resolutions, i.e., 5120 ⇥ 2560 and 2560 ⇥ 1280,and with 4⇥ 2 and 2⇥ 2 tile grid to align the inter-resolutionMCTS, respectively. An MCTS is then coded for each tileposition and two different sets of the low-resolution contentare encoded, differentiated by 90� yaw difference in rotationangles. Consequently, each range of viewing orientations willcause a different selection of high- and low-resolution MCTSs,which have no overlap on the sphere. An extractor track iscreated for each distinct viewport-adaptive MCTS selection.

C. Two-Layer Encoding

This scheme applies an entire low-resolution or -qualityomnidirectional video stream (base layer) independently, whilealso using a viewport-dependent high-resolution or -quality

Page 9: Omnidirectional Media Format and Its Application to ... · international standard on immersive media format. Its asso-ciated standardization activities dated back to October 2015

9

0t 1t Nt

File

enc

apsu

latio

n

Deliv

ery1 2 3 4

5 6 7 81 2 3 45 6 7 8

HEVC bitstream with motion-constrained tile sets

extr

acto

r tra

ck

AVC or HEVC bitstream

1

8

1

8

1

8

. . . sub-picture tracks

. . .

. . .

Slice data extracted from sub-picture tracks

. . .

. . .

3 47 8

3 47 8

3 47 8 HE

VC v

iew

port

de

pend

ent p

rofil

efil

e de

code

r

AVC

or H

EVC

dec

oder

Fig. 14. A two-layer strategy with the enhancement layer consisting ofHEVC MCTS sub-picture bitstreams.

spatially partitioned one (enhancement layer) for superim-posed presentation.

Note that the enhancement layer could be generated throughthe same motion-constrained sub-picture strategy as discussedin Section VI-B, with spatially non-overlapped sub-picturetracks [39]. Figure 14 presents an example where the high-resolution HEVC bitstream uses a 4⇥2 tile grid for the motion-constrained tile sets and the low-resolution HEVC or AVCbitstream serves as the base layer. The presented extractortrack for high-resolution MCTSs 3, 4, 7, and 8 covers certainviewing orientations. Two decoders are used in this example,one for the low-resolution omnidirectional AVC or HEVCbitstream and the other for the high-resolution extractor track.With such two-layer scheme, it intituvely supports the region-wise packing mode.

D. Viewport-Adaptive Quality Optimized StreamingThe viewport-dependent streaming strategies of omnidi-

rectional video suggested in OMAF are mainly setting thecontent within current FoV at high-quality (HQ, or nativequality) but reduced-quality (RQ) elsewhere. Although thisway can significantly save the transmission bandwidth, ittypically involves the refinement from a RQ version to acorresponding HQ one when users navigate their focus to anew FoV. Under the circumstances, the user experience wouldintuitively depends on the gap between quality scales (usuallydetermined by the associated quantization stepsize q or/andspatial resolution s) across two consecutive FoVs, and therefinement duration ⌧ (i.e., how long does the reduced qualitylast) shown in Fig. 15.

Although the perceptual impact introduced by the qualityvariations where both q and s are applied is inseparable [40],

FoV2FoV1 FoV

Period 1 Period 2

FoV Adaptation FoV2 (Reduced-Quality) FoV2 (High-Quality)

FoV2

Period 3

τ Immersive Video

Fig. 15. Illustration of the viewport adaptation for omnidirectional videostreaming. Different color shades represent various quality levels of currentFoV: the darker blue means the original high-quality version, while the lighterone for reduced-quality copy.

we still assume the separable response of the q-impact ands-impact on the perceptual quality regarding ⌧ , in order tosimplify the model complexity. And we then perform thesubjective quality assessments to collect the mean opinionscores (MOSs) for various reduced-quality scales (i.e., throughdifferent qs or ss) and ⌧s, when setting the quality refinementtowards the highest one. In addition to have normalized qualityQ of q-impact or s-impact with respect to ⌧ (denoted as Nor-malized Quality versus Quantization - NQQ and NormalizedQuality versus Spatial Resolution - NQS, respectively), wecould finally derive the overall analytical model Q(⌧, q, s) withextensive subjective assessments and cross-validation [39],for evaluating the perceptual quality when adapting FoV inimmersive video applications, i.e.,

Q(⌧, q, s) = Qmax · QNQQ(⌧, q) · QNQS(⌧, s), (1)

where

QNQQ(⌧, q) = a(q) · e�b(q)·⌧ + (1� a(q)), (2)

QNQS(⌧, s) = a(s) · e�b(s)·⌧ + (1� a(s)), (3)

a(q), b(q) =k1

1 + k2 · qk3, (4)

a(s), b(s) = k1 · e�k2·s + k3, (5)

with all fixed parameters shown in Table II. Here, q = qmin/q

and s = s/smax are the normalized q and s, respectively.In practice, qmin = 8 at corresponding quantization parameter(QP) 22, and smax is the native spatial resolution of an entireomnidirectional video.

Nevertheless, smaller quality gap means the higher trans-mission bitrate and more data exchange, which meanwhileresults in the longer ⌧ . Therefore, a fundamental optimizationcan be resolved analytically with the model to determine anappropriate quality gap that balances the trade-off betweenthe bandwidth requirement and refinement duration. Ideally,the optimization can be formulated as

max⌧,q,s

Q, (6)

s.t. RFoVi +R

RLmi B, (7)

0 < q, s 1. (8)

Thereinto,

⌧ =R

FoVi +R

RLmi

B· T, (9)

RFoVi , R

RLmi = R(q, s). (10)

RFoVi is the bitrate of the current FoV at the highest quality,

for the i-th segment in time dimension. RRLim is the bitrate

of the auxiliary reduced-quality layer (RL), corresponding to

TABLE IIFIXED PARAMETERS FOR a AND b.

k1 k2 k3a(q) 0.8 39.55 2.73b(q) 1.45 47.14 3.29a(s) 0.8 4.65 0b(s) 4.53 0.3 -3.37

Page 10: Omnidirectional Media Format and Its Application to ... · international standard on immersive media format. Its asso-ciated standardization activities dated back to October 2015

10

4 6 8 10 12 14 16B [Mbps]

0

40

80

120

160

q

0.2

0.4

0.6

0.8

1

Q

qopt

Qopt

T = 4 s

T = 2 s

T = 1 s

T = 0.5 s

(a)

10 15 20B [Mbps]

0

40

80

120

160

q

0

0.2

0.4

0.6

0.8

1

Q,s

sopt

Qopt

qopt

(b)

Fig. 16. Examples to demonstrate the effectiveness of viewport-adaptivequality optimized streaming, applying our proposed model to the immersivevideo “Balboa”. (a) illustration of the effect of T on bandwidth-qualityoptimization, (b) optimal qopt (continuous), sopt (discrete), and Qopt versusB. The initialization duration T of a segment is set to 5 seconds.

the i-th segment coded at the m-th quality level. In addition,B and T denote the constrained streaming bandwidth andthe minimum duration of a segment that is sufficient fordecoding and rendering, respectively. Furthermore, R(q, s) isthe rate model for compressed video considering impacts ofquantization stepsize and spatial resolutions [41].

The initialization duration T might take different valuesfor different transport protocols, e.g., DASH and MMT. Tounderstand the impact of various T s on the bandwidth-qualityoptimization, we use the 360� test sequence “Balboa” [42]as an example, and plot the normalized maximum perceptualquality Qopt against the B under different T , along with thecorresponding qopt (optimal q), as shown in Fig. 16(a). Itcan be found that as T decreases, the optimal normalizedquality Q improves at same B, i.e., Q(T1) > Q(T2) withT1 < T2. Note that, if the T is smaller enough, such as0.5s, when B is less than a threshold, the optimal quantizationstepsize qopt stays at 160 (the highest value in practical codec),which means that the best way in this case for adaptive FoVstreaming is choosing RL with the worst quality to minimizetransmission bitrate, and the Qopt increases monotonically asthe B increases.

Meanwhile, Fig. 16(b) shows the optimal solutions versusan given B. Due to the assumption that s can only vary amongdiscrete values (i.e., 1, 1/4 and 1/16), the corresponding qopt

does not decrease monotonically with B. Instead, qopt wouldjump to meet the rate constraint whenever sopt steps into thenext higher value, and then decreases as B increases whilesopt keeps unchanged. Because of the limited options of the s

and its interference with the q, when q reaches qmin and s isstill below 1, s will jump to the next level as the B continuesto increase. More interestingly, when s stays at a smaller value,q decreases much faster and so does the Q improve.

Consequently, for viewport-adaptive perceptual optimizedstreaming, in addition to delivering the sub-picture trackscovering a whole scene based on their optimized quality scalesrespectively, we can prepare multiple content copies applyingregion-wise packing in advance, i.e., selecting an optimalallocation of the quality scales in different regions with ‘srqr’or ‘2dqr’ indicating in OMAF under a limited bandwidth.

VII. CONCLUSION

In this paper, we introduced the emerging standardization ofthe file format for omnidirectional media, i.e., OMAF, alongwith the consideration of content authoring and streamingprocedure. Usually, the main processes to generate an omnidi-rectional video for streaming include stitching (in a sphericalsurface), rotation, projection (to a 2D plane), region-wisepacking, encasulation and so on. OMAF defines appropriateinterfaces and metadata to extend existing ISOBMFF for fileformat, and existing DASH, MMT for transport. In addition tostreaming entire omnidirectional videos, OMAF could supportthe viewport-dependent transmission schemes to improve thesystem performance. A series of examples are illustrated forviewport dependent streaming, as well as a perceptual qualityoptimzation is given by leaveraging the subjective qualitymodels developed for omnidirectional video.

By far, OMAF focused on the omnidirectional media with3DoF. Going forward, extra functionalities would be aug-mented, such as the 3DoF+ (i.e., 3DoF with depth), 6DoF,and multi-user interaction, for an extension of OMAF in futurestudies.

ACKNOWLEDGMENT

The authors would like to thank all MPEG experts toproduce this OMAF standard jointly.

REFERENCES

[1] P. Dempsey, “The teardown: HTC Vive VR headset,” Engineering &Technology, vol. 11, no. 7-8, pp. 80–81, 2016.

[2] C. Timmerer, “Immersive media delivery: Overview of ongoing stan-dardization activities,” IEEE Communications Standards Magazine,vol. 1, no. 4, pp. 71–74, 2017.

[3] MPEG-2 Transport Stream, “ITU-T Rec. H.222.0 — ISO/IEC 13818-1:Generic coding of moving pictures and associated audio information,Part 1: Systems,” ITU-T, Oct. 2014.

[4] D. Singer, W. Belknap, G. Franceschini (Editors), “ISO/IEC 14496-14:Information technology Coding of audio-visual objects Part 14: MP4file format,” Final Draft International Standard (FDIS), Feb. 2004.

[5] T. Stockhammer, “Dynamic adaptive streaming over http standards anddesign principles,” in Proc. ACM Multimedia Systems (MMSys’11), Feb.2011, pp. 133–144.

[6] J. Y. Lee, K. Park, Y. Lim, S. Aoki, and G. Fernando, “MMT: Anemerging mpeg standard for multimedia delivery over the internet,” IEEEMultiMedia, vol. 20, no. 1, pp. 80–85, 2013.

[7] R. Skupin, Y. Sanchez, Y.-K. Wang, M. M. Hannuksela, J. Boyce,and M. Wien, “Standardization status of 360 degree video codingand delivery,” in IEEE Int. Conf. Visual Communications and ImageProcessing (VCIP’17), Dec. 2017, pp. 1–4.

[8] “ISO/IEC 14496-12: Information technology Coding of audio-visualobjects Part 12: ISO base media file format,” Fourth edition, Jul. 2012.

[9] “ISO/IEC 23009-1: Information technology Dynamic adaptive stream-ing over HTTP (DASH) Part 1: Media presentation description andsegment formats,” Second edition, May 2014.

[10] K. P. (Editor), “ISO/IEC 23008-1: Information technology High effi-ciency coding and media delivery in heterogeneous environments Part1: MPEG media transport (MMT),” Second edition, Jun. 2015.

[11] Y. X. Shaowei Xie and Z. Li, “DASH sub-representation with temporalQoE driven layering,” in IEEE Int. Conf. Multimedia & Expo Workshops(ICMEW’16), Jul. 2016, pp. 1–6.

[12] D. J. Brady, W. Pang, H. Li, Z. Ma, T. Yue, and X. Cao, “Parallelcameras,” Optica, vol. 5, no. 2, pp. 127–137, 2018.

[13] K.-C. Huang, P.-Y. Chien, C.-A. Chien, H.-C. Chang, and J.-I. Guo, “A360-degree panoramic video system design,” in IEEE Int. Symp. VLSIDesign, Automation and Test (VLSI-DAT’14), Apr. 2014, pp. 1–4.

[14] E. Adel, M. Elmogy, and H. Elbakry, “Image stitching based on featureextraction techniques: a survey,” International Journal of ComputerApplications, vol. 99, no. 6, pp. 1–8, 2014.

Page 11: Omnidirectional Media Format and Its Application to ... · international standard on immersive media format. Its asso-ciated standardization activities dated back to October 2015

11

[15] J. Jia and C.-K. Tang, “Image stitching using structure deformation,”IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 30, no. 4,pp. 617–631, 2008.

[16] D. G. Lowe, “Distinctive image features from scale-invariant keypoints,”International journal of computer vision, vol. 60, no. 2, pp. 91–110,2004.

[17] P. Guo, Q. Shen, M. Huang, R. Zhou, X. Cao, and Z. Ma, “Modelingperipheral vision impact on perceptual quality of immersive images,”in IEEE Int. Conf. Visual Communications and Image Processing(VCIP’17), Dec. 2017, pp. 1–4.

[18] P. Guo, Q. Shen, Y. Meng, L. Xu, X. Cao, and Z. Ma, “Perceptualquality assessment of immersive images considering peripheral visionimpact,” submitted to IEEE Trans. Image Processing, 2018.

[19] “ISO/IEC FDIS 23090-2: Information technology coded representationof immersive media (MPEG-I) part 2: Omnidirectional media format,”Final Draft International Standard (FDIS), Feb 2018.

[20] S. Xie, Y. Xu, Q. Qian, Q. Shen, Z. Ma, and W. Zhang, “Modeling theperceptual impact of viewport adaptation for immersive video,” in IEEEInt. Symp. Circuits and Systems (ISCAS’18), May 2018, pp. 1–5.

[21] Z. Ma, F. C. A. Fernandes, and Y. Wang, “Analytical rate model forcompressed video considering impacts of spatial, temporal and ampli-tude resolutions,” in IEEE Int. Conf. Multimedia and Expo Workshops(ICMEW’13), Jul. 2013.

[22] E. Y. Byeongdoo Choi, “M39854: Viewport-dependent geometry rota-tion,” Input contribution of OMAF, Jan. 2017.

[23] Y. Y. (Editor), “JVET-H1004: Algorithm descriptions of projectionformat conversion and video quality metrics in 360Lib version 5,” Oct.2017.

[24] R. Skupin, Y. Sanchez, C. Hellge, and T. Schierl, “Tile based hevc videofor head mounted displays,” in IEEE Int. Symp. Multimedia (ISM’16),Dec. 2016, pp. 399–400.

[25] T. Ho and M. Budagavi, “Dual-fisheye lens stitching for 360-degreeimaging,” in IEEE Int. Conf. Acoustics, Speech and Signal Processing(ICASSP’17 ), Mar. 2017, pp. 2172–2176.

[26] Y. Xu, H. Chen, W. Zhang, and J. Hwang, “Smart media transport: Aburgeoning intelligent system for next generation multimedia conver-gence service over heterogeneous networks in china,” accepted by IEEEMultimedia, 2018.

[27] Y. Lim, S. Aoki, I. Bouazizi, and J. Song, “New mpeg transport standardfor next generation hybrid broadcasting system with ip,” IEEE Trans.Broadcasting, vol. 60, no. 2, pp. 160–169, 2014.

[28] B. B. (Editor), “JCT-VC-L1003: High efficiency video coding (HEVC)text specification draft 10 (for FDIS & last call),” Output document ofJCT-VC, Jan. 2013.

[29] Gear VR. [Online]. Available: https://www.samsung.com/global/galaxy/gear-vr/

[30] Y. Hu, S. Xie, Y. Xu, and J. Sun, “Dynamic VR live streamingover MMT,” in IEEE Int. Symp. Broadband Multimedia Systems andBroadcasting (BMSB’17), Jun. 2017, pp. 1–4.

[31] K. Misra, A. Segall, M. Horowitz, S. Xu, A. Fuldseth, and M. Zhou,“An overview of tiles in hevc,” IEEE journal of selected topics in signalprocessing, vol. 7, no. 6, pp. 969–977, 2013.

[32] Y. Sanchez, R. Skupin, and T. Schierl, “Compressed domain videoprocessing for tile based panoramic streaming using hevc,” in IEEEInt. Conf. Image Processing (ICIP’15), Sep. 2015, pp. 2244–2248.

[33] T. Wiegand, G. J. Sullivan, G. Bjontegaard, and A. Luthra, “Overview ofthe h. 264/avc video coding standard,” IEEE Trans. circuits and systemsfor video technology, vol. 13, no. 7, pp. 560–576, 2003.

[34] R. Van Brandenburg, O. Niamut, M. Prins, and H. Stokking, “Spatialsegmentation for immersive media delivery,” in IEEE Int. Conf. Intelli-gence in Next Generation Networks (ICIN’11), Oct. 2011, pp. 151–156.

[35] P. R. Alface, J.-F. Macq, and N. Verzijp, “Interactive omnidirectionalvideo delivery: A bandwidth-effective approach,” Bell Labs TechnicalJournal, vol. 16, no. 4, pp. 135–147, 2012.

[36] Y. Bao, H. Wu, T. Zhang, A. A. Ramli, and X. Liu, “Shooting a movingtarget: Motion-prediction-based transmission for 360-degree videos,” inProc. IEEE Int. Conf. Big Data (Big Data’16), Dec. 2016, pp. 1161–1170.

[37] C.-L. Fan, J. Lee, W.-C. Lo, C.-Y. Huang, K.-T. Chen, and C.-H. Hsu,“Fixation prediction for 360� video streaming in head-mounted virtualreality,” in Proc. ACM Workshop Netw. and Operating Systems Supportfor Digital Audio and Video (NOSSDAV’17), Jun. 2017, pp. 67–72.

[38] V. R. Gaddam, M. Riegler, R. Eg, C. Griwodz, and P. Halvorsen, “Tilingin interactive panoramic video: Approaches and evaluation,” IEEE Trans.Multimedia, vol. 18, no. 9, pp. 1819–1831, 2016.

[39] S. Xie, Q. Shen, Y. Xu, Q. Qian, S. Wang, Z. Ma, , and W. Zhang, “View-port adaptation-based immersive video streaming: Perceptual modelingand applications,” submitted to IEEE Trans. Image Processing, 2018.

[40] Y. Xue, Y.-F. Ou, Z. Ma, and Y. Wang, “Perceptual video qualityassessment on a mobile platform considering both spatial resolution andquantization artifacts,” in IEEE Int. Packet Video Workshop (PVW’10),Dec. 2010, pp. 201–208.

[41] Z. Ma, H. Hu, M. Xu, and Y. Wang, “Rate model for compressedvideo considering impacts of spatial, temporal and amplituderesolutions and its applications for video coding and adaptation,”Computer Science, vol. abs/1206.2625, Jun. 2012. [Online]. Available:https://arxiv.org/abs/1206.2625

[42] JVET. [Online]. Available: ftp://ftp.ient.rwth-aachen.de