a high-efficiency compressed sensing-based terminal-to

16
IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 21, NO. 8, AUGUST 2019 1905 A High-Efficiency Compressed Sensing-Based Terminal-to-Cloud Video Transmission System Shuai Zheng , Xiao-Ping Zhang , Senior Member, IEEE, Jian Chen , Member, IEEE, and Yonghong Kuo Abstract—With the rapid popularization of mobile intelligent terminals, mobile video and cloud services applications are widely used in people’s lives. However, the resource-constrained characteristic of the terminals and the enormous amount of video information make the efficient terminal-to-cloud data upload a challenge. To solve the problem, this paper proposes an efficient compressed sensing-based high-efficiency video upload system for the terminal-to-cloud upload network. The system contains two main new components. First, to effectively remove the inter-frame redundant information, an encoder sampling scheme with high efficiency is developed by applying the skip block-based residual compressed sensing sampling technology. For the time-varying channel state, the encoder can adaptively allocate the sampling rate for different video frames by the proposed adaptive sampling scheme. Second, a local secondary reconstruction-based multi- reference frames cross-recovery algorithm is developed at the decoder. It further improves the reconstruction quality and reduces the quality fluctuation of the recovered video frames to improve the user experience. Compared with the state-of-the-art reference systems reported in the literature, the proposed system achieves the high-efficiency and high-quality terminal-to-cloud transmission. Index Terms—Compressed sensing, terminal-to-cloud upload network, multi-reference frames, local secondary reconstruction based cross recovery. I. INTRODUCTION W ITH the development of the mobile internet and wire- less smart devices for video acquisition, mobile video services have become a critical part of our daily life. More and more people, especially from the young generation, are ac- customed to recording, uploading, and sharing short videos by smartphones. However, in realistic scenarios, wireless mobile terminals are resource-constrained. The storage space, energy (battery capacity), and computing ability are all limited. The Manuscript received July 11, 2018; revised November 2, 2018 and December 17, 2018; accepted December 20, 2018. Date of publication January 9, 2019; date of current version July 19, 2019. This work was supported in part by the National Natural Science Foundation of China under Grant 61771366, in part by the “111” project under Grant B08038, and in part by the Natural Sciences and Engineering Research Council of Canada under Grant RGPIN239031. The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Sanjeev Mehrotra. (Corresponding author: Jian Chen.) S. Zheng, J. Chen, and Y. Kuo are with the School of Telecommunications En- gineering, Xidian University, Xi’an 710071, China (e-mail:, [email protected]; [email protected]; [email protected]). X.-P. Zhang is with the Department of Electrical and Computer Engineer- ing, Ryerson University, Toronto, ON M5B 2K3, Canada (e-mail:, xzhang@ee. ryerson.ca). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TMM.2019.2891415 processing of massive video data has brought great challenges to these terminals. To reduce the burden of wireless mobile terminals, cloud ser- vices, as emerging techniques, have received extensive attention [1]–[3]. Cloud services are utilized to store and process the data obtained from user terminals. Media cloud has strong computing power and almost infinite storage capacity. These characteris- tics have caused the emergence of various multimedia services combing with cloud [4], [5]. A significance evaluation system for video data over media cloud is proposed in [6]. It is applied to the downlink video streaming services. Some research on video transcoding techniques in media cloud is carried out in [7], [8]. In [9] and [10], a novel cloud based distributed image coding scheme is proposed. Conventional video coding schemes, such as H.26x and MPEG-x series, have high computational complexity on the encoder side [11]. These schemes lead to huge computation bur- den to user terminals due to the encoding mode prediction [12], motion estimation (ME) and motion compensation (MC) pro- cessing [13], [14]. Compressed sensing (CS) [15]–[18] provides an effective way to reduce the complexity of the encoder. By reducing the sampling rate and synchronizing the data sampling and compression, CS greatly improves the coding efficiency of the encoder [11]. It has received wide attention in video pro- cessing research [19]–[25]. To further improve the coding efficiency, the distributed com- pressed video sensing (DCVS) scheme is proposed in [26]– [28]. It effectively combines the distributed source coding and block-based compressed sensing (BCS) sampling technologies. For video CS decoding, a block-based CS smooth projection Landweber reconstruction (BCS-SPL) algorithm is proposed by Fowler et al. [29], [30]. It achieves a superior recovery quality with fast execution speed. To exploit the inter-frame correla- tion, a multihypothesis based BCS-SPL (MH-BCS-SPL) recon- struction method is proposed in [31], [32]. The Tikhonov-based multihypothesis prediction model is applied to estimate the side information (SI) for decoding. However, the Tikhonov model is poor in the accuracy of SI estimation at a low sampling rate. To solve this problem, many improved multihypothesis based CS video codec schemes are proposed in [33]–[35]. In [35], an Elastic net-based multihypothesis prediction (wEnet) model is proposed. By combining the l 1 and l 2 penalty terms, the wEnet model effectively improves the accuracy of SI estimation at a low sampling rate. Research on hypothesis set optimization and key frame secondary recovery is conducted in [36]–[39]. The recovery quality is further improved by improving the quality of 1520-9210 © 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications standards/publications/rights/index.html for more information.

Upload: others

Post on 12-Dec-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 21, NO. 8, AUGUST 2019 1905

A High-Efficiency Compressed Sensing-BasedTerminal-to-Cloud Video Transmission System

Shuai Zheng , Xiao-Ping Zhang , Senior Member, IEEE, Jian Chen , Member, IEEE, and Yonghong Kuo

Abstract—With the rapid popularization of mobile intelligentterminals, mobile video and cloud services applications arewidely used in people’s lives. However, the resource-constrainedcharacteristic of the terminals and the enormous amount of videoinformation make the efficient terminal-to-cloud data upload achallenge. To solve the problem, this paper proposes an efficientcompressed sensing-based high-efficiency video upload system forthe terminal-to-cloud upload network. The system contains twomain new components. First, to effectively remove the inter-frameredundant information, an encoder sampling scheme with highefficiency is developed by applying the skip block-based residualcompressed sensing sampling technology. For the time-varyingchannel state, the encoder can adaptively allocate the samplingrate for different video frames by the proposed adaptive samplingscheme. Second, a local secondary reconstruction-based multi-reference frames cross-recovery algorithm is developed at thedecoder. It further improves the reconstruction quality and reducesthe quality fluctuation of the recovered video frames to improvethe user experience. Compared with the state-of-the-art referencesystems reported in the literature, the proposed system achieves thehigh-efficiency and high-quality terminal-to-cloud transmission.

Index Terms—Compressed sensing, terminal-to-cloud uploadnetwork, multi-reference frames, local secondary reconstructionbased cross recovery.

I. INTRODUCTION

W ITH the development of the mobile internet and wire-less smart devices for video acquisition, mobile video

services have become a critical part of our daily life. Moreand more people, especially from the young generation, are ac-customed to recording, uploading, and sharing short videos bysmartphones. However, in realistic scenarios, wireless mobileterminals are resource-constrained. The storage space, energy(battery capacity), and computing ability are all limited. The

Manuscript received July 11, 2018; revised November 2, 2018 and December17, 2018; accepted December 20, 2018. Date of publication January 9, 2019;date of current version July 19, 2019. This work was supported in part by theNational Natural Science Foundation of China under Grant 61771366, in partby the “111” project under Grant B08038, and in part by the Natural Sciencesand Engineering Research Council of Canada under Grant RGPIN239031. Theassociate editor coordinating the review of this manuscript and approving it forpublication was Dr. Sanjeev Mehrotra. (Corresponding author: Jian Chen.)

S. Zheng, J. Chen, and Y. Kuo are with the School of Telecommunications En-gineering, Xidian University, Xi’an 710071, China (e-mail:,[email protected];[email protected]; [email protected]).

X.-P. Zhang is with the Department of Electrical and Computer Engineer-ing, Ryerson University, Toronto, ON M5B 2K3, Canada (e-mail:,[email protected]).

Color versions of one or more of the figures in this paper are available onlineat http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TMM.2019.2891415

processing of massive video data has brought great challengesto these terminals.

To reduce the burden of wireless mobile terminals, cloud ser-vices, as emerging techniques, have received extensive attention[1]–[3]. Cloud services are utilized to store and process the dataobtained from user terminals. Media cloud has strong computingpower and almost infinite storage capacity. These characteris-tics have caused the emergence of various multimedia servicescombing with cloud [4], [5]. A significance evaluation systemfor video data over media cloud is proposed in [6]. It is appliedto the downlink video streaming services. Some research onvideo transcoding techniques in media cloud is carried out in[7], [8]. In [9] and [10], a novel cloud based distributed imagecoding scheme is proposed.

Conventional video coding schemes, such as H.26x andMPEG-x series, have high computational complexity on theencoder side [11]. These schemes lead to huge computation bur-den to user terminals due to the encoding mode prediction [12],motion estimation (ME) and motion compensation (MC) pro-cessing [13], [14]. Compressed sensing (CS) [15]–[18] providesan effective way to reduce the complexity of the encoder. Byreducing the sampling rate and synchronizing the data samplingand compression, CS greatly improves the coding efficiency ofthe encoder [11]. It has received wide attention in video pro-cessing research [19]–[25].

To further improve the coding efficiency, the distributed com-pressed video sensing (DCVS) scheme is proposed in [26]–[28]. It effectively combines the distributed source coding andblock-based compressed sensing (BCS) sampling technologies.For video CS decoding, a block-based CS smooth projectionLandweber reconstruction (BCS-SPL) algorithm is proposed byFowler et al. [29], [30]. It achieves a superior recovery qualitywith fast execution speed. To exploit the inter-frame correla-tion, a multihypothesis based BCS-SPL (MH-BCS-SPL) recon-struction method is proposed in [31], [32]. The Tikhonov-basedmultihypothesis prediction model is applied to estimate the sideinformation (SI) for decoding. However, the Tikhonov modelis poor in the accuracy of SI estimation at a low sampling rate.To solve this problem, many improved multihypothesis basedCS video codec schemes are proposed in [33]–[35]. In [35], anElastic net-based multihypothesis prediction (wEnet) model isproposed. By combining the l1 and l2 penalty terms, the wEnetmodel effectively improves the accuracy of SI estimation at alow sampling rate. Research on hypothesis set optimization andkey frame secondary recovery is conducted in [36]–[39]. Therecovery quality is further improved by improving the quality of

1520-9210 © 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications standards/publications/rights/index.html for more information.

1906 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 21, NO. 8, AUGUST 2019

Fig. 1. Architecture of the terminal-to-cloud upload networks.

reference information. Fan et al. [40] propose a CS-based scal-able video coding scheme to ensure the recovery quality withpacket loss. In [41]–[43], the image blocks of non-key framesare classified into multiple change levels. Different numbers ofmeasurements are allocated to the blocks with different changelevels. It effectively improves the reconstruction quality of high-motion blocks.

Current studies in CS video codec focus on the improvementof reconstruction quality. Some adaptive CS encoding schemes[11], [14], [41]–[43] are proposed to improve the encoding effi-ciency and the reconstruction quality of the high-motion region.However, the above adaptive encoding strategies mainly con-sider the different region features of the video frames. The ef-fect of the wireless channel state on transmission efficiency is ig-nored. For decoder, the improvement of objective reconstructionquality is the main focus of current work [34]–[38]. However, toavoid adding more computation burden to resource-constrainedterminals, the inter-frame correlation of non-key frames is rarelyused in decoding. The recovery quality of the non-key frame issignificantly affected by its distance from key frame. It leads tothe quality fluctuation of non-key frames. Moreover, the cloud-based distributed wireless transmission schemes proposed in [9]and [10] are designed for static images. The additional thumb-nail, coded by conventional intra coding, has to be transmittedto the cloud for each image. The computation for single frameis complex. They are not suitable for video coding application.Therefore, the realization of the appropriate and high-qualityCS-based terminal-to-cloud video upload scheme is still a chal-lenge. Under the time-varying wireless channel environment,the customers’ demand for high-speed and high-efficiency videouploading service cannot be guaranteed.

In this paper, we design a new CS-based high-efficiency videoupload (CS-HEVU) system for terminal-to-cloud upload net-work. As shown in Fig. 1, video data is acquired by wireless ter-minals at the encoder. After compression and encoding process,it is transmitted to the cloud side by base station. At the decoder,the cloud server reconstructs the video data and transcodes it fordownlink services. The main contributions are as follows.

1) To reduce the amount of the data to be transmitted andadaptively adjust the sampling rate according to the time-varying wireless channel state, we propose a novel adap-tive video coding scheme by combining the skip blockbased residual sampling and the adaptive sampling rateadjustment technologies. The proposed new encoder ef-fectively removes the redundant information among the

video frames in a low-complexity way. Under the poorchannel state, the decline of the user experience causedby communication interruption or link congestion is ef-fectively mitigated.

2) To improve the quality of the recovered video, a local sec-ondary reconstruction based multi-reference frames crossrecovery method is proposed. By introducing the adjacentnon-key frames as additional reference frames, more andbetter hypotheses are acquired in the new decoder. Thenew decoder improves the quality of SI and the final re-constructed video significantly. The design of the crossrecovery scheme provides an effective way to alleviatethe quality fluctuation of non-key frames.

The rest of this paper is organized as follows. Section II firstgives a brief overview of the block diagram of the proposedCS-HEVU system. Then, the contributions and their specificimplementation details are illustrated. The experimental resultsare given in Section III. Finally, Section IV makes a conclusionfor this paper.

II. THE PROPOSED CS-HEVU SYSTEM

The block diagram of the proposed CS-HEVU scheme isillustrated in Fig. 2. The modules with blue shadow are the coresteps of the proposed new scheme. The mathematical notationsused in this section are summarized in Table I.

At the encoder, the module of the group of pictures first di-vides video frames into group of pictures (GoP) with a fixedsize. The first frame of each GoP is selected as the key frame,the other frames are non-key frames, named as CS frames. BCSmodule samples the input video frames by block-based CS sam-pling. Assume that vector uk ∈ RN ×1 denotes an image blockk with size n × n; N = n × n is the total number of pixelsin block k. The measurements yk ∈ RM ×1 obtained by BCSsampling is expressed as below,

yk = Φuk , (1)

where Φ ∈ RM ×N (M << N) is the CS measurement matrix;M is the number of measurements in yk . The sampling rate isdefined as M

N . The module of side information acquisition basedon BCS-SPL recovery is utilized to recover the key frame. Therecovered key frame ukey

e is used as the reference frame in CSframe residual coding. In the module of block mode decision,the reference block ukey

k of current block uk is obtained fromukey

e . Correspondingly, the residual block urk of current block

in CS frame is obtained by (2).

urk = uk − ukey

k . (2)

Then, the residual blocks are divided into non-skip blocks,named as CS blocks, and skip blocks. The adaptive samplingrate module selects an appropriate sampling rate according tothe channel state. In residual based BCS module, the block modeflag is allocated to skip block and CS block, respectively. TheCS block is sampled by BCS as formula (1). All of the mea-surements and flag information are transmitted to cloud via thebase station.

ZHENG et al.: HIGH-EFFICIENCY CS-BASED TERMINAL-TO-CLOUD VIDEO TRANSMISSION SYSTEM 1907

Fig. 2. The proposed CS-based terminal-to-cloud video transmission scheme.

TABLE IMATHEMATICAL NOTATIONS

At the decoder side, the key frame is first recovered. A two-stage recovery process for the reconstruction of key frame isutilized in the proposed decoder. In the first step, the improvedtotal variation (TV) algorithm [44] is utilized to perform the ini-tial reconstruction. After getting the first recovered key frame,it is applied as the reference frame for the second reconstructionof key frame. In the second step, the MH-BCS-SPL algorithmis performed to acquire the final recovered key frame. For the

Fig. 3. The proposed skip block based residual BCS sampling scheme.

recovery of CS frames, the BCS-SPL recovery module is firstutilized to recover the same key reference frame ukey

e as theencoder. Then, the module of multi-reference frames based hy-pothesis set acquisition is utilized to acquire the reference in-formation. By block sliding pixel by pixel, each block in thesearch window is obtained as a reference block i.e., hypothesishl of current block. All of the reference blocks constitute thehypothesis set H = {h1 , · · ·hl, · · ·hL}. L is the total numberof hypotheses. The weight prediction module is utilized to cal-culate the weights of different hypotheses. The final obtainedSI is utilized in the subsequent recovery module. In the moduleof local secondary reconstruction based multi-reference framescross recovery, the proposed reconstruction method is appliedto recover the original CS frames. The detailed implementationof the proposed methods is given below. The proposed encod-ing scheme is described in subsection II.A and subsection II.B.Subsection II.C gives the proposed decoding scheme.

A. New Skip Block Based Residual BCS Sampling

For video sequences, the main character is the high correlationamong the adjacent video frames. The temporal redundancy ismuch more than the spatial redundancy. It means that many sim-ilar blocks among video frames, i.e., the inter-frame redundantinformation, are transmitted to the cloud. It is unnecessary forthe cloud since the decoder could recover these similar blocksfrom the reference frames by block matching method. To furtherimprove the compression efficiency, removing the image blockswith high inter-frame similarity is an effective way.

To realize the decision of high-similarity blocks and skipthese blocks, Fig. 3 shows the proposed skip block based resid-ual BCS sampling scheme. It consists of the following steps:

1908 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 21, NO. 8, AUGUST 2019

first, the BCS-SPL recovery module obtains the recovered keyframe as the reference frame of CS frames. Second, the resid-ual acquisition module gets the residual between the CS frameand its reference frame. Third, the block mode decision moduledivides the residual frame into non-overlapped skip blocks andCS blocks. Finally, the BCS module samples the CS blocks andgets the final measurements. Unlike existing methods such as in[11], where the original key frame is used as the reference frameof the encoder. In our new encoder, the recovered key frame isemployed as the reference frame. The reason is as follows. Theoriginal key frame is not known at the decoder. There alwaysexists difference between the recovered key frame at decoderand the original key frame at encoder. Therefore, using origi-nal key frame as the reference frame at encoder will generate asystematic error in the recovery of CS frames due to this differ-ence. In our new system, the decoder and encoder use the samerecovered key frame in the absence of channel noise to avoidsuch systematic recovery error.

Note that some reconstruction algorithms, such as MH-BCS-SPL algorithm, have better reconstruction quality. It makes theblock mode decision more accurate. However, the computa-tional complexity of the encoder would increase significantlyat the same time. On the other hand, the BCS-SPL algorithmcan achieve a good tradeoff to balance the encoder complexityand the accuracy of block mode decision. First, the BCS-SPLalgorithm is more suitable for key frame reconstruction in theencoder due to its low complexity. Second, in compressed sens-ing based video codec system, key frames usually have a muchhigher sampling rate than the non-key frames. For the encoder,the reconstruction quality of the BCS-SPL algorithm is accept-able in such cases.

For block mode decision, existing methods in [41]–[43] pro-cess it in the decoder. Block mode information is transmitted toencoder by feedback channel. Repeated information transmis-sion further increases the pressure of the channel, especially forthe poor channel state. In our proposed system, the block modedecision is performed at the encoder with low computationalcomplexity. After obtaining the recovered key frame ukey

e , theblock in ukey

e that has the same position as current block uk isselected as the reference block ukey

k . The block mode decisionof residual block ur

k is shown as below,

SAD(urk ) =

N∑

xp =1

|uk (xp) − ukeyk (xp)|, (3)

urk =

{CS block SAD(ur

k ) > TSAD

skip block else ,(4)

where SAD(urk ) is the sum of absolute difference (SAD) be-

tween uk and ukeyk ; xp represents the index of pixel; TSAD is the

threshold of SAD in block mode decision. When SAD(urk ) ≤

TSAD , urk is selected as a skip block. It means that block uk

has sufficient similarity to ukeyk . Otherwise, the residual block

is selected as CS block. The skip block would not be sampledat the encoder. Only one bit Fr

k , as the flag, is transmitted to thedecoder to distinguish the block mode. At the decoder, with thehelp of the flag map, the skip block is directly represented by

Fig. 4. The adaptive sampling rate selection scheme.

its reference block. On the other hand, the CS block is sampledby BCS residual sampling. The obtained measurements yr

k isshown in (5),

yrk = Φur

k = Φ(uk − ukeyk ). (5)

Then the residual measurements yrk is transmitted to the decoder.

Compared with the original image block, the residual CS blockis sparser. The variance of the residual measurements is lowerthan that of the original CS measurements. Correspondingly, themeasurements obtained by residual BCS sampling allow for alower quantization rate.

B. Adaptive Sampling Rate Adjustment

In wireless communication systems, video transmission usu-ally consumes more time than the transmission of static image ortext information. The channel state may experience significantchanges with a higher probability, especially for the fast-movingscenes. When the channel quality degrades, the channel cannotcarry the high transmission rate of video data. The mismatchingbetween the channel quality and the transmission rate leads tolink congestion or possible communication interruption, causingthe degradation of user experience. On the other hand, when thechannel quality is good enough to support a higher transmissionrate, the transmission resource is wasted if the encoder keepsa low sending rate. In such a scenario, increasing transmissionrate can provide better video experience for users.

To solve the mismatching problem, we propose an adaptivesampling rate adjustment method based on the channel state.Since the sampling rate is linearly proportional to the compres-sion of the video, controlling the sampling rate controls thecompression rate of the entire video and thus the data rate ofthe video transmission [45]. The transmission efficiency can befurther improved by adaptively adjusting the sampling rate. Asillustrated in Fig. 4, first the base station sends the feedbackinformation to the user terminal. According to the channel state,the rate selection model selects the appropriate rate to sam-ple the CS blocks at the encoder. Note that the channel stateis measured by the signal-to-noise ratio (SNR). The channelstate is updated after each frame transmission. After quantiza-tion and channel encoding, the measurements are transmitted tothe cloud via the base station. Here we define three status forchannel state according to the value of SNR. Correspondingly,three kinds of sampling rates are selected to realize the adap-tive sampling. Table II gives the detailed sampling rate selectionmethod under different channel states. When the channel SNR islarger than the upper decision threshold, i.e., SNR > TU pper

SN R ,

ZHENG et al.: HIGH-EFFICIENCY CS-BASED TERMINAL-TO-CLOUD VIDEO TRANSMISSION SYSTEM 1909

TABLE IITHE SAMPLING RATE SELECTION TABLE

Fig. 5. The block diagram of the proposed decoder.

the channel state is judged to be excellent. It means that currentchannel can support high-speed data transmission. Therefore,the encoder would select a high sampling rate to sample currentframe. Meanwhile, the recovered quality of the video framesat the decoder would be improved. When the channel SNR islower than the lower decision threshold, i.e., SNR < TLower

SN R ,the channel state is bad. The encoder has to select a low samplingrate to ease the transmission pressure and avoid the transmissionfailure. Otherwise, a medium sampling rate is selected.

C. Local Secondary Reconstruction Based Multi-ReferenceFrames Cross Recovery

At the decoder, it is challenging to recover the original videoframes with high quality due to the high-efficiency compressionin the proposed encoding scheme. Especially for the poor chan-nel state, the new encoder would sample the CS frames withthe lowest sampling rate. It increases the difficulty of accuratelyreconstructing the original video frames. Moreover, the qualityfluctuation of the recovered CS frames is also a problem. In CS-based video codec schemes, the GoP size is usually quite largeto ensure the compression efficiency. With the increase of thedistance between key frame and CS frame, their inter-frame cor-relation decreases significantly. It leads to a decline of SI quality.Correspondingly, the recovery quality of the CS frames in themiddle of each GoP is much less than the CS frames in bothsides of GoP. Therefore, realizing the accurate reconstruction ofthe video sequences is the focus of this subsection.

To solve these problems, a local secondary reconstructionbased multi-reference frames cross recovery scheme is proposedin Fig. 5. It mainly consists of the following steps: key framereconstruction, CS frame measurements recovery, reference in-formation acquisition, and CS frame recovery. In key framereconstruction step, the key frame is firstly recovered by theimproved TV model [44]. The reconstruction of block uk is

formulated in (6),

minuk

c1‖Duk‖2 + c2‖Ψuk‖1 + c3‖uk − WuSI ‖22 ,

s.t. y = Φuk , (6)

where Duk is the gradient representation of uk ; Ψ is thesparse matrix based on wavelet basis; ‖uk − WuSI ‖2

2 is thenon-local penalty term; uSI is the non-local reference infor-mation set; W is the weight vector for different referenceinformation; c1 , c2 and c3 represent the weights of differentpenalty terms. Similar to [44], the values of c1 and c2 are ad-justed adaptively. Parameter c1 is defined by the gradient ofthe pixel in image block. For the pixel uk,i in block uk , c1,i isupdated by c1,i = max(1/Duk,i , a) − b, where a > b > 0 arenon-negative constants. Parameter c2 is used to adaptively ad-just the weight of the wavelet penalty term under different scalesand frequency bands. Assume that Θi,f is wavelet coefficientsset for the ith level wavelet transform and the f th frequencyband. c2 is updated by c2 = 1/mean(Θi,f ). c3 is a constant.After getting the first recovered key frame, it is utilized as theSI in the second recovery of key frame. The final recovered keyframe is obtained by MH-BCS-SPL [32] algorithm.

For the recovery of CS frame, the BCS-SPL algorithm [30] isfirst applied to obtain the same recovered key frame ukey

e as theencoder. After obtaining the reference block ukey

k from ukeye , the

complete measurements yk of the CS block uk can be recoveredas follows,

yk = yrk + ykey

k = yrk + Φukey

k , (7)

where yrk is the residual measurements of uk received at de-

coder; ykeyk is the measurements of ukey

k . After that, the pro-posed multiple reference frames acquisition and cross recon-struction scheme is utilized to reconstruct CS frames. In thestep of reference information acquisition, the previous recov-ered CS frames are utilized as the additional supplement of thekey reference frame. Compared with the key reference frame,the CS reference frame is better in inter-frame correlation insome local areas, especially for the CS frames in the middle ofGoP. More high-quality hypotheses can be acquired from theprevious recovered CS frames. Moreover, to further make upthe unbalance of recovery quality, the search window of the CSframes in the middle of GoP is expanded in our proposal. It isuseful for acquiring more high-quality hypotheses outside theoriginal search window. Note that the expanding of the searchwindow is only performed in key reference frames.

To efficiently acquire CS reference frames from the previousrecovered CS frames, a local secondary reconstruction basedcross recovery scheme is proposed for the recovery of the CSframe. Here, we set the GoP size to 8 as an example to explain theproposed scheme. As shown in Fig. 6, where Key1 and Key2 arethe key frames in current GoP and next GoP, respectively. CS′ irepresents the ith CS frame of current GoP in the initial recovery.CS i represents the ith CS frame in the final recovery. Rec i is theorder of the reconstruction. LCS 7 and NCS 1 represent the lastCS frame in previous GoP and the first CS frame in the next GoP,respectively. In the proposed cross reconstruction scheme, there

1910 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 21, NO. 8, AUGUST 2019

Fig. 6. The local secondary reconstruction based multi-reference frames crossrecovery method.

are four reference frames for each CS frame to be recovered,i.e., two key reference frames and two previous recovered CSreference frames. Note that the two key reference frames arenecessary for the recovery of each CS frame. To facilitate thedescription of the reconstruction process, the acquisition of keyreference frames is not shown in Fig. 6 except CS′ 1 and CS′ 7.In the initial recovery of the first and the last CS frames, thereare no CS reference frames to provide hypotheses. The recoverystarts from both sides of the GoP. Then, it moves to the centeredCS frame. The first CS frame is first reconstructed, followed bythe last frame, the second frame etc. The reconstruction orderof the CS frames in GoP can be expressed as: CS 1 → CS 7 →CS 2 → CS 6 → CS 3 → CS 5 → CS 4. The first recovered fourCS frames would be reconstructed at a second time. The last CSframe in previous GoP and the second CS frame in current GoPare selected as the CS reference frames for the second recoveryof CS 1. For the second recovery of CS 7. The sixth CS frame incurrent GoP and the first CS frame in the next GoP are selectedas the CS reference frames.

After obtaining the hypotheses from the selected referenceframes, the hypothesis set optimization is performed by calcu-lating the Euclidean distance Edk,l between the lth hypothesishl and the original CS block k as follows,

Edk,l = ‖yk − Φhl‖2 . (8)

The hypotheses are sorted by their Euclidean distance fromsmall to large. Assume that the final selected number of hy-potheses is Thyp

num , determined empirically. The first Thypnum hy-

potheses with higher quality are selected finally. The low-qualityhypotheses are removed effectively. For the final hypothesis setHk of block k, the optimal weight vector wEnet

k of Hk is calcu-lated by the wEnet model [35] as follows,

wEnetk =

(1 +

λ2

M

)arg min

w‖yk − Qkw‖2

2

+ λ1‖Γkw‖1 + λ2‖w‖22 , (9)

Algorithm 1: The Wenet-Based MH-BCS-SPL AlgorithmInput: yk , Φ, sparse basis Ψ, Hk

1: calculating weights vector wEnetk by equation (9)

2: acquiring reference information: uSIk = HkwEnet

k

3: initial value: yk = yk − ΦuSIk , j = 0, u

(j )k = ΦT yk

4: repeat5: Wiener filtering: u

(j )k = Wiener(u(j )

k )6: the projected Landweber CS reconstruction [29]

result of current iteration:ˆu(j )

k = u(j )k + ΦT (yk − Φu

(j )k )

7: thresholding processing in the sparse domain:θ

(j )k = Threshold(Ψˆu(j )

k )8: the inverse process for sparse representation:

u(j )k = Ψ−1θ

(j )k

9: the projected Landweber CS reconstruction [29]result of the next iteration:u

(j+1)k = u

(j )k + ΦT (yk − Φu

(j )k )

10: calculating the Euclidean distance between u(j+1)k

and ˆu(j )k : Ed(j+1) = ‖u(j+1)

k − ˆu(j )k ‖2

11: iteration increase: j = j + 112: until |Ed(j ) − Ed(j−1) | < 10−4

Output:the final result uk = u(j )k + uSI

k

where Qk = ΦHk ∈ RM ×T h y pn u m is the projection of hypothesis

set; λ1 and λ2 are real parameters; Γk is a regularization matrixthat consists of the Euclidean distance of each hypothesis asfollows,

Γk =

⎜⎝Edk,1 0

. . .0 Edk,T h y p

n u m

⎟⎠ . (10)

The reference information uSIk is obtained by calculating the

weighted sum of hypotheses uSIk = HkwEnet

k . The final recov-ered CS frame is obtained by the wEnet-based MH-BCS-SPLalgorithm. In this algorithm, the wEnet model instead of theTikhonov model in traditional MH-BCS-SPL algorithm is uti-lized to perform the weight prediction of multihypothesis. ForCS block uk , the detailed iteration reconstruction processingis shown in Algorithm 1. Note that j is the iteration index.Wiener(·) is the pixel-wise adaptive Wiener filtering using aneighborhood of 3 × 3. The projected Landweber CS recon-struction [29] algorithm is utilized to obtain the recovery resultof each iteration. The thresholding processing Threshold(·)[29] is formulated as below,

θ(j )k =

{Ψˆu(j )

k |Ψˆu(j )k | ≥ τ

0 else ,(11)

where τ = λσ√

2 log Nc ; λ is a constant factor to manageconvergence; σ is estimated by robust median estimator σ =median(|Ψˆu(j )

k |)/0.6745; Nc is the number of transform co-

efficients of Ψˆu(j )k . The recovered CS frame is final ob-

ZHENG et al.: HIGH-EFFICIENCY CS-BASED TERMINAL-TO-CLOUD VIDEO TRANSMISSION SYSTEM 1911

tained by combining the recovered CS blocks and the skipblocks.

III. EXPERIMENTAL RESULTS

To evaluate the performance of the proposed CS-HEVUscheme, simulation results are presented in this section. Allthe experiments are performed in Matlab R2015b. The config-uration of the simulation is 64-bit Windows 7SP1, Inter (R)Core (TM) i7-4790 CPU, 3.60GHz, 16G RAM. To measurethe objective reconstruction quality of the decoder, the peaksignal-to-noise ratio (PSNR) is used as the objective evaluationcriteria. Assume that Pmax is the maximum value of the pixelsin the original image Xorg . Here, Pmax = 255. Pr and Pc arethe numbers of rows and columns of image pixels, respectively.The PSNR of the recovered image Xrec is calculated as,

PSNR = 10 log10

(Pmax

2

MSE

), (12)

MSE =1

PrPc

Pr −1∑

i=0

Pc −1∑

j=0

[Xorg (i, j) − Xrec(i, j)]2 , (13)

where MSE is the mean square error (MSE) between the orig-inal image Xorg and the recovered image Xrec . To give a sub-jective evaluation for the recovered image quality, the featuresimilarity index for image quality assessment (FSIM) [46] isapplied as the criterion of visual quality assessment. The FSIMindex between Xorg and Xrec is defined as,

FSIM =

∑xp ∈Ω SL (xp) · PCm (xp)∑

xp ∈Ω PCm (xp), (14)

where xp represents the pixel index in image; Ω means the wholeimage spatial domain; SL (xp) is the similarity index betweenXorg and Xrec . PCm (xp) = max(PCorg (xp), PCrec(xp)) isthe phase congruency (PC) feature of image, where PCorg (xp)and PCrec(xp) denote the PC feature of Xorg and Xrec , respec-tively. The detailed calculation of SL (xp) and PCm (xp) can befound in [46].

A. Parameter Setting

In this subsection, we show the parameter setting in the simu-lation experiments. The Gaussian orthogonal projection matrixis utilized as the measurement matrix Φ. Coastguard, Fore-man, Mother-daughter and Soccer sequences, as the standardtest video sequences, are applied as the source information inthis paper. The selected four sequences cover different mo-tion characteristics. The Coastguard and Mother-daughter se-quences are the representative of high motion and low motionsequences, respectively. Foreman sequence has a middle mo-tion characteristic. Soccer sequence has complex picture textureand multi-target motion. The experimental results of the abovefour sequences can effectively illustrate the performance of theproposed CS-HEVU scheme. Moreover, the first 88 frames ofeach sequence completely contain the features of the whole se-quences. In the reference schemes [32], [35]–[37], the first 88frames of each test sequence are selected as the final tested

frames in experiments. To ensure fairness, the first 88 frames ofeach test sequence are applied as the test frames in our simula-tion.

The GoP size is set to 8 in this paper to balance the codingefficiency and the reconstruction quality. Note that the existingliteratures [30]–[32], [35]–[37] take the GoP size of 8 in theirexperiments. In existing CS-based video processing systems, toreduce the ratio of the key frames in the entire video frames andimprove the coding efficiency of the encoder, the GoP size isusually large. With the increase of the GoP size, the correlationbetween the key frame and the non-key frames decreases sig-nificantly. The quality of SI predicted in the decoder is relatedto the inter-frame correlation. The quality of SI decreases withthe decrease of inter-frame correlation. It causes the degradationof reconstruction quality in the decoder. When GoP size is setto 8, the encoding efficiency and the recovery quality can bebalanced well.

Generally, the block size is related to the measurement matrixsize and the reconstruction quality. When the block size is large,the encoder must prepare a large storage space to store themeasurement matrix. Correspondingly, the computational costof the encoder is also increased. When the block size is small,the block artifact is a problem for the recovery of the decoder.We take the block size of 16 × 16 in this paper to balance thecomputational cost of the encoder and the recovery quality ofthe decoder. Moreover, in current literatures [30]–[32], [35]–[37], the block size is set to 16 × 16. Ensuring the fairness ofthe comparison is the other consideration.

We obtain the threshold of block mode decision TSAD bya large number of empirical statistical experiments. WhenTSAD = 300, sufficient skip blocks can be obtained for thelow-motion sequences. Meanwhile, the decoder can maintain ahigh reconstruction quality. When TSAD is small, the improve-ment of the coding efficiency for the low-motion sequences isnot obvious, i.e., the number of skip block is small. When TSAD

is too large, many blocks that are not similar to the referenceblock would also be selected as the skip blocks. The quality ofthe reconstructed video would decrease significantly.

We obtain the number of the finally selected hypotheses in CSframe recovery also by empirical statistical experiments. Gen-erally, the computational complexity would increase with theincrease of Thyp

num . When Thypnum is too large, the low-correlation

hypotheses would cause the quality degradation of the side in-formation estimated by multihypothesis prediction. When Thyp

num

is too small, some useful hypotheses are removed in hypothesisset optimization. After a large number of empirical statisticalexperiments, we find that a good quality can be obtained whenthe value of Thyp

num is in the range of [50, 220] for the test videos.When Thyp

num = 70, a good trade-off between the recovery qual-ity and the computational complexity can be achieved for the testvideos in this paper. Thyp

num = 70 is robust for the test videos. Itis used for all of the test videos in this paper. An accurate valueof Thyp

num can be obtained by empirical experiments for somespecific videos.

The search window is constructed by extending a fixed num-ber of pixels around the current block. The size of the search win-dow mainly affects the performance of multihypothesis weight

1912 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 21, NO. 8, AUGUST 2019

TABLE IIITHE COMPRESSION PERFORMANCE AT ENCODER

prediction. In this paper, the window size is 15 pixels to keepconsistent with the reference schemes. For the centered non-keyframes, the increment of the search window size is eight pixels.Moreover, the sampling rate of key frames is 0.7, the sam-pling rate of CS frames varies from 0.1 to 0.3. The thresholdsTU pper

SN R = 9.0130 dB and TLowerSN R = 4.0675 dB. In sampling

rate adaptive adjustment scheme, we set the basic sampling rateto 0.2. For parameters c1 , c2 and c3 in (6), the initialization ofc1 and c2 is 1, and a and b are 0.5 and 0.3, respectively, as in[44]. For the value of c3 , it is usually set to be the constant 1.Parameters λ1 and λ2 in (9) are the tuning parameters of thewEnet model. According to the research in [35], [47], the valueof λ2 is usually in the range of [0.01, 0.2]. The parameter λ1 isusually replaced by the number of steps s of the algorithm. Inthe proposal, to keep consistent with the reference scheme [35],we set λ2 equal to 0.1 and s equal to 1000 × CSratio , whereCSratio is the sampling rate. The parameter λ in thresholdingprocessing is set to 3.

B. The Performance of Proposed Adaptive Residual Encoding

In this subsection, the compression performance of the novelencoder is validated. To show the validity of the skip block basedresidual BCS sampling scheme, the total skip block number ofCS frames is first measured. The average absolute value of themeasurements of CS frames, sampled by conventional BCS [32],[35]–[37] and the new encoder, are given, respectively. It affectsthe quantization process. Compared with the conventional CSencoder, the ratio of the reduced quantization bits number inthe new encoder is also measured. The advantage of the newencoder in encoding efficiency is shown directly. To verify theadaptive sampling scheme, the distribution of the sampling rateis measured at different channel states.

Table III gives the simulation results on residual BCS sam-pling. The blocks required to be transmitted are decreased effec-tively. Especially for the low-motion video sequence Mother-daughter, the block number to be transmitted is decreased byabout 45 percent. For Coastguard, the representative of high-motion sequences, there are few skip blocks for two main rea-sons: a) The inter-frame correlation is poor in high-motion se-quences; b) To maintain the low complexity of encoder, theblock in the corresponding position of key frame is selectedas the reference block of current block. The complex matchingcalculating, such as MC and ME, is not used in the encoder. Theterm of average value of measurements in Table III means theaverage absolute value of the measurements of non-key frames.

Fig. 7. The sampling rate distribution of CS frames at different channel states.

We can observe that the average absolute value of the mea-surements is decreased greatly in proposed CS-HEVU scheme.Correspondingly, the quantification bits number of the measure-ments can be reduced effectively. Under zero-order exponentialGolomb coding, the term of bits decrease in Table III givesthe proportion of the number of reduced quantification bits fornon-key frames. Compared with the conventional encoder, thenumber of quantification bits is reduced by more than 40 per-cent in the new encoder. Especially for the low-motion sequenceMother-daughter, more than 80 percent of the bits are reduced.It significantly reduces the pressure of data transmission at thepoor channel state.

We can observe from Fig. 7 that our CS-HEVU scheme canadaptively adjust the sampling rate according to current channelstate. When the average SNR is low, the video data must betransmitted at a bad channel state. In this case, our scheme wouldreduce the sampling rate. There are more CS frames are allocateda low sampling rate to avoid the communication interruption orcongestion. With the increase of channel SNR, channel qualityis gradually improved. Current channel has sufficient capacityto support a higher rate. Therefore, the number of CS frameswith high sampling rate increases gradually as illustrated inFig. 7. Correspondingly, the reconstruction quality can also beimproved because more useful information can be received atthe decoder.

C. The Reconstruction Quality of CS Frames

To show the performance of the proposed CS frames recon-struction scheme, three CS frames recovery methods: the lat-est improved multihypothesis-based CS frame recovery (IMH-CS) in [36], the initial stage in CS-HEVU, i.e., multi-referenceframes based cross recovery without local secondary recon-struction, and the complete local secondary reconstruction basedmulti-reference frames cross recovery method in our CS-HEVUscheme, are performed under the same conditions. Note thatthe above CS frame recovery methods are performed under the

ZHENG et al.: HIGH-EFFICIENCY CS-BASED TERMINAL-TO-CLOUD VIDEO TRANSMISSION SYSTEM 1913

Fig. 8. The CS frames recovery quality for different methods. (a) Coastguard. (b) Foreman. (c) Mother-daughter. (d) Soccer.

same key frame reconstruction scheme as the CS-HEVU system.Moreover, to intuitively show the reconstruction performance,the sampling rate is fixed to 0.2 and there is no skip block atencoder in this simulation.

Fig. 8 shows the average PSNR of the CS frames at dif-ferent positions in GoP. It shows that our CS-HEVU schemeeffectively improves the recovery quality, especially for the CSframes located in the middle of GoP. Compared with the IMH-CS scheme, the initial stage of CS-HEVU improves the recoveryquality by about 0.91 dB on average for all of the test sequences.This improvement mainly benefits from the introducing of theCS reference frames and the expansion of the search windowfor the centered CS frames. By introducing the local secondaryreconstruction scheme, the quality is further improved by about0.19 dB in the complete CS-HEVU scheme. The above improve-

ments all come from the increase of good hypotheses in hypoth-esis set. It leads to the improvement of SI quality. The recoveryquality is improved eventually. Especially for Soccer sequence,the PSNR is improved by about 2.33 dB and 0.38 dB on average,respectively. For Mother-daughter sequence, the PSNR is onlyimproved respectively by about 0.20 dB and 0.04 dB on aver-age. The reason is as follows. In high-motion sequences, withthe increase of inter-frame distance, the correlation between thekey reference frame and the CS frame decreases greatly. The CSreference frames can provide more high-quality hypotheses thanthe key reference frames. Therefore, the improvement is moreobvious for high-motion sequences. For low-motion sequences,the inter-frame correlation is less affected by the inter-frame dis-tance. Key reference frames play a decisive role in the CS framesrecovery. Therefore, only a slight improvement is obtained.

1914 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 21, NO. 8, AUGUST 2019

Fig. 9. The comparison results of PSNR for different schemes with fixed sampling rate. (a) Coastguard. (b) Foreman. (c) Mother-daughter. (d) Soccer.

Though the hypotheses obtained from the second recoveredCS frames have some improvements, little improvement is trans-ferred to the SI prediction of the middle CS frames due to theweight allocation processing. Therefore, only a slight improve-ment is obtained for the middle three CS frames compared withthe initial recovery stage. It can be seen that there is a markeddecrease in the recovery quality at position five in Fig. 8(b). Thereason is as follows. In IMH-CS recovery scheme, the recoveryquality at position five is also less than its symmetrical position,i.e., the position three. It means that the CS frames at positionfive are low correlated with the other frames as a whole. Theproposed scheme is closely related to the inter-frame correla-tion. Therefore, the quality improvement at this position is lessthan other positions.

D. The Comparison of Overall Performance

The performance of our CS-HEVU scheme is shown by com-paring with the state-of-the-art CS-based video codec systems,

including the MH-BCS-SPL scheme [32], MS-wEnet scheme[35], Up-Se-MS-HHP scheme [36] and KSR-DCVS scheme[37]. The MH-BCS-SPL scheme is a typical system in CS-based video codec area. It is also a basic algorithm in theproposed decoder. The MS-wEnet, Up-Se-MS-HHP and KSR-DCVS schemes represent the state-of-the-art performance inrecent research on CS-based video codec area. To ensure thefairness of the comparison, the GoP size, block size, and otherparameters are consistent in all comparison schemes. The com-parison results of PSNR under fixed sampling rate and adap-tive sampling rate are given in Fig. 9 and Fig. 10, respectively.The comparison results under the FSIM assessment criterionare given in Table IV. Fig. 11 to Fig. 13 give the recoveredvideo frames to show the visual quality intuitively. Table Vgives the comparison results of computational complexity. Therate-distortion performance is shown in Fig. 14.

As shown in Fig. 9, compared with the reference schemes,our CS-HEVU system significantly improves the reconstruction

ZHENG et al.: HIGH-EFFICIENCY CS-BASED TERMINAL-TO-CLOUD VIDEO TRANSMISSION SYSTEM 1915

Fig. 10. The comparison results of PSNR for different schemes with adaptive sampling rate. (a) Coastguard. (b) Foreman. (c) Mother-daughter. (d) Soccer.

quality. The average PSNR of the proposal is 3.7 dB more thanthat of the MH-BCS-SPL scheme. Especially for Coastguard se-quence, the average PSNR increases by about 4.2 dB. The small-est average PSNR gain comes from the Soccer sequence, whichis improved by 3.2 dB. For Up-Se-AWEN-HHP and MS-wEnetschemes, the average PSNR is improved by 1.7 dB and 1.9 dBrespectively in the CS-HEVU system. The smallest increase ofrecovery quality all comes from the Mother-daughter sequence.The average PSNR is improved by about 1.0 dB and 1.5 dB,respectively. Compared with Up-Se-AWEN-HHP scheme, thelargest average gain of PSNR is 2.3 dB comes from the Soccersequence. For MS-wEnet scheme, the value is 2.4 dB comesfrom the Foreman sequence. Compared with the KSR-DCVSscheme [37], about 1.7 dB of PSNR gain on average is obtainedwith the new CS-HEVU scheme. For Foreman sequence, morethan 1.95 dB of the quality gain is obtained. The smallest gaincomes from the Mother-daughter sequence, the value is 1.3 dB.

By applying the proposed sampling rate adaptive adjustmenttechnique into the encoder of the test schemes, Fig. 10 showsthe comparison results under different channel states. We canobserve that the reconstruction quality increases significantlywith the improvement of the channel quality. Under variablesampling rate, compared with the MH-BCS-SPL scheme, theaverage PSNR is improved by about 3.8 dB in the proposed CS-HEVU scheme. Compared with the MS-wEnet, Up-Se-AWEN-HHP and KSR-DCVS schemes, the average PSNR is improvedby about 1.8 dB. The sampling rate adaptive adjustment tech-nique effectively improves the adaptability of the test schemesfor different channel conditions.

Different from the image objective evaluation criteria PSNR,FSIM uses computational models to measure the image qualitywith the subjective evaluations. According to the FSIM compar-ison results shown in Table IV, the proposed CS-HEVU schemeperforms better than the reference schemes. Compared with the

1916 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 21, NO. 8, AUGUST 2019

Fig. 11. The comparison results of visual quality for the 54th Mother-daughter CS frame (sampling rate: 0.1). (a) Original image. (b) MH-BCS-SPL, PSNR37.49 dB, FSIM 0.958. (c) Up-Se-AWEN-HHP, PSNR 39.54 dB, FSIM 0.974. (d) MS-wEnet, PSNR 38.93 dB, FSIM 0.971. (e) KSR-DCVS, PSNR 38.45 dB,FSIM 0.965. (f) CS-HEVU, PSNR 40.49 dB, FSIM 0.978.

TABLE IVTHE COMPARISON OF FSIM (BEST PERFORMANCE SHOWN IN BOLD)

MH-BCS-SPL scheme, the FSIM is increased by about 0.026on average. Especially for Coastguard sequence, the FSIM ofthe CS-HEVU scheme is 0.04 more than that of the MH-BCS-

SPL scheme. Compared with the Up-Se-AWEN-HHP and MS-wEnet schemes, the FSIM value is improved by about 0.011and 0.012 respectively in CS-HEVU scheme. The maximumaverage gain of FSIM both come from the Soccer sequence,which is 0.024 and 0.022, respectively. Followed by Coast-guard sequence, the average FSIM is improved by about 0.009and 0.012, respectively. For Mother-daughter sequence, it con-tributes the minimum average gain of FSIM, which is 0.002 and0.003, respectively. Compared with the KSR-DCVS scheme,the average value of FSIM is improved by about 0.016 withthe proposed CS-HEVU scheme. Especially for high-motionsequences Coastguard and Soccer, the value of FSIM is im-proved by about 0.022 and 0.027, respectively. According to theabove results, we can conclude that the improvement of the re-covery quality in our proposal performs better for high-motionsequences than that of low-motion sequences. For high-motionor complex-motion sequences, many image textures in currentframe cannot be found in key reference frames. It leads to thedistortion of the image local textures. In the CS-HEVU scheme,the decoder can provide more high-correlation reference blocksby introducing the adjacent non-key reference frames and ex-panding the search window. The recovered image is closer tothe original image.

To show the visual quality of the recovered video sequencesin different schemes, the recovery results of the 54th Mother-daughter CS frame and 18th Soccer CS frame, as the examples,are given in Fig. 11 and Fig. 12. We can observe that the pro-posed scheme presents higher recovery quality compared withthe other schemes. The MH-BCS-SPL scheme has a poor pic-ture clarity. All of the reference schemes have poor performancefor the recovery of moving objects as shown in the red box re-gion. In our CS-HEVU scheme, the picture textures and clarity

ZHENG et al.: HIGH-EFFICIENCY CS-BASED TERMINAL-TO-CLOUD VIDEO TRANSMISSION SYSTEM 1917

Fig. 12. The comparison results of visual quality for the 18th Soccer CS frame (sampling rate: 0.25). (a) Original image. (b) MH-BCS-SPL, PSNR 31.81 dB,FSIM 0.911. (c) Up-Se-AWEN-HHP, PSNR 31.65 dB, FSIM 0.911. (d) MS-wEnet, PSNR 31.87 dB, FSIM 0.913. (e) KSR-DCVS, PSNR 31.78 dB, FSIM 0.914.(f) CS-HEVU, PSNR 37.15 dB, FSIM 0.970.

Fig. 13. The recovery results for 4K-resolution color video (sampling rate:0.20). (a) The 8th original frame of FoodMarket. (b) The recovered image inCS-HEVU, PSNR 34.93 dB, FSIM 0.994.

are effectively improved. The proposal can preserve the edgedetails of moving objects better. The recovery results for a 4K-resolution color video sequence FoodMarket are also given in

TABLE VTHE AVERAGE TIME CONSUMPTION FOR PER FRAME (UNIT: S)

Fig. 13. It shows that the proposed CS-HEVU scheme is alsosuitable for the processing of high-definition color videos.

To compare the computational complexity of the proposedCS-HEVU scheme, the average time consumption for process-ing each frame in different schemes is given in Table V. Thetime consumption in our CS-HEVU scheme is about 1.3 timesof that of the Up-Se-AWEN-HHP and KSR-DCVS schemes atsampling rate equals to 0.1, and 1.7 times of that of MS-wEnetand MH-BCS-SPL schemes. The increase of the time consump-tion in our CS-HEVU scheme mainly comes from three opera-tions. First, the local second reconstruction for non-key framesincreases the number of reconstructed frames. Second, morereference frames are applied in the proposed decoder, requir-ing more computation for hypotheses processing, such as thehypothesis set optimization and weight prediction. Third, at thenew encoder, the BCS-SPL recovery for key frame also increasesthe computational complexity. Compared with the conventionalencoder in reference schemes [32], [35]–[37], the total compu-tational cost overhead in the proposed encoder is increased by

1918 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 21, NO. 8, AUGUST 2019

Fig. 14. The rate-distortion performance for different schemes. (a) Coastguard. (b) Foreman. (c) Mother-daughter. (d) Soccer.

about 5 percent. The overall computational complexity increaseis moderate, as seen in Table V, but the new CS-HEVU schemeachieves significant improvement of the reconstruction qualityat the decoder.

To show the efficiency of the proposed CS-HEVU system,the rate-distortion performance is measured under the zero-orderexponential Golomb coding. As shown in Fig. 14, the horizontalaxis represents the number of bits to be transmitted for per frame.The angular axis is the corresponding PSNR. It can be observedthat the proposed CS-HEVU system is much superior to otherstate-of-the-art schemes in rate-distortion performance. First, byintroducing the skip block mode and residual coding into theencoder, the amount of the bits to be transmitted in the proposedencoder is reduced significantly. Second, the proposed localsecondary reconstruction based multi-reference frames crossrecovery scheme effectively improves the reconstruction qualityof the decoder.

IV. CONCLUSION

In this paper, a compressed sensing based high-efficiencyvideo upload (CS-HEVU) system is proposed for terminal-to-cloud upload network. By combining the skip block based BCSresidual sampling and sampling rate adaptive adjustment tech-nologies, a new skip block based adaptive encoder is developed.The inter-frame redundancy information is reduced significantlyespecially for the low-motion sequences. The sampling rate ad-justment technology can effectively cope with the time-varyingproblem of the channel state and improve the user experienceof the poor channel state. Simulation results show that the com-pression and transmission efficiency of the encoder is improvedeffectively. Furthermore, a new local secondary reconstructionbased multi-reference frames cross recovery scheme is proposedto further improve the reconstruction performance of the de-coder. The quality fluctuation of the non-key frames is alleviated

ZHENG et al.: HIGH-EFFICIENCY CS-BASED TERMINAL-TO-CLOUD VIDEO TRANSMISSION SYSTEM 1919

effectively. Compared with state-of-the-art compressed sensingsystems reported in the literature, both the objective and sub-jective quality of the reconstructed video frames are improvedsignificantly. The experimental results verify the superiority ofthe proposed new system in compression efficiency and the re-construction quality compared to other existing state-of-the-artsystems.

Recently, high-definition videos are increasingly popular dueto the fast development of mobile wireless networks. While theproposed new scheme can be extended to the high-resolutionvideo application scenarios, the increased computational com-plexity caused by the increase of the video resolution could bea bottleneck in practical systems. To realize the high-efficiencywireless transmission of high-resolution video sequences, the ef-ficient combining of the characteristics of high-resolution videowith the joint design of encoding scheme and the wireless trans-mission strategy is a challenging open problem that we willinvestigate in the future.

REFERENCES

[1] X. Dai, X. Wang, and N. Liu, “Optimal scheduling of data intensive appli-cations in cloud based video distribution services,” IEEE Trans. CircuitsSyst. Video Technol., vol. 27, no. 1, pp. 73–83, Jan. 2017.

[2] X. Li, M. A. Salehi, and M. Bayoumi, “VLSC: Video live streaming usingcloud services,” in Proc. IEEE Int. Conf. Big Data Cloud Comput., 2016,pp. 595–600.

[3] X. Li, M. A. Salehi, M. Bayoumi, N.-F. Tzeng, and R. Buyya, “Cost-efficient and robust on-demand video transcoding using heterogeneouscloud services,” IEEE Trans. Parallel Distrib. Syst., vol. 29, no. 3,pp. 556–571, Mar. 2018.

[4] S. Zhang, Y. Lin, and Q. Liu, “Secure and efficient video surveillance incloud computing,” in Proc. IEEE Int. Conf. Mobile Ad Hoc Sensor Syst.,2014, pp. 222–226.

[5] J. Lee, J. Son, R. Hussain, and H. Oh, “Media cloud: a secure and efficientvirtualization framework for media service,” in Proc. IEEE Int. Conf.Consum. Electron., 2014, pp. 79–80.

[6] J. Guo, B. Song, and X. Du, “Significance evaluation of video data overmedia cloud based on compressed sensing,” IEEE Trans. Multimedia,vol. 18, no. 7, pp. 1297–1304, Jul. 2016.

[7] G. Gao, W. Zhang, Y. Wen, Z. Wang, and W. Zhu, “Towards cost-efficient video transcoding in media cloud: Insights learned from userviewing patterns,” IEEE Trans. Multimedia, vol. 17, no. 8, pp. 1286–1296,Aug. 2015.

[8] Y. Jin, Y. Wen, and C. Westphal, “Optimal transcoding and caching foradaptive streaming in media cloud: An analytical approach,” IEEE Trans.Circuits Syst. Video Technol., vol. 25, no. 12, pp. 1914–1925, Dec. 2015.

[9] X. Song, X. Peng, J. Xu, G, Shi, and F. Wu, “Cloud-based distributedimage coding,” IEEE Trans. Circuits Syst. Video Technol., vol. 25, no. 12,pp. 1926–1940, Dec. 2015.

[10] X. Song, X. Peng, J. Xu, G, Shi, and F. Wu, “Distributed compressivesensing for cloud-based wireless image transmission,” IEEE Trans. Mul-timedia, vol. 19, no. 6, pp. 1351–1364, Jun. 2017.

[11] A. U. Rehman, G. A. Shah, and M. Tahir, “Compressed sensing basedadaptive video coding for resource constrained devices,” in Proc. IEEEConf. Wireless Commun. Mobile Comput., 2016, pp. 170–175.

[12] M. G. Sarwer, Q. J. Wu, and X.-P. Zhang, “Enhanced SATD based costfunction for mode selection of H.264/AVC intra coding,” Signal, ImageVideo Process., vol. 7, no. 4, pp. 777–786, Jul. 2013.

[13] M. G. Sarwer, Q. J. Wu, and X.-P. Zhang, “A novel rate-distortion op-timization method of H.264/AVC intra coder,” in Proc. IEEE Int. Conf.Image Process., 2011, pp. 3465–3468.

[14] S. Yu, F. Qiao, L. Luo, and H. Yang, “Increasing compression ratio of lowcomplexity compressive sensing video encoder with application-awareconfigurable mechanism,” in Proc. IEEE Int. Conf. Commun. Signal Pro-cess. 2014, pp. 995–998.

[15] E. J. Candes, “Compressive sampling,” Marta Sanz Sole, vol. 17, no. 2,pp. 1433–1452, 2006.

[16] D. L. Donoho, “Compressed sensing,” IEEE Trans. Inf. Theory, vol. 52,no. 4, pp. 1289–1306, 2006.

[17] E. J. Candes and T. Tao, “Decoding by linear programming,” IEEE Trans.Inf. Theory, vol. 51, no. 12, pp. 4203–4215, Dec. 2005.

[18] S. Mun and J. E. Fowler, “DPCM for quantized block-based compressedsensing of images,” in Proc. Eur. Signal Process. Conf., 2012, pp. 1424–1428.

[19] S. Pudlewski and T. Melodia, “On the performance of compressive videostreaming for wireless multimedia sensor networks,” in Proc. IEEE Int.Conf. Commun., 2010, pp. 1–5.

[20] C. Li, H. Jiang, P. Wilford, Y. Zhang, and M. Scheutzow, “A new com-pressive video sensing framework for mobile broadcast,” IEEE Trans.Broadcast., vol. 59, no. 1, pp. 197–205, Mar. 2013.

[21] S. Roohi, M. Noorhosseini, J. Zamani, and H. S. Rad, “Low complexitydistributed video coding using compressed sensing,” in Proc. Conf. Mach.Vis. Image Process., 2014, pp. 53–57.

[22] M. S. Hosseini and K. N. Plataniotis, “High-accuracy total variation withapplication to compressed video sensing,” IEEE Trans. Image Process.,vol. 23, no. 9, pp. 3869–3884, Sep. 2014.

[23] J. Chen, K.-X. Su, W.-X. Wang, and C.-D. Lan, “Residual distributedcompressive video sensing based on double side information,” Acta Auto-matica Sinica, vol. 40, no. 10, pp. 2316–2323, Oct. 2014.

[24] R. Li, H. Liu, R. Xue, and Y. Li, “Compressive-sensing-based videocodec by autoregressive prediction and adaptive residual recovery,” Int. J.Distrib. Sensor Netw., vol. 11, no. 8, pp. 1–19, Jan. 2015.

[25] Y. Song, G. Yang, H. Xie, D. Zhang, and X. Sun, “Residual domaindictionary learning for compressed sensing video recovery,” MultimediaTools Appl., vol. 76, no. 7, pp. 10083–10096, Apr. 2017.

[26] L.-W. Kang and C.-S. Lu, “Distributed compressive video sensing,” inProc. IEEE Int. Conf. Acoust., Speech Signal Process., 2009, pp. 1169–1172.

[27] T. T. Do et al., “Distributed compressed video sensing,” in Proc. IEEE Int.Conf. Image Process., 2009, pp. 1393–1396.

[28] Y. Liu, X. Zhu, L. Zhang, and S. H. Cho, “Distributed compressed videosensing in camera sensor networks,” Int. J. Distrib. Sensor Netw., vol. 8,no. 12, pp. 1–10, Dec. 2012.

[29] S. Mun and J. E. Fowler, “Block compressed sensing of images usingdirectional transforms,” in Proc. IEEE Int. Conf. Image Process., 2009,pp. 3021–3024.

[30] S. Mun and J. E. Fowler, “Residual reconstruction for block-based com-pressed sensing of video,” in Proc. Data Compression Conf., 2011,pp. 183–192.

[31] E. W. Tramel and J. E. Fowler, “Video compressed sensing with multihy-pothesis,” in Proc. Data Compression Conf., 2011, pp. 193–202.

[32] J. E. Fowler, S. Mun, and E. W. Tramel, “Block-based compressed sens-ing of images and video,” Found. Trends Signal Process., vol. 4, no. 4,pp. 297–416, Apr. 2012.

[33] J. Zhu, N. Cao, and Y. Meng, “Adaptive multihypothesis prediction algo-rithm for distributed compressive video sensing,” Int. J. Distrib. SensorNetw., vol. 9, no. 5, pp. 718–720, May 2013.

[34] M. Azghani, M. Karimi, and F. Marvasti, “Multihypothesis compressedvideo sensing technique,” IEEE Trans. Circuits Syst. Video Technol., vol.26, no. 4, pp. 627–635, Apr. 2016.

[35] J. Chen, Y. Chen, D. Qin, and Y. Kuo, “An elastic net-based hybrid hy-pothesis method for compressed video sensing,” Multimedia Tools Appl.,vol. 74, no. 6, pp. 2085–2108, Mar. 2015.

[36] Y. Kuo, K. Wu, and J. Chen, “A scheme for distributed compressed videosensing based on hypothesis set optimization techniques,” Multidimen-sional Syst. Signal Process., vol. 28, no. 1, pp. 129–148, Jan. 2017.

[37] J. Chen, F. Xue, and Y. Kuo, “Distributed compressed video sensing basedon key frame secondary reconstruction,” Multimedia Tools Appl., vol. 74,no. 12, pp. 14873–14889, Jun. 2018.

[38] J. Chen, N. Wang, F. Xue, and Y. Gao, “Distributed compressed videosensing based on the optimization of hypothesis set update technique,”Multimedia Tools Appl., vol. 76, no. 14, pp. 15735–15754, Jul. 2017.

[39] Y. Kuo, S. Wang, D. Qin, and J. Chen, “High-quality decoding methodbased on resampling and re-reconstruction,” Electron. Lett., vol. 49,no. 16, pp. 991–992, Aug. 2013.

[40] N. Fan, X. Zhu, Y. Liu, and L. Zhang, “Scalable distributed video codingusing compressed sensing in wavelet domain,” in Proc. IEEE Veh. Technol.Conf., 2013, pp. 1–5.

[41] Z. Liu, A. Y. Elezzabi, and H. V. Zhao, “Maximum frame rate videoacquisition using adaptive compressed sensing,” IEEE Trans. CircuitsSyst. Video Technol., vol. 21, no. 11, pp. 1704–1718, Nov. 2011.

1920 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 21, NO. 8, AUGUST 2019

[42] J. Zhu, Y. Zhang, G. Han, and C. Zhu, “Block-based adaptive compressedsensing with feedback for DCVS,” in Proc. IEEE Int. Conf. Commun.Netw. China, 2014, pp. 625–630.

[43] X. Zhang, A. Wang, B. Zeng, and L. Liu, “Adaptive distributed compressedvideo sensing,” J. Inf. Hiding Multimedia Signal Process., vol. 5, no. 1,pp. 98–106, Jan. 2014.

[44] J. Chen, Y. Gao, C. Ma, and Y. Kuo, “Compressive sensing image recon-struction based on multiple regulation constraints,” Circuits, Syst. SignalProcess., vol. 36, no. 4, pp. 1621–1638, Apr. 2016.

[45] S. Pudlewski, A. Prasanna, and T. Melodia, “Compressed-sensing-enabledvideo streaming for wireless multimedia sensor networks,” IEEE Trans.Mobile Comput., vol. 11, no. 6, pp. 1060–1072, Jun. 2012.

[46] L. Zhang, L. Zhang, X. Mou, and D. Zhang, “FSIM: A feature similarityindex for image quality assessment,” IEEE Trans. Image Process., vol. 20,no. 8, pp. 2378–2386, Aug. 2011.

[47] H. Zou and T. Hastie, “Regularization and variable selection via the elasticnet,” J. Roy. Statist. Soc., vol. 67, no. 2, pp. 301–320, Mar. 2005.

Shuai Zheng was born in Henan, China. He re-ceived the bachelor’s degree in electronic informa-tion engineering from Henan Technology University,Zhengzhou, China, in 2015. He is currently work-ing toward the Ph.D. degree in Xidian University,Xi’an, China. Since October 2018, he has been withRyerson University, Toronto, ON, Canada, as a jointPh.D. student sponsored by the China ScholarshipCouncil. His current research interests involve com-pressed sensing, distributed video coding, image andvideo processing, and multimedia communications.

Xiao-Ping Zhang (M’97–SM’02) received theB.S. and Ph.D. degrees from Tsinghua University,Beijing, China, in 1992 and 1996, respectively, bothin electronic engineering, and the MBA (with Hon-ors) degree in finance, economics, and entrepreneur-ship from The University of Chicago Booth Schoolof Business, Chicago, IL, USA.

Since Fall 2000, he has been with the Depart-ment of Electrical and Computer Engineering, Ry-erson University, Toronto, ON, Canada, where he iscurrently a Professor and the Director of Commu-

nication and Signal Processing Applications Laboratory. He was the ProgramDirector of Graduate Studies. He is cross appointed to the Finance Depart-ment at the Ted Rogers School of Management at Ryerson University. He is aco-founder and the CEO of EidoSearch, an Ontario-based company offering acontent-based search and analysis engine for financial data. His research inter-ests include image and multimedia content analysis, statistical signal processing,sensor networks and electronic systems, machine learning, and applications inbioinformatics, finance, and marketing. He is a frequent consultant for biotechcompanies and investment firms.

Dr. Zhang is a registered Professional Engineer in Ontario, Canada, and amember of Beta Gamma Sigma Honor Society. He is the General Co-Chairfor ICASSP2021. He was the General Co-Chair for the 2017 GlobalSIP Sym-posium on Signal and Information Processing for Finance and Business. Heis an elected member of the ICME steering committee. He was the GeneralChair for MMSP’15. He was the Publicity Chair for ICME’06 and ProgramChair for ICIC’05 and ICIC’10. He was a Guest Editor for Multimedia Toolsand Applications and the International Journal of Semantic Computing. Heis a tutorial speaker at ACMMM2011, ISCAS2013, ICIP2013, ICASSP2014,and IJCNN2017. He is currently a Senior Area Editor for the IEEE TRANS-ACTIONS ON SIGNAL PROCESSING. He is/was an Associate Editor for the IEEETRANSACTIONS ON IMAGE PROCESSING, IEEE TRANSACTIONS ON MULTIMEDIA,IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY,IEEE TRANSACTIONS ON SIGNAL PROCESSING, and IEEE SIGNAL PROCESSING

LETTERS.

Jian Chen (M’14) received the B.S. degree fromXi’an Jiaotong University, Xi’an, China, in 1989,the M.S. degree from Xi’an Institute of Optics andPrecision Mechanics, Chinese Academy of Sciences,Xi’an, China, in 1992, and the Ph.D. degree intelecommunications engineering from Xidian Uni-versity, Xi’an, China, in 2005. He is currently a Pro-fessor with the School of Telecommunications Engi-neering, Xidian University. He was a visitor scholarwith the University of Manchester from 2007 to 2008.His research interests include cognitive radio, non-

orthogonal multiple access, multimedia communications, and wireless sensornetworks.

Yonghong Kuo received the B.S. and M.S. degreesfrom Xi’an Jiaotong University, Xi’an, China, in 1989and 1992 respectively, and the Ph.D. degree fromPeking Union Medical University, Beijing, China, in1998. She is currently a Professor with the Schoolof Telecommunications Engineering, Xidian Univer-sity, Xi’an, China. Her research interests include sig-nal processing, cognitive radio, and wireless sensornetworks.