videochef: efficient approximation for streaming video processing pipelines · videochef: efficient...

14
Open access to the Proceedings of the 2018 USENIX Annual Technical Conference is sponsored by USENIX. VideoChef: Efficient Approximation for Streaming Video Processing Pipelines Ran Xu, Jinkyu Koo, Rakesh Kumar, and Peter Bai, Purdue University; Subrata Mitra, Adobe Research; Sasa Misailovic, University of Illinois Urbana-Champaign; Saurabh Bagchi, Purdue University https://www.usenix.org/conference/atc18/presentation/xu-ran This paper is included in the Proceedings of the 2018 USENIX Annual Technical Conference (USENIX ATC ’18). July 11–13, 2018 • Boston, MA, USA ISBN 978-1-939133-02-1

Upload: others

Post on 11-May-2020

16 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: VideoChef: Efficient Approximation for Streaming Video Processing Pipelines · VideoChef: Efficient Approximation for Streaming Video Processing Pipelines Ran Xu, Jinkyu Koo, Rakesh

Open access to the Proceedings of the 2018 USENIX Annual Technical Conference

is sponsored by USENIX.

VideoChef: Efficient Approximation for Streaming Video Processing Pipelines

Ran Xu, Jinkyu Koo, Rakesh Kumar, and Peter Bai, Purdue University; Subrata Mitra, Adobe Research; Sasa Misailovic, University of Illinois Urbana-Champaign;

Saurabh Bagchi, Purdue University

https://www.usenix.org/conference/atc18/presentation/xu-ran

This paper is included in the Proceedings of the 2018 USENIX Annual Technical Conference (USENIX ATC ’18).

July 11–13, 2018 • Boston, MA, USA

ISBN 978-1-939133-02-1

Page 2: VideoChef: Efficient Approximation for Streaming Video Processing Pipelines · VideoChef: Efficient Approximation for Streaming Video Processing Pipelines Ran Xu, Jinkyu Koo, Rakesh

VIDEOCHEF: Efficient Approximation forStreaming Video Processing Pipelines

Ran Xuα , Jinkyu Kooα , Rakesh Kumarα , Peter Baiα , Subrata Mitraβ

Sasa Misailovicγ , Saurabh Bagchiα

α: Purdue University, β : Adobe Research, γ: University of Illinois

AbstractMany video streaming applications require low-latencyprocessing on resource-constrained devices. To meet thelatency and resource constraints, developers must oftenapproximate filter computations. A key challenge to suc-cessfully tuning approximations is finding the optimalconfiguration, which may change across and within theinput videos because it is content-dependent. Searchingthrough the entire search space for every frame in thevideo stream is infeasible, while tuning the pipeline off-line, on a set of training videos, yields suboptimal results.

We present VIDEOCHEF, a system for approximateoptimization of video pipelines. VIDEOCHEF finds theoptimal configurations of approximate filters at runtime,by leveraging the previously proposed concept of ca-nary inputs—using small inputs to tune the accuracy ofthe computations and transferring the approximate con-figurations to full inputs. VIDEOCHEF is the first sys-tem to show that canary inputs can be used for com-plex streaming applications. The two key innovations ofVIDEOCHEF are (1) an accurate error mapping from theapproximate processing with downsampled inputs to thatwith full inputs and (2) a directed search that balances thecost of each search step with the estimated reduction inthe run time.

We evaluate our approach on 106 videos obtainedfrom YouTube, on a set of 9 video processing pipelineswith a total of 10 distinct filters. Our results show sig-nificant performance improvement over the baseline andthe previous approach that uses canary inputs. We alsoperform a user study that shows that the videos producedby VIDEOCHEF are often acceptable to human subjects.

1 IntroductionVideo processing has brought many emerging appli-cations such as augmented reality, virtual reality, andmotion tracking. These applications implement com-plex video pipelines for video editing, scene understand-

ing, object recognition and object classification [14, 49].They often consume significant computational resources,but also require short response time and low energy con-sumption. Often, the applications need to run on the localmachines instead of the cloud, due to latency [14], band-width [50], or privacy constraints [46].

To enable low-latency and low-energy video process-ing, we leverage the the fact that most stages in the videopipeline are inherently approximate because human per-ception is tolerant to modest differences in images andmany end goals of video processing require only esti-mates (e.g., detecting object movement or counting thenumber of objects in a scene [5]). Many domain-specificalgorithms have exposed algorithmic knobs, that can e.g.,subsample the input images or replace expensive com-putations with lower-accuracy but faster alternatives [45,41, 13, 26]. To complement domain-specific approxima-tions, researchers have proposed various generic system-level techniques that expose additional knobs for opti-mizing performance and energy of applications whiletrading-off accuracy of the results. The techniques spancompilers [39, 30, 42, 7, 36, 27], systems [3, 15, 18, 17,31], and architectures [32, 29, 37, 36, 8].Content-dependent Approximation. A fundamentalchallenge of uncovering the full power of both genericand domain specific approximations is finding the con-figurations of these approximations that provide max-imum savings, while providing acceptable results foreach given input. This challenge has two main parts.

First, the optimal approximation setting is dependenton the content of the video, not just on the algorithmsbeing used in the processing pipeline. Often individualvideos, or parts of the same video should have differentapproximation settings, requiring the program to makethe decisions at runtime. Second, the optimization needsto explore a large number of approximate configurationsbefore selecting the optimal one for the given input, re-quiring the optimizer to construct off-line models. Sys-tems like Green [3] and Paraprox [36] dynamically adapt

USENIX Association 2018 USENIX Annual Technical Conference 43

Page 3: VideoChef: Efficient Approximation for Streaming Video Processing Pipelines · VideoChef: Efficient Approximation for Streaming Video Processing Pipelines Ran Xu, Jinkyu Koo, Rakesh

the computation using runtime checks of the intermedi-ate results, while Capri [43] selects approximation levelfrom the input features at the program start. However,the systems rely on extensive off-line training to map theapproximation levels to accuracy and performance.

To relax the dependency on off-line training, Lauren-zano et al. [25] propose Input Responsive Approxima-tion (IRA) for runtime recalibration with no offline train-ing. IRA creates canary inputs, smaller representationsof the full inputs (obtained via subsampling), and then re-runs the computation on the canary input with differentapproximation settings, until it finds the most efficientsetting that maintains the accuracy requirement (on thecanary). While the concept is promising, the applicationof IRA to video processing pipelines is limited:• IRA has been applied to individual computational ker-

nels (in contrast to full pipelines). It is unclear howto capture the interactions between the stages of thepipeline, how often to calibrate, and what are the opti-mal canary sizes.

• IRA uses the approximation settings derived from thecanary input to the full input, assuming that the er-rors for the full and correlated inputs will be identical.However, this assumption is often incorrect (98% ofcases, Figure 2) and leads to missed speedup opportu-nities.

• IRA’s greedy search may introduce additional over-heads and may not find good approximation settingsefficiently because it has no notion of what are the ap-propriate points in the stream to search.

Our Solution: VIDEOCHEF. We present VIDEOCHEF,a fast and efficient processing pipeline for stream-ing videos. VIDEOCHEF can optimize the perfor-mance subject to accuracy constraints for the system-level and domain-specific approximations of all kernelsin the video processing pipeline. Figure 1 presentsVIDEOCHEF’s end-to-end workflow:• Like IRA, VIDEOCHEF uses small-sized canary input

to guide the on-line search for approximation setting.However, unlike IRA, VIDEOCHEF is tailored for op-timization of the whole video processing pipelines, notjust individual kernels.

• In contrast to IRA, VIDEOCHEF presents a finely tun-able prediction model for mapping the error from thecanary input to that with the original input. This pre-diction model is trained offline and hence does notgenerate any additional runtime overhead. At the sametime, it is much more lightweight than the full off-linetraining employed by other approaches.

• At runtime, VIDEOCHEF performs an efficient searchthrough the space of approximation settings and en-sures that the cost of the search does not overwhelmthe benefit of approximating the computation.

Video Decoder

Pipeline of filters (shaded blocks are approximable filters)

Video Encoder

Processed Streaming Video

Error with canary input

Erro

r with

fu

ll in

put

Model Training

Error Prediction

Model

Estimate of Speedup from Approximation

Library of Approximation

Techniques

Approximable Program Blocks

Representative Image Frame

Canary Input Generation

Canary Input Frame

Search through Space of Approximation Settings Optimal Approximation

Settings

VIDEOCHEF Offline Training

VIDEOCHEF Online PhaseFor Key Frames

Apply to Video Processing Pipeline (P)

For Non-Key Frames

Memoized Optimal Approximation Settings

Apply to Video Processing Pipeline (P)

Streaming Video Processing Workflow

Streaming Video

Figure 1: End-to-end flow of approximate video processing withVIDEOCHEF. The video processing pipeline comprises multiple fil-ters, which can be approximated to save computation at the expenseof tolerable video quality. The offline and the online components ofVIDEOCHEF work together to determine the best approximation set-ting for each approximable filter block.

We evaluate VIDEOCHEF with three error models andtwo search strategies, applied to a corpus of 106 YouTubevideos from 8 content categories, which span the rangeof video features (e.g., color and motion). We analyze10 filters arranged in 9 pipelines of size 3. We find thatVIDEOCHEF is able to reach within 20% of the theo-retical best performance possible and outperforms IRA’sperformance by 14.6% averaged across all videos andsaves on an average 39.1% over the exact computationgiven a relatively restrict quality requirement. Whilegiven a more loose quality requirement, VIDEOCHEF isable to reach within 26.6% of the theoretical best andalso achieve higher performance gain – 53.4% and 61.5%over IRA and exact computation, respectively.

While we have framed this discussion in terms ofvideo processing, the novel contributions outlined be-low apply to other low-latency streaming applications,with the fundamental requirement that the characteristicschange to some extent from one segment of the stream toanother, for instance, online video gaming, augmentedreality and virtual reality applications.

Contributions. We make the following contributions:

1. We present VIDEOCHEF, a system for perfor-mance and accuracy optimization of video stream-ing pipelines. It consists of off-line and on-linecomponents, that together adapt the application’sapproximation level to the desired output quality.

2. We build a predictive model to accurately estimatethe quality degradation in the full output from theerror generated when using the canary input. Thisenables more aggressive approximation setting theapproximation algorithm that has tunable knobs.

3. We propose an efficient and incremental searchtechnique for the optimal approximation setting that

44 2018 USENIX Annual Technical Conference USENIX Association

Page 4: VideoChef: Efficient Approximation for Streaming Video Processing Pipelines · VideoChef: Efficient Approximation for Streaming Video Processing Pipelines Ran Xu, Jinkyu Koo, Rakesh

takes hints from the video encoding parameters toreduce the overhead of the search process.

4. We demonstrate the benefits of VIDEOCHEFthrough (1) quantitative evaluation on various real-world video contents and filters and (2) a user study.

2 Background and MotivationError Metric: At a high-level, a video is composed ofa sequence of image frames. To quantify the error inthe output or the processed video due to approximation,we measure the Peak Signal-to-Noise Ratio (PSNR) ofthe output video. PSNR is the average of the PSNRsof the individual frames in the output video. Supposethat a video consists of K frames where each frame hasM×N pixels. Let Yk(i, j) be the value of the pixel at (i, j)position on the k-th frame of the processed video withoutany use of approximation, and Zk(i, j) be the value of thepixel when approximation was applied. Then, the PSNRof the approximate output is computed as follows:

PSNR =1K

K−1

∑k=0

20× log10MaxValue√MSE(Zk,Yk)

, (1)

where MaxValue is the maximum possible pixel valuepresent in the frame, and MSE(Zk,Yk) is the mean squareerror between Zk and Yk, i.e., ∑i ∑ j(Zk(i, j)−Yk(i, j))2,as a result of approximation. Thus, lower the PSNR, thehigher the error in the output video.Isn’t the problem solved by IRA and Capri? IRA (In-put Responsive Approximation) [25] and Capri [43] at-tempted to address the problem of selecting optimal ap-proximation level for individual inputs.

IRA [25] solely relies on canary inputs to search forbest approximation settings. Thus, it implicitly assumesthat the magnitude of error corresponding to a particu-lar approximation setting on the canary inputs is iden-tical to the error with the same approximation settingson the full-sized inputs. But, Figure 2 shows our experi-ment with 424 real images and 216 different approxima-tion settings. We found that for the same approximationsettings, the PSNR of the full-sized inputs can be sig-nificantly different from the PSNR of the canary inputs.Most of the points (about 98%) are above the diagonal,indicating that the error on the full input is lower than thatwith the canary input for the same approximation level.

We attribute the difference in the approximation tothe higher variations between neighboring pixel valuesfor canary inputs. Therefore, for the same approxima-tion settings, the approximate processing on canary in-puts gives lower PSNR. We found that on an average,the PSNR of a full-sized output is 5.36 dB higher thanthe PSNR of canary output. Therefore, IRA misses anopportunity for more aggressive optimization that can fitwithin the user-specified quality threshold.

Figure 2: The PSNR of full-sized output versus the PSNR of ca-nary output, for the I-frames of 106 videos on one of our applicationBoxblur-Vignette-Dilation video filter pipeline. The PSNR of full out-put is higher for over 98% approximation settings and 45.1% of theapproximation settings lie in such a zone that is ignored by IRA ap-proach but actually satisfies the quality requirement.

Capri [43] rigorously addresses the problem of se-lecting the best approximation settings to minimize thecomputational cost, while meeting the error bound. ButCapri also fails in the video processing setting because itdoes not recalibrate itself with the stream and thus cannotchange its approximation settings when the characteris-tics of the stream change. Further, it (1) relies on priorenumeration of all possible inputs, which is impossible inthis target domain, and (2) performs the selection of ap-proximation settings completely offline, which reducesthe cost of the optimization but makes it non-responsiveto changes in the input data.

3 Solution OverviewFigure 1 shows the end-to-end workflow of streamingvideo processing, with approximation.Approximation. Under normal processing, a video de-coder converts the video into its constituent frames. Thena sequence of “filters” (synonymously, processing stepsor pipeline stages) is applied to each frame. Examplesinclude blurring filter (e.g., at the TSA airport check-point scanners) and edge detection filter (e.g., for count-ing people in a scene). Finally, the transformed framesare optionally put together by a video encoder. To makesuch processing fast and resource efficient, VIDEOCHEFintelligently uses selective approximation (Section 3.1)during the computation of the filters. The user sets thequality constraint on the output video quality. An ex-ample specification is that the PSNR of the output videoshould be above 30 dB.Accuracy Calibration with Canary Inputs. Foreach representative frame (called “key frame” here),VIDEOCHEF determines a canary input (Section 3.2),which summarizes the full frame such that the dissim-ilarity between the full and the canary frame remainsbelow a threshold. With the canary input, VIDEOCHEFoccasionally recalibrates the approximation levels of thefilters. It determines when to call the search algorithm

USENIX Association 2018 USENIX Annual Technical Conference 45

Page 5: VideoChef: Efficient Approximation for Streaming Video Processing Pipelines · VideoChef: Efficient Approximation for Streaming Video Processing Pipelines Ran Xu, Jinkyu Koo, Rakesh

using domain specific knowledge about the frames andscenes (Section 3.3). For this, we extract hints fromthe video decoder which lets VIDEOCHEF determine thekey frames. This amortizes the cost of the search acrossmany frames of the video, with 80-120 frames being atypical range for MPEG-4 videos. In the absence of sucha video decoder, we have a variant, which triggers thesearch upon a scene change detection.Online Search for Optimal Tradeoffs. VIDEOCHEFsearches for the approximation setting of each filter thatgives the lowest execution time subject to a threshold forthe output quality (Section 3.4). Since the search for ap-proximation is done with the canary input, the error ofapproximate computation is different from the error ofthe computation on the full input. VIDEOCHEF intro-duces a method to accurately map between these twoerrors (Section 3.5). In performing this estimation, weconsider multiple variants of VIDEOCHEF, depending onwhat features are available to the predictor, such as, somecategorization of the video frame according to its imageproperties. Through this procedure, we aim to maximallyleverage the approximation potential in the applicationand give flexible approximation choices.

3.1 Approximation techniquesThe computations involved in filtering operations can beapproximated by VIDEOCHEF in various ways as longas each of the underlying approximation techniques ex-poses knobs that can be tuned to control the approxi-mation levels (ALs). For example, in the three pop-ular program transformation-based approximation tech-niques, the variable approx level is a tuning knobthat controls the levels of approximation. A higher valueimplies more aggressive approximation, leading most of-ten to higher speedup but also higher error. These trans-formations are performed automatically by a compiler(LLVM in our case).Loop perforation: In loop perforation [42, 30], thecomputation is reduced by skipping some iterations, asshown below.f o r ( i = 0 ; i < n ; i = i + a p p r o x l e v e l )

r e s u l t = c o m p u t e r e s u l t ( ) ;

Loop truncation: In loop truncation [42, 30], the lastfew iterations of the computation are dropped as shownin the following example:f o r ( i = 0 ; i < ( n − a p p r o x l e v e l ) ; i ++)

r e s u l t = c o m p u t e r e s u l t ( ) ;

Loop memoization: In this technique [7, 36], for someiterations in a loop we compute the result and cache it.For other iterations we use the previously cached results.f o r ( i = 0 ; i < n ; i ++)

i f ( i % a p p r o x l e v e l == 0)c a c h e d r e s u l t = r e s u l t = c o m p u t e r e s u l t ( ) ;

e l s e r e s u l t = c a c h e d r e s u l t ;

3.2 Canary InputsTo reduce the search overhead for finding the best ap-proximation level within each frame of the video, wegenerate canary inputs for the frame following the workin [25]. A good canary input should meet two require-ments: (1) it should be close enough to the original inputso that the AL found by the canary is the same as theAL computed from the original; (2) it should be smallenough that the search process using the canary input isefficient. We first define the dissimilarity metric to com-pare the canary sample video and the full-sized video andthen show how to choose the appropriate canary input.Metrics of Dissimilarity. We define two metrics of dis-similarity. Let a full-sized video have K frames andM×N pixels in each frame, and each pixel has the prop-erty X(i, j). A canary video has K frames with m× npixels, and the same property Y (i, j). The property couldbe one component in the YUV colorspace of an image,where the Y component determines the brightness of thecolor (known as “luminance”) while the U and V com-ponents determine the color itself (known as “chroma”)and each ranges from 0 to 255. The “dissimilarity metricfor mean” (SMM), is defined as follows (following [25]):

mFull =1

M×N×K

K−1

∑i=0

M−1

∑i=0

N−1

∑j=0

X(i, j) (2)

mSmall =1

m×n×K

K−1

∑i=0

m−1

∑i=0

n−1

∑j=0

Y (i, j) (3)

SMM =|mSmall−mFull|

mFull(4)

When a pixel has a vector of values, such as the YUVcolorspace which has 3 values for the 3 components, thenthe SMM metric is combined across the different ele-ments of the vector. The combination could be a sim-ple average or a weighted average; we use the latter dueto the higher weight of the Y-channel in the YUV col-orspace. Similarly, we define the “dissimilarity metric ofstandard deviation” (SMSD) to capture the dissimilarityin the Standard Deviation between the full input and thecanary input.Generating Candidate Canary Videos. Given a frameof the video from which to generate the canary video, weresize the frame to a fraction 1/N of its original size tocreate the canary video. Typical sizes that we find usefulin our target domain are 1/16, 1/32, 1/64, 1/128, 1/256of the original size. Since the frame is a 2-D matrix ofpixels, to resize it to 1/N of its original size, we shrinkthe width and height each to 1/

√N of the full size by

sub-sampling 1 pixel out of every√

N pixels.Reducing an input size causes at least proportional re-

duction in the amount of work inside the filter. Manyfilters that are finding increasing use are super-linear,where the benefit of using a small canary frame is evenmore significant. Two popular examples are determining

46 2018 USENIX Annual Technical Conference USENIX Association

Page 6: VideoChef: Efficient Approximation for Streaming Video Processing Pipelines · VideoChef: Efficient Approximation for Streaming Video Processing Pipelines Ran Xu, Jinkyu Koo, Rakesh

optical flow to measure motion [4] and morphological fil-ter [10], where the value of each pixel in the output imageis based on a comparison of the corresponding pixel inthe input image with its neighbors. We compute the sim-ilarity between the full-sized video frame and the canaryvideo frame according to the metrics SMM and SMSD.We set the maximum dissimilarity metric we can tolerateas a threshold parameter—we find 10% is a practicallyuseful threshold for both SMM and SMSD. Among allthe qualified canary inputs, we select the smallest one asour final choice.

3.3 Identifying Key Video FramesSearching for the best AL for each approximable pro-gram block is computationally expensive. Conceptu-ally, we would want to repeat the search when the char-acteristic of the video changes significantly so that theoptimal approximation setting is expected to be differ-ent. In practical terms, we want to perform such changepoint detection without having to parse the content of thevideo. Video encoders already provide hints when thecontent of the scene has changed significantly.

We make the observation that videos have temporallocality, and many frames in the same group will havethe same approximation setting. Therefore, we can per-form a single search once per a single group of frames.We leverage domain-specific knowledge about videos toautomatically select the group boundaries in two ways:Scene Change Detector. Our first observation is to re-calibrate the approximation at the beginning of differentscenes. This approach is general and works for any videoformat. There are mainly 2 classes of scene change de-tectors, namely, pixel-based and histogram-based. Thepixel-based methods are highly sensitive to camera andobject motion. Histogram-based methods are good fordetecting abrupt scene changes. To keep our overheadlow (since the detection algorithm runs on every frame),we limit ourselves to detecting only abrupt scene changesand use canary frames for this detection.

We implement a histogram-based scene change detec-tor using only the Y-channel of the canary frames [20].We experimentally found the Y-channel information wassufficient to detect abrupt scene changes and we weremore concerned about overhead of scene change detectorthan its accuracy. The algorithm detects a scene changewhenever the sum of the absolute difference across all thebins of histograms of two consecutive frames is greaterthan some predefined threshold (20% of the total pix-els in our evaluation). Our experiments show that theoptimal configuration (found through offline brute-forcesearch) changes dramatically at scene change boundariesbut stay relatively stable within a group of frames.I-frame Selection for MPEG videos. The second so-lution takes advantage of I-frames, present in the popu-

lar H.264 encoder (which the MPEG-4 and many othervideo formats follow). It defines three main types offrames: I−, P−, and B− f rames [21]. An I-frameuses intra-prediction meaning the predicted pixels withinthis frame are formed using only samples from the sameframe. The P- and the B-frames use inter-predictionmeaning the predicted pixel within this frame dependson samples from the same frame as well as samples fromother frames around it (the distinction between P- and B-frames is not relevant for our discussion).

When to insert an I-frame (also called a “referenceframe”) depends on the exact coding scheme being used,but in all such coding schemes that we are aware of, abig difference in the frame triggers the insertion of a newI-frame, since inter-coding will give almost as long acode as intra-coding. Further, because an I-frame doesnot have dependencies on other frames, this makes iteasier to reconstruct and perform the (exact or approx-imate) computation. We see empirically that for a widerange of videos used in our evaluation, the average spac-ing between adjacent I-frames is 137 frames. Althoughspecific to only some video formats, it results in a lowsampling rate and consequently the low search overhead,while triggering search at a suitable granularity.

3.4 Search with Canary InputsAn approximable program block exposes one or moreapproximation knobs. The approx level variablementioned with the loop-based approximation tech-niques in Section 3.1 is an example of such a knob. Inour notation, we use AL 1 to denote the exact computa-tion. The higher the AL is, the less accurate the com-putation is and the higher the speedup is. Now for apipeline of cascaded filters, each having one or moreapproximation knobs we have a vector of approxima-tion settings per frame. We define a setting in the pro-cessing pipeline as the combination of ALs for each ofthe approximable program blocks in the video process-ing pipeline. For example, with an n-stage processingpipeline and each stage being approximable and hav-ing exactly one approximation knob, the setting will be~A = {a1,a2, · · · ,an}, where ai denotes the AL of knob i.

To find the best approximation setting, we follow asearching algorithm outlined as follows:

1. Start searching at a particular setting, typically(1,1, · · · ,1), corresponding to no approximation.

2. Select a group of candidate settings by the Candi-date Selection Algorithm. The selection algorithmsimply selects the next set of settings to try out in thesearch process. The greedy algorithm works as fol-lows: given the current setting ~A(0), we assume thatwe can reach the best AL by looking at each step 1 ALfurther in each approximable block. So the candidates

USENIX Association 2018 USENIX Annual Technical Conference 47

Page 7: VideoChef: Efficient Approximation for Streaming Video Processing Pipelines · VideoChef: Efficient Approximation for Streaming Video Processing Pipelines Ran Xu, Jinkyu Koo, Rakesh

are { ~A(i)}, with i = 1 · · ·n, for n approximable blocksand ~A( j) = {a(0)1 , · · ·a(0)j−1,a

(0)j +1,a(0)j+1, · · · ,a

(0)n }.

3. Decide whether to continue search, i.e., whether itis worthwhile to try any of these candidate settings.We use the Approximation Payoff Estimation Algo-rithm (Section 3.4.1). If not, return the current set-ting. This algorithm estimates whether the saving dueto the more aggressive approximation can compensatefor the time of the additional search step.

4. Try each worthwhile candidate setting from the setcomputed by the previous step. Use the ALs in candi-date settings to run approximate computation on ca-nary video and compute the error metric for each can-didate setting. Then map the error metric to that withthe full sized outputs.

5. Check for exit or iterate – if error metrics of all thefull video outputs exceed the error boundary, returnthe current setting. Otherwise, the candidate settingwhich gives the lowest error becomes the next setting,go to (2) and iterate.

3.4.1 Approximation Payoff Estimation Algorithm

The goal of this algorithm is to estimate the benefit ofexecuting the application with the new AL searched forversus the cost of searching with the new AL, all for thekey frame under question. Let the current setting be rep-resented by ~A(0) = {a(0)1 ,a(0)2 , · · · ,a(0)n }. Recollect thatthis is the set of ALs for each of the approximable blocksin the application. Let the execution time of the applica-tion at setting ~A(i) and with canary downsampling Cd begiven by g(A(i),Cd), where Cd = 1 denotes the executiontime with the full input.

This algorithm works in a breadth-first fashion and at-tempts to prune some of the paths where exploring higherdegrees of approximation for a particular knob cannotspeedup the execution further and may lead to slowdowndue to associated overheads. From the current settingof ~A(0), let the next possible settings of exploration be~A(1), ~A(2), · · · , ~A(N). For example, with greedy search,

with n approximable blocks, there will be n possible nextsettings. The maximum possible benefit by exploring allthe candidate next settings is calculated as:

B =N

maxi=1

[g(A(0),1)−g(A(i),1)] (5)

This benefit B simply means the maximum reduction inexecution time across all the possible candidate settings,when run with the full input. However, to realize thisgain, we have to pay the cost of searching, which can beexpressed in terms of the overhead as follows:

O =N

∑i=1

g(A(i),Cd) (6)

This overhead O is simply the cost of executing the appli-cation with the next step ALs, but with the canary input(and hence the downsampling ratio Cd). The decision forVIDEOCHEF becomes simple: if B > O, then continuethe search, else stop and return the current setting.

3.5 Error mapping modelWe have to develop an error mapping model to character-ize the relation between error in the canary output and er-ror in the full output, for the same approximation levels.This is important because we have seen empirically (Fig-ure 2) that the canary errors are higher than full frame er-rors for most points. We propose three different mappingmodels to use according to different amounts of knowl-edge in the model.

3.5.1 Model-CSuppose we know the error metric of a canary output C.The error metric of a full-sized output F is estimated bya quadratic regression model as follows,

F = w0 +w1×C+w2×C2 (7)Offline, we calculate the ground truth of the pairs

(C,F) for every possible AL ~A and for all the videos inthe training set. In practice, we find that sub-samplingthe space of possible ALs still provides accurate enoughtraining, with a sub-sampling rate of 10% being ade-quate. Let us say that the error bound specified by theuser is EB. Then clearly we want F < EB. However, dueto the possible inaccuracy of the error mapping model,we want to explore a larger space so that we are notmissing out on opportunities for approximating. There-fore, while training the model, Model-C, we explore allthe points where F ≤ EB +∆ , where ∆ is a user con-figurable parameter for how far outside the tolerable re-gion we want to explore in the model. Then we solve theunknown coefficients w0,w1, and w2 in the model. Wefind empirically that for a large set of videos, this modelreaches its limits with the quadratic regression function.

3.5.2 Model-CANow, suppose VIDEOCHEF has additional knowledge ofwhat ALs were in effect. Given the error metric of acanary output C, which is computed approximately withALs ~A = {a1,a2, · · · ,an}, we construct the input vector~I = (1,C,~A). The first element of this vector is the con-stant 1 and allows for a constant term in our equationfor F . Then the error metric of a full-sized output F isestimated by a regression model as follows,

F =~I ·w (8)where w is a (n+2)×1 coefficient vector. The goal of

the training is to estimate the matrix w. The elements ofthe matrix w provide the weights to multiply the differentinput components—the error in canary output (C), andthe different ALs. We use similar training method offlineas for Model-C.

48 2018 USENIX Annual Technical Conference USENIX Association

Page 8: VideoChef: Efficient Approximation for Streaming Video Processing Pipelines · VideoChef: Efficient Approximation for Streaming Video Processing Pipelines Ran Xu, Jinkyu Koo, Rakesh

3.5.3 Model-CADMany of approximation techniques on image processingreduce computation load by skipping a fraction of rowsof images. Thus, the difference over rows is often relatedwith approximation quality. Inspired by this characteris-tic, we consider a new feature vector ~D = (d1,d2,d3),where each of dk’s represents a feature extracted fromone of Y, U, and V channels of an image. The feature dkis referred to as a row-difference feature and is definedas the mean of absolute difference in pixel values of thesame column between consecutive rows in each channel.Averaging over rows and columns, we use only one rep-resentative number as dk for each channel.

Considering an input vector ~I = (1,C,~A,~D), the errormetric of a full-sized output F is estimated by a linearregression model as:

F =~I ·w, (9)

where w is a (n+ 5)× 1 coefficient vector. In the ex-periment results, we will see that Model-CAD outper-forms the other models.

3.5.4 Non-linear models.

We have also tested complex non-linear models to pre-dict F , using artificial neural networks with all pixel in-formation as input. However, considering the run-timecomplexity [23], we could not observe any significantbenefit of the non-linear models over the linear modelsmentioned earlier. Thus, we do not report their results inthe evaluation.

4 Implementation and Dataset

We use loop perforation and memoization [42, 30] to ap-proximately filter the frames in the video. The imple-mentation of VIDEOCHEF is comprised of an offline andan online component. The offline component uses a setof training videos (50% of videos described under thedataset below) and creates models for the error mappingand for the cost and the benefit of each step of the search.This last model is actually implemented as a lookup ta-ble, due to the space being only piece-wise continuous.During runtime, VIDEOCHEF queries these models, us-ing linear interpolation if needed, and performs an effi-cient search to identify the optimal ALs and runs each ofthe three filters in any pipeline with their optimal values.VIDEOCHEF API. Our compiler pass identifies the ap-proximable blocks using program annotations and thenperforms the relevant transformations to insert the ap-proximation knobs to be tuned (such as approx levelin Sec. 3.1). We have provided support for similar anno-tations in other domain specific languages that we havebuilt in the past and that helps to reduce the programmer

burden [28]. The user can then use the following APIcalls to enable VIDEOCHEF in the video pipeline:

• setCalibrationFrequency(f=”I-frame”) : This willset how frequently VIDEOCHEF will search for thebest approximation settings. The default value isVIDEOCHEF will trigger a search for every I-frame. Iff =”x”, then VIDEOCHEF will search every x-th frame.

• setQualityThreshold(b=”30”) : This will set the(lower) PSNR threshold that the approximatedpipeline must deliver. Default is 30 dB. VIDEOCHEFexposes to the user approximate versions of many fil-ters from the FFmpeg library, with names like de-flate approx. The developer of VIDEOCHEF can reg-ister a callback with the video decoder using the callvoid notifyIFrame(void *).

Video Dataset. We used 106 YouTube MPEG-4 videosfor our evaluation. We used libvideo, a lightweight.NET library [24], to download the videos. The videoswere collected from 8 different categories to cover aspectrum of different motion and color artifacts in theframes: Lectures, Ads, Car Races, Entertainment, Movietrailers, Nature, News, and Sports. At the first step, a sin-gle seed video was downloaded from each category, thenwe downloaded all YouTube’s recommendations to theseed video, which turned out to belong to the same cate-gory as that of the seed video. Once the set of videos wascollected, we randomly sub-sampled a 20 second clipfrom each video, being motivated by a desire to boundthe experiment time. For each category, we collectedapproximately 25 videos and filtered out those with lowresolution (since the quality threshold was likely alreadybreached with the original video).

5 Evaluation

We describe our benchmarks first and then thefour experiments to evaluate the macro properties ofVIDEOCHEF and then its various components.Benchmarks. We construct our benchmark by includ-ing different video processing pipelines. Each video pro-cessing pipeline consists of 3 consecutive filters, whichare selected from a pool of 10 video filters from the FFm-peg library. These filters are modified to support approx-imation with tuning knobs. To execute on these filterpipelines, one needs to provide a video input and a qual-ity threshold. Finally, the output is also a video, togetherwith a quality metric with respect to each frame. We havea total of 9 different filter pipelines.Quality Metric. We use PSNR (Eq. 1) as the qual-ity metric for the videos produced by the approximatepipelines. We present the results for two acceptablePSNR thresholds. The threshold of 30 dB is considered atypical lower bound for lossy image and video compres-

USENIX Association 2018 USENIX Annual Technical Conference 49

Page 9: VideoChef: Efficient Approximation for Streaming Video Processing Pipelines · VideoChef: Efficient Approximation for Streaming Video Processing Pipelines Ran Xu, Jinkyu Koo, Rakesh

Table 1: Summary of the analyzed pipelines. We denote the approximation applied to each filter: Loop Perforation (LP) or Memoization (M)

Name Description, labeled with Approx. type Approximation Type Approximation LevelsDEB Deflate(LP)-Emboss(LP)-Boxblur(M) Loop perforation(LP) & Memoization(M) 1-6, 1-6, 1-6DVE Deflate(LP)-Vignette(LP)-Emboss(LP) Loop perforation 1-6, 1-6, 1-6BVI Boxblur(M)-Vignette(LP)-Inflate(LP) Loop perforation & Memoization 1-6, 1-6, 1-6UIV Unsharp(LP)-Inflate(LP)-Vignette(LP) Loop perforation 1-6, 1-6, 1-6DUE Dilation(LP)-Unsharp(LP)-Emboss(LP) Loop perforation 1-6, 1-6, 1-6BVD Boxblur(M)-Vignette(LP)-Dilation(LP) Loop perforation & Memoization 1-6, 1-6, 1-6UEE Unsharp(LP)-Erosion(LP)-Emboss(LP) Loop perforation 1-6, 1-6, 1-6EUB Erosion(LP)-Unsharp(LP)-Boxblur(M) Loop perforation & Memoization 1-6, 1-6, 1-6BUC Boxblur(M)-Unsharp(LP)-Colorbalance(LP) Loop perforation & Memoization 1-6, 1-6, 1-6

sion [48, 16]. The threshold of 20 dB is considered thelower bound for lossy wireless transmission [44].Evaluation Metrics. We define improvement as de-crease in execution time, expressed as a percentage ofthe competitive protocol. We define the speedup of ourapproach as Speedup = Speed of our protocol

Speed of compared protocol −1Setup. We split the input videos into three groups: train-ing, validation and test, with a share of 50%, 25%, and25% of the videoset. The experiments are done on an x86server with a six-core Intel(R) Xeon CPU, 16 GB RAM,and Ubuntu Linux kernel 4.4. We used FFmpeg libraryversion 3.0 (compiled with gcc 5.4.0).Canary Input Selection. We fixed the canary size to be1/64 of the original size because our preliminary exper-iments showed that it is a good parameter to guaranteea dissimilarity metric value of 10% or less. Thus, theoverhead of appropriate canary selection is not includedin the results.

5.1 Performance and Quality Comparisonfor End-to-End Workflow

Figure 3 presents the results of the end-to-end workflowfor the nine different video processing pipelines over allvideos from the test set. Each plot presents the speeduprelative to the exact pipeline for the following configura-tions (from left to right):• Exact computation, with default parameters.• Best static approximation, created by setting the AL

that is just over the error threshold for all the framesin training videos.

• IRA extended with a simple searching policy that has afixed interval of 10 frames. This number is chosen ac-cording to SAGE [37], which gives an analytic boundfor a video processing setting.

• VIDEOCHEF version A – with I-frames detection.• VIDEOCHEF version B – with scene change detection.• Oracle version uses exhaustive search but does not in-

cur search overhead. This sets the upper bound of theperformance.

For both VIDEOCHEF versions, we used the CAD er-ror model with 3dB margin, as the result of the analysisin Section 5.3.

Performance for 30db Threshold. Figure 3(a) showsthat VIDEOCHEF version A reduces the execution timeby 39.1% over exact computation and is within 20% ofthe Oracle. It outperforms both static approximation andIRA, by respectively 29.9% and 14.6% in the aggregate.The advantage exists for all the video filter pipelines withthe greatest savings relative to IRA being in Unsharp-Inflate-Vignette (UIV) pipeline. We are 39.2%, 36.8%and 29.5% better than exact computation, static approx-imation, and IRA, respectively. The search overhead forVIDEOCHEF (both versions A and B) is small – the yel-low portions of the bars are almost not visible – and yet itfinds more aggressive approximations than the competi-tive approaches (static or IRA) (the blue portions of thebars are shorter). The IRA approach, due to its assump-tion that the error in the canary output is identical to theerror in the full output, cannot use aggressive ALs andthus cannot achieve the full speedup available throughapproximation. Within the two variants of VIDEOCHEF,scene change detector (version B) is slower than an I-frame lookup (version A).Performance for 20db Threshold. We also evalu-ate on VIDEOCHEF on a different quality thresholds20dB. Given a larger error budget, Figure 3 shows thatVIDEOCHEF is able to achieve more performance gainover exact computation (1.6x speedup). We also outper-form static approximation and IRA by 53.4% and 23.1%and within 26.6% from the Oracle results. Notice that thepipelines where we achieve the maximum performancegain over IRA changes from UIV to DVE.Quality for 30db Threshold. Figure 4(a) shows thatIRA and static approximation both achieve much higherquality than what the user specified (30 dB), an unde-sirable outcome here since this comes at the expense ofhigher execution time. VIDEOCHEF on the other handtracks the Oracle quality quite closely, which in turnmeets the user requirement. It does however, drop belowthe threshold on some inputs, albeit by small amounts.This indicates that a future design feature should com-pensate for the tendency of VIDEOCHEF to sometimedrop below the target video quality, say by adding apenalty function when the AL brings it close to theboundary. Further, a carefully designed margin in thesearching algorithm can reduce the violation in quality

50 2018 USENIX Annual Technical Conference USENIX Association

Page 10: VideoChef: Efficient Approximation for Streaming Video Processing Pipelines · VideoChef: Efficient Approximation for Streaming Video Processing Pipelines Ran Xu, Jinkyu Koo, Rakesh

(a) Quality threshold = 30dB (b) Quality threshold = 20dB

Figure 3: Mean execution times over all frames of all videos. Geometric means of the speedups are on the right.

(a) Quality threshold = 30dB (b) Quality threshold = 20dB

Figure 4: Quality of each frame across different video filter pipelines.

requirement but still achieve speedup. The careful readerwould have noticed that for some pipelines, some pro-tocol results are missing here. This happens because noapproximation is possible for some pipelines and there isno error introduced and hence, PSNR is not defined.

We also use the percentage of frames that violatethe quality threshold to characterize the robustness ofeach protocol. The violation rate of static approxima-tion, IRA, VIDEOCHEF version A and B are 3.27%,0.64%, 6.6% and 4.79%. Although the two versions ofVIDEOCHEF have higher violation rates, they are stillwithin a typical user acceptable threshold (5%). We con-sider the violation may due to two factors – (1) Inaccu-rate error prediction in the key frame. (2) The qualityof non-key frames degrade and drop below the thresh-old before a fresh key frame is identified and a searchtriggered. According to our modeling in Sec 5.3, the vi-olation due to the first factor is limited to at most 5%,while the second error may be inevitable as long as wedo not search for every frame. Considering the trade-off between searching overhead and better error control,VIDEOCHEF is able to largely reduce the searching over-head and still maintain good quality.

Quality for 20db Threshold. Figure 4(b) shows thequality measurement of different protocols across allthe pipelines. The mean violation rate averaged acrossall pipelines of static approximation, IRA, VIDEOCHEFversion A and B are 0%, 0.23%, 7.18% and 3.93%. In thetwo quality threshold case, we see the advantage of scenechange detection as an add-on in VIDEOCHEF version Bto decrease the violation rate because it can accuratelydetect the frame which differs largely from the previousand trigger a required search for optimal approximation

levels.

5.2 Speedup and Video Quality versusApproximation Levels

This experiment studies (1) how the execution time ofeach filter varies with the AL setting for that filter and(2) how the video quality varies with the AL setting. Thisresult is dependent on the approximation technique butis independent of the VIDEOCHEF configuration used todecide on the AL. We show the results with all the videosin our dataset and 5 out of 10 representative filters inFigure 5 (number of executed instructions) and Figure6 (video quality). When showing the result for a specificfilter, we only execute on this filter and not the 3-stagepipeline. Here the results have higher variability due tothe content-dependent effect. For the execution time, wenormalize by the measure for exact computation.Execution Time. Figure 5 shows that as the AL becomeshigher, i.e., the approximation becomes more aggressive,the execution time decreases. But the rate of decreaseslows down as the AL becomes higher and the behaviorsamong the different filters in our evaluation are compa-rable. Note that this is a box plot, but there is little varia-tion across the different videos and hence each AL givesvery tight result. This is expected because the amountof processing done in the filter, whether with exact or ap-proximate computation, is not content dependent, but theeffect of the approximation is content dependent.Quality. Figure 6 shows the effect of AL on the videoquality when the full frame is used. The quality degradesas the approximation gets more aggressive, but the natureof the decrease is not uniform across all the filters. Evenwithin each filter, the effect on quality depends on the ex-

USENIX Association 2018 USENIX Annual Technical Conference 51

Page 11: VideoChef: Efficient Approximation for Streaming Video Processing Pipelines · VideoChef: Efficient Approximation for Streaming Video Processing Pipelines Ran Xu, Jinkyu Koo, Rakesh

(a) Deflate filter (b) Emboss filter (c) Boxblur filter (d) Histeq filter (e) Vignette filter

Figure 5: The normalized execution time for each filter as the Approximation Level (AL) is varied, across all 106 videos in our dataset. Thenumber of CPU cycles is normalized by the measure for exact computation. As the AL increases, the execution time decreases and this happensconsistently across all videos and filters.

(a) Deflate filter (b) Emboss filter (c) Boxblur filter (d) Histeq filter (e) Vignette filter

Figure 6: The video quality for each filter as the Approximation Level (AL) is varied, across all 106 videos in our dataset. The effect dependson the video content and the filter being used.

act video frame, as implied by the vertical data spread forany given AL. We identify two forms of unpredictabilityof how AL correlates with video quality: with the content(which video frame is being approximated) and with thefilter. Due to these two factors, we do not try to come upwith a closed form curve for doing the prediction, rather,we do the actual computation with the canary input fora given AL setting, compute the PSNR, and then map itto the PSNR with the full input (Section 3.5). Contrastthis to the execution time where we create a lookup ta-ble through training, which is content independent, andjust look it up during the online search (Section 3.4.1).The variability due to video content in the PSNR plotvalidates our rationale for doing the approximation in acontent-dependent manner. The rationale is shared with[25], but it sets our work apart from the approximationtechniques that select the approximation configuration ina content-independent manner.

5.3 Evaluation of Error Mapping ModelsIn this experiment, we evaluate the quality of the vari-ous error mapping models in VIDEOCHEF. We trainedthe model on the training video set. Table 2 and Fig-ure 3 present the performance of our model on the vali-dation videos. Figure 7 shows that even a simple modelC can greatly reduce the prediction error relative toIRA. Also, as we increase the level of knowledge, themodel achieves higher prediction accuracy and model-CAD performs the best due to its good use of the fea-ture extraction from the frames. We can see that with ourCAD model, we can successfully control the error within2dB in 80% of the cases and within 3dB in 90% of the

(a) Validation set (b) Test set

Figure 7: Results of the error modeling in VIDEOCHEF mappingerror in canary output to error in full output. The CAD model withcharacteristics of the frame performs best, though it is only slightlybetter than the CA models.

Table 2: F-1 measure of different error mapping models averagedover all pipelines. We regard IRA as a pass through error mapping.

Models IRA C CA CAD30dB threshold 0.8650 0.9576 0.9594 0.968620dB threshold 0.8007 0.9679 0.9660 0.9759

cases. Given these results, we set up a 3dB margin whenmapping from the canary error to the full error.

The results on the test videos (Section 5.1) show thatthe cases when we violate the quality requirement arewithin 10%.

5.4 User Perception StudyTo evaluate if the protocols cause any perceptual differ-ence, we conduct a small user study with 16 participants.Users were recruited by emailing students of certain ECEclasses. We picked 16 videos, 2 from each YouTube con-tent category, randomly picked from our dataset. We pro-cessed each video (a snippet of 20 seconds from each, asin the rest of the evaluation) using the Oracle approachand using VIDEOCHEF for pipeline DBE.

52 2018 USENIX Annual Technical Conference USENIX Association

Page 12: VideoChef: Efficient Approximation for Streaming Video Processing Pipelines · VideoChef: Efficient Approximation for Streaming Video Processing Pipelines Ran Xu, Jinkyu Koo, Rakesh

Table 3: Results of the user studies with 16 videos processed usingOracle and VIDEOCHEF

Degree of difference PercentageNo difference 58.59%Little difference 34.77%Large difference 6.64%Total difference 0

This pipeline was chosen because its result in the restof the evaluation is representative and it produces videoswhich are still visually pleasing. In the experiment, weshowed the two versions of each of the 16 videos con-currently, processed using the Oracle and VIDEOCHEFtools, without letting the participant know which windowcorresponded to which tool. All participants watchedthe videos independently. The participants were askedto rate the videos in four categories: Same, Little dif-ference, Large difference, and Total difference. We gaveguidance to the participants for the four categories as dif-ference ∈ [0%,5%),[5%,20%),[20%,50%), and ≥ 50%.

We show the results in Table 3. The percentage figureis the percentage of the total number of videos shown,which is 16× 16 (number of videos× number of users).We conclude that 58.59% of the videos got no differ-ence rating between the Oracle and the VIDEOCHEF pro-cessed videos, while 34.77% got a little difference rat-ing. Although 6.64% of videos got large difference rat-ing, none of the videos got total difference rating. Thisvalidates that qualitatively human perception is not see-ing significant difference in video quality due to approx-imate processing using VIDEOCHEF.

6 Related Work

Approximate Tradeoffs in Computations and Data.Researchers presented various techniques for changingcomputations at the system level to trade accuracy forperformance, e.g., in hardware [32, 47, 12, 11, 8], run-time systems [3, 18], and compilers [30, 42, 2, 39, 6].A key challenge of approximate computing is find-ing good tradeoffs between accuracy and performance.For this, researchers have looked at both off-line au-totuning [30, 42, 29, 38] and on-line dynamic adapta-tion [3, 18, 37, 22, 17]. In image processing, varioustechniques exist for synthesizing approximate filter ver-sions, e.g., using genetic programming [45, 41, 13]. Re-cently, Lou et al. [26] present “image perforation”, anadaptive version of loop perforation tailored for indi-vidual image filters. Researchers also proposed stor-ing multimedia data in approximate memories, includ-ing standard [39, 34], solid-state [40], and multi-levelcell memories specialized for video encodings [19]. Weconsider such storage approaches complementary to ourcomputation-based technique for video encoding.Input-Aware Approximation. Several techniques pro-

vide input-aware approximations to monitor output qual-ity and control the aggressiveness of the approximationduring execution. Green [3] was an early approach thatapplied dynamic quality monitoring to adjust the levelof approximation, based on a user-defined quality func-tion. More recently, input-aware approximation identi-fies classes of similar inputs and applies different ap-proximations for each input class [9, 43]. Opprox [31]learns the control-flow of the input-optimized programand then selects in which phase to approximate as wellas how much to approximate. In contrast to our work,all these approaches use off-line models for prediction ofinput quality and do not craft the smaller inputs at run-time. Ringenburg et. al. [35] proposed online moni-toring mechanisms, where a random subset of approx-imate outputs is compared with a precise output on asampling basis, or the output of the current execution ispredicted from past executions with similar inputs. Rahaet al. [33] present a precise analysis of accuracy for acommonly used reduce-and-rank computational pattern.Rumba [22] and Topaz [1] detect outliers in intermedi-ate computation results. In contrast to IRA [25] and ourVIDEOCHEF, these approaches do not use canary inputsto guide the optimization and monitoring and thereforegrapple with the overhead issue.

7 Conclusion

Fast and resource efficient processing of videos is re-quired in many scenarios. We built a resource efficientand input-aware approximate video processing pipelinecalled VIDEOCHEF. VIDEOCHEF controls the approxi-mation in each frame (using the properties of the frame)to meet the user’s accuracy requirement. In particu-lar, VIDEOCHEF uses a canary-input based approach forfast searching, as proposed in prior work, but overcomessome fundamental challenges by innovating a machine-learning based accurate error estimation technique andan input-aware search technique that finds best approx-imation settings. We show that VIDEOCHEF can pro-vide significant speedup in 9 different video processingpipelines while satisfying user’s quality requirements.

8 AcknowledgmentsThis material is based in part upon work supported bythe National Science Foundation under Grant NumbersCNS-1527262 and CCF-1703637 and by Sandia Na-tional Lab under their Academic Alliance Program. Anyopinions, findings, and conclusions or recommendationsexpressed in this material are those of the authors anddo not necessarily reflect the views of the sponsors. Wethank Ashraf Mahgoub, Manish Gupta, Edward J. Delp,Fengqing Maggie Zhu, and Di Chen for useful discus-sions and advice.

USENIX Association 2018 USENIX Annual Technical Conference 53

Page 13: VideoChef: Efficient Approximation for Streaming Video Processing Pipelines · VideoChef: Efficient Approximation for Streaming Video Processing Pipelines Ran Xu, Jinkyu Koo, Rakesh

References[1] ACHOUR, S., AND RINARD, M. C. Approximate computation

with outlier detection in topaz. In OOPSLA (2015).

[2] ANSEL, J., WONG, Y. L., CHAN, C., OLSZEWSKI, M., EDEL-MAN, A., AND AMARASINGHE, S. Language and compilersupport for auto-tuning variable-accuracy algorithms. In CGO(2011).

[3] BAEK, W., AND CHILIMBI, T. M. Green: a framework for sup-porting energy-conscious programming using controlled approx-imation. In PLDI (2010).

[4] BAO, L., YANG, Q., AND JIN, H. Fast edge-preserving patch-match for large displacement optical flow. In IEEE CVPR (2014),pp. 3534–3541.

[5] BUCKLER, M., JAYASURIYA, S., AND SAMPSON, A. Recon-figuring the imaging pipeline for computer vision. In The IEEEInternational Conference on Computer Vision (ICCV) (2017).

[6] CARBIN, M., MISAILOVIC, S., AND RINARD, M. C. Verify-ing quantitative reliability for programs that execute on unreliablehardware. In OOPSLA (2013).

[7] CHAUDHURI, S., GULWANI, S., LUBLINERMAN, R., ANDNAVIDPOUR, S. Proving programs robust. In FSE (2011).

[8] CHIPPA, V. K., MOHAPATRA, D., RAGHUNATHAN, A., ROY,K., AND CHAKRADHAR, S. T. Scalable effort hardware design:exploiting algorithmic resilience for energy efficiency. In DAC(2010).

[9] DING, Y., ANSEL, J., VEERAMACHANENI, K., SHEN, X.,O’REILLY, U.-M., AND AMARASINGHE, S. Autotuning algo-rithmic choice for input sensitivity. In PLDI (2015).

[10] DOUGHERTY, E. R., AND LOTUFO, R. A. Hands-on morpho-logical image processing, vol. 59. SPIE press, 2003.

[11] ESMAEILZADEH, H., SAMPSON, A., CEZE, L., AND BURGER,D. Architecture support for disciplined approximate program-ming. In ASPLOS (2012).

[12] ESMAEILZADEH, H., SAMPSON, A., CEZE, L., AND BURGER,D. Neural acceleration for general-purpose approximate pro-grams. In MICRO (2012), IEEE Computer Society.

[13] FARBMAN, Z., FATTAL, R., AND LISCHINSKI, D. Convolutionpyramids. ACM Trans. Graph. 30, 6 (2011), 175–1.

[14] FOULADI, S., WAHBY, R. S., SHACKLETT, B., BALASUBRA-MANIAM, K., ZENG, W., BHALERAO, R., SIVARAMAN, A.,PORTER, G., AND WINSTEIN, K. Encoding, fast and slow: Low-latency video processing using thousands of tiny threads. In NSDI(2017), pp. 363–376.

[15] GOIRI, I., BIANCHINI, R., NAGARAKATTE, S., AND NGUYEN,T. D. Approxhadoop: Bringing approximations to mapreduceframeworks. In ASPLOS (2015).

[16] HAMZAOUI, R., AND SAUPE, D. Fractal image compression -in ”Document and Image Compression”. CRC Press, 2006.

[17] HOFFMANN, H. Jouleguard: energy guarantees for approximateapplications. In SOSP (2015).

[18] HOFFMANN, H., SIDIROGLOU, S., CARBIN, M., MISAILOVIC,S., AGARWAL, A., AND RINARD, M. Dynamic knobs for re-sponsive power-aware computing. In ASPLOS (2011).

[19] JEVDJIC, D., STRAUSS, K., CEZE, L., AND MALVAR, H. S.Approximate storage of compressed and encrypted videos. InProceedings of the Twenty-Second International Conference onArchitectural Support for Programming Languages and Operat-ing Systems (2017), ACM, pp. 361–373.

[20] JIANG, H., HELAL, A. S., ELMAGARMID, A. K., AND JOSHI,A. Scene change detection techniques for video database sys-tems. Multimedia Systems 6, 3 (May 1998), 186–195.

[21] JUURLINK, B., ALVAREZ-MESA, M., CHI, C. C., AZEVEDO,A., MEENDERINCK, C., AND RAMIREZ, A. Understanding theapplication: An overview of the h. 264 standard. Scalable Paral-lel Programming Applied to H. 264/AVC Decoding (2012), 5–15.

[22] KHUDIA, D. S., ZAMIRAI, B., SAMADI, M., AND MAHLKE, S.Rumba: an online quality management system for approximatecomputing. In ISCA (2015).

[23] KIM, S. G., HARWANI, M., GRAMA, A., AND CHATERJI, S.Ep-dnn: A deep neural network-based global enhancer predictionalgorithm. Scientific reports 6 (2016), 38433.

[24] KO, J. Libvideo: A lightweight .net library to download youtubevideos. https://github.com/jamesqo/libvideo, 2016.

[25] LAURENZANO, M. A., HILL, P., SAMADI, M., MAHLKE, S.,MARS, J., AND TANG, L. Input responsiveness: using canary in-puts to dynamically steer approximation. In PLDI (2016), ACM.

[26] LOU, L., NGUYEN, P., LAWRENCE, J., AND BARNES, C. Im-age perforation: Automatically accelerating image pipelines byintelligently skipping samples. ACM Transactions on Graphics(TOG) 35, 5 (2016), 153.

[27] MAHADIK, K., CHATERJI, S., ZHOU, B., KULKARNI, M.,AND BAGCHI, S. Orion: Scaling genomic sequence matchingwith fine-grained parallelization. In Proceedings of the Interna-tional Conference for High Performance Computing, Network-ing, Storage and Analysis (2014), IEEE Press, pp. 449–460.

[28] MAHADIK, K., WRIGHT, C., ZHANG, J., KULKARNI, M.,BAGCHI, S., AND CHATERJI, S. Sarvavid: a domain specificlanguage for developing scalable computational genomics appli-cations. In Proceedings of the 2016 International Conference onSupercomputing (2016), ACM, p. 34.

[29] MENG, J., CHAKRADHAR, S., AND RAGHUNATHAN, A. Best-effort parallel execution framework for recognition and miningapplications. In Parallel & Distributed Processing, 2009. IPDPS2009. IEEE International Symposium on (2009), IEEE, pp. 1–12.

[30] MISAILOVIC, S., SIDIROGLOU, S., HOFFMANN, H., AND RI-NARD, M. Quality of service profiling. In ICSE (2010).

[31] MITRA, S., GUPTA, M. K., MISAILOVIC, S., AND BAGCHI,S. Phase-aware optimization in approximate computing. In Pro-ceedings of the 2017 International Symposium on Code Genera-tion and Optimization (CGO) (2017), IEEE Press, pp. 185–196.

[32] PALEM, K. V. Energy aware computing through probabilisticswitching: A study of limits. IEEE Transactions on Computers54, 9 (2005), 1123–1137.

[33] RAHA, A., VENKATARAMANI, S., RAGHUNATHAN, V., ANDRAGHUNATHAN, A. Quality configurable reduce-and-rank forenergy efficient approximate computing. In DATE (2015).

[34] RANJAN, A., RAHA, A., VENKATARAMANI, S., ROY, K., ANDRAGHUNATHAN, A. Aslan: Synthesis of approximate sequentialcircuits. In DATE (2014).

[35] RINGENBURG, M., SAMPSON, A., ACKERMAN, I., CEZE, L.,AND GROSSMAN, D. Monitoring and debugging the quality ofresults in approximate programs. In ASPLOS (2015).

[36] SAMADI, M., JAMSHIDI, D. A., LEE, J., AND MAHLKE, S.Paraprox: Pattern-based approximation for data parallel applica-tions. In ASPLOS (2014).

[37] SAMADI, M., LEE, J., JAMSHIDI, D. A., HORMATI, A., ANDMAHLKE, S. Sage: Self-tuning approximation for graphics en-gines. In MICRO (2013).

[38] SAMPSON, A., BAIXO, A., RANSFORD, B., MOREAU, T., YIP,J., CEZE, L., AND OSKIN, M. Accept: A programmer-guidedcompiler framework for practical approximate computing. Uni-versity of Washington Technical Report UW-CSE-15-01 (2015).

54 2018 USENIX Annual Technical Conference USENIX Association

Page 14: VideoChef: Efficient Approximation for Streaming Video Processing Pipelines · VideoChef: Efficient Approximation for Streaming Video Processing Pipelines Ran Xu, Jinkyu Koo, Rakesh

[39] SAMPSON, A., DIETL, W., FORTUNA, E., GNANAPRAGASAM,D., CEZE, L., AND GROSSMAN, D. Enerj: Approximate datatypes for safe and general low-power computation. In PLDI(2011).

[40] SAMPSON, A., NELSON, J., STRAUSS, K., AND CEZE, L. Ap-proximate storage in solid-state memories. ACM TOCS (2014).

[41] SHARMAN, K., ALCAZAR, A., AND LI, Y. Evolving signalprocessing algorithms by genetic programming. In Genetic Al-gorithms in Engineering Systems: Innovations and Applications,1995. GALESIA. First International Conference on (Conf. Publ.No. 414) (1995), IET, pp. 473–480.

[42] SIDIROGLOU-DOUSKOS, S., MISAILOVIC, S., HOFFMANN,H., AND RINARD, M. Managing performance vs. accuracytrade-offs with loop perforation. In FSE (2011).

[43] SUI, X., LENHARTH, A., FUSSELL, D. S., AND PINGALI, K.Proactive control of approximate programs. In ASPLOS (2016).

[44] THOMOS, N., BOULGOURIS, N. V., AND STRINTZIS, M. G.Optimized transmission of jpeg2000 streams over wireless chan-nels. IEEE Transactions on image processing 15, 1 (2006), 54–67.

[45] UESAKA, K., AND KAWAMATA, M. Evolutionary synthesis ofdigital filter structures using genetic programming. IEEE Trans-

actions on Circuits and Systems II: Analog and Digital SignalProcessing 50, 12 (2003), 977–983.

[46] UNION, E. General data protection regulation. http://www.eugdpr.org/, 2017.

[47] VENKATARAMANI, S., CHIPPA, V. K., CHAKRADHAR, S. T.,ROY, K., AND RAGHUNATHAN, A. Quality programmable vec-tor processors for approximate computing. In MICRO (2013).

[48] WELSTEAD, S. T. Fractal and Wavelet Image CompressionTechniques, 1st ed. Society of Photo-Optical Instrumentation En-gineers (SPIE), Bellingham, WA, USA, 1999.

[49] ZHANG, H., ANANTHANARAYANAN, G., BODIK, P., PHILI-POSE, M., BAHL, P., AND FREEDMAN, M. J. Live video ana-lytics at scale with approximation and delay-tolerance. In NSDI(2017), pp. 377–392.

[50] ZHANG, T., CHOWDHERY, A., BAHL, P. V., JAMIESON, K.,AND BANERJEE, S. The design and implementation of a wirelessvideo surveillance system. In Proceedings of the 21st AnnualInternational Conference on Mobile Computing and Networking

(2015), ACM, pp. 426–438.

USENIX Association 2018 USENIX Annual Technical Conference 55