deep parameter selection for classic computer vision

Brigham Young University Brigham Young University

BYU ScholarsArchive BYU ScholarsArchive

Theses and Dissertations

2021-12-13

Deep Parameter Selection For Classic Computer Vision Deep Parameter Selection For Classic Computer Vision

Applications Applications

Michael Whitney Brigham Young University

Follow this and additional works at: https://scholarsarchive.byu.edu/etd

Part of the Physical Sciences and Mathematics Commons

BYU ScholarsArchive Citation BYU ScholarsArchive Citation Whitney, Michael, "Deep Parameter Selection For Classic Computer Vision Applications" (2021). Theses and Dissertations. 9351. https://scholarsarchive.byu.edu/etd/9351

This Thesis is brought to you for free and open access by BYU ScholarsArchive. It has been accepted for inclusion in Theses and Dissertations by an authorized administrator of BYU ScholarsArchive. For more information, please contact [email protected].

http://home.byu.edu/home/

http://home.byu.edu/home/

https://scholarsarchive.byu.edu/

https://scholarsarchive.byu.edu/etd

https://scholarsarchive.byu.edu/etd?utm_source=scholarsarchive.byu.edu%2Fetd%2F9351&utm_medium=PDF&utm_campaign=PDFCoverPages

http://network.bepress.com/hgg/discipline/114?utm_source=scholarsarchive.byu.edu%2Fetd%2F9351&utm_medium=PDF&utm_campaign=PDFCoverPages

https://scholarsarchive.byu.edu/etd/9351?utm_source=scholarsarchive.byu.edu%2Fetd%2F9351&utm_medium=PDF&utm_campaign=PDFCoverPages

mailto:[email protected]

Deep Parameter Selection for Classic Computer Vision Applications

Michael Whitney

A thesis submitted to the faculty ofBrigham Young University

in partial fulfillment of the requirements for the degree of

Master of Science

Bryan Morse, ChairDavid WingateQuinn Snell

Department of Computer Science

Brigham Young University

Copyright © 2021 Michael Whitney

All Rights Reserved

ABSTRACT

Deep Parameter Selection for Classic Computer Vision Applications

Michael WhitneyDepartment of Computer Science, BYU

Master of Science

A trend in computer vision today is to retire older, so-called “classic” methods infavor of ones based on deep neural networks. This has led to tremendous improvementsin many areas, but for some problems deep neural solutions may not yet exist or be ofpractical application. For this and other reasons, classic methods are still widely used in avariety of applications. This paper explores the possibility of using deep neural networks toimprove these older methods instead of replace them. In particular, it addresses the issueof parameter selection in these algorithms by using a neural network to predict effectivesettings on a per-input basis. Specifically, we look at a straightforward and well-understoodalgorithm with one primary parameter: interactive graph-cut segmentation. This parameterbalances region/boundary influences and heavily influences the resulting segmentation. Manyapproach tuning this parameter by using an ad hoc or empirically selected static setting,while others pre-analyze images to determine effective settings on a per-image basis. Tuningthis parameter for each image, or even for each target selection within an image, is highlysensitive to properties of the image and object, suggesting that a network might be ableto recognize these properties and predict settings that would improve performance. Weemploy a lightweight network with minimal layers to avoid adding significant computationaloverhead with this pre-analysis step. The network predicts the segmentation performancefor each of a set of discretely sampled values for this parameter and selects the one withthe highest predicted performance. Results demonstrate that this per-image prediction andtuning performs better than a single empirically selected setting.

Keywords: deep learning, classic computer vision, graph-cut segmentation

ACKNOWLEDGMENTS

Thanks to my advisor, Dr. Morse, my committee, and fellow researchers in the

Computer Vision lab for helping me every step of the way. Thanks to my sweet wife, Macy,

my parents, and other family members for their love and support.

Table of Contents

List of Figures v

List of Tables vi

1 Deep Parameter Selection For Classic Computer Vision Applications 1

1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Background and Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.2.1 Interactive Graph-Cut Segmentation . . . . . . . . . . . . . . . . . . 3

1.2.2 Relation to Hyperparameter Optimization . . . . . . . . . . . . . . . 4

1.3 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.3.1 Data and Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.3.2 Neural Network Architecture . . . . . . . . . . . . . . . . . . . . . . . 6

1.3.3 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.4.1 Example Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.4.2 Quantitative Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.4.3 Variations and Discussion . . . . . . . . . . . . . . . . . . . . . . . . 8

1.5 Discussion and Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

1.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

iv

List of Figures

1.1 Example Outputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Effects of λ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.3 IOU Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.4 Network Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.5 Qualitative Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.6 Random Selection Histograms . . . . . . . . . . . . . . . . . . . . . . . . . . 8

v

List of Tables

1.1 Pre-processing Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.2 Quantitative Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

vi

Chapter 1

In Preparation:

Deep Parameter Selection For Classic Computer Vision Applications

This manuscript has not yet been accepted for publication.

1

Deep Parameter Selection For Classic Computer Vision Applications

Mike Whitney and Bryan MorseBrigham Young University

Provo, UT{mikeswhitney,morse}@byu.edu

Abstract

A trend in computer vision today is to retire older, so-called “classic” methods in favor of ones based on deepneural networks. This has led to tremendous improvementsin many areas, but for some problems deep neural solutionsmay not yet exist or be of practical application. For this andother reasons, classic methods are still widely used in a va-riety of applications. This paper explores the possibility ofusing deep neural networks to improve these older methodsinstead of replace them. In particular, it addresses the is-sue of parameter selection in these algorithms by using aneural network to predict effective settings on a per-inputbasis. Specifically, we look at a straightforward and well-understood algorithm with one primary parameter: inter-active graph-cut segmentation. This parameter balances re-gion/boundary influences and heavily influences the result-ing segmentation. Many approach tuning this parameter byusing an ad hoc or empirically selected static setting, whileothers pre-analyze images to determine effective settings ona per-image basis. Tuning this parameter for each image,or even for each target selection within an image, is highlysensitive to properties of the image and object, suggestingthat a network might be able to recognize these propertiesand predict settings that would improve performance. Weemploy a lightweight network with minimal layers to avoidadding significant computational overhead with this pre-analysis step. The network predicts the segmentation per-formance for each of a set of discretely sampled values forthis parameter and selects the one with the highest predictedperformance. Results demonstrate that this per-image pre-diction and tuning performs better than a single empiricallyselected setting.

1. IntroductionA commonly held belief in the computer vision commu-

nity is that “classic” algorithms, those developed before theadvent of deep neural networks, are outdated and shouldbe retired. This assertion is true to a certain extent; how-

(a) Images (b) Static Setting (c) NN-Predicted

Figure 1: Example segmentations using a single empiricallytuned parameter vs. per-image parameter prediction. Anempirically tuned static parameter setting may work well onaverage across various image types, but for some images (a)it can sometimes perform poorly (b). Using per-image pa-rameters predicted by a deep neural network can achieveimproved results (c).

ever, “classic” algorithms still have their place in the field.Deep neural networks can achieve state-of-the-art resultsfor many problems, but they require large amounts of data,expensive hardware for training, and other computationalresources such as memory and storage. For some prob-lems, we have yet to develop deep-learning solutions thatsurpass prior methods, though that number seems to de-crease with each conference or journal issue in the field.While deep neural-network approaches may not yet havesupplanted classic algorithms for some applications, it maybe possible for such networks to cooperate with or assist

2

existing algorithms.A significant and well-known problem with classic com-

puter vision algorithms, and even with machine-learning al-gorithms, is that they often have parameters that can be dif-ficult to tune. This leads to vision systems that are not aseffective or robust as they could be.

The most common approach to setting these parametersis for the implementer of the algorithm to set these to fixedvalues, perhaps through their own experience working withthe method, but often through simple trial and error (a formof optimization sometimes jokingly called “grad student de-scent”). A more principled approach is to gather a largecorpus of data and empirically select a value that performswell in general across that set. Using a single, well-chosenset of parameters can perform well for some inputs or sit-uations but often not well for others. Automatically tun-ing parameters on a per-input basis can improve results, asdemonstrated in Fig. 1.

Previous efforts to automatically select parameter valueson a per-input basis have involved using heuristics chosenby the developer based on their knowledge of the underlyingalgorithm (e.g., [15]). The use of heuristics can producegood results for some images but is limited to images thatfit the preconceived motivation behind them.

Other approaches for per-input parameter tuning use aclassic machine learning approach with hand-crafted fea-tures (e.g., [13]). The use of machine learning to tune theseparameters expands the set of images for which automatictuning will work well, but the use of hand-crafted featuresis still limiting.

This raises the question of whether lightweight moderndeep neural networks could be used to augment classic com-puter vision algorithms by predicting optimal, or at leastimproved, parameter tuning on a per-input basis. This pa-per attempts to addresses this question with a proof of con-cept using an simple, well-understood algorithm with a sin-gle parameter whose behavior is also well-understood. Itshould be emphasized that the goal of this work is not toproduce a better vision algorithm per se, or even to com-pete with existing algorithms, but rather to explore this par-ticular question of whether lightweight neural networks canadaptively predict improved per-input parameter settingsfor existing algorithms.

In particular, this work uses the well-known interactivegraph-cut segmentation algorithm first proposed by Boykovand Jolley [2]. Again, the goal is not to produce a bet-ter segmentation algorithm—especially given the substan-tial body of work using this and similar approaches overthe last 20 years—but to see if this fairly straightforwardalgorithm can be improved using a lightweight neural net-work to predict per-image parameter settings. We specif-ically employ a lightweight network architecture to avoidintroducing substantial overhead [5, 18].

Experimental results using this approach suggest that us-ing a lightweight deep neural network to predict per-imageparameter settings can improve the accuracy of this sim-ple algorithm by as much as 17% of the potential possibleimprovement compared to using an empirically optimizedstatic parameter setting.

2. Background and Related WorkAlthough we use the classic method of [2] as the sub-

ject of our approach, not the basis of it, we briefly review itso that readers unfamiliar with it may better understand therole of its key parameter. We also review previous meth-ods for trying to automatically tune this parameter on a per-image basis.

2.1. Interactive Graph-Cut Segmentation

Interactive graph-cut segmentation aims to perform bi-nary segmentation of an image given indications from theuser for which object to select. The user provides rough“scribbles” to respectively indicate the desired foregroundand background regions. In addition to providing spatial in-formation, these scribbles serve as representative samplesof foreground/background pixels, which are then used toestimate foreground/background color distributions respec-tively. For simple color-based selection, the pixels could belabeled as foreground or background based solely on theirrespective posterior probabilities. However, using color dis-tributions alone can often produce disjoint segmentation re-gions. The core graph-cut framework incorporates a com-ponent that discourages disjoint object regions while en-couraging boundaries around natural image edges.

The segmentation is framed as minimizing an objectivefunction incorporating both of these elements:

arg minA

E(A) = R(A) + λ B(A) (1)

where A is a binary labeling of each pixel. R(A), oftenreferred to as the region term, encourages labeling pixelsaccording to the the inferred foreground/background colordistributions. B(A), often referred to as the boundary term,encourages breaks in foreground and background to alignwith natural image edges. λ ≥ 0 is a regularization fac-tor that balances the relative importance between R(A) andB(A). The minimum of this objective function is then cal-culated by formulating it as the solution to a well-knownminimum-cut problem from graph theory, for which solu-tions can be found in low-degree polynomial time [3]. Werefer the reader to the vast body of research that has goneinto graph-cut segmentation, including a recent survey ofinteractive segmentation methods [16].

The success of graph-cut segmentation is heavily depen-dent on the value of λ, which controls the relative weight-ing between region and boundary terms, as demonstrated

3

(a) Image (b) λ too low for this image (c) λ too high for this image (d) λ best setting for this image

Figure 2: The effects on graph-cut segmentation of different λ settings. When the image (a) exhibits overlapping colordistributions for foreground and background regions, setting λ too low may rely too heavily on these color distributions,often resulting in disjoint regions (b). Segmenting with a setting for λ that is too high causes some portions of the object tobe missed (c).

in Fig. 2. When λ is set too low, the region term is reliedon too heavily, and not enough penalty is applied to misla-beling similar neighbors, resulting in disjoint regions (2b).When λ is set too high, the boundary term is favored and alength bias is introduced, often resulting in missing portionsof the intended selection (2c).

As can also be seen in Fig. 2, even when the optimalsetting for λ is used (2d), the segmentation may still be par-tially incorrect. Most often, graph-cut algorithms are usedin interactive segmentation, where the user has the oppor-tunity to review the resulting segmentation and place addi-tional foreground or background strokes as needed.

We are not the first to address the challenge of select-ing optimal settings for λ. Similar to empirical priors inBayesian estimation, one can select a value for λ that op-timizes average performance over a large corpus of imageswith accompanying ground-truth segmentations. However,using a single static value, while it can give the best perfor-mance on average, doesn’t always give the best result forevery image.

Other work has focused on selecting values for λ on aper-image basis. Price et al. [15] used the idea of color dis-tributions to adjust the value of λ on a per-image basis. BothSnapcut [1] and Livecut[14] adapt settings frame-by-framefor video segmentation based on each frame’s contents anduser feedback.

Other work explores using machine learning to tune thisparameter. Peng and Veksler [13] use a machine-learningapproach where they teach an Adaboost model to predicthow well an image has been segmented, and then performgraph-cut segmentation some fixed number of times withdiffering λ values, selecting the segmentation that was pre-dicted to be best. This selection method of running graph-cut segmentation for some range of λ values can be slow persegmentation so we strive to have our network select an op-timal λ value and only perform the min-cut algorithm onceat inference time. The main difference between their workand this paper is that they take a shotgun approach, perform-

ing graph-cut segmentation multiple times to select the bestλ value, while we try to predict the best value directly foreach image.

2.2. Relation to Hyperparameter Optimization

In some ways this work is related to hyperparameteroptimization, in which machine-learning architectures andgoverning parameters are selected. (For those less famil-iar with neural networks, the term “parameters” generallyrefers to the learned elements while these other tune-ablefactors are called “hyperparameters”. We refer the inter-ested reader to the introduction in [4].) As with parametertuning in classic algorithms, the choice of hyperparameterscan greatly affect the effectiveness of machine-learning al-gorithms. Hyperparameters, however, are typically deter-mined in such a way that they work well for a set of repre-sentative inputs rather than based on individual instances,which is analogous to empirical tuning of parameters inclassic algorithms. This work tries to select optimal param-eters for each input image rather than finding a static valuethat works well on average across a corpus of images.

3. Methods3.1. Data and Preprocessing

Although there are datasets for interactive segmenta-tion that provide both ground-truth segmentations and in-teractive scribbles, they are typically too small for training.Many larger datasets for training instance-segmentation al-gorithms provide only corresponding ground truth segmen-tations along with images. In particular, we use the Seman-tic Boundaries Dataset (SBD) [6], which itself incorporatesportions of earlier datasets.

With ground-truth segmentations, user interaction can besimulated automatically to select each instance. The train-ing data requirements for neural networks necessitate au-tomation in the creation of user-provided scribbles. We usedthe method outlined by [12] where small circles represent-

4

Image ... λ=0.28 λ=0.33 λ∗=0.38 λ=0.45 λ=0.52 ... Max IOU153093 ... 0.67 0.66 0.65 0.65 0.57 ... 0.67

person 2007 001423 ... 0.46 0.47 0.46 0.43 0.4 ... 0.47cat 2007 000528 ... 0.72 0.72 0.78 0.76 0.77 ... 0.78

bike 2008 002772 ... 0.49 0.53 0.53 0.53 0.54 ... 0.54bike 2007 005878 ... 0.76 0.80 0.82 0.85 0.84 ... 0.85

... ... ... ... ... ... ... ... ...Mean IOU ... 0.461 0.462 0.464 0.46 0.458 ... 0.528

Table 1: A subset of the precomputed data for our training set consisting of the IOU scores resulting from segmenting eachimage (row) with each sampled value of λ (column). The maximum of the mean for each column (here 0.464) forms alower bound for a predictor since it can be achieved by a single static value while the mean of the maximum for each row(here 0.528) forms a corresponding upper bound on what is achievable by a perfect predictor.

ing the strokes are added incrementally and the image issegmented after each iteration until some accuracy thresh-old is met or a predefined number of iterations have passed.The placement of each of the circles is determined by tak-ing the intersection of the predicted segmentation and theground truth and finding the farthest distance to the closestnon-included pixel.

Before discussing the training of our network, we mustfirst address the issue that the value of λ in Eq. 1, while con-strained to be positive, is only semi-bounded. The respec-tive ranges for R(A) and B(A) may be implementation-dependent and can cause reasonable values of λ to changeas well. To overcome this issue, we reformulate Eq. 1 so asto constrain the values of λ into a consistent range:

arg minA

E(A) = (1− λ) R(A) + λ B(A) (2)

Because we are solving for the set of labels A that mini-mizes E(A), only the relative, not absolute, magnitudes ofthe terms in Eq. 1 are relevant. The construction in Eq. 2,while seemingly redundant, serves to bound 0 ≤ λ ≤ 1,thus allowing us to sample values within a defined range.Because the relative weighting of the two terms in this for-mulation is non-linear with respect to λ, we have found thatsampling values such that they are linear when plotted ona logarithmic scale works better than sampling at equal in-tervals linearly. For the results presented here, we used 20discrete values for λk.

We use the intersection-over-union metric (IOU) to de-termine the quality of a given segmentation, which is de-fined by the area of the intersection between the givenand ground-truth segmentations divided by the area of theirunion. For each input image Ij and corresponding simu-lated input user strokes we denote the IOU score resultingfrom performing graph-cut segmentation using each sam-pled λk as S(Ij , λk). For each of the objects segmented—as there are sometimes multiple objects per image—theIOU scores are precomputed as a matrix with elements

(a) image (b) IOU score as a function of λ

(c) image (d) IOU score as a function of λ

Figure 3: Graphs of IOU scores Sj for different values of λ.Note that these curves vary according to image properties,with the best results often obtained using different settingsfor different images.

Sjk = S(Ij , λk) and stored for lookup.1

An example of this is shown in Table 1. Each row of thetable Sj represents the IOU score resulting from segmentingthat image Ij with each of a set of sampled values λk. Here,we highlight the best value of λ (highest IOU score) foreach row. Notice that this ideal setting can vary from imageto image, as also illustrated in Fig. 3.

One can also observe that this table allows for straight-forward calculation of two key quantities that bound thegoals of prediction. Computing the mean of each column

1Although some images contain multiple objects, for purposes of thisdiscussion we will use the term “image” to refer to a unique combinationof image and object-segmentation goal, even if some combinations reusethe same source images.

5

Figure 4: Network architecture. The input image and distance maps for the foreground/background strokes are combinedinto a single five-channel input to a SqueezeNet-based network that produces a vector of predicted IOU scores, one for eachsampled value of λk. This is compared to the result of a preprocessing step that uses these inputs, plus the correspondingground truth segmentations, to calculate the true target vector. The mean-squared-error loss between predicted and targetvectors is then backpropagated to optimize the network during training.

sk = 1N

∑Nj=1 Sjk gives the average performance across

the entire set of N images for each sampled λk, as illus-trated in the bottom row of the table. The setting for λ thatmaximizes this average performance across the set of im-ages is the empirically tuned best static setting describedin Sect. 1, which we denote as λ∗. It should be noted thatthis level of accuracy S∗ can be achieved without per-imageparameter prediction and thus forms a lower bound for anyreasonable predictor.

The maximum score in each row indicates the bestachievable segmentation for that image, as found on therightmost column. Computing the mean of those scoresgives the average best achievable average score Smax =1N

∑Nj=1 maxk Sjk. Note that Smax is the mean of the max-

imum for each row while S∗ is the maximum of the meanfor each column. Smax thus gives us a corresponding upperbound for any parameter predictor.

All of these values are precomputed prior to training sothat graph-cut segmentation is taken out of the loop duringtraining iterations.

3.2. Neural Network Architecture

One drawback of using deep neural networks is thatthey often require large networks in order to achieve state-of-the-art results. These large architectures take up sig-nificant computational resources in terms of storage andpower. Work has been done to attempt to lower the num-ber of weights necessary and achieve good results, includ-

ing SqueezeNet [9], which achieves an AlexNet [11] levelof performance with less than 50MB of storage. We useSqueezeNet as the backbone of our network but change theinput and output layer to fit with the training style we use.We have explored multiple training styles in order to predictan optimal λ value, including classification, regression andperformance prediction.

Deep neural networks excel at classification, so aclassification-based approach is appealing as a well-developed solution. Framing the problem as a classificationone, the network output consists of a vector whose length isthe number of discretely sampled values for λ. The largestactivation in this vector indicates the predicted value, whichfor input image Ij we denote as λ+j .

Framing the problem as a regression one, the networkproduces for an input image Ij a single predicted value forλ+j . During training this prediction is compared to the bestperforming value for that image as determined by the pre-processing step, and the squared-error loss is backpropa-gated through the network.

Like classification, performance prediction produces avector whose length is the number of sampled values for λ.Instead of attempting to produce a one-hot classificationvector, the network aims to predict the resulting IOU scorefor each sampled λk.

Of the three networks, we found performance predictionto be the most effective. The results of the other two meth-ods are described and discussed further in Sect. 4.3.

6

The process of performance prediction is outlined inFig. 4. The input xj consists of a given image Ij withdistance maps for the foreground and background user-selection strokes concatenated as additional channels. Theground-truth segmentation of Ij is used to precompute theIOU scores for each sampled λk value. The network pro-duces yj , an estimation of the target values yj = sj , theprecomputed IOU scores for image Ij segmented usingeach sampled λk. The largest element of this vector indi-cates the predicted best setting λ+j for that image.

3.3. Training

The network is trained on SBD’s training set using amean squared-error loss function between yj and yj :

Lj = ‖yj − yj‖ (3)

The loss function is minimized using the Adam opti-mizer [10] for 50 epochs with 32 instances in each batch.Since graph-cut segmentation algorithm can perform poorlyon certain images regardless of parameter setting, wepruned the dataset with a threshold best IOU score of 0.3and a constraint that at least half of the IOU scores had tobe unique. Therefore, the network only focuses on learningan effective λ value for images where correct prediction willactually help. After this pruning, the remaining training setconsisted of 10,946 instances.

4. ResultsWe evaluate our results on the validation set of the SBD

dataset. As with training, we also pruned the dataset to re-move images for which graph-cut segmentation performspoorly regardless of parameter setting. In other words, eval-uating the network’s ability to predict improved values forλ only makes sense if such improvement is possible. Afterpruning and counting each object in the images on its own,there are 3785 instances in the validation set.

4.1. Example Results

The visual results of our method follow what is gener-ally accepted within the graph-cut segmentation literature.For the images shown in Figs. 1 and 5, when λ∗ proves tobe too low a setting for that image, disjoint regions are in-troduced; and when λ∗ is too high for that image, portionsof the object may be missed. Using the network-predictedsettings results in improved performance for these images,with quantitative evaluation across the entire set presentedand discussed in the remainder of this section.

For each of the examples in Fig. 5, using an empiri-cally tuned λ∗ resulted in disjoint regions labeled as fore-ground and some parts within the foreground region labeledas background. In each case, our network predicted a highervalue for λ+, resulting in more coherent results.

(a) Image (b) λ∗ = 0.28 (c) λ+ = 0.52

(d) Image (e) λ∗ = 0.28 (f) λ+ = 0.68

(g) Image (h) λ∗ = 0.28 (i) λ+ = 0.90

Figure 5: A comparison between using an empirically tunedstatic setting and our per-image predicted settings. Notethe disjoint region located within the foreground of each ofthe λ∗ settings as well as other discontiguous regions in theresults for the first image. These issues are alleviated usingper-image predicted settings.

Interestingly, our network rarely predicted a value for λ+jthat was lower than λ∗ and outperformed it. For imageswith different-colored foreground and background regions,a low value for λ is sufficient and generally the best set-ting. An example of this can be clearly seen in Fig. 3a,where the foreground and background are easily differen-tiated using color alone, resulting in excellent performancewith a low value for λ but deteriorating performance as thatvalue is increased, as seen in Fig. 3b. Where this is lessclear, larger values for λ improve performance, as seen inFig. 3c,d. Since a large portion of the dataset falls into theformer category (clear color separation between foregroundand background region), the value of λ∗ = 0.28 for this setis somewhat low, with little room for improvement usingsmaller values but significant room for improvement usinglarger values for images that fall into the latter category.

4.2. Quantitative Evaluation

We compare the mean performance using our predictionsλ+j , which we denote as S+, with that of using an empir-ically tuned static setting λ∗, which we denote as S∗, aswell as to the maximum achievable performance Smax asdiscussed in Sect. 3.1. Recall that our goal was to achievea level of performance in the range S∗ < S+ ≤ Smax,i.e., better than can be achieved with a single static valuewhile obviously less than the best achievable level of per-

7

Score Type Mean IOU Score % > S∗

Smax 0.536 n/aS+ 0.451 58.5%Peng et al. 0.447 57.7%S∗ 0.413 n/aSrand 0.367 33%

Table 2: A comparison between Smax, S∗, S+, and usingrandom settings Srand in both average performance and per-cent of the dataset performing better than S∗.

formance. The results in Table 2 demonstrate that it is pos-sible to achieve such performance using network-predictedper-image settings.

On average, using the static setting λ∗ results in a meanIOU score of S∗ = 0.413 while using our predicted set-tings λ+j achieves a score of S+ = 0.451. Using predictedλ+j rather than λ∗ improved the segmentation on 2216 ofthe images in the validation set while achieving the sameperformance as λ∗ on 205 of the images. The resulting in-crease in performance over using a static λ∗ is 17% of themaximum possible improvement.

The predictor is not perfect, though, and there are caseswhere our predictions λ+j underperform using a static set-ting λ∗. However, we found that this occurs for only 36%of the SBD validation set. For 7.5% of the set, using pre-dicted values resulted in performance equal to using a staticλ∗, but that is expected since for some images the best set-ting actually is λ∗. Furthermore, the nature of graph-cutsegmentation often leads to a broad “sweet spot” for whichslightly different values of λ result in identical binary labels.For 58.5% of the set, using per-image predicted values λ+joutperformed using the static λ∗.

Although these results suggest the network is learningto make meaningful predictions, we also chose to com-pare its performance to simple random selection of valuesfor λj . We found that on average, random selection per-formed worse than using λ∗, underperforming on 57.1% ofthe images in the dataset. By plotting a histogram of per-centage improvements as shown in Fig. 6, it can be seenthat the majority of results were very similar to those for theS∗, for both predicted and random values. But the predictedsettings skewed toward values that scored better than usingλ∗, while random settings skewed towards ones with worseperformance. The fact that our method (S+ = 0.451) con-sistently performed well above Srand = 0.367 shows thatthe network is indeed learning to select better values for λ.

4.3. Variations and Discussion

The results suggest the existence of other approachesthan the presented method, which were tried and are dis-cussed here. As mentioned in Sect. 3.2, we looked into

(a) Prediction

(b) Random

Figure 6: Histograms of raw improvement compared usingpredicted and random settings as compared to using an em-pirically tuned static setting. Note that while they both havethe majority of their predictions comparable to using λ∗, thepredicted set has a heavier positive tail and the random sethas a heavier negative tail.

training using both classification and regression approaches.We also looked into how much more effective using adeeper, more expensive neural architecture would be as wellas compared the differences between training from scratchversus using pre-trained weights. The number of discretevalues for λ was revisited to see if finer-grained approachesresult in better performance.

We implemented a classification-based approach to pre-dict values for λ, but the results underperformed thoseachieved using a static setting. This is most likely due tothe nature of the classification loss function focusing on in-dividual labels (value settings) rather than the curve as awhole. That is, the network is penalized for not choosingthe optimal value while not being rewarded for predicting a

8

similar almost-as-good value. It is often the case in graph-cut segmentation where near-optimal settings for λ can pro-duce adequate results since the performance curves oftendisplay a large sweet spot. Predicting the entire IOU curveavoids this focus on the best value and allows the networkto learn more general characteristics, including the abilityto predict values that improve performance even if not theoptimal ones.

We also implemented a regression-based method, whichdirectly predicts an actual value for λ rather than an indexinto a preset range of values. While this method is appealingsince it can find values that would not otherwise be possiblewith a discrete set, it also underperformed using a singlestatic setting.

In addition, we applied our method to several differentsets of sampled values for λk including linearly spaced bins,more bins, and fewer bins, but we ultimately settled on thelogarithmic sampling discussed in Sect. 3.1. Upon analy-sis of the performance using different sets of sampled val-ues, we found that the majority of Smax were found in thelower values of λ. motivating the change from linear tologarithmic, which was confirmed in the results. We triedusing fewer sampled values for λk, but this did not providesufficient granularity for sampling λ. Using a larger set ofsampled settings could produce better predictions given anaccordingly larger dataset, and perhaps a larger network,but we found that performance suffered for output vectorslonger than used in the method presented here.

Intuitively, it seems that on many tasks a deeper net-work performs better than a shallower one, so we triedusing deeper networks to compare with our (intention-ally) shallower architecture. We implemented variationson VGG [17], ResNet [7], and DenseNet [8]. Althoughthe deeper networks were also successful in that segmen-tations using their predictions similarly outperformed thoseusing λ∗, they actually underperformed in comparison toour SqueezeNet backbone. This is likely due to over-fittingsince larger networks often require more data to be effec-tive. Deeper networks may produce better results given acorrespondingly larger dataset, but again, we intentionallychose a lighterweight one to avoid significant overhead inthis pre-analysis step.

Another common technique in designing and trainingdeep neural networks is the idea of transfer learning, wherethe weights of a pre-trained network can be used to initializethe weights for learning another task. Such transfers workwell between similar tasks, but not as well for problems re-quiring substantially different local features [19]. When ap-plied to this problem of parameter selection though, trans-ferring pre-trained backbone weights also seemed to under-perform. This is mostly due to a different set of featuresneeded to select the parameter than those needed for othertasks.

5. Discussion and Limitations

This research is a proof of concept, and the results pre-sented here suggest potential for further success buildingon the general approach. The proposed method is designedto work for an algorithm with a single primary parameterand uses a discrete set of sampled values for that parameter.These limitations cause the algorithm to miss potential per-formance gains that could come from a continuous range ofpossible values as well as the effects other parameters couldhave on the algorithm. Furthermore, the pre-computationapproach used here to avoid implementing the core algo-rithm “in the loop” during training may not extend beyonda small set of parameters.

To overcome some of these limitations, one could con-ceivably use an actor-critic type of training model wheretwo networks are trained. One network would be used toselect a value of λ from a continuous range (the actor) andthe other would predict how well the given λ would per-form for a given image (the critic). The actor could also beformed to output multiple values representing other param-eters to be tuned.

While using predicted per-image settings results in im-proved performance overall compared to an empiricallytuned static setting, the predictions may sometimes be in-correct and lead to underperformance for individual images.This suggests a variation that could be used if user guidanceis available—as it clearly is for interactive segmentation—in which the core algorithm is run twice: once for the pre-dicted setting and once for the static one, with the user se-lecting the better result based on their knowledge of the in-tended outcome. Alternatively, if an automated evaluatorof the results is available (such as in [13]), it could selectbetween the two outcomes automatically without having totest a large set of possible settings.

Finally, this work used as its target for improvement afairly basic method for interactive graph-cut segmentation,and although it was highly novel at the time it was pub-lished, a large body of subsequent improvements have sincebeen developed. As such, the overall system presented herefalls short of bringing generic graph-cut segmentation up tothe level of state-of-the-art methods for interactive segmen-tation. But that was never the point of this work, and theresults here validate the core idea that modern deep neuralnetworks can predict improved per-input parameter settingsfor other existing algorithms.

6. Conclusion

The aim of this work has been to investigate whetherdeep neural networks could be used to automatically pre-dict per-input parameter settings for classic computer vi-sion algorithms. We have presented a network that predictsthe performance for each of a discrete set of possible val-

9

ues for the single primary parameter in the seminal inter-active graph-cut segmentation algorithm. Using these net-work predictions outperforms using an empirically tunedstatic setting, both on average and for the majority of theimages in the SBD validation dataset. The results show thatthe network indeed learns the features necessary for effec-tive parameter selection, at least for the algorithm employedhere. We believe this holds promise for future applicationof these ideas to other classic computer vision algorithmsand systems.

References[1] Xue Bai, Jue Wang, David Simons, and Guillermo

Sapiro. Video snapcut: Robust video object cutoutusing localized classifiers. In ACM SIGGRAPH 2009Papers, SIGGRAPH ’09, New York, NY, USA, 2009.Association for Computing Machinery.

[2] Yuri Boykov and Marie-Pierre Jolly. Interactive graphcuts for optimal boundary & region segmentation ofobjects in n-d images, 2001.

[3] Yuri Boykov and Vladimir Kolmogorov. An experi-mental comparison of min-cut/max- flow algorithmsfor energy minimization in vision. IEEE Transac-tions on Pattern Analysis and Machine Intelligence,26(9):1124–1137, 2004.

[4] Pavel Brazdil, Christophe Giraud-Carrier, CarlosSoares, and Ricardo Vilalta. Metalearning: Applica-tions to Data Mining. Springer Publishing Company,Incorporated, 1 edition, 2008.

[5] Marco Forte, Brian Price, Scott Cohen, Ning Xu, andFrancois Pitie. Interactive training and architecturefor deep object selection. In IEEE International Con-ference on Multimedia and Expo (ICME), pages 1–6,2020.

[6] Bharath Hariharan, Pablo Arbelaez, Lubomir Bour-dev, Subhransu Maji, and Jitendra Malik. Semanticcontours from inverse detectors. In International Con-ference on Computer Vision (ICCV), 2011.

[7] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and JianSun. Deep residual learning for image recognition.In IEEE Conference on Computer Vision and PatternRecognition (CVPR), pages 770–778, 2016.

[8] Gao Huang, Zhuang Liu, and Kilian Q. Weinberger.Densely connected convolutional networks. IEEEConference on Computer Vision and Pattern Recog-nition (CVPR), pages 2261–2269, 2017.

[9] Forrest N. Iandola, Song Han, Matthew W.Moskewicz, Khalid Ashraf, William J. Dally,and Kurt Keutzer. Squeezenet: Alexnet-level accu-racy with 50x fewer parameters and <0.5mb modelsize. arXiv:1602.07360, 2016.

[10] Diederik P. Kingma and Jimmy Ba. Adam: A methodfor stochastic optimization. International Conferenceon Learning Representations (ICLR), abs/1412.6980,2015.

[11] Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hin-ton. Imagenet classification with deep convolutionalneural networks. Advances in neural information pro-cessing systems, 25:1097–1105, 2012.

[12] Hannes Nickisch, Carsten Rother, Pushmeet Kohli,and Christoph Rhemann. Learning an interactive seg-mentation system. In Proceedings of the Seventh In-dian Conference on Computer Vision, Graphics andImage Processing, ICVGIP ’10, page 274–281, NewYork, NY, USA, 2010. Association for ComputingMachinery.

[13] Bo Peng and Olga Veksler. Parameter selection forgraph cut based image segmentation. In Proceedingsof the British Machine Vision Conference, pages 16.1–16.10. BMVA Press, 2008.

[14] Brian Price, Bryan Morse, and Scott Cohen. Live-cut: Learning-based interactive video segmentation byevaluation of multiple propagated cues. In IEEE In-ternational Conference on Computer Vision (ICCV),pages 779–786, 2009.

[15] Brian Price, Bryan Morse, and Scott Cohen. Geodesicgraph cut for interactive image segmentation. In IEEEConference on Computer Vision and Pattern Recogni-tion (CVPR), pages 3161–3168, 2010.

[16] Hiba Ramadan, Chaymae Lachqar, and Hamid Tairi.A survey of recent interactive image segmentationmethods. Computational Visual Media, 6(4):355–384,2020.

[17] Karen Simonyan and Andrew Zisserman. Very deepconvolutional networks for large-scale image recogni-tion. In International Conference on Learning Repre-sentations, 2015.

[18] Ning Xu, Brian Price, Scott Cohen, Jimei Yang, andThomas Huang. Deep interactive object selection.In IEEE Conference on Computer Vision and PatternRecognition (CVPR), pages 373–381, 2016.

[19] Amir R. Zamir, Alexander Sax, William B. Shen,Leonidas J. Guibas, Jitendra Malik, and SilvioSavarese. Taskonomy: Disentangling task transferlearning. In IEEE Conference on Computer Vision andPattern Recognition (CVPR). IEEE, 2018.

10

deep parameter selection for classic computer vision

Documents