abstract - arxivabstract this paper tackles the problem of video object segmen-tation. we are...

10
Meta Learning with Differentiable Closed-form Solver for Fast Video Object Segmentation Yu Liu 1 Lingqiao Liu 1 Haokui Zhang 2 Hamid Rezatofighi 1 Ian Reid 1 1 The University of Adelaide, Australia 2 Northwestern Polytechnical University, China Abstract This paper tackles the problem of video object segmen- tation. We are specifically concerned with the task of seg- menting all pixels of a target object in all frames, given the annotation mask in the first frame. Even when such anno- tation is available this remains a challenging problem be- cause of the changing appearance and shape of the object over time. In this paper, we tackle this task by formulating it as a meta-learning problem, where the base learner grasp- ing the semantic scene understanding for a general type of objects, and the meta learner quickly adapting the appear- ance of the target object with a few examples. Our proposed meta-learning method uses a closed form optimizer, the so- called “ridge regression”, which has been shown to be con- ducive for fast and better training convergence. Moreover, we propose a mechanism, named “block splitting”, to fur- ther speed up the training process as well as to reduce the number of learning parameters. In comparison with the- state-of-the art methods, our proposed framework achieves significant boost up in processing speed, while having very competitive performance compared to the best performing methods on the widely used datasets. 1. Introduction Fast and accurate video object segmentation plays an im- portant role in many real-world applications, including, but not limited to, film making [12], public surveillance [44], robotic vision [20]. The goal of video object segmentation is to distinguish an object of interest over video frames from its background at the pixel level. In contrast to many vision tasks such as image classifica- tion [17], face recognition [28] and object detection [32, 13] which the performance of the algorithms reach to the point of being suitable for real-world applications, the perfor- mance of video object segmentation algorithms are still far beyond the annotations performed by human [30]. This is Figure 1. A comparison of the quality and the speed of previ- ous video object segmentation methods (DAVIS2016 benchmark). We visualize the intersection-over-union (IoU) with respect to the frames-per-second (FPS). mainly because this problem does not benefit from avail- ability of a massive corpus of training data, unlike the other aforementioned tasks. Recently, deep learning-based approaches have shown promising progresses on video object segmentation task [3, 42, 24, 37]. However, they still struggle to satisfy both good accuracy and fast processing inference. In this paper, we aim to bridge this gap. Inspired by the meta-learning method of [2], we propose an intuitive yet powerful algorithm for video object segmen- tation, in which the reference frame is available with its an- notated mask. Our objective is to train a system that can “adapt” this annotation information to subsequent frames in a fast yet flexible way at inference time. Specifically, at inference time the reference frame (i.e. one with ground- truth annotation) is mapped to vector in a high dimensional 1 arXiv:1909.13046v1 [cs.CV] 28 Sep 2019

Upload: others

Post on 03-Apr-2020

9 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Abstract - arXivAbstract This paper tackles the problem of video object segmen-tation. We are specifically concerned with the task of seg-menting all pixels of a target object in

Meta Learning with Differentiable Closed-form Solver for Fast Video ObjectSegmentation

Yu Liu1 Lingqiao Liu1 Haokui Zhang2 Hamid Rezatofighi1

Ian Reid1

1The University of Adelaide, Australia2 Northwestern Polytechnical University, China

Abstract

This paper tackles the problem of video object segmen-tation. We are specifically concerned with the task of seg-menting all pixels of a target object in all frames, given theannotation mask in the first frame. Even when such anno-tation is available this remains a challenging problem be-cause of the changing appearance and shape of the objectover time. In this paper, we tackle this task by formulating itas a meta-learning problem, where the base learner grasp-ing the semantic scene understanding for a general type ofobjects, and the meta learner quickly adapting the appear-ance of the target object with a few examples. Our proposedmeta-learning method uses a closed form optimizer, the so-called “ridge regression”, which has been shown to be con-ducive for fast and better training convergence. Moreover,we propose a mechanism, named “block splitting”, to fur-ther speed up the training process as well as to reduce thenumber of learning parameters. In comparison with the-state-of-the art methods, our proposed framework achievessignificant boost up in processing speed, while having verycompetitive performance compared to the best performingmethods on the widely used datasets.

1. Introduction

Fast and accurate video object segmentation plays an im-portant role in many real-world applications, including, butnot limited to, film making [12], public surveillance [44],robotic vision [20]. The goal of video object segmentationis to distinguish an object of interest over video frames fromits background at the pixel level.

In contrast to many vision tasks such as image classifica-tion [17], face recognition [28] and object detection [32, 13]which the performance of the algorithms reach to the pointof being suitable for real-world applications, the perfor-mance of video object segmentation algorithms are still farbeyond the annotations performed by human [30]. This is

Figure 1. A comparison of the quality and the speed of previ-ous video object segmentation methods (DAVIS2016 benchmark).We visualize the intersection-over-union (IoU) with respect to theframes-per-second (FPS).

mainly because this problem does not benefit from avail-ability of a massive corpus of training data, unlike the otheraforementioned tasks.

Recently, deep learning-based approaches have shownpromising progresses on video object segmentation task [3,42, 24, 37]. However, they still struggle to satisfy both goodaccuracy and fast processing inference. In this paper, weaim to bridge this gap.

Inspired by the meta-learning method of [2], we proposean intuitive yet powerful algorithm for video object segmen-tation, in which the reference frame is available with its an-notated mask. Our objective is to train a system that can“adapt” this annotation information to subsequent framesin a fast yet flexible way at inference time. Specifically, atinference time the reference frame (i.e. one with ground-truth annotation) is mapped to vector in a high dimensional

1

arX

iv:1

909.

1304

6v1

[cs

.CV

] 2

8 Se

p 20

19

Page 2: Abstract - arXivAbstract This paper tackles the problem of video object segmen-tation. We are specifically concerned with the task of seg-menting all pixels of a target object in

Figure 2. Example result of our technique: The segmentation of the first frame (red) is used to learn the model of the specific object totrack, which is segmented in the rest of the frames independently (green). One every 10 frames shown of 50 in total.

embedding spaceX = φ(I) using a CNN φ. We then deter-mine using ridge regression [25], the coefficients of a ma-trix W that best maps X to the ground truth, Y = WX .W is then the video-specific “adaptor”, and it maps thefeature vectors for every query image (i.e. every other im-age in the video sequence) to their predicted segmentationmasks. Training comprises the process of learning the map-ping φ(.) by presenting the network with pairs of images(from a variety of videos but with each pair coming from thesame video), each with ground-truth annotation, and back-propagating the loss through φ. This is illustrated in Figure3 and described in more detail later in the paper.

We observe that a limitation of the proposed approachis that the ridge regression scales poorly with the dimen-sion of the feature feature produced by φ(.) because the op-timization requires an huge matrix inversion. We addressthis through the use of a “block splitting” method that ap-proximates the matrix in block diagonal form, meaning theinversion can be done much more efficiently.

Our main contributions are three-fold:

• A meta-learning based method for video object seg-mentation is developed, using a closed form solver(ridge regression) as the internal optimizer. This iscapable of performing fast gradient back-propagationand can adapt to previously unseen objects quicklywith very few samples. Inference (i.e. segmentationof the video) is a single forward pass per frame withno need for fine-tuning or post-processing.

• Ridge regression in high-dimensional feature spacescan be very slow, because of the need to invert a largematrix. We address this using a novel block splittingmechanism which we show greatly speeds the trainingprocess without damaging the performance.

• We demonstrate state-of-the-art video segmentationaccuracy relative to all others methods of comparableprocessing time, and even better accuracy than manyslower ones (see Figure 1).

2. Related Works

2.1. Semi-supervised Video Object Segmentation

The goal of video object segmentation is to ‘cutout’ thetarget object(s) from the entire input video sequence. Re-garding the amount of supervision utilized for video objectsegmentation, methods can be roughly put into two spec-trum, i.e. semi-supervised and unsupervised methods.

For semi-supervised video object segmentation, the an-notated mask of the first frame is given, and the algorithmis designed to predict the masks of the rest frames in thevideo. There are three categories in this spectrum. The firstone, which include MSK [29], MPNVOS [37] etc, is to useoptical flow to track the mask from the previous frame tothe current frame. Similarly, the second category formulatesthe optical flow and segmentation in two parallel branches,and utilizes the predicted mask from the previous frame as aguidance, some representatives are Segflow [7],VSOF [40],RGMP [43], OSNM [45] etc. The final class which keepsthe state-of-the-art performance on Davis benchmark [30] isto try to over-fit the appearance of the target object(s), andexpect the method can generalize in the subsequent frames.Specifically, OSVOS [3] uses one-shot learning mechanismto conduct fine-tuning on the first frame of test video tocapture the appearance of the target object(s), and conductinference on the rest frames. The drawbacks of OSVOSare: (1) it can not adapt to the unseen parts (2) when dra-matic changes of appearance happen in subsequent frames,the method’s performance significantly degrade. Inspiredby the overall design of OSVOS, there are some follow-ing methods which employ various additional ingredients toimprove the segmentation accuracy. In particular, OSVOS-S [24] combines the semantic instance information to re-move the noisy objects coming from the same category. On-VOS [42] utilizes on-line adaption mechanism to overcomethe limits of OSVOS when drastic appearance changes oc-cur. CINM [1] utilizes a CNN-based markov random field(MRF) to estimate the probabilities of the pixels belongingto the target object(s) in spatial domain, and employs opti-cal flow to track segmented pixels in temporal domain. Al-beit those methods improve the segmentation performanceof OSVOS to some extent, they are still time-consumingduring inference since the on-line fine-tuning is necessary.And usually, at lease a dense CRF [16] and more techniques

Page 3: Abstract - arXivAbstract This paper tackles the problem of video object segmen-tation. We are specifically concerned with the task of seg-menting all pixels of a target object in

CNN

CNN

Share

800 x hw

800 x hw

R.R.A

predict result

LossBackward and update

Referenceimage

Queryimage

t=0

t=1

t=n-1

t=n

Figure 3. Workflow of the proposed method. An image pair sampled from the same video as the input to the network. The first imageIR and its annotation MR as the reference frame, and the second image IQ and its annotation MQ ( or prediction PQ during inference)as the query frame. The image pair first passes through the feature extractor (DeepLabv2 [4] with ResNet101 [14]) to compute a 800Dembedding tensor FR, FQ. Then a mapping matrix W between FR and MR is calculated in the reference frame (Eq. 1) using ridgeregression. After that, the prediction result PQ in the query frame is acquired by multiplying FQ and W (Eq. 2). During training, theloss error between PQ and MQ is back-propagated to enhance the network’ adaptation ability between the reference frame and the queryframe. During inference, the reference frame (IR and MR) is always the first frame, and the query image IQ is the rest sequence from thesame video. Through iterative meta-learned, our network is capable of quickly adapting to unseen target object(s) with a few examples.

are applied as the post-processing step to acquire the bettersegmentation results.

In this paper, we mainly target to fast video object seg-mentation, since no optical flow and fine-tuning processesare used, the proposed method is appropriate for real-worldapplications.

2.2. Meta Learning

Meta learning is also named learning to learn [35, 26, 38]because its goal is to help the machine to be capable oflearning quickly, especially in the case with very few sam-ples for the new task(s). Generally speaking, meta learn-ing algorithms are composed of two components, i.e. baselearner and meta learner. According to their roles, baselearner is mainly in charge of handling with individualtasks, and meta learner is much like a coordinator, throughlearning individuals tasks, meta learner can boost the per-formance of base learners across the tasks.

Meta-learning is an alternative to the de-facto solutionthat has emerged in deep learning of pre-training a networkusing a large, generic dataset (eg ImageNet [8]) followed byfine-tuning with a problem-specific dataset. Meta-learningaims to replace the fine-tuning stage (which can still be veryexpensive) by training a network that has a degree of plastic-ity so that it can adapt rapidly to new tasks. For this reasonit has become a very active area recently, especially withregard to one-shot and few-shot learning problems [18, 9]

Recent approaches for meta-learning can be roughly put

into three categories: (i) metric learning for acquiring simi-larities [41, 36, 11]; (ii) learning optimizers for gaining up-date rules [10, 31]; and (iii) recurrent networks for reserv-ing the memory [33, 15]. In this work, we adopt the meta-learning algorithm that belongs to the category of learningoptimizers. Specifically, inspired by [2] which was origi-nally designed for image classification, we adopt ridge re-gression, which is a closed-form solution to the optimiza-tion problem. The reason for using it is because, comparedwith the widely-used SGD [19] in CNNs, ridge regressioncan propagate gradient efficiently, which is matched withthe goal of fast mapping. Through extensive experiments,we demonstrate that the proposed method is in the firstechelon regarding to speed for fast video object segmen-tation, while obtaining more accurate results without anypost-processing.

2.3. Fast Video Object Segmentation

A few previous methods proposed to tackle fast videoobject segmentation. In particular, FAVOS [6] first tracksthe part-based detection. Then, based on the tracked box,it generates the part-based segments and merges those partsaccording to a similarity score to form the final segmenta-tion results. The drawback of FAVOS is that it can not belearned in an end-to-end manner, and heavily relies on thepart-based detection performance. OSNM [45] proposes amodel which is composed of a modulator and a segmenta-tion network. Through encoding the mask prior, the mod-

Page 4: Abstract - arXivAbstract This paper tackles the problem of video object segmen-tation. We are specifically concerned with the task of seg-menting all pixels of a target object in

ular can help the segmentation network quickly adapt tothe target object. RGMP [43] shares the same spirit withOSNM. Specifically, it employs a Siamese encoder-decoderstructure to utilize the mask propagation, and further booststhe performance with synthetic data. The most similar workto ours is PML [5], which formulates the problem as a pixel-wise metric learning problem. Through the FCN [23], itmaps the pixels to high-dimensional space, and utilizes a re-vised triplet loss to encourage pixels belonging to the sameobject much closer than those belonging to different objects.Nearest neighbor (NN) is required for retrieval during in-ference. In contrast our meta-learning approach acquires amapping matrix between the high-dimensional feature andannotated mask in reference image using ridge regression,and then can be adapted rapidly to generate the predictionmask. Compared to baseline method PML [5], our methodachieves more accurate performance and is twice the speed.And with the same efficiency, the J mean of our method is3.4 percent better than OSNM [45] on the DAVIS2016 [30]validation set.

3. Methodology

3.1. Overview

We formulate the video object segmentation as a meta-learning problem. For each image pair which comes froma same video, ridge regression is used as the optimizerto learn the base learner. Meta learner is naturally builtthrough the training process. Once the meta learner islearned, it possesses the ability of fast mapping between theimage features and object masks, and can be adapted to un-seen objects quickly with the help of the reference image.

According to the phase that user input involved in thetraining loop, the current existing methods can be classifiedinto three categories.User input outside the network training loop This cat-egory utilizes the user input to fine-tune the network toover-fit the appearance cues of target object(s) during in-ference. The representatives are OSVOS [3] and its follow-ing works [24, 1, 42]. Since online fine-tuning is requiredduring inference, the drawback of these algorithms is time-consuming, which usually take seconds per image, thus isnot practical for the real-world applications.User input within the network training loop This cate-gory of work injects the user input as the additional inputfor training the network. Through this way, no online fine-tuning is needed. These algorithms incorporate the user in-put either by using a parallel network or concatenating theimage with the user input [43, 45]. One drawback of thiskind of methods is that the model needs to be recalculatedonce the user input changes, thus it is not practical for adap-tation especially for long videos.

User input is detached from the network training loop Incontrast to the previous methods, our algorithm shares thesame spirit with PML [5] in design. The network and userinput are detached, and the user input can be more flexible.Moreover, once the user input is given (for example, theannotation in the reference image), the network can quicklyadapt to the target objects without any extra operations.

3.2. Segmentation as Meta-Learning

For simplicity, we assume single-object segmentationcase, and the annotation of first frame is given as the userinput. Note that our method can also be applied for multi-objects and easily extended to other types of user input, e.g.,scribble, clicks etc.

We adopt the following notation:C denotes the number of feature channels (in our case 800).w, h denote the spatial resolution of the extracted features(in our case 1/8th of the orginal image size).FR, FQ are the feature tensors of size C × h×w producedby φX is a flattened tensor of FR or FQ, with shape h · w × CY is the flattened tensor of annotation mask MR or MQ,with shape h · w × 1W denotes the mapping matrix of size C × 1 between thefeature space and annotation mask.

As noted above, there are two components to the learner:(i) φ(.) an embedding model that maps images to a high-dimensional feature space, C × h × w; and (ii) an adaptorW of size C × 1, found using ridge regression, that mapsthe embedded features to a (flattened) segmentation mask(of size h · w × 1).Embedding Model We adopt DeeplabV2 [4] built on theResNet-101 [14] backbone structure as our feature extrac-tor φ. This choice allows a direct comparison of our methodwith the baseline, PML [5]. First, we use the pretrainedmodel on COCO [22] dataset as the initialization for seman-tic segmentation. Then the ASPP [4] layer for classificationis removed and replaced by our video-specific mapping W .Ridge Regression Ridge regression is a closed form solverand widely-used in machine learning community [34, 27].The learner seeks W that minimizes Λ as follows:

Λ(X,Y ) = arg minW

||XW − Y ||2 + λ||W ||2

= (XTX + λI)−1XTY(1)

where,X,Y andW are as defined above, and λ is a regular-ization parameter, and set to 5.0 in all of our experiments.As can be seen in Figure 3, during training, an image pair aswell as their annotations are sampled from the same videosequence. The feature FR extracted from the reference im-age IR (in the figure this is the first image) and its annotationMR will be used to calculate the mapping matrix W .

PQ = FQ ×W (2)

Page 5: Abstract - arXivAbstract This paper tackles the problem of video object segmen-tation. We are specifically concerned with the task of seg-menting all pixels of a target object in

(where we abuse notation and use the unflattened featuretensors for clarity)

For the query image IQ, likewise we compute the featureFQ, map these to the predicted segmentation maskPQ usingEquation 2 in which W is the matrix computed from thereference image and its ground truth. The loss between theprediction mask PQ and the annotation MQ for the queryprovides the back-propagation signal to improve φ’s abilityto produce adaptable features.

During inference in our case, the reference frame IR willbe always the first frame, for which the annotation mask isprovided, and the query frames IQ will be the rest of framesin the same video.

3.3. Block Splitting

Thanks to ridge regression, the computation of the map-ping matrix and gradient back-propagation are already veryfast compared with other algorithms, which also focus onvideo object segmentation.

F (X) = (XTX + λI)−1 (3)

During the experiments, we found the higher dimensionof the feature used as the input for meta-learning module,the more accurate segmentation results likely be achieved.However, we also observed that the higher dimension of thefeature being utilized, the slower of the training process.Specifically, during the computation of mapping matrix W,it involves a matrix inverse calculation. as denoted by Equa-tion 3, which will become the bottleneck of fast propagationwhen the very high dimensional feature is used.

In order to further speed up the training process of theproposed network, we deliver a block splitting mechanism,and its work principle as shown in Figure 4. In particular, ,our motivation is that the matrix inverse computation F (X)for much high-dimensional feature (eg. 800D) can be ap-proximated by the sum of the computations of that relativelow-dimensional features (eg. 200D × 4). From the workprinciple, it can be viewed that a n × n matrix can be ap-proximated by four n/4× n/4 irrelevant diagonal matrix.

The advantages of using the proposed block splittingmechanism are: Firstly, it can largely speed up the matrixinverse process involved in ridge regression, thus it savesthe training time to some extent. Secondly, through the ma-trix approximation step as aforementioned, the network pa-rameters involved in the ridge regression as well as memoryutilized in our network are reduced. The experimental evi-dence can be found in Ablation Study ( Section 5).

3.4. Training

Training Strategy For training, optimizer is SGD withmomentum 0.9, with weight decay 5e-4. We use theDeepLabV2 [4] with backbone network ResNet-101 [14]as the feature extractor, and the constant learning rate, i.e.

Splitting

Approximate

200800Computation cost: 200x200x4=160000Computation cost: 800x800=640000

Figure 4. Illustration of the proposed block splitting: during ma-trix inverse calculation of ridge regression, the computation of thehigher dimensional feature is approximated by the sum of compu-tation of that lower dimensional features. Which can effectivelyspeed up the training process as well as reducing the parametersand memory.

1.0e-5, is used during the whole training process. The di-mension of extracted feature is 800 outputed by the featureextractor, which is used as the input for the meta-learningmodule.Loss BCEWithLogitsLoss1 is employed for training theproposed network, it essentially is a combination of the Sig-moid layer and binary cross entropy (BCE) loss, it bene-fits from the log-sum-exp trick for numerical stability. Andcompared to BCE loss, it is more robust and less likely tocause numerical problem when computing the inverse ma-trix in the ridge regression step.

`(x, y) = L = {l1, ..., lN}Tln = −wn[yn · log δ(xn) + (1− yn) · log(1− δ(xn))]

(4)where N is the batch size. xn is the input of the loss calcu-lation, and yn (yn ∈ [0, 1]) is the ground truth label. wn isa rescaling weight given to the loss of each batch element.

4. Experiments4.1. Dataset

We verify the proposed method both on DAVIS2016 [30]and SegTrack v2 [21] datasets.

On DAVIS2016, which contains 50 pixel-level annotatedvideo sequences, and each video only contains one targetobject for segmenting. Among these 50 video sequences, 30video sequences as the training set with which the annotatedmask is provided for every frame. And another 20 video

1https://pytorch.org/docs/stable/nn.html

Page 6: Abstract - arXivAbstract This paper tackles the problem of video object segmen-tation. We are specifically concerned with the task of seg-menting all pixels of a target object in

Black

-swan

Libby

motoc

ross

-bump

s

Bmx

-bump

s

Figure 5. Qualitative results: Homogeneous sample of DAVIS sequences with our result overlaid

Method DAVIS16 Online-Tuning OptFlow CRF BS Speed(s)OFL 68.0 - 7 3 7 42.2BVS 60.0 - 7 7 7 0.37ConvGRU 70.1 7 3 7 7 20VPN 70.2 7 7 7 7 0.63MaskTrack-B 63.2 - 7 7 7 0.24SFL-B 67.4 7 3 7 7 0.30OSVOS-B 52.5 7 7 7 7 0.14OSNM 72.2 7 7 7 7 0.14PML 75.5 7 7 7 7 0.28Ours 75.8 7 7 7 7 0.145PLM 70.0 3 7 7 7 0.50SFL 74.8 3 7 7 7 7.9MaskTrack 69.8 3 7 7 7 12OSVOS 79.8 3 3 7 3 10

Table 1. Performance comparison of our approach with recent approaches on DAVIS 2016 Performance measured in mean IoU.

sequences as the validation set, and only the annotation ofthe first frame is allowed to access.

SegTrack v2 [21] is extended from SegTrack [39]dataset. Both of them contain the dense pixel-level anno-tation for each frame within each video. For segtrack v2dataset, we test our algorithm on all the sequences whichcontain one target object.

4.2. Results on DAVIS2016

Quantitative Results Table 1 shows the experimental re-sults on DAVIS2016 [30] on different methods. Apart fromthe performance (measured by J mean), switches for online-fining, using optical-flow, dense CRF (CRF) and boundarysnapping (BS) are also described. Meanwhile, the infer-ence time is also shown. In particular, compared with mostof the competitors, our algorithm shares the same or muchfaster processing time with superior performance regarding

the segmentation accuracy. Please note that, some methodswhich use much stronger backbone networks are not listedout for the purpose of fair comparison. Compared with OS-VOS [3], for which the online fine-tuning is necessary, ourmethod just takes a smaller fraction of time to do inference.Compared to the baseline method PML [5] which use thesame feature extractor, our method is twice faster and withbetter performance. Compared OSNM [45], with the sameefficiency, our method achieve 3.4 percent improvementsregarding to the segmentation accuracy.

Qualitative Results

Figure 5 demonstrates some visualized results of ourmethod. As shown in Figure 5, our method is not only goodat recovering object details (e.g., the results on the sequenceof blackswan), but also robust against heavy occlusions (eg.the results on the sequences bmx-bumps and libby, dramaticmovement as well as abrupt rotation (eg. the results on the

Page 7: Abstract - arXivAbstract This paper tackles the problem of video object segmen-tation. We are specifically concerned with the task of seg-menting all pixels of a target object in

Figure 6. Per-sequence results of mean region similarity (J ) . Sequences are sorted by our performance.

Figure 7. Visualized comparison between the proposed methodand other methods. With the red box to denote the error region.

sequence motocross-bumps). However, there are very fewscenarios which may lead to failure cases (denoted by thered box), and mainly caused by the (noisy) objects whichhave not appeared at the first frame of the video, and canbe easily cured by some post-processing steps, includingtracking [6], online adaptation [42, 5].

In Figure 7, we show some visualized results comparedwith OSVOS [3] and PML [5]. For the breakdance, scooter-

black and dance-jump sequences, which contain fast mov-ing and abrupt rotation, OSVOS [3] performs worse thanPML [5]. And for the dog sequence, PML [5] can notachieve a satisfied result due to the dramatic change of thelight conditions. However, on both of these two scenarios,the proposed method performs better than both of OSVOSand PML, which is benefit from robust adaptation ability ofour network.

4.3. Results on SegTrack Dataset

In Figure 8, some visualized results in the segTrack [39]dataset are shown. Which are acquired by direcly utl-ized the model trained on Davis2016 dataset. As can beseen, in most cases, our model maintain a good segmen-tation accuracy, and with a few case fails (as denoted bythe red box), which mainly due to the dramatically changesof the light conditions and exact same appearance betweenthe background and the target object. These results proveour method has a better generalization ability and can bequickly adapted to other unseen objects with very few ex-amples (here, only the annotation in the first frame is pro-vided).

5. Ablation Study5.1. Feature Dimension and Block Splitting

As mention in Section 3.3, since our meta learning mod-ule (ridge regression) requires the computation of matrix in-verse, the training speed will varies significantly regradingthe features with various dimensions utilized for this step.And based on the fact that low dimensional features usu-ally have the faster speed but lose some details of image

Page 8: Abstract - arXivAbstract This paper tackles the problem of video object segmen-tation. We are specifically concerned with the task of seg-menting all pixels of a target object in

Parachute

Girl

Monkey

Soldier

Worm

Frog

Figure 8. Qualitative results: Homogeneous sample of SegTrack sequences with our result overlaid

Split No Feature Speed Memory Computation Cost1 800 1.50 11590 640k2 400 1.23 11720 320k4 200 0.75 11580 160k8 100 0.86 11584 80k

Table 2. Ablation study on block splitting: feature dimension, run-ning speed, memory and computation cost with different settingsare listed out.

information. On the contrary, high dimensional features aretime-consuming but carry much rich information. We pro-pose a block splitting mechanism to train the meta learner.In Table 2, the splitting number (of feature), feature dimen-sion, running speed (per iteration), memory cost (of thewhole network), as well as computation cost (of the com-putation of matrix inverse) with different settings are listedout. As can be seen, with the feature dimension decreasing,the overall trend are running speed increasing, computationcost decreasing, dramatically. However the memory costreduce slightly, which mainly because of the backbone fea-ture extractor take up most of the memory usage. All thenumbers are tested on the single GPU card (with type ofGTX 1080).

5.2. Per Sequence Performance Analysis

In Figure 6, J mean of per sequence of different methodsare outlined. It is sorted according our algorithm’s perfor-

mance in each sub-sequence, which provides a more intu-itive understanding for the proposed algorithm. Firstly, theproposed method achieve a better video segmentation ac-curacy when compared to many other methods. Secondly,our algorithm works quite well on most of sequences, evenon the most challenging sequences, e.g., breakdance andbmx-tree, the J mean is above 0.5.Thirdly, benefit from thequick adaption ability of meta-learning, around half of se-quence achieve J mean over 0.8. Moreover, our method canwell recover the object details as well as robust against fastmovement and heavy occlusion, which are aligned with ourconclusion in Section 4.2

6. Conclusion

In this paper, we explore applying meta-learning intovideo object segmentation system. A closed form opti-mizer, i.e., ridge regression, is utilized to update the metalearner, which achieves fast speed while maintains the su-perior accuracy. Through iteratively meta-learned, the net-work is capable of conducting fast mapping on unseen ob-jects with a few examples available. Compared to the fine-tuning methods, our algorithm with similar performance butjust a smaller fraction time is required, which is appeal tothe real-world applications. In addition, a block splittingmechanism is delivered to speed up the training process,which also has the benefits of reducing parameters and sav-ing memory. In future work, we would like to use other

Page 9: Abstract - arXivAbstract This paper tackles the problem of video object segmen-tation. We are specifically concerned with the task of seg-menting all pixels of a target object in

basic optimizers, such as, Newton’s methods and logisticregression. Meanwhile, based on the flexible design of ourmeta-learner, instead of inferring the rest frames from thegiven whole annotation of the first frame. Inferring wholeobject from only part of annotation or user feedback is alsoworth to investigate.

References

[1] L. Bao, B. Wu, and W. Liu. Cnn in mrf: Video ob-ject segmentation via inference in a cnn-based higher-order spatio-temporal mrf. In Proceedings of the IEEEConference on Computer Vision and Pattern Recogni-tion, pages 5977–5986, 2018. 2, 4

[2] L. Bertinetto, J. F. Henriques, P. H. Torr, andA. Vedaldi. Meta-learning with differentiable closed-form solvers. arXiv preprint arXiv:1805.08136, 2018.1, 3

[3] S. Caelles, K.-K. Maninis, J. Pont-Tuset, L. Leal-Taixe, D. Cremers, and L. Van Gool. One-shot videoobject segmentation. In CVPR 2017. IEEE, 2017. 1,2, 4, 6, 7

[4] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy,and A. L. Yuille. Deeplab: Semantic image segmen-tation with deep convolutional nets, atrous convolu-tion, and fully connected crfs. IEEE transactions onpattern analysis and machine intelligence, 40(4):834–848, 2018. 3, 4, 5

[5] Y. Chen, J. Pont-Tuset, A. Montes, and L. Van Gool.Blazingly fast video object segmentation with pixel-wise metric learning. In Proceedings of the IEEE Con-ference on Computer Vision and Pattern Recognition,pages 1189–1198, 2018. 4, 6, 7

[6] J. Cheng, Y.-H. Tsai, W.-C. Hung, S. Wang, andM.-H. Yang. Fast and accurate online video ob-ject segmentation via tracking parts. arXiv preprintarXiv:1806.02323, 2018. 3, 7

[7] J. Cheng, Y.-H. Tsai, S. Wang, and M.-H. Yang.Segflow: Joint learning for video object segmentationand optical flow. In Computer Vision (ICCV), 2017IEEE International Conference on, pages 686–695.IEEE, 2017. 2

[8] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, andL. Fei-Fei. Imagenet: A large-scale hierarchical im-age database. In 2009 IEEE conference on computervision and pattern recognition, pages 248–255. Ieee,2009. 3

[9] L. Fei-Fei, R. Fergus, and P. Perona. One-shot learn-ing of object categories. IEEE transactions on pat-tern analysis and machine intelligence, 28(4):594–611, 2006. 3

[10] C. Finn, P. Abbeel, and S. Levine. Model-agnosticmeta-learning for fast adaptation of deep networks.arXiv preprint arXiv:1703.03400, 2017. 3

[11] V. Garcia and J. Bruna. Few-shot learning with graphneural networks. arXiv preprint arXiv:1711.04043,2017. 3

[12] J. P. Gee. Deep learning properties of good digitalgames: how far can they go? In Serious Games, pages89–104. Routledge, 2009. 1

[13] R. Girshick. Fast r-cnn. In The IEEE InternationalConference on Computer Vision (ICCV), December2015. 1

[14] K. He, X. Zhang, S. Ren, and J. Sun. Deep resid-ual learning for image recognition. In Proceedings ofthe IEEE conference on computer vision and patternrecognition, pages 770–778, 2016. 3, 4, 5

[15] Ł. Kaiser, O. Nachum, A. Roy, and S. Bengio.Learning to remember rare events. arXiv preprintarXiv:1703.03129, 2017. 3

[16] P. Krahenbuhl and V. Koltun. Efficient inference infully connected crfs with gaussian edge potentials. InAdvances in neural information processing systems,pages 109–117, 2011. 2

[17] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Im-agenet classification with deep convolutional neuralnetworks. In F. Pereira, C. J. C. Burges, L. Bottou,and K. Q. Weinberger, editors, Advances in Neural In-formation Processing Systems 25, pages 1097–1105.Curran Associates, Inc., 2012. 1

[18] B. M. Lake, R. Salakhutdinov, and J. B. Tenenbaum.Human-level concept learning through probabilisticprogram induction. Science, 350(6266):1332–1338,2015. 3

[19] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner.Gradient-based learning applied to document recog-nition. Proceedings of the IEEE, 86(11):2278–2324,1998. 3

[20] I. Lenz, H. Lee, and A. Saxena. Deep learning fordetecting robotic grasps. The International Journal ofRobotics Research, 34(4-5):705–724, 2015. 1

[21] F. Li, T. Kim, A. Humayun, D. Tsai, and J. M. Rehg.Video segmentation by tracking many figure-groundsegments. In Proceedings of the IEEE InternationalConference on Computer Vision, pages 2192–2199,2013. 5, 6

[22] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona,D. Ramanan, P. Dollar, and C. L. Zitnick. Microsoftcoco: Common objects in context. In European con-ference on computer vision, pages 740–755. Springer,2014. 4

Page 10: Abstract - arXivAbstract This paper tackles the problem of video object segmen-tation. We are specifically concerned with the task of seg-menting all pixels of a target object in

[23] J. Long, E. Shelhamer, and T. Darrell. Fully convo-lutional networks for semantic segmentation. In Pro-ceedings of the IEEE conference on computer visionand pattern recognition, pages 3431–3440, 2015. 4

[24] K.-K. Maninis, S. Caelles, Y. Chen, J. Pont-Tuset,L. Leal-Taixe, D. Cremers, and L. Van Gool. Videoobject segmentation without temporal information.arXiv preprint arXiv:1709.06031, 2017. 1, 2, 4

[25] R. H. Myers and R. H. Myers. Classical and mod-ern regression with applications, volume 2. DuxburyPress Belmont, CA, 1990. 2

[26] D. K. Naik and R. Mammone. Meta-neural networksthat learn by learning. In Neural Networks, 1992.IJCNN., International Joint Conference on, volume 1,pages 437–442. IEEE, 1992. 3

[27] I. Nouretdinov, T. Melluish, and V. Vovk. Ridge re-gression confidence machine. In ICML, pages 385–392, 2001. 4

[28] O. M. Parkhi, A. Vedaldi, A. Zisserman, et al. Deepface recognition. In BMVC, volume 1, page 6, 2015.1

[29] F. Perazzi, A. Khoreva, R. Benenson, B. Schiele, andA. Sorkine-Hornung. Learning video object segmen-tation from static images. In Computer Vision and Pat-tern Recognition, volume 2, 2017. 2

[30] F. Perazzi, J. Pont-Tuset, B. McWilliams, L. Van Gool,M. Gross, and A. Sorkine-Hornung. A benchmarkdataset and evaluation methodology for video objectsegmentation. In The IEEE Conference on ComputerVision and Pattern Recognition (CVPR), June 2016. 1,2, 4, 5, 6

[31] S. Ravi and H. Larochelle. Optimization as a modelfor few-shot learning. 2016. 3

[32] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi.You only look once: Unified, real-time object detec-tion. In The IEEE Conference on Computer Visionand Pattern Recognition (CVPR), June 2016. 1

[33] A. Santoro, S. Bartunov, M. Botvinick, D. Wier-stra, and T. Lillicrap. Meta-learning with memory-augmented neural networks. In International confer-ence on machine learning, pages 1842–1850, 2016. 3

[34] C. Saunders, A. Gammerman, and V. Vovk. Ridgeregression learning algorithm in dual variables. 1998.4

[35] J. Schmidhuber. Evolutionary principles in self-referential learning, or on learning how to learn: themeta-meta-... hook. PhD thesis, Technische Univer-sitat Munchen, 1987. 3

[36] J. Snell, K. Swersky, and R. Zemel. Prototypical net-works for few-shot learning. In Advances in Neural

Information Processing Systems, pages 4077–4087,2017. 3

[37] J. Sun, D. Yu, Y. Li, and C. Wang. Mask propagationnetwork for video object segmentation. arXiv preprintarXiv:1810.10289, 2018. 1, 2

[38] S. Thrun and L. Pratt. Learning to learn. SpringerScience & Business Media, 2012. 3

[39] D. Tsai, M. Flagg, A. Nakazawa, and J. M. Rehg.Motion coherent tracking using multi-label mrf opti-mization. International journal of computer vision,100(2):190–202, 2012. 6, 7

[40] Y.-H. Tsai, M.-H. Yang, and M. J. Black. Video seg-mentation via object flow. In The IEEE Conferenceon Computer Vision and Pattern Recognition (CVPR),June 2016. 2

[41] O. Vinyals, C. Blundell, T. Lillicrap, D. Wierstra,et al. Matching networks for one shot learning. InAdvances in Neural Information Processing Systems,pages 3630–3638, 2016. 3

[42] P. Voigtlaender and B. Leibe. Online adaptation ofconvolutional neural networks for video object seg-mentation. arXiv preprint arXiv:1706.09364, 2017.1, 2, 4, 7

[43] S. Wug Oh, J.-Y. Lee, K. Sunkavalli, and S. Joo Kim.Fast video object segmentation by reference-guidedmask propagation. In Proceedings of the IEEE Con-ference on Computer Vision and Pattern Recognition,pages 7376–7385, 2018. 2, 4

[44] Z. Xu, C. Hu, and L. Mei. Video structured descriptiontechnology based intelligence analysis of surveillancevideos for public security applications. MultimediaTools and Applications, 75(19):12155–12172, 2016. 1

[45] L. Yang, Y. Wang, X. Xiong, J. Yang, and A. K. Kat-saggelos. Efficient video object segmentation via net-work modulation. algorithms, 29:15, 2018. 2, 3, 4,6