supplementary for roam: recurrently optimizing tracking model · 2020. 6. 11. · supplementary for...

Supplementary for ROAM: Recurrently Optimizing Tracking Model

Tianyu Yang1,2 Pengfei Xu3 Runbo Hu3 Hua Chai3 Antoni B. Chan2

1 Tencent AI Lab 2 City University of Hong Kong 3 Didi Chuxing

1. Gradient FlowNote that our recurrent model optimization process in-

volves a gradient update step (see Eq. 9), which may needto compute the second derivative when minimizing the metaloss. Specifically, we show the computation graph of theoffline training process of our framework in Fig. 1, wherethe gradients flow backwards along the solid lines and ig-nores the dashed lines. For the initial frame (Fig. 1 left), weaim to learn a conducive tracking network θ(0) and learn-ing rates λ(0) that can fast adapt to different videos withone gradient step, which is correlated closely with the track-ing model θ(1). Therefore, we back-propagate through boththe gradient update step and the gradient of the update loss(i.e. computing the second derivative). For the subsequentframes, our goal is to learn the online neural optimizer O,which is independent of the tracking model θ(t) that is beingoptimized. Hence, we choose to ignore the gradient pathsalong the update loss gradient Oθ(t−1)`(t−1), and the neuraloptimizer inputs I(t−1), which are composed of historicalcontext vectors computed from the tracking model. The ef-fect of this simplification is to focus the training more onimproving the neural optimizer to reduce the loss, ratherthan on secondary items, such as the LSTM input or lossgradient.

Figure 1. Computation graph of the recurrent neural opti-mizer.Dashed lines are ignored during backpropagation of the gra-dient.

2. Speed analysisWe further investigate the speed gain obtained by our

recurrent neural optimizer and bounding box regression.

To make a fair speed comparison, we use the same fea-ture backbone (VGG16) and tracking model (two stackedconvolutional layers) to build the tracker. We compare ourROAM and ROAM++ with trackers using traditional SGDfor model updating under different number of gradient stepsin Table 1. ROAM++ runs faster than ROAM because itdoes not need to compute multi-scale features. With similarrun time, our ROAM achieves much better results comparedwith SGD-2, showing the effectiveness of our recurrent neu-ral optimizer. Note that ROAM and SGD-2, which bothuses two gradient steps, obtain similar speed demonstratingthat our recurrent neural optimizer has negligible computa-tional cost. Furthermore, using vanilla SGD requires morenumber of iterations to converge the model, thus leading toslower running speed.

Speed (fps) AUC

ROAM++ 20.6 0.680ROAM 13.4 0.681

SGD-2 13.7 0.593SGD-4 12.5 0.597SGD-8 10.6 0.621

SGD-16 8.0 0.645SGD-32 5.5 0.644

Table 1. Speed comparison results on OTB2015 datasets. SGD-nmeans n gradient steps. SGD uses a learning rate of 7e-7. Noteboth ROAM and ROAM++ uses two gradient steps.

3. Update Loss and Meta Loss ComparisonWe show the update loss `(t) and meta loss L(t) compar-

ison over time between ROAM and SGD after two gradientsteps for all videos of OTB-2015 in Figure 2 and Figure 3respectively. Our ROAM demonstrates better optimizationability on minimizing tracking loss.

4. Impact of Training Data.We also train the variants of our ROAM++ and ROAM

with ImageNet VID training data, which are denoted aasVID-ROAM++ and VID-ROAM, respectively. Table 2

shows the comparison results on VOT datasets. Withonly VID dataset for training, our ROAM++ still outper-forms our preliminary tracker ROAM. Furthermore, bothour VID-ROAM++ and VID-ROAM outperform the base-line MetaTracker [1] by a large margin.

VOT-2016 VOT-2017EAO A R EAO A R

ROAM++ 0.4414 0.5987 0.1743 0.3803 0.5433 0.1945VID-ROAM++ 0.4147 0.5886 0.1749 0.3330 0.5513 0.2366

ROAM 0.3842 0.5558 0.1833 0.3313 0.5048 0.2263VID-ROAM 0.3568 0.5468 0.2091 0.2971 0.5098 0.2491

MetaTracker 0.317 0.519 - - - -

Table 2. Comparison results on VOT-2016 and VOT-2017 datasetswith different training datasets

5. More AblationsWe investigate the impact of removing scale smoothen

and multi-scale search for ROAM tracker. We also presenta variant of ROAM++ which does not resize tracking modelto demonstrate the effectiveness of our resizable model. Ta-ble 3 shows the comparison results on OTB dataset. We cansee that removing multi-scale search or filter resizing de-crease the performance a lot while removing scale smoothhas relative small affect.

Trackers AUC

ROAM 0.681ROAM++ 0.680

ROAM w/o scalesmooth 0.667ROAM w/o multiscale 0.590ROAM++ w/o filter resizing 0.553

Table 3. Performance comparison of different variants on OTBdataset

6. Bounding Box ParametrizationWe follow the way of bounding box parameterization in

[2] when computing smooth L1 loss in (7). Let (x, y, w, h)be the center coordinate, width and height of the groundtruth box, and (x∗, y∗, w∗, h∗) be the predicted box. Theparametrization of the bounding boxes are:

B = (tx, ty, tw, th) = (x−xa

aw, y−ya

ah, log w

ah, log h

ah),

B∗ = (t∗x, t∗y, t

∗w, t

∗h) = (x

∗−xa

aw, y∗−ya

ah, log w∗

aw, log h∗

ah)

where the anchor size (aw, ah) is defined in (6) and thisanchor is used on each spatial location on the feature map.

(xa, ya) is the center of each anchor. R(F ;θreg) (in Eq.7)corresponds to B∗, which is the parametrized predictedbbox, and B is the parameterized ground truth bbox. Aftergetting R(F ;θreg), we compute the predicted object box(x∗, y∗, w∗, h∗).

References[1] Eunbyung Park and Alexander C. Berg. Meta-Tracker: Fast

and Robust Online Adaptation for Visual Object Trackers. InECCV, 2018. 2

[2] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun.Faster R-CNN: Towards Real-Time Object Detection with Re-gion Proposal Networks. In NeurIPS, 2015. 2

0 5000

0.050.1

up

da

te lo

ss Basketball

0 50 1000

0.050.1

Biker

0 200 4000

0.5Bird1

0 500

0.050.1

Bird2

0 2000

0.20.4

BlurBody

0 5000

0.050.1

BlurCar1

0 5000

0.020.04

BlurCar2

0 2000

0.05BlurCar3

0 2000

0.050.1

up

da

te lo

ss BlurCar4

0 200 4000

0.050.1

BlurFace

0 5000

0.20.4

BlurOwl

0 5000

0.050.1

Board

0 2000

0.10.2

Bolt

0 100 2000

0.10.2

Bolt2

0 500 10000

0.050.1

Box

0 5000

0.10.2

Boy

0 500 10000

0.20.4

up

da

te lo

ss Car1

0 5000

0.10.2

Car2

0 20000

0.20.4

Car24

0 5000

0.050.1

Car4

0 2000

0.10.2

CarDark

0 100 2000

0.20.4

CarScale

0 200 4000

0.20.4

ClifBar

0 100 2000

0.10.2

Coke

0 50 1000

0.10.2

up

da

te lo

ss Couple

0 2000

0.050.1

Coupon

0 50 1000

0.20.4

Crossing

0 2000

0.20.4

Crowds

0 100 2000

0.10.2

Dancer

0 50 1000

0.20.4

Dancer2

0 200 4000

0.050.1

David

0 5000

0.10.2

David2

0 100 2000

0.050.1

up

da

te lo

ss David3

0 500

0.10.2

Deer

0 100 2000

0.10.2

Diving

0 50 1000

0.10.2

Dog

0 10000

0.10.2

Dog1

0 20000

0.20.4

Doll

0 50 1000

0.050.1

DragonBaby

0 500 10000

0.050.1

Dudek

0 5000

0.20.4

up

da

te lo

ss FaceOcc1

0 5000

0.050.1

FaceOcc2

0 200 4000

0.10.2

Fish

0 5000

0.10.2

FleetFace

0 2000

0.10.2

Football

0 500

0.050.1

Football1

0 2000

0.20.4

Freeman1

0 200 4000

0.51

Freeman3

0 100 2000

0.5

up

da

te lo

ss Freeman4

0 200 4000

0.20.4

Girl

0 10000

0.20.4

Girl2

0 5000

0.10.2

Gym

0 500 10000

0.20.4

Human2

0 10000

0.20.4

Human3

0 5000

0.20.4

Human4

0 5000

0.51

Human5

0 5000

0.5

up

da

te lo

ss Human6

0 100 2000

0.020.04

Human7

0 50 1000

0.050.1

Human8

0 100 200 3000

0.050.1

Human9

0 1000

0.050.1

Ironman

0 2000

0.10.2

Jogging-1

0 2000

0.10.2

Jogging-2

0 50 1000

0.20.4

Jump

0 2000

0.20.4

up

da

te lo

ss Jumping

0 500

0.10.2

KiteSurf

0 10000

0.050.1

Lemming

0 10000

0.10.2

Liquor

0 50 1000

0.10.2

Man

0 500

0.20.4

Matrix

0 10000

0.10.2

Mhyang

0 1000

0.050.1

MotorRolling

0 100 2000

0.10.2

up

da

te lo

ss MountainBike

0 5000

0.20.4

Panda

0 10000

0.20.4

RedTeam

0 10000

0.050.1

Rubik

0 2000

0.050.1

Shaking

0 2000

0.20.4

Singer1

0 2000

0.20.4

Singer2

0 1000

0.20.4

Skater

0 200 4000

0.20.4

up

da

te lo

ss Skater2

0 2000

0.10.2

Skating1

0 200 4000

0.20.4

Skating2-1

0 200 4000

0.10.2

Skating2-2

0 500

0.20.4

Skiing

0 2000

0.10.2

Soccer

0 1000

0.20.4

Subway

0 2000

0.20.4

Surfer

0 5000

0.20.4

up

da

te lo

ss Suv

0 10000

0.10.2

Sylvester

0 2000

0.050.1

Tiger1

0 2000

0.050.1

Tiger2

0 100 200frame number

00.10.2

Toy


00.10.2

Trans

0 500frame number

00.05

0.1Trellis


00.05

0.1Twinnings


00.20.4

up

da

te lo

ss Vase


00.05

0.1Walking


00.05

0.1Walking2

0 500frame number

00.20.4

Woman

ROAMSGD

Figure 2. Update loss comparison between ROAM and SGD.

0 5000

0.2

0.4

me

ta lo

ss Basketball

0 50 1000

0.5

1Biker

0 200 4000

0.5

1Bird1

0 500

0.2

0.4Bird2

0 2000

0.5BlurBody

0 5000

0.2

0.4BlurCar1

0 5000

0.1

0.2BlurCar2

0 2000

0.1

0.2BlurCar3

0 2000

0.1

0.2

me

ta lo

ss BlurCar4

0 200 4000

0.2

0.4BlurFace

0 5000

0.5

1BlurOwl

0 5000

0.2

0.4Board

0 2000

0.2

0.4Bolt

0 100 2000

0.2

0.4Bolt2

0 500 10000

0.2

0.4Box

0 5000

0.2

0.4Boy

0 500 10000

0.2

0.4

me

ta lo

ss Car1

0 5000

0.2

0.4Car2

0 20000

0.5Car24

0 5000

0.1

0.2Car4

0 2000

0.2

0.4CarDark

0 100 2000

0.2

0.4CarScale

0 200 4000

0.5ClifBar

0 100 2000

0.5Coke

0 50 1000

0.2

0.4

me

ta lo

ss Couple

0 2000

0.2

0.4Coupon

0 50 1000

0.2

0.4Crossing

0 2000

0.5Crowds

0 100 2000

0.2

0.4Dancer

0 50 1000

0.2

0.4Dancer2

0 200 4000

0.1

0.2David

0 5000

0.2

0.4David2

0 100 2000

0.2

0.4

me

ta lo

ss David3

0 500

0.2

0.4Deer

0 100 2000

0.5Diving

0 50 1000

0.2

0.4Dog

0 10000

0.2

0.4Dog1

0 20000

0.5Doll

0 50 1000

0.2

0.4DragonBaby

0 500 10000

0.2

0.4Dudek

0 5000

0.2

0.4

me

ta lo

ss FaceOcc1

0 5000

0.1

0.2FaceOcc2

0 200 4000

0.2

0.4Fish

0 5000

0.2

0.4FleetFace

0 2000

0.5Football

0 500

0.2

0.4Football1

0 2000

0.5

1Freeman1

0 200 4000

0.5

1Freeman3

0 100 2000

0.5

1

me

ta lo

ss Freeman4

0 200 4000

0.2

0.4Girl

0 10000

0.5Girl2

0 5000

0.2

0.4Gym

0 500 10000

0.5Human2

0 10000

0.2

0.4Human3

0 5000

0.2

0.4Human4

0 5000

0.5

1Human5

0 5000

0.5

1

me

ta lo

ss Human6

0 100 2000

0.2

0.4Human7

0 50 1000

0.2

0.4Human8

0 2000

0.2

0.4Human9

0 1000

0.2

0.4Ironman

0 2000

0.2

0.4Jogging-1

0 2000

0.2

0.4Jogging-2

0 50 1000

0.5Jump

0 2000

0.2

0.4

me

ta lo

ss Jumping

0 500

0.2

0.4KiteSurf

0 10000

0.2

0.4Lemming

0 10000

0.2

0.4Liquor

0 50 1000

0.2

0.4Man

0 500

0.5Matrix

0 10000

0.2

0.4Mhyang

0 1000

0.2

0.4MotorRolling

0 100 2000

0.2

0.4

me

ta lo

ss MountainBike

0 5000

0.5Panda

0 10000

0.5RedTeam

0 10000

0.2

0.4Rubik

0 2000

0.2

0.4Shaking

0 2000

0.2

0.4Singer1

0 2000

0.2

0.4Singer2

0 1000

0.2

0.4Skater

0 200 4000

0.2

0.4

me

ta lo

ss Skater2

0 2000

0.2

0.4Skating1

0 200 4000

0.2

0.4Skating2-1

0 200 4000

0.2

0.4Skating2-2

0 500

0.5Skiing

0 2000

0.5Soccer

0 1000

0.2

0.4Subway

0 2000

0.5Surfer

0 5000

0.5

me

ta lo

ss Suv

0 10000

0.2

0.4Sylvester

0 2000

0.2

0.4Tiger1

0 2000

0.2

0.4Tiger2

0 100 200

frame number

0

0.2

0.4Toy

0 50 100

frame number

0

0.2

0.4Trans

0 500

frame number

0

0.2

0.4Trellis

0 200 400

frame number

0

0.2

0.4Twinnings

0 100 200

frame number

0

0.2

0.4

me

ta lo

ss Vase

0 200 400

frame number

0

0.2

0.4Walking

0 200 400

frame number

0

0.2

0.4Walking2

0 500

frame number

0

0.5

1Woman

ROAM

SGD

Figure 3. Meta loss comparison between ROAM and SGD.

supplementary for roam: recurrently optimizing tracking model · 2020. 6. 11. · supplementary for...

Documents