supplementary for roam: recurrently optimizing tracking model · 2020. 6. 11. · supplementary for...

4
Supplementary for ROAM: Recurrently Optimizing Tracking Model Tianyu Yang 1,2 Pengfei Xu 3 Runbo Hu 3 Hua Chai 3 Antoni B. Chan 2 1 Tencent AI Lab 2 City University of Hong Kong 3 Didi Chuxing 1. Gradient Flow Note that our recurrent model optimization process in- volves a gradient update step (see Eq. 9), which may need to compute the second derivative when minimizing the meta loss. Specifically, we show the computation graph of the offline training process of our framework in Fig. 1, where the gradients flow backwards along the solid lines and ig- nores the dashed lines. For the initial frame (Fig. 1 left), we aim to learn a conducive tracking network θ (0) and learn- ing rates λ (0) that can fast adapt to different videos with one gradient step, which is correlated closely with the track- ing model θ (1) . Therefore, we back-propagate through both the gradient update step and the gradient of the update loss (i.e. computing the second derivative). For the subsequent frames, our goal is to learn the online neural optimizer O, which is independent of the tracking model θ (t) that is being optimized. Hence, we choose to ignore the gradient paths along the update loss gradient O θ (t-1) (t-1) , and the neural optimizer inputs I (t-1) , which are composed of historical context vectors computed from the tracking model. The ef- fect of this simplification is to focus the training more on improving the neural optimizer to reduce the loss, rather than on secondary items, such as the LSTM input or loss gradient. Figure 1. Computation graph of the recurrent neural opti- mizer.Dashed lines are ignored during backpropagation of the gra- dient. 2. Speed analysis We further investigate the speed gain obtained by our recurrent neural optimizer and bounding box regression. To make a fair speed comparison, we use the same fea- ture backbone (VGG16) and tracking model (two stacked convolutional layers) to build the tracker. We compare our ROAM and ROAM++ with trackers using traditional SGD for model updating under different number of gradient steps in Table 1. ROAM++ runs faster than ROAM because it does not need to compute multi-scale features. With similar run time, our ROAM achieves much better results compared with SGD-2, showing the effectiveness of our recurrent neu- ral optimizer. Note that ROAM and SGD-2, which both uses two gradient steps, obtain similar speed demonstrating that our recurrent neural optimizer has negligible computa- tional cost. Furthermore, using vanilla SGD requires more number of iterations to converge the model, thus leading to slower running speed. Speed (fps) AUC ROAM++ 20.6 0.680 ROAM 13.4 0.681 SGD-2 13.7 0.593 SGD-4 12.5 0.597 SGD-8 10.6 0.621 SGD-16 8.0 0.645 SGD-32 5.5 0.644 Table 1. Speed comparison results on OTB2015 datasets. SGD-n means n gradient steps. SGD uses a learning rate of 7e-7. Note both ROAM and ROAM++ uses two gradient steps. 3. Update Loss and Meta Loss Comparison We show the update loss (t) and meta loss L (t) compar- ison over time between ROAM and SGD after two gradient steps for all videos of OTB-2015 in Figure 2 and Figure 3 respectively. Our ROAM demonstrates better optimization ability on minimizing tracking loss. 4. Impact of Training Data. We also train the variants of our ROAM++ and ROAM with ImageNet VID training data, which are denoted aas VID-ROAM++ and VID-ROAM, respectively. Table 2

Upload: others

Post on 04-Sep-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Supplementary for ROAM: Recurrently Optimizing Tracking Model · 2020. 6. 11. · Supplementary for ROAM: Recurrently Optimizing Tracking Model Tianyu Yang1;2 Pengfei Xu 3Runbo Hu

Supplementary for ROAM: Recurrently Optimizing Tracking Model

Tianyu Yang1,2 Pengfei Xu3 Runbo Hu3 Hua Chai3 Antoni B. Chan2

1 Tencent AI Lab 2 City University of Hong Kong 3 Didi Chuxing

1. Gradient FlowNote that our recurrent model optimization process in-

volves a gradient update step (see Eq. 9), which may needto compute the second derivative when minimizing the metaloss. Specifically, we show the computation graph of theoffline training process of our framework in Fig. 1, wherethe gradients flow backwards along the solid lines and ig-nores the dashed lines. For the initial frame (Fig. 1 left), weaim to learn a conducive tracking network θ(0) and learn-ing rates λ(0) that can fast adapt to different videos withone gradient step, which is correlated closely with the track-ing model θ(1). Therefore, we back-propagate through boththe gradient update step and the gradient of the update loss(i.e. computing the second derivative). For the subsequentframes, our goal is to learn the online neural optimizer O,which is independent of the tracking model θ(t) that is beingoptimized. Hence, we choose to ignore the gradient pathsalong the update loss gradient Oθ(t−1)`(t−1), and the neuraloptimizer inputs I(t−1), which are composed of historicalcontext vectors computed from the tracking model. The ef-fect of this simplification is to focus the training more onimproving the neural optimizer to reduce the loss, ratherthan on secondary items, such as the LSTM input or lossgradient.

Figure 1. Computation graph of the recurrent neural opti-mizer.Dashed lines are ignored during backpropagation of the gra-dient.

2. Speed analysisWe further investigate the speed gain obtained by our

recurrent neural optimizer and bounding box regression.

To make a fair speed comparison, we use the same fea-ture backbone (VGG16) and tracking model (two stackedconvolutional layers) to build the tracker. We compare ourROAM and ROAM++ with trackers using traditional SGDfor model updating under different number of gradient stepsin Table 1. ROAM++ runs faster than ROAM because itdoes not need to compute multi-scale features. With similarrun time, our ROAM achieves much better results comparedwith SGD-2, showing the effectiveness of our recurrent neu-ral optimizer. Note that ROAM and SGD-2, which bothuses two gradient steps, obtain similar speed demonstratingthat our recurrent neural optimizer has negligible computa-tional cost. Furthermore, using vanilla SGD requires morenumber of iterations to converge the model, thus leading toslower running speed.

Speed (fps) AUC

ROAM++ 20.6 0.680ROAM 13.4 0.681

SGD-2 13.7 0.593SGD-4 12.5 0.597SGD-8 10.6 0.621

SGD-16 8.0 0.645SGD-32 5.5 0.644

Table 1. Speed comparison results on OTB2015 datasets. SGD-nmeans n gradient steps. SGD uses a learning rate of 7e-7. Noteboth ROAM and ROAM++ uses two gradient steps.

3. Update Loss and Meta Loss ComparisonWe show the update loss `(t) and meta loss L(t) compar-

ison over time between ROAM and SGD after two gradientsteps for all videos of OTB-2015 in Figure 2 and Figure 3respectively. Our ROAM demonstrates better optimizationability on minimizing tracking loss.

4. Impact of Training Data.We also train the variants of our ROAM++ and ROAM

with ImageNet VID training data, which are denoted aasVID-ROAM++ and VID-ROAM, respectively. Table 2

Page 2: Supplementary for ROAM: Recurrently Optimizing Tracking Model · 2020. 6. 11. · Supplementary for ROAM: Recurrently Optimizing Tracking Model Tianyu Yang1;2 Pengfei Xu 3Runbo Hu

shows the comparison results on VOT datasets. Withonly VID dataset for training, our ROAM++ still outper-forms our preliminary tracker ROAM. Furthermore, bothour VID-ROAM++ and VID-ROAM outperform the base-line MetaTracker [1] by a large margin.

VOT-2016 VOT-2017EAO A R EAO A R

ROAM++ 0.4414 0.5987 0.1743 0.3803 0.5433 0.1945VID-ROAM++ 0.4147 0.5886 0.1749 0.3330 0.5513 0.2366

ROAM 0.3842 0.5558 0.1833 0.3313 0.5048 0.2263VID-ROAM 0.3568 0.5468 0.2091 0.2971 0.5098 0.2491

MetaTracker 0.317 0.519 - - - -

Table 2. Comparison results on VOT-2016 and VOT-2017 datasetswith different training datasets

5. More AblationsWe investigate the impact of removing scale smoothen

and multi-scale search for ROAM tracker. We also presenta variant of ROAM++ which does not resize tracking modelto demonstrate the effectiveness of our resizable model. Ta-ble 3 shows the comparison results on OTB dataset. We cansee that removing multi-scale search or filter resizing de-crease the performance a lot while removing scale smoothhas relative small affect.

Trackers AUC

ROAM 0.681ROAM++ 0.680

ROAM w/o scalesmooth 0.667ROAM w/o multiscale 0.590ROAM++ w/o filter resizing 0.553

Table 3. Performance comparison of different variants on OTBdataset

6. Bounding Box ParametrizationWe follow the way of bounding box parameterization in

[2] when computing smooth L1 loss in (7). Let (x, y, w, h)be the center coordinate, width and height of the groundtruth box, and (x∗, y∗, w∗, h∗) be the predicted box. Theparametrization of the bounding boxes are:

B = (tx, ty, tw, th) = (x−xa

aw, y−ya

ah, log w

ah, log h

ah),

B∗ = (t∗x, t∗y, t

∗w, t

∗h) = (x

∗−xa

aw, y∗−ya

ah, log w∗

aw, log h∗

ah)

where the anchor size (aw, ah) is defined in (6) and thisanchor is used on each spatial location on the feature map.

(xa, ya) is the center of each anchor. R(F ;θreg) (in Eq.7)corresponds to B∗, which is the parametrized predictedbbox, and B is the parameterized ground truth bbox. Aftergetting R(F ;θreg), we compute the predicted object box(x∗, y∗, w∗, h∗).

References[1] Eunbyung Park and Alexander C. Berg. Meta-Tracker: Fast

and Robust Online Adaptation for Visual Object Trackers. InECCV, 2018. 2

[2] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun.Faster R-CNN: Towards Real-Time Object Detection with Re-gion Proposal Networks. In NeurIPS, 2015. 2

Page 3: Supplementary for ROAM: Recurrently Optimizing Tracking Model · 2020. 6. 11. · Supplementary for ROAM: Recurrently Optimizing Tracking Model Tianyu Yang1;2 Pengfei Xu 3Runbo Hu

0 5000

0.050.1

up

da

te lo

ss Basketball

0 50 1000

0.050.1

Biker

0 200 4000

0.5Bird1

0 500

0.050.1

Bird2

0 2000

0.20.4

BlurBody

0 5000

0.050.1

BlurCar1

0 5000

0.020.04

BlurCar2

0 2000

0.05BlurCar3

0 2000

0.050.1

up

da

te lo

ss BlurCar4

0 200 4000

0.050.1

BlurFace

0 5000

0.20.4

BlurOwl

0 5000

0.050.1

Board

0 2000

0.10.2

Bolt

0 100 2000

0.10.2

Bolt2

0 500 10000

0.050.1

Box

0 5000

0.10.2

Boy

0 500 10000

0.20.4

up

da

te lo

ss Car1

0 5000

0.10.2

Car2

0 20000

0.20.4

Car24

0 5000

0.050.1

Car4

0 2000

0.10.2

CarDark

0 100 2000

0.20.4

CarScale

0 200 4000

0.20.4

ClifBar

0 100 2000

0.10.2

Coke

0 50 1000

0.10.2

up

da

te lo

ss Couple

0 2000

0.050.1

Coupon

0 50 1000

0.20.4

Crossing

0 2000

0.20.4

Crowds

0 100 2000

0.10.2

Dancer

0 50 1000

0.20.4

Dancer2

0 200 4000

0.050.1

David

0 5000

0.10.2

David2

0 100 2000

0.050.1

up

da

te lo

ss David3

0 500

0.10.2

Deer

0 100 2000

0.10.2

Diving

0 50 1000

0.10.2

Dog

0 10000

0.10.2

Dog1

0 20000

0.20.4

Doll

0 50 1000

0.050.1

DragonBaby

0 500 10000

0.050.1

Dudek

0 5000

0.20.4

up

da

te lo

ss FaceOcc1

0 5000

0.050.1

FaceOcc2

0 200 4000

0.10.2

Fish

0 5000

0.10.2

FleetFace

0 2000

0.10.2

Football

0 500

0.050.1

Football1

0 2000

0.20.4

Freeman1

0 200 4000

0.51

Freeman3

0 100 2000

0.5

up

da

te lo

ss Freeman4

0 200 4000

0.20.4

Girl

0 10000

0.20.4

Girl2

0 5000

0.10.2

Gym

0 500 10000

0.20.4

Human2

0 10000

0.20.4

Human3

0 5000

0.20.4

Human4

0 5000

0.51

Human5

0 5000

0.5

up

da

te lo

ss Human6

0 100 2000

0.020.04

Human7

0 50 1000

0.050.1

Human8

0 100 200 3000

0.050.1

Human9

0 1000

0.050.1

Ironman

0 2000

0.10.2

Jogging-1

0 2000

0.10.2

Jogging-2

0 50 1000

0.20.4

Jump

0 2000

0.20.4

up

da

te lo

ss Jumping

0 500

0.10.2

KiteSurf

0 10000

0.050.1

Lemming

0 10000

0.10.2

Liquor

0 50 1000

0.10.2

Man

0 500

0.20.4

Matrix

0 10000

0.10.2

Mhyang

0 1000

0.050.1

MotorRolling

0 100 2000

0.10.2

up

da

te lo

ss MountainBike

0 5000

0.20.4

Panda

0 10000

0.20.4

RedTeam

0 10000

0.050.1

Rubik

0 2000

0.050.1

Shaking

0 2000

0.20.4

Singer1

0 2000

0.20.4

Singer2

0 1000

0.20.4

Skater

0 200 4000

0.20.4

up

da

te lo

ss Skater2

0 2000

0.10.2

Skating1

0 200 4000

0.20.4

Skating2-1

0 200 4000

0.10.2

Skating2-2

0 500

0.20.4

Skiing

0 2000

0.10.2

Soccer

0 1000

0.20.4

Subway

0 2000

0.20.4

Surfer

0 5000

0.20.4

up

da

te lo

ss Suv

0 10000

0.10.2

Sylvester

0 2000

0.050.1

Tiger1

0 2000

0.050.1

Tiger2

0 100 200frame number

00.10.2

Toy

0 50 100frame number

00.10.2

Trans

0 500frame number

00.05

0.1Trellis

0 200 400frame number

00.05

0.1Twinnings

0 100 200frame number

00.20.4

up

da

te lo

ss Vase

0 200 400frame number

00.05

0.1Walking

0 200 400frame number

00.05

0.1Walking2

0 500frame number

00.20.4

Woman

ROAMSGD

Figure 2. Update loss comparison between ROAM and SGD.

Page 4: Supplementary for ROAM: Recurrently Optimizing Tracking Model · 2020. 6. 11. · Supplementary for ROAM: Recurrently Optimizing Tracking Model Tianyu Yang1;2 Pengfei Xu 3Runbo Hu

0 5000

0.2

0.4

me

ta lo

ss Basketball

0 50 1000

0.5

1Biker

0 200 4000

0.5

1Bird1

0 500

0.2

0.4Bird2

0 2000

0.5BlurBody

0 5000

0.2

0.4BlurCar1

0 5000

0.1

0.2BlurCar2

0 2000

0.1

0.2BlurCar3

0 2000

0.1

0.2

me

ta lo

ss BlurCar4

0 200 4000

0.2

0.4BlurFace

0 5000

0.5

1BlurOwl

0 5000

0.2

0.4Board

0 2000

0.2

0.4Bolt

0 100 2000

0.2

0.4Bolt2

0 500 10000

0.2

0.4Box

0 5000

0.2

0.4Boy

0 500 10000

0.2

0.4

me

ta lo

ss Car1

0 5000

0.2

0.4Car2

0 20000

0.5Car24

0 5000

0.1

0.2Car4

0 2000

0.2

0.4CarDark

0 100 2000

0.2

0.4CarScale

0 200 4000

0.5ClifBar

0 100 2000

0.5Coke

0 50 1000

0.2

0.4

me

ta lo

ss Couple

0 2000

0.2

0.4Coupon

0 50 1000

0.2

0.4Crossing

0 2000

0.5Crowds

0 100 2000

0.2

0.4Dancer

0 50 1000

0.2

0.4Dancer2

0 200 4000

0.1

0.2David

0 5000

0.2

0.4David2

0 100 2000

0.2

0.4

me

ta lo

ss David3

0 500

0.2

0.4Deer

0 100 2000

0.5Diving

0 50 1000

0.2

0.4Dog

0 10000

0.2

0.4Dog1

0 20000

0.5Doll

0 50 1000

0.2

0.4DragonBaby

0 500 10000

0.2

0.4Dudek

0 5000

0.2

0.4

me

ta lo

ss FaceOcc1

0 5000

0.1

0.2FaceOcc2

0 200 4000

0.2

0.4Fish

0 5000

0.2

0.4FleetFace

0 2000

0.5Football

0 500

0.2

0.4Football1

0 2000

0.5

1Freeman1

0 200 4000

0.5

1Freeman3

0 100 2000

0.5

1

me

ta lo

ss Freeman4

0 200 4000

0.2

0.4Girl

0 10000

0.5Girl2

0 5000

0.2

0.4Gym

0 500 10000

0.5Human2

0 10000

0.2

0.4Human3

0 5000

0.2

0.4Human4

0 5000

0.5

1Human5

0 5000

0.5

1

me

ta lo

ss Human6

0 100 2000

0.2

0.4Human7

0 50 1000

0.2

0.4Human8

0 2000

0.2

0.4Human9

0 1000

0.2

0.4Ironman

0 2000

0.2

0.4Jogging-1

0 2000

0.2

0.4Jogging-2

0 50 1000

0.5Jump

0 2000

0.2

0.4

me

ta lo

ss Jumping

0 500

0.2

0.4KiteSurf

0 10000

0.2

0.4Lemming

0 10000

0.2

0.4Liquor

0 50 1000

0.2

0.4Man

0 500

0.5Matrix

0 10000

0.2

0.4Mhyang

0 1000

0.2

0.4MotorRolling

0 100 2000

0.2

0.4

me

ta lo

ss MountainBike

0 5000

0.5Panda

0 10000

0.5RedTeam

0 10000

0.2

0.4Rubik

0 2000

0.2

0.4Shaking

0 2000

0.2

0.4Singer1

0 2000

0.2

0.4Singer2

0 1000

0.2

0.4Skater

0 200 4000

0.2

0.4

me

ta lo

ss Skater2

0 2000

0.2

0.4Skating1

0 200 4000

0.2

0.4Skating2-1

0 200 4000

0.2

0.4Skating2-2

0 500

0.5Skiing

0 2000

0.5Soccer

0 1000

0.2

0.4Subway

0 2000

0.5Surfer

0 5000

0.5

me

ta lo

ss Suv

0 10000

0.2

0.4Sylvester

0 2000

0.2

0.4Tiger1

0 2000

0.2

0.4Tiger2

0 100 200

frame number

0

0.2

0.4Toy

0 50 100

frame number

0

0.2

0.4Trans

0 500

frame number

0

0.2

0.4Trellis

0 200 400

frame number

0

0.2

0.4Twinnings

0 100 200

frame number

0

0.2

0.4

me

ta lo

ss Vase

0 200 400

frame number

0

0.2

0.4Walking

0 200 400

frame number

0

0.2

0.4Walking2

0 500

frame number

0

0.5

1Woman

ROAM

SGD

Figure 3. Meta loss comparison between ROAM and SGD.