probabilistic 3d multi-object tracking for autonomous driving · probabilistic 3d multi-object...

Probabilistic 3D Multi-Object Tracking for Autonomous Driving

Hsu-kuang Chiu1, Antonio Prioletti2, Jie Li2, and Jeannette Bohg1

1Stanford University, 2Toyota Research Institute

Abstract

3D multi-object tracking is a key module in autonomousdriving applications that provides a reliable dynamic rep-resentation of the world to the planning module. In thispaper, we present our on-line tracking method, which madethe first place in the NuScenes Tracking Challenge, held atthe AI Driving Olympics Workshop at NeurIPS 2019. Ourmethod estimates the object states by adopting a KalmanFilter. We initialize the state covariance as well as the pro-cess and observation noise covariance with statistics fromthe training set. We also use the stochastic information fromthe Kalman Filter in the data association step by measur-ing the Mahalanobis distance between the predicted objectstates and current object detections. Our experimental re-sults on the NuScenes validation and test set show that ourmethod outperforms the AB3DMOT baseline method by alarge margin in the Average Multi-Object Tracking Accu-racy (AMOTA) metric. Our code will be available soon.1

1. Introduction

3D multi-object tracking is essential for autonomousdriving. Its aim is to estimate the location, orientation, andscale of all the objects in the environment over time. Bytaking temporal information into account, a tracking mod-ule can filter outliers in frame-by-frame object detectors andbe robust to partial or full occlusions. Thereby, it promisesto identify the trajectories of different categories of movingobjects, such as pedestrians, bicycles, and cars. The result-ing trajectories may then be used to infer motion patternsand driving behaviours for improved forecasting. This inturn helps planning to enable autonomous driving.

In this paper, we approach the 3D multi-object trackingproblem with a Kalman Filter [4]. We model the state ofeach object with its 3D position, orientation and scale as

https://github.com/eddyhkchiu/mahalanobis_3d_multi_object_tracking.

well as linear and angular velocity. For the prediction step,we use a process model with constant linear and angularvelocity. We model the unknown accelerations as Gaussianrandom variables. For the update step, we consider the de-tections provided by an object detector as measurements.Similar to the process model, we also model measurementnoise as Gaussian random variables. To ensure robustnessin multi-object tracking we found the following two stepsto be essential: (i) we employ the Mahalanobis distance [5]for outlier detection and data association between predictedand actual object detections; (ii) we estimate the covariancematrices of the initial state and of the process and observa-tion noise from the training data.

For data association between the predicted and actual ob-ject detections, we found that using the Mahalanobis dis-tance [5] is better than using the 3D Intersection-Over-Union (3D-IOU) as in the AB3DMOT [8] baseline andother previous works. Differently from the 3D-IOU, theMahalanobis distance takes into account the uncertaintyabout the predicted object state as provided by the KalmanFilter in form of the state covariance matrix. Moreover, theMahalanobis distance can provide distance measurementeven when prediction and detection do not overlap. In thiscase, the 3D-IOU gives zero which prevents any data associ-ation. However, non-overlapping detections and predictionsare highly common in driving scenarios due to sudden ac-celerations and for smaller objects such as pedestrians andbicycles.

Correctly choosing the initial state and noise covariancematrices is fundamental for filter convergence. Moreover,the reliability of Mahalanobis distance directly depends onthe choice of these values which thereby influence the qual-ity of data association. We extract the statistics of the train-ing data to perform this initialization. This also ensures thatour experiments on the validation and test set do not use anyfuture or ground-truth information.

We evaluate our approach in the NuScenes TrackingChallenge [1] using the provided MEGVII [9] detection re-sults as measurements. Our proposed method outperformsthe AB3DMOT [8] baseline by a large margin in terms of

1

arX

iv:2

001.

0567

3v1

[cs

.CV

] 1

6 Ja

n 20

20

https://github.com/eddyhkchiu/mahalanobis_3d_multi_object_tracking

https://github.com/eddyhkchiu/mahalanobis_3d_multi_object_tracking

the Average Multi-Object Tracking Accuracy (AMOTA) andmade the first place in the NuScenes Tracking Challenge.

2. Related Work2.1. 3D Object Detection

The 3D object detection component provides objectbounding boxes for each frame as measurements to 3Dmulti-object tracking systems. Therefore, the quality ofthe 3D object detector is essential for the final tracking ac-curacy. In general, most Lidar based 3D object detectionmethods belong to one of two categories: voxel- or point-based methods. Voxel-based methods first divide the 3Dspace into equally-sized 3D voxels to generate 3D featuretensors based on the points inside each voxel. Then the fea-ture tensors are fed to 3D CNNs to predict the object bound-ing boxes. Point-based methods do not need the quantiza-tion step. Those methods directly apply PointNet++ [6] onthe raw point cloud data for detecting the objects in the 3Dspace. A more recent work that achieves state-of-the-artin the NuScenes Detection Challenge [1] is the MEGVII[9] model. This model is a voxel-based method and uti-lizes sparse 3D convolution to extract the semantic features.Then a region proposal network and class-balanced multi-head networks are used for final object detection. We usethe MEGVII [9] detection results as measurements for the3D multi-object tracker.

2.2. 3D Multi-Object Tracking

Several 3D multi-object tracking methods are exten-sions of 2D tracking methods. Weng et al. [8] proposeAB3DMOT, a simple yet effective on-line tracking methodbased on a 3D Kalman Filter. Hu et al. [3] combine LSTM-based 3D motion estimation with 2D image deep featureassociation for solving the 3D tracking problem. Differentfrom the above methods that only rely on the image or pointcloud sensor input data, Argoverse [2] further uses the mapinformation, such as lanes and drivable areas, to improve3D multi-object tracking accuracy in driving scenarios.

In our work, we use AB3DMOT [8] as our baseline.AB3DMOT uses a 3D Kalman Filter for tracking. Each oftheir Kalman Filter states includes the center position, rota-tion angle, length, width, height of the object bounding box,and center velocity, while excluding the angular velocity.Their Kalman Filter covariance matrices are identity matri-ces multiplied with a heuristically chosen scalar. Moreover,AB3DMOT uses the 3D-IOU as the affinity function andthe Hungarian algorithm for data association.

In our approach, we propose to utilize the Mahalanobisdistance [5] for measuring affinity between predictions anddetections with or without direct overlapping. This dis-tance takes the uncertainty in the predictions into accountand is standard practice for outlier detection in filtering

method [7]. In our approach, we also estimate the state andnoise covariance matrices from the statistics of the train-ing data. In experiments, we quantitatively show that thesetwo measures improve the performance of the multi-objecttracker by a large margin. We also included angular veloc-ity in the state and found that qualitatively the trajectorieslook more accurate especially in terms of object orientation.

3. A Kalman Filter for Multi-Object TrackingIn this section, we introduce our proposed 3D multi-

object tracking algorithm built upon a Kalman Filter [4]. Inthe prediction step, we use a process model assuming con-stant linear and angular velocity. We assume that an objectdetector provides frame-by-frame measurements to the fil-ter. These detections are matched to the predicted detectionsto then update the current object state estimates. The over-all architecture is shown in Fig. 1. In the following sections,we first model the dynamical system we are estimating andthen describe how we tune the open parameters in the filteras well as how we perform data association.

3.1. Object State

We model each object’s state with a tuple of 11 variables:

st = (x, y, z, a, l, w, h, dx, dy, dz, da)T , (1)

where (x, y, z) represent the 3D object center position, arepresents the object orientation about the z-axis, (l, w, h)represent the length, width, and height of the object’sbounding box, and (dx, dy, dz, da) represent the change of(x, y, z, a) from the previous frame to the current frame.The last four variables are the linear and angular velocity ofthe object center multiplied by a constant ∆t. Please note,that we are tracking multiple objects and therefore maintainM such states, one per tracked object in the scene.

3.2. Process Model

We model the dynamics of the moving objects using thefollowing process model:

xt+1 = xt + dxt + qxt , dxt+1 = dxt + qdxt

yt+1 = yt + dyt+ qyt

, dyt+1= dyt

+ qdyt

zt+1 = zt + dzt + qzt , dzt+1= dzt + qdzt

at+1 = at + dat + qat , dat+1 = dat + qdat

lt+1 = ltwt+1 = wt

ht+1 = htwhere we model the unknown linear and angular

acceleration as random variables (qxt , qyt , qzt , qat) and(qdxt

, qdyt, qdzt

, qdat) that follow a Gaussian distribution

with zero mean and covariance Q. We assume constant lin-ear and angular velocity as well as constant object dimen-sions, i.e. they do not change during the prediction step.Note that those variable may change during the update step.

2

Matching By Mahalanobis

Distance

Detections

KalmanFilterUpdate

KalmanFilterPredict


Distance

Detections

KalmanFilterUpdate

KalmanFilterPredict


Distance

Detections

t

KalmanFilterUpdate ...

Tracking Result Tracking Result Tracking Result

t-1 t+1

3D Object Detector 3D Object Detector 3D Object Detector

Sensor Input Sensor Input Sensor Input

PredictionsPredictions

KalmanFilterPredict

Predictions

Tracking ID 1Tracking ID 2Tracking ID 3Tracking ID 4Tracking ID 5Tracking ID 6

Figure 1: Architecture Overview. We use 3D object detection results as measurements. At each timestep, we use theMahalanobis distance [5] to compute the distance between object detections and predictions. Given this distance, we performdata association. The Kalman Filter [4] then updates the current state estimates. It uses a constant velocity model forpredicting the mean and covariance of the state in the next time step.

We can then write the Kalman Filter prediction step inmatrix form as follows:

µt+1 = Aµt (2)

Σt+1 = AΣtAT + Q (3)

where µt is the estimated mean of the true state s at timet, and µt+1 is the predicted state mean at time t + 1. Thematrix A is the state transition matrix of the process model.The matrix Σt is the state covariance at time t, and Σt+1 isthe predicted state covariance at time t+ 1.

3.3. Observation Model

We assume that an object detector provides us with Nframe-by-frame measurements o of object states, i.e. po-sition, orientation and bounding box scale. The number ofdetections may differ from the number of tracked, individ-ual objects. For now, let us assume that we already matchedone of the detections to an object state. In the next section,we provide detail on data association.

As detections are direct measurements of parts of thestate µ, the linear observation model has the following ma-trix form: H7×11 = [I 0]. Similar to the process model,

we assume observation noise follows a Gaussian distribu-tion with zero mean and covariance R. Using this obser-vation model and the predicted object state µt+1, we canpredict the next measurement ot+1 and innovation covari-ance St+1 that represents the uncertainty of the predictedobject detection:

ot+1 = Hµt+1 (4)

St+1 = HΣt+1HT + R (5)

We will discuss how we estimate the value of Σ0, Q, andR in Section 3.6.

3.4. Data Association

We are using an object detector to provide the KalmanFilter with N measurements. As the detector results can benoisy, we need to design a data association mechanism todecide which detection to pair with a predicted object stateand which detections to treat as outliers. Previous work [8]has used 3D-IOU to measure the affinity between predic-tions and detections. We adopt the fairly standard prac-tice [7] of using the Mahalanobis distance [5] instead. Thisdistance m measures the difference between predicted de-tections Hµt+1 and actual detections ot+1 weighted by the

3

uncertainty about the prediction as expressed through theinnovation covariance St+1:

m =

√(ot+1 −Hµt+1)TSt+1

−1(ot+1 −Hµt+1). (6)

We also adopt the orientation correction approach fromthe AB3DMOT [8] baseline. Specifically, when the angledifference between the detection and prediction is between90 and 270 degrees, we rotate the prediction’s angle by 180degrees before calculating the Mahalanobis distance. Largeangle difference like that usually stems from the detectorthat outputs an incorrect facing direction of the object. Fur-thermore, it is unlikely that the object makes such a largeturn in the short time duration between consecutive frames.In our experiments, we show that the Mahalanobis distanceprovides better tracking performance than the 3D-IOU.

Given the distances between all predictions and detec-tions, we solve a bipartite matching problem to find the op-timal pairing. Specifically, we employ a greedy algorithmwith an upper bound threshold value to solve this problem.The algorithm is described in detail in Algorithm 1. Com-pared to the Hungarian algorithm as used in the AB3DMOTbaseline [8], the greedy approach performs better as shownin our ablative analysis in Section 4.3.

3.5. Kalman Filter Update Step

Given the matched pairs of detections and predictions,we can now update the predicted state mean and covarianceat time t+ 1 by using the following equations:

Kt+1 = Σt+1HTS−1

t+1

µt+1 = µt+1 + Kt+1(ot+1 −Hµt+1)

Σt+1 = (I−Kt+1H)Σt+1

where K refers to the Kalman Gain and the matrix I is anidentity matrix.

We also follow the aforementioned orientation correc-tion in the update step. We adopt the birth-and-death mem-ory module from the AB3DMOT [8] baseline: we initializea track after having matches for 3 consecutive frames. Andwe terminate a track when it does not match any detectionfor 2 consecutive frames.

3.6. Covariance Matrices Estimation

Rather than using the identity matrices and heuristi-cally chosen scalars to build the covariance matrices of theKalman Filter as in AB3DMOT [8], we use the statistics ofthe training set data to estimate the initial state covariance,the process and observation noise covariance. Note that wedid not use the statistics from the validation or test set, tomake sure that our experiment does not use any future orground-truth information in the evaluation.

Algorithm 1: Greedy Algorithm for Data Associationat time t

Input:M predicted means and innovation covariancematrices, one per tracked object:P = {(µ[1], S[1]), (µ[2], S[2]), . . . , (µ[M ], S[M ])}.N detections: D = {o[1], o[2], . . . , o[N ]}.A threshold T as the upper bound of matched pair’sMahalanobis distance.Output:List of bipartite matched pair indices sorted by theMahalanobis distance.Initialization:List← ∅MatchedP ← ∅MatchedD ← ∅Distance← array[M ][N ]for i← 1 to M do

for j ← 1 to N doDistance[i][j]←MahalanobisDistance((µ[i], S[i]), o[j])

endendPairs← IndexPairsSortByV alue(Distance)for k ← 1 to length(Pairs) do

(m,n)← Pairs[k]if m 6∈MatchedP and n 6∈MatchedD then

if Distance[m][n] < T thenList← append(List, (m,n))MatchedP ←MatchedP ∪ {m}MatchedD ←MatchedD ∪ {n}

elsebreak

endend

endreturn List

Specifically, our process noise models the unknownlinear and angular accelerations. Therefore, we anal-yse the variance in the ground truth accelerations inthe training data set. Let us denote the training set’sground-truth object center positions and rotation angles as(x

[m]t , y

[m]t , z

[m]t , a

[m]t ) for timestamp t ∈ {1 · · ·T} and

object index m ∈ {1 · · ·M}. We model the processnoise covariance as a diagonal matrix where each elementis associated to the center positions and rotation angles

4

(Qxx, Qyy, Qzz, Qaa) and estimated as follows:

Qxx = V ar((x[m]t+1 − x

[m]t )− (x

[m]t − x[m]

t−1)) (7)

Qyy = V ar((y[m]t+1 − y

[m]t )− (y

[m]t − y[m]

t−1)) (8)

Qzz = V ar((z[m]t+1 − z

[m]t )− (z

[m]t − z[m]

t−1)) (9)

Qaa = V ar((a[m]t+1 − a

[m]t )− (a

[m]t − a[m]

t−1)) (10)

The above variances are calculated over m ∈ {1, ...,M}and t ∈ {2, ..., T − 1}. The Q’s elements as-sociate to the center velocity and angular velocity(Qdxdx , Qdydy , Qdzdz , Qdada) are estimated in the sameway as follows:

(Qdxdx , Qdydy , Qdzdz , Qdada) = (Qxx, Qyy, Qzz, Qaa)(11)

One might think that the above estimation seems to dou-ble count the acceleration. However, the above estimation isactually reasonable based on our process model definition.For example, consider the x component of the state and itsvelocity-related component dx in the process model definedin Section 3.2:

xt+1 = xt + dxt+ qxt

(12)

dxt+1= dxt

+ qdxt(13)

To estimate the two noise terms qxtand qdxt

, we have:

qxt = xt+1 − xt − dxt (14)

qdxt= dxt+1 − dxt (15)

where the predicted state components xt+1 and dxt+1can

be estimated using the ground-truth state components xt+1

and dxt+1 . The velocity-related components dxt+1 and dxt

can be approximated as xt+1 − xt and xt − xt−1 based onour state definition in Section 3.1. And we can derive theequations as follows:

qxt≈ xt+1 − xt − dxt

(16)≈ (xt+1 − xt)− (xt − xt−1) (17)

qdxt≈ dxt+1

− dxt(18)

≈ (xt+1 − xt)− (xt − xt−1) (19)

The above approximation explains why we use the vari-ance of accelerations to estimate the process model noisecovariance from equation 7 to 11.

Additionally, including the acceleration noise in bothprediction equations in 12 and 13 also adds robustness to thedata association. On the contrary, only including the accel-eration to the velocity prediction in equation 13 will under-estimate the uncertainty when predicting the next position.Consider the case that there is a very large real accelerationin the current time step which could not be accounted for in

the previously estimated velocity. In this case, we will havelarge uncertainty in predicting the next velocity. But we willonly have small uncertainty in predicting the next positionif we do not include the acceleration noise to the positionprediction equation. By adding this additional accelerationnoise in position prediction, we increase the predicted un-certainty of position. And that is used within the Maha-lanobis distance and therefore the data association becomesmore generous for matching and more robust. Similar rea-soning also applies to other state variables.

And for the elements related to the length, width, height,and other non-diagonal elements inQ, we assume their vari-ances to have value 0.

Our observation noise models the error in the object de-tector. Therefore, we analyse the error variance betweenground-truth object poses and detections in the training setto then choose the diagonals entries of R and the initial statecovariance Σ0. For this, we first find the matching pairs ofthe detection bounding boxes and the ground-truth by usingthe matching criteria that the 2D center distance is less than2 meters. Given the matched pairs of the detections and theground-truth (D

[k]t , G

[k]t ) for timestamp t ∈ {1 · · ·T} and

matched pair index k ∈ {1 · · ·K}, where

D[k]t = (D[k]

xt, D[k]

yt, D[k]

zt , D[k]at, D

[k]lt, D[k]

wt, D

[k]ht

) (20)

G[k]t = (G[k]

xt, G[k]

yt, G[k]

zt , G[k]at, G

[k]lt, G[k]

wt, G

[k]ht

) (21)

we estimate the elements of the observation noise covari-ance matrix R as follows:

Rxx = V ar(D[k]xt−G[k]

xt) (22)

Ryy = V ar(D[k]yt−G[k]

yt) (23)

Rzz = V ar(D[k]zt −G

[k]zt ) (24)

Raa = V ar(D[k]at−G[k]

at) (25)

Rll = V ar(D[k]lt−G[k]

lt) (26)

Rww = V ar(D[k]wt−G[k]

wt) (27)

Rhh = V ar(D[k]ht−G[k]

ht) (28)

The non-diagonal entries of R are all zero. We set Σ0 =R as we initialize the multi-object tracker with the initialdetection results.

4. Experiment Results4.1. Evaluation Metrics

We follow the NuScenes Tracking Challenge [1] and usethe Average Multi-Object Tracking Accuracy (AMOTA) asthe main evaluation metric. AMOTA is defined as follows:

AMOTA =1

n− 1

∑r∈{ 1

n−1, 2n−1

,...,1}

MOTAR, (29)

5

Table 1: Tracking results for the validation set of NuScenes [1]: evaluation in terms of overall AMOTA and individualAMOTA for each object category in comparison with the AB3DMOT [8] baseline method, and variations of our method. Ineach column, the best obtained results are typeset in boldface. (*Our baseline implementation of applying AB3DMOT [8] onthe MEGVII [9] detection result.)

Method Overall bicycle bus car motorcycle pedestrian trailer truckAB3DMOT [8] 17.9 0.9 48.9 36.0 5.1 9.1 11.1 14.2AB3DMOT [8] * 50.9 21.8 74.3 69.4 39.0 58.7 35.3 58.1Ours w/ 3D-IOU, threshold 0.01 52.7 23.2 73.9 72.1 40.4 66.7 34.4 58.3Ours w/ 3D-IOU, threshold 0.1 49.2 22.3 74.4 68.2 38.9 47.1 35.9 57.8Ours w/ 3D-IOU, threshold 0.25 43.9 21.3 73.9 63.3 35.1 21.6 36.9 54.9Ours w/ Hungarian algorithm 49.8 24.2 68.4 63.9 42.9 70.0 27.6 52.0Ours w/ default covariance 41.7 11.2 57.0 56.8 37.8 63.7 23.4 41.7Ours w/o angular velocity 56.1 27.2 74.1 73.5 50.7 75.5 33.8 58.1Ours 56.1 27.2 74.1 73.5 50.6 75.5 33.7 58.0

Table 2: Tracking results for the test set of NuScenes [1].The full tracking challenge leaderboard will be released topublic soon by the organizer.

Rank Team Name AMOTA1 StanfordIPRL-TRI (Ours) 55.02 VV team 37.13 CenterTrack 10.8

baseline AB3DMOT [8] 15.1

where n is the number of evaluation sample points, and ris the evaluation targeted recall. The MOTAR is the Recall-Normalized Multi-Object Tracking Accuracy, defined as thefollows:

MOTAR = max(0, 1−IDSr + FPr + FNr − (1− r) ∗ Pr ∗ P ),

(30)where P is the number of ground-truth positives, IDSr is

the number of identity switches, FPr is the number of falsepositives, and FNr is the number of false negatives.

4.2. Baseline Evaluation

We use AB3DMOT [8] as the baseline, as described ear-lier in Section 2.2. We report the AB3DMOT’s tracking re-sult for the NuScenes validation set in the first row of Table1 as reported by the NuScenes Tracking Challenge [1]. Ad-ditionally, we adopted the AB3DMOT [8] open-source codeon the MEGVII [9] detection results, and generate a betterbaseline tracking result, as reported in the second row of Ta-ble 1. Currently, we do not know why the AMOTA numbersare different for the two implementations.

4.3. Quantitative Results and Ablations

We report our method’s results on the validation set inTable 1. We also include the AB3DMOT [8] baseline val-idation result in Table 1. We can see that our method out-performs the official AB3DMOT baseline by a large mar-gin (38.2%) in terms of the overall AMOTA. Our methodalso achieves higher overall AMOTA compared with our

baseline implementation of applying AB3DMOT [8] on theMEGVII [9] detection result by 5.2%.

Additionally, we perform an ablation study by replacingdifferent components of our method by the associate com-ponents of AB3DMOT [8]. We report the results in Table 1.We can see that our proposed Mahalanobis distance-baseddata association method outperforms the 3D-IOU methods,especially in the categories of small objects, such as the bi-cycle, the motorcycle, and the pedestrian. For those objects,their 3D-IOU could be 0 even if the prediction and the de-tection are very close but do not overlap. In such cases, the3D-IOU method will miss the match. However, our pro-posed Mahalanobis distance method can still correctly trackthe objects because this method still provides distance mea-surements even when the 3D-IOU is zero. The Mahalanobisdistance also takes the uncertainty about the prediction intoaccount as estimated by the Kalman Filter.

We also find that the greedy algorithm performs betterthan the Hungarian algorithm during the data associationprocess. Our data-driven covariance matrix estimation out-performs the heuristic choices when using our Mahalanobisdistance-based tracking method.

One interesting finding of the ablation analysis is thatexcluding the angular velocity from the Kalman Filter statedoes not decrease the quantitative tracking performance interms of the AMOTA. That is because the NuScenes track-ing evaluation procedure uses 2D center distance as thematching criteria when counting the numbers of the falsepositives and false negatives. Therefore, the accuracy ofthe rotation angles is ignored in this evaluation metric. Al-though the AMOTA values do not change too much, our vi-sualization results show that including the angular velocityin the Kalman Filter state generates better and more realisticqualitative tracking results in Figure 2.

The NuScenes Tracking Challenge organizer shared thetest set result of the top 3 participants and the AB3DMOT[8] baseline, as in Table 2. The full tracking challengeleaderboard will be released to public soon by the organizer.

6

(a) AB3DMOT [8] (b) Ours (c) Ground-truth

(d) Our AB3DMOT [8] baseline (e) Ours without angular velocity (f) Input detection from MEGVII [9]

Figure 2: Bird-eye-view tracking visualization of cars

4.4. Qualitative Results

We show the AB3DMOT [8] baselines and our method’sbird-eye-view visualization tracking results of the car cat-egory in Figure 2. We also show the ground-truth annota-tion and the input detection from MEGVII [9] as the ref-erence. We draw the object bounding boxes from differenttimesteps of the same scene in a single plot. Different colorsrepresent different instances of tracks or objects. The detec-tion results only have a single color because no tracking idinformation is available.

We can see that the AB3DMOT [8] has difficulties tocontinue tracking when the object makes a sharp turn, asshown in Figure 2a and 2d. This is because the Kalman Fil-ter’s predicted 3D bounding box does not overlap with anydetection box when the car is turning sharply. However,our Mahalanobis distance-based methods can still correctlytrack the car’s motion as shown in Figure 2b and 2e, eitherwith or without angular velocity in the Kalman Filter state.For the case without using the angular velocity, the esti-mated car orientation during turning is obviously different

from the detection or the ground-truth, as shown in Figure2e, 2f, and 2c. Such an issue can be fixed by including theangular velocity in the Kalman Filter state as in our finalproposed model, as shown in Figure 2b.

We show the bird-eye-view visualization for pedestriansin Figure 3. In this example, we can see that the input detec-tion 3f has some noise in the lower end of the longest track,potentially due to occlusions. The AB3DMOT [8] baselineas visualized in Figure 3a is unable to continue tracking thepedestrian. However, both of our proposed methods eitherwith or without angular velocity (Figure 3b and 3e) can cor-rectly track the pedestrian’s location and orientation.

5. Conclusion

We present an on-line 3D multi-object tracking methodusing the Mahalanobis distance in the data association step.Moreover, we use the statistics from the training set to esti-mate and initialize the Kalman Filter’s covariance matrices.Our method better utilizes the stochastic information andoutperforms the 3D-IOU-based AB3DMOT [8] baseline by

7

(a) AB3DMOT [8] (b) Ours (c) Ground-truth

(d) Our AB3DMOT [8] baseline (e) Ours without angular velocity (f) Input detection from MEGVII [9]

Figure 3: Bird-eye-view tracking visualization of pedestrians

a large margin in terms of the AMOTA evaluation metric inthe NuScenes Tracking Challenge [1].

6. AcknowledgementToyota Research Institute (”TRI”) provided funds to as-

sist the authors with their research but this article solely re-flects the opinions and conclusions of its authors and notTRI or any other Toyota entity.

References[1] Holger Caesar, Varun Bankiti, Alex H. Lang, Sourabh Vora,

Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan,Giancarlo Baldan, and Oscar Beijbom. nuscenes: A mul-timodal dataset for autonomous driving. arXiv preprintarXiv:1903.11027, 2019. 1, 2, 5, 6, 8

[2] Ming-Fang Chang, John W Lambert, Patsorn Sangkloy, Jag-jeet Singh, Slawomir Bak, Andrew Hartnett, De Wang, PeterCarr, Simon Lucey, Deva Ramanan, and James Hays. Argo-verse: 3d tracking and forecasting with rich maps. In Con-ference on Computer Vision and Pattern Recognition (CVPR),2019. 2

[3] Hou-Ning Hu, Qi-Zhi Cai, Dequan Wang, Ji Lin, Min Sun,Philipp Krhenbhl, Trevor Darrell, and Fisher Yu. Joint monoc-ular 3d detection and tracking. 2019. 2

[4] Rudolph Emil Kalman. A new approach to linear filtering andprediction problems. Journal of Basic Engineering, 1960. 1,2, 3

[5] Prasanta Chandra Mahalanobis. On the generalized distancein statistics. Proceedings of the National Institute of Sciencesof India, 1936. 1, 2, 3

[6] Charles R Qi, Li Yi, Hao Su, and Leonidas J Guibas. Point-net++: Deep hierarchical feature learning on point sets in ametric space. arXiv preprint arXiv:1706.02413, 2017. 2

[7] Sebastian Thrun, Wolfram Burgard, and Dieter Fox. Prob-abilistic Robotics (Intelligent Robotics and AutonomousAgents). The MIT Press, 2005. 2, 3

[8] Xinshuo Weng and Kris Kitani. A Baseline for 3D Multi-Object Tracking. arXiv:1907.03961, 2019. 1, 2, 3, 4, 6, 7,8

[9] Benjin Zhu, Zhengkai Jiang, Xiangxin Zhou, Zeming Li,and Gang Yu. Class-balanced Grouping and Samplingfor Point Cloud 3D Object Detection. arXiv preprintarXiv:1908.09492, 2019. 1, 2, 6, 7, 8

8

probabilistic 3d multi-object tracking for autonomous driving · probabilistic 3d multi-object...

Documents