person re-identification in videos by analyzing spatio ... · sk. arif ahmed, member, ieee, debi...

9
1 Person Re-identification in Videos by Analyzing Spatio-Temporal Tubes Sk. Arif Ahmed, Member, IEEE, Debi Prosad Dogra, Member, IEEE, Heeseung Choi, Seungho Chae and Ig-Jae Kim Abstract—Typical person re-identification frameworks search for k best matches in a gallery of images that are often collected in varying conditions. The gallery may contain image sequences when re-identification is done on videos. However, such a process is time consuming as re-identification has to be carried out multiple times. In this paper, we extract spatio-temporal sequences of frames (referred to as tubes) of moving persons and apply a multi-stage processing to match a given query tube with a gallery of stored tubes recorded through other cameras. Initially, we apply a binary classifier to remove noisy images from the input query tube. In the next step, we use a key-pose detection-based query minimization. This reduces the length of the query tube by removing redundant frames. Finally, a 3-stage hierarchical re-identification framework is used to rank the output tubes as per the matching scores. Experiments with publicly available video re-identification datasets reveal that our framework is better than state-of-the-art methods. It ranks the tubes with an increased CMC accuracy of 6-8% across multiple datasets. Also, our method significantly reduces the number of false positives. A new video re-identification dataset, named Tube-based Re- identification Video Dataset (TRiViD), has been prepared with an aim to help the re-identification research community. Index Terms—Trajectory analysis, anomaly detection, ELM, HTM, bio-inspired leaarning I. I NTRODUCTION Person re-identification (Re-Id) is useful in various in- telligent video surveillance applications. The task can be considered as image retrieval problem, where a query image of a person (probe) is given and we search the person in a set of images extracted from different cameras (gallery). The query can be a single image [1] or multiple images [2]. Often multi-image query uses early fusion of images and generate an average query image [3]. The method thus consumes higher computational power as compared to single image- based methods. Advanced hardware and efficient learning frameworks have encouraged the researchers to focus on designing Re-Id systems applicable to videos. However, video- based re-identification research is still in its infancy [4], [5]. Even though the existing video Re-Id applications seem to be promising, such methods often fail in low resolution videos, crowded environment, or in the presence of significant camera angle variations. It has also been observed that the query image or video has to be selected judiciously to obtain good retrieval results. Choosing an improper image or video may lead to Sk. Arif Ahmed (Email: [email protected]) is with NIT Durgapur, India Debi Prosad Dogra (Email: [email protected]) is with IIT Bhubaneswar, India Heeseung Choi (Email:[email protected]) Seungho Chae (Email: se- [email protected]) and Ig-Jae Kim (Email: [email protected]) are with KIST South Korea poor quality of retrieval. In this paper, we detect and track humans in movement and construct spatio-temporal tubes that are used in the re-identification framework. We also propose a method for selecting optimum set of key pose images and use a 3-stage learning framework to re-identify persons appearing in different cameras. To accomplish this, we have made the following contributions in this paper: We propose a learning-based method to select an opti- mum set of key pose images to reconstruct the query tube by minimizing its length in terms of number of frames. We propose a 3-stage hierarchical framework that has been built using (i) SVDNet guided Re-Id architecture, (ii) self-similarity estimation, and (iii) temporal correla- tion analysis to rank the tubes of the gallery. We introduce a new video dataset, named Tube-based Re-identification Video Dataset (TRiViD) that has been prepared with an aim to help the re-identification research community. Rest of the paper is organized as follows. In Section 2, we discuss the state-of-the-art of person re-identification research. Section 3 presents the proposed Re-Id framework with various components. Experiment results are presented in Section 4. Conclusion and future work are presented in Section 5. II. RELATED WORK Person re-identification applications are growing rapidly in numbers. However, humongeous growth in CCTV surveillance has thrown up various challenges to the re-identification re- search community. The primary challenges are to handle large volume of data [6], [7], tracking in complex environment [8], [9], presence of group [10], occlusion [11], varying pose and style across different cameras [2], [12]–[14], etc. The process of Re-Id can be categorized as image-guided [2], [10], [15], [16] and video-guided [4], [5], [17]–[19]. The image- guided methods typically use deep neural networks for feature representation and re-identification, whereas the video-guided methods typically use recurrent convolutional networks (RNN) to embed the temporal information such as optical flow [17], sequence of pose, etc. Table I summarizes recent progress in person re-identification. In recent years, late fusion of different scores [15], [20] has shown significant improvement over the final ranking. Our method is similar to a typical delayed or late fusion guided method. We refine search results obtained using convolutional neural networks with the help of temporal correlation analysis. arXiv:1902.04856v1 [cs.CV] 13 Feb 2019

Upload: others

Post on 30-Sep-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Person Re-identification in Videos by Analyzing Spatio ... · Sk. Arif Ahmed, Member, IEEE, Debi Prosad Dogra, Member, IEEE, Heeseung Choi, Seungho Chae and Ig-Jae Kim Abstract—Typical

1

Person Re-identification in Videos by AnalyzingSpatio-Temporal Tubes

Sk. Arif Ahmed, Member, IEEE, Debi Prosad Dogra, Member, IEEE, Heeseung Choi, Seungho Chae and Ig-JaeKim

Abstract—Typical person re-identification frameworks searchfor k best matches in a gallery of images that are oftencollected in varying conditions. The gallery may contain imagesequences when re-identification is done on videos. However, sucha process is time consuming as re-identification has to be carriedout multiple times. In this paper, we extract spatio-temporalsequences of frames (referred to as tubes) of moving persons andapply a multi-stage processing to match a given query tube with agallery of stored tubes recorded through other cameras. Initially,we apply a binary classifier to remove noisy images from the inputquery tube. In the next step, we use a key-pose detection-basedquery minimization. This reduces the length of the query tubeby removing redundant frames. Finally, a 3-stage hierarchicalre-identification framework is used to rank the output tubes asper the matching scores. Experiments with publicly availablevideo re-identification datasets reveal that our framework isbetter than state-of-the-art methods. It ranks the tubes with anincreased CMC accuracy of 6-8% across multiple datasets. Also,our method significantly reduces the number of false positives.A new video re-identification dataset, named Tube-based Re-identification Video Dataset (TRiViD), has been prepared withan aim to help the re-identification research community.

Index Terms—Trajectory analysis, anomaly detection, ELM,HTM, bio-inspired leaarning

I. INTRODUCTION

Person re-identification (Re-Id) is useful in various in-telligent video surveillance applications. The task can beconsidered as image retrieval problem, where a query imageof a person (probe) is given and we search the person in aset of images extracted from different cameras (gallery). Thequery can be a single image [1] or multiple images [2]. Oftenmulti-image query uses early fusion of images and generatean average query image [3]. The method thus consumeshigher computational power as compared to single image-based methods. Advanced hardware and efficient learningframeworks have encouraged the researchers to focus ondesigning Re-Id systems applicable to videos. However, video-based re-identification research is still in its infancy [4], [5].Even though the existing video Re-Id applications seem to bepromising, such methods often fail in low resolution videos,crowded environment, or in the presence of significant cameraangle variations. It has also been observed that the query imageor video has to be selected judiciously to obtain good retrievalresults. Choosing an improper image or video may lead to

Sk. Arif Ahmed (Email: [email protected]) is with NIT Durgapur, IndiaDebi Prosad Dogra (Email: [email protected]) is with IIT Bhubaneswar,

IndiaHeeseung Choi (Email:[email protected]) Seungho Chae (Email: se-

[email protected]) and Ig-Jae Kim (Email: [email protected]) are withKIST South Korea

poor quality of retrieval. In this paper, we detect and trackhumans in movement and construct spatio-temporal tubes thatare used in the re-identification framework. We also propose amethod for selecting optimum set of key pose images and usea 3-stage learning framework to re-identify persons appearingin different cameras. To accomplish this, we have made thefollowing contributions in this paper:

• We propose a learning-based method to select an opti-mum set of key pose images to reconstruct the query tubeby minimizing its length in terms of number of frames.

• We propose a 3-stage hierarchical framework that hasbeen built using (i) SVDNet guided Re-Id architecture,(ii) self-similarity estimation, and (iii) temporal correla-tion analysis to rank the tubes of the gallery.

• We introduce a new video dataset, named Tube-basedRe-identification Video Dataset (TRiViD) that has beenprepared with an aim to help the re-identification researchcommunity.

Rest of the paper is organized as follows. In Section 2, wediscuss the state-of-the-art of person re-identification research.Section 3 presents the proposed Re-Id framework with variouscomponents. Experiment results are presented in Section 4.Conclusion and future work are presented in Section 5.

II. RELATED WORK

Person re-identification applications are growing rapidly innumbers. However, humongeous growth in CCTV surveillancehas thrown up various challenges to the re-identification re-search community. The primary challenges are to handle largevolume of data [6], [7], tracking in complex environment [8],[9], presence of group [10], occlusion [11], varying poseand style across different cameras [2], [12]–[14], etc. Theprocess of Re-Id can be categorized as image-guided [2], [10],[15], [16] and video-guided [4], [5], [17]–[19]. The image-guided methods typically use deep neural networks for featurerepresentation and re-identification, whereas the video-guidedmethods typically use recurrent convolutional networks (RNN)to embed the temporal information such as optical flow [17],sequence of pose, etc. Table I summarizes recent progress inperson re-identification. In recent years, late fusion of differentscores [15], [20] has shown significant improvement over thefinal ranking. Our method is similar to a typical delayed orlate fusion guided method. We refine search results obtainedusing convolutional neural networks with the help of temporalcorrelation analysis.

arX

iv:1

902.

0485

6v1

[cs

.CV

] 1

3 Fe

b 20

19

Page 2: Person Re-identification in Videos by Analyzing Spatio ... · Sk. Arif Ahmed, Member, IEEE, Debi Prosad Dogra, Member, IEEE, Heeseung Choi, Seungho Chae and Ig-Jae Kim Abstract—Typical

2

Reference Method Overview

Lv et al. [4] Motion and image based features Recurrent convolutional network for video-based person re-identificationBarman et al. [15] Graph theory and multiple algorithm fusion-based algorithm SHaPE: A Novel Graph Theoretic Algorithm for Making Consensus-based Decisions in Person Re-identification SystemsChang et al. [16] Visual appearance and multiple semantic level features Multi-Level Factorization Net for Person Re-IdentificationChen et al. [10] Fusion of local similarity and group similarity-based DNN and CRF Group Consistent Similarity Learning via Deep CRF for Person Re-IdentificationChen et al. [5] Divides a long person sequence into short snippet and match snippets for re-identification Video Person Re-identification with Competitive Snippet-similarity Aggregation and Co-attentive Snippet EmbeddingChung et al. [17] Learn spatial and temporal similarity and used weighed fusion A Two Stream Siamese Convolutional Neural Network For Person Re-IdentificationDeng et al. [2] Learn self similarity and domain dissimilarity Image-Image Domain Adaptation with Preserved Self-Similarity and Domain-Dissimilarity for Person Re-identificationHe et al. [21] Deep pixel-level CNN for person re-identification from partially observed images. Deep Spatial Feature Reconstruction for Partial Person Re-identification: Alignment-free ApproachHuang et al. [11] Proposed augmented training data generation for person re-identification. Adversarially Occluded Samples for Person Re-identificationKalayeh et al. [22] Proposed human semantic parts model to train state-of-the-art deep networks and calculate weighted average. Human Semantic Parsing for Person Re-identificationLi et al. [23] Distinct body parts-based attention model for re-identification. Diversity Regularized Spatiotemporal Attention for Video-based Person Re-identificationLi et al. [24] Harmonious attention network consists of pixel-level and bounding box level attention as feature. Harmonious Attention Network for Person Re-IdentificationLiu et al. [12] Augmented pose of persons and generate training set as used to re-identify persons Pose Transferrable Person Re-IdentificationLiu et al. [25] Tracklets have been used as training and re-identification. Stepwise Metric Promotion for Unsupervised Video Person Re-identificationLv et al. [4] Transfer learning have been used to learn spatio-temporal pattern in unsupervised manner. Unsupervised Cross-dataset Person Re-identification by Transfer Learning of Spatial-Temporal PatternsFu et al. [26] Used multi-scale feature representation and chose correct scale for matching Multi-scale Deep Learning Architectures for Person Re-identificationTomasi et al. [27] Proposed method for selection of good features for re-identification Features for Multi-Target Multi-Camera Tracking and Re-IdentificationRoy et al. [28] Minimized the labeling effort by choosing minimum image for labeling task in re-identification. Exploiting Transitivity for Learning Person Re-identification Models on a BudgetSarfraz et al. [13] Used fine and coarse pose information for deep re-identification. A Pose-Sensitive Embedding for Person Re-Identification with Expanded Cross Neighborhood Re-RankingShen et al. [29] Proposed group-shuffling random walk network for fully utilizing train and test images. Deep Group-shuffling Random Walk for Person Re-identificationShen et al. [30] Proposed Kronecker Product Matching module to match feature maps of different persons in an end-to-end trainable deep neural network. End-to-End Deep Kronecker-Product Matching for Person Re-identificationSi et al. [31] Uses and learn context-aware feature sequences and perform attentive sequence comparison simultaneously. Dual Attention Matching Network for Context-Aware Feature Sequence based Person Re-IdentificationWang et al. [32] Deep architecture named BraidNet is proposed. It uses the cascaded Wconv structure learns to extract the comparison features Images. Person Re-identification with Cascaded Pairwise ConvolutionsWu et al. [18] It propose an approach to exploiting unsupervised Convolutional Neural Network (CNN) feature representation via stepwise learning. Exploit the Unknown Gradually: One-Shot Video-Based Person Re-Identification by Stepwise LearningXu et al. [33] Body parts-based attention network for re-identification Attention-Aware Compositional Network for Person Re-identificationXu et al. [3] Joint Spatial and Temporal Attention Pooling Network (ASTPN) has been used in video sequences. Jointly Attentive Spatial-Temporal Pooling Networks for Video-based Person Re-IdentificationZhang et al. [19] Sequential decision making has been used to identify each frame in a video Multi-shot Pedestrian Re-identification via Sequential Decision MakingZhong et al. [14] Used style transfer across different camera to improve re-identification Camera Style Adaptation for Person Re-identification

TABLE I: Recent progress in person re-identification research

Fig. 1: The proposed method for Tube-to-tube Re-identification. Our contributions are marked with circle. The method takesa tube as query and rank the tubes by best matching.

III. PROPOSED APPROACH

Our method can be regarded as tracking followed by re-identification. Moving persons are tracked using Simple On-line Deep Tracking (SODT) that has been developed usingYOLO [34] framework. A tube is defined as the sequence ofspatio-temporal frames of a moving person. Training is doneusing the videos captured by a camera. Videos captured usingcameras are used to construct the gallery of tubes. Assume agallery (G) contains n tubes as given in (1).

G = {T1, T2, T3, ..., Tn} (1)

Suppose a tube (T ) in the gallery contains m frames as givenin (2).

T = {I1, I2, I3, ..., Im} (2)

At the time of re-identification, a query tube is given as aprobe. First, the noisy frames are eliminated and the querytube is minimized. Next, frames of the revised query tube arepassed through a 3-stage hierarchical re-ranking process toget the final ranking of the tubes in the gallery. The methodis depicted in Figure 1.

A. Query MinimizationRe-identification using multiple images usually performs

better as compared to single image-based frameworks. How-ever, the former method consumes more computational power.Also, selecting a set of frames that can uniquely represent atube can be challenging. To address this, we have used a deepsimilarity matching architecture to select a set of representativeframes based on pose dissimilarity. First, a query tube ispassed through binary classifier to remove noisy frames suchas blurry, cropped, low-quality, etc. Next, a ResNet50 [35]framework has been trained using a few query tubes containingsimilar looking images. The similarity cost (σij) is calculatedusing (3).

σij = ResNet50(Ij , Ik) (3)

The input tube contains m images, whereas the output querytube contains n images such that n << m. The images in theoptimized query tube can be represented using (4).

Q = {I1, I2, I3, ..., In} (4)

The pairwise query cost function (ξ) for a given frame (Ii)and other frame (Ij) is defined in (5).

Page 3: Person Re-identification in Videos by Analyzing Spatio ... · Sk. Arif Ahmed, Member, IEEE, Debi Prosad Dogra, Member, IEEE, Heeseung Choi, Seungho Chae and Ig-Jae Kim Abstract—Typical

3

ξij = min(σij),∀j (5)

The loss of energy is defined as given in (6).

γij = ∀j,max(σij) (6)

The optimal query energy (E) is defined in (7), where Qis the set of images that are not included in Q and φ isa weighting parameter called query threshold (between 0-1).Larger φ produces higher number of images in Q.

E =∑i,j∈Q

φξij +∑

i∈Q,j∈Q

γij (7)

Figure 2 depicts the steps and the minimized query imagesTRiViD dataset.

B. Image Re-identification using SVDNet

Our proposed method uses single image-based re-identification at the top layer of the hierarchy. We have usedSingular Vector Decomposition Network (SVDNet) [1] as thebaseline. It uses a convolutional neural network and an eigen-layer before the fully connected layer. The eigenlayer consistsof a set of weights. Figure 3 demonstrates the architecture of atypical SVDNet. The outputs of SVDNet are a set of retrievedimages with ranks up to k as given in (8).

SVD = {I1, I2, I3, ..., Ik} (8)

C. Self Similarity Guided Re-ranking

In the next step, we have aggregated the self-similarityscores with the SVDNet outputs. A typical ResNet50 [35]architecture has been trained to learn self-similarity scoresusing the tubes of the query set. We assume the imagesavailable in a tube are similar. Next, a similarity score betweenthe query image and every output image of SVD network upto rank k, is calculated. Finally, the scores are averaged andthe images are re-ranked. This step ensures that the dissimilarimages get pushed toward the end of the ranked sequence ofthe retrieved images. Figure 4 illustrates this method.

D. Tube Ranking by Temporal Correlation

Final step of the proposed method is to rank the tubesby temporal correlation among the retrieved images. Weassume the images that belong to a single tube, are temporallycorrelated as they are extracted by detection and tracking. Letthe result matrix up to rank k for the query tube Q after thefirst two stages be denoted by R. Weight of an image of Rcan be estimated using (9).

R =

I11 I12 I13 . . . I1kI21 I22 I23 . . . I2k

. . . . . . . . . . .

Ij1 Ij2 Ij3 . . . Ijk

αIjk =

1

r, where r is the rank of Ijk (9)

Similarly, weight of a tube (Tn) can be estimated using (10).

βTn =# of images of Tn that are in R

maximum # of images belong to Tn ∈ R(10)

Finally, the temporal correlation cost (τIjk ) of an image in Rcan be estimated as given in (11).

τIjk = αIjk × βTn, for Ijk ∈ Tn (11)

Based on the temporal correlation, the retrieved tubes areranked. Let the ranked tubes up to k be represented using (12),where higher rank tubes have higher weights.

Rtube = {T1, T2, ..., Tk} (12)

The final ranked images are extracted by taking the highestscoring images from the tubes. The final ranked images aregiven in (13). Figure 5 explains the whole process of tuberanking and selection of final set of frames.

Rimage = {I1, I2, ..., Il}, where Ii ∈ Ti (13)

IV. EXPERIMENTS

We have evaluated our proposed approach on two publicdatasets, iLIDS-VID [36] and PRID-11 [37] that are oftenused for testing video-based re-identification frameworks. Inaddition to that, we have also prepared a new re-identificationdataset. It has been recorded using 2 cameras in an indoorenvironment with human movements with moderately densecrowd (with more than 10 people appearing within 4-6 sq-mt), varying camera angles, and persons with similar clothing.Such situations have not been covered yet in existing re-identification video datasets. Details about these datasets arepresented in Table II. Several experiments have been con-ducted to validate our method and a through comparativeanalysis has been performed.

Dataset Number ofCamera

PersonRe-appeared

GallerySize

Challenges

PRID-11 [37] 2 245 475 Large volumeiLIDS-VID [36] 2 119 300 Clothing SimilarityTRiViD 2 47 342 Dense, Tracking, Similarity

TABLE II: Dataset used in our experiments. Only TRiViDdataset is tracked to extract tube. In other datset the givensequence of images are considered as tube

Evaluation Metrics and Strategy: We have followed thewell known experimental protocols for evaluating the method.For iLIDS-VID and TRiViD dataset videos, the tubes arerandomly split into 50% for training and 50% for testing. ForPRID-11, we have followed the experimental setup as pro-posed in [3], [5], [36], [38], [39]. Only first 200 persons whoappeared in both cameras of the PRID-11 dataset, have beenused in our experiments. A 10-folds cross validation schemehas been adopted and the average results are reported. Wehave prepared Cumulative Matching Characteristics (CMC)and mean average precision (mAP) curves to evaluate andcompare the performance.

Page 4: Person Re-identification in Videos by Analyzing Spatio ... · Sk. Arif Ahmed, Member, IEEE, Debi Prosad Dogra, Member, IEEE, Heeseung Choi, Seungho Chae and Ig-Jae Kim Abstract—Typical

4

Fig. 2: Examples of original tube (first row), detected noisy frames (second row), tube after noise removal (third row), andminimized tube for query execution (fourth row) taken from the TRiViD dataset.

Fig. 3: Architecture of the SVDNet used in the fist stage ofthe re-identification framework shown in Figure 1. It containsan Eigenlayer before the fully connected layer. The Eigenlayercontains the weights to be used during training.

A. Comparative Analysis

As per the state-of-the-art, our work though unique indesign has some similarities with video re-id methods pro-posed in [38], [40], multiple query-based method [1], andthe re-ranking method [20]. Therefore, we have compared ourapproach with the above three recently proposed methods. Ithas been observed that the proposed method can achieve again up to 9.6% as compared to the state-of-the-art methodswhen top rank accuracy is estimated. Even if we compute theaccuracy up to rank 20, our method has the upper hand with amargin of 3%. This is the USP of the proposed method and weclaim it to be significant at this stage. This happens becauseour method tries to reduce the number of false positives whichhas not yet been addressed by the re-identification researchcommunity. Figures 6-8 represent CMC curves and Table ??summarizes the mAP up to rank 20 across the three datasets.Figure 9 shows a typical query and response applied on PRID-11 dataset.

B. Computational Complexity Analysis

re-identification in real-time is a challenging task. All re-search work carried out so far presume the gallery as a pre-

Method/Dataset PRID iLIDS New

RCNN [38] 81.2 74.6 79.11TDL [40] 78.2 74.1 80Video-ReId [38] 73.31 64.29 83.22SVD Net [1] (Single Image) 76.44 69 79.11SVD Net [1] (Multiple Images) 79.21 66.71 82.66SVD Net+Re Rank [20] 77.25 69.2 78.6Proposed 86.17 79.22 91.66

TABLE III: mAP (%) up to rank 20 in across three videodatasets

recorded set of images and they try to rank best 5, 10, 15, 20images from the set. However, executing a single query takesconsiderable time when multiple images are involvd in thequery. We have carried out a comparative analysis on compu-tation complexities across various re-identification frameworksincluding the proposed scheme. A Nvdia Quadro P5000 seriesGPU has been used to implement the frameworks. The resultsare reported in Figure 10. We have observed that the proposedtube-based re-identification framework takes lesser time ascompared to video re-id framework proposed in [38] and themultiple images-based re-id using SVDNet [1].

C. Effect of φ

Our proposed method depends on the query threshold (φ).In this section, we present an analysis about the effect of φ onresults. Figure 11 depicts the average number of query imagesgenerated from various query tubes. It may be observed that,higher φ produces more query images.

Figure 12 depicts average CMC by varying φ. It may beobserved that the accuracy does not increase significantly whenφ is increased above 0.4.

Figure 13 presents execution time (in seconds) by varyingthe query threshold. It can also be observed that an increasein φ leads to higher response time. Therefore, we have usedφ = 0.4 in our experiments.

Page 5: Person Re-identification in Videos by Analyzing Spatio ... · Sk. Arif Ahmed, Member, IEEE, Debi Prosad Dogra, Member, IEEE, Heeseung Choi, Seungho Chae and Ig-Jae Kim Abstract—Typical

5

Fig. 4: The self similarity estimation layer. It learns to measure self-similarity during training. We use ResNet50 [35] as thebaseline. It takes a set of ranked images (SVDNet outputs) and produces a set of ranked images by introducing self-similaritiesbetween the query image and the retrieved images.

Fig. 5: Explanation of re-identification framework with the help of the proposed 3-stage framework depicted in Figure 1.

0 2 4 6 8 10 12 14 16 18 20

Rank

50

55

60

65

70

75

80

85

90

95

CM

C

PRID-11

RCNNTDLVideo-ReIdSVDNet (single image)SVDNet (multiple images)SVDNet+Re-rankProposed

Fig. 6: The accuracy (CMC) in PRID-11 dataset usingRCNN [38], TDL [40], Video re-id [38], SVDNet [1] (singleimage), SVDNet (multiple images), SVDNet+Re-rank [20].

0 2 4 6 8 10 12 14 16 18 20

Rank

40

50

60

70

80

90

100

CM

C

iLIDS

RCNNTDLVideo-ReIdSVDNet (single image)SVDNet (multiple images)SVDNet+Re-rankProposed

Fig. 7: The accuracy (CMC) in iLIDS dataset usingRCNN [38], TDL [40], Video re-id [38], SVDNet [1] (singleimage), SVDNet (multiple images), SVDNet+Re-rank [20].

Page 6: Person Re-identification in Videos by Analyzing Spatio ... · Sk. Arif Ahmed, Member, IEEE, Debi Prosad Dogra, Member, IEEE, Heeseung Choi, Seungho Chae and Ig-Jae Kim Abstract—Typical

6

0 2 4 6 8 10 12 14 16 18 20

Rank

40

50

60

70

80

90

100C

MC

TRiViD Dataset

RCNNTDLVideo-ReIdSVDNet (single image)SVDNet (multiple images)SVDNet+Re-rankProposed

Fig. 8: The accuracy (CMC) using the TRiViD dataset with thehelp of RCNN [38], TDL [40], Video re-id [38], SVDNet [1](single image), SVDNet (multiple images), SVDNet+Re-rank [20].

Fig. 9: Typical results obtained using PRID-11 dataset usingsingle image query [1], video sequence [38], and using theproposed method. Green box indicates a correct retrieval.

D. Results After Various Stages

In this section, we present the effect of various stages of theoverall framework on re-identification results. Table ?? showsthe accuracy (CMC) in each step of the proposed method. Itmay be observed that the proposed method gains 11% rank-1accuracy after the first stage and 7% rank-1 accuracy after thesecond step. The method gains 7% rank-20 accuracy in the firststage and 6% rank-20 accuracy after the second stage. Table ??shows the accuracy (CMC) in each step. Figure 14 shows anexample of scores (true positives and false positives) during theself-similarity fusion. It may be observed that SVDNet outputscores and similarity scores are high in case of true positives.Similarity scores are relatively low in case of false positives.More results can be found in the form of supplementary data.

V. CONCLUSION

In this paper, we propose a new person re-identificationframework that is able to outperform existing re-identification

schemes when applied on videos or sequence of frames.The method uses a CNN-based framework (SVDNet) at thebeginning. A self-similarity layer is used to refine the SVD-Net scores. Finally, a temporal correlation layer is used toaggregate multiple query outputs and to match tubes. A queryoptimization has also been proposed to select an optimumset of images for a query tube. Our study reveals that theproposed method outperforms in several cases as compared tothe state-of-the-art single image-based, multiple images-based,and video-based re-identification methods. The computationalis also reasonably low.

One straight extension of the present work is to fusemethods like camera pose-based [2], video-based [38], anddescription-based [16]. It may lead to higher accuracy incomplex situations. Also, group re-identification can be triedwith the similar concept of tube guided analysis.

ACKNOWLEDGMENT

The work has been funded under KIST Flagship Project(Project No.XXXX) and Global Knowledge Platform (GKP)of Indo-Korea Science and Technology Center (IKST) exe-cuted at IIT Bhubaneswar under the Project Code: XXX. Wegratefully acknowledge the support of NVIDIA Corporationwith the donation of the Quadro P5000 GPU used for thisresearch.

REFERENCES

[1] Y. Sun, L. Zheng, W. Deng, and S. Wang, “Svdnet for pedestrianretrieval,” in ICCV. IEEE, 2017, pp. 3820–3828.

[2] W. Deng, L. Zheng, G. Kang, Y. Yang, Q. Ye, and J. Jiao, “Image-imagedomain adaptation with preserved self-similarity and domain-dissimilarityfor person reidentification,” in CVPR, vol. 1, no. 2, 2018, p. 6.

[3] S. Xu, Y. Cheng, K. Gu, Y. Yang, S. Chang, and P. Zhou, “Jointlyattentive spatial-temporal pooling networks for video-based person re-identification,” in ICCV. IEEE, 2017, pp. 4743–4752.

[4] J. Lv, W. Chen, Q. Li, and C. Yang, “Unsupervised cross-dataset personre-identification by transfer learning of spatial-temporal patterns,” inCVPR, 2018, pp. 7948–7956.

[5] D. Chen, H. Li, T. Xiao, S. Yi, and X. Wang, “Video person re-identification with competitive snippet-similarity aggregation and co-attentive snippet embedding,” in CVPR, 2018, pp. 1169–1178.

[6] L. Zheng, Z. Bie, Y. Sun, J. Wang, C. Su, S. Wang, and Q. Tian, “Mars:A video benchmark for large-scale person re-identification,” in ECCV.Springer, 2016, pp. 868–884.

[7] L. Zheng, L. Shen, L. Tian, S. Wang, J. Wang, and Q. Tian, “Scalableperson re-identification: A benchmark,” in ICCV, 2015, pp. 1116–1124.

[8] E. Ristani, F. Solera, R. Zou, R. Cucchiara, and C. Tomasi, “Performancemeasures and a data set for multi-target, multi-camera tracking,” in ECCV.Springer, 2016, pp. 17–35.

[9] L. Zheng, H. Zhang, S. Sun, M. Chandraker, Y. Yang, Q. Tian et al.,“Person re-identification in the wild.” in CVPR, vol. 1, 2017, p. 2.

[10] D. Chen, D. Xu, H. Li, N. Sebe, and X. Wang, “Group consistentsimilarity learning via deep crf for person re-identification,” in CVPR,2018, pp. 8649–8658.

[11] H. Huang, D. Li, Z. Zhang, X. Chen, and K. Huang, “Adversariallyoccluded samples for person re-identification,” in CVPR, 2018, pp. 5098–5107.

[12] J. Liu, B. Ni, Y. Yan, P. Zhou, S. Cheng, and J. Hu, “Pose transferrableperson re-identification,” in CVPR, 2018, pp. 4099–4108.

[13] M. S. Sarfraz, A. Schumann, A. Eberle, and R. Stiefelhagen, “A pose-sensitive embedding for person re-identification with expanded crossneighborhood re-ranking,” in CVPR, 2018.

[14] Z. Zhong, L. Zheng, Z. Zheng, S. Li, and Y. Yang, “Camera styleadaptation for person re-identification,” in CVPR, 2018, pp. 5157–5166.

[15] A. Barman and S. K. Shah, “Shape: A novel graph theoretic algo-rithm for making consensus-based decisions in person re-identificationsystems,” in ICCV. IEEE, 2017, pp. 1124–1133.

Page 7: Person Re-identification in Videos by Analyzing Spatio ... · Sk. Arif Ahmed, Member, IEEE, Debi Prosad Dogra, Member, IEEE, Heeseung Choi, Seungho Chae and Ig-Jae Kim Abstract—Typical

7

RCNN TDL Video-ReId SVDNet (single) SVDNet (multiple) SVDNet+Re-rank Proposed0

200

400

600

800

1000

1200

1400

PRID-11iVLIDTRiViD

Fig. 10: Average response time (in seconds) for a given query by varying the datasets. We have taken 100 query tubes inrandom and calculated the average response time with the help of RCNN [38], TDL [40], Video re-id [38], SVDNet [1] (singleimage), SVDNet (multiple images), SVDNet+Re-rank [20].

PRID11 [37] iLIDS [36] TRiViDMethod/Top Rank 1 5 10 20 1 5 10 20 1 5 10 20SVD Net (Multi Image) 66 76 84 89 56 68 76 86 68 71 74 89SVD Net+Self-similarity 69 77 84 89 61 71 79 86 71 77 76 91SVD Net+Self-similarity+Temporal Correlation (Proposed)

78 89 92 91 67 84 91 96 79 88 91 98

TABLE IV: Accuracy (CMC) in each step of the proposed method

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Query Threshold ( φ)

0

10

20

30

40

50

60

70

80

Num

ber

of Q

uery

Imag

es

Number of Query Images

PRID-11iLIDSTRiViD

Fig. 11: Average number of query images by varying the querythreshold (φ). We have taken 100 query sequences randomlyand average number of optimized images, is reported. It maybe observed that a higher φ produces more number of queryimages.

[16] X. Chang, T. M. Hospedales, and T. Xiang, “Multi-level factorisationnet for person re-identification,” in CVPR, vol. 1, 2018, p. 2.

[17] D. Chung, K. Tahboub, and E. J. Delp, “A two stream siamese convo-lutional neural network for person re-identification,” in ICCV, 2017.

[18] Y. Wu, Y. Lin, X. Dong, Y. Yan, W. Ouyang, and Y. Yang, “Exploitthe unknown gradually: One-shot video-based person re-identification bystepwise learning,” in CVPR, 2018, pp. 5177–5186.

[19] J. Zhang, N. Wang, and L. Zhang, “Multi-shot pedestrian re-identification via sequential decision making,” in CVPR, 2018.

[20] S. Paisitkriangkrai, C. Shen, and A. Van Den Hengel, “Learning to rankin person re-identification with metric ensembles,” in CVPR, 2015, pp.1846–1855.

[21] L. He, J. Liang, H. Li, and Z. Sun, “Deep spatial feature reconstructionfor partial person re-identification: Alignment-free approach,” in CVPR,2018, pp. 7073–7082.

[22] M. M. Kalayeh, E. Basaran, M. Gokmen, M. E. Kamasak, and M. Shah,“Human semantic parsing for person re-identification,” in CVPR, 2018,pp. 1062–1071.

[23] S. Li, S. Bak, P. Carr, and X. Wang, “Diversity regularized spatiotem-

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Query Threshold ( φ)

40

45

50

55

60

65

70

75

80C

MC

Accuracy in Different Dataset

PRID-11iLIDSTRiViD

Fig. 12: Accuracy (CMC) by varying the query threshold (φ).We have taken 100 query sequences randomly and average isreported. It may be observed that a higher φ may not producehigher accuracy

poral attention for video-based person re-identification,” in CVPR, 2018,pp. 369–378.

[24] W. Li, X. Zhu, and S. Gong, “Harmonious attention network for personre-identification,” in CVPR, vol. 1, 2018, p. 2.

[25] Z. Liu, D. Wang, and H. Lu, “Stepwise metric promotion for unsuper-vised video person re-identification,” in ICCV. IEEE, 2017, pp. 2448–2457.

[26] X. Q. Y. Fu, Y.-G. Jiang, and T. X. X. Xue, “Multi-scale deep learningarchitectures for person re-identification.” ICCV, 2017.

[27] E. R. C. Tomasi, “Features for multi-target multi-camera tracking andre-identification,” in CVPR, 2018.

[28] S. Roy, S. Paul, N. E. Young, and A. K. Roy-Chowdhury, “Exploitingtransitivity for learning person re-identification models on a budget,” inCVPR, 2018, pp. 7064–7072.

[29] Y. Shen, H. Li, T. Xiao, S. Yi, D. Chen, and X. Wang, “Deep group-shuffling random walk for person re-identification,” in CVPR, 2018, pp.2265–2274.

[30] Y. Shen, T. Xiao, H. Li, S. Yi, and X. Wang, “End-to-end deepkronecker-product matching for person re-identification,” in CVPR, 2018,pp. 6886–6895.

[31] J. Si, H. Zhang, C.-G. Li, J. Kuen, X. Kong, A. C. Kot, and G. Wang,

Page 8: Person Re-identification in Videos by Analyzing Spatio ... · Sk. Arif Ahmed, Member, IEEE, Debi Prosad Dogra, Member, IEEE, Heeseung Choi, Seungho Chae and Ig-Jae Kim Abstract—Typical

8

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Query Threshold ( φ)

200

400

600

800

1000

1200

1400E

xecu

tion

time

Execution time (in seconds)

PRID-11iLIDSTRiViD

Fig. 13: Execution time by varying φ. It may be observed thata higher φ takes more time to execute as it produces morequery images.

Fig. 14: Typical examples of SVDNet outputs and self-similarity scores in TRiViD (first two rows) and PRID-11 [37](last row).

“Dual attention matching network for context-aware feature sequencebased person re-identification,” arXiv preprint arXiv:1803.09937, 2018.

[32] Y. Wang, Z. Chen, F. Wu, and G. Wang, “Person re-identification withcascaded pairwise convolutions,” in CVPR, 2018, pp. 1470–1478.

[33] J. Xu, R. Zhao, F. Zhu, H. Wang, and W. Ouyang, “Attention-awarecompositional network for person re-identification,” in CVPR, 2018.

[34] M. B. Jensen, K. Nasrollahi, and T. B. Moeslund, “Evaluating state-of-the-art object detector on challenging traffic light data,” in CVPRW.IEEE, 2017, pp. 882–888.

[35] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for imagerecognition,” in CVPR, 2016, pp. 770–778.

[36] T. Wang, S. Gong, X. Zhu, and S. Wang, “Person re-identification byvideo ranking,” in ECCV. Springer, 2014, pp. 688–703.

[37] M. Hirzer, C. Beleznai, P. M. Roth, and H. Bischof, “Person re-identification by descriptive and discriminative classification,” in SCIA.Springer, 2011, pp. 91–102.

[38] N. McLaughlin, J. Martinez del Rincon, and P. Miller, “Recurrentconvolutional network for video-based person re-identification,” in CVPR,2016, pp. 1325–1334.

[39] Z. Zhou, Y. Huang, W. Wang, L. Wang, and T. Tan, “See the forest forthe trees: Joint spatial and temporal recurrent neural networks for video-

based person re-identification,” in CVPR. IEEE, 2017, pp. 6776–6785.[40] J. You, A. Wu, X. Li, and W.-S. Zheng, “Top-push video-based person

re-identification,” in CVPR, 2016, pp. 1345–1353.

Sk. Arif Ahmed has obtained a master degree inComputer Applications form West Bengal Universityof Technology. Currently working as an AssistantProfessor at Haldia institute of Technology andPh.D. candidate in NIT Durgapur, India. His area ofinterest is in the domain of computer vision, imageprocessing, and scene analysis. He has already pub-lished 10 number of research articles in internationaljournals and conferences.

Debi Prosad Dogra is an Assistant Professor inthe School of Electrical Sciences, IIT Bhubaneswar,India. He received his M.Tech degree from IITKanpur in 2003 after completing his B.Tech. (2001)from HIT Haldia, India. After finishing his mas-ters, he joined Haldia Institute of Technology asa faculty members in the Department of ComputerSc. & Engineering (2003-2006). He has workedwith ETRI, South Korea during 2006-2007 as aresearcher. Dr. Dogra has published more than 75international journal and conference papers in the

areas of computer vision, image segmentation, and healthcare analysis. He isa member of IEEE.

Heeseung Choi Heeseung Choi received the B.S.,M.S. and Ph.D. degrees in Electrical and ElectronicEngineering from Yonsei University, Seoul, Korea,in 2004, 2006 and 2011 respectively. He had been aresearch member of BERC (Biometrics EngineeringResearch Center, Korea) and Computer Science andEngineering from Michigan State University, USA.He is currently a research member at Center forImaging Media Research (CIMR) in Korea Instituteof Science and Technology (KIST). His researchinterests include computer vision, biometrics, image

processing, forensic science and pattern recognition.

Seungho Chae received the PhD degree in com-puter science from Yonsei University, Seoul, Koreain 2018. Currently he is a Post-doc researcher inthe Korea Institute of Science and Technology. Hisresearch interests lie in the fields of computer vision,augmented reality and human-computer interaction.Particularly, his research focuses on object trackingand person re-identification.

Page 9: Person Re-identification in Videos by Analyzing Spatio ... · Sk. Arif Ahmed, Member, IEEE, Debi Prosad Dogra, Member, IEEE, Heeseung Choi, Seungho Chae and Ig-Jae Kim Abstract—Typical

9

Ig-Jae Kim is currently a Director of Center forImaging Media Research, Korea Institute of Sci-ence and Technology (KIST), Seoul, South Korea.He is also associate professor at Korea Universityof Science and Technology. He received his Ph.Ddegree from EECS of Seoul National Universityin 2009, MS and BS degree from EE of YonseiUniversity, Seoul, South Korea, in 1998 and 1996 re-spectively. He had worked in Massachusetts Instituteof Technology (MIT) Media Lab as a postdoctoralresearcher (2009 2010). He has published over 80

fully-referred papers in international journal and conferences, including ACMTransaction on Graphics, SIGGRAPH, Pattern Recognition, ESWA, etc. Heis interested in pattern recognition, computer vision, computer graphics, andcomputational photography.