![Page 1: Coupling Video Segmentation and Action Recognition · • A good video segmentation is fundamental to obtain accurate action recognition • We have proposed and evaluate several](https://reader036.vdocument.in/reader036/viewer/2022071003/5fbfe7f30bfe54213f745a5f/html5/thumbnails/1.jpg)
Coupling Video Segmentation and Action Recognition
Amir Ghodrati, Marco Pedersoli, Tinne Tuytelaars
WACV 2014
![Page 2: Coupling Video Segmentation and Action Recognition · • A good video segmentation is fundamental to obtain accurate action recognition • We have proposed and evaluate several](https://reader036.vdocument.in/reader036/viewer/2022071003/5fbfe7f30bfe54213f745a5f/html5/thumbnails/2.jpg)
Outline
• Action Recognition: an essential task
• Problem Definition
• Motion segmentation using trajectories
• Our proposed methods
• Datasets and results
2
• Conclusion
![Page 3: Coupling Video Segmentation and Action Recognition · • A good video segmentation is fundamental to obtain accurate action recognition • We have proposed and evaluate several](https://reader036.vdocument.in/reader036/viewer/2022071003/5fbfe7f30bfe54213f745a5f/html5/thumbnails/3.jpg)
Action Recognition – an essential task
• Motivation -> Huge amount of videos
Upload rate in YouTube: 1
hour of video per second• Applications:
• Content-based search
• Summarization
• Intelligent fast forwarding
• Abnormality detection in surveillance
videos
• Scientific researches e.g. relation
between number of smoking scenes in
the movies and human addiction
• A key for human and robot interaction
3
![Page 4: Coupling Video Segmentation and Action Recognition · • A good video segmentation is fundamental to obtain accurate action recognition • We have proposed and evaluate several](https://reader036.vdocument.in/reader036/viewer/2022071003/5fbfe7f30bfe54213f745a5f/html5/thumbnails/4.jpg)
Our Question
Whether video segmentation can be
exploited for improved action
recognition?recognition?
Effect of ideal segmentation on classification accuracy
4
88.5%83.5%
accuracy improvement
(YouTube dataset)
![Page 5: Coupling Video Segmentation and Action Recognition · • A good video segmentation is fundamental to obtain accurate action recognition • We have proposed and evaluate several](https://reader036.vdocument.in/reader036/viewer/2022071003/5fbfe7f30bfe54213f745a5f/html5/thumbnails/5.jpg)
Trajectories
• State-of-the-art video segmentation algorithms use trajectories as their building
blocks.
• Densely sampled patches that are tracked over several frames, following the
underlying motion of the object or scene.
description
5
HOG HOF MBH
![Page 6: Coupling Video Segmentation and Action Recognition · • A good video segmentation is fundamental to obtain accurate action recognition • We have proposed and evaluate several](https://reader036.vdocument.in/reader036/viewer/2022071003/5fbfe7f30bfe54213f745a5f/html5/thumbnails/6.jpg)
Trajectories
6
![Page 7: Coupling Video Segmentation and Action Recognition · • A good video segmentation is fundamental to obtain accurate action recognition • We have proposed and evaluate several](https://reader036.vdocument.in/reader036/viewer/2022071003/5fbfe7f30bfe54213f745a5f/html5/thumbnails/7.jpg)
Motion Segmentation
• Motion segmentation reduces to cluster coherent, spatially close trajectories.
• Building a fully connected graph
o Each node corresponds to a trajectory.
o Weight of the edge between node i and node j depends on spatial distance
and shape of i and j trajectories.
• normalized-cut /spectral clustering on the graph• normalized-cut /spectral clustering on the graph
o Assign labels to each node corresponding to each object
• Motion segmentation is a fully bottom-up foreground/background segmentation
7
![Page 8: Coupling Video Segmentation and Action Recognition · • A good video segmentation is fundamental to obtain accurate action recognition • We have proposed and evaluate several](https://reader036.vdocument.in/reader036/viewer/2022071003/5fbfe7f30bfe54213f745a5f/html5/thumbnails/8.jpg)
An example of Bottom-up segmentation
8
![Page 9: Coupling Video Segmentation and Action Recognition · • A good video segmentation is fundamental to obtain accurate action recognition • We have proposed and evaluate several](https://reader036.vdocument.in/reader036/viewer/2022071003/5fbfe7f30bfe54213f745a5f/html5/thumbnails/9.jpg)
Recognition Pipeline
Learning
SegmentationAlgorithm
foreground
background
BoW description
BoW description
Hf
Hb
first video
.
9
Learning
SegmentationAlgorithm
foreground
background
BoW description
BoW description
Hf
Hb
nth video
.
.
.
![Page 10: Coupling Video Segmentation and Action Recognition · • A good video segmentation is fundamental to obtain accurate action recognition • We have proposed and evaluate several](https://reader036.vdocument.in/reader036/viewer/2022071003/5fbfe7f30bfe54213f745a5f/html5/thumbnails/10.jpg)
Our proposed methods
• We propose several methods that integrate segmentation and recognition:
• Segmentation
o Split action-related foreground and action-unrelated background in a top-down
fashion.
• Co-segmentation• Co-segmentation
o Multiple videos of the same action should have consistent segmentation; so
we segment a video leveraging segmentation of other videos.
• Iterative learning
o An iterative learning scheme that alternates between segmentation and
recognition.
• Kernels
o Mapping the original feature space with a non-linear kernel
10
![Page 11: Coupling Video Segmentation and Action Recognition · • A good video segmentation is fundamental to obtain accurate action recognition • We have proposed and evaluate several](https://reader036.vdocument.in/reader036/viewer/2022071003/5fbfe7f30bfe54213f745a5f/html5/thumbnails/11.jpg)
Segmentation – Top Down
• initial over-segmentation of the video in trajectory-groups
• Positive (action-related) trajectory-groups: those that have more than 25% overlap
with ground-truth bounding box
• Learning the similarities that trajectory-groups share across the
DATASET, independent of the action label. We call it actionness operator.
11
![Page 12: Coupling Video Segmentation and Action Recognition · • A good video segmentation is fundamental to obtain accurate action recognition • We have proposed and evaluate several](https://reader036.vdocument.in/reader036/viewer/2022071003/5fbfe7f30bfe54213f745a5f/html5/thumbnails/12.jpg)
Examples of top-down segmentation
12
![Page 13: Coupling Video Segmentation and Action Recognition · • A good video segmentation is fundamental to obtain accurate action recognition • We have proposed and evaluate several](https://reader036.vdocument.in/reader036/viewer/2022071003/5fbfe7f30bfe54213f745a5f/html5/thumbnails/13.jpg)
Segmentation – Top Down
model
SegmentationAlgorithm
foreground
background
BoW description
BoW description
Hf
Hb
first video
.
actionness
13
Learningactivity models
SegmentationAlgorithm
foreground
background
BoW description
BoW description
Hf
Hb
nth video
.
.
. actionness
![Page 14: Coupling Video Segmentation and Action Recognition · • A good video segmentation is fundamental to obtain accurate action recognition • We have proposed and evaluate several](https://reader036.vdocument.in/reader036/viewer/2022071003/5fbfe7f30bfe54213f745a5f/html5/thumbnails/14.jpg)
Segmentation – Top Down
hT
exhΘ−∝Θ=Ω ),1,(
Cost function actionness model
14
exh ∝Θ=Ω ),1,(
Cost is inversely proportional to score of actionness
K
ix 1,0∈binary labeling actionness score
![Page 15: Coupling Video Segmentation and Action Recognition · • A good video segmentation is fundamental to obtain accurate action recognition • We have proposed and evaluate several](https://reader036.vdocument.in/reader036/viewer/2022071003/5fbfe7f30bfe54213f745a5f/html5/thumbnails/15.jpg)
Co-segmentation
15
![Page 16: Coupling Video Segmentation and Action Recognition · • A good video segmentation is fundamental to obtain accurate action recognition • We have proposed and evaluate several](https://reader036.vdocument.in/reader036/viewer/2022071003/5fbfe7f30bfe54213f745a5f/html5/thumbnails/16.jpg)
Co-segmentation
model
foreground
background
BoW description
BoW description
Hf
Hb
first video
actionness
ith class
16
LearningCo-segmentation
Algorithm
foreground
background
BoW description
BoW description
Hf
Hb
nth video
ith class
![Page 17: Coupling Video Segmentation and Action Recognition · • A good video segmentation is fundamental to obtain accurate action recognition • We have proposed and evaluate several](https://reader036.vdocument.in/reader036/viewer/2022071003/5fbfe7f30bfe54213f745a5f/html5/thumbnails/17.jpg)
Co-segmentation
∑∑∈
′′
∈
∈′
−ΘΩ=Θe
iiv
i Ghh
iiii
Gh
ii
cj
jxxhhSimxhxE
),(
),(),,();( λ
unary term: actionness
17
∈∈ ′e
iiv
i GhhGh ),(
Co-segmentation cost function reward for similar foreground groups
sub-modular exact solution using graph-cut
![Page 18: Coupling Video Segmentation and Action Recognition · • A good video segmentation is fundamental to obtain accurate action recognition • We have proposed and evaluate several](https://reader036.vdocument.in/reader036/viewer/2022071003/5fbfe7f30bfe54213f745a5f/html5/thumbnails/18.jpg)
Iterative Learning
• The two previous methods, solve the segmentation and use its output during action classification but in this approach, we alternate between segmentation and recognition.
• Latent-SVM based approach (discriminate classes as much as possible the goal is not better segmentation but better classification). Latent variables are 1-0 labels of each trajectory-group.
18
group.
• 2 major restrictions of L-SVM
o sensitive to initialization
o works with linear models (will be discussed in next slides)
• This method can be also used in conjunction with the segmentation methods introduced in previous sections
![Page 19: Coupling Video Segmentation and Action Recognition · • A good video segmentation is fundamental to obtain accurate action recognition • We have proposed and evaluate several](https://reader036.vdocument.in/reader036/viewer/2022071003/5fbfe7f30bfe54213f745a5f/html5/thumbnails/19.jpg)
Iterative Learning
modelLearning
Co-segmentationAlgorithm
foreground
background
BoW description
BoW description
Hf
Hb
first video
actionness
19
LearningAlgorithm
foreground
background
BoW description
BoW description
Hf
Hb
nth video
![Page 20: Coupling Video Segmentation and Action Recognition · • A good video segmentation is fundamental to obtain accurate action recognition • We have proposed and evaluate several](https://reader036.vdocument.in/reader036/viewer/2022071003/5fbfe7f30bfe54213f745a5f/html5/thumbnails/20.jpg)
Iterative Learning
∑
∑∑∈
′′
∈
∈
Θ+
−ΘΩ=Θ′
eii
vi
globii
Ghh
iiii
Gh
actionnessii
cj
j
xh
xxhhSimxhxE
),,(
),(),,();(
2
),(
ϕλ
λ
20
∑∈
Θ+v
iGh
xh ),,(2 ϕλ
discriminative model
cost of discriminative labeling
![Page 21: Coupling Video Segmentation and Action Recognition · • A good video segmentation is fundamental to obtain accurate action recognition • We have proposed and evaluate several](https://reader036.vdocument.in/reader036/viewer/2022071003/5fbfe7f30bfe54213f745a5f/html5/thumbnails/21.jpg)
Kernels
• So far, we have used linear models for classification. While the iterative learning is a powerful tool, it is limited to linear model.
• Alternative is to map the features into a kernel.
• Excluding the iterative learning, all the other proposed methods can
21
•be used together with a kernel
linear
non-linear
![Page 22: Coupling Video Segmentation and Action Recognition · • A good video segmentation is fundamental to obtain accurate action recognition • We have proposed and evaluate several](https://reader036.vdocument.in/reader036/viewer/2022071003/5fbfe7f30bfe54213f745a5f/html5/thumbnails/22.jpg)
Datasets
• UCF-Sports
o 10 categories
o 150 videos
o extracted from sport broadcasts.
• YouTube
o 11 categories
o 1600 videos (quality 240×320)
o Handheld camera -> camera motion
Challenges: large intra-class variability in view point, speed of action and
cluttered background
22
![Page 23: Coupling Video Segmentation and Action Recognition · • A good video segmentation is fundamental to obtain accurate action recognition • We have proposed and evaluate several](https://reader036.vdocument.in/reader036/viewer/2022071003/5fbfe7f30bfe54213f745a5f/html5/thumbnails/23.jpg)
Results - YouTube
Method Recognition acc
FG/BG - Using ground-truth
bounding box (upper bound)
88.5%
Baseline BoW (No seg.) 83.5%
Bottom Up 83.6%
Top Down (Actionness) 85.0 %
Top Down + Co-segmentation 85.1%
23
![Page 24: Coupling Video Segmentation and Action Recognition · • A good video segmentation is fundamental to obtain accurate action recognition • We have proposed and evaluate several](https://reader036.vdocument.in/reader036/viewer/2022071003/5fbfe7f30bfe54213f745a5f/html5/thumbnails/24.jpg)
Results (Iterative) - YouTube
Initial
Segmentation
Method Recognition
accuracy
Random
Iteration
Iteration + Top Down
Iteration + Co-seg
Iteration + Top Down + Co-seg
85.0%
85.2%
85.7%
85.7%
Iteration 85.2%
Top Down
Iteration
Iteration + Top Down
Iteration + Co-seg
Iteration + Top Down + Co-seg
85.2%
86.1%
86.2%
86.7%
Ground-Truth
Iteration
Iteration + Top Down
Iteration + Co-seg
Iteration + Top Down + Co-seg
85.5%
86.4%
86.2%
86.7%
24
![Page 25: Coupling Video Segmentation and Action Recognition · • A good video segmentation is fundamental to obtain accurate action recognition • We have proposed and evaluate several](https://reader036.vdocument.in/reader036/viewer/2022071003/5fbfe7f30bfe54213f745a5f/html5/thumbnails/25.jpg)
Results (kernel) - YouTube
Method Recognition acc
Top Down segmentation +
kernel-SVM
86.2%
Top Down + Co-seg + 86.8%Top Down + Co-seg +
kernel-SVM
86.8%
25
![Page 26: Coupling Video Segmentation and Action Recognition · • A good video segmentation is fundamental to obtain accurate action recognition • We have proposed and evaluate several](https://reader036.vdocument.in/reader036/viewer/2022071003/5fbfe7f30bfe54213f745a5f/html5/thumbnails/26.jpg)
Results (State-of-the-art) - YouTube
Method Accuracy
Brendelet al. [1] 77.8%
Wang et al. [8] 84.2%
Sapienza et al. [4] 80.0% Sapienza et al. [4] 80.0%
Gaidon et al. [2] 87.9%
Iterative (1)
Kernel (2)
(1)+(2)
86.7%
86.8%
87.4%
26
![Page 27: Coupling Video Segmentation and Action Recognition · • A good video segmentation is fundamental to obtain accurate action recognition • We have proposed and evaluate several](https://reader036.vdocument.in/reader036/viewer/2022071003/5fbfe7f30bfe54213f745a5f/html5/thumbnails/27.jpg)
Results (State-of-the-art) – UCFsports
Method Accuracy
Lan et al. [5] 73.1%
Raptis et al. [7] 79.4%
Shapovalova et al. [3] 75.3%Shapovalova et al. [3] 75.3%
Todorovic et al. [6] 86.8%
Iterative (1)
Kernel (2)
(1)+(2)
81.5%
86.1%
86.1%
27
![Page 28: Coupling Video Segmentation and Action Recognition · • A good video segmentation is fundamental to obtain accurate action recognition • We have proposed and evaluate several](https://reader036.vdocument.in/reader036/viewer/2022071003/5fbfe7f30bfe54213f745a5f/html5/thumbnails/28.jpg)
Qualitative results for segmentation
28
![Page 29: Coupling Video Segmentation and Action Recognition · • A good video segmentation is fundamental to obtain accurate action recognition • We have proposed and evaluate several](https://reader036.vdocument.in/reader036/viewer/2022071003/5fbfe7f30bfe54213f745a5f/html5/thumbnails/29.jpg)
Conclusion
• A good video segmentation is fundamental to obtain accurate action recognition
• We have proposed and evaluate several ways to integrate segmentation and recognition
• Coupling segmentation and recognition in an iterative learning can always improve the recognition accuracy.
• An alternative way to obtain similar results is to map the features into a non-linear kernel.
29
![Page 30: Coupling Video Segmentation and Action Recognition · • A good video segmentation is fundamental to obtain accurate action recognition • We have proposed and evaluate several](https://reader036.vdocument.in/reader036/viewer/2022071003/5fbfe7f30bfe54213f745a5f/html5/thumbnails/30.jpg)
References[1] W. Brendel and S. Todorovic. Activities as time series of human postures. In ECCV. 2010
[2] A. Gaidon, Z. Harchaoui, and C. Schmid. A time series ker-nel for action recognition. In BMVC, 2011.
[3] N. Shapovalova, A. Vahdat, K. Cannons, T. Lan, and G. Mori. Similarity constrained latent support vector machine: An application to weakly supervised action classification. In ECCV, 2012.
[4] M. Sapienza, F. Cuzzolin, and P. Torr. Learning discriminative space-time actions from weakly labeled videos. In BMVC, 2012.
[5] T. Lan, Y. Wang, and G. Mori. Discriminative figure-centric models for joint action localization and recognition. In ICCV, 2011.
[6] S. Todorovic. Human activities as stochastic kronecker graphs. In ECCV. 2012.
[7] M. Raptis, I. Kokkinos, and S. Soatto. Discovering discrim-inative action parts from mid-level video representations. In CVPR, 2012.
[8] H. Wang, A. Kl¨ aser, C. Schmid, and C.-L. Liu. Action recog-nition by dense trajectories. In CVPR, 2011.
![Page 31: Coupling Video Segmentation and Action Recognition · • A good video segmentation is fundamental to obtain accurate action recognition • We have proposed and evaluate several](https://reader036.vdocument.in/reader036/viewer/2022071003/5fbfe7f30bfe54213f745a5f/html5/thumbnails/31.jpg)
Thanks For Your Attention!
Questions?
![Page 32: Coupling Video Segmentation and Action Recognition · • A good video segmentation is fundamental to obtain accurate action recognition • We have proposed and evaluate several](https://reader036.vdocument.in/reader036/viewer/2022071003/5fbfe7f30bfe54213f745a5f/html5/thumbnails/32.jpg)
Challenges• Intra-class variability
o In common with objects: varying viewpoints, backgrounds and partial
occlusions.
o Specific for actions: performed by different people, at different speeds
and in different ways.
• Uncertainty in actual extent of action
temporal delineation -> When does the action start/end?o temporal delineation -> When does the action start/end?
o spatial delineation -> Does the action include the whole actor or only a
part of that? Should objects that are involved be included as well?
• Number of training data
o cumbersome process of collecting data (accuracy of keyword- based
search for 235 terms: 10%)
o size of dataset quickly grows
32
![Page 33: Coupling Video Segmentation and Action Recognition · • A good video segmentation is fundamental to obtain accurate action recognition • We have proposed and evaluate several](https://reader036.vdocument.in/reader036/viewer/2022071003/5fbfe7f30bfe54213f745a5f/html5/thumbnails/33.jpg)
Segmentation – Top Down
∑=K
xhH ∑ −=K
33
∑=
=k
kkf xhH1
)1(1
∑=
−=K
k
kkb xhH
![Page 34: Coupling Video Segmentation and Action Recognition · • A good video segmentation is fundamental to obtain accurate action recognition • We have proposed and evaluate several](https://reader036.vdocument.in/reader036/viewer/2022071003/5fbfe7f30bfe54213f745a5f/html5/thumbnails/34.jpg)
Co-segmentation
• Take into account similar motion and appearance
characteristics that trajectory-group share with
trajectory-groups among other videos of same label.
• Building a graph from all trajectory-groups of all training
videos of class c: weight of each node is actionness
score of each trajectory-group and weight of edges are score of each trajectory-group and weight of edges are
similarity between inter-video-connected trajectory-
groups.
34
![Page 35: Coupling Video Segmentation and Action Recognition · • A good video segmentation is fundamental to obtain accurate action recognition • We have proposed and evaluate several](https://reader036.vdocument.in/reader036/viewer/2022071003/5fbfe7f30bfe54213f745a5f/html5/thumbnails/35.jpg)
Setting up experiments and parameters
o UCF Sports
o The dataset is split into 103 training and 47 test samples. This
separation reduces the chance of videos in the test set having strong
scene correlations with videos in the training set.
o performance measuring: mean per-class accuracy
o YouTube
o Dividing the dataset to 25 groups: leave-one-group-out cross validation
o performance measuring: average accuracy over all classes
o Parameters
o Trajectories parameters same as their authors
o Trajectory description: Histogram of Gradients (HOG), Histogram of
Flows (HOF) and Motion Boundary Histogram(MBH)
o Video description: BoW with vocabulary size of 4000
o α1 and α2 are tuned with cross-validation on training data.
35