bag of video-words video representation ucf virat efforts
TRANSCRIPT
Bag of Video-Words Video Representation
UCF VIRAT EFFORTS
Outline Bag of Video-words approach for video
representation Feature detection Feature quantization Histogram-based video descriptor generation
Preliminary experimental results on aerial videos Discussion on ways to improve the performance
Bag of video-words approach (I)
Interest Point Detector
Motion Feature Detection
Bag of video-words approach (II)
Video-word A
Video-word B
Video-word C
Feature Quantization: Codebook Generation
Bag of video-words approach (III)
Histogram-based video descriptor generation
His
togr
am-b
ased
V
ideo
Des
crip
tor
Similarity Metrics Histogram Intersection
Chi-square distance
i baba iHiHHHSim ))(),(min(),(
))()(
))()((exp()),(exp(),(
22
iba
bababa iHiH
iHiHHHHHSim
Classifiers Bayesian Classifier K-Nearest Neighbors (KNN) Support Vector Machines (SVM)
Histogram Intersection Kernel Chi-square Kernel RBF (Radial Basis Function) Kernel
Experiments on Aerial videos Dataset
Blimp with a HD camera on a gimbal
11 Actions: Digging, gesturing, picking up, throwing, kicking, carrying object, walking, standing, running, entering vehicle, exiting vehicle
Clipping & Cropping Actions
- Optimal box is created so that the object of interest doesn't go out of view in all the frames (Start Frame to End Frame)
Start of Frame
End of Frame
Feature Detection for Video Clips20
0 Fe
atur
es
Digging Kicking Throwing Walking
Classification Results (I) “kicking”(22 clips) v.s. “non kicking” (22 clips)
Number of featuresPer video
Codebook 50 Codebook 100 Codebook 200
50 65.91% 79.55% 75.00%
100 79.55% 77.27% 77.27%
200 77.27% 79.55% 81.82%
Classification Results (II)
Classification Results (III) “Digging”, “Kicking”, “Walking”, “Throwing” ( 25clips x 4 )
digg
ing
kick
ing
thro
win
gw
alki
ng
Similarity Matrix(Histogram Intersection)
Classification Results (V) Average accuracy with different codebook size
Confusion table for the case of codebook size of 300
Number of FeaturesPer Video
Codebook 100 Codebook 200 Codebook 300
200 84.6% 85.0% 86.7%
Misclassified examples (I) Misclassified “walking” into
“kicking”
Misclassified examples (I) Misclassified “digging” into
“walking”
Misclassified examples (III) Misclassified “walking” into
“throwing”
How to improve the performance? Low Level Features
Stable motion features Different Motion Features Different Motion Feature Sampling Hybrid of Motion and Static Features
Video-words generation Unsupervised method
Hierarchical K-Means (David Nister, et al., CVPR 2006) Supervised method
Random Forest (Bill Triggs, et al., NIPS 2007) “Visual Bits” (Rong Jin, et al., CVPR 2008)
Classifiers SVM Kernels : histogram intersection v.s. Chi-Square distance Multiple Kernels
Stable motion features Motion compensation Video clipping and cropping
Start of Frame
End of Frame
Different Low-level Features Flattened gradient vector (magnitude) Histogram of Gradient (direction) Histogram of Optical Flow Combination of all types of features
Feature sampling Feature detection: Gabor filter or 3D Harris corner
detection Random sampling Grid-based sampling Bill Triggs et al., Sampling Strategies for Bag-of-Features
Image Classification, ECCV 2006
Hybrid of Motion and Static Features (I) Multiple-frame Features (spatiotemporal, motion)
3D Harris Capture the local spatiotemporal information around the interest points
Single-frame Features (spatial, static) 2D Harris detector MSER (Maximally Stable Extremal Regions ) detector Perform action recognition by a sequence instantaneous postures or
poses Overcome the shortcoming of multiple-frame features which require
relative stable camera motion Hybrid of motion and static features
Represent a video by the combination of multiple-frame and single-frame features
Hybrid of Motion and Static Features (II) Examples of 2D Harris and MSER feature
2D H
arri
sM
SE
R
Hybrid of Motion and Static Features (III) Experiments on three action datasets
KTH, 6 action categories, 600 videos UCF sports, 10 action categories, about 200
videos YouTube videos, 11 action categories,
about 1,100 videos
KTH dataset
Boxing Clapping Waving
Walking Jogging Running
Experimental results on KTH dataset Recognition using either Motion (L), Static (M) features and Hybrid
(R) features
Average Accuracy 92.66%Average Accuracy 82.96%Average Accuracy 87.65%
Results on UCF sports dataset
The average accuracy for static, motion and static+motion experimental strategy is 74.5%, 79.6% and 84.5% respectively.
YouTube Video Dataset (I)
Cycling Diving Golf Swinging
Riding Juggling
YouTube Video Dataset (II)
Basketball Shooting Swinging Tennis Swinging
Volleyball Spiking Trampoline Jumping
Results on YouTube dataset
The average accuracy for motion, static and hybrid features are 65.4%, 63.1% and 71.2%, respectively
Hierarchical K-Means (I) Traditional k-means
Slow when generating large size of codebook Less discriminative when dealing with large size of codebook
Hierarchical k-means Building a tree on the training features Children nodes are clusters of applying k-means on the parent
node Treat each node as a “word”, so the tree is a hierarchical
codebook D. Nister, Scalable Recognition with a Vocabulary Tree, CVPR 2006
Hierarchical K-Means (II) Advantages
Tree also defines the quantization of features, so it integrates the indexing and quantization in one tree
It is much more efficient when generating a large size of codebook
The word (node) frequency can be integrated with the inverse document frequency to weight it.
It can generate more discriminative word than that of k-means
Large size of codebook which can generally obtain better performance.
Random Forests (I) K-means based quantization methods
Unsupervised It suffers from the high dimensionality of the features
Single-tree based methods Each path through the tree typically accesses only a few of the feature
dimensions It fails to deal with the variance of the feature dimensions It is fast, but performance is not even as good as k-means
Random Forests Build an ensemble trees Each tree node is splitted by checking the randomly selected subset of feature
dimensions Building all the trees using video or image labels (supervised method) Instead of taking the trees as an ensemble classifiers, we treat all the leaves of
all the trees as “words”. The generated “words” are more meaningful and discriminative, since it contains
class category information
Random Forests (II)
“Visual Bits” (I) Both k-means or random forests
Treat all the features equally when generating the codebooks. Hard assignment (each feature can only be assigned to one “word”)
“Visual Bits” Rong Jin et al., Unifying Discriminative Visual Codebook Generation
with Classifier Training for Object Category Recognition, CVPR 2008 Training a visual codebook for each category, so it can overcome
the shortcomings of “hard assignment” of the features It integrates the classification and codebook generation together, so
it is able to select the relevant features by weighting them
“Visual Bits” (II)
Classifiers Kernel SVM
Histogram Intersection Chi-square distance
Multiple kernels Fuse different type of features Fuse different distance metrics
The end… Thank you!