unsupervised learning of human actions using spatial

Post on 03-May-2022

1 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Unsupervised Learning of Human Actions Using Spatial-Temporal Words

Juan Carlos Niebles1,2, Hongcheng Wang1, Li Fei-Fei11University of Illinois at Urbana – Champaign, Urbana, IL 61801, USA 2Universidad del Norte, Barranquilla, Colombia

Ref: [1] J.C. Niebles, H. Wang, L. Fei-Fei. Unsupervised Learning of Human Actions Using Spatial-Temporal Words. Submitted. 2006.

[2] T. Hofmann. Probabilistic latent semantic indexing. In SIGIR, 1999.

Feature representationFeature description: • Histogram of brightness gradientObtaining Codebook:• K-means clustering of video word descriptors

Representation:• Histogram of video words from the codebook

SummaryProblem statement: identifying and localizing different human actions in video sequences with moving background and moving camera. Contributions:• Unsupervised learning of actions using “bag of video words” representation• Multiple action localization and categorization in a single video.• Best reported performance on standard dataset.

Training data• KTH human motions data (6 classes)• SFU figure skating data (3 classes)

Algorithm

New sequence

Feature extraction and description

Decide on best model

Recognition

Class 1

Class N

...

Feature extraction

Class 1

Class N

Model 1

Model N

Feature extractionand description

Learning

Represent each sequence as a bag of video words

Input videosequences

Learn a modelfor each class

...

...

Form Codebook

Feature extraction• Separable linear filters (2D Gaussian + 1D Gabor filters)• A small video cube is extracted around each interest point

Long and complex videos

3-class model: Results on a long video sequence

6-class model: Results on a complex natural video

Exp II: 3-action classes of

figure skating dataset

stand-spin camel-spinsit-spin

Only the words assigned to the most likely action category are shown.

Classification

Given a new video and the learnt model, we can classify it as belonging to one of the action categories.

( ) ( ) ( ) zw PdzP dwPK

k ktestktest ∑ ==

1|||

( )testkk

dzP |maxargcategoryaction =

Localization

Given a new video, we can localize multiple motions:

( )testk dzP |Select topics with high

( )kzwP |

Assign words to topics according to

Cluster words

from selected

topics according to their spatial position

Exp I: 6-action classes of KTH dataset

Testing with the Caltech dataset

Performance:

walking running jogging

boxing handclapping handwaving

Colored words according to their most likely action category.

Only the words assigned to the most likely action category are shown.

62.96Ke et al.

71.72Schuldt et al.

81.17Dollar et al.

81.50Our method

Recognition

Accuracy %

Method

• Best Performance• Multiple Actions• Unlabeled training

Model

We deploy a pLSA model for video analysis.

K = Number of Action

Categories

( ) ( ) ( )jijij dwPd P wdP |, =

( ) ( ) ( )∑=

=K

k

kijkji zwPdzP |dwP1

||

action category vectorsaction category weights

d: input video z: action category w: video word

top related