beyond actions: discriminative models for contextual group activities tian lan school of computing...

59
Beyond Actions: Discriminative Models for Contextual Group Activities Tian Lan School of Computing Science Simon Fraser University August 12, 2010 M.Sc. Thesis Defense

Post on 19-Dec-2015

217 views

Category:

Documents


0 download

TRANSCRIPT

Beyond Actions: Discriminative Models for Contextual Group

Activities

Tian LanSchool of Computing Science

Simon Fraser University August 12, 2010

M.Sc. Thesis Defense

Outline

• Group Activity Recognition with Context– Structure-level (latent structures)– Feature-level (Action Context descriptor)

• Experiments

• Introduction

Activity Recognition• Goal Enable computers to analyze and understand

human behavior.

Answering a phone Kissing

Action vs. Activity Activity: a group of

people forming a queue Action: Stand

in a queue and facing left

Activity Recognition

• Activity Recognition is important

• Activity Recognition is difficult intra-class variation, background clutter, partial

occlusion, etc.

SurveillanceEntertainment

SportHCI

Group Activity Recognition

• Motivation human actions are rarely performed in

isolation, the actions of individuals in a group can serve as context for each other.

• Goal explore the benefit of contextual information

in group activity recognition in challenging real-world applications

Group Activity Recognition

Context

Group Activity Recognition

• Two types of ContextTalk

… …

group-person interaction

person-person interaction

Latent Structured Model

y

h1 h2 yh

x1 x2 xn image

action class

activity class

x0

Activity

Action

Feature

Hidden layer

y

h1 h2 yhn

x1 x2 xn

image

action class

activity class

x0

Latent Structured Modelgroup-person

Interaction

person-person Interaction

Structure-level

Feature-level

Difference from Previous Work

• Group Activity Recognition

Previous Work• Single-person action recognition Schuldt et al. icpr 04• Relative simple activity recognition Vaswani et al. cvpr 03• Dataset in controlled conditions

Our work• Group activity recognition in realistic videos• Two new types of contextual information• A unified framework

Difference from Previous Work

• Latent Structured Models

Our work latent structure for the hidden layer, automatically infer it during learning and inference.

Previous worka pre-defined structure for the hidden layer, e.g. tree (HCRF) ( Quattoni et al. pami 07, Felzenszwalb et al. cvpr 08)

Outline

• Group Activity Recognition with Context– Structure-level (latent structures)– Feature-level (Action Context descriptor)

• Experiments

• Introduction

y

h1 h2 yhn

x1 x2 xn

image

action class

activity class

x0

Structure-level Approach

person-person Interaction

Structure-level

Feature-level

Structure-level Approach

• Latent Structure

Queue ?

Talk

Talk

Model Formulation

y

h1 h2 yhn

x1 x2 xn

x0

Image-ActivityImage-Action Action-Activity

Action-Action

Input: image-label pair (x,h,y)

Inference

• Score an image x with activity label y

• Infer the latent variables

NP hard !

Inference

• Holding Gy fixed,

• Holding hy fixed,

Loopy BP

ILP

Learning with Latent SVM

Optimization: Non-convex bundle method (Do & Artieres, ICML 09)

y

h1 h2 yhn

x1 x2 xn

image

action class

activity class

x0

Feature-level Approach

person-person Interaction

Structure-level

Feature-level

Feature-level Approach

• Model

y

h1 h2 yh

x1 x2 xn image

action class

activity class

x0

…Action Context

Descriptor

Action Context Descriptorτ

(a)

action(c)

τ

z

+action

Focal person Context(b)

Action Context Descriptor

Feature Descriptor

Multi-class SVM

action class

scor

e

action class

scor

e

…action class

scor

e

max

action classsc

ore

e.g. HOG by Dalal & Triggs

Outline

• Group Activity Recognition with Context– Structure-level (latent structures)– Feature-level (Action Context descriptor)

• Experiments

• Introduction

Dataset

• Collective Activity Dataset (Choi et al. VS 09)

• 5 action categories: crossing, waiting, queuing, walking, talking. (per person)

• 44 video clips

Collective Activity Dataset

Dataset

• Nursing Home Dataset• activity categories: fall, non-fall. (per image)• 5 action categories: walking, standing, sitting,

bending and falling. (per person)• In total 22 video clips (2990 frames), 8 clips for

test, the rest for training. 1/3 are labeled as fall.

Nursing Home Dataset

Baselines• root (x0) + svm (no structure)• No connection• Min-spanning tree• Complete graph within r

h1

h2

h3

h4

h1

h2

h3

h4rh1

h2

h3

h4

h1

h2

h3

h4

Structure-level approach

Hidden layer

System Overview

Person

DetectorPerson

DescriptorVideo

u

v

Model

• Pedestrian Detection by Felzenszwalb et al.• Background Subtraction

• HOG by Dalal & Triggs • LST by Loy et al. at cvpr 09

Results – Collective Activity Dataset

Results – Correct Examples

Results – Incorrect Examples

Crossing Waiting

Walking Talking

Queuing

Results – Nursing Home Dataset

Results – Correct Examples

Results – Incorrect Examples

Conclusion

• A discriminative model for group activity recognition with context.

• Two new types of contextual information:– group-person interaction– person-person interaction• structure-level: Latent structure• Feature-level: Action Context descriptor

• Experimental results demonstrate the effectiveness of the proposed model

Future Work

• Modeling Complex Structures– Temporal dependencies among action

• Contextual Feature Descriptors– How to encode discriminative context?

• Weakly supervised Learning– e.g. multiple instance learning for fall detection

Thank you!

Pairwise Weightyhj

hk

Pairwise Weight

Pairwise Weight

Infer the graph structures

0/1 loss – optimize overall accuracy

Results – Nursing Home Dataset

Results – Nursing Home Dataset

new loss – optimize mean per-class accuracy

Person Detectors

• Collective Activity Dataset: • Pedestrian Detector (Felzenszwalb et al., CVPR 08)

• Nursing Home Dataset

BackgroundSubtraction

Moving RegionsVideo

Person Descriptors

• Collective Activity Dataset: • HOG

• Nursing Home Dataset• Local Spatial Temporal (LST) Descriptor (Loy et al.,

ICCV 09)

u

v

Results – Correct Examples

Results – Incorrect Examples

Results – Collective Activity Dataset

Root+SVM Structure-levelFeature-level

Group Context Descriptor

y

h1 h2 yhn

x1 x2 xn

x0

Learning

• Training data consists of {xn,hn,yn}

Structure-levelFeature-level

No connection

Structure-levelFeature-level

No connection

Results – Nursing Home Dataset