gesture ronfard fribourg...1 machine learning methods in visual recognition of gesture and action...

41
Visual Recognition of Gesture and Action, Rémi Ronfard, INRIA 6/21/2010 1 Machine Learning Methods in Visual Machine Learning Methods in Visual Recognition of Gesture and Action Recognition of Gesture and Action Rémi Ronfard INRIA Rhone-Alpes [email protected] Fribourg, Sumer School, June 21, 2010 Intro Intro Survey the most useful and promising techniques for real-life applications of face, gesture and full- body action recognition. Examine how those different classes of methods can be adapted to achieve invariance with respect to viewing directions and performing styles.

Upload: others

Post on 26-May-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Gesture Ronfard Fribourg...1 Machine Learning Methods in Visual Recognition of Gesture and Action Rémi Ronfard INRIA Rhone-Alpes remi.ronfard@inria.fr Fribourg, Sumer School, June21,

Visual Recognition of Gesture and Action, Rémi Ronfard, INRIA 6/21/2010

1

Machine Learning Methods in Visual Machine Learning Methods in Visual Recognition of Gesture and ActionRecognition of Gesture and Action

Rémi Ronfard

INRIA Rhone-Alpes

[email protected]

Fribourg, Sumer School, June 21, 2010

IntroIntro

Survey the most useful and promising techniques for real-life applications of face, gesture and full-body action recognition.

Examine how those different classes of methods can be adapted to achieve invariance with respect to viewing directions and performing styles.

Page 2: Gesture Ronfard Fribourg...1 Machine Learning Methods in Visual Recognition of Gesture and Action Rémi Ronfard INRIA Rhone-Alpes remi.ronfard@inria.fr Fribourg, Sumer School, June21,

Visual Recognition of Gesture and Action, Rémi Ronfard, INRIA 6/21/2010

2

OUTLINEOUTLINE

• Part 1 - History and applications� Gesture recognition - a machine learning problem

◦ What is a gesture ? What does it look like ?

� Part 2 - Spatial structure of gestures◦ Face, hands and feet, full body

◦ Actor invariance

◦ View invariance

� Part 3 - Temporal structure of gestures◦ Templates, grammars and histograms

◦ Tracking, segmentation and recognition

Part 1 Part 1 -- HistoryHistory and applicationsand applications

� What is a gesture ?

� How can the computer use gesture ?

� How can the computer recognize gesture ?

Page 3: Gesture Ronfard Fribourg...1 Machine Learning Methods in Visual Recognition of Gesture and Action Rémi Ronfard INRIA Rhone-Alpes remi.ronfard@inria.fr Fribourg, Sumer School, June21,

Visual Recognition of Gesture and Action, Rémi Ronfard, INRIA 6/21/2010

3

DifferentDifferent classes of classes of gesturesgestures

� Gestures are movements of the human body◦ With a communicative goal

◦ Or part of goal-directed action

� Action changes the world

� Gesture communicates

� Visual analysis of gesture and action◦ Detect and track human body parts

◦ Recognize their gestures and actions

DifferentDifferent classes of classes of GesturesGestures

� Gestures as poses

◦ Counting with fingers

◦ Making faces

◦ Pointing

� Gestures as movements

◦ Yes, no

◦ Come here, go away

◦ From here to there

� Single body part

◦ Hand gestures

◦ Facial expressions

� Multiple body parts

◦ Hand gestures

◦ Clapping hands

◦ Scratching head

◦ Blowing a kiss

◦ Are you crazy ?

Page 4: Gesture Ronfard Fribourg...1 Machine Learning Methods in Visual Recognition of Gesture and Action Rémi Ronfard INRIA Rhone-Alpes remi.ronfard@inria.fr Fribourg, Sumer School, June21,

Visual Recognition of Gesture and Action, Rémi Ronfard, INRIA 6/21/2010

4

ExamplesExamples

� Source: Real-time Detectionand Interpretationof 3D DeicticGestures

� J. Richarz, T. Plötz and G. A. Fink, ICPR 2008.

GestureGesture MeasurementMeasurement

� Visual measurements◦ Markers

◦ Local features

◦ Body parts

◦ Shape

◦ Color and texture

◦ Motion

� From one or more cameras (binocular, trinocular and multiviewstereo)

� Physical measurements◦ Accelerometers

◦ Gravitometers

◦ RFID tags

◦ Laser scanners

◦ Radars, Lidars

◦ Time-Of-Flight (TOF) cameras

Page 5: Gesture Ronfard Fribourg...1 Machine Learning Methods in Visual Recognition of Gesture and Action Rémi Ronfard INRIA Rhone-Alpes remi.ronfard@inria.fr Fribourg, Sumer School, June21,

Visual Recognition of Gesture and Action, Rémi Ronfard, INRIA 6/21/2010

5

Applications Applications ––readingreading signsign languagelanguage

� Source: Buehler et al. Oxford University, 2010

GestureGesture--basedbased remoteremote controlscontrols

� First proposed by Bill Freeman

� Now using Time-of-Flight Camera, allowING people to turn up the volume by moving their hand in a circle, switch the channel by swiping to the right, pause by extending their hands in a “stop” gesture, and so on.

� Source: Softkinetic-Optrima

Page 6: Gesture Ronfard Fribourg...1 Machine Learning Methods in Visual Recognition of Gesture and Action Rémi Ronfard INRIA Rhone-Alpes remi.ronfard@inria.fr Fribourg, Sumer School, June21,

Visual Recognition of Gesture and Action, Rémi Ronfard, INRIA 6/21/2010

6

TeachingTeaching robots by robots by demonstrationdemonstration

� Gesturerecognition and imitation

� Visual gestureto motorcommands

MinorityMinority Report Report gesturegesture InterfaceInterface

� John Underkoffler helped Steven Spielberg design a futuristic human computer interface for a movie that takes place in the year 2054.

� Now introducing G-Speakby Oblong

Page 7: Gesture Ronfard Fribourg...1 Machine Learning Methods in Visual Recognition of Gesture and Action Rémi Ronfard INRIA Rhone-Alpes remi.ronfard@inria.fr Fribourg, Sumer School, June21,

Visual Recognition of Gesture and Action, Rémi Ronfard, INRIA 6/21/2010

7

GestureGesture generationgeneration by imitationby imitation GestureGesture generationgeneration by imitationby imitation� Gesture

modeling and animation, ACM TOG 2008

� Michael Neff, U. California, Davis

� Michael Kipp, DFKI

� Irene Albrechand Hans-PeterSeidel, MPI Informatik

Page 8: Gesture Ronfard Fribourg...1 Machine Learning Methods in Visual Recognition of Gesture and Action Rémi Ronfard INRIA Rhone-Alpes remi.ronfard@inria.fr Fribourg, Sumer School, June21,

Visual Recognition of Gesture and Action, Rémi Ronfard, INRIA 6/21/2010

8

OtherOther sensorssensors and applicationsand applications

� Gesture-Based Interface to Windows by Andrew Wilson and NuriaOliver, Microsoft Research

� Microsoft Kinect (NATAL)

OtherOther sensorssensors

� Sony Eyetoy

� Playstation Move

� Shadow SDK by OmekInteractive

� Etc.

Page 9: Gesture Ronfard Fribourg...1 Machine Learning Methods in Visual Recognition of Gesture and Action Rémi Ronfard INRIA Rhone-Alpes remi.ronfard@inria.fr Fribourg, Sumer School, June21,

Visual Recognition of Gesture and Action, Rémi Ronfard, INRIA 6/21/2010

9

BinocularBinocular and and trinoculartrinocular stereostereo

PANASONIC 3D CAMERA COMPUTER THEATRE at MIT

OtherOther sensorssensors -- BIDI SCREEN BIDI SCREEN atat MIT MIT MEDIA LABMEDIA LAB

Page 10: Gesture Ronfard Fribourg...1 Machine Learning Methods in Visual Recognition of Gesture and Action Rémi Ronfard INRIA Rhone-Alpes remi.ronfard@inria.fr Fribourg, Sumer School, June21,

Visual Recognition of Gesture and Action, Rémi Ronfard, INRIA 6/21/2010

10

Speech vs Speech vs gesturegesture

� Phonemes

� Phonetics◦ Mel-Cepstrum

Coefficients

◦ Speaker invariance

� Language models◦ Phonemes to words

◦ Word to word statistics

� Visemes and Kinemes◦ Facial Action Units

◦ Body Action Units

� Visual representations◦ Actor invariance

◦ View-invariance

◦ Temporal resolution

� Multiple contexts◦ Gesture lexicons

◦ Gesture profiles

gesturegesture in dialogue (in dialogue (mcneilmcneil, 1992), 1992)

� Communicative

◦ EMBLEM

� Yes, no, thumbs up

◦ DEICTIC

� pointing

◦ ICONIC/METAPHORIC

� Representing

◦ BEAT

� Stress and rythm

� Non-communicative

◦ ADAPTOR

� TOUCH SELF

� TOUCH OBJECT

� TOUCH OTHER

◦ GOAL-DIRECTED

� Raise arm to reach object

� Scratch head to ease pain

� Chase a mosquito

◦ UNVOLONTARY

� Embarassment

Page 11: Gesture Ronfard Fribourg...1 Machine Learning Methods in Visual Recognition of Gesture and Action Rémi Ronfard INRIA Rhone-Alpes remi.ronfard@inria.fr Fribourg, Sumer School, June21,

Visual Recognition of Gesture and Action, Rémi Ronfard, INRIA 6/21/2010

11

Body Body languagelanguage

� Hand and Mind. What Gestures Reveal about thought, David McNeill, 1992.

� Gesture: Visible Action as Utterance, Adam Kendon, 2004.

� Gesture and Thought, David McNeill, 2007.

Tracking and recognitionTracking and recognition

� Source: Green and Guang, Quantifying and recognizing human movement patterns form monocular video images, IEEE Trans. on Circuits and Systems for Video Technology.

Page 12: Gesture Ronfard Fribourg...1 Machine Learning Methods in Visual Recognition of Gesture and Action Rémi Ronfard INRIA Rhone-Alpes remi.ronfard@inria.fr Fribourg, Sumer School, June21,

Visual Recognition of Gesture and Action, Rémi Ronfard, INRIA 6/21/2010

12

GestureGesture and action recognitionand action recognition KTH FULL BODY GESTURE DATASETKTH FULL BODY GESTURE DATASET

Page 13: Gesture Ronfard Fribourg...1 Machine Learning Methods in Visual Recognition of Gesture and Action Rémi Ronfard INRIA Rhone-Alpes remi.ronfard@inria.fr Fribourg, Sumer School, June21,

Visual Recognition of Gesture and Action, Rémi Ronfard, INRIA 6/21/2010

13

KECK DATASET (KECK DATASET (UnivUniv MARYLAND)MARYLAND)

Source: Lin, Jiang, Davis, Recognizing Actions by Shape-Motion Prototype Trees, ICCV, 2009.

INRIA IXMAS FullINRIA IXMAS Full--body body gesturegesture

datasetdataset

� 11 actions by 10 actors

� 3different poses

� 5 cameras

Page 14: Gesture Ronfard Fribourg...1 Machine Learning Methods in Visual Recognition of Gesture and Action Rémi Ronfard INRIA Rhone-Alpes remi.ronfard@inria.fr Fribourg, Sumer School, June21,

Visual Recognition of Gesture and Action, Rémi Ronfard, INRIA 6/21/2010

14

DifferentDifferent classes of classes of problemsproblems

� Recognize gesture, givenbody part trajectories(easiest, but body part tracking is hard)

◦ Faces

◦ Hands

◦ Feet

� Recognize gesture, given human body trajectory (humantracking still hard)

� Recognize gesture, givenimage sequence(hardest)

DifferentDifferent classes of classes of problemsproblems

� Single actor

◦ Train on actor

◦ Test on actor

� Single view

◦ Train in one view

◦ Test on other view

� Multiple actors

◦ Train on one actor

◦ Test on other actor

� Multiple views

◦ Train in one view

◦ Test on other view

Page 15: Gesture Ronfard Fribourg...1 Machine Learning Methods in Visual Recognition of Gesture and Action Rémi Ronfard INRIA Rhone-Alpes remi.ronfard@inria.fr Fribourg, Sumer School, June21,

Visual Recognition of Gesture and Action, Rémi Ronfard, INRIA 6/21/2010

15

Machine Machine learninglearning approachesapproaches

� TRAINING◦ Define gestures from

training examples

◦ One or more actors

◦ One or more views

◦ Given body parts

◦ Given human tracking

◦ Given sequence

� RECOGNITION◦ Recognize gestures from

test examples

◦ Same or different actors

◦ Same or different views

◦ Same modalities as in training

� EVALUATION◦ Precision and recall

◦ Generalization

The recognition The recognition problemproblem

� Variations

1. Given a test detectionbox for an actor, isactor performinggesture or action ?

2. Given a test detectionboxes for a body part, is body part performing gesture or action?

� Task - Given a test videoclip, does it contain gestureor action ?

1. Learn models with training set of labeled examples.

2. Evaluate precision and recall with testing set of labeled examples.

� Many approaches for encoding spatial and temporal structure of action

Page 16: Gesture Ronfard Fribourg...1 Machine Learning Methods in Visual Recognition of Gesture and Action Rémi Ronfard INRIA Rhone-Alpes remi.ronfard@inria.fr Fribourg, Sumer School, June21,

Visual Recognition of Gesture and Action, Rémi Ronfard, INRIA 6/21/2010

16

The segmentation The segmentation problemproblem

� Given video clip of gesture or action, whereand when is it happening ?

� Given video clip, how many different gesturesand actions ? Where and when are theyhappening ?

� Fewer references◦ Motion maxima (Marr and Vaina)

◦ Sliding window and Dynamic Time-Warping

◦ Semi-Markov Models (HMMs and CRFs) provide a segmentation as a result of recognition

The The viewpointviewpoint recognition recognition problemproblem� Recognize gestures and

actions from differentviewpoints◦ How many training

viewpoints ?◦ How many testing

viewpoints ?

� Few approches◦ Exhaustive search◦ View-invariant features

(Perez)◦ Transformed Grammars

(Frey & Jojic)

� Few databases – CMU, IXMAS

Page 17: Gesture Ronfard Fribourg...1 Machine Learning Methods in Visual Recognition of Gesture and Action Rémi Ronfard INRIA Rhone-Alpes remi.ronfard@inria.fr Fribourg, Sumer School, June21,

Visual Recognition of Gesture and Action, Rémi Ronfard, INRIA 6/21/2010

17

SpatiotemporalSpatiotemporal featuresfeatures

� Image histograms

◦ Space Bag of Words

◦ Color and texture (SIFT)

◦ Oriented Gradient (HOG)

� Image templates

◦ Face

◦ Hand

◦ Body

� Recognition by parts

◦ Pictorial structures

� Temporal histograms

◦ Time Bag of Words

◦ Oriented Flow (HOF)

◦ Spatio-temporal InterestPoints (STIPS)

� Temporal templates

◦ Motion History Images

� Recognition by parts

◦ Markov States and Transitions (HMM)

◦ Grammars

TaxonomyTaxonomy of of learninglearning methodsmethods

Space

Time

Body model Image model Spatial Bag-of-words

Template Body template Image template Bag of trajectories

Grammar Body grammar Image grammar Bag grammar

Temporal Bag-of-words

Bag of body parts Bag of keyframes Bag of events

Page 18: Gesture Ronfard Fribourg...1 Machine Learning Methods in Visual Recognition of Gesture and Action Rémi Ronfard INRIA Rhone-Alpes remi.ronfard@inria.fr Fribourg, Sumer School, June21,

Visual Recognition of Gesture and Action, Rémi Ronfard, INRIA 6/21/2010

18

Part 2 Part 2 -- Spatial structure of Spatial structure of gesturegesture

� Body coordinates = Body models

� Image coordinates = Image models

� Unstructured space = Histograms

◦ spatial bag-of-words (BOW)

Body Body modelsmodels

� Detection and/or tracking of body parts

� Labeling of body parts (and actors)

� Movement and pose of body parts

� Pros: gestures are naturally defined in terms of body parts

� Cons: recognition cannot recover from tracking or detection errors

Page 19: Gesture Ronfard Fribourg...1 Machine Learning Methods in Visual Recognition of Gesture and Action Rémi Ronfard INRIA Rhone-Alpes remi.ronfard@inria.fr Fribourg, Sumer School, June21,

Visual Recognition of Gesture and Action, Rémi Ronfard, INRIA 6/21/2010

19

Hand trackingHand tracking

� Color-to-Finger indexing

� Wang and Popovic, MIT Media Lab, 2010

UpperUpper body pose estimationbody pose estimation

� Source: Buehler, Oxford Univ. 2010

Page 20: Gesture Ronfard Fribourg...1 Machine Learning Methods in Visual Recognition of Gesture and Action Rémi Ronfard INRIA Rhone-Alpes remi.ronfard@inria.fr Fribourg, Sumer School, June21,

Visual Recognition of Gesture and Action, Rémi Ronfard, INRIA 6/21/2010

20

Facial expressionsFacial expressions

Active Appearance Models◦ Geometric deformations

◦ Changes in appearance

� Tracking and recognition

� Requires training per actor

Source: Theobalt et al. Real-time Expression Cloning using Appearance Models. Multimodal Interfaces, 2007.

64 Facial action 64 Facial action unitsunits

Source: Cohn, Ambadar, Eckman, 2005.

Page 21: Gesture Ronfard Fribourg...1 Machine Learning Methods in Visual Recognition of Gesture and Action Rémi Ronfard INRIA Rhone-Alpes remi.ronfard@inria.fr Fribourg, Sumer School, June21,

Visual Recognition of Gesture and Action, Rémi Ronfard, INRIA 6/21/2010

21

Full body Full body gesturegesture Full body Full body gesturegesture

Page 22: Gesture Ronfard Fribourg...1 Machine Learning Methods in Visual Recognition of Gesture and Action Rémi Ronfard INRIA Rhone-Alpes remi.ronfard@inria.fr Fribourg, Sumer School, June21,

Visual Recognition of Gesture and Action, Rémi Ronfard, INRIA 6/21/2010

22

Image Image modelsmodels

� Do not assume body parts can be recognized

� Learn global model of image appearance and motion

� Pros: Gestures can be learned fromunsegmented examples; no body part labelingnecessary; fast.

� Cons: Requires multiple models for differentviewpoints, distances and body sizes.

Image Image DescriptorsDescriptors

� Silhouettes◦ Background-subtraction

◦ Image difference

� Colors and textures◦ SIFT images (dim=128)

� Oriented Gradients and Flows◦ Dense optical flow

◦ Image of HOG-HOF coefficients (dim=16)

� Self-similarity

Page 23: Gesture Ronfard Fribourg...1 Machine Learning Methods in Visual Recognition of Gesture and Action Rémi Ronfard INRIA Rhone-Alpes remi.ronfard@inria.fr Fribourg, Sumer School, June21,

Visual Recognition of Gesture and Action, Rémi Ronfard, INRIA 6/21/2010

23

Image Image modelsmodels for full body for full body gesturegesture Optical flowOptical flow

Source: Efros et al. ICCV 2005

Page 24: Gesture Ronfard Fribourg...1 Machine Learning Methods in Visual Recognition of Gesture and Action Rémi Ronfard INRIA Rhone-Alpes remi.ronfard@inria.fr Fribourg, Sumer School, June21,

Visual Recognition of Gesture and Action, Rémi Ronfard, INRIA 6/21/2010

24

RecognizingRecognizing hand hand movementsmovements

� A Method for Temporal Hand Gesture Recognition

� Joshua R. New

� Knowledge Systems Laboratory

� Jacksonville State University

Hand Hand gesturesgestures withwith image momentsimage moments

� Source: Freeman et al. Computer Vision for Interactive Computer Graphics. IEEE CGA, 1998.

Page 25: Gesture Ronfard Fribourg...1 Machine Learning Methods in Visual Recognition of Gesture and Action Rémi Ronfard INRIA Rhone-Alpes remi.ronfard@inria.fr Fribourg, Sumer School, June21,

Visual Recognition of Gesture and Action, Rémi Ronfard, INRIA 6/21/2010

25

Orientation Orientation histogramshistograms Multiple body partsMultiple body parts

� Build templates of Gradient Orientations (HOG)

� Source: Dalal & Triggs, CVPR 2005

Page 26: Gesture Ronfard Fribourg...1 Machine Learning Methods in Visual Recognition of Gesture and Action Rémi Ronfard INRIA Rhone-Alpes remi.ronfard@inria.fr Fribourg, Sumer School, June21,

Visual Recognition of Gesture and Action, Rémi Ronfard, INRIA 6/21/2010

26

HogHog--Hof Hof templatestemplates

Source: Dalal, 2005

Spatial «Spatial « BagBag--ofof--WordWord »» modelsmodels

� Discretize image descriptors into a finitelexicon of « visual words »

� Compute frequency of visual words in entireimage

� Object recognition: SIFT ◦ 128-dimensional descriptor◦ At « interest points » or regular grid

� Gesture recognition: HOG-HOF◦ 32-dimensional descriptor◦ At « spatio-temporal interest points » or regular

grid

Page 27: Gesture Ronfard Fribourg...1 Machine Learning Methods in Visual Recognition of Gesture and Action Rémi Ronfard INRIA Rhone-Alpes remi.ronfard@inria.fr Fribourg, Sumer School, June21,

Visual Recognition of Gesture and Action, Rémi Ronfard, INRIA 6/21/2010

27

SpatioSpatio--temporaltemporal interestinterest pointspoints

� Source: Laptev, IJCV 2005

SpatioSpatio--temporaltemporal interestinterest pointspoints

� Source: Laptev, IJCV, 2005

Page 28: Gesture Ronfard Fribourg...1 Machine Learning Methods in Visual Recognition of Gesture and Action Rémi Ronfard INRIA Rhone-Alpes remi.ronfard@inria.fr Fribourg, Sumer School, June21,

Visual Recognition of Gesture and Action, Rémi Ronfard, INRIA 6/21/2010

28

Part 3 Part 3 -- Temporal structure of Temporal structure of gesturegesture

Global time-slice = GestureTemplates

Moments in time = Gesture Grammars

Unstructured time = Gesture Histograms(Temporal Bag-Of-Words)

Temporal structure of Temporal structure of gesturegesture

� Source: Kipp et al. Behavior Markup Language

Page 29: Gesture Ronfard Fribourg...1 Machine Learning Methods in Visual Recognition of Gesture and Action Rémi Ronfard INRIA Rhone-Alpes remi.ronfard@inria.fr Fribourg, Sumer School, June21,

Visual Recognition of Gesture and Action, Rémi Ronfard, INRIA 6/21/2010

29

Temporal structure of Temporal structure of gesturegesture

� Source: Kipp. Gesture Generation by Imitation

GestureGesture TemplateTemplatess

� Body is an N-dimensional articulated structure◦ recognize trajectory in N dimensions

� Template matching in N dimensions is hard !� Dimension reduction and/or simplifications◦ Head trajectory◦ Hand trajectories◦ Head pose

� Segmentation requires sliding window(convolution)

� Multiple templates for speed, duration and intensity

Page 30: Gesture Ronfard Fribourg...1 Machine Learning Methods in Visual Recognition of Gesture and Action Rémi Ronfard INRIA Rhone-Alpes remi.ronfard@inria.fr Fribourg, Sumer School, June21,

Visual Recognition of Gesture and Action, Rémi Ronfard, INRIA 6/21/2010

30

Image model + Image model + GestureGesture TemplateTemplate

� Gesture is defined as a « spatio-temporalobject »

◦ Image x time

◦ Space x time

� Examples:

◦ Motion History Image

◦ Motion History Volume

Motion Motion historyhistory imageimage

Representation and Recognition of Action Using Temporal Templates, Bobick and Davis, PAMI 2001

Page 31: Gesture Ronfard Fribourg...1 Machine Learning Methods in Visual Recognition of Gesture and Action Rémi Ronfard INRIA Rhone-Alpes remi.ronfard@inria.fr Fribourg, Sumer School, June21,

Visual Recognition of Gesture and Action, Rémi Ronfard, INRIA 6/21/2010

31

Motion Motion HistoryHistory VVolumesolumes

Free Viewpoint Action Recognition using Motion History Volumes. Weinland, Ronfard, Boyer, CVIU 2006

Action Recognition in 3D using voxels

Shape fromSilhouettes+ Motion HistoryImages

Recognition Recognition resultsresults

Dimension reduction (LDA) produces best results with 10 MHV coefficients

Average precision94%

Page 32: Gesture Ronfard Fribourg...1 Machine Learning Methods in Visual Recognition of Gesture and Action Rémi Ronfard INRIA Rhone-Alpes remi.ronfard@inria.fr Fribourg, Sumer School, June21,

Visual Recognition of Gesture and Action, Rémi Ronfard, INRIA 6/21/2010

32

Action segmentationAction segmentation

Maxima of Motion Energy from MHV

� Used for training IXMAS classes

� Used for unsupervised learning(clustering) of novelaction classes

Automatic Discovery of Action Taxonomies from Multiple Views.Weinland, Ronfard, Boyer, CVPR 2006

Spatial BagSpatial Bag--ofof--WordsWords + + GestureGesture

TemplateTemplate

� Gesture as a bag of trajectories (bag of templates)◦ KLT features (« good features to track »)

◦ Compute lexicon of KLT trajectories� based on shape, velocity, duration, etc.

◦ Learn frequency of « trajectons » per gesture

� Pal and Kautz, 2010

� Also possible - gesture as a trajectory in the space of « visual word » frequencies (templateof bags)

Page 33: Gesture Ronfard Fribourg...1 Machine Learning Methods in Visual Recognition of Gesture and Action Rémi Ronfard INRIA Rhone-Alpes remi.ronfard@inria.fr Fribourg, Sumer School, June21,

Visual Recognition of Gesture and Action, Rémi Ronfard, INRIA 6/21/2010

33

GestureGesture GrammarsGrammars

� Grammar decomposes gesture into« states» and states into observations

� Generative HMM ◦ P(body part motion and pose|state)

◦ P(state at t|state at t-1)

� Discriminative CRF� P(state|body part motion and pose)

� Probabilistic Context-free grammar◦ Repetition and grouping of states

Markov Markov modelsmodels

� Probabilistic Finite State Machines

� First-order Hidden Markov Models

� Segment Models

Page 34: Gesture Ronfard Fribourg...1 Machine Learning Methods in Visual Recognition of Gesture and Action Rémi Ronfard INRIA Rhone-Alpes remi.ronfard@inria.fr Fribourg, Sumer School, June21,

Visual Recognition of Gesture and Action, Rémi Ronfard, INRIA 6/21/2010

34

ConditionalConditional RandomRandom FieldsFields

� Source: Hidden Conditional Random Fields for Gesture Recognition by Wang, Quattoni, Morency, Demirdjian & Darrell, CVPR 2006

ConditionalConditional RandomRandom FieldsFields

FB - Flip Back, SV - Shrink Vertically, EV - Expand Vertically, DB -Double Back, PB - Point and Back, EH – Expand Horizontally.

Green arrows are the motion trajectory of the fingertip and the numbers next to the arrows symbolize the order of these arrows.

Page 35: Gesture Ronfard Fribourg...1 Machine Learning Methods in Visual Recognition of Gesture and Action Rémi Ronfard INRIA Rhone-Alpes remi.ronfard@inria.fr Fribourg, Sumer School, June21,

Visual Recognition of Gesture and Action, Rémi Ronfard, INRIA 6/21/2010

35

Image model + Image model + GrammarGrammar

� Grammar decomposes gesture into « states» and states into observations

� HMM is a probabilistic state machine◦ P(image motion and appearance|state)

◦ P(state at t|state at t-1)

� CRF conditioned on image observations� P(state|image motion and appearance)

� Controled settings

� Constant viewpoint

Learning of action Learning of action exemplarsexemplars

Goal – learn actions in 3D and recognize in 2D

Problems

� MHVs cannot be projected to novel views

� IXMAS actions not all simple

Solutions

� Metric mixture of 3D exemplars (Toyama & Blake,

2001)

� Transformed HMM for action & actor pose relative to camera (Frey & Jojic, 2001)

Action Recognition from Arbitrary Views using 3D Exemplars , Weinland, Boyer, Ronfard , 2007

Page 36: Gesture Ronfard Fribourg...1 Machine Learning Methods in Visual Recognition of Gesture and Action Rémi Ronfard INRIA Rhone-Alpes remi.ronfard@inria.fr Fribourg, Sumer School, June21,

Visual Recognition of Gesture and Action, Rémi Ronfard, INRIA 6/21/2010

36

Recognition Recognition withwith action action exemplarsexemplars Recognition Recognition withwith action action exemplarsexemplars

Average 2D recognition 60%

Average 3D recognition 90%

Best compromise with cameras 2+4

Outperformed by STIPS and self-similarity (Perez) on single-viewrecognition

Page 37: Gesture Ronfard Fribourg...1 Machine Learning Methods in Visual Recognition of Gesture and Action Rémi Ronfard INRIA Rhone-Alpes remi.ronfard@inria.fr Fribourg, Sumer School, June21,

Visual Recognition of Gesture and Action, Rémi Ronfard, INRIA 6/21/2010

37

Temporal Temporal Bag of Bag of WordsWords

� Histogram of image features over time

◦ BODY MODELS

◦ IMAGE MODELS

◦ SPACE BAG OF WORDS

� Recognition based on distance betweenhistograms

◦ Learned histograms

◦ Observed histogram

Image model + Image model + Bag of Bag of WordsWords

� Gesture from a single or few « keyframes »

◦ Still frame

◦ Keyframe selection during learning and recognition

◦ Same as ergodic HMM (no temporal structure)

� Gesture from many frames

◦ Discretize video frames into lexicon of « visualwords »

◦ Compute word frequencies over time

◦ P(gesture|sequence) = H(words)

Page 38: Gesture Ronfard Fribourg...1 Machine Learning Methods in Visual Recognition of Gesture and Action Rémi Ronfard INRIA Rhone-Alpes remi.ronfard@inria.fr Fribourg, Sumer School, June21,

Visual Recognition of Gesture and Action, Rémi Ronfard, INRIA 6/21/2010

38

Source: Mori Structured Action Recognition. 2007.

Body model + Body model + Bag of Bag of WordsWords

� Temporal structure is not imposed◦ Invariant to speed, order, sync

◦ Useful for sparse observations

◦ Requires body part detection, not tracking

� Examples◦ Strike a pose

◦ Intermediate poses may be missing

� Applications ◦ Interleaved actions

◦ Partial observations

Page 39: Gesture Ronfard Fribourg...1 Machine Learning Methods in Visual Recognition of Gesture and Action Rémi Ronfard INRIA Rhone-Alpes remi.ronfard@inria.fr Fribourg, Sumer School, June21,

Visual Recognition of Gesture and Action, Rémi Ronfard, INRIA 6/21/2010

39

SpatiSpatial + Temporal al + Temporal BagBag--ofof––WordsWords

� Build a discrete lexicon of « motion » words◦ in space and in time

� Learn the frequencies of motion words per gesture◦ H(word|gesture)

� Recognize most likely gesture from observed wordfrequencies over entire video segment (space and time)◦ Pros: Robust to partial observations and occlusions;

invariant to sizes and durations ; very good precision in benchmarks; good generalization

◦ Cons: Blind to higher-order spatial and temporal structure; hard to tune to specific cases

Spatial BagSpatial Bag--ofof--WordsWords + + GrammarGrammar

� Each state in the grammar is a bag of visualwords

� Decompose gesture into « states» withprobabilities conditioned on word frequencies◦ P(visual words |state at t) = H(words)

◦ P(state at t|state at t-1)

� Distances between histograms◦ X2(H_observed,H_learned)

◦ Gaussian distributions

Page 40: Gesture Ronfard Fribourg...1 Machine Learning Methods in Visual Recognition of Gesture and Action Rémi Ronfard INRIA Rhone-Alpes remi.ronfard@inria.fr Fribourg, Sumer School, June21,

Visual Recognition of Gesture and Action, Rémi Ronfard, INRIA 6/21/2010

40

CONCLUSIONCONCLUSION

Gestures and actions are spatio-temporal patterns with internal structure and high complexity. The choice of suitable representations is often the most crucial aspect.

In the spatial domain, actions and gestures can be represented with body models, with image models, or with bag of isolated features.

In the temporal domain, they can be represented with templates, with grammars, or with bags of isolated features. By combining the spatial and temporal aspects of gesture and action, one is faced with a vast number of possible combinations.

AdditionalAdditional referencesreferences

� Vision-Based Gesture Recognition: A Review. Wu, Huang, 1999.

� Perceptual Interfaces, Matthew Turk and Mathias Kölsch, U. California, Santa Barbara, 2003.

� Vision-based hand pose estimation: A review. Erol et al, CVIU 2007.

� Gesture Recognition: A Survey. Mitra, Achary, IEEE Trans. Systems, Mand and Cybernetics, 2007.

� A Survey of Vision-Based Methods for Action Representation, Segmentation and Recognition. Weinland, Ronfard, Boyer, INRIA, 2010.

Page 41: Gesture Ronfard Fribourg...1 Machine Learning Methods in Visual Recognition of Gesture and Action Rémi Ronfard INRIA Rhone-Alpes remi.ronfard@inria.fr Fribourg, Sumer School, June21,

Visual Recognition of Gesture and Action, Rémi Ronfard, INRIA 6/21/2010

41

ThankThank YouYou