human action recognition by learning bases of action attributes and parts

48
Human Action Recognition by Learning Bases of Action Attributes and Parts Bangpeng Yao 1 , Xiaoye Jiang 2 , Aditya Khosla 1 , Andy Lai Lin 3 , Leonidas Guibas 1 , and Li Fei-Fei 1 {bangpeng,aditya86,guibas,feifeili}@cs.stanford.ed u {xiaoye,ydna}@stanford.edu 1 Computer Science Department, Stanford University Institute for Computational & Mathematical Engineering, Stanford University Electrical Engineering Department, Stanford University

Upload: arthur-casey

Post on 31-Dec-2015

26 views

Category:

Documents


2 download

DESCRIPTION

Human Action Recognition by Learning Bases of Action Attributes and Parts. Bangpeng Yao 1 , Xiaoye Jiang 2 , Aditya Khosla 1 , Andy Lai Lin 3 , Leonidas Guibas 1 , and Li Fei-Fei 1. Computer Science Department, Stanford University - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Human Action Recognition by Learning Bases of Action Attributes and Parts

Human Action Recognition by Learning Bases of Action

Attributes and Parts

Bangpeng Yao1, Xiaoye Jiang2, Aditya Khosla1,

Andy Lai Lin3, Leonidas Guibas1, and Li Fei-Fei1

{bangpeng,aditya86,guibas,feifeili}@cs.stanford.edu

{xiaoye,ydna}@stanford.edu1

Computer Science Department, Stanford University

Institute for Computational & Mathematical Engineering, Stanford University

Electrical Engineering Department, Stanford University

Page 2: Human Action Recognition by Learning Bases of Action Attributes and Parts

2

Action Classification in Still Images

Riding bike

• Directly using low level feature for classification:

- Grouplet (Yao & Fei-Fei, 2010)- Multiple kernel learning (Koniusz et al., 2010)- Spatial pyramid (Delaitre et al., 2010)- Random forest (Yao et al., 2011)

Page 3: Human Action Recognition by Learning Bases of Action Attributes and Parts

3

Action Classification in Still Images

Riding bike

• Human actions are more than just a class label:

Riding a bikeSitting on a bike seatWearing a helmetPeddling the pedals…

High-level concepts - Attributes

Page 4: Human Action Recognition by Learning Bases of Action Attributes and Parts

4

Action Classification in Still Images

Riding bike

• Human actions are more than just a class label:

Riding a bikeSitting on a bike seatWearing a helmetPeddling the pedals…

High-level concepts – Attributes Objects

Page 5: Human Action Recognition by Learning Bases of Action Attributes and Parts

5

Action Classification in Still Images

Riding bike

• Human actions are more than just a class label:

Riding a bikeSitting on a bike seatWearing a helmetPeddling the pedals…

High-level concepts – Attributes Objects Human poses

Parts

Page 6: Human Action Recognition by Learning Bases of Action Attributes and Parts

6

Action Classification in Still Images

Riding bike

• Human actions are more than just a class label:

Riding a bikeSitting on a bike seatWearing a helmetPeddling the pedals…

High-level concepts – Attributes Objects Human poses Interactions of attributes & parts

Parts

Riding

Page 7: Human Action Recognition by Learning Bases of Action Attributes and Parts

7

Riding bike

• Human actions are more than just a class label.

Attributes & Parts for Classification

Attributes, objects, and human poses in visual recognition:

Farhadi et al., 2009Lampert et al., 2009Berg et al., 2010Parikh & Grauman, 2011Liu et al., 2011

Gupta et al., 2009Yao & Fei-Fei, 2010Torresani et al., 2010Li et al., 2010

Yao & Fei-Fei, 2010Yang et al., 2010Maji et al., 2011

riding a bike

wearing a helmet

Peddling the pedal

sitting on bike seat

Page 8: Human Action Recognition by Learning Bases of Action Attributes and Parts

8

Benefits of the Attribute & Part Rep.

• Incorporate more human knowledge;

• Produce more descriptive intermediate outputs;

Farhadi et al., 2009Lampert et al., 2009Berg et al., 2010Parikh & Grauman, 2011

• Allow more discriminative classifiers;Torresani et al., 2010Li et al., 2010Maji et al., 2011Liu et al., 2011

• Complementary information in attributes and parts, hence improve classification performance.

Page 9: Human Action Recognition by Learning Bases of Action Attributes and Parts

9

Challenges We Need to Address

• How to model attributes and parts (objects & poses)?• How to model their interactions?• How to eliminate noise or inconsistency in the data?

• How to use attributes and parts for recognition?

Unexpected object

Errors in detection

Object does not appear

Page 10: Human Action Recognition by Learning Bases of Action Attributes and Parts

• Attributes and Parts in Human Actions

• Learning Bases of Attributes and Parts

(modeling the interactions)

• Dataset & Experiments

• Conclusion

Outline

10

Page 11: Human Action Recognition by Learning Bases of Action Attributes and Parts

• Attributes and Parts in Human Actions

• Learning Bases of Attributes and Parts

(modeling the interactions)

• Dataset & Experiments

• Conclusion

Outline

11

Page 12: Human Action Recognition by Learning Bases of Action Attributes and Parts

12

Action Attributes

CyclingPeddlingWriting

PhoningJumping

• Semantic descriptions of actions;• Usually related to verbs.

CyclingPeddlingWriting

PhoningJumping

Page 13: Human Action Recognition by Learning Bases of Action Attributes and Parts

13

Action Attributes

• Semantic descriptions of actions;• Usually related to verbs.• A discriminative classifier for each attribute:

CyclingPeddlingWriting

PhoningJumping

CyclingPeddlingWriting

PhoningJumping

Page 14: Human Action Recognition by Learning Bases of Action Attributes and Parts

• Objects:

• Human poses – poselets:

14

Action Parts – Objects and Poses

(Bourdev & Malik, 2010)

(Li et al., 2010Bourdev & Malik, 2010)

• For each part (object or poselet), we have a pre-trained detector.

bike detector

Page 15: Human Action Recognition by Learning Bases of Action Attributes and Parts

15

Putting Attributes and Parts Together

CyclingPeddlingWriting

Phoning

… …

… …

… …

Attribute classification

Confidence scores

Object detection

Poselet detection

SVM Classifier

HighLow

Page 16: Human Action Recognition by Learning Bases of Action Attributes and Parts

16

Challenges We Need to Address

• How to model attributes and parts (objects & poses)?• How to model their interactions?• How to eliminate noise or inconsistency in the data?

• How to use attributes and parts for recognition?

Unexpected object

Errors in detection

Object does not appear

Page 17: Human Action Recognition by Learning Bases of Action Attributes and Parts

17

Challenges We Need to Address

• How to model attributes and parts (objects & poses)?• How to model their interactions?• How to eliminate noise or inconsistency in the data?

• How to use attributes and parts for recognition?

Unexpected object

Errors in detection

Object does not appear

Page 18: Human Action Recognition by Learning Bases of Action Attributes and Parts

• Attributes and Parts in Human Actions

• Learning Bases of Attributes and Parts

(modeling the interactions)

• Dataset & Experiments

• Conclusion

Outline

18

Page 19: Human Action Recognition by Learning Bases of Action Attributes and Parts

19

Bases of Atr. & Parts: Motivation

Cycling

……

……

……

Peddling

Writing

Phoning

ideala

Ideal vector

HighLow

Page 20: Human Action Recognition by Learning Bases of Action Attributes and Parts

20

Bases of Atr. & Parts: Motivation

Cycling

……

……

……

……

Peddling

Writing

Phoning

Cycling

……

Peddling

Writing

Phoning

a

Real vector

ideala

Ideal vector

HighLow

Page 21: Human Action Recognition by Learning Bases of Action Attributes and Parts

21

Bases of Atr. & Parts: Motivation

Cycling

……

……

……

……

Peddling

Writing

Phoning

Cycling

……

Peddling

Writing

Phoning ……

……

……

……

……

a

Real vector

ideala

Ideal vector

1 2 3 4 …

Action bases

HighLow

Page 22: Human Action Recognition by Learning Bases of Action Attributes and Parts

22

Bases of Atr. & Parts: Motivation

Cycling

……

……

……

……

Peddling

Writing

Phoning

Cycling

……

Peddling

Writing

Phoning ……

……

……

……

……

a

Real vector

ideala

Ideal vector

1 2 3 4 …

1w2w3w4w

w

Action bases

Reconstruction coefficients

HighLow

Page 23: Human Action Recognition by Learning Bases of Action Attributes and Parts

w

23

Bases of Atr. & Parts: Motivation

Cycling

……

……

……

……

Peddling

Writing

Phoning

Cycling

……

Peddling

Writing

Phoning ……

……

……

……

……

a

Real vector

ideala

Ideal vector

1 2 3 4 …

1w2w3w4w

Action bases

Reconstruction coefficients

HighLow

Action bases (sparse)

Page 24: Human Action Recognition by Learning Bases of Action Attributes and Parts

w

24

Bases of Atr. & Parts: Motivation

Cycling

……

……

……

……

Peddling

Writing

Phoning

Cycling

……

Peddling

Writing

Phoning ……

……

……

……

……

a

Real vector

ideala

Ideal vector

1 2 3 4 …

1w2w3w4w

Reconstruction coefficients

HighLow

Action bases (sparse)

Reconstruction coefficients (sparse)

Page 25: Human Action Recognition by Learning Bases of Action Attributes and Parts

1, , NW w w

25

Bases of Atr. & Parts: TrainingReal

vector

HighLow

Action bases (sparse)

Reconstruction coefficients (sparse)

1, , Na a 1[ , , ]M Φ (N images) (M bases)

Input Output

2

2 11

1min ,

2M N

N

i i ii

W

a Φw w

2

1 2s.t. , 1

2j jj

Φ Φ

L1 regularization, sparsity of W

Elastic net, sparsity ofΦ

[Zou & Hasti, 2005]

Accurate reconstruction

Page 26: Human Action Recognition by Learning Bases of Action Attributes and Parts

w

26

Bases of Atr. & Parts: TestingReal

vector

HighLow

Action bases (sparse)

Reconstruction coefficients (sparse)

a 1[ , , ]M Φ (M bases)

Input Output

2

2 1

1min

2M

wa Φw w

L1 regularization, sparsity of W

Accurate reconstruction

Page 27: Human Action Recognition by Learning Bases of Action Attributes and Parts

w

27

Bases of Atr. & Parts: Benefits

Cycling

……

……

……

……

Peddling

Writing

Phoning

Cycling

……

Peddling

Writing

Phoning ……

……

……

……

……

a

Real vector

ideala

Ideal vector

1 2 3 4 …

1w2w3w4w

… HighLow

Action bases (sparse)

Reconstruction coefficients (sparse)

• Co-occurrence context;

Page 28: Human Action Recognition by Learning Bases of Action Attributes and Parts

w

28

Bases of Atr. & Parts: Benefits

Cycling

……

……

……

……

Peddling

Writing

Phoning

Cycling

……

Peddling

Writing

Phoning ……

……

……

……

……

a

Real vector

ideala

Ideal vector

1 2 3 4 …

1w2w3w4w

… HighLow

Action bases (sparse)

Reconstruction coefficients (sparse)

• Co-occurrence context;

• Reduce noise;

Page 29: Human Action Recognition by Learning Bases of Action Attributes and Parts

w

29

Bases of Atr. & Parts: Benefits

Cycling

……

……

……

……

Peddling

Writing

Phoning

Cycling

……

Peddling

Writing

Phoning ……

……

……

……

……

a

Real vector

ideala

Ideal vector

1 2 3 4 …

1w2w3w4w

… HighLow

Action bases (sparse)

Reconstruction coefficients (sparse)

• Co-occurrence context;

• Reduce noise;

• Improve performance.

SVM Classifier

Page 30: Human Action Recognition by Learning Bases of Action Attributes and Parts

• Attributes and Parts in Human Actions

• Learning Bases of Attributes and Parts

(modeling the interactions)

• Datasets & Experiments

• Conclusion

Outline

30

Page 31: Human Action Recognition by Learning Bases of Action Attributes and Parts

31

PASCAL VOC 2010 Action Dataset

Slide credit: Ivan Laptev

• 9 classes, 50-100 training / testing images per class

Page 32: Human Action Recognition by Learning Bases of Action Attributes and Parts

32

PASCAL VOC 2010 Action Dataset

• Average precision (%)

Phoning Reading Riding bike

Riding horse Running Taking

photo Walking

SURREY_MK 52.6 53.5 35.9 81.0 89.3 86.5 32.8 59.2 68.6

UCLEAR_DOSP 47.0 57.8 26.9 78.8 89.7 87.3 32.5 60.0 70.1

POSELETS 49.6 43.2 27.7 83.7 89.4 85.6 31.0 59.1 67.9

49.5 56.6 31.4 82.3 89.3 87.0 36.1 67.7 73.0

Playinginstrument

Using computer

Ours Conf_Score

- SURREY_MK, UCLEAR_DOSP: Best results from the challenge;

- POSELETS: Results from Maji et al, 2011;

14 attributes – trained from the trainval images;27 objects – taken from Li et al, NIPS 2010;150 poselets – taken from Bourdev & Malik, ICCV 2009.

- Ours Conf_Score: Concatenating attributes classification and parts detection scores.

Page 33: Human Action Recognition by Learning Bases of Action Attributes and Parts

33

PASCAL VOC 2010 Action Dataset

• Average precision (%)

Phoning Reading Riding bike

Riding horse Running Taking

photo Walking

SURREY_MK 52.6 53.5 35.9 81.0 89.3 86.5 32.8 59.2 68.6

UCLEAR_DOSP 47.0 57.8 26.9 78.8 89.7 87.3 32.5 60.0 70.1

POSELETS 49.6 43.2 27.7 83.7 89.4 85.6 31.0 59.1 67.9

49.5 56.6 31.4 82.3 89.3 87.0 36.1 67.7 73.0

Playinginstrument

Using computer

Ours Conf_Score

Page 34: Human Action Recognition by Learning Bases of Action Attributes and Parts

34

PASCAL VOC 2010 Action Dataset

• Average precision (%)

Phoning Reading Riding bike

Riding horse Running Taking

photo Walking

SURREY_MK 52.6 53.5 35.9 81.0 89.3 86.5 32.8 59.2 68.6

UCLEAR_DOSP 47.0 57.8 26.9 78.8 89.7 87.3 32.5 60.0 70.1

POSELETS 49.6 43.2 27.7 83.7 89.4 85.6 31.0 59.1 67.9

49.5 56.6 31.4 82.3 89.3 87.0 36.1 67.7 73.0

42.8 60.8 41.5 80.2 90.6 87.8 41.4 66.1 74.4

Playinginstrument

Using computer

Ours Conf_Score

Ours Sparse_Base

- Ours Sparse_Base: Using the reconstruction coefficients as the input of SVM classifiers.

Page 35: Human Action Recognition by Learning Bases of Action Attributes and Parts

35

PASCAL VOC 2010 Action Dataset

• Average precision (%)

Phoning Reading Riding bike

Riding horse Running Taking

photo Walking

SURREY_MK 52.6 53.5 35.9 81.0 89.3 86.5 32.8 59.2 68.6

UCLEAR_DOSP 47.0 57.8 26.9 78.8 89.7 87.3 32.5 60.0 70.1

POSELETS 49.6 43.2 27.7 83.7 89.4 85.6 31.0 59.1 67.9

49.5 56.6 31.4 82.3 89.3 87.0 36.1 67.7 73.0

42.8 60.8 41.5 80.2 90.6 87.8 41.4 66.1 74.4

Playinginstrument

Using computer

Ours Conf_Score

Ours Sparse_Base

400 action bases

attributesobjects

poselets

riding

Page 36: Human Action Recognition by Learning Bases of Action Attributes and Parts

36

PASCAL VOC 2010 Action Dataset

• Average precision (%)

Phoning Reading Riding bike

Riding horse Running Taking

photo Walking

SURREY_MK 52.6 53.5 35.9 81.0 89.3 86.5 32.8 59.2 68.6

UCLEAR_DOSP 47.0 57.8 26.9 78.8 89.7 87.3 32.5 60.0 70.1

POSELETS 49.6 43.2 27.7 83.7 89.4 85.6 31.0 59.1 67.9

49.5 56.6 31.4 82.3 89.3 87.0 36.1 67.7 73.0

42.8 60.8 41.5 80.2 90.6 87.8 41.4 66.1 74.4

Playinginstrument

Using computer

Ours Conf_Score

Ours Sparse_Base

400 action bases

attributesobjects

poselets

Using Sitting

Page 37: Human Action Recognition by Learning Bases of Action Attributes and Parts

37

PASCAL VOC 2010 Action Dataset

• Average precision (%)

Phoning Reading Riding bike

Riding horse Running Taking

photo Walking

SURREY_MK 52.6 53.5 35.9 81.0 89.3 86.5 32.8 59.2 68.6

UCLEAR_DOSP 47.0 57.8 26.9 78.8 89.7 87.3 32.5 60.0 70.1

POSELETS 49.6 43.2 27.7 83.7 89.4 85.6 31.0 59.1 67.9

49.5 56.6 31.4 82.3 89.3 87.0 36.1 67.7 73.0

42.8 60.8 41.5 80.2 90.6 87.8 41.4 66.1 74.4

Playinginstrument

Using computer

Ours Conf_Score

Ours Sparse_Base

400 action bases

attributesobjects

poselets

Phoning

Page 38: Human Action Recognition by Learning Bases of Action Attributes and Parts

38

PASCAL VOC 2010 Action Dataset

• Average precision (%)

Phoning Reading Riding bike

Riding horse Running Taking

photo Walking

SURREY_MK 52.6 53.5 35.9 81.0 89.3 86.5 32.8 59.2 68.6

UCLEAR_DOSP 47.0 57.8 26.9 78.8 89.7 87.3 32.5 60.0 70.1

POSELETS 49.6 43.2 27.7 83.7 89.4 85.6 31.0 59.1 67.9

49.5 56.6 31.4 82.3 89.3 87.0 36.1 67.7 73.0

42.8 60.8 41.5 80.2 90.6 87.8 41.4 66.1 74.4

Playinginstrument

Using computer

Ours Conf_Score

Ours Sparse_Base

400 action bases

attributesobjects

poselets

Page 39: Human Action Recognition by Learning Bases of Action Attributes and Parts

39

PASCAL VOC 2011 Action Dataset

Others’ best in comp9

Others’ best in comp10 Our method

Jumping 71.6 59.5 66.7

Phoning 50.7 31.3 41.1

Playing instrument 77.5 45.6 60.8

Reading 37.8 27.8 42.2

Riding bike 88.8 84.4 90.5

Riding horse 90.2 88.3 92.2

Running 87.9 77.6 86.2

Taking photo 25.7 31.0 28.8

Using computer 58.9 47.4 63.5

Walking 59.5 57.6 64.2

• Our method achieves the best performance in five out of ten classes if we consider both comp9 and comp10.

• Our method ranks the first in nine out of ten classes in comp10;

Page 40: Human Action Recognition by Learning Bases of Action Attributes and Parts

40

Stanford 40 ActionsApplauding Blowing

bubblesBrushing

teethCalling Cleaning

floorClimbing

wallCooking Cutting

trees

Cutting vegetables

Drinking Feeding horse

Fishing Fixing bike

Gardening Holding umbrella

Jumping

Playing guitar

Playing violin

Pouring liquid

Pushing cart

Reading Repairing car

Riding bike

Riding horse

Rowing Running Shooting arrow

Smoking cigarette

Taking photo

Texting message

Throwing frisbee

Using computer

Using microscope

Using telescope

Walking dog

Washing dishes

Watching television

Waving hands

Writing on board

Writing on paper

http://vision.stanford.edu/Datasets/40actions.html

Page 41: Human Action Recognition by Learning Bases of Action Attributes and Parts

41

Stanford 40 Actions• 40 actions – the largest number of action classes.

http://vision.stanford.edu/Datasets/40actions.html

• Opportunity to study the relationships between actions.

washing dishes

cutting vegetables

writing on a board

writing on a paper

fixing a bike

fiding bike

Page 42: Human Action Recognition by Learning Bases of Action Attributes and Parts

42

Stanford 40 Actions• 40 actions – the largest number of action classes.

http://vision.stanford.edu/Datasets/40actions.html

• Opportunity to study the relationships between actions.• 9532 images from Google, Flickr – The largest action dataset.• Large pose variation and background clutter.• Bounding boxes annotations of humans.• Upper-body visible, possible to explore human poses.• More annotations are coming ...

Page 43: Human Action Recognition by Learning Bases of Action Attributes and Parts

43

Stanford 40 Actions

0 0.2 0.4 0.6 0.8 1

Texting messageTaking photos

Reading a bookPouring liquidWaving hands

CallingDrinking

Washing dishesLooking thru a telescope

Smoking cigaretteCooking

ApplaudingUsing a computer

Pushing a cartRepairing a bike

Brushing teethPlaying violin

Blowing bubblesCutting vegetables

Looking thru a microscopeRepairing a car

Writing on a bookGardening

Feeding a horseCutting treesWatching TV

Writing on a boardThrowing a frisbee

RunningHolding up an umbrella

FishingPlaying guitar

Shooting an arrowWalking a dog

Cleaning the floorJumping

Climbing mountainRiding a bike

Rowing a boatRiding a horse

LLC

Our Method

• We use 45 attributes, 81 objects, and 150 poselets.

Average precision

• Compare our method with the Locality-constrained Linear Coding (LLC, Wang et al, CVPR 2010) baseline.

Page 44: Human Action Recognition by Learning Bases of Action Attributes and Parts

44

Stanford 40 Actions

0 0.2 0.4 0.6 0.8 1

Texting messageTaking photos

Reading a bookPouring liquidWaving hands

CallingDrinking

Washing dishesLooking thru a telescope

Smoking cigaretteCooking

ApplaudingUsing a computer

Pushing a cartRepairing a bike

Brushing teethPlaying violin

Blowing bubblesCutting vegetables

Looking thru a microscopeRepairing a car

Writing on a bookGardening

Feeding a horseCutting treesWatching TV

Writing on a boardThrowing a frisbee

RunningHolding up an umbrella

FishingPlaying guitar

Shooting an arrowWalking a dog

Cleaning the floorJumping

Climbing mountainRiding a bike

Rowing a boatRiding a horse

LLC

Our Method

Average precision

Compare with PASCAL VOC 2011 results:

Reading:Phoning:

Taking photo:

42.241.128.8

Riding horse:Riding bike:

92.290.5

Running:Jumping:

86.266.7

Using computer: 63.5

Page 45: Human Action Recognition by Learning Bases of Action Attributes and Parts

45

Stanford 40 Actions

0 0.2 0.4 0.6 0.8 1

Texting messageTaking photos

Reading a bookPouring liquidWaving hands

CallingDrinking

Washing dishesLooking thru a telescope

Smoking cigaretteCooking

ApplaudingUsing a computer

Pushing a cartRepairing a bike

Brushing teethPlaying violin

Blowing bubblesCutting vegetables

Looking thru a microscopeRepairing a car

Writing on a bookGardening

Feeding a horseCutting treesWatching TV

Writing on a boardThrowing a frisbee

RunningHolding up an umbrella

FishingPlaying guitar

Shooting an arrowWalking a dog

Cleaning the floorJumping

Climbing mountainRiding a bike

Rowing a boatRiding a horse

LLC

Our Method

Average precision

Poses are relatively

consistent

Very large pose

variation

Page 46: Human Action Recognition by Learning Bases of Action Attributes and Parts

• Attributes and Parts in Human Actions

• Learning Bases of Attributes and Parts

(modeling the interactions)

• Dataset & Experiments

• Conclusion

Outline

46

Page 47: Human Action Recognition by Learning Bases of Action Attributes and Parts

47

Conclusion

Cycling

……

……

……

……

Peddling

Writing

Phoning

Cycling

……

Peddling

Writing

Phoning ……

……

……

……

……

a

Real vector

ideala

Ideal vector

1 2 3 4 …

w

Action bases

Reconstruction coefficients

Sparse

Sparse

1w2w3w4w

HighLow

Page 48: Human Action Recognition by Learning Bases of Action Attributes and Parts

48

Acknowledgement