activity recognition - justin liangjustin-liang.com/talks/activity_recognition.pdf · agenda...

ActivityRecognitionJUSTINLIANGMARCH27, 2016

1

Agenda•End-to-endLearningofActionDetectionfromFrameGlimpsesinVideos.S.Yeung,O.Russakovsky,G.Mori,L.Fei-Fei.CVPR2016.

•DetectingEventsandKeyActorsinMulti-PersonVideos.V.Ramanathan,J.Huang,S.Abu-El-Haija,A.Gorban,K.MurphyandL.Fei-Fei.CVPR2016.

2

WhatisActivityRecognition•Ideaistobeabletodetectwhateventoccursinavideo• Ex.diving, successfullayup,failedlayup,successfulslamdunk,blocking, setting,standing

•Differentsubdomainstodoactivityrecognition:• Individualactivityrecognition• Groupactivityrecognition• Temporalactivityrecognition

[Ibrahimetal.CVPR2016]

3

End-to-endLearningofActionDetectionfromFrameGlimpsesinVideos

4

End-to-endLearningofActionDetectionfromFrameGlimpsesinVideos•PaperfromSerenaYeung,OlgaRussakovsky,GregMori,LiFei-Fei inCVPR2016.

•Objective:• Predictactionsandtheirtemporalbounds:howlongandwheretheyoccurinavideoclip.Videoclipsusedareuntrimmed.

•KeyContributions:• End-to-endapproachtoactiondetectionandtemporallocalizationinvideos• Trainanagentpolicytoskipvideoframestofindwheretheactionsareinthevideo• Showthatthismethodcanoutperformstateoftheartresults

5

Approach•Actiondetectionisaprocessofobservationandrefinement.Effectivelychoosingasequenceofframeobservationsallowsustoquicklynarrowdownwhenthebaseballswingoccurs.

6

Approach(Pipeline)•𝑜": observationfeaturevector

•ℎ": internalhiddenstate

•𝑑": candidatedetection• 𝑠": actionstarts• 𝑒": actionends• 𝑐": actionconfidence level

•𝑝": indicatortoemitaction

•𝑙"*+: locationofnextobservation,𝑙" ∈ [0,1]

7

ObservationNetwork•Boththelocation𝑙" andvideoframe𝑣34 aremappedtoahiddenspaceandthencombinedwithafullyconnectedlayertoproducetheobservationvector𝑜"•𝑣34 ismappedusingtheVGG16networkandfc7featuresareextractedfromit

8

RecurrentNetwork•Observationfeatures𝑜" andpreviousinternalhiddenstateℎ"5+ areinputstotherecurrentnetwork𝑓7 whichisparameterizedby𝜃7 toproduceℎ"

9

RecurrentNetwork•Observationfeatures𝑜" andpreviousinternalhiddenstateℎ"5+ areinputstotherecurrentnetwork𝑓7 whichisparameterizedby𝜃7 toproduceℎ"•Candidatedetection𝑑":• 𝑑" = 𝑓: ℎ"; 𝜃: ,𝑓: isafullyconnectedlayer

10


•PredictionIndicator𝑝":• 𝑝" = 𝑓< ℎ";𝜃< ,𝑓< isafullyconnectedlayer• During training,𝑓< isusedtoparameterizeaBernoullidistribution fromwhich𝑝" issampled.AttesttimeMAPestimateisused.

11


•PredictionIndicator𝑝":• 𝑝" = 𝑓< ℎ";𝜃< ,𝑓< isafullyconnectedlayer• During training,𝑓< isusedtoparameterizeaBernoullidistribution fromwhich𝑝" issampled.AttesttimeMAPestimateisused.

•Locationofnextobservation𝑙"*+:• 𝑙"*+ = 𝑓3 ℎ"; 𝜃3 ,𝑓3 isafullyconnectedlayer• During training, 𝑙"*+ issampledfromaGaussiandistributionwithmean𝑓3 ℎ"; 𝜃3 andfixedvariance.AttesttimeMAPestimateisused.

12

Training•Goalistotrainthreeoutputs:candidatedetection𝑑",predictionindicator𝑝",locationofnextobservation𝑙"*+• Thisisdifficultduetothechallengesofdesigning suitablelossandrewardfunctionsandhandling non-differentiablemodelcomponents

•Weusebackpropagationtotrain𝑑" andREINFORCEtotrain𝑝" and𝑙"*+

13

Training(CandidateDetection𝑑")•MatcheachcandidatedetectionD = {𝑑"|𝑛 = 1,… , 𝑁} fromrecurrentnetworktogroundtruth𝑔+,…,Q•Matchingfunction:

• 𝑦"S = T1𝑖𝑓𝑚 = 𝑎𝑟𝑔𝑚𝑖𝑛YZ+,…,Q𝑑𝑖𝑠𝑡(𝑙", 𝑔Y)0𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒

• 𝑔Y = (𝑠Y , 𝑒Y)• 𝑑𝑖𝑠𝑡 𝑙", 𝑔Y = min( 𝑠Y − 𝑙" , 𝑒Y − 𝑙" )

14

Training(CandidateDetection𝑑")•MatcheachcandidatedetectionD = {𝑑"|𝑛 = 1,… , 𝑁} fromrecurrentnetworktogroundtruth𝑔+,…,Q•Matchingfunction:

• 𝑦"S = T1𝑖𝑓𝑚 = 𝑎𝑟𝑔𝑚𝑖𝑛YZ+,…,Q𝑑𝑖𝑠𝑡(𝑙", 𝑔Y)0𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒

• 𝑔Y = (𝑠Y , 𝑒Y)• 𝑑𝑖𝑠𝑡 𝑙", 𝑔Y = min( 𝑠Y − 𝑙" , 𝑒Y − 𝑙" )

•Lossfunction:• ∑ 𝐿c3d 𝑑" + 𝛾 ∑ ∑ 𝕀[𝑦"S= 1]𝐿3hc(S"" 𝑑", 𝑔S)• 𝐿c3d 𝑑" :crossentropy lossondetectionconfidence𝑐"• 𝐿3hc(𝑑", 𝑔S):L2losstofurtherminimizedistance 𝑠", 𝑒" − 𝑠S, 𝑒S

•Optimizelossusingbackpropagation

15

Training(Location𝑙"*+ andPredictionIndicator𝑝")•UseREINFORCEtolearnobservationandemissionpolicies

•REINFORCE:• Objective:𝐽 𝜃 = ∑ 𝑝j 𝑎 𝑟(𝑎)k∈𝒜• 𝒜:spaceofactionsequences• 𝑝j 𝑎 :probability ofaction• 𝑟(𝑎):reward

16


•REINFORCE:• Objective:𝐽 𝜃 = ∑ 𝑝j 𝑎 𝑟(𝑎)k∈𝒜• 𝒜:spaceofactionsequences• 𝑝j 𝑎 :probability ofaction• 𝑟(𝑎):reward

• Gradient:𝛻𝐽 𝜃 = ∑ 𝑝j 𝑎 𝛻log𝑝j 𝑎 𝑟(𝑎)k∈𝒜• Thisisanontrivialoptimizationproblemduetothehigh

dimensional spaceofpossible actionsequences!• InsteadwecanuseMonteCarlototaketheexpectation

17


•REINFORCE:• Objective:𝐽 𝜃 = ∑ 𝑝j 𝑎 𝑟(𝑎)k∈𝒜• 𝒜:spaceofactionsequences• 𝑝j 𝑎 :probabilityofaction• 𝑟(𝑎):reward

• Gradient:𝛻𝐽 𝜃 = ∑ 𝑝j 𝑎 𝛻log𝑝j 𝑎 𝑟(𝑎)k∈𝒜• UseMonteCarlotoapproximate:• 𝛻𝐽 𝜃 ≈ +

q∑ ∑ 𝛻 log𝜋j 𝑎"s |ℎ+:"s ,𝑎+:"5+s 𝑅"sv

"Z+qsZ+

• 𝐾 interactionsequences• 𝑁 RNNtimesteps• 𝜋j :agent’spolicy• 𝑎":currentaction(𝑙"*+or𝑝")• 𝑅":cumulativerewardfromcurrenttimestep onward• ℎ":hiddenstate

• Optimizebymaximizingobjective

18

Training(Location𝑙"*+ andPredictionIndicator𝑝")•Rewardfunction:• Wanthighprecisionandrecall

• 𝑟v = T 𝑅<𝑖𝑓𝑀 > 0𝑎𝑛𝑑𝑁< = 0𝑁*𝑅* + 𝑁5𝑅5𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒

• 𝑁<:#predictions emittedbyagent• 𝑁*,𝑅*:#truepositive predictions andreward• 𝑁5, 𝑅5:#falsepositive preditions andreward• 𝑅<: penaltyfornotemittingpredictionwhen#groundtruth𝑀 > 0

• Prediction iscorrectifitsoverlapwithground truth isgreaterthanathresholdandhigher thananyotherprediction

19

Strengths/WeaknessesofApproach•Strengths:• Donotneedtolookatalltheframes• End-to-endlearning

•Weaknesses:• Needalltheframesinaclip(cannotdoonlinedetection)• Canbedifficulttolearnobservationpolicy ifeventcontainslessdiscriminativemovements

20

Results•ResultsfromTHUMOS’14comparingwithtop3performers.mAP isreportedfordifferentIOUthresholds𝛼

•Ablationstudiesshowthatwithoutlocalizationregressionandwheretoobservenext,resultsaresignificantlyworse

21

Results(LearnedObservationPolicy)

22

Results(LearnedObservationPolicy)

23

FutureDirection•Learnjointspatio-temporalobservationpolicies

24

DetectingEventsandKeyActorsinMulti-PersonVideos

25

DetectingEventsandKeyActorsinMulti-PersonVideos•PaperfromVignesh Ramanathan,JonathanHuang,SamiAbu-El-Haija,AlexanderGorban,KevinMurphyandLiFei-Fei inCVPR2016.

•Objective:• Predicteventsandkeyactorsinvideoswheremultiplepeopleareinvolved

•KeyContributions:• Introducelarge-scalebasketballeventdataset• Useattentiontodecidemostrelevantpeople totheactionbeingperformed• Showthattheattentionmodelresultsinbettereventrecognition

26

Dataset•Introducedalargedatasetwithmulti-personactionvideos.Thedatasetconsistsof257NCAAgameseacharound1.5hourslong.11differentbasketballeventsaredenselyannotatedinthevideos.

27

Approach•Eventsinateamsportareperformedbyasetofkeyplayers.Itissufficienttofocusonlytheplayersparticipatingtorecognizeanevent.Forexample,a“steal”eventinbasketballisdefinedbytheactionoftheplayerattemptingtopasstheballandtheplayerstealing.

•Theideaistofocusonkeyplayerstopredictevents.

28

Approach(Pipeline)•EachplayertrackisprocessedbyaBLSTMnetwork.Theoutputhiddenstateisprocessedbyanattentionmodeltoidentifykeyplayers.

•Thethicknessoftheboxesshowattentionweights.

•EachvideoframeisprocessedbyaBLSTMnetwork.

29

FeatureExtraction•Eachvideoframe𝑡 isrepresentedasafeaturevector𝑓{ fromtheactivationofthelastfullyconnectedlayeroftheInception7network.

•Eachplayer𝑖 boundingboxisrepresentedasafeaturevector𝑝{s fromInception7.

30

EventClassification•Computeglobalcontextvectorforeachframe𝑡:• ℎ{

| = 𝐵𝐿𝑆𝑇𝑀|�kS�(ℎ{5+| , ℎ{*+

| ,𝑓{ )

31


| = 𝐵𝐿𝑆𝑇𝑀|�kS�(ℎ{5+| , ℎ{*+

| ,𝑓{ )

•Nextcomputehiddenstateofeventattime𝑡:• ℎ{� = 𝐿𝑆𝑇𝑀(ℎ{5+� ,ℎ{

|,𝑎{)• 𝑎{ isthefeaturevectorfortheplayersfromtheattentionmodel

32


| = 𝐵𝐿𝑆𝑇𝑀|�kS�(ℎ{5+| , ℎ{*+

| ,𝑓{ )

•Nextcomputehiddenstateofeventattime𝑡:• ℎ{� = 𝐿𝑆𝑇𝑀(ℎ{5+� ,ℎ{

|,𝑎{)• 𝑎{ isthefeaturevectorfortheplayersfromtheattentionmodel

•Predictclasslabelusing𝑤��ℎ{�

•SquaredHingeLossfunction:• 𝐿 = +

�∑ ∑ max(0,1− 𝑦�𝑤��ℎ{�)�q

�Z+�{Z+

• 𝑦� is1ifthevideobelongs toclass𝑘 and-1otherwise

33

Attention•Howdowegetthefeaturevector𝑎{ fortheplayersfromtheattentionmodel?

34

AttentionModels(withtracking)•AttentionmodelwithKLTtrackingforplayer𝑖 andframet:• ℎ{s

< = 𝐵𝐿𝑆𝑇𝑀{�kc�(ℎ{5+,s< ,ℎ{*+,s

< , 𝑝{s)• 𝑎{{�kc� = ∑ 𝛾{s{�kc�ℎ{s

<v�sZ+

• 𝛾{s{�kc� = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝜙(ℎ{|,ℎ{s

< ,ℎ{5+� ); 𝜏)

•𝑎{: weightedcombinationoverplayersinframe𝑡•𝛾{s : attentionweights

•𝑁{: #playerdetectionsinframe𝑡•𝜙():multilayerperceptron

•𝜏:softmax temperature

35

AttentionModels(withouttracking)•Attentionmodelwithouttracking:• 𝑎{"h{�kc� = ∑ 𝛾{s"h{�kc� 𝑝{s

v�sZ+

• 𝛾{s"h{�kc� = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝜙(ℎ{|,𝑝{s , ℎ{5+� ); 𝜏)

•𝑎{:weightedcombinationoverplayersinframe𝑡

•𝛾{s : attentionweights

•𝑁{: #playerdetections inframe𝑡

•𝜙():multilayerperceptron

•𝜏:softmax temperature

•𝑝{s:playerfeaturevectorfromInception7

36

Strengths/WeaknessesofApproach•Strengths:• Attentionfocusesonkeyplayers

•Weaknesses:• Needalltheframesinaclip(cannotdoonlinedetection)• Model tendstobereluctanttoswitchattentionbetweenplayersinascene

37

Results(EventClassification)•Herewecomparetheabilitytoclassifyisolatedvideoclipsinto11classes

•Attentionisparticularlygoodforshot-basedeventswhereattendingtotheshotmakingpersonordefenderscanbeuseful

38

Results(EventDetection)•Herewecomparetheabilitytotemporallylocalizeeventsinuntrimmedvideos usinga4second slidingwindowthroughallthe videos

•Here,astealeventisparticularlychallengingasitisoftenmistakenforapass

•Combining theplayerfeaturesbyaveragingwithoutusingattentionperformsverygoodaswell• Possiblybecause thealgorithmhasdifficultychangingattentionsincewearedealingwithuntrimmedvideos

39

Results(Attention)•Attendedplayerisincyanandballisinyellow

•Resultsshowthatmodelattendstotheplayermakingtheshotatthebeginning

40

Results(AttentionHeatmap)•Distributionofattentionshowsinitiallyattentionfocussesonshooterandthendisperseslaterintheevent

41

WrapUp•Questions?

•Suggestions?

42

activity recognition - justin liangjustin-liang.com/talks/activity_recognition.pdf · agenda...

Documents