learning spatiotemporal features with 3d convolutional...
TRANSCRIPT
LearningSpatiotemporalFeatureswith
3DConvolutionalNetworksDuTran,LubomirBourdev,RobFergus,LorenzoTorresani,ManoharPaluri
EffectiveVideoDescriptor
• Generic– Canrepresentdifferenttypes
• Compact– Processing,storage
• Efficient– computation
• Simple– implementation
3DConvolutionandPooling
• 3DConvolutionisbetterthan2DConvolutiontomodeltemporalinformation.– 2DCONV:performedonlyspatially,losetemporalinformation.
– 3DCONV:performedspatio-temporally,preservetemporalinformation.
• Samephenomenaisapplicableforpooling.
2DConvolutionOn1-chInput
• Result:2DImage.
2DConvolutionOnn-chInput
• Result:2DImage.
3DConvolutionOnn-chInput
• Result:Volume
IdentifyBestArchitectureFor3DConvNets(OnUCF101)
• Commonnetworksettings– Allvideoframesresizedinto128x171.– Videosaresplitintonon-overlapped16frameclip.– Input:3x16x128x171.– 5ConvolutionandPoolinglayer– 2FullyConnectedlayer– SoftmaxLosslayertopredictactionlabels
IdentifyBestArchitectureFor3DConvNets(OnUCF101)
• VaryingNetworkArchitecture– Homogeneoustemporaldepth.• Depth–dfor1,3,5,7
– Varyingtemporaldepth.• Increasing:3-3-5-5-7• Decreasing:7-7-5-5-3-3
3DConvolutionKernelTemporalDepthSearch
SpatiotemporalFeatureLearning
• BestNetworkArchitecture–With3x3x3kernel
SpatiotemporalFeatureLearning
• Datasetfortraining– Sports1MDataset• Largestvideoclassificationbenchmark• 1.1millionsportsvideos• 487categories
Sports1MClassificationResults
C3DVideoDescriptor
• C3DModelcanbeusedasafeatureextractorforvariousvideoanalysistasks.– Actionrecognition– Actionsimilarity– SceneandObjectrecognition
• Usingwithfc6activations– 4096dimension
ActionRecognition
• Dataset:UCF101– 13.320video– 101humanaction
ActionSimilarityLabeling
• Dataset:ASLAN– 3,631video– 432actionclass
SceneObjectRecognition
• Dataset:YUPENN– 420video– 14scene
• Dataset:Maryland– 130video– 13scene
WhyC3DFeatures?
• Generic• Compact• Efficient• Simple
Visualisation using t-SNE method:
L. van der Maaten and G. Hinton. Visualizing data using t-sne. JMLR
WhatDoesC3DLearn?
Using deconvolution method in M. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. In ECCV, 2014
UsefulLinks
• http://vlg.cs.dartmouth.edu/c3d/• https://github.com/facebook/C3D
Tools and software required:
- keras- tensorflow- ffmpeg(compiled form source)- opencv(compiled from source)
Thank you