human action recognition using a hybrid deep learning

Human Action Recognition Using a Hybrid DeepLearning HeuristicSamarendra Chandan Bindu Dash

Informatica Pvt LtdSoumya Ranjan Mishra

Anil Neerukonda Institute of Technology and ScienceK. Srujan Raju ( [email protected] )

CMR Technical Campus https://orcid.org/0000-0001-5058-4313L V Narasimha Prasad

Institute of Aeronautical Engineering

Research Article

Keywords: Human action recognition, 3D Convolutional Neural network, Deep Neural Network, Optical�ow, SIFT, Motion features extraction

Posted Date: July 19th, 2021

DOI: https://doi.org/10.21203/rs.3.rs-693703/v1

License: This work is licensed under a Creative Commons Attribution 4.0 International License. Read Full License

https://doi.org/10.21203/rs.3.rs-693703/v1

mailto:[email protected]

https://orcid.org/0000-0001-5058-4313

https://doi.org/10.21203/rs.3.rs-693703/v1

https://creativecommons.org/licenses/by/4.0/

Human Action Recognition using a Hybrid Deep

Learning Heuristic

Samarendra Chandan Bindu Dash1 Soumya Ranjan Mishra2 K. Srujan Raju3

L V Narasimha Prasad4

1 Informatica Pvt. Ltd2 Department of Computer Science and Engineering, ANITS,Visakhapatnam,India

3 Dept of computer science and engineering, CMR Technical Campus,Hyderabad,India

4 Department of Computer Science and Engineering Institute of AeronauticalEngineering, Hyderabad,India

{samarendra109, soumyaranjanmishra.in, [email protected], [email protected]}@gmail.com

Abstract. Human action recognition in the surveillance video is cur-rently one of the challenging research topics. Most of the works in thisarea are based on either building classifiers on sophisticated handcraftedfeatures or designing deep learning based Convolutional neural networks(CNNs), which directly act on raw inputs and extract meaningful infor-mation from the video. To capture the motion information between adja-cent frames, 3D CNN extract features in temporal dimension along withspatial dimension. Even though this technique is very effective in humanaction recognition but limited to very few numbers of fixed frames, more-over, all human actions are not limited to a fixed number of frames; it mayspan several frames. If we increase the size of the input window in CNN,then handling of all trainable parameter in the network will be very com-plicated. Hence it is advisable to encode high level motion features fromdifferent sources to the CNN model. In this paper, we proposed a novelframework to extract handcrafted high level motion features and deepfeatures by CNN in parallel to recognize human action. SIFT is usedas handcrafted features to encode high level motion features from themaximum number of input video frames. The combination of deep andhandcrafted features preserve longer temporal information from entire2D frames present in action video with minimal computational power.Finally, we pass the extracted SIFT into the dense layer and concate-nate with fully connected layer of CNN for classification. We evaluate theproposed combined CNN framework against normal 3D CNN and tradi-tional handcrafted features like optical flow with SVM, SIFT with SVMon UCF and KTH human action dataset. We achieve better performancein term of computational cost and processing time in the proposed CNNframework in comparison to the other three methods.

Keywords: Human action recognition ➲ 3D Convolutional Neural net-work ➲ Deep Neural Network➲ Optical flow ➲ SIFT ➲ Motion features ex-traction

2 Authors Suppressed Due to Excessive Length

1 Introduction

Human action recognition in a real-world environment is highly applicable invarious domains like video surveillance system, crowd behavior analysis, human-robot interaction. However, recognizing the accurate action in real-world videois a challenging task due to many factors like variation in viewpoint, scale, clut-tered background and many more [9],[14],[22],[47]. Few authors had taken certainassumption on these factors while taking video and had shown comparativelybetter results on this problem [12],[36]. Generally, human action recognition isa two steps process. In the first step, different types of features need to be ex-tracted from the raw video frame, and then in the second step, classifier needsto be trained on extracted features to categorize different action classes. Featureextraction is one of the very crucial parts of video and image analysis. Gen-erally, there are two types of feature extraction method widely used for videorepresentation, the first one is handcrafted based feature descriptor, and secondis deep learning based descriptors. CNN (Convolutional Neural Network) is atype of deep learning based descriptors where convolution and pooling opera-tion is applied in different layers on the raw input image and produce a fullyconnected layer. After several layer of trainable filters and pooling, it generateshierarchy of complex features, which is very efficient on visual object recogni-tion task [7],[37],[40],[51]. There are few works on human action recognition by2D CNN model [32] by representing single frame architecture, where the featurevector is extracted for each frame. This 2D CNN architecture can be used totrain on large-scale human action dataset, but the major problem in 2D CNNis, it does not capture and model the temporal information between differentframes. Human action videos consist of multiple frames; therefore, 3D CNN ismore effective to model both spatial and temporal information in a short periodof time [13]. Although 3D CNN shows tremendous success for human actionclassification from video clips but limited to a fixed number of input frames.In real-time, human action may be performed throughout the entire durationof the video. We need huge computational cost to run 3D CNN on the com-plete sequence of 2D frames. It takes around 3 to 4 days on UCF101 and nearly2 months on sports-1M to train a 3DConvNet with very extensive architec-ture [41]. Handcrafted based SIFT descriptors and bag of visual word (BOW)models achieve state-of-the-art performance for human action recognition withbetter compromise between computational complexity and recognition rate. Inour previous work,scale-invariant feature transform (SIFT) features also used inbezier cohort fusion in doubling states for human identity recognition [11]. Tocapture motion information from the entire video clip (all sequence of frames)with limited computational cost and complexity, we introduce a novel combinedframework to recognize human action considering both handcrafted features anddeep features in our proposed CNN model. By this combination, both high andlow level spatiotemporal information can be preserve from entire video clip withminimum computational cost.Inspired by both manual handcrafted features extraction and bio-inspired basedconvolution operation, In this work, we have combined both in different layers

Human Action Recognition using a Hybrid Deep Learning Heuristic 3

of 3D CNN. In the first layer, we have extracted features by applying hardwiredkernel to first seven frames of the input video and generates multiple channelsof information and then applied multiple layers of convolution and pooling inboth spatial and temporal dimension to all the channel separately. Along withconvolution, we have extracted key point based SIFT descriptor from raw inputvideo frames (all the frames present in action video) to model temporal informa-tion with minimal computational cost and time. Finally, we pass the extractedSIFT into the dense layer and concatenate with fully connected layer of CNNfor classification.

We evaluate the proposed combined CNN framework on UCF and KTHhuman action dataset. We have taken first 7 frames for CNN and entire video clip(all the frames ) for SIFT in parallel for feature extraction.The concatenation oftwo feature vector is used for classification. From our experiments, it shows thatthe proposed combined CNN framework outperform better than state-of-the-arthuman action recognition systems with very less computational cost and minimalprocessing time. The major contribution of this proposed model is summarizedas below:

❼ First, handcrafted feature SIFT is extracted from the raw input video clip(allframes) followed by PCA to form a fully connected fixed-size vector.

❼ In parallel with SIFT, we extracted features like gradient images, Opticalflow, and grey channel for initial 7 cropped input frames for convolutionoperation. The Gradient of an image can be obtained by applying Sobeloperation in X and Y direction, similarly optical flow also computed in Xand Y direction between two consecutive frames. Gray code channel alsotaken for all frames to generate different channels of information.

❼ We then apply multiple layers of convolution and polling in both spatial andtemporal domain with a fixed kernel size to all different channels separatelyand generates a fully connected layer.

❼ Finally, we pass the extracted SIFT into the dense layer and concatenatewith fully connected layer of CNN followed by a dropout with sigmoid forfinal classification.

❼ We evaluate our proposed model on UCF and KTH human action dataset,and observed interesting result as compared with other state-of-the-art hu-man action recognition models.

2 Related Work

In this literature, we describe related works on human action classification suchas traditional handcrafted spatio-temporal features representation, and deeplearning based approach followed by related combination strategies. Predomi-nantly many human action recognition models consist of two primary modules,action representation, and action classification. Converting feature vectors froman action video is generally referred to as action representation and concludean action label from extracted feature vector is termed as action classification.


Modeling the human action from different features is a traditional machine learn-ing techniques, where different motion information of input video is extractedand trained to the classifier.In this section, we will discuss various handcraftedfeature and deep learning based features for human action recognition relevantto our proposed method.

Handcrafted feature representation. Handcrafted Features are designedto encode the movement of the human body and spatiotemporal changes inaction video [33],[4], including STIP based method, spatiotemporal volume-based,and trajectory of skeleton joints methods. All these action features areused in traditional machine learning technique like SVM, boosting, and proba-bility map model to recognize human action. The earliest approach to humanaction recognition is proposed by [2] using motion energy image(MEI) and mo-tion history image(MHI).The main idea behind this model is to design an actiontemplate and perform template matching. In [54], polar coordinates are used todivide the human body in MHI, and SIFT based motion descriptor is used torepresent human action. Histogram of gradient (HOG) method is extended tospace-time dimension and proposed 3DHOG to describe motion in video [17].STIP based techniques are the most widely used method for human action detec-tion, in which the interest points are tracked in spatiotemporal dimension [5],[20].3D-Harris spatiotemporal feature points and some advanced techniques are pro-posed in space-time domain for effective human motion computation[21],[3]. In[31], proposed key region extraction technique by using spatiotemporal attentionmechanism to design action features and visual dictionary. Again in [34], a hy-brid super vector human action representation method is proposed by modifyinglocal spatiotemporal features and visual dictionary construction. To extract thekey region of action video, 3D-Harris spatiotemporal information and 3D scale-invariance feature are combined, and visual word histograms are used to modelhuman action [28]. Earlier, STIP based human action recognition method at-tracted the attention of many researchers. The method mentioned above showsexcellent result in the static background, but in the case of camera motion, itgenerates many unnecessary key points, which creates problem in recognizingobject motion.Generally, the human joints (joint skeleton features) are invariant to cameramotion and appearance of an object. Its provides the position of human jointsaccurately from a different viewpoint( front, back, side). Recently, joints in thehuman skeleton are used to represent human action. Improved dense trajectories(IDT) is proposed [45] to address human action from skeleton joints. The dis-placement of the extracted human joints(feature points) is calculated using theoptical flow approach. Few researchers have improved the skeleton joint basedapproaches by various way [44],[35]. Local motion trajectories is analyzed byusing split clustering [10] and used to represent motion level in human action.Stacked fisher vector also used in IDT feature to analyzed human action. Themain advantage of this trajectory-based methods is viewpoint and appearanceinvariant. However, we need an accurate human skeleton joints detector of humanbody for this to model the motion of those point for effective action recognition.


Deep feature Extraction. Now a day’s, deep neural networks combinedboth motion representation and classification components and represented acomplete framework for action recognition problem, which enhances the clas-sification accuracy [29],[18]. Based on convolution operation, different types of3D CNN architecture can be designed. The selection of architecture is entirelyproblem dependent and specific to a particular application. Multi-stream CNNis proposed in [42] to incorporate multiple features like full frame, human body,and motion-salient body part regions together in a single network. Region se-quence based CNN is proposed in [26] for human action recognition in smallscale image regions. Here pooling layer information is encoded for more spatialinformation. Very recently, capsule network (capsnet) is introduced for humanaction recognition problem [1]. This capsnet replaces neurons with vectors whichcan retain spatial features relationship between frames. As compared to neuronsbased CNN, these vector based capsnet have shown better performance with thesame complexity on KTH and UCF datasets. By mimicking the neural process-ing technique in the visual cortex of human brain, sparse-based neural responsemethod is proposed in [24] for image classification. Biological modeling of hu-man vision system is proposed by [6] , in which very low level visual processingis concerned. In this model, they simulated the properties of retina and lateralgeniculate nucleus (LGN) in the early stages. A combined RGB and skeletondata is used in CNN for human activities recognition [16]. Here multiple visioncue is combined for competitive results. In [46], traditional handcrafted motionfeatures are trained by CNN-based hallucination step. In this model, they com-bined dense trajectory of video frames with the output of CNN. In [48], a novelregularization scheme is proposed on human skeleton joint to learn co-occurrencefeatures using LSTM network. A two-fold approach has been proposed for theframe extraction and action recognition from video using typical convolutionalneural network [27]. Recently, Few authors used 3D convolution deep learningframework for medical image analysis like head and neck anatomy segmentation[55].

3 Proposed Model for Human Action Classification

3.1 Overview of the Framework

Fig-1 shows an overview of our model. Our proposed method jointly trained bothdeep and handcrafted features from input videos. We have used two features ex-traction methods which includes deep features (by using CNN) and handcraftedfeatures (by using SIFT). Deep features CNN deals with a fixed number of framesto avoid complexity in the network. SIFT is calculated form all the 2D framespresent in action video to extract longer temporal features. As we are extractingSIFT from maximum number of frames, so it increases the dimension of the fea-ture vector. To address this, we have used PCA on extracted SIFT to select lessnumber of components. These two features extraction methods are explained indetails in 3.2 and 3.3. By using these methods, we obtained two features vector,which capture both spatial and temporal motion information from the entire


input video clip. Finally, two feature vector are concatenated and classified atthe classification layer of CNN.

Hand crafted feature

Extraction using SIFT

Frame 1 to 7

All non

overlapping

frames

Deep features

Extraction using CNN

methods

PCA

CNN

Fully Connected

Fully Connected

Fully Connected layer

Classification in the output layer

Fig. 1. Overview of the proposed Framework.

3.2 Handcrafted features SIFT

After various research and experiment on image and video analysis for humanaction recognition problem, we can conclude that local patches sampled arounddetected interest point preserve most discriminative and adequate informationon human actions. The local patches represent the complete action video with-out any further preprocessing like background subtraction or object tracking. Inthis paper, we have used 3D SIFT as additional handcrafted features since it isrobust and invariant to rotation and noise. The original 2D SIFT [25] is extendedto 3D video volume where the third dimension is time, to capture space-time in-formation. Especially, for consecutive frames, motion edge history image(MEHI)proposed in [2] preserve maximum motion information with sufficient processingcapability. Using this idea, first, we extracted moving points and tracked themthroughout the full video frames. Then we extract SIFT descriptor around everydetected point to describe the spatial information on frames. The displacement


of those extracted points is described with trajectories. Inspired by the workin [15], moving points detection and their displacement computation is done bytwo widely used technique, background subtraction, and Lucas Kanade opticalflow respectively. Finally, bag of features model is used to generate histogramfor every input action videos. The complete process of feature extraction frominput video is shown in Figure 2 and explained below :

1 First, each current video frame is subtracted from the reference frame toobtained motion region or foreground.

2 After getting motion region, salient points are detected from these regionsusing Shi Tomasi corner detector [39] to generate moving points.

3 Then we extract SIFT descriptor around every detected points in both spa-tial and temporal domain. Spatial domain information computes appearancearound the points, and temporal domain information represents the changeof the feature points along time axis in the 3D space-time volume.

4 Finally, a fully connected fixed size vector is formed by grouping all the de-scriptors into set of cluster using k -means , then bag-of-features is formedby calculating number of descriptor in each cluster. This resulting fully con-nected feature vector is merged with the last fully connected layer of CNN.

PCA to reduce dimension of Handcrafted feature SIFT

However, the resulting feature vector extracted from more number sequen-tial frame has increased the dimensionality of motion features. Again, we needto concatenate the SIFT feature vector with the last layer or classification layerof CNN, which produce a giant vector for classification. The high dimensionalityof these combined feature vector will become very complex for classification. Toaddress this problem, the subspace method proposed in [30] is used to reducethe dimension of handcrafted SIFT features before the concatenation of featuresvectors. We have used PCA (principal component analysis) to select less numberof components for extracted SIFT features before concatenating with classifica-tion layer of CNN as shown in Figure 1. By using PCA, we can select less numberof principal components having maximum variation from original components.The resulting principal components can have sufficient power to represent theoriginal features with a smaller dimension.

3.3 Deep feature extraction using CNN

2D CNN computes features in one dimension(spatial dimension), and it canextract very useful features from single frame still image. If we talk about videoanalysis, we need to capture the motion information between consecutive frames;therefore we have to compute features from both the dimension(spatial andtemporal). 3D CNN is introduced by applying a 3D kernel to the contiguousinput video frames [13]. By this convolution, the obtained feature maps are


Action Video Frame Background subtraction

Motion detectionMoving point detection

SIFT feature vector

Fig. 2. Description of SIFT feature extraction.

connected to multiple frames of the input video, which may capture the motioninformation effectively. Mathematical representation of 3D CNN is shown inequation 1.

vxyij = tanh

(

bij +∑

m

p1−1∑

p=0

q1−1∑

q=0

wpqijmv

(x+p)(y+q)(i−1)m

)

(1)

Where vxyij represent value at position (x, y) in ith layer of jth feature map. tanhis hyperbolic tangent function, bij is bias, m is indexes, wpq

ijm is the value atposition (p, q) of the kernel and Pi and Qi are height and width of the kernel.

3.4 3D CNN on video Frames

In order to capture deep spatial and temporal motion features from action video,we adopt the state-of-the-art 3D CNN architecture proposed by [13] and havedone few modifications to combined other handcrafted features in the classifi-cation layer. In our architecture shown in Figure 3, we have taken 7 contagiousframes of size 60× 40 as input to our model. Instead of random initialization ofinput frames, we have taken features like gradient X and Y and optical flow X


Gradient X,Y, Optical flow X,Y and

Grey channels

7*

7*

3

3D

Co

nvo

luti

on

2*

2 s

ub

sa

mp

lin

g

7*

6*

3

3D

Co

nvo

luti

on

3*

3 s

ub

sa

mp

lin

g

7*

4

3D

Co

nvo

luti

on

Fu

lly

co

nn

ect

ed

CONV2 SAM3 CONV4 SAM5 CONV6HF1

Fig. 3. 3D CNN Architecture for deep feature extraction.

Fig. 4. Gradient representation in X and Y direction.

and Y of all the 7 frames to produce several channels of information from theinput video. The resulting channels of information act as the feature map for thenext layer. These hardwired features are employed before the convolution op-eration to encode human knowledge on features. This handcrafted informationhas shown effective performance than providing direct raw input to the network.


Gradient X and Y and optical flow in both X and Y direction explained below.

Gradient of an Image(sobel filter)The gradient of an image is used to extract meaningful information from animage. X and Y gradient images are generated from the original image by con-volving with a filter. It measures the change in intensity of any pixel of an imagewith direction. First, 7 frames of size 60× 40 is taken from the input video andgradient in X and Y direction is computed from the input frames. Then insteadof raw input, these gradients of all input frames are given to the network forconvolution operation. Gradients representation in X and Y direction of the in-put image is shown in Figure 4.

Optical FlowOptical flow between two consecutive video frames is the motion of any pixellocation, caused by movement of any objects in a scene. For example in firstframe the intensity of pixel is I(x, y, t) at time t, if we move its pixel by (dx, dy)at time dt then we get new pixel intensity I(x+dx, y+dy, t+dt). In optical flowcalculation we have to assume image pixel intensity are constant in all frames .With this assumption, we can form a equation I(x, y, t) = I(x+dx, y+dy, t+dt).Then we take the Taylor Series Approximation of the right hand side to get thebelow equation.

I(x+ δx, y + δy, t+ δt) =∂I

∂xδx+

∂I

∂yδy +

∂I

∂t

Now divide dt to get the final optical flow equation below.

=∂I

∂xu+

∂I

∂yv +

∂I

∂t

Where u =dx

dtand v =

dy

dt. Here u and v are the movements over time t.

u and v can be solve using lucas kanade method . This Optical flow is very mucheffective in motion sequence analysis of human action in video data. We havecalculated gradient and optical flow of seven consecutive frames and generatemultiple channel of information before the 3D convolution process. The opticalflow calculation of frames is shown in Figure 5.

3.5 Proposed Convolution Operation

Several CNN architecture can be invented from 3D convolution technique. Asdescribed above, in the first layer, we have extracted gradient X, Y, grey channel,and optical flow for all the frames from the input video. After processed throughthese feature extraction phase, it generates multiple channels of information inthe form of feature maps in the second layer. Now the feature maps containgradient along both horizontal and vertical direction and optical flow field inboth direction X and Y and also grey pixel value for all input frames. Next, weapply convolutions operation in both spatial and temporal domain on all theseextracted features channels with a kernel size of 7× 7× 3 , 7× 7 in spatial and


I(x, y, t) I(x + dx, y + dy, t+dt)

Displacement (dx,dy)

(x,y) (x + dx, y +dy)

Time=t Time=t + dt

Fig. 5. Optical flow between two consecutive frames in action video.

3 in temporal dimension respectively. Two rounds of Convolution operation isapplied on every channel separately to get maximum number of feature maps.Now 2 × 2 subsampling is applied on obtained 2 sets of feature maps from theprevious convolution. Subsampling layers are used to reduce only the spatialresolution by keeping the same number of feature maps. Again 3D convolutionis applied on these 2 sets of sampled feature maps with a kernel size of 7× 6× 3on all resulting channels, here we have applied 3 different combinations of kernelon each of the 2 sets of features map and obtained 6 sets of feature maps. Again3 × 3 subsampling is done for all these resulting feature maps to reduce thespatial dimension. Now the size of the temporal dimension is reduced, So nextconvolution is applied only in spatial dimension with a kernel size of 7× 4. As aresult final feature maps of size 1× 1 obtained as a fully connected layer.Python and TensorFlow are used to implement the proposed CNN on UCF andKTH datasets. The details implementation are based on 3D CNN proposed in[13]. In convolution layer, 7 × 7 × 3 filters are used to extract subregion withReLU activation function. Subsampling is done in pooling layer. After severalrounds of convolution and pooling, we obtained 128 dimensions fully connectedfeature vector from 7 input frame. All parameter used in this model are takenrandomly, and training is done by using the backpropagation algorithm proposedin [23].


Boxing

carrying

Clapping

Digging

Jogging

OpenTr

Running

Throw

Walking

Waving

Action Classes

PCAHand crafted feature

Extraction using SIFT

Classification in the output layer

Input

Video

Frame

Frame 1 to 7

Human detection

Croped frame 1-7

Given for

Convolution

All frames

Fully Connected

Fully Connected

Fig. 6. Proposed SIFT-CNN Architecture.

3.6 SIFT and CNN Aggregator

It is already admitted that human vision system is a multi-scale process [8].Therefore, it will be advisable to integrate multiple levels of motion features forrobust classification of human action. The combination of invariant and moredelicate features produce better representation of motion features. As shownin Figure 6, we have extracted two feature vector (By using SIFT and CNN)which describe the spatial and temporal information of the input video. In theabove CNN architecture, we have used only 7 input frames for convolution op-eration, which is not enough to encode high level temporal motion informationfrom complete action video. Again, if we increase the input frame, handling allthe trainable parameters of the network will be very complex. To address thisproblem, we have used key points based descriptor SIFT in parallel with CNN,to capture the motion information of all the frames present in action video. Inthis approach, SIFT features is considered as spatially distinctive interest pointsand, the displacement of these distinctive points is computed to find motionconstrain around it. In the above CNN architecture, there is a global averagepooling before the fully connected layer. We pass the extracted SIFT into thedense layer and concatenate CNN fully connected layer with SIFT feature vec-tor. After concatenation of two feature vector, one more fully connected layerfollowed by a dropout of 0.4 and one more dense layer with sigmoid activationfunction is added for final classification. To avoid overfitting of the network, All


pooling and dense layer are followed by a dropout layer. All training was per-formed by using Caffe [43] framework with an additional feature vector SIFTat classification layer. Dropout of 0.4 and 0.6 were tested, and we found 0.6gave better result in our proposed combined framework. Momentum and Weightdecay were set to 0.8 and 0.0005 to reduce the overfitting of the network. Weadopt the approach in [38] for leaning rate and testing purpose.

4 Experiments

4.1 Action Recognition On UCF dataset

We have used UCF and KTH dataset to evaluate the proposed combined CNNmodel for human action recognition. UCF Human action dataset consists of10 actions performed by 12 actors recorded from 3 different angles. We havetaken the aerial view recoding for our experiment. The 10 actions are Boxing,Clapping, Digging, Jogging, Carrying, Open-Close Trunk, Running, Throwing,Walking, and Waving shown in Figure 4. All aerial view recording videos containshuman action along with multiple objects with different backgrounds. Thereforefirst, we have used human detector to detect and prepare bounding boxes from960 × 540 video frames and then reduced to 60 × 40 shown in Figure 7 forconvolution operation. Human detection with bounding box from original framesis explained in [50]. As we have taken 7 input frames for the convolution process,

Fig. 7. Human detection with bounding box from original frames in UCF dataset.

so the above detection and tracking process is applied for all the 7 frames to get acube having only human action without background. We have extracted humanwith bounding boxes with a step size of 2 from the original frames. For example,for frames number 0, bounding boxes are extracted for frames number -2, -4, -6,2, 4, 6. For all these frames bounding boxes are extracted at the same position.Finally, the action video of size 960 × 540 is converted to 60 × 40 patch cube,which is the input to our model in one stream. Along with this convolution


Boxing Carrying Clapping Digging Jogging

O-trunk Throwing Walking wavingRunning

Fig. 8. Sample human action from UCF dataset.

process, we have computed SIFT for the same input 960 × 540 in parallel withCNN. SIFT descriptor is extracted around every detected point in both spatialand temporal dimension for all the frames present in the input video clip asexplained in section 3.2. Finally, PCA is applied to these features vectors toreduce the dimension and concatenated with the last layer of proposed CNN.To evaluate the performance of combined CNN framework shown in Figure 6,we have reported the result of normal 3D CNN along with OpticalFlow(OF) andSVM, SIFT and SVM on UCF dataset. These are represented as OFSVM andSIFTSVM . In OFSVM , the optical flow descriptor is generated, and K meanclustering is used with different number of clusters. These clusters are used forvector quantization to assign each optical flow descriptor to its nearest codeword.Then bag of word(BOW) vector is constructed for each video. A BOW vector islike a histogram that counts the frequency of optical flow descriptors appears inthe video. Then we have used Linear SVM to train these vectors which containdifferent human action. Similarly, in SIFTSVM , SIFT features extraction, K-means clustering to build codebook, Building of Bag-of-Words vector, Trainingwith SVM, are done step by step for each and every video of the training set.

We have used 40 random samples of UCF dataset and report the performanceof these 4 models with the help of confusion matrix shown in Figure 9. We canclearly observe from confusion matrices that most of the actions are correctlyclassified in Figure 9(a) compared to Fig 9(b), Fig 9(c) and Fig 9(d). We canalso observe, our combined CNN outperform the normal 3D CNN, OFSVM andSIFTSVM . In few cases, combined CNN and general 3D CNN are showing same


results otherwise for all actions our proposed combined CNN is showing betterperformance. Precision and recall are computed for all these models with 10 dif-ferent actions. The comparative bar-chart is shown in Figure 10 for all 4 models.We can clearly observe from the bar-chart and Table 2 that combined CNN isshowing better precision and recall compared to others. Average precision andrecall of 10 human actions also plotted for all models is shown in Figure 11.

We have tested the time complexity of handcrafted SIFT feature extractionfrom the input video and observed that, point detection and description time ismuch faster. The time space for histogram formation is also very less. The keypoint detection algorithm generates less feature point, it detects only movingpoint from the 2D frames. This helps to extract meaningful information frommaximum number of frames present in the action video with reduced computa-tional complexity. Table 1 shows state-of-the-art method for human action recog-nition on UCF dataset[38],[49],[52],[53],[19]. We achieved comparatively betterresult by considering longer temporal information with less computational com-plexity and time.

4.2 Action Recognition On KTH dataset

We also evaluate all these 4 models with KTH human action data, which consistof 6 actions performed by 25 different subjects. We have used a cube of total 9frames as input after foreground extraction. 80 × 60 resolution is used in theseexperiments for fewer memory requirements. We have used the same 3D CNNarchitecture as in Figure 3. As the resolution of images is 80 × 60 and 9-frameare used, so convolution layer kernel sizes are 9×7, 7×7, 6×4 and subsamplinglayer kernel sizes are 3×3. After multiple layers of convolution and subsampling,finally, we get flatten fully connected feature vector. The last layer contains 6units. The same setting is used for combined CNN shown in Figure 6 with ad-ditional features SIFT parallel with CNN. 16 random subjects are selected fortraining and 9 used for testing. The same training and testing split are used forall 4 methods and observed that our combined CNN achieves very good accuracyas compared to the remaining three shown in Table 3.


Models Accuracy

Two Stream with extra data[38] 86.9

Two stream, regularized fusion[49] 88.4

Two stream with extra data, SVM fusion[38] 88.0

CNN, optical flow, LSTM[52] 88.6

CNN, IDT, FV[53] 89.6

IDT, FV, temporal scale invariance[19] 89.1

Ours 89.5Table 1. Comparison to state-of-the-art method on UCF dataset

(a) Combined SIFT-CNN (b) CNN

(c) OFSVM

(d) SIFT SVM

Fig. 9. Confusion matrix for all 4 models.


0.7

0.73

0.76

0.79

0.82

0.85

0.88

0.91

0.94

0.97

1

Precision

Recall

0.6

0.63

0.66

0.69

0.72

0.75

0.78

0.81

0.84

0.87

0.9

0.93

0.96

Precision

Recall

0.6

0.65

0.7

0.75

0.8

0.85

Precesion

Recall

0.6

0.65

0.7

0.75

0.8

0.85

Precision

Recall

(a) Combined SIFT-CNN (b) CNN

(c) Optical flow with SVM (d) SIFT with SVM

Fig. 10. Precision Recall bar-chart for all 4 models.

0.7

0.72

0.74

0.76

0.78

0.8

0.82

0.84

0.86

0.88

Combined SIFT-CNN CNN OF SVM SIFT SVM

Precision

Recall

Fig. 11. Average Precision recall.


Combined SIFT-CNN

Precision 0.878 0.809 0.947 0.85 0.809 0.945 0.9 0.795 0.837 0.971 0.8741

Recall 0.9 0.85 0.9 0.85 0.85 0.875 0.9 0.875 0.9 0.85 0.875

CNN

Precision 0.868 0.68 0.829 0.837 0.809 0.882 0.864 0.795 0.813 0.918 0.8295

Recall 0.825 0.8 0.85 0.775 0.85 0.75 0.8 0.875 0.875 0.85 0.825

Optical flow

with SVM

Precision 0.681 0.651 0.72 0.812 0.736 0.794 0.727 0.707 0.731 0.775 0.7334

Recall 0.75 0.7 0.775 0.65 0.7 0.675 0.8 0.725 0.75 0.775 0.73

SIFT with

SVM

Precision 0.645 0.627 0.697 0.714 0.736 0.764 0.8 0.7 0.81 0.69 0.7183

Recall 0.775 0.675 0.75 0.625 0.7 0.65 0.8 0.7 0.75 0.725 0.715

Boxing

Carrying

Clapping

Digging

Jogging

OpenTrunk

Running

Throwing

Walking

Waving

Avg

Table 2. Performance(Precision and Recall) for all 4 methods.

Boxing Hand

Clapping

Hand

Waving

Jogging Running Walking AVG

91 92 95 84 82 96 90

87 90 93 86 74 84 85.6

85 84 86 74 90 72 81.8

86 79 84 71 84 70 79

Methods

Combined SIFT-CNN

CNN

OF SVM

SIFT SVM

Table 3. The Recognition accuracy in percentage for KTH dataset.


5 Conclusion

In this paper, we proposed a combined SIFT and CNN model for human actionrecognition. This model extract features from both spatial and temporal di-mension like normal 3D convolution. Along with convolution and subsampling,this model also extracts additional handcrafted features from entire input videoclip(All frames). Finally, the last fully connected layer of the CNN concatenatedwith the additional handcrafted feature vectors SIFT. These combinations offeature vectors preserve both spatial and longer temporal information from en-tire video clip without increasing the input window of CNN and also boosted themodel performance. We evaluate the combined CNN on UCF and KTH datasetand observed that the combination of handcrafted features with 3D CNN outper-forms on both UCF and KTH dataset with fixed number of training and testingsplits. We compared our model with normal 3D CNN, OF with SVM and SIFTwith SVM with fixed number of random sample from both the datasets. Com-parison to state-of-the-art method on UCF dataset also presented. However, weproposed to just combine deep and handcrafted features in the classification layerof the SIFT-CNN framework. There is no score level fusion and weight valuesare considered for combining two different types of features. This could be thedrawback of our proposed model because the score level fusion with fixed weightmay significantly increase the accuracy of the network. Now a day, depth sensorRGB-D data received good attention in human action recognition and commer-cial depth sensors are easily available. We believe that our proposed frameworkimprove the performance of action recognition by combining handcrafted depthRGB data and static image. We will explore the same combination strategy withRGB-D data in our future work.

Declaration:

Author Contributions: All authors have equally contributed and all authorshave read and agreed to the published version of the manuscript.

Funding: This research has no funding by any organization or individual.

Institutional Review Board Statement: Not applicable.

Informed Consent Statement: Not applicable.

Data availability Statement: Data sharing not applicable.

Conflict of Interest: The author declare no conflict of interest.


References

1. Algamdi, A.M., Sanchez, V., Li, C.T.: Learning temporal information from spa-tial information using capsnets for human action recognition. In: ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing(ICASSP). pp. 3867–3871. IEEE (2019)

2. Bobick, A.F., Davis, J.W.: The recognition of human movement using temporaltemplates. IEEE Transactions on Pattern Analysis & Machine Intelligence (3),257–267 (2001)

3. Chakraborty, B., Holte, M.B., Moeslund, T.B., Gonzalez, J.: Selective spatio-temporal interest points. Computer Vision and Image Understanding 116(3), 396–410 (2012)

4. Choutas, V., Weinzaepfel, P., Revaud, J., Schmid, C.: Potion: Pose motion rep-resentation for action recognition. In: The IEEE Conference on Computer Visionand Pattern Recognition (CVPR) (June 2018)

5. Dawn, D.D., Shaikh, S.H.: A comprehensive survey of human action recognitionwith spatio-temporal interest point (stip) detector. The Visual Computer 32(3),289–306 (2016)

6. Deng, L., Wang, Y., Liu, B., Liu, W., Qi, Y.: Biological modeling of human vi-sual system for object recognition using glop filters and sparse coding on multi-manifolds. Machine Vision and Applications 29(6), 965–977 (2018)

7. Donahue, J., Hendricks, L.A., Rohrbach, M., Venugopalan, S., Guadarrama, S.,Saenko, K., Darrell, T.: Long-term recurrent convolutional networks for visualrecognition and description. IEEE Transactions on Pattern Analysis and MachineIntelligence 39(4), 677–691 (Apr 2017)

8. Donoho, D.L., Huo, X.: Beamlets and multiscale image analysis. In: Multiscale andmultiresolution methods, pp. 149–196. Springer (2002)

9. Feichtenhofer, C., Pinz, A., Wildes, R.P.: Spatiotemporal multiplier networks forvideo action recognition. In: 2017 IEEE Conference on Computer Vision and Pat-tern Recognition (CVPR). IEEE (Jul 2017)

10. Gaidon, A., Harchaoui, Z., Schmid, C.: Activity representation with motion hier-archies. International journal of computer vision 107(3), 219–238 (2014)

11. Garain, J., Mishra, S.R., Kumar, R.K., Kisku, D.R., Sanyal, G.: Bezier cohortfusion in doubling states for human identity recognition with multifaceted con-strained faces. Arabian Journal for Science and Engineering 44(4), 3271–3287(2019)

12. Jhuang, H., Serre, T., Wolf, L., Poggio, T.: A biologically inspired system for actionrecognition. In: International Conference on Computer Vision (ICCV) (2007)

13. Ji, S., Xu, W., Yang, M., Yu, K.: 3d convolutional neural networks for human ac-tion recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence35(1), 221–231 (Jan 2013)

14. Junejo, I.N., Dexter, E., Laptev, I., Perez, P.: View-independent action recogni-tion from temporal self-similarities. IEEE Transactions on Pattern Analysis andMachine Intelligence 33(1), 172–185 (Jan 2011)

15. Kawai, Y., Takahashi, M., Fujii, M., Naemura, M., Satoh, S.: Nhk strl at trecvid2010: Semantic indexing and surveillance event detection. In: TRECVID (2010)

16. Khaire, P., Kumar, P., Imran, J.: Combining cnn streams of rgb-d and skeletal datafor human activity recognition. Pattern Recognition Letters 115, 107–116 (2018)

17. Klaser, A., Marsza lek, M., Schmid, C.: A spatio-temporal descriptor based on 3d-gradients (2008)


18. Kong, Y., Tao, Z., Fu, Y.: Deep sequential context networks for action prediction.In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).IEEE (Jul 2017)

19. Lan, Z., Lin, M., Li, X., Hauptmann, A.G., Raj, B.: Beyond gaussian pyramid:Multi-skip feature stacking for action recognition. In: Proceedings of the IEEEconference on computer vision and pattern recognition. pp. 204–212 (2015)

20. Laptev, Lindeberg: Space-time interest points. In: Proceedings Ninth IEEE Inter-national Conference on Computer Vision. IEEE (2003)

21. Laptev, I.: On space-time interest points. International journal of computer vision64(2-3), 107–123 (2005)

22. Le, Q.V., Zou, W.Y., Yeung, S.Y., Ng, A.Y.: Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis. In:CVPR 2011. IEEE (Jun 2011)

23. LeCun, Y., Bottou, L., Bengio, Y., Haffner, P., et al.: Gradient-based learning ap-plied to document recognition. Proceedings of the IEEE 86(11), 2278–2324 (1998)

24. Li, H., Li, H., Wei, Y., Tang, Y., Wang, Q.: Sparse-based neural response for imageclassification. Neurocomputing 144, 198–207 (2014)

25. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Interna-tional journal of computer vision 60(2), 91–110 (2004)

26. Ma, M., Marturi, N., Li, Y., Leonardis, A., Stolkin, R.: Region-sequence basedsix-stream cnn features for general and fine-grained human action recognition invideos. Pattern Recognition 76, 506–521 (2018)

27. Mishra, S.R., Mishra, T.K., Sanyal, G., Sarkar, A., Satapathy, S.C.: Real time hu-man action recognition using triggered frame extraction and a typical cnn heuristic.Pattern Recognition Letters 135, 329–336 (2020)

28. Nazir, S., Yousaf, M.H., Velastin, S.A.: Evaluating a bag-of-visual features ap-proach using spatio-temporal features for action recognition. Computers & Elec-trical Engineering 72, 660–669 (2018)

29. Nunez, J.C., Cabido, R., Pantrigo, J.J., Montemayor, A.S., Velez, J.F.: Convo-lutional neural networks and long short-term memory for skeleton-based humanactivity and hand gesture recognition. Pattern Recognition 76, 80 – 94 (2018)

30. Nguyen, D., Kim, K., Hong, H., Koo, J., Kim, M., Park, K.: Gender recognitionfrom human-body images using visible-light and thermal camera videos based ona convolutional neural network for image feature extraction. Sensors 17(3), 637(2017)

31. Nguyen, T.V., Song, Z., Yan, S.: Stap: Spatial-temporal attention-aware poolingfor action recognition. IEEE Transactions on Circuits and Systems for Video Tech-nology 25(1), 77–86 (2014)

32. Ning, F., Delhomme, D., LeCun, Y., Piano, F., Bottou, L., Barbano, P.: Towardautomatic phenotyping of developing embryos from videos. IEEE Transactions onImage Processing 14(9), 1360–1371 (Sep 2005)

33. Patel, C.I., Garg, S., Zaveri, T., Banerjee, A., Patel, R.: Human action recognitionusing fusion of features for unconstrained video sequences. Computers & ElectricalEngineering 70, 284–301 (2018)

34. Peng, X., Wang, L., Wang, X., Qiao, Y.: Bag of visual words and fusion methodsfor action recognition: Comprehensive study and good practice. Computer Visionand Image Understanding 150, 109–125 (2016)

35. Peng, X., Zou, C., Qiao, Y., Peng, Q.: Action recognition with stacked fisher vec-tors. In: European Conference on Computer Vision. pp. 581–595. Springer (2014)

36. Ramezani, M., Yaghmaee, F.: A review on human action analysis in videos forretrieval applications. Artificial Intelligence Review 46(4), 485–514 (Mar 2016)


37. Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recog-nition in videos. In: Proceedings of the 27th International Conference on NeuralInformation Processing Systems - Volume 1. pp. 568–576. MIT Press (2014)

38. Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recog-nition in videos. In: Advances in neural information processing systems. pp. 568–576 (2014)

39. Tomasi, C., Detection, T.K.: Tracking of point features. Tech. rep., Tech. Rep.CMU-CS-91-132, Carnegie Mellon University (1991)

40. Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotem-poral features with 3d convolutional networks. In: 2015 IEEE International Con-ference on Computer Vision (ICCV). IEEE (Dec 2015)

41. Tran, D., Ray, J., Shou, Z., Chang, S.F., Paluri, M.: Convnet architecture searchfor spatiotemporal feature learning. arXiv preprint arXiv:1708.05038 (2017)

42. Tu, Z., Xie, W., Qin, Q., Poppe, R., Veltkamp, R.C., Li, B., Yuan, J.: Multi-stream cnn: Learning representations based on human-related regions for actionrecognition. Pattern Recognition 79, 32–43 (2018)

43. Vedaldi, A., Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Darrell, T.:Convolutional architecture for fast feature embedding. Cornell University, arXiv:1408.5093 v12014 (2014)

44. Wang, H., Oneata, D., Verbeek, J., Schmid, C.: A robust and efficient video repre-sentation for action recognition. International Journal of Computer Vision 119(3),219–238 (2016)

45. Wang, H., Schmid, C.: Action recognition with improved trajectories. In: Pro-ceedings of the IEEE international conference on computer vision. pp. 3551–3558(2013)

46. Wang, L., Koniusz, P., Huynh, D.Q.: Hallucinating bag-of-words and fisher vectoridt terms for cnn-based action recognition. arXiv preprint arXiv:1906.05910 (2019)

47. Wang, Y., Mori, G.: Hidden part models for human action recognition: Proba-bilistic versus max margin. IEEE Transactions on Pattern Analysis and MachineIntelligence 33(7), 1310–1323 (Jul 2011)

48. Wentao Zhu, Cuiling Lan, J.X.W.Z.Y.L.L.S.X.X.: Co-occurrence feature learningfor skeleton based action recognition using regularized deep lstm networks. In:Thirtieth AAAI Conference on Artificial Intelligence (2016)

49. Wu, Z., Wang, X., Jiang, Y.G., Ye, H., Xue, X.: Modeling spatial-temporal cluesin a hybrid deep learning framework for video classification. In: Proceedings of the23rd ACM international conference on Multimedia. pp. 461–470. ACM (2015)

50. Yang, M., Lv, F., Xu, W., Gong, Y.: Detection driven adaptive multi-cue integra-tion for multiple human tracking. In: 2009 IEEE 12th International Conference onComputer Vision. IEEE (Sep 2009)

51. Yu, K., Xu, W., Gong, Y.: Deep learning with kernel regularization for visualrecognition. In: Koller, D., Schuurmans, D., Bengio, Y., Bottou, L. (eds.) Advancesin Neural Information Processing Systems 21, pp. 1889–1896. Curran Associates,Inc. (2009)

52. Yue-Hei Ng, J., Hausknecht, M., Vijayanarasimhan, S., Vinyals, O., Monga, R.,Toderici, G.: Beyond short snippets: Deep networks for video classification. In:Proceedings of the IEEE conference on computer vision and pattern recognition.pp. 4694–4702 (2015)

53. Zha, S., Luisier, F., Andrews, W., Srivastava, N., Salakhutdinov, R.: Exploit-ing image-trained cnn architectures for unconstrained video classification. arXivpreprint arXiv:1503.04144 (2015)


54. Zhang, Z., Hu, Y., Chan, S., Chia, L.T.: Motion context: A new representationfor human action recognition. In: European Conference on Computer Vision. pp.817–829. Springer (2008)

55. Zhu, W., Huang, Y., Zeng, L., Chen, X., Liu, Y., Qian, Z., Du, N., Fan, W.,Xie, X.: Anatomynet: Deep learning for fast and fully automated whole-volumesegmentation of head and neck anatomy. Medical Physics 46(2), 576–589 (2019)

human action recognition using a hybrid deep learning

Documents