human pose simulation and detection in real time using video … · 2020. 6. 16. · prasanth kumar...

Vol.:(0123456789)

SN Computer Science (2020) 1:148 https://doi.org/10.1007/s42979-020-00153-8

SN Computer Science

ORIGINAL RESEARCH

Human Pose Simulation and Detection in Real Time Using Video Streaming Data

Prasanth Kumar Ponnarassery1 · Gopichand Agnihotram1 · Pandurang Naik1

Received: 3 April 2020 / Accepted: 4 April 2020 / Published online: 4 May 2020 © Springer Nature Singapore Pte Ltd 2020

AbstractThis paper proposes a simulation approach to detect different human poses in real time with streaming data. Pose detec-tion in real time is a critical area for many of the applications in different domains where the available literature deals with training-based models on huge amount of data and the methods use 3D cameras for accurate predictions. It also requires huge computational efforts and GPU machines to obtain different human poses in real time. The available methods in the literature are with high frame rate requirement and use previous frames as well for predicting the poses of human. If the frame rate is less, then the methods fail to predict the poses accurately and efficiently. The proposed simulation mechanism describes the simulation of different poses of human and generates feature descriptors for each of the pose and trains the model using simple classifier. The trained model predicts the real-time human pose detection on video streaming data. The different poses are predicted with less frame rate using simple 2D cameras and with accurate predictions by reducing the processing time and with less computational efforts. The proposed solution will be used to predict the candidate poses or gestures in the virtual interview application.

Keywords Key point generation · Nodal points generation · Feature descriptors · Training model · Feedforward network · Pose detection model · Simulation · Classification

Introduction

Human pose detection is one of the important tasks for many computer vision applications such as human motion capture for human–computer interfaces, physiotherapy, ergonomics studies, robotic process, and visual surveillance for finding human anomalous behaviors. The current human pose detec-tions solution will be used as part of human machine inter-action to derive the candidate gestures or poses for virtual

interview scenario where the candidate is appearing an inter-view with the virtual system. The candidate gestures are very important for effective communication between the machine and candidate. The present paper describes the simulation method to detect human poses by simulating different human poses of nodal points or key points of the human skeleton for each of the pose and computing feature descriptor using extracted nodal points. These feature descriptors are trained using a simple classifier for example feedforward network for different poses. The trained model will be used for real-time detection of poses with the video streaming data. In this paper, entire solution mechanism will be explained in the next sections. In the solution, the extraction of nodal points of human skeleton will be based on pretrained model using Deeper-Cut algorithm. The Deeper-Cut algorithm [9, 11] is based on ResNet architecture where the model is trained with huge MPII human pose data sets [13] and Coco data sets [12] to extract nodal points or key points which will be the human joints to form the human skeleton structure. Further, these nodal points are generated for different ori-entations to compute the feature descriptors corresponding to each pose. The poses are nothing but persons standing,

This article is part of the topical collection “Advances in Computational Approaches for Artificial Intelligence, Image Processing, IoT and Cloud Applications” guest edited by Bhanu Prakash K N and M. Shivakumar.

* Gopichand Agnihotram [email protected]

Prasanth Kumar Ponnarassery [email protected]

Pandurang Naik [email protected]

1 Wipro Technology Limited, Wipro CTO Office, Wipro, Bangalore 560100, India

http://crossmark.crossref.org/dialog/?doi=10.1007/s42979-020-00153-8&domain=pdf

SN Computer Science (2020) 1:148148 of 8

SN Computer Science

sitting, hand up, hand down, thumbs-up, fist, victory, head left, head right, head down, etc.

While detecting the human poses, the intention is not to use any expensive depth camera (3D) which may provide us very accurate results, but it may be very expensive in cost and with less computation time. Here, in this paper the proposed approach can provide you very accurate result in par with depth camera by using only normal web camera or laptop camera. Further, the system can give good results even with a simple training model, for example feedforward network to train the feature descriptors which is used to detect the pose in real time with a video streaming data.

There are few papers [2, 14] which uses segmented approach to detect pose of a human body where user will be trained on that segment and obtain the pose of human being. Here, in our paper the different poses of the human are simulated using feature descriptors data. The simulated data will be trained to predict the different poses of human in real-time video streaming data.

The rest of the paper is organized as follows. “Related Work” describes the literature available in human pose detection related work. “Solution Approach” discusses the detailed solution approach on simulation of different human poses by simulating the corresponding nodal points and human skeleton. The “Solution Approach” also discusses about the feature descriptor computations using nodal points for different human orientations and also training of models using feature descriptors data. “Application of Human pose Detection in Real Time” deals with the virtual interview sys-tem application in real time. “Conclusions and Future Work” describes the conclusions of research findings and future enhancements of research work followed by References.

Related Work

This section discusses the recent works carried out on human pose/gesture detection and hand gesture recognition using computer vision algorithms.

Xu [1] designed a real-time human–computer interaction system based on hand gestures. The whole system consists of three components such as hand detection, gesture recogni-tion and human–computer interaction. The author realized that robust control of mouse and keyboard events plays a vital for a higher accuracy of gesture recognition. The author used convolutional neural networks to recognize these ges-tures with cheap monocular cameras. The author introduces Kalman filter to estimate the hand position, and the author observed that the mouse cursor control is realized in a stable and smooth way. The developed system is highly extendable and can be used in human–robotic or other human–machine interaction scenarios with more complex command formats rather than just mouse and keyboard events.

Song et al. [2] and Zimmermann and Thomas [10] pre-sented an approach for gesture recognition that tracks the body and hands simultaneously and recognizes gestures continuously from an unsegmented and unbounded input stream. The system estimates the 3D coordinates of upper body joints and classifies the appearance of hand into a set of canonical shapes. A multilayered filtering technique with a temporal sliding window is developed to enable online sequence labeling and segmentation. The authors also worked on multimodal gesture recognitions and deep-hier-archical sequence representation learning concepts.

Bulat and Georgios [3] used convolutional neural net-works to derive the human pose estimation. The authors’ main contribution is on CNN cascaded architecture specifi-cally designed for learning relationships and spatial context and inferring the pose robustly even for the case of severe part occlusions. The authors proposed the detection followed by regression using CNN cascade. The first part of the cas-cade outputs part detection heatmaps, and the second part performs regression heatmaps. The benefits of the proposed architecture is multi-fold, and it guides the network where to focus in the image and effectively encodes part constraints and context.

Tompson et al. [4] proposed a new hybrid architecture that consists of deep convolutional network and Markov ran-dom field to compute human poses. The authors discussed how this architecture is successfully applied to the challeng-ing problem of articulated human pose estimation in monoc-ular images. The architecture can exploit structural domain constraints such as geometric relationships between body joint locations. The authors showed that the joint training of these two models improves the performance significantly.

Carfi et al. [5] describe the online human gesture recogni-tion using recurrent neural networks and wearable sensors. The authors interpret that the gestures are fundamental for robots aiming to naturally interact with humans. Wearable sensors are promising to monitor human activity; in par-ticular, the usage of triaxial accelerometers for gesture rec-ognition has been explored. Despite this, the state of the art presents the lack of systems for reliable online gesture recognition using accelerometer data. The authors propose SLOTH an architecture for online gesture recognition based on a wearable triaxial accelerometer and recurrent neu-ral network probabilistic classifier for continuous gesture detection.

Toshev et al. [6] proposed a method for human pose esti-mation based on deep neural networks (DNNs). The pose estimation is formulated as a DNN-based regression prob-lem towards body joints. The authors present a cascade of such DNN regressors which results in high-performance pose estimates. The approach has the advantage of reason-ing about pose in a holistic fashion and has a simple but yet

SN Computer Science (2020) 1:148 of 8 148

SN Computer Science

powerful formulation which capitalizes on recent advances in Deep Learning.

Oberweger et al. [7] evaluated several architectures for convolution neural networks to predict the 3D joint loca-tions of a hand given a depth map. The authors first showed that a prior on the 3D pose can be easily introduced and significantly improves the accuracy and reliability of the pre-dictions. The authors also showed how to use the context efficiently to deal with ambiguities between fingers.

Ouyang et al. [8] came up with different types of human pose estimation such as visual appearance score, appear-ance mixture type and deformation. The authors proposed to build a multi-source deep learning model to extract non-linear representation from these different aspects of informa-tion sources. With the deep learning models, the high-order human body patterns are extracted for pose estimation from these sources of information. The task for estimating body locations and the task for human detection are jointly learned using a unified deep model.

Solution Approach

The proposed approach uses Deeper-Cut algorithms [9, 11] to derive the nodal points (key points) for humans, which defines the skeleton of the human. The proposed solution uses these nodal points to compute pose feature descrip-tors for each of the human pose. Further, these nodal points are simulated for different orientations of the human to compute the feature descriptors corresponding to each pose. The feature descriptors are computed using a cosine distance between each point to all other nodal points. In a similar way, the data for each pose (such as sitting, standing, hand up, hand down etc.) are simulated and trained using a feedforward neural network or any classifier to generate a training model. After successful training for the required poses, the generated model can be used to predict the human pose in real time with a video streaming data. This can be deployed as platform to capture different human poses for effective communication with virtual system, robotic con-trol, human anomalous behaviors, threat identification in real time.

The proposed solution has three stages: simulation stage, training stage and detection stage. Each stage is explained in detail below. The simulation stage takes the nodal points generated from the Deeper-Cut algorithm as the input and simulates these nodal points for different poses, and the fea-ture descriptors are computed for each of the pose. In train-ing stage, a feedforward neural network will be used to train the human poses using the simulated feature descriptors for each of the pose. Finally, in detection stage the human pose will be identified using the trained model with the real-time

video streaming data. The different stages of pose detection is given in Fig. 1 and each of the stage is described in detail below.

Simulation of Different Poses

In this stage, Deeper-Cut algorithm will give the initial set of nodal points in the human skeleton for the reference posi-tions. Appropriate 3D transformations are applied to these nodal points to simulate the nodal points corresponding to different human poses. Feature Descriptors are calculated for each set of transformed nodal points. Figure 2 shows the overall Simulation process.

The system uses the initial human skeleton to derive dif-ferent human skeleton poses. Each nodal point is free to move with respect to other nodal points. The Deeper-Cut algorithm which is used for obtaining the initial human nodal points uses Residual Neural Networks for part detec-tion. The method uses MPII human pose [13] and Coco data sets [12] for training the model. The best performing body part detection model with 152 layers will be used. An extra layer will be added to predict the relative position of the body part. The clustering techniques are used to label the body parts. Finally, for each body part-wise, nodal points

Simulation of

different Poses

Training the

Feature Descriptors Real time

Detection of Poses

Fig. 1 Stages for pose detection in real time

Human Skeleton with

Nodal Points

Feature Descriptor

ComputationModel Simulation for

each of the Pose

Pose wise Nodal

Points Simulation

Fig. 2 Human pose simulation using nodal points


SN Computer Science

are generated. The nodal point generation and skeleton formation is given in Fig. 3. The node numbering used for descriptor calculation is given in Fig. 4. If the frame contains multiple humans, then the algorithm can generate key points for each of the members accordingly. The method provides

more than 90% of accuracy on human body parts such as head, shoulder, and hip, and the method has more than 80% accuracy on elbow, knee and ankle body parts.

The multiple human key points are detected using Deeper-Cut algorithm as shown in Fig. 5. Here, the key points are represented as dots, circle, and thick circles to differentiate the members. For the face, 128 facial key points are derived from neural networks, the key points and its numbering are as shown in Figs. 6 and 7, respectively. For the Palm, the key point and its numbering are as in Fig. 8.

A pose is defined when nodal points constituting for that pose appear in a particular arrangement on the two-dimensional (2D) image or a 2D plane. There can also be a special tolerance for each of the nodal point, or there can be a range for each nodal point to move in the 2D space so that the same pose label is maintained. In some situations, it could be difficult to collect all the possible orientations of the nodal points corresponding to each pose label. Here in this paper, this problem is mitigated by using the key points generation approach and there by generating the nodal points precisely for each of the poses and thus improving the pre-diction accuracy.

Further, the key point generation is achieved by using a rendering platform (e.g., OpenGL-based rendering platform) where the initial nodal points can be loaded and apply trans-formation in the three dimensions (3D) to recreate different poses. The initial nodal points can be obtained from Deeper-Cut algorithm or can be manually created which would be very much like the normal human structure. Further, the trans-formation can be achieved by multiplying these nodal points with an appropriate transformation matrix, which would be happening inside the OpenGL pipeline. Further, predefined and very realistic pose simulation can be generated. The projections of the transformed points (i.e., only “x” and “y” coordinates) will be taken for the calculation of the feature descriptors which will be described in the next subsection. All the feature descriptors for each possible pose are captured and saved into a file which can be xml or any other file along with its pose label. This file is used in the training phase to train for different human poses and create a trained model.

Feature Descriptor Computation

The system generates the simulated nodal points for each of the pose as explained in the above section. Here, the paper

Fig. 3 Key points and skeleton generation

Fig. 4 Node numbering for the skeleton

Fig. 5 Differentiating the multi-ple human key points


SN Computer Science

describes the computation of feature descriptors for each of the pose and computed features descriptors will be used for training the pose model. The descriptors are computed using cosine similarity or any other similarity measure between all the possible pairs of the nodal points. The steps involved in computing feature vectors are explained in detail below.

1. Forming the feature vector.

a. A vector is defined by connecting two of the nodal points. This step involves the formation of suitable number of vector subsets from the skeleton (exam-ple: human skeleton), which is well enough to rep-resent a particular human pose.

b. In an example, for detection of head-front, head-left, and head-right, only the nodal points “2”, “1”, “5”, “0”, “14”, “15”, “16”, and “17” are used from the skeleton as given in Fig. 4. Further, the vectors are defined, which clearly show the spatial difference when the nodal points change. For example:

• Vec1 = Node 2 – Node 1• Vec2 = Node 5 – Node 1• Vec3 = Node 0 – Node 1• Vec4 = Node 14 – Node 0• vec5 = Node 15 – Node 0• vec6 = Node 16 – Node 14• vec7 = Node 17 – Node 15

c. Hence by considering n number of subset nodal points, there will be (n − 1) subset vectors.

d. Figures 9, 10, and 11 represent the vectors for head front, head left, and head right.

2. Computing the cosine-distance from the vector pairs.

a. Here, the cosine-distance between all the possible 2D vector pairs is determined. If A and B are the vectors, the cosine-distance between the vectors is defined in (1) as given below.

Fig. 6 128 Facial key points

Fig. 7 Numbered Facial key points

Fig. 8 Node numbering for the palm


SN Computer Science

where Dc is the cosine distance and Sc is the cosine similarity. The cosine similarity is given in Eq. (2)

3. Forming the feature descriptor from the cosine-distance values.

(1)Dc(A,B) = 1 − Sc(A,B),

(2)cos(�) =A.B

��A��B� �=

∑n

i=0AiBi

�∑n

i=0Ai

2Bi2

.

The cosine distance values calculated from step 2 will be placed in a predefined sequence to form the feature descrip-tor. Thus, if there are n number of vectors, there will be n2 number of cosine distances values in the descriptors. This can be set to any predefined sequence for placing the cosine distance values in the descriptor; only thing is that the same sequence should be followed while training and detection. Further, the computed feature descriptor is ready and can be used for training and detection using a feedforward neural network. The training model with the feature descriptors given in (3) will be explained in the next section.

where V11 = V22 = …Vnn = 0 ; Here Vij ‘s defines the cosine distance between vector i and j ; i = 1, 2,… , n ; j = 1,2, 3,… , n

Finally, compute the feature vectors of all generated poses and its labels are stored in an xml file or any other file for-mat. This file is used as a data set for training the poses (i.e., human poses, for example human standing, human sitting, human turn left, human turn right, human turn up, human turn down). The details of the training model will be explained in the next section.

Training the Feature Descriptors

Compute the feature descriptors for all generated poses and its label data and train the feedforward neural network or any classifier to classify the different poses. Here the method describes the training with respect to feedforward neural

(3)V =

[V11,… ,V1n,V21,… ,V2n,…… ,Vn1,Vn2,,… .Vnn

],

Fig. 9 Face front

Fig. 10 Face left

Fig. 11 Face right


SN Computer Science

network. The mathematical representation of the labels or classes is given below (4).

where Ci is the ith class and the classes for example sit-ting, standing, head left, head right.

The feature descriptors for each of the class are given in (5).

For the ith class Ci the feature descriptors are given as

For example, feature descriptors are floating numbers which are given in (6) for one of the feature descriptor.

Hence, the pair of labels and classes are given in (7) and this data will be fed into feedforward neural network for training and the details will be explained briefly below.

Here, C ’s represents the classes and FD ’s represents the feature descriptors.

The backpropagation algorithm is used for learning on a multilayer feedforward neural network. The method trains the feature descriptors and corresponding class labels. It iteratively learns on set of weights for prediction of the class label. A multilayer feedforward neural network con-sists of an input layer, one or more hidden layers, and an output layer. The multilayer feedforward network is shown in Fig. 12.

Here each of the layers is made up of units. The inputs to the network correspond to the attributes measured using

(4)C =(C1,C2,… ,Cn

),

(5)FDi =[fi1, fi2, fi3,… , fin

], for i = 1, 2,3,… , n.

(6)

fi1 = [−2.3,−5.6, 7.8, 6.9, 3.45,−8.98, 2.567,

0.567, 1.456,−7.89,… , 4.56].

(7)(C,FD) =[(C1,FD1

),(C2,FD2

),… ,

(Cn,FDn

)],

feature descriptors for each of the training tuples. These inputs are fed simultaneously into the units making up the input layer. These inputs pass through the input layer and are then weighted and fed simultaneously to a second layer of “neuronlike” units, known as a hidden layer. The out-puts of the hidden layer units can be input to another hidden layer, and it continues like this. The number of hidden lay-ers is user choice, and in general it will be one or more than one hidden layers. The weighted outputs of the last hidden layer are input to units making up the output layer which is class label for a given input data. The real-time detection of human pose for video streaming data will be described as given below.

Real‑Time Detection of Poses

This section describes the real-time poses of video streaming data using feature descriptors, trained models as explained above. The steps for the real-time pose detection are described in Fig. 13.

The live streaming images/video frames are captured from a RGB camera is fed to the Deeper-Cut algorithm to generate the nodal points of the person in the frame. Using these nodal points, feature descriptors are calculated. Using the calculated feature descriptor and the previously trained model as given in “Training the feature descriptors”, the pos-ture of the person in the frame can be detected in real time.

Application of Human Pose Detection in Real Time

The solution is deployed in a virtual interview system where the candidate is being interviewed by virtual avatar and ava-tar will ask few technical questions to the candidate. The candidate will answer those questions to the avatar, and those questions may be sometime multiple-choice ques-tions and sometimes descriptive answers. The virtual system continuously observes the candidate poses through a nor-mal RGB camera. Few of the candidate poses are whether candidate is sitting, standing, where the candidate is look-ing or multiple persons are appearing in front of the cam-era, etc. In these scenarios, the avatar will communicate to candidate saying sit down, and multiple persons should not

Fig. 12 Feedforward neural network with multilayers

Line streaming

Images/videos

frames

Training Model

Nodal Points

Generation

Feature Descriptor

Computation

Real time pose

detection

Fig. 13 Real-time pose detection for video streaming data


SN Computer Science

present in the interview likewise. The avatar also observes where the candidate is looking front, up, left or right and the same will be communicated to the candidate by the avatar. Some of the times, candidate will also answer the questions posed by avatar with some hand gestures such as answer numbers 2 and 3. Since the system is trained with those gestures, the avatar can understand the candidate gestures and the communication will happen accordingly. In some of the cases, the candidate may also use victory, fist, thumbs-up, gestures while communicating to the virtual avatar. Our solution can address those poses or gestures for effective communication with the virtual system where user gestures are more important for effective communication along with voice. The solution is deployed in a cloud server Amazon Web Service (AWS), and at the client side (personal laptop, desktop or mobile) the streaming application will be run-ning. The pose events will be computed at AWS, and those events will be displayed at client side. The solution achieved more than 90% of accuracy for the computed poses in real time such as human standing, sitting, multiple person count, human looking up, down, left, right, human hand gestures such as fist, victory, and thumbs-up. The same solution can be also used in robotic control systems, AR applications, VR applications, surveillances to find the person anomaly detection, sign language interpretation, industrial automa-tion, touchless smart phones, assistive technology, and car safety systems.

Conclusions and Future Work

The solution proposed in this paper discusses the human pose detection using a simple 2D RGB camera. The method uses nodal point simulation approach to generate the feature descriptors and uses a simple feedforward neural network to train the model and obtain the pose in real time with video streaming data. The proposed solution is getting more than 90% accuracy for all different human poses in par with the usage of depth camera, which turns out to be very expensive. Using simple 2D cameras, one can achieve more than 90% accuracy of human poses in real time which is very cost effective and system is able to detect the poses in real time.

In the future, this solution will be extended to detect poses for the nonliving human beings as well and currently working on different applications to identify rotating wheel/orientation of the wind wheel, railway signals identifica-tions, etc. The work is in progress on sign language interpre-tation and touchless smart phone applications.

Funding This study was funded by Wipro Technology Limited.

Compliance and Ethical Standards

Conflict of Interest The authors declare that they have no conflict of interest.

References

1. Xu P. A real-time hand gesture recognition and human-computer interaction system. arXiv preprint. 2017; arXiv :1704.07296 .

2. Song Y, Demirdjian D, Davis R. Continuous body and hand ges-ture recognition for natural human-computer interaction. ACM Trans Interact Intell Syst (TiiS). 2012;2(1):5.

3. Bulat A, Georgios T. Human pose estimation via convolutional part heatmap regression. In: European Conference on Computer Vision. Cham: Springer; 2016.

4. Tompson JJ, Jain A, LeCun Y, Bregler C. Joint training of a con-volutional network and a graphical model for human pose estima-tion. In: Advances in neural information processing systems;2014. p. 1799–1807.

5. Carfi A, Motolese C, Bruno B, Mastrogiovanni F. Online human gesture recognition using recurrent neural networks and wearable sensors. In: 2018 27th IEEE International Symposium on Robot and Human Interactive Communication (RO-MAN), IEEE; 2018. p. 188–195.

6. Toshev A, Szegedy C, DeepPose G. Human pose estimation via deep neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Columbus, OH, USA; 2014. p. 24–27.

7. Oberweger M, Wohlhart P, Lepetit V. Hands deep in deep learning for hand pose estimation. arXiv preprint arXiv :1502.06807 . 2015 Feb 24.

8. Ouyang W, Chu X, Wang X. Multi-source deep learning for human pose estimation. In: Proceedings of the IEEE Confer-ence on Computer Vision and Pattern Recognition; 2014. p. 2329–2336.

9. Rajchl M, et al. Deepcut: Object segmentation from bounding box annotations using convolutional neural networks. IEEE Trans Med Imaging. 2017;36(2):674–83.

10. Zimmermann C, Brox T. Learning to estimate 3d hand pose from single rgb images. In: Proceedings of the IEEE International Con-ference on Computer Vision; 2017. p. 4903–4911.

11. Wu Z, Shen C, Van Den Hengel A. Wider or deeper: revis-iting the resnet model for visual recognition. Pattern Recogn. 2019;90:119–33.

12. Lin T-Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Lawrence Zitnick C. Microsoft coco: Common objects in context. In: European conference on computer vision; Cham: Springer; 2014. p. 740–755.

13. Andriluka M, Pishchulin L, Gehler P, Schiele B. 2d human pose estimation: New benchmark and state of the art analysis. In: Pro-ceedings of the IEEE Conference on computer Vision and Pattern Recognition, pp. 3686–3693. 2014.

14. Erol A, Bebis G, Nicolescu M, Boyle RD, Twombly X. Vision-based hand pose estimation: a review. Comput Vis Image Underst. 2007;108(1–2):52–73.

Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

http://arxiv.org/abs/1704.07296

http://arxiv.org/abs/1502.06807

human pose simulation and detection in real time using video … · 2020. 6. 16. · prasanth kumar...

Documents