pose comparison for correct yoga posture measurement

Pose Comparison for Correct Yoga Posture Measurement Albert Suryanto 1002391 Jason Swee 1002009 Naik Hiong Chiang (Frankie) 1002415 Rayson Lim 1002026

1. Executive Summary This document reports the design process of corrective pose measurement using computer vision concepts and it is an attempt to ensure correct posture for different yoga poses and workout in an intuitive way. In general, OpenPose is used to extract the pose from a target image of the pose that the user is trying to perform and the pose from the image of the user performing the pose. A similarity score is then calculated from the two poses. If the similarity score passes a certain threshold, we treat the user’s pose as sufficiently correct. Two methods of comparison were attempted and they are namely cosine similarity and using a fully connected neural network. The neural network was found to outperform cosine similarity significantly when they were evaluated through their receiver operating characteristics (ROC).

2. Table of Content 1. Executive Summary ............................................................................................................................. 2

2. Table of Content .................................................................................................................................. 3

3. Background .......................................................................................................................................... 4

4. Problem Framing ................................................................................................................................. 4

4.1 Problem Statement ............................................................................................................................ 4

4.2 Needs and Constraints ....................................................................................................................... 4

5. Designed Solution ................................................................................................................................ 5

5.1 System Design ................................................................................................................................... 5

5.2 OpenPose CNN ................................................................................................................................. 5

5.2.1 Network Architecture ................................................................................................................. 6

5.3 Preprocessing .................................................................................................................................... 7

5.4 Comparison Methodology ................................................................................................................. 8

5.4.1 Cosine Similarity ........................................................................................................................ 8

5.4.2 ComparatorNet ........................................................................................................................... 9

6. Experiment Results ............................................................................................................................ 11

7. Implementation .................................................................................................................................. 12

7.1 Hardware Specifications ................................................................................................................. 12

7.2 Software Specifications ................................................................................................................... 12

8. Future Works ..................................................................................................................................... 14

8.1 Allow detection for multiple individuals ........................................................................................ 14

8.2 Create a more robust dataset ........................................................................................................... 14

8.3 Expand to 3D representations ......................................................................................................... 14

9. Contributions of Team Members ....................................................................................................... 15

10. Bibliography ...................................................................................................................................... 16

3. Background As a human, one of our concerns in life is our body and our health. Exercising is one of methods which we use to preserve our health in life and yoga is one such exercise. In Singapore, yoga is rank 9 out of 40 sports activity in terms of participation level (Sports Index Participation Trends 2015, 2016). Research has also shown that certain yoga techniques may improve physical and mental health (Ross & Thomas, 2010). However, you may get injured from doing yoga as well if you do not follow proper form and technique. According to University of Sydney, yoga causes musculoskeletal pain in 10% of people and exacerbates 21% of existing injuries (University of Sydney, 2017). A separate study conducted in Canada found that 43% of yoga injuries are caused by physical activity such as over-stretching or performing a yoga pose (Russell, Gushue, Richmond, & McFaull, 2016). The researchers concluded that individuals who attempt yoga poses without proper form or push themselves beyond their capability or flexible are more likely to injure themselves during the exercise. Among to the data that they had collected, the most commonly suffered injuries during yoga are sprains, soft tissue injuries, and injuries to muscles, tendons or nerves. In order to mitigate the risk of injury during yoga, it is of paramount importance for yoga practitioners to practice proper form and technique during the exercise. In this study, the aim is to test the viability of using Computer Vision techniques to verify an individual’s yoga pose and check if the pose is done correctly. The study will be done using OpenPose, an existing pose estimation neural network, to extract pose information in an image of a person.

4. Problem Framing

4.1 Problem Statement Right asana is vitally important to the yogi and is necessary for preventing injuries and maximizing the benefits of yoga. However, asana correction is not a one-off thing and must be maintained. This is especially so for people who is a self-taught yogi and have no way to check on their form. Therefore, there is a need for a solution to easily and quickly check a yogi’s form.

4.2 Needs and Constraints In order to effectively help a yogi correct her form, the solution needs to be able to provide real-time visual evaluation so that the yogi can correct her form as soon as possible. On top of that, after-workout reviews should be available as well so that the yogi can learn from her past mistakes and improve on her future workouts. Finally, in order to not impede the yogi’s movement and potentially increase the likelihood of injury, the solution should not require the practitioner to put on any additional equipment such as wearable sensors.

5. Designed Solution

5.1 System Design Our proposed solution is to utilize the OpenPose convolutional neural network (CNN) to obtain the human skeletons of the user’s pose and target pose and compare the two poses to obtain a similarity score. The similarity score will be used as a measure of the correctness of the user’s pose. The target pose refers to the desired yoga pose that the user is trying to perform. Two comparison methods were tested in our implementation, namely using cosine similarity and using a neural network with a sigmoid activation function in the output layer. OpenPose returns the coordinates and the confidence level of the different body parts as output. The BODY_25 output format is used for this project which returns the x and y coordinates of 25 body parts along with their confidence levels. The body parts are namely, the nose, neck, shoulders, elbows, wrists, mid-hip, left-hip, right-hip, knees, ankles, eyes, ears, big toes, small toes and heels. Our hypothesis is that the coordinates of the various body parts in a human body extracted from an image contains enough information to determine if a pose is performed correctly.

5.2 OpenPose CNN

Figure 2: OpenPose design: Extracted from: (Cao, Simon, Wei, & Sheikh, 2016)

OpenPose is a real-time Multi-Person 2D Pose Estimation using Part Affinity Fields (PAFs) and it is based on the CVPR 2016 Convolutional Pose Machine which is used for the task of articulated pose

Figure 1: General system design

OpenPose Comparator

Image of target pose

Image of user’s pose

Coordinates of target pose

Coordinates of user’s pose

Similarity score

estimation (Cao et al., 2016). OpenPose consists of a sequence of multi-class predictors that are trained to predict the location of specific body features along with PAFs which encode the orientation of limbs. There are five main parts in the OpenPose pipeline as illustrated in Figure 2. The system takes in a coloured image of size w by h (Figure 2a) and produces a 2D map of the anatomical key points for each individual (Figure 2e) in the input image. The system does so by using a feedforward network to estimate a set of 2D confidence maps of body part locations, S, (Figure 2b) and a set of 2D vector fields, L, which encode the orientation of body parts. The set S = (S1, S2, ..., SJ ) has J confidence maps, one per part, where Sj ∈ Rw×h , j ∈ {1 . . . J}. The set L = (L1,L2, ...,LC ) has C vector fields, one per limb, where Lc ∈ Rw×h×2 , c ∈ {1 . . . C}. Finally, the confidence maps and the PAFs are parsed by greedy inference (Figure 2d) to output the 2D skeletons for each individual present in the original input image.

5.2.1 Network Architecture

Figure 3: OpenPose neural network architecture. Extracted from: (Cao et al., 2016)

The OpenPose neural network makes use of multiple different stages of convolutions in order to get a good accuracy and performance (Cao et al., 2016). The input image is first processed using VGG-19 to obtain set of feature maps F. After which, the feature maps are simultaneously processed by two branches to predict both the PAFs and confidence maps. The architecture, as shown in Figure 3, shows two stages with two branches each. The blue branches are responsible for predicting PAFs while the beige branches are responsible for creating part confidence maps. The output of each stage along with their inputs are concatenated for each subsequent stage. Each stage has its own loss value and predictions are refined over successive stages with intermediate supervision at each stage.

In stage 1, a set of part confidence maps, S1 and a set of PAFs, L1 are produced where,

𝑆# = 𝜌#(𝐹) 𝐿# = 𝜙#(𝐹)

𝜌# and 𝜙# are the CNN inferences from stages 1. Finally, for subsequently stages where 𝑡 ≥ 2, St and Lt are given by,

𝑆/ = 𝜌/(𝐹, 𝑆/1#, 𝐿/1#), ∀𝑡 ≥ 2 𝐿/ = 𝜙/(𝐹, 𝑆/1#, 𝐿/1#), ∀𝑡 ≥ 2

In order to guide each stage of the network, two L2 loss functions are applied at the end of each stage, one for each branch. Specifically, the loss functions at both branches at stage t are,

𝑓4/ =55𝑊(𝑝) ⋅∥ 𝑆:/(𝑝) − 𝑆:∗(𝑝) ∥==>

?

:@#

𝑓A/ =55𝑊(𝑝) ⋅∥ 𝐿B/ (𝑝) − 𝐿B∗ (𝑝) ∥==>

C

B@#

Where 𝑆:∗ is the ground truth part confidence map, 𝐿:∗ is the ground truth part affinity vector field, W is a binary mask with 𝑊(𝑝) = 0 where the annotation is missing at location p. W is used to avoid penalizing true positive predictions during training. The overall objective function is

𝑓 = 5(𝑓4/ + 𝑓A/)F

/@#

5.3 Preprocessing Before comparing the coordinates from two images, the coordinates are centered to (0, 0), scaled to a 1x1 scale, flattened to a single vector and normalized. This is to ensure that the coordinates are being compared on the same basis.

Figure 5: Original yoga image. Extracted from: (Christina, n.d.)

Figure 4: Scaled and centered coordinates

0.5

0.5

x (0, 0)

-0.5

As mentioned above, coordinates are centered to (0, 0) and scaled to fill a 1 by 1 square. The steps for centering and scaling are as follows. Let the set of 25 x and y coordinates be X and Y respectively.

𝑤𝑖𝑑𝑡ℎ = max(𝑋) − min(𝑋) ℎ𝑒𝑖𝑔ℎ𝑡 = max(𝑌) − min(𝑌)

𝑠𝑐𝑎𝑙𝑒𝑓𝑎𝑐𝑡𝑜𝑟 = max(𝑤𝑖𝑑𝑡ℎ, ℎ𝑒𝑖𝑔ℎ𝑡)

𝑛𝑒𝑤𝑋 ={𝑋 −𝑚𝑒𝑎𝑛(𝑋)}𝑠𝑐𝑎𝑙𝑒𝑓𝑎𝑐𝑡𝑜𝑟

𝑛𝑒𝑤𝑌 ={𝑋 −𝑚𝑒𝑎𝑛(𝑌)}𝑠𝑐𝑎𝑙𝑒𝑓𝑎𝑐𝑡𝑜𝑟

After centering and scaling, the coordinates are flattened along with their confidence scores into a single vector of the form [𝑥#, 𝑦#, 𝑐#, 𝑥=, 𝑦=, 𝑐=, … , 𝑥=b, 𝑦=b, 𝑐=b]. This vector is then normalized by dividing it by its norm.

5.4 Comparison Methodology Two different comparison methodologies were tested, namely cosine similarity and using a neural network.

5.4.1 Cosine Similarity

Figure 6: Illustration of cosine distance/similarity. Extracted from: (Wang, Chen, & Wu, 2017)

The first comparison method is using cosine similarity to compare two output coordinates from the OpenPose CNN. As each coordinate represents a different body part, we apply cosine similarity to compare the coordinates of similar body parts. Mathematically, it measures the cosine of the angle between two vectors projected in a multi-dimensional space as illustrated in Figure 6. Cosine similarity outputs 1 if the two vectors are pointing exactly in the same direction and -1 if they are pointing in

opposite directions. In this context, the two vectors are scaled, centered and flattened coordinates of two poses. Since we are looking at normalized vectors, their cosine similarity is directly proportionate to the square of their Euclidean distance. The hypothesis for using cosine similarity here is that the size and proportions of the person in the image should not affect the determination of whether the pose is done correctly or not. The correctness of the pose should be fully determined by the relative positions of the different body parts. Hence, an angle based similarity function is chosen instead of a magnitude based function.

5.4.2 ComparatorNet The second comparison method is to use a neural network to determine if the two output coordinates from the OpenPose CNN is indicating the same pose. Supervised learning is performed to train the neural network. The dataset of images used to train the neural network is obtained by scraping Google images for photos of humans performing various yoga poses. 36 different yoga poses were used consisting of a total of 1540 images. These images were first fed into OpenPose to get the coordinates of the poses in each image. After which, these coordinates are processed according to the steps described in section 5.3. Finally, each vector was concatenated with every other vector from the processed dataset to produce the inputs to our neural network as illustrated in Figure 7. This is done so as to compare every image from our dataset with each other. Performing this steps creates 1,185,030 rows of training data. If the two vectors being concatenated come from the same yoga pose, we label their ground truth as 1 and 0 otherwise. In other words, the ground truth is a binary whereby 1 represents both images representing the same pose and 0 represents both images representing different poses. This method of processing our data creates a highly unbalanced dataset with a large majority of training data with ground truth set to 0. This is true for the pose verification problem in general as the number of ways a pose can be performed wrongly is much greater than the number of ways a pose can be performed correctly. In order to prevent the neural network from always predicting 0, we down sampled the majority class to match the number of observations in the minority class. The resulting balanced dataset has a total of 88636 observations.

Vector

Vector

Input Vector

Figure 7: Concatenating pose vectors to produce input vector

Figure 8 illustrates the neural network used. It consists of 2 fully connected (FC) layers with batch normalization (BN) and rectifier (ReLU) activation function and 1 FC layer with BN and sigmoid activation function. A shallow network with BN was used to reduce overfitting. The output from the sigmoid activation function can be treated as the probability that the input vectors are representing the same yoga pose.

The training results are illustrated in Figure 9 and Figure 10.

Figure 8: Neural network architecture

Figure 9: Accuracy of network over 100 training epochs

Figure 10: Loss function value of network over 100 training epochs

FC BN ReLU FC BN ReLU FC BN Sigmoid

Input shape: (1, 150) 128 nodes

128 nodes 1 node

6. Experiment Results

Method Area under curve (AUC) ComparatorNet 0.982 Cosine Similarity 0.709

Figure 13: AUC scores

ROC curve for comparison using cosine similarity is done using the same dataset and technique as with the neural network. Each image in the dataset is compared against every other image from the data using cosine similarity and if the two images are of the same yoga pose, the ground truth prediction is set a 1, 0 otherwise. The comparator neural network is found to have a much higher AUC score than using cosine similarity.

Figure 12: Receiver operating curve (ROC) for ComparatorNet

Figure 11: ROC for cosine similarity

7. Implementation

7.1 Hardware Specifications GPU GeForce GTX 965M CPU Intel(R) Core(TM) i7-6700HQ CPU@ 2.60 GHz 2.60GHz Memory Ram 16.0 Gb System Type 64-bit OS

7.2 Software Specifications Operating System Operating System Ubuntu 16.04.6 LTS Kernel Linux 4.15.0-47-generic Architecture x86-64

Programming Software Python 3.5.2 was the language of choice.

Tools Version Python 3.5.2 Tensorflow 1.13.1 CUDA 10.1 Pytorch 1.0.1 Numpy 1.16.2 OpenCV Python 3.4.1 Keras 2.2.4 Scikit-learn 0.20.3 Sklearn 0.0

Figure 14: Software tools used and their versions

Our demo was implemented entirely in Python. A target video or image is taken as input along with the video feed from the webcam of a computer using OpenCV. Both inputs are fed into the OpenPose CNN to get the coordinates of the poses in both inputs. After which, these coordinates are passed into the comparator. If the output from the comparator is above a preset threshold value, our software will consider the user’s pose to be equal to the target pose. After performing the comparison, both the target video and webcam stream will be displayed on the screen along with whether the user is performing the pose correctly or not. Cosine similarity is computed using the sklearn library. The ComparatorNet was built using the Keras with Tensorflow as backend. Since we are predicting two classes, binary cross entropy loss was used in training. Gradient descent was done using the Adam optimizer with a batch size of 32 and a total of 100 epochs. One third of all available data was used for testing and a third of the remaining data was used in the validation set. The rest of the data was used for in the training set. Finally, the two methods were compared using the roc_curve function from the sklearn library.

In our demonstration script, the ComparatorNet was used. The threshold values were determined using the ComparatorNet’s ROC curve. The “Genius!” level was chosen to be the threshold level at 5% false positive rate (FPR), “Almost there!” to be at 7.5% FPR and “Nice try!” to be at 10% FPR. These levels have 93.7%, 96.6% and 97.9% true positive rates (TPR) respectively. Our code can be found at https://github.com/nosyarlin/CV-pose-detection

Figure 15: The screen shows a green tint if the user is doing the target pose correctly

Figure 16: The screens shows a red tint if the user is not doing the target pose correctly

8. Future Works

8.1 Allow detection for multiple individuals Our model currently focuses on detecting the asana pose of a single person. However, OpenPose does come with the capabilities of detecting poses of multiple individuals in an image. Hence, we would like to improve our model to detect multiple people in one frame. This would be beneficial for a group of people that are trying to learn or practice yoga at the same time.

8.2 Create a more robust dataset During our training process, it was found that there are many ways to do a pose wrongly and very few ways to do it right. On top of that, usually when someone is performing a pose wrongly, he generally still looks like he is performing the intended pose with small deviations from the target image (eg. arms are not straight). In our data preparation however, we compared correct yoga poses with each other and give a ground truth label of 1 when the two images are of the same pose. Yoga poses can look very different from each other making it easy for a neural network to tell the difference. Therefore, this may have resulted in a network that is actually not very good at detecting whether someone is performing a pose wrongly but just really good at telling apart two different yoga poses. This requires further investigation and it would be wise to add images of individuals performing poses wrongly into the dataset as well.

8.3 Expand to 3D representations OpenPose is limited by the fact that 2D images can only view someone from one angle. As a result, our method does not work very well when parts of the body are obscured (eg. when someone’s arms are behind his back). Therefore, it will be beneficial to look into detection of poses during a 3D representation of the human body.

9. Contributions of Team Members Task Lead(s) Assisted by Data collection Albert, Frankie Data preprocessing and cleaning Rayson, Jason Checkoff 2 slides All Checkoff 2 presentation Jason Testing different pose machines Rayson ComparatorNet Rayson, Albert, Frankie ROC-AUC comparison Rayson Development environment setup Rayson Jason Demo Rayson, Jason Final presentation slides Rayson Final presentation Rayson Report Rayson Jason, Albert, Frankie Poster Albert, Frankie Rayson, Jason

10. Bibliography Cao, Z., Simon, T., Wei, S.-E., & Sheikh, Y. (2016). Realtime Multi-Person 2D Pose Estimation using

Part Affinity Fields. ArXiv:1611.08050 [Cs]. Retrieved from http://arxiv.org/abs/1611.08050

Ross, A., & Thomas, S. (2010). The Health Benefits of Yoga and Exercise: A Review of Comparison

Studies. Journal of Alternative & Complementary Medicine, 16(1), 3–12.

https://doi.org/10.1089/acm.2009.0044

Russell, K., Gushue, S., Richmond, S., & McFaull, S. (2016). Epidemiology of yoga-related injuries in

Canada from 1991 to 2010: a case series study. International Journal of Injury Control & Safety

Promotion, 23(3), 284–290. https://doi.org/10.1080/17457300.2015.1032981

Sports Index Participation Trends 2015. (2016, June). Retrieved from

https://www.sportsingapore.gov.sg/~/media/Corporate/Files/About/Publications/Sports%20Index

%202015.pdf

University of Sydney. (2017, June 27). Yoga more risky for causing musculoskeletal pain than you

might think: Injury rate up to 10 times higher than previously reported. Retrieved April 21, 2019,

from ScienceDaily website: https://www.sciencedaily.com/releases/2017/06/170627105433.htm

Wang, L., Chen, Z., & Wu, J. (2017). An Opportunistic Routing for Data Forwarding Based on Vehicle

Mobility Association in Vehicular Ad Hoc Networks. Information, 8, 140.

https://doi.org/10.3390/info8040140

pose comparison for correct yoga posture measurement

Documents