Computers & Graphics (2019)
Contents lists available at ScienceDirect
Computers & Graphics
journal homepage: www.elsevier.com/locate/cag
TensorPose: Real-time pose estimation for interactive applications
Luiz Jose Schirmer Silvaa, Djalma Lucio Soares da Silvaa, Alberto Raposoa, Luiz Velhob, Helio Lopesa
aPUC-Rio - Pontifıcia Universidade Catolica do Rio de JaneirobIMPA - Instituto Nacional de Matematica Pura e Aplicada, Rio de Janeiro, Brazil
A R T I C L E I N F O
Article history:Received August 23, 2019
Keywords: Convolutional Neural Net-works, Pose estimation, Real time appli-cations
A B S T R A C T
The state of the art has outstanding results for 2D multi-person pose estimation usingmulti-stage Deep Neural Networks in images with high accuracy. However, the useof these models on real-time applications may be impractical not just because they arecomputationally intensive, but also because they suffer from flicking, from the inabilityfor capturing temporal correlations among video frames, as well as from image degrada-tion. To tackle these problems, we expand the use of pose estimation to motion capturein interactive applications. To do so, we propose a novel deep neural network withstreamlined architecture and tensor decomposition for pose estimation with improvedprocessing time, named TensorPose. We introduce an architecture for markerless mo-tion capture using Convolutional Neural Networks combined with sparse optical flowand Kalman Filters. We also apply this architecture in a multi-user environment, basedon the Holojam framework, where it is possible to create simultaneous collaborativeexperiences.
c© 2019 Elsevier B.V. All rights reserved.
1. Introduction1
Pose estimation is a challenging problem in computer vision2
with many real applications in areas including augmented re-3
ality, virtual reality, computer animation and 3D scene recon-4
struction [1, 2, 3, 4, 5]. Usually, the problem to be addressed5
involves estimating the 2D human pose, i.e., the anatomical6
keypoints or body parts of persons in images or videos[6]. Re-7
cent works have presented great results for multi-person pose8
estimation in images, but still, have some issues when consid-9
ering videos [7]. These approaches can achieve high accuracy10
using architectures based on conventional convolution neural11
networks; however, mistakes caused by occlusion and motion12
blur are not uncommon, and those models are computationally13
very intensive for real-time applications [7, 5]. Also, despite14
some related work to increase performance, few papers explore15
the use of temporal coherence and the possibility of using such16
e-mail: [email protected] (Luiz Jose Schirmer Silva)
machine learning models for real-time applications involving 17
animation [8]. 18
OpenPose is considered the state-of-art approach on multi- 19
person pose estimation, but it does not achieve the desired per- 20
formance in terms of frames per second, which make it difficult 21
to use in interactive applications that require frame rates close 22
to or above 30 FPS. To increase the OpenPose performance, we 23
propose a new pose estimation model, called TensorPose. In our 24
model, we replace conventional convolution operations by suc- 25
cessive pointwise and regular convolutions in a reduced space. 26
As we will show, the proposed modifications is analogous to 27
applying a tensor decomposition, more specifically a high or- 28
der singular value decomposition (HOSVD). Another issue that 29
we study is the temporal coherence, which is not present in pre- 30
vious works since they do not consider the relationship between 31
the processed frames. To track the detected persons in videos 32
and 3D captures, we use sparse optical flow and a Kalman filter 33
to smooth the movement. 34
In this work, we also propose the creation of an architec- 35
2 Preprint Submitted for review / Computers & Graphics (2019)
ture that uses markless motion-capture for real-time applica-1
tions. Such an architecture brings us the possibility of creat-2
ing environments for the development of shared and distributed3
applications. With the collected information, our architecture4
easily composes an environment for the development of ap-5
plications that can involve motion tracking using WebGL or a6
game engine. This software infrastructure can be used to imple-7
ment multi-user interaction, positional tracking of people and8
objects, to create a complete sense of immersion in a tangible9
space, where it is possible to bridge the gap between the physi-10
cal and virtual spaces. In other words, users in different physical11
environments and using various devices can share and explore12
a unique and integrated virtual environment.13
In summary, the main contributions of this work are:14
• a novel deep neural network with streamlined architec-15
ture and tensor decomposition for pose estimation with im-16
proved processing time;17
• the extension of our model with Sparse Optical Flow and18
Kalman Filters for motion tracking; and19
• an architecture for multi-user applications, where is possi-20
ble to create simultaneous collaborative experiences based21
on the Holojam Framework.22
The paper is structured as follows. Section 2 presents the23
related works considering pose estimation and motion capture.24
Section 3 describes an overview of pose estimation task using25
a neural network. Section 4 presents the main properties and26
operations with tensor, as well the decomposition that we use27
in our approach. Section 5 introduces the main aspects of our28
deep learning model and the software architecture for interac-29
tive applications. Section 6 shows the mains aspects of virtual30
environment communication based on the Holojam Framework.31
Section 7 shows the results and finally, Section 8 concludes32
the work by discussing our contributions and proposing future33
works.34
2. Related Work35
2.1. Pose Machines and Tracking36
Ramakrishna et al. proposed [9] an inference machine frame-37
work and presented a method for articulated human pose esti-38
mation, called Pose Machines. A pose machine consists of a se-39
quence of multi-class predictors, that is trained to predict the lo-40
cation of each part in each level of the hierarchy. Their method41
has the benefits of implicitly learning long-range dependencies42
between image and multi-part cues, tight integration between43
learning and inference, and a modular sequential design. The44
model was built on the inference machine framework so that it45
could learn reliable interconnections between body parts.46
Wei et al. [10] introduced the Convolutional Pose Machines47
(CPMs) for the task of pose estimation. CPMs expand the main48
idea of pose machine architectures and combine them with the49
advantages of convolutional neural networks. Their contribu-50
tions were the ability to learn feature representations for both51
image and spatial context directly from data, and the ability52
to handle large training datasets efficiently. CPMs consists of53
a sequence of convolutional networks that repeatedly produce 54
2D confidence maps for the location of each body part. At 55
each stage in a CPM, image features and the confidence maps 56
produced by the previous stage are used as input. As said by 57
Wei et al. [10], the confidence maps provide the subsequent 58
stage an expressive non-parametric encoding of the spatial un- 59
certainty of location for each part, allowing the CPM to learn 60
rich image-dependent spatial models of the relationships be- 61
tween elements. 62
Cao et al. [6] proposed an efficient method for 2D multi- 63
person pose estimation with high accuracy on multiple public 64
datasets, extending the original Convolutional Pose Machines. 65
They created a bottom-up representation of association scores 66
via Part Affinity Fields(PAFs), a set of 2D vector fields that en- 67
code the location and orientation of limbs over the image do- 68
main. They show that their approach encodes global context 69
sufficiently well to allow for a combinatorial optimization al- 70
gorithm that solves an assignment problem to detect a person’s 71
skeleton. However, they did not consider temporal coherence 72
in the original work. When considering a video, there is no re- 73
lation between persons detected in multiple frames. Also, this 74
approach is computationally intense, demanding a lot of pro- 75
cessing power. Following their algorithm and considering their 76
benchmarks, we achieved approximately only 10 FPS using an 77
Nvidia Titan Xp GPU. 78
In regards to the tracking problem, a solution is to use dense 79
optical flow to adjust the predicted positions to generate a 80
smooth movement across frames. The Thin-Slicing Network 81
[11] achieved excellent results. However, this system is com- 82
putationally very intensive and is slower when compared with 83
the proposed by other papers. 84
Luo et al. [7] achieved some improvements in terms of both 85
accuracy and efficiency. They propose a recurrent CNN model 86
with LSTM for video pose estimation. Although considering 87
occlusion, some incorrect predictions can be detected when the 88
joint is not visible for an extended period. Also, their tests do 89
not consider multi-person pose estimation. 90
2.2. Motion Capture and Mesh Recovery 91
Concerning motion capture, recent papers have presented 92
promising solutions. Shafaei et al. [12] showed an efficient and 93
inexpensive approach to markerless motion capture using a set 94
of sensors, like the Microsoft Kinect. They split the multiview 95
pose estimation task into three subproblems of dense classifi- 96
cation, view aggregation, and pose estimation. Their approach 97
also runs in real-time. They do not assume a shape model or 98
a specific application, such as home entertainment, being ap- 99
plicable to a wider range of scenarios. However, their method 100
considers only one person and not a multi-user system. Also, 101
they use depth sensors like Kinect, and not regular cameras or 102
stereo cameras as our architecture does. 103
Tome et al. [13] proposed a CNN-based approach for multi- 104
camera markerless motion capture of the human body. Their 105
technique makes use of 3D reasoning throughout a multi-stage 106
procedure. They extended existing works on single view recon- 107
struction to a multi-camera setting and showed how such single 108
view methods could be enhanced by training on the multiview 109
Preprint Submitted for review / Computers & Graphics (2019) 3
based annotation of unlabelled data. However, this technique1
seems to be intense, considering the computational cost and, for2
multi-user detection, cannot be used in real time applications.3
Mehta et al. [8] introduced the first real-time method to cap-4
ture the full global 3D skeletal pose of a human in a stable, tem-5
porally consistent manner using a single RGB camera. How-6
ever, their approach suffers when there are significant move-7
ment shifts between frames as well as occlusion. Also, the8
method is only capable of tracking a single person.9
Kanazawa et al. [14] show an end-to-end framework for re-10
constructing a full 3D mesh of a human body from a single11
RGB image. They used the generative human body model,12
SMPL[15], which parameterizes the mesh by 3D joint angles13
and low-dimensional linear shape space. Also, an adversarial14
model is presented to adjust the mesh, determining if a pose can15
be considered acceptable. An image is passed through a convo-16
lutional encoder and sent to a 3D regression module that infers17
the 3D representation of the human. The 3D parameters are18
also sent to a discriminator, whose goal is to tell if these param-19
eters come from a real human shape and pose. This mesh can20
be useful to generate humanoid animations, which can immedi-21
ately be used by animators, modified, measured, manipulated,22
and retargeted. Again, this approach has a high computational23
cost and only considers single person detection.24
3. Pose estimation model25
This section describes the main aspects of the architecture26
proposed by Cao et al. [6]. The network architecture consists27
of a feedforward neural network that predicts a set S of 2D28
“confidence maps”, from which the body parts are located in29
an image, and also a set L of 2D vector fields that help iden-30
tify the connection between two body parts to generate a skele-31
ton. The set S = {S 1, S 2, ..., S n } has n “confidence maps” one32
for each part of the body (hand, elbow, head, etc) and the set33
L = {L1, L2, ..., LC} has C 2D vector fields, each used to con-34
struct the skeleton, identifying some member of it: an arm or35
leg for example. N candidates for each body part are generated,36
creating for each pair a bipartite graph. Finally, the results are37
analyzed following a greedy strategy, using the Hungarian algo-38
rithm with some modifications to create a connection between39
two parts. This process is performed in each frame.40
The neural network is divided into two branches: one re-41
sponsible for generating the “confidence maps”, and the other,42
the affinity fields. Also, these two branches are divided into t43
stages, where the authors of the original paper set t = 6 stages.44
At first, the images are processed by the first ten layers of a45
VGG-19 convolution network, generating a set of F features46
that will be used in the subsequent stages of the OpenPose.47
These features are used as input for the other layers of the net-48
work and, as result, they have a set S of confidence maps and a49
set L of Affinity Fields (vector fields). Let ρ and Φ represent the50
convolution layers for the confidence maps and affinity fields.51
At the end of this step, the results are concatenated with the ini-52
tial set F to produce refined results. This process is performed53
for each subsequent stage.54
To guide the network in its predictions, two objective func-55
tions are given at the end of each stage, one for each branch.56
The objective functions for both branches are based on the dis- 57
tance between the network output data and the ground truth gen- 58
erated for the confidence maps and vector fields, according to 59
Equations 1 and 2 [6]: 60
f tS =
J∑j=1
∑p
W (p) ||S tj (p) − S ∗j (p) ||22; (1)
f tL =
C∑c=1
∑p
W (p) ||Ltc (p) − L∗c (p) ||22, (2)
where S ∗j and L∗c represent the ground truth of the annotated 61
data; J and C are, respectively, the confidence maps for each 62
part of the body, and the vector fields. The W (p) is a binary 63
mask with W (p) = 0 when there is no data annotated at the 64
point p of image. The mask is used because there are images 65
in the training dataset that are not entirely annotated, so it pre- 66
vents true positive data from being penalized. The overall loss 67
function is defined by the sum of Equations 1 an 2. 68
The training images, as well as their annotations, were 69
obtained through the COCO (Common Objects in COntext) 70
dataset [16], using more than 100000 images for the task. 71
Following the architecture model, they first analyze the set 72
S to identify the possible candidates for a part of the body. In 73
total, 18 confidence maps are generated by branch 1 defined by 74
the convolution network ρ. Each confidence map of the set S 75
can be viewed as a heatmap. For this, they used a technique 76
called Non-maximum suppression to identify the position of 77
potential candidates by getting the peaks in the image. With 78
all possible candidates (points) obtained, it is necessary to find 79
the real connections between them. Therefore, for each con- 80
nection, is built a complete bipartite graph where each node is 81
a part of the body, and each edge is a connection representing 82
an arm or a leg, for example. To find the right relationship, 83
they face a well-known problem from graph theory that is find- 84
ing the best match between the vertices of a bipartite graph: an 85
assignment problem. For this task, the model uses the vector 86
fields that were generated by the output of Φ of the network. 87
An integral line is computed along the segment described by 88
each pair of candidates, and since the integral line shows the in- 89
fluence of a given field along a line. Equation 3 shows how they 90
compute the score of each segment represented by each pair of 91
body parts. Given two points, d1 and d2, as candidates for two 92
parts of a skeleton forming a connection, the score of an edge 93
connecting these two points is: 94
E =
∫ u=1
u=0Lc (p (u)) .
d j2 − d j1
||d j2 − d j1||2du (3)
where Lc represents the vector field and p(u) interpolates the 95
position between two body parts d j1 e d j2, 96
p (u) = (1 − u) d j1 + ud j2. (4)
Afterward, to obtain the optimal solution, they used a classi- 97
cal method called Hungarian Algorithm. Regarding the detec- 98
tion of multiple people, determining the pose of each becomes 99
an n-dimensional matching problem. Since this is a considered 100
4 Preprint Submitted for review / Computers & Graphics (2019)
Fig. 1. 2D pose estimation example.
NP-Hard problem, some restrictions can be made. First, a min-1
imum number of vertices for each body segment is chosen to2
generate a spanning tree instead of having the complete graph.3
Afterward, the problem is decomposed into a set of bipartite4
graphs to determine the matching between adjacent nodes in-5
dependently. With all the connections defined, they make the6
union of all the connections that share common vertices, to-7
gether with each other, to create the skeleton as a whole. Figure8
1 illustrates the final result of the pose estimation.9
4. Tensor Decompositions and Convolutional Neural Net-10
works11
This section presents Tensor analysis and its relation with12
Convolutional Neural Networks. This can lead us to better un-13
derstand the connection between the HOSVD theory and the14
factorized convolutions. There is a rising interest in exploring15
efficient architectures for Neural Networks, either for use in em-16
bedded device applications or for better allocation of resources17
and services, reducing network latency and size [17]. Several18
papers propose different architectures for Convolutional Neural19
Networks based on factorized convolutions [17, 18, 19], and ac-20
cording to our opinion, the hidden theory behind these applica-21
tions is related to tensor algebra and high order decompositions;22
although few papers discuss these fundamentals.23
A 2D convolutional layer of a neural network can be de-24
scribed as a 4-dimensional tensor, where its dimensions are de-25
fined by the number of columns and rows, the number of chan-26
nels of an input image, i.e., the RGB channels, and the number27
of output channels. In this work, we are interested in improv-28
ing the pose estimation network for fast applications. To do so,29
we decompose convolutional layers into smaller tensors, i.e.,30
we made approximations of these layers. This approximation31
is essential since it reduces the number of operations performed32
by each layer and, in this way, it accelerates our neural network33
model. The following subsections will discuss general aspects34
of tensors as well the Tucker decomposition [20, 21, 22, 23].35
4.1. Tensor Properties36
A tensor can be seen as a high-dimensional matrix, i.e, with37
three or more dimensions. The order of a tensor T is the num-38
ber of its dimensions, also known as ways or modes [22]. In39
a manner analog to matrices’ rows and columns, tensors have 40
fibers. A matrix column is a mode-1 fiber and a matrix row is 41
a mode-2 fiber [22, 23]. Third-order tensors have column, row, 42
and depth fibers for example. 43
In tensor analysis, there are operators that creates subarrays 44
by fixing some of the given tensors indices. In this way, when 45
we fix all indexes leaving only one free, we create a fiber. In the 46
same way, we create slices if we fix all indexes but leaving two 47
free. For a third order tensor, the frontal, lateral and horizontal 48
slices are obtained by fixing the third, the second and the first 49
indexes, respectively. 50
Unfolding is similar to flattening a matrix, where we stack 51
the fibers of a tensor in a given way to obtain a matrix rep- 52
resentation [24, 20, 22, 25]. For a tensor T with dimensions 53
(I1, I2, ..., In), its mode-n unfolding of T ∈ RI1×I2×···×IN is a ma- 54
trix T[n] ∈ RIn,IM , with M =N∏
k=1,k,n
Ik, mapping from element 55
(i1, i2, · · · , iN) to (in, j), with j =N∑
k=1,k,n
ik ×∏N
m=k+1 Im [22]. 56
For example, if we have a 3D tensor T with dimensions 57
(3, 3, 2) defined by the frontal slices: 58
T1 =
1 2 34 5 67 8 9
,T2 =
10 11 1213 14 1516 17 18
the mode-1 unfolding matrix is by definition given by: 59
T[mode−1] =
1 10 2 11 3 124 13 5 14 6 157 16 8 17 9 18
A rank of a tensor T is the smallest number of rank-one ten- 60
sors that generate T by computing their sum [22]. A rank-one 61
tensor is a mode-N tensor where it can be seen as the outer 62
product of N vectors [22, 23] as we can see in Equation 5. In 63
other words, each element of T ∈ RI1 xI2 x...IN is the product of the 64
corresponding vector elements defined by Equation 6. 65
T = v(1) ◦ v(2) ◦ ...v(N) (5)
ti1i2...in = v(1)i1
v(2)i2...v(N)
iN(6)
The product of a mode-n tensor T ∈ RI1×I2×...×IN by a matrix 66
M ∈ RJn×In is a tensor X with dimensions (I1 × I2 × ... × In−1 × 67
Jn × In+1 × ... × IN) given by Equation 7 [26]. 68
X = (T × M)(I1×I2×...×In−1×Jn×In+1×...×IN ) =∑
in
ti1....iN m jnin (7)
Another property is the Kruskal form of a tensor. It is similar 69
to matrix properties where the sum of the outer products of two 70
vectors can generate a matrix M. Let us consider a single tensor 71
T. We can generate it as a sum of outer products of N vectors, 72
where the number of terms in the sum is called the Kruskal rank 73
of the tensor [22]. 74
Preprint Submitted for review / Computers & Graphics (2019) 5
4.2. Tucker Decomposition1
A Singular Value Decomposition of a 2D matrix, i.e, a 2-order tensor, can be defined as follows in Equation 8.
M = UΣVT (8)
where U is a unitary matrix m × m, Σ is a rectangular diagonalmatrix m × n and VT is the transpose conjugate n × n. This,equation can be rewritten as follows, in Equation 9 [26]:
M = C × U1 × U2 (9)
where U1 is defined as (u11 × u1
2 × ... × u1I1
) and U2, defined as2
(u21 × u2
2 × ... × u2I1
), where both are unitary matrices, and C is a3
core matrix (I1 × I2).4
The SVD decomposition can be generalized to High OrderTensors, where this approach is also know as High Order SVDor Tucker decomposition [26]. Such mathematical tool consid-ers the orthonormal spaces associated with the different modesof a tensor [21]. For example, Equation 9 can be extended for3D tensors as follows:
M = C × U1 × U2 × U3 (10)
where U1,U2,U3 contains the 1-mode, 2-mode, and 3-mode5
singular vectors, respectively, related to the column space of6
Mmode−1,Mmode−2, and Mmode−3 matrix unfoldings. C is a core7
tensor with orthogonality property [26].8
We could now apply these Tucker decomposition concepts toour domain [21]. Consider a mode-4 kernel tensor T, which canbe seem as a kernel in a convolution layer. All operations for itsdecomposition can be written in the form described in Equation11 [21]:
Tx,y,z,k =
R1∑r1=1
R2∑r2=1
R3∑r3=1
R4∑r4=1
Cr1,r2,r3,r4 U1x,r1
U2y,r2
U3z,r3
U4k,r4, (11)
where C is a core tensor of size (R1 × R2 × R3 × R4) and the U∗9
matrices are factor matrices of sizes X × R1,Y × R2,Z × R3, and10
K × R4, respectively. Each dimension, in our domain, is asso-11
ciated with the dimensions defined by the number of columns12
and rows, RGB channels and the number of output channels13
in a convolutional layer of a neural network. Although this is14
true considering as input an RGB image, in successive layers15
the dimensions and number of channels vary. In this way, in16
intermediate layers, our tensors can be defined by the spatial17
dimensions X and Y of the input, the number of filters of the18
input data, i.e. the ”depth”, and the number of filters to the out-19
put data of the layer. Considering this, the decomposition is20
performed over the weights of each layer of convolution in our21
Neural Network and, as we can see, the convolution kernels are22
4D tensors. We can boost the speed performance of a neural23
network by creating several factorized approximations of reg-24
ular convolutions with only a few reductions in accuracy. The25
next section presents the details of our model as well its relation26
with tensor decompositions.27
5. TensorPose Network and Architecture 28
In this section, we describe the main aspects of our deep con- 29
volutional neural network model and the architecture for inter- 30
active applications. At first, we focus on the new CNN model. 31
Secondly, we present the main aspects of our architecture for 32
interactive applications. We describe the modules for inference 33
of 2D and 3D data and for distributed applications. 34
5.1. Network Model and Skeleton Tracking 35
In real time applications, efficiency and approaches for re- 36
ducing the computational cost are essential. Concerning the 37
model of Cao et al. [6], we propose a new deep neural network 38
based on a streamlined architecture and Tensor decompositions. 39
Initially, we modified the generation of the first features using 40
Depthwise Separable Convolution of MobileNet [17], which is 41
typically used for embedded devices. We replace the original 42
layers of VGG network because its performance in this step 43
represents the first bottleneck. Here, we decrease the number of 44
operations required in the convolution network process, making 45
an approximation through successive convolutions. The Mo- 46
bileNet model is based on a form of factorized convolutions 47
which transforms a standard convolution into a depthwise and 48
a pointwise convolution [17]. Depthwise Separable Convolu- 49
tion deals not only with the spatial characteristics of the image 50
but also with depth, i.e., the RGB channels. In this process, 51
a kernel is separated into other two to perform two convolu- 52
tions: a depthwise and a pointwise. Essentially, the depthwise 53
convolution applies a single filter to each input channel and the 54
pointwise applies a 1x1 kernel convolution to combine the out- 55
puts. 56
A Depthwise Separable Convolution requires approximately 57
8 to 9 times less computation than a standard convolution, with 58
only a small reduction in accuracy [17]. Based on the architec- 59
ture of the MobilenetV2 [27], we create 12 layers of Depthwise 60
separable convolutions instead of regular ones for feature ex- 61
traction. 62
Following this same idea, we create another lightweight 63
structure for intermediate layers of the network intending to 64
increase FPS performance. In the first three stages, we per- 65
form approximations using tensors to speedup the runtime per- 66
formance without having significant losses in the network ac- 67
curacy. Each intermediate convolution stage was replaced by 68
a block consisting of a pointwise, a standard convolution, but 69
with reduced space, and another pointwise convolution. For ex- 70
ample, in the intermediate layers of our network, convolutions 71
with input channels and output channels of n = 128, kernel 3x3 72
and stride 1x1 were replaced by a block containing a pointwise 73
convolution with 128 input channels and 28 output channels, a 74
regular convolution with 28 and 32, for input and output size 75
respectively, with a kernel 3x3 and another pointwise convo- 76
lution with 32 channels for input and 128 channels for output. 77
We use this approach to change the stages 1 to 3 of our network. 78
This can be seen in an analogous manner as an approximation 79
of a regular convolution described by Kim et al. [21], were 80
they used a High Order SVD for this task. Similar as in the 81
depthwise separable convolutions, we split a regular convolu- 82
tion into other 3, where it drastically reduce the computation 83
6 Preprint Submitted for review / Computers & Graphics (2019)
and model size. Considering the tensor decomposition’s theory1
stated in the previous sections, let us relate it with convolutional2
layers. A regular convolution maps an input tensor into another3
with different size by successive operations as we can see in the4
Equation 12:5
conv(x, y, z) =∑
i
∑j
∑k
Θ(i, j,k,z)W(x−i,y− j,k) (12)
where Θ is a kernel of size IJKZ (for example, a kernel with6
size I × J = 3 × 3, with K = 3 channels (RGB), and Z = 1287
convolution filters) and W an input tensor with size XYK, as8
we refer typically as an image, for example, with X and Y the9
image dimensions and K the number of channels. Replacing10
Θ by our approximation, this approach can be seen as kernel11
tensor decomposition [21] as we can see in the Equation 13.12
conv(x, y, z) =∑
i
∑j
∑k
approxW(x−i,y− j,k) (13)
Regarding the equation 11 in our previous section, the equa-tion 14 defines the tensor named approx computed by theHOSVD decomposition [20]:
approx =
R3∑r=1
R4∑r=1
Ci, j,r3,r4 Ukr3
(k)Uzr4
(z), (14)
where Uk,r3 and Uz,r4 are the factor matrices of sizes KxR3 and13
ZxR4, respectively, and C is s a core tensor of size (I, J,R3,R4).14
Notice that this equation uses only the R3 and R4 ranks. In our15
application, we suppress mode-1 (R1) and mode-2 (R2), which16
are associated with spatial dimensions of an image, because17
they are quite small [21].18
When replacing Θ in the equation 12 by the equation 14,19
we can rearrange the operations obtaining 3 convolutional op-20
erations used to approximate the original: 2 pointwise con-21
volutions and 1 regular convolution with reduced space. This22
leads us to the convolutional block previously described. Here23
the first pointwise convolution reduces the number of channels24
from K to R3, the regular convolution of K input and Z output25
channels has now R3 input channels and R4 output channels.26
Finally, the last pointwise convolution is used to get back the27
original output size. This process not only makes a speedup28
in our application, but also reduces the number of parameters29
needed.30
A regular convolution requires D2ZKXY multiplication-addition operations, where D2 refer to the dimensions of ourkernel 3 × 3 and X and Y refer to the spatial dimensions ofour data. Considering the decomposition, our approach has aspeed-up ratio S defined by the equation 15 [21].
S =D2ZKXY
KR3XY + D2R3R4XY + ZR4XY(15)
Regarding our model, each decomposed layer requires approx-31
imately 9.3 times less operations than a regular convolution.32
Figure 2 represents the architecture model of our network.33
Here, we have more layers of convolution, but the number of34
operations and weights is smaller, since each regular convolu-35
tion is substituted by other 3 factorized. As we can see in figure36
2, for stage 1, the overall number of convolutions, considering 37
all blocks, is 3 × 3 + 2 and for stages 2 and 3 is 5 × 3 + 2. 38
We defined this model for the first three stages, after the feature 39
extraction, in an empirical way, where we did not observe sig- 40
nificant differences in the accuracy of our network when com- 41
paring to the OpenPose. If we extend this approach to the other 42
stages, for example, to stage 4, we verified a more significant 43
difference in the accuracy, more than 25%, leading to wrong re- 44
sults in inference. Also, if we apply to the whole 6 stages, we 45
face an instability problem [21, 28], where it is difficult to find 46
a good learning rate as a well how to initialize the weights. We 47
believe that is an effect of applying stacked decomposed layers 48
to the model, and for this reason, we limited the number of de- 49
composed stages to the first three. Figure 3 shows some results 50
of the inference using our network. As we can see, in major 51
cases, we can achieve a good performance in the detection de- 52
spite some small errors considering high movement shifts and 53
blurred images. 54
Despite frame rate issues, another problem identified in the 55
OpenPose was the lack of temporal coherence in the original pa- 56
per. In other words, there is no relation between the objects de- 57
fined in one frame to the ones defined in the subsequent frames. 58
Therefore, the reference is lost for each identified person. For 59
the context of our applications, it is necessary to create a mod- 60
ule for this task. Temporal coherence enables us to actually 61
implement real motion tracking, where the temporal correspon- 62
dence between each pair of the frame needs to be preserved, 63
considering motion coherence between spatial neighbors and 64
across the temporal dimension [29]. We also can use this to fix 65
the tracking in a single person, removing problems involving 66
people interaction, for example. 67
To solve this problem, in a first experiment, we create a mod- 68
ule for temporal coherence using Optical Flow. Optical Flow 69
is a pattern of apparent motion of objects in images between 70
two consecutive frames caused by the movement of an object 71
or the camera. In other words, it is an approximation of image 72
motion based upon local derivatives in a given sequence of im- 73
ages. Some patterns can affect the sequence of images, causing 74
temporal variations in the brightness. In sum, it specifies how 75
image pixels moves between subsequent frames. 76
Firstly, we tested the Kanade-Lucas-Tomasi (KLT) algorithm 77
[30], and our preliminary results not only presented an increase 78
in the performance of the technique considering the frame rate, 79
but also smoothness in the motion capture. For the KLT algo- 80
rithm, we pass the points of skeletons detected by the network. 81
At each interval of t frames, these points are iteratively tracked 82
through the optical flow, where the previous frame, the posi- 83
tion of the previous points, and the next frame are passed to the 84
detection function. With each new frame, the location of each 85
point of the tracked skeletons is updated. If any position or 86
point are lost in the process, the tracking is stopped. Then, the 87
network is executed again and the skeletons recalculated. The 88
same is done at the end of the interval of t frames to guarantee 89
the consistency of the tracking. Normally we use an interval of 90
10 frames, where we do not where we did not detect a greater 91
error in the tracking when compared with the technique frame 92
to frame. Here, the sparse optical flow deals with much less 93
Preprint Submitted for review / Computers & Graphics (2019) 7
Fig. 2. TensorPose CNN model. The initial features were extracted using the first 12 layers of a Mobilenet V2 and in the intermediary layers we replaceconventional convolutions by a block containing a pointwise convolution, regular but with reduced space and another pointwise convolution. We have morelayers of convolution but the number of operations and weights is smaller, which leads to a boost in performance. At the final of each stage the results areconcatenated with the results of feature extraction from Mobilenet to refine the results.
parameters than the use of the CNN and the calculation of line1
integral at each frame, where it tracks just the previous point de-2
tected by the CNN. The optical flow assumes that the intensity3
of the pixels of an object does not change with time and neigh-4
boring pixels have the same pattern of movement. Classical5
approaches, such as optical flow fail when it comes to environ-6
ments with illumination changes and long-range motions. To7
solve these issues, we change the algorithm to use the Robust8
Local Optical Flow [31], following the framework proposed by9
Senst et al., while taking into account an illumination model to10
deal with varying illumination. Here the computational cost of11
this variation, used by the KLT method, is given by the upper12
bound of O(nNlargei), where n is the number of computed mo-13
tion vectors and N is the number of pixels of the larger image14
region i [31]. Also, as support to the tracking process, we use15
the Kalman filter and Multiple Instance Learning (MIL). The16
Kalman filter [32] is used to predict the position of the skeleton17
in the frame after the current one, and to ensure that the refer-18
ences for each skeleton detected are maintained. Given the dis-19
tance between the points calculated by the network and those20
predicted by the Kalman filter, a threshold is given to ensure21
that the reference points are maintained. We fix this threshold22
considering a radius of 20 pixels of error. Combining Optical23
Flow and Kalman filter gives us the possibility to interpolate24
the data over subsequent frames, which enable us to smooth the25
captured movement. We also use the MIL [33] just to ensure26
that the detected person’s reference is maintained. We use it as27
an additional feature to identify the bounding box of detected28
head keypoints, and to maintain the reference of a person, even29
when the network is called. Although their use is optional.30
5.2. TensorPose Architecture31
TensorPose was developed with the aim of being a platform32
for the development of 2D and 3D applications that involve33
motion capture. In this section, we will describe the general34
aspects of its architecture, as well as the framework for net- 35
work communication of different applications. Such applica- 36
tions can be computer animation, games and also shared virtual 37
experiences. One of our goals is to provide developers with a 38
way to create environments for shared 2D or 3D experiences 39
where users share a virtual environment, but not necessarily at 40
the same physical environment. The motion capture data can be 41
used to track the activities of users in this kind of application. In 42
our architecture, each module is independent in terms of video 43
processing, the inference of captured skeletons, temporal co- 44
herence, and information transmission. Figure 4 represents the 45
architecture model. 46
As one can see in Figure 4, the architecture model is com- 47
posed by several modules. Next we will describe the function- 48
ality of each one. 49
Pose Client. This module is responsible for processing video 50
data from different cameras. Here, each frame is prepared using 51
the OpenCV library [34], where we also do an equalization of 52
the histogram to improve the contrast in the images and send the 53
data to the inference module or the temporal coherence module. 54
In this case, here, the resolution of the image is 1280x960, and 55
the network resolution is 656x386 to fit in GPU memory. 56
Pose Inference. This neural network module receives data from 57
the pose client and performs the inference of skeletons in an 58
image. Also, a post-processing is performed to solve the as- 59
signment problem. For each set of key points detected and for 60
each person, a dictionary is created, which contains the 2D po- 61
sition of each body part and the connections between them. For 62
each recognized person is given an Id and their corresponding 63
dictionary is added to a list of tracked persons who are sent to 64
the module responsible by the instantiating objects, which will 65
represent the capture of those people. 66
Zed SDK. In this stage, we use the ZED stereo camera SDK 67
[35][36] for 3D data capture. This module was developed us- 68
8 Preprint Submitted for review / Computers & Graphics (2019)
Fig. 3. Results containing viewpoint variation and occlusion, which are common characteristics in images. Images from COCO dataset, Rio Olympics 2016and Brazilian National Basketball league.
ing the Python API, where data from a 2D image is sent to the1
Pose Client to be processed and the 3D data is sent to the depth2
module.3
Depth Front. This module is responsible for processing the4
depth data of one or more ZED cameras and receiving intrinsic5
camera data. ZED has two cameras separated by 12 cm, which6
capture a high-resolution 3D video of the scene and estimate7
depth and motion by comparing the displacement of pixels be-8
tween the left and right images. This module stores a distance9
value Z for each pixel in the image.10
2D/3D Positions. This part of the architecture receives data11
from the positions inferred by the network and creates objects12
identifying each detected person. In case of 2D, keeps the po-13
sitions received in coordinates of the image. In the case of14
3D, it receives data from Front-Depth and performs transfor-15
mations in 2D coordinates for positions in the 3D world. It is16
also responsible for passing information to the temporal coher- 17
ence module, also processing the Kalman filter, identifying the 18
points to be tracked and ensuring the reference of the detected 19
skeletons. 20
Temporal Coherence. This module tracks the points using the 21
Optical Flow algorithm. It receives the initial data detected by 22
the network, and every t frames perform the tracking of the 23
points. 24
Holojam Emitter. Uses the Holojam platform to send the cap- 25
tured data via the network. It is a Python client that commu- 26
nicates with the Holojam Server. The Holojam platform is de- 27
scribed in the following section. 28
Preprint Submitted for review / Computers & Graphics (2019) 9
Fig. 4. TensorPose Architecture
6. Virtual enviroment communication by Holojam1
Holojam is a virtual space sharing platform developed by2
the Future Reality Lab of the New York University [37]. This3
platform consists of the Holojam Node library and the Holo-4
jam SDK project. It enables content creators to build complex5
location-based multiplayer VR experiences in a simple and uni-6
fied Unity project. The development framework provides an7
extensible and clean interface, allowing rapid prototyping. Ad-8
ditionally, it abstracts away specific VR hardware, promoting a9
flexible and customizable creation of virtual reality experiences.10
We use the Holojam protocol framework in our architecture to11
communicate applications with very low latency, and to send12
motion capture information through the network, expanding the13
use to not only VR applications.14
6.1. Holojam Node15
We use the Holojam Node, which is a client-server library16
developed in Node.js and targeted to applications running on a17
local network or over the web. One of its main characteristics18
is to perform low-latency communication between the various19
clients and the server. It consists of the following components:20
relay, emitters, sinks, and clients.21
The relay component is responsible for routing the messages22
(updates and events) in a Holojam network. It acts both as a23
central server, which collects data received through a precon-24
figured (upstream) address, as well as performs a broadcast of25
this data through a multicast (downstream) address.26
In addition to the relay in a Holojam network, there are also 27
several nodes, called endpoints. Endpoints are either emitters, 28
which only send upstream data; sinks, which only receive data 29
through the downstream; or clients that receive downstream 30
data and send upstream data. Holojam Node also provides a 31
WebSocket interface where you can receive and transmit data 32
over the web. 33
6.2. Holojam Protocol 34
Packets routed through a network are either hosted or up- 35
dated. An update is essentially an array of flakes (generic Holo- 36
jam objects). An event has an array containing only one flake. 37
There is also a notification that is an abstraction of an event that 38
includes just the label. 39
6.3. Holojam Objects 40
Holojam provides two types of objects: Nuggets and Flakes. 41
These objects are defined through Google’s FlatBuffers Inter- 42
face Description Language (IDL). We use Flatbuffers to serial- 43
ize data, as well as streaming data, because it offers very low 44
processing overhead. As follows, we present the structure of 45
the objects used. 46
Nugget Composition: 47
• It is event type or update. Default is update 48
• It is mandatory to have an array of flakes 49
10 Preprint Submitted for review / Computers & Graphics (2019)
enum NuggetType : byte { UPDATE, EVENT }
// Message (update or event)
table Nugget {
scope : string; // Namespace
origin : string; // Source
type : NuggetType = UPDATE;
flakes : [Flake] (required); // Data array
}
Flake Composition:1
• A label is mandatory2
table Flake { // Data container
label : string (required);
// Optional data
vector3s : [Vector3];
vector4s : [Vector4];
floats : [float];
ints : [int];
bytes : [ubyte];
text : string;
}
6.4. Holojam SDK3
The Holojam SDK project is a Unity3D project containing4
all the elements needed to create a multiplayer virtual reality5
application. This project provides a C# API that allows the ap-6
plication created to use or extend this API allowing for rapid7
prototyping. Also, it abstracts the configuration of the VR hard-8
ware used in the project. One of its main components is the9
implementation of the Holojam client in C#. Thus, all Holojam10
objects can be integrated easily into the Unity project.11
6.5. Holojam and TensorPose applications12
All components of the Holojam platform were used in the13
TensorPose project applications, both in 2D applications as well14
as 3D and VR applications. Since TensorPose is implemented15
in Python, the components needed to use Holojam in Tensor-16
Pose were also deployed in Python, including a serializer for17
the keypoints and the emitter. The serialization of the obtained18
keypoints was implemented from the IDL FlatBuffers of Holo-19
jam. With this, it is possible to send the keypoints to the relay,20
giving access to data to any sink/client/websocket that is con-21
nected to this relay.22
Figure 5 shows the components for various applications cre-23
ated in the TensorPose project and to which other component24
they are related to.25
Fig. 5. TensorPose network infrastructure. The Relay component is re-sponsible for synchronizing and sending information over the network tothe clients of our applications.
7. Experiments 26
Our proposal was tested and compared with the results ob-tained by OpenPose 1.3. The TensorPose code was imple-mented in Python, including the post-processing steps, such asthe line integral calculation and the assignment problem. Intraining, 110000 images were used as well as 5000 imagesin validation, with a maximum of 200 iterations per epoch.In COCO dataset, approximately 150000 people and 1.7 mil-lion labeled keypoints are available. Considering the inference,we use the COCO validation dataset to compare our modifiedmodel and the OpenPose. COCO uses a metric called ObjectKeypoint Similarity (OKS), which measures of how close thepredicted Keypoint is to the ground truth annotation. Also,COCO uses the mean average precision (AP) and average re-call (AR) over 10 OKS thresholds as the main competitionmetrics[6]. In our initial tests, we consider Precision and Recallat 50% and 75% OKS (AP-50, AP-75, AR-50, AR-75) whichare usually used for benchmark. We test both models with theevaluation data from COCO and submit the results to the COCOevaluation server. Equation 16 shows the formula used to cal-culate the OKS.
OKS =
∑i e−
d2i
s2k2i δ(υi > 0)∑
i δ(υi > 0)(16)
where di are the Euclidean distances between each detected 27
keypoint and the ground truth, υi are the visible flags for each 28
annotated keypoint, and s × k (scale × keypointconstant ) is re- 29
lated to a standard deviation of a gaussian related to concentric 30
circles where their radius varies by keypoint type. The met- 31
rics AP-50, AP-75, AR-50, AR-75 are related to how close a 32
prediction is to annotated data considering the Gaussian stan- 33
dard deviation. The s × k is computed by evaluating this Gaus- 34
sian function, centered on the ground-truth position of a key- 35
point and the standard deviation is specific to the keypoint type, 36
which is scaled by the area of the instances, measured in pix- 37
els [38]. Table 1 shows our results in contrast to OpenPose and 38
Figure 6 show us a visual comparison between the 2 techniques. 39
Also, varying the OKS threshold from 0.5 to 0.95, we measure 40
Preprint Submitted for review / Computers & Graphics (2019) 11
both precision and recall for the COCO validation dataset, as1
we can see in the figure 7, where highest values mean how close2
our predictions are to the ground truth annotations considering3
each threshold.4
Model AP 0.50 AP 0.75 AR 0.50 AR 0.75OpenPose 0.780 0.593 0.807 0.647TensorPose 0.750 0.536 0.772 0.591
Table 1. Precision and Recall for OpenPose and TensorPose Models.
Here, higher OKS means higher overlap between predicted5
Keypoints and the ground truth. As we can see, our model has6
about 6.5% lower accuracy than OpenPose, which is mainly7
caused by our proposed simplification on the intermediary lay-8
ers. Just to remember that this simplification promotes an ad-9
vantage of having 9.3 times less operations in these layers than10
the original OpenPose. We have focus on real-time application,11
so this accuracy loss does not represent significant problems,12
since some prediction errors are acceptable.13
Following the benchmarking and error diagnosis for pose es-14
timation proposed by Ruggero et al. [38], we analyze 4 types15
of localization errors using the annotations of COCO dataset:16
• Jitter: small position error around the correct keypoint lo-17
cation;18
• Miss: large localization error when the detected key point19
is far away of any body part;20
• Inversion: confusion between semantically similar parts of21
a detected person,i.e, wrong body part associations for the22
same person due problems like self occlusion, for exam-23
ple;24
• Swap: confusion between semantically similar parts of25
different persons, i.e., problems related to the association26
of body parts of a detected person to another.27
Table 2 shows the total of correct predictions and errors and28
Table 3 presents the estimate error associated to each detected29
joint considering our model. As we can see, most of the er-30
rors are related to the small difference between the predict key-31
point and the annotated data and to significant localization er-32
ror, which we believe this is due to false positives detected. We33
can also see that most localization errors affect keypoints more34
sensitive to errors related to occlusion or that usually involves35
some interaction between people such as the arms.36
Total Num. keypoints: [62790] Abs PercentageGood Prediction 45300 72.1453%Jitter 8039 12.80%Inversion 1519 2.41%Swap 613 0.976%Miss 7319 11.656%
Table 2. Total of correct predictions and errors considering the COCO val-idation dataset in absolute values and percentage.
In terms of runtime performance comparisons, we begin with37
the test of the CNN processing. Considering just one image38
Keypoints jitter % inversion % swap % miss %nose 9.6 0 2.4 5eyes 16 4.8 4.2 7.4ears 13.1 0.3 3.6 6.1shoulders 12.8 12 21.9 7.1elbows 12.6 4.6 17.5 14.3wrists 9.7 10.8 15.2 21.8hips 13.5 24.2 14 11.7knees 7.1 22.7 11.1 13.4ankles 5.5 20.6 10.1 13.2
Table 3. The frequency of localization errors over all the predicted key-points. We have 62790 joints detected in 5000 images with this test.
with 23 people, we compared our approach with OpenPose. 39
The tests were performed in a GPU Nvidia RTX 2060 with 6GB 40
of memory, where we vary the scale of the image tested by a 41
factor of 0.5 and repeat each test 1000 times. As we can see 42
in Table 4, in average, our approach has a better performance 43
when compared with the OpenPose. 44
Image Resolution OpenPose TensorPose328 x 193 ∼55,08 ms ∼54,86 ms656 x 386 ∼120.224 ms ∼84,25 ms984 x 579 ∼174,75 ms ∼116,66 ms1312 x 772 ∼320,18 ms ∼210,41 ms
Table 4. CNN processing time for OpenPose 1.3 and TensorPose. We varythe scale of the image tested by a factor of 0.5 considering 4 image scales.
When processing videos, to avoid processing the CNN and 45
calculating the line integral and the execution of the Hungarian 46
algorithm at each step, we use sparse optical flow. We get not 47
only performance improvements, with a frame rate above 30 48
FPS, but also smoothness in motion tracking, also due to the 49
Kalman filter. In the first test, we defined a fixed interval of 10 50
frames, where the optical flow is processed and, afterward, its 51
parameters are updated by a new processing step in the neural 52
network. Such range was defined empirically, where we saw 53
that there was not a significant difference in the accuracy of the 54
tracking. Our strategy, considering both changes in the network 55
and the use of optical flow during t frames, reduces the overall 56
time for tracking and increases the frame rate, as we can see in 57
Table 5. 58
Similar to the OpenPose, we analyzed some limitations when 59
our approach fails. We face the same problems with highly 60
crowded images where people are overlapping, and the ap- 61
proach tends to merge annotations from different people while 62
missing others[6]. Also, our application has less precision when 63
compared to previous work due to the factorized layers. It has 64
more difficulty to detect joints considering occlusion having a 65
slight noise in the inferred data. As we can see in table 1, this is 66
not a significant problem as the results are comparable and we 67
still have a good performance gain. 68
Subsequently, with the trained network, we defined an ar- 69
chitecture for the creation of applications for motion capture, 70
connecting different cameras and applications, such as scenes 71
in the Unity3D engine and web applications.The initial mod- 72
12 Preprint Submitted for review / Computers & Graphics (2019)
Fig. 6. Comparison of between the OpenPose model and ours. The first figure is generated by the OpenPose and the second by our model. As we can see,our model is slightly less accurate than the original work and fails to find the exactly position of a body joint in particular cases. Although due to the natureof our application, this issue can be considered acceptable. Image from COCO dataset.
GPU CPU OpenPose TensorPoseNvidia Titan RTX Intel(R) Core I9(R) CPU @
3.3GHz∼11. FPS ∼36.8 FPS
Nvidia Tesla P40 Intel(R) Xeon(R) CPU E5-2630v3 @ 2.40GHz
∼10 FPS ∼34.43 FPS
Nvidia TITAN Xp Intel(R) Core I7(R) CPU @3.3GHz
∼8.28 FPS ∼27 FPS
Nvidia RTX 2060 AMD Ryzen 7 1700 @ 3.2GHz ∼6.121 FPS ∼18.27 FPSNvidia GTX 960 Intel(R) Xeon(R) CPU E5-2630
v3 @ 2.40GHz∼3.73 FPS ∼12 FPS
Table 5. Frame rate comparison between the OpenPose 1.3 and TensorPose.As we can see, our model surpasses in approximately 3x the performance ofthe OpenPose model used as base.
0.0 0.2 0.4 0.6 0.8 1.0recall
0.0
0.2
0.4
0.6
0.8
1.0
precision
areaRng:[all], maxDets:[20]
Oks 0.50: .750Oks 0.55: .726Oks 0.60: .691Oks 0.65: .645Oks 0.70: .596Oks 0.75: .536Oks 0.80: .454Oks 0.85: .347Oks 0.90: .228Oks 0.95: .099
Fig. 7. Precision and Recall for all OKS thresholds. AreaRng is related tothe area of the objects annotated with size bigger than 322 pixels.
ule was developed for 2D applications where capture could be1
done through ordinary cameras such as webcams. Each frame2
is captured and sent to the detection module, which is in charge3
of performing the inference and detection of the people in each4
frame. Subsequently, the data regarding the captured persons,5
are sent via the net using the framework Holojam [37].6
In addition to the 2D capture, tests involving stereo cameras7
were also conducted. We used the ZED camera to map three- 8
dimensional coordinates of the world from two-dimensional co- 9
ordinates of images. The skeletal position is initially processed 10
in the same way, but, in a later step, it is transformed into 3D 11
space using intrinsic camera data and depth information. As 12
in the previous process, the capture and processing modules are 13
independent, with Holojam also being used to send information. 14
Some applications were developed as a proof of concept us- 15
ing the Unity 3D engine as well as WebGL. Regarding applica- 16
tions in Unity3D, the data is sent via the network by the Holo- 17
jam Server, using multicast. This application was developed 18
for the use of multiple devices, but it is also possible to use uni- 19
cast connections. Regarding the 3D scene in the Unity engine, 20
”Holojam Unity” objects implement support for multiplayer, 21
which includes communication with the Holojam Server, com- 22
ponents with position recognition, and Avatar presence. These 23
objects have a script component called ”Holojam Network”, 24
which contains some fields indicating how to communicate with 25
the server. 26
The “Multicast Address” field indicates the IP address used 27
to receive data from the server. This field is not relevant if 28
there is already a unicast connection between this client and 29
the server since, in this case, the client will receive the data di- 30
rectly from the server. The Server Address field can be used 31
to indicate the IP of the server directly. This address is used 32
to send data to the server, but if the indicated address is not 33
Preprint Submitted for review / Computers & Graphics (2019) 13
local (not starting with 192), Holojam will ping this address,1
which creates a connection between the server and this client.2
All data that is sent and received uses an update data structure.3
When Holojam sends an object, it uses the label “SendData” for4
such an update. For the virtual characters, they are created from5
the tracked data of the real person using inverse kinematics. A6
manager implemented in the application reconstructs the entire7
body of the Actor from pairable elements corresponding to the8
connections between hands, elbows, shoulders, neck, head, and9
so on. The reconstructed data are used as targets for a 3D model10
in the scene. Figure 8 shows the Unity3D application.11
The OpenPose also has a plugin for Unity 3D, although it is12
limited to run the inference locally and do not achieve the nec-13
essary performance. Their plugin often encounters problems14
when considering low cost GPUs, where it consumes a lot of15
memory and frequently is necessary to run in CPU mode. Also,16
unlike our application, their plugin does not deals with possi-17
bility to create shared virtual experiences where the physical18
presence of users in the same environment is not necessary. In19
our proposal, we can have a server dedicated to inference, with-20
out the need to make intensive use of the local machine. The21
entire project was developed considering network communica-22
tion using the low latency model of Holojam. All modules in23
the figure 4 were developed to be independent.24
Similarly, a web application with a unicast connection was25
developed, as we can see in Figure 9. The entire application26
was designed in javascript using WebGL. A script called “Man-27
ager” was implemented, responsible for receiving the data sent28
by the server, considering each object that identifies each person29
tracked and the position of their skeleton. The Manager instan-30
tiates objects called characters for the identification of actors,31
while still being in charge of updating their positions to each32
frame. It also manages the removal of actors from the scene,33
if they are no longer in it, or if the track has been lost. Subse-34
quently, the characters are rendered as a ragdoll using the po-35
sition of each part of their skeleton directly, not using inverse36
kinematics, in this case.37
8. Conclusion38
In this work we proposed a novel deep neural network with39
streamlined architecture and tensor decomposition for pose esti-40
mation with improved processing time, named TensorPose. We41
adopted it on a real-time motion capture multi-user interactive42
application. Considering our results, we show an efficient op-43
timization in the CNN model, since we achieved performance44
above 30 FPS, where in major cases represents 3× the perfor-45
mance of the original work. The accuracy of the detection of46
keypoints shows to be slightly smaller. However, this fact does47
not represent great problems in our applications, since our pri-48
mary objective is the performance gain, when considering FPS.49
As we showed in Figure 3 some failure cases were detected, but50
do not represents major problems in our applications. Also, we51
present a statistic to each kind of error detected. As a proof of52
concept, we present two main applications using Unity3D and53
WebGL, integrated with Holojam platform. We show that it is54
possible to create shared virtual experiences using motion cap-55
ture with the TensorPose. As future work, we expect to adapt56
our CNN architecture to be used in low power devices [28]. To 57
do so, we have to still continue to explore computational cost 58
reduction. For this, we intend to explore strategies to create a 59
complete form of the factored network, trying to solve the sta- 60
bility error, where all the layers will be in the factored form, 61
using different types of initialization of the weights. 62
Acknowledgments 63
We gratefully acknowledge the support of NVIDIA Corpo- 64
ration with the donation of the Titan Xp GPU used for this 65
research. The authors also thanks CAPES and CNPq for the 66
partial support of this research. 67
References 68
[1] Ranjan, R, Patel, VM, Chellappa, R. Hyperface: A deep multi-task 69
learning framework for face detection, landmark localization, pose esti- 70
mation, and gender recognition. IEEE Transactions on Pattern Analysis 71
and Machine Intelligence 2019;41(1):121–135. 72
[2] Sindagi, VA, Patel, VM. A survey of recent advances in cnn-based 73
single image crowd counting and density estimation. Pattern Recognition 74
Letters 2018;107:3–16. 75
[3] Ge, L. Real-time 3d hand pose estimation from depth images. Ph.D. 76
thesis; 2018. 77
[4] Schwarcz, S, Pollard, T. 3d human pose estimation from deep multi-view 78
2d pose. In: 2018 24th International Conference on Pattern Recognition 79
(ICPR). IEEE; 2018, p. 2326–2331. 80
[5] Lin, M, Lin, L, Liang, X, Wang, K, Cheng, H. Recurrent 3d pose se- 81
quence machines. In: Proceedings of the IEEE Conference on Computer 82
Vision and Pattern Recognition. 2017, p. 810–819. 83
[6] Cao, Z, Simon, T, Wei, SE, Sheikh, Y. Realtime multi-person 2d pose 84
estimation using part affinity fields. In: CVPR. 2017,. 85
[7] Luo, Y, Ren, J, Wang, Z, Sun, W, Pan, J, Liu, J, et al. Lstm pose 86
machines. In: Proceedings of the IEEE Conference on Computer Vision 87
and Pattern Recognition. 2018, p. 5207–5215. 88
[8] Mehta, D, Sridhar, S, Sotnychenko, O, Rhodin, H, Shafiei, M, Seidel, 89
HP, et al. Vnect: Real-time 3d human pose estimation with a single rgb 90
camera. ACM Transactions on Graphics (TOG) 2017;36(4):44. 91
[9] Ramakrishna, V, Munoz, D, Hebert, M, Bagnell, JA, Sheikh, Y. Pose 92
machines: Articulated pose estimation via inference machines. In: Euro- 93
pean Conference on Computer Vision. Springer; 2014, p. 33–47. 94
[10] Wei, SE, Ramakrishna, V, Kanade, T, Sheikh, Y. Convolutional pose 95
machines. In: Proceedings of the IEEE Conference on Computer Vision 96
and Pattern Recognition. 2016, p. 4724–4732. 97
[11] Song, J, Wang, L, Van Gool, L, Hilliges, O. Thin-slicing network: A 98
deep structured model for pose estimation in videos. In: Proceedings of 99
the IEEE Conference on Computer Vision and Pattern Recognition. 2017, 100
p. 4220–4229. 101
[12] Shafaei, A, Little, JJ. Real-time human motion capture with multiple 102
depth cameras. In: 2016 13th Conference on Computer and Robot Vision 103
(CRV). IEEE; 2016, p. 24–31. 104
[13] Tome, D, Toso, M, Agapito, L, Russell, C. Rethinking pose in 3d: 105
Multi-stage refinement and recovery for markerless motion capture. In: 106
2018 International Conference on 3D Vision (3DV). IEEE; 2018, p. 474– 107
483. 108
[14] Kanazawa, A, Black, MJ, Jacobs, DW, Malik, J. End-to-end recovery 109
of human shape and pose. In: Proceedings of the IEEE Conference on 110
Computer Vision and Pattern Recognition. 2018, p. 7122–7131. 111
[15] Loper, M, Mahmood, N, Romero, J, Pons-Moll, G, Black, MJ. Smpl: A 112
skinned multi-person linear model. ACM transactions on graphics (TOG) 113
2015;34(6):248. 114
[16] Lin, TY, Maire, M, Belongie, S, Hays, J, Perona, P, Ramanan, D, et al. 115
Microsoft coco: Common objects in context. In: European conference on 116
computer vision. Springer; 2014, p. 740–755. 117
[17] Howard, AG, Zhu, M, Chen, B, Kalenichenko, D, Wang, W, Weyand, 118
T, et al. Mobilenets: Efficient convolutional neural networks for mobile 119
vision applications. arXiv preprint arXiv:170404861 2017;. 120
14 Preprint Submitted for review / Computers & Graphics (2019)
Fig. 8. Unity3D application with TensorPose. One of our use cases is the development of applications that make use of shared virtual environments.
[18] Wang, M, Liu, B, Foroosh, H. Factorized convolutional neural networks.1
In: Proceedings of the IEEE International Conference on Computer Vi-2
sion. 2017, p. 545–553.3
[19] Jin, J, Dundar, A, Culurciello, E. Flattened convolutional neural4
networks for feedforward acceleration. arXiv preprint arXiv:141254745
2014;.6
[20] Tucker, LR. Some mathematical notes on three-mode factor analysis.7
Psychometrika 1966;31(3):279–311.8
[21] Kim, YD, Park, E, Yoo, S, Choi, T, Yang, L, Shin, D. Compres-9
sion of deep convolutional neural networks for fast and low power mobile10
applications. arXiv preprint arXiv:151106530 2015;.11
[22] Kolda, TG, Bader, BW. Tensor decompositions and applications. SIAM12
review 2009;51(3):455–500.13
[23] Smith, S, Karypis, G. Accelerating the tucker decomposition with com-14
pressed sparse tensors. In: European Conference on Parallel Processing.15
Springer; 2017, p. 653–668.16
[24] Cichocki, A, Zdunek, R, Phan, AH, Amari, Si. Nonnegative matrix and17
tensor factorizations: applications to exploratory multi-way data analysis 18
and blind source separation. John Wiley & Sons; 2009. 19
[25] De Lathauwer, L, De Moor, B, Vandewalle, J. A multilinear singular 20
value decomposition. SIAM journal on Matrix Analysis and Applications 21
2000;21(4):1253–1278. 22
[26] Symeonidis, P, Nanopoulos, A, Manolopoulos, Y. A unified frame- 23
work for providing recommendations in social tagging systems based on 24
ternary semantic analysis. IEEE Transactions on Knowledge and Data 25
Engineering 2010;22(2):179–192. 26
[27] Sandler, M, Howard, A, Zhu, M, Zhmoginov, A, Chen, LC. Mo- 27
bilenetv2: Inverted residuals and linear bottlenecks. In: Proceedings of 28
the IEEE Conference on Computer Vision and Pattern Recognition. 2018, 29
p. 4510–4520. 30
[28] Zhou, S, Vinh, NX, Bailey, J, Jia, Y, Davidson, I. Accelerating online 31
cp decompositions for higher order tensors. In: Proceedings of the 22nd 32
ACM SIGKDD International Conference on Knowledge Discovery and 33
Data Mining. ACM; 2016, p. 1375–1384. 34
Preprint Submitted for review / Computers & Graphics (2019) 15
Fig. 9. TensorPose web test.
[29] Tsai, D, Flagg, M, Nakazawa, A, Rehg, JM. Motion coherent track-1
ing using multi-label mrf optimization. International journal of computer2
vision 2012;100(2):190–202.3
[30] Kim, JS, Hwangbo, M, Kanade, T. Realtime affine-photometric klt fea-4
ture tracker on gpu in cuda framework. In: 2009 IEEE 12th International5
Conference on Computer Vision Workshops, ICCV Workshops. IEEE;6
2009, p. 886–893.7
[31] Senst, T, Geistert, J, Sikora, T. Robust local optical flow: Long-8
range motions and varying illuminations. In: IEEE International Con-9
ference on Image Processing. Phoenix, AZ, USA: IEEE; 2016, p. 4478–10
4482. IEEE Catalog Number: CFP16CIP-USB ISBN: 978-1-4673-9960-11
9 DOI:10.1109/ICIP.2016.7533207.12
[32] Chui, CK, Chen, G, et al. Kalman filtering with Real-Time Applications.13
Springer; 2017.14
[33] Babenko, B, Yang, MH, Belongie, S. Visual Tracking with Online15
Multiple Instance Learning. In: CVPR. 2009,.16
[34] Bradski, G, Kaehler, A. Learning OpenCV: Computer vision with the17
OpenCV library. ” O’Reilly Media, Inc.”; 2008.18
[35] Burbano, A, Vasiliu, M, Bouaziz, S. 3d cameras benchmark for human19
tracking in hybrid distributed smart camera networks. In: Proceedings of20
the 10th International Conference on Distributed Smart Camera. ACM;21
2016, p. 76–83.22
[36] Gupta, T, Li, H. Indoor mapping for smart citiesan affordable approach:23
Using kinect sensor and zed stereo camera. In: 2017 International Con-24
ference on Indoor Positioning and Indoor Navigation (IPIN). IEEE; 2017,25
p. 1–8.26
[37] Masson, T, Perlin, K, et al. Holo-doodle: an adaptation and expansion27
of collaborative holojam virtual reality. In: ACM SIGGRAPH 2017 VR28
Village. ACM; 2017, p. 9.29
[38] Ruggero Ronchi, M, Perona, P. Benchmarking and error diagnosis in30
multi-instance pose estimation. In: Proceedings of the IEEE International31
Conference on Computer Vision. 2017, p. 369–378.32