an adaptable system for rgb -d based human body detection

DRO Deakin Research Online, Deakin University’s Research Repository Deakin University CRICOS Provider Code: 00113B

An adaptable system for RGB-D based human body detection and pose estimation: incorporating attached props

Citation of the final article: Haggag, H., Hossny, M., Nahavandi, S. and Haggag, O. 2017, An adaptable system for RGB-D based human body detection and pose estimation: incorporating attached props, in SMC 2016 : IEEE International Conference on Systems, Man and Cybernetics, IEEE, Piscataway, N.J., pp. 1544-1549.

Published in its final form at https://doi.org/10.1109/SMC.2016.7844458.

This is the accepted manuscript.

© 2016, IEEE

Downloaded from DRO: http://hdl.handle.net/10536/DRO/DU:30094570

https://doi.org/10.1109/SMC.2016.7844458

http://hdl.handle.net/10536/DRO/DU:30094570

An Adaptable System for RGB-D based HumanBody Detection and Pose Estimation: Incorporating

Attached PropsH. Haggag, M. Hossny, S. Nahavandi, O. Haggag

Centre for Intelligent Systems ResearchDeakin University, Australia

Abstract—One of the biggest challenges of RGB-D posturetracking is separating appendages such as briefcases, trolleys,and backpacks from the human body. Markerless motion trackingrelies on segmenting each depth frame to a finite set of body parts.This is achieved via supervised learning by assigning each pixelto a certain body part. The training image set for the supervisedlearning are usually synthesised using popular motion capturedatabases and an ensemble of 3D models covering a wide rangeof anthropometric characteristics. In this paper, we propose anovel method for generating training data of human postures withattached objects. The results have shown a significant increasein body-part classification accuracy for subjects with props from60% to 94% using the generated image set.

Index Terms—RGB-D, Kinect, Pixel Labelling, Props

I. INTRODUCTION

Microsoft Kinect was introduced in 2010 as an Xbox 360external gaming peripheral. This RGB-D camera’s technologyis patented by PrimeSense. The Microsoft Kinect shown inFig 1 allowed people to play and interact with their consolewithout carrying any controllers or needing for a calibrationpose. Nowadays RGB-D cameras are used in areas far beyondgaming. The size [1], price [2] and accuracy [3]–[6] of theseRGB-D cameras make it popular to be used in many otherresearch areas such as work safety and ergonomic assessments[4], [7], [8], biometrics [9], [10], simulated crowd interactions[11], human machine interaction [12], [13] and 3D scanning[14]–[16].

Human body tracking and labelling is closely associatedwith various research areas such as medicine, sports andergonomics. Commercial depth cameras such as MicrosoftKinect have attracted a significant attention to achieve mark-erless motion tracking in these disciplines. Data generationpipelines such as Shotton’s [17] Buys’ [18] are the maindrivers behind the success of this technology. Both pipelinespromote data intensive supervised machine learning paradigmby optimising a set of weak classifiers and training on largenumber of synthetic postures.

The biggest challenge in markerless motion capture usingRGB-D cameras, such as Kinect, is attached props. In motioncapture, a prop is an external object that interacts with thesubject and relies on his or her tracked motion. In marker-less motion capture, not only is the prop misrecognised butalso interfers with other bodyparts and reduces the theoverallmotion tracking of other bodyparts. In marker-based (e.g.

Fig. 1: Microsoft Kinect depth sensor features a depth camera,RGB camera, infrared sensor, tilting motor and four micro-phones.

VICON) and gyroscopic (e.g. XSENS [19]) motion trackingsolutions, the motion of the prop is tracked by placing adedicated marker or a gyroscopic sensor on the prop. Bydefinition, this solution is not feasible in marker-less motiontracking. Therefore, the props must be integrated into thetraining algorithms for marker-less systems.

A. Contribution

The contribution presented in this work is extending theBuys’ framework presented in [18] to accommodate attachedprops. The modified framework generates virtualised trainingdepth and labelled images for all bodyparts and an externalobject. We considered three types of bags (backpack, briefcaseand trolley).

The rest of this paper is organised as follows. Section IIrecaps the posture generation procedure described in [17], [18].Section III describes the method used to add and animatethe props. Section IV describes the experiments and results.Finally, Section V concludes and introduces to future advance-ments.

II. RGB-D POSTURE GENERATION

Collecting large training datasets from individuals withdifferent anthropometric measures while performing differentactivities is both expensive and time consuming. To overcomethis problem, a real pre-captured motion data will be usedto animate a virtual human model [17], [18]. By doing this,different activities can be mapped to many human modelswith various mesh topologies. As illustrated in Fig. 2, thetraining pipeline is fairly simple. First, a set of digital humanmodels is generated with parameters spanning a wide rangeof anthropometric characteristics. The next step is to articulatethe 3D models into postures by animating them using captured

1

Biovision Hierarchy (BVH) motion as shown in Fig. 2-a. BothShotton and Buys chose the CMU motion capture database[20]. The result of this step is a set of synthesised depthrendered posture images and a corresponding set of labelledimages, as demonstrated in Fig. 2-a. Both image sets are usedin the random decision forest (RDF) learning module. TheRDF is trained on dense representations obtained using thecomputationally efficient dynamic range feature extractor [17].For each pixel p in a depth image D, a feature is extractedusing

fΘ(D, p) = dD

(p +

o1

dD (p)

)− dD

(p +

o2

dD (p)

)(1)

where dD (p) is the depth value of pixel p, and Θ = (o1, o2)is the location offsets of the other two randomly chosenneighbour pixels. The forest training algorithm is formulatedas a MapReduce problem allowing batch and scalable training.The training result is an ensemble of uncorrelated decisiontrees each of which determines the probability distribution ofan unseen depth pixel over a set of possible body parts. In orderto discriminate appendages or props from the tracked humanposture, new depth and label image sets must be generated withdepth images featuring different human postures with props.

III. ADDING AND ANIMATING PROPS

Segmenting body-parts of a human subject with externalprops faces three major challenges. The first challenge isrelated to the feature. The calculated features of the propsare usually influenced with the neighbouring body-part. Thisusually happens when the prop is a neighbour to a largebody-part. The second challenge is related to the overlaiddepth pixels between body-parts and the attached prop. Thesetwo challenges are addressed by incorporating the props intothe training depth and label images. This leads us to thethird challenge, rendering a realistic animation of the propsaccording to the captured human motion. We have consideredthree levels of interaction between the attached prop andhuman body. The first level is a constant contact between theattached prop, such as a backpack, and multiple body parts.The second level features a constant attachment between theprop like a briefcase and one extremity (a hand). The thirdlevel features a minimal contact with the prop and the bodyparts such as trolleys.

Animating props can be achieved in one of three ap-proaches. First approach is recapturing the motion of humansubjects performing activities with props. This usually takesplace by attaching markers to both the human subject andthe prop. The second approach is to manually animate theprops according to the human motion. However, this approachrelies on subjective viewpoints of 3D animation artists. Bothapproaches are expensive and time consuming.

The third approach, implemented in this work, is to employa rigid body simulator to automatically animate the propsaccording to the captured motion. The first step in this ap-proach is modifying the BVH skeleton by adding extra limb

+

Depth and Label Image Pairs

Anthropometric Models Captured Motion

(a) RGB-D synthesised postures

Anthropometric Model Prop Physics Captured Motion Modified Model

+ + =

(b) Adding and animating props using rigid body simulation

(c) New synthesised depth/label image pairs with props

Fig. 2: The RGB-D posture generation articulates the an-thropometric model using the pre-captured motion to createdepth/label image pairs in (a). The 3D models of the propsare animated using rigid body simulation in (b). The newdepth/label image pairs are shown in (c).

representing the new prop and fitting it with the prop’s 3Dmodels. The rigid body simulation is used to articulate themotion of attached props according to the captured humanmotion. The physics engine parameters such as mass, frictionand velocity of the moving body parts contribute in the positionand orientation of the props.

2

Backpack Briefcase Trolley

Dep

thFr

ames

(a) (b) (c)

With

out

prop

(d) 0.58416 (e) 0.60850 (f) 0.79047

With

prop

(g) 0.94889 (h) 0.95593 (i) 0.93422

Fig. 3: Resulted similarity images of RDF trees evaluated on depth frames in (a, b and c). The test was performed with 15-levelstrees that are trained without a prop in (d, e and f) and with a prop in (g, h and i). The similarity was calculated as the average(inverted) hamming distance between corresponding pixels. The pixels of interest were masked to ignore the background.

IV. EXPERIMENTS AND RESULTS

The focus of our experiment is to assess the efficacy oftraining on postures with appendages. In this experimentwe have considered three scenarios. The first scenario isan external prop, such as a backpack, closely attached tothe human body. In this scenario the prop and the humanmeshes collide with each other. This scenario includes thethree challenges, overlaid depth pixels, mesh collisions andmisleading feature vectors. In the second scenario the externalprop (briefcase) is attached to the human body and followsits animation. The challenge with this scenario is the overlaiddepth pixels of the prop on those of the occluded bodypart. The last scenario is an external prop, such as a trolleybag, following the human body and yet not attached to it.Overlaying depth pixels and environmental physics parametersare the major challenges in this scenario.

We compared the recognition accuracy of different bodyparts with RDFs trained on images with/without props usingaverage Hamming distance [21]. Figure 3 shows the resultedsimilarity images of each scenario when using RDF trees(15 levels) trained with/without a prop. The confusion matrixwas constructed to indicate the performance of the proposedmethod. As shown in Fig. 4, each column represents the distri-bution of testing input pixels among different body parts usingRDFs trained without props (right) and with props (left). Theresults summarised in the confusion matrices demonstrate theadvantage of training on human postures with props as shownin Fig. 4-right. Training on human postures without props, notonly misclassifies the props but also imposes propagating erroron other body parts as shown in Fig.4-left. The similarity isthen checked again between each body-part and prop of boththe resulted images generated with trees trained with/withoutprop and the input reference image were measured. This wasdone by calculating the number of pixels correctly assigned to

3

Head

Neck

Rsho

ulde

rLSho

ulde

rRA

rmLArm

RForearm

LForearm

RHand

LHand

Spine

LowerBack

RHipJoint

LHipJoint

RUpLeg

LUpLeg

RLeg

LLeg

RFoo

tLFoo

tProp

Head 1.0 0.6 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0Neck 0.0 0.4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.1

RShoulder 0.0 0.0 0.9 0.0 0.9 0.0 0.1 0.0 0.0 0.0 0.4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.2LShoulder 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

RArm 0.0 0.0 0.0 0.0 0.1 0.0 0.3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0LArm 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.1

RForearm 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0LForearm 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.6

RHand 0.0 0.0 0.0 0.0 0.0 0.0 0.3 0.0 0.8 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0LHand 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0Spine 0.0 0.0 0.0 0.0 0.0 0.0 0.3 0.0 0.0 0.0 0.4 0.1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

LowerBack 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.2 0.9 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0RHipJoint 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0LHipJoint 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0RUpLeg 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.2 0.4 0.0 0.0 0.0 0.0 0.9 0.0 0.0 0.0 0.0 0.0 0.0LUpLeg 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0

RLeg 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.3 0.0 0.2 0.0 0.0LLeg 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.1 0.0 0.1 1.0 0.0 0.0 0.0

RFoot 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.5 0.0 0.8 0.1 0.0Lfoot 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.1 0.0 0.0 0.9 0.0

Prop 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

(a) Trained without Backpack

Head

Neck

Rsho

ulde

rLSho

ulde

rRA

rmLArm

RForearm

LForearm

RHand

LHand

Spine

LowerBack

RHipJoint

LHipJoint

RUpLeg

LUpLeg

RLeg

LLeg

RFoo

tLFoo

tProp

Head 1.0 0.4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0Neck 0.0 0.1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0


RArm 0.0 0.0 0.1 0.0 1.0 0.0 0.2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0LArm 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0




RLeg 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.2 0.1 0.0LLeg 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0

RFoot 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.8 0.0 0.0Lfoot 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.9 0.0

Prop 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0

(b) Trained with Backpack

Head

Neck

Rsho

ulde

rLSho

ulde

rRA

rmLArm

RForearm

LForearm

RHand

LHand

Spine

LowerBack

RHipJoint

LHipJoint

RUpLeg

LUpLeg

RLeg

LLeg

RFoo

tLFoo

tProp

Head 1.0 0.4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0Neck 0.0 0.5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0


RArm 0.0 0.0 0.1 0.0 0.9 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.4LArm 0.0 0.0 0.0 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0




RLeg 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.9 0.1 0.0 0.0 0.0LLeg 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.5 0.0 0.1 0.1

RFoot 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0Lfoot 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.9 0.0

Prop 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

(c) Trained without Briefcase

Head

Neck

Rsho

ulde

rLSho

ulde

rRA

rmLArm

RForearm

LForearm

RHand

LHand

Spine

LowerBack

RHipJoint

LHipJoint

RUpLeg

LUpLeg

RLeg

LLeg

RFoo

tLFoo

tProp

Head 1.0 0.4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0Neck 0.0 0.6 0.1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0


RArm 0.0 0.0 0.0 0.0 0.9 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0LArm 0.0 0.0 0.0 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0




RLeg 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.9 0.0 0.1 0.1 0.0LLeg 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.1 0.0

RFoot 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.9 0.0 0.0Lfoot 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.8 0.0

Prop 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0

(d) Trained with Briefcase

Head

Neck

Rsho

ulde

rLSho

ulde

rRA

rmLArm

RForearm

LForearm

RHand

LHand

Spine

LowerBack

RHipJoint

LHipJoint

RUpLeg

LUpLeg

RLeg

LLeg

RFoo

tLFoo

tProp

Head 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.5Neck 0.0 0.7 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.1


RArm 0.0 0.0 0.0 0.0 0.2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0LArm 0.0 0.0 0.0 0.2 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0




RLeg 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.3 0.0 0.0LLeg 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.7 0.0 0.0 0.9 0.0 0.0 0.0

RFoot 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.7 0.0 0.0Lfoot 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0

Prop 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

(e) Trained without Trolley

Head

Neck

Rsho

ulde

rLSho

ulde

rRA

rmLArm

RForearm

LForearm

RHand

LHand

Spine

LowerBack

RHipJoint

LHipJoint

RUpLeg

LUpLeg

RLeg

LLeg

RFoo

tLFoo

tProp

Head 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0Neck 0.0 0.9 1.0 0.1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0


RArm 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0LArm 0.0 0.0 0.0 0.1 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0




RLeg 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.3 0.0 0.0LLeg 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.1 0.0

RFoot 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.7 0.0 0.0Lfoot 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.9 0.0

Prop 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0

(f) Trained with Trolley

Fig. 4: The confusion matrix representing the percentage of the number of pixels assigned to each body part and prop of theinput reference image when using 15-levels random decision trees trained without a prop (left column) and with a prop in (rightcolumn).

4

each body part of the input reference image, and results areshown in Fig. 5. It is important to note that occluded bodyparts (due to posture and angle of the camera) were assigneda zero-score in the confusion matrices because no pixels wereassigned to them. This explains the poor accuracy in theshoulders and arms in both confusion matrices. The resultspresented in Fig. 3 and Fig. 4 show that the average accuracyhas increased from 60% to 94% by training on postures withprops.

V. CONCLUSIONS

A new approach was proposed where rigid body physicssimulation was used to generate the training data of the humanbody with attached props. Anthropometric human modelsalong with external props were mapped to different animationof pre-captured data from the CMU motion capture database[20]. Results demonstrate successful recognition and identifi-cation of both the body parts and the attached props usingthis approach. The results of our approach record an averageaccuracy of 94% in identifying both the human body parts andthe attached prop in all the three scenarios. This rate can besignificantly improved by designing an aggregated ensembleof RDFs to recognise different levels of posture details. Inorder to achieve that, a special image fusion algorithm will bedesigned to fuse indexed images of the recognised labels fromdifferent RDFs as described in [22].

ACKNOWLEDGMENT

This research was fully supported by the Centre forIntelligent Systems Research (CISR) at Deakin University.The motion data used in this research was obtained frommocap.cs.cmu.edu. The database was created with fund-ing from NSF EIA-0196217. The modified motion cap-ture files of this work are available for download athttp://www.deakin.edu.au/cisr/mocap

REFERENCES

[1] B. Lange, C. Y. Chang, E. Suma, B. Newman, A. S. Rizzo, and M. Bolas,“Development and evaluation of low cost game-based balance rehabil-itation tool using the microsoft kinect sensor,” in IEEE engineering inmedicine and biology society, 2011, pp. 1831–1834.

[2] J. Tong, J. Zhou, L. Liu, Z. Pan, and H. Yan, “Scanning 3d fullhuman bodies using kinects,” in IEEE Transactions on Visualization andComputer Graphics, 2012, pp. 643–650.

[3] T. Dutta, “Evaluation of the kinect sensor for 3-d kinematic measurementin the workplace,” in Applied Ergonomics, 2012, pp. 645–649.

[4] H. Haggag, M. Hossny, D. Filippidis, D. Creighton, S. Nahavandi, andV. Puri, “Measuring depth accuracy in rgbd cameras,” in InternationalConference on Signal Processing and Communication Systems, 2013.

[5] K. Khoshelham and S. O. Elberink, “Accuracy and resolution of kinectdepth data for indoor mapping applications,” in Sensors, 2012, pp. 1437–1454.

[6] H. Haggag, M. Hossny, S. Haggag, J. Xiao, S. Nahavandi, andD. Creighton, “Lgt/vot tracking performance evaluation of depth im-ages,” in 9th International Conference on System of Systems Engineering(SOSE), 2014, pp. 284–288.

[7] H. Haggag, M. Hossny, S. Haggag, S. Nahavandi, and D. Creighton,“Safety applications using kinect technology,” in IEEE InternationalConference on Systems, Man and Cybernetics (SMC), 2014, pp. 2164–2169.

(a) Backpack

(b) Briefcase

(c) Trolley

Fig. 5: The number of pixels correctly assigned to each bodypart and prop of the input reference image when using 15-levels random decision trees.

5

[8] H. Haggag, M. Hossny, S. Nahavandi, and D. Creighton, “Real timeergonomic assessment for assembly operations using kinect,” in 15th In-ternational Conference in Computer Modelling and Simulation (UKSim),2013.

[9] M. Hossny, D. Filippidis, W. Abdelrahman, H. Zhou, M. Fielding,J. Mullins, L. Wei, D. Creighton, V. Puri, and S. Nahavandi, “Low costmultimodal facial recognition via kinect sensors,” in The Land WarfareConference, 2012, pp. 77–86.

[10] M. Zollhofer, M. Martinek, G. Greiner, M. Stamminger, and J. SuBmuth,“Automatic reconstruction of personalized avatars from 3d face scans,”in Computer Animation and Virtual Worlds, 2011, pp. 195–202.

[11] L. Wei, V. Le, W. Abdelrahman, M. Hossny, D. Creighton, and S. Naha-vandi, “Kinect crowd interaction,” in Asia Pacific Simulation Technologyand Training, Adelaide, South Australia, 2012.

[12] K. F. Li, K. Lothrop, E. Gill, and S. Lau, “A web-based sign languagetranslator using 3d video processing,” in 14th International Conferenceon Network-Based Information Systems, 2011, pp. 356–361.

[13] M. N. K. Boulos, B. J. Blanchard, C. Walker, J. Montero, A. Tripathy,and R. Gutierrez-Osuna, “Web gis in practice x: a microsoft kinectnatural user interface for google earth navigation,” in InternationalJournal of Health Geographics, 2011, p. 45.

[14] A. D. Wilson, “Using a depth camera as a touch sensor,” in ACMInternational Conference on Interactive Tabletops and Surfaces, 2010,pp. 69–72.

[15] S. Izadi, D. Kim, O. Hilliges, D. Molyneaux, R. Newcombe, P. Kohli,J. Shotton, S. Hodges, D. Freeman, A. Davison, and A. Fitzgibbon,“Kinectfusion: Real-time 3d reconstruction and interaction using amoving depth camera,” in The 24th annual ACM symposium on userinterface software and technology, 2011, pp. 559–568.

[16] H. Haggag, M. Hossny, S. Haggag, S. Nahavandi, and D. Creighton,“Efficacy comparison of clustering systems for limb detection,” in 9thInternational Conference on System of Systems Engineering (SOSE),2014, pp. 148–153.

[17] J. Shotton, T. Sharp, A. Kipman, A. Fitzgibbon, M. Finocchio, A. Blake,M. Cook, and R. Moore, “Real-time human pose recognition in partsfrom single depth images,” in Communications of the ACM, 2013, pp.116–124.

[18] K. Buys, C. Cagniart, A. Baksheev, T. De Laet, J. De Schutter, andC. Pantofaru, “An adaptable system for rgb-d based human body de-tection and pose estimation,” in Journal of Visual Communication andImage Representation. Elsevier, 2014, pp. 39–52.

[19] B. K. J. B. A. W. T. H. M. M. H.-P. S. J. Tautges, A. Zinke andB. Eberhardt, “Motion reconstruction using sparse accelerometer data,”in ACM Transactions on Graphics, 2011, pp. 18:1–18:12.

[20] “Cmu graphics lab motion capture database,” 2011.[21] R. W. Hamming, “Error detecting and error correcting codes,” in Bell

System technical journal, 1950, pp. 147–160.[22] M. Hossny, S. Nahavandi, and D. Creighton, “Color map-based image

fusion,” in IEEE International Conference on Industrial Informatics(INDIN), 2008, pp. 52–56.

6

an adaptable system for rgb -d based human body detection

Documents