[ieee 2012 ieee computer society conference on computer vision and pattern recognition workshops...

Abstract

This paper presents an algorithm for real-time body motion analysis for dance pattern recognition by use of a dynamic stereo vision sensor. Dynamic stereo vision sensors asynchronously generate events upon scene dynamics, so that motion activities are on-chip segmented by the sensor. Using this sensor body motion analysis and tracking can be efficiently performed. For dance pattern recognition we use a machine learning method based on the Hidden Markov Model. Emphasis is laid on the analysis of the suitability for use in embedded systems. For testing the algorithm we use a dance choreography consisting of eight different activities and a training set of 430 recorded activities performed by 15 different persons. A cross validation on the data reached an average recognition rate of 94%.

1. Introduction Motivated by providing new systems for human-

machine interaction, gesture recognition is a well investigated topic in the past years. Different technologies including time of flight sensor, stereoscopic or structured light cameras (e.g. Kinect) were established on which gesture recognition can be applied. A lot of computer vision methods were studied for robust and real-time gesture and body motion recognition including Hidden-Markov-Model [1, 2, 3], artificial neural networks [4, 5], decision trees [6] and support vector machines [7, 8].

The evaluation of a literature research revealed that the Hidden Markov Model is in the field of gesture recognition exhibiting the highest recognition rates. The Hidden Markov Model is applied in several different application areas like speech recognition, email spam filters, recognition of deaf sign language and many more. A well-known introduction on HMM can be found in the paper of Rabiner et al [1].

Regarding gesture recognition based on HMM the best recognition rates reported so far are up to 93% [3, 9] for a small number of test cases (less than 100). Yamato et al [2] reported a recognition rate of 96% if the training data

includes the sequences of the test persons. They also mentioned that more realistic results is given by using a leave one out cross validation, which means the test persons are not included in training data. By using a leave one out cross validation Yamato et al reaches a recognition rate of 71%.

Most of these methods deal with processing a synchronous sequence of intensity images taken at equidistant time intervals. The employed methods make mainly use of object detection in single frames and tracking the motion across the sequence of frames in a spatio-temporal representation of the posture.

In this paper an event-driven dynamic vision sensor [10, 11] developed for capturing scene dynamics and asynchronously generating events upon relative light intensity changes in the scene is used. This sensor generates a continuous stream of events representing the motion path of the scene dynamic at high temporal resolution. An advantage regarding real-time application is the very low data volume of the sensor compared to a frame-based camera. As the sensor employs complete background subtraction on the fly, it is well applicable for motion capturing and activity recognition of persons.

Several ideas on using this type of sensor for gesture recognition are mentioned on the Capo Caccia Cognitive Neuromorphic Engineering Workshops. A first publication regarding posture recognition is published by Chen et. al. [12]. They applied a specific feedforward categorization algorithm which is especially suitable for use in embedded systems. In [13] first results for body motion analysis were published. In this work we explore the details of the Hidden Markov Model based method used for body motion analysis.

The structure of the paper is as follows. In section 2 we introduce shortly the dynamic vision sensor in context to gesture recognition, the experimental setup for recording and a description of the recorded activities. The detailed explanation of the applied method is explored in section 3. The results are presented in section 4 and finally we finish the paper with a discussion in section 5 and conclusions in section 6.

Real-time Body Motion Analysis For Dance Pattern Recognition

Bernhard Kohn, Member IEEE, Aneta Nowakowska, Ahmed Nabil Belbachir, Member IEEE AIT Austrian Institute of Technology GmbH

Donau-City-Str. 1/ 5.OG, 1220 Vienna, Austria [email protected], [email protected], [email protected]

978-1-4673-1612-5/12/$31.00 ©2012 IEEE 48

2. Dynamic vision sensor and experimental setup

The dynamic vision sensor consist of an array of 304x240 pixels build in a standard 0.18μm CMOS technology. The pixels respond to relative light intensity changes by sending their position in the array asynchronously over a bus to a receiver unit. The pixel position and the information, if the light intensity has increased or decreased is called an address event (AE). The AEs arrives at a multiplexer unit and are forwarded to a FPGA, which attaches a timestamp to each AE. The resolution can be adjusted up to 1μs. The FPGA is also capable to perform the calculation of the depth map. More details on the functionality of the sensor can be found in [10, 11].

In Figure 2 sample data from the dynamic vision sensor are shown in a spatio-temporal form for a sequence of hand movements. The blue and red dots represent the address events generated in the recent 20ms. The small gray dots are the events generated in elapsed time prior to the recent 20ms. As mentioned before the dynamic vision sensor only recognizes the moving parts of a person, so the background extraction is done by the sensor without additional cost. Thus the processing can be focused on the real interesting movements.

Using two dynamic vision sensors, it is possible to asynchronously reconstruct the depth information of moving objects efficiently in real-time. The first algorithm for depth estimation of asynchronously generated data from an event-driven stereo sensor is described by Schraml et al [14]. The processing unit of the dynamic stereo vision sensor embeds this event-based stereo vision algorithm, including the depth calculation.

In [13] we compared the robustness of the gesture recognition of a single dynamic vision sensor with the so called overlay mode and the calculated depth map. It was revealed, that the best recognition rate was achieved in the overlay modus. In this modus the dedicated detector of the AEs (the detector of the stereo system they come from) is not used, so the shift or disparity of both detectors is seen in the middle image of Figure 1. At the same time using the overlay mode does reduce the needed processing power, as no stereo matching must be performed.

Gesture recognition is applied in several fields’ e.g. deaf sign language, navigation of virtual environments, gaming and many others. In this paper we use the dynamic vision sensor for recording AEs of certain dance patterns. The gesture recognition is applied to recognize the different dance pattern (or activities) for a dance/fitness training game. A complete dance includes between 5 and 15 activities. Our aim is to recognize, which activity a trainee performs at the moment. In Figure 3 eight activities of the chosen dance choreography are represented as time sequences. Each activity last a couple of seconds. In this work we try to recognize and differ between these eight activities. For testing the algorithms several recordings of the activities are made. Fifteen persons performed the eight activities up to 5 times. In total 430 different recordings have been acquired. Iterations with bad performance are not used for the training of the gesture recognition. To simulate a training session in a TV room like situation, the dynamic stereo vision sensor has been mounted in an elevation of 1.3m (typical to a sensor placed over a TV set). The distance to the performing person is 2m. The data sets consists of recordings of persons with different body height, clothing and clothing texture, skin color, hair color and gender.

Figure 2. Representation of the spatio-temporal generatedaddress events by one dynamic vision detector as a reaction to a person moving his arms.

Figure 1. Mono data from one detector (left image),overlay data generated by data from both detector (middleimage) and color coded stereo data (right image).

49

Figure 3. Snapshots of eight different activities A1-A8 shown as time sequences. The activities A3 and A4 are nearly identical, only the direction of rotation is different.

Activity A1: Arms front stretched and crossed

Activity A2: Arms pointing with 180 degree left right side rotation

Activity A3: Arms pointing with 360 degree axis left rotation

Activity A4: Arms pointing with 360 degree axis right rotation

Activity A5: Arms waving top down

Activity A6: Bent down with arms back crossed

Activity A7: Elbow to knee

Activity A8: Legs front stretched with shoulder rolling

50

3. Applied Method For the learning step a left-right continuous HMM with

mixed Gaussian output probability is used [3]. Yamato et al. used the Hidden Markov Model approach to distinguish different type of tennis strokes. The strokes are captured with video cameras and a set of feature vectors are calculated out of binarized images. We choose a very similar feature vector. For each activity the recorded samples are used to calculate a set of feature vectors needed for the training of the Hidden Markov Model. In Figure 4 the workflow of the computation of the feature vector is shown.

a) The pixel array of the sensor is divided into blocks

containing 8x8 pixels. This is done to reduce the pixel count.

b) For a time period of 40ms each arriving address event increases the block count by one.

c) As the pixel array consists of 304x240 pixels, this leads to a total of 1140 block count values.

d) At the end of the 40ms period the vector of block count values is normalized by the total number of pixels in the block, in this case 64. This is called the relative pixel count. The resulting vector of the relative pixel count is standardized to a mean value of zero and divided with the standard deviation. To minimize rounding errors the vector is scaled with a factor of 10.

e) As one can easily imaging the calculation time of training and classification (and the classification result itself) depends on the element number of the feature vector. Therefore additionally compression of the 1140 elements is applied by using a discrete cosine transformation (DCT). This is also successfully applied by Mendoza et. al. [3].

f) The first 32 coefficients of the DCT are used for training the HMMs. For choosing the right number of coefficients we make measurements regarding classification time and results

g) Train the Hidden Markov Model with the set of feature vectors.

h) Perform a leave one out cross validation with all available recordings.

For benchmarking the recognition rate of the

classification of the activities we use a leave one out cross validation. In a leave one out cross validation there exits recordings of the activities from N persons. For the training of the Hidden Markov Models only N-1 persons are used. The testing is done with the samples of the not used person n. This is done for all N persons, which results in N evaluation matrices in total (for each leaved out person one matrix). The sum of all evaluation matrices is called the confusion matrix, where a mean recognition rate over all activities is stated.

Figure 4. Workflow of gesture recognition method with all applied steps.

h) At the end a leave one out cross validation is performed with all recordings

g) For the training of the Hidden Markov Model this feature vector will be used

f) From the DCT result the first 32 coeffiecients are used as feature vector

e) Compress the number of features by using a discrete cosine transformation (DCT)

d) Normalize the value with number of pixels per block (64) and standarized the total vector

c) This results in a total of 1140 pixel count values per period

b) Increase for each arriving AE the appropriate block (over a period of 40 milliseconds)

a) Divide sensor array in 8x8 pixel blocks

51

4. Results The main focus of this paper is the suitability of the

used method and algorithm for application on an embedded system. For applying the HMM based gesture recognition algorithms a new C++ library was developed. Speed measurements were taken on two Intel based platforms.

For simulating an embedded system a Microspace PCX48 from Digital Logic based on a Intel Celeron M processor (1 GHz, 400MHz bus, 1 GB Ram) was taken. For comparison to an actual platform a Dell Laptop e6420 based on an Intel i7-2620M CPU (2 Cores @2.7GHz, 800MHz bus, 8GB Ram) is used.

In Table 1 the results of two speed measurements are shown. In the second column the training time for each activity is displayed. In the third column the mean timings for the classification are shown. The algorithm is designed in a way, that after each period of 40ms the classification is updated. The mean time needed for the update of the classification is shown in the third column.

Activity Training Classification per

40ms period

Cel

eron

M

@ 1

GH

z

i7-2

620

M

@ 2

.7 G

Hz

Cel

eron

M

@ 1

GH

z

i7-2

620

M

@ 2

.7 G

Hz

A1 592 s 33.8 s 20 ms 3 ms A2 679 s 38.6 s 20 ms 3 ms A3 376 s 21.0 s 20 ms 3 ms A4 478 s 26.3 s 20 ms 3 ms A5 1284 s 72.8 s 20 ms 3 ms A6 295 s 16.2 s 20 ms 3 ms A7 426 s 24.0 s 20 ms 3 ms A8 262 s 14.2 s 20 ms 3 ms sum 4392 s 247.0 s 160 ms 24 ms

Table 1. Duration values for the training and for the period based classification for each activity.

For the activity recognition the overlay data are used for

training the HMMs. A leave one out cross validation has been calculated with varying combinations of the Gaussian mixture numbers and states. A variation of the Gaussian mixtures (M) has been made from 4 to 8, the states (Q) are iterated from 8 to 14. The best resulting confusion matrix is shown in Table 2. The gesture recognition yields an average recognition rate of 94%. It can be noticed that typical wrong recognitions appear in the activity A8.

A1 A2 A3 A4 A5 A6 A7 A8 G E R[%] A1 58 1 0 0 0 0 0 0 58 1 98A2 0 56 0 0 0 0 0 0 56 0 100A3 0 3 44 0 0 0 1 0 44 4 92A4 0 0 2 54 0 0 0 0 54 2 96A5 0 0 0 0 55 0 0 0 55 0 100A6 1 0 0 0 1 51 2 0 51 4 93A7 0 0 0 0 0 0 57 0 57 0 100A8 0 4 2 0 0 0 7 31 31 13 71

Sum 406 24 94 Table 2 Confusion matrix for the cross validation. Indices A1 – A8 refer to eight activities, G means good recognition, E means error recognition, R is the recognition rate in percent.

5. Discussion The HMM based gesture recognition algorithms were

successfully applied to the data of the dynamic stereo vision sensor. By the use of the dynamic stereo vision sensor a recognition rate of 94% is reached. Mendoza et al [3] employed a single video based system and reached a recognition rate of 82%, whereas Wang et al [9] used a video based stereo system which results in a recognition rate of 92%.

It is difficult to provide a thoroughly comparative analysis with state-of-the-art systems because a single set of gestures has to be used for comparison. In the scope of this paper different gestures between the actual system and the cited ones above are used. The recognition rates of the proposed system provide a first indicator that the results are comparable with state of the art.

The time measurements results in Table 1 shows clearly, that the training of the Hidden Markov Model is rather fast, and needed around 4 minutes on an Intel double core based computer. The measured classification time shows that per activity on an Intel Celeron M based computer one activity can be classified. For a dance pattern based game this is sufficient if you know in advance which dance pattern should be performed. This is especially the case for games aiming the fitness market. For games, where the pattern of the activity could be chosen out of a larger pool, one needs at least an Intel double core based computer. Then it is possible to distinguish several different dance patterns.

It should be mentioned, that in these experiments a standard feature vectors was chosen, which needs a costly DCT operation. There is a high potential to find a more suitable feature vector that take advantage of the data provided by the dynamic vision sensor.

Additionally the recognition rate regarding the activity A8 has potential for improvements. The rather high miss classifications should be further analyzed.

52

6. Conclusion In this work, it was shown that the combination of

Hidden Markov Model and dynamic stereo vision sensor data in a spatio-temporal space has a large potential for efficient and robust gesture recognition. Recognition rates of up to 94% have been reached. Dynamic stereo vision sensors provides a continuous and asynchronous stream of data upon scene dynamics, thus offering an on-chip background subtraction and generating motion data in a spatio-temporal representation (x,y,z). The algorithms seem to be suitable for integration into the embedded system. In the next step, new types of feature vectors which take advantage of the spatio-temporal data of the dynamic stereo vision sensor will be developed and explored.

Acknowledgment This work is supported by the project grant SilverGame

“aal-2009-2-113” running under the Ambient Assisted Living – European Commission joint program. The authors would like to thank Reha-Zentrum Lübben for providing the dance choreography of the eight activities.

References [1] L. R. Rabiner und B. H. Juang, „An introduction to hidden

markov models,“ IEEE Trans. Acoust., Speech, Signal Processing, pp. 4-16, 1986.

[2] J. Yamato, J. Ohya und K. Ishii, „Recognizing human action in time-sequential images using hidden markov model,“ Proc. CVPR ’92 IEEE Computer Society Conf. Computer Vision and Pattern Recognition, pp. 379-385, 1992.

[3] A. M. Mendoza und N. P. de la Blance, „Hmm-based action recognition using contour histograms,“ Lecture Notes in Computer Sciences, pp. 394-401, 2007.

[4] H. Meng, N. Pears und C. Bailey, „A human action recognition system for embedded computer vision application,“ IEEE Conference on Computer Vision and Pattern Recognition, pp. 1-6, 2007.

[5] K. Murakami und H. Taguchi, „Gesture recognition using

recurrent neural networks,“ Proceedings of the SIGCHI conference on Human factors in computing systems: Reaching through technology, CHI’91, pp. 237-242, 1991.

[6] M. J. Jeon, S. E. Yang und Z. Bien, „User adaptive hand gesture recognition using multivariate fuzzy decision tree and fuzzy garbage model,“ in IEEE International Conference on Fuzzy Systems, Jeju Island, 2009.

[7] C. Schüldt, I. Laptev und B. Caputo, „Recognizing human actions: a loval svm approach,“ ICPR Proceedings of the 17th International Conference on Pattern Recognition, Bd. 3, pp. 32-36, 2004.

[8] S. Mitra und T. Acharya, „Gesture recognition: A survey,“ IEEE Transactions on Systems, Man and Cybernetics, Part C: Applications and Reviews, Bd. 37, Nr. 3, pp. 311-324, May 2007.

[9] Y. Wang, T. Yu, L. Shi und Z. Li, „Using human body gestures as inputs for gaming via depth analysis,“ IEEE International Conference on Multimedia and Expo, pp. 993-996, 2008.

[10] C. H. Posch, D. Matolin, R. Wohlgenannt und a. b. test, „A QVGA 143 dB Dynamic Range Frame-Free PWM Image Sensor With Lossless Pixel-Level Video Compression and Time-Domain CDS,“ IEEE Journal of Solid State Circuits, Bd. 46, pp. 259 - 275, 2011.

[11] P. Lichtsteiner, C. Posch und T. Delbrück, „A 128x128 120dB 15μs Latency Asynchronous Temporal Contrast Vision Sensor,“ IEEE Journal of Solid State Circuits, Bd. 43, pp. 566 - 576, 2008.

[12] S. Chen, P. Akselrod, B. Zhao, J. A. Perez Carrasco, L.-B. B und E. Culurciello, „Efficient Feedforward Categorization of Objects and Human Postures with Address-Event Image Sensors,“ IEEE Transaction on Pattern Analysis and Machine Intelligence, Bd. 34, Nr. 2, pp. 302-314, Feb. 2012.

[13] B. Kohn, A. N. Belbachir, T. Hahn und H. Kaufmann, „Event-driven Body Motion Analysis for Real-time Gesture Recognition,“ IEEE International Symposium on Circuits and Systems Conference Proceedings, to be published 2012.

[14] S. Schraml, A. Belbachir, N. Milosevic und P. Schoen, „Dynamic Stereo Vision for Real-time Tracking,“ Proc. of IEEE ISCAS, June 2010.

[15] A. N. Belbachir, Hrsg., Smart Cameras, Springer New York, 2009.

53

[ieee 2012 ieee computer society conference on computer vision and pattern recognition workshops...

Documents