comparing models for gesture recognition of children’s ......a humanoid robot that used hand-coded...

8
2017 Seventh International Conference on Affective Computing and Intelligent Interaction (ACII) Comparing Models for Gesture Recognition of Children’s Bullying Behaviors Michael Tsang, Vadim Korolik Department of Computer Science University of Southern California Los Angeles, California 90089 Email: {tsangm,korolik}@usc.edu Stefan Scherer Institute of Creative Technologies University of Southern California Los Angeles, California, 90094 Email: [email protected] Maja Matari´ c Department of Computer Science University of Southern California Los Angeles, California 90089 Email: [email protected] Abstract—We explored gesture recognition applied to the prob- lem of classifying natural physical bullying behaviors by chil- dren. To capture natural bullying behavior data, we developed a humanoid robot that used hand-coded gesture recognition to identify basic physical bullying gestures and responded by explaining why the gestures were inappropriate. Children interacted with the robot by trying various bullying behaviors, thereby allowing us to collect a natural bullying behavior dataset for training the classifiers. We trained three different sequence classifiers using the collected data and compared their effectiveness at classifying different types of common physi- cal bullying behaviors. Overall, Hidden Conditional Random Fields achieved the highest average F1 score (0.645) over all tested gesture classes. 1. Introduction Bullying among children is a global issue that can cause long-term negative psychological and behavioral problems for both the victim and the bully. Children who bully others tend to have higher instances of conduct problems and dislike of school, whereas victims of bullying show higher levels of insecurity, anxiety, depression, loneliness, and low self-esteem compared to their peers [7], [13]. Despite the prevalence and severity of bullying, to our knowledge no research has been conducted on the automatic recog- nition/classification of physical bullying behaviors. Such recognition and classification, especially if done in real- time, can be useful for reporting and intervening to prevent negative effects. An example of an intervention could be a physical robot detecting a child’s bullying behavior and advising the child on what inappropriate behavior the child engaged in and why bullying is wrong. In the computer vision community, human action recog- nition is typically studied on datasets of behaviors acted by adults performing in front of a camera in a structured fashion [29]. For example, specific behaviors may be acted in sequence in the same order, with every actor performing the target behaviors multiple times. However, our goal was to acquire children’s natural bullying behavior data. Nomura et al. (2015) found that children in a shopping mall in Tokyo Figure 1. Example of bullying-type gestures children spontaneously demon- strated in front of the robot to elicit its response had a tendency to show abusive, bullying behaviors towards a robot [15]. Informed by that work, we used a robot as a target of non-contact bullying behaviors by children for data collection purposes. Toward that end, we developed a humanoid anti-bullying robot that responds to perceived bullying behaviors by explaining why such behaviors are inappropriate. Children engaged in interacting with the robot tested out a variety of bullying behaviors to see how the robot would respond. We recorded a significant number of such playful, mock-bullying instances by children (see Figure 1) for use as a training set for the classification algorithm. Bullying can manifest in a variety of forms, including physical (kicking, hitting), verbal (name-calling, intimida- tion), social (gossip), and cyber [23]. According to the literature, punching, kicking, showing/waving a fist, and pointing are prevalent in aggressive bullying and teasing, and are especially damaging [2], [4], [21]. Therefore, in the scope of this paper, we focus on the detection of those behaviors in one-on-one bullying. In group bullying, i.e., the cases where children bullied the robot in groups, we select the child closest to the robot as the bully. We hypothesized that the bullying behaviors we exam- ined lend themselves to effective classification using gesture recognition methods. We tested this hypothesis by training 978-1-5386-0563-9/17/$31.00 c 2017 IEEE

Upload: others

Post on 22-Sep-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Comparing Models for Gesture Recognition of Children’s ......a humanoid robot that used hand-coded gesture recognition to identify basic physical bullying gestures and responded

2017 Seventh International Conference on Affective Computing and Intelligent Interaction (ACII)

Comparing Models for Gesture Recognitionof Children’s Bullying Behaviors

Michael Tsang, Vadim KorolikDepartment of Computer ScienceUniversity of Southern California

Los Angeles, California 90089Email: {tsangm,korolik}@usc.edu

Stefan SchererInstitute of Creative TechnologiesUniversity of Southern CaliforniaLos Angeles, California, 90094

Email: [email protected]

Maja MataricDepartment of Computer ScienceUniversity of Southern California

Los Angeles, California 90089Email: [email protected]

Abstract—We explored gesture recognition applied to the prob-lem of classifying natural physical bullying behaviors by chil-dren. To capture natural bullying behavior data, we developeda humanoid robot that used hand-coded gesture recognitionto identify basic physical bullying gestures and respondedby explaining why the gestures were inappropriate. Childreninteracted with the robot by trying various bullying behaviors,thereby allowing us to collect a natural bullying behaviordataset for training the classifiers. We trained three differentsequence classifiers using the collected data and compared theireffectiveness at classifying different types of common physi-cal bullying behaviors. Overall, Hidden Conditional RandomFields achieved the highest average F1 score (0.645) over alltested gesture classes.

1. Introduction

Bullying among children is a global issue that can causelong-term negative psychological and behavioral problemsfor both the victim and the bully. Children who bully otherstend to have higher instances of conduct problems anddislike of school, whereas victims of bullying show higherlevels of insecurity, anxiety, depression, loneliness, andlow self-esteem compared to their peers [7], [13]. Despitethe prevalence and severity of bullying, to our knowledgeno research has been conducted on the automatic recog-nition/classification of physical bullying behaviors. Suchrecognition and classification, especially if done in real-time, can be useful for reporting and intervening to preventnegative effects. An example of an intervention could bea physical robot detecting a child’s bullying behavior andadvising the child on what inappropriate behavior the childengaged in and why bullying is wrong.

In the computer vision community, human action recog-nition is typically studied on datasets of behaviors actedby adults performing in front of a camera in a structuredfashion [29]. For example, specific behaviors may be actedin sequence in the same order, with every actor performingthe target behaviors multiple times. However, our goal wasto acquire children’s natural bullying behavior data. Nomuraet al. (2015) found that children in a shopping mall in Tokyo

Figure 1. Example of bullying-type gestures children spontaneously demon-strated in front of the robot to elicit its response

had a tendency to show abusive, bullying behaviors towardsa robot [15]. Informed by that work, we used a robot asa target of non-contact bullying behaviors by children fordata collection purposes. Toward that end, we developeda humanoid anti-bullying robot that responds to perceivedbullying behaviors by explaining why such behaviors areinappropriate. Children engaged in interacting with the robottested out a variety of bullying behaviors to see how therobot would respond. We recorded a significant numberof such playful, mock-bullying instances by children (seeFigure 1) for use as a training set for the classificationalgorithm.

Bullying can manifest in a variety of forms, includingphysical (kicking, hitting), verbal (name-calling, intimida-tion), social (gossip), and cyber [23]. According to theliterature, punching, kicking, showing/waving a fist, andpointing are prevalent in aggressive bullying and teasing,and are especially damaging [2], [4], [21]. Therefore, inthe scope of this paper, we focus on the detection of thosebehaviors in one-on-one bullying. In group bullying, i.e., thecases where children bullied the robot in groups, we selectthe child closest to the robot as the bully.

We hypothesized that the bullying behaviors we exam-ined lend themselves to effective classification using gesturerecognition methods. We tested this hypothesis by training

978-1-5386-0563-9/17/$31.00 c©2017 IEEE

Page 2: Comparing Models for Gesture Recognition of Children’s ......a humanoid robot that used hand-coded gesture recognition to identify basic physical bullying gestures and responded

and comparing the following models for gesture recognitionon our dataset: Hidden Conditional Random Fields, HiddenMarkov Models, and Dynamic Time Warping [25]. Theresults of our experiments support our hypothesis: even witha small number of children’s free-form bullying behaviors astraining and testing examples, Hidden Conditional RandomFields was able to discriminate among all four bullyinggesture classes and a null class.

This paper contributes a novel process for obtaining nat-ural bullying behavior data from children and comparativelydemonstrates an effective gesture recognition approach forbullying behavior classification. With natural bullying datacomes the challenge of accounting for a large null class,since bullying occurrences are much less common thannon-bullying (null) events. We address this challenges byaccounting for naturally imbalanced, noisy, and limited dataassociated with children’s misbehaviors, instead of relyingon clean acted data.

We focus on testing the hypothesis that gesture recogni-tion is appropriate for bullying behavior classification. Thefull dynamics of bullying were not modeled, as the focuswas on determining the effectiveness of gesture recognitionas a prerequisite. Toward that goal. we show that gesturerecognition methods that have traditionally been tested ondata from adults can also work on child data and on thespecific task of bullying detection. We describe our approachto obtaining natural behavior data on children’s bullyingbehaviors using a humanoid robot as a bullying target andcompare the performance of three validated classificationmodels on those bullying behavior data.

2. Related Work

We briefly review existing literature relevant to physicalbullying detection and behavior classification in child-robotinteractions.

2.1. Bullying Detection

While little research has been conducted on the auto-matic detection of physical bullying behaviors specifically,related topics, such as human aggression and violence de-tection, have been explored. As a result, detection methodshave been applied to a variety of domains, including thedetection of aggressive behaviors by the elderly [5], violenthuman behavior in crowds [10], fight scenes in sports videos[14], violence in movies [1], [6], and physical bullying role-played by adults [26], [27].

Research in classifying aggression or violence has pri-marily focused on feature representations of RGB videos.For example, Chen et al. [5] studied binary motion descrip-tors in their classification of aggression in elderly, Nievaset al. [14] used motion space-invariant feature transformand space-time interest points in classifying hockey fights,Hassner et al. [10] used flow vector descriptors in theirclassification of crowd violence, and Ye et al. [26], [27]used acceleration and gyro data for role-played bullyingdetection.

Classification of aggressive behavior has also been stud-ied on depth camera (RGB-D) videos in various forms. Assurveyed by Zhang et al. [29], many RGB-D datasets consistof one or more aggressive behaviors, such as punching,kicking, pushing, and throwing. Furthermore, many of thesedatasets are widely used in action recognition experimentswith novel feature representation or computational models;however, none of these datasets consist solely of aggressivebehaviors, and none of these datasets involve child actors.

In the context of our study, bullying detection is thedetection of a subset of physical children’s behaviors thatare described as bullying in social science literature. Specif-ically, studies have found that intentional physical or emo-tional abuse, such as hitting, kicking, and pointing directedtoward a victim from a person of greater power or strength isconsidered bullying [4], [16], [17], [19], [21]. These gesturescan be observed from RGB-D data in the form of skeletons,making them the basis for our detection method.

2.2. Classification of Behaviors in Child-Robot In-teractions

Limited research to date has been conducted on thedata collection and automatic classification of children’sbehaviors in child-robot interactions. We describe severalnotable studies in this domain.

Leite et al. [12] studied and classified the nonverbalbehaviors that children show when they disengage fromsocial interactions with robots. The setting of the study issimilar to ours, but we classify nonverbal behaviors whenchildren bully a robot. In that study, data used to modeldisengagement were collected from child-robot interactionstudies primarily in the form of videos, which were hand-annotated and processed with a facial tracking algorithm forfeatures of children’s behaviors. Support Vector Machineswere used to classify and rank the most discriminativefeatures of disengagement.

Strohkorb et al. [22] recorded a group of children inter-acting with robots using a tablet device to identify dominantbehavior of one child over the others. Video data of child-robot interactions were recorded and hand-annotated forfeatures including gaze, utterance, and gestures, then used tomodel social dominance. Logistic Regression and SupportVector Machines were used for classification. The appear-ance of the robots was designed to engage the children inthe child-robot interactions, similar to our choice of using ahumanoid robot.

In contrast to those works, we study the recognitionof children’s bullying behaviors by taking into accountthe temporal component of human movement. Hence, weapply sequence classifiers to our task of detecting children’sbullying behaviors.

Page 3: Comparing Models for Gesture Recognition of Children’s ......a humanoid robot that used hand-coded gesture recognition to identify basic physical bullying gestures and responded

Figure 2. Experimental setup: the robot responds to detected undesirablebehaviors by explaining why they are inappropriate. The red tape marksthe boundary children were not allowed to cross, to prevent them fromtouching and possibly harming the robot given the bullying context.

3. Methodology

3.1. Experimental Setup

To enable our data collection of children’s bullyingbehaviors, we endowed a humanoid robot with the ability torespond to bullying poses by children. The robot was pro-grammed to respond to poses that appeared to be preparingfor hitting, kicking, and shoving, as well as the poses ofshowing a fist and sticking out the tongue. We deployedthe robot at our University Robotics Open House eventattended by children aged 6 to 18. All children participatedin the Open House with parental consent provided to theirschools, and many of the children interacted with the robotover a four-hour data collection period. A demonstratorshowed the audience of children the poses that the robotwas programmed to detect, and then the children, eitherindividually or in a group, freely interacted with the robotfrom one meter away (Figures 1, 2). Child supervisors andthe demonstrator were always present, and participation inthe activity was voluntary and could be terminated at anypoint.

The robot’s role in the interaction was to explain whycertain behaviors are inappropriate. For example, when therobot detected someone pointing at it, it responded by saying“Stop it! Please don’t point at people because they won’tknow why you are pointing at them.” Likewise, when therobot detected someone showing their fist, it responded“Stop it! You are showing your fist. Please don’t do thatat people because it signals an intention to hurt them.”Similar responses were made for every gesture the robot wasprogrammed to detect. These capabilities naturally inspiredchildren to attempt various bullying actions in order toelicit the robot’s response. In this way, the robot was atarget for children’s real or mock bullying behaviors, and

we were able to record those natural behaviors from therobot’s perspective.

We used Bandit, an adolescent-sized humanoid robottorso mounted atop a Pioneer P3-DX mobile robot base.We collected data on children’s behaviors using a KinectOne1 mounted on top of the robot (which was 1.12 meterstall) at a height of 1.35 meters and angled downwardsat 13.8 degrees. This setup allowed the Kinect to capturechildren and adolescents up to 185cm (6’ 1”) tall. Thesystem recorded skeletal features, depth data, and body seg-mentation information while simultaneously detecting posesin real time from tracked skeletons. At any point in time,we limited tracking and data collection to one child by onlyprocessing the closest skeleton in a 60-degree horizontalfield of view of the Kinect sensor.

The real-time pose detector we used for data collectionwas heuristic-based. It normalized Kinect skeleton sizes,selected a set of representative skeletal joints for a pose,computed z-scores [8] of joint positions while the personheld the pose, and used a manually set threshold to averagethose features. The threshold was based on estimating whenthe same pose is shown again by another person. This de-tector is not practical for general, reliable bullying detectionbecause it neither captures the sequential nature of gesturesnor the subtleties of children’s bullying, since it is not data-driven. In our testing and data collection, the heuristic-basedpose detector often failed, causing us to capture fewer posesoverall. However, in spite of its limitations, the automaticapproach was sufficiently effective for the data collectionprocess, allowing us to capture a natural behavior datasetthat was used for bullying gesture recognition training andevaluation.

3.2. Dataset

The dataset consists of Kinect One skeletal data of 49boys and 17 girls performing bullying gestures. No identifi-able data of these children were published or shared. Boys’heights ranged from 0.77m and 1.85m, with an average of1.30m and the standard deviation of 0.17m. Girls’ heightsranged from 1.04m to 1.45m, with an average of 1.26mand the standard deviation of 0.11m. The bullying gesturescollected in our dataset consist of hitting, pointing, kicking,showing a fist, and null classes shown by children in frontof the demo robot. We only processed the gestures that didnot make physical contact with the robot or Kinect and weredemonstrated using the child’s right hand.

A hitting gesture is represented by the onset and apex ofa swipe or punch performed in front of the body. Likewise,pointing and kicking gestures are represented by their onsetand apex, demonstrated in front of the body. Showing a fistis represented with raising the hand and displaying the fistat the robot as if preparing to punch the robot. Finally, thenull class represents other behavioral sequences the chil-dren performed. These behaviors were diverse and includedwalking, leaning forward, hand waving, and standing idle.

1. http://www.xbox.com/en-US/xbox-one/accessories/kinect

Page 4: Comparing Models for Gesture Recognition of Children’s ......a humanoid robot that used hand-coded gesture recognition to identify basic physical bullying gestures and responded

TABLE 1. THE NUMBER OF CHILDREN THAT SHOWED A GESTURE A SPECIFIC NUMBER OF TIMES IN OUR DATASET.

Gesture class Number of times a gesture was shown

0 times 1 time 2 times 3 times 4 times 5 times 6 times 7 or more times

Pointing 18 25 13 5 1 3 1 0Hitting 42 11 6 4 0 2 0 1

Showing Fist 43 16 3 1 2 0 0 1Kicking 50 10 3 3 0 0 0 0

Null 6 11 17 1 3 7 7 14

There were a total of 91 pointing, 54 hitting, 40 showing afist, 25 kicking, and 302 null gestures. The average sequencelengths of pointing, hitting, showing a fist, kicking, and nullare: 33.5, 9.2, 46.8, 8.8, and 32.4 frames, respectively, at 15frames per second2.

While every child showed a bullying gesture, not allchildren showed every gesture. Out of the 66 total children,48 showed pointing, 24 showed hitting, 23 showed their fist,16 showed kicking, and 60 (nearly everyone) showed thenull sequence. Furthermore, gesture repetition rates variedgreatly among children. For example, 25 children showedpointing only once, 13 children showed pointing 2 times, 5children showed pointing 3 times, and so on. The repetitionrates for all gesture classes can be seen in Table 1.

3.3. Video Annotation

Because all recorded videos are of a protected participantclass (children), we opted to annotate the videos ourselves.The lead author was the only data coder; all annotationswere done using Elan annotation software3. We were care-ful not to bias the annotations by setting strict objectiveguidelines on the start (onset) and end (offset) times ofeach behavior; the onset is defined as the moment a childbegins any specific behavior and the offset is when s/hecompletes the same behavior. Each video consisted of one orseveral unique behaviors of various durations; we manuallyanalyzed each video to determine the onset and offset ofour restricted set of bullying behaviors. A high numberof all remaining video segments, including false positivebehaviors, were labeled as the null class and are the reasonfor our dataset imbalance (Table 1).

4. Predicting Children’s Bullying Gestures

4.1. Procedure

A number of steps were needed to classify children’sbullying behaviors in our dataset. In order to prepare fea-tures, we used two approaches of processing raw skeletaldata. First, we computed normalized 3D positions of theraw skeleton by converting all pairwise distances between

2. Because our demo robot and Kinect One configuration used anunderpowered mini-PC, the Kinect skeletal data were not captured at 30Hz.To handle irregularity in the data time-series, we sampled the data at 15Hz.

3. https://tla.mpi.nl/tools/tla-tools/elan/

point

inghit

ting

showing

fist

kickin

g null

0.0

0.2

0.4

0.6

0.8

1.0

F1 sc

ores

PositionsNormPosVelocitiesNorm+Vel

Figure 3. F1 score comparisons of feature representations for each gesture.”Positions” indicates features are raw skeletal joint positions, ”NormPos”indicates features are normalized positions, ”Velocities” indicates featuresare joint velocities, and ”Norm+Vel” indicates features are a combinationof joint velocities and normalized joint positions. For all feature represen-tations, PCA was applied to the training data to capture 90% of featurevariance, and HCRF was used for classification. Error bars indicate standarddeviations of F1 scores from cross-validation.

point

inghit

ting

showing

fist

kickin

g null

0.0

0.2

0.4

0.6

0.8

1.0

F1 sc

ores

DTWDHMMCHMMHCRF

Figure 4. F1 score comparisons of classifiers for each gesture. Error barsindicate standard deviations of F1 scores from cross-validation.

Page 5: Comparing Models for Gesture Recognition of Children’s ......a humanoid robot that used hand-coded gesture recognition to identify basic physical bullying gestures and responded

neighboring joints to be unit length while preserving alljoint angles, and by fixing the head joint to be at theorigin. Second, we computed the velocities of all joints bysubtracting the joint positions of every other raw skeletonin time and dividing by their time difference. Our use ofnormalized joint positions and velocities was inspired bythe findings of Zanfir et al. on improving skeletal actionrecognition using simple descriptors [28]. A comparison ofclassifier performance for different feature representationscan be seen in Figure 3.

Our data analysis indicates that gesture classes suchas kicking and pointing require features that capture bothstationary positions in pointing and fast movements in kick-ing. Figure 3 shows that, by themselves, velocity featuresoutperform stationary features for classification of hittingand kicking, whereas solely using stationary features, in par-ticular normalized positions, outperforms velocity featuresin classifying pointing and showing the fist. We achievea balance of representing fast and stationary gestures byconcatenating normalized joint position features and jointvelocity features. This combined representation results in150 features; the x, y, and z positions of all 25 joints wereused in both the normalization and velocity computations.To reduce feature dimensionality, we applied Principal Com-ponent Analysis (PCA) to capture 90% of the variance of theoriginal features in training sets and used the same principalcomponents to classify test sets. The performance of thecombined representation in gesture classifications can beseen in the Norm+Vel bars in Figure 3 and in a comparisonof classifiers that use this representation in Figure 4.

Using the combined joint velocity and normalized jointposition feature representation, we compared classificationperformance of Dynamic Time Warping (DTW), DiscreteHidden Markov Models (DHMMs), Continuous HiddenMarkov Models (CHMMs), and Hidden Conditional Ran-dom Fields (HCRFs) on our data. Among sequence clas-sifiers, DTW was chosen because it is a simple baseline,Hidden Markov Models (HMMs) because they are standardgenerative models, and HCRF because it is a discrimina-tive model. HCRFs and HMMs both employ hidden statesfor classification of sequences, but HCRFs only need onetrained model for multi-class classification, whereas HMMsneed separate models to be trained per class [24]. In contrast,DTW attempts to find an alignment between an input timeseries and a reference time series by computing the distancebetween them [3].

HCRF experiments were conducted using the HCRFlibrary4, and HMM and DTW experiments were conductedusing the Gesture Recognition Toolkit [9]. The DHMM hadfour states, and the HCRF had 10 states with a windowsize of 1 and an L2 regularization of 10. These parame-ters are consistent with the model comparison experimentsperformed by Wang et al. [24]. With these classifiers, weperformed person-independent cross-validation experiments.Since our dataset contains 66 children, we used leave-11-children-out cross-validation, where gesture data from 11

4. http://sourceforge.net/projects/hcrf/

TABLE 2. COMPARISON OF CROSS-VALIDATION STATISTICS FORMACRO-AVERAGED F1 SCORES.

Classifier F1 scores

average std. dev.

DTW 0.359 0.0923DHMM 0.469 0.0597CHMM 0.496 0.105HCRF 0.645 0.0495

Figure 5. Confusion matrix for Dynamic Time Warping

children were retained as validation data for testing a model,and the remaining gesture data from 55 children were usedto train the model. In order to handle random seeds in ourmodels, we repeated the training of each model five times.The six rounds of classification experiments as part of cross-validation and the repeated training of our models result in30 classification experiments per model.

4.2. Evaluation

The primary evaluation metric we used was the F1score [20], in order to capture classifier performance onboth precision and recall of data with imbalanced classlabels. We computed an F1 score for each gesture class todetermine a classifier’s performance on specific gestures. Toobtain a holistic performance score for a classifier acrossall gesture classes, we averaged the F1 scores across agesture class to produce a macro-averaged F1 score. Wealso generate confusion matrices to examine how classifiersconfuse predicted and actual class labels, discussed below.

4.3. Results and Discussion

Comparing F1 scores between classifiers for each ges-ture class reveals that HCRF obtains the highest scores atclassifying children’s bullying gestures (Figure 4). The per-class F1 scores obtained by HCRF classifications are 0.79,

Page 6: Comparing Models for Gesture Recognition of Children’s ......a humanoid robot that used hand-coded gesture recognition to identify basic physical bullying gestures and responded

Figure 6. Confusion matrix for Continuous HMM

Figure 7. Confusion matrix for Discrete HMM

Figure 8. Confusion matrix for HCRF

0.54, 0.64, 0.68, and 0.90, for pointing, hitting, showingthe fist, kicking, and null, respectively. The superior perfor-mance of HCRF can also be seen in Table 2, which showscross-validation statistics on macro-averaged F1 scores. InTable 2, HCRF achieves the highest average and loweststandard deviation of F1 scores in cross-validation.

The confusion matrix of each classifier is given inFigures 5-8. A number of critical misclassifications can beseen in the confusion matrices. DTW confuses showing thefist with hitting and all of the gestures with null. CHMMconfuses most gestures with pointing and null. The commonconfusion with the null class is likely caused by the dataset’sskewed class distribution towards that class, resulting fromcapturing natural behavior data. The HCRF is the onlymodel to demonstrate the capability of discriminating allgesture classes in our dataset even in the presence of classimbalance, since the HCRF predicts true classes at thehighest rates compared to other classes for all gestures (seeFigure 8).

Contrary to common understandings of simple modelssuch as DTW, and generative models such as CHMM andDHMM, such models are not necessarily better at generaliz-ing and modeling small datasets as compared to discrimina-tive models, such as HCRF [11]. In the case of our dataset,the HCRF discriminates all gestures classes while HMMand DTW mostly tend to mis-classify gestures. We suspectthe reason behind HCRF outperforming HMM is becauseHMM presumes independence of observations given latentvariables, while HCRF makes no such assumption [18]. Inaddition, we believe that HCRF outperforms DTW becauseHCRF models hidden temporal dependency structure in itslatent variables, whereas DTW does not have this capability.

The results of HCRF classification are not without prob-lems. For example, the predicted label is often null whenactual labels are other gestures. Showing the fist and hittinggestures are also often confused with pointing. The formerproblem may be attributed to the imbalanced dataset, andthe latter problem may be due to feature representations ofhitting and pointing. There also seems to be a classificationbias favoring long gesture sequences and more training data,which may explain why pointing and null classification arethe best for not only the HCRF, but also CHMM and DTW.

Despite potential problems with HCRF classification ofour dataset, HCRF still outperforms other models and itsclassification performance may be improved in uses withmore training data and a more balanced training set.

5. Lessons from the Child-Robot Interaction

In this section, we share our observations on the robotinteraction task from the perspectives of the interaction itselfand the affective computing technologies we used. Then, weexplain how these observations inform future research onbetter understanding and addressing bullying.

One of the most notable lessons we learned in the child-robot interaction was that children were very engaged ininteracting with the robot, sometimes revisiting the demolater in the day to mock bully the robot again. We also

Page 7: Comparing Models for Gesture Recognition of Children’s ......a humanoid robot that used hand-coded gesture recognition to identify basic physical bullying gestures and responded

noticed that the responses made by the robot - correct orotherwise - encouraged more mock bullying. Dynamics ofgroup interactions also played a role, where children weremore likely to mock bully the robot when in a group than bythemselves alone. Finally, we noticed that children almostalways looked at the eyes of the robot when mock bullyingit, rather than looking at the Kinect camera mounted abovethe robot.

The challenges associated with developing and usingaffective technologies for bullying detection are manifold.A significant challenge we faced involved minimizing thefalse positive rate of our classifiers, since a robot thatfalsely detects children’s bullying could lead to undeservedaccusations of bullying. We know that given sufficient data,classification performance can reach higher levels of accu-racy, however, the approach will likely never yield perfectresults. Given this fact, an important question to addressis how to develop technologies for bullying detection thatminimizes false positives.

Another challenge we encountered was detecting groupbullying. Although our work does not address group bul-lying, Kinect-type vision systems typically do not perceivemore than 10 people, so it is difficult to track many individ-uals in a stable manner, as it is also non-trivial to maintainsustained tracking of a specific person.

In addition to these challenges, there are many otherproblems to be addressed in vision-based bullying detection.One of the most classic problems is collecting data at scale,so that state-of-the-art machine learning models can also beapplied to bullying detection. More specific to bullying isthe detection of nuanced behaviors, such as distinguishingbetween thumbs down and showing a fist, or detectingbehind-the-back bullying. Another important area of studyis identifying the bully, to enable the detection of repeatedbullying as well as aid in bullying interventions.

6. Conclusions and Future Works

The ability to classify bullying gestures from childrenis necessary for automatically monitoring and interveningin cases of physical bullying at schools, playgrounds, andhomes. To train effective recognizers, realistic data areneeded, but socially sensitive behaviors are difficult to cap-ture naturally.

We present a novel method of collecting data on chil-dren’s bullying gestures using an anti-bullying robot, whichchildren engage with by naturally acting out bullying ges-tures in front of it. Using the collected data, we preformexperiments with different sequence classifiers to comparetheir performance on discriminating gestures in the dataset.For per-class gesture recognition, we show that HCRFoutperforms other models like HMM and DTW for everygesture based on F1 scores in our dataset, which featuresnatural child bullying behaviors and is highly imbalanced.Neither HMM nor DTW perform comparably to HCRFacross all gesture classes, suggesting that future domain-agnostic gesture recognition experiments should use theHCRF as a baseline model. Furthermore, this work offers

a first glimpse at the classifier performance achievable inthe domain of detecting children’s bullying gestures withnatural, noisy, and limited data.

There are multiple lines of future work to explore in thegesture recognition of bullying behaviors. Since collectingdata from children is challenging, it would be interesting toexamine whether models trained on acted bullying gesturesby adults can be used to correctly classify test gesturesby children. Furthermore, it is worth exploring whetherexamples of children’s bullying behaviors are enough totrain sequence classifiers for real-time bullying recognition.Finally, the detection of synchronous bullying behaviorsamong groups of children is another natural extension ofthis work.

Acknowledgments

This material is supported by the National Science Foun-dation under award number IIS-1117279 and the U.S. ArmyResearch Laboratory under contract number W911NF-14-D-0005. Any opinions, findings, and conclusions or recommen-dations expressed in this material are those of the author(s)and do not necessarily reflect the views of the Government,and no official endorsement should be inferred.

References

[1] E. Acar, F. Hopfgartner, and S. Albayrak. Violence detection inhollywood movies by the fusion of visual and mid-level audio cues. InProceedings of the 21st ACM international conference on Multimedia,pages 717–720. ACM, 2013.

[2] M. A. Barnett, S. R. Burns, F. W. Sanborn, J. S. Bartel, and S. J.Wilds. Antisocial and prosocial teasing among children: Perceptionsand individual differences. Social Development, 13(2):292–310, 2004.

[3] D. J. Berndt and J. Clifford. Using dynamic time warping to findpatterns in time series. In KDD workshop, volume 10, pages 359–370. Seattle, WA, 1994.

[4] K. Bjorkqvist, K. M. Lagerspetz, and A. Kaukiainen. Do girlsmanipulate and boys fight? developmental trends in regard to directand indirect aggression. Aggressive behavior, 18(2):117–127, 1992.

[5] D. Chen, H. Wactlar, M.-y. Chen, C. Gao, A. Bharucha, and A. Haupt-mann. Recognition of aggressive human behavior using binary localmotion descriptors. In 2008 30th Annual International Conference ofthe IEEE Engineering in Medicine and Biology Society, pages 5238–5241. IEEE, 2008.

[6] L.-H. Chen, H.-W. Hsu, L.-Y. Wang, and C.-W. Su. Violence detectionin movies. In Computer Graphics, Imaging and Visualization (CGIV),2011 Eighth International Conference on, pages 119–124. IEEE,2011.

[7] W. E. Copeland, D. Wolke, A. Angold, and E. J. Costello. Adultpsychiatric outcomes of bullying and being bullied by peers inchildhood and adolescence. JAMA psychiatry, 70(4):419–426, 2013.

[8] J. Dubes. Algorithms for clustering data. Prentice Hall., 1988.

[9] N. E. Gillian and J. A. Paradiso. The gesture recognition toolkit.Journal of Machine Learning Research, 15(1):3483–3487, 2014.

[10] T. Hassner, Y. Itcher, and O. Kliper-Gross. Violent flows: Real-timedetection of violent crowd behavior. In 2012 IEEE Computer SocietyConference on Computer Vision and Pattern Recognition Workshops,pages 1–6. IEEE, 2012.

Page 8: Comparing Models for Gesture Recognition of Children’s ......a humanoid robot that used hand-coded gesture recognition to identify basic physical bullying gestures and responded

[11] A. Jordan. On discriminative vs. generative classifiers: A comparisonof logistic regression and naive bayes. Advances in neural informationprocessing systems, 14:841, 2002.

[12] I. Leite, M. McCoy, D. Ullman, N. Salomons, and B. Scassellati.Comparing models of disengagement in individual and group inter-actions. In Proceedings of the Tenth Annual ACM/IEEE InternationalConference on Human-Robot Interaction, pages 99–105. ACM, 2015.

[13] T. R. Nansel, M. Overpeck, R. S. Pilla, W. J. Ruan, B. Simons-Morton, and P. Scheidt. Bullying behaviors among us youth:Prevalence and association with psychosocial adjustment. Jama,285(16):2094–2100, 2001.

[14] E. B. Nievas, O. D. Suarez, G. B. Garcıa, and R. Sukthankar. Violencedetection in video using computer vision techniques. In InternationalConference on Computer Analysis of Images and Patterns, pages 332–339. Springer, 2011.

[15] T. Nomura, T. Uratani, T. Kanda, K. Matsumoto, H. Kidokoro,Y. Suehiro, and S. Yamada. Why do children abuse robots? InProceedings of the Tenth Annual ACM/IEEE International Conferenceon Human-Robot Interaction Extended Abstracts, pages 63–64. ACM,2015.

[16] D. Olweus. The revised Olweus bully/victim questionnaire. Universityof Bergen, Research Center for Health Promotion, 1996.

[17] D. Olweus. Bully/victim problems in school: Facts and intervention.European Journal of Psychology of Education, 12(4):495–510, 1997.

[18] A. Quattoni, S. Wang, L.-P. Morency, M. Collins, and T. Darrell.Hidden conditional random fields. IEEE transactions on patternanalysis and machine intelligence, 29(10), 2007.

[19] J. P. Shapiro, R. F. Baumeister, and J. W. Kessler. A three-componentmodel of children’s teasing: Aggression, humor, and ambiguity. Jour-nal of Social and Clinical Psychology, 10(4):459–472, 1991.

[20] M. Sokolova and G. Lapalme. A systematic analysis of performancemeasures for classification tasks. Information Processing & Manage-ment, 45(4):427–437, 2009.

[21] M. E. Solberg and D. Olweus. Prevalence estimation of schoolbullying with the olweus bully/victim questionnaire. Aggressivebehavior, 29(3):239–268, 2003.

[22] S. Strohkorb, I. Leite, N. Warren, and B. Scassellati. Classificationof children’s social dominance in group interactions with robots.In Proceedings of the 2015 ACM on International Conference onMultimodal Interaction, pages 227–234. ACM, 2015.

[23] J. Wang, R. J. Iannotti, and T. R. Nansel. School bullying amongadolescents in the united states: Physical, verbal, relational, and cyber.Journal of Adolescent health, 45(4):368–375, 2009.

[24] S. B. Wang, A. Quattoni, L.-P. Morency, D. Demirdjian, and T. Dar-rell. Hidden conditional random fields for gesture recognition. In2006 IEEE Computer Society Conference on Computer Vision andPattern Recognition (CVPR’06), volume 2, pages 1521–1527. IEEE,2006.

[25] D. Weinland, R. Ronfard, and E. Boyer. A survey of vision-basedmethods for action representation, segmentation and recognition.Computer vision and image understanding, 115(2):224–241, 2011.

[26] L. Ye, H. Ferdinando, T. Seppanen, and E. Alasaarela. Physicalviolence detection for preventing school bullying. Advances inArtificial Intelligence, 2014:5, 2014.

[27] L. Ye, H. Ferdinando, T. Seppanen, T. Huuki, and E. Alasaarela.An instance-based physical violence detection algorithm for schoolbullying prevention. In 2015 International Wireless Communicationsand Mobile Computing Conference (IWCMC), pages 1384–1388.IEEE, 2015.

[28] M. Zanfir, M. Leordeanu, and C. Sminchisescu. The moving pose: Anefficient 3d kinematics descriptor for low-latency action recognitionand detection. In Proceedings of the IEEE International Conferenceon Computer Vision, pages 2752–2759, 2013.

[29] J. Zhang, W. Li, P. O. Ogunbona, P. Wang, and C. Tang. Rgb-d-based action recognition datasets: A survey. Pattern Recognition,60:86–105, 2016.