on the possibility of predicting gaze aversion to improve ... · on the possibility of predicting...

On the Possibility of Predicting Gaze Aversion toImprove Video-Chat Efficiency

Anonymous VCIP SubmissionPaper ID: 64

Abstract—A possible way to make video chat more efficientis to only send video frames that are likely to be looked atby the remote participant. Gaze in dialog is intimately tied todialog states and behaviors, so prediction of such times shouldbe possible. To investigate, we collected data on both participantsin 6 video-chat sessions, totalling 65 minutes, and created amodel to predict whether a participant will be looking at thescreen 300 milliseconds in the future, based on prosodic andgaze information available at the other side. A simple predictorhad a precision of 42% at the equal error rate. While this isprobably not good enough to be useful, improved performanceshould be readily achievable.

I. MOTIVATION

The principal challenge in telecommunications is maximiz-ing the quality of experience that can be delivered for a givenbandwidth. Adaptation can help: channels where instantaneousquality varies over time can support a better overall userexperience than constant-rate transmission. To date, adaptationis generally based on the network state. We propose insteadto adapt to the state of the user, in particular to the user’sattention level.

Specifically, we would like to make video chat more ef-ficient. Video chat is popular and a significant data hog,especially for mobile users [1]. We would like to make it moreefficient by transmitting less information for video frames thatthe remote user is unlikely to look at, exploiting the fact thatpeople often look away (avert gaze) during dialog. This paperexplores the feasibility of this idea.

II. BACKGROUND AND PRELIMINARY STUDIES

It has long been known that people in dyadic conversationoften gaze away from their interlocutors [2], with reportedpercentages varying from 10% to 91%, depending on theindividual and other factors. In dyadic conversation, workstudying the relationship between gaze, dialog activities, andprosody [3], [4], [5] has shown that gaze aversion is not

IEEE VCIP’15, Dec. 13 - Dec. 16, 2015, Singapore000-0-0000-0000-0/00/$31.00 c©2015 IEEE.

randomly distributed, but relates to turn-taking behavior. Forexample, in one study, 73% of gaze aversions were turn-initial, with an average duration of 2.3 seconds [6]. Turn-takingbehavior in turn relates to the prosody of the utterances of theparticipants [7], [8], suggesting that we may be able to predictgaze from prosody.

For telecommunications applications, there have been stud-ies of gaze in multi-party scenarios and the question of whoa participant will be looking at. More detailed gaze modelinghas also been done, but so far only for non-interactive video[9], [10], [11], [12].

We did a preliminary study of the human ability to pre-dict gaze. We made six over-the-shoulder videos of peopleSkyping, in a configuration with a one-way delay of about600 milliseconds. We then had four untrained observers watchthese using Elan, stopping the video at random points, andasking them to predict whether the remote participant wouldbe looking at the local one 350 ms later. We thus obtained 250informal predictions. 72% of the predictions of an upcominggaze aversion were correct, and overall only 10% of theirpredictions were false positives. Thus gaze appears to be fairlypredictable.

We did a second preliminary study to investigate the rela-tionship between gaze aversion and the value of the displayedimage to the user. In particular, since the focus of attention isnot always at the gaze point, we need an estimate of how wellgaze can serve as a proxy for attention. While aspects of visualacuity across the visual field have been well-studied, we areconcerned specifically with viewing faces, and with sensitivityduring normal conversation, rather than in single-task exper-iments. We therefore investigated the relation between gazedirection and the likelihood of noticing video impairments,again informally. In this two participants communicated overSkype, and occasionally one closed his or her eyes for abouta second. These stimuli were cued by a silent random-intervaltimer visible to that one participant. Immediately afterwardsthe stimulus-generating participant said “freeze” and the otherfroze his or her gaze and reported the location, to within 10degrees, by naming the nearest one of a grid of labeled dots

Fig. 1. Participant’s view, with webcam above and gaze tracker below thescreen.

on the screen and wall. They also reported whether or not theyhad noticed the closing of the eyes.

With 6 participants and 150 total trials, we found, asexpected, that closings were noticed less to the extent that gazewas farther from the center of the screen. In particular, in caseswhere the reported gaze was greater than 15 degrees fromthe screen center, in either the x or y direction, the stimulusdetection rate was 13%, far less than the 89% when gaze wascloser, let alone the 93% when gaze was within 5 degrees ofthe center.

III. DATA COLLECTION

To investigate predictability we needed data. Since gazebehavior in telecommunications differs from that in face-to-face dialog [13] (for reasons probably including the lack ofphysical presence, poor image and audio quality, and degradedcontingency due to delay), we created a new data collection.We set up a Skype connection between two quiet roomsin our laboratory. The Skype window filled the screen, andparticipants sat at a comfortable distance and adjusted thegaze tracker appropriately. We asked them to not move toomuch because of the gaze tracker limitations. The image ofthe partner’s face on the screen typically subtended around 7degrees horizontally, as suggested by Figure 1.

Each pair of participants had a 4-12 minute Skype conver-sation, on any topics they wished to talk about. Each pairwas already acquainted and many were friends. All wereaffiliated with a Computer Science department. There were12 participants and 6 conversations, totaling 65 minutes.

Each participant’s gaze was tracked by a Gazepoint GP3Eye Tracker mounted below the monitor. Audio was recordedwith head-mounted microphones with a Tascam DR-40. Werecorded reference signals consisting of three claps in front ofthe gaze tracker before the conversation and again afterwards,for each side. Synchronizing with these gave a global, delay-free view of the behaviors. This was not, of course, what theparticipants experienced. In Skype the delay is not constant,and audio and video delays may diverge. In this setup thevideo delays are around 120 to 150 ms, and the audio 200 to250 ms. Video freezes were rare.

We counted as “gaze-off” those frames where either thegaze tracker reported no gaze detected or when the gaze wasaway from the interlocutor’s face. The former were probablymostly hard-to-predict blinks, and the latter more predicablegaze aversions. “Away from the face” was determined usingtwo heuristics. First, since the participants were free to moveand because we did not capture images, we estimated thecenter of interest as the average x and y of the on-screenfixations. Second, based on the informal perceptual studymentioned above, gaze aversion was inferred when the currentgaze direction was more than 0.7h away from this center ofinterest, where h is the screen height. This corresponds toabout 14 degrees of visual angle.

Using these definitions, the fraction of gaze-off framesaveraged 29%, varying across participants, from 12% to 45%.

IV. PREDICTIVE FEATURES

Given this data we investigated features that correlate withfuture gaze. In order to decide whether or not to send a packetat time t, we need to predict whether the remote user will belooking at the screen at time t + d, where d is the one-waydelay. Delay implies, further, that the prediction algorithm willnot have access to any recipient-side features more recent thant− d. Here we assume d = 300 ms, and accordingly consideronly features that would be available at the local, sending, endwhen the prediction must be made.

We first examined prosodic features. As noted above, gaze isrelated to turn-taking; we have also observed that participantstend to look away while laughing, when back-channeling,when closing out a topic, when thinking, and during affectbursts. Accordingly the prosodic feature set was designed tocapture information useful for turn-taking and dialog-eventprediction [14], [15], [16]. Our features were volume, pitchheight, pitch range, pitch creakiness, and a speaking-rateproxy. Since we thought gaze might relate to patterns ofbehavior across time, we used feature averages over adjacentwindows, for example, the average interlocutor volume over1900 to 1100 ms before the prediction point, the average valueover 1100 to 600 ms before, and so on. We included suchfeatures for both participants.

Examining correlations, speaking rate was negatively corre-lated (–0.10) with upcoming gaze, perhaps because of turn-endbehavior, where speakers known to both tend to slow downand to tend to look at the interlocutor. Interlocutor volumeand speaking rate correlated positively with upcoming gaze,perhaps because increases in rate and volume tend to signifyimportant information and attract attention.

We next examined gaze-location features. We used similarlycoarse windows and again included features for both partic-ipants, as there is clearly some interplay. For example, it isreported that at turn end the current speaker usually looksdirectly at the interlocutor, and that at the subsequent turn start,the new speaker looks away. For the remote participant, whosegaze we were predicting, we used no features more recentthan 300 ms in the past. Since gaze aversions in differentdirections may reflect different cognitive activities [17], we

Gaze: predicted on predicted off sum

actually on 67% 4% 71%actually off 23% 6% 29%sum 90% 10% 100%

TABLE IPERCENT OF FRAMES IN EACH CONDITION

included features to encode this information. For convenientuse in our linear model, explained below, we thresholded theseso that, for example, the gaze-up-distance feature value waszero when the gaze was below the center point. Our featureswere gaze-up, gaze-down, gaze-right, gaze-left, gaze-distance-from-center, and gaze-on fraction.

The mostly highly correlated feature was, unsurprisingly,the fraction of gaze-on by the speaker during the most recentavailable 100 ms window, namely that 700 to 600 ms beforethe frame to predict, with a correlation of 0.27. Gaze-awayfeatures correlated negatively, in order right > down > left >up; perhaps because gaze aversions to the right tend to lastlonger. Interlocutor gaze features correlated in the same ways,although more weakly, perhaps reflecting the fact that gazetends to be reciprocated. Overall gaze features were far morepredictive than the prosodic features.

V. PREDICTION RESULTS

As this is initial exploration, we chose to predict using avery simple model, linear regression with a threshold, as wewanted something easy to analyze. The input was 42 gazefeatures and 67 prosodic features. Varying the threshold makesit predict likely gaze-off moments more or less aggressively,thereby varying the trade-off between reducing transmissionand preserving quality.

A gaze-off prediction is correct, of course, if it matches theactual gaze state in the data 300 ms later. We evaluated ourpredictor by generating predictions every 10 ms, 707830 intotal. This we did by testing, for each participant’s data, theperformance of a model trained on the other 11 participants.Thus each test was of the predictability of a participant unseenin the training data. The results reported are averages acrossall participants.

Figure 2 shows how the precision, recall, and weighted F-measure varied with the prediction threshold. The equal-errorrate was 42% that is, there was an operating point at whichthe fraction of the gaze-off frames correctly predicted as suchwas 42%, and so was the fraction of the gaze-off predictionsthat were correct. This precision is significantly above the30% baseline which would be obtained by random guessing.For the F-measure we weighted the precision 30 times therecall, for reasons discussed below. The best-point value forthis measure was 49%. Best-point performance varied greatlyacross speakers, from 18% to 86%.

Examining more closely the performance at the best point,here the average precision was 53%, and the recall 19%. Theconfusion matrix at this operating point is given in Table I.

Fig. 2. Predictor Performance

It would be convenient if decent performance could beobtained with prosodic features alone, as they are easy tocompute with any current hardware, or without using remote-speaker features. However precision for feature sets withoutremote-speaker gaze was barely above chance.

VI. COST-BENEFIT ANALYSIS

While clearly our method is able to reduce transmissionrates, incorrect predictions will, from the recipient’s perspec-tive, manifest as a noticeable loss of quality. This sectionconsiders when this tradeoff is likely to be advantageous.

Unfortunately, while there are good quality-of-experiencemodels for video transmission and for interactive audio, thereare no suitable ones for video chat. Most quality modelinghas been done for passive video viewing, but user behaviorsand user preferences clearly differ for interactive applicationssuch as video chat and teleconferencing [18], [19]. Moreover,different users have different preferences, especially differentcost sensitivities, and video chat can be implemented anddelivered in different ways [20], [21].

Nevertheless there is information in the literature fromwhich we can attempt to roughly characterize the tradeoff. Theprimary determinants of subjective video quality are packetloss and bitrate (the product of frame rate and frame quality)[20]. The benefit of gaze-aversion modeling is a reduction inunneeded transmission, and this can be used to increase thebitrate at times when transmission is needed. A 30% increasein bitrate appears to generally provide a Mean-Opinion-Score(MOS) increase of about 0.3 points, according to the graphs in[20]. A 1% increase in packet loss appears to generally causea MOS decrease of about 0.3 points, according to the graphsin [22]. The ratio of these two is the weight of 30 used above.

From this we can estimate the utility of a system operatingat the best point identified above. At this point 10% ofthe frames are being predicted as gaze-off, implying that,if transmission were suspended entirely at such times, thetransmission reduction will be 10%. This means that thebitrate in times when transmission is needed can be increasedby 11%, without increasing the overall bitrate. This wouldindicate a MOS benefit of 0.12. At this same point, due to

false negatives, 4% of the total frames are actually gaze-on frames but incorrectly not sent. The number of usefulframes sent is therefore, using the noticing ratios from ourpreliminary study, 89% * gaze-on-frames-correctly-sent + 13%* gaze-off-frames-incorrectly-sent, or 65.5%. In comparison,the number of useful frames received if all are sent is 67.1%.Thus there is a 2.4% reduction in the quantity of useful frames,equivalent from the user’s perspective to a 2.4% increase inpacket loss, which suggests a MOS decrement of 0.72 points,far outweighing the benefit. This estimate suggests that theperformance of our current predictor is far from sufficient.

VII. FUTURE WORK

Direct experimental evaluation of the potential of thistechnique is needed. Given the widespread acceptance ofvideo-chat systems that periodically freeze, it is possible thatestimation based on other quality models, as done above,significantly underestimates the potential value of this method.

Nevertheless, it seems certain that much better gaze pre-diction will be needed to make gaze-adaptive transmissionuseful for video chat. This may be achievable. Better features,especially those supporting fine-grained modeling of gazedynamics [23], [12], should be tried. In addition image featuresand the words spoken may be helpful, although robustnessand computational-complexity considerations may limit theirutility in practice. Given the large individual differences,speaker-specific modeling is also likely to greatly improveresults. Instead of trying to predict all gaze-off frames with onemodel, one might try composite models that fuse the outputsof individual models trained to predict different gaze events,such as blinks, aversion while thinking, and aversion whilelaughing. Models could also be conditioned on gender, age,relative status, dialog type, personality, dialect and language.Finally, future work should obviously use more training dataand better machine-learning models.

Other types of attention prediction might also be explored.Rather than merely predicting gaze on/off, one might try topredict the actual gaze location, for example to the eyes or themouth, so as to devote higher resolution to the image region ofinterest. Also, while our data was collected in a distraction-freesituation, people also video chat while reading email, playinggames, talking to side participants, cooking, walking, and soon. Prediction of attention in non-distraction-free environmentsmay be much easier.

VIII. SUMMARY

This paper has explored the possibility of using a predic-tive model of video-chat participant gaze to reduce pointlesstransmission. It identified relevant dialog states and predictivefeatures, reported the first study of the predictability of gazein dialog, and identified avenues for improvement.

ACKNOWLEDGMENT

[omitted for anonymous submission]

REFERENCES

[1] S. Jana, A. Pande, A. Chan, and P. Mohapatra, “Mobile video chat:issues and challenges.” IEEE Communications Magazine, vol. 51, no. 6,2013.

[2] A. Kendon, “Some functions of gaze-direction in social interaction,”Acta Psychologica, vol. 26, pp. 22–63, 1967.

[3] F. Cummins, “Gaze and blinking in dyadic conversation: A study incoordinated behaviour among individuals,” Language and CognitiveProcesses, vol. 27, no. 10, pp. 1525–1549, 2012.

[4] T. Kawahara, T. Iwatate, and K. Takanashi, “Prediction of turn-taking bycombining prosodic and eye-gaze information in poster conversations.”in Interspeech, 2012.

[5] K. Jokinen, H. Furukawa, M. Nishida, and S. Yamamoto, “Gaze andturn-taking behavior in casual conversational interactions,” ACM Trans-actions on Interactive Intelligent Systems (TiiS), vol. 3, p. 12, 2013.

[6] S. Andrist, B. Mutlu, and M. Gleicher, “Conversational gaze aversionfor virtual agents,” in Proceedings of Intelligent Virtual Agents (IVA),2013.

[7] A. Gravano and J. Hirschberg, “Turn-taking cues in task-orienteddialogue,” Computer Speech and Language, vol. 25, pp. 601–634, 2011.

[8] A. Raux and M. Eskenazi, “Optimizing the turn-taking behavior oftask-oriented spoken dialog systems,” ACM Transactions on Speech andLanguage Processing (TSLP), vol. 9, p. 1, 2012.

[9] O. V. Komogortsev, “Gaze-contingent video compression with targetedgaze containment performance,” Journal of Electronic Imaging, vol. 18,no. 3, pp. 1–10, 2009.

[10] R. Yonetani, “Modeling video viewing behaviors for viewer state esti-mation,” in Proceedings of the 20th ACM international conference onMultimedia, 2012, pp. 1393–1396.

[11] Y. Feng, G. Cheung, W.-t. Tan, P. Le Callet, and Y. Ji, “Low-cost eyegaze prediction system for interactive networked video streaming,” IEEETransactions on Multimedia, 2013.

[12] J.-W. Kim and J.-O. Kim, “Video viewer state estimation using gazetracking and video content analysis,” in Visual Communications andImage Processing (VCIP), 2013. IEEE, 2013, pp. 1–6.

[13] G. Doherty-Sneddon and F. G. Phelps, “Gaze aversion: A response tocognitive or social difficulty?” Memory & Cognition, vol. 33, no. 4, pp.727–733, 2005.

[14] E. Shriberg, L. Ferrer, S. Kajarekar, A. Venkataraman, and A. Stolcke,“Modeling prosodic feature sequences for speaker recognition,” SpeechCommunication, vol. 46, pp. 455–472, 2005.

[15] K. P. Truong and D. A. van Leeuwen, “Automatic discriminationbetween laughter and speech,” Speech Communication, vol. 49, pp. 144–158, 2007.

[16] N. G. Ward, A. Vega, and T. Baumann, “Prosodic and temporal featuresfor language modeling for dialog,” Speech Communication, vol. 54, pp.161–174, 2011.

[17] H. Ehrlichman and A. Weinberger, “Lateral eye movements and hemi-spheric asymmetry: a critical review.” Psychological Bulletin, vol. 85,no. 5, p. 1080, 1978.

[18] S. Whittaker and B. O’Conaill, “The role of vision in face-to-face andmediated communication,” in Video-Mediated Communication, K. E.Finn, A. J. Sellen, and S. B. Wilbur, Eds. Lawrence Erlbaum Associates,1997.

[19] L. S. Bohannon, A. M. Herbert, J. B. Pelz, and E. M. Rantanen, “Eyecontact and video-mediated communication: A review,” Displays, 2012.

[20] G. W. Cermak, “Subjective video quality as a function of bit rate framerate, packet loss, and codec,” in International Workshop on Quality ofMultimedia Experience (QoMEx). IEEE, 2009, pp. 41–46.

[21] C. Yu, Y. Xu, B. Liu, and Y. Liu, “Can you SEE me now?, ameasurement study of mobile video calls,” in Proceedings of IEEEConference on Computer and Communications (INFOCOM), 2014.

[22] S. Khorsandroo, R. M. Noor, and S. Khorsandroo, “A generic quantita-tive relationship between quality of experience and packet loss in videostreaming services,” in Fourth International Conference on Ubiquitousand Future Networks (ICUFN). IEEE, 2012, pp. 352–356.

[23] O. V. Komogortsev, Y. S. Ryu, S. Marcos, and D. H. Koh, “Quick modelsfor saccade amplitude prediction,” Journal of Eye Movement Research,vol. 3, 2009.

on the possibility of predicting gaze aversion to improve ... · on the possibility of predicting...

Documents