deeplogger: extracting user input logs from 2d gameplay …

10
DeepLogger: Extracting User Input Logs From 2D Gameplay Videos Thanapong Intharah 1,2 and Gabriel J. Brostow 1 1 University College London, London, United Kingdom 2 Khon Kaen University, Khon Kaen, Thailand [t.intharah and g.brostow]@cs.ucl.ac.uk ABSTRACT Game and player analysis would be much easier if user in- teractions were electronically logged and shared with game researchers. Understandably, sniffing software is perceived as invasive and a risk to privacy. To collect player analytics from large populations, we look to the millions of users who already publicly share video of their game playing. Though labor-intensive, we found that someone with experience of playing a specific game can watch a screen-cast of someone else playing, and can then infer approximately what buttons and controls the player pressed, and when. We seek to auto- matically convert video into such game-play transcripts, or logs. We approach the task of inferring user interaction logs from video as a machine learning challenge. Specifically, we pro- pose a supervised learning framework to first train a neural network on videos, where real sniffer/instrumented software was collecting ground truth logs. Then, once our DeepLogger network is trained, it should ideally infer log-activities for each new input video, which features gameplay of that game. These user-interaction logs can serve as sensor data for gaming analytics, or as supervision for training of game-playing AI’s. We evaluate the DeepLogger system for generating logs from two 2D games, Tetris [23] and Mega Man X [6], chosen to represent distinct game genres. Our system performs as well as human experts for the task of video-to-log transcription, and could allow game researchers to easily scale their data collection and analysis up to massive populations. CCS Concepts Computing methodologies Video summarization; Su- pervised learning by classification; Human-centered computing Interaction design theory, concepts and paradigms; Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. CHI PLAY ’18, October 28–31, 2018, Melbourne, VIC, Australia ©2018 Copyright is held by the owner/author(s). Publication rights licensed to ACM. ACM 978-1-4503-5624-4/18/10. . . $15.00 http://dx.doi.org/10.1145/3242671.3242674 Author Keywords Convolutional Neural Network; User Input Log; Learning From Gameplay Video. INTRODUCTION To capture a player’s experience means watching them, prob- ing them with different scenarios, and interviewing them to understand how they felt in the game and afterward. Game analytics and large scale game evaluations are also important, to supplement such careful analysis of individual players. But practical constraints influence the balance between these kinds of depth vs breadth analyses. For example, when studying how players experience different game levels, important insights about play-tuning and interfaces come from both small focus- groups, and global-scale cohorts of players. When available, just the log files themselves provide invaluable insights [33, 26, 25, 27]. Critically, the sampled population size for each study is partly a question of time and cost. The focus of this paper is to make the millions of hours of recorded gameplay videos usable, for each game’s analytics and game-user research. In short, we seek to automatically convert each video into a log-file transcript, as if the player had used sniffer software while playing a known game. Privacy concerns and OS security, especially on mobile devices, keep gamers from generating and sharing log files of their game play. But millions of players are comfortable sharing videos online. On YouTube, there are 2.3m and 7.8m search result videos for The Legend of Zelda: Breath of the Wild [10], and PlayerUnknown’s Battlegrounds [11], respectively, after the games were released for only 12 months. Twitch.tv is a website expressly for live streaming videos of games, which has 2m unique players broadcasting every month. Until now, that rich data was hard to interpret. We introduce DeepLogger, a Convolutional Neural Network (CNN), that is custom-built to generate player-computer in- teraction log-files from gameplay videos. An example of its intended use would be to collect information about a specific game, and how players do better/worse depending on whether they play using a keyboard, game controller, or a particular mobile phone model. A DeepLogger CNN would first be trained by a cooperative game-player. She would record video and key/button logs on a computer with a sniffer program installed. The trained DeepLogger could then be run on each of the 10 3 ... 10 6 relevant gameplay videos of that game. The

Upload: others

Post on 22-Oct-2021

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: DeepLogger: Extracting User Input Logs From 2D Gameplay …

DeepLogger: Extracting User Input LogsFrom 2D Gameplay Videos

Thanapong Intharah1,2 and Gabriel J. Brostow1

1University College London, London, United Kingdom2Khon Kaen University, Khon Kaen, Thailand

[t.intharah and g.brostow]@cs.ucl.ac.uk

ABSTRACTGame and player analysis would be much easier if user in-teractions were electronically logged and shared with gameresearchers. Understandably, sniffing software is perceivedas invasive and a risk to privacy. To collect player analyticsfrom large populations, we look to the millions of users whoalready publicly share video of their game playing. Thoughlabor-intensive, we found that someone with experience ofplaying a specific game can watch a screen-cast of someoneelse playing, and can then infer approximately what buttonsand controls the player pressed, and when. We seek to auto-matically convert video into such game-play transcripts, orlogs.

We approach the task of inferring user interaction logs fromvideo as a machine learning challenge. Specifically, we pro-pose a supervised learning framework to first train a neuralnetwork on videos, where real sniffer/instrumented softwarewas collecting ground truth logs. Then, once our DeepLoggernetwork is trained, it should ideally infer log-activities foreach new input video, which features gameplay of that game.These user-interaction logs can serve as sensor data for gaminganalytics, or as supervision for training of game-playing AI’s.We evaluate the DeepLogger system for generating logs fromtwo 2D games, Tetris [23] and Mega Man X [6], chosen torepresent distinct game genres. Our system performs as wellas human experts for the task of video-to-log transcription,and could allow game researchers to easily scale their datacollection and analysis up to massive populations.

CCS Concepts•Computing methodologies → Video summarization; Su-pervised learning by classification; •Human-centeredcomputing → Interaction design theory, concepts andparadigms;

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than theauthor(s) must be honored. Abstracting with credit is permitted. To copy otherwise, orrepublish, to post on servers or to redistribute to lists, requires prior specific permissionand/or a fee. Request permissions from [email protected] PLAY ’18, October 28–31, 2018, Melbourne, VIC, Australia©2018 Copyright is held by the owner/author(s). Publication rights licensed to ACM.ACM 978-1-4503-5624-4/18/10. . . $15.00http://dx.doi.org/10.1145/3242671.3242674

Author KeywordsConvolutional Neural Network; User Input Log; LearningFrom Gameplay Video.

INTRODUCTIONTo capture a player’s experience means watching them, prob-ing them with different scenarios, and interviewing them tounderstand how they felt in the game and afterward. Gameanalytics and large scale game evaluations are also important,to supplement such careful analysis of individual players. Butpractical constraints influence the balance between these kindsof depth vs breadth analyses. For example, when studying howplayers experience different game levels, important insightsabout play-tuning and interfaces come from both small focus-groups, and global-scale cohorts of players. When available,just the log files themselves provide invaluable insights [33,26, 25, 27]. Critically, the sampled population size for eachstudy is partly a question of time and cost.

The focus of this paper is to make the millions of hours ofrecorded gameplay videos usable, for each game’s analyticsand game-user research. In short, we seek to automaticallyconvert each video into a log-file transcript, as if the player hadused sniffer software while playing a known game. Privacyconcerns and OS security, especially on mobile devices, keepgamers from generating and sharing log files of their gameplay. But millions of players are comfortable sharing videosonline. On YouTube, there are 2.3m and 7.8m search resultvideos for The Legend of Zelda: Breath of the Wild [10],and PlayerUnknown’s Battlegrounds [11], respectively, afterthe games were released for only 12 months. Twitch.tv is awebsite expressly for live streaming videos of games, whichhas 2m unique players broadcasting every month. Until now,that rich data was hard to interpret.

We introduce DeepLogger, a Convolutional Neural Network(CNN), that is custom-built to generate player-computer in-teraction log-files from gameplay videos. An example of itsintended use would be to collect information about a specificgame, and how players do better/worse depending on whetherthey play using a keyboard, game controller, or a particularmobile phone model. A DeepLogger CNN would first betrained by a cooperative game-player. She would record videoand key/button logs on a computer with a sniffer programinstalled. The trained DeepLogger could then be run on eachof the 103 . . .106 relevant gameplay videos of that game. The

Page 2: DeepLogger: Extracting User Input Logs From 2D Gameplay …

resulting log files could reveal trends and advantage-givinginterfaces. Additionally, for interface-researchers who don’town the game’s copyright or code, they can avoid the copy-right infringement risks that sometimes go along with buildingemulators [9].

Figure 1. Classical titles: Tetris from Nintendo Entertainment System(NES) and Mega Man X from Super Nintendo Entertainment System(SNES) are used in our experiments. Cover art © Nintendo Co., Ltd.

The proposed network is evaluated on a spectrum of quantita-tive metrics, through different scenarios: training and testingon different players’ videos, different levels, and differentvideo encoders. We picked two classic games from two pop-ular console systems: Tetris (NES) [23] and Mega Man X(SNES) [6] as our test cases, shown in Figure 1.

Both obvious and unexpected challenges are set out in thenext section. These are challenges faced by both humans andbaseline networks, as they try to convert a video into a logfile. The rest of the paper works through our solutions, andevaluation criteria for validating the DeepLogger approach.

RELATED WORKWe briefly summarize two different branches that relate to ourown work. First, we discuss projects where different gam-ing log-files served as the main input data to studies of userbehavior and gameplay experiences. Second, we highlightmachine learning projects that treat computer games as experi-mental platforms. Where feasible, these establish a connectionbetween machine learning algorithms and computer gameanalytics research.

Understanding Game Users Through Log-filesA log-file records information as a stream of events or mes-sages. Log files have been used extensively in game userresearch to extract important or hidden information of gamers’emotions or behavior. Here, we touch on some of the manyresearch projects that utilize gaming log-files (with a varietyof content) as initial sources of information to extract analyticsand better understand player experiences.

Shuteet al. [26, 25, 27] analyze user interaction logs in theirstealth assessment of digital games for education. Stealthassessment is an assessment made from the interaction datacollected during gameplay sessions. The assessment is notnoticeable by the users/learners. [8] uses customized gamelog data to generate video summarizations of what happenedduring gameplay. In [31], Tveit et al. proposed using a player’s

action log-files from Massive Multiplayer games to mine foruser patterns of behavior. Zaman and MacKenzie [33] use userinput logs and game logs to evaluate different input devicesfor touchscreen phones. Nacke et al. [21, 22] study the use ofgame session logs and game event logs to assess game playerexperience, and to better understand the gameplay itself. Smithand Nayar [29] successfully model user’s play style from theuser’s raw controller inputs using Latent Dirichlet Allocation(LDA).

While these works use log information as a core input com-ponent, we are unaware of any practical tool that allows re-searchers to retrieve such information from publicly sharedgameplay videos. To the best of our knowledge, the closestresearch which aims to extract game information log fromgameplay video is Marczak et al.work [18]. This work ex-tracts gameplay metrics such as health bar and in-game itemsby simple hand-coded image processing techniques. We, how-ever, propose a system to allow researchers to automaticallyextract raw control input information from existing videos.This should speed up our research community’s ability towork with very large datasets, without conducting extremelylaborious and costly data collection processes.

Using Machine Learning in Computer Game ResearchSeveral attempts to probe Artificial Intelligence (AI) agents inthe context of computer games played by humans are maderecently through advancements in Machine Learning (ML)research, and through newly available platforms. To allow theAI and ML research communities to validate their algorithms,a variety of games has been used. Each presents its ownchallenges, with platform specifically chosen or custom-madeby these researchers to more easily plug in their AI and MLalgorithms as required. We highlight several of these wellknown platforms.

OpenAI Gym [5] and ALE [3] provide environments for com-puter agents to play Atari 2600 games and many flash games.VizDoom [16] and DeepMind Lab [2] provide API’s whichallow AI and ML agents to learn to play classical first personshooter (FPS) games, e.g. Doom and Quake III respectively.Recently, RLE [4] was proposed to be an environment forSNES games, and Project Malmo [15] is an environment forthe game Minecraft.

These encourage ML research to develop new agents, e.g.through Reinforcement Learning. Some examples are [12]for Tetris and [20] for Mario; Deep Reinforcement Learningagents: [17, 7, 24] for Doom, [19] for Atari games, and Neu-ralKart [13] for Mario Kart 64; and Supervised Learning agent:TensorKart [14] for Mario Kart 64.

DeepLogger is a tool intended to complement these works,by allowing researchers to train their Machine Learning algo-rithms on human gameplay videos, in addition or in contrastto having these agents learn only by self-play.

BASELINES AND CHALLENGES THEY FACEWe start by illustrating the performance of human gamerswhen asked to estimate what buttons were pressed during themiddle frame of a short video clip of someone else’s gameplay.

Page 3: DeepLogger: Extracting User Input Logs From 2D Gameplay …

Tetris Human DeepLoggerSingle-label Accuracy 0.8400±0.00 0.7885Multi-Label Accuracy 0.8400±0.00 0.7911F1-score (Example-based) 0.8644±0.00 0.8882F1-score (Label-based) 0.7794±0.02 0.5691

Mega Man X Human DeepLoggerSingle-label Accuracy 0.2250±0.18 0.5356Multi-Label Accuracy 0.4420±0.20 0.7060F1-score (Example-based) 0.5936±0.19 0.8325F1-score (Label-based) 0.4722±0.18 0.5363

Table 1. The performance of both human experts and the proposedDeepLogger system, on estimating a gamer’s controller inputs (the log)from gameplay videos only: Tetris and Mega Man X. The criteria (rows)are explained in the text, but higher accuracies and F1 scores are bet-ter. Humans are better with Tetris videos, while DeepLogger does betterwith Mega Man, possibly due to the UI complexity.

We then discuss the challenges faced by both a human and theautomated systems, as we attempt to generate interaction logsfrom videos.

Baseline: Human PerformanceIn this section, we analyze how people fare, when estimatinguser input logs from gameplay videos. The small user studywas conducted online, with 8 gamers recruited to participate.The gamers were first screened, selecting only those gamerswho had experience playing Tetris and Mega Man X. In ques-tionnaires, the gamers had to answer 50 questions for eachof the two games. Each question displays a short video clip(Figure 2), randomly extracted from our gameplay videos. Weask the user to select all the check-boxes for game-controlbuttons that they think were pressed in the short video clip.

Figure 2. An example video clip of Tetris gameplay, with check-boxesfor a user to indicate what buttons they think were pressed at the middleframe. While this is a demanding and time-consuming task, users werefairly successful when “transcribing” Tetris.

Table 1 summarizes our human subjects’ and previewsDeepLogger’s performance on the task of predicting inputlogs from video. We can see that the system performs betterthan human experts on a harder game like Mega Man X wherethere are many possible button-combinations, but it performsworse than human experts on Tetris, a game that is visuallyeasier to decipher, with fewer possible button-combinations.

(a)

(b)Figure 3. Button-combination Frequencies for Tetris (a) and Mega ManX (b). For Tetris, the dominant input is Idle (nothing pressed), followedby three main buttons: Down, Left, and Right. For Mega Man X, thetwo main classes are Idle and Shoot, followed by the “Right and Shoot”combination.

Challenge: Class ImbalanceClass imbalance happens when some controls or button-combinations are used more often than others. See button-combination statistics for Tetris and Mega Man X in Figure 3.It is quite common for game logs to have substantial imbal-ance. For instance, in a side-scrolling game such as Mega ManX, the character progresses by moving steadily right, so otherdirection-controls are pressed less often.

Substantial class imbalance, as found here, severely impactsthe training of machine learning systems, and colors the per-formance metrics. For example, if one button combinationoccurs 90% of the time, and the training optimization focuseson subset accuracy (no partial credit for imperfect key com-binations), then regardless of input, the network will simplyalways predict that one combination.

In the Architecture Section, we give the details of training ourCNN while accounting for imbalanced data, and set out theperformance metrics used throughout this paper.

Challenge: Multiple Control Buttons Per RecordFor each time step, gamers can and tend to press multiplecontrol keys at once. This leads to many possible uniquebutton-combinations. For NES, eight control buttons can gen-erate 82 combinations. For SNES, twelve control buttons cangenerate 122 combinations, though “select” and “start” arerarely pressed, leaving 102 most of the time.

Page 4: DeepLogger: Extracting User Input Logs From 2D Gameplay …

Notation: Q is the number of input buttons. x is the inputof our classifier, so in practice, a short gameplay video clip.Instead of a single output y, Y is the classifier’s output vector,representing the state of all the controller buttons. Y ′ is theground truth label associated with x.

Our CNN would normally be trained using a standard multi-class loss function H,

HY ′(Y ) :=−Q

∑q

Y ′qlogY ∗q , (1)

that compares the ground truth output vector Y ′ to the softmaxoutput tensor of the network Y ∗. Y ∗ is

softmax(Yq) :=eY

q

∑Qq eYq

. (2)

Training our CNN by minimizing the normal multi-class loss(1) will not work here for two reasons. First, if each buttonis a disjoint class, then the loss will encourage the CNN totreat each button as mutually exclusive of others, discouragingchords. Second, if we consider each button-combination asa class, the network will not be able to predict classes thatweren’t present in the training set.

The Losses section in our Architecture description detailshow our multi-label loss and multi-label-multi-choice loss aredesigned to cope with these problems.

Challenge: Many-to-OneThis is a problem that we did not anticipate, and it may surprisereaders who have not analyzed interaction logs. In most games,there are times when pressing two different buttons (or button-combinations) produces the same in-game result. We refer tothese situations as functionally equivalent, or as many-to-onesituations. Figure 4 depicts the confusion statistics of our twosample games.

This challenge is the obstacle that most affects human perfor-mance on the video-to-log task. We consider a multi-button-combination as a single class label, for simplicity.

Many-to-one happens when many classes generate one output.For example, when the game is unresponsive, no matter whichkeys are pressed, the subsequent frames are the same. Toaccount for video examples with more than one associatedlabel, we propose the multi-label-multi-choice loss, to tacklethe problem. The result is analyzed in Result: Many-to-Onesection.

ARCHITECTUREConvolutional Neural Networks (CNN’s) have proved success-ful in a wide range of applications, especially on computervision tasks such as image classification. We devise a newCNN architecture for transcribing user input logs from videosof gameplay, by combining existing CNN components andcarefully choosing appropriate training procedures to tacklegameplay video challenges. Moreover, a new loss function,multi-label-multi-choice loss, is proposed to tackle the many-to-one challenges, described in the Challenge: Many-to-One

(a)

(b)Figure 4. Functionally equivalent button-combinations for Tetris, shownin (a), and Mega Man X shown in (b), are visualized through confusionmatrices. Each numerical entry indicates how many times the button-combination (of that row) produces the same visual output as anotherbutton-combination (column).

Page 5: DeepLogger: Extracting User Input Logs From 2D Gameplay …

section. The network is implemented in Tensorflow [1]. Thedataset and the codes can be found at the project page.1 It isworth to note here that all the network setting: the number ofconvolution and fully connected layers as well as dropout ratioand learning rate were achieved empirically through a numberof experiments which are omitted in this paper.

The NetworkOur DeepLogger network is composed of five 3D convolu-tional layers, each of which has rectified linear units (ReLU) asactivation functions. The network then has five fully connectedlayers with dropout between layers. Table 2 demonstrates thediagram of our proposed CNN network.

Layers Kernel Dimensions Number of kernelsInput input dimensions = D×1×W×HConv1 3×5×5 24Conv2 3×5×5 36Conv3 3×5×5 48Conv4 3×3×3 64Conv5 3×3×3 64Dense6 output dimensions = 1164Dropout keep = 0.8Dense7 output dimensions = 100Dropout keep = 0.8Dense8 output dimensions = 50Dropout keep = 0.8Dense9 output dimensions = 30Dropout keep = 0.8Output output dimensions = C

Table 2. Diagram of the proposed DeepLogger Network where D isthe number of frames per clip, which are 21 frames and 11 frames forTetris and Mega Man X respectively. W×H are the image dimensionsof the gameplay videos, which are the default screen dimensions of NES,256×224 for Tetris, and SNES, 586×448 for Mega Man X. C is the num-ber of buttons, with 8 buttons for Tetris and 12 buttons for Mega ManX.

3D convolution layersIn this work, we deem a video as a stack of temporal 2Dimages (frames). Before passing each frame to the network,the RGB frame is transformed into a grayscale image. While2D convolution layers look at a whole temporal block at a timewhich does not consider relations between consecutive framesand the frames order, 3D convolution layers look at smallertemporal blocks. In Table 2, “Conv” are 3D convolution layers.The temporal dimension of each layer is set to 3 which meansthat the network is restricted to consider temporal relations ofevery three consecutive frames of the given temporal block.

TrainingSince user input log data is extremely imbalanced, as shownin Figure 3, we train the network by over-sampling all otherclasses. The effect is that in each mini-batch, the non-majorityclasses (button-combinations) have comparable numbers ofsamples to the majority class.

We chose ±5 and ±10 consecutive frames per short clip, asone data point for Mega Man X and Tetris respectively. The1http://visual.cs.ucl.ac.uk/pubs/DeepLogger/

network is trained with the mini-batch scheme using batchsize 16 for 50 epochs. We use 1E-5 as our fixed learning ratefor both games.

We compare and discuss the imbalanced training procedurewith the normal training procedure on both games in the Ex-periments section.

LossesAs discussed in Challenge: Multiple Control Buttons PerRecord, each input button can be pressed at the same time andpressing one button is independent of pressing another button.We frame this problem as the multi-label problem where eachclass can occur independently. To train a multi-label NNclassifier, sigmoid cross entropy loss is used:

HY ′(Y ) :=−∑Y ′qlog(σ(Yq)), (3)

where σ(†) = 11+e† is the result of applying the sigmoid func-

tion to the output of the network.

However, an extremely challenging characteristic of gameplayvideo is the many-to-one issues are not yet addressed; and fromour inspection of the training data, these situations happen alot, as shown in figure 4.

To tackle the issues, we modify the sigmoid cross entropyloss to consider multiple label choices. When there are morethan one class (button-combination) which generates the samevisual output as another class, functionally equivalent class,the loss should not penalize those classes even if they are notthe ground truth label.

To make our network aware of that, we generate a new groundtruth by replaying the emulator with the ground truth inputlogs and checking which button-combination is functionallyequivalent to the ground truth label. By doing that, the newground truth label for each example become a set of labels.We call it multi-choice labels.

The modified loss is designed to consider each ground truthlabel as a set. It the looks for the best label from those labelset, and computes multi-label loss against that label. We namethis loss multi-label-multi-choice loss. The multi-label-multi-choice loss is mathematically defined as

HY′(Y ) := minY ′∈Y′

(−∑Y ′qlog(σ(Yq))), (4)

where Y′i is a multi-choice ground truth label of example i.

EXPERIMENTS AND RESULTSIn this section, we discuss experiments which were used tovalidate different components of the network, as well as thewhole system’s performance. Table 3 demonstrates the key fig-ures from the CNN experiments. DeepLogger is our proposednetwork; DeepLogger2D is the modified version of our pro-posed network to validate performance of using 3D filters inthe convolutional layers; the VGGNet [28] is a baseline CNNdue to its broad acceptance in the computer vision community.

DataWe collected the datasets of both Tetris and Mega Man Xusing the BizHawk Emulator version 2.1.1. Gamers for each

Page 6: DeepLogger: Extracting User Input Logs From 2D Gameplay …

Tetris DeepLogger (-balanced) DeepLogger2D (-balanced) VGGNet [28] (-balanced)Single-label Accuracy 0.7885 (0.7735) 0.5712 (0.6558) 0.6386 (0.6558)Multi-Label Accuracy 0.7911 (0.7735) 0.5734 (0.6558) 0.6298 (0.6553)F1-score (Example-based) 0.8882 (0.8754) 0.7675 (0.7921) 0.7841 (0.7921)F1-score (Label-based) 0.5691 (0.1865) 0.0874 (0.0000) 0.03907 (0.0000)

Mega Man X DeepLogger (-balanced) DeepLogger2D (-balanced) VGGNet [28] (-balanced)Subset Accuracy 0.5356 (0.6014) 0.4126 (0.5088) 0.2730 (0.4480)Multi-Label Accuracy 0.7060 (0.7459) 0.6135 (0.6901) 0.5151 (0.6400)F1-score (Example-based) 0.8325 (0.8587) 0.7740 (0.8278) 0.7179 (0.7970)F1-score (Label-based) 0.5363 (0.4352) 0.3578 (0.3149) 0.2866 (0.2616)

Table 3. Performance of 3 CNN’s: DeepLogger, DeepLogger2D, and VGG on Tetris and Mega Man X. Figures in parentheses indicate performance ofthe networks when they were trained without over-sampling the non-majority classes.

game were recruited to play the game multiple times. ForMega Man X, we use one playthrough recoded from onegamer as our training data and another recoded playthroughfrom the same gamer as the test data. For Tetris, the net-work was trained on twenty gameplay recordings of one levelfrom one gamer; and the test data comprises of gameplayrecordings of three different levels and two additional gamers.BizHawk records user control inputs at each time step foreach gameplay, in a log-file which can be played back later.We use this user input log, Yi and the associated screenshotimage, imi, to construct the training data for our model. Forour training data, one data point is a clip of 2n+1 frames,xi = {imi−n, imi−n+1, ..., imi, ..., imi+n−1, imi+n}, where n = 5for Mega Man X and n = 10 for Tetris.

The data is split into 80% training set, 5% for the validation set,and 15% test set, which amounts to 316,848 training examplesfor Tetris and 436,203 training examples for Mega Man X.

Performance MetricsWe benchmark the performance of our system over a range ofdifferent metrics [34]. All metrics and their characteristics

Single-label Accuracy Subset Accuracy is a performancemetric that captures the fraction of perfectly correct predic-tions. This performance metric only counts as correct thepredictions where Y is identical to the ground truth for allbuttons/controls. This metric is similar to the accuracy of thesingle-label problem.

Multi-label Accuracy evaluates the fraction of correctly clas-sified labels in the multi-label setting. It returns 1 if the pre-dicted set of input buttons is identical to the ground truth,similar to the subset accuracy above, but it returns non-zeronumbers if the predicted result is partly correct, so if someof the input buttons match. This metric allows us to analyzemore fine-grained performance measurements of the classi-fiers, because it gives partial credit.

Example-based F1-Score measures the harmonic mean ofPrecision and Recall, where it first evaluates the performanceof each example separately, and then returns the average valueacross the test set. Precision and Recall are complimentary per-formance measurements to Accuracy, because they take intoaccount class imbalance, via false positive and false negativestatistics.

Label-based F1-Score computes an F1-score by first evaluat-ing the performance of each class label input button separately,and then returning the average value across all class labels,where every class label is given equal weight (Macro average).

Although the Example-based F1-Score is good for being sensi-tive to imbalanced data, it can be misled by very large numbersof test examples. Label-based F1-Score, instead, focuses onmeasuring the performance of predicting individual class la-bels for each input button.

Result: Overall ResultsTable 3 presents the performance of the DeepLogger networkon Tetris and Mega Man X. It is clear that DeepLogger hasbetter performance across all four criteria, at least as comparedto the 2D alternate version of DeepLogger, and against VG-GNet [28], which is a standard architecture used in thousandsof computer vision tasks. We touch on the numerical resultsbelow, with respect to the different challenges identified at theoutset.

Figure 5 and Figure 6 demonstrate detailed analysis of howthe network performs on per button and per class (button-combination) predictions. For Tetris, the per button accuracychart (a) suggests that the class rotate block clockwise andclass rotate block counter-clockwise are harder for the net-work to distinguish than the classes which translate the block.This leads to poor performance on the button-combinationprediction, where the classes have both direction and rotationbuttons being pressed. For Mega Man X, the per button accu-racy chart (b) indicates that the network always made mistakesat predicting the button X because this button does not mapto any action, Also, the network got relatively low scores onthe button L because L and R are sliding the inventory windowleft and right respectively, and there is no visual distinctionbetween the two.

Result: Class ImbalanceWe trained three different CNN architectures with/withoutover-sampling the non-majority classes, to balance the numberof examples from each class of button-combinations. Theresults are listed in Table 3 where training with the over-sampling scheme is shown first, and results after trainingwithout the over-sampling scheme are shown in parentheses.From the table, we see that for Mega Man X, training with theover-sampling scheme can cause a slight performance drop onthree metrics: single-label accuracy, multi-label accuracy, and

Page 7: DeepLogger: Extracting User Input Logs From 2D Gameplay …

(a)

(b)Figure 5. Detailed analysis of our DeepLogger network’s performanceon Tetris for per button and per class prediction. (a) shows the perfor-mance of per button prediction and (b) shows the performance of perclass prediction. In both cases, rotation is the hardest to recognize. Smallred X’s on the per button accuracy chart indicate there is no groundtruth label for the buttons (Up and Select).

example-based F1-score. However, the over-sampling schemealways improves the label-based F1-score.

Result: 2D VS 3D FiltersThis experiment demonstrates the benefit of using 3D filtersfor predicting an action from a sequence of images. Theperformance of two identical network where one uses 3Dconvolutional layers (DeepLogger) and another that uses 2Dconvolutional layers (DeepLogger2D) are shown in Table 3.Please note that, VGGnet [28] also uses 2D convolutionallayers.

For both games, we can clearly see that for predicting userinputs from gameplay video, 3D convolutions significantlyimprove the performance of the CNN’s across all metrics.

Result: Many-to-OneFigure 7 shows the improvement in score when training withmulti-label-multi-choice loss, compared to training with ordi-nary sigmoid cross entropy loss.

To generate a multi-choice label from a normal single-choiceinput log, we rely here (only) on an emulator. We replay therecorded user input logs through the emulator, where at eachsubsequent time step, we try all possible button-combinations,looking for ones that generate the same visual output as theoriginal single-label. These functionally equivalent (at leastin the short-term) extra combinations are added to the list ofmulti-choice labels, supplementing the original ground truth.

(a)

(b)Figure 6. Detailed analysis of our DeepLogger network’s performanceon Mega Man X for per button and per class prediction. (a) shows theperformance of per button prediction and (b) shows the performanceof per class prediction. Binary labels in (b) along the horizontal axisrepresent pressing (1) and not pressing (0) that game controller buttonand The Binary codes preserve the button-order from (a). Red X’s onthe per button accuracy chart indicate there is no ground truth label forthe buttons (R and Select).

Figure 7. Comparing loss functions. This bar-chart compares twoDeepLogger networks using four performance metrics. For the red bars,the network was trained with normal multi-label loss, and for the bluebars, the network was first trained with normal multi-label loss on 90%of the training data, and then fine-tuned with multi-label-multi-choiceloss on the remaining 10% of the training data. This bar-chart is gener-ated from the Tetris gameplay dataset. Under each measure, fine-tuningwith multi-label-multi-choice loss yielded better scores.

This process takes substantial time per gameplay log-file, sofigure 7 only reports the performance when we fine-tune thenetwork with multi-choice-multi-label loss on 10% of thetraining data. This process is trivially parallelizable if needed.

Page 8: DeepLogger: Extracting User Input Logs From 2D Gameplay …

Tetris Base Diff 1 level Diff 2 levels Diff Gamers Diff EncoderSingle-label Accuracy 0.7885 0.7251 0.7988 0.7228 0.7633Multi-Label Accuracy 0.7911 0.7269 0.8017 0.7235 0.7649F1-score Example-based 0.8882 0.8489 0.8870 0.8441 0.8743F1-score Label-based 0.5691 0.4529 0.5504 0.4137 0.6690

Table 4. Generalization evaluation. Base is the performance of the system when testing on gameplay video from the same level, gamer, and encoder.Diff 1 level shows the performance of the system when testing with gameplay videos from a different level than the training data, by 1 level, Diff 2 levelstests the same thing, but where the difference is 2 levels. Diff Gamers level shows the performance of the system when testing with gameplay videos fromdifferent gamers. Diff Encoder level shows the performance of the system when testing gameplay videos downloaded from video-sharing site YouTube,where compression and frame-rate changes can occur.

Result: GeneralizationTable 4 shows experiments where the network is validated onfurther challenging tasks: training on one person’s gameplayvideos and then testing on another person’s, training on onelevel and then testing on different levels, and training withlocally collected video, and then testing on different videosthat were encoded by YouTube and downloaded back again,as if scraped.

Result: CNN EmbeddingThis secondary benefit of our system has some interesting op-portunities. Figure 8 visualizes what is learned by the network.We can see that the network learns to ignore or at least toleratethe background of the screenshots, and focuses on the maincharacter action.

While the estimated logs are the main focus of this paper, theneural network’s ability to embed “similar” frames near eachother in feature space is valuable for other kinds of analytics.For example, a researcher could search, across all recordedgameplay of one or many users, for situations where the gametakes a certain turn, or the player performs a certain chainof actions. These would correspond to points in the CNN’sabstract feature space, but could be found more reliably than,say, using the absolute time passed from the start of a level- different players move at different speeds. Also, the samegame played across different devices may have different logs,but the appearance part of interesting in-game situations willbe comparatively consistent.

DISCUSSION AND CONCLUSIONWe have shown that it is possible to extract UI log informationfrom gameplay videos. While the accuracy of our systemstill has room for improvement, compared with the 100% onewould get from using sniffer software, our system could beapplied to the millions of users’ videos that are generated eachmonth. Preserving users’ privacy while collecting broad-scaleusage statistics has a value that is presently hard to measure.Additionally, the CNN-embedding gives researchers new op-portunities to compare gameplay across game levels and acrossplayers. This is similar to the opportunity DeepLogger maypresent for AI researchers who want to train game-playingAI’s, without relying purely on self-play.

Our proposed network, along with the learning algorithm thatfinds functionally equivalent user commands, outperforms thebaseline networks in Table 3. Presently, it performs on parwith what humans can do when focusing their attention, asshown in Table 1. It seems from the Mega Man X experiments

that human performance is not a ceiling on accuracy or F1scores (people are slightly worse on Mega Man X), but furthermachine learning developments are needed to improve onthose metrics.

The current network can only predict user input logs for a gamewhere training data is available. In further work, it would beattractive to learn from many games, to try to generalize acrossgames, increasing the pool of available training data.

Due to we frame the problem as a classification problem,the system is limited to games with discrete inputs. Furtherwork on transcribing games with continuous inputs is worthexploring because modern games tend to use continuous inputssuch as mouse trajectory, gyroscope, or gestures.

Last but not least, in future work it would be interesting toextend the network to work with multi-player games. For2-player games which each of their two controllers is fixed tothe certain part of the screen, such as Super Mario Kart [30],or fixed to a certain character, such as Contra [32], the networkmight need only a little tweak to work. However, the morechallenging research topic is to transcribe the games of whichthe controllers are not fixed to a certain aspect.

Page 9: DeepLogger: Extracting User Input Logs From 2D Gameplay …

Figure 8. The graph visualizes Euclidean distances between embedding features, produced by the "Dense9" layer of the network, for the query clip andother clips from different videos. Min1, Min2, and Min3 clips are examples of the clips which have the smallest distances to the query clip. The Y-axisrepresents distance and X-axis represents time steps in the video of the retrieved gameplay.

REFERENCES1. Martín Abadi, Ashish Agarwal, Paul Barham, Eugene

Brevdo, Zhifeng Chen, Craig Citro, Greg S Corrado,Andy Davis, Jeffrey Dean, Matthieu Devin, and others.2016. Tensorflow: Large-scale machine learning onheterogeneous distributed systems. arXiv preprintarXiv:1603.04467 (2016).

2. Charles Beattie, Joel Z Leibo, Denis Teplyashin, TomWard, Marcus Wainwright, Heinrich Küttler, AndrewLefrancq, Simon Green, Víctor Valdés, Amir Sadik, andothers. 2016. Deepmind lab. arXiv preprintarXiv:1612.03801 (2016).

3. Marc G. Bellemare, Yavar Naddaf, Joel Veness, andMichael Bowling. 2015. The Arcade LearningEnvironment: An Evaluation Platform for GeneralAgents. In Proceedings of the 24th InternationalConference on Artificial Intelligence (IJCAI’15). AAAIPress, 4148–4152.http://dl.acm.org/citation.cfm?id=2832747.2832830

4. Nadav Bhonker, Shai Rozenberg, and Itay Hubara. 2016.Playing SNES in the Retro Learning Environment. arXivpreprint arXiv:1611.02205 (2016).

5. Greg Brockman, Vicki Cheung, Ludwig Pettersson, JonasSchneider, John Schulman, Jie Tang, and WojciechZaremba. 2016. OpenAI Gym. arXiv:1606.01540 (2016).

6. Capcom. 1993. Mega Man X. Game [SNES]. (17December 1993).

7. Devendra Singh Chaplot and Guillaume Lample. 2017.Arnold: An Autonomous Agent to Play FPS Games.. InAAAI. 5085–5086.

8. Yun-Gyung Cheong, Arnav Jhala, Byung-Chull Bae, andRobert Michael Young. 2008. Automatically GeneratingSummary Visualizations from Game Logs.. In AIIDE.167–172.

9. James Conley, Ed Andros, Priti Chinai, and EliseLipkowitz. 2003. Use of a game over: Emulation and thevideo game industry, a white paper. Nw. J. Tech. & Intell.Prop. 2 (2003), 261.

10. Hidemaro Fujibayashi, Eiji Aonuma, Akihito Toda, andSatoru Takizawa. 2017. The Legend of Zelda: Breath ofthe Wild. Game [Nintendo Switch]. (3 March 2017).

11. Brendan Greene and Chang han Kim. 2017.PlayerUnknown’s Battlegrounds. Game [MicrosoftWindows]. (23 March 2017).

12. Alexander Groß, Jan Friedland, and FriedhelmSchwenker. 2008. Learning to play Tetris applyingreinforcement learning methods.. In ESANN. 131–136.

13. Harrison Ho, Varun Ramesh, and Eduardo TorresMontano. 2017. NeuralKart: A Real-Time Mario Kart 64AI. (2017).

14. Kevin Hughes. 2016. Tensorkart: self-driving mariokartwith tensorflow. Blog. (28 December 2016).https://kevinhughes.ca/blog/tensor-kart Accessed April13, 2018.

15. Matthew Johnson, Katja Hofmann, Tim Hutton, andDavid Bignell. 2016. The Malmo Platform for ArtificialIntelligence Experimentation.. In IJCAI. 4246–4247.

Page 10: DeepLogger: Extracting User Input Logs From 2D Gameplay …

16. Michał Kempka, Marek Wydmuch, Grzegorz Runc,Jakub Toczek, and Wojciech Jaskowski. 2016. Vizdoom:A doom-based ai research platform for visualreinforcement learning. In Computational Intelligenceand Games (CIG), 2016 IEEE Conference on. IEEE, 1–8.

17. Guillaume Lample and Devendra Singh Chaplot. 2017.Playing FPS Games with Deep Reinforcement Learning..In AAAI. 2140–2146.

18. Raphaël Marczak, Jasper van Vught, Gareth Schott, andLennart E. Nacke. 2012. Feedback-based GameplayMetrics: Measuring Player Experience via AutomaticVisual Analysis. In Proceedings of The 8th AustralasianConference on Interactive Entertainment: Playing theSystem (IE ’12). ACM, New York, NY, USA, Article 6,10 pages. DOI:http://dx.doi.org/10.1145/2336727.2336733

19. Volodymyr Mnih, Koray Kavukcuoglu, David Silver,Alex Graves, Ioannis Antonoglou, Daan Wierstra, andMartin Riedmiller. 2013. Playing atari with deepreinforcement learning. arXiv preprint arXiv:1312.5602(2013).

20. Shiwali Mohan and John E Laird. 2009. Learning to playMario. Tech. Rep. CCA-TR-2009-03 (2009).

21. Lennart Nacke, Craig Lindley, and Sophie Stellmach.2008. Log whoâAZs playing: psychophysiological gameanalysis made easy through event logging. In Fun andgames. Springer, 150–157.

22. Lennart Nacke, Jonas Schild, and Joerg Niesenhaus.2010. Gameplay experience testing with playability andusability surveys–An experimental pilot study. InProceedings of the Fun and Games 2010 Workshop,NHTV Expertise Series, Vol. 10.

23. Alexey Pajitnov and Vladimir Pokhilko. 1984. Tetris.Game [NES]. (6 June 1984).

24. Dino Ratcliffe, Sam Devlin, Udo Kruschwitz, and LucaCiti. 2017. Clyde: A deep reinforcement learning DOOMplaying agent. (2017).

25. Val Shute. 2015. Stealth assessment in video games.(2015).

26. Valerie J Shute and Gregory R Moore. 2017. Consistencyand validity in game-based stealth assessment.Technology enhanced innovative assessment:Development, modeling, and scoring from aninterdisciplinary perspective (2017), 31–51.

27. Valerie J Shute, Matthew Ventura, and DiegoZapata-Rivera. 2013. Stealth assessment in digital games.(2013).

28. Karen Simonyan and Andrew Zisserman. 2014. Verydeep convolutional networks for large-scale imagerecognition. arXiv preprint arXiv:1409.1556 (2014).

29. Brian A. Smith and Shree K. Nayar. 2016. MiningController Inputs to Understand Gameplay. InProceedings of the 29th Annual Symposium on UserInterface Software and Technology (UIST ’16). ACM,New York, NY, USA, 157–168. DOI:http://dx.doi.org/10.1145/2984511.2984543

30. Tadashi Sugiyama and Hideki Konno. 1992. Super MarioKart. Game [SNES]. (27 August 1992).

31. Amund Tveit and Gisle B Tveit. 2002. Game UsageMining: Information Gathering for Knowledge Discoveryin Massive Multiplayer Games.. In InternationalConference on Internet Computing. 636–642.

32. Shigeharu Umezaki and Shinji Kitamoto. 1987. Contra.Game [NES]. (20 Febuary 1987).

33. Loutfouz Zaman and I Scott MacKenzie. 2013.Evaluation of nano-stick, foam buttons, and other inputmethods for gameplay on touchscreen phones. InInternational Conference on Multimedia andHuman-Computer Interaction-MHCI. 69–1.

34. Min-Ling Zhang and Zhi-Hua Zhou. 2014. A review onmulti-label learning algorithms. IEEE transactions onknowledge and data engineering 26, 8 (2014),1819–1837.