copyright by qiping zhang 2021
TRANSCRIPT
Copyright
by
Qiping Zhang
2021
The Thesis committee for Qiping ZhangCertifies that this is the approved version of the following thesis:
Interactive Learning from Implicit Human Feedback:
The EMPATHIC Framework
SUPERVISING COMMITTEE:
Peter Stone, Supervisor
Scott Niekum
Interactive Learning from Implicit Human Feedback:
The EMPATHIC Framework
by
Qiping Zhang
Thesis
Presented to the Faculty of the Graduate School
of The University of Texas at Austin
in Partial Fulfillment
of the Requirements
for the Degree of
Master of Science in Computer Science
The University of Texas at Austin
August 2021
To my parents, Mingjia Zhang and Xiaozhou Li.
Acknowledgments
I would first like to express my deepest appreciation to my advisor,
Dr. Peter Stone. Since I started to work with him at the beginning of my
master’s study, Peter has been patiently instructing me and guiding me into
the research world of reinforcement learning and robotics. No matter how busy
he was, Peter always managed to be responsive and available for any problems I
had with my research, participating in all meetings of our project with detailed
feedback every time. Without him, I could not have found my field of interest
and determined to pursue my path of academic research. Meanwhile, I am
extremely grateful for the precious lessons learned from him on how to be a
successful researcher and supervisor, which will be my lifelong treasure in my
future career. Peter has shown me what an ideal advisor should be like, and
I hope someday I could also become a nice mentor as he does.
I would also like to give my sincere thanks to my second reader, Dr.
Scott Niekum. As my co-advisor, Scott has been continuously providing in-
sightful ideas and feedback on our work, which would not have been possible
without his guidance and persistent help. Furthermore, I am extremely grate-
ful for his strong support and valuable suggestions when I was working on my
graduate school application.
I owe a deep sense of gratitude to the principal investigator of our lihf
v
project, Dr. Brad Knox. Under his supervision, we developed the empathic
framework from the very beginning of his conception. This work would not
have been well completed and successfully published without Brad’s instruc-
tions on every detail, including the experiment design and writings. I am also
thankful for his valuable advice on how to do a better research in many previ-
ous chats. And I wish him best of luck with his new journey at Google after
he finishes his job at Bosch.
I must thank my collaborator and great friend, Yuchen Cui, who has
been working together with me on this project over the past two years. I
sincerely appreciate all the time she spent discussing the research challenges
we were confronted with, shedding light on the concepts I was not familiar
with, and clearing up all my confusion while composing this work. Thank
you, Yuchen, for being a patient mentor of mine as a senior Ph.D. student,
and helping me improve multiple skills for academic research such as writing
and formulating ideas. I hope our work together is also a fun experience to
you, and has a nice contribution to your dissertation.
I am also very thankful to Dr. Alessandro Allievi for his continuous help
with our research, for both his advice on the project and his timely support
whenever any resource is needed for the experiments (I still remember his
phone call on Christmas day when we urgently needed access to an account).
In addition, I would like to express my gratitude to my dear friends.
Bo Liu and Yifeng Zhu have always been my closest labmates to talk with,
especially when I got a bit lost or felt too stressful with my research. Also,
vi
special thanks to Zhihong Luo for his guidance on conducting research, apply-
ing for graduate schools, and finding suitable advisors, together with the joy
and encouragement he brought.
I have hugely enjoyed the two years I spent in the Learning Agents Re-
search Group (LARG) during my master’s study. It is my genuine pleasure to
know all these great friends, who are all nice and smart people: Yunshu Du, Is-
han Durugkar, Justin Hart, Yuqian Jiang, Haresh Karnan, Sai Kiran, William
Macke, Reuth Mirsky, Brahma Pavse, Faraz Torabi, Xuesu Xiao, Harel Yedid-
sion, and Ruohan Zhang (in alphabetical order). Also, much thanks to the
other friends I met at UT: Suna Guo, Zhaoyuan He, Sahil Jain, Jordan Schnei-
der, Siming Yan, Chenxi Yang, and Zaiwei Zhang.
Finally, deepest thanks to my family, especially my parents Mingjia
Zhang and Xiaozhou Li, and my grandma Yuxin Miao, for their love and
support over these years that kept me going.
vii
Interactive Learning from Implicit Human Feedback:
The EMPATHIC Framework
Qiping Zhang, M.S.Comp.Sci.
The University of Texas at Austin, 2021SUPERVISOR: Peter Stone
Reactions such as gestures and facial expressions are an abundant, nat-
ural source of signal emitted by humans during interactions. An autonomous
agent could leverage an interpretation of such implicit human feedback to
improve its task performance at no cost to the human, contrasting with tradi-
tional agent teaching methods based on demonstrations or other intentionally
provided guidance. In this thesis, we first define the general problem of learning
from implicit human feedback, and propose a data-driven framework named
empathic as a solution, which includes two stages: (1) mapping implicit hu-
man feedback to corresponding task statistics; and (2) learning a task with the
constructed mapping. We first collect a human facial reaction dataset while
participants observe an agent execute a sub-optimal policy for a prescribed
training task. Then, we train a neural network to instantiate and demonstrate
the ability of the empathic framework to (1) infer reward ranking of events
from offline human reaction data in the training task; (2) improve the online
agent policy with live human reactions as they observe the training task; and
(3) generalize to a novel domain in which robot manipulation trajectories are
evaluated by the learned reaction mappings.
viii
Table of Contents
Acknowledgments v
Abstract viii
List of Tables xi
List of Figures xii
Chapter 1. Introduction 1
1.1 Research Questions and Contributions . . . . . . . . . . . . . . 2
1.2 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Chapter 2. Background 5
2.1 Markov Decision Processes . . . . . . . . . . . . . . . . . . . . 5
2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.1 Interactive Reinforcement Learning . . . . . . . . . . . . 6
2.2.2 Facial Expression Recognition (FER) . . . . . . . . . . . 7
2.2.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Chapter 3. The LIHF Problem and The EMPATHIC Frame-work 9
3.1 The LIHF Problem . . . . . . . . . . . . . . . . . . . . . . . . 9
3.2 The EMPATHIC Framework . . . . . . . . . . . . . . . . . . . 11
3.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
Chapter 4. Instantiation of the EMPATHIC Framework 15
4.1 Data Collection Domains and Protocol . . . . . . . . . . . . . 15
4.1.1 Robotaxi . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.1.2 Robotic Sorting Task . . . . . . . . . . . . . . . . . . . 16
ix
4.1.3 Data Collection . . . . . . . . . . . . . . . . . . . . . . 17
4.2 Reward Mapping Design . . . . . . . . . . . . . . . . . . . . . 18
4.2.1 Human Exploration of the Data . . . . . . . . . . . . . 18
4.2.2 Reaction Mapping Architecture . . . . . . . . . . . . . . 20
4.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
Chapter 5. Reaction Mapping Results and Evaluation 25
5.1 Reward-ranking Performance in the Robotaxi Domain . . . . . 26
5.1.1 Learning Outcome of the Reaction Mapping . . . . . . . 27
5.1.2 Ablation Study for Predictive Model Design . . . . . . . 28
5.1.3 Preliminary Modeling of Other Task Statistics . . . . . 32
5.1.4 Effects of Different Belief Priors . . . . . . . . . . . . . 33
5.2 Online Learning in the Robotaxi Domain . . . . . . . . . . . . 37
5.3 Trajectory Ranking in Robotic Sorting Domain . . . . . . . . . 39
5.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
Chapter 6. Conclusion and Future Work 43
Appendices 45
Appendix A. Supplemental Material 46
A.1 Experimental Domains and Data Collection Details . . . . . . 46
A.1.1 Robotaxi . . . . . . . . . . . . . . . . . . . . . . . . . . 46
A.1.2 Robotic Sorting Task . . . . . . . . . . . . . . . . . . . 48
A.1.3 Experimental Design . . . . . . . . . . . . . . . . . . . . 50
A.1.4 Participant Recruitment . . . . . . . . . . . . . . . . . . 52
A.2 Annotations of Human Reactions . . . . . . . . . . . . . . . . 53
A.2.1 The Annotation Interface . . . . . . . . . . . . . . . . . 53
A.2.2 Visualizations of Annotated Data . . . . . . . . . . . . . 56
A.3 Model Design . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
A.3.1 Feature Extraction . . . . . . . . . . . . . . . . . . . . . 57
A.3.2 Data Split of k-fold Cross Validation for Random Search 57
A.3.3 Hyperparameters . . . . . . . . . . . . . . . . . . . . . . 59
x
List of Tables
4.1 Human proxy test result: average τ values across participantsare shown; a baseline that randomly picks rankings has an ex-pected τ value of 0. . . . . . . . . . . . . . . . . . . . . . . . . 20
xi
List of Figures
3.1 Graphical model for lihf (colored boxes and their identicallycolored labels represent conditional probability tables) . . . . . 10
3.2 Overview of empathic . . . . . . . . . . . . . . . . . . . . . . 12
4.1 Robotaxi environment . . . . . . . . . . . . . . . . . . . . . . . 16
4.2 Robotic sorting task . . . . . . . . . . . . . . . . . . . . . . . 17
4.3 Human proxy’s view: semantics are hidden with color masks;the dark green circle is the agent; observer’s reaction is dis-played; detected face is enlarged; background is colored by lastpickup. The left frame is the same game state shown in Fig 4.1. 20
4.4 The feature extraction pipeline and architecture of the reaction map-ping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
5.1 Sorted per-subject Kendall’s τ for Robotaxi reward-ranking task 26
5.2 Loss profiles for training final models for reward ranking evalu-ation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.3 Sorted per-subject Kendall’s τ for Robotaxi reward-ranking taskon the human proxy test episodes . . . . . . . . . . . . . . . . 30
5.4 Loss profiles for training different models (each model has itsown set of best hyper parameters found through random searchexcept the end-to-end model) . . . . . . . . . . . . . . . . . . 31
5.5 Loss profiles for training with other task statistics . . . . . . . 34
5.6 Performance of reward inference starting from different priorsover reward rankings. . . . . . . . . . . . . . . . . . . . . . . . 35
5.7 Trials of informal online learning sessions in Robotaxi . . . . . 38
5.8 Sample plot of trajectory positivity score over aggregated frames. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.9 Sorted per-subject Kendall’s τ for evaluating robotic sortingtrajectories . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.10 Overall trajectory ranking by mean positivity score across sub-jects (each entry is colored by the trajectory’s final return: greenfor +2, yellow for 0, and red for −1) . . . . . . . . . . . . . . 41
xii
A.1 Robotaxi environment . . . . . . . . . . . . . . . . . . . . . . . 48
A.2 Robotic task table layout with object labels, from the perspec-tive of the human observer . . . . . . . . . . . . . . . . . . . . 48
A.3 Robotic sorting task trajectories with optimality segmentation 49
A.4 View of the annotation interface. The corresponding trajectoryof Robotaxi is not displayed. . . . . . . . . . . . . . . . . . . . 54
A.5 Proportion of annotated gestures . . . . . . . . . . . . . . . . 54
A.6 Histograms of non-zero reward events around feature onset . . 55
A.7 Diagram of data split for subject k . . . . . . . . . . . . . . . 58
xiii
Chapter 1
Introduction
When human observers are interested in the outcome of an agent’s
behavior, they often provide reactions that are not necessarily intended to
communicate to the agent. However, information carried by such reactions
about the perceived quality of the agent’s performance could instead be used
for the improvement of task learning if an agent can perform effective inter-
pretation. Furthermore, since humans instinctively generate such reactions,
learning from implicit human feedback does not result in extra burden on the
users.
In this work, learning from implicit human feedback (lihf) is con-
sidered complementary to learning from explicit human teaching, which in-
cludes demonstrations [4], evaluative feedback [22, 23], or other communica-
tive modalities [2, 8, 11, 24, 36]. Though in most scenarios implicit feedback is
less informative and leads to more interpretation challenges than explicit sig-
nals, lihf induces no additional cost to human users by making use of existing
reactions.
In this thesis, we define the lihf problem, propose a general data-
driven framework to solve it, and implement and validate an instantiation of
1
this framework using facial expressions and head poses (referred to jointly as
facial reactions) as the modalities of human reactions.
1.1 Research Questions and Contributions
While great success has been witnessed in previous computer vision
research on basic facial expression recognition tasks [14, 15, 27], it is not trivial
for a learning agent to interpret human expressions. Major challenges exist in
the following aspects:
1. When the same facial expression is interpreted in various ways, an agent
could have distinct learning behaviors. For instance, a smile could mean
satisfaction, encouragement, amusement, or frustration [16]. As is sug-
gested by recent cognitive science research [9, 12, 18, 33], facial expres-
sions also play an important role in regulating social interactions and
signaling contingent social action. Hence, different contexts and individ-
uals will also lead to various interpretations of facial expressions.
2. There is generally a variable delay on the occurrence of human reac-
tions after an event takes place. Also, reactions could occur in advance,
when a human is anticipating an event. These cases result in additional
challenges of interpreting which event has induced the person’s reactions.
3. Natural human reactions can contain spontaneous micro-expressions con-
sisting of minor facial muscle movements lasting for less than 500 mil-
liseconds [34, 44], which can be hard to detect by computer vision sys-
2
tems trained with common datasets with only exaggerated facial expres-
sions [13, 28].
4. In many use cases, there exist other components in the human environ-
ments apart from the agent and its task environment. Thus, it becomes
more challenging to infer the focus of a person’s reactions.
Therefore, this thesis aims to answer the following question:
Can an agent learn a task directly from implicit human feed-
back, without requiring explicit signals from humans?
In this thesis, we approach lihf with data-driven modeling that creates
a general reaction mapping from implicit human feedback to task statistics.
In doing so, we make the following contributions:
1. We motivate and frame the general problem of Learning from Implicit
Human Feedback (lihf), with the aim of better exploiting under-utilized
data modality already existing in natural human reactions.
2. We propose a broad data-driven framework to solve this problem, called
Evaluative MaPping for Affective Task-learning via Human Implicit Cues
(empathic).
3. We experimentally validate an instantiation of the empathic framework,
using human facial reactions as the data modality, and rewards as the
target task statistic.
3
1.2 Thesis Outline
The thesis is organized as follows: Chapter 2 goes over the related
work and the background related to Markov Decision Processes. Chapter 3
defines the problem of Learning from Implicit Human Feedback and presents
the formulation of the two-stage empathic framework. Chapter 4 provides the
details of the task domains and the data collection process for instantiating the
framework, and then demonstrates the reaction mapping design, including the
human exploration of the data and the architecture of the reaction mapping
pipeline. Chapter 5 shows the empirical results to validate that the learned
mappings from our instantiation of stage 1 effectively enable task learning in
stage 2, by testing three hypotheses in different settings. Finally, Chapter 6
discusses future work and concludes.
The content of this thesis mainly follows our paper published at Con-
ference on Robot Learning [10] on the empathic framework, which is the col-
laborative work with Yuchen Cui, under the supervision of Alessandro Allievi,
Peter Stone, Scott Niekum, and W. Bradley Knox. In this thesis, both of us
have jointly contributed to the conceptual formulation of lihf and empathic,
the human exploration of data, the instantiation pipeline of empathic, and
the online learning in Robotaxi. Yuchen’s contributions cover more of the con-
struction and investigation of trajectory ranking in the robotic sorting domain,
preliminary modeling of other task statistics, and effects of different belief pri-
ors. My work includes more of the design and analysis of the Robotaxi domain,
together with the reward-ranking outcome of the learned reaction mapping.
4
Chapter 2
Background
This chapter introduces the basic concept of Markov Decision Pro-
cesses, which is used for formally specifying the Learning from Implicit
Human Feedback (lihf) problem, and discusses related work.
2.1 Markov Decision Processes
Markov Decision Processes (MDPs) are often used to model se-
quential decision making of autonomous agents. An MDP is given by the
tuple 〈S,A, T,R, d0, γ〉, where:
• S is a set of states; A is a set of actions an agent can take;
• T : S×A×S → [0, 1] is a probability function describing state transition
based on actions;
• R : S × A× S → R is a real-valued reward function;
• d0 is a starting state distribution and γ ∈ [0, 1) is the discount factor.
• π : S × A → [0, 1] is a policy mapping from any state and action to a
probability of choosing that action.
5
An agent aims to find a policy that maximizes the expected return
E [∑∞
t=0 γtrt] where rt is the reward at time step t.
2.2 Related Work
Our work relates closely to the growing literature of interactive rein-
forcement learning (RL), or human-centered RL [17, 22, 25, 26, 29, 31, 35, 37,
40, 47] where agents learn from interactions with humans, with or without pre-
defined environmental rewards. In the empathic framework, implicit human
feedback refers to any multi-modal evaluative signals humans naturally emit
during human-robot interactions, such as facial expressions and head gestures,
as well as other vocalization modalities not intended for explicit communica-
tion. Others’ usage of “implicit feedback” has referred to the implied feedback
when a human refrains from giving explicit feedback [20, 30], to human bio-
magnetic signals [43], or to facial expressions [3, 19, 38]. This thesis focuses
on predicting task statistics from human facial features. Hence, this work also
intersects with the research field of facial expression recognition.
2.2.1 Interactive Reinforcement Learning
The tamer framework proposed by Knox et al. [22, 23] is the first to
enable RL agents to learn from explicit human feedback without environmental
rewards, by explicitly modeling human feedback with button clicks, drawing
inspiration from animal clicker-training. Veeriah et al. [39] propose learning
a value function only based on human facial expressions and agent actions,
6
using manual negative feedback as supervision. However, the policy of the
RL agent depends on the trainer’s facial expression only, without actually
reasoning about the task state. Arakawa et al. [3] detect human emotions
by adopting an existing facial expression recognition system, and then apply
a predefined mapping from emotions to tamer feedback. Nonetheless, they
fail to optimize and make the mapping effective for the downstream task.
Likewise, Zadok et al. [46] improves exploration of an RL agent by biasing
its behavior to increase the probability of smiling for human demonstrators,
which is modeled within the task. Meanwhile, Li et al. [26] show the possibility
of learning only from human facial expressions by generalizing tamer, using
a deep neural network to map the facial expressions of the trainer to positive
or negative feedback.
The approach proposed in this work differs from prior work by learning
a direct mapping from facial reactions to task statistics, and is not dependent
on any specific state-actions. Without requiring explicit signal throughout the
RL tasks, our work is the first managing to learn from subjects not told to
react or serve as trainers.
2.2.2 Facial Expression Recognition (FER)
Facial expression recognition is a broad research field that spans multi-
ple areas, including psychology, neuroscience, cognitive science and computer
vision. Fasel and Luettin [15] provide an overview of traditional fer systems
and Li and Deng [27] introduces latest fer systems based on deep learning.
7
Instead of conducting fer explicitly, our proposed method maps extracted
facial features to reward values.
Our work is closely related to the problem of dynamic fer, in which
time-series data are taken as inputs for temporal predictions. As is discussed
by Li and Deng [27], modern fer systems often consist of two stages: data
pre-processing and predictive modeling with deep neural networks. Given our
small dataset, we directly extract certain facial features sufficiently informa-
tive for modeling human reactions, by employing an existing toolkit named
OpenFace 2.0 [5, 6, 45]. Furthermore, we also model the temporal aspect of
our task by extracting facial features in the frequency domain.
2.2.3 Summary
In this chapter, we established the basic notations and definitions in
Markov Decision Processes, which will be applied and further extended to
specify the lihf problem, described the two major fields relevant to our work,
and investigated the existing literature in these research fields.
8
Chapter 3
The LIHF Problem and The EMPATHIC
Framework
In this chapter, we provide the detailed concepts of the problem of
Learning from Implicit Human Feedback (lihf), which follows our defi-
nitions and notations in the background of Markov Decision Processes (MDPs).
We then describe the detailed structure of the empathic framework, consisting
of two stages and several necessary elements for instantiating in specific task
domains. This formulation, which is initially presented in our paper published
at Conference on Robot Learning [10], leads to our actual instantiation of the
framework and our proposed reaction mapping algorithm, to be discussed in
Chapter 4.
3.1 The LIHF Problem
The problem of Learning from Implicit Human Feedback (lihf)
asks how an agent can learn a task with information derived from human
reactions to its behavior.
We define lihf with the tuple 〈S,A, T,RH, XH,Ξ, d0, γ〉, in which S,
A, T , d0, and γ are defined identically as in MDPs. Observations x ∈ XH
9
containing implicit feedback from some human H are received by the agent.
An observation function Ξ denotes the conditional probability over XH of
observing x, given a trajectory of state-actions and the human’s hidden reward
functionRH. Furthermore, lihf states are generally defined more broadly than
task states, including all environmental and human factors that influence the
conditional probability of observing x.
The goal of an agent is to maximize the return under RH. How to
ground observations x ∈ XH containing implicit human feedback to evaluative
task statistics is at the core of solving lihf.
Figure 3.1: Graphical model for lihf (colored boxes and their identicallycolored labels represent conditional probability tables)
Figure 3.1 shows a graphical model for lihf to demonstrate how the
10
data generation process is modeled, whose formulation resembles the definition
of Partially Observable MDPs, whereas the partially observable variable is the
human’s reward function instead of state. We assume the human H’s reward
function RH is a temporally invariant component of the human’s internal state
SH. However, the observationXH containing implicit human feedback to agent
behavior can change over time since it is influenced by the human’s internal
model of the task domain and the agent’s behavior policy at a certain time
step.
Given an observation x ∈ XH, the current state s ∈ S, and the previous
action a ∈ A, the agent constructs its belief b ∈ B as a probabilistic memory
of arbitrary form and scope over the task domain. A belief could include, for
example, the probability distribution over possible reward functions, which the
agent could use to generate a policy (and therefore an action given the current
state) that maximizes expected return (aggregated single-step rewards r ∈ R)
under the unobserved human reward function RH. Note that reward is not
directly dependent on state and action – it is determined by the human entirely
(who can, for generality, internally maintain a history of states and actions,
and therefore can give non-Markovian rewards).
3.2 The EMPATHIC Framework
With the mathematical formulation of lihf defined, a data-driven so-
lution to the problem is then proposed to infer relevant task statistics from
human reactions. Fig. 3.2 presents the formulation of the empathic frame-
11
Figure 3.2: Overview of empathic
work, consisting of two stages: (1) learning a mapping from implicit human
feedback to corresponding task statistics; and (2) using the trained reaction
mappings for task learning.
In the first stage, human observers are incentivized to want an agent to
succeed – to align the person’s RH with a known task reward function R – and
they are then recorded while watching the agent’s behaviors. Task statistics
are computed from R for every timestep to serve as supervisory labels, which
are used for training a mapping from synchronized human reaction recordings.
Since the state-actions of a task are not inputs, the learned reaction mappings
could be conveniently deployed on other tasks.
In the second stage, a human observes an agent attempt a task with
sparse or no environmental reward, and the human observer’s reactions to its
behavior are input to the mapping trained in the previous stage, to predict
12
unknown task statistics for improving the agent’s policy, either directly or with
other approaches, such as guiding exploration or inferring the reward function
RH that describes the human’s utility.
Basically, any instantiation of empathic can be achieved through spec-
ifying these elements:
• the reaction modality and the target task statistic(s);
• the end-user population(s);
• training task(s) for stage 1 and deployment task(s) for stage 2;
• an incentive structure for stage 1 to align human interests with task
performance; and
• policies or RL algorithms to control the observed agent in both stages.
Any specific task or person can be part of either of these two stages.
Note that empathic is defined broadly enough to include instantiations with
varying degrees of personalization – from learning a single reaction mapping
applicable to all humans to training a person-specific model – and of across-
task generalization. Though a single reaction mapping will be generally useful,
we hypothesize that personalized training on certain users or tasks will achieve
even more effective mappings, which may also alleviate negative effects of
potential dataset bias from the first stage if it is used amongst underserved
populations.
13
3.3 Summary
In this chapter, we provided a formal definition of the lihf problem by
establishing the mathematical expressions, together with a graphical model,
which resembles the concept of Partially Observable MDPs. We further pre-
sented an overview of empathic, specifying the two stages and the necessary
elements for achieving any instantiation of the framework, which are used to
motivate the focus of Chapter 4 that details our instantiation, using facial
reactions as the modality for implicit human feedback.
14
Chapter 4
Instantiation of the EMPATHIC Framework
In this chapter, we describe the details of our instantiation of the em-
pathic framework. We first reveal the task domains in our experiments, to-
gether with the data collection protocols. Then, we illustrate the design of our
reaction mapping algorithm, by discussing the outcome of data exploration
using a human proxy test, and providing the mathematical formulas of the
reaction mapping architecture.
4.1 Data Collection Domains and Protocol
In this section we describe the experimental domains and data collection
process of our instantiation of empathic.
4.1.1 Robotaxi
We create Robotaxi as a simulated domain to collect implicit human
feedback data with known task statistics, instantiating the training task for
stage 1 of empathic presented in Chapter 3. Fig. 4.1 shows the visualization
viewed by the human observer. An agent (depicted as a yellow bus) acts in
a grid-based map. Rewards are connected to objects: +6 for picking up a
15
passenger; −1 for crashing into a roadblock; and −5 for crashing into a parked
car. Reward is 0 otherwise. An object disappears after the agent moves onto
it, and another object of the same type is spawned with a short delay at a
random unoccupied location. An episode starts with two objects of each type.
Figure 4.1: Robotaxi environment
4.1.2 Robotic Sorting Task
To test whether or not the learned reaction mapping is able to transfer
across task domains, a robotic manipulation task is further developed in our
paper published at Conference on Robot Learning by Cui et al. [10], which
instantiates the deployment task for stage 2 of empathic discussed in Chap-
ter 3. As is shown in Fig. 4.2, the robot arm aims at recycling the aluminum
cans into the bin, without sorting the other items. It will lead to a reward of
+2 when a can is recycled, and −1 for the remaining objects. 0 reward will be
given in all other cases. The short episodes consist of fixed pick-up trajecto-
ries. In each trajectory, there is at most one non-zero reward event. Appx.A.1
16
provides further details of the task domains, especially our instantiation of the
policies and RL algorithms to control the observed agent in both stages, which
is another necessary element for empathic discussed in Chapter 3.
Figure 4.2: Robotic sorting task
4.1.3 Data Collection
Human subjects were recruited to participate in the interactions with
both agents. We informed them that their financial payout would be propor-
tional to the agent’s total reward in the end, after which they started to observe
the agents performing the tasks. By making this payout rule, we align human
interests with the task performance and link human reactions to task statis-
tics, by creating a mapping from the in-game rewards to the real-world value
to humans, which instantiates the incentive structure for stage 1 of empathic
in Chapter 3. Furthermore, we told the participants that their “reactions are
being recorded for research purposes”, without letting them know how we plan
to use their facial features, in order to make sure the explicit teaching from
17
participants is minimized. By using this experimental design, our work differs
from previous works that involve explicit human training [3, 26, 39], aligning
with the research direction of the lihf problem by leveraging existing data
modality in human-robot interactions.
17 human participants observed 3 episodes of Robotaxi, and 14 of the
participants observed 7 episodes of the robotic trash sorting task. After obtain-
ing the human participant’s consent, we conducted user studies in an isolated
room, while people were asks to watch the agents perform predetermined tra-
jectories generated by suboptimal policies, with videos of their facial reactions
recorded. At the end of their sessions, the subjects were asked to complete
an exit survey. Further details are given in Appx.A.1, including the end-user
populations for instantiating empathic discussed in Chapter 3.
4.2 Reward Mapping Design
In this section, we describe the result of a human proxy test for data
exploration, and introduce the architecture of our reaction mapping pipeline.
4.2.1 Human Exploration of the Data
As an initial step after data collection, we perform human exploration
of the data, in which the authors of our paper published at Conference on
Robot Learning [10] serve as proxies for a mapping. As is shown in Fig. 4.3, a
semantically anonymized version of each agent trajectory is created and viewed
by the authors, alongside a synchronized recording of the human participant’s
18
reactions. Afterwards, we rank the reward values of the 3 object types, with
one truncated episode from each of the 17 participants watched by all human
proxies. Kendall’s rank correlation coefficient τ ∈ [−1, 1] [1] is then applied
to evaluate the quality of our inference in comparison with the ground truth,
in which a higher τ value corresponds to a higher correlation between two
rankings.
Table 4.1 shows the mean τ scores for the human proxy test across
17 human participants, with a mean for each author. We also compare each
human proxy’s 17 τ scores with the baseline value τ = 0 for uniformly random
reward ranking, and show the p-values computed with Wilcoxon signed-rank
tests [42]. The results demonstrate that 5 out of 6 human proxies surpassed
uniformly random reward ranking, with 1 author performing significantly bet-
ter, even after adjusting a p < 0.05 threshold for multiple testing to p < 0.0083
using a Bonferroni correction [41]. This person’s success indicates that al-
though humans have variable ability of perceiving implicit feedback signal,
sufficient information is contained in human facial reactions for reward rank-
ing.
Meanwhile, based on our experience throughout the proxy test, we iden-
tify 7 common reaction gestures that helped us with reward ranking inference:
smile, pout, eyebrow-raise, eyebrow-frown, (vertical) head nod, head shake,
and eye-roll. We also annotate the video data with frame onsets and offsets
of these 7 gestures, together with the positive, negative, or neutral sentiment
of the gesture, without watching the corresponding trajectories beforehand.
19
A detailed analysis of the annotation outcomes that provides insight for the
computational modeling is included in Appx.A.2.
Figure 4.3: Human proxy’s view: semantics are hidden with color masks; thedark green circle is the agent; observer’s reaction is displayed; detected faceis enlarged; background is colored by last pickup. The left frame is the samegame state shown in Fig 4.1.
Avg. τ p-value.569 .004.216 .185
Human .098 .319Proxies -.176 .179
.255 .123
.294 .059Avg. .209 .078
Table 4.1: Human proxy test result: average τ values across participants areshown; a baseline that randomly picks rankings has an expected τ value of 0.
4.2.2 Reaction Mapping Architecture
With the dataset of human facial reactions collected, we would like to
show the possibility of computationally modeling the implicit human feedback.
20
Figure 4.4: The feature extraction pipeline and architecture of the reaction mapping
To do so, a reaction mapping from temporal series of extracted facial features
(retrieved using a pre-trained toolkit) to the probability of reward categories
is constructed, by training a deep neural network for reward prediction with
supervised learning. Note that the input and output of this neural network
serve as our instantiation of the reaction modality and the target task statistics
for empathic respectively (as is listed in Chapter 3).
Fig. 4.4 shows the feature extraction pipeline and architecture of the
proposed deep network model to construct the reaction mapping. We use
OpenFace 2.0 [5, 6, 45] for facial feature extraction from human reaction videos
including 30 image frames per second. Head pose and activation of facial
action units (FAUs) are extracted from each frame with the toolkit. Further,
we detect head nods and shakes by maintaining a running average of extracted
head-pose features and subtracting it from each incoming feature vector. We
21
then use fast Fourier transform to compute the changing frequency of head-
pose, whose coefficients are taken as head-motion features.
With the set of facial features described above, we apply max pooling
on each dimension to merge the feature vectors of consecutive image frames.
This step outputs temporally aggregated feature vectors of the same size, also
enabling the sequence of input features to cover a sufficiently large time window
of human facial reactions. Further feature extraction details can be found in
Appx.A.3.1.
We denote the sequence of raw input image frames from time t0 to t
as {Xt0 , ..., Xt}, with t0 being the starting time of an episode, and t being
the time of the last image frame for calculating the T -th aggregated frame.
Aggregated FAU features ϕFAU ∈ Rm and head-motion features ϕhead ∈ Rn
are extracted by the feature extractor Φ: (ϕFAU , ϕhead)T = Φ({Xt0 , ..., Xt}).
The input for a data sample is a temporal window of sequential ag-
gregated features, which starts from the (T -k)-th and ends at the (T+`)-th
aggregated frame, while the corresponding label is the reward category (i.e.,
−5, −1, or +6) occurred during the time step containing the T -th aggregated
frame. Note that future reaction data is also included for prediction since some
reactions occur after a reward event. We then encode FAU features and the
head-motion features separately, by flattening the input temporal sequences
into a single vector and encoding with their own linear layers respectively.
After concatenating the two encodings, we input the resulting vector to a
multilayer perceptron (MLP).
22
An auxiliary task of predicting the corresponding annotations {A(T−k), ..., A(T+`)}
is also included to accelerate representation learning and serve as a regular-
izer, where each binary element of A refers to whether a reaction gesture
takes place. The anticipated prediction output is a single flattened vector
a ∈ {0, 1}10(k+`+1). Empirical results show that adding the auxiliary task
leads to the lowest test loss, yet it is not required in the reward-ranking task
for achieving better-than-random performance. Meanwhile, we also take into
account the reward ordinality by adding a binary classification loss that com-
bines the two negative reward classes (i.e., −5, −1), so that the prediction
outcomes with a wrong sign can be further penalized. These results will be
discussed in detail in Chapter 5.
Let gθ(·) represent the function of our MLP model, and z ∈ Rc denote
the output of the network, which is the raw log probabilities over the c reward
categories. The ground-truth one-hot classification label can be denoted by
y ∈ {0, 1}c. We then represent the output of the auxiliary task with o, while
ybin is the ground-truth binary class mentioned before, with zbin being the
corresponding binary prediction computed directly from z. Thus, we have:
(z,o)T = gθ({(ϕFAU , ϕhead)T−k, ..., (ϕFAU , ϕhead)T+`}) (4.1)
The loss for optimization is given by:
L(θ) = −y · log(softmax(z))− λ1ybin · log(softmax(zbin)) + λ2||a− o||2(4.2)
23
We train our network with Adam [21], and apply random search [7] to
find the best set of hyper-parameters, consisting of the input’s window size (k
and `), learning rate, dropout rate, loss coefficients (λ1 and λ2), together with
the depth and widths of the MLP. Our annotations of human facial reactions
provide the candidate window sizes for random search, which is further intro-
duced in Appx.A.2). We use k-fold cross validation to facilitate the random
search, in view of our small dataset, from which we also randomly sample one
episode of data given by each human participant to form a holdout set for final
evaluation. By evaluating across train-test data folds, we eventually select the
set of parameters with the lowest mean test loss. See Appx.A.3 for further
details of reaction mapping construction.
4.3 Summary
In this chapter, we revealed the design details of our empathic instan-
tiation. We first provided the task specifications and data collection protocols
for creating a human reaction dataset. Further, we presented the outcomes of
a human proxy test to show that the reactions contain sufficient information
for training a reaction mapping network and performing task learning. Finally,
we illustrated the feature extraction pipeline and architecture of the proposed
deep network model, which are used for constructing the reaction mappings.
24
Chapter 5
Reaction Mapping Results and Evaluation
To validate that the learned mappings from our instantiation of stage
1 effectively enable task learning in stage 2, we test the following hypotheses
(here we refer to observers from stage 1 who have created data in the training
set as “known subjects”), which are already proposed in our paper published
at Conference on Robot Learning by Cui et al. [10]:
• Hypothesis 1 [deployment setting is the same as training setting]. The
learned reaction mappings will outperform uniformly random reward
ranking, using reaction data from known subjects watching the Robo-
taxi task.
• Hypothesis 2 [generalizing H1 to online data from novel subjects]. The
learned reaction mappings will improve the online policy of a Robotaxi
agent via updates to its belief over reward functions, based on online
data from novel human observers;
• Hypothesis 3 [generalizing H1 to a different deployment task]. The
learned reaction mappings can be adapted to evaluate robotic-sorting-
task trajectories and will outperform uniformly random guessing on
25
return-based rankings of these trajectories, using reaction data from
known subjects.
Figure 5.1: Sorted per-subject Kendall’s τ for Robotaxi reward-ranking task
5.1 Reward-ranking Performance in the Robotaxi Do-main
In this section, we focus on the result of reward-ranking in the Robotaxi
domain. We first demonstrate the effectiveness of our learned reaction mapping
network in predicting the reward function used for the task, with ablation
study conducted. Further, we provide the result of preliminary modeling of
other task statistics. Lastly, we show the influence of different initial belief
priors on the inference of reward rankings.
26
5.1.1 Learning Outcome of the Reaction Mapping
To evaluate the effectiveness of the learned reaction mapping, we test
its performance on a reward-ranking task. Let q be the random variable for
reward event and x denote the detected human reactions. Let m represent the
discrete random variable over all possible reward functions, and we can regard
it as a reward ranking in the Robotaxi task. Our neural network gθ(·) models
P(q | x,m), which is the probability of a reward event given the human’s reac-
tion and a fixed reward ranking m. We aim to find the posterior distribution
over m: P(m | q, x). The proof for P(m | q, x) ∝ P(q | x,m)P(m) is given as
follows:
P(q, x,m) = P(q | x,m)P(x | m)P(m) (5.1)
= P(m | q, x)P(q, x) (5.2)
P(m | q, x)P(q | x)P(x) = P(q | x,m)P(x | m)P(m) (5.3)
x and m are conditionally independent since the human observes the reward,
therefore:
P(x | m) = P(x) (5.4)
Hence,
P(m | q, x)P(q | x) = P(q | x,m)P(m) (5.5)
P(q | x) is constant across all values ofm, therefore: P(m | q, x) ∝ P(q | x,m)P(m).
With the formula derived above,the target distribution P(m | q, x) can
be found given the learned mapping gθ(·) and a uniform prior over m. After
27
incorporating predicted mappings from all human reaction data in an episode,
we select the maximum a posteriori reward ranking as the estimation out-
come. Furthermore, the neural network is trained 4 times with the mean
result reported, to deal with the impact of randomness. The corresponding
loss profiles are shown in Fig. 5.2. The red dotted line marks the mean valida-
tion loss across 4 repetitions using the model selected with the lowest testing
loss.
The performance of the learned reaction mapping is evaluated with
Wilcoxon Signed-Rank test, on the holdout set of Robotaxi episodes. Fig. 5.1
shows that we get p = 0.0024 with the annotation-reliant auxiliary task and
p = 0.0207 without it, demonstrating that the mappings achieve significantly
better performance than uniformly random guessing (τ = 0), which supports
H1.
Similarly, Fig. 5.3 shows the reward ranking result on the human proxy
test episodes, in which we use the same Robotaxi episodes as the human proxies
viewed for evaluation, while the remaining episodes are used for training the
network. With few exceptions, it turns out that our model performs badly on
participants that the human proxies also find difficult and well on participants
the human proxies have good performance on.
5.1.2 Ablation Study for Predictive Model Design
An ablation study is further conducted to validate the effectiveness of
our model design, focusing on the use of auxiliary task and input features. We
28
(a) Holdout data as evaluation set
(b) Human proxy episodes as evaluation set
Figure 5.2: Loss profiles for training final models for reward ranking evaluation.
29
Figure 5.3: Sorted per-subject Kendall’s τ for Robotaxi reward-ranking taskon the human proxy test episodes
generate the loss profiles along the training epochs across 17 subject train-
test-validation sets, respectively using the proposed model, the model without
auxiliary loss, the model with only FAU features, and the model with only
head-motion features. The hyperparameters are determined by random search
performed independently on each scenario, and we select the model with the
lowest mean test loss using 17-fold cross validation based on each subject.
Fig. 5.4 indicates that our full model achieved the lowest test loss, mean and
variance of validation loss in comparison with all the other cases.
As the last test case, we also train an end-to-end model with a Resnet-
18 CNN as feature extractor and an LSTM model for feature processing in a
time window. Fig. 5.4 shows that the model fails to achieve a training loss
lower than that obtained by directly outputting the label distribution. In
this experiment, random hyperparameter search cannot be performed due to
30
(a) Proposed model (b) Model trained without auxiliarytask
(c) Only use FAU features as input (d) Only use head-motion features asinput
(e) End-to-end model (Resnet+LSTMarchitecture)
Figure 5.4: Loss profiles for training different models (each model has its ownset of best hyper parameters found through random search except the end-to-end model)
31
the high cost of training an end-to-end model, hence only manual tuning is
conducted, which serves as an important reason for failure. In addition, the
small size of dataset might also have been a constraint, which motivates our
use of OpenFace [5, 6, 45] for feature extraction to alleviate the modeling
burden.
5.1.3 Preliminary Modeling of Other Task Statistics
We also try modeling the following task statistics other than reward.
πb denotes the agent’s behavior policy, and π∗ represents the optimal policy
using the ground-truth reward function:
• Q-value of an action under optimal policy: Q∗(s, a) = R(s, a) + V ∗(s′)
• Optimality (0/1) of an action (1 is the indicator function): O(s, a) =
1[Q∗(s,a)](Q(s, a))
• Q-value of an action under the behavior policy: Qπb(s, a) = R(s, a) +
V ∗(s′)
• Advantage of an action under optimal policy: A∗(s, a) = Q∗(s, a)−V ∗(s)
• Advantage of an action under behavior policy: Aπb(s, a) = Qπb
(s, a) −
V πb(s)
• Surprise modeled as the difference in Q: S(s, a) = Qπb(s, a)−Q∗(s, a)
To compute the Robotaxi agent’s policy, we apply an approximate opti-
mal policy by assuming a static gridworld map for each time step and running
32
value iteration. Such a policy could be considered optimal given that no more
than 2 objects of the same type exist in the map. Monte Carlo rollouts are
then used for value and Q-value estimation.
We continue to use cross entropy loss for classifying optimality, while
mean square error is employed for the other continuous statistics, with auxil-
iary loss added as in reward prediction. Fig. 5.5 shows that neither the testing
nor the validation loss manages to get below the baseline that simply outputs
the mean of the labels, with the two losses increasing while the training loss
decreases.
The overfitting issue observed when training models for these task
statistics is considered the result of time-aliasing in the training data, in which
two adjacent training inputs in two consecutive timesteps are very similar yet
have different labels determined by the timestep’s task statistics, and could
be regarded as an important future research direction. Moreover, the discount
factor in our experiments treats Robotaxi as an infinite-horizon MDP. How-
ever, the episodes actually used are trajectories with finite length, which might
also lead to the failure in modeling the other statistics.
5.1.4 Effects of Different Belief Priors
As the last experiment on Robotaxi in our previous paper [10], we infer
over all possible reward rankings starting from various priors, to study how fast
our reaction mapping approach could recover from different (and potentially
wrong) initial beliefs. To do so, a pool of predictions from the learned Robotaxi
33
(a) O(s, a) (b) Q∗(s, a)
(c) Qπb(s, a) (d) A∗(s, a)
(e) Aπb(s, a) (f) S(s, a)
Figure 5.5: Loss profiles for training with other task statistics
34
Figure 5.6: Performance of reward inference starting from different priors overreward rankings.
reaction mapping on the holdout set is retrieved, from which we randomly
sample without replacement to update the reward belief. We test the following
cases of reward ranking priors:
• all uniform: uniform prior over all possible reward rankings (used in
all other experiments);
• P(worst ranking) = p: the reward mapping that ranks events in the
reverse of the correct ranking has prior probability mass p, and the rest
of the reward rankings uniformly share 1− p probability mass;
• P(best ranking) = p: the correct reward ranking has prior probabil-
ity mass p, and the rest of the reward rankings uniformly share 1 − p
35
probability mass; and
• P(second best ranking) = p: the reward ranking that correctly ranks
the positive-reward event first but incorrectly rank the two negative-
reward events has prior probability mass p, and the rest of the reward
rankings uniformly share 1− p probability mass.
The results are then evaluated with two metrics: the average Kendall’s τ
score of the most possible reward ranking, and the KL divergence between the
current belief and a soft true belief, in which we let the prior probability of a
certain reward ranking be exp(λτ)/Z, with τ representing the corresponding
Kendall’s τ value. Values of these two variables are computed and recorded
after different numbers of non-zero reward events are observed and used by
our model for prediction.
Fig. 5.6 shows the mean performance over 100 repetitions. It could
be observed that the case of P(second best ranking) = 0.9 is the most
difficult prior to recover from, in which the sequence of two negative rewards
is mistaken. After receiving enough data inputs, our model can converge to the
ground-truth reward ranking in 4 out of the 6 scenarios mentioned above, while
higher weights on the predicted likelihood generally leads to faster convergence.
In view of the fact that these final results are sensitive to the choice of belief
prior, initializing with a uniform distribution is more preferred when we have
no strong evidence for selecting a specific biased prior.
Throughout the study above we are using offline reaction data of human
36
subjects observing predetermined trajectories, while the learning performance
will differ if we instead use live human data to update the behavior policy
using the belief in real time, which will be discussed in the following section.
5.2 Online Learning in the Robotaxi Domain
Given a learned reaction mapping model, we can further use the online
human facial reactions to interactively update the belief over reward rankings,
hence optimizing the behavior policy by having the agent follow the most likely
reward function at every timestep. To do so, we employ the reaction mapping
trained in stage 1 on human observers who have never appeared in the training
data, to validate our model is able to improve agent policy in real time in stage
2. This experiment is conducted jointly in our paper published at Conference
on Robot Learning by Cui et al. [10].
With the social distancing during the COVID-19 pandemic, the au-
thors recruited their friends as participants and conducted the user study in
their own homes, resulting in two aspects of informality: 1) Participants were
unpaid, hence the human internal model RH and the real reward function R
are not as aligned as in the collected dataset; 2) Human observers didn’t get a
chance to control the Robotaxi agent before watching the trajectories, which is
included for better understanding the task domain. Though it is likely that the
overall performance is worsened by these constraints, we still observe positive
experiment outcomes, which can be found in Fig. 5.7.
In 9 out of 10 trials, the agent receives a positive final return (p =
37
(a) Probability (b) Entropy over rewards
(c) Return (d) Kendall’s τ values
Figure 5.7: Trials of informal online learning sessions in Robotaxi
0.0134 under binomial test), among which 8 trials achieve significantly higher
return than uniformly random policy. The belief also converged to the optimal
and the second optimal rankings (passenger has the highest rank) in 3 out of
10 trials with low entropy. Meanwhile, after only a small number of initial
timesteps, the mean Kendall’s τ value of reward ranking becomes higher than
that of random guessing. Furthermore, 7 of the 10 trials terminate with the
probability of reward mappings that correspond to optimal behaviors being
38
the highest. These results moderately support H2.
5.3 Trajectory Ranking in Robotic Sorting Domain
As is discussed by Cui et al. [10] in our paper published at Conference
on Robot Learning, we extend the trained Robotaxi reaction mapping and use
it in the domain of robotic trash sorting task, by only using the binary reward
classification loss. In this way, the binary prediction outcome can be regarded
as a “positivity score” for each aggregated frame input.
In the user study, we select 8 different predetermined trajectories, 7 of
which are combined into an episode and watched by each human subject. A
trajectory contains one of the three possible return events: 1) recycling a can
(+2); 2) recycling any other wrong item (−1); 3) no object placed in the trash
bin (0). Empirically, human reactions correspond to high-level behaviors in
the trajectories (i.e. pick and place object X ), instead of the low-level joint
torques. Therefore, the small time window size in the Robotaxi domain is no
longer applicable to include all necessary facial reactions for prediction.
Confronted with this additional challenge, we calculate the mean posi-
tivity score over all aggregated frames in each trajectory, which can be treated
as an extended action. We also make an assumption that the facial reactions
along each trajectory are generated by a single latent state, which reflects the
human’s internal modeling of the agent’s behaviors. Hence, we resolve the
temporal incompatibility with Robotaxi by using the per-trajectory positivity
score mentioned above as the overall scoring of the entire trajectory.
39
Figure 5.8: Sample plot of trajectory positivity score over aggregated frames
Figure 5.9: Sorted per-subject Kendall’s τ for evaluating robotic sorting tra-jectories
40
Figure 5.10: Overall trajectory ranking by mean positivity score across sub-jects (each entry is colored by the trajectory’s final return: green for +2, yellowfor 0, and red for −1)
The positivity scores along all 7 candidate trash-sorting trajectories
(observed by one participant) are shown by Fig. 5.8. The Kendall’s τ values
for trajectory rankings by each human participant is then presented in Fig. 5.9,
in which the per-trajectory mean positivity scores are computed across the
temporal domain. After further computing the mean positivity scores across
all human subjects, we show the final rankings of all objects in Fig. 5.10, where
Kendall’s τ independence test achieves τ = 0.70 with p = 0.034, significantly
outperforming uniformly random guessing whose τ = 0. This result supports
H3.
5.4 Summary
In this chapter, we presented and analyzed the experiment results of
our reaction mapping from the empathic framework. By proposing and val-
41
idating three hypotheses, we demonstrated that the learned mappings from
our instantiation of stage 1 effectively enable task learning in stage 2, both
when the deployment setting is the same as the training setting, and when we
generalize to online data from novel subjects and to a different deployment
task.
42
Chapter 6
Conclusion and Future Work
In this thesis we introduce the lihf problem and the empathic frame-
work for solving lihf. We demonstrate that our instantiation interprets human
facial reactions in both the training task and the deployment task, thus tak-
ing a significant step towards the ultimate goal of task learning from implicit
human feedback. It does so by successful application of a learned mapping
from human facial reactions to reward types for online agent learning and for
evaluating trajectories from a different domain. In this last section, we discuss
the limitations of our work and the potential aspects of future study.
• Experimental Design In this work, we demonstrate a single instan-
tiation of empathic, yet in the current domains agent behaviors have
no long-term consequences on the expected return, and the training and
testing tasks also have great similarity in terms of temporal character-
istics and reward structures, which altogether limit the generality of
the proposed framework in solving the lihf problem. Therefore, more
abundant task environments could be designed in our future work, to
further investigate facial reaction modeling for the prediction of long-
term returns, which closely relates to the changes in human observers’
43
expectation.
• Human Models In many real-world scenarios, human observers usu-
ally concentrate on their own main tasks, only occasionally switching
their attention and providing implicit human feedback to our task of in-
terest. Therefore, we aim to further generalize our work and enable the
empathic framework to accurately infer the relevance of human reac-
tions to the agent’s behavior. Moreover, our current instantiation still
needs the ability to efficiently model the latent human states that reflect
changing expectations of agent behavior in the lihf problem, instead of
simply focusing on the recent and anticipated agent experience.
• Data Modalities In this work, only discrete reward categories are
taken as targets of facial reaction mapping. Future extension may involve
more abundant implicit human feedback modality, including gaze and
gestures, to more effectively model various task statistics in a broader
range of practical tasks.
44
Appendices
45
Appendix A
Supplemental Material
A.1 Experimental Domains and Data Collection Details
A.1.1 Robotaxi
• Agent Transition Dynamics In the 8×8 grid-based map, the agent
has three actions available at each timestep: maintain direction, turn
left, or turn right. When the agent runs into the boundary of the map,
it is forced to turn left or right, in the direction of the farther boundary.
• Rewards There are three different types of objects associated with
non-zero rewards when encountered in the Robotaxi environment: if the
agent picks up a passenger, it gains a large reward of +6; if it runs into
a roadblock, it receives a small penalizing reward of −1; if the agent
crashes into a parked car, it receives a large penalizing reward of −5. All
other actions result in 0 reward.
• Object Regeneration At most 2 instances of the same object type are
present in the environment at any given time. An object disappears after
the agent moves onto it (a “pickup”), and another object of the same
type is spawned at a random unoccupied location 2 time steps after the
corresponding pickup.
46
• Agent Policy The agent executes a stochastic policy by choosing from
a set of 3 pseudo-optimal policies under 3 different reward mappings
from objects to the 3 reward values:
– Go for passenger: {passenger: +6, road-block: −1, parked-car: −5}
– Go for road-block: {passenger: −1, road-block: +6, parked-car: −5}
– Go for parked-car: {passenger: −1, road-block: −5, parked-car: +6}
The pseudo-optimal policies are computed in real time via value iteration
(discount factor γ = 0.95) on a static version of the current map, meaning
that objects neither disappear nor respawn when the agent moves onto
them. We simplify the state space in this manner because the true
state space is too large to evaluate and would create too large of a Q
function to store, yet this simplification finds an near-optimal policy that
almost always takes the shortest path to an object of the type that its
corresponding reward function considers to be of highest reward.
At the start of an episode, the agent selects 1 of these 3 policies. The
agent follows the selected policy, with a 0.1 probability at each time step
that the agent will reselect from the 3 policies. This 0.1 probability of
the agent changing its plans, in a rough sense, was included because
we speculated that it would help increase human reactions by making
the agent typically exhibit plan-based behavior but sometimes change
course, violating human expectations. All selections among the 3 policies
are done uniformly randomly.
47
(a) Car choice screen (b) Robotaxi game view
Figure A.1: Robotaxi environment
A.1.2 Robotic Sorting Task
Figure A.2: Robotic task table layout with object labels, from the perspectiveof the human observer
As is described in our paper published at Conference on Robot Learning
by Cui et al. [10], the robot in the trash sorting task executes predetermined
trajectories programmed through key-frame based kinesthetic teaching. The
7-DOF robotic arm is controlled at 40Hz with torque commands. The initial
table layout is shown by Fig. A.2, in which the objects in the green circles give
48
Figure A.3: Robotic sorting task trajectories with optimality segmentation
+2 rewards when moved into the bin and others give −1 rewards. Therefore,
the robot’s task is to sort the aluminum cans into the recycling bin, without
sorting the remaining objects by mistake.
We show the snapshots of the 8 arm trajectories used in the robotic
trash sorting task in Fig. A.3. Each of these trajectories consists of a fixed
sequence of torque commands, which might generate small variations in the
actual trajectories. Thus, we remove the results from trajectories with quali-
tative departure, such as failure in object grasping. 1 or 2 target objects are
49
involved in each episode, which terminates with a total reward of −1, 0, or
+2.
Each trajectory of an episode can be further segmented into reaching,
grasping, transferring, placing and retracting sub-trajectories. The relative op-
timality of these sub-trajectories can be determined by whether the projected
outcome is desired. For example reaching for a correct object and retracting
from picking up a wrong object are both considered optimal while reaching for
and transferring a wrong object are both sub-optimal. Fig. A.3 also contains
annotations of these sub-trajectories’ optimality under each trajectory. Note
that these segmentations are only for illustration, without actually being used
in our reaction mapping algorithm.
A.1.3 Experimental Design
The instructions we give the participants in Robotaxi are as follows:
– Hello human! Welcome to Robo Valley, an experimental city where
humans don’t work but make money through hiring robots!
– You’ll start with $12 for hiring Robotaxi, and after each session you will
be paid or fined according to the performance of the autonomous vehicle
or robot.
– Your initial $12 will be given to you in poker chips. After each session,
we will add or take away your chips based on your vehicle’s score. At
50
the end, you can exchange your final count of poker chips for an Amazon
gift card of the same dollar value.
– For the 3 sessions with a Robotaxi, you begin by choosing an autonomous
vehicle to lease.
– The cost to lease one of these vehicles will be $4 each session.
– The vehicle earns $6 for every passenger it picks up, but it will be fined
$1 each time it hits a roadblock and fined $5 each time it hits a parked
car.
– You will watch the Robotaxi earn money for you, and your reactions to
its performance will be recorded for research purposes.
– You will have a chance to practice driving in this world, but the amount
earned during the test session won’t count towards your final payout.
The instructions we give the participants in robotic sorting task are as
follows:
– For the robotic task, the robot is trying to sort recyclable cans out of a
set of objects.
– You will earn $2 for each correct item it sorts and get penalized for $1
for each wrong item it puts in the trash bin.
– You will watch the robot earn money for you, and your reactions to its
performance will be recorded for research purposes.
51
The participants first control the agent themselves for a test session to
make themselves familiar with the Robotaxi task, removing a source of human
reactions changing in ways we cannot easily model. For the agent-controlled
sessions, the participants select an agent at the beginning of each episode of
Robotaxi. Fig. A.1a shows the view of this agent selection. Unbeknownst to the
subject, their selection of a vehicle only affects its appearance, not its policy.
This vehicle choice was included in the experimental design as a speculatively
justified tactic to increase the subject’s emotional investment in the agent’s
success, thereby better aligning R and RH as well as increasing their reactions.
At the start of the session, participants are given $12, which they must
soon spend to purchase a Robotaxi agent before it begins its task. To make
their earnings and losses more tangible (and therefore, we speculate, elicit
greater reactions), participants are given poker chips equal to their current
total earnings. After each session they are paid or fined according to the
performance of the agent: their chips are increased or decreased based on the
score of Robotaxi. At the end of the entire experimental session, participants
exchange their final count of poker chips for an Amazon gift card of the same
dollar value.
A.1.4 Participant Recruitment
The participants we recruited are mostly college students in the com-
puter science department. Each participant filled out an exit survey of their
backgrounds after completing all episodes of observing an agent. The statistics
52
of these 17 subjects are given below:
• Gender: 10 participants are male and 7 are female.
• Age: The participants’ average age is 20. Ages range from 18 to 28
(inclusive).
• Robotics/AI background: 1 participant is not familiar with AI/robotics
technologies at all. 2 have neither worked in AI nor studied it technically,
but are familiar with AI and robotics. 13 have not worked in AI but have
taken classes related to AI or otherwise studied it technically. Only 1
has worked or done major research in robotics and/or AI.
• Ownership of robotics/AI-related products: 7 participants own robotics
or AI-related products, while 10 do not. The products include Google
Home, Roomba, and Amazon Echo.
A.2 Annotations of Human Reactions
A.2.1 The Annotation Interface
To gain a better understanding of the dataset and the lihf problem
as a whole, two of the authors annotated the collected dataset. These an-
notations are not intended to serve as ground truth and are only used as
labels for an auxiliary task in our training of reaction mappings. Therefore,
training/calibration of annotators, evaluation of annotators via inter-rator re-
liability scores, etc. are not important. The interface for annotating the data
53
Figure A.4: View of the annotation interface. The corresponding trajectoryof Robotaxi is not displayed.
Figure A.5: Proportion of annotated gestures
is displayed in Fig. A.4. A human annotator marks whether facial gestures
and head gestures are occurring in each frame, effectively marking the onset
and offset of such gestures. Annotation is performed without any visibility of
the corresponding game trajectory. The proportion of 7 reaction gestures in
the annotation is shown in Fig. A.5. Annotations provide several benefits in
this study: in our search for a modeling strategy, we found our first success-
ful reaction mapping while using annotations directly as the only supervisory
54
Figure A.6: Histograms of non-zero reward events around feature onset
labels; annotations provide labels for an auxiliary task to regularize training
of and speed representation learning by (both important for a small dataset)
the reaction mapping from the features extracted automatically via OpenFace
[5, 6, 45]; and an annotation-based analysis of our data helped us find a tem-
poral window of reaction data around an event that is effective for inferring
the reward types for that event.
55
A.2.2 Visualizations of Annotated Data
The annotations can be used to visualize the temporal relationship
between reaction onsets and events (rewards). Fig. A.6 shows example his-
tograms of reward events binned into time windows around feature onsets. As
demonstrated by Fig. A.6, the onsets of certain gestures such as eyebrow-frown
and head-nod correlate with negative or positive events respectively (peaking
around 1.47s before the onset). While smile accounts for a large portion of
overall gestures (Fig. A.5), it does not correlate strongly with either positive
or negative events, contradicting the assumption made by several prior studies
that smile should always be treated as positive feedback [3, 26, 39, 46]. While
this observation could be specific to our experimental setting or domain, it
agrees with established research on the emotional meanings of smiles as shown
in the work of Hoque et al. [16] and Niedenthal et al. [32].
In these histograms, the contours of red and yellow bars are strikingly
similar in most subplots of Fig. A.6, which suggests that although an individual
may react differently to the events that provide −1 and −5 reward, it may be
hard to distinguish between them through single gestures. We also find that
reactions (across all gestures) are likely to occur between 2.8s before and 3.6s
after an event (shown as a peak in the histograms), which we use as a prior
for designing the set of candidate time windows that random hyperparameter
search draws from (see Appx.A.3.3).
56
A.3 Model Design
A.3.1 Feature Extraction
The specific output data we use from OpenFace [5, 6, 45] are: [success,
AU01 c , AU02 c , AU04 c , AU05 c , AU06 c , AU07 c , AU09 c , AU10 c ,
AU12 c , AU14 c , AU15 c , AU17 c , AU20 c , AU23 c , AU25 c , AU26 c ,
AU28 c , AU45 c , AU01 r , AU02 r , AU04 r , AU05 r , AU06 r , AU07 r ,
AU09 r , AU10 r , AU12 r , AU14 r , AU15 r , AU17 r , AU20 r , AU23 r ,
AU25 r , AU26 r , AU45 r , pose Tx , pose Ty , pose Tz , pose Rx , pose Ry
, pose Rz ].
The AUx c signals are outputs from classifiers of activation of facial
action units (FAU) and AUx r are from regression model that are designed to
capture the intensity of the activation of facial action units. The pose T and
pose R signals are detected head translation and rotation with respect to the
camera pose. Since the camera pose and relative position of a person with the
camera varies from training time to application time, we explicitly model the
change in the detected person’s head pose by maintaining a running average
and subtract the average from all incoming pose features. We then use a
time window of the past 50 feature frames and compute the Fourier transform
coefficients as the head-motion features we feed into the neural network.
A.3.2 Data Split of k-fold Cross Validation for Random Search
During our search for hyperparameters that learn an effective reaction
mapping from data gathered in Robotaxi, we used a data-splitting method de-
57
Figure A.7: Diagram of data split for subject k
signed to avoid overfitting and to have relatively large training sets despite
our small dataset size. Recall that each participant observe and react to 3
episodes. Of these 3 episodes, 1 is randomly chosen as a holdout episode and is
not used for training or testing except for final evaluation. With the remaining
2 episodes per subject, we split data such that different train-test-validation
sets are created for each subject, as shown in Fig. A.7. Specifically, we con-
struct a train-test-validation set for each subject by assigning one episode of
the target subject as the validation set, randomly sampling half (either the
first half or the second) of an unused episode from each subject into the test
set, and using the remaining data in the training set.
For a target subject, a model is trained on the subject’s corresponding
training set and tested after each epoch on the test set. The epoch with the
best cross-validation loss is chosen as the early stopping point, and the model
trained at this epoch is then evaluated on the validation set. The performance
of a hyperparameter set is defined as the mean of the cross entropy losses
across each subject’s validation set. The hyperparameter set with the lowest
such mean cross entropy loss is selected for evaluation on the holdout set.
58
The data split for evaluation on the holdout set is similar but simpler. From
the 2 episodes per subject that are not in the holdout set, half an episode is
randomly sampled into the test set and the rest are in the training set. A
single model is trained (stopping with the lowest cross-entropy loss on test
set) and then evaluated on the holdout set.
A.3.3 Hyperparameters
Random search is used to find the best set of hyper-parameters, includ-
ing input window size (k and l), learning rate, dropout rate, loss coefficients
(λ1 and λ2), depth and widths of the MLP hidden layers. Fig. A.6 indicates
that reactions are likely to onset between 2.8s before and 3.6s after an event.
Therefore, we convert the corresponding range of temporal window into the
number of aggregated frames before and after a particular prediction point
(aggregated frame) and use that as the range to sample the input window.
Each set of randomly sampled parameters is evaluated on all 17 train-test
folds and the set with the lowest average test loss is selected. For each model,
the weights with the lowest test loss are saved and evaluated on the validation
set.
The best hyper-parameters found through random search are: {learning rate
= 0.001, batch size = 8, k = 0, l = 12, dropout rate = 0.6314, λ1 = 2, λ2 =
1}. Below is the best model architecture found through random search:
(facial_action_unit_input): Linear(in_features=455, out_features=64, bias=True)
(head_pose_input): Linear(in_features=702, out_features=32, bias=True)
(hidden): ModuleList(
59
(0): Linear(in_features=96, out_features=128, bias=True)
(1): BatchNorm1d(128, eps=1e-05, momentum=0.1)
(2): LeakyReLU(negative_slope=0.01)
(3): Dropout(p=0.63, inplace=False)
(4): Linear(in_features=128, out_features=128, bias=True)
(5): BatchNorm1d(128, eps=1e-05, momentum=0.1)
(6): LeakyReLU(negative_slope=0.01)
(7): Dropout(p=0.63, inplace=False)
(8): Linear(in_features=128, out_features=64, bias=True)
(9): BatchNorm1d(64, eps=1e-05, momentum=0.1)
(10): LeakyReLU(negative_slope=0.01)
(11): Dropout(p=0.63, inplace=False)
(12): Linear(in_features=64, out_features=8, bias=True)
(13): BatchNorm1d(8, eps=1e-05, momentum=0.1)
(14): LeakyReLU(negative_slope=0.01)
(15): Dropout(p=0.63, inplace=False))
(out): Linear(in_features=8, out_features=3, bias=True)
(auxiliary_task): Linear(in_features=128, out_features=130, bias=True)
60
Bibliography
[1] Herve Abdi. The kendall rank correlation coefficient. Encyclopedia of
Measurement and Statistics. Sage, Thousand Oaks, CA, pages 508–510,
2007.
[2] Henny Admoni and Brian Scassellati. Social eye gaze in human-robot
interaction: a review. Journal of Human-Robot Interaction, 6(1):25–63,
2017.
[3] Riku Arakawa, Sosuke Kobayashi, Yuya Unno, Yuta Tsuboi, and Shin-
ichi Maeda. Dqn-tamer: Human-in-the-loop reinforcement learning with
intractable feedback. arXiv preprint arXiv:1810.11748, 2018.
[4] Brenna D Argall, Sonia Chernova, Manuela Veloso, and Brett Browning.
A survey of robot learning from demonstration. Robotics and autonomous
systems, 57(5):469–483, 2009.
[5] Tadas Baltrusaitis, Marwa Mahmoud, and Peter Robinson. Cross-dataset
learning and person-specific normalisation for automatic action unit de-
tection. In 2015 11th IEEE International Conference and Workshops
on Automatic Face and Gesture Recognition (FG), volume 6, pages 1–6.
IEEE, 2015.
61
[6] Tadas Baltrusaitis, Amir Zadeh, Yao Chong Lim, and Louis-Philippe
Morency. Openface 2.0: Facial behavior analysis toolkit. In 2018 13th
IEEE International Conference on Automatic Face & Gesture Recognition
(FG 2018), pages 59–66. IEEE, 2018.
[7] James Bergstra and Yoshua Bengio. Random search for hyper-parameter
optimization. The Journal of Machine Learning Research, 13(1):281–305,
2012.
[8] Sonia Chernova and Andrea L Thomaz. Robot learning from human
teachers. Synthesis Lectures on Artificial Intelligence and Machine Learn-
ing, 8(3):1–121, 2014.
[9] Carlos Crivelli and Alan J Fridlund. Facial displays are tools for social
influence. Trends in Cognitive Sciences, 22(5):388–399, 2018.
[10] Y Cui, Q Zhang, A Allievi, P Stone, S Niekum, and W Knox. The
empathic framework for task learning from implicit human feedback. In
Conference on Robot Learning, 2020.
[11] Yuchen Cui and Scott Niekum. Active reward learning from critiques.
In 2018 IEEE International Conference on Robotics and Automation
(ICRA), pages 6907–6914. IEEE, 2018.
[12] Matthew N Dailey, Carrie Joyce, Michael J Lyons, Miyuki Kamachi,
Hanae Ishi, Jiro Gyoba, and Garrison W Cottrell. Evidence and a compu-
62
tational explanation of cultural differences in facial expression recognition.
Emotion, 10(6):874, 2010.
[13] Adrian K Davison, Walied Merghani, and Moi Hoon Yap. Objective
classes for micro-facial expression recognition. Journal of Imaging, 4(10):
119, 2018.
[14] Paul Ekman. Facial expressions. Handbook of cognition and emotion, 16
(301):e320, 1999.
[15] Beat Fasel and Juergen Luettin. Automatic facial expression analysis: a
survey. Pattern recognition, 36(1):259–275, 2003.
[16] Mohammed Ehsan Hoque, Daniel J McDuff, and Rosalind W Picard.
Exploring temporal patterns in classifying frustrated and delighted smiles.
IEEE Transactions on Affective Computing, 3(3):323–334, 2012.
[17] Charles Isbell, Christian R Shelton, Michael Kearns, Satinder Singh, and
Peter Stone. A social reinforcement learning agent. In Proceedings of
the fifth international conference on Autonomous agents, pages 377–384.
ACM, 2001.
[18] Rachael E Jack and Philippe G Schyns. The human face as a dynamic
tool for social communication. Current Biology, 25(14):R621–R634, 2015.
[19] Natasha Jaques, Jennifer McCleary, Jesse Engel, David Ha, Fred Bertsch,
Rosalind Picard, and Douglas Eck. Learning via social awareness: Im-
63
proving a deep generative sketching model with facial feedback. arXiv
preprint arXiv:1802.04877, 2018.
[20] Thorsten Joachims, Laura Granka, Bing Pan, Helene Hembrooke, and
Geri Gay. Accurately interpreting clickthrough data as implicit feedback.
In ACM SIGIR Forum, volume 51, pages 4–11. Acm New York, NY, USA,
2017.
[21] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic
optimization. arXiv preprint arXiv:1412.6980, 2014.
[22] W Bradley Knox and Peter Stone. Interactively shaping agents via hu-
man reinforcement: The tamer framework. In Proceedings of the fifth
international conference on Knowledge capture, pages 9–16. ACM, 2009.
[23] W Bradley Knox, Peter Stone, and Cynthia Breazeal. Training a robot
via human feedback: A case study. In International Conference on Social
Robotics, pages 460–470. Springer, 2013.
[24] Oliver Kroemer, Scott Niekum, and George Konidaris. A review of robot
learning for manipulation: Challenges, representations, and algorithms.
arXiv preprint arXiv:1907.03146, 2019.
[25] Guangliang Li, Randy Gomez, Keisuke Nakamura, and Bo He. Human-
centered reinforcement learning: A survey. IEEE Transactions on
Human-Machine Systems, 2019.
64
[26] Guangliang Li, Hamdi Dibeklioglu, Shimon Whiteson, and Hayley Hung.
Facial feedback for reinforcement learning: a case study and offline anal-
ysis using the tamer framework. Autonomous Agents and Multi-Agent
Systems, 34(1):1–29, 2020.
[27] Shan Li and Weihong Deng. Deep facial expression recognition: A survey.
arXiv preprint arXiv:1804.08348, 2018.
[28] Xiaobai Li, Tomas Pfister, Xiaohua Huang, Guoying Zhao, and Matti
Pietikainen. A spontaneous micro-expression database: Inducement, col-
lection and baseline. In 2013 10th IEEE International Conference and
Workshops on Automatic face and gesture recognition (fg), pages 1–6.
IEEE, 2013.
[29] Jinying Lin, Zhen Ma, Randy Gomez, Keisuke Nakamura, Bo He, and
Guangliang Li. A review on interactive reinforcement learning from hu-
man social feedback. IEEE Access, 8:120757–120765, 2020.
[30] Robert Loftin, Bei Peng, James MacGlashan, Michael L Littman,
Matthew E Taylor, Jeff Huang, and David L Roberts. Learning some-
thing from nothing: Leveraging implicit human feedback strategies. In
The 23rd IEEE International Symposium on Robot and Human Interac-
tive Communication, pages 607–612. IEEE, 2014.
[31] James MacGlashan, Mark K Ho, Robert Loftin, Bei Peng, David Roberts,
Matthew E Taylor, and Michael L Littman. Interactive learning from
65
policy-dependent human feedback. arXiv preprint arXiv:1701.06049,
2017.
[32] Paula M Niedenthal, Martial Mermillod, Marcus Maringer, and Ursula
Hess. The simulation of smiles (sims) model: Embodied simulation and
the meaning of facial expression. Behavioral and brain sciences, 33(6):
417, 2010.
[33] Jaak Panksepp and Douglas Watt. What is basic about basic emotions?
lasting lessons from affective neuroscience. Emotion review, 3(4):387–396,
2011.
[34] Tomas Pfister, Xiaobai Li, Guoying Zhao, and Matti Pietikainen. Recog-
nising spontaneous facial micro-expressions. In 2011 international con-
ference on computer vision, pages 1449–1456. IEEE, 2011.
[35] Patrick M Pilarski, Michael R Dawson, Thomas Degris, Farbod Fahimi,
Jason P Carey, and Richard S Sutton. Online human training of a my-
oelectric prosthesis controller via actor-critic reinforcement learning. In
2011 IEEE International Conference on Rehabilitation Robotics, pages
1–7. IEEE, 2011.
[36] Dorsa Sadigh, Anca D Dragan, Shankar Sastry, and Sanjit A Seshia.
Active preference-based learning of reward functions. In Robotics: Science
and Systems, 2017.
66
[37] Halit Bener Suay and Sonia Chernova. Effect of human guidance and
state space size on interactive reinforcement learning. In 2011 Ro-Man,
pages 1–6. IEEE, 2011.
[38] Vivek Veeriah. Beyond clever hans: Learning from people without their
really trying. 2018.
[39] Vivek Veeriah, Patrick M Pilarski, and Richard S Sutton. Face valuing:
Training user interfaces with facial expressions and reinforcement learn-
ing. arXiv preprint arXiv:1606.02807, 2016.
[40] Garrett Warnell, Nicholas Waytowich, Vernon Lawhern, and Peter Stone.
Deep tamer: Interactive agent shaping in high-dimensional state spaces.
In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
[41] Eric W Weisstein. Bonferroni correction. https://mathworld. wolfram.
com/, 2004.
[42] RF Woolson. Wilcoxon signed-rank test. Wiley encyclopedia of clinical
trials, pages 1–3, 2007.
[43] Duo Xu, Mohit Agarwal, Faramarz Fekri, and Raghupathy Sivakumar.
Playing games with implicit human feedback.
[44] Wen-Jing Yan, Qi Wu, Jing Liang, Yu-Hsin Chen, and Xiaolan Fu. How
fast are the leaked facial expressions: The duration of micro-expressions.
Journal of Nonverbal Behavior, 37(4):217–230, 2013.
67
[45] Amir Zadeh, Yao Chong Lim, Tadas Baltrusaitis, and Louis-Philippe
Morency. Convolutional experts constrained local model for 3d facial
landmark detection. In Proceedings of the IEEE International Confer-
ence on Computer Vision Workshops, pages 2519–2528, 2017.
[46] Dean Zadok, Daniel McDuff, and Ashish Kapoor. Affect-based in-
trinsic rewards for learning general representations. arXiv preprint
arXiv:1912.00403, 2019.
[47] Ruohan Zhang, Faraz Torabi, Lin Guan, Dana H Ballard, and Peter
Stone. Leveraging human guidance for deep reinforcement learning tasks.
arXiv preprint arXiv:1909.09906, 2019.
68