modelling annotator variability across feature spaces in ... · edinburgh) and prof. adrian muscat,...

Modelling Annotator Variabilityacross Feature Spaces in the

Temporal Analysis of Behaviour

Michael Camilleri

TH

E

U NI V E R

S

IT

Y

OF

ED I N B U

RG

H

Master of Science by ResearchInstitute for Adaptive and Neural Computation

School of InformaticsUniversity of Edinburgh

2018

(Graduation date: 28 November 2018)

AbstractAlthough behavioural phenotyping is an active area of research in the biological community,

there is limited analysis in terms of temporal modelling of behavioural states. Moreover, while

most of the data is obtained by human annotators indicating a behaviour from a predefined set

of labels (which we call a schema), much of the research ignores the noise inherent in such data.

Additionally the schemas themselves sometimes change throughout the lifetime of the project.

In this thesis, we seek to address the problems of: (a) explicitly modelling uncertainty in an-

notator-labels, (b) learning in situations where portions of the data are represented in different

labelling schemas (feature-spaces), and (c) constructing a temporal characterisation of multiple

behavioural states and their dynamics. Our baseline model is a mixture-of-categoricals which

models the inter-annotator variability. We then develop a novel way of unifying information

across annotation schemas by means of a hierarchical model. This is then extended, first by in-

corporating position information and ultimately in the temporal domain through an adaptation

of the Hidden Markov Model architecture. Our models are evaluated on a dataset provided by

the MRC Harwell Institute, Oxfordshire, but should be applicable to general behaviour mod-

elling dealing with human annotations. We further apply our models to inter-individual and

day-night cycle characterisation of behaviour and show that our models are able to learn rich

and plausible behavioural trends.

iii

AcknowledgementsIt is a must to show my heartfelt gratitude to those people without whom I will never have

completed this work.

First off, a huge thanks goes to my supervisor, Prof. Chris Williams. It was his insight and

knowledge that directed my efforts to fruitful outcomes. Above all he was a solid reference

point in my ups and downs on this journey, believing in me more than I did in myself.

Thanks goes also to our collaborators at the MRC Harwell Institute, Oxfordshire, who not only

provided the data for this project, but, especially Dr Sonia Rasneer Bains, put up with my

lengthy emails with a divine patience. Equal gratitude is due to Prof. Chris Holmes, at the

University of Oxford, who set up the contacts and provided yet another sounding board for my

queries, and to Actual AnalyticsTM

, especially Dr Rowland Sillito for their helpful comments in

analysing the data from their rig.

A heartfelt thanks is due also to Dr. Timothy Hospedales, Dr. Adam Lopez (University of

Edinburgh) and Prof. Adrian Muscat, who bore through my verbose reports and provided many

a helpful remark (and maybe here I should spare a thought to my markers). Thanks in particular

to Mrs Sally Galloway, who despite her other commitments, went out of her way to help with

any matter which arose, proofread much of my work and acted like a second mother to our

cohort!

I must not forget my colleagues and office-mates, who put up with my consistent state of panic

and overly-energetic attitude. Equally, I must thank my parents who while not academically

minded, always lifted my spirits and kept me going when motivation was running low.

Finally, I acknowledge that this work was supported in part by the EPSRC Centre for Doc-

toral Training in Data Science, funded by the UK Engineering and Physical Sciences Research

Council (grant EP/L016427/1) and the University of Edinburgh.

iv

DeclarationI declare that this thesis was composed by myself, that the work contained herein is my own

except where explicitly stated otherwise in the text, and that this work has not been submitted

for any other degree or professional qualification except as specified.

(Michael Camilleri)

v

Lil Alla wah̄du Unur u Glorja, gh̄al dejjem ta’ dejjem.Ammen.

vi

Table of Contents

Acronyms xi

1 Introduction 11.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1.1 The Case for Mouse Data . . . . . . . . . . . . . . . . . . . . . . . . 21.1.2 Behaviour Modelling Goals . . . . . . . . . . . . . . . . . . . . . . . 3

1.2 Our Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.2.1 Preliminary Analysis & Cleaning . . . . . . . . . . . . . . . . . . . . 41.2.2 Developing the Models . . . . . . . . . . . . . . . . . . . . . . . . . . 41.2.3 Evaluating the Models . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.3 Report Organisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.4 Mathematical Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2 Background 72.1 Theoretical Foundations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.1.1 Preliminary Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . 72.1.2 Latent-Variables and Mixture Models . . . . . . . . . . . . . . . . . . 82.1.3 Time Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.2.1 Characterising Behaviour . . . . . . . . . . . . . . . . . . . . . . . . . 102.2.2 Dealing with Inter-Annotator Variability . . . . . . . . . . . . . . . . . 13

3 Exploratory Data Analysis 153.1 Data Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153.2 Exploratory Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.2.1 Annotator Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163.2.2 RFID . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.3 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4 Methodologies 254.1 Noisy-Annotator (NA) Model . . . . . . . . . . . . . . . . . . . . . . . . . . 25

4.1.1 Probabilistic Description . . . . . . . . . . . . . . . . . . . . . . . . . 254.1.2 Estimating the Parameters . . . . . . . . . . . . . . . . . . . . . . . . 27

vii

4.1.3 Dealing with Missing Data . . . . . . . . . . . . . . . . . . . . . . . . 274.1.4 Inferring Latent States . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4.2 Inter-Schema Annotator Consistency (ISAC) Model . . . . . . . . . . . . . . . 284.2.1 Missing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294.2.2 Probabilistic Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294.2.3 Estimating the Parameters . . . . . . . . . . . . . . . . . . . . . . . . 31

4.3 Positional Annotated Behaviour (PAB) Model . . . . . . . . . . . . . . . . . . 314.3.1 Probabilistic Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324.3.2 Estimating the Parameters . . . . . . . . . . . . . . . . . . . . . . . . 33

4.4 Temporally Annotated Positional (TAP) Model . . . . . . . . . . . . . . . . . 334.4.1 Probabilistic Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334.4.2 Estimating the Parameters . . . . . . . . . . . . . . . . . . . . . . . . 34

4.5 Input/Output Temporally Annotated Positional (IOTAP) . . . . . . . . . . . . . 354.5.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354.5.2 Probabilistic Definition . . . . . . . . . . . . . . . . . . . . . . . . . . 364.5.3 Parameter Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

5 Results 395.1 Test Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

5.1.1 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395.1.2 Cross-Validation Setup . . . . . . . . . . . . . . . . . . . . . . . . . . 40

5.2 NA Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405.2.1 Test Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415.2.2 Analysing Predictability . . . . . . . . . . . . . . . . . . . . . . . . . 415.2.3 Likelihood Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 425.2.4 Qualitative Analysis of the Results . . . . . . . . . . . . . . . . . . . . 44

5.3 ISAC Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 465.3.1 Test Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 475.3.2 Quantitative Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 475.3.3 Qualitative Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

5.4 PAB Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 495.4.1 Test Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 495.4.2 Likelihood Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 495.4.3 Posterior Confidence Analysis . . . . . . . . . . . . . . . . . . . . . . 505.4.4 Qualitative Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 505.4.5 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

5.5 TAP Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 535.5.1 Results on Synthetic Data . . . . . . . . . . . . . . . . . . . . . . . . 535.5.2 Test Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 535.5.3 Quantitative Comparisons . . . . . . . . . . . . . . . . . . . . . . . . 545.5.4 Qualitative Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

viii

5.5.5 TAP as a Generative Model . . . . . . . . . . . . . . . . . . . . . . . 565.6 IOTAP Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

5.6.1 Individual Mouse Models (i-TAP) . . . . . . . . . . . . . . . . . . . . 595.6.2 Light-Status Models (l-TAP) . . . . . . . . . . . . . . . . . . . . . . . 62

6 Conclusion 656.1 Our contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

6.1.1 Summary of Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 656.1.2 Key Outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 666.2.1 Further Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 666.2.2 Architectural Extensions . . . . . . . . . . . . . . . . . . . . . . . . . 66

Glossary 69

Bibliography 71

A Data Summary and Extended Analysis 77A.1 Summary of the Available Data . . . . . . . . . . . . . . . . . . . . . . . . . . 77

A.1.1 Data-Collection Setup . . . . . . . . . . . . . . . . . . . . . . . . . . 77A.1.2 Data Availability and Preprocessing . . . . . . . . . . . . . . . . . . . 80

A.2 Sanity Checks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84A.2.1 RFID Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84A.2.2 Annotator Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

A.3 Additional Exploratory Results . . . . . . . . . . . . . . . . . . . . . . . . . . 85

B Detailed Derivations 95B.1 NA Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

B.1.1 E-Step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96B.1.2 M-Step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

B.2 ISAC Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98B.2.1 E-Step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98B.2.2 M-Step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

B.3 PAB Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100B.3.1 E-Step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101B.3.2 M-Step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

B.4 TAP Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103B.4.1 E-Step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104B.4.2 M-Step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109B.4.3 Estimate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

B.5 IOTAP Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110B.5.1 E-Step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

ix

B.5.2 M-Step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

C Supplementary Results 117C.1 Synthetic Data for the TAP Model . . . . . . . . . . . . . . . . . . . . . . . . 117

C.1.1 Data Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117C.1.2 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118C.1.3 Analysing Stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118C.1.4 Convergence Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 119C.1.5 Light-Status Simulations . . . . . . . . . . . . . . . . . . . . . . . . . 121

C.2 Additional Analysis on TAP Model . . . . . . . . . . . . . . . . . . . . . . . . 122C.2.1 Summary Statistics for generated model . . . . . . . . . . . . . . . . . 122

D Extensions for Future Work 125D.1 Switching Temporally Annotated Positional (STAP) Model . . . . . . . . . . . 125D.2 Coupled Temporally Annotated Positional (CTAP) Model . . . . . . . . . . . . 127D.3 Explicit Modelling of Position Dynamics . . . . . . . . . . . . . . . . . . . . 129

x

Acronyms

i-TAP Individual Temporally Annotated Positional.

l-TAP Light-sensitive Temporally Annotated Positional.

ANOVA Analysis Of Variance.

ARHMM Auto-Regressive Hidden Markov Model.

BLO Best-Likelihood Offset.

BRW Biased Random Walk.

BST British Summer Time.

CED Conditional-Entropy-Difference.

CHMM Coupled Hidden Markov Model.

CPT Conditional Probability Table.

CRW Correlated Random Walk.

CTAP Coupled Temporally Annotated Positional.

d.o.f. degrees of freedom.

d.p. decimal places.

ED Entropy-Difference.

EM Expectation-Maximisation.

FSLDS Factorial Switching Linear-Dynamical-System.

HCA Home-Cage Analysis.

HDF5 Hierarchical Data Format (version 5).

HGP Human Genome Project.

xi

HMM Hidden Markov Model.

HSMM Hidden Semi-Markov Model.

IID independent and identically distributed.

IMKC International Mouse Knockout Consortium.

IMPC International Mouse Phenotype Consortium.

IOHMM Input-Output Hidden Markov Model.

IOTAP Input/Output Temporally Annotated Positional.

IR Infra-Red.

ISAC Inter-Schema Annotator Consistency.

LDS Linear Dynamical System.

MA(q) Moving-Average.

MAD Mean-Absolute-Deviation.

MAP Maximum-A-Posteriori.

MC Markov Chain.

MLE Maximum-Likelihood Estimate.

MRP Markov-Renewal Process.

ms milliseconds.

MV Majority-Vote.

NA Noisy-Annotator.

NIS Not-In-Schema.

NLL Negative-Log-Likelihood.

PAB Positional Annotated Behaviour.

PMF Probability Mass Function.

RFID Radio-Frequency IDentification.

RV Random Variable.

SLDS Switching Linear-Dynamical-System.

xii

STAP Switching Temporally Annotated Positional.

TAP Temporally Annotated Positional.

XML Extensible Markup Language.

xiii

Chapter 1

Introduction

The analysis of animal (and human) behaviour has found applications in many fields includingconservation, demographics, and medicine. Characterising behaviour is closely related to thedomain of time-series modelling which is itself a mature field. However, most behaviouralmodels in literature either (a) deal with absolute positions only and infer a few coarse high-levelbehavioural states (e.g. [26, 28, 33, 39, 43]), or (b) treat fine-grained activities in terms offrequencies of manifestation with little to no temporal modelling ([54, 71]). At the same time,the analysis has to contend with noisy observations, such as errors in human annotations, whichare often ignored in the interest of simplicity [26, 33, 43].

This report details our effort to characterise the behaviour of mice based on human-annotatedlabels and position information. We develop methods to deal with unreliable annotations per-taining to different feature spaces (i.e. when the label-set changes during the term of data col-lection), and we do so in a probabilistic temporal setting, drawing on ideas from the time-seriesand crowd-sourcing literature. The characterisation is aimed towards identifying behaviouraltrends and how these may differ between genetic strains1: however, while our data is annotatedmouse behaviours kindly provided by the MRC Harwell Institute, Oxfordshire2, the methodsand models derived herein are applicable to the wider behavioural modelling field where thereis uncertainty in the observations and in particular when the data pertains to overlapping butdifferent feature spaces.

1.1 Motivation

While characterising animal behaviour already lent itself to many aspects of medicine [2, 32,38, 46], conservation [39, 43], demographics [33, 48] and psychology [46], the sequencing ofthe human genome by the International Human Genome Sequencing Consortium [25] at the

1Note that our work focuses on the methodology of achieving this, rather than the actual analysis.2https://www.har.mrc.ac.uk/.

1

2 Chapter 1. Introduction

turn of the 21st century provided yet another avenue for such research. More famously knownas the Human Genome Project (HGP), this achievement opened wide the field of genetics,with far-reaching impacts on the methodologies for medical diagnosis, disease-prevention (andtreatment), drug-discovery, organ-transplants and especially for the focus of this project, socialand behavioural analysis [69]. In particular, the nearly-complete gene database prompted anumber of questions as to what happens to the individual when a particular gene or parts of itare missing or changed [25].

1.1.1 The Case for Mouse Data

Answering such questions requires moving beyond ‘reading’ DNA and actually modifying it,through Genetic Engineering [9]. While the tools exist (such as CRISPR [78]), they bring withthem a number of ethical considerations, not the least of which is that it is considered immoral(and impractical) to test gene modification directly on humans. To this end, biologists turn tothe mouse as a stand-in. This is because the mouse shares almost 99% of its genes with humansin terms of functionality [71], is small enough to be able to deal with in large numbers, is verycost-effective to maintain, breeds rapidly and in large quantities, and is also similar to humansin anatomy and physiology.

Our project is related to behavioural phenotyping3, which involves monitoring live mice in well-controlled environments, and analysing the different traits between control mice (also known asWild-types i.e. strains which appear naturally ‘in the wild’) and mutant strains, typically referredto as Knockouts [29]. The latter refers to the fact that a particular gene has been suppressed i.e.‘Knocked-out’ (or sometimes replaced, which is referred to as ‘Knock-in’) in the mutant strain,and the monitoring is used to identify what phenotypes (if any) are affected.

The HGP itself included the mouse as one of its model organisms [14] — indeed its genomewas completely sequenced less than two years after the human one. In 2006, the Interna-tional Mouse Knockout Consortium (IMKC) [62] continued on the momentum by establishinga “high-throughput gene-targeting pipeline”, through the development of an embryonic stemcell resource. Following in the footsteps of the HGP, it sought to provide a publicly availablelibrary of tools for biologists worldwide to use. Its success prompted the setting up of the In-ternational Mouse Phenotype Consortium (IMPC) [10]. Running from 2011 through 2021, theaim of this project is to analyse some 20,000 knockouts in the mouse genome and provide thisdata to researchers worldwide.

3A phenotype is an observeable difference in physical or behavioural aspects of an individual following achange in a gene.

1.2. Our Contributions 3

1.1.2 Behaviour Modelling Goals

The IMPC itself comprises a conglomeration of mouse genetics centres from around the globe,including MRC Harwell Institute, Oxfordshire, with which this project is affiliated. Each centrecontributes towards the phenotyping process, pooling all data in a central repository, while atthe same time looking towards effective utilisation of the data.

The biologists at MRC Harwell are especially interested in behaviour modelling. Throughthe NC3R4 CRACK-IT5 initiative, they are actively monitoring wild-type and knockout strainsthrough a custom Home-Cage Analysis (HCA) system developed by Actual Analytics

TM 6. Thisincludes annotating mouse behaviour, which will be the main data modality used in this project.In corresponence with MRC Harwell, it emerged that the researchers are interested in the fol-lowing tasks:

1. quantifying, with a view to modelling, inter-annotator variability and incorporating datafrom disparate sources into one model,

2. richer temporal modelling (as opposed to summary statistics) of the mouse behaviourwhich in turn, allows for,

3. identifying behavioural differences between individuals,

4. characterising different behavioural ‘regimes’ throughout the circadian cycle of the mice,

5. prediction of the onset of certain states (such as aggression) given past actions, and,

6. identifying social hierarchies and behaviour drivers,

with the end-goal of distinguishing phenotypes between strains. This project focuses on tasks 1and 2, which we apply in a limited analysis of 3 and 4.

1.2 Our Contributions

In achieving the above goals, we seek to address three main challenges in behaviour modelling:(a) explicitly modelling uncertainty/noise in human-annotated behaviour, through probabilisticmixture modelling, (b) building models which can incorporate data collected under differentlabelling schemas (feature-spaces), and (c) providing rich temporal modelling of multiple be-havioural states and their dynamics. We achieve these by developing a time-series representa-tion of behaviour, with explicit modelling of the noise and biases inherent in human labellingthrough hierarchical-temporal mixture modelling, which we train in an unsupervised manner.

4https://www.nc3rs.org.uk/.5https://crackit.org.uk/.6http://www.actualanalytics.com/.


The novelty of our work is not only in applying the stated techniques to the problem of pheno-typing, but also in addressing all three problems in a single architecture. Furthermore, to thebest of our knowledge, the proposed method for unifying information across schemas (feature-spaces) has not been addressed before. In addition, we make available a set of software toolsfor dealing with the data: the code is available on request by the author7.

1.2.1 Preliminary Analysis & Cleaning

Our data consists of behaviour annotations (according to a number of different labelling-schemas),mouse-positions and light-status. As with all projects dealing with real-world data, a signifi-cant effort was undertaken in data-cleaning, consolidating the various formats and modalities,identifying the potentials and shortcomings of the material and, in the end, narrowing down thepotential avenues for our models.

1.2.2 Developing the Models

Within this project, we investigated a number of different models for characterising behaviour,as well as dealing with some of the nuances of the data. In keeping with the temporal-modellingliterature, we logically group these into emission and transition models.

We focus first on an independent and identically distributed (IID) model of our data, relatinga latent-behavioural state to observed annotations (and positions). These are then extended inthe temporal domain by introducing dependencies between successive states. Given that wedo not have any ground-truth, all our models are trained in an unsupervised manner using theExpectation-Maximisation (EM) algorithm.

Emission Models

1. Our simplest model deals solely with inter-annotator variability on a per-schema ba-sis. We model the labels as a latent Mixture of Multivariate Categorical distributions,uniquely specifying the latent-state permutation using the assumption of broad annotatoragreement (NA Model §4.1).

2. We extend the baseline model by incorporating labelling responsibility per experiment,differentiating between informative/uninformative unlabelling, and explicitly modellinginter-schema translation through an added emission layer allowing for an inter-schemamodel to be learnt (ISAC Model §4.2).

3. Finally, we also incorporate position information (PAB Model §4.3).

7email : [email protected]

1.3. Report Organisation 5

Temporal Model

4. We add a temporal dependence between successive latent behavioural states, based on aHidden Markov Model (HMM) architecture (TAP Model §4.4).

5. We extend our model to deal with different individuals (while sharing emission probabil-ities) and lights-on/off cycle behaviour through an Input-Output Hidden Markov Model(IOHMM) type of architecture (IOTAP Extensions §4.5).

1.2.3 Evaluating the Models

We evaluate our models using the following criteria:

• Given that we do not have access to ground-truths for the state of the mice, the naturalmeasure is to compute the (log-)likelihood given a set of parameters. Since the aim is togeneralise as much as possible, we employ the standard practice of dividing our datasetin ‘training’ and ‘validation/testing’ portions, and report the likelihood on the latter.

• In specific cases, we introduce more relevant measures, such as entropy-based probabilitycomparators.

• For the temporal models, we run simulations on synthetic data (thus controling the gen-erating process) that yield expectations of performance.

• Where applicable, we draw comparisons with established techniques such as how ourapproaches compare to Majority-Vote (MV), and expert feedback from our collaboratorsat MRC Harwell.

• We also analyse, in a limited setting, the applicability of our models to two of the end-goals sought at MRC Harwell, namely identifying different individuals and behaviouralregimes.

1.3 Report Organisation

This dissertation is organised as follows. Chapter 2 introduces the reader to the core theoret-ical concepts used within this project, and briefly reviews the relevant literature. Chapter 3presents an in-depth exploratory analysis of the data, followed by a description of the modelsand methodologies employed in chapter 4. The results achieved are reported in chapter 5, andwe end this work with a discussion of the main findings and potential avenues for future work(chapter 6).

In the interest of brevity we also relegate some material to the Appendices. Appendix A containsan in-depth description of the data, and additional analysis to Chapter 3. Detailed derivationsfor our models are provided in Appendix B, with Appendix C acting as companion to Chapter


5 in terms of documenting additional results. We also give further details on future extensionsto our models in Appendix D.

1.4 Mathematical Notation

Unless otherwise noted, we follow the mathematical notations presented below:

• We often denote the size (dimensionality) of a vector/matrix/tensor by |X |: this shouldnot be confused with the norm, where we use ||X ||.

• Random Variables (RVs) (and deterministic observations) are represented by upper-caseLatin characters. Indexing of the one-hot representation of categorical variables (i.e. po-tential manifestations) is done by the lower-case character of the same letter. We alsouse the upper-case variant, without indexing, to denote an observation in our data-set. Toclarify,

– X represents either the general variable, or the observation as it appears in the data,with the meaning inferred from context, and

– Xx represents a specific (possible) manifestation (Xx is either 0 or 1).

In addition, we sometimes refer to all random variables (excluding deterministic ones)collectively by D and the observable ones (including deterministic) by O.

• Model parameters are often denoted by Greek letters. Again, we use the upper-casevariant to denote the Probability Mass Function (PMF), and lower-case characters forspecific values/dimensions (namely the individual event probabilities in the categoricaldistribution). To clarify, if our distribution over X ∈ (0,1)|X | is captured by Ω, then wedefine:

P(Xx = 1)≡ ωx

• Indexing of RVs/PMFs is always denoted by subscripts, with the dependent variable beingthe first index: the only exception to this is when we have multiples of the same variable(such as multiple annotators) in which case the first index is the identity of the variable:

– if X is conditioned on Z, then, ωx,z ≡ P(Xx = 1|Zz = 1), while,

– if there are multiple variables X indexed by j, then ω j,x,z ≡ P(X j,x = 1|Zz = 1

).

• We use bracketed superscripts to indicate sample/time indexing, so as to avoid confusionwith exponentiation e.g. X (n). We also use colon to determine a range: i.e. X (a:b) includesall samples from a to b (inclusive).

• We typically reserve 〈angle-brackets〉 for expectations

Chapter 2

Background

This chapter will introduce the theoretical foundations for the project, and also set the picturewith an overview of related literature. We first discuss Latent-Variable Mixture Models andDiscrete-Space Time Series, and then move on to the recent behavioural analysis literature.

2.1 Theoretical Foundations

This project deals with a number of concepts in parameter estimation with incomplete data andtime-series. We thus introduce the methodological preliminaries which will be referenced lateron in the project: readers familiar with the concepts may skip to section 2.2.

2.1.1 Preliminary Definitions

We consider a set of samples D indexed by some integer n. When dealing with time-series, weindex D by the integer t to make the distinction explicit1. Each sample d(n) in the Data canconsist of multiple variables, each of which can be continuous or discrete (we deal exclusivelywith the latter), and is in general, multi-dimensional. For the most part we assume that thesamples are IID: i.e. there is no dependency between samples and no order in the sampling.This means that the joint distribution over the data is the product of the individual densities:

P(D) =N

∏n=1

P(

d(n)). (2.1)

However, when dealing with time-series, we characterise it as a Stochastic Process, which inthe most general form can be represented by Eq. 2.2:

P(D) = P(

d(1)) T∏t=2

P(

d(t)|d(1:t−1)). (2.2)

1There are also methods for dealing with continuous time, but these are outside the scope of this work.

7

8 Chapter 2. Background

2.1.2 Latent-Variables and Mixture Models

If the sample d(n) is not completely observable — d(n) ={

z(n),x(n)}

with z(n) hidden — thenthe model is a Latent-Variable Model [44, p. 339]. A key architecture that we repeatedly turnto is the mixture-model, which consists of a single latent categorical variable Z and a set ofobserved variables X (which can be of any form, including in continuous space), governed byconditional emission probabilities. In general, such models (in the IID case) can be representedas:

P(D) =N

∏n=1

P(

z(n))

P(

x(n)|(n)). (2.3)

Latent-Variable models can represent complex multi-variate distributions with a reduced pa-rameter space, but more importantly, they allow us to reason about causes in the data. Thedown-side is that parameter estimation is significantly more difficult [6]. Assuming we are us-ing Maximum-Likelihood Estimate (MLE), we seek to maximise the Observed Data Likelihood(where we have parametrised the model by Θ):

Θmle = argmaxθ

p(X |Θ) = argmaxθ

N

∏n=1

∑z

p(X ,Z|Θ) (2.4)

Since only X is observed, Z must be summed out: however, this is what generally makes di-rect optimisation intractable. Instead, assuming that estimation of the Complete Data (log)likelihood is easier, we can take the expectation of the latent-states given the observations andmaximise the Expected Complete Data (Log-)Likelihood, which as proven in [44] is a lower-bound for maximising the Observed Data (Log-)Likelihood. This technique is referred to as theExpectation-Maximisation (EM) algorithm [40], and consists of iterating between computingthe Expectation of the latent-states conditioned on the observations (and the current parameter-estimates), and subsequently Maximising the expected complete log-likelihood w.r.t. the pa-rameters. Further details are provided when we apply this to our models in Chapter 4.

2.1.3 Time Series

The definition of a time-series in Eq. (2.2) corresponds to a fully connected graphical modeland becomes intractable as the length of the series grows. We get around this by enforcingcertain assumptions/structures on the model, which we will address in turn. We explore bothfully-observable (X only) and latent-variable models, but do so only in the discrete domain: thereader is referred to [44] for an in-depth treatment of continuous-state-space models and relatedtechniques.

Markovian Models

A simple discrete-space time-series is one which exhibits the Markov Property, and hence isoften referred to as a Markov Chain (MC) [44, p. 589]. This property states that for a particular

2.2. Related Work 9

length (of history) τ ,P(

x(t+1)|x(1:t))= P

(x(t+1)|x(t−τ+1:t)

): (2.5)

i.e. the future x(t+1) is independent of the past x(1:t−τ) given the present x(t−τ+1:t). The length τdefines the order of the model. Fig. 2.1 represents a first-order MC. The model is fully specifiedusing the initial distribution π = P(x(1)) and the transition probabilities P

(x(t+1)|x(t)

). The

latter is usually represented as a matrix, where each row sums to 1.

. . . x(t−1) x(t) x(t+1) . . .

Figure 2.1: A Markov Chain as a Probabilistic Graphical Representation.

Latent Processes

The real-world is noisy, and the Markov assumption is often too naı̈ve to fully represent thedata. The natural extension to the MC is to carry out the state transitions in a latent space andonly observe ‘noisy’ versions of them. This is the basis of the Hidden Markov Model (HMM)[44, p. 603], which is graphically represented in Fig. 2.2 and mathematically by Eq. (2.6):

P(D) = p(

z(1))

p(

x(1)|z(1)) T∏t=2

p(

z(t)|z(t−1))

p(

x(t)|z(t)). (2.6)

The observed variable x may be continuous, although z is discrete. In effect, a HMM maybe regarded as a time-dependent mixture model. This again implies a non-trivial parameterestimation, typically through the Baum-Welch algorithm [44, p. 618] which uses the Forwards-Backwards algorithm to speed up the E-Step responsibility computation.

. . . z(t−1) z(t) z(t+1) . . .

x(t−1) x(t) x(t+1)

Figure 2.2: Graphical Representation of a (first-order) HMM.

2.2 Related Work

In this section we will describe some of the techniques employed in behaviour modelling. Thisis a voluminous field and a complete review is beyond the scope of our work: however, wediscuss some of the challenges which crop up and how these may be solved, drawing also


from methods applied to other mature fields in general temporal modelling. We also addressthe crowd-sourcing literature, which is relevant due to the human-annotated behaviour labelswhich form the main modality of our dataset (see Chapter 3).

2.2.1 Characterising Behaviour

Despite the vast literature on behavioural characterisation, the overlap with the temporal mod-elling field is a recent development [54]. We thus begin our discussion by describing traditionalstatistical metrics and their shortcomings, motivating the need for explicit consideration of time.We then address the challenges which behavioural modelling brings to the table, and suggestmodels from the wider temporal modelling community which can be applied in such cases.

Summary Statistics (and some Extensions)

Traditional behavioural models, especially as applied to detecting phenotypes, typically re-volved around observing discrete behaviours in some (fixed) time-interval and counting his-tograms of the behaviours [54]. These would in turn be compared using a reference range [16]or statistical tests such as the Student’s t-test or two-way Analysis Of Variance (ANOVA) [41,57, 67]. However, information about significant phenotypic differences may be encoded in thetransitions rather than the time spent in any activity itself [46, 51, 54, 70].

A first step in solving this problem is to analyse patterns over sequences of activites. Magnus-son [37] proposes an algorithm which searches over pairs of activities to extract patterns in chil-dren’s social interactions, aiming to automatically detect structure/repetition in the presence ofnoise. The method operates in a bottom-up hierarchical approach, building ever-more-complexdescriptors of the behaviour. The work was picked up by Casarrubea et al. [13] who applied thetechniques towards studying anxiety in rats by tracking social interactions.

An alternative (and complementary) approach is presented by Paraschiv-Ionescu et al. [46].The authors use complexity-based descriptors over sequences of activity patterns in individualswith/without chronic pain. Their hypothesis is that the descriptors (as modelled by informa-tion entropy, sample entropy and Lempel-Ziv complexity [36]) provide sufficient covariates forclassification, since the transition dynamics change under pain.

The Temporal Dimension

The above schemes, while a step towards capturing temporal patterns, may not account for alldynamics in behaviour [12]. They also lack generative capabilities and are limited to purelydescriptive tasks. Time-series models of the form described in Section 2.1 bridge this gap byexplicitly modelling conditional probabilities through time.

HMMs (in their various forms) have been used for example in determining feeding behaviourof caterpillars [79], classifying movement patterns in southern bluefin tuna [47], distinguishing

2.2. Related Work 11

multiple levels of diving behaviour in short-finned pilot whales [51], and, closer to our applica-tion, differentiating between maternal behaviours across different mice strains [12]. However,literature in this area (as regards high-level behaviour modelling with more than a couple ofstates) appears somewhat sparse, as does work on coupled behaviour by individuals in confinedenvironments [59].

Instead, the majority of work focuses explicitly on fine-grained motion behaviour [1, 26, 39].Characterised by the need for continuous-state-space models of locomotion, the method ofchoice is typically the Random Walk [35]: this is often extended by enforcing a preference ondirection, either globally (Biased Random Walk (BRW) [45]) or temporally (Correlated Ran-dom Walk (CRW) [15]). Note that often, these methods violate the Markovian assumption onlocation, but if reformulated (such as augmenting the state with velocity, [26]), the standardmethods can be applied.

Despite this, more involved models are needed to fit realistic behaviour. We explore some ofthe shortcomings of vanilla HMMs and random-walk models, and propose extensions from thetemporal modelling literature.

Better Temporal Modelling

Perhaps the most glaring limitation is that such models fail to capture long-term changes inbehavioural parameters. A direct approach is to add another layer of latent variables, which inthe motion models leads to a Mixture of Random-Walks. This is employed by Jonsen et al. [28]for modelling seal pathways, by conditioning the parameters of a CRW on a binary variable.A similar model is employed by Morales et al. [43] for tracking elk and Patterson et al. inidentifying resident and migratory behaviours in southern bluefin tuna [47].

The key idea relates to causality: i.e. identifying the generative process which gives rise to thedata, and hence allowing one to reason about the covariates governing the temporal process.The significance of this is that often it is these latent states which are of interest to scientists[79], and as Morales and Ellner [42] show in their study of beetle movement across variedterrain, can have a significant effect on observed behaviour. For example Zucchini et al. [79]use a two-hidden-layer model to capture caterpillar feeding behaviour. The authors incorporatefeedback from one of the layers to the other, explicitly encoding the influence that behaviourhas on higher-level “motivational states”.

In the more general case lies the domain of Switching Linear-Dynamical-Systems (SLDSs)[60] and their extensions. In particular, causal modelling allows expert knowledge to be used tosimplify the architecture and maintain tractability. This is the motivation behind the family ofFactorial Switching Linear-Dynamical-System (FSLDS) as presented in [76] (with the HMMcounterpart explored in [22]). Quinn et al. [52] utilise this architecture to monitor the physiol-ogy of premature babies in intensive care. In addition, the authors incorporate an ‘X-Factor’ toaccount for any unmodelled dynamics. Further extensions to the SLDS model include Hierar-chical [65] and purely Discriminative [21] variants.


A related problem concerns the state dwell-time, which in HMMs is necessarily geometricallydistributed [23]. A number of solutions have been proposed to rectify this. Causal models ofthe form of [79] get around this due to the non-Markovian nature of the state process. Indeed,this is reminiscent of the Auto-Regressive Hidden Markov Model (ARHMM) architectures of[77] and [66], which can be used to bias towards ‘smoother’ observations. This is achieved byconditioning the visible variables (and not just the latent ones) in time. A more direct approachis to explicitly encode dwell-time through a Hidden Semi-Markov Model (HSMM) [3], wherethe next-state is conditioned not only on the previous state, but also the time spent in that state.The downside is that the non-Markovian nature of HSMMs makes parameter estimation andinference more difficult. Guédon [23] gets around this through a state-duration counter whichis initialised stochastically when a new latent-state is sampled and then counts deterministicallyto 0 at which point the latent state is again allowed to change. Johnson [27] describes anotherapproximation where each latent state is represented by a number of aggregate sub-states, eachwith the same emission probabilities as the original one: the desired dwell-time is approximatedby tuning the transition probabilities within these sub-states. This is employed successfully byLangrock et al. to track bison movements [33].

Handling multiple Individuals

The work of [1, 33, 79] and [42] points towards the importance of considering the heterogeneitybetween individuals, which may be even more challenging than other covariates [42]. Betweenthe two extremes of global (same parameters for all individuals) and individual models, Lan-grock et al. [33] investigate the use of a hierarchical model, where individual parameters aredrawn from a common distribution. This reduces the parameter space (as opposed to the indi-vidual model) but maintains some flexibility in modelling distinct individual behaviour, and infact proved superior in modelling (a situation mirrored also in [79]).

The presence of multiple individuals gives rise to another challenge which has not been exten-sively explored: specifically, the effect that other individuals have on each other’s behaviour.This is especially significant in social animals [56], where ignoring its effects leads to incorrectmodelling [55]. In his work with neonatal rats [58], Schank used agent-based modelling tech-niques drawn from the robotics literature, with knowledge of other pups factored in and trainedusing genetic algorithms. The improved aggregation models brought about by the couplingconsideration allowed the author to identify significant differences in behaviour between 7- and10-day old pups. [72] achieves a globally-consistent model by restricting the mean movementpattern of an individual to be similar to the average directions of all nearby animals in the herd.This is expanded upon by Langrock et al. [34], by way of a hierarchical model, tracking boththe group-centre and individual-specific models conditioned on this. A more realistic but alsomore involving model concerns the use of Coupled Hidden Markov Model (CHMM) [8], asemployed by Russel et al. in their model of Carpenter Ant motion [56], where the individualinsect’s transition is conditioned on the state of all other neighbours.

2.2. Related Work 13

2.2.2 Dealing with Inter-Annotator Variability

As will become clearer in Chapter 3, a challenge which we had to contend with is that of inter-annotator agreement. We briefly review some potential solutions found in literature, focusing onprobabilistic (latent-variable) models — these are often superior to simple majority votes sincethey account for the reliability of each annotator, and can even deal with adversarial annotations[75].

One of the earliest models for probabilistically quantifying the reliability of annotators is pre-sented by Dawid and Skene [18] for consolidating clinical diagnosis. They formulate the prob-lem as a mixture model with (multiple) multinomial observations conditioned on a latent groundtruth. The authors derive the maximum likelihood solution, both under the condition of knownground truth and not, the latter making use of the EM algorithm [19]. The procedure is echoedin [63] this time for images.

The above formulation models individual annotators’ reliability through the conditional prob-abilities. Whitehill et al [75] extend the model by also modelling the difficulty of individualsamples, by conditioning the probabilities on the reliability of each annotator and the difficultyof each image (sample). Wauthier and Jordan [74] address a similar scenario through a causalmodel which captures the sources of bias through shared random effects, and also enforces anatural hierarchy between labellers and gold-standards. In addition, they employ active-learningby treating the entire data-collection process as one task, to improve the model with more in-formative labelling. Plangprasopchok and Lerman [49] focus instead on different annotatorsselecting tags from their own unconstrained vocabulary. In an effort to maintain as much indi-viduality as possible (and thus avoiding dimensionality reduction techniques), they propose acausal model incorporating user interest and topic variables. The Bayesian approach is used toavoid specifying the number of topics/interests apriori.

Chapter 3

Exploratory Data Analysis

Analysing and cleaning uncurated (raw) lab data such as ours necessitated considerable effort,especially given the lack of a readily-available software infrastructure which we had to develop1.In the interest of brevity, this chapter only briefly describes the sources and modalities of thedata, focusing instead on the findings from the Exploratory Analysis2, which are fundamentalin understanding our choices for the models described in the next chapter (4). The reader isdirected to Appendix A for a more detailed description of the data.

3.1 Data Summary

Our dataset was provided by the MRC Harwell Institute, Oxfordshire, and consists of video,location and annotator labels about the behaviour of mice housed in cages of three. The 27.5hours of available data, is divided into 30-minute segments and relates to multiple individuals,all from wild-type strains: the full breakdown appears in Table A.1 (Appendix A). The mice arekept on a 12-hour light-cycle, with lights-on at 07:00 and lights-off at 19:00.

Position data is in the form of a 3×6 coordinate system, picked up through Radio-FrequencyIDentification (RFID) antennas, with positions numbered 1 through 18. The behaviour of eachof the three mice is labelled by two to three annotators from a pool of eleven individuals, withresponsibility for annotating changing between segments. Labelling involves specifying thestart/end-times of exhibited behaviours. The data contains periods for which no label is given,which we refer to as missing data. The annotations follow one of four schemas (denoted I, II,III and V3), containing the labels shown in Fig. 3.1. Both position and behaviour are reportedat a frequency of 25Hz (synchronised with the video-frame rate).

1All code can be made available by request to the author.2This analysis is an extension of that carried out as part of the IRDS Mini-Project [11]: however, the former

study was on a much smaller portion of the data (eight segments), was much less extensive/structured than this,and served mainly to make an informed selection of which data to work with in the final project.

3Schema IV was used in [11] but does not appear in the segments provided for this project.

15

16 Chapter 3. Exploratory Data Analysis

Schemas

Page 1

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 #

I groom soccg sniffing microm feed fh drink rest loco climb rear aggr fight 13

II groom allogroom drink climb aggression fighting 15

III groom feed drink resting locomotion climb rear aggression 17

V g socg nb di shred f fh d circling c r aggr 10

S Grm AGrm Nst Dig Shrd Snif UMve FedG FedH Drnk Rest Circ Loco Clmb Rear Aggr Fght

feed on ground

feed from

hopper

social grooming

micromovement

feed at hopper

Figure 3.1: Annotation Schemas. The names reflect the labels used within the different schemas (‘nb’refers to Nest-Building and ‘di’ to Digging). The numbers at the top are the internal categorical repre-sentation, while the last row, marked S is our short-hand notation for referring to the labels. The lastcolumn indicates the number of segments corresponding to each schema.

Data preprocessing involved matching the different data modalities, translating spurious sam-ples to one of the four schemas, and synchronising lights-on/off timings (refer to §A.1.2). Videodata, while not used in modelling behaviour, was utilised here to verify processing steps. Anumber of automated sanity checks were also carried out on the data, the results of which moti-vated us to only use the Antenna Index field from the RFID data as the most reliable indicatorof position (details in §A.2).

3.2 Exploratory Analysis

Our exploratory analysis begins with an in-depth treatment of the nuances in the annotations: wethen investigate the relationship between annotations and positional information, and concludewith behavioural/positional trends.

3.2.1 Annotator Data

The behavioural labels suffered from two limitations: missing data and inter-annotator disagree-ment. We analysed these in depth, especially how each manifests across schemas, aiming to finda method for handling both in a principled manner.

Distribution of Behaviours

It is helpful to first analyse the raw distribution of activities by schema, as displayed in Fig. 3.2.Schemas I and III are quite similar, with the most prolific activities being micromovement andresting: the other two schemas, (II and V), exhibit significantly more missing (N/A) data. Italso emerges that apart from circling no label may be dropped due to not appearing in the data.

Missing Data

Table 3.1 shows the percentage of samples which the annotators have not labelled, and whichappear as N/A in Figure 3.2. The global statistics (top row) seem to indicate that a significant

3.2. Exploratory Analysis 17

Grm

AGrm Sn

ifUM

veFe

dGFe

dHDr

nkRe

stLo

coCl

mb

Rear

Aggr

Fght N/A

0

500000

1000000

1500000

2000000

2500000

0.0

0.1

0.2

0.3

0.4

0.5

131.

3K16

2.8K

9.80

7K1.

44M

190.

6K17

4.9K

27.6

5K2.

752M

66.6

1K4.

784K

57.7

7K10

.24K

290 2

36.3

K

(a) Schema I

Grm

AGrm

FedG

FedH

Drnk

Clm

bAg

grFg

ht N/A

0

500000

1000000

1500000

2000000

2500000

3000000

3500000

0.0

0.1

0.2

0.3

0.4

0.5

0.6

575.

6K60

3.7K

261.

5K54

1.6K

42.8

2K

251.

5K

115K

0

3.68

3M

(b) Schema II

Grm

AGrm

UMve

FedG

FedH

Drnk

Rest

Loco

Clm

bRe

arAg

gr N/A

0

500000

1000000

1500000

2000000

2500000

3000000

3500000

0.0

0.1

0.2

0.3

0.4

0.5

286.

5K14

4.5K

1.73

1M20

3.1K

216.

6K60

.74K

3.61

M

142.

5K7.

63K

125.

6K14

.68K 340K

(c) Schema III

Grm

AGrm Nest Dig

Shrd

FedG

FedH

Drnk Circ

Clm

bRe

arAg

gr N/A

0

500000

1000000

1500000

2000000

2500000

0.0

0.1

0.2

0.3

0.4

0.5

0.6

516.

1K15

6.2K

404

6.39

3K1.

106K

156K 28

1.5K

49.5

1K

0 8.98

7K11

3.8K

70.8

3K

2.68

9M

(d) Schema V

Figure 3.2: Histogram (distribution) of Activity Labels across Schemas. Each graph displays only the la-bels which are found in that schema, with a blank space for other labels: this allows for easy comparisonswhile distinguishing between missing labels (in the schema) and those with no counts. The statistics arecomputed by flattening the contribution of all annotators per segment (in effect tripling the number ofsamples), and are presented as both absolute counts (left-axis) and normalised frequencies (right-axis).

amount (> 50% for some annotators) of the data is missing. However, closer inspection revealsthat the level of missing data is correlated with the schema. To understand this, we postulatethat missing data arises due to (a) the annotator being unsure about an activity4, (b) the mousebeing in an ‘obstructed’ view, where the activity cannot be discerned, or (c) the schema does nothave a label for the exhibited behaviour. The third point seems to explain why schemas II andV, which significantly do not account for the resting and micromovement behaviours, exhibitdisproportionately higher levels of missing labels than schemas I and III: this is also evidencedby Fig. 3.2, where the unaccounted behaviours seem to be absorbed by the missing data.

To test this hypothesis we also report (last column of Table 3.1, labelled ‘All’) the percentageof samples for which none of the annotators give a label, thus reducing the impact of factor (a).Note how in this case, the missing data falls to 0 for schemas I and III, but is still significant

4Dr Bains, from MRC Harwell, confirmed that annotators were instructed not to annotate behaviours they areunsure about.


for II/V. It should be said that discounting for (b) is difficult, since the same un-labelling acrossall annotators is to be expected: however, the fact that missing data correlates with the schemaswhile there being no reason to believe that obstruction should be correlated with the schema,strengthens our conclusion that the phenomenon is largely due to (c).

AnnotatorsAll

A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A12Global (%) 24.9 39.0 51.3 51.3 11.4 3.7 19.0 22.5 59.7 50.4 64.1 24.0

Sche

ma

I (%) 2.0 1.9 - - 0.0 0.0 0.7 - 2.1 - - 0.0II (%) 54.7 52.3 67.6 51.3 57.2 - 51.3 - - - 64.1 49.4III (%) 3.9 2.2 0.4 - 0.0 0.2 13.4 1.3 - 0.0 - 0.0V (%) 76.5 66.4 58.7 - - 45.1 - 75.3 67.3 75.7 - 57.8

Table 3.1: Distribution (% samples) of missing labels across segments/annotators. The columns underthe heading Annotators indicate the percentage of missing samples (per schema, and globally) for eachannotator individually. The last column, ‘All’ shows the percentage of samples which do not have atleast one-label from any annotator.

Inter-Annotator Variability

We now turn to the challenge of inter-annotator disagreement. To quantify this, we use Fleiss’Kappa statistic [20], defined as:

κ =P̄− P̄e1− P̄e

, (3.1)

and is the ratio of the agreement achieved over chance (P̄− P̄e) to the best possible agreementover chance (1− P̄e). As a rule of thumb, κ-values above 0.7 are considered good agreement[61], although the value increases with fewer labels (making inter-schema comparisons lessreliable).

Fleiss’ Kappa 2 Agree (%) 3 Agree (%)Include Ignore

Include IgnoreInclude

κκκ Min Max κκκ Min Max Global Min Max

Sche

ma

I 0.66 0.17 0.84 0.63 -0.41 0.84 97.1 97.1 69.8 47.7 91.5II 0.78 0.68 0.83 0.50 0.20 0.65 98.8 97.3 81.2 69.4 92.4III 0.67 0.38 0.80 0.63 -0.03 0.80 97.0 96.9 70.0 46.6 94.8V 0.69 0.59 0.85 0.28 0.01 0.55 98.2 94.6 76.3 56.3 91.2

Table 3.2: Fleiss’ Kappa and Agreement Statistics for the DataSet (on a per-schema basis). The In-clude/Ignore sub-columns refer to treatment of missing data. Minima and maxima are reported acrossKappa/Percentage values computed individually per segment. 2 Agree reports the percentage of samplesfor which at least two of the annotators agree on a label: to avoid bias, we omit segments which onlyhave two annotators. 3 Agree indicates where there is perfect agreement.

Table 3.2 reports the statistic across schemas. Due to the significance of missing data, we re-port two values. The Include sub-columns are computed by treating the missing value as a


label (aligning with the assumption of it encoding certain states): for the Ignore statistics, wediscount unlabelled samples. In the former case there is little difference between the schemas,although the level of agreement is less than ideal. There is however a significant drop in agree-ment across all schemas when missing data is ignored (column ‘κ’, ignore’ sub-heading), es-pecially for schema V, which lends credibility to our former assumptions. For comparison, thesame table reports the percentage of samples where a behaviour is agreed upon by all (3 Agree)and at least two (2 Agree) of the annotators. As can be seen, this indicates that in a good portionof samples we can infer a unique label using majority-voting.

Given, that we wish to build a model of the inter-annotator variability, it is useful to speculateas to why this arises. We hypothesize that (a) the exact timing at which an annotator specifiesa label is highly subjective, especially considering the 40ms discretisation, (b) some activitiesmay be difficult to distinguish, either intrinsically, or due to observation angle, and finally, (c) anannotator may consistently permute certain labels due to subjective bias.

Figure 3.3: Indicative timeline of a Segment with 3 annotators. This is Segment 12 from Schema III.

To test these hypothesis, we visualise the data in two formats. Fig. 3.3 displays a time-lineof colour-coded behaviours throughout a single segment. As can be seen, while the timingfactor (a) does account for some of the variability, it certainly is not the only factor, and hence,down-sampling alone is unlikely to solve the problem. There also seems to be little consistentpermutation, although this is hard to ascertain from this kind of view.

For a more rigorous depiction we turn to Hinton-plots of the ‘confusion’ matrices, a subset5

of which appear in Fig. 3.4. These are computed on a per-schema basis between each pair ofannotators for all samples where the two annotators appear together. The plots show that:

1. there is generally good agreement with most observations along the main diagonal, al-though this varies with schema e.g. (c) v.s. (a) or (f),

2. there is evidence of consistent mislabelling: in (e) annotator 5 consistently confuses feed-ing at the hopper with climbing

3. micromovement seems to be particularly hard to pin down, see (a), and

4. some annotators tend to be more unsure than others, with more unlabelling e.g. (b).

5Fig. A.3 through A.6 in Appendix A illustrate the complete set of pairwise combinations in unnormalisedform.


Grm

AGrm Sn

ifUM

veFe

dGFe

dHDr

nkRe

stLo

coCl

mb

Rear

Aggr

Fght N/A

GrmAGrm

SnifUMveFedGFedHDrnkRestLocoClmbRearAggrFghtN/A

1/2

(a) 1 vs 2: Schema IGr

mAG

rm Snif

UMve

FedG

FedH

Drnk

Rest

Loco

Clm

bRe

arAg

grFg

ht N/A

GrmAGrm

SnifUMveFedGFedHDrnkRestLocoClmbRearAggrFghtN/A

6/9

(b) 6 vs 9: Schema I

Grm

AGrm

UMve

FedG

FedH

Drnk

Rest

Loco

Clm

bRe

arAg

gr N/A

GrmAGrmUMveFedGFedHDrnkRestLocoClmbRearAggrN/A

6/7

(c) 6 vs 7: Schema III

Grm

AGrm

FedG

FedH

Drnk

Clm

bAg

grFg

ht N/A

GrmAGrmFedGFedHDrnkClmbAggrFghtN/A

1/12

(d) 1 vs 12: Schema II

Grm

AGrm

FedG

FedH

Drnk

Clm

bAg

grFg

ht N/A

GrmAGrmFedGFedHDrnkClmbAggrFghtN/A

1/5

(e) 1 vs 5: Schema IIGr

mAG

rm Nest Dig

Shrd

FedG

FedH

Drnk Circ

Clm

bRe

arAg

gr N/A

GrmAGrmNestDig

ShrdFedGFedHDrnkCirc

ClmbRearAggrN/A

6/9

(f) 6 vs 9: Schema V

Figure 3.4: Selected Confusion Matrices across schemas. The first annotator’s labels are recorded alongthe rows, and the second’s in the columns. The size of the square indicates the weight for that cell. Thecounts along each row are normalised, so that the trends are visible: note however that this exaggeratesthe extent of overall disagreement.

3.2.2 Radio-Frequency IDentification (RFID) Data

We now turn to the RFID Data. Since these were intended as a further indicator of behaviour,we focused our analysis on the relationship between position and annotations. In the absenceof a ground-truth, we pre-processed the samples to include only those for which a majorityconsensus can be inferred. Given a vector X (n), where X (n)x represents the sum of all votes givento x by all annotators for sample (n), we define the Majority-Vote label as:

MV (n) =

argmaxx

(X (n)x

)if maxx

(X (n)x

)≥ 2

argmaxx(

X (n)x)

if ∑x X(n)x = 1

N/A otherwise

(3.2)

Global Densities

As with the annotator labels, it is beneficial to start off with an overview of the global distri-bution of positions. This allows us to identify any trends and acts as a control when analysingper-label distributions.


I II III V

2D Histogram for Ant.Y/Ant.X

(a) Mouse Positions

0 1 2 3 4 50

500000

1000000

I

0.0

0.2

0.4

0.6

0 1 2 3 4 50

250000500000750000

1000000

II

0.0

0.2

0.4

0 1 2 3 4 50

500000

1000000

1500000

III

0.0

0.2

0.4

0.6

0 1 2 3 4 50

200000

400000

600000

V

0.0

0.2

0.4

Histogram for ('Sep.Min', 'Sep.Max')

(b) Inter-Mouse Separations

Figure 3.5: Position (a, top) and Separation (b, bottom) Distibutions. The figures left to right indicate theindividual schemas, I through V respectively. The separation plots show the nearest (blue) and furthest(orange) distance (computed according to the euclidean metric).

Fig 3.5 shows the distribution of instantaneous positions (a) and proximities (b) across allschemas. In the latter case, we report the minimum (closest) and maximum (farthest) sepa-ration (euclidean distance) since this allows unique plots and aggregation across all mice. Thisin itself is not a full-picture of the behaviour: certain trends may be better exemplified by look-ing at pairs of mice6, but this is left as a future exercise.

In general, mouse position is distributed across the cage, with some preference towards theedges (this has been observed in for example [58]) or the right-side which is also where thefeeding hopper is. Similarly, the proximity plots indicate a tendency for the mice to huddleand group together. There are some differences between schemas, which may be attributed todifferent individuals being analysed, but could also be due to the time-of-day differences.

Distribution of Position by Label

The full set of per-label positions appear in Appendix Figs A.7 through A.10 — we report heresome indicative examples in Fig. 3.6. While certain activities are harder to pin down than others,some behaviours exhibit clear correlations with positions: compare locomotion (a) with resting(b). Activities involving climbing, such as feeding at hopper (c), suffered from a limitation inthe antenna range not picking up the RFID implant, causing somewhat less consistent results7.

The analysis also pointed out (and allowed us to rectify) an issue with the RFID data. Notwith-standing the antenna-range limitation, it was expected that behaviours like drinking, shouldbe highly correlated with the position of the hopper. However, preliminary results diverged

6This would be especially relevant when viewing the statistics in a per-activity breakdown, which ideally wouldinvolve a confusion matrix between each pair of activities between each pair of mice.

7This was further verified by comparing with the video data.


(Self) Grooming (Social) Grooming Sniffing Micromovement

Feeding (Ground) Feeding (Hopper) Drinking Resting

Locomotion Climbing Rearing Aggression

Unsure NA

Label-Based Hinton-Plots - III

(a) Locomotion (III)




Fighting Unsure NA

Label-Based Hinton-Plots - I

(b) Resting (I)

(Self) Grooming (Social) Grooming Feeding (Ground) Feeding (Hopper)

Drinking Climbing Aggression Fighting

Unsure NA

Label-Based Hinton-Plots - II

(c) Feeding at Hopper (II)

Figure 3.6: Position distributions for indicative behaviours. The schema is indicated in brackets.

from this (Fig. 3.7 (a)). Further examination of the RFID data showed a time-lag issue, whichwas subsequently fixed through discussions with Dr Rowland Sillito from Actual Analytics

TM

(resulting in Fig. 3.7 (b): the spurious samples are again due to the RFID limitation).




Fighting Unsure NA


(a) Un-Synchronised




Fighting Unsure NA


(b) After Synchronisation.

Figure 3.7: Drinking-specific distribution of positions before (a) and after (b) synchronisation. Theapproximate location of the drinking tube is marked in red. The counts shown are for Schema III.

Separation statistics by Label

We report here some informative examples of behaviour-specific minimum/maximum proxim-ities. The full data is available in the same figures as for the positions (A.7 through A.10).

0 0.5 1.5 2.5 3.5 4.5 5.50

5000

10000

15000

20000

25000(Self) Grooming

Sep.MinSep.Max

0.0

0.2

0.4

0.6

0.8

0 0.5 1.5 2.5 3.5 4.5 5.50

10000

20000

30000

(Social) Grooming

0.0

0.2

0.4

0.6

0 0.5 1.5 2.5 3.5 4.5 5.50

500

1000

1500

2000Sniffing

0.0

0.2

0.4

0.6

0 0.5 1.5 2.5 3.5 4.5 5.50

100000

200000

300000

Micromovement

0.0

0.2

0.4

0.6

0 0.5 1.5 2.5 3.5 4.5 5.50

5000

10000

15000

20000

25000

Feeding (Ground)

0.0

0.1

0.2

0.3

0.4

0 0.5 1.5 2.5 3.5 4.5 5.50

5000

10000

15000

20000

25000

Feeding (Hopper)

0.0

0.1

0.2

0.3

0.4

0 0.5 1.5 2.5 3.5 4.5 5.50

1000

2000

3000

4000

5000Drinking

0.0

0.1

0.2

0.3

0.4

0 0.5 1.5 2.5 3.5 4.5 5.50

200000

400000

600000

Resting

0.0

0.2

0.4

0.6

0.8

0 0.5 1.5 2.5 3.5 4.5 5.50

2000

4000

6000

8000

Locomotion

0.0

0.1

0.2

0.3

0.4

0.5

0 0.5 1.5 2.5 3.5 4.5 5.50

200

400

600

800

Climbing

0.0

0.1

0.2

0.3

0.4

0 0.5 1.5 2.5 3.5 4.5 5.50

1000

2000

3000

4000

5000

Rearing

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0 0.5 1.5 2.5 3.5 4.5 5.50

250

500

750

1000

1250

1500Aggression

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0 0.5 1.5 2.5 3.5 4.5 5.50

10

20

30

Fighting

0.0

0.2

0.4

0.6

0.8

1.0

0 0.5 1.5 2.5 3.5 4.5 5.50.0

0.5

1.0

1.5

2.0

2.5

3.0Unsure

0.0

0.2

0.4

0.6

0.8

1.0

0 0.5 1.5 2.5 3.5 4.5 5.50

10000

20000

30000

40000

50000NA

0.0

0.2

0.4

0.6

(a) Resting (I)

0 0.5 1.5 2.5 3.5 4.5 5.50

20000

40000

60000


Sep.MinSep.Max

0.0

0.1

0.2

0.3

0.4

0 0.5 1.5 2.5 3.5 4.5 5.50

10000

20000

30000

(Social) Grooming

0.0

0.2

0.4

0.6

0.8

0 0.5 1.5 2.5 3.5 4.5 5.5

0.04

0.02

0.00

0.02

0.04

Nest Building

0.04

0.02

0.00

0.02

0.04

0 0.5 1.5 2.5 3.5 4.5 5.50

100

200

300

400

500

600Digging

0.0

0.1

0.2

0.3

0 0.5 1.5 2.5 3.5 4.5 5.5

0.04

0.02

0.00

0.02

0.04

Shredding

0.04

0.02

0.00

0.02

0.04

0 0.5 1.5 2.5 3.5 4.5 5.50

5000

10000

15000

20000Feeding (Ground)

0.0

0.1

0.2

0.3

0.4

0 0.5 1.5 2.5 3.5 4.5 5.50

10000

20000

30000

Feeding (Hopper)

0.0

0.1

0.2

0.3

0.4

0 0.5 1.5 2.5 3.5 4.5 5.50

2000

4000

6000

Drinking

0.0

0.1

0.2

0.3

0.4

0 0.5 1.5 2.5 3.5 4.5 5.5

0.04

0.02

0.00

0.02

0.04

Circling

0.04

0.02

0.00

0.02

0.04

0 0.5 1.5 2.5 3.5 4.5 5.50

200

400

600

800

1000Climbing

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0 0.5 1.5 2.5 3.5 4.5 5.50

2000

4000

6000

8000

10000Rearing

0.0

0.1

0.2

0.3

0.4

0 0.5 1.5 2.5 3.5 4.5 5.50

5000

10000

15000

Aggression

0.0

0.2

0.4

0.6

0 0.5 1.5 2.5 3.5 4.5 5.50

100000

200000

300000

400000

Unsure

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0 0.5 1.5 2.5 3.5 4.5 5.50

20000

40000

60000

80000

NA

0.0

0.1

0.2

0.3

0.4

0.5

0.6

(b) Allo-Grooming (V)

0 0.5 1.5 2.5 3.5 4.5 5.50

5000

10000

15000

20000


Sep.MinSep.Max

0.0

0.2

0.4

0.6

0.8

0 0.5 1.5 2.5 3.5 4.5 5.50

10000

20000

30000

(Social) Grooming

0.0

0.2

0.4

0.6

0 0.5 1.5 2.5 3.5 4.5 5.50

500

1000

1500

2000Sniffing

0.0

0.2

0.4

0.6

0 0.5 1.5 2.5 3.5 4.5 5.50

100000

200000

300000

Micromovement

0.0

0.2

0.4

0.6

0 0.5 1.5 2.5 3.5 4.5 5.50

5000

10000

15000

20000

25000

Feeding (Ground)

0.0

0.1

0.2

0.3

0.4

0 0.5 1.5 2.5 3.5 4.5 5.50

5000

10000

15000

20000

25000

Feeding (Hopper)

0.0

0.1

0.2

0.3

0.4

0 0.5 1.5 2.5 3.5 4.5 5.50

1000

2000

3000

4000

5000Drinking

0.0

0.1

0.2

0.3

0.4

0 0.5 1.5 2.5 3.5 4.5 5.50

200000

400000

600000

Resting

0.0

0.2

0.4

0.6

0.8

0 0.5 1.5 2.5 3.5 4.5 5.50

2000

4000

6000

8000

Locomotion

0.0

0.1

0.2

0.3

0.4

0.5

0 0.5 1.5 2.5 3.5 4.5 5.50

200

400

600

800

Climbing

0.0

0.1

0.2

0.3

0.4

0 0.5 1.5 2.5 3.5 4.5 5.50

1000

2000

3000

4000

5000

Rearing

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0 0.5 1.5 2.5 3.5 4.5 5.50

250

500

750

1000

1250

1500Aggression

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0 0.5 1.5 2.5 3.5 4.5 5.50

10

20

30

Fighting

0.0

0.2

0.4

0.6

0.8

1.0

0 0.5 1.5 2.5 3.5 4.5 5.50.0

0.5

1.0

1.5

2.0

2.5

3.0Unsure

0.0

0.2

0.4

0.6

0.8

1.0

0 0.5 1.5 2.5 3.5 4.5 5.50

10000

20000

30000

40000

50000NA

0.0

0.2

0.4

0.6

(c) Fighting (I)

Figure 3.8: Position distributions for indicative behaviours. The schema is indicated in brackets.

Some trends do emerge: resting is very peaked at the lower-end of the range since the micetend to huddle together, and the same is true for social-grooming. What is really indicativehowever is fighting8: in this case, the minimum proximity is centred at a single-value (basicallythe adjacent cell) while the maximum proximity is on the higher-end of the scale (presumablythe other individual stays clear of the fight).

8Although the low sample-count does not provide a very robust correlation.

3.3. Concluding Remarks 23

3.3 Concluding Remarks

Our conclusions in light of the goals of the project and the results elicited in this exploration areas follows.

• We down-sample the data to 1-second intervals, allowing for more manageable datasetsand mitigating the time-lags between annotators.

• It is not possible to unify all schemas into a single over-arching grouping: i.e. the schemasmust be modelled explicitly. We do however reduce the dimensionality by:

– dropping Circling from Schema V as this is never observed in our data,

– aggregating Aggression and Fighting in Schemas I and II as Aggression, and

– aggregating Nest Building, Digging and Shredding in Schema V as Nesting.

• Missing data must also be explicitly handled since in certain cases it is highly informative.

A note on Downsampling

1. For the time field, we take the value at the start of the sample-aggregate: since our modelsdepend more on the ordering rather than the absolute values, this does not have any impactbeyond simplicity of implementation.

2. We take the most prevalent position within the 1s aggregate, assuming limited motion inthe period.

3. We also employ majority-vote for annotator labels (per-annotator individually). The ab-sence of a label is treated as a label in its own right.

In both position and labels, ties are broken in order of appearance.

Chapter 4

Methodologies

This chapter will document the models which were applied to the data, describing incrementallythe process towards building the full model: discussion of the results is postponed to Chapter5. We motivate and define the architectures below, quoting necessary equations, but refer thereader to Appendix B for the full derivations.

We structure our models into two logical families. The first three treat the data in an IID fashion,and focus on the inter-annotator and inter-schema variability: the other two models are temporalextensions of the IID models. In the interest of simplicity, and given the amount of data availableto us, we do not employ the full Bayesian approach in training our models, but rather theMaximum-A-Posteriori (MAP) approximation (described below).

4.1 Noisy-Annotator (NA) Model

Chapter 3 (esp. §3.2.1) highlighted the significant inter-annotator disagreement, which poses adegree of uncertainty in any behavioural model. Our baseline architecture deals with this usinga mixture of categorical distributions, which we term the Noisy-Annotator (NA) Model , andis based on the Dawid & Skeene formulation [18].

4.1.1 Probabilistic Description

Our model is presented in vector-plate notation in Fig. 4.1, corresponding to the following jointPMF:

P(U,Z|Π,Ψ) =N

∏n=1

|Z|

∏z=1

(πz

K

∏k=1

|U |

∏u=1

ψU (n)k,uk,u,z

)Z(n)z. (4.1)

This is a per-schema model: i.e. a separately-trained model is employed to data from differentschemas. Z is a 1-Hot encoded vector of size |Z| indicating the latent mouse Behaviour. We

25

26 Chapter 4. Methodologies

Z(n)

U (n)k

K

N

Figure 4.1: Inter-Annotator Variability explained as a mixture model with categorical observations.

also have observations from each Annotator k, denoted Uk (again 1-Hot encoded vector of size|U | for all k). In our case, |Z| and |U | are equal (for all annotators), but change across schemas.The model is parametrised by two distributions. Π encodes the prior distribution over latentbehaviours and is defined as:

Π ∈ (0,1)|Z| : πz ≡ P(Zz = 1) ,|Z|

∑z=1

πz = 1. (4.2)

Ψ is the Conditional Probability Table (CPT) of Uk:

Ψ ∈ (0,1)K×|U |×|Z| : ψk,u,z ≡ P(Uk,u = 1|Zz = 1

),

|U |

∑u=1

ψk,u,z = 1 ∀k,z (4.3)

where k encodes the annotator index. Note that we collectively refer to both sets of parametersas Θ = (Π,Ψ). We introduce priors over both Π and Ψ and compute MAP rather than MLEestimates1. We use conjugate Dirichlet priors, parametrised by a vector of ‘imaginary counts’αθ :

Dir(Π)≡ 1βββ (απ)

|Z|

∏z=1

παπz −1

z and Dir(Ψk,z)≡1

βββ (αψk,z)

|U |

∏u=1

ψαψk,u,z−1k,u,z (4.4)

where βββ is the Multivariate Beta Function (for normalisation), and the prior2 over Ψ is inde-pendently specified for all z,k. We thus redefine our original PMF to include the contributionof the priors:

P(U,Z|Θ,ααα) =

N∏n=1

|Z|

∏z=1

{πz

K

∏k=1

|U |

∏u=1

ψU (n)k,uk,u,z

}Z(n)z (Dir(Π) |Z|∏z=1

K

∏k=1

Dir(Ψk,z)

). (4.5)

We define the MAP prediction for the latent state as the most probable z given the observation:

Z(n)map = argmaxz

[P(

Z(n)z = 1|U (n)1 , . . . ,U(n)K

)](4.6)

1This is done mainly to avoid mathematical instability (division-by-zero) due to some emission configurationsnever being observed.

2Note that the π or ψ superscript is intended to identify different α counts, and should not be interpreted as anexponent.

4.1. Noisy-Annotator (NA) Model 27

4.1.2 Estimating the Parameters

Our goal is to find MAP estimates for the parameters Π and Ψ. To do so, we maximise thepenalised Model (Log-)likelihood w.r.t. each of the parameters. Since we only observe U ,however, we need to sum out Z in (4.5), which makes maximisation intractable3. The EMalgorithm [40] gets around this by iterating between an (E)xpectation and a (M)aximisationstep. In the interest of brevity, we report here the update equations for each step, and presentthe full derivation in Appendix B.1.

The E-Step consists of computing the expectation of the latent-states (Z) w.r.t. the observationsand current estimate of the parameters4. The “responsibility” for state z(n) to have generatedobservation U (n) under our model is defined as:

γ(n)z ≡ E〈

Z(n)z |U (n),Θold〉=

πz ∏Kk=1 ∏|U |u=1 ψ

U (n)k,uk,u,z

∑|Z|z′=1 πz′∏Kk=1 ∏

|U |u=1 ψ

U (n)k,uk,u,z′

(4.7)

Note how this is nothing but the conditional posterior on z, the maximum of which also givesus the MAP prediction in (4.6). During the M-Step, we then compute MAP estimates for eachof the parameters:

π̂z =∑Nn=1 γ

(n)z +απz −1

N +∑|Z|z′=1 απz′ −|Z|

, (4.8)

ψ̂k,u,z =∑Nn=1 γ

(n)z U

(n)k,u +α

ψk,u,z−1

∑Nn=1 ∑|U |u=1 γ

(n)z U

(n)k,u +∑

|U |u=1 α

ψk,u,z−|U |

. (4.9)

4.1.3 Dealing with Missing Data

In this baseline model, we treat all missing data as strictly uninformative. The power of ourmodel is that it naturally handles such missing data, both within training and when performingpredictions: and it does so in the most natural manner by ignoring non-observed variables.

Training Assuming that annotator k∗ is unobserved for some sample n∗, then U (n∗)

k∗ must notaffect the likelihood function, Eq. (4.5). This implies that the product over U should evaluateto 1: i.e. U (n

∗)k∗,u = 0 ∀u ∈ (1, |U |) — this is guaranteed by the 1-hot encoding. In the extreme

case where all annotators do not label a sample, the solution is also stable:

• In estimating the responsibility, Eq. (4.7), the one-hot encoding implies that the emissioncomponent is 1, and hence the responsibilities depend only on the prior probabilities(Πold), which is as expected.

3To see why this is, recall that logs cannot be distributed over addition.4Θold : however we omit the old superscript for clarity.

28 Chapter 4. Methodologies

• Updating Π depends only on the responsibilities.

• In updating Ψ, the summation is over all samples, and any undefined behaviour is miti-gated not only by the sum over all samples, but also the prior Dirichlet parameters.

Inference The predictive equation (4.6) is nothing but the maximisation of the responsibilityupdate (4.7) and hence the same conclusions apply.

4.1.4 Inferring Latent States

In comparing MAP predictions for Z, we require interpretation of the hidden states. How-ever, the latent-space, by virtue of the problem formulation, is invariant to permutations ofthe latent-states (i.e. the optimum is indistinguishable w.r.t. these permutations, a phenomonenoften referred to as ‘Label-switching’ [64]).

We get around this by picking the permutation (denoted by the permutation matrix P̂σ ) whichjointly maximises the traces of the confusion matrices Ck between the MAP predictions, andthe labels for annotator k:

P̂σ = argmaxσ

{K

∑k=1

tr (CkPσ )

}. (4.10)

To solve this efficiently, we first sum the confusion matrices for all annotators, and casting itinto a non-negative minimisation problem, employ the (polynomial) Hungarian algorithm [31].Intuitively, our scheme amounts to assigning each latent state to the annotation label with whichit co-occurs most often, and is related to the family of ‘Deterministic Relabelling Algorithms’summarised in [64]. The success of this is based on two assumptions:

• Annotators agree more often than they disagree, so that a single-state can be inferred, and,

• They are largely predictable: i.e. their conditional emission probabilities are sufficientlypeaked around a single state.

The first assumption is grounded in the high-level of agreement in majority-voting (97% agree-ment on average, see sub-section 3.2.1). The level of predictability is explored in §5.2.2.

4.2 Inter-Schema Annotator Consistency (ISAC) Model

The NA Model captures the reliability of each annotator: however, the information is not di-rectly transferable between schemas. Based on the assumption that the consistency of each an-notator transfers between schemas, the Inter-Schema Annotator Consistency (ISAC) Model(which represents one of the major contributions of our work) retains the IID assumption, butincorporates data from all schemas into one holistic model.

4.2. Inter-Schema Annotator Consistency (ISAC) Model 29

4.2.1 Missing Data

Before describing the model, we motivate the need to explicitly model missing data, in light ofour observations in Chapter 3. The results therein indicated the significance of missing data,which, we postulate, can arise in one of four scenarios5:

(a) The annotator was not tasked with labeling the segment,

(b) The mouse is hidden from view,

(c) The annotator is unsure about the activity, or

(d) The label is not available in the schema.

Identifying between these cases allows us to infer subsets of behaviours from the lack of datain case (d), while treating as uninformative the other possibilities6. We propose artificiallyassigning a Not-In-Schema (NIS) label to instances of (d) which we identify when all annotatorsfail to label a sample. It is assumed that the different annotator skill-levels reduce the effect of(c): this is supported by the results in the previous chapter (§3.2.1). The same results allow usto discount (b). Finally, we have access to information about (a).

Note that while the NA Model handles missing data, it does so in an uninformative manner, as-signing the latent state to the most probable state under the prior distribution Π. The motivationbehind the ISAC extension is specifically to make use of (a) information from all schemas, and(b) knowledge that the label is not in the schema, and adjust posterior inference accordingly.

4.2.2 Probabilistic Model

Our proposed model appears

modelling annotator variability across feature spaces in ... · edinburgh) and prof. adrian muscat,...

Documents