deconvolutional time series regression: a technique for ... · 2.1 non-deconvolutional time series...

Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2679–2689Brussels, Belgium, October 31 - November 4, 2018. c©2018 Association for Computational Linguistics

2679

Deconvolutional Time Series Regression: A Technique for ModelingTemporally Diffuse Effects

Cory ShainDepartment of LinguisticsThe Ohio State [email protected]

William SchulerDepartment of LinguisticsThe Ohio State [email protected]

Abstract

Researchers in computational psycholinguis-tics frequently use linear models to studytime series data generated by human subjects.However, time series may violate the assump-tions of these models through temporal diffu-sion, where stimulus presentation has a linger-ing influence on the response as the rest ofthe experiment unfolds. This paper proposes anew statistical model that borrows from digitalsignal processing by recasting the predictorsand response as convolutionally-related sig-nals, using recent advances in machine learn-ing to fit latent impulse response functions(IRFs) of arbitrary shape. A synthetic exper-iment shows successful recovery of true la-tent IRFs, and psycholinguistic experimentsreveal plausible, replicable, and fine-grainedestimates of latent temporal dynamics, withcomparable or improved prediction quality towidely-used alternatives.

1 Introduction

Time series are abundant in many naturally-occurring phenomena of interest to science, andthey frequently violate the assumptions of linearmodeling and its generalizations. One confoundthat may be widespread in psycholinguistic datais temporal diffusion: the dependent variable mayevolve slowly in response to its inputs, with theresult that a particular predictor observed at a par-ticular time may continue to exert an influence onthe response as the rest of the process unfolds. Ifnot properly controlled for, such a confound couldhave a detrimental impact on parameter estima-tion, model interpretation, and hypothesis testing.

The problem of temporal diffusion remainslargely unsolved in the general case.1 A stan-

1Although recent work in computational psycholinguis-tics has begun to address separate but related problems intime series modeling (auto-correlation and non-stationarity)

dard approach for handling the possibility of tem-porally diffuse relationships between the predic-tors and the response is to use spillover or lagregressors, where the observed predictor value isused to predict subsequent observations of the re-sponse (Erlich and Rayner, 1983). But this strat-egy has several undesirable properties. First, thechoice of spillover position(s) for a given predic-tor is difficult to motivate empirically. Second, inexperiments with variably long trials the use ofrelative event indices obscures potentially impor-tant details about the actual amount of time thatpassed between events. And third, including mul-tiple spillover positions per predictor quickly leadsto parametric explosion on realistically complexmodels over realistically sized data sets, especiallyif random effects structures are included.

As a solution to the problem of temporal diffu-sion, this paper proposes deconvolutional time se-ries regression (DTSR), a technique that directlymodels diffusion by learning parametric impulseresponse functions (IRFs) of the predictors thatmediate their relationship to the response variableover time. Parametric deconvolution is difficultin the general case because the likelihood surfacedepends on the choice IRF kernel, requiring theuser to re-derive estimators for each unique modelstructure. Furthermore, arbitrary IRF kernels arenot guaranteed to afford analytical estimator func-tions or unique real-valued solutions. However,recent advances in machine learning have led tolibraries like Tensorflow (Abadi et al., 2015) —which uses auto-differentiation to support opti-mization of arbitrary computation graphs — andEdward (Tran et al., 2016) — which enables blackbox variational inference (BBVI) on Tensorflowgraphs. While these libraries are typically used tobuild and train deep networks, DTSR uses them

using generalized additive models (GAM) with a particularstructure (Baayen et al., 2017, 2018).

2680

Res

pons

ePr

edic

tor

Trial index

Figure 1: Effects in a linear time series model

Res

pons

ePr

edic

tor

Trial index

Figure 2: Linear time series model with spillover

to overcome the aforementioned difficulties withgeneral-purpose temporal deconvolution by elim-inating the need for hand-derivation of estimatorsand sampling distributions for each model.

The IRFs learned by DTSR are interpretableas estimates of the temporal shape of predictors’influence on the response variable. By convolv-ing predictors with their IRFs, DTSR is able toconsider arbitrarily long histories of independentvariable observations in generating a given predic-tion, and (in contrast to spillover) model complex-ity is constant on the length of the history win-dow. DTSR is thus a parsimonious technique fordirectly measuring temporal diffusion.

Figures 1–3 illustrate the present proposal andhow it differs from linear time series models. Asshown in Figure 1, a standard linear model as-sumes conditional independence of the responsefrom all preceding observations of the predictor.This independence assumption can be weakenedby including additional spillover predictors (Fig-ure 2), at a cost of requiring additional parameters.In both cases, only the relative order of events isconsidered, not their actual distance in time. Bycontrast, DTSR recasts the predictor and responsevectors as streams of impulses and responses (re-spectively) localized in time. It then fits latentIRFs that govern the influence of each predictorvalue on the response as a function of time (Fig-

Res

pons

e

•

∑••

∑••

•∑••

••∑

••••

•∑

Con

volv

edpr

edic

tor

Pred

icto

r

Latent IRF

Time

Figure 3: Effects of predictors in DTSR

ure 3).This paper presents evidence that DTSR can

(1) recover known underlying IRFs from syntheticdata, (2) discover previously unknown temporalstructure in human data (psycholinguistic readingtime experiments), (3) provide support for the ab-sence of temporal diffusion in settings where itmight exist in principle, and (4) provide compara-ble (or in some cases improved) prediction qualityto standard linear mixed-effects (LME) and gener-alized additive (GAM) models.

2 Related work

2.1 Non-deconvolutional time series modelingThe two most widely used tools for analyzing psy-cholinguistic time series are linear mixed effectsregression (LME) (Bates et al., 2015) and gener-alized additive models (GAM) (Hastie and Tib-shirani, 1986; Wood, 2006). LME learns a lin-ear combination of the predictors that generatesa given response variable. GAM generalizes lin-ear models by allowing the response variable to becomputed as the sum of smooth functions of oneor more predictors.

In both approaches, responses are modeledas conditionally independent of preceding obser-vations of predictors unless spillover terms areadded, with the attendant drawbacks discussed inSection 1. To make this point more forcefully, takefor example Shain et al. (2016), who find signif-icant effects of constituent wrap-up (p = 2.33e-14) and dependency locality (p = 4.87e-10) in the

2681

Natural Stories self-paced reading corpus (Futrellet al., 2018). They argue that this constitutes thefirst strong evidence of memory effects in broad-coverage sentence processing. However, it turnsout that when one baseline predictor — probabilis-tic context free grammar (PCFG) surprisal — isspilled over one position, the reported effects dis-appear: p = 0.816 for constituent wrap-up and p =0.370 for dependency locality. Thus, a reasonablebut ultimately inaccurate assumption about base-line effect timecourses can have a dramatic im-pact on the conclusions supported by the statisticalmodel. DTSR offers a way forward by bringingtemporal diffusion under direct statistical control.

2.2 Deconvolutional time series modeling

Deconvolutional modeling has long been used ina variety of scientific fields, including economics(Ramey, 2016), epidemiology (Goldstein et al.,2011), and neuroimaging (Friston et al., 1998).Non-parametric deconvolutional models quantizethe time series and fit estimates for each time pointwithin some window, similarly to the spilloverapproach discussed above. These estimates canbe unconstrained, as in finite impulse responsemodels (FIR) (H. Glover, 1999; Ward, 2006),or smoothed with some form of regularization(Goutte et al., 2000; Pedregosa et al., 2014). Ad-ditional post-hoc interpolation is necessary in or-der to obtain a closed-form continuous IRF. Thesenon-parametric approaches are prone to paramet-ric explosion as well as sparsity problems whentrials are variably spaced in time.

Parametric deconvolutional approaches (i.e.specific instantiations of DTSR) have evolvedin certain fields (e.g. fMRI modeling) to solveparticular problems, generally with someindependently-motivated IRF kernel like thehemodynamic response function (HRF) (Fristonet al., 1998; Lindquist and Wager, 2007; Lindquistet al., 2009). However, to our knowledge DTSRconstitutes the first mathematical formulationand software implementation of general-purposemixed effects parametric deconvolutional re-gression for arbitrary impulse response kernels.DTSR also supports Bayesian inference, enablingquantification of uncertainty in the absence ofanalytic formulae for standard errors. With theseproperties, DTSR expands the range of possibleapplications of parametric deconvolution beyondthose fields for which appropriate formulations

have already been developed.

3 Model definition

This section presents the mathematical definitionof DTSR. For readability, only a fixed effectsmodel is presented below, since mixed modelingsubstantially complicates the equations. The fullmodel definition is provided in Appendix A. Notethat the full definition is used to construct all read-ing time models reported in subsequent sections,since they contain random effects.

Let X ∈ RM×K be a design matrix of M ob-servations for K predictor variables and y ∈ RNbe a vector of N responses, both of which containcontiguous temporally-sorted time series. DTSRmodels the relationship between X and y usingparameters consisting of:

• a scalar intercept µ ∈ R• a vector u ∈ RK of K coefficients

• a matrix A ∈ RR×K of R IRF kernel param-eters for K fixed impulse vectors

• a scalar variance σ2 ∈ R of the response

To define the convolution step, let gk for k ∈{1, 2, . . . ,K} be a set of parametric IRF kernels,one for each predictor; let a ∈ RM and b ∈ RN bevectors of timestamps associated with each obser-vation in X and y, respectively; and let c ∈ NMand d ∈ NN be vectors of series ID’s associatedwith each observation in X and y, respectively. Afilter F ∈ RN×M admits only those observationsin X that precede y[n] in the same time series:

F[n,m]def=

{1 c[m] = d[n] ∧ a[m] ≤ b[n]

0 otherwise(1)

The inputs X can be convolved with each IRFgk by premultiplication with sparse matrix Gk ∈RN×M for k ∈ {1, 2, ...,K} as defined below:

Gk = gk

(b1> − 1a>;A[∗,k]

)� F (2)

The convolution that yields the design matrix ofconvolved predictors X′ ∈ RN×K is then definedusing products of the G matrices and the designmatrix X:2

X′[∗,k]def= GkX[∗,k] (3)

2This implementation of convolution is only exact whenthe predictors fully describe a discrete impulse signal. Exactconvolution of samples from continuous signals is generallynot possible because the signal is generally not analyticallyintegrable. For continuous signals, DTSR can approximatethe convolution as long as the predictor is interpolated be-tween sample points at a fixed frequency prior to fitting.

2682

Following convolution, DTSR is simply a linearmodel. The full model mean is the sum of (1) theintercept µ and (2) the product of the convolvedpredictor matrix X′ and the coefficient vector u:

y ∼ N(µ+ X′u, σ2

)(4)

4 Implementation

The present implementation defines the afore-mentioned equations3 as a Bayesian computationgraph in Tensorflow and Edward and trains it withblack box variation inference (BBVI) using theNadam optimizer (Dozat, 2016)4 with a constantlearning rate of 0.01 and minibatches of size 1024.For computational efficiency, histories are trun-cated at 128 timesteps. Prediction from the net-work uses an exponential moving average of pa-rameter iterates with a decay rate of 0.998. Con-vergence was visually diagnosed.

The present experiments use a ShiftedGammaIRF kernel:

f(x;α, β, δ) =βα(x− δ)α−1e−β(x−δ)

Γ(α)(5)

This is simply the PDF of the Gamma distributionaugmented with a shift parameter δ allowing thelower bound of the support of the distribution todeviate from 0. We constrain δ to be strictly neg-ative, thereby allowing the model to find a non-zero instantaneous response. We also constrain kto be strictly greater than 1, which deconfoundsthe shape and shift parameters. All bounded vari-ables are constrained using the softplus bijection:

softplus(x) = log(ex + 1) (6)

The ShiftedGamma kernel is used here becauseit can fit a wide range of response shapes andhas precedent in the fMRI literature, where HRFkernels are often assumed to be Gamma-shaped(Lindquist et al., 2009).5

All parameters are given normal priors with unitvariance. Prior means for the fixed IRF kernel

3As noted above, for expository purposes the definition inSection 3 only supports fixed-effects models. The full def-inition for mixed-effects DTSR models is provided in Ap-pendix A. Mixed models are used throughout the experimentsreported below.

4The Adam optimizer (Kingma and Ba, 2014) with Nes-terov momentum (Nesterov, 1983)

5Other IRF kernels, including spline functions and com-position of convolutions, are supported by the current imple-mentation of DTSR but are not explored in these experiments.More details are provided in the software documentation.

parameters are domain-specific and discussed inthe experiments sections below. To center theprior at an intercept-only model,6 prior means forthe intercept µ and variance σ2 are set (respec-tively) to the empirical mean and variance of theresponse, and prior means for both fixed coeffi-cients and random effects7 are set to 0. Althoughthe Bayesian implementation of DTSR is used forthis study because it provides quantification of un-certainty, placing priors on the IRF kernel param-eters is not crucial to the success of the system. Inall experiments reported below, the MLE imple-mentation arrives at similar solutions and achievesslightly better error.

In the interests of enabling the use of DTSRby the scientific community, the implementationof DTSR used here is offered as a documentedopen-source Python package with support for (1)Bayesian, variational Bayesian, and MLE infer-ences and (2) a variety model structures and im-pulse response kernels. The Tensorflow back-end also enables GPU acceleration where avail-able. Source code and links to documenta-tion are available at https://github.com/coryshain/dtsr.

5 Experiment 1: Synthetic data

An initial experiment fits DTSR estimates to syn-thetic data to determine whether the model can re-cover known ground truth IRFs. Synthetic datawere synthesized using the following procedure.First, 20 input vectors of size 10,000 were drawnfrom a standard normal distribution. These valuessynthesize an impulse stream containing 20 co-variates, each with 10,000 observations. A Shift-edGamma IRF was then drawn for each of the 20covariates. Coefficients were drawn from a uni-form distribution U(−50, 50), and IRF parame-ters were drawn from the following distributions:α ∼ U(1, 6), β ∼ U(0, 5), δ ∼ U(−1, 0). Theprior means for the corresponding IRF kernel pa-rameters are placed at the centers of these ranges.The stream of responses was generated by con-volving the covariates with their correspondingIRFs. Gaussian noise with standard deviation 20was injected into the response following genera-tion. The 10,000 trials were spaced 100ms apart.As shown in Figure 4, the DTSR estimates for the

6A model in which the response is insensitive to the modelstructure.

7See Appendix A for the definition of the mixed-effectsDTSR model, which includes random effects.

https://github.com/coryshain/dtsr

https://github.com/coryshain/dtsr

2683

Res

pons

e

Time (s)

Figure 4: Synthetic data. True IRFs (left) and estimated IRFs with 95% credible intervals (right).

synthetic data are very similar to the ground truth,confirming that when the data-generating modelmatches the assumptions of DTSR, DTSR can re-cover its latent structure with high fidelity.

6 Experiment 2: Human reading times

6.1 Background and experimental design

The main interest of DTSR is the potential to bet-ter understand real-world dynamical systems likethe human sentence processing response. There-fore, Experiment 2 applies DTSR to three exist-ing datasets of naturalistic reading: Natural Sto-ries (Futrell et al., 2018), Dundee (Kennedy et al.,2003), and UCL (Frank et al., 2013).

Natural Stories is a self-paced reading (SPR)corpus consisting of narratives designed to providecontext-rich, fluent-sounding stimuli that nonethe-less contain many grammatical constructions thatrarely occur naturally in texts. The public releaseof the corpus contains data collected from 181subjects. The stimulus set contains 10 stories witha total of 485 sentences and 10,245 tokens, for atotal 848,768 fixation events.

Dundee is an eye-tracking (ET) corpus con-taining newspaper editorials read by 10 subjects,with incremental eye fixation data recorded duringreading. The stimulus set contains 20 editorialswith a total of 2,368 sentences and 51,502 tokens,for a total of 260,065 fixation events.

UCL is a reading corpus containing individualsentences that were extracted from novels writtenby amateur authors. The sentences were shuffledand presented in isolation to 42 subjects. The eye-tracking portion of the UCL corpus used in theseexperiments contains 205 sentences with a total of1,931 tokens, for a total of 53,070 fixation events.

In all experiments, the response variable is log

fixation duration (go-past duration for ET). Mod-els use the following set of predictor variables incommon use in psycholinguistics: Sentence posi-tion (index of word in sentence), Trial (index oftrial in series),8 Saccade Length (in words, ETonly), Word Length (in characters), Unigram Log-prob, and 5-gram Surprisal. Unigram Logproband 5-gram Surprisal are computed by the KenLMtoolkit (Heafield et al., 2013) trained on Gigaword4 (Parker et al., 2009). In addition, DTSR enablesfitting of a Rate predictor, which is simply a vec-tor of ones, one for each observation, that is con-volved using a latent IRF. Rate thus measures theresponse to density of stimulus presentation in therecent past. Since without deconvolution Rate isidentical to the intercept, it is excluded from non-deconvolutional baseline models. Following stan-dard practice in psycholinguistics, by-subject ran-dom coefficients for each of these predictors areincluded in all models (baseline and DTSR).9

ShiftedGamma IRFs are fitted to all predic-tors except Sentence Position, which is assigneda Dirac delta IRF (i.e. a linear coefficient) since itincreases linearly within the sentence and is notexpected to have a diffuse response. In plots,the Sentence Position estimate is shown as a stickfunction at time 0s. Prior means used for theIRF kernel parameters are α = 2, β = 5, andδ = −0.5. Together, these priors define an ex-pected exponential-like IRF which decays to near-zero in about 1s, which seems plausible for humanreading times. In practice they do not appear to bevery constraining, since posterior means of fitted

8Except UCL, which contains isolated sentences, inwhich case Trial is identical to Sentence Position.

9By-subject IRF parameters were not used for this studybecause they substantially complicate the model and initialexperiments using them showed little benefit on training data.

2684

Fixa

tion

dura

tion

(log

ms)

Time (ms)

Figure 5: Human data. Estimated IRFs with 95% credible intervals for Natural Stories (left), Dundee(center) and UCL (right). Intervals are too tight to be seen.

models often deviate quite far from these values.Existing work provides some expectations

about the relationships of these variables to read-ing time. Processing difficulty is expected to in-crease with Saccade Length, Word Length, and 5-gram Surprisal, and positive linear relationshipshave been shown experimentally (Demberg andKeller, 2008). Unigram Logprob is expected tobe negatively correlated with reading times, sincemore frequent words are expected to be easier toprocess. Sentence Position, Trial, and Rate indexdifferent kinds of change in the response over timeand their relationship has not been carefully stud-ied, in part for lack of deconvolutional regressiontools. Although reading times tend to decreaseover the course of the experiment (Baayen et al.,2018), suggesting an expected negative effect ofTrial, this may be partially explained by temporaldiffusion. For the present study, all predictors arerescaled by their standard deviations.10

In all reading experiments, data are partitionedinto training (50%), development (25%) and test(25%) sets. Outlier filtering is also performed.For Natural Stories, following Shain et al. (2016),items are excluded if they have fixations shorterthan 100ms or longer than 3000ms, if they startor end a sentence, or if subjects missed 4 ormore subsequent comprehension questions. ForDundee, following van Schijndel and Schuler(2015), unfixated items are excluded as well as(1) items following saccades longer than 4 wordsand (2) starts and ends of sentences, screens, doc-uments, and lines. For UCL, unfixated items are

10Except Rate, which has no variance and therefore cannotbe scaled by its standard deviation of 0.

excluded as well as (1) items following saccadeslonger than 4 words and (2) sentence starts andends. Partitioning and filtering are applied only tothe response series. The entire predictor historyremains visible to the model.

From a modeling perspective, the primary re-sults of interest in Experiment 2 are the IRFs them-selves and the insights they provide into humansentence processing. However, to check the re-liability of the DTSR estimates, prediction qual-ity on unseen data is compared to that of non-deconvolutional baseline models fitted with LMEand GAM.11 Both baselines are fitted with andwithout three preceding spillover positions foreach predictor (baselines with spillover are desig-nated throughout this paper with the suffix -S).12

6.2 Results

The fitted IRFs for Natural Stories, Dundee, andUCL are shown in Figure 5. Effect sizes by corpus— computed here as the integral of each IRF overthe first 10s — are shown in Table 1, along with

11Formulae used to construct each model reported in thisstudy are available in the associated code repository.

12This number of spillover positions is among the largestattested in the psycholinguistic literature because model com-plexity in LME and GAM increases substantially with eachspillover position added, especially when by-subject randomslopes are included for each spillover position for each vari-able. Indeed, many of the baseline models run for these ex-periments are already at the limits of tractability, as shown bythe non-convergence reported in certain cells of Table 2. Anadvantage of the DTSR approach is that it can consider arbi-trarily long histories at no cost to model complexity. Whilethis permits DTSR to consider longer histories than its com-petitors (in these experiments, 128 timepoints vs. 4), DTSRis more constrained in its use of history since it must applythe same set of IRFs to all datapoints, while the baselines es-sentially fit separate models for each spillover position.

2685

Natural Stories Dundee UCLPredictor Mean 2.5% 97.5% Mean 2.5% 97.5% Mean 2.5% 97.5%

Trial -0.0053 -0.0057 -0.0049 -0.0085 -0.0010 -0.0071 — — —Sent pos 0.0154 0.0148 0.0160 0.0004 -0.0013 0.0022 0.0340 0.0301 0.0379

Rate -0.1853 -0.1858 -0.1848 -0.0649 -0.0659 -0.0640 -0.0806 -0.0832 -0.0781Sac len — — — 0.0249 0.0216 0.0207 0.0217 0.0209 0.0225

Word len 0.0020 0.0019 0.0021 0.0107 0.0105 0.0109 -8e-07 -1.7e-5 1.4e-5Unigram 2.6e-6 -5e-6 2.2e-5 -2.0e-6 -3.9e-5 2.8e-5 1e-06 -4e-6 1.2e-55-gram 0.0057 0.0056 0.0059 0.0139 0.0134 0.0145 0.0159 0.0148 0.0171

Table 1: Effect sizes by corpus with 95% credible intervals based on 1024 posterior samples

95% credible intervals (CI). The IRFs (curves) inthese plots represent the expected change in theresponse over time from observing a unit impulseof the predictor. For example, the Dundee modelestimates that observing a standard deviation of5-gram surprisal engenders a slowdown of about0.05 log ms instantaneously and a slowdown ofabout 0.03 log ms 250 ms after stimulus presen-tation. Because the response is reading time, pos-itive IRFs represent inhibition and negative IRFsrepresent facilitation. Detailed interpretation ofthese curves is provided below in Section 6.3.

Table 2 shows prediction error from DTSR vs.baselines fitted to the same feature set. As shown,DTSR provides comparable or improved predic-tion performance to the baselines, even against the-S models which are more heavily parameterized.DTSR outperforms LME models on unseen dataacross all corpora and generally improves upon orclosely matches the performance of GAM (withno spillover). Compared to GAM-S (with threeadditional spillover positions), there is a clear ad-vantage of DTSR for Natural Stories but not forthe eye-tracking (ET) datasets. This is likely dueto more pronounced temporal confounds in Natu-ral Stories (especially of Rate, which the baselinemodels cannot estimate) compared to the othercorpora.13 However, even in the absence of suf-ficiently diffuse effects to afford prediction im-provements, the ability to measure diffusion di-rectly is a major advantage of the DTSR model,since it can be used to detect the absence of dif-fusion in settings where it might in principle ex-ist. Further discussion of the DTSR IRF estimatesthemselves is provided in Section 6.3.

As shown in Table 3, pooling across corpora,permutation testing reveals a significant improve-

13Note that GAM-S is more heavily parameterized thanDTSR in that it fits multidimensional spline functions of eachspillover position of each predictor. This makes it difficult togeneralize information about effect timecourses from GAMfits, motivating the use of DTSR for studies in which time-courses are a quantity of interest.

ment in MSE on test data of DTSR over each base-line system (p = 0.0001 for all comparisons).14

6.3 Discussion

Some key generalizations emerge from the DTSRestimates shown in Figure 5. The first is the pro-nounced facilitative role of Rate in all three mod-els, but especially in Natural Stories. This meansthat fast reading in the recent past engendersfast reading in the present, because (1) observ-ing a stimulus exerts a large-magnitude, diffuse,and negative (facilitative) influence on the sub-sequent response, and (2) the Rate contributionsof the stimuli are additive. This result demon-strates an important pre-linguistic influence of in-ertia — a tendency toward slow overall changein base response rate. This effect is especiallylarge-magnitude and diffuse in Natural Stories,which is self-paced reading and therefore differsin modality from the other datasets (which areeye-tracking). This suggests that SPR participantsstrongly habituate to repeated button pressing andstresses the importance of deconvolutional regres-sion for bringing this low-level confound undercontrol in analyzing SPR data, since it appears tohave a large influence on the response and mightotherwise confound model interpretation.

Second, effects are generally consistent with ex-pectations: positive effects for Saccade Length,Word Length, and 5-gram Surprisal, and a nega-tive effect of Trial. The null influence of UnigramLogprob is likely due to the presence in the modelof both 5-gram Surprisal (which interpolates uni-gram probabilities) and Word Length (which is in-versely correlated with Unigram Logprob). Thebiggest departure from prior expectations is thenull estimate for Word Length in UCL. It appears

14To ensure comparability across corpora with different er-ror variances, per-datum errors were first scaled by their stan-dard deviations within each corpus. Standard deviations werecomputed over the joint set of error values in each pair ofDTSR and baseline models.

2686

Natural Stories Dundee UCLSystem Train Dev Test Train Dev Test Train Dev Test

LME 0.0803 0.0818 0.0815 0.2135 0.2133 0.2128 0.2613 0.2776 0.2561LME-S 0.0789† 0.0807† 0.0804† 0.2099† 0.2103† 0.2095† 0.2509† 0.2754† 0.2557†

GAM 0.0798 0.0814 0.081 0.212 0.2116 0.2111 0.2576 0.2741 0.2538GAM-S 0.0784 0.0802 0.0799 0.2083 0.2085 0.2078 0.2440 0.2661 0.2457

DTSR 0.0648 0.0655 0.0650 0.2100 0.2094 0.2088 0.2590 0.2752 0.2543

Table 2: Mean squared prediction error by system (daggers indicate convergence warnings)

Baseline DTSR improvement (z-units) p-valueLME 0.059 0.0001∗∗∗

LME-S 0.054 0.0001∗∗∗

GAM 0.057 0.0001∗∗∗

GAM-S 0.051 0.0001∗∗∗

Table 3: Overall pairwise significance of predictionimprovement from DTSR vs. baselines

that the contribution of Word Length in this corpuscan be effectively explained by other variables.

Third, the response estimates for Dundee andUCL (both of which are eye-tracking) are verysimilar, which suggests that DTSR is discoveringreplicable population-level features of the tempo-ral profile for eye-tracking data.

Fourth, there is a general asymmetry in degreeof diffusion between low-level perceptual-motorvariables like Saccade Length and Word Length,whose responses tend to decay quickly, and thehigh-level 5-gram Surprisal variable, whose re-sponse tends to decay more slowly. This is consis-tent with expectations from the sentence process-ing literature. Perceptual-motor variables involverapid bottom-up computation (e.g. visual process-ing or motor planning/execution) and are there-fore not expected to have a diffuse response, whilesurprisal involves top-down computation of futurewords given context, which might be more com-putationally expensive and therefore engender aslower response. While this outcome is suggestede.g. by the aforementioned finding that spillover 1winds up being a stronger position for a surprisalpredictor in the Shain et al. (2016) models, DTSRpermits direct investigation of these dynamics.

7 A note on hypothesis testing

As a Bayesian model, DTSR supports hypothesistesting by querying the variational posterior. Forexample, as shown in Table 1, the credible interval(CI) for 5-gram Surprisal in Natural Stories doesnot include zero (rejecting the null hypothesis ofno effect), while the CI for Unigram logprob does(failing to reject). To control for effects of mul-

ticolinearity, one could perform ablative tests offitted null and alternative models using (1) likeli-hood comparison or (2) predictive performance onunseen data.

However, DTSR estimates are obtained throughnon-convex stochastic optimization, which com-plicates hypothesis testing because of possible es-timation noise due to (1) convergence to a lo-cal but not global optimum, (2) imperfect con-vergence to the local optimum, and/or (3) MonteCarlo estimation of the test statistic via posteriorsampling. It cannot therefore be guaranteed thathypothesis testing results are due to differences inmodel structure rather than differences in relativeamounts of estimation noise introduced by the fit-ting procedure. Thus, p-values (and, consequently,hypothesis tests) based on direct comparison ofDTSR models should be considered approximate.

However, even in situations where such un-certainty in hypothesis testing is not acceptable,DTSR is appropriate for certain important usecases. First, DTSR can be used for exploratorydata analysis in order to empirically motivate thespillover structure of the linear model. Spillovervariables can be excluded or included based on thedegree of temporal diffusion revealed by DTSR,permitting construction of linear models that areboth parsimonious and effective for controllingtemporal diffusion. Second, DTSR can be usedto fit a data transform which is then applied tothe data prior to statistical analysis. This approachis identical in spirit to e.g. the use of the canon-ical HRF to convolve predictors in fMRI modelsprior to linear regression. However, since DTSRis domain-general, it can be a valuable componentin any analysis toolchain for time series.

8 Conclusion

This paper presented a variational Bayesian de-convolutional time series regression method as asolution to the problem of temporal diffusion inpsycholinguistic time series data and applied it toboth synthetic and human responses in order to

2687

better understand and control for latent temporaldynamics. Results showed that DTSR can yielda plausible, replicable, parsimonious, insightful,and predictive model of a complex dynamical sys-tem like the human sentence processing responseand therefore support the use of DTSR for psy-cholinguistic time series modeling. While thepresent study explored the use of DTSR to under-stand human reading times, DTSR can in princi-ple also be used to deconvolve other kinds of re-sponse variables, such as the HRF in fMRI model-ing or the power/coherence response in oscillatorymeasures like electroencephalography, suggestinga rich array of potential applications of DTSR incomputational psycholinguistics.

Acknowledgements

The authors would like to thank the anonymousreviewers for their helpful comments. This workwas supported by National Science Foundationgrants #1551313 and #1816891. All views ex-pressed are those of the authors and do not nec-essarily reflect the views of the National ScienceFoundation.

References

Martn Abadi, Ashish Agarwal, Paul Barham, EugeneBrevdo, Zhifeng Chen, Craig Citro, Greg Corrado,Andy Davis, Jeffrey Dean, Matthieu Devin, San-jay Ghemawat, Ian Goodfellow, Andrew Harp, Ge-offrey Irving, Michael Isard, Yangqing Jia, RafalJozefowicz, Lukasz Kaiser, Manjunath Kudlur, JoshLevenberg, Dan Man, Rajat Monga, Sherry Moore,Derek Murray, Chris Olah, Mike Schuster, JonathonShlens, Benoit Steiner, Ilya Sutskever, Kunal Tal-war, Paul Tucker, Vincent Vanhoucke, Vijay Vasude-van, Fernanda Vigas, Oriol Vinyals, Pete Warden,Martin Wattenberg, Martin Wicke, Yuan Yu, andXiaoqiang Zheng. 2015. Tensorflow: Large-scalemachine learning on heterogeneous distributed sys-tems.

Harald Baayen, Shravan Vasishth, Reinhold Kliegl, andDouglas Bates. 2017. The cave of shadows: Ad-dressing the human factor with generalized additivemixed models. Journal of Memory and Language,94(Supplement C):206 – 234.

R. Harald Baayen, Jacolien van Rij, Cecile de Cat, andSimon Wood. 2018. Autocorrelated errors in exper-imental data in the language sciences: Some solu-tions offered by Generalized Additive Mixed Mod-els. In Dirk Speelman, Kris Heylen, and Dirk Geer-aerts, editors, Mixed Effects Regression Models inLinguistics. Springer, Berlin.

Douglas Bates, Martin Machler, Ben Bolker, and SteveWalker. 2015. Fitting linear mixed-effects mod-els using lme4. Journal of Statistical Software,67(1):1–48.

Vera Demberg and Frank Keller. 2008. Data from eye-tracking corpora as evidence for theories of syntacticprocessing complexity. Cognition, 109(2):193–210.

Timothy Dozat. 2016. Incorporating Nesterov momen-tum into Adam. In ICLR Workshop.

Kate Erlich and Keith Rayner. 1983. Pronoun assign-ment and semantic integration during reading: Eyemovements and immediacy of processing. Journalof Verbal Learning & Verbal Behavior, 22:75–87.

Stefan L. Frank, Irene Fernandez Monsalve, Robin L.Thompson, and Gabriella Vigliocco. 2013. Read-ing time data for evaluating broad-coverage modelsof English sentence processing. Behavior ResearchMethods, 45:1182–1190.

Karl J. Friston, Oliver Josephs, Geraint Rees, andRobert Turner. 1998. Nonlinear event-related re-sponses in fMRI. Magn. Reson. Med, pages 41–52.

Richard Futrell, Edward Gibson, Hal Tily, Anasta-sia Vishnevetsky, Steve Piantadosi, and Evelina Fe-dorenko. 2018. The Natural Stories corpus. InLREC 2018.

Edward Goldstein, Benjamin J. Cowling, Allison E.Aiello, Saki Takahashi, Gary King, Ying Lu, andMarc Lipsitch. 2011. Estimating incidence curves ofseveral infections using symptom surveillance data.PLOS ONE, 6(8):1–8.

C. Goutte, F. A. Nielsen, and K. H. Hansen. 2000.Modeling the hemodynamic response in fMRI usingsmooth FIR filters. IEEE Transactions on MedicalImaging, 19(12):1188–1201.

Gary H. Glover. 1999. Deconvolution of impulse re-sponse in event-related BOLD fMRI. 9:416–29.

Trevor Hastie and Robert Tibshirani. 1986. General-ized additive models. Statist. Sci., 1(3):297–310.

Kenneth Heafield, Ivan Pouzyrevsky, Jonathan H.Clark, and Philipp Koehn. 2013. Scalable modi-fied Kneser-Ney language model estimation. In Pro-ceedings of the 51st Annual Meeting of the Associa-tion for Computational Linguistics, pages 690–696,Sofia, Bulgaria.

Alan Kennedy, James Pynte, and Robin Hill. 2003.The Dundee corpus. In Proceedings of the 12th Eu-ropean conference on eye movement.

Diederik P. Kingma and Jimmy Ba. 2014. Adam:A method for stochastic optimization. CoRR,abs/1412.6980.

Martin Lindquist and Tor Wager. 2007. Validity andpower in hemodynamic response modeling: A com-parison study and a new approach. 28:764–84.

2688

Martin A. Lindquist, Ji Meng Loh, Lauren Y. Atlas, andTor D. Wager. 2009. Modeling the hemodynamicresponse function in fmri: Efficiency, bias and mis-modeling. NeuroImage, 45(1, Supplement 1):S187– S198. Mathematics in Brain Imaging.

Yurii E Nesterov. 1983. A method for solving theconvex programming problem with convergence rateO(1/k2). In Dokl. Akad. Nauk SSSR, volume 269,pages 543–547.

Robert Parker, David Graff, Junbo Kong, Ke Chen,and Kazuaki Maeda. 2009. English GigawordLDC2009T13.

Fabian Pedregosa, Michael Eickenberg, Philippe Ciu-ciu, Alexandre Gramfort, and Bertrand Thirion.2014. Data-driven hrf estimation for encoding anddecoding models. 104.

V.A. Ramey. 2016. Macroeconomic shocks and theirpropagation. volume 2 of Handbook of Macroeco-nomics, pages 71 – 162. Elsevier.

Cory Shain, Marten van Schijndel, Richard Futrell, Ed-ward Gibson, and William Schuler. 2016. Mem-ory access during incremental sentence processingcauses reading time latency. In Proceedings of theComputational Linguistics for Linguistic Complex-ity Workshop, pages 49–58. Association for Compu-tational Linguistics.

Dustin Tran, Alp Kucukelbir, Adji B. Dieng, MajaRudolph, Dawen Liang, and David M. Blei.2016. Edward: A library for probabilistic mod-eling, inference, and criticism. arXiv preprintarXiv:1610.09787.

Marten van Schijndel and William Schuler. 2015. Hi-erarchic syntax improves reading time prediction. InProceedings of NAACL-HLT 2015. Association forComputational Linguistics.

B. Douglas Ward. 2006. Deconvolution analysis offmri time series data.

Simon N. Wood. 2006. Generalized Additive Models:An Introduction with R. Chapman and Hall/CRC,Boca Raton.

A Definition of mixed effects DTSR

For expository purposes, in Section 3 the DTSRmodel was defined only for fixed effects. How-ever, DTSR is compatible with mixed modelingand the implementation used here supports ran-dom effects in the model intercepts, coefficients,and IRF parameters. The full mixed-effects DTSRequations are presented below.

The definitions of X, y, µ, σ2, a, b, c, d, F,M ,N , K, and R presented in Section 3 are retainedfor the mixed model definition. The remainingvariables and equations must be redefined to some

extent. Mixed-effects DTSR models additionallycontain the following parameters:

• a vector o ∈ RO of O random intercepts

• a vector u ∈ RU of U fixed coefficients

• a vector v ∈ RV of V random coefficients

• a matrix A ∈ RR×L of R fixed IRF kernelparameters for L fixed impulse vectors

• a matrix B ∈ RR×W ofR random IRF kernelparameters for W random impulse vectors

Random parameters o, v, and B are constrainedto be zero-centered within each random groupingfactor.

To support mixed modeling, the fixed and ran-dom effects must first be combined using addi-tional utility matrices. Let O ∈ {0, 1}N×O bea mask matrix for random intercepts. A vectorq ∈ RN of intercepts is:

qdef= µ+ Oo (7)

Let U ∈ {0, 1}L×U be an indicator matrix forfixed coefficients, V ∈ {0, 1}L×V be an indi-cator matrix for random coefficients, and V′ ∈{0, 1}N×V be a mask matrix for random coeffi-cients. A matrix Q ∈ RN×L of coefficients is:

Qdef= 1 (Uu)> + V′ diag(v)V> (8)

Let W ∈ {0, 1}L×W be an indicator matrixfor random IRF parameters and W′

1, . . . ,W′n ∈

{0, 1}R×W be mask matrices for random IRF pa-rameters. Then matrices Pn ∈ RR×L for n ∈{1, 2, . . . , N} are:

Pndef= A + (W′

n �B)W> (9)

In each equation above, the random effects param-eters are masked using the random effects filterassociated with each data point. Q and Pn arethen transformed into the impulse vector space us-ing the indicator matrices V and W, respectively.This procedure sums the random effects associ-ated with each data point and adds them to thepopulation-level parameters.

To define the convolution step, let gl for l ∈{1, 2, . . . , L} be parametric IRF kernels, one foreach impulse. Convolution of X with each IRFkernel is performed by premultiplying the inputs

2689

X with sparse matrix Gl ∈ RN×M for l ∈{1, 2, ..., L}:

(Gl)[n,∗]def= gl

(b[n] − a>; (Pn)[∗,l]

)� F[n,∗]

(10)Finally, let L ∈ {0, 1}K×L be an indicator ma-trix mapping the K predictors of X to the corre-sponding L impulse vectors of the model.15 Theconvolution that yields the design matrix of con-volved predictors X′ ∈ RN×L is then defined us-ing a product of the convolution matrices G, thedesign matrix X, and the impulse indicator L:

X′[∗,l]def= GlXL[∗,l] (11)

The full model mean is the sum of (1) the in-tercepts and (2) the sum-product of the convolvedpredictors X′ with the coefficient parameters Q:

y ∼ N(q + (X′ �Q)1, σ2

)(12)

15Predictors and impulse vectors are distinguished becausein principle multiple IRFs can be applied to the same predic-tor. In the usual case where this distinction is not needed, Lis identity and K = L.

deconvolutional time series regression: a technique for ... · 2.1 non-deconvolutional time series...

Documents