memory and forecastingcapacitiesofnonlinear recurrentnetworks … · arxiv:2004.11234v1 [math.oc]...

21
arXiv:2004.11234v1 [math.OC] 22 Apr 2020 Memory and forecasting capacities of nonlinear recurrent networks Lukas Gonon 1 , Lyudmila Grigoryeva 2 , and Juan-Pablo Ortega 3,4 Abstract The notion of memory capacity, originally introduced for echo state and linear networks with independent inputs, is generalized to nonlinear recurrent networks with stationary but dependent inputs. The presence of dependence in the inputs makes natural the introduction of the network forecasting capacity, that measures the possibility of forecasting time series values using network states. Generic bounds for memory and forecasting capacities are formulated in terms of the number of neurons of the network and the autocovariance function of the input. These bounds generalize well-known estimates in the literature to a dependent inputs setup. Finally, for linear recurrent networks and independent inputs it is proved that the memory capacity is given by the rank of the associated controllability matrix. Key Words: memory capacity, forecasting capacity, recurrent neural network, reservoir computing, echo state network, ESN, linear recurrent network, machine learning, fading memory property, echo state property. 1 Introduction Memory capacities have been introduced in [Jaeg 02] in the context of recurrent neural networks in general and of echo state networks (ESNs) [Matt 92, Matt 94, Jaeg 04] in particular, as a way to quantify the amount of information contained in the states of a state-space system in relation with past inputs. In the original definition, the memory capacity was defined as the sum of the coefficients of determi- nation of the different linear regressions that take the state of the system at a given time as covariates and values of the input at a given point in the past as dependent variables. This notion has been the subject of much research in the reservoir computing literature [Whit 04, Gang 08, Herm 10, Damb 12, Bara 14, Coui 16, Fark 16, Goud 16, Char 17, Xue 17, Verz 19] where most of the efforts have been con- centrated in linear and echo state systems. Analytical expression of the capacity of time-delay reservoirs have been formulated in [Grig 15, Grig 16a] and various proposals for optimized reservoir architectures can be obtained by maximizing this parameter [Orti 12, Grig 14, Orti 19, Orti 20]. Additionally, mem- ory capacities have been extensively compared with other related concepts like Fisher information-base criteria [Tino 13, Livi 16, Tino 18]. 1 Universit¨at Sankt Gallen. Faculty of Mathematics and Statistics. Bodanstrasse 6. CH-9000 Sankt Gallen. Switzer- land. [email protected] 2 Department of Mathematics and Statistics. Universit¨atKonstanz. Box 146. D-78457 Konstanz. Germany. [email protected] 3 Universit¨at Sankt Gallen. Faculty of Mathematics and Statistics. Bodanstrasse 6. CH-9000 Sankt Gallen. Switzer- land. [email protected] 4 Centre National de la Recherche Scientifique (CNRS). France. 1

Upload: others

Post on 02-Jun-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Memory and forecastingcapacitiesofnonlinear recurrentnetworks … · arXiv:2004.11234v1 [math.OC] 22 Apr 2020 Memory and forecastingcapacitiesofnonlinear recurrentnetworks LukasGonon1,LyudmilaGrigoryeva2,andJuan-PabloOrtega3,4

arX

iv:2

004.

1123

4v1

[m

ath.

OC

] 2

2 A

pr 2

020 Memory and forecasting capacities of nonlinear

recurrent networks

Lukas Gonon1, Lyudmila Grigoryeva2, and Juan-Pablo Ortega3,4

Abstract

The notion of memory capacity, originally introduced for echo state and linear networks with

independent inputs, is generalized to nonlinear recurrent networks with stationary but dependent

inputs. The presence of dependence in the inputs makes natural the introduction of the network

forecasting capacity, that measures the possibility of forecasting time series values using network

states. Generic bounds for memory and forecasting capacities are formulated in terms of the number

of neurons of the network and the autocovariance function of the input. These bounds generalize

well-known estimates in the literature to a dependent inputs setup. Finally, for linear recurrent

networks and independent inputs it is proved that the memory capacity is given by the rank of the

associated controllability matrix.

Key Words: memory capacity, forecasting capacity, recurrent neural network, reservoir computing,echo state network, ESN, linear recurrent network, machine learning, fading memory property, echo stateproperty.

1 Introduction

Memory capacities have been introduced in [Jaeg 02] in the context of recurrent neural networksin general and of echo state networks (ESNs) [Matt 92, Matt 94, Jaeg 04] in particular, as a way toquantify the amount of information contained in the states of a state-space system in relation with pastinputs.

In the original definition, the memory capacity was defined as the sum of the coefficients of determi-nation of the different linear regressions that take the state of the system at a given time as covariatesand values of the input at a given point in the past as dependent variables. This notion has been thesubject of much research in the reservoir computing literature [Whit 04, Gang 08, Herm 10, Damb 12,Bara 14, Coui 16, Fark 16, Goud 16, Char 17, Xue 17, Verz 19] where most of the efforts have been con-centrated in linear and echo state systems. Analytical expression of the capacity of time-delay reservoirshave been formulated in [Grig 15, Grig 16a] and various proposals for optimized reservoir architecturescan be obtained by maximizing this parameter [Orti 12, Grig 14, Orti 19, Orti 20]. Additionally, mem-ory capacities have been extensively compared with other related concepts like Fisher information-basecriteria [Tino 13, Livi 16, Tino 18].

1Universitat Sankt Gallen. Faculty of Mathematics and Statistics. Bodanstrasse 6. CH-9000 Sankt Gallen. Switzer-land. [email protected]

2Department of Mathematics and Statistics. Universitat Konstanz. Box 146. D-78457 Konstanz. [email protected]

3Universitat Sankt Gallen. Faculty of Mathematics and Statistics. Bodanstrasse 6. CH-9000 Sankt Gallen. Switzer-land. [email protected]

4Centre National de la Recherche Scientifique (CNRS). France.

1

Page 2: Memory and forecastingcapacitiesofnonlinear recurrentnetworks … · arXiv:2004.11234v1 [math.OC] 22 Apr 2020 Memory and forecastingcapacitiesofnonlinear recurrentnetworks LukasGonon1,LyudmilaGrigoryeva2,andJuan-PabloOrtega3,4

Memory and forecasting capacities of nonlinear recurrent networks 2

All the above-mentioned works consider exclusively white noise as input signals and it is, to ourknowledge only in [Grig 16b] (under strong high-order stationarity assumptions) and in [Marz 17] (forlinear recurrent networks) that the case involving dependent inputs has been treated. Since most signalsthat one encounters in applications exhibit some sort of temporal dependence, studying this case in detailis of obvious practical importance. Moreover, the presence of dependence makes pertinent consideringnot only memory capacities, but also the possibility to forecast time series values using network states,that is, forecasting capacities. This notion has been introduced for the first time in [Marz 17] andstudied in detail for linear recurrent networks. This work shows, in particular, that linear networksoptimized for memory capacity do not necessarily have a good forecasting capacity and vice versa.

The main contribution of this paper is the extension of the results on memory and forecasting ca-pacities available in the literature for either linear recurrent or echo state networks, or for independentinputs, to general non-linear systems and to dependent inputs that are only assumed to be stationary.More explicitly, it is known since [Jaeg 02] that the memory capacity of an ESN or a linear recurrentnetwork defined using independent inputs is bounded above by its number of output neurons or, equiv-alently, by the dimensionality of the corresponding state-space representation. Moreover, in the linearcase, it can be shown [Jaeg 02] that this capacity is maximal if and only if Kalman’s controllability rankcondition [Kalm 10, Sont 91, Sont 98] is satisfied. This paper extends those results as follows:

• Theorem 3.4 states generic bounds for the memory and the forecasting capacities of generic recur-rent network with linear readouts and with stationary inputs, in terms of the dimensionality ofthe corresponding state-space representation and the autocovariance function of the input, whichis assumed to be second-order stationary. These bounds reduce to those in [Jaeg 02] when theinputs are independent.

• Section 4 is exclusively devoted to the linear case. In that simpler situation, explicit expressionsfor the memory and forecasting capacities can be stated (see Proposition 4.1) in terms of thematrix parameters of the network and the autocorrelation properties of the input.

• In the classical linear case with independent inputs, the expressions in the previous point yieldinteresting relations (see Proposition 4.3) between maximum memory capacity, spectral propertiesof the connectivity matrix of the network, and the so-called Kalman’s characterization of thecontrollability of a linear system [Kalm 10]. This last condition has been already mentioned in[Jaeg 02] in relation with maximum capacity.

• Theorem 4.4 proves that the memory capacity of a linear recurrent network with independentinputs is given by the rank of its controllability matrix. The relation between maximum capacityof a system and the full rank of the controllability matrix established in [Jaeg 02] is obviously aparticular case of this statement. Even though, to our knowledge, this is the first rigorous proofof the relation between network memory and the rank of the controllability matrix, that link hasbeen for a long time part of the reservoir computing folklore. In particular, recent contributionsare dedicated to the design of ingenuous configurations that maximize that rank [Acei 17, Verz 20].

2 Recurrent neural networks with stationary inputs

The results in this paper apply to recurrent neural networks determined by state-space equations of theform:

xt = F (xt−1, zt),

yt = h(xt) := W⊤xt + a.

(2.1)

(2.2)

These two relations form a state-space system, where the map F : DN ⊂ RN ×Dd ⊂ Rd −→ DN ⊂RN , N, d ∈ N, is called the state map and h : RN −→ Rm the readout or observation map that,

Page 3: Memory and forecastingcapacitiesofnonlinear recurrentnetworks … · arXiv:2004.11234v1 [math.OC] 22 Apr 2020 Memory and forecastingcapacitiesofnonlinear recurrentnetworks LukasGonon1,LyudmilaGrigoryeva2,andJuan-PabloOrtega3,4

Memory and forecasting capacities of nonlinear recurrent networks 3

all along this paper, will be assumed to be affine, that is, it is determined just by a matrix W ∈ MN,m

and a vector a ∈ Rm, m ∈ N. The inputs ztt∈Zof the system, with zt ∈ Rd, will be in most cases

infinite paths of a discrete-time stochastic process.We shall focus on state-space systems of the type (2.1)-(2.2) that determine an input/output

system. This happens in the presence of the so-called echo state property (ESP), that is, when forany z ∈ (Dd)

Z there exists a unique y ∈ (Rm)Z such that (2.1)-(2.2) hold. In that case, we talk aboutthe state-space filter UF

h : (Dd)Z −→ (Rm)Z associated to the state-space system (2.1)-(2.2) defined

by:UFh (z) := y,

with z ∈ (Dd)Z and y ∈ (Rm)Z are linked by (2.1) via the ESP. If the ESP holds at the level of the state

equation (2.1), we can define a state filter UF : (Dd)Z −→ (DN )Z and, in that case, we have that

UFh := h UF .

It is easy to show that state and state-space filters are automatically causal and time-invariant (see[Grig 18, Proposition 2.1]) and hence it suffices to work with their restriction UF

h : (Dd)Z− −→ (Rm)Z−

to semi-infinite inputs and outputs. Moreover, UFh determines a state-space functionalHF

h : (Dd)Z −→

Rm as HFh (z) := UF

h (z)0, for all z ∈ (Dd)Z− (the same applies to UF and HF when the ESP holds at

the level of the state equation). In the sequel we use the symbol Z− to denote the negative integersincluding zero and Z− without zero.

The echo state property has deserved much attention in the context of the so-called echo statenetworks (ESNs) [Matt 92, Matt 93, Matt 94, Jaeg 04] (see, for instance, [Jaeg 10, Bueh 06, Yild 12,Bai 12, Wain 16, Manj 13, Gall 17]). Sufficient conditions for the ESP to hold in general systemshave been formulated in [Grig 18, Grig 19, Gono 19], in most cases assuming that the state map is acontraction in the state variable. We recall now a result (see [Grig 19, Theorem 12] and [Gono 19,Proposition 1]) that ensures the ESP as well as a continuity property of importance for the sequel.

Proposition 2.1 Let F : DN × Dd −→ DN be a continuous state map such that DN is a compactsubset of RN and F is a contraction on the first entry with constant 0 < c < 1, that is,

‖F (x1, z)− F (x2, z)‖ ≤ c‖x1 − x2‖,

for all x1,x2 ∈ DN , z ∈ Dd. Then, the associated system the echo state property for any input in(Dn)

Z− . The associated filter UF : (Dn)Z− −→ (DN )Z− is continuous with respect to the product

topologies in (Dn)Z− and (DN )Z− .

State-space morphisms. In this paragraph we discuss how maps between state spaces can producedifferent state-space representations for a given filter. Consider the state-space systems determined bythe two pairs (Fi, hi), i ∈ 1, 2, with Fi : DNi

×Dd −→ DNiand hi : DNi

−→ Rm.

Definition 2.2 A map f : DN1−→ DN2

is a morphism between the systems (F1, h1) and (F2, h2)whenever it satisfies the following two properties:

(i) System equivariance: f(F1(x1, z)) = F2(f(x1, z)), for all x1 ∈ DN1, and z ∈ Dd.

(ii) Readout invariance: h1(x1) = h2(f(x1)), for all x1 ∈ DN1.

When the map f has an inverse f−1 and this inverse it is also a morphism between the systemsdetermined by the pairs (F1,M1) and (F2,M2) we say that f is a system isomorphism and that thesystems (F1,M1) and (F2,M2) are isomorphic. Given a system F1 : DN1

×Dd −→ DN1, h1 : DN1

−→

Page 4: Memory and forecastingcapacitiesofnonlinear recurrentnetworks … · arXiv:2004.11234v1 [math.OC] 22 Apr 2020 Memory and forecastingcapacitiesofnonlinear recurrentnetworks LukasGonon1,LyudmilaGrigoryeva2,andJuan-PabloOrtega3,4

Memory and forecasting capacities of nonlinear recurrent networks 4

Rm and a bijection f : DN1−→ DN2

, the map f is a system isomorphism with respect to the systemF2 : DN2

×Dd −→ DN2, h2 : DN2

−→ Rm defined by

F2(x2, z) := f(F1(f−1(x2), z)), for all x2 ∈ DN2

, z ∈ Dn, (2.3)

h2(x2) := h1(f−1(x2)), for all x2 ∈ DN2

. (2.4)

Proposition 2.3 Let (F1, h1) and (F2, h2) be the two systems introduced above. Let f : DN1−→ DN2

be a map. Then:

(i) If f is system equivariant and x1 ∈ (DN1)Z− is a solution for the state system associated to F1 and

the input z ∈ (Dd)Z− , then so is (f(x1

t ))t∈Z−∈ (DN2

)Z− for the system associated to F2 and thesame input.

(ii) Suppose that the system determined by (F2, h2) has the echo state property and assume that thestate system determined by F1 has at least one solution for each element z ∈ (Dd)

Z− . If f is amorphism between (F1, h1) and (F2, h2), then (F1, h1) has the echo state property and, moreover,

UF1

h1= UF2

h2. (2.5)

(iii) If f is a system isomorphism then the implications in the previous two points are reversible.

Proof. (i) By hypothesis x1t = F1(x

1t−1, zt), for all t ∈ Z−. By system equivariance,

f(x1t ) = f(F1(x

1t−1, zt)) = F2(f(x

1t−1), zt), for all t ∈ Z−.

as required.

(ii) In order to show that (2.5) holds, it suffices to prove that given any z ∈ (Dd)Z− , any solution

(y1,x1) ∈ (Rm)Z− × (DN1

)Z− of (F1, h1) associated to z, with y1 :=

(h1(x

1t ))t∈Z−

, which exists by the

hypothesis on F1, coincides with the unique solution UF2

h2(z) for the system (F2, h2). Indeed, for any

t ∈ Z−,y1t = h1(F1(x

1t−1, zt)) = h2(f(F1(x

1t−1, zt))) = h2(F2(f(x

1t−1), zt)).

Here, the first equality follows from readout invariance and the second one from the system equivariance.This implies that

(y1, (f(x1

t ))t∈Z−

)is a solution of the system determined by (F2, h2) for the input z.

By hypothesis, (F2, h2) has the echo state property and hence y1 = UF2

h2(z) and since z ∈ Vd is arbitrary,

the result follows. Part (iii) is straightforward.

Input and output stochastic processes. We now fix a probability space (Ω,A,P) on which allrandom variables are defined. The triple consists of the sample space Ω, which is the set of possibleoutcomes, the σ-algebra A (a set of subsets of Ω (events)), and a probability measure P : A −→ [0, 1].The input and target signals are modeled by discrete-time stochastic processes Z = (Zt)t∈Z−

andY = (Yt)t∈Z−

taking values in Dd ⊂ Rd and Rm, respectively. Moreover, we write Z(ω) = (Zt(ω))t∈Z−

and Y(ω) = (Yt(ω))t∈Z−for each outcome ω ∈ Ω to denote the realizations or sample paths of Z

and Y, respectively. Since Z can be seen as a random sequence in Dd ⊂ Rd, we write interchangeablyZ : Z−×Ω −→ Dd and Z : Ω −→ (Dd)

Z− . The latter is necessarily measurable with respect to the Borelσ-algebra induced by the product topology in (Dd)

Z− . The same applies to the analogous assignmentsinvolving Y.

We will most of the time work under stationarity hypotheses. We recall that the discrete-time

process Z : Ω −→ (Dd)Z− is stationary whenever T−τ (Z)=

d

Z, for any τ ∈ Z−, where the symbol =d

Page 5: Memory and forecastingcapacitiesofnonlinear recurrentnetworks … · arXiv:2004.11234v1 [math.OC] 22 Apr 2020 Memory and forecastingcapacitiesofnonlinear recurrentnetworks LukasGonon1,LyudmilaGrigoryeva2,andJuan-PabloOrtega3,4

Memory and forecasting capacities of nonlinear recurrent networks 5

stands for the equality in distribution and T−τ : (Rd)Z− −→ (Rd)Z− is the time delay operator definedby T−τ (z)t := zt+τ for any t ∈ Z−.

The definition of stationarity that we just formulated is usually known in the time series literature asstrict stationarity [Broc 06]. When a process Z has second-order moments, that is, Zt ∈ L2(Ω,Rd),t ∈ Z−, then strict stationarity implies the so-called second-order stationarity. We recall that asquare-summable process Z : Ω −→ (Dd)

Z− is second-order stationary whenever (i) there exists aconstant µ ∈ Rd such that E [Zt] = µ, for all t ∈ Z− (mean stationarity) and (ii) the autocovariancematrices Cov (Zt,Zt+h) depend only on h ∈ Z and not on t ∈ Z− (autocovariance stationary) andwe can hence define the autocovariance function γ : Z −→ R as γ(h) := Cov (Zt,Zt+h), which isnecessarily even [Broc 06, Proposition 1.5.1]. If Z is mean stationary and condition (ii) only holds forh = 0 we say that Z is covariance stationary. Second-order stationarity and stationarity are onlyequivalent for Gaussian processes. If Z is autocovariance stationary and γ(h) = 0 for any non-zeroh ∈ Z then we say that Z is a white noise.

Corollary 2.4 Let F : DN × Dd −→ DN be a state map that satisfies the hypotheses of Proposition2.1 or that, more generally, it has the echo state property and the associated filter UF : (Dn)

Z− −→(DN )Z− is continuous with respect to the product topologies in (Dn)

Z− and (DN )Z− . If the input processZ : Ω −→ (Dd)

Z− is stationary, then so is the output X := UF (Z) : Ω −→ (DN )Z− as well as the jointprocesses (T−τ (X),Z) and (X, T−τ (Z)), for any τ ∈ Z−.

Proof. The continuity (and hence the measurability) hypothesis on UF proves that X := UF (Z) isstationary (see [Kall 02, page 157]). The joint processes (T−τ (X),Z) and (X, T−τ (Z)) are also stationaryas they are the images of Z by the measurable maps (T−τ UF )× I(Rd)Z− and UF × T−τ , respectively.

The next result shows that if a state-space system has covariance stationary outputs and the corre-sponding covariance matrix is non-singular, then there exists an isomorphic state-space representationwhose states have mean zero and have orthonormal components. The stationarity condition in thisresult can be obtained, for example, by using Corollary 2.4.

Proposition 2.5 (Standardization of state-space realizations) Consider a state-space system asin (2.1)-(2.2) and suppose that the input process Z : Ω −→ (Dd)

Z− is such that the associated stateprocess X : Ω −→ (DN )Z− is covariance stationary. Let µ := E [Xt] and suppose that the covariance

matrix ΓX := Cov(Xt,Xt) is non-singular. Then, the map f : RN −→ RN given by f(x) := Γ−1/2X

(x−µ)is a system isomorphism between the system (2.1)-(2.2) and the one with state map

F (x, z) := Γ−1/2X

(F(Γ1/2X

x+ µ, z)− µ

)(2.6)

and readouth(x) := (Γ

−1/2X

W)⊤x+W⊤µ+ a. (2.7)

Moreover, the state process X associated to the system F and the input Z is variance stationary and

E[Xt] = 0, and Cov(Xt, Xt) = IN . (2.8)

Proof. Since by hypothesis the matrix ΓX is invertible, then so is it square root Γ1/2X

, as well

as the map f , whose inverse f−1 is given by f−1(x) := Γ1/2X

x + µ. The fact that f is a system

isomorphism between (F, h) and (F , h) is a consequence of the equalities (2.3)-(2.4). Parts (i) and (iii)

of Proposition 2.3 guarantee that if X is the state process associated to F and input Z then so is X

defined by Xt := Γ−1/2X

(Xt − µ), t ∈ Z−, with respect to F . The equalities (2.8) immediately follow.

Page 6: Memory and forecastingcapacitiesofnonlinear recurrentnetworks … · arXiv:2004.11234v1 [math.OC] 22 Apr 2020 Memory and forecastingcapacitiesofnonlinear recurrentnetworks LukasGonon1,LyudmilaGrigoryeva2,andJuan-PabloOrtega3,4

Memory and forecasting capacities of nonlinear recurrent networks 6

3 Memory and forecasting capacity bounds for stationary in-

puts

The following definition extends the notion of memory capacity introduced in [Jaeg 02] to generalnonlinear systems and to input signals that are stationary but not necessarily time-decorrelated.

Definition 3.1 Let Z : Ω −→ DZ− , D ⊂ R, be a variance-stationary input and let F be a state mapthat has the echo state property with respect to the paths of Z. Assume, moreover, that the associatedstate process X : Ω −→ (DN )Z− defined by Xt := UF (Z)t is covariance stationary, as well as the jointprocesses (T−τ (X),Z) and (X, T−τ (Z)), for any τ ∈ Z−. We define the τ-lag memory capacity MCτ

(respectively, forecasting capacity FCτ ) of F with respect to Z as:

MCτ := 1−1

Var (Zt)min

W∈RN

a∈R

E[((T−τZ)t −W⊤UF (Z)t − a

)2], (3.1)

FCτ := 1−1

Var (Zt)min

W∈RN

a∈R

E[(Zt −W⊤UF (T−τ (Z))t − a

)2]. (3.2)

The total memory capacity MC (respectively, total forecasting capacity FC) of F with respect toZ is defined as:

MC :=∑

τ∈Z−

MCτ , FC :=∑

τ∈Z−

FCτ . (3.3)

Note that, by Corollary 2.4, the conditions of this definition are met when, for instance, UF iscontinuous with respect to the product topologies and the input process Z is stationary. The proof ofthe following result is straightforward.

Lemma 3.2 In the conditions of Definition (3.1) and if the covariance matrix ΓX := Cov(Xt,Xt) isnon-singular then, for any τ ∈ Z−:

MCτ =Cov (Zt+τ ,Xt) Γ

−1X

Cov (Xt, Zt+τ )

Var (Zt), FCτ =

Cov (Zt,Xt+τ ) Γ−1X

Cov (Xt+τ , Zt)

Var (Zt). (3.4)

The next result shows that the memory capacity of a state-space system with linear readouts isinvariant with respect to linear system morphisms. This result will be used later on in Section 4when we study the memory and forecasting capacities of linear systems. The proof uses the followingelementary definition and linear algebraic fact: let (V, 〈·, ·〉V ) and (W, 〈·, ·〉W ) be two inner productspaces and let f : V −→ W be a linear map between them. The dual map f∗ : W −→ V of f is definedby

〈f∗(w),v〉V = 〈w, f(v)〉W , for any v ∈ V,w ∈ W.

If the map f is injective, then the inner product 〈·, ·〉W in W induces an inner product 〈·, ·〉V in V viathe equality

〈v1,v2〉V := 〈f(v1), f(v2)〉W . (3.5)

It is easy to see that in that case, the dual map with respect to the inner product (3.5) in V and 〈·, ·〉Win W , satisfies that

f∗ f = IV , (3.6)

and, in particular, f∗ : W −→ V is surjective.

Page 7: Memory and forecastingcapacitiesofnonlinear recurrentnetworks … · arXiv:2004.11234v1 [math.OC] 22 Apr 2020 Memory and forecastingcapacitiesofnonlinear recurrentnetworks LukasGonon1,LyudmilaGrigoryeva2,andJuan-PabloOrtega3,4

Memory and forecasting capacities of nonlinear recurrent networks 7

Lemma 3.3 Let Z be a variance-stationary input and let F2 : DN2×D −→ DN2

be a state map thatsatisfies the conditions of Definition 3.1. Let F1 : DN1

× D −→ DN1be another state map and let

f : RN1 −→ RN2 be an injective linear system equivariant map between F1 and F2. Then, the memoryand forecasting capacities of F1 with respect to Z are well-defined and coincide with those of F2 withrespect to Z.

Proof. First of all, note that the echo state property hypothesis on F2 and part (ii) of Proposition2.3 imply that the filter UF2 is well-defined and, moreover, for any W ∈ RN2 so is UF1

Wf and

UF1

Wf = UF2

W. (3.7)

Let 〈·, ·〉RN2 be the Euclidean inner product in RN2 , let 〈·, ·〉RN1 be the inner product induced in RN1

by 〈·, ·〉RN2 and the injective map f using (3.5), and let f∗ be the corresponding dual map. It is easy tosee that the equality (3.7) can be rewritten using f∗ as

UF1

f∗(W) = UF2

W, for any W ∈ RN2 . (3.8)

Let now τ ∈ Z− and let MCτ be the τ -lag memory capacity of F2 with respect to Z. By definition and(3.8)

MCτ = 1−1

Var (Zt)min

W∈RN2

a∈R

E

[((T−τZ)t − UF2

W(Z)t − a

)2]

= 1−1

Var (Zt)min

W∈RN2

a∈R

E

[((T−τZ)t − UF1

f∗(W)(Z)t − a)2]

= 1−1

Var (Zt)min

W∈RN1

a∈R

E

[((T−τZ)t − UF1

W(Z)t − a

)2],

which coincides with the τ -lag memory capacity of F1. Notice that in the last equality we used thesurjectivity of f∗ which is a consequence of the injectivity of f (see (3.6)). A similar statement can bewritten for the forecasting capacities.

The next result is the main contribution of this paper and generalizes the bounds formulatedin [Jaeg 02] for the total memory capacity of an echo state network in the presence of independentinputs to general state systems with second-order stationary inputs.

Theorem 3.4 Suppose that we are in the conditions of Definition 3.1 and that the covariance matrixΓX := Cov(Xt,Xt) is non-singular. Then,

(i) If the input process Z is mean-stationary then, for any τ ∈ Z−:

0 ≤ MCτ ≤ 1, and 0 ≤ FCτ ≤ 1. (3.9)

(ii) Suppose that, additionally, the input process Z is second-order stationary with mean zero and auto-covariance function γ : Z −→ R, and that for any L ∈ Z− the symmetric matrices HL ∈ M−L+1

determined by HLij := γ(|i − j|) are invertible. Then, if we use the symbol C to denote both MC

and FC, we have:

0 ≤ C ≤N

γ(0)ρ(H) ≤ N

1 +

2

γ(0)

∞∑

j=1

|γ(j)|

, (3.10)

where ρ(H) := limL→−∞

ρ(HL), with ρ(HL) the spectral radius of HL.

Page 8: Memory and forecastingcapacitiesofnonlinear recurrentnetworks … · arXiv:2004.11234v1 [math.OC] 22 Apr 2020 Memory and forecastingcapacitiesofnonlinear recurrentnetworks LukasGonon1,LyudmilaGrigoryeva2,andJuan-PabloOrtega3,4

Memory and forecasting capacities of nonlinear recurrent networks 8

(iii) In the same conditions as in part (ii), suppose that, additionally, the autocovariance function γ isabsolutely summable, that is,

∑∞j=−∞ |γ(j)| < +∞. In that case, the spectral density f : [−π, π] →

R of Z is well-defined and given by

f(λ) :=1

n∈Z

e−inλγ(n), (3.11)

and we have that

0 ≤ C ≤2πN

γ(0)Mf ≤ N

1 +

2

γ(0)

∞∑

j=1

|γ(j)|

, (3.12)

where Mf := maxλ∈[−π,π] f(λ) .

Proof. First of all, since by hypothesis ΓX is non-singular, Proposition 2.5 and the second partof Proposition 2.3 allow us to replace the system (2.1)-(2.2) in the definitions (3.1) and (3.2) by its

standardized counterpart whose states Xt are such that E[Xt] = 0 and Cov(Xt, Xt) = E[XtX

⊤t

]= IN .

More specifically, if we denote by σ2 := Var (Zt) = γ(0), we can write

MCτ := 1−1

σ2min

W∈RN

a∈R

E[(Zt+τ −W⊤Xt − a

)2]= 1−

1

σ2min

W∈RN

a∈R

E

[(Zt+τ − W⊤Xt − a

)2],

which, using Lemma 3.2 and the fact that ΓX

= IN , can be rewritten as

MCτ =1

σ2Cov

(Zt+τ , Xt

)Cov

(Xt, Zt+τ

)=

1

σ2

N∑

i=1

E[X i

tZt+τ

]2. (3.13)

Analogously, it is easy to show that

FCτ =1

σ2

N∑

i=1

E[X i

t+τZt

]2. (3.14)

If we now define Zt := Zt/σ ∈ L2(Ω,R), for any t ∈ Z−, it is clear that∥∥∥Zt

∥∥∥L2

= E[Z2t

]= 1. Moreover,

the relation Cov(Xt, Xt) = IN implies that the components X it ∈ L2(Ω,R), i ∈ 1, . . . , N, and that

they form an orthonormal set, that is,⟨X i

t , Xjt

⟩L2

= δij , where δij stands for Kronecker’s delta. We

now separately prove the two parts of the theorem.

(i) The fact that MCτ ,FCτ ≥ 0 is obvious from (3.13) and (3.14). Let now St = spanX1

t , . . . , XNt

L2(Ω,R) and let PSt: L2(Ω,R) −→ St be the corresponding orthogonal projection. Then,

1 =∥∥∥Zt+τ

∥∥∥L2

≥∥∥∥PSt

(Zt+τ

)∥∥∥L2

=

∥∥∥∥∥

N∑

i=1

X it

⟨X i

t , Zt+τ

⟩L2

∥∥∥∥∥L2

=

N∑

i=1

E[X i

t+τ Zt

]2= MCτ .

The inequality FCτ ≤ 1 can be established analogously by considering the projection onto St+τ of the

vector Zt.

(ii) Define first, for any L ∈ Z−, the vector ZL := (Z0, Z−1, . . . , ZL), and

MCL :=

L∑

τ=0

MCτ =

L∑

τ=0

N∑

i=1

E[X i

0Zτ

]2, FCL :=

L∑

τ=−1

FCτ =

L∑

τ=−1

N∑

i=1

E[X i

LZL−τ

]2. (3.15)

Page 9: Memory and forecastingcapacitiesofnonlinear recurrentnetworks … · arXiv:2004.11234v1 [math.OC] 22 Apr 2020 Memory and forecastingcapacitiesofnonlinear recurrentnetworks LukasGonon1,LyudmilaGrigoryeva2,andJuan-PabloOrtega3,4

Memory and forecasting capacities of nonlinear recurrent networks 9

In these equalities we used (3.13) and (3.14) as well as the stationarity hypothesis. Now, using the

invertibility hypothesis on the matrices HL, we define the random vector ZL :=(HL)−1/2

ZL, whosecomponents form an orthonormal set in L2(Ω,R). Indeed, for any i, j ∈ 1, . . . ,−L+ 1,

〈ZLi , Z

Lj 〉L2 = E

[ZLi Z

Lj

]=

−L+1∑

k,l=1

(HL)−1/2

ik

(HL)−1/2

jlE [Z−k+1Z−l+1]

=

−L+1∑

k,l=1

(HL)−1/2

ik

(HL)−1/2

jlγ(|k − l|) =

((HL)−1/2

HL(HL)−1/2

)ij= δij . (3.16)

Let SL = spanZL1 , . . . , Z

L−L+1

⊂ L2(Ω,R) and let PSL

: L2(Ω,R) −→ SL be the corresponding

orthogonal projection. By (3.15) we have that

MCL =

L∑

τ=0

N∑

i=1

E[X i

0Zτ

]2=

1

γ(0)

L∑

τ=0

N∑

i=1

E

[X i

0

((HL)1/2

ZL)−τ+1

]2=

1

γ(0)

N∑

i=1

∥∥∥(HL)1/2

E[X i

0ZL]∥∥∥

2

.

(3.17)Analogously,

FCL =

L∑

τ=−1

N∑

i=1

E[X i

LZL−τ

]2=

1

γ(0)

L∑

τ=−1

N∑

i=1

E

[X i

L

((HL)1/2

ZL)−L+τ+1

]2

≤1

γ(0)

N∑

i=1

∥∥∥(HL)1/2

E[X i

LZL]∥∥∥

2

. (3.18)

Now, by (3.16) and (3.17) we can write that

MCL ≤1

γ(0)

N∑

i=1

∣∣∣∣∣∣∣∣∣(HL)1/2∣∣∣

∣∣∣∣∣∣∥∥∥E[X i

0ZL]∥∥∥

2

=1

γ(0)

N∑

i=1

∣∣∣∣∣∣∣∣∣(HL)1/2∣∣∣

∣∣∣∣∣∣∥∥∥PSL

(X i

0

)∥∥∥2

L2

≤1

γ(0)

N∑

i=1

∣∣∣∣∣∣∣∣∣(HL)1/2∣∣∣

∣∣∣∣∣∣∥∥∥X i

0

∥∥∥2

L2≤

N

γ(0)|λmax(H

L)| =N

γ(0)ρ(HL).

An identical inequality can be shown for FCL using (3.18), which proves the first inequality in (3.10). Thesecond inequality can be obtained by bounding ρ(HL) using Gersgorin’s Disks Theorem (see [Horn 13,Theorem 6.1.1 and Corollary 6.1.5]). Indeed, due to this result:

ρ(HL) ≤ γ(0) + maxi∈1,...,−L+1

j 6=i

j∈1,...,−L+1

∣∣HLij

∣∣

≤ γ(0) + 2

−L∑

i=1

|γ(i)|. (3.19)

(iii) The inequalities in (3.12) are a consequence of considering H as the infinite symmetric Toeplitzmatrix associated to the bi-infinite sequence of autocovariances γ(j)j∈Z

of Z. First, when the autoco-variance function is absolutely summable then (3.11) determines the spectral density of Z by [Broc 06,Corollary 4.3.2]. Second, by [Gray 06, Lemma 6, page 194], the spectrum of H is bounded above by the

Page 10: Memory and forecastingcapacitiesofnonlinear recurrentnetworks … · arXiv:2004.11234v1 [math.OC] 22 Apr 2020 Memory and forecastingcapacitiesofnonlinear recurrentnetworks LukasGonon1,LyudmilaGrigoryeva2,andJuan-PabloOrtega3,4

Memory and forecasting capacities of nonlinear recurrent networks 10

maximum of the function 2πf , which, using (3.10) implies that C ≤ 2πNγ(0)Mf . The last inequality is a

consequence of

|f(λ)| ≤1

∞∑

j=−∞

|γ(j)| =

γ(0) + 2

∞∑

j=1

|γ(j)|

< +∞, for any λ ∈ [−π, π].

This result and its proof are used in the next corollary to recover the memory capacity boundsproposed in [Jaeg 02] when using independent inputs and to show that, in that case, the forecastingcapacity is zero.

Corollary 3.5 (Jaeger [Jaeg 02]) In the conditions of Theorem 3.4, if the input process Z is inde-pendent, then

0 ≤ MC ≤ N and FC = 0. (3.20)

Proof. The first inequality is a straightforward consequence of (3.10) and the fact that for independentinputs γ(h) = 0, for all h 6= 0. The second one can be easily obtained from (3.14). Indeed, by thecausality and the time-invariance [Grig 18, Proposition 2.1] of any filter induced by a state-space system

of the type (2.1)-(2.2), for any τ ≤ −1, the random variables X it+τ and Zt in (3.14) are independent,

and hence

FCτ =1

σ2

N∑

i=1

E[X i

t+τZt

]2=

1

σ2

N∑

i=1

E[X i

t+τ

]2E [Zt]

2= 0.

4 The memory and forecasting capacities of linear systems

When the state equation (2.1) is linear and has the echo state property, both the memory and forecastingcapacities in (3.3) can be explicitly written down in terms of the equation parameters provided thatthe invertibility hypothesis on the covariance matrix of the states holds. This case has been studiedfor independent inputs and randomly generated linear systems in [Jaeg 02, Coui 16] and, more recently,in [Marz 17] for more general correlated inputs and diagonalizable linear systems. It is in this paperthat has been pointed out for the first time how different linear systems that maximize forecasting andmemory capacities may be.

We split this section in two parts. In the first one we handle what we call the regular case in whichwe assume the invertibility of the covariance matrix of the outputs. In the second one we see how,using Lemma 3.3, we can reduce the more general singular case to the regular one. This approachallows us to prove that when the inputs are independent, the memory capacity of a linear system withindependent inputs coincides with the rank of its associated controllability or reachability matrix. Thisis a generalization of the fact, already established in [Jaeg 02], that when the rank of the controllabilityis maximal, then linear systems has maximal capacity, that is, it coincides with the dimensionality ofits state space. Different configurations that maximize the rank of the controllability matrix have beenrecently studied in [Acei 17, Verz 20].

4.1 The regular case

The capacity formulas that we state in the next result only require the stationarity of the input and theinvertibility of the covariance matrix of the outputs. In this case we consider fully infinite inputs, thatis, we shall be working with stationary input process Z : Ω −→ RZ that have second-order momentsand an autocovariance function γ : Z −→ R for which we shall construct the semi-infinite (respectively,doubly infinite) symmetric Toeplitz matrix Hij := γ(|i − j|), i, j ∈ N+, (respectively, Hij := γ(|i − j|),i, j ∈ Z).

Page 11: Memory and forecastingcapacitiesofnonlinear recurrentnetworks … · arXiv:2004.11234v1 [math.OC] 22 Apr 2020 Memory and forecastingcapacitiesofnonlinear recurrentnetworks LukasGonon1,LyudmilaGrigoryeva2,andJuan-PabloOrtega3,4

Memory and forecasting capacities of nonlinear recurrent networks 11

Proposition 4.1 Consider the linear state system determined by the linear state map F : RN ×R −→RN given by

F (x, z) := Ax+Cz, where C ∈ RN , A ∈ MN , and |||A|||2 = σmax(A) < 1. (4.1)

Consider a zero-mean stationary input process Z : Ω −→ DZ, with D ⊂ R compact that has second-order moments and an absolutely summable autocovariance function γ : Z −→ R. Suppose also that theassociated spectral density f satisifies that f(λ) ≥ 0, λ ∈ [−π, π], and that f(λ) = 0 holds only in atmost a countable number of points.

Then F has the echo state property and the associated filter UA,C is such that its output X :=UA,C(Z) : Ω −→ (DN )Z− as well as the joint processes (T−τ (X),Z) and (X, T−τ (Z)), for any τ ∈ Z−,are stationary. Suppose that the covariance matrix ΓX := Cov(Xt,Xt) is non-singular. Then,

(i) Consider the N vectors B1, . . . ,BN ∈ ℓ2+(R) defined by

Bji :=

(Γ−1/2X

∞∑

k=0

AkCH1/2k+1,j

)

i

, j ∈ N+. (4.2)

These vectors form an orthonormal set in ℓ2+(R) and the memory capacity MC can be written as

MC =1

γ(0)

N∑

i=1

〈Bi, HBi〉ℓ2 . (4.3)

(ii) Consider the N vectors B1, . . . ,BN ∈ ℓ2(R) defined by

Bji :=

(Γ−1/2X

∞∑

k=0

AkCH1/2

−k,j

)

i

, j ∈ Z. (4.4)

These vectors form an orthonormal set in ℓ2(R) and the forecasting capacity FC can be written as

FC =1

γ(0)

N∑

i=1

∥∥∥PZ+

(H

1/2Bi

)∥∥∥ℓ2, (4.5)

where PZ+ : ℓ2(R) −→ ℓ2(R) is the projection that sets to zero all the entries with non-positiveindex.

Proof. First of all, the condition σmax(A) < 1 in (4.1) implies that the state map is a contractionon the first entry. Second, as the input process takes values on a compact set, there exists a compactsubset DN ⊂ RN (see [Gono 19, Remark 2]) such that the restriction F : DN ×D −→ DN satisfies thehypotheses of Proposition 2.1 and of Corollary 2.4. This implies that the system associated to F hasthe echo state property, as well as the stationarity of the filter output X = UA,C(Z) and of the jointprocesses in the statement. We recall that, in this case,

Xt = UA,C(Z)t =

∞∑

j=0

AjCZt−j , for any t ∈ Z−,

and hence, using the notation introduced in Proposition 2.5 and the invertibility hypothesis on ΓX, thestandarized states Xt are given by

Xt = Γ−1/2X

Xt = Γ−1/2X

∞∑

j=0

AjCZt−j . (4.6)

Page 12: Memory and forecastingcapacitiesofnonlinear recurrentnetworks … · arXiv:2004.11234v1 [math.OC] 22 Apr 2020 Memory and forecastingcapacitiesofnonlinear recurrentnetworks LukasGonon1,LyudmilaGrigoryeva2,andJuan-PabloOrtega3,4

Memory and forecasting capacities of nonlinear recurrent networks 12

Additionally, when the autocovariance function is absolutely summable, then the spectral density fof Z defined in (3.11) belongs to the so-called Wiener class and, moreover, if the hypothesis on it inthe statement is satisfied, then the two matrices H (semi-infinite) and H (doubly infinite) are invertible(see [Gray 06, Theorem 11]).

(i) Let Z := (Z0, Z1, . . .) and let Z = H−1/2Z. An argument similar to (3.16) shows that 〈Zi, Zj〉L2 =δij . Moreover, (4.6) implies that

X0 = Γ−1/2X

∞∑

k=0

AkCZ−k = Γ−1/2X

∞∑

k=0

AkC(H1/2Z

)k+1

= Γ−1/2X

∞∑

k=0

∞∑

j=1

AkCH1/2k+1,j Zj,

which, using the definition in (4.2) can be rewritten componentwise as

X i0 =

∞∑

j=1

Bji Zj, i ∈ 1, . . . , N .

The L2-orthonormality of the components X i0 of X0 implies the ℓ2-orthonormality of the vectors

B1, . . . ,BN ∈ ℓ2+(R). Indeed, for any i, j ∈ 1, . . . , N,

δij = 〈X i0, X

j0〉L2 =

∞∑

k=1

∞∑

l=1

E[〈Bk

i Zk, BljZl〉L2

]=

∞∑

k=1

Bki B

kj = 〈Bi,Bj〉ℓ2

+. (4.7)

We now prove (4.3). First, taking the limit L → ∞ in (3.17) we write,

MC =1

γ(0)

N∑

i=1

∥∥∥H1/2E[X i

0Z]∥∥∥

2

=1

γ(0)

N∑

i=1

∥∥∥∥∥∥H1/2

∞∑

j=1

BjiE[ZjZ

]∥∥∥∥∥∥

2

=1

γ(0)

N∑

i=1

∥∥∥H1/2Bi

∥∥∥2

ℓ2=

1

γ(0)

N∑

i=1

〈Bi, HBi〉ℓ2 .

(ii) Let now Z := (. . . , Z−1, Z0, Z1, . . .) and let Z = H−1/2

Z. An argument similar to (3.16) shows that

〈Zi, Zj〉L2 = δij for any i, j ∈ Z. Also, in this case, (4.6) implies that

X0 = Γ−1/2X

∞∑

k=0

∞∑

j=−∞

AkCH1/2

−k,jZj ,

which, using the definition in (4.4) can be rewritten componentwise as

X i0 =

∞∑

j=−∞

Bji Zj, i ∈ 1, . . . , N .

As in (4.7), the L2-orthonormality of the components X i0 of X0 implies the ℓ2-orthonormality of the

vectors B1, . . . ,BN ∈ ℓ2(R). Finally, in order to establish (4.5), we use the stationarity hypothesis to

Page 13: Memory and forecastingcapacitiesofnonlinear recurrentnetworks … · arXiv:2004.11234v1 [math.OC] 22 Apr 2020 Memory and forecastingcapacitiesofnonlinear recurrentnetworks LukasGonon1,LyudmilaGrigoryeva2,andJuan-PabloOrtega3,4

Memory and forecasting capacities of nonlinear recurrent networks 13

rewrite the expression of the forecasting capacity in (3.14) as

FC =1

γ(0)

∞∑

τ=1

N∑

i=1

E[X i

0Zτ

]2=

1

γ(0)

∞∑

τ=1

N∑

i=1

E

∞∑

j=−∞

Bji Zj

(H

1/2Z)τ

2

=1

γ(0)

∞∑

τ=1

N∑

i=1

∞∑

j,l=−∞

BjiH

1/2

τ,l E[ZjZl

]

2

=1

γ(0)

∞∑

τ=1

N∑

i=1

∞∑

j=−∞

BjiH

1/2

τ,j

2

=1

γ(0)

∞∑

τ=1

N∑

i=1

(H

1/2Bi

)2τ=

1

γ(0)

N∑

i=1

∥∥∥PZ+

(H

1/2Bi

)∥∥∥ℓ2.

When the inputs are second-order stationary and not autocorrelated (Z is a white noise) then theformulas in the previous result can be used to give, for the linear case, a more informative version ofCorollary 3.5, without the need to invoke input independence.

Corollary 4.2 Suppose that we are in the hypotheses of Proposition 4.1 and that, additionally, the inputprocess Z : Ω −→ DZ is a white noise, that is, the autocovariance function γ satisfies that γ(h) = 0, forany non-zero h ∈ Z. Then:

MC = N and FC = 0. (4.8)

Proof. The first equality in (4.8) is a straightforward consequence of (4.3) and of the fact that forwhite noise inputs H = γ(0)Iℓ2

+(R). The equality FC = 0 follows from the fact that H = γ(0)Iℓ2(R) and

that Bji = 0, for any i ∈ 1, . . . , N and any j ∈ Z+, by (4.4). Then, by (4.5),

FC =1

γ(0)

N∑

i=1

∥∥∥PZ+

(H

1/2Bi

)∥∥∥ℓ2

=

N∑

i=1

‖PZ+ (Bi)‖ℓ2 = 0.

An important conclusion of this corollary is that linear systems with white noise inputs that havea non-singular state covariance matrix ΓX automatically have maximal memory capacity. This makesimportant the characterization of the invertibility of ΓX in terms of the parameters A ∈ MN (usuallyreferred to as connectivity matrix) and C ∈ RN in (4.1). In particular, the following propositionestablishes a connection between the invertibility of ΓX and Kalman’s characterization of the control-lability of a linear system [Kalm 10]. The relation between this condition and maximum capacity isalready mentioned in [Jaeg 02].

Proposition 4.3 Consider the linear state system determined introduced in (4.1). Suppose that theinput process Z : Ω −→ DZ is a white noise and that the connectivity matrix A ∈ MN is diagonalizable.Let σ(A) = λ1, . . . , λN be the spectrum of A and let v1, . . . ,vN be an eigenvectors basis. Thefollowing statements are equivalent:

(i) The state covariance matrix ΓX = Cov(Xt,Xt) = γ(0)∑∞

j=0 AjCC⊤

(Aj)⊤

is non-singular.

(ii) The eigenvalues in σ(A) are all non-zero and distinct and in the linear decomposition C =∑N

i=1 civi

all the coefficients ci, i ∈ 1, . . . , N are non-zero.

(iii) The vectorsAC, A2C, . . . , ANC

form a basis of RN .

(iv) Kalman controllability condition: the vectorsC, AC, . . . , AN−1C

form a basis of RN .

Page 14: Memory and forecastingcapacitiesofnonlinear recurrentnetworks … · arXiv:2004.11234v1 [math.OC] 22 Apr 2020 Memory and forecastingcapacitiesofnonlinear recurrentnetworks LukasGonon1,LyudmilaGrigoryeva2,andJuan-PabloOrtega3,4

Memory and forecasting capacities of nonlinear recurrent networks 14

Proof. (i)⇔(ii) Using the notation introduced in the statement notice that:

AC =

N∑

i=1

ciλivi, A2C =

N∑

i=1

ciλ2ivi, . . . , A

NC =

N∑

i=1

ciλNi vi. (4.9)

Since by hypothesis |||A|||2 = σmax(A) < 1, the spectral radius ρ(A) of A satisfies that ρ(A) ≤ σmax(A) <1, and hence:

B :=

∞∑

j=0

AjCC⊤(Aj)⊤

=

∞∑

k=0

N∑

i,j=1

λki λ

kj cicjviv

⊤j =

N∑

i,j=1

cicj1− λiλj

viv⊤j ,

which shows that in the matrix basisviv

⊤j

i,j∈1,...,N

, the matrix B has components Bij :=cicj

1−λiλj

or, equivalently,

B := CC⊤⊙D, with C = (c1, . . . , cN )⊤, and D defined by Dij :=

1

1− λiλj, (4.10)

for any i, j ∈ 1, . . . , N; the symbol ⊙ stands for componentwise matrix multiplication (Hadamardproduct). Let P ∈ MN be the invertible change-of-basis matrix between v1, . . . ,vN and the canonical

basis e1, . . . , eN, that is, for any i ∈ 1, . . . , N, we have that vi =∑N

k=1 Pikek. It is easy to see thatB = P⊤BP and hence the invertibility of B (and hence of ΓX) is equivalent to the invertibility of B in(4.10), which we now characterize.

In order to provide an alternative expression for B, recall that for any two vectors v,w ∈ RN , we canwrite v⊙w = diag(v)w, where diag(v) ∈ MN is the diagonal matrix that has the entries of the vectorv in the diagonal. Using this fact and the Hadamard product trace property (see [Horn 94, Lemma5.1.4, page 305]) we have that

⟨v, Bw

⟩=⟨v, (CC

⊤⊙D)w

⟩= trace

(v⊤(CC

⊤⊙D)w

)= trace

((CC

⊤⊙D)wv⊤

)

= trace((CC

⊤⊙ vw⊤)D⊤

)= trace

((C⊙ v)(C ⊙w)⊤D⊤

)

= trace(diag(C)vw⊤diag(C)D⊤

)=⟨v, diag(C)Ddiag(C)w

⟩.

Since v,w ∈ RN are arbitrary, this equality allows us to conclude that B = diag(C)Ddiag(C) andhence B is invertible if and only if both diag(C) and D are. The regularity of diag(C) is equivalent torequiring that all the entries of the vector C are non-zero. Regarding D, it can be shown by inductionon the matrix dimension N , that

det(D) = (−1)N

∏Ni<j=1 (λi − λj)

2

∏Ni<j=1 (λiλj − 1)

2∏Ni=1(λ

2i − 1)

.

Consequently, D is invertible if and only if det(D) 6= 0, which is equivalent to all the elements in thespectrum σ(A) being distinct.

(ii)⇔(iii) The condition on the vectorsAC, A2C, . . . , ANC

forming a basis of RN is equivalent to

the invertibility of the matrix R(A,C) :=(AC|A2C| · · · |ANC

). It is easy to see using (4.9) that

R(A,C) = P⊤R(A,C), (4.11)

Page 15: Memory and forecastingcapacitiesofnonlinear recurrentnetworks … · arXiv:2004.11234v1 [math.OC] 22 Apr 2020 Memory and forecastingcapacitiesofnonlinear recurrentnetworks LukasGonon1,LyudmilaGrigoryeva2,andJuan-PabloOrtega3,4

Memory and forecasting capacities of nonlinear recurrent networks 15

where P is the invertible change-of-basis matrix in the previous point and R(A,C) is given by

R(A,C) :=

c1λ1 c1λ21 · · · c1λ

N1

c2λ2 c2λ22 · · · c2λ

N2

......

. . ....

cNλN cNλ2N · · · cNλN

N

. (4.12)

Indeed, for any i, j ∈ 1, . . . , N,

R(A,C)ij =(AjC

)i=

(N∑

k=1

ckAjvk

)

i

=

(N∑

k=1

ckλjkvk

)

i

=

N∑

k,l=1

ckλjkPklel

i

=

N∑

k,l=1

R(A,C)kjPklel

i

=

(N∑

l=1

(P⊤R(A,C)

)ljel

)

i

=

N∑

l=1

(P⊤R(A,C)

)lje⊤i el =

(P⊤R(A,C)

)ij,

which proves (4.11). Now, using induction on the matrix dimension N , it can be shown that

det(R(A,C)) =

N∏

i=1

ciλi

N∏

i<j=1

(λi − λj).

The invertibility of R(A,C) (or, equivalently, the invertibility of R(A,C)) is equivalent to all the coef-

ficients ci and all the eigenvalues λi being non-zero (so that∏N

i=1 ciλi is non-zero) and all the elements

in σ(A) being distinct (so that∏N

i<j=1(λi − λj) is non-zero).

(iii)⇔(iv) The Kalman controllability condition on the vectorsC, AC, . . . , AN−1C

forming a basis

of RN is equivalent to the invertibility of the controllability or reachability matrix R(A,C) :=(C|A2C| · · · |AN−1C

)(see [Sont 98] for this terminology). Following the same strategy that we used to

prove (4.11), it is easy to see that R(A,C) = P⊤R(A,C), where

R(A,C) :=

c1 c1λ1 · · · c1λN−11

c2 c2λ2 · · · c2λN−12

......

. . ....

cN cNλN · · · cNλN−1N

.

The matrix R(A,C) has the same rank as R(A,C) since it can be obtained from R(A,C) via elementarymatrix operations, namely, by dividing each row i of R(A,C) by the corresponding eigenvalue λi.

4.2 The singular case

In the following paragraphs we study the situation in which the covariance matrix of the outputs ΓX

in the presence of white noise inputs is not invertible. All the results formulated in the paper so far, inparticular the capacity formulas in (3.4), are not valid anymore in this case. The strategy that we adoptto tackle this problem uses Lemma 3.3. More specifically, we show that whenever we are given a linearsystem whose covariance matrix ΓX is not invertible, there exists another linear system defined in adimensionally smaller state space that generates the same filter and hence has the same capacities but,unlike the original system, this smaller one has an invertible covariance matrix. This feature allows usto use for this system some of the results in previous sections and, in particular to compute its memorycapacity in the presence of independent inputs that, as we establish in the next result, coincides withthe rank of the controllability matrix R(A,C).

Page 16: Memory and forecastingcapacitiesofnonlinear recurrentnetworks … · arXiv:2004.11234v1 [math.OC] 22 Apr 2020 Memory and forecastingcapacitiesofnonlinear recurrentnetworks LukasGonon1,LyudmilaGrigoryeva2,andJuan-PabloOrtega3,4

Memory and forecasting capacities of nonlinear recurrent networks 16

Theorem 4.4 Consider the linear system F (x, z) := Ax+Cz introduced in (4.1) and suppose that theinput process Z : Ω −→ DZ is a white noise.

(i) Let R(A,C) be the controllability matrix of the linear system. Then:

ker ΓX = kerR(A,C)⊤. (4.13)

(ii) Let X := spanC, AC, . . . , AN−1C

and let r := dim(X) = rankR(A,C). Let V ⊂ RN be such

that RN = X ⊕ V and let iX : X −→ RN and πX : RN −→ X be the injection and the projectionassociated to this splitting, respectively. Then, the linear system F : X ×D −→ X defined by

F (x, z) := Ax+Cz and determined by A := πXAiX and C = πX(C), (4.14)

is well-defined and has the echo state property.

(iii) The inclusion iX : X −→ RN is an injective linear system equivariant map between F and F .

(iv) Let X : Ω −→ V Z be the output of the filter determined by the state-system F . Then:

rankR(A,C) = r and if A is diagonalizable then ΓX:= Cov(Xt,Xt) is invertible. (4.15)

(v) If A is diagonalizable then the memory MC and forecasting FC capacities of F with respect to Zare given by

MC = rankR(A,C) = dim(span

C, AC, . . . , AN−1C

)and FC = 0. (4.16)

Proof. (i)We first show that kerR(A,C)⊤ ⊂ ker ΓX. Let v ∈ kerR(A,C)⊤ and show thatC⊤(Aj)⊤v =0, for all j ∈ N. Indeed, the condition v ∈ kerR(A,C)⊤ implies that

C⊤(Aj)⊤v = 0, for all j ∈ 0, . . . , N − 1. (4.17)

Now, by the Hamilton-Cayley Theorem [Horn 13, Theorem 2.4.3.2], for any j ≥ N , there exist constantsβj0, . . . , β

jN−1

such that Aj =

∑N−1i=0 βj

iAi which, together with (4.17), implies that that equality holds

for all j ∈ N. Recall now that by Proposition 4.3 part (i), ΓX = γ(0)∑∞

j=0 AjCC⊤

(Aj)⊤

, and hence

we can conclude that ΓX(v) = γ(0)∑∞

j=0 AjCC⊤

(Aj)⊤

v = 0, that is, v ∈ ker ΓX. Conversely, if v ∈

ker ΓX, we have that 0 = 〈v,ΓX(v)〉 = γ(0)∑∞

j=0 ‖C⊤(Aj)⊤

v‖2, which implies that C⊤(Aj)⊤

v = 0,

necessarily, for any j ∈ N and hence v ∈ kerR(A,C)⊤.

(ii) The system associated to F has the echo state property because for any x ∈ X ,

∥∥Ax∥∥2 = ‖πXAiX(x)‖

2≤ ‖AiX(x)‖

2,

which implies that∣∣∣∣∣∣A∣∣∣∣∣∣

2≤ |||A|||2 = σmax(A) < 1.

(iii) We first show that for any x ∈ X and z ∈ D we have that AiXx + Cz ∈ X . Indeed, as x ∈ X ,

there exist constants α0, . . . , αN−1 such that x =∑N−1

i=0 αiAiC and hence

AiXx+Cz =

N−1∑

i=0

αiAi+1C+Cz = Cz +

N−1∑

i=1

αi−1AiC+ αN−1A

NC

= Cz +

N−1∑

i=1

αi−1AiC+ αN−1

N−1∑

j=0

βNj AjC ∈ X, (4.18)

Page 17: Memory and forecastingcapacitiesofnonlinear recurrentnetworks … · arXiv:2004.11234v1 [math.OC] 22 Apr 2020 Memory and forecastingcapacitiesofnonlinear recurrentnetworks LukasGonon1,LyudmilaGrigoryeva2,andJuan-PabloOrtega3,4

Memory and forecasting capacities of nonlinear recurrent networks 17

where the constantsβN0 , . . . , βN

N−1

satisfy that AN =

∑N−1i=0 βN

i Ai and, as above, are a byproduct ofthe Hamilton-Cayley Theorem [Horn 13, Theorem 2.4.3.2]. We now show that iX is a system equivariantmap between F and F . For any x ∈ X and z ∈ D,

iX(F (x, z)

)= iXπX (AiX(x) +Cz) = AiX(x) +Cz = F (iX(x), z),

where the second equality holds because iX πX |X = IX and, by (4.18), AiXx+Cz ∈ X .

(iv) We prove the identity rankR(A,C) = r by showing that

iX

(span

C, AC, . . . , A

rC)

= X.

First, using that iX πX |X = IX and that by the Hamilton-Cayley Theorem AjC ∈ X , for all j ∈ N, it

is easy to conclude that for any v =∑r

i=1 αiAi−1

C we have that

iX(v) = iX

(r∑

i=1

αiAi−1

C

)=

r∑

i=1

αiAi−1C,

which shows that iX

(span

C, AC, . . . , A

rC)

⊂ X . Conversely, let v =∑N

i=1 αiAi−1C ∈ X . Using

again that iX πX |X = IX , we can write that,

v =

N∑

i=1

αiAi−1C =

N∑

i=1

αi (iXπXA) · · · (iXπXA)︸ ︷︷ ︸i − 1-times

(iXπXC) = iX

(N∑

i=1

αiAi−1

C

)

= iX

r∑

i=1

αiAi−1

C+

N∑

j=r+1

r∑

kj=1

αjβjkjA

kj−1C

,

which clearly belongs to iX

(span

C, AC, . . . , A

rC)

. The constants βjkj

are obtained, again, by using

the Hamilton-Cayley Theorem. Finally, the invertibility of ΓX

is a consequence of Proposition 4.3.

(v) The statement in part (iii) that we just proved and Lemma 3.3 imply that the memory MC andforecasting FC capacities of F with respect to Z coincide with those of F with respect to Z. Now, thestatement in part (iv), Proposition 4.3, and Corollary 4.2 imply that those capacities coincide with rand 0, respectively. Finally, as r = rankR(A,C), the claim follows.

5 Conclusions

In this paper we have studied memory and forecasting capacities of generic nonlinear recurrent networkswith respect to arbitrary stationary inputs that are not necessarily independent. In particular, wehave stated upper bounds for those capacities in terms of the dimensionality of the network and theautocovariance of the inputs that generalize those formulated in [Jaeg 02] for independent inputs.

The approach followed in the paper is particularly advantageous for linear networks for which explicitexpressions can be formulated for the memory and the forecasting capacities. In the classical linear casewith independent inputs, we have proved that the memory capacity of a linear recurrent network withindependent inputs is given by the rank of its controllability matrix. This explicit and readily computablecharacterization of the memory capacity of those networks generalizes a well-known relation betweenmaximal capacity and Kalman’s controllability condition, formulated for the first time in [Jaeg 02]. This

Page 18: Memory and forecastingcapacitiesofnonlinear recurrentnetworks … · arXiv:2004.11234v1 [math.OC] 22 Apr 2020 Memory and forecastingcapacitiesofnonlinear recurrentnetworks LukasGonon1,LyudmilaGrigoryeva2,andJuan-PabloOrtega3,4

Memory and forecasting capacities of nonlinear recurrent networks 18

is, to our knowledge, the first rigorous proof of the relation between network memory and the rank ofthe controllability matrix, that has been for a long time part of the reservoir computing folklore.

The results in this paper suggest links between controllability and memory capacity for nonlinearrecurrent systems that will be explored in forthcoming works.

Acknowledgments: Lukas G and JPO acknowledge partial financial support coming from the ResearchCommission of the Universitat Sankt Gallen and the Swiss National Science Foundation (grant number200021 175801/1). JPO acknowledges partial financial support of the French ANR “BIPHOPROC”project (ANR-14-OHRI-0002-02). The authors thank the hospitality and the generosity of the FIMat ETH Zurich and the Division of Mathematical Sciences of the Nanyang Technological University,Singapore, where a significant portion of the results in this paper were obtained.

References

[Acei 17] P. V. Aceituno, Y. Gang, and Y.-Y. Liu. “Tailoring artificial neural networks for optimallearning”. jul 2017.

[Bai 12] Bai Zhang, D. J. Miller, and Yue Wang. “Nonlinear system modeling with random matri-ces: echo state networks revisited”. IEEE Transactions on Neural Networks and LearningSystems, Vol. 23, No. 1, pp. 175–182, jan 2012.

[Bara 14] P. Barancok and I. Farkas. “Memory capacity of input-driven echo state networks at theedge of chaos”. In: Proceedings of the International Conference on Artificial Neural Networks(ICANN), pp. 41–48, 2014.

[Broc 06] P. J. Brockwell and R. A. Davis. Time Series: Theory and Methods. Springer-Verlag, 2006.

[Bueh 06] M. Buehner and P. Young. “A tighter bound for the echo state property”. IEEE Transactionson Neural Networks, Vol. 17, No. 3, pp. 820–824, 2006.

[Char 17] A. S. Charles, D. Yin, and C. J. Rozell. “Distributed sequence memory of multidimensionalinputs in recurrent networks”. Tech. Rep., 2017.

[Coui 16] R. Couillet, G. Wainrib, H. Sevi, and H. T. Ali. “The asymptotic performance of linear echostate neural networks”. Journal of Machine Learning Research, Vol. 17, No. 178, pp. 1–35,2016.

[Damb 12] J. Dambre, D. Verstraeten, B. Schrauwen, and S. Massar. “Information processing capacityof dynamical systems”. Scientific reports, Vol. 2, No. 514, 2012.

[Fark 16] I. Farkas, R. Bosak, and P. Gergel. “Computational analysis of memory capacity in echostate networks”. Neural Networks, Vol. 83, pp. 109–120, 2016.

[Gall 17] C. Gallicchio and A. Micheli. “Echo state property of deep reservoir computing networks”.Cognitive Computation, Vol. 9, 2017.

[Gang 08] S. Ganguli, D. Huh, and H. Sompolinsky. “Memory traces in dynamical systems.”. Proceed-ings of the National Academy of Sciences of the United States of America, Vol. 105, No. 48,pp. 18970–5, dec 2008.

[Gono 19] L. Gonon, L. Grigoryeva, and J.-P. Ortega. “Risk bounds for reservoir computing”. Preprint,2019.

Page 19: Memory and forecastingcapacitiesofnonlinear recurrentnetworks … · arXiv:2004.11234v1 [math.OC] 22 Apr 2020 Memory and forecastingcapacitiesofnonlinear recurrentnetworks LukasGonon1,LyudmilaGrigoryeva2,andJuan-PabloOrtega3,4

Memory and forecasting capacities of nonlinear recurrent networks 19

[Goud 16] A. Goudarzi, S. Marzen, P. Banda, G. Feldman, M. R. Lakin, C. Teuscher, and D. Stefanovic.“Memory and information processing in recurrent neural networks”. Tech. Rep., 2016.

[Gray 06] R. M. Gray. “Toeplitz and Circulant Matrices: A Review”. Foundations and Trends inCommunications and Information Theory, Vol. 2, No. 3, pp. 155–239, 2006.

[Grig 14] L. Grigoryeva, J. Henriques, L. Larger, and J.-P. Ortega. “Stochastic time series forecastingusing time-delay reservoir computers: performance and universality”. Neural Networks,Vol. 55, pp. 59–71, 2014.

[Grig 15] L. Grigoryeva, J. Henriques, L. Larger, and J.-P. Ortega. “Optimal nonlinear informa-tion processing capacity in delay-based reservoir computers”. Scientific Reports, Vol. 5,No. 12858, pp. 1–11, 2015.

[Grig 16a] L. Grigoryeva, J. Henriques, L. Larger, and J.-P. Ortega. “Nonlinear memory capacity ofparallel time-delay reservoir computers in the processing of multidimensional signals”. NeuralComputation, Vol. 28, pp. 1411–1451, 2016.

[Grig 16b] L. Grigoryeva, J. Henriques, and J.-P. Ortega. “Reservoir computing: information process-ing of stationary signals”. In: Proceedings of the 19th IEEE International Conference onComputational Science and Engineering, pp. 496–503, 2016.

[Grig 18] L. Grigoryeva and J.-P. Ortega. “Echo state networks are universal”. Neural Networks,Vol. 108, pp. 495–508, 2018.

[Grig 19] L. Grigoryeva and J.-P. Ortega. “Differentiable reservoir computing”. Journal of MachineLearning Research, Vol. 20, No. 179, pp. 1–62, 2019.

[Herm 10] M. Hermans and B. Schrauwen. “Memory in linear recurrent neural networks in continuoustime.”. Neural networks : the official journal of the International Neural Network Society,Vol. 23, No. 3, pp. 341–55, apr 2010.

[Horn 13] R. A. Horn and C. R. Johnson. Matrix Analysis. Cambridge University Press, second Ed.,2013.

[Horn 94] R. A. Horn and C. R. Johnson. Topics in matrix analysis. Cambridge University Press,Cambridge, 1994.

[Jaeg 02] H. Jaeger. “Short term memory in echo state networks”. Fraunhofer Institute for Au-tonomous Intelligent Systems. Technical Report., Vol. 152, 2002.

[Jaeg 04] H. Jaeger and H. Haas. “Harnessing Nonlinearity: Predicting Chaotic Systems and SavingEnergy in Wireless Communication”. Science, Vol. 304, No. 5667, pp. 78–80, 2004.

[Jaeg 10] H. Jaeger. “The ’echo state’ approach to analysing and training recurrent neural networkswith an erratum note”. Tech. Rep., German National Research Center for InformationTechnology, 2010.

[Kall 02] O. Kallenberg. Foundations of Modern Probability. Probability and Its Applications, SpringerNew York, 2002.

[Kalm 10] R. Kalman. “Lectures on Controllability and Observability”. In: Controllability and Ob-servability, pp. 1–149, Springer Berlin Heidelberg, Berlin, Heidelberg, 2010.

Page 20: Memory and forecastingcapacitiesofnonlinear recurrentnetworks … · arXiv:2004.11234v1 [math.OC] 22 Apr 2020 Memory and forecastingcapacitiesofnonlinear recurrentnetworks LukasGonon1,LyudmilaGrigoryeva2,andJuan-PabloOrtega3,4

Memory and forecasting capacities of nonlinear recurrent networks 20

[Livi 16] L. Livi, F. M. Bianchi, and C. Alippi. “Determination of the edge of criticality in echo statenetworks through Fisher information maximization”. 2016.

[Manj 13] G. Manjunath and H. Jaeger. “Echo state property linked to an input: exploring a funda-mental characteristic of recurrent neural networks”. Neural Computation, Vol. 25, No. 3,pp. 671–696, 2013.

[Marz 17] S. Marzen. “Difference between memory and prediction in linear recurrent networks”. Phys-ical Review E, Vol. 96, No. 3, pp. 1–7, 2017.

[Matt 92] M. B. Matthews. On the Uniform Approximation of Nonlinear Discrete-Time Fading-Memory Systems Using Neural Network Models. PhD thesis, ETH Zurich, 1992.

[Matt 93] M. B. Matthews. “Approximating nonlinear fading-memory operators using neural networkmodels”. Circuits, Systems, and Signal Processing, Vol. 12, No. 2, pp. 279–307, jun 1993.

[Matt 94] M. Matthews and G. Moschytz. “The identification of nonlinear discrete-time fading-memorysystems using neural network models”. IEEE Transactions on Circuits and Systems II:Analog and Digital Signal Processing, Vol. 41, No. 11, pp. 740–751, 1994.

[Orti 12] S. Ortin, L. Pesquera, and J. M. Gutierrez. “Memory and nonlinear mapping in reservoircomputing with two uncoupled nonlinear delay nodes”. In: T. Gilbert, M. Kirkilionis, andG. Nicolis, Eds., Proceedings of the European Conference on Complex Systems, pp. 895–899,Springer International Publishing Switzerland, 2012.

[Orti 19] S. Ortın and L. Pesquera. “Tackling the trade-off between information processing capacityand rate in delay-based reservoir computers”. Frontiers in Physics, Vol. 7, p. 210, 2019.

[Orti 20] S. Ortın and L. Pesquera. “Delay-based reservoir computing: tackling performance degra-dation due to system response time”. Optics Letters, Vol. 45, No. 4, pp. 905–908, 2020.

[Sont 91] E. D. Sontag. “Kalman’s controllability rank condition: from linear to nonlinear”. In: A. C.Antoulas, Ed., Mathematical System Theory, pp. 453–462, Springer, 1991.

[Sont 98] E. Sontag. Mathematical Control Theory: Deterministic Finite Dimensional Systems.Springer-Verlag, 1998.

[Tino 13] P. Tino and A. Rodan. “Short term memory in input-driven linear dynamical systems”.Neurocomputing, Vol. 112, pp. 58–63, 2013.

[Tino 18] P. Tino. “Asymptotic Fisher memory of randomized linear symmetric Echo State Networks”.Neurocomputing, Vol. 298, pp. 4–8, 2018.

[Verz 19] P. Verzelli, C. Alippi, and L. Livi. “Echo state networks with self-normalizing activationson the hyper-sphere”. Scientific Reports, Vol. 9, No. 1, p. 13887, dec 2019.

[Verz 20] P. Verzelli, C. Alippi, L. Livi, and P. Tino. “Input representation in recurrent neural networksdynamics”. mar 2020.

[Wain 16] G. Wainrib and M. N. Galtier. “A local echo state property through the largest Lyapunovexponent”. Neural Networks, Vol. 76, pp. 39–45, apr 2016.

[Whit 04] O. White, D. Lee, and H. Sompolinsky. “Short-Term Memory in Orthogonal Neural Net-works”. Physical Review Letters, Vol. 92, No. 14, p. 148102, apr 2004.

Page 21: Memory and forecastingcapacitiesofnonlinear recurrentnetworks … · arXiv:2004.11234v1 [math.OC] 22 Apr 2020 Memory and forecastingcapacitiesofnonlinear recurrentnetworks LukasGonon1,LyudmilaGrigoryeva2,andJuan-PabloOrtega3,4

Memory and forecasting capacities of nonlinear recurrent networks 21

[Xue 17] F. Xue, Q. Li, and X. Li. “The combination of circle topology and leaky integrator neuronsremarkably improves the performance of echo state network on time series prediction.”. PloSone, Vol. 12, No. 7, p. e0181816, 2017.

[Yild 12] I. B. Yildiz, H. Jaeger, and S. J. Kiebel. “Re-visiting the echo state property.”. NeuralNetworks, Vol. 35, pp. 1–9, nov 2012.