![Page 1: Machine Learning on Sequences - University of Pennsylvania](https://reader033.vdocument.in/reader033/viewer/2022051323/627bb350f6977235c75d6075/html5/thumbnails/1.jpg)
Machine Learning on Sequences
I GNNs are architectures specialized in learning data defined over graph supports
I Several processes have a sequential nature. To learn from them, we need dedicated architectures
1
![Page 2: Machine Learning on Sequences - University of Pennsylvania](https://reader033.vdocument.in/reader033/viewer/2022051323/627bb350f6977235c75d6075/html5/thumbnails/2.jpg)
Machine Learning on Sequences
I Often, we want to learn properties of a sequence ⇒ Is the particle entering the forbidden area?
x1 x2 x3 x4 x5 x6 x7 x8
I This problem is not just a simple sequence of classifications ⇒ yt = φ(xt)
I It is a (sequence of) classifications of a sequence ⇒ yt = φ(x1:t) = φ(xt , xt−1, . . . , x1)
2
![Page 3: Machine Learning on Sequences - University of Pennsylvania](https://reader033.vdocument.in/reader033/viewer/2022051323/627bb350f6977235c75d6075/html5/thumbnails/3.jpg)
Unbounded Memory Growth
I Predictions on a sequence depends on observation histories ⇒ yt = Φ(xt , xt−1, . . . , x1)
x1
y1
x2
y2
x3
y3
xt
yt
I Recurrent neural networks (RNNs) estimate a hidden state to avoid this unbounded memory growth
3
![Page 4: Machine Learning on Sequences - University of Pennsylvania](https://reader033.vdocument.in/reader033/viewer/2022051323/627bb350f6977235c75d6075/html5/thumbnails/4.jpg)
Markov Random Processes
I A stochastic process (random sequence) is said to be Markov or memoryless if
p(
xt+1
∣∣ x1:t
)= p
(xt+1
∣∣ xt)
I It is the same to condition on the current value xt or conditioning or on the whole trajectory x0:t
⇒ The future, given the present, is independent of the past
⇒ For predicting the future, knowledge of the past is irrelevant
I Outputs (e.g, trajectory categories) are conditionally independent ⇒ p(yt∣∣ xt) = p(yt
∣∣ x1:t)
4
![Page 5: Machine Learning on Sequences - University of Pennsylvania](https://reader033.vdocument.in/reader033/viewer/2022051323/627bb350f6977235c75d6075/html5/thumbnails/5.jpg)
Learning in a Markov Process
I In a memoryless Markov Process, learning is equivalent reduce to a sequence of learning problems
I State evolution is a chain of memoryless transitions. And outputs depend on the current state only
p(xt+1
∣∣ xt)xt xt+1
p(yt∣∣ xt) p(yt+1
∣∣ xt+1)yt yt+1
I An AI to predict yt mimics the conditional distribution of the observations. The past is irrelevant
5
![Page 6: Machine Learning on Sequences - University of Pennsylvania](https://reader033.vdocument.in/reader033/viewer/2022051323/627bb350f6977235c75d6075/html5/thumbnails/6.jpg)
Learning in a Markov Process
I In a memoryless Markov Process, learning is equivalent reduce to a sequence of learning problems
I State evolution is a chain of memoryless transitions. And outputs depend on the current state only
p(xt+1
∣∣ xt)xt xt+1
Φ(xt) Φ(xt)yt yt+1
I An AI to predict yt mimics the conditional distribution of the observations. The past is irrelevant
5
![Page 7: Machine Learning on Sequences - University of Pennsylvania](https://reader033.vdocument.in/reader033/viewer/2022051323/627bb350f6977235c75d6075/html5/thumbnails/7.jpg)
Recurrent Neural Networks
I Machine Learning in stochastic processes that are not Markov ⇒ The past is relevant in learning
6
![Page 8: Machine Learning on Sequences - University of Pennsylvania](https://reader033.vdocument.in/reader033/viewer/2022051323/627bb350f6977235c75d6075/html5/thumbnails/8.jpg)
All Processes are Markov. However Disguised
I The evolution of the trajectory is not a Markov Process if we observe positions only
I But it is Markov if we have access to velocities and accelerations ⇒ Hidden (unobserved) states
x1 x2 x3 x4 x5 x6 x7 x8
v1, u1 v2, u2 v3, u3 v4, u4 v5, u5 v6, u6 v7, u7 v8, u8
I All systems are Markov ⇒ We often lack enough information to observe their Markov structure
7
![Page 9: Machine Learning on Sequences - University of Pennsylvania](https://reader033.vdocument.in/reader033/viewer/2022051323/627bb350f6977235c75d6075/html5/thumbnails/9.jpg)
Hiden Markov Model
I Stochastic process xt follows a hidden Markov model if there exists a process zt such that
p(
zt+1
∣∣ z1:t
)= p
(zt+1
∣∣ zt)
and p(
xt∣∣ zt)
= p(
xt∣∣ z0:t
)
I The hidden state zt is a memoryless Markov stochastic process
I The observed state xt is conditionally independent. Depends only on the current hidden state zt
I Outputs are also conditionally independent ⇒ p(yt∣∣ zt) = p(yt
∣∣ z1:t)
8
![Page 10: Machine Learning on Sequences - University of Pennsylvania](https://reader033.vdocument.in/reader033/viewer/2022051323/627bb350f6977235c75d6075/html5/thumbnails/10.jpg)
Machine Learning on Hidden Markov Models
I In a hidden Markov model learning is not equivalent to a sequence of learning problems
I The AI can try to mimic the conditional distribution p(yt∣∣ zt). But we don’t have access to zt
p(zt+1
∣∣ zt)zt zt+1
p(yt∣∣ zt) p(yt+1
∣∣ zt+1)
yt yt+1
p(xt∣∣ zt) p(xt+1
∣∣ zt+1)
xt xt+1
I Recurrent Neural Network (RNN) ⇒ Use observed state xt to estimate hidden state zt
9
![Page 11: Machine Learning on Sequences - University of Pennsylvania](https://reader033.vdocument.in/reader033/viewer/2022051323/627bb350f6977235c75d6075/html5/thumbnails/11.jpg)
Machine Learning on Hidden Markov Models
I In a hidden Markov model learning is not equivalent to a sequence of learning problems
I The AI can try to mimic the conditional distribution p(yt∣∣ zt). But we don’t have access to zt
p(zt+1
∣∣ zt)zt zt+1
Φ(zt) Φ(zt)
yt yt+1
p(xt∣∣ zt) p(xt+1
∣∣ zt+1)
xt xt+1
zt zt+1
I Recurrent Neural Network (RNN) ⇒ Use observed state xt to estimate hidden state zt
9
![Page 12: Machine Learning on Sequences - University of Pennsylvania](https://reader033.vdocument.in/reader033/viewer/2022051323/627bb350f6977235c75d6075/html5/thumbnails/12.jpg)
Recurrent Neural Networks
I A recurrent neural network is made up of two separate learning parametrizations
⇒ Φ1(xt , zt−1) ⇒ From observed state xt ⇒ and hidden state zt−1 ⇒ to hidden state update zt
⇒ Φ2(zt) ⇒ From updated hidden state zt ⇒ to output estimate yt
Φ1(xt , zt−1)
xt
zt−1 zt Φ2(zt) yt
I It is a recurrent neural network because hidden states are fed-back as inputs for the next time step
10
![Page 13: Machine Learning on Sequences - University of Pennsylvania](https://reader033.vdocument.in/reader033/viewer/2022051323/627bb350f6977235c75d6075/html5/thumbnails/13.jpg)
Hidden State Update AI
I Use a perceptron for the AI that updates the hidden state ⇒ Φ1
(xt , zt−1
)= σ
(Axt + Bzt−1
)
σ(
Axt + Bzt−1
)
xt
zt−1 zt Φ2
(zt)
yt
I Number of learnable parameters ≡ Entries of A and B ⇒ Does not depend on the time index t
11
![Page 14: Machine Learning on Sequences - University of Pennsylvania](https://reader033.vdocument.in/reader033/viewer/2022051323/627bb350f6977235c75d6075/html5/thumbnails/14.jpg)
Output Prediction AI
I Use another perceptron for the AI that predicts the output ⇒ Φ1
(zt)
= σ(
Czt)
σ(
Axt + Bzt−1
)
xt
zt−1 zt Φ2
(zt)
σ(
Czt)
yt
I We can also use a multi-layer neural network for the output prediction AI ⇒ Φ1
(zt)
= Φ(
zt ; H)
12
![Page 15: Machine Learning on Sequences - University of Pennsylvania](https://reader033.vdocument.in/reader033/viewer/2022051323/627bb350f6977235c75d6075/html5/thumbnails/15.jpg)
Output Prediction AI
I Use another perceptron for the AI that predicts the output ⇒ Φ1
(zt)
= σ(
Czt)
σ(
Axt + Bzt−1
)
xt
zt−1 zt Φ(
zt ; H)
yt
I We can also use a multi-layer neural network for the output prediction AI ⇒ Φ1
(zt)
= Φ(
zt ; H)
12
![Page 16: Machine Learning on Sequences - University of Pennsylvania](https://reader033.vdocument.in/reader033/viewer/2022051323/627bb350f6977235c75d6075/html5/thumbnails/16.jpg)
Time Gating
I We discuss the problem of vanishing/exploding gradients in recurrent neural networks
I We introduce gating mechanisms in the form of long short-term memories and gated recurrent units
13
![Page 17: Machine Learning on Sequences - University of Pennsylvania](https://reader033.vdocument.in/reader033/viewer/2022051323/627bb350f6977235c75d6075/html5/thumbnails/17.jpg)
Vanishing/Exploding Gradients for Long Term Dependencies
I In some tasks, the RNN may have to learn how to model long term dependencies of length T
I This poses a challenge ⇒ the Jacobian ∂zT/∂B will depend on a chain of multiplications by B
zt+T−1
xt+T−1
BT
I If eigenvalues of B � 1, the gradients tend to vanish, leading to exponentially smaller weights B
I If eigenvalues of B � 1, the gradients tend to explode, leading to exponentially larger weights B
14
![Page 18: Machine Learning on Sequences - University of Pennsylvania](https://reader033.vdocument.in/reader033/viewer/2022051323/627bb350f6977235c75d6075/html5/thumbnails/18.jpg)
Vanishing/Exploding Gradients for Long Term Dependencies (Example)
I Consider a simplification of the RNN where we omit the nonlinear function σ(·) and the inputs xt
zt = Bzt−1
I At time t = T , the state variable zT depends on the T th power of the matrix B
zT = BT zt−T
I If B admits an eigendecomposition B = QΛQ>, the recurrence can be rewritten as
zT = QΛTQ>zt−T
⇒ Eigenvalues less than one will vanish and eigenvalues greater than one will explode
⇒ Any component of zt−T not aligned with the largest eigenvalues will be discarded
15
![Page 19: Machine Learning on Sequences - University of Pennsylvania](https://reader033.vdocument.in/reader033/viewer/2022051323/627bb350f6977235c75d6075/html5/thumbnails/19.jpg)
Gating Mechanism
I To address the issue of vanishing gradients, we add a gating mechanism to RNNs
I Gates are scalars in [0, 1] acting on the current input and on the previous state
⇒ Control how much of the input and past time information should be taken into account
I The value of each gate is updated at every step of the sequence
⇒ Allows creating paths through time with derivatives that neither vanish nor explode
⇒ Creates dependency paths that allow encoding both short and long term dependencies
16
![Page 20: Machine Learning on Sequences - University of Pennsylvania](https://reader033.vdocument.in/reader033/viewer/2022051323/627bb350f6977235c75d6075/html5/thumbnails/20.jpg)
Long Short-Term Memory (LSTM)
I The most popular gated RNN architecture is the Long Short-Term Memory (LSTM) cell
I Three gates: a forget gate ft ∈ [0, 1], an input gate gt ∈ [0, 1], and a cell output gate qt ∈ [0, 1]
I Let xt be the input, zt the state, and define the internal memory st of the LSTM cell
I Memory st updated by applying the forget gate to st−1 and the input gate to the state update
st = ftst−1 + gtσ (Axt + Bzt−1)
I State zt updated by applying the cell output gate to the internal cell memory st
zt = qtσ(st)
17
![Page 21: Machine Learning on Sequences - University of Pennsylvania](https://reader033.vdocument.in/reader033/viewer/2022051323/627bb350f6977235c75d6075/html5/thumbnails/21.jpg)
Gated Recurrent Unit (GRU)
I The Gated Recurrent Unit (GRU) is a second popular gated version of the RNN
I Slight variation of LSTM ⇒ single gate ut ∈ [0, 1] plays the role of input and forget gates
zt = utzt−1 + (1− ut)σ (Axt + rtBzt−1)
⇒ Reset gate rt ∈ [0, 1] controls contribution of previous state zt−1 to updated state
I Besides the LSTM and the GRU, many more variants of gating mechanisms for RNNs exist
18
![Page 22: Machine Learning on Sequences - University of Pennsylvania](https://reader033.vdocument.in/reader033/viewer/2022051323/627bb350f6977235c75d6075/html5/thumbnails/22.jpg)
Gate Computation
I In the LSTM and the GRU, the gates themselves are calculated as the outputs of RNNs
I For example, the forget gate ft of the LSTM has its own state variable z′t (as do all the other gates)
z′t = σ(A′xt + B′z′t−1
)I The forget gate ft is then calculated from the input xt and the state z′t as
ft = sigmoid(Uxt + Wz′t
)⇒ With U and W linear layers mapping the input and state features to a single scalar
⇒ And the sigmoid activation function ensuring gate values in the [0, 1] interval
19
![Page 23: Machine Learning on Sequences - University of Pennsylvania](https://reader033.vdocument.in/reader033/viewer/2022051323/627bb350f6977235c75d6075/html5/thumbnails/23.jpg)
Graph Recurrent Neural Networks
I We define Graph Recurrent Neural Networks (GRNNs) as particular cases of RNNs
20
![Page 24: Machine Learning on Sequences - University of Pennsylvania](https://reader033.vdocument.in/reader033/viewer/2022051323/627bb350f6977235c75d6075/html5/thumbnails/24.jpg)
From RNNs to GRNNs
I Consider a time varying process xt in which each of the signals is supported on shift operator S
xt−2 xt−1 xt
I A graph recurrent neural network (GRNN) combines
⇒ A GNN because xt is supported on a graph. ⇒ An RNN because xt is a sequence
21
![Page 25: Machine Learning on Sequences - University of Pennsylvania](https://reader033.vdocument.in/reader033/viewer/2022051323/627bb350f6977235c75d6075/html5/thumbnails/25.jpg)
A Recurrent Neural Network for Graph Signals
I An RNN has a hidden state zt updated with the perceptron ⇒ zt = σ(Axt + Bzt−1)
I An it has an output prediction yt given by the perceptron ⇒ yt = σ(Cxt)
Bzt−1
Axt
+ σ(·)
xt
zt−1 zt Czt σ(·) yt
I The observed state xt and the output yt are graph signals supported on the graph shift operator S
⇒ The hidden state zt is constructed to be a graph signal supported on the graph shift operator S
22
![Page 26: Machine Learning on Sequences - University of Pennsylvania](https://reader033.vdocument.in/reader033/viewer/2022051323/627bb350f6977235c75d6075/html5/thumbnails/26.jpg)
Graph Recurrent Neural Networks
I Hidden and observed state are propagated through graph filters to update the hidden state
A = A(S) =K−1∑k=0
akSk B = B(S)x =K−1∑k=0
bkSk
I The state update is ⇒ zt = σ
[A(S)xt + B(S)zt−1
]= σ
[K−1∑k=0
akSkxt +K−1∑k=0
bkSkzt−1
]
K−1∑k=0
bkSkzt−1
K−1∑k=0
akSkxt
+ σ(·)
xt
zt−1 zt
K−1∑k=0
akSkzt σ(·) yt
23
![Page 27: Machine Learning on Sequences - University of Pennsylvania](https://reader033.vdocument.in/reader033/viewer/2022051323/627bb350f6977235c75d6075/html5/thumbnails/27.jpg)
Graph Recurrent Neural Networks
I The hidden state zt is propagated through a graph filter to make a prediction yt of the output yt
C = C(S) =K−1∑k=0
ckSk
I The prediction of the output yt is given by ⇒ yt = σ
[C(S)zt
]= σ
[K−1∑k=0
ckSkzt
]
K−1∑k=0
bkSkzt−1
K−1∑k=0
akSkxt
+ σ(·)
xt
zt−1 zt
K−1∑k=0
akSkzt σ(·) yt
23
![Page 28: Machine Learning on Sequences - University of Pennsylvania](https://reader033.vdocument.in/reader033/viewer/2022051323/627bb350f6977235c75d6075/html5/thumbnails/28.jpg)
Multiple Feature GRNNs
I A GRNN is made up a hidden state update perceptron and an output prediction perceptron
zt = σ
[K−1∑k=0
akSkxt +K−1∑k=0
bkSkzt−1
]yt = σ
[K−1∑k=0
ckSkzt
]
I Each of these filters can be replaced by a MIMO filter to yield a GRNN with multiple features
Zt = σ
[K−1∑k=0
SkXtAk +K−1∑k=0
SkZt−1Bk
]Yt = σ
[K−1∑k=0
SkZtCk
]
I Multiple-feature hidden state Zt permits larger dimensionality relative to observed states
24
![Page 29: Machine Learning on Sequences - University of Pennsylvania](https://reader033.vdocument.in/reader033/viewer/2022051323/627bb350f6977235c75d6075/html5/thumbnails/29.jpg)
Spatial Gating
I We extend time gating to GRNNs to handle the problem of vanishing/exploding gradients
I We discuss long range graph dependencies and introduce node and edge gating
25
![Page 30: Machine Learning on Sequences - University of Pennsylvania](https://reader033.vdocument.in/reader033/viewer/2022051323/627bb350f6977235c75d6075/html5/thumbnails/30.jpg)
Gating in GRNNs
I Like RNNs, GRNNs may also experience the problem of vanishing/exploding gradients
⇒ Happens when eigenvalues of B(S) are much smaller/larger than 1
I Similarly to what we did for RNNs, we address it by adding gating operators to GRNNs
Zt = σ
(Q {AS(Xt)}+ Q {BS(Zt−1)}
)
I Input gate operator Q : RN×H → RN×H ⇒ controls the importance of the input Xt at time t
I Forget gate operator Q : RN×H → RN×H ⇒ controls the importance of the state Zt at time t
26
![Page 31: Machine Learning on Sequences - University of Pennsylvania](https://reader033.vdocument.in/reader033/viewer/2022051323/627bb350f6977235c75d6075/html5/thumbnails/31.jpg)
Time-Gated GRNNs
I First type of gating for GRNNs is time gating ⇒ simple extension of input and forget gates of RNNs
I In the Time-Gated GRNN, the input and forget gate operators are expressed as
Q {AS(Xt)} = qtAS(Xt), Q {BS(Zt)} = qtBS(Zt)
I Time gating multiplies the input and the state by scalar gates qt ∈ [0, 1] and qt ∈ [0, 1]
I A single scalar gate is applied to the whole graph signal ⇒ same gate value for all nodes
27
![Page 32: Machine Learning on Sequences - University of Pennsylvania](https://reader033.vdocument.in/reader033/viewer/2022051323/627bb350f6977235c75d6075/html5/thumbnails/32.jpg)
Long Range Spatial Dependencies
I Even if eigenvalues of B(S) ∼ 1 spatial imbalances can cause gradients to vanish in space
⇒ Some nodes/paths might get assigned more importance than others in long range exchanges
I Example: graphs with community structure, where some nodes are highly connected within clusters
⇒ Gradients of ZT depend on successive products of B(S) ⇒ successive products of S
⇒ For large T , the matrix entries in ST with highly connected nodes will get densely populated
⇒ Overshadows community structure ⇒ can’t encode long processes that are local on the graph
28
![Page 33: Machine Learning on Sequences - University of Pennsylvania](https://reader033.vdocument.in/reader033/viewer/2022051323/627bb350f6977235c75d6075/html5/thumbnails/33.jpg)
Spatial Gating
I Node and edge structure of the graph allows for other forms of gating ⇒ spatial gating
I Node gating ⇒ one input and one forget gate for each node of the graph
1
23
4
5 6
7
89
10
11 12
⇒1
23
4
5 6
7
89
10
11 12
I Spatial gating strategies help encode long range spatial dependencies in graph processes
29
![Page 34: Machine Learning on Sequences - University of Pennsylvania](https://reader033.vdocument.in/reader033/viewer/2022051323/627bb350f6977235c75d6075/html5/thumbnails/34.jpg)
Spatial Gating
I Node and edge structure of the graph allows for other forms of gating ⇒ spatial gating
I Edge gating ⇒ one input and one forget gate for each edge of the graph
1
23
4
5 6
7
89
10
11 12
⇒1
23
4
5 6
7
89
10
11 12
I Spatial gating strategies help encode long range spatial dependencies in graph processes
29
![Page 35: Machine Learning on Sequences - University of Pennsylvania](https://reader033.vdocument.in/reader033/viewer/2022051323/627bb350f6977235c75d6075/html5/thumbnails/35.jpg)
Node-Gated GRNNs
I In the Node-Gated GRNN, the input gate and forget gate operators are expressed as
Q {AS(Xt)} = diag(qt)AS(Xt), Q {BS(Zt)} = diag(qt)BS(Zt)
I Gating operators correspond to multiplication of the input and state by diagonal matrices
⇒ The diagonals are the input and forget vector gates qt ∈ [0, 1]N and qt ∈ [0, 1]N
I A scalar gate applied to each nodal component of the signal ⇒ different gate values for each node
30
![Page 36: Machine Learning on Sequences - University of Pennsylvania](https://reader033.vdocument.in/reader033/viewer/2022051323/627bb350f6977235c75d6075/html5/thumbnails/36.jpg)
Edge-Gated GRNNs
I In the Edge-Gated GRNN, the input gate and forget gate operators are expressed as
Q {AS(Xt)} = AS�Qt(Xt), Q {BS(Zt)} = BS�Qt
(Zt)
I Gating operators correspond to elementwise multiplication of the shift operator by gate matrices
⇒ The matrices multiplying the GSOs are the input and forget matrix gates Qt and Qt ∈ [0, 1]N×N
I Separate gate for each edge ⇒ control the amount of information transmitted across edges
31
![Page 37: Machine Learning on Sequences - University of Pennsylvania](https://reader033.vdocument.in/reader033/viewer/2022051323/627bb350f6977235c75d6075/html5/thumbnails/37.jpg)
Gate Computation
I Parameters of input and forget gate operators are the outputs of GRNNs themselves
I Input and forget gate states are expressed as
Zt = σ
(AS(Xt) + BS(Zt−1)
)Zt = σ
(AS(Xt) + BS(Zt−1)
)
⇒ Gate computation takes different forms depending on the type of gating
I In the case of time gating, the gates are calculated as
qt = sigmoid(cTvec(Zt)) qt = sigmoid(cTvec(Zt))
⇒ Where c ∈ RHN and c ∈ RHN are fully connected layers and the sigmoid ensures gates in [0, 1]
32
![Page 38: Machine Learning on Sequences - University of Pennsylvania](https://reader033.vdocument.in/reader033/viewer/2022051323/627bb350f6977235c75d6075/html5/thumbnails/38.jpg)
Gate Computation
I Parameters of input and forget gate operators are the outputs of GRNNs themselves
I Input and forget gate states are expressed as
Zt = σ
(AS(Xt) + BS(Zt−1)
)Zt = σ
(AS(Xt) + BS(Zt−1)
)
⇒ Gate computation takes different forms depending on the type of gating
I In the case of node gating, the gates are calculated as
qt = sigmoid(CS(Zt)
)qt = sigmoid
(CS(Zt)
)⇒ Where CS and CS are graph convolutions and the sigmoid ensures gates in [0, 1]N
32
![Page 39: Machine Learning on Sequences - University of Pennsylvania](https://reader033.vdocument.in/reader033/viewer/2022051323/627bb350f6977235c75d6075/html5/thumbnails/39.jpg)
Gate Computation
I Parameters of input and forget gate operators are the outputs of GRNNs themselves
I Input and forget gate states are expressed as
Zt = σ
(AS(Xt) + BS(Zt−1)
)Zt = σ
(AS(Xt) + BS(Zt−1)
)⇒ Gate computation takes different forms depending on the type of gating
I In the case of edge gating, the gates are calculated as
[Qt ]ij = sigmoid(
cT[δTi Zt C||δT
j Zt C]T)
[Qt ]ij = sigmoid(
cT[δTi Zt C||δT
j Zt C]T)
⇒ Where δi and δj are N-dimensional Dirac deltas; C ∈ RH×H′ , C ∈ RH×H′ are linear layers
⇒ And c ∈ R2H′×1 and c ∈ R2H′×1 are f.c. layers applied to concatenation || of features of i and j
32
![Page 40: Machine Learning on Sequences - University of Pennsylvania](https://reader033.vdocument.in/reader033/viewer/2022051323/627bb350f6977235c75d6075/html5/thumbnails/40.jpg)
Stability of GRNNs
I GRNNs can be seen as a time extension of GNNs, therefore they inherit their stability properties
33
![Page 41: Machine Learning on Sequences - University of Pennsylvania](https://reader033.vdocument.in/reader033/viewer/2022051323/627bb350f6977235c75d6075/html5/thumbnails/41.jpg)
Relative Perturbation Model
Definition (Relative perturbation matrices)
Given GSOs S and S, we define the set of relative perturbation matrices modulo permutation as
E(S, S) ={
E∈RN×N : PTSP = S + ES + SET,P ∈ P}
(1)
where P ={
P ∈ {0, 1}N×N : P1 = 1,PT1 = 1}.
I We consider that the distance between two graphs S and S is given by d(S, S) = minE∈E(S,S)
‖E‖
I Notice that if S is a permutation of the shift matrix S, then we have d(S, S) = 0
34
![Page 42: Machine Learning on Sequences - University of Pennsylvania](https://reader033.vdocument.in/reader033/viewer/2022051323/627bb350f6977235c75d6075/html5/thumbnails/42.jpg)
Integral Lipschitz filters
Definition (Integral Lipschitz filters)
A filter A(S) =∑K−1
k=0 akSk is integral Lipschitz if there exists C > 0 such that a(λ) =∑K−1
k=0 akλk
satisfies
|a(λ2)− a(λ1)| ≤ C|λ2 − λ1||λ1 + λ2|/2
(2)
for all λ1, λ2 ∈ R.
I Integral Lipschitz filters also satisfy |λa′(λ)| ≤ C , where a′(λ) is the derivative of a(λ)
I Recall that the frequency response of integral Lipschitz filters becomes flat for large λ
35
![Page 43: Machine Learning on Sequences - University of Pennsylvania](https://reader033.vdocument.in/reader033/viewer/2022051323/627bb350f6977235c75d6075/html5/thumbnails/43.jpg)
Assumptions
I We consider a GRNN with FX = 1 input feature, FZ = 1 state feature, and FY = 1 output feature
zt = σ(A(S)xt + B(S)zt−1
)yt = ρ (C(S)zt)
(A1) A, B and C are integral Lipschitz with constants CA, CB and CC and ‖A‖ = ‖B‖ = ‖C‖ = 1
(A2) Nonlinearities σ and ρ satisfy: |σ(b)− σ(a)| ≤ |b − a| for all a, b ∈ R, σ(0) = ρ(0) = 0
(A3) Initial hidden state is identically zero, i.e., z0 = 0, and the xt satisfy ‖xt‖ ≤ ‖x‖ = 1 for all t
36
![Page 44: Machine Learning on Sequences - University of Pennsylvania](https://reader033.vdocument.in/reader033/viewer/2022051323/627bb350f6977235c75d6075/html5/thumbnails/44.jpg)
GRNN stability theorem
Theorem (Stability of GRNNs)
Let S = VΛVH and S be the GSOs of the original and perturbed graph, and let E = UMUH ∈
E(S, S) such that d(S, S) ≤ ‖E‖ ≤ ε . Let yt and yt be the outputs of the GRNNs running on S
and S respectively, and satisfying assumptons (A1)-(A3). Then,
minP∈P‖yt − PTyt‖ ≤ C(1 +
√Nδ)(t2 + 3t)ε +O(ε2) (3)
where C = max{CA,CB,CC} and δ = (‖U− V‖+ 1)2 − 1.
37
![Page 45: Machine Learning on Sequences - University of Pennsylvania](https://reader033.vdocument.in/reader033/viewer/2022051323/627bb350f6977235c75d6075/html5/thumbnails/45.jpg)
Discussion
I GRNNs are stable to relative perturbations with constant C(1 +√Nδ)(T 2 + 3T ), T process length
I C could be set at a fixed value or learned from data through A, B and C ⇒ design parameter
I The term (1 + δ√N) is a property of the graph perturbation ⇒ cannot be controlled by design
I Eigenvector misalignment δ = (‖U− V‖+ 1)2 − 1 measures commutativity of matrices S and E
I Polynomial dependence on T ⇒ due to recurrence relationship in the computation of xt
38
![Page 46: Machine Learning on Sequences - University of Pennsylvania](https://reader033.vdocument.in/reader033/viewer/2022051323/627bb350f6977235c75d6075/html5/thumbnails/46.jpg)
Epidemic Modeling with GRNNs
I We use a GRNN, a GNN, and a RNN to track an epidemic on a high school friendship network
39
![Page 47: Machine Learning on Sequences - University of Pennsylvania](https://reader033.vdocument.in/reader033/viewer/2022051323/627bb350f6977235c75d6075/html5/thumbnails/47.jpg)
Epidemic Modeling
I Model the spread of an infectious disease over a friendship network as a graph process
I Graph is a symmetric friendship network corresponding to a high school in France
I Model the spread of the disease on the graph using Susceptible-Infectious-Removed (SIR) model
I Compare the performance of a GRNN, a RNN, and a GNN in predicting infections after 8 days
40
![Page 48: Machine Learning on Sequences - University of Pennsylvania](https://reader033.vdocument.in/reader033/viewer/2022051323/627bb350f6977235c75d6075/html5/thumbnails/48.jpg)
Friendship Network
I Real-world friendship network corresponding to 134 students from a high school in Marseille
I Each node of the graph represents a student
I Friendships are modeled as symmetric unweighted edges
I Isolated nodes are removed to make the graph fully connected
I Assumption: friends are likely to be in contact with each other
41
![Page 49: Machine Learning on Sequences - University of Pennsylvania](https://reader033.vdocument.in/reader033/viewer/2022051323/627bb350f6977235c75d6075/html5/thumbnails/49.jpg)
Susceptible-Infectious-Removed (SIR) Disease Model
I Process starts with random seed infections on day 0 ⇒ probability pseed = 0.05
I Each person is in one of the three SIR states ⇒ updated each day with the following rules
I Susceptible: can get the disease from an infected friend with probability pinf = 0.3
I Infectious: can spread the disease for 4 days after being infected, after which they recover
I Removed: have overcome the disease and can no longer spread it or contract it
InfectiousSusceptible Removed
p = 0.3 per day
per infected friend 4 days
42
![Page 50: Machine Learning on Sequences - University of Pennsylvania](https://reader033.vdocument.in/reader033/viewer/2022051323/627bb350f6977235c75d6075/html5/thumbnails/50.jpg)
Problem Setup
I Problem: given the node states, goal is to predict whether each node will be infected in 8 days
I Input: graph process xt where, at each time t, [xt ]i is given by
[xt ]i =
0, if student i is susceptible
1, if student i is infected
2, if student i is removed
I Output: binary graph process yt ⇒ our goal is only to track infections
[yt ]i =
{0, if student i is susceptible or removed
1, if student i is infected
I Given xt , xt+1, . . . , xt+7, we want to predict yt+8, yt+9, . . . , yt+15 ⇒ binary node classification
43
![Page 51: Machine Learning on Sequences - University of Pennsylvania](https://reader033.vdocument.in/reader033/viewer/2022051323/627bb350f6977235c75d6075/html5/thumbnails/51.jpg)
Objective Function
I Accuracy is not a good performance metric ⇒ does not distinguish true positives and true negatives
I In epidemic tracking, true positives are more important than true negatives ⇒ maximize F1 score
F1 = 2 · Precision · Recall
Precision + Recall
I Precision = True Positive/Predicted Positive
⇒ Proportion of correct positive predictions
I Recall = True Positive/All Actual Positive
⇒ Proportion of correctly predicted positives
ActualPositive Negative
PositiveTrue
PositiveFalse
PositivePredicted
NegativeFalse
NegativeTrue
Negative
I Loss function we minimize is 1− F1 ⇒ trade-off between minimizing FPs and FNs
44
![Page 52: Machine Learning on Sequences - University of Pennsylvania](https://reader033.vdocument.in/reader033/viewer/2022051323/627bb350f6977235c75d6075/html5/thumbnails/52.jpg)
Results
I We compare a GRNN with a GNN and a RNN, all with roughly the same number of parameters
⇒ In the GNN, the time instants become input features ⇒ parameters depend on T
⇒ In the RNN, the nodal components become input features ⇒ parameters depend on N
I GRNN improves upon RNN and GNN ⇒ exploits both spatial and temporal structure of the data
45