cs224w: machine learning with graphs jure leskovec, weihua...
TRANSCRIPT
CS224W: Machine Learning with GraphsJure Leskovec, Weihua Hu, Stanford University
http://cs224w.stanford.edu
12/3/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 2
…
Output: Node embeddings.We can also embed larger network structures, subgraphs, graphs.
¡ Key idea: Generate node embeddings based on local network neighborhoods
12/3/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 3
INPUT GRAPH
TARGET NODE B
DE
F
CA
B
C
D
A
A
A
C
F
B
E
A
¡ Intuition: Nodes aggregate information from their neighbors using neural networks
12/3/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 4
INPUT GRAPH
TARGET NODE B
DE
F
CA
B
C
D
A
A
A
C
F
B
E
A
Neural networks
12/3/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 5
Scarselli et al., 2009b; Battaglia et al., 2016; Defferrard et al., 2016; Duvenaud et al., 2015; Hamilton et al., 2017a; Kearnes et al., 2016; Kipf & Welling, 2017; Lei et al., 2017; Li et al., 2016; Velickovic et al., 2018; Verma & Zhang, 2018; Ying et al., 2018; Zhang et al., 2018
5
INPUT GRAPH
TARGET NODE B
DE
F
CA
B
C
D
A
A
A
C
F
B
E
A
What’s inside the box?
¡ Many model variants have been proposed with difference choice of neural networks.
?
?
?
?
12/3/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 6
Scarselli et al., 2009b; Battaglia et al., 2016; Defferrard et al., 2016; Duvenaud et al., 2015; Hamilton et al., 2017a; Kearnes et al., 2016; Kipf & Welling, 2017; Lei et al., 2017; Li et al., 2016; Velickovic et al., 2018; Verma & Zhang, 2018; Ying et al., 2018; Zhang et al., 2018
6
INPUT GRAPH
TARGET NODE B
DE
F
CA
B
C
D
A
A
A
C
F
B
E
A
¡ Many model variants have been proposed with difference choice of neural networks.
Graph Convolutional Networks [Kipf & Welling ICLR'2017]
Linear + ReLU
Linear + ReLU
Mean
Mean
12/3/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 7
Scarselli et al., 2009b; Battaglia et al., 2016; Defferrard et al., 2016; Duvenaud et al., 2015; Hamilton et al., 2017a; Kearnes et al., 2016; Kipf & Welling, 2017; Lei et al., 2017; Li et al., 2016; Velickovic et al., 2018; Verma & Zhang, 2018; Ying et al., 2018; Zhang et al., 2018
7
INPUT GRAPH
TARGET NODE B
DE
F
CA
B
C
D
A
A
A
C
F
B
E
A
¡ Many model variants have been proposed with difference choice of neural networks.
GraphSAGE[Hamilton+ NeurIPS’2017]
Max
MLP
MaxMLP
¡ Intuition: Network neighborhood defines a computation graph
12/3/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 8
Every node defines a computation graph based on its neighborhood!
¡ Obtain node representation by neighbor aggregation
12/3/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 912/3/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 9
¡ Obtain graph representation by pooling node representation
12/3/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 1012/3/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 10
Pool (e.g., Sum, Average)
Graph Neural Networks have achieved state-of-the-art performance on:
¡ Node classification [Kipf+ ICLR’2017]
¡ Graph Classification [Ying+ NeurIPS’2018]
¡ Link Prediction [Zhang+ NeurIPS’2018]
12/3/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 11
Graph Neural Networks have achieved state-of-the-art performance on
¡ Node classification [Kipf+ ICLR’2017]
¡ Graph Classification [Ying+ NeurIPS’2018]
¡ Link Prediction [Zhang+ NeurIPS’2018]
12/3/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 12
Are GNNs perfect?
What are the limitations of GNNs?
¡ Some simple graph structures cannot be distinguished by conventional GNNs.
12/3/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 13
GCN and GraphSAGE fail to distinguish the two graphs.
Assume: Input node features are uniform(denoted by the same node color)
¡ Some simple graph structures cannot be distinguished by conventional GNNs.
¡ GNNs are not robust to noise in graph data.
12/3/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 14
GCN and GraphSAGE fail to distinguish the two graphs.
Assume: Input node features are uniform(denoted by the same node color)
111
011
101
000
110?
Graph
GNNs1
0
Noise in graph1. Node feature
perturbation2. Edge
addition/deletion Class 1 Class 2 Class 3Class prediction
1. Limitations of conventional GNNs in capturing graph structure
2. Vulnerability of GNNs to noise in graph data
3. Open questions & Future direction
12/3/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 15
12/3/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 17
¡ Given two different graphs, can GNNs map them into different graph representations?
¡ Important condition for classification scenario.
GNN GNN
? ?
12/3/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 18
¡ Essentially, graph isomorphism test problem.¡ No polynomial algorithms exist for general case.¡ GNNs may not perfectly distinguish any graphs!
GNN
?
GNN
?
12/3/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 19
¡ Essentially graph isomorphism test problem.¡ No polynomial algorithms exist for general case.¡ GNNs may not perfectly distinguish any graphs.
GNN
?
GNN
?How well can GNNs perform the graph isomorphism test?
Requires rethinking the mechanism of how GNNs capture graph structure.
12/3/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 20
¡ GNNs use different computational graphs to distinguish different graphs.
1 2
34
1 2 3 4
1’ 2’
3’4’
1’ 2’ 3’ 4’
12/3/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 21
¡ Node representation captures rooted subtree structure.
1 2
34
1 2 3 4
1’ 2’
3’4’
1’ 2’ 3’ 4’
Rooted subtree structure
Rooted subtree structure
12/3/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 22
¡ Most discriminative GNNs map different subtrees into different node representations (denoted by different colors).
1 2 3 4 1’ 2’ 3’ 4’
12/3/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 23
¡ Most discriminative GNNs map different subtrees into different node representation (denoted by different colors).
1 2 3 4 1’ 2’ 3’ 4’
Injectivity
¡ Function is injective if it maps different elements into different outputs.
12/3/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 24
¡ Entire neighbor aggregation is injective if every step of neighbor aggregation is injective.
12/3/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 25
Injective à
Injective à
Entire function is Injective!
¡ Neighbor aggregation is essentially a function over multi-set (set with repeating elements).
12/3/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 26
Neighbor aggregation
Multi-set function
EquivalentExamples of multi-set
Same color indicates the same node features
¡ Neighbor aggregation is essentially function over multi-set (set with repeating elements).
12/3/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 27
Neighbor aggregation
Multi-set function
EquivalentExamples of multi-set
Discriminative Power of GNNs can be characterized by that of multi-set functions
Next: Analyzing GCN, GraphSAGE
12/3/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 28
Recall: GCN uses mean pooling.
Mean Pooling + Linear ReLU
12/3/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 29
Recall: GCN uses mean pooling.
Mean pooling + Linear ReLU
GCN will fail to distinguish proportionally equivalent multi-sets.
Not injective!
12/3/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 30
Recall: GraphSAGE uses max pooling.
MLP + Max pooling
12/3/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 31
Recall: GraphSAGE uses max pooling.
MLP + Max pooling.
GraphSAGE will even fail to distinguish multi-set with the same distinct elements.
Not injective!
12/3/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 32
Recall: GraphSAGE uses max pooling.
MLP + max pooling.
GCN will even fail to distinguish multi-set with the same distinct elements. Not injective!
How can we design injective multi-set function using neural networks?
TheoremAny injective multi-set function can be expressed by
12/3/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 33
𝜙 "#∈%
𝑓(𝑥)
𝑆 : multi-set
Some non-linear function
Some non-linear function
𝜙 𝑓 𝑓 𝑓+ +
Sum over multi-set
TheoremAny injective multi-set function can be expressed by
We can model 𝜙 and 𝑓 using Multi-Layer-Perceptron (MLP). Note: MLP is a universal approximator.
12/3/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 34
𝜙 "#∈%
𝑓(𝑥)Some non-linear function
Some non-linear function
Sum over multi-set
+ +𝑀𝐿𝑃/ 𝑀𝐿𝑃0 𝑀𝐿𝑃0 𝑀𝐿𝑃0
12/3/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 35
Graph Isomorphism Network (GIN) [Xu+ ICLR’2019]
MLP + sum pooling
12/3/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 36
MLP + sum pooling
The GIN’s neighbor aggregation is injective!
Graph Isomorphism Network (GIN) [Xu+ ICLR’2019]
¡ Graph pooling is also a function over multiset. Sum pooling can give injective graph pooling!
12/3/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 37
1 2
34
2 3 4
1’ 2’
3’4’
1’ 2’
Sum pooling Sum pooling
12/3/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 38
MLP + sum pooling
The GIN’s neighbor aggregation is injective!
Graph Isomorphism Network (GIN) [Xu+ ICLR’2019] So far: GIN achieves maximal
discriminative power by using injective neighbor aggregation.
How powerful is this?
12/3/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 39
¡ GIN is closely related to Weisfeiler-Lehman (WL) Graph Isomorphism Test (1968).
¡ WL test is known to be capable of distinguishing most of real-world graphs.
12/3/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 40
¡ GIN is closely related to Weisfeiler-Lehman (WL) Graph Isomorphism Test (1968).
¡ WL test is known to be capable of distinguishing most of real-world graphs.
¡ Next: We will show GIN is as discriminative as the WL test.
12/3/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 41
1 2 3 4 1’ 2’ 3’ 4’
¡ WL first maps different rooted subtrees to different colors
1 2
34
1’ 2’
3’4’
Similar to Injective neighbor aggregation in GIN
¡ WL then counts different colors.
12/3/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 42
1 2 3 4 1’ 2’ 3’ 4’
1 2
34
1’ 2’
3’4’
Similar to graph pooling in GNN
4 x
1 x 1 x
2 x
12/3/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 43
1 2 3 4 1’ 2’ 3’ 4’
¡ Finally, WL compares the count
1 2
34
1’ 2’
3’4’
4 x 1 x
1 x
2 x Compare!
12/3/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 44
1 2 3 4 1’ 2’ 3’ 4’
¡ Finally, WL compares the count
1 2
34
1’ 2’
3’4’
4 x 1 x
1 x
2 x Compare!WL test and GIN are operationally equivalent.
Graphs that WL test can distinguishGraphs that GIN can distinguish
Observation¡ GINs have the same discriminative power as
the WL graph isomorphism test.
¡ WL test has been known to distinguish most of the graphs, except for some corner cases.
12/3/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 45
Relational Pooling for Graph Representations
where CNN denotes a 2D (LeCun et al., 1989; Krizhevskyet al., 2012) if there are no edge features and a 3D CNN (Jiet al., 2013) if there are edge features, [· on ·] is a concatena-tion of the representations, and MLP is a multilayer percep-tron. Multi-resolution 3D convolutions (Qi et al., 2016) canbe used to map variable-sized graphs into the same sizedrepresentation for downstream layers.
GNNs. The function f*
can also be a graph neural network(GNN), a broad class of models that use the graph G itselfto define the computation graph. These are permutation-invariant by design but we will show that their integrationinto RP can (1) make them more powerful and (2) speed uptheir computation via theoretically sound approximations.The GNNs we consider follow a message-passing (Gilmeret al., 2017) scheme defined by the recursion
h(l)u = �(l)
⇣h(l�1)u , JP
�(h(l�1)
v )v2N (u)
�⌘, (4)
where �(l) is a learnable function with distinct weights ateach layer 1 l L of the computation graph, JP isa general (learnable) permutation-invariant function (Mur-phy et al., 2019), N (u) is the set of neighbors of u 2 V ,and h(l)
u 2 Rd(l)h is a vector describing the embedding of
node u at layer l. h(0)u is the feature vector of node u,
(X(v))u,· or can be assigned a constant c if u has no fea-tures. Under this framework, node embeddings can be useddirectly to predict node-level targets, or all node embed-dings can be aggregated (via a learnable function) to forman embedding hG used for graph-wide tasks.
There are several variations of Equation 4 in the literature.Duvenaud et al. (2015) proposed using embeddings fromall layers l 2 {1, 2, . . . , L} for graph classification. Hamil-ton et al. (2017a) used a similar framework for node clas-sification and link prediction tasks, using the embeddingat the last layer, while Xu et al. (2018) extend Hamiltonet al. (2017a) to once again use embeddings at all layersfor node and link prediction tasks. Other improvements in-clude attention (Velickovic et al., 2018). This approach canbe derived from spectral graph convolutions (e.g., (Kipf &Welling, 2017)). More GNNs are discussed in Section 3.
Recently, Xu et al. (2019); Morris et al. (2019) showedthat these architectures are at most as powerful as theWeisfeiler-Lehman (WL) algorithm for testing graph iso-morphism (Weisfeiler & Lehman, 1968), which itself ef-fectively follows a message-passing scheme. Accordingly,we will broadly refer to models defined by Equation 4 asWL-GNNs. Xu et al. (2019) proposes a WL-GNN calledGraph Isomorphism Network (GIN) which is as powerfulas the WL test in graphs with discrete features.
Can a WL-GNN be more powerful than the WL test?WL-GNNs inherit a shortcoming from the WL test (Cai
Gskip(11, 2) Gskip(11, 3)
Figure 1: The WL test incorrectly deems these isomorphic.
et al., 1992; Arvind et al., 2017; Furer, 2017; Morris et al.,2019); node representations h(l)
u do not encode whethertwo nodes have the same neighbor or distinct neighbors
with the same features, limiting their ability to learn an ex-pressive representation of the entire graph. Consider a taskwhere graphs represent molecules, where node features in-dicate atom type and edges denote the presence or absenceof bonds. Here, the first WL-GNN layer cannot distinguishthat two (say) carbon atoms have a bond with the same car-bon atom or a bond to two distinct carbon atoms. Succes-sive layers of the WL-GNN update node representationsand the hope is that nodes eventually get unique representa-tions (up to isomorphisms), and thus allow the WL-GNN todetect whether two nodes have the same neighbor based onthe representations of their neighbors. However, if there aretoo few WL-GNN layers or complex cycles in the graph,the graph and its nodes will not be adequately represented.
To better understand this challenge, consider the extremecase illustrated by the two graphs in Figure 1. These arecycle graphs with M = 11 nodes where nodes that areR 2 {2, 3} ‘hops’ around the circle are connected by anedge. These highly symmetric graphs, which are specialcases of circulant graphs (Vilfred, 2004) are formally de-fined in Definition 2.1 but the key point is that the WL test,and thus WL-GNNs, cannot distinguish these two noniso-morphic graphs.
Definition 2.1: [Circulant Skip Links (CSL) graphs] LetR and M be co-prime natural numbers1 such that R <M � 1. Gskip(M,R) denotes an undirected 4-regular graphwith vertices {0, 1, . . . ,M � 1} whose edges form a cycleand have skip links. That is, for the cycle, {j, j + 1} 2 Efor j 2 {0, . . . ,M � 2} and {M � 1, 0} 2 E. For theskip links, recursively define the sequence s1 = 0, si+1 =(si +R) mod M and let {si, si+1} 2 E for any i 2 N. ⌃We will use RP to help WL-GNNs overcome this short-coming. Let f
*
be a WL-GNN that we make permutationsensitive by assigning each node an identifier that dependson ⇡. Permutation sensitive IDs prevent the RP sum fromcollapsing to just one term but more importantly help dis-tinguish neighbors that otherwise appear identical. In par-ticular, given any ⇡ 2 ⇧|V |, we append to the rows of X(v)
⇡
one-hot encodings of the row number before computing f*
.We can represent this by an augmented vertex attribute ma-
1Two numbers are co-primes if their only common factor is 1.
Relational Pooling for Graph Representations
where CNN denotes a 2D (LeCun et al., 1989; Krizhevskyet al., 2012) if there are no edge features and a 3D CNN (Jiet al., 2013) if there are edge features, [· on ·] is a concatena-tion of the representations, and MLP is a multilayer percep-tron. Multi-resolution 3D convolutions (Qi et al., 2016) canbe used to map variable-sized graphs into the same sizedrepresentation for downstream layers.
GNNs. The function f*
can also be a graph neural network(GNN), a broad class of models that use the graph G itselfto define the computation graph. These are permutation-invariant by design but we will show that their integrationinto RP can (1) make them more powerful and (2) speed uptheir computation via theoretically sound approximations.The GNNs we consider follow a message-passing (Gilmeret al., 2017) scheme defined by the recursion
h(l)u = �(l)
⇣h(l�1)u , JP
�(h(l�1)
v )v2N (u)
�⌘, (4)
where �(l) is a learnable function with distinct weights ateach layer 1 l L of the computation graph, JP isa general (learnable) permutation-invariant function (Mur-phy et al., 2019), N (u) is the set of neighbors of u 2 V ,and h(l)
u 2 Rd(l)h is a vector describing the embedding of
node u at layer l. h(0)u is the feature vector of node u,
(X(v))u,· or can be assigned a constant c if u has no fea-tures. Under this framework, node embeddings can be useddirectly to predict node-level targets, or all node embed-dings can be aggregated (via a learnable function) to forman embedding hG used for graph-wide tasks.
There are several variations of Equation 4 in the literature.Duvenaud et al. (2015) proposed using embeddings fromall layers l 2 {1, 2, . . . , L} for graph classification. Hamil-ton et al. (2017a) used a similar framework for node clas-sification and link prediction tasks, using the embeddingat the last layer, while Xu et al. (2018) extend Hamiltonet al. (2017a) to once again use embeddings at all layersfor node and link prediction tasks. Other improvements in-clude attention (Velickovic et al., 2018). This approach canbe derived from spectral graph convolutions (e.g., (Kipf &Welling, 2017)). More GNNs are discussed in Section 3.
Recently, Xu et al. (2019); Morris et al. (2019) showedthat these architectures are at most as powerful as theWeisfeiler-Lehman (WL) algorithm for testing graph iso-morphism (Weisfeiler & Lehman, 1968), which itself ef-fectively follows a message-passing scheme. Accordingly,we will broadly refer to models defined by Equation 4 asWL-GNNs. Xu et al. (2019) proposes a WL-GNN calledGraph Isomorphism Network (GIN) which is as powerfulas the WL test in graphs with discrete features.
Can a WL-GNN be more powerful than the WL test?WL-GNNs inherit a shortcoming from the WL test (Cai
Gskip(11, 2) Gskip(11, 3)
Figure 1: The WL test incorrectly deems these isomorphic.
et al., 1992; Arvind et al., 2017; Furer, 2017; Morris et al.,2019); node representations h(l)
u do not encode whethertwo nodes have the same neighbor or distinct neighbors
with the same features, limiting their ability to learn an ex-pressive representation of the entire graph. Consider a taskwhere graphs represent molecules, where node features in-dicate atom type and edges denote the presence or absenceof bonds. Here, the first WL-GNN layer cannot distinguishthat two (say) carbon atoms have a bond with the same car-bon atom or a bond to two distinct carbon atoms. Succes-sive layers of the WL-GNN update node representationsand the hope is that nodes eventually get unique representa-tions (up to isomorphisms), and thus allow the WL-GNN todetect whether two nodes have the same neighbor based onthe representations of their neighbors. However, if there aretoo few WL-GNN layers or complex cycles in the graph,the graph and its nodes will not be adequately represented.
To better understand this challenge, consider the extremecase illustrated by the two graphs in Figure 1. These arecycle graphs with M = 11 nodes where nodes that areR 2 {2, 3} ‘hops’ around the circle are connected by anedge. These highly symmetric graphs, which are specialcases of circulant graphs (Vilfred, 2004) are formally de-fined in Definition 2.1 but the key point is that the WL test,and thus WL-GNNs, cannot distinguish these two noniso-morphic graphs.
Definition 2.1: [Circulant Skip Links (CSL) graphs] LetR and M be co-prime natural numbers1 such that R <M � 1. Gskip(M,R) denotes an undirected 4-regular graphwith vertices {0, 1, . . . ,M � 1} whose edges form a cycleand have skip links. That is, for the cycle, {j, j + 1} 2 Efor j 2 {0, . . . ,M � 2} and {M � 1, 0} 2 E. For theskip links, recursively define the sequence s1 = 0, si+1 =(si +R) mod M and let {si, si+1} 2 E for any i 2 N. ⌃We will use RP to help WL-GNNs overcome this short-coming. Let f
*
be a WL-GNN that we make permutationsensitive by assigning each node an identifier that dependson ⇡. Permutation sensitive IDs prevent the RP sum fromcollapsing to just one term but more importantly help dis-tinguish neighbors that otherwise appear identical. In par-ticular, given any ⇡ 2 ⇧|V |, we append to the rows of X(v)
⇡
one-hot encodings of the row number before computing f*
.We can represent this by an augmented vertex attribute ma-
1Two numbers are co-primes if their only common factor is 1.
Follow-up work to resolve corner cases but with exponential time complexity:[Murphy+ ICML 2019]
The two graphs look the same for WL test because all the nodes have the same local subtree structure!
¡ Graph classification: social and bio/chem graphsTraining accuracy of different GNN architectures.GIN fits training data much better than GCN, GraphSAGE.
12/3/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 46
WL kernelGIN
GCNGraphSAGE
¡ Graph classification: social and bio/chem graphsTraining accuracy of different GNN architectures.Same trend across datasets!
12/3/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 47
GIN
GraphSAGE
GCN
12/3/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 48
¡ Graph classification: social and bio/chem graphs
GIN outperforms existing GNNs also in terms of test accuracy because it can better capture graph structure.
Data / Model GIN(Powerful) Sum-1 GCN
(Mean)GraphSAGE
(Max)
RDT-B 92.4 90.0 50.0 50.0
IMDB-B 75.1 74.1 74.0 72.3
NCI1 82.7 82.0 80.2 77.7
MUTAG 89.4 90.0 85.6 85.1
Data / Model GIN(Powerful) Sum-1 GCN
(Mean)GraphSAGE
(Max)
RDT-B 92.4 90.0 50.0 50.0
IMDB-B 75.1 74.1 74.0 72.3
NCI1 82.7 82.0 80.2 77.7
MUTAG 89.4 90.0 85.6 85.1
Reddit dataset does not have node features!
¡ Existing GNNs use non-injective neighbor aggregation, thus have low discriminative power
¡ GIN uses injective neighbor aggregation, and is as discriminative as the WL graph isomorphism test
¡ GIN achieves state-of-the-art test performance in graph classification
12/3/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 49
¡ Deep Neural Networks are vulnerable to adversarial attacks!
¡ Attacks are often implemented as imperceptible noise that changes the prediction
12/3/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 51
Gradient direction that changes prediction
52
¡ Adversaries are very common in applications of graph neural networks, e.g., search engines, recommender systems, social networks, etc.
¡ These adversaries will exploit any exposed vulnerabilities!
Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu
53
¡ Adversaries are very common in applications of graph neural networks. e.g. search engines, recommender systems, social networks.
¡ These adversaries will exploit any vulnerabilities exposed.
Are GNNs robust to adversarial attacks?
Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu
54
Labeled
??
?
?
?
??
Message passing
¡ Here we focus on semi-supervised node classification using Graph Convolutional Neural Networks (GCN) [Kipf+ ICLR’2017].
GCN
Labeled
?: Unlabeled
Input: Partially labeledattributed graph
Goal: Predict labels ofunlabeled nodes
Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu
12/3/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 55
𝐴 ∈ 0,1 5×5: Adjacency matrix𝑋 ∈ 0,1 5×8: (binary) node attributes
𝑠𝑜𝑓𝑡𝑚𝑎𝑥 >𝐴 𝑅𝑒𝐿𝑈 >𝐴𝑋𝑊 C 𝑊 D
>𝐴 ≡ 𝐷GHI 𝐴 + 𝐼 𝐷G
HI : Renormalized adjacency matrix
Classification Model: Two-step GCN message passing
Training:Minimize cross entropy loss on labeled data
Testing:Apply the model to predict unlabeled data
Labeled
??
?
?
?
??
Labeled
Target node 𝑡 ∈ 𝑉: node whose classification label we want to changeAttacker nodes 𝑆 ⊂ 𝑉: nodes the attacker can modify
56
Target node
Attacker nodeAttacker node¡ What are attack possibilities in real world?
Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu
Target node 𝑡 ∈ 𝑉: node whose classification label we want to changeAttacker nodes 𝑆 ⊂ 𝑉: nodes the attacker can modify
Direct attack (𝑆 = {𝑡})
¡ Modify thetarget‘s features
¡ Add connectionsto the target
¡ Remove connectionsfrom the target
57
Target node
Attacker nodeAttacker node
Change websitecontent
Buy likes/ followers
Example
Unfollowuntrustedusers
¡ What are attack possibilities in real world?
Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu
Target node 𝑡 ∈ 𝑉: node whose classification label we want to changeAttacker nodes 𝑆 ⊂ 𝑉: nodes the attacker can modify
Direct attack (𝑆 = {𝑡})
¡ Modify thetarget‘s features
¡ Add connectionsto the target
¡ Remove connectionsfrom the target
58
Target node
Indirect attack (𝑡 ∉ 𝑆)
• Modify theattackers‘ features
• Add connectionsto the attackers
• Remove connectionsfrom the attackers
Attacker nodeAttacker node
Change websitecontent
Buy likes/ followers
Example
Unfollowuntrustedusers
Hijack friendsof target
Create a link/ spam farm
Example
¡ What are attack possibilities in real world?
Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu
Target node 𝑡 ∈ 𝑉: node whose classification label we want to changeAttacker nodes 𝑆 ⊂ 𝑉: nodes the attacker can modify
Direct attack (𝑆 = {𝑡})
¡ Modify thetarget‘s features
¡ Add connectionsto the target
¡ Remove connectionsfrom the target
59
Target node
Indirect attack (𝑡 ∉ 𝑆)
• Modify theattackers‘ features
• Add connectionsto the attackers
• Remove connectionsfrom the attackers
Attacker nodeAttacker node
Change websitecontent
Buy likes/ followers
Example
Unfollowuntrustedusers
Hijack friendsof target
Create a link/ spam farm
Example
¡ What are attack possibilities in real world?
Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu
How to mathematically formalize these attack
possibilities?
¡ Zügner+, Adversarial Attacks on Neural Networks for Graph Data, KDD’18
Maximize (Change of predicted labels of target node)Subject to (Limited noise in the graph)
12/3/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 60
111
011
101
000
110?
Graph
GCN1
0
Class 1 Class 2 Class 3
Add small graph noise
Target node
Change the predicted label
argmaxVW,XW
maxYZY[\]
log 𝑍a,Y∗ − log 𝑍a,Y[\]∗
𝑤ℎ𝑒𝑟𝑒 𝑍∗ = 𝑓g∗ 𝐴h, 𝑋h =𝑠𝑜𝑓𝑡𝑚𝑎𝑥 >𝐴′ 𝑅𝑒𝐿𝑈 >𝐴′𝑋′𝑊 C 𝑊 D ,
𝑤𝑖𝑡ℎ 𝜃∗ = argmingℒ (𝜃; 𝐴h, 𝑋h)
𝑠. 𝑡. 𝐴h, 𝑋h ≈ (𝐴, 𝑋)
¡ Find a modified graph that maximizes the change of predicted labels of target node
12/3/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 61
Let’s parse the objective function!
argmaxVW,XW
maxYZY[\]
log 𝑍a,Y∗ − log 𝑍a,Y[\]∗
𝑤ℎ𝑒𝑟𝑒 𝑍∗ = 𝑓g∗ 𝐴h, 𝑋h =𝑠𝑜𝑓𝑡𝑚𝑎𝑥 >𝐴′ 𝑅𝑒𝐿𝑈 >𝐴′𝑋′𝑊 C 𝑊 D ,
𝑤𝑖𝑡ℎ 𝜃∗ = argmingℒ (𝜃; 𝐴h, 𝑋h)
𝑠. 𝑡. 𝐴h, 𝑋h ≈ (𝐴, 𝑋)
¡ Find a modified graph that maximizes the change of predicted labels of target node
12/3/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 62
Modified adjacency matrix
Modified nodefeature
Original prediction
New prediction (specified by attacker)
Target node
¡ Find a modified graph that maximizes the change of predicted labels of target node
12/3/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 63
argmaxVW,XW
maxYZY[\]
log 𝑍a,Y∗ − log 𝑍a,Y[\]∗
𝑤ℎ𝑒𝑟𝑒 𝑍∗ = 𝑓g∗ 𝐴h, 𝑋h =𝑠𝑜𝑓𝑡𝑚𝑎𝑥 >𝐴′ 𝑅𝑒𝐿𝑈 >𝐴′𝑋′𝑊 C 𝑊 D ,
𝑤𝑖𝑡ℎ 𝜃∗ = argmingℒ (𝜃; 𝐴h, 𝑋h)
𝑠. 𝑡. 𝐴h, 𝑋h ≈ (𝐴, 𝑋)
Increase the loglikelihood of target node 𝑣 being predicted as 𝑐
Decrease the loglikelihood of target node 𝑣being predicted as 𝑐tuv
¡ Find a modified graph that maximizes the change of predicted labels of target node
12/3/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 64
GCN is trained on modified graph, which is then used to predict labels of the target node.
argmaxVW,XW
maxYZY[\]
log 𝑍a,Y∗ − log 𝑍a,Y[\]∗
𝑤ℎ𝑒𝑟𝑒 𝑍∗ = 𝑓g∗ 𝐴h, 𝑋h =𝑠𝑜𝑓𝑡𝑚𝑎𝑥 >𝐴′ 𝑅𝑒𝐿𝑈 >𝐴′𝑋′𝑊 C 𝑊 D ,
𝑤𝑖𝑡ℎ 𝜃∗ = argmingℒ (𝜃; 𝐴h, 𝑋h)
𝑠. 𝑡. 𝐴h, 𝑋h ≈ (𝐴, 𝑋)
argmaxVW,XW
maxYZY[\]
log 𝑍a,Y∗ − log 𝑍a,Y[\]∗
𝑤ℎ𝑒𝑟𝑒 𝑍∗ = 𝑓g∗ 𝐴h, 𝑋h =𝑠𝑜𝑓𝑡𝑚𝑎𝑥 >𝐴′ 𝑅𝑒𝐿𝑈 >𝐴′𝑋′𝑊 C 𝑊 D ,
𝑤𝑖𝑡ℎ 𝜃∗ = argmingℒ (𝜃; 𝐴h, 𝑋h)
𝑠. 𝑡. 𝐴h, 𝑋h ≈ (𝐴, 𝑋)
¡ Find a modified graph that maximizes the change of predicted labels of target node
12/3/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 65
The modified graph should be close to the original graph.E.g., Fixed budget of edge deletion/addition, node attribute perturbation
¡ In practice, we cannot exactly solve the optimization problem because…§ Graph modification is discrete (cannot use simple
gradient descent to optimize)§ Inner loop involves expensive re-training of GCN
12/3/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 66
argmaxVW,XW
maxYZY[\]
log 𝑍a,Y∗ − log 𝑍a,Y[\]∗
𝑤ℎ𝑒𝑟𝑒 𝑍∗ = 𝑓g∗ 𝐴h, 𝑋h =𝑠𝑜𝑓𝑡𝑚𝑎𝑥 >𝐴′ 𝑅𝑒𝐿𝑈 >𝐴′𝑋′𝑊 C 𝑊 D ,
𝑤𝑖𝑡ℎ 𝜃∗ = argmingℒ (𝜃; 𝐴h, 𝑋h)
𝑠. 𝑡. 𝐴h, 𝑋h ≈ (𝐴, 𝑋)
¡ Some heuristics have been proposed to efficiently obtain an approximate solution.
¡ For example:§ Greedily choosing the step-by-step graph
modification§ Simplifying GCN by removing ReLU activation (to
work in closed form)§ Etc.
12/3/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 67
More details in Zügner+ KDD’2018!
¡ Semi-supervised node classification with GCN§ Class predictions for a single node, produced by 5 GCNs with
different random initializations
68
Class 1 Class 2 Class 3 Class 4 Class 5 Class 6(correct)
Class 7
Predicted probabilities (clean)
Classification margin> 0: Correct classification< 0: Incorrect classification
Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu
¡ The GCN prediction is easily manipulated by only 5 modifications of graph structure (|V|=~2k, |E|=~5k)
69
Class 1 Class 2 Class 3 Class 4 Class 5 Class 6(correct)
Class 7
Predicted probabilities after attack (5 modifications)
Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu
70
Class 1 Class 2 Class 3 Class 4 Class 5 Class 6(correct)
Class 7
Predicted probabilities after attack (5 modifications)
¡ The prediction of GCN is easily manipulated by only 5 modifications of graph structure.
GNNs are not robust to adversarial attacks!
Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu
12/3/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 72
¡ Chemistry: Molecular graphs§ Molecular property prediction
¡ Biology: Protein-Protein Interaction Networks§ Protein function prediction
= toxic?
GNN ( ) = adenylate cyclase activity?
GNN( )
¡ Scarcity of labeled data§ Labels require expensive experiments
à Models overfit to small training datasets
¡ Out-of-distribution prediction§ Test examples are very different from training in
scientific discovery à Models typically perform poorly
12/3/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 73
¡ Pre-training GNNs [Hu+ 2019]
12/3/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, http://cs224w.stanford.edu 74
GNN
Pre-train GNNs on relevant, easy to obtain graph data
Unlabeled molecules
Chemistry database
Pre-train
Downstream task 1
Downstream task 2
Downstream task N
Fine-tunewith chemical knowledge
¡ We have seen how to attack GNNs¡ Open question:
How to defend against the attacks?
12/3/19 Jure Leskovec, Stanford CS224W: Machine Learning with Graphs, hpp://cs224w.stanford.edu 75
Challenges:¡ Tractable optimization on discrete graph data¡ Achieving good trade-off between Accuracy
and Robustness