graph embeddingsembeddings.pdf · 2019. 10. 16. · 1. a graph embedding is a fixed length vector...

Post on 31-Mar-2021

3 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Graph EmbeddingsAlicia Frame, PhDOctober 10, 2019

What’s an embedding?How do these work?- Motivating Example - Word2Vec- Motivating Example - DeepWalkGraph embeddings overviewGraph embedding techniquesGraph embeddings with Neo4j

2

Overview

What does the internet say?- Google: “An embedding is a relatively

low-dimensional space into which you can translate high-dimensional vectors”

- Wikipedia: “In mathematics, an embedding is one instance of some mathematical structure contained within another instance, such as a group that is a subgroup.”

3

TL;DR - what’s an embedding?

A way of mapping something (a document, an image, a graph) into a fixed length vector (or matrix) that captures key features while reducing the dimensionality

Graph embeddings are a specific type of embedding that translate graphs, or parts of graphs, to fixed length vectors (or tensors)

4

So what’s a graph embedding?

An embedding translates something complex into something a machine can work with- Represents the important features of the input object in a

compact, low dimensional format- Embedded representation can be used as a feature for ML, for

direct comparisons, or as an input representation for a DL model

Embeddings - typically - learn what’s important in an unsupervised, generalizable way.

5

But why bother?

6

Motivating Examples

How can I represent words in a way that I can use them mathematically?- How similar are two words?- Can I use the representation of a word in a model?

Naive approach - how similar are the strings?- Hand engineered rules?- How many of each letter?

CAT = [10100000000000000001000000]

7

Motivating example: Word Embeddings

Frequency matrix:

8

Motivating example: Word Embeddings

Weighted term frequency (TF-IDF)Can we use documents to encode words?

Word order probably matters too: Words that occur together have similar contexts.

9

Motivating example: Word Embeddings

- “Tylenol is a pain reliever,” “Paracetamol is a pain reliever” same context

- Co-occurence: how often do two words appear in the same context window?

- Context window: specific number and direction

He is not lazyHe is intelligent

He is smart

He is not lazyHe is intelligent

He is smart

Word order probably matters too: Words that occur together have similar contexts.

10

Motivating example: Word Embeddings

- “Tylenol is a pain reliever,” “Paracetamol is a pain reliever” same context

- Co-occurence: how often do two words appear in the same context window?

- Context window: specific number and direction

3 3

Why not stop here?- You need more documents to really understand context … but

the more documents you have the bigger your matrix is- Giant sparse matrices or vectors are cumbersome and

uninformative

We need to reduce the dimensionality of our matrix

11

Motivating example: Word Embeddings

Count Based Methods: Linear algebra to the rescue?

Pros: Preserves semantic relationships, accurate, known methodsCons: Huge memory requirements, not trained for a specific task

12

Motivating Example: Word Embeddings

13

Motivating Example: Word Embeddings

Predictive Methods: learn an embedding for a specific task

14

Motivating Example: Word Embeddings

Predictive Methods: learn an embedding for a specific task

The SkipGram model learns a vector representation for each word that maximizes the probability of that word given the previous words

15

Motivating Example: Word Embeddings

input word - one hot encoded vector

output prediction - probability, for each word in the corpus, that it’s the next word

The SkipGram model learns a vector representation for each word that maximizes the probability of that word given the previous words

16

Motivating Example: Word Embeddings

input word - one hot encoded vector

output prediction - probability, for each word in the corpus, that it’s the next word

The SkipGram model learns a vector representation for each word that maximizes the probability of that word given the previous words

17

Motivating Example: Word Embeddings

The hidden layer is a weight matrix with one row per word, and one column per neuron -- this is the embedding!

Maximize the probability that the next word is w_t given h:

Train model by maximizing the log-likelihood over the training set:

Skipgram model calculates:

18

(if we really want to get into the math)

19

Motivating Example: Word Embeddings

Word embeddings condense representations of the words while preserving context:

20

Cool, but what’s this got to do with graphs?

Motivating example: DeepWalk

21

How do we represent a node in a graph mathematically? Can we adapt word2vec?- Each node is like a word- Neighborhood around the node is the context window

Extract the context for each node by sampling random walks from the graph:

For every node in the graph, take n fixed length random walks (equivalent to sentences)

22

Motivating example: DeepWalk

Once we have our sentences, we can extract the context windows and learn weights using the same skip-gram model

(Objective is to predict neighboring nodes given the target node)23

Motivating example: DeepWalk

Embeddings are the hidden layer weights from the skipgram model

Note: there are also equivalent methodologies to the matrix factorization approaches or hand engineered approaches we talked about for words as well

24

Motivating example: DeepWalk

25

Graph Embeddings Overview

There are lots of graph embeddings...

26

What type of graph are you trying to create an embedding for?- Monopartite graphs (DeepWalk is designed for these)- Multipartite graphs (eg. Knowledge Graphs)

What aspect of the graph are you trying to represent?- Vertex embeddings: describe connectivity of each node- Path embeddings: traversals across the graph- Graph embeddings: encode an entire graph into a single vector

What tp

Most techniques consist of:

- A similarity function that measures the similarity between nodes- An encoder function: generates the node embedding- A decoder function to reconstruct pairwise similarity- A loss function that measures how good your reconstruction is

27

Node embedding overview

Shallow - Encoder function is an embedding lookup

Matrix Factorization:- These techniques all rely on an adjacency matrix input- Matrix factorization is applied either directly to the input or

some transformation of the input

Random Walk:- Obtain node co-occurrence via random walks- Learn weight to optimize similarity measure

28

Shallow Graph Embedding Techniques

Shallow - Encoder function is an embedding lookup

Matrix Factorization:- These techniques all rely on an adjacency matrix input- Matrix factorization is applied either directly to the input or

some transformation of the input

Random Walk:- Obtain node co-occurrence via random walks- Learn weight to optimize similarity measure

29

Shallow Graph Embedding Techniques

Massive memory footprint Computationally intense

Shallow - Encoder function is an embedding lookup

Matrix Factorization:- These techniques all rely on an adjacency matrix input- Matrix factorization is applied either directly to the input or

some transformation of the input

Random Walk:- Obtain node co-occurrence via random walks- Learn weight to optimize similarity measure

30

Shallow Graph Embedding Techniques

Massive memory footprint Computationally intense

Local-only perspectiveAssumes similar nodes are close together

Matrix Factorization:- These techniques all rely on an adjacency matrix input- Matrix factorization is applied either directly to the input or

some transformation of the input

Random Walk:- Obtain node co-occurrence via random walks- Learn weight to optimize similarity measure

31

Shallow Graph Embedding Techniques

Massive memory footprint Computationally intense

Local-only perspectiveAssumes similar nodes are close together

Why not stick with these?- Shallow embeddings are inefficient - no parameters shared

between nodes- Can’t leverage node attributes- Only generate embeddings for nodes present when the

embedding was trained - problematic for large, evolving graphsNewer methodologies - compress information- Neighborhood autoencoder methods- Neighborhood aggregation- Convolutional autoencoders

32

Shallow Embeddings

33

Autoencoder methods

Using Graph Embeddings

34

Why are we going to all this trouble?

35

Visualization & pattern discovery:- Leveraging lots of existing - t-SNE plots- PCA

Clustering and community detection:- Apply generic tabular data approaches (eg. k-means) but allows

capture of both functional and structural roles- KNN graphs based on embedding similarity

Node classification/semi-supervised learningPredict missing node attributesLink prediction- predict edges not present in the graph- Either using similarity measures/heuristics or ML pipelines

36

Why are we going to all this trouble?

Embeddings can make the graph algorithm library even more powerful!

Graph Embeddings in Neo4j

37

Two prototype implementations from Labs: DeepWalk & DeepGL- DeepGL is more similar to a “hand crafted” embedding- Uses graph algorithms to generate features- Diffusion of values across edges, dimensionality reduction

Neither is ready for production use - but lessons learned!- Lots of demand- Memory intensive and not turned for performance- Deep Learning is not easy in Java

Python is easy to get started with for experimentation, but doesn’t perform at scale38

Neo4j Labs Implementations

We’re actively exploring the best ways to implement graph embeddings at scale so please stay tuned

39

...So what’s next?

1. A graph embedding is a fixed length vector ofa. Numbersb. Lettersc. Nodes

2. An embedding is a ______________ representation of your dataa. Human readableb. Lower dimensionalc. Binary

3. What’s the name of the graph embedding we walked through in this presentation?40

Hunger Games!

top related