csce 496/896 lecture 9: word2vec and...

20

CSCE 496/896 Lecture 9: word2vec and node2vec Stephen Scott Introduction word2vec node2vec CSCE 496/896 Lecture 9: word2vec and node2vec Stephen Scott (Adapted from Haluk Dogan) [email protected] 1 / 20

Upload: others

Post on 20-Mar-2020

10 views

Category:

Documents

0 download

Report

Download

Embed Size (px):

TRANSCRIPT

Page 1: CSCE 496/896 Lecture 9: word2vec and node2veccse.unl.edu/~sscott/teach/Classes/cse496S18/slides/9-xxx2vec.pdf · CSCE 496/896 Lecture 9: word2vec and node2vec Stephen Scott Introduction

CSCE496/896

Lecture 9:word2vec and

node2vec

Stephen Scott

Introduction

word2vec

node2vec

CSCE 496/896 Lecture 9:word2vec and node2vec

Stephen Scott

(Adapted from Haluk Dogan)

[email protected]

1 / 20

mailto:[email protected]

Page 2: CSCE 496/896 Lecture 9: word2vec and node2veccse.unl.edu/~sscott/teach/Classes/cse496S18/slides/9-xxx2vec.pdf · CSCE 496/896 Lecture 9: word2vec and node2vec Stephen Scott Introduction

CSCE496/896

Lecture 9:word2vec and

node2vec

Stephen Scott

Introduction

word2vec

node2vec

Introduction

To apply recurrent architectures to text (e.g., NLM),need numeric representation of words

The “Embedding lookup” blockWhere does the embedding come from?

Could train it along with the rest of the networkOr, could use “off-the-shelf” embedding

E.g., word2vec or GloVe

Embeddings not limited to words: E.g., biologicalsequences, graphs, ...

Graphs: node2vec

The xxxx2vec approach focuses on trainingembeddings based on context

2 / 20

Page 3: CSCE 496/896 Lecture 9: word2vec and node2veccse.unl.edu/~sscott/teach/Classes/cse496S18/slides/9-xxx2vec.pdf · CSCE 496/896 Lecture 9: word2vec and node2vec Stephen Scott Introduction

CSCE496/896

Lecture 9:word2vec and

node2vec

Stephen Scott

Introduction

word2vec

node2vec

Outline

word2vecArchitecturesTrainingSemantics of embedding

node2vec

3 / 20

Page 4: CSCE 496/896 Lecture 9: word2vec and node2veccse.unl.edu/~sscott/teach/Classes/cse496S18/slides/9-xxx2vec.pdf · CSCE 496/896 Lecture 9: word2vec and node2vec Stephen Scott Introduction

CSCE496/896

Lecture 9:word2vec and

node2vec

Stephen Scott

Introduction

word2vec

node2vec

Word2vec (Mikolov et al.)

Training is a variation of autoencodingRather than mapping a word to itself, learn to mapbetween a word and its context

Context-to-word: Continuous bag-of-words (CBOW)Word-to-context: Skip-gram

4 / 20

Page 5: CSCE 496/896 Lecture 9: word2vec and node2veccse.unl.edu/~sscott/teach/Classes/cse496S18/slides/9-xxx2vec.pdf · CSCE 496/896 Lecture 9: word2vec and node2vec Stephen Scott Introduction

CSCE496/896

Lecture 9:word2vec and

node2vec

Stephen Scott

Introduction

word2vec

node2vec

Word2vec (Mikolov et al.)Architectures

CBOW: Predict current word w(t) based on contextSkip-gram: Predict context based on w(t)One-hot input, hidden linear activation, softmax output

5 / 20

Page 6: CSCE 496/896 Lecture 9: word2vec and node2veccse.unl.edu/~sscott/teach/Classes/cse496S18/slides/9-xxx2vec.pdf · CSCE 496/896 Lecture 9: word2vec and node2vec Stephen Scott Introduction

CSCE496/896

Lecture 9:word2vec and

node2vec

Stephen Scott

Introduction

word2vec

node2vec

Word2vec (Mikolov et al.)CBOW

N = vocabulary size, d =embedding dimensionN × d matrix W is sharedweights from input to hiddend × N matrix W ′ is weightsfrom hidden to outputWhen one-hot contextvectors xt−2, xt−1, . . . , xt+2input, corresponding rowsfrom W are summed to v̂Then get score vector v′and softmax itTrain with cross-entropy

Use ith column of W ′ as embedding6 / 20

Page 7: CSCE 496/896 Lecture 9: word2vec and node2veccse.unl.edu/~sscott/teach/Classes/cse496S18/slides/9-xxx2vec.pdf · CSCE 496/896 Lecture 9: word2vec and node2vec Stephen Scott Introduction

CSCE496/896

Lecture 9:word2vec and

node2vec

Stephen Scott

Introduction

word2vec

node2vec

Word2vec (Mikolov et al.)Skip-gram

Symmetric to CBOW: use ith rowof W as embeddingGoal is to maximizeP(wt−2,wt−1,wt+1,wt+2 | wt)

Same as minimizing− logP(wt−2,wt−1,wt+1,wt+2 | wt)

Assume words are independentgiven wt:P(wt−2,wt−1,wt+1,wt+2 | wt) =∏

j∈{−2,−1,1,2} P(wt+j | wt)

7 / 20

Page 8: CSCE 496/896 Lecture 9: word2vec and node2veccse.unl.edu/~sscott/teach/Classes/cse496S18/slides/9-xxx2vec.pdf · CSCE 496/896 Lecture 9: word2vec and node2vec Stephen Scott Introduction

CSCE496/896

Lecture 9:word2vec and

node2vec

Stephen Scott

Introduction

word2vec

node2vec

Word2vec (Mikolov et al.)Skip-gram

Equivalent to maximizing log probability∑j∈{−c,−(c−1),...,(c−1),c}, j6=0

logP(wt+j | wt)

Softmax output and linear activation imply

P(wO | wI) =exp

(v′>

wOvwI

)∑N

i=1 exp(v′>

i vwI

)where vwI is wI ’s (input word) row from W and v′

i is wi’s(output word) column from W ′

I.e., trying to maximize dot product (similarity) betweenwords in same contextProblem: N is big (≈ 105–107)

8 / 20

Page 9: CSCE 496/896 Lecture 9: word2vec and node2veccse.unl.edu/~sscott/teach/Classes/cse496S18/slides/9-xxx2vec.pdf · CSCE 496/896 Lecture 9: word2vec and node2vec Stephen Scott Introduction

CSCE496/896

Lecture 9:word2vec and

node2vec

Stephen Scott

Introduction

word2vec

node2vec

Word2vec (Mikolov et al.)Skip-gram

Speed up evaluation via negative samplingUpdate the weight of each target word and only a smallnumber (5–20) of negative wordsI.e., do not update for all N wordsTo estimate P(wO | wI), use

log σ(

v′>wO

vwI

)+

k∑i=1

Ewi∼Pn(w)

[log σ

(−v

′>wi

vwI

)]I.e., learn to distinguish target word wO from wordsdrawn from noise distribution

Pn(wi) =f (wi)

3/4∑Nj=1 f (wj)3/4

,

where f (wi) is frequency of word wi in corpusI.e., Pn(wi) is a unigram distribution

9 / 20

Page 10: CSCE 496/896 Lecture 9: word2vec and node2veccse.unl.edu/~sscott/teach/Classes/cse496S18/slides/9-xxx2vec.pdf · CSCE 496/896 Lecture 9: word2vec and node2vec Stephen Scott Introduction

CSCE496/896

Lecture 9:word2vec and

node2vec

Stephen Scott

Introduction

word2vec

node2vec

Word2vec (Mikolov et al.)Semantics

Distances between countries and capitals similar10 / 20

Page 11: CSCE 496/896 Lecture 9: word2vec and node2veccse.unl.edu/~sscott/teach/Classes/cse496S18/slides/9-xxx2vec.pdf · CSCE 496/896 Lecture 9: word2vec and node2vec Stephen Scott Introduction

CSCE496/896

Lecture 9:word2vec and

node2vec

Stephen Scott

Introduction

word2vec

node2vec

Word2vec (Mikolov et al.)Semantics

Analogies: a is to b as c is to d

Given normalized embeddings xa, xb, and xc, computey = xb − xa + xc

Find d maximizing cosine: xd y>/(‖xd‖‖y‖)

11 / 20

Page 12: CSCE 496/896 Lecture 9: word2vec and node2veccse.unl.edu/~sscott/teach/Classes/cse496S18/slides/9-xxx2vec.pdf · CSCE 496/896 Lecture 9: word2vec and node2vec Stephen Scott Introduction

CSCE496/896

Lecture 9:word2vec and

node2vec

Stephen Scott

Introduction

word2vec

node2vec

Node2vec (Grover and Leskovec, 2016)

Word2vec’s approach generalizes beyond textAll we need to do is represent the context of an instanceto embed together instances with similar contexts

E.g., biological sequences, nodes in a graph

Node2vec defines its context for a node based on itslocal neighborhood, role in the graph, etc.

12 / 20

Page 13: CSCE 496/896 Lecture 9: word2vec and node2veccse.unl.edu/~sscott/teach/Classes/cse496S18/slides/9-xxx2vec.pdf · CSCE 496/896 Lecture 9: word2vec and node2vec Stephen Scott Introduction

CSCE496/896

Lecture 9:word2vec and

node2vec

Stephen Scott

Introduction

word2vec

node2vec

Node2vec (Grover and Leskovec, 2016)Notation

G = (V, E)A is a |V| × |V| adjacency matrixf : V → Rd is a mapping function from individual nodesto feature representations

|V| × d matrix

NS(u) ⊂ V denotes a neighborhood of node u generatedthrough a neighborhood sampling strategy S

Objective: Preserve local neighborhoods of nodes

13 / 20

Page 14: CSCE 496/896 Lecture 9: word2vec and node2veccse.unl.edu/~sscott/teach/Classes/cse496S18/slides/9-xxx2vec.pdf · CSCE 496/896 Lecture 9: word2vec and node2vec Stephen Scott Introduction

CSCE496/896

Lecture 9:word2vec and

node2vec

Stephen Scott

Introduction

word2vec

node2vec

Node2vec (Grover and Leskovec, 2016)

Organization of nodes is based on:Homophily: Nodes that arehighly interconnected andcluster together shouldembed near each other

Structural roles: Nodes with similar roles in the graph(e.g., hubs) should embed near each otheru and s1 belong to the same community of nodesu and s6 in two distinct communities share samestructural role of a hub node

GoalEmbed nodes from the same network communityclosely togetherNodes that share similar roles have similar embeddings

14 / 20

Page 15: CSCE 496/896 Lecture 9: word2vec and node2veccse.unl.edu/~sscott/teach/Classes/cse496S18/slides/9-xxx2vec.pdf · CSCE 496/896 Lecture 9: word2vec and node2vec Stephen Scott Introduction

CSCE496/896

Lecture 9:word2vec and

node2vec

Stephen Scott

Introduction

word2vec

node2vec

node2vec

Key Contribution: Defining a flexible notion of a node’snetwork neighborhood.

1 BFS: role of the vertexfar apart from each other but share similar kind ofvertices

2 DFS: communityreachability/closeness of the two nodesmy friend’s friend’s friend has a higher chance to belongto the same community as me

15 / 20

Page 16: CSCE 496/896 Lecture 9: word2vec and node2veccse.unl.edu/~sscott/teach/Classes/cse496S18/slides/9-xxx2vec.pdf · CSCE 496/896 Lecture 9: word2vec and node2vec Stephen Scott Introduction

CSCE496/896

Lecture 9:word2vec and

node2vec

Stephen Scott

Introduction

word2vec

node2vec

node2vec

Objective function

maxf∑

u∈Vlog P (NS(u) | f (u))

Assumptions:

Conditional independence:P (NS(u) | f (u)) =

∏ni∈NS(u)

P(ni | f (u))

Symmetry in feature space:P (ni | f (u)) = exp(f (ni)·f (u))∑

v∈Vexp(f (v)·f (u))

Objective function simplifies to:

maxf

∑u∈V

− log Zu +∑

ni∈NS(u)

f (ni) · f (u)

Zu is approximated with negative sampling

16 / 20

Page 17: CSCE 496/896 Lecture 9: word2vec and node2veccse.unl.edu/~sscott/teach/Classes/cse496S18/slides/9-xxx2vec.pdf · CSCE 496/896 Lecture 9: word2vec and node2vec Stephen Scott Introduction

CSCE496/896

Lecture 9:word2vec and

node2vec

Stephen Scott

Introduction

word2vec

node2vec

Node2vec (Grover and Leskovec, 2016)Neighborhood Sampling

Given a source node u, we simulate a random walk of fixedlength `:

P (ci = x | ci−1 = v) =

{πvxZ if (v, x) ∈ E

0 otherwise

c0 = u

πvx is the unnormalized transition probabilityZ is the normalization constant.2nd order Markovian

17 / 20

Page 18: CSCE 496/896 Lecture 9: word2vec and node2veccse.unl.edu/~sscott/teach/Classes/cse496S18/slides/9-xxx2vec.pdf · CSCE 496/896 Lecture 9: word2vec and node2vec Stephen Scott Introduction

CSCE496/896

Lecture 9:word2vec and

node2vec

Stephen Scott

Introduction

word2vec

node2vec

Node2vec (Grover and Leskovec, 2016)Neighborhood Sampling

Search bias α: πvx = αpq(t, x)wvx where

αpq(t, x) =

1p if dtx = 0

1 if dtx = 11q if dtx = 2

Return parameter p:

Controls the likelihood of immediately revisiting a nodein the walkIf p > max(q, 1)

less likely to sample an already visited nodeavoids 2-hop redundancy in sampling

If p < min(q, 1)backtrack a stepkeep the walk local

18 / 20

Page 19: CSCE 496/896 Lecture 9: word2vec and node2veccse.unl.edu/~sscott/teach/Classes/cse496S18/slides/9-xxx2vec.pdf · CSCE 496/896 Lecture 9: word2vec and node2vec Stephen Scott Introduction

CSCE496/896

Lecture 9:word2vec and

node2vec

Stephen Scott

Introduction

word2vec

node2vec

Node2vec (Grover and Leskovec, 2016)Neighborhood Sampling

In-out parameter q:

If q > 1 inward explorationLocal viewBFS behavior

If q < 1 outward explorationGlobal viewDFS behavior

19 / 20

Page 20: CSCE 496/896 Lecture 9: word2vec and node2veccse.unl.edu/~sscott/teach/Classes/cse496S18/slides/9-xxx2vec.pdf · CSCE 496/896 Lecture 9: word2vec and node2vec Stephen Scott Introduction

CSCE496/896

Lecture 9:word2vec and

node2vec

Stephen Scott

Introduction

word2vec

node2vec

Node2vec (Grover and Leskovec, 2016)Algorithm

Implicit bias due to choiceof the start node u

Simulating r randomwalks of fixed length` starting from everynode

Phases:

1 Preprocessing to compute transition probabilities2 Random walks3 Optimization using SGD

Each phase is parallelizable and executed asynchronously20 / 20

CSCE 496/896 Lecture 5: Autoencoderscse.unl.edu/~sscott/teach/Classes/cse496S18/slides/5-Autoencoder… · Denoising AE Sparse AE Contractive AE Variational AE GAN Generative Adversarial

Introduction CSCE 496/896 Lecture 6: Recurrent ...cse.unl.edu/~sscott/teach/Classes/cse496S18/slides/6-Recurrent-6up.pdf · CSCE 496/896 Lecture 6: Recurrent Architectures Stephen

Word Embedding e word2vec: Introduzione ed Esperimenti Preliminari

What is word2vec?

Word representation: SVD, LSA, Word2Vec

word2vec - jlu.myweb.cs.uwindsor.ca

Word2Vec: Vector presentation of words - Mohammad Mahdavi

word2vec Hendrik Heuer From theory to practice - Meetupfiles.meetup.com/16315032/StockholmNLPMeetup-WordEmbedding-20150324... · word2vec • by Mikolov, Sutskever, Chen, Corrado

CSCE 496/896 Lecture 6: Recurrent Architecturescse.unl.edu/~sscott/teach/Classes/cse496S18/slides/6-Recurrent.pdf · CSCE 496/896 Lecture 6: Recurrent Architectures Stephen Scott

Deep Learning Framework based on Word2Vec and CNN for ... · space [4]. Moreover, with Word2vec features can be obtained without human intervention. Word2Vec can also perform effectively

word2vec - From theory to practice

CSCE 522 Firewalls. CSCE 522 - Farkas2 Readings Pfleeger: 7.4

Introduction to Word2vec and its application to find ...compling.hss.ntu.edu.sg/courses/hg7017/pdf/word2vec and its application... · Introduction to Word2vec and its application

Tutorial: word2vec Yang-de Chen [email protected]

How exactly does word2vec work? - Semantic Scholar...How exactly does word2vec work? David Meyer [email protected],uoregon.edu,brocade.com,...g July 31, 2016 1 Introduction The word2vec

CSCE 496/896: Robotics Lab 1: Hovercraft Construction and …

Compare word2vec with hash2vec for Word Sense ... · Compare word2vec with hash2vec for Word Sense Disambiguation on Wikipedia corpus Output : figure. 3. Word2vec implementation output

Word2vec in Theory Practice with TensorFlow

CS7015 (Deep Learning) : Lecture 10miteshk/CS7015/Slides/... · 2018-12-27 · 2/70 Acknowledgments ‘word2vec Parameter Learning Explained’ by Xin Rong ‘word2vec Explained:

Introduction CSCE 496/896 Lecture 9cse.unl.edu/~sscott/teach/Classes/cse496S19/slides/09-ObjectDetection-6up.pdf · SPP-net Fast R-CNN YOLO Outline Performance measures RCNN SPP-net

Basic Computer Architecture - College of Engineering · 2007-02-02 · Basic Computer Architecture CSCE 496/896: Embedded Systems Witawas Srisa-an Review of Computer Architecture

Iteratively Linking Words Using Word2Vec and Cosine Similarity

Word2vec slide(lab seminar)

CSCE 496/896 Lecture 9: Object Detectioncse.unl.edu/~sscott/teach/Classes/cse496S19/slides/09-ObjectDetection.pdf · RCNN SPP-net Fast RCNN YOLO 3/21. CSCE 496/896 Lecture 9: Object

Word2Vec Tutorial Part 2 - Negative Sampling · Chris McCormick€¦ · Word2Vec Tutorial Part 2 - Negative Sampling 11 Jan 2017 In part 2 of the word2vec tutorial (hereʼs part 1),

LogUAD: Log unsupervised anomaly detection based on word2Vec

EKSTRAKSI FITUR MENGGUNAKAN MODEL WORD2VEC …

word2vec - mathematik.uni-muenchen.dedeckert/teaching/SS17/ATML/m… · word2vec Mathematics and Applications of Machine Learning Yannick Couzini e Ludwig-Maximilians-Universit at

Word2vec and Friends

Word2vec: From intuition to practice using gensim

Word Embeddingssvivek.com/.../spring2019/slides/word-embeddings/4-word2vec-glove… · enough [Mikolov, 2013, word2vec] –Simpler model, fewer parameters –Faster to train 4 Context

Introduction - Computer Science and Engineeringcse.unl.edu/~sscott/teach/Classes/cse496S18/slides/2-ANN...CSCE 496/896 Lecture 2: Basic Artiﬁcial Neural Networks Stephen Scott (Adapted

Efficient Parallel Learning of Word2Vec