gpu-accelerated semantic similarity search at...

34
GPU-Accelerated Semantic Similarity Search at Scale Kubilay Atasu IBM Research - Zurich in collaboration with Thomas Parnell, Celestine Duenner, Manolis Sifalakis, Haris Pozidis Vasileios Vasileiadis, Michail Vlachos, Cesar Berrospi, Abdel Labbi

Upload: others

Post on 21-Jun-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: GPU-Accelerated Semantic Similarity Search at Scaleon-demand.gputechconf.com/gtc/2018/presentation/s... · For each word in the vocabulary, compute the distance to the closest word

GPU-Accelerated Semantic Similarity Search at Scale

Kubilay AtasuIBM Research - Zurich

in collaboration withThomas Parnell, Celestine Duenner, Manolis Sifalakis, Haris PozidisVasileios Vasileiadis, Michail Vlachos, Cesar Berrospi, Abdel Labbi

Page 2: GPU-Accelerated Semantic Similarity Search at Scaleon-demand.gputechconf.com/gtc/2018/presentation/s... · For each word in the vocabulary, compute the distance to the closest word

2

Outline

§ Introduction

§ Background

§ Our solution

§ Our results

Page 3: GPU-Accelerated Semantic Similarity Search at Scaleon-demand.gputechconf.com/gtc/2018/presentation/s... · For each word in the vocabulary, compute the distance to the closest word

Why scalable similarity (i.e., nearest neighbors) search?

Example: financial news analysis

§ 100k news entries every day

§ 100M entries in three years

§ searching, browsing, clustering

3

Need for a similarity/distance metric:§ must be accurate and scalable

Image Source: http://social-dynamics.org/

Page 4: GPU-Accelerated Semantic Similarity Search at Scaleon-demand.gputechconf.com/gtc/2018/presentation/s... · For each word in the vocabulary, compute the distance to the closest word

Word Mover’s Distance (WMD) for Semantic Similarity

4

M. Kusner et al.: From Word Embeddings to Document Distances. ICML 2015.

Principle of WMD

The Queen to tour Canada

Royal visit to Halifax

Canada

Queentour

Royalvisit

Halifax

embedding space

Page 5: GPU-Accelerated Semantic Similarity Search at Scaleon-demand.gputechconf.com/gtc/2018/presentation/s... · For each word in the vocabulary, compute the distance to the closest word

High Cost of Word Mover’s Distance (WMD)

5

Quality Timecomplexity

GPUfriendly

WMD Very high Cubical No

Relaxed WMD High Quadratic YesOur solution High Linear Yes

Word Mover’s Distance: very high quality, but very high complexity!

Dense and Sparse Linear Algebra on Graphics Processing Units (GPUs)

Sub-second query performance on very large data sets!

Page 6: GPU-Accelerated Semantic Similarity Search at Scaleon-demand.gputechconf.com/gtc/2018/presentation/s... · For each word in the vocabulary, compute the distance to the closest word

6

Outline

§ Introduction

§ Background

§ Our solution

§ Our results

Page 7: GPU-Accelerated Semantic Similarity Search at Scaleon-demand.gputechconf.com/gtc/2018/presentation/s... · For each word in the vocabulary, compute the distance to the closest word

WMD: Earth Mover’s Distance using Word Embeddings

7

Cost matrix (word distances)

CanadaQueentour

Royalvisit

Halifax

Bag-of-Words

Solves a minimum-cost flow problem!

Cubical time complexity in the size of the histograms!

Histogram 1 Histogram 2

Page 8: GPU-Accelerated Semantic Similarity Search at Scaleon-demand.gputechconf.com/gtc/2018/presentation/s... · For each word in the vocabulary, compute the distance to the closest word

Relaxed WMD (RWMD): A Lower Bound of WMD

Histogram 1

His

togr

am 2

Cost Matrix

# Words = h1

# Words = h2

M. Kusner et al.: From Word Embeddings to Document Distances. ICML 2015.

Quadratic time and space!

It is not possible to do better when comparing only two histograms!

Page 9: GPU-Accelerated Semantic Similarity Search at Scaleon-demand.gputechconf.com/gtc/2018/presentation/s... · For each word in the vocabulary, compute the distance to the closest word

Problem Formulation: Input and Output Data Structures

Sparse matrix X1(DATABASE)

# Words in vocabulary: v

# Hists = n1

X[i,w]: frequency of word w in histogram i

Dense matrix E

Size of embedding vectors: m

# Words = v

E[w]: embedding vector for word w

Sparse matrix X2(QUERY)

# Words in vocabulary: v

# Hists = n2 Dense matrix R

k most similar histograms in X1

# Hists = n2

Given two sets of histograms (X1 and X2) and the embedding vectors E:For each histogram in X2, compute the K most similar histograms in X1

9

Page 10: GPU-Accelerated Semantic Similarity Search at Scaleon-demand.gputechconf.com/gtc/2018/presentation/s... · For each word in the vocabulary, compute the distance to the closest word

Computing the Cost Matrix: Dense Matrix-Matrix Multiplication

# Words = h1,i

Compute pairwise distances between all words in histogram i and all words in histogram j

# Words = h2,j

T1,i: embedding vectors of all words in hist i

Dense matrix T1,i

Size of embedding vectors: m

# Words = h1,i

T2,j: embedding vectors of all words in hist j

Dense matrix T2,j

Size of embedding vectors: m

# Words = h2,j

Complexity: O(h%m)C),+ = T.,) ∘ T%,+

10

Excellent candidate for GPU acceleration!

Page 11: GPU-Accelerated Semantic Similarity Search at Scaleon-demand.gputechconf.com/gtc/2018/presentation/s... · For each word in the vocabulary, compute the distance to the closest word

Word Mover’s Distance (WMD)

Dense matrix D

D[j,i]: Distance between hist j and hist i

# Words = h1,i

F1,i: frequency of the words in histogram i

Dense vector F1,i Computing D[i,j]:

Computing D[j]:

# Words = h1,i

# Words = h2,j

C),+ = T.,) ∘ T%,+

D[j, i] = WMD(F.,), F%,+, C),+)O(h%m + h9logh)

# Hists = n2

# Hists = n1

O(nh%m + nh9logh)

Given Ci,j, F1,i, and F2,j, compute D[j,i]

11

Page 12: GPU-Accelerated Semantic Similarity Search at Scaleon-demand.gputechconf.com/gtc/2018/presentation/s... · For each word in the vocabulary, compute the distance to the closest word

Relaxed Word Mover’s Distance – Quadratic Implementation

Dense matrix D# Hists = n2

D[j,i]: distance between hist j and hist i

# Hists = n1

# Words = h1,i

F1,i: frequency the words in histogram i

Dense vector F1,i Complexity of computing D[j,i]:

Complexity of computing D[j]:

O(h%m)O(nh%m)

D[j, i] = F.,) > min(C),+)

# Words = h1,i

# Words = h2,j

C),+ = T.,) ∘ T%,+

Given Ci,j, F1,i, and F2,j, compute D[j,i]

12

Page 13: GPU-Accelerated Semantic Similarity Search at Scaleon-demand.gputechconf.com/gtc/2018/presentation/s... · For each word in the vocabulary, compute the distance to the closest word

Quadratic Implementation – GPU Mapping

T%,+, 0 ≤ j ≤ n% − 1: streamed in one document at a time

Dense matrix T2,j

Size of word vectors: m

# Words = h2,j

T.,), 0 ≤ i ≤ n. − 1: resident in GPU memory

T1,0

Size of word vectors: m

h1,0

T1,1h1,1

T1,ih1,i

T1, n1-1h1,n1-1

# Hists: n1

Compute one row of D in parallel on a single GPU

Memory requirement on one GPU:

D[j] = F. > min(T. ∘ T%,+)

O(nhm)

13

Page 14: GPU-Accelerated Semantic Similarity Search at Scaleon-demand.gputechconf.com/gtc/2018/presentation/s... · For each word in the vocabulary, compute the distance to the closest word

How far away is RWMD from WMD?

14

What is the fraction of overlap between top-k results of RWMD and WMD?

10 1 0.1 0.01 0.0010

0.2

0.4

0.6

0.8

1

1.2

RWMDselection(%)

RWMDvsWMD- Precision

WMD(0.001%)

WMD(0.01%)

WMD(0.1%)

WMD(1%)

WMD(10%)

Page 15: GPU-Accelerated Semantic Similarity Search at Scaleon-demand.gputechconf.com/gtc/2018/presentation/s... · For each word in the vocabulary, compute the distance to the closest word

15

Outline

§ Introduction

§ Background

§ Our solution

§ Our results

Page 16: GPU-Accelerated Semantic Similarity Search at Scaleon-demand.gputechconf.com/gtc/2018/presentation/s... · For each word in the vocabulary, compute the distance to the closest word

Relaxed WMD (RWMD): Redundancy

16

Histogram 1H

isto

gram

2

Cost Matrix

# Words = h1

His

togr

am 3

⊙ Cost Matrix

Common words are the problem!

Redundancy can be eliminated!

Page 17: GPU-Accelerated Semantic Similarity Search at Scaleon-demand.gputechconf.com/gtc/2018/presentation/s... · For each word in the vocabulary, compute the distance to the closest word

Overview of Linear-Complexity RWMD (LC-RWMD)

17

. . .

Query documents (X2)

Phase 1

Clu

ster

of G

PUs

For each word in the vocabulary, compute the distance to the closest word in one query doc. Store the results in a dense vector Z.

Phase 2

Dense vector Z

Sparse-matrix dense-vector multiplication between X1 and Z to compute the distances between the query and the database docs.

Database documents (X1)

Distribute X2across GPUs

Page 18: GPU-Accelerated Semantic Similarity Search at Scaleon-demand.gputechconf.com/gtc/2018/presentation/s... · For each word in the vocabulary, compute the distance to the closest word

LC-RWMD: First Phase

T2,j: embedding vectors of the query histogram j

Dense matrix T2,j

Size of embedding vectors: m

# Words = h2,jDense matrix E

Size of embedding vectors: m

# Words = v

E: embedding vectors of the complete vocabulary

Multiply E and the transpose of T and compute row-wise minimums

Z# Words = v

1

Z=min(E∘ T%,+)

18

E∘ T%,+

h2,j

v

Complexity of the first phase: O(vhm)

Page 19: GPU-Accelerated Semantic Similarity Search at Scaleon-demand.gputechconf.com/gtc/2018/presentation/s... · For each word in the vocabulary, compute the distance to the closest word

LC-RWMD: Second Phase

Z: distance to the closest word in query histogram for each word in vocabulary

Z# Words = v

1

Sparse matrix X1# Hists = n1

# Words in vocabulary: v

Sparse matrix vector multiply to compute the distances. Complexity:

D j = X.×Z

O(nh)

Overall complexity: O(vhm+nh)

19

X1: weights of the database histograms in compressed sparse row (csr) format

Page 20: GPU-Accelerated Semantic Similarity Search at Scaleon-demand.gputechconf.com/gtc/2018/presentation/s... · For each word in the vocabulary, compute the distance to the closest word

Complexity Comparison (Time)

Complexity of linear RWMD:

Complexity of quadratic RWMD:

Improvement vs. quadratic RWMD:

O(vhm+nh)

O(nh%m)

O(min(nh/v, hm))

20

h: avg. size of histogramsm: size of word vectors v: size of the vocabulary

Comparing one query document with n database documents

Page 21: GPU-Accelerated Semantic Similarity Search at Scaleon-demand.gputechconf.com/gtc/2018/presentation/s... · For each word in the vocabulary, compute the distance to the closest word

LC-RWMD – GPU Mapping

X1 is resident in the GPU memory T%,+, 0 ≤ j ≤ n% − 1:streamed in one document at a time

Dense matrix T2,j

Size of embedding vectors: m

# Words = d2,j

Memory requirement on one GPU:

Sparse matrix X1# Hists = n1

# Words in vocabulary: v

Dense matrix E

Size of embedding vectors: m

Word embeddings E is resident in GPU memory

# Words = vO(vm+nh + vh)

21

Page 22: GPU-Accelerated Semantic Similarity Search at Scaleon-demand.gputechconf.com/gtc/2018/presentation/s... · For each word in the vocabulary, compute the distance to the closest word

Complexity Comparison (Space)

Space complexity of linear RWMD:

Space complexity of quadratic RWMD:

Improvement w.r.t. quadratic RWMD:

O(vm+nh + vh)

O(nhm)

O(min(nh/v, nm/v,m))

22

n: # database documents h: avg. size of histogramsm: size of word vectors v: size of the vocabulary

Page 23: GPU-Accelerated Semantic Similarity Search at Scaleon-demand.gputechconf.com/gtc/2018/presentation/s... · For each word in the vocabulary, compute the distance to the closest word

LC-RWMD: Dealing with Asymmetric Distances

Sparse matrix X1(DATABASE)

# Words: v

n1 Sparse matrix X2(QUERY)

# Words: v

n2

23

Distance Matrix D1n1

n2

Sparse matrix X2(DATABASE)

# Words: v

n2 Sparse matrix X1(QUERY)

# Words: v

n1 Distance Matrix D2n2

n1

D = max D1M, D2Transpose, maximum, and top-k on the CPU

Page 24: GPU-Accelerated Semantic Similarity Search at Scaleon-demand.gputechconf.com/gtc/2018/presentation/s... · For each word in the vocabulary, compute the distance to the closest word

24

Outline

§ Introduction

§ Background

§ Our solution

§ Our results

Page 25: GPU-Accelerated Semantic Similarity Search at Scaleon-demand.gputechconf.com/gtc/2018/presentation/s... · For each word in the vocabulary, compute the distance to the closest word

Speed-up vs CPU-based RWMD § CPU: Intel ® Core ® i7-6900K @ 3.20 GHz, 8 cores (SMT2), 64 GB memory, Intel ® MKL

§ GPU: NVIDIA ® Tesla ® P100, 16 GB memory, CUDA 9.0 with CUBLAS and CUSPARSE

25

Set 1: h=150 words per histogram

Set 2: h=30 words per histogram

Google’s Word2Vec (Google News)

v = 3M words, m = 300 floating pt. nums

All operations in single-prec. floating pt.

18 19

27951330

1

10

100

1000

10000

Set1 Set2

RWMD GPU LC-RWMD GPU

Page 26: GPU-Accelerated Semantic Similarity Search at Scaleon-demand.gputechconf.com/gtc/2018/presentation/s... · For each word in the vocabulary, compute the distance to the closest word

Runtime vs GPU-accelerated WMD

26

Time to compare one query doc with all database docs using 16 CPU processes + 16 GPUs

Set 1: n=1M docs, h=150 words per hist. Set 2: n=2.8M docs, h=30 words per hist.

0.001

0.01

0.1

1

10

100

1000

Set1 Set2

Runtime (secs)

WMD (16 GPUs) LC-RWMD (16 GPUs)

30000x2000x

Page 27: GPU-Accelerated Semantic Similarity Search at Scaleon-demand.gputechconf.com/gtc/2018/presentation/s... · For each word in the vocabulary, compute the distance to the closest word

4 8 16 32 64 128

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

LC-RWMD very large

LC-RWMD large

WMD medium

LC-RWMD medium

WMD small

LC-RWMD small

Comparison with WMD: Precision at Top-K for Set 2

§ small: 300-1000 examples per label

§ medium: 1k-10k examples per label

§ large: 10k-100k examples per label

§ very large: 100k-1M examples per label27

Precision

K

Page 28: GPU-Accelerated Semantic Similarity Search at Scaleon-demand.gputechconf.com/gtc/2018/presentation/s... · For each word in the vocabulary, compute the distance to the closest word

Summary

A linear complexity method for computing Relaxed Word Mover’s Distance

§ The original method proposed by Kusner et al. has quadratic complexity

§ ~30000-fold improvement in performance w.r.t GPU-accelerated WMD

§ ~2800-fold improvement w.r.t CPU-accelerated quadratic RWMD

Main insight: Big Data offers new ways of dealing with algorithmic complexity!

§ Reduce complexity by eliminating redundant and repetitive operations

§ Exploit the massive parallelism offered by GPUs and clusters of GPUs

28

Page 29: GPU-Accelerated Semantic Similarity Search at Scaleon-demand.gputechconf.com/gtc/2018/presentation/s... · For each word in the vocabulary, compute the distance to the closest word

Business and Academic Impact

§ Being used by our business developers: ingestion of business news

§ Sub-second execution latency for similarity queries (100k docs per day)

§ Database of 100M documents using 16 NVIDIA ® Tesla ® P100 GPUs

§ Larger databases or higher ingestion rates? Simply add more GPUs!

§ IEEE Big Data 2017 Conference, ERCIM News, GTC 2018 Conference

29

Page 30: GPU-Accelerated Semantic Similarity Search at Scaleon-demand.gputechconf.com/gtc/2018/presentation/s... · For each word in the vocabulary, compute the distance to the closest word

Future directions

§ Possible improvements:

§ CUDA streams to overlap CPU/GPU computation, half-precision support

§ Sinkhorn Distance to better approximate WMD (quadratic complexity)

§ Supervised training of word weights and word vectors (supervised WMD)

§ Limitations of bag-of-words: augment syntax trees with word vectors

§ Possible extensions:

§ Use FPGAs with hard floating-point cores and high-bandwidth memories

§ Similarity search in other domains: time series, images, genomics data

30

Page 31: GPU-Accelerated Semantic Similarity Search at Scaleon-demand.gputechconf.com/gtc/2018/presentation/s... · For each word in the vocabulary, compute the distance to the closest word

Questions?

Page 32: GPU-Accelerated Semantic Similarity Search at Scaleon-demand.gputechconf.com/gtc/2018/presentation/s... · For each word in the vocabulary, compute the distance to the closest word

Many-to-many LC-RWMD: First Phase

Dense matrix E

Size of embedding vectors: m

# Words = v

E[w]: vector representation of word w

Z# Words = v

n2

Z=min(E∘ T%)

32

T%,+, 0 ≤ j ≤ n% − 1: resident in GPU memory

T2,0

Size of word vectors: m

h2,0

T2,1h2,1

T2,jh2,j

T2, n2-1h2,n2-1

# Hists: n2

Page 33: GPU-Accelerated Semantic Similarity Search at Scaleon-demand.gputechconf.com/gtc/2018/presentation/s... · For each word in the vocabulary, compute the distance to the closest word

Many-to-many LC-RWMD: Second Phase

Z[w,j] stores the distance to closest word in histogram j for each word w in the vocabulary

Sparse matrix X1# Hists = n1

# Words in vocabulary: v

Sparse-matrix dense-matrix multiply to compute D D = X.×Z

33

Z# Words = v

Page 34: GPU-Accelerated Semantic Similarity Search at Scaleon-demand.gputechconf.com/gtc/2018/presentation/s... · For each word in the vocabulary, compute the distance to the closest word

How far away is Word Centroid Distance from WMD?

34

10 1 0.1 0.01 0.0010

0.2

0.4

0.6

0.8

1

1.2

WCD selection (%)

WCD vs WMD - Precision

WMD (0.001%)

WMD (0.01%)

WMD (0.1%)

WMD (1%)

WMD (10%)

What is the fraction of overlap between top-k results of WCD and WMD?