gpu-accelerated semantic similarity search at...

Post on 21-Jun-2020

3 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

GPU-Accelerated Semantic Similarity Search at Scale

Kubilay AtasuIBM Research - Zurich

in collaboration withThomas Parnell, Celestine Duenner, Manolis Sifalakis, Haris PozidisVasileios Vasileiadis, Michail Vlachos, Cesar Berrospi, Abdel Labbi

2

Outline

§ Introduction

§ Background

§ Our solution

§ Our results

Why scalable similarity (i.e., nearest neighbors) search?

Example: financial news analysis

§ 100k news entries every day

§ 100M entries in three years

§ searching, browsing, clustering

3

Need for a similarity/distance metric:§ must be accurate and scalable

Image Source: http://social-dynamics.org/

Word Mover’s Distance (WMD) for Semantic Similarity

4

M. Kusner et al.: From Word Embeddings to Document Distances. ICML 2015.

Principle of WMD

The Queen to tour Canada

Royal visit to Halifax

Canada

Queentour

Royalvisit

Halifax

embedding space

High Cost of Word Mover’s Distance (WMD)

5

Quality Timecomplexity

GPUfriendly

WMD Very high Cubical No

Relaxed WMD High Quadratic YesOur solution High Linear Yes

Word Mover’s Distance: very high quality, but very high complexity!

Dense and Sparse Linear Algebra on Graphics Processing Units (GPUs)

Sub-second query performance on very large data sets!

6

Outline

§ Introduction

§ Background

§ Our solution

§ Our results

WMD: Earth Mover’s Distance using Word Embeddings

7

Cost matrix (word distances)

CanadaQueentour

Royalvisit

Halifax

Bag-of-Words

Solves a minimum-cost flow problem!

Cubical time complexity in the size of the histograms!

Histogram 1 Histogram 2

Relaxed WMD (RWMD): A Lower Bound of WMD

Histogram 1

His

togr

am 2

Cost Matrix

# Words = h1

# Words = h2

M. Kusner et al.: From Word Embeddings to Document Distances. ICML 2015.

Quadratic time and space!

It is not possible to do better when comparing only two histograms!

Problem Formulation: Input and Output Data Structures

Sparse matrix X1(DATABASE)

# Words in vocabulary: v

# Hists = n1

X[i,w]: frequency of word w in histogram i

Dense matrix E

Size of embedding vectors: m

# Words = v

E[w]: embedding vector for word w

Sparse matrix X2(QUERY)

# Words in vocabulary: v

# Hists = n2 Dense matrix R

k most similar histograms in X1

# Hists = n2

Given two sets of histograms (X1 and X2) and the embedding vectors E:For each histogram in X2, compute the K most similar histograms in X1

9

Computing the Cost Matrix: Dense Matrix-Matrix Multiplication

# Words = h1,i

Compute pairwise distances between all words in histogram i and all words in histogram j

# Words = h2,j

T1,i: embedding vectors of all words in hist i

Dense matrix T1,i

Size of embedding vectors: m

# Words = h1,i

T2,j: embedding vectors of all words in hist j

Dense matrix T2,j

Size of embedding vectors: m

# Words = h2,j

Complexity: O(h%m)C),+ = T.,) ∘ T%,+

10

Excellent candidate for GPU acceleration!

Word Mover’s Distance (WMD)

Dense matrix D

D[j,i]: Distance between hist j and hist i

# Words = h1,i

F1,i: frequency of the words in histogram i

Dense vector F1,i Computing D[i,j]:

Computing D[j]:

# Words = h1,i

# Words = h2,j

C),+ = T.,) ∘ T%,+

D[j, i] = WMD(F.,), F%,+, C),+)O(h%m + h9logh)

# Hists = n2

# Hists = n1

O(nh%m + nh9logh)

Given Ci,j, F1,i, and F2,j, compute D[j,i]

11

Relaxed Word Mover’s Distance – Quadratic Implementation

Dense matrix D# Hists = n2

D[j,i]: distance between hist j and hist i

# Hists = n1

# Words = h1,i

F1,i: frequency the words in histogram i

Dense vector F1,i Complexity of computing D[j,i]:

Complexity of computing D[j]:

O(h%m)O(nh%m)

D[j, i] = F.,) > min(C),+)

# Words = h1,i

# Words = h2,j

C),+ = T.,) ∘ T%,+

Given Ci,j, F1,i, and F2,j, compute D[j,i]

12

Quadratic Implementation – GPU Mapping

T%,+, 0 ≤ j ≤ n% − 1: streamed in one document at a time

Dense matrix T2,j

Size of word vectors: m

# Words = h2,j

T.,), 0 ≤ i ≤ n. − 1: resident in GPU memory

T1,0

Size of word vectors: m

h1,0

T1,1h1,1

T1,ih1,i

T1, n1-1h1,n1-1

# Hists: n1

Compute one row of D in parallel on a single GPU

Memory requirement on one GPU:

D[j] = F. > min(T. ∘ T%,+)

O(nhm)

13

How far away is RWMD from WMD?

14

What is the fraction of overlap between top-k results of RWMD and WMD?

10 1 0.1 0.01 0.0010

0.2

0.4

0.6

0.8

1

1.2

RWMDselection(%)

RWMDvsWMD- Precision

WMD(0.001%)

WMD(0.01%)

WMD(0.1%)

WMD(1%)

WMD(10%)

15

Outline

§ Introduction

§ Background

§ Our solution

§ Our results

Relaxed WMD (RWMD): Redundancy

16

Histogram 1H

isto

gram

2

Cost Matrix

# Words = h1

His

togr

am 3

⊙ Cost Matrix

Common words are the problem!

Redundancy can be eliminated!

Overview of Linear-Complexity RWMD (LC-RWMD)

17

. . .

Query documents (X2)

Phase 1

Clu

ster

of G

PUs

For each word in the vocabulary, compute the distance to the closest word in one query doc. Store the results in a dense vector Z.

Phase 2

Dense vector Z

Sparse-matrix dense-vector multiplication between X1 and Z to compute the distances between the query and the database docs.

Database documents (X1)

Distribute X2across GPUs

LC-RWMD: First Phase

T2,j: embedding vectors of the query histogram j

Dense matrix T2,j

Size of embedding vectors: m

# Words = h2,jDense matrix E

Size of embedding vectors: m

# Words = v

E: embedding vectors of the complete vocabulary

Multiply E and the transpose of T and compute row-wise minimums

Z# Words = v

1

Z=min(E∘ T%,+)

18

E∘ T%,+

h2,j

v

Complexity of the first phase: O(vhm)

LC-RWMD: Second Phase

Z: distance to the closest word in query histogram for each word in vocabulary

Z# Words = v

1

Sparse matrix X1# Hists = n1

# Words in vocabulary: v

Sparse matrix vector multiply to compute the distances. Complexity:

D j = X.×Z

O(nh)

Overall complexity: O(vhm+nh)

19

X1: weights of the database histograms in compressed sparse row (csr) format

Complexity Comparison (Time)

Complexity of linear RWMD:

Complexity of quadratic RWMD:

Improvement vs. quadratic RWMD:

O(vhm+nh)

O(nh%m)

O(min(nh/v, hm))

20

h: avg. size of histogramsm: size of word vectors v: size of the vocabulary

Comparing one query document with n database documents

LC-RWMD – GPU Mapping

X1 is resident in the GPU memory T%,+, 0 ≤ j ≤ n% − 1:streamed in one document at a time

Dense matrix T2,j

Size of embedding vectors: m

# Words = d2,j

Memory requirement on one GPU:

Sparse matrix X1# Hists = n1

# Words in vocabulary: v

Dense matrix E

Size of embedding vectors: m

Word embeddings E is resident in GPU memory

# Words = vO(vm+nh + vh)

21

Complexity Comparison (Space)

Space complexity of linear RWMD:

Space complexity of quadratic RWMD:

Improvement w.r.t. quadratic RWMD:

O(vm+nh + vh)

O(nhm)

O(min(nh/v, nm/v,m))

22

n: # database documents h: avg. size of histogramsm: size of word vectors v: size of the vocabulary

LC-RWMD: Dealing with Asymmetric Distances

Sparse matrix X1(DATABASE)

# Words: v

n1 Sparse matrix X2(QUERY)

# Words: v

n2

23

Distance Matrix D1n1

n2

Sparse matrix X2(DATABASE)

# Words: v

n2 Sparse matrix X1(QUERY)

# Words: v

n1 Distance Matrix D2n2

n1

D = max D1M, D2Transpose, maximum, and top-k on the CPU

24

Outline

§ Introduction

§ Background

§ Our solution

§ Our results

Speed-up vs CPU-based RWMD § CPU: Intel ® Core ® i7-6900K @ 3.20 GHz, 8 cores (SMT2), 64 GB memory, Intel ® MKL

§ GPU: NVIDIA ® Tesla ® P100, 16 GB memory, CUDA 9.0 with CUBLAS and CUSPARSE

25

Set 1: h=150 words per histogram

Set 2: h=30 words per histogram

Google’s Word2Vec (Google News)

v = 3M words, m = 300 floating pt. nums

All operations in single-prec. floating pt.

18 19

27951330

1

10

100

1000

10000

Set1 Set2

RWMD GPU LC-RWMD GPU

Runtime vs GPU-accelerated WMD

26

Time to compare one query doc with all database docs using 16 CPU processes + 16 GPUs

Set 1: n=1M docs, h=150 words per hist. Set 2: n=2.8M docs, h=30 words per hist.

0.001

0.01

0.1

1

10

100

1000

Set1 Set2

Runtime (secs)

WMD (16 GPUs) LC-RWMD (16 GPUs)

30000x2000x

4 8 16 32 64 128

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

LC-RWMD very large

LC-RWMD large

WMD medium

LC-RWMD medium

WMD small

LC-RWMD small

Comparison with WMD: Precision at Top-K for Set 2

§ small: 300-1000 examples per label

§ medium: 1k-10k examples per label

§ large: 10k-100k examples per label

§ very large: 100k-1M examples per label27

Precision

K

Summary

A linear complexity method for computing Relaxed Word Mover’s Distance

§ The original method proposed by Kusner et al. has quadratic complexity

§ ~30000-fold improvement in performance w.r.t GPU-accelerated WMD

§ ~2800-fold improvement w.r.t CPU-accelerated quadratic RWMD

Main insight: Big Data offers new ways of dealing with algorithmic complexity!

§ Reduce complexity by eliminating redundant and repetitive operations

§ Exploit the massive parallelism offered by GPUs and clusters of GPUs

28

Business and Academic Impact

§ Being used by our business developers: ingestion of business news

§ Sub-second execution latency for similarity queries (100k docs per day)

§ Database of 100M documents using 16 NVIDIA ® Tesla ® P100 GPUs

§ Larger databases or higher ingestion rates? Simply add more GPUs!

§ IEEE Big Data 2017 Conference, ERCIM News, GTC 2018 Conference

29

Future directions

§ Possible improvements:

§ CUDA streams to overlap CPU/GPU computation, half-precision support

§ Sinkhorn Distance to better approximate WMD (quadratic complexity)

§ Supervised training of word weights and word vectors (supervised WMD)

§ Limitations of bag-of-words: augment syntax trees with word vectors

§ Possible extensions:

§ Use FPGAs with hard floating-point cores and high-bandwidth memories

§ Similarity search in other domains: time series, images, genomics data

30

Questions?

Many-to-many LC-RWMD: First Phase

Dense matrix E

Size of embedding vectors: m

# Words = v

E[w]: vector representation of word w

Z# Words = v

n2

Z=min(E∘ T%)

32

T%,+, 0 ≤ j ≤ n% − 1: resident in GPU memory

T2,0

Size of word vectors: m

h2,0

T2,1h2,1

T2,jh2,j

T2, n2-1h2,n2-1

# Hists: n2

Many-to-many LC-RWMD: Second Phase

Z[w,j] stores the distance to closest word in histogram j for each word w in the vocabulary

Sparse matrix X1# Hists = n1

# Words in vocabulary: v

Sparse-matrix dense-matrix multiply to compute D D = X.×Z

33

Z# Words = v

How far away is Word Centroid Distance from WMD?

34

10 1 0.1 0.01 0.0010

0.2

0.4

0.6

0.8

1

1.2

WCD selection (%)

WCD vs WMD - Precision

WMD (0.001%)

WMD (0.01%)

WMD (0.1%)

WMD (1%)

WMD (10%)

What is the fraction of overlap between top-k results of WCD and WMD?

top related