gpu-accelerated semantic similarity search at...

GPU-Accelerated Semantic Similarity Search at Scale

Kubilay AtasuIBM Research - Zurich

in collaboration withThomas Parnell, Celestine Duenner, Manolis Sifalakis, Haris PozidisVasileios Vasileiadis, Michail Vlachos, Cesar Berrospi, Abdel Labbi

Outline

§ Introduction

§ Background

§ Our solution

§ Our results

Why scalable similarity (i.e., nearest neighbors) search?

Example: financial news analysis

§ 100k news entries every day

§ 100M entries in three years

§ searching, browsing, clustering

Need for a similarity/distance metric:§ must be accurate and scalable

Image Source: http://social-dynamics.org/

Word Mover’s Distance (WMD) for Semantic Similarity

M. Kusner et al.: From Word Embeddings to Document Distances. ICML 2015.

Principle of WMD

The Queen to tour Canada

Royal visit to Halifax

Canada

Queentour

Royalvisit

Halifax

embedding space

High Cost of Word Mover’s Distance (WMD)

Quality Timecomplexity

GPUfriendly

WMD Very high Cubical No

Relaxed WMD High Quadratic YesOur solution High Linear Yes

Word Mover’s Distance: very high quality, but very high complexity!

Dense and Sparse Linear Algebra on Graphics Processing Units (GPUs)

Sub-second query performance on very large data sets!

Outline

§ Introduction

§ Background

§ Our solution

§ Our results

WMD: Earth Mover’s Distance using Word Embeddings

Cost matrix (word distances)

CanadaQueentour

Royalvisit

Halifax

Bag-of-Words

Solves a minimum-cost flow problem!

Cubical time complexity in the size of the histograms!

Histogram 1 Histogram 2

Relaxed WMD (RWMD): A Lower Bound of WMD

Histogram 1

Cost Matrix

# Words = h1

# Words = h2

M. Kusner et al.: From Word Embeddings to Document Distances. ICML 2015.

Quadratic time and space!

It is not possible to do better when comparing only two histograms!

Problem Formulation: Input and Output Data Structures

Sparse matrix X1(DATABASE)

# Words in vocabulary: v

# Hists = n1

X[i,w]: frequency of word w in histogram i

Dense matrix E

Size of embedding vectors: m

# Words = v

E[w]: embedding vector for word w

Sparse matrix X2(QUERY)

# Hists = n2 Dense matrix R

k most similar histograms in X1

# Hists = n2

Given two sets of histograms (X1 and X2) and the embedding vectors E:For each histogram in X2, compute the K most similar histograms in X1

Computing the Cost Matrix: Dense Matrix-Matrix Multiplication

# Words = h1,i

Compute pairwise distances between all words in histogram i and all words in histogram j

# Words = h2,j

T1,i: embedding vectors of all words in hist i

Dense matrix T1,i

# Words = h1,i

T2,j: embedding vectors of all words in hist j

Dense matrix T2,j

# Words = h2,j

Complexity: O(h%m)C),+ = T.,) ∘ T%,+

Excellent candidate for GPU acceleration!

Word Mover’s Distance (WMD)

Dense matrix D

D[j,i]: Distance between hist j and hist i

# Words = h1,i

F1,i: frequency of the words in histogram i

Dense vector F1,i Computing D[i,j]:

Computing D[j]:

# Words = h1,i

# Words = h2,j

C),+ = T.,) ∘ T%,+

D[j, i] = WMD(F.,), F%,+, C),+)O(h%m + h9logh)

# Hists = n2

# Hists = n1

O(nh%m + nh9logh)

Given Ci,j, F1,i, and F2,j, compute D[j,i]

Relaxed Word Mover’s Distance – Quadratic Implementation

Dense matrix D# Hists = n2

D[j,i]: distance between hist j and hist i

# Hists = n1

# Words = h1,i

F1,i: frequency the words in histogram i

Dense vector F1,i Complexity of computing D[j,i]:

Complexity of computing D[j]:

O(h%m)O(nh%m)

D[j, i] = F.,) > min(C),+)

# Words = h1,i

# Words = h2,j

C),+ = T.,) ∘ T%,+

Given Ci,j, F1,i, and F2,j, compute D[j,i]

Quadratic Implementation – GPU Mapping

T%,+, 0 ≤ j ≤ n% − 1: streamed in one document at a time

Dense matrix T2,j

Size of word vectors: m

# Words = h2,j

T.,), 0 ≤ i ≤ n. − 1: resident in GPU memory

T1,1h1,1

T1,ih1,i

T1, n1-1h1,n1-1

# Hists: n1

Compute one row of D in parallel on a single GPU

Memory requirement on one GPU:

D[j] = F. > min(T. ∘ T%,+)

O(nhm)

How far away is RWMD from WMD?

What is the fraction of overlap between top-k results of RWMD and WMD?

10 1 0.1 0.01 0.0010

RWMDselection(%)

RWMDvsWMD- Precision

WMD(0.001%)

WMD(0.01%)

WMD(0.1%)

WMD(1%)

WMD(10%)

Outline

§ Introduction

§ Background

§ Our solution

§ Our results

Relaxed WMD (RWMD): Redundancy

Histogram 1H

Cost Matrix

# Words = h1

⊙ Cost Matrix

Common words are the problem!

Redundancy can be eliminated!

Overview of Linear-Complexity RWMD (LC-RWMD)

Query documents (X2)

Phase 1

For each word in the vocabulary, compute the distance to the closest word in one query doc. Store the results in a dense vector Z.

Phase 2

Dense vector Z

Sparse-matrix dense-vector multiplication between X1 and Z to compute the distances between the query and the database docs.

Database documents (X1)

Distribute X2across GPUs

LC-RWMD: First Phase

T2,j: embedding vectors of the query histogram j

Dense matrix T2,j

# Words = h2,jDense matrix E

# Words = v

E: embedding vectors of the complete vocabulary

Multiply E and the transpose of T and compute row-wise minimums

Z# Words = v

Z=min(E∘ T%,+)

E∘ T%,+

Complexity of the first phase: O(vhm)

LC-RWMD: Second Phase

Z: distance to the closest word in query histogram for each word in vocabulary

Z# Words = v

Sparse matrix X1# Hists = n1

Sparse matrix vector multiply to compute the distances. Complexity:

D j = X.×Z

Overall complexity: O(vhm+nh)

X1: weights of the database histograms in compressed sparse row (csr) format

Complexity Comparison (Time)

Complexity of linear RWMD:

Complexity of quadratic RWMD:

Improvement vs. quadratic RWMD:

O(vhm+nh)

O(nh%m)

O(min(nh/v, hm))

h: avg. size of histogramsm: size of word vectors v: size of the vocabulary

Comparing one query document with n database documents

LC-RWMD – GPU Mapping

X1 is resident in the GPU memory T%,+, 0 ≤ j ≤ n% − 1:streamed in one document at a time

Dense matrix T2,j

# Words = d2,j

Memory requirement on one GPU:

Dense matrix E

Word embeddings E is resident in GPU memory

# Words = vO(vm+nh + vh)

Complexity Comparison (Space)

Space complexity of linear RWMD:

Space complexity of quadratic RWMD:

Improvement w.r.t. quadratic RWMD:

O(vm+nh + vh)

O(nhm)

O(min(nh/v, nm/v,m))

n: # database documents h: avg. size of histogramsm: size of word vectors v: size of the vocabulary

LC-RWMD: Dealing with Asymmetric Distances

# Words: v

n1 Sparse matrix X2(QUERY)

# Words: v

Distance Matrix D1n1

# Words: v

n2 Sparse matrix X1(QUERY)

# Words: v

n1 Distance Matrix D2n2

D = max D1M, D2Transpose, maximum, and top-k on the CPU

Outline

§ Introduction

§ Background

§ Our solution

§ Our results

Speed-up vs CPU-based RWMD § CPU: Intel ® Core ® i7-6900K @ 3.20 GHz, 8 cores (SMT2), 64 GB memory, Intel ® MKL

§ GPU: NVIDIA ® Tesla ® P100, 16 GB memory, CUDA 9.0 with CUBLAS and CUSPARSE

Set 1: h=150 words per histogram

Set 2: h=30 words per histogram

Google’s Word2Vec (Google News)

v = 3M words, m = 300 floating pt. nums

All operations in single-prec. floating pt.

27951330

Set1 Set2

RWMD GPU LC-RWMD GPU

Runtime vs GPU-accelerated WMD

Time to compare one query doc with all database docs using 16 CPU processes + 16 GPUs

Set 1: n=1M docs, h=150 words per hist. Set 2: n=2.8M docs, h=30 words per hist.

Set1 Set2

Runtime (secs)

WMD (16 GPUs) LC-RWMD (16 GPUs)

30000x2000x

4 8 16 32 64 128

LC-RWMD very large

LC-RWMD large

WMD medium

LC-RWMD medium

WMD small

LC-RWMD small

Comparison with WMD: Precision at Top-K for Set 2

§ small: 300-1000 examples per label

§ medium: 1k-10k examples per label

§ large: 10k-100k examples per label

§ very large: 100k-1M examples per label27

Precision

Summary

A linear complexity method for computing Relaxed Word Mover’s Distance

§ The original method proposed by Kusner et al. has quadratic complexity

§ ~30000-fold improvement in performance w.r.t GPU-accelerated WMD

§ ~2800-fold improvement w.r.t CPU-accelerated quadratic RWMD

Main insight: Big Data offers new ways of dealing with algorithmic complexity!

§ Reduce complexity by eliminating redundant and repetitive operations

§ Exploit the massive parallelism offered by GPUs and clusters of GPUs

Business and Academic Impact

§ Being used by our business developers: ingestion of business news

§ Sub-second execution latency for similarity queries (100k docs per day)

§ Database of 100M documents using 16 NVIDIA ® Tesla ® P100 GPUs

§ Larger databases or higher ingestion rates? Simply add more GPUs!

§ IEEE Big Data 2017 Conference, ERCIM News, GTC 2018 Conference

Future directions

§ Possible improvements:

§ CUDA streams to overlap CPU/GPU computation, half-precision support

§ Sinkhorn Distance to better approximate WMD (quadratic complexity)

§ Supervised training of word weights and word vectors (supervised WMD)

§ Limitations of bag-of-words: augment syntax trees with word vectors

§ Possible extensions:

§ Use FPGAs with hard floating-point cores and high-bandwidth memories

§ Similarity search in other domains: time series, images, genomics data

Questions?

Many-to-many LC-RWMD: First Phase

Dense matrix E

# Words = v

E[w]: vector representation of word w

Z# Words = v

Z=min(E∘ T%)

T%,+, 0 ≤ j ≤ n% − 1: resident in GPU memory

T2,1h2,1

T2,jh2,j

T2, n2-1h2,n2-1

# Hists: n2

Many-to-many LC-RWMD: Second Phase

Z[w,j] stores the distance to closest word in histogram j for each word w in the vocabulary

Sparse-matrix dense-matrix multiply to compute D D = X.×Z

Z# Words = v

How far away is Word Centroid Distance from WMD?

10 1 0.1 0.01 0.0010

WCD selection (%)

WCD vs WMD - Precision

WMD (0.001%)

WMD (0.01%)

WMD (0.1%)

WMD (1%)

WMD (10%)

What is the fraction of overlap between top-k results of WCD and WMD?

gpu-accelerated semantic similarity search at...

Documents

gpu server portfolio overview - on-demand.gputechconf.com

accelerating computational science and engineering with...

automotive advanced driver assistance...

welcome...

alexey panteleev developer technology engineer,...

pycuda: even simpler gpu programming with...

efficiency and programmability: enablers for...

présentation powerpointon-demand.gputechconf.com › gtc-eu...

high-performance broadcast designs for streaming...

success story - on-demand.gputechconf.com

mastering computational chemistry with deep...

evaluation of parallel hashing...

deep learning –a cookbook view -...

vector addition - gpu technology...

anand santhanam - on-demand.gputechconf.com

embedded bayesian perception & v2x communications for...

whats the caption for this picture? word word word word word...

opengl scene-rendering techniques - gpu -...

supplement to the operating instructions for simoreg...

heterogeneous multicore parallel...