gpu-accelerated semantic similarity search at...
TRANSCRIPT
GPU-Accelerated Semantic Similarity Search at Scale
Kubilay AtasuIBM Research - Zurich
in collaboration withThomas Parnell, Celestine Duenner, Manolis Sifalakis, Haris PozidisVasileios Vasileiadis, Michail Vlachos, Cesar Berrospi, Abdel Labbi
2
Outline
§ Introduction
§ Background
§ Our solution
§ Our results
Why scalable similarity (i.e., nearest neighbors) search?
Example: financial news analysis
§ 100k news entries every day
§ 100M entries in three years
§ searching, browsing, clustering
3
Need for a similarity/distance metric:§ must be accurate and scalable
Image Source: http://social-dynamics.org/
Word Mover’s Distance (WMD) for Semantic Similarity
4
M. Kusner et al.: From Word Embeddings to Document Distances. ICML 2015.
Principle of WMD
The Queen to tour Canada
Royal visit to Halifax
Canada
Queentour
Royalvisit
Halifax
embedding space
High Cost of Word Mover’s Distance (WMD)
5
Quality Timecomplexity
GPUfriendly
WMD Very high Cubical No
Relaxed WMD High Quadratic YesOur solution High Linear Yes
Word Mover’s Distance: very high quality, but very high complexity!
Dense and Sparse Linear Algebra on Graphics Processing Units (GPUs)
Sub-second query performance on very large data sets!
6
Outline
§ Introduction
§ Background
§ Our solution
§ Our results
WMD: Earth Mover’s Distance using Word Embeddings
7
Cost matrix (word distances)
CanadaQueentour
Royalvisit
Halifax
Bag-of-Words
Solves a minimum-cost flow problem!
Cubical time complexity in the size of the histograms!
Histogram 1 Histogram 2
Relaxed WMD (RWMD): A Lower Bound of WMD
Histogram 1
His
togr
am 2
⊙
⊙
Cost Matrix
# Words = h1
# Words = h2
M. Kusner et al.: From Word Embeddings to Document Distances. ICML 2015.
Quadratic time and space!
It is not possible to do better when comparing only two histograms!
Problem Formulation: Input and Output Data Structures
Sparse matrix X1(DATABASE)
# Words in vocabulary: v
# Hists = n1
X[i,w]: frequency of word w in histogram i
Dense matrix E
Size of embedding vectors: m
# Words = v
E[w]: embedding vector for word w
Sparse matrix X2(QUERY)
# Words in vocabulary: v
# Hists = n2 Dense matrix R
k most similar histograms in X1
# Hists = n2
Given two sets of histograms (X1 and X2) and the embedding vectors E:For each histogram in X2, compute the K most similar histograms in X1
9
Computing the Cost Matrix: Dense Matrix-Matrix Multiplication
# Words = h1,i
Compute pairwise distances between all words in histogram i and all words in histogram j
# Words = h2,j
T1,i: embedding vectors of all words in hist i
Dense matrix T1,i
Size of embedding vectors: m
# Words = h1,i
T2,j: embedding vectors of all words in hist j
Dense matrix T2,j
Size of embedding vectors: m
# Words = h2,j
Complexity: O(h%m)C),+ = T.,) ∘ T%,+
10
Excellent candidate for GPU acceleration!
Word Mover’s Distance (WMD)
Dense matrix D
D[j,i]: Distance between hist j and hist i
# Words = h1,i
F1,i: frequency of the words in histogram i
Dense vector F1,i Computing D[i,j]:
Computing D[j]:
# Words = h1,i
# Words = h2,j
C),+ = T.,) ∘ T%,+
D[j, i] = WMD(F.,), F%,+, C),+)O(h%m + h9logh)
# Hists = n2
# Hists = n1
O(nh%m + nh9logh)
Given Ci,j, F1,i, and F2,j, compute D[j,i]
11
Relaxed Word Mover’s Distance – Quadratic Implementation
Dense matrix D# Hists = n2
D[j,i]: distance between hist j and hist i
# Hists = n1
# Words = h1,i
F1,i: frequency the words in histogram i
Dense vector F1,i Complexity of computing D[j,i]:
Complexity of computing D[j]:
O(h%m)O(nh%m)
D[j, i] = F.,) > min(C),+)
# Words = h1,i
# Words = h2,j
C),+ = T.,) ∘ T%,+
Given Ci,j, F1,i, and F2,j, compute D[j,i]
12
Quadratic Implementation – GPU Mapping
T%,+, 0 ≤ j ≤ n% − 1: streamed in one document at a time
Dense matrix T2,j
Size of word vectors: m
# Words = h2,j
T.,), 0 ≤ i ≤ n. − 1: resident in GPU memory
T1,0
Size of word vectors: m
h1,0
T1,1h1,1
T1,ih1,i
T1, n1-1h1,n1-1
# Hists: n1
Compute one row of D in parallel on a single GPU
Memory requirement on one GPU:
D[j] = F. > min(T. ∘ T%,+)
O(nhm)
13
How far away is RWMD from WMD?
14
What is the fraction of overlap between top-k results of RWMD and WMD?
10 1 0.1 0.01 0.0010
0.2
0.4
0.6
0.8
1
1.2
RWMDselection(%)
RWMDvsWMD- Precision
WMD(0.001%)
WMD(0.01%)
WMD(0.1%)
WMD(1%)
WMD(10%)
15
Outline
§ Introduction
§ Background
§ Our solution
§ Our results
Relaxed WMD (RWMD): Redundancy
16
Histogram 1H
isto
gram
2
⊙
⊙
Cost Matrix
# Words = h1
His
togr
am 3
⊙ Cost Matrix
Common words are the problem!
Redundancy can be eliminated!
Overview of Linear-Complexity RWMD (LC-RWMD)
17
. . .
Query documents (X2)
Phase 1
Clu
ster
of G
PUs
For each word in the vocabulary, compute the distance to the closest word in one query doc. Store the results in a dense vector Z.
Phase 2
Dense vector Z
Sparse-matrix dense-vector multiplication between X1 and Z to compute the distances between the query and the database docs.
Database documents (X1)
Distribute X2across GPUs
LC-RWMD: First Phase
T2,j: embedding vectors of the query histogram j
Dense matrix T2,j
Size of embedding vectors: m
# Words = h2,jDense matrix E
Size of embedding vectors: m
# Words = v
E: embedding vectors of the complete vocabulary
Multiply E and the transpose of T and compute row-wise minimums
Z# Words = v
1
Z=min(E∘ T%,+)
18
E∘ T%,+
h2,j
v
Complexity of the first phase: O(vhm)
LC-RWMD: Second Phase
Z: distance to the closest word in query histogram for each word in vocabulary
Z# Words = v
1
Sparse matrix X1# Hists = n1
# Words in vocabulary: v
Sparse matrix vector multiply to compute the distances. Complexity:
D j = X.×Z
O(nh)
Overall complexity: O(vhm+nh)
19
X1: weights of the database histograms in compressed sparse row (csr) format
Complexity Comparison (Time)
Complexity of linear RWMD:
Complexity of quadratic RWMD:
Improvement vs. quadratic RWMD:
O(vhm+nh)
O(nh%m)
O(min(nh/v, hm))
20
h: avg. size of histogramsm: size of word vectors v: size of the vocabulary
Comparing one query document with n database documents
LC-RWMD – GPU Mapping
X1 is resident in the GPU memory T%,+, 0 ≤ j ≤ n% − 1:streamed in one document at a time
Dense matrix T2,j
Size of embedding vectors: m
# Words = d2,j
Memory requirement on one GPU:
Sparse matrix X1# Hists = n1
# Words in vocabulary: v
Dense matrix E
Size of embedding vectors: m
Word embeddings E is resident in GPU memory
# Words = vO(vm+nh + vh)
21
Complexity Comparison (Space)
Space complexity of linear RWMD:
Space complexity of quadratic RWMD:
Improvement w.r.t. quadratic RWMD:
O(vm+nh + vh)
O(nhm)
O(min(nh/v, nm/v,m))
22
n: # database documents h: avg. size of histogramsm: size of word vectors v: size of the vocabulary
LC-RWMD: Dealing with Asymmetric Distances
Sparse matrix X1(DATABASE)
# Words: v
n1 Sparse matrix X2(QUERY)
# Words: v
n2
23
Distance Matrix D1n1
n2
Sparse matrix X2(DATABASE)
# Words: v
n2 Sparse matrix X1(QUERY)
# Words: v
n1 Distance Matrix D2n2
n1
D = max D1M, D2Transpose, maximum, and top-k on the CPU
24
Outline
§ Introduction
§ Background
§ Our solution
§ Our results
Speed-up vs CPU-based RWMD § CPU: Intel ® Core ® i7-6900K @ 3.20 GHz, 8 cores (SMT2), 64 GB memory, Intel ® MKL
§ GPU: NVIDIA ® Tesla ® P100, 16 GB memory, CUDA 9.0 with CUBLAS and CUSPARSE
25
Set 1: h=150 words per histogram
Set 2: h=30 words per histogram
Google’s Word2Vec (Google News)
v = 3M words, m = 300 floating pt. nums
All operations in single-prec. floating pt.
18 19
27951330
1
10
100
1000
10000
Set1 Set2
RWMD GPU LC-RWMD GPU
Runtime vs GPU-accelerated WMD
26
Time to compare one query doc with all database docs using 16 CPU processes + 16 GPUs
Set 1: n=1M docs, h=150 words per hist. Set 2: n=2.8M docs, h=30 words per hist.
0.001
0.01
0.1
1
10
100
1000
Set1 Set2
Runtime (secs)
WMD (16 GPUs) LC-RWMD (16 GPUs)
30000x2000x
4 8 16 32 64 128
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
LC-RWMD very large
LC-RWMD large
WMD medium
LC-RWMD medium
WMD small
LC-RWMD small
Comparison with WMD: Precision at Top-K for Set 2
§ small: 300-1000 examples per label
§ medium: 1k-10k examples per label
§ large: 10k-100k examples per label
§ very large: 100k-1M examples per label27
Precision
K
Summary
A linear complexity method for computing Relaxed Word Mover’s Distance
§ The original method proposed by Kusner et al. has quadratic complexity
§ ~30000-fold improvement in performance w.r.t GPU-accelerated WMD
§ ~2800-fold improvement w.r.t CPU-accelerated quadratic RWMD
Main insight: Big Data offers new ways of dealing with algorithmic complexity!
§ Reduce complexity by eliminating redundant and repetitive operations
§ Exploit the massive parallelism offered by GPUs and clusters of GPUs
28
Business and Academic Impact
§ Being used by our business developers: ingestion of business news
§ Sub-second execution latency for similarity queries (100k docs per day)
§ Database of 100M documents using 16 NVIDIA ® Tesla ® P100 GPUs
§ Larger databases or higher ingestion rates? Simply add more GPUs!
§ IEEE Big Data 2017 Conference, ERCIM News, GTC 2018 Conference
29
Future directions
§ Possible improvements:
§ CUDA streams to overlap CPU/GPU computation, half-precision support
§ Sinkhorn Distance to better approximate WMD (quadratic complexity)
§ Supervised training of word weights and word vectors (supervised WMD)
§ Limitations of bag-of-words: augment syntax trees with word vectors
§ Possible extensions:
§ Use FPGAs with hard floating-point cores and high-bandwidth memories
§ Similarity search in other domains: time series, images, genomics data
30
Questions?
Many-to-many LC-RWMD: First Phase
Dense matrix E
Size of embedding vectors: m
# Words = v
E[w]: vector representation of word w
Z# Words = v
n2
Z=min(E∘ T%)
32
T%,+, 0 ≤ j ≤ n% − 1: resident in GPU memory
T2,0
Size of word vectors: m
h2,0
T2,1h2,1
T2,jh2,j
T2, n2-1h2,n2-1
# Hists: n2
Many-to-many LC-RWMD: Second Phase
Z[w,j] stores the distance to closest word in histogram j for each word w in the vocabulary
Sparse matrix X1# Hists = n1
# Words in vocabulary: v
Sparse-matrix dense-matrix multiply to compute D D = X.×Z
33
Z# Words = v
How far away is Word Centroid Distance from WMD?
34
10 1 0.1 0.01 0.0010
0.2
0.4
0.6
0.8
1
1.2
WCD selection (%)
WCD vs WMD - Precision
WMD (0.001%)
WMD (0.01%)
WMD (0.1%)
WMD (1%)
WMD (10%)
What is the fraction of overlap between top-k results of WCD and WMD?