indexing text documents on gpuon-demand.gputechconf.com/gtc/2014/presentations/s... · indexing of...
Post on 25-Apr-2020
13 Views
Preview:
TRANSCRIPT
INDEXING TEXT DOCUMENTS ON GPU
CAN YOU INDEX THE WEB IN REAL TIME?
Michael Frumkin (NVIDIA)
HIGH-LEVEL STEPS FOR INDEXING
Periodically collect all webpages
Index the pages by terms or phrases
— ASCII and UTF-8 encoding, remove HTML and XML tags
Generate index
Distribute index to the serving clusters
Serve search queries and feed Knowledge/Intelligence engines
WHAT IS INDEX OF THE WEB ?
Sparse matrix of the size order of 107 x 1011
The columns represent documents, are sorted by page rank
The rows represent terms, are lexicographically sorted
Each matrix element is a list of locations of the term in the document
DATA FLOW IN AN INDEXING CLUSTER
Document
Bucket Document
Batches GPU
CPU
Index
Bucket
Int
ern
et
Inte
rnet
Int
ern
et
Serv
ing C
loud
Document
Bucket Document
Batches GPU
CPU
Index
Bucket
POTENTIAL INDEXING BOTTLENECKS Data size characterization (based on 42 GB Wikipedia)
PCIe limit: docs => device -- do indexing => host (123 K docs/s)
Memory BW limit (1M docs/s):
— Data expansion (word location, line numbers)
— 10 comparisons per word (for average doc)
SM limit: 4-way divergence (800 K docs/s)
Overall upper bound 123 K docs/s (PCIe Gen3)
— achieved 23 K docs/s
1 T docs (20 PB of data) per day on a cluster with 1000 GPUs
SINGLE BOARD FOR INDEXING
Preprocess
Tokenize Map Reduce
Index of a batch
Tokenize Split Bucket Sort Reduce
Indexing on CPU
Indexing on GPU Index Bucket
Index Bucket
Document
Batch
INDEXING ON CPU Process a batch using 12 (p-) threads
Tokenizer
— Splits a line onto terms
Map map<string, set<int> > terms;
for (int i = 0; i < content.size(); ++i) {
vector<string> fields;
Tokenize(content[i], &fields);
for (int j = fields.size(); ++j)
terms[fields[j]].insert(i); }
Reduce
— Single scan over the map and generates the index string
INDEXING ON GPU Tokenizer
— Single thread per doc (parallel version not in this talk)
Splitter
— Single thread per doc
BucketSort
— 32 threads per doc, sensitive to load balancing
Reduce
— Single thread per doc
TOKENIZER
Single scan over the doc
Finds terms boundaries
Computes histogram (224 buckets)
Uses static splitters of the docs
— Quintiles are the ideal splitters, but more expensive
Packing of the location info:
— (term_size & 0xFF) << 24 | (term_offset & 0x00FFFFFF)
Document content is immutable
banana grows
GPU
tokenizer
tree
fast
banana tree grows fast in Hawaii
SPLITTER
Buckets filled by Tokenizer are very uneven, many are empty
Splitter spreads these buckets across 32 big buckets as evenly as possible
Does not deal with outliers (e.g. buckets of the size > 100 K)
Output: 32 ordered buckets of unsorted terms
BUCKET SORT - INNERMOST LOOP Key for performance
Merge(const TermComparator& comparator,
const uint* loc1, const uint* loc2,
uint* scratch, int terms_num) {
int pos1 = 0, pos2 = 0, dst = 0;
for (int i = 0; i < terms_num: ++i) {
if (comparator.less(loc1[pos1], loc2[pos2]) {
scratch[dst] = loc1[pos1++];
} else {
scratch[dst] = loc2[pos2++];
}
++dst;
// Code handling boundary checks
}
}
Small divergence and right data flow key for high GPU performance
OVERLAP OF COMMUNICATIONS WITH COMPUTATIONS
PCIe is the resource that will be saturated first (after additional 5x speedup)
Send each batch to its own stream
Use cudaMemcpyAsync
This should fully hide any communications
Can’t use cudaFree for multi-streaming. It serializes pending streams
Hence we can’t use cudaMalloc, cudaMallocHost
=> Custom Memory Management
MEMORY POOL Must use cudaMalloc and cudaMallocHost only in the beginning
of the program
MemPool class for Host and Device
— Calls once cudaMalloc and cudaMallocHost in the beginning
— Overloads other call to cudaMalloc and cudaMallocHost
Minimal change to the code
Double buffering
Trade-in
— num_batches_in_flight * MemPool_size < GGR_size
MULTI-STREAMING WIKIPEDIA BUCKET 0
MULTI-STREAMING WIKIPEDIA BUCKET 13
KEYS FOR HIGH PERFORMANCE
Coalesce CPU/GPU IO
Overlap I/O with computations
Minimize random access, use Shared Memory
0.00E+00
1.00E+01
2.00E+01
3.00E+01
4.00E+01
5.00E+01
6.00E+01
7.00E+01
8.00E+01
0 5 10 15
12 coresSandy Bridge
K20Xm
CPU no IOoverlap
K20C opt 1
K20C IOoverlap
K20C opt 2
Bucket Number
Tim
e in
s
Bucket Number
Tim
e in se
c
PERFORMANCE: CPU VS GPU
Literature collection: 4200 docs, 92 MB 3.1 x faster
Wikipedia: about 7 M docs, 42 GB 2.2 x faster
0.00E+00
5.00E+00
1.00E+01
1.50E+01
2.00E+01
2.50E+01
3.00E+01
3.50E+01
4.00E+01
4.50E+01
5.00E+01
0 2 4 6 8 10 12 14 16
12 cores SandyBridge
K20Xm
Bucket Number
Tim
e in
sec
LUCENE: JAVA CUDA INTERFACE Lucene/Solr Apache projects Indexing/Search
LucidWorks develops commercial indexing and search engines based on Lucene/Solr
GPU indexer worked smoothly with Java
Indexer.java Indexer.so Indexer
CPU GPU
C++ C++CUDA Docs Docs
Index Index
SUMMARY CPU: STL based map, set, fully parallel
GPU: BucketSort
Document sets:
— Literature collection: 4200 docs
— Wikipedia: 7 M docs split into 16 buckets
GPU 3.1x faster on Literature collection and 2.2x faster on Wikipedia
— K20Xm, 14 SMs, 732 MHz, vs i7 6 cores (2x hyper-threaded) @ 3.2 GHz
— 3.4 K docs/s (Literature) 23.1 K docs/s (Wikipedia)
Theoretical limiting resource: PCIe Gen3 - 123K docs/s
Indexing Wikipedia on 16 GPUs in 31 s (19 s average per bucket)
Questions ?
Document
Bucket GPU
CPU
Index
Bucket
Int
ern
et
Inte
rne
t
Document
Bucket GPU
CPU
Index
Bucket My
Analytics
Engine
top related