indexing text documents on gpuon-demand.gputechconf.com/gtc/2014/presentations/s... · indexing of...

INDEXING TEXT DOCUMENTS ON GPU

CAN YOU INDEX THE WEB IN REAL TIME?

Michael Frumkin (NVIDIA)

HIGH-LEVEL STEPS FOR INDEXING

Periodically collect all webpages

Index the pages by terms or phrases

— ASCII and UTF-8 encoding, remove HTML and XML tags

Generate index

Distribute index to the serving clusters

Serve search queries and feed Knowledge/Intelligence engines

WHAT IS INDEX OF THE WEB ?

Sparse matrix of the size order of 107 x 1011

The columns represent documents, are sorted by page rank

The rows represent terms, are lexicographically sorted

Each matrix element is a list of locations of the term in the document

DATA FLOW IN AN INDEXING CLUSTER

Document

Bucket Document

Batches GPU

Bucket

Document

Bucket Document

Batches GPU

Bucket

POTENTIAL INDEXING BOTTLENECKS Data size characterization (based on 42 GB Wikipedia)

PCIe limit: docs => device -- do indexing => host (123 K docs/s)

Memory BW limit (1M docs/s):

— Data expansion (word location, line numbers)

— 10 comparisons per word (for average doc)

SM limit: 4-way divergence (800 K docs/s)

Overall upper bound 123 K docs/s (PCIe Gen3)

— achieved 23 K docs/s

1 T docs (20 PB of data) per day on a cluster with 1000 GPUs

SINGLE BOARD FOR INDEXING

Preprocess

Tokenize Map Reduce

Index of a batch

Tokenize Split Bucket Sort Reduce

Indexing on CPU

Indexing on GPU Index Bucket

Index Bucket

Document

INDEXING ON CPU Process a batch using 12 (p-) threads

Tokenizer

— Splits a line onto terms

Map map<string, set<int> > terms;

for (int i = 0; i < content.size(); ++i) {

vector<string> fields;

Tokenize(content[i], &fields);

for (int j = fields.size(); ++j)

terms[fields[j]].insert(i); }

Reduce

— Single scan over the map and generates the index string

INDEXING ON GPU Tokenizer

— Single thread per doc (parallel version not in this talk)

Splitter

— Single thread per doc

BucketSort

— 32 threads per doc, sensitive to load balancing

Reduce

— Single thread per doc

TOKENIZER

Single scan over the doc

Finds terms boundaries

Computes histogram (224 buckets)

Uses static splitters of the docs

— Quintiles are the ideal splitters, but more expensive

Packing of the location info:

— (term_size & 0xFF) << 24 | (term_offset & 0x00FFFFFF)

Document content is immutable

banana grows

tokenizer

banana tree grows fast in Hawaii

SPLITTER

Buckets filled by Tokenizer are very uneven, many are empty

Splitter spreads these buckets across 32 big buckets as evenly as possible

Does not deal with outliers (e.g. buckets of the size > 100 K)

Output: 32 ordered buckets of unsorted terms

BUCKET SORT - INNERMOST LOOP Key for performance

Merge(const TermComparator& comparator,

const uint* loc1, const uint* loc2,

uint* scratch, int terms_num) {

int pos1 = 0, pos2 = 0, dst = 0;

for (int i = 0; i < terms_num: ++i) {

if (comparator.less(loc1[pos1], loc2[pos2]) {

scratch[dst] = loc1[pos1++];

} else {

scratch[dst] = loc2[pos2++];

++dst;

// Code handling boundary checks

Small divergence and right data flow key for high GPU performance

OVERLAP OF COMMUNICATIONS WITH COMPUTATIONS

PCIe is the resource that will be saturated first (after additional 5x speedup)

Send each batch to its own stream

Use cudaMemcpyAsync

This should fully hide any communications

Can’t use cudaFree for multi-streaming. It serializes pending streams

Hence we can’t use cudaMalloc, cudaMallocHost

=> Custom Memory Management

MEMORY POOL Must use cudaMalloc and cudaMallocHost only in the beginning

of the program

MemPool class for Host and Device

— Calls once cudaMalloc and cudaMallocHost in the beginning

— Overloads other call to cudaMalloc and cudaMallocHost

Minimal change to the code

Double buffering

Trade-in

— num_batches_in_flight * MemPool_size < GGR_size

MULTI-STREAMING WIKIPEDIA BUCKET 0

MULTI-STREAMING WIKIPEDIA BUCKET 13

KEYS FOR HIGH PERFORMANCE

Coalesce CPU/GPU IO

Overlap I/O with computations

Minimize random access, use Shared Memory

0.00E+00

1.00E+01

2.00E+01

3.00E+01

4.00E+01

5.00E+01

6.00E+01

7.00E+01

8.00E+01

0 5 10 15

12 coresSandy Bridge

CPU no IOoverlap

K20C opt 1

K20C IOoverlap

K20C opt 2

Bucket Number

e in se

PERFORMANCE: CPU VS GPU

Literature collection: 4200 docs, 92 MB 3.1 x faster

Wikipedia: about 7 M docs, 42 GB 2.2 x faster

0.00E+00

5.00E+00

1.00E+01

1.50E+01

2.00E+01

2.50E+01

3.00E+01

3.50E+01

4.00E+01

4.50E+01

5.00E+01

0 2 4 6 8 10 12 14 16

12 cores SandyBridge

Bucket Number

LUCENE: JAVA CUDA INTERFACE Lucene/Solr Apache projects Indexing/Search

LucidWorks develops commercial indexing and search engines based on Lucene/Solr

GPU indexer worked smoothly with Java

Indexer.java Indexer.so Indexer

CPU GPU

C++ C++CUDA Docs Docs

Index Index

SUMMARY CPU: STL based map, set, fully parallel

GPU: BucketSort

Document sets:

— Literature collection: 4200 docs

— Wikipedia: 7 M docs split into 16 buckets

GPU 3.1x faster on Literature collection and 2.2x faster on Wikipedia

— K20Xm, 14 SMs, 732 MHz, vs i7 6 cores (2x hyper-threaded) @ 3.2 GHz

— 3.4 K docs/s (Literature) 23.1 K docs/s (Wikipedia)

Theoretical limiting resource: PCIe Gen3 - 123K docs/s

Indexing Wikipedia on 16 GPUs in 31 s (19 s average per bucket)

Questions ?

Document

Bucket GPU

Bucket

Document

Bucket GPU

Bucket My

Analytics

Engine

indexing text documents on gpuon-demand.gputechconf.com/gtc/2014/presentations/s... · indexing of...

Documents

marwan al-namari hassan al-mathami. indexing what is...

cmpt454 gpu managed database · gpgpu: general purpose gpu,...

indexing and active fund management: international...

multi-gpu mapreduce on gpu clusters

new special machines 512 october 2019 · 2020. 2. 6. ·...

indexing & retrieval. approaches to indexing key word...

soap3 &...

multidimensional indexing: spatial data management & high...

additive manufacturing simulation on the...

indexing jurnal elektronik -...

file storage and indexing. file organizations indices ...

salton2-1 automatic indexing hsin-hsi chen. salton2-2...

indexing* indexing*

a20direct c-axis indexing enables deceleration direct to...

accelerated ansys fluent: algebraic multigrid on a...

gpu computing with matlab® @ cbi laboratory. overview gpu...

best practices gpu-based video processing | gtc...

pycuda: even simpler gpu programming with python · python...

multidimensional indexing: spatial data management & high...

static-curis.ku.dk · web view(word indexing is therefore...