an implementation of the language model based ir system on the gpu
DESCRIPTION
An Implementation of the Language Model Based IR System on the GPU. Sudhanshu Khemka. Outline. Background and Related Work Motivation and goal Our contributions: A GPU based implementation of the Good Turing smoothing algorithm - PowerPoint PPT PresentationTRANSCRIPT
1
An Implementation of the Language Model Based IR System on the GPU
Sudhanshu Khemka
2
Outline Background and Related Work
Motivation and goal
Our contributions:• A GPU based implementation of the Good Turing smoothing algorithm• A GPU based implementation of the Kneser Ney smoothing algorithm• An efficient implementation of Ponte and Croft’s document scoring
model on the GPU• A GPU friendly version of the single link hierarchical clustering
algorithm
Discussion
Conclusion
3
Outline Background and Related Work
• The GPU Architecture• The structure of an IR System• Ponte and Croft’s document scoring model• Clustering • Related Work
4
GPU Programming Model Allows the programmer to define a grid of thread blocks.
Each thread block executes independently of other blocks.
All threads in a thread block can also execute independently of each other; however, one can synchronize their execution using barrier synchronization methods, such as __syncthreads().
5
GPU Memory Hierarchy
6
Ding et al.’s GPU based architecture for IR
GPU cannot access main memory directly.
Thus, transfer cost associated with transferring the data from the CPU’s main memory to the GPU’s global memory.
In some cases, this transfer cost is higher than the speed up obtained by using the GPU.
7
Structure of an IR system
Inverted Index and Smoothing
8
Doc 1 Doc 2 Doc 3
NG1 5 6 7
NG2 5 0 0
NG3 0 0 7
NG4 0 6 0
After Add one smoothing
Doc 1 Doc 2 Doc 3
NG1 6/14 7/16 8/18
NG2 6/14 1/16 1/18
NG3 1/14 1/16 8/18
NG4 1/14 7/16 1/18
Doc 1 Doc 2 Doc 3
NG1 5 6 7
NG2 5 0 0
NG3 0 0 7
NG4 0 6 0
Inverted List
NG4 occurs 6 times in Doc 2Inverted Index
Smoothing assigns a small non-zero probability to n grams that were not seen in the document.
9
The Language Model based approach to IR
Builds a probabilistic language model for each document d ()
Ranks documents according to the probability of generating the query (Q) given their language model representation ((Q|Md))
Ponte and Croft’s model is an enhanced version of the above
�̂� (𝑡|𝑀𝑑)={P̂ml (𝑡 ,𝑀𝑑)(1.0−𝑅𝑡 ,𝑑 )× 𝑃𝑎𝑣𝑔 (𝑡 )𝑅𝑡 ,𝑑𝑖𝑓 𝑡𝑓 (𝑡 ,𝑑 )>0𝑐𝑓 𝑡𝑐𝑠 h𝑜𝑡 𝑒𝑟𝑤𝑖𝑠𝑒
Ponte and Croft’s model
As we are estimating using a document sized sample, we cannot be very confident about our maximum likelihood estimates. Therefore, Ponte and Croft suggest to use the mean probability of term t in documents containing t in order to estimate
10
And
And = *
11
Clustering Enables a search engine to present information in a more effective manner
by displaying similar documents together.
Particularly useful when the search term has different word senses
For example, consider the query “jaguar.”
jaguar can refer to a car, an animal, or the Apple operating system
If a user is searching for documents related to the animal jaguar, he will have to manually search through the top-k documents to find documents related to the animal.
Clustering alleviates problem
12
Related Work Ding et al. propose data parallel algorithms for compressing, decompressing,
and intersecting sorted inverted lists for a Vector Space model based information retrieval system.
Example of their list intersection algorithm to intersect two lists A and B:
Randomly pick few elements from list A and for each element, find the pair Bj, Bj+1 in B such that Bj < A ≤ Bj+1
This implicitly partitions both A and B into segments as shown below:
Intersect corresponding segments in parallel.
13
Related Work contd Chang et al. implement hierarchical clustering on the GPU. However,
1) They apply clustering to DNA microarray experiments. We apply it to information retrieval.
2) They use the Pearson correlation coefficient as the distance metric to compute the distance between two elements. We use cosine similarity.
3) We present a more optimized version of their code.
14
Outline Background and Related Work
Motivation and goal
Our contributions:• A GPU based implementation of the Good Turing smoothing algorithm• A GPU based implementation of the Kneser Ney smoothing algorithm• An efficient implementation of Ponte and Croft’s document scoring
model on the GPU• A GPU friendly version of the single link hierarchical clustering
algorithm
Discussion
Conclusion
15
Motivation and Goal
No published papers propose an implementation of the LM based IR system on the GPU
However, a probabilistic language model based approach to retrieval significantly outperforms standard tf.idf weighting (Ponte and Croft, 1998)
Goal: We hope to be the first to contribute algorithms to realize a Language model based IR system on the GPU
16
Outline Background and Related Work
Motivation and goal
Our contributions:• A GPU based implementation of the Good Turing smoothing algorithm• A GPU based implementation of the Kneser Ney smoothing algorithm• An efficient implementation of Ponte and Croft’s document scoring
model on the GPU• A GPU friendly version of the single link hierarchical clustering
algorithm
Discussion
Conclusion
Good Turing Smoothing Intuition : We estimate the probability of things that occur c times
using the probability of things that occur c+1 times.
Smoothed Count: c* = (c+1)
Smoothed Probability: P(c*) =
In the above definition, Nc is the number of N grams that occur c times.
Doc 1 Doc 2a shoe 1 1a cat 0 2foo bar 2 0a dog 1 2
Smoothed Count: c* = (c+1)
Smoothed Probability: P(c*) =
2 phases:
1) Calculate the Nc values
2) Smooth counts
Calculating NC values on the GPU
1 0 2 1Doc1:
0 1 1 2Sort:
0 1 2 3Positions:
0 1 2
0 1 3
Stream compaction
Doc1: N0 = 1 , N1 = 2 , N2 = 1
Smooth Ngram counts1 0 2 1Doc1
Thread 2Thread 1Thread 0 Thread 3
Let one thread compute the smoothed count for each Ngram
Smoothed Count: c* = (c+1)
Smoothed Probability: P(c*) =
Experimental results
1K 10K 100K 1M 2M0
20
40
60
80
100
120
140
CPU(ms)
GPU(ms)
Number of elements
Tim
e(m
s)
22
Outline Background and Related Work
Motivation and goal
Our contributions:• A GPU based implementation of the Good Turing smoothing algorithm• A GPU based implementation of the Kneser Ney smoothing algorithm• An efficient implementation of Ponte and Croft’s document scoring
model on the GPU• A GPU friendly version of the single link hierarchical clustering
algorithm
Discussion
Conclusion
23
Kneser Ney smoothing The Good Turing algorithm assigns the same probability of occurrence to all
0 count n-grams
Example, if count(BURNISH THE) = count(BURNISH THOU) = 0, then using Good Turing
P(THE|BURNISH) = P(THOU|BURNISH)
However, intuitively, P(THE|BURNISH) > P(THOU| BURNISH), as THE is much more common than THOU
The Kneser Ney smoothing algorithm captures this intuition
Calculate P(wi|wi-1) based on the number of different contexts the word wi has appeared in. (Assuming count(wi-1wi) = 0)
24
Kneser Ney smoothing
𝑃 𝐾𝑁 (𝑤𝑖|𝑤𝑖−1 )=¿ if > 0
otherwise.
=
Step 1
Step 2Step 3
Step 4
25
GPU based implementationBigrams Countsan example 0this is 4this example 3
Step 1: Compute for each wi :
Launch a kernel such that each thread visits one bigram in the bigram dictionary and checks if count(wi-1wi) > 0. If yes, it increments contextW[wi] by 1
0 1 1 0
an example is this
26
Step 2 : Compute
Apply GPU based parallel reduction on result of step 1. Please refer to technical paper by Mark Harris for an efficient implementation of the parallel reduction operation on the GPU. For us, = 2
Step 3: Compute for each wi-1.As we have already completed steps 1 and 2, step 3 can easily be done by asking one thread to compute the for each wi-1.Step 4: According to the value of count(wi-1wi), we use the correct version of the Kneser Ney algorithm to get the following array:
.5 .5712 .4284
27
Experimental results
1000 10000 100000 1000000 20000000
100
200
300
400
500
600
700
800
900
1000
CPU(ms)GPU(ms)
Number of elements
Tim
e(m
s)
28
Outline Background and Related Work
Motivation and goal
Our contributions:• A GPU based implementation of the Good Turing smoothing algorithm• A GPU based implementation of the Kneser Ney smoothing algorithm• An efficient implementation of Ponte and Croft’s document scoring
model on the GPU• A GPU friendly version of the single link hierarchical clustering
algorithm
Discussion
Conclusion
29
Ponte and Croft’s document scoring modelComputation of the score of a document given the query is
independent of the computation of the score of another document given the query
Embarrassingly parallel.
30
Experimental results
1K 5K 10K 50K 100K0
500
1000
1500
2000
2500
3000
CPU(ms)
GPU(ms)
Number of elements
Ti
me(
ms)
31
Outline Background and Related Work
Motivation and goal
Our contributions:• A GPU based implementation of the Good Turing smoothing algorithm• A GPU based implementation of the Kneser Ney smoothing algorithm• An efficient implementation of Ponte and Croft’s document scoring
model on the GPU• A GPU friendly version of the single link hierarchical clustering
algorithm
Discussion
Conclusion
32
Single link hierarchical clusteringThe algorithm can be divided into two phases:
Phase 1: Compute pairwise similarity between documents. i.e., Compute sim(di,dj) for all i,j belongs to {1….N}
Phase 2: Merging. During each iteration, merge the 2 most similar clusters. Let the new cluster be called X. Update similarity of X with all other active clusters. Find new most similar cluster for X.
NG1 NG2 NG3 NG4
D0 .2 .3 .1 .4
D1 .6 .1 .1 .2
D2 .5 .1 .2 .2
D3 .1 .1 .7 .1
We launch a 2 * 2 grid of thread blocks where each block’s dimension is also 2*2
Block 0 Block 1
Block 2 Block 3
Thread 0 (d0,d2) Thread 1 (d0,d3)
Thread 2 (d1,d2) Thread 3 (d1,d3)
Input matrix
Phase 1 : Computing pairwise distances
Block 0 Block 1
Block 2 Block 3
Thread 0 (d0,d2) Thread 1 (d0,d3)
Thread 2 (d1,d2) Thread 3 (d1,d3)
Each thread computes the similarity between a pair of documents
However, as the threads within a block share common documents, they can synchronize their execution. E.g., Both Thread 0 and Thread 1 in block 1 require document 0
The above is a very important observation as it allows us to exploit the shared memory of a block. We only need to load d0 into the block’s shared memory once. However, both thread 0 and thread 1 can use it.
Focus on block 1
Similarity computation for block 1 Process the input matrix in chunks.
In order to process each chunk, each thread in block 1 loads 2 values into the shared memory.
NG1 NG2 NG3 NG4
D0 .2 .3 .1 .4
D1 .6 .1 .1 .2
D2 .5 .1 .2 .2
D3 .1 .1 .7 .1
.2 .3
.6 .1
.5 .1
.1 .1
Do partial similarity computation. Eg., for doc0 and doc2, we can find partial dot product by multiplying (.2)(.5) + (.3)(.1). Store this result.
Shared Memory of blockLoaded by thread 0
After done processing 1st chunk, move to second chunk
NG1 NG2 NG3 NG4
D0 .2 .3 .1 .4
D1 .6 .1 .1 .2
D2 .5 .1 .2 .2
D3 .1 .1 .7 .1
.1 .4
.1 .2
.2 .7
.2 .1
Earlier we had computed the following for doc0 and doc 2 : (.2)(.5) + (.3)(.1).
Based on the next chunk, we can complete the dot product by adding (.1)(.2) + (.4)(.2)
Shared Memory of block
GPU based pairwise distance computation
256 512 1024 2048 40960
0.5
1
1.5
2
2.5
3
3.5
4
CPU(ms)GPU(ms)
Number of Documents
log Ti
me(
ms)
38
Phase 2 : Merge clustersFor n <- 1 to N-Xdo i1<-argmax{i:I[i] = i} NBM[i].sim i2<- I[NBM[i1].index] Merge i1 and i2
for i<- 1 to N
do if I[i] = i and i ≠ i1 and i ≠ i2
C[i1][i].sim <- C[i][i1].sim <- max(C[i1][i].sim, C[i2][i].sim)
if I[i] = i2
then I[i] <- i1
NBM[i1] <- argmax X Ɛ {C[i1][i] : I[i] = i and i ≠ i1} X.sim
Implement parallel reduction on the GPU that directly returns NBM[i1]
Launch GPU kernel with blocks =
Implement the parallel reduction algorithm on the GPU that directly returns i1 and i2
GPU based merging
256 512 1024 2048 40960
200
400
600
800
1000
1200
1400
CPU(ms)GPU(ms)
Number of documents
Ti
me(
ms)
40
Outline Background and Related Work
Motivation and goal
Our contributions:• A GPU based implementation of the Good Turing smoothing algorithm• A GPU based implementation of the Kneser Ney smoothing algorithm• An efficient implementation of Ponte and Croft’s document scoring
model on the GPU• A GPU friendly version of the single link hierarchical clustering
algorithm
Discussion
Conclusion
41
Discussion From our experiments we observed that GPU based algorithms
are primarily useful when dealing with large size datasets.
The GPU is suitable for solving problems that can be divided into non overlapping sub problems
If one is running several iterations of the same GPU code, he should minimize the data transfer between the CPU and the GPU within those iterations
42
Outline Background and Related Work
Motivation and goal
Our contributions:• A GPU based implementation of the Good Turing smoothing algorithm• A GPU based implementation of the Kneser Ney smoothing algorithm• An efficient implementation of Ponte and Croft’s document scoring
model on the GPU• A GPU friendly version of the single link hierarchical clustering
algorithm
Discussion
Conclusion
43
Conclusion We have contributed the following novel algorithms for GPU based IR:
1) A GPU based implementation of the Good Turing smoothing algorithm2) A GPU based implementation of the Kneser Ney smoothing algorithm3) An efficient implementation of Ponte and Croft’s document scoring
model on the GPU4) A GPU friendly version of the single link hierarchical clustering algorithm
We have experimentally shown that our GPU based implementations are significantly faster than similar CPU based implementations
Future work:1) Implement pseudo relevance feedback on the GPU2) Investigate methods to implement an image retrieval system on the GPU
44
References[1] Cederman, D. and Tsigasy, P (2008).A Practical Quicksort algorithm for Graphics Processors. In Proceedings of the 16th annual European symposium on Algorithms (ESA '08), Springer-Verlag, Berlin, Heidelberg, 246-258. [2] CUDPP. http://code.google.com/p/cudpp/[3] Ding, S., He, J., and Suel T. Using graphics processors for high performance IR query processing. In Proceedings of the 18th international conference on World wide web (WWW '09). ACM, New York, NY, USA, 421-430.[4] Fagin, R., Kumar, R. and Sivakumar, D. Comparing top k lists. In Proceedings of the fourteenth annual ACM-SIAM symposium on Discrete algorithms (SODA '03). Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 28-36.[5] Harris, M. http://developer.download.nvidia.com/compute/cuda/1_1/Website/projects/scan/doc/scan.pdf[6] Hoare, C.A.R (1962) .Quick Sort. Computer Journal, Vol. 5, 1, 10-15.
45
[7] Indri. http://lemurproject.org/indri/[8] Jones, K.S. A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation 28(1), 11–20.[9] Jurafsky, D. and Martin, J. Speech and Language Processing.[10] NVIDIA CUDA C programming guide.[11] Ponte, J.M and Croft, W.B. A language modeling approach to information retrieval. In Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval (SIGIR '98). ACM, New York, NY, USA, 275-281. [12] Salton, G.,Wong, A., Yang, C.S. A vector space model for automatic indexing.Commun. ACM 18, 11 (November 1975), 613-620.[13] Sanders, J. and Kandrot, E. CUDA by example: An introduction to General Purpose GPU programming.[14] Spink, Amanda. U.S. VERSUS EUROPEAN WEB SEARCHING TRENDS[15] Thrust. http://code.google.com/p/thrust/
References
46
Thank you!!!
47
Ponte and Croft’s model For non-occurring terms, estimate as follows:
In the above, is the raw count of term t in the collection and cs is the total number of tokens in the collection
As we are estimating using a document sized sample, we cannot be very confident about our maximum likelihood estimates. Therefore, Ponte and Croft suggest to use the mean probability of term t in documents containing t in order to estimate
And
And = *