an implementation of the language model based ir system on the gpu

1

An Implementation of the Language Model Based IR System on the GPU

Sudhanshu Khemka

2

Outline Background and Related Work

Motivation and goal

Our contributions:• A GPU based implementation of the Good Turing smoothing algorithm• A GPU based implementation of the Kneser Ney smoothing algorithm• An efficient implementation of Ponte and Croft’s document scoring

model on the GPU• A GPU friendly version of the single link hierarchical clustering

algorithm

Discussion

Conclusion

3


• The GPU Architecture• The structure of an IR System• Ponte and Croft’s document scoring model• Clustering • Related Work

4

GPU Programming Model Allows the programmer to define a grid of thread blocks.

Each thread block executes independently of other blocks.

All threads in a thread block can also execute independently of each other; however, one can synchronize their execution using barrier synchronization methods, such as __syncthreads().

5

GPU Memory Hierarchy

6

Ding et al.’s GPU based architecture for IR

GPU cannot access main memory directly.

Thus, transfer cost associated with transferring the data from the CPU’s main memory to the GPU’s global memory.

In some cases, this transfer cost is higher than the speed up obtained by using the GPU.

7

Structure of an IR system

Inverted Index and Smoothing

8

Doc 1 Doc 2 Doc 3

NG1 5 6 7

NG2 5 0 0

NG3 0 0 7

NG4 0 6 0

After Add one smoothing

Doc 1 Doc 2 Doc 3

NG1 6/14 7/16 8/18

NG2 6/14 1/16 1/18

NG3 1/14 1/16 8/18

NG4 1/14 7/16 1/18

Doc 1 Doc 2 Doc 3

NG1 5 6 7

NG2 5 0 0

NG3 0 0 7

NG4 0 6 0

Inverted List

NG4 occurs 6 times in Doc 2Inverted Index

Smoothing assigns a small non-zero probability to n grams that were not seen in the document.

9

The Language Model based approach to IR

Builds a probabilistic language model for each document d ()

Ranks documents according to the probability of generating the query (Q) given their language model representation ((Q|Md))

Ponte and Croft’s model is an enhanced version of the above

�̂� (𝑡|𝑀𝑑)={P̂ml (𝑡 ,𝑀𝑑)(1.0−𝑅𝑡 ,𝑑 )× 𝑃𝑎𝑣𝑔 (𝑡 )𝑅𝑡 ,𝑑𝑖𝑓 𝑡𝑓 (𝑡 ,𝑑 )>0𝑐𝑓 𝑡𝑐𝑠 h𝑜𝑡 𝑒𝑟𝑤𝑖𝑠𝑒

Ponte and Croft’s model

As we are estimating using a document sized sample, we cannot be very confident about our maximum likelihood estimates. Therefore, Ponte and Croft suggest to use the mean probability of term t in documents containing t in order to estimate

10

And

And = *

11

Clustering Enables a search engine to present information in a more effective manner

by displaying similar documents together.

Particularly useful when the search term has different word senses

For example, consider the query “jaguar.”

jaguar can refer to a car, an animal, or the Apple operating system

If a user is searching for documents related to the animal jaguar, he will have to manually search through the top-k documents to find documents related to the animal.

Clustering alleviates problem

12

Related Work Ding et al. propose data parallel algorithms for compressing, decompressing,

and intersecting sorted inverted lists for a Vector Space model based information retrieval system.

Example of their list intersection algorithm to intersect two lists A and B:

Randomly pick few elements from list A and for each element, find the pair Bj, Bj+1 in B such that Bj < A ≤ Bj+1

This implicitly partitions both A and B into segments as shown below:

Intersect corresponding segments in parallel.

13

Related Work contd Chang et al. implement hierarchical clustering on the GPU. However,

1) They apply clustering to DNA microarray experiments. We apply it to information retrieval.

2) They use the Pearson correlation coefficient as the distance metric to compute the distance between two elements. We use cosine similarity.

3) We present a more optimized version of their code.

14


Motivation and goal



algorithm

Discussion

Conclusion

15

Motivation and Goal

No published papers propose an implementation of the LM based IR system on the GPU

However, a probabilistic language model based approach to retrieval significantly outperforms standard tf.idf weighting (Ponte and Croft, 1998)

Goal: We hope to be the first to contribute algorithms to realize a Language model based IR system on the GPU

16


Motivation and goal



algorithm

Discussion

Conclusion

Good Turing Smoothing Intuition : We estimate the probability of things that occur c times

using the probability of things that occur c+1 times.

Smoothed Count: c* = (c+1)

Smoothed Probability: P(c*) =

In the above definition, Nc is the number of N grams that occur c times.

Doc 1 Doc 2a shoe 1 1a cat 0 2foo bar 2 0a dog 1 2



2 phases:

1) Calculate the Nc values

2) Smooth counts

Calculating NC values on the GPU

1 0 2 1Doc1:

0 1 1 2Sort:

0 1 2 3Positions:

0 1 2

0 1 3

Stream compaction

Doc1: N0 = 1 , N1 = 2 , N2 = 1

Smooth Ngram counts1 0 2 1Doc1

Thread 2Thread 1Thread 0 Thread 3

Let one thread compute the smoothed count for each Ngram



Experimental results

1K 10K 100K 1M 2M0

20

40

60

80

100

120

140

CPU(ms)

GPU(ms)

Number of elements

Tim

e(m

s)

22


Motivation and goal



algorithm

Discussion

Conclusion

23

Kneser Ney smoothing The Good Turing algorithm assigns the same probability of occurrence to all

0 count n-grams

Example, if count(BURNISH THE) = count(BURNISH THOU) = 0, then using Good Turing

P(THE|BURNISH) = P(THOU|BURNISH)

However, intuitively, P(THE|BURNISH) > P(THOU| BURNISH), as THE is much more common than THOU

The Kneser Ney smoothing algorithm captures this intuition

Calculate P(wi|wi-1) based on the number of different contexts the word wi has appeared in. (Assuming count(wi-1wi) = 0)

24

Kneser Ney smoothing

𝑃 𝐾𝑁 (𝑤𝑖|𝑤𝑖−1 )=¿ if > 0

otherwise.

=

Step 1

Step 2Step 3

Step 4

25

GPU based implementationBigrams Countsan example 0this is 4this example 3

Step 1: Compute for each wi :

Launch a kernel such that each thread visits one bigram in the bigram dictionary and checks if count(wi-1wi) > 0. If yes, it increments contextW[wi] by 1

0 1 1 0

an example is this

26

Step 2 : Compute

Apply GPU based parallel reduction on result of step 1. Please refer to technical paper by Mark Harris for an efficient implementation of the parallel reduction operation on the GPU. For us, = 2

Step 3: Compute for each wi-1.As we have already completed steps 1 and 2, step 3 can easily be done by asking one thread to compute the for each wi-1.Step 4: According to the value of count(wi-1wi), we use the correct version of the Kneser Ney algorithm to get the following array:

.5 .5712 .4284

http://www.uni-graz.at/~haasegu/Lectures/GPU_CUDA/Lit/reduction.pdf

27


1000 10000 100000 1000000 20000000

100

200

300

400

500

600

700

800

900

1000

CPU(ms)GPU(ms)

Number of elements

Tim

e(m

s)

28


Motivation and goal



algorithm

Discussion

Conclusion

29

Ponte and Croft’s document scoring modelComputation of the score of a document given the query is

independent of the computation of the score of another document given the query

Embarrassingly parallel.

http://www.google.com/url?sa=i&rct=j&q=&esrc=s&frm=1&source=images&cd=&cad=rja&docid=kK8sIZzCveScqM&tbnid=3vdZAHGQR1awMM:&ved=0CAUQjRw&url=http://www.redcedaru.com/blog/series-hybrid-vs-parallel-hybrid-04-18-2012&ei=L85FUZDFD8OplQXnvYDoAQ&psig=AFQjCNHqakkhH6dUEKm4WrOsFaoTXtt2Fw&ust=1363615577534775

30


1K 5K 10K 50K 100K0

500

1000

1500

2000

2500

3000

CPU(ms)

GPU(ms)

Number of elements

Ti

me(

ms)

31


Motivation and goal



algorithm

Discussion

Conclusion

32

Single link hierarchical clusteringThe algorithm can be divided into two phases:

Phase 1: Compute pairwise similarity between documents. i.e., Compute sim(di,dj) for all i,j belongs to {1….N}

Phase 2: Merging. During each iteration, merge the 2 most similar clusters. Let the new cluster be called X. Update similarity of X with all other active clusters. Find new most similar cluster for X.

NG1 NG2 NG3 NG4

D0 .2 .3 .1 .4

D1 .6 .1 .1 .2

D2 .5 .1 .2 .2

D3 .1 .1 .7 .1

We launch a 2 * 2 grid of thread blocks where each block’s dimension is also 2*2

Block 0 Block 1

Block 2 Block 3

Thread 0 (d0,d2) Thread 1 (d0,d3)


Input matrix

Phase 1 : Computing pairwise distances

Block 0 Block 1

Block 2 Block 3



Each thread computes the similarity between a pair of documents

However, as the threads within a block share common documents, they can synchronize their execution. E.g., Both Thread 0 and Thread 1 in block 1 require document 0

The above is a very important observation as it allows us to exploit the shared memory of a block. We only need to load d0 into the block’s shared memory once. However, both thread 0 and thread 1 can use it.

Focus on block 1

Similarity computation for block 1 Process the input matrix in chunks.

In order to process each chunk, each thread in block 1 loads 2 values into the shared memory.

NG1 NG2 NG3 NG4

D0 .2 .3 .1 .4

D1 .6 .1 .1 .2

D2 .5 .1 .2 .2

D3 .1 .1 .7 .1

.2 .3

.6 .1

.5 .1

.1 .1

Do partial similarity computation. Eg., for doc0 and doc2, we can find partial dot product by multiplying (.2)(.5) + (.3)(.1). Store this result.

Shared Memory of blockLoaded by thread 0

After done processing 1st chunk, move to second chunk

NG1 NG2 NG3 NG4

D0 .2 .3 .1 .4

D1 .6 .1 .1 .2

D2 .5 .1 .2 .2

D3 .1 .1 .7 .1

.1 .4

.1 .2

.2 .7

.2 .1

Earlier we had computed the following for doc0 and doc 2 : (.2)(.5) + (.3)(.1).

Based on the next chunk, we can complete the dot product by adding (.1)(.2) + (.4)(.2)

Shared Memory of block

GPU based pairwise distance computation

256 512 1024 2048 40960

0.5

1

1.5

2

2.5

3

3.5

4

CPU(ms)GPU(ms)

Number of Documents

log Ti

me(

ms)

38

Phase 2 : Merge clustersFor n <- 1 to N-Xdo i1<-argmax{i:I[i] = i} NBM[i].sim i2<- I[NBM[i1].index] Merge i1 and i2

for i<- 1 to N

do if I[i] = i and i ≠ i1 and i ≠ i2

C[i1][i].sim <- C[i][i1].sim <- max(C[i1][i].sim, C[i2][i].sim)

if I[i] = i2

then I[i] <- i1

NBM[i1] <- argmax X Ɛ {C[i1][i] : I[i] = i and i ≠ i1} X.sim

Implement parallel reduction on the GPU that directly returns NBM[i1]

Launch GPU kernel with blocks =

Implement the parallel reduction algorithm on the GPU that directly returns i1 and i2

GPU based merging

256 512 1024 2048 40960

200

400

600

800

1000

1200

1400

CPU(ms)GPU(ms)

Number of documents

Ti

me(

ms)

40


Motivation and goal



algorithm

Discussion

Conclusion

41

Discussion From our experiments we observed that GPU based algorithms

are primarily useful when dealing with large size datasets.

The GPU is suitable for solving problems that can be divided into non overlapping sub problems

If one is running several iterations of the same GPU code, he should minimize the data transfer between the CPU and the GPU within those iterations

42


Motivation and goal



algorithm

Discussion

Conclusion

43

Conclusion We have contributed the following novel algorithms for GPU based IR:

1) A GPU based implementation of the Good Turing smoothing algorithm2) A GPU based implementation of the Kneser Ney smoothing algorithm3) An efficient implementation of Ponte and Croft’s document scoring

model on the GPU4) A GPU friendly version of the single link hierarchical clustering algorithm

We have experimentally shown that our GPU based implementations are significantly faster than similar CPU based implementations

Future work:1) Implement pseudo relevance feedback on the GPU2) Investigate methods to implement an image retrieval system on the GPU

44

References[1] Cederman, D. and Tsigasy, P (2008).A Practical Quicksort algorithm for Graphics Processors. In Proceedings of the 16th annual European symposium on Algorithms (ESA '08), Springer-Verlag, Berlin, Heidelberg, 246-258. [2] CUDPP. http://code.google.com/p/cudpp/[3] Ding, S., He, J., and Suel T. Using graphics processors for high performance IR query processing. In Proceedings of the 18th international conference on World wide web (WWW '09). ACM, New York, NY, USA, 421-430.[4] Fagin, R., Kumar, R. and Sivakumar, D. Comparing top k lists. In Proceedings of the fourteenth annual ACM-SIAM symposium on Discrete algorithms (SODA '03). Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 28-36.[5] Harris, M. http://developer.download.nvidia.com/compute/cuda/1_1/Website/projects/scan/doc/scan.pdf[6] Hoare, C.A.R (1962) .Quick Sort. Computer Journal, Vol. 5, 1, 10-15.

http://code.google.com/p/cudpp/

http://developer.download.nvidia.com/compute/cuda/1_1/Website/projects/scan/doc/scan.pdf

http://developer.download.nvidia.com/compute/cuda/1_1/Website/projects/scan/doc/scan.pdf

45

[7] Indri. http://lemurproject.org/indri/[8] Jones, K.S. A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation 28(1), 11–20.[9] Jurafsky, D. and Martin, J. Speech and Language Processing.[10] NVIDIA CUDA C programming guide.[11] Ponte, J.M and Croft, W.B. A language modeling approach to information retrieval. In Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval (SIGIR '98). ACM, New York, NY, USA, 275-281. [12] Salton, G.,Wong, A., Yang, C.S. A vector space model for automatic indexing.Commun. ACM 18, 11 (November 1975), 613-620.[13] Sanders, J. and Kandrot, E. CUDA by example: An introduction to General Purpose GPU programming.[14] Spink, Amanda. U.S. VERSUS EUROPEAN WEB SEARCHING TRENDS[15] Thrust. http://code.google.com/p/thrust/

References

http://lemurproject.org/indri/

http://code.google.com/p/thrust/

46

Thank you!!!

47

Ponte and Croft’s model For non-occurring terms, estimate as follows:

In the above, is the raw count of term t in the collection and cs is the total number of tokens in the collection

As we are estimating using a document sized sample, we cannot be very confident about our maximum likelihood estimates. Therefore, Ponte and Croft suggest to use the mean probability of term t in documents containing t in order to estimate

And

And = *

an implementation of the language model based ir system on the gpu

Documents

ir gpu

related work motivation

modelclustering related

language model based

related workding

gpu architecturethe

crofts document

gpu programming modelallows