an implementation of the language model based ir system on the gpu

47
plementation of the Language Model B IR System on the GPU Sudhanshu Khemka 1

Upload: feleti

Post on 24-Feb-2016

52 views

Category:

Documents


0 download

DESCRIPTION

An Implementation of the Language Model Based IR System on the GPU. Sudhanshu Khemka. Outline. Background and Related Work Motivation and goal Our contributions: A GPU based implementation of the Good Turing smoothing algorithm - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: An Implementation of the Language Model Based  IR System on the GPU

1

An Implementation of the Language Model Based IR System on the GPU

Sudhanshu Khemka

Page 2: An Implementation of the Language Model Based  IR System on the GPU

2

Outline Background and Related Work

Motivation and goal

Our contributions:• A GPU based implementation of the Good Turing smoothing algorithm• A GPU based implementation of the Kneser Ney smoothing algorithm• An efficient implementation of Ponte and Croft’s document scoring

model on the GPU• A GPU friendly version of the single link hierarchical clustering

algorithm

Discussion

Conclusion

Page 3: An Implementation of the Language Model Based  IR System on the GPU

3

Outline Background and Related Work

• The GPU Architecture• The structure of an IR System• Ponte and Croft’s document scoring model• Clustering • Related Work

Page 4: An Implementation of the Language Model Based  IR System on the GPU

4

GPU Programming Model Allows the programmer to define a grid of thread blocks.

Each thread block executes independently of other blocks.

All threads in a thread block can also execute independently of each other; however, one can synchronize their execution using barrier synchronization methods, such as __syncthreads().

Page 5: An Implementation of the Language Model Based  IR System on the GPU

5

GPU Memory Hierarchy

Page 6: An Implementation of the Language Model Based  IR System on the GPU

6

Ding et al.’s GPU based architecture for IR

GPU cannot access main memory directly.

Thus, transfer cost associated with transferring the data from the CPU’s main memory to the GPU’s global memory.

In some cases, this transfer cost is higher than the speed up obtained by using the GPU.

Page 7: An Implementation of the Language Model Based  IR System on the GPU

7

Structure of an IR system

Page 8: An Implementation of the Language Model Based  IR System on the GPU

Inverted Index and Smoothing

8

Doc 1 Doc 2 Doc 3

NG1 5 6 7

NG2 5 0 0

NG3 0 0 7

NG4 0 6 0

After Add one smoothing

Doc 1 Doc 2 Doc 3

NG1 6/14 7/16 8/18

NG2 6/14 1/16 1/18

NG3 1/14 1/16 8/18

NG4 1/14 7/16 1/18

Doc 1 Doc 2 Doc 3

NG1 5 6 7

NG2 5 0 0

NG3 0 0 7

NG4 0 6 0

Inverted List

NG4 occurs 6 times in Doc 2Inverted Index

Smoothing assigns a small non-zero probability to n grams that were not seen in the document.

Page 9: An Implementation of the Language Model Based  IR System on the GPU

9

The Language Model based approach to IR

Builds a probabilistic language model for each document d ()

Ranks documents according to the probability of generating the query (Q) given their language model representation ((Q|Md))   

Ponte and Croft’s model is an enhanced version of the above

Page 10: An Implementation of the Language Model Based  IR System on the GPU

�̂� (𝑡|𝑀𝑑)={P̂ml (𝑡 ,𝑀𝑑)(1.0−𝑅𝑡 ,𝑑 )× 𝑃𝑎𝑣𝑔 (𝑡 )𝑅𝑡 ,𝑑𝑖𝑓 𝑡𝑓 (𝑡 ,𝑑 )>0𝑐𝑓 𝑡𝑐𝑠 h𝑜𝑡 𝑒𝑟𝑤𝑖𝑠𝑒

Ponte and Croft’s model

As we are estimating using a document sized sample, we cannot be very confident about our maximum likelihood estimates. Therefore, Ponte and Croft suggest to use the mean probability of term t in documents containing t in order to estimate

10

And

And = *

Page 11: An Implementation of the Language Model Based  IR System on the GPU

11

Clustering Enables a search engine to present information in a more effective manner

by displaying similar documents together.

Particularly useful when the search term has different word senses

For example, consider the query “jaguar.”

jaguar can refer to a car, an animal, or the Apple operating system

If a user is searching for documents related to the animal jaguar, he will have to manually search through the top-k documents to find documents related to the animal.

Clustering alleviates problem

Page 12: An Implementation of the Language Model Based  IR System on the GPU

12

Related Work Ding et al. propose data parallel algorithms for compressing, decompressing,

and intersecting sorted inverted lists for a Vector Space model based information retrieval system.

Example of their list intersection algorithm to intersect two lists A and B:

Randomly pick few elements from list A and for each element, find the pair Bj, Bj+1 in B such that Bj < A ≤ Bj+1

This implicitly partitions both A and B into segments as shown below:

Intersect corresponding segments in parallel.

Page 13: An Implementation of the Language Model Based  IR System on the GPU

13

Related Work contd Chang et al. implement hierarchical clustering on the GPU. However,

1) They apply clustering to DNA microarray experiments. We apply it to information retrieval.

2) They use the Pearson correlation coefficient as the distance metric to compute the distance between two elements. We use cosine similarity.

3) We present a more optimized version of their code.

Page 14: An Implementation of the Language Model Based  IR System on the GPU

14

Outline Background and Related Work

Motivation and goal

Our contributions:• A GPU based implementation of the Good Turing smoothing algorithm• A GPU based implementation of the Kneser Ney smoothing algorithm• An efficient implementation of Ponte and Croft’s document scoring

model on the GPU• A GPU friendly version of the single link hierarchical clustering

algorithm

Discussion

Conclusion

Page 15: An Implementation of the Language Model Based  IR System on the GPU

15

Motivation and Goal

No published papers propose an implementation of the LM based IR system on the GPU

However, a probabilistic language model based approach to retrieval significantly outperforms standard tf.idf weighting (Ponte and Croft, 1998)

Goal: We hope to be the first to contribute algorithms to realize a Language model based IR system on the GPU

Page 16: An Implementation of the Language Model Based  IR System on the GPU

16

Outline Background and Related Work

Motivation and goal

Our contributions:• A GPU based implementation of the Good Turing smoothing algorithm• A GPU based implementation of the Kneser Ney smoothing algorithm• An efficient implementation of Ponte and Croft’s document scoring

model on the GPU• A GPU friendly version of the single link hierarchical clustering

algorithm

Discussion

Conclusion

Page 17: An Implementation of the Language Model Based  IR System on the GPU

Good Turing Smoothing Intuition : We estimate the probability of things that occur c times

using the probability of things that occur c+1 times.

Smoothed Count: c* = (c+1)

Smoothed Probability: P(c*) =

In the above definition, Nc is the number of N grams that occur c times.

Page 18: An Implementation of the Language Model Based  IR System on the GPU

Doc 1 Doc 2a shoe 1 1a cat 0 2foo bar 2 0a dog 1 2

Smoothed Count: c* = (c+1)

Smoothed Probability: P(c*) =

2 phases:

1) Calculate the Nc values

2) Smooth counts

Page 19: An Implementation of the Language Model Based  IR System on the GPU

Calculating NC values on the GPU

1 0 2 1Doc1:

0 1 1 2Sort:

0 1 2 3Positions:

0 1 2

0 1 3

Stream compaction

Doc1: N0 = 1 , N1 = 2 , N2 = 1

Page 20: An Implementation of the Language Model Based  IR System on the GPU

Smooth Ngram counts1 0 2 1Doc1

Thread 2Thread 1Thread 0 Thread 3

Let one thread compute the smoothed count for each Ngram

Smoothed Count: c* = (c+1)

Smoothed Probability: P(c*) =

Page 21: An Implementation of the Language Model Based  IR System on the GPU

Experimental results

1K 10K 100K 1M 2M0

20

40

60

80

100

120

140

CPU(ms)

GPU(ms)

Number of elements

Tim

e(m

s)

Page 22: An Implementation of the Language Model Based  IR System on the GPU

22

Outline Background and Related Work

Motivation and goal

Our contributions:• A GPU based implementation of the Good Turing smoothing algorithm• A GPU based implementation of the Kneser Ney smoothing algorithm• An efficient implementation of Ponte and Croft’s document scoring

model on the GPU• A GPU friendly version of the single link hierarchical clustering

algorithm

Discussion

Conclusion

Page 23: An Implementation of the Language Model Based  IR System on the GPU

23

Kneser Ney smoothing The Good Turing algorithm assigns the same probability of occurrence to all

0 count n-grams

Example, if count(BURNISH THE) = count(BURNISH THOU) = 0, then using Good Turing

P(THE|BURNISH) = P(THOU|BURNISH)

However, intuitively, P(THE|BURNISH) > P(THOU| BURNISH), as THE is much more common than THOU

The Kneser Ney smoothing algorithm captures this intuition

Calculate P(wi|wi-1) based on the number of different contexts the word wi has appeared in. (Assuming count(wi-1wi) = 0)

Page 24: An Implementation of the Language Model Based  IR System on the GPU

24

Kneser Ney smoothing

𝑃 𝐾𝑁 (𝑤𝑖|𝑤𝑖−1 )=¿ if > 0

otherwise.

=

Step 1

Step 2Step 3

Step 4

Page 25: An Implementation of the Language Model Based  IR System on the GPU

25

GPU based implementationBigrams Countsan example 0this is 4this example 3

Step 1: Compute for each wi :

Launch a kernel such that each thread visits one bigram in the bigram dictionary and checks if count(wi-1wi) > 0. If yes, it increments contextW[wi] by 1

0 1 1 0

an example is this

Page 26: An Implementation of the Language Model Based  IR System on the GPU

26

Step 2 : Compute

Apply GPU based parallel reduction on result of step 1. Please refer to technical paper by Mark Harris for an efficient implementation of the parallel reduction operation on the GPU. For us, = 2

Step 3: Compute for each wi-1.As we have already completed steps 1 and 2, step 3 can easily be done by asking one thread to compute the for each wi-1.Step 4: According to the value of count(wi-1wi), we use the correct version of the Kneser Ney algorithm to get the following array:

.5 .5712 .4284

Page 27: An Implementation of the Language Model Based  IR System on the GPU

27

Experimental results

1000 10000 100000 1000000 20000000

100

200

300

400

500

600

700

800

900

1000

CPU(ms)GPU(ms)

Number of elements

Tim

e(m

s)

Page 28: An Implementation of the Language Model Based  IR System on the GPU

28

Outline Background and Related Work

Motivation and goal

Our contributions:• A GPU based implementation of the Good Turing smoothing algorithm• A GPU based implementation of the Kneser Ney smoothing algorithm• An efficient implementation of Ponte and Croft’s document scoring

model on the GPU• A GPU friendly version of the single link hierarchical clustering

algorithm

Discussion

Conclusion

Page 29: An Implementation of the Language Model Based  IR System on the GPU

29

Ponte and Croft’s document scoring modelComputation of the score of a document given the query is

independent of the computation of the score of another document given the query

Embarrassingly parallel.

Page 30: An Implementation of the Language Model Based  IR System on the GPU

30

Experimental results

1K 5K 10K 50K 100K0

500

1000

1500

2000

2500

3000

CPU(ms)

GPU(ms)

Number of elements

Ti

me(

ms)

Page 31: An Implementation of the Language Model Based  IR System on the GPU

31

Outline Background and Related Work

Motivation and goal

Our contributions:• A GPU based implementation of the Good Turing smoothing algorithm• A GPU based implementation of the Kneser Ney smoothing algorithm• An efficient implementation of Ponte and Croft’s document scoring

model on the GPU• A GPU friendly version of the single link hierarchical clustering

algorithm

Discussion

Conclusion

Page 32: An Implementation of the Language Model Based  IR System on the GPU

32

Single link hierarchical clusteringThe algorithm can be divided into two phases:

Phase 1: Compute pairwise similarity between documents. i.e., Compute sim(di,dj) for all i,j belongs to {1….N}

Phase 2: Merging. During each iteration, merge the 2 most similar clusters. Let the new cluster be called X. Update similarity of X with all other active clusters. Find new most similar cluster for X.

Page 33: An Implementation of the Language Model Based  IR System on the GPU

NG1 NG2 NG3 NG4

D0 .2 .3 .1 .4

D1 .6 .1 .1 .2

D2 .5 .1 .2 .2

D3 .1 .1 .7 .1

We launch a 2 * 2 grid of thread blocks where each block’s dimension is also 2*2

Block 0 Block 1

Block 2 Block 3

Thread 0 (d0,d2) Thread 1 (d0,d3)

Thread 2 (d1,d2) Thread 3 (d1,d3)

Input matrix

Phase 1 : Computing pairwise distances

Page 34: An Implementation of the Language Model Based  IR System on the GPU

Block 0 Block 1

Block 2 Block 3

Thread 0 (d0,d2) Thread 1 (d0,d3)

Thread 2 (d1,d2) Thread 3 (d1,d3)

Each thread computes the similarity between a pair of documents

However, as the threads within a block share common documents, they can synchronize their execution. E.g., Both Thread 0 and Thread 1 in block 1 require document 0

The above is a very important observation as it allows us to exploit the shared memory of a block. We only need to load d0 into the block’s shared memory once. However, both thread 0 and thread 1 can use it.

Focus on block 1

Page 35: An Implementation of the Language Model Based  IR System on the GPU

Similarity computation for block 1 Process the input matrix in chunks.

In order to process each chunk, each thread in block 1 loads 2 values into the shared memory.

NG1 NG2 NG3 NG4

D0 .2 .3 .1 .4

D1 .6 .1 .1 .2

D2 .5 .1 .2 .2

D3 .1 .1 .7 .1

.2 .3

.6 .1

.5 .1

.1 .1

Do partial similarity computation. Eg., for doc0 and doc2, we can find partial dot product by multiplying (.2)(.5) + (.3)(.1). Store this result.

Shared Memory of blockLoaded by thread 0

Page 36: An Implementation of the Language Model Based  IR System on the GPU

After done processing 1st chunk, move to second chunk

NG1 NG2 NG3 NG4

D0 .2 .3 .1 .4

D1 .6 .1 .1 .2

D2 .5 .1 .2 .2

D3 .1 .1 .7 .1

.1 .4

.1 .2

.2 .7

.2 .1

Earlier we had computed the following for doc0 and doc 2 : (.2)(.5) + (.3)(.1).

Based on the next chunk, we can complete the dot product by adding (.1)(.2) + (.4)(.2)

Shared Memory of block

Page 37: An Implementation of the Language Model Based  IR System on the GPU

GPU based pairwise distance computation

256 512 1024 2048 40960

0.5

1

1.5

2

2.5

3

3.5

4

CPU(ms)GPU(ms)

Number of Documents

log Ti

me(

ms)

Page 38: An Implementation of the Language Model Based  IR System on the GPU

38

Phase 2 : Merge clustersFor n <- 1 to N-Xdo i1<-argmax{i:I[i] = i} NBM[i].sim i2<- I[NBM[i1].index] Merge i1 and i2

for i<- 1 to N

do if I[i] = i and i ≠ i1 and i ≠ i2

C[i1][i].sim <- C[i][i1].sim <- max(C[i1][i].sim, C[i2][i].sim)

if I[i] = i2

then I[i] <- i1

NBM[i1] <- argmax X Ɛ {C[i1][i] : I[i] = i and i ≠ i1} X.sim

Implement parallel reduction on the GPU that directly returns NBM[i1]

Launch GPU kernel with blocks =

Implement the parallel reduction algorithm on the GPU that directly returns i1 and i2

Page 39: An Implementation of the Language Model Based  IR System on the GPU

GPU based merging

256 512 1024 2048 40960

200

400

600

800

1000

1200

1400

CPU(ms)GPU(ms)

Number of documents

Ti

me(

ms)

Page 40: An Implementation of the Language Model Based  IR System on the GPU

40

Outline Background and Related Work

Motivation and goal

Our contributions:• A GPU based implementation of the Good Turing smoothing algorithm• A GPU based implementation of the Kneser Ney smoothing algorithm• An efficient implementation of Ponte and Croft’s document scoring

model on the GPU• A GPU friendly version of the single link hierarchical clustering

algorithm

Discussion

Conclusion

Page 41: An Implementation of the Language Model Based  IR System on the GPU

41

Discussion From our experiments we observed that GPU based algorithms

are primarily useful when dealing with large size datasets.

The GPU is suitable for solving problems that can be divided into non overlapping sub problems

If one is running several iterations of the same GPU code, he should minimize the data transfer between the CPU and the GPU within those iterations

Page 42: An Implementation of the Language Model Based  IR System on the GPU

42

Outline Background and Related Work

Motivation and goal

Our contributions:• A GPU based implementation of the Good Turing smoothing algorithm• A GPU based implementation of the Kneser Ney smoothing algorithm• An efficient implementation of Ponte and Croft’s document scoring

model on the GPU• A GPU friendly version of the single link hierarchical clustering

algorithm

Discussion

Conclusion

Page 43: An Implementation of the Language Model Based  IR System on the GPU

43

Conclusion We have contributed the following novel algorithms for GPU based IR:

1) A GPU based implementation of the Good Turing smoothing algorithm2) A GPU based implementation of the Kneser Ney smoothing algorithm3) An efficient implementation of Ponte and Croft’s document scoring

model on the GPU4) A GPU friendly version of the single link hierarchical clustering algorithm

We have experimentally shown that our GPU based implementations are significantly faster than similar CPU based implementations

Future work:1) Implement pseudo relevance feedback on the GPU2) Investigate methods to implement an image retrieval system on the GPU

Page 44: An Implementation of the Language Model Based  IR System on the GPU

44

References[1] Cederman, D. and Tsigasy, P (2008).A Practical Quicksort algorithm for Graphics Processors. In Proceedings of the 16th annual European symposium on Algorithms (ESA '08), Springer-Verlag, Berlin, Heidelberg, 246-258. [2] CUDPP. http://code.google.com/p/cudpp/[3] Ding, S., He, J., and Suel T. Using graphics processors for high performance IR query processing. In Proceedings of the 18th international conference on World wide web (WWW '09). ACM, New York, NY, USA, 421-430.[4] Fagin, R., Kumar, R. and Sivakumar, D. Comparing top k lists. In Proceedings of the fourteenth annual ACM-SIAM symposium on Discrete algorithms (SODA '03). Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 28-36.[5] Harris, M. http://developer.download.nvidia.com/compute/cuda/1_1/Website/projects/scan/doc/scan.pdf[6] Hoare, C.A.R (1962) .Quick Sort. Computer Journal, Vol. 5, 1, 10-15.

Page 45: An Implementation of the Language Model Based  IR System on the GPU

45

[7] Indri. http://lemurproject.org/indri/[8] Jones, K.S. A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation 28(1), 11–20.[9] Jurafsky, D. and Martin, J. Speech and Language Processing.[10] NVIDIA CUDA C programming guide.[11] Ponte, J.M and Croft, W.B. A language modeling approach to information retrieval. In Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval (SIGIR '98). ACM, New York, NY, USA, 275-281. [12] Salton, G.,Wong, A., Yang, C.S. A vector space model for automatic indexing.Commun. ACM 18, 11 (November 1975), 613-620.[13] Sanders, J. and Kandrot, E. CUDA by example: An introduction to General Purpose GPU programming.[14] Spink, Amanda. U.S. VERSUS EUROPEAN WEB SEARCHING TRENDS[15] Thrust. http://code.google.com/p/thrust/

References

Page 46: An Implementation of the Language Model Based  IR System on the GPU

46

Thank you!!!

Page 47: An Implementation of the Language Model Based  IR System on the GPU

47

Ponte and Croft’s model For non-occurring terms, estimate as follows:

In the above, is the raw count of term t in the collection and cs is the total number of tokens in the collection

As we are estimating using a document sized sample, we cannot be very confident about our maximum likelihood estimates. Therefore, Ponte and Croft suggest to use the mean probability of term t in documents containing t in order to estimate

And

And = *