data mining techniques on graphics processing...

1
Parallel Data Mining Techniques on Graphics Processing Unit with CUDA Data mining is widely used in various domains and has significant applications. However, current data mining tools cannot meet the requirement of applications with largescale databases in terms of speed. We propose three techniques to accelerate fundamental kernels in data mining algorithms on CUDA platform, scalable thread scheduling scheme for irregular pattern, parallel distributed topk scheme, and parallel high dimension reduction scheme. They play a key role in our GUCAS_CUMiner, including three representative data mining algorithms, CUApriori, CUKNN and CUKmeans. Hardware: HP xw8600 workstation, Corequad 2.66 GHz Intel Xeon CPU, 4 GB host memory, Tesla C1060 card (240 1.30 GHz SPs, 4 GB device memory) Software: Red Hat Enterprise Linux WS 4.7, CUDA 2.1, gcc 3.4.6 Experimental Results Problem: Topk problem is to select k minimum or maximum elements from a data collection. Solution: Step 1: Local sort. Divide the data collection into small data partitions with equal size, then store and sort them in the shared memory concurrently. Step 2: Approximate topk queue. Form a data collection using the heads of each sorted queues, then a insertion sort is performed on it to pick out the approximate topk queue. Step 3: Global topk queue. Based on the approximate topk queue, A, and k local sorted queues, whose heads are in A and sorted according to their heads, the exclusive property is applied to produce the global topk queue. This scheme not only cuts down the comparison between elements, but also maximizes parallelization by classifying data concurrently. Exclusive property: If element a is less than element b which belongs to a sorted queue q, any element greater than b in q cannot be less than a. Problem: Data with high dimensionality makes the cost for data manipulation and temporal results storage very high. Solution: 1) We see the same attribute (dimension) on all data as a vector, and perform reduction on each attribute rather than on each row. Each thread block only takes care of one distinct attribute. 2) The sequential addressing reduction in CUDA SDK is chosen for onedimensional reduction. This scheme aims at maximizing the thread parallelism, and eliminating the shared memory overflow problem with a number of large data elements. Our experiments demonstrate our proposed techniques work efficiently, and our toolkit also indicates that GPU + CUDA parallel architecture is feasible and promising for data mining applications. Novelty: Our work is the first to propose a parallel distributed topk scheme on CUDA, which can be treated as an independent algorithm. Introduction Conclusion GUCAS_CUMiner It is a CUDAbased data mining toolkit, now including three algorithms. The Aforementioned techniques play a key role in enhancing the performance. CUApriori: It is to identify frequently cooccurring itemsets in a database. We implement Candidate Generation kernel as a 2D thread grid using our scalable threads scheduling scheme for irregular pattern, so as to get an easy control and higher efficiency. Each thread tests whether two frequent itemsets can be joined. CUKNN: It is a method for classifying objects based on their closest reference objects. We implement the Neighbor Selecting kernel using our parallel distributed topk scheme, since the selection of k nearest neighbors of a query object is a typical topk problem. CUKMeans: It is to group the data into clusters so that objects within a cluster have high similarity while objects in differernt clusters have high dissimilarity. There are two CUDA kernels, reduction of objects' attributes and counts, and detection of centroids movement, which adopt our parallel high dimension reduction scheme to maximize the thread parallelization. Liheng Jian Ying Liu([email protected]) Shenshen Liang School of Information Science and Engineering, Graduate University of Chinese Academy of Sciences, Beijing, China Acknowledgements This work has been supported by 2009 NVIDIA’s Professor Partnership. References [1] Liu Y, Pisharath J, Liao WK, Memik G, et al. (2004) Performance evaluation and characterization of scalable data mining algorithms. 16 th IASTED International Conference on Parallel and Distributed Computing and Systems, pp. 620625, MIT Cambridge, Massachusetts, USA. [2] Garcia V, Debreuve E, Barlaud M. (2008) Fast k nearest neighbor search using GPU. IEEE Conference on Computer Vision and Patter Recognition Workshops, Vol. 13, pp.11071112. [3] Fang WB, Lau KK, Lu M, et al. (2008) Parallel data mining on graphics processors. Technical Report HKUSTCS0807, http://code.google.com/p/gpuminer/. [4] Jian LH, Wang C, Liu Y, Liang SS, et al. (2011) Parallel data mining techniques on Graphics Processing Unit with Compute Unified Device Architecture (CUDA). Journal of Supercomputing, DOI: 10.1007/s1122701106727. Scalable Threads Scheduling Scheme for Irregular Pattern Parallel High Dimension Reduction Scheme Parallel Distributed Topk Scheme Problem: In irregular pattern, the size of a problem changes dynamically during a process, making a challenge for CUDA computing model. Solution: 1) Estimate the upper bound of threads/thread blocks, then allocate the GPU resources. 2) Let the #threads/#blocks quit immediately if it is determined useless when the processing begins. Comparing to restarting a GPU kernel with an updated size, the overall performance can be improved by sacrificing part of the computation resource, since launching or exiting a thread block incurs trivial cost. The serial Apriori in NUMineBench [1], FastKNN [2], and KmeansBitmap in GPUMiner [3] are used for comparison with our corresponding algorithms, 13.5times, 8.31times, and 5times speedup are observed as shown in the following figures, respectively . See [4] for more detail.

Upload: buikhanh

Post on 03-Apr-2018

217 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Data Mining Techniques on Graphics Processing …developer.download.nvidia.com/GTC/PDF/GTC2012/Posters/P...Parallel Data Mining Techniques on Graphics Processing Unit with CUDA Data

Parallel Data Mining Techniques on Graphics Processing Unit with CUDA

Data mining is widely used in various domains and hassignificant applications. However, current data miningtools cannot meet the requirement of applicationswith large‐scale databases in terms of speed. Wepropose three techniques to accelerate fundamentalkernels in data mining algorithms on CUDA platform,scalable thread scheduling scheme for irregularpattern, parallel distributed top‐k scheme, and parallelhigh dimension reduction scheme. They play a key rolein our GUCAS_CU‐Miner, including threerepresentative data mining algorithms, CU‐Apriori, CU‐KNN and CU‐K‐means.

Hardware: HP xw8600 workstation, Core‐quad 2.66GHz Intel Xeon CPU, 4 GB host memory, Tesla C1060card (240 1.30 GHz SPs, 4 GB device memory)Software: Red Hat Enterprise Linux WS 4.7, CUDA 2.1,gcc 3.4.6

Experimental Results

Problem: Top‐k problem is to select k minimum or

maximum elements from a data collection.

Solution:

Step 1: Local sort. Divide the data collection into

small data partitions with equal size, then store and

sort them in the shared memory concurrently.

Step 2: Approximate top‐k queue. Form a data

collection using the heads of each sorted queues, then

a insertion sort is performed on it to pick out the

approximate top‐k queue.

Step 3: Global top‐k queue. Based on the

approximate top‐k queue, A, and k local sorted queues,

whose heads are in A and sorted according to their

heads, the exclusive property is applied to produce the

global top‐k queue.

This scheme not only cuts down the comparison

between elements, but also maximizes parallelization

by classifying data concurrently.

Exclusive property: If element a is less than element b

which belongs to a sorted queue q, any element

greater than b in q cannot be less than a.

Problem: Data with high dimensionality makes thecost for data manipulation and temporal resultsstorage very high.Solution:1) We see the same attribute (dimension) on all dataas a vector, and perform reduction on each attributerather than on each row. Each thread block only takescare of one distinct attribute.2) The sequential addressing reduction in CUDA SDKis chosen for one‐dimensional reduction.

This scheme aims at maximizing the threadparallelism, and eliminating the shared memoryoverflow problem with a number of large dataelements.

Our experiments demonstrate our proposedtechniques work efficiently, and our toolkit alsoindicates that GPU + CUDA parallel architecture isfeasible and promising for data mining applications.Novelty: Our work is the first to propose a paralleldistributed top‐k scheme on CUDA, which can betreated as an independent algorithm.

Introduction

Conclusion

GUCAS_CU‐Miner

It is a CUDA‐based data mining toolkit, now includingthree algorithms. The Aforementioned techniquesplay a key role in enhancing the performance.CU‐Apriori: It is to identify frequently co‐occurringitemsets in a database. We implement CandidateGeneration kernel as a 2D thread grid using ourscalable threads scheduling scheme for irregularpattern, so as to get an easy control and higherefficiency. Each thread tests whether two frequentitemsets can be joined.CU‐KNN: It is a method for classifying objects basedon their closest reference objects. We implement theNeighbor Selecting kernel using our paralleldistributed top‐k scheme, since the selection of knearest neighbors of a query object is a typical top‐kproblem.CU‐K‐Means: It is to group the data into clusters sothat objects within a cluster have high similarity whileobjects in differernt clusters have high dissimilarity.There are two CUDA kernels, reduction of objects'attributes and counts, and detection of centroidsmovement, which adopt our parallel high dimensionreduction scheme to maximize the threadparallelization.

Liheng Jian Ying Liu([email protected])        Shenshen LiangSchool of Information Science and Engineering, 

Graduate University of Chinese Academy of Sciences, Beijing, China

Acknowledgements

This work has been supported by 2009 NVIDIA’sProfessor Partnership.

References 

[1] Liu Y, Pisharath J, Liao WK, Memik G, et al. (2004)Performance evaluation and characterization ofscalable data mining algorithms. 16th IASTEDInternational Conference on Parallel and DistributedComputing and Systems, pp. 620‐625, MIT Cambridge,Massachusetts, USA.[2] Garcia V, Debreuve E, Barlaud M. (2008) Fast knearest neighbor search using GPU. IEEE Conferenceon Computer Vision and Patter Recognition Workshops,Vol. 1‐3, pp.1107‐1112.[3] Fang WB, Lau KK, Lu M, et al. (2008) Parallel datamining on graphics processors. Technical ReportHKUST‐CS08‐07, http://code.google.com/p/gpuminer/.[4] Jian LH, Wang C, Liu Y, Liang SS, et al. (2011) Paralleldata mining techniques on Graphics Processing Unitwith Compute Unified Device Architecture (CUDA).Journal of Supercomputing, DOI: 10.1007/s11227‐011‐0672‐7.

Scalable Threads Scheduling Scheme for Irregular Pattern

Parallel High Dimension Reduction Scheme

Parallel Distributed Top‐k Scheme

Problem: In irregular pattern, the size of a problemchanges dynamically during a process, making achallenge for CUDA computing model.Solution:1) Estimate the upper bound of threads/threadblocks, then allocate the GPU resources.2) Let the #threads/#blocks quit immediately if it isdetermined useless when the processing begins.

Comparing to re‐starting a GPU kernel with anupdated size, the overall performance can beimproved by sacrificing part of the computationresource, since launching or exiting a thread blockincurs trivial cost.

The serial Apriori in NU‐MineBench [1], Fast‐KNN [2],and K‐meansBitmap in GPUMiner [3] are used forcomparison with our corresponding algorithms, 13.5‐times, 8.31‐times, and 5‐times speedup are observedas shown in the following figures, respectively . See [4]for more detail.