parallel implementation of k means clustering on cuda

CS 240A - Parallel Implementation of K Means Clustering onCUDA

Lan Liu, Pritha D N

December 9, 2016

Abstract

K-Means clustering is a popular clustering algorithm in data mining. Clustering large data sets can betime consuming, and in an attempt to minimize this time, our project is a parallel implementation of K-Means clustering algorithm on CUDA using C. We present the performance analysis and implementationof our approach to parallelizing K-Means clustering.

1 K-Means ClusteringIn this section, we provide an overview of K-Means clustering, the mathematical description and thesequential algorithm has been presented, and the complexity of sequential code has been analyzed.

1.1 DescriptionK-Means clustering is one of the most widely used clustering method used in data mining, it aims to partitionN given data points into K clusters, in which each cluster has the most similarity, namely, each data pointbelongs to the cluster with the nearest mean.

Assume there are N data points in d dimension, (x1, x2, ...., xn) ⊂ Rd , and we need to classify them intoK clusters S = (S1, S2, ..., SK ), where K is generally fixed apriori. Define the center of each cluster Si as themean of all data points x ∈ Si , i.e.

µi ,

∑x∈Si x|Si |

, where |Si | denotes size of Si

The goal of clustering is to minimize the total Euclidean distance from each data point to its cluster center,i.e. find clustering S to minimize the objective function:

cost(S) =k∑i=1

∑x∈Si| |x − µi | |2

Finding the global minimum of the objective function is computationally challenging (NP-hard). Thecommonly used algorithm is really a heuristic which can find a local minimum instead of global minimum,and the trade off is that computational cost is cheaper. The commonly used k-means algorithm is as follows:

Step1: Initialize cluster centroids µ1, µ2, ..., µk ∈ Rn randomly.Step2: Repeat until convergence:{Assignment step: In order to assign each data point xi to the nearest cluster Si , define ci as membership

index of xi ,ci := argminkj=1 | |xi − µj | |

2

Update centroids step: For each j, update

µj :=∑N

i=1 1ci=j xi∑Ni=1 1ci=j

}Since generally the clustering problem is not convex, there might be a lot of local minimum. For anyfixed initial condition, It can be easily proved that cost(S) decreases for every iteration step, and thus the

1

algorithm converges to a unique local minimum depending on which initial condition is given. For example,set x = (2, 4, 5, 6), if initial centers are µ1 = 2, µ2 = 5, then S1 = (2), S2 = (4, 5, 6), if initial centers areµ1 = 4, µ2 = 5, then S1 = (2, 4), S2 = (5, 6), the first is global minimal with cost=2, and the second is asaddle with cost=2.5.

Consider this, people might generate a couple of result using random initial centers, and choose the bestwith smallest objective function. And this also drives us to apply parallel computing to save running time.

1.2 AlgorithmBased on the description, we apply the following heuristic K Means clustering sequential algorithm, thisalgorithm is based on the Professor Wei-keng Liao’s k-means clustering code [1]. We have modified thecode about how to recompute the new cluster center. In that code it sumerizes all data points in the newcluster to compute center in every iteration step, considering that only a portion of data changes membership(and being less and less as iteration goes), we instead treat the changing data by adding the data into newcluster and removing it from old cluster, this will be more efficient and the result also verifies this.

Step 1: Pick the first K data points as initial cluster centers.Step2: Attribute each data point to the nearest cluster.Step3: For each reassigned data point, membership change increase by 1.Step4: Set the position of each new cluster to be the mean of all data points belonging to that cluster.Step 5: Repeat steps 2-4 until convergence.

The pseudo code is as follows:Let N be the number of data points, K be the number of clusters.data[N]: the array of data objectscenter[K]: the array of cluster centersmembership[N]: the array of data point membership.clustersum[K]: the sum of data points in Kth cluster.clustersize[K]: the size of Kth cluster.δ: count the number of membership change.threshold: critical value to define stop condition, we set it be 0.001.

for i from 0 to K-1center[i]←− data[i]

do{δ←− 0for i from 0 to N − 1mindis=||data[i]-center[0]||

for j from 1 to K − 1distance=||data[i]-center[j]||if distance<mindis

mindis←−distanceindex←−j

if first iterationδ←−Nmembership[i]←−indexclustersize[index]←−clustersize[index]+1clustersum[index]←−clustersum[index]+data[i]

else if membership[i], indexδ = δ + 1clustersize[index]←−clustersize[index]+1clustersize[membership[i]]←−clustersize[membership[i]]-1clustersum[index]←−clustersum[index]+data[i]clustersum[membership[i]]←−clustersum[membership[i]]-data[i]membership[i]←−index

for j from 0 to K − 1center[j]←−clustersum[j]/clustersize[j]

}While(δ/N>threshold)

2

Note: stop condition is δ/N<threshold, i.e., the number of membership changes is 1‰of all datas.Complexity Analysis: The sequential code has complexity O(TKND), where T is the number of iterations,K is number of clusters, N is number of data points, and D is the dimension of each data point. There aretwo main parts in the code, the first part is to reassign each data point to the nearest center, this requiresto compute the distance between each data point with each cluster center for N data points, and thus thecomplexity isO(NKD) for each iteration step. The second part is to compute the center for each new clusterafter the reassignment, it basically requires to compute K groups of means for N datas, and the complexityis O((N +K) ∗D) for each iteration step. Apparently, part 1 is the dominant time consuming part, and sincepart 1 is an independent process for each data point, this inspires us to parallelize part1. The platform ofparallel code can ba MPI, CILK, OPENMP, CUDA, we will use CUDA on comet to do the parallelizationbecause of the high efficency of GPU processing large scale data.

2 Parallelization Of K-Means Using CUDAIn this section, we first introduce CUDA and GPUs nodes on Comet, and secondly discuss parallelizationstrtegies and CUDA implementation on comet, the last part is to use CUDA Occupancy Calculator todetermine the optimal number of Threads per Block.

2.1 CUDACUDA (Compute Unified Device Architecture) is a parallel computing platform and programming modelinvented by NVIDIA. It enables dramatic increases in computing performance by harnessing the power ofthe graphics processing unit (GPU).

The graphics processing unit (GPU), as a specialized computer processor, addresses the demands ofreal-time high-resolution 3D graphics compute-intensive tasks. GPUs have evolved into highly parallelmulti-core systems allowing very efficient manipulation of large blocks of data. In the computer gameindustry, GPUs are used for graphics rendering, and for game physics calculations (physical effects such asdebris, smoke, fire, fluids); examples include PhysX and Bullet. CUDA has also been used to acceleratenon-graphical applications in computational biology, cryptography and other fields by an order of magnitudeor more.

CUDA works with all Nvidia GPUs from the G8x series onwards, including GeForce, Quadro and theTesla line.

2.1.1 NVIDIA GPUs on Comet

The Comet infrastructure provides 36 GPU Nodes. It contains the NVIDIA K80 GPUs. This series ispopularly called the Tesla GPU series by NVIDIA.

GPUs 2 NVIDIA K-80Cores or socket 12Sockets 2Clock speed 2.5 GHzMemory capacity 128 GB DDR4 DRAMMemory bandwidth 120 GB/sFlash memory 320 GB

Table 1: GPU node in comet

Figure 1: Left is how CPU process data, right is how GPU process data. In CPU, each core will process n/pdates, and in GPU each thread get access to one single data.

3

Nvidia Tesla is Nvidia’s brand name for their products targeting stream processing and/or general purposeGPU.

Tesla is Nvidia’s first microarchitecture to implement unified shaders. It was used with GeForce 8 Series,GeForce 9 Series, GeForce 100 Series, GeForce 200 Series, and GeForce 300 Series of GPUs manufacturedin 90 nm, 80 nm, 65 nm, and 55 nm. It also found use in the GeForce 405, and in the workstation market inthe Quadro FX, Quadro x000, Quadro NVS series, and Nvidia Tesla computing modules.

With their very high computational power (measured in floating point operations per second or FLOPS)compared to microprocessors, the Tesla products target the high performance computing market.

Physical Limits for GPU Compute Capability=3.7 (Tesla X80) are in Table 2.

Threads per Warp 32Max Warps per Multiprocessor 64Max Thread Blocks per Multiprocessor 16Max Threads per Multiprocessor 2048Maximum Thread Block Size 1024Registers per Multiprocessor 131072Max Registers per Thread Block 65536Max Registers per Thread 255Shared Memory per Multiprocessor (bytes) 114688Max Shared Memory per Block 49152Register allocation unit size 256Register allocation granularity warpShared Memory allocation unit size 256Warp allocation granularity 4

Table 2: Physical limits for GPU Compute Capability=3.7 (Tesla X80)

Figure 2: Thread Organization in CUDA

2.2 Parallelization of K-Means clustering on CUDACUDA uses CPU as its HOST, and GPU as its DEVICE. Each GPU node can get access to thousands ofthreads, and each thread is processing one single data. The threads are grouped into block and sharedmemory is restricted to each block. HOST and DEVICE do not share memory. Under this configuration,we would have to mannully communicate message between HOST and DEVICE.

As explained in the last part of sec 1.2, we aim to parallelize the reassignment step for computingdistance between each data point and each cluster center. The logic and order of parallel algorithm is totallythe same with original sequantial algorithm, and we have to take into account the communication betweenHOST and DEVICE in parallel algorithm:

Step0: HOST initialize cluster centers, copy N data coordinates to DEVICE.Step1: DEVICE copy data membership and K cluster centers from HOST.Step2: In DEVICE, each thread process a single data point, compute the distance between each cluster

center and update data membership. tid=blockDim.x * blockIdx.x + threadIdx.x.Step3: HOST Copy the new data membership from DEVICE, and recompute cluster centers.Step4: Repeat step 1-3 if not converges, go to step 5 if converges.Step5: Host free the allocated memory.

There are several crucial points in parallel code, first, it is easier to handle 1D array than 2D array forthreads in GPU, thus we convert N data points(in D dimension) from a 2D array to a 1D array in HOST

4

and then send it to DEVICE. i.e. DEVICEdata[i*numCoordinates+j]=HOSTdata[i][j] (the jth coordinate ofthe ith data point). Secondly, since different blocks do not share memory, we have to reduce the number ofmembership change in each block to compute the total number of membership change.

In our implementation, we set: NumberThreadsPerClusterBlock=128

NumClusterBlocks=(N+numThreadsPerClusterBlock − 1)

numThreadsPerClusterBlock

The correctness of the parallel algorithm is guaranteed, measured by that it produces the same clusteringas the original sequential k-means algorithm. Our implementation performs the same steps as the sequentialcode in parallel without changing the logic, thus the correctness is expected.

2.3 Determining Optimal Number of Threads per BlockWe used the CUDA Occupancy Calculator provided by NVIDIA to determine the optimal number ofthreads per block.

2.3.1 Code Analysis

To determine the resource usage for each of the CUDA threads for our nearest cluster determining kernel,we compiled our code with the ptxas nvcc option. The following is the ouput of the compilation:

p t x a s i n f o : 0 b y t e s gmemnvcc −g −pg − I . −DBLOCK_SHARED_MEM_OPTIMIZATION=0 −−p txa s −o p t i o n s =−v−o cuda_kmeans . o −c cuda_kmeans . cup t x a s i n f o : 0 b y t e s gmemp t x a s i n f o : Compi l ing e n t r y f u n c t i o n ’ _ Z 2 0 f i n d _ n e a r e s t _ c l u s t e r i i i P f S _ P i S 0 _ ’f o r ’ sm_20 ’p t x a s i n f o : Func t i o n p r o p e r t i e s f o r _ Z 2 0 f i n d _ n e a r e s t _ c l u s t e r i i i P f S _ P i S 0 _

0 by t e s s t a c k frame , 0 b y t e s s p i l l s t o r e s , 0 b y t e s s p i l l l o a d sp t x a s i n f o : Used 18 r e g i s t e r s , 80 b y t e s cmem [ 0 ]

2.3.2 CUDA Occupancy Calculator

The CUDA Occupancy Calculator [3] allows you to compute the multiprocessor occupancy of a GPU by agiven CUDA kernel. The multiprocessor occupancy is the ratio of active warps to the maximum number ofwarps supported on a multiprocessor of the GPU. Each multiprocessor on the device has a set of N registersavailable for use by CUDA thread programs. These registers are a shared resource that are allocated amongthe thread blocks executing on a multiprocessor. The CUDA compiler attempts to minimize register usageto maximize the number of thread blocks that can be active in the machine simultaneously. If a programtries to launch a kernel for which the registers used per thread times the thread block size is greater than N,the launch will fail.

Maximizing the occupancy can help to cover latency during global memory loads that are followed by a__syncthreads(). The occupancy is determined by the amount of shared memory and registers used by eachthread block. Because of this, programmers need to choose the size of thread blocks with care in order tomaximize occupancy. This GPU Occupancy Calculator can assist in choosing thread block size based onshared memory and register requirements.

For any input size, the shared memory used by our program is null. We use the CUDA OccupancyCalculator with Number of threads per block = 128. From the dis-assembly of code, we find that 18registers are required by the kernel function we have. We provide the Compute Capacity - 3.7 (GK210,X80), number of threads per block - 128, shared memory size - 112KB (for 3.7) and number of registersrequired per thread = 18, as input to the occupancy calculator. We indicate the results in the figures included.Number of threads per block = 128 gave maximum occupancy for our program.

5

Figure 3: Input to CUDA Occupancy Calculator

Figure 4: Output of CUDA Occupancy Calculator

Figure 5: Impact of varying Block Size

Figure 6: Impact of varying Register Count Per Thread

Figure 7: Impact of varying Shared Memory Usage Per Block

6

3 Parallel Performance AnalysisIn this section, we present several experiment result to reveal the parallel performance.

1. Experiment1: Vary the size of data set N, fix number of clusters K=128. The performance resultis on table 2 and figure 9. Data dimension is 1000.For K=128 fixed, as the size of data sets N increase, parallel speed up is around 40 and increasegradually. The memory capacity for GPU is 11GB in comet, and we have tested for 8GB data set(2048000*1000), it turns out that the parallel running time is 6175sec (about 1.7 hour), the sequentialcode is too slow to get the time, and we expect the time to be around 3 days. The parallel codeoutperforms the sequential code a lot.

Size(float values) Sequential(in sec) Parallel (in sec) Speed up51200*1000 463.09 11.73 39.576800*1000 857.73 19.25 44.5589600*1000 1182.43 24.82 47.6115200*1000 1676.96 35.22 47.6128000*1000 1794.91 41.23 43.53512000*1000 >4hrs 405.56 NA2048000*1000 > 6174.72 -

Table 3: Experiment1. Parallel Performance when varying size, fix K=128

Figure 8: Experiment1. Parallel Speedup versus size of data set. K=128

2. Experiment2: For fixed size=51200*1000, vary the number of clusters, use δ/N < 0.001 as stopcondition. We have tested for K=4, 16, 64, 128, 256, 512, 1024, 2048, and the result is in table 3 andfigure 10, 11.Figure 10 presents that parallel speed up keeps increasing as K increase, and the derivative of thecurve decrease. This matches with our expectation. The computational cost of part 1 in each iterationfor sequential code is O(N*D*K), of part2 is O((N+K)*D), in parallel code, only part1 has beenparallelized. After running T iterations, the speed up will be:

t1tp=

O(NDKT) +O((N + K)DT)parallized +O((N + K)DT)

As K being larger, since N >> K,D > K , and N,D are both fixed, the time consuming of part2 issteady, and the larger K is, the more speedup will be earned by part1. Thus overall the speed upincreases. Figure 11 shows that as K increase, for fixed N, the number of iteration needed to convergedecrease, this drives down the parallel running time even K increase at the first stage.

7

K(num of clusters) numofiteration Sequential(in sec) Parallel (in sec) Speed up4 71 63.81 18.04 3.5416 51 155.64 15.06 10.3364 29 338.55 10.51 32.20128 20 463.38 11.06 41.88256 16 739.15 12.30 60.11512 12 1105.98 13.14 84.141024 10 1842.00 18.98 97.042048 6 2207.78 21.19 104.17

Table 4: Experiment2.Performance when varying number of clusters K for fixed data size

Figure 9: Experiment2.Speed up versus number of clusters K for fixed data size

Figure 10: Experiment2.Parallel running time, numofiteration versus number of clusters K for fixed datasize

8

3. Experiment3: For fixed size=51200*1000, vary the number of clusters, fix number of itera-tion=30. Tested for K=4, 16, 64, 128, 256. The result is in table 5 and Figure 12,13.As in Experiment2, we get outstanding speed up and the speedup goes up as K goes up, as shown infigure 12. Experiment3 reveal another scaling fact of the code, in figure 13 we plot the relationshipbetween log2(K) with log2(running time), for sequential case, it is very close to a straight linewith slope 1, and this coincide with the face that as K double, sequential running time will double,considering the complexity to be O(NDKT) +O((N + K)DT , and for parallel case, it is a curve withincreasing slope which is always less than 1, and the fitting slope is 0.28. This implies a fact that theparallelism is larger as K being larger, and gained more speedup.

K(num of clusters) Sequential(in sec) Parallel (in sec) Speed up4 27.82 7.72 3.608 50.03 7.89 6.3416 94.50 8.33 11.3432 183.56 9.50 19.3264 361.92 12.31 29.40128 718.31 15.91 45.15256 1430.74 21.76 65.76512 2858.05 32.47 88.01

Table 5: Experiment3. Performance when changing K, fix numiteration=30

Figure 11: Experiment3.Speedup versus number of clusters K for fixed data size and fixed iterations

Figure 12: Experiment3: Rate of growth of sequential and parallel implementations

9

3.1 Profilingnvprof[2] presents an overview of the GPU kernels and memory copies in our program. The summary,groups all calls to the same kernel together, presenting the total time and percentage of the total applicationtime for each kernel. In addition to summary mode, nvprof supports GPU-Trace and API-Trace modes thatlet you see a complete list of all kernel launches and memory copies, and in the case of API-Trace mode, allCUDA API calls.

We perform 4 mallocs - one for the input 2D data, one for 2D cluster data, a 1D membership and one 1Dmembership changed array. In every iteration, we perform two Device-Host copies for - membership andmembershipChanged and one Host-Device copy, copying the new cluster centers.

==180922== NVPROF i s p r o f i l i n g p r o c e s s 180922 , command : . / t e s t _ d r i v e r==180922== P r o f i l i n g a p p l i c a t i o n : . / t e s t _ d r i v e rN = 51200d imens ion = 1000k = 128t h r e s h o l d = 0 .0010Type : P a r a l l e lComputa t ion t im i ng = 11 .8963 secLoop i t e r a t i o n s = 21

==180922== P r o f i l i n g r e s u l t :Time (%) Time C a l l s Avg Min Max Name98.99% 4.11982 s 21 196 .18ms 195 .58ms 197 .96msf i n d _ n e a r e s t _ c l u s t e r ( i n t , i n t , i n t , f l o a t ∗ , f l o a t ∗ , i n t ∗ , i n t ∗ )0.98% 40 .635ms 23 1 .7668ms 30 .624 us 39 .102ms [CUDA memcpy HtoD ]0.03% 1.2578ms 42 29 .946 us 28 .735 us 31 .104 us [CUDA memcpy DtoH ]

==180922== API c a l l s :Time (%) Time C a l l s Avg Min Max Name93.06% 4.12058 s 21 196 .22ms 195 .62ms 198 .00ms cudaDev i c eSynch ron i z e5.79% 256 .47ms 4 64 .117ms 4 .9510 us 255 .97ms cudaMal loc1.02% 45 .072ms 65 693 .42 us 82 .267 us 39 .230ms cudaMemcpy0.06% 2.6048ms 332 7 .8450 us 528 ns 282 .45 us c uDev i c eGe tA t t r i b u t e0.03% 1.3272ms 4 331 .79 us 297 .63 us 344 .04 us cuDeviceTotalMem0.02% 694 .81 us 21 33 .086 us 29 .098 us 47 .856 us cudaLaunch0.01% 505 .33 us 3 168 .44 us 7 .3170 us 330 .65 us cudaFree0.01% 237 .02 us 4 59 .255 us 56 .222 us 67 .280 us cuDeviceGetName0.00% 119 .83 us 147 815 ns 533 ns 12 .646 us cudaSetupArgument0.00% 28 .884 us 21 1 .3750 us 1 .1180 us 1 .7890 us c ud aCon f i g u r eCa l l0.00% 16 .784 us 21 799 ns 740 ns 922 ns c ud aGe tL a s tE r r o r0.00% 4.7380 us 8 592 ns 532 ns 788 ns cuDeviceGet0.00% 3.9500 us 2 1 .9750 us 895 ns 3 .0550 us cuDeviceGetCount

4 Conclusion and Future WorkOur analysis depicts that we obtain a significant speedup (45X average) over the sequential execution ofK-Means clustering. In our project, we only parallelized the method to compute the nearest cluster. Weoptimized the calculation of new clusters centers by adding members that changed membership to newcluster groups and subtracting it from the old cluster center. This approach as opposed to re-calculatingcluster centers afresh, saved running significant time. Due to shortage of time, we were not able to quantifythe new speedup. There is definitely scope for increased speedup, (when input dimension increases) if weattempt to parallelize the new cluster center calculation using CUDA.

10

References[1] KMeans Algorithm : http://users.eecs.northwestern.edu/ wkliao/Kmeans/index.html, Wei-keng Liao,

Northwestern University, 2005

[2] NVPROF : http://docs.nvidia.com/cuda/profiler-users-guide/#axzz4SHzfjCkf

[3] CUDA Occupancy Calculator : https://devtalk.nvidia.com/default/topic/368105/cuda-occupancy-calculator-helps-pick-optimal-thread-block-size/

[4] Understanding CUDA : https://courses.engr.illinois.edu/ece498al/Syllabus.html

11

parallel implementation of k means clustering on cuda

Engineering