parallel k means clustering using cuda

19
Parallel K-means Clustering using CUDA Lan Liu Pritha D N 12/06/2016 1

Upload: prithan

Post on 19-Feb-2017

67 views

Category:

Engineering


9 download

TRANSCRIPT

Page 1: Parallel K means clustering using CUDA

Parallel K-means Clustering using CUDA

Lan LiuPritha D N12/06/2016

1

Page 2: Parallel K means clustering using CUDA

Outline

● Performance Analysis

● Overview of CUDA

● Future Work

2

Page 3: Parallel K means clustering using CUDA

Review of Parallelization●Complexity of sequential K-Means algorithm: O(N*D*K*T)

N: # of datas.D: # of dimension.K: # of clusters.T: # of iterations.

●Complexity of each iteration step:

Part2: compute the new center as mean of new cluster datas. O((N+K)*D)--> O((2*delta+K)*D)delta: #of membership change.

3

Part1: for each data point, compute the distance with K cluster centers and assign to the nearest one. O(N*D*K) Parallelize on CUDA (SIMD: single instruction multiple data)

Page 4: Parallel K means clustering using CUDA

Modify Part 2●Complexity of sequential K-Means algorithm: O(N*D*K*T)

N: # of datas.D: # of dimension.K: # of clusters.T: # of iterations.

●Complexity of each iteration step:Part1: Compute the distance with K cluster centers and assign to the nearest one. O(N*D*K)

4

Part2: compute the new center as mean of new cluster datas. O(N+k)--> O(2*delta+K)delta: #of membership change.

Page 5: Parallel K means clustering using CUDA

Performance Analysis

5

Experiment1: set K=128, D=1000, changing size N.

Page 6: Parallel K means clustering using CUDA

Performance AnalysisExperiment 2: set N=51200, D=1000, changing number of clusters K.

6

Speedup (N>>K, D>K)

Page 7: Parallel K means clustering using CUDA

Performance AnalysisExperiment 3: set N=51200, D=1000, T=30. changing number of clusters K.

7

Page 8: Parallel K means clustering using CUDA

Performance AnalysisExperiment 3 continue..

8 Slope~1, K double, running time double

Page 9: Parallel K means clustering using CUDA

CUDA - Execution of a CUDA program

9

find_nearest_cluster

Page 10: Parallel K means clustering using CUDA

CUDA - Memory Organisation

10

Page 11: Parallel K means clustering using CUDA

CUDA - Thread Organization

Execution resources are organized into Streaming Multiprocessors(SM).

Blocks are assigned to Streaming Multiprocessors in arbitrary order.Blocks are further partitioned into warps.SM executes only one of its resident warps at a time. The goal is to keep an SM max occupied.

11

Threads per Warp 32

Max Warps per Multiprocessor

64

Max Thread Blocks per Multiprocessor

16

Page 12: Parallel K means clustering using CUDA

NVProfTime(%) Time Calls Avg Min Max Name

98.99% 4.11982s 21 196.18ms 195.58ms 197.96ms find_nearest_cluster(int, int, int, float*, float*, int*, int*)

0.98% 40.635ms 23 1.7668ms 30.624us 39.102ms [CUDA memcpy HtoD]

0.03% 1.2578ms 42 29.946us 28.735us 31.104us [CUDA memcpy DtoH]

Time(%) Time Calls Avg Min Max Name

93.06% 4.12058s 21 196.22ms 195.62ms 198.00ms cudaDeviceSynchronize

5.79% 256.47ms 4 64.117ms 4.9510us 255.97ms cudaMalloc

1.02% 45.072ms 65 693.42us 82.267us 39.230ms cudaMemcpy

12

N = 51200Dimension = 1000k = 128Loop iterations = 21

Page 13: Parallel K means clustering using CUDA

Future work:

1. Compare time taken with implementations on OpenMP, MPI, standard libraries - SkLearn, Matlab, etc.

2. Apply the Map-Reduce methodology.

3. Efficiently parallelize Part2.

13

Page 14: Parallel K means clustering using CUDA

Thank You!Questions?

14

Page 15: Parallel K means clustering using CUDA

Introduction to K Means Clustering

●Clustering algorithm used in data mining

●Aims to partition N data points into K clusters

---->where, each data belongs to the cluster with nearest mean.

●Objective Function

15

number of clusters center/mean of cluster i

number of data points in cluster i

Page 16: Parallel K means clustering using CUDA

Parallelization: CUDA C Implementation

Step0: HOST initialize cluster centers, copy N data coordinates to DEVICE.

Step1: Copy data membership and k cluster centers from HOST to DEVICE.

Step2: In DEVICE, each thread process a single data point, compute the distance and update data membership.

Step3: Copy the new membership to HOST, and recompute cluster centers.

Step4: Check convergence, if not, go back to step1.

Step5: Host free the allocated memory at last.

16

Step2: In DEVICE, each thread process a single data point, compute the distance and update data membership.

Page 17: Parallel K means clustering using CUDA

1. Pick the first k data points as initial cluster centers.2. Attribute each data point to the nearest cluster.3. For each reassigned data point, d=d+1.4. Set the position of each new cluster to be the mean of all data points belonging to

that cluster5. Repeat steps 2-4 until convergence.

➔How to define convergence?Stop condition: d/N <0.001d: number of inputs that changed membership, N: total number of data points

17

Sequential Algorithm Steps1

1Reference: http://users.eecs.northwestern.edu/~wkliao/Kmeans/

Page 18: Parallel K means clustering using CUDA

Sequential Algorithm

18

Page 19: Parallel K means clustering using CUDA

GPU based----CUDA C

If CPU has p cores, n datas, each core process n/p datas.

GPU process one element per thread.(The number of threads are very large ~1000 or more)

GPU is more effective than CPU when dealing with large blocks of data in parallel.

CUDA C: Host---> CPU, Device-->GPU. The host launches a kernel that execute on the device.

19