optimizing opencl kernels for iterative statistical applications on gpus

38
Optimizing OpenCL Kernels for Iterative Statistical Applications on GPUs Thilina Gunarathne, Bimalee Salpitkorala, Arun Chauhan, Geoffrey Fox {tgunarat,ssalpiti,achauhan,gcf} @cs.indiana.edu 2nd International Workshop on GPUs and Scientific Applications Galveston Island, TX

Upload: finola

Post on 04-Feb-2016

47 views

Category:

Documents


0 download

DESCRIPTION

Optimizing OpenCL Kernels for Iterative Statistical Applications on GPUs. Thilina Gunarathne , Bimalee Salpitkorala , Arun Chauhan , Geoffrey Fox { tgunarat,ssalpiti,achauhan,gcf } @cs.indiana.edu 2nd International Workshop on GPUs and Scientific Applications Galveston Island, TX. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Optimizing  OpenCL  Kernels for Iterative Statistical Applications on GPUs

Optimizing OpenCL Kernelsfor Iterative Statistical Applications on GPUs

Thilina Gunarathne, Bimalee Salpitkorala, Arun Chauhan, Geoffrey Fox

{tgunarat,ssalpiti,achauhan,gcf} @cs.indiana.edu2nd International Workshop on GPUs and Scientific Applications

Galveston Island, TX

Page 2: Optimizing  OpenCL  Kernels for Iterative Statistical Applications on GPUs

Iterative Statistical Applications

• Consists of iterative computation and communication steps

• Growing set of applications– Clustering, data mining, machine learning & dimension

reduction applications– Driven by data deluge & emerging computation fields

Compute Communication Reduce/ barrier

New Iteration

Page 3: Optimizing  OpenCL  Kernels for Iterative Statistical Applications on GPUs

Iterative Statistical Applications

• Data intensive• Larger loop-invariant data• Smaller loop-variant delta between iterations– Result of an iteration– Broadcast to all the workers of the next iteration

• High memory access to floating point operations ratio

Compute Communication Reduce/ barrier

New Iteration

Page 4: Optimizing  OpenCL  Kernels for Iterative Statistical Applications on GPUs

Motivation

• Important set of applications• Increasing power and availability of GPGPU computing• Cloud Computing – Iterative MapReduce technologies– GPGPU computing in clouds

from http://aws.amazon.com/ec2/

Page 5: Optimizing  OpenCL  Kernels for Iterative Statistical Applications on GPUs

Motivation

• A sample bioinformatics pipeline

Gene Sequences

Pairwise Alignment &

Distance Calculation

Distance Matrix

Clustering

Multi-Dimensional

Scaling

Visualization

Cluster Indices

Coordinates

3D Plot

O(NxN)

O(NxN)

O(NxN)

http://salsahpc.indiana.edu/

Page 6: Optimizing  OpenCL  Kernels for Iterative Statistical Applications on GPUs

Overview

• Three iterative statistical kernels implemented using OpenCl– Kmeans Clustering– Multi Dimesional Scaling– PageRank

• Optimized by,– Reusing loop-invariant data– Utilizing different memory levels– Rearranging data storage layouts– Dividing work between CPU and GPU

Page 7: Optimizing  OpenCL  Kernels for Iterative Statistical Applications on GPUs

OpenCL

• Cross platform, vendor neutral, open standard– GPGPU, multi-core CPU, FPGA…

• Supports parallel programming in heterogeneous environments

• Compute kernels– Based on C99– Basic unit of executable code

• Work items – Single element of the execution domain– Grouped in the work groups• Communication & synchronization within work groups

Page 8: Optimizing  OpenCL  Kernels for Iterative Statistical Applications on GPUs

OpenCL Memory Hierarchy

Local Memory

Work Item 1

Work Item 2

Private Private

Compute Unit 1

Local Memory

Work Item 1

Work Item 2

Private Private

Compute Unit 2

Global GPU Memory

Constant Memory

CPU

Page 9: Optimizing  OpenCL  Kernels for Iterative Statistical Applications on GPUs

Environment

• NVIDIA Tesla C1060– 240 scalar processors– 4GB global memory– 102 GB/sec peak memory bandwidth– 16KB shared memory per 8 cores– CUDA compute capability 1.3– Peak Performance

• 933 GFLOPS Single with SF• 622 GFLOPS Single MAD• 77.7 GFLOPS Double

Page 10: Optimizing  OpenCL  Kernels for Iterative Statistical Applications on GPUs

KMeans Clustering

• Partition a given data set into disjoint clusters• Each iteration– Cluster assignment step– Centroid update step

• Flops per work item (3DM+M)D :number of dimensionsM :number of centroids

Page 11: Optimizing  OpenCL  Kernels for Iterative Statistical Applications on GPUs

Re-using loop-invariant data

Page 12: Optimizing  OpenCL  Kernels for Iterative Statistical Applications on GPUs

KMeansClustering Optimizations

• Naïve (with data re-using)

256 2,560 25,600 256,000 2,560,000 25,600,000 256,000,0000

20

40

60

80

100

120

Number of Data Points

GFL

OPS

Page 13: Optimizing  OpenCL  Kernels for Iterative Statistical Applications on GPUs

KMeansClustering Optimizations

• Data points copied to local memory

256 2,560 25,600 256,000 2,560,000 25,600,000 256,000,0000

20

40

60

80

100

120Naïve (A)

Data in Local Memory(B)

Number of Data Points

GFL

OPS

Page 14: Optimizing  OpenCL  Kernels for Iterative Statistical Applications on GPUs

KMeansClustering Optimizations

• Cluster centroid points copied to local memory

256 2,560 25,600 256,000 2,560,000 25,600,000 256,000,0000

20

40

60

80

100

120Naïve (A)Data in Local Memory(B)Data & Centers in Local Mem (C) C+ Data Coalescing (D)

Number of Data Points

GFL

OPS

Page 15: Optimizing  OpenCL  Kernels for Iterative Statistical Applications on GPUs

KMeansClustering Optimizations

• Local memory data points in column major order

256 2,560 25,600 256,000 2,560,000 25,600,000 256,000,0000

20

40

60

80

100

120 Naïve (A)Data in Local Memory(B)Data & Centers in Local Mem (C) C+ Data Coalescing (D)D + Local Data Points Column Major

Number of Data Points

GFL

OPS

Page 16: Optimizing  OpenCL  Kernels for Iterative Statistical Applications on GPUs

KMeansClustering Performance

• Varying number of clusters (centroids)

256 2,560 25,600 256,000 2,560,000 25,600,000 256,000,0000

20

40

60

80

100

120

140

50

100

200

300

360

Number of Data Points

GFL

OPS

Page 17: Optimizing  OpenCL  Kernels for Iterative Statistical Applications on GPUs

KMeansClustering Performance

• Varying number of dimensions

256 2,560 25,600 256,000 2,560,000 25,600,000 256,000,0000

10

20

30

40

50

60

4D-300

2D-300

4D-100

2D-100

Number of Data Points

Spee

dup

(GPU

vs

Sing

le c

ore

CPU

)

Page 18: Optimizing  OpenCL  Kernels for Iterative Statistical Applications on GPUs

KMeansClustering Performance• Increasing number of iterations

256 2,560 25,600 256,000 2,560,000 25,600,000 256,000,0000

20406080

100120140

5 Iterations10 Iterations15 Iterations20 Iterations

Number of Data Points

GFL

OPS

256 2,560 25,600 256,000 2,560,000 25,600,000 256,000,0000.1

1

10

100

1000

100005 Iterations10 Iterations15 Iterations20 Iterations

Number of Data Points

Tim

e Pe

r It

erati

on (m

s)

Page 19: Optimizing  OpenCL  Kernels for Iterative Statistical Applications on GPUs

KMeans Clustering Overhead

256 2560 25600 256000 2560000 25600000 2560000001

10

100

1000

10000

100000

0%

30%

60%

90%

120%

150%Double ComputeRegular (Single Compute)Compute OnlyOverhead

Number of Data Points

Page 20: Optimizing  OpenCL  Kernels for Iterative Statistical Applications on GPUs

Multi Dimesional Scaling

• Map a data set in high dimensional space to a data set in lower dimensional space

• Use a NxN dissimilarity matrix as the input– Output usually in 3D (Nx3)

or 2D (Nx2) space• Flops per work item

(8DN+7N+3D+1) D : target dimension N : number of data points

• SMACOF MDS algorithm

http://salsahpc.indiana.edu/

Page 21: Optimizing  OpenCL  Kernels for Iterative Statistical Applications on GPUs

MDS Optimizations• Re-using loop-invariant data

0 5000 10000 15000 200000%

10%

20%

30%

40%

50%

60%

70%

80%

90%

Number of Data Points (N)

Spee

dup

of C

achi

ng

Page 22: Optimizing  OpenCL  Kernels for Iterative Statistical Applications on GPUs

MDS Optimizations• Naïve (with loop-invariant data reuse)

0 5000 10000 15000 200000

10

20

30

40

50

60

70

Number of Data Points (N)

Perf

orm

ance

(GFL

OPS

)

Page 23: Optimizing  OpenCL  Kernels for Iterative Statistical Applications on GPUs

MDS Optimizations• Naïve (with loop-invariant data reuse)

0 5000 10000 15000 200000

10

20

30

40

50

60

70

Naïve

Results in Shared Mem

Number of Data Points (N)

Perf

orm

ance

(GFL

OPS

)

Page 24: Optimizing  OpenCL  Kernels for Iterative Statistical Applications on GPUs

MDS Optimizations• Naïve (with loop-invariant data reuse)

0 5000 10000 15000 200000

10

20

30

40

50

60

70

Naïve

Results in Shared Mem

X(k) in shared mem

Number of Data Points (N)

Perf

orm

ance

(GFL

OPS

)

Page 25: Optimizing  OpenCL  Kernels for Iterative Statistical Applications on GPUs

MDS Optimizations• Naïve (with loop-invariant data reuse)

0 5000 10000 15000 200000

10

20

30

40

50

60

70NaïveResults in Shared MemX(k) in shared memData Points Coalesed

Number of Data Points (N)

Perf

orm

ance

(GFL

OPS

)

Page 26: Optimizing  OpenCL  Kernels for Iterative Statistical Applications on GPUs

MDS Performance• Increasing number of iterations

0 5000 10000 15000 200000

20406080

100120140160180

10 Iterations25 Iterations50 Iterations100 Iterations

Number of Data Points (N)

GPU

Spe

edup

0 5000 10000 15000 200000

10

20

30

40

50

60

7010 Iterations25 Iterations50 Iterations100 Iterations

Number of Data Points (N)

Perf

orm

ance

(GFL

OPS

)

Page 27: Optimizing  OpenCL  Kernels for Iterative Statistical Applications on GPUs

MDS Overhead

64 5064 10064 15064 200641

10

100

1000

10000

100000

0%

12%

24%

36%

48%

60%Double ComputeRegular (Single Compute)Compute Only TimeOverhead

Number of Data Points (N)

Page 28: Optimizing  OpenCL  Kernels for Iterative Statistical Applications on GPUs

Page Rank• Analyses the linkage information to measure the relative

importance

• Sparse matrix and vector multiplication

• Web graph– Very sparse– Power law

distribution

Page 29: Optimizing  OpenCL  Kernels for Iterative Statistical Applications on GPUs

Sparse Matrix Representations

ELLPACK

Compressed Sparse Row (CSR)http://www.nvidia.com/docs/IO/66889/nvr-2008-004.pdf

Page 30: Optimizing  OpenCL  Kernels for Iterative Statistical Applications on GPUs

PageRank implementations

10 25 50 75 100 125 1500

200

400

600

800

1000

1200

1400

1600

1800CPU only

K(i)<4 in ELL, K(i)>=4 in CPU

K(i)<7 in ELL, K(i)>=7 in CPU

K(i)<16 in ELL, K(i)>=16 in CPU

Number of Iterations

Tim

e (m

s)

Page 31: Optimizing  OpenCL  Kernels for Iterative Statistical Applications on GPUs

Lessons

• Reusing of loop-invariant data• Leveraging local memory• Optimizing data layout• Sharing work between CPU & GPU

Page 32: Optimizing  OpenCL  Kernels for Iterative Statistical Applications on GPUs

OpenCL experience

• Flexible programming environment• Support for work group level synchronization

primitives• Lack of debugging support• Lack of dynamic memory allocation• Compilation target than a user programming

environment?

Page 33: Optimizing  OpenCL  Kernels for Iterative Statistical Applications on GPUs

Future Work

• Extending kernels to distributed environments• Comparing with CUDA implementations• Exploring more aggressive CPU/GPU sharing• Studying more application kernels• Data reuse in the pipeline

Page 34: Optimizing  OpenCL  Kernels for Iterative Statistical Applications on GPUs

Acknowledgements

• This work was started as a class project for CSCI-B649:Parallel Architectures (spring 2010) at IU School of Informatics and Computing.

• Thilina was supported by National Institutes of Health grant 5 RC2 HG005806-02.

• We thank Sueng-Hee Bae, BingJing Zang, Li Hui and the Salsa group (http://salsahpc.indiana.edu/) for the algorithmic insights.

Page 35: Optimizing  OpenCL  Kernels for Iterative Statistical Applications on GPUs

Questions

Page 36: Optimizing  OpenCL  Kernels for Iterative Statistical Applications on GPUs

Thank You!

Page 37: Optimizing  OpenCL  Kernels for Iterative Statistical Applications on GPUs
Page 38: Optimizing  OpenCL  Kernels for Iterative Statistical Applications on GPUs

KMeansClustering Optimizations

• Data in global memory coalesced

256 2,560 25,600 256,000 2,560,000 25,600,000 256,000,0000

20

40

60

80

100

120Naïve (A)Data in Local Memory(B)Data & Centers in Local Mem (C) C+ Data Coalescing (D)

Number of Data Points

GFL

OPS