co-clustering using cuda. co-clustering explained problem: large binary matrix of samples (rows)...

CO-CLUSTERING USING CUDA

Co-Clustering Explained

Problem: Large binary matrix of samples (rows) and features

(columns) What samples should be grouped together? Why? What are shared features?

Co-clustering provides you the “why” explicitly Correlated sample/feature pair

Row cluster: s1 and s3 are in a group

Column cluster: distinguishing features are 2,3, and 5

Co-Clustering - Details

Using Information Theoretic Co-clustering, as parallelized for Hadoop architecture in: Disco: Distributed co-clustering with Map-Reduce

: A case study towards petabyte-scale end-to-end mining, Papadimitriou et.al, Data Mining 2008

Partition entire matrix into row groups, col groups Minimize length of encoding of resulting partitioned matrix

Competing code length factors: number of row groups & col groups, homogeneity of clusters

Iterate over rows, rearrange and sub-partition to find better encoding using heuristic

Repeat for columns, then rows again, until local optimum is found Complexity: O(n*fp*(row_groups+col_groups)2*iters)

Credit: Chakrabarti et. al, KDD 2004

http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=4781146








Implementation - Basics

Initial matrix generation : CPU Initial random row/column group

assignment: CPU Memory structures very simple, arrays of

ints

Implementation – Stats step 1 Statistics calculations:

Calculates statistics for each row of each column group Statistic is number of 1’s

in a column group Straight-forward

parallelization (each thread works on one row at a time), global memory

2 3 1 3 2

35114

Column Groups

Row Groups

Stat(Row 3, ColumnGroup 3) = 1

Room For Improvement

Calculate row statistics according to histogram algorithm from text book Block columns Assign one thread block to each block Compute shared memory histograms within

block Merge back to global memory when

finished

Implementation – Stats step 2

Calculates cost for each row group of each column group Essentially a reduce on the

per-row data Block the rows, assign

block to thread block Use shared memory and

atomics to build histogram of all rows in a given row group

Merge shared histogram with global histogram for that row group

Iterate over all row groups

2 3 1 3 2

35114

Column Groups

Row Groups

Stat(RowGroup 1, ColumnGroup 3) = 2

Implementation – Row/Col Group Optimization

For each row, find optimal group it could belong to Parallelized straight-forwardly, one row per thread,

loop and stride to get all rows Each row calculation goes through all row groups,

determines global cost of moving to that row group Move all rows to their optimal group Recompute statistics Repeat for column groups Continue alternating row/column groupings until

convergence

Room For Improvement

Parallelization could be more sophisticated Could block the rows and compute the cost

of the row joining each row group in parallel Using shared memory atomics to identify

minimum cost

In practice, this algorithm heavily favors a small number of row and column groups The parllelization would be therefore be

small

Implementation Outer Loop

After local minimum is found, change initial number of row and column groups and retry Change number of row groups or number of

column groups, up or down Continue changing number of row or

column groups in that direction until cost fails to decrease

Try both directions in both dimensions before stopping

Outer loop performed on CPU

Room for Improvement

Outer loop could parallelize inner loop actions over different GPUs Each could explore the different dimensions

and directions in parallel

Implementation – CPU + Validation CPU implementation performed all steps

described earlier, but sequentially Validation

Used CPU implementation of statistics calculations to validate GPU stats calculations

CPU and GPU log implementations differ, so validated cost calculations by allowing for a tolerance of 5% btw results

Did not have time to validate the overall algorithm or visualize the outputs to it to see if coclusters produced were reasonable

Timing Measurements

Time was measured by clock_t/CLOCKS_PER_SEC under CPU implementation

Measured by cuda events under GPU implementation

Development Lessons Learned CUDA and structured data is a bad idea

Even structs of arrays are impossible to deal with Host-side pointer math on device pointers does not

work CUDA API has REALLY unfriendly error messages

Take care to do very, very little through that API __device__ variables declared globally must be

passed to kernels Runtime errors otherwise

You can malloc and free shared memory in device code as of 3.2

Development Lessons Learned Cont Visual Studio CUDA integration leaves a

lot to be desired All optimizations removed, still can’t set

breakpoints everywhere Many variables show as freed No in-IDE, real-time, in editor compile errors

But, Visual Studio does give nice auto-complete, auto-definition navigation

No CUDA linker => separate files must be directly #include’d

Experiment - Environment

Float.cs.drexel.edu CPU: 4 quad-core Intel Xeon L5360

processors @2.13 Ghz GPU: 2 Nvidia GeForce GTX 580 GPUs

@1544Mhz

Experiment - Description

Sequential (CPU) and Parallel (GPU) tested on square matrices of order 100, 1000, and 10000 Larger matrices caused memory problems

GPU tested with varying block and thread counts Num blocks: 10, 100, 5000 Num threads: 10, 100, 1024 (max)

Resulting co-clusters usually stayed in the 50-200 row/column group range, regardless of matrix order Row and column groupings are important in the

calculation of matrix statistics, rows and columns are blocked by these

Experiment Results

100 1000 100000

10

20

30

40

50

60

70

80

Speedup - 10 Blocks

101001024

Matrix Order

Num Threads

Experiment Results

For small number of blocks, 100 thread performance peaks at num_blocks * num_threads = matrix_order I would expect this to be the optimal

configuration, when num_blocks ~= num_row_groups ~= num_col_groups

Slowdown occurs when matrix order exceeds total number of threads and more must be done serially

Experiment - Results

100 1000 100000

10

20

30

40

50

60

70

80

Speedup - 100 Blocks

101001024

Matrix Order

Num Threads

Experiment Results

100 1000 100000

10

20

30

40

50

60

70

80

Speedup - 5000 Blocks

101001024

Matrix Order

Num Threads

Experiment Results

Interestingly, the maximum speedup was the same in all block counts Roughly speaking, as long as num_blocks *

num_threads >= matrix order, max speedup of ~70 is achieved 10 threads never got there, due to block scheduling

overhead? Possibly cost of copying to shared memory for block processing was not recouped in 10 thread case?

Maxing out thread count is counter-productive in smaller matrices Hypothesis: When block count is excessive (as for

small matrices), scheduling of large blocks of threads that return immediately is costly

Experiment Results

100 1000 100000

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

Effficiency - 10 Blocks

101001024

Matrix Order

Num Threads

Experiment Results

100 1000 100000

0.005

0.01

0.015

0.02

0.025

0.03

0.035

0.04

0.045

0.05

Efficiency - 100 Blocks

101001024

Matrix Order

Num Threads

Experiment Results

100 1000 100000

0.0002

0.0004

0.0006

0.0008

0.001

0.0012

Efficiency - 5000 Blocks

101001024

Matrix Order

Num Threads

Experiment Results

Efficiency is consistently highest for the smaller numbers of blocks and smaller numbers of threads within those blocks Hypothesis: Overhead of starting blocks

and threads must be high enough to result in diminishing returns when adding blocks and threads

co-clustering using cuda. co-clustering explained problem: large binary matrix of samples (rows)...

Documents