evaluating fermi features for data mining applications

Evaluating FERMI features for Data Mining Applications

Masters Thesis PresentationSinduja Muralidharan

Advised by: Dr. Gagan Agrawal

Outline

• Motivation and Background• FERMI series and the TESLA series GPUs• Reduction based Data Mining Algorithms• Parallelization Methods for GPUs• Experimental Evaluation• Conclusion

Background

• GPUs have emerged as a major player in high performance computing recently.– Excellent price to performance ratio provided by GPUs– suitability and popularity of CUDA to program a

variety of high performance applications.• GPU hardware and software have evolved rapidly.

New GPU products and successive versions of CUDA added new functionality and better performance.

The FERMI GPU

• The Fermi series of cards – include the C2050 and the C2070 cards. – also referred to as the 20-series family of NVIDIA Tesla

GPUs. • Support for double precision atomic operations. • Much larger shared memory/L1 cache which can be

configured – 48kB shared memory, 16kB L1 cache or– 16kB shared memory, 48kb L1 cache

• Presence of an L2 cache.

TESLA vs FERMI

Thesis Objective• Optimizing and evaluating the new features of the Fermi series

GPUs – Increased Shared memory– Support for atomic operations on floating point data

• Using three parallelization approaches on reduction based mining algorithms:– Full Replication in Shared memory– Improving locking with inbuilt atomic operations– Creation of several hybrid versions for optimal performance

Generalized Reductions• op is a function that is both

commutative and associative and Reduc is a data structure referred to as the reduction object

• Specific elements of the reduction object updated depend on results of previous processing

• Divide the data instances (or records or transactions) among the processing threads

• The reduction object updated in iteration i of the loop is determined as a result of previous processing

Parallelizing Generalized Reductions

• It is not possible to statically partition the reduction object, different processors update different disjoint portions at runtime:– Can lead race conditions– Execution time of the process function can take up a

major chunk of the total execution time of an iteration of the loop, so runtime preprocessing and static scheduling techniques cannot be applied.

– Sometimes the size of the reduction object may be too large to fit in replicas in memory without significant overheads.

Earlier Parallelization Techniques

• Attempts to parallelize the Map-Reduce class of applications– lack of support for atomic operations on floating point

numbers– large number of threads required for effective parallelization.

• The larger shared memory allows total replication of the reduction object for some thread configurations – significantly avoids the possibility of race conditions and

thread contention.

Full Replication

• In any shared memory system, the best way to avoid race conditions would be to– Have each thread keep its own copy of the

reduction object on the device memory and process each object separately.

– Then at the end of each iteration, a global combination could be performed either by a single thread or by using the tree structure.

– The final object is copied back to host memory

Full Replication in Shared Memory• The factors which affect performance of full replication

mode of reduction – size of the reduction object (depends on the number of

threads per multiprocessor).– the amount of computation in comparison to the amount of

data copied between devices and – whether or not, global data can be copied into shared memory.

• In Tesla it was not possible to fit in all the copies of the reduction object within 16k of shared memory available– Higher latency device memory had to be used.

Full Replication on Shared memory (continued)

• The higher amount of available shared memory in Fermi can fit in all copies of the reduction object entirely within the shared memory for smaller configurations:– No race conditions and contention among threads

because each thread updates its own copy of the object.

– Global memory accesses are now replaced by low latency shared memory accesses.

Locking Scheme• The shared memories of different

multiprocessors, have no synchronization mechanism, so a separate copy of the reduction object is placed in the shared memory of each multiprocessor.

• While performing updates on the reduction object, all threads of a thread block use locking to avoid race conditions.

• Finally a global combination is performed on all the accumulated updates on the different multiprocessors.

Locking : TESLA vs FERMI

• Fine Grained Locking:• TESLA:

• FERMI:

The Hybrid Scheme• Full replication

– A private copy of the reduction object is needed for each thread in a block

– Larger reduction objects stored in the high latency global device memory.– The cost of combination could be very high.

• Locking– A single copy of the reduction object is stored in the shared memory – Eliminates the need for global combination. – Contention among threads in a block is very high.

• Configuring an application with a larger number of threads in a multiprocessor typically leads to better performance. – Latencies can be masked by context switching between warps.

The Hybrid Scheme (continued)• While choosing the number of groups, M

– M copies of the reduction object should still fit into the shared memory.

– If the reduction object is big, the overhead of combination would be higher than the overhead of contention.

– When the object is smaller, the contention overhead dominates over the combination overhead.

• Since it is desirable to keep the contention overhead smaller, a larger number of groups are preferable.

• Several Hybrid versions were created and evaluated on Fermi– to study the optimal balance between contention and

combination overheads

Experimental Evaluation

• Environment:– TESLA: NVIDIA Tesla C1060 GPU with 240 cores,

clock frequency of 1.296 GHz and 4GB device memory.

– FERMI: NVIDIA Tesla C2050 GPU with 448 processor cores, clock frequency of 1.15GHz and a device memory of 3 GB.

Observations

• For larger reduction objects, the hybrid approach generally outperforms the replication and the locking approaches.– Contention overhead dominates.

• For smaller reduction objects full replication in shared memory yields the best performance.– Combination overhead dominates.

• Inbuilt support for atomic floating point operations outperforms the previously used wrapper based implementation.

K-Means ResultsWrapper based

implementation of atomic floating point operations k=10

Inbuilt support for atomic floating point operations k=10 512 256 192 128 64

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

AtomicReplicateHybridFull replication on SM

Number of threads

Exec

ution

tim

e (s

econ

ds)

512 256 192 128 640123456789

10

AtomicReplicateHybrid

Number of threads

Exec

ution

tim

e (s

econ

ds)

Kmeans ResultsWrapper based

implementation of atomic floating point operations

k=100

Inbuilt support for atomic floating point operations

k=100

512 256 192 128 640

2

4

6

8

10


Number of threads

Exec

ution

tim

e (s

econ

ds)

512 256 192 128 640123456789

10


Number of threads

Exec

ution

tim

e (s

econ

ds)

K-Means ResultsHybrid Versions for k=10

Hybrid Versions for k=100

512 256 128 640

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

4 per group8 per group16 per group32 per group64 per group

Threads per block

Exec

ution

tim

e (s

econ

ds)

512 256 128 640

1

2

3

4

5

6

7

8

9

4 per group8 per group16 per group

Threads per block

Exec

ution

tim

e (s

econ

ds)

PCA ResultsComparison of Parallelization schemes with

wrapper based implementation for 16 columns

Comparison of Parallelization schemes with inbuilt atomic floating point for 32

columns

64 128 192 256 5120

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

wrapper atomicwrapper hybridinbuilt atomicinbuilt hybrid

Threads per block

Exec

ution

tim

e (s

econ

ds)

64 128 192 256 5120

0.01

0.02

0.03

0.04

0.05

0.06

0.07

wrapper atomicwrapper hybridinbuilt atomicinbuilt hybrid

Threads per block

Exec

ution

TIm

e (s

econ

ds)

PCA - ResultsHybrid versions for 16 columns

Hybrid versions for 32 columns

64 128 192 256 5120

0.01

0.02

0.03

0.04

0.05

0.06

4 per group8 per group16 per group32 per group

Threads per block

Exec

ution

tim

e (s

econ

ds)

64 128 192 256 5120

0.05

0.1

0.15

0.2

0.25

0.3

4 per group8 per group

Threads per block

Exec

ution

tim

e (s

econ

ds)

kNN - Results

512 256 192 128 640

0.1

0.2

0.3

0.4

0.5

0.6

KNN comparison of schemes k=10

atomic

replicate

hybrid

full replication on SM

Number of threads per block

Exec

ution

tim

e (s

econ

ds)

512 256 192 128 640

0.10.20.30.40.50.60.70.80.9

KNN comparison of schemes k=20

atomicreplicatehybridfull replication on SM


Exec

ution

tim

e (s

econ

ds)

kNN - Results

512 256 192 128 640

0.05

0.1

0.15

0.2

0.25

0.3

Hybrid versions of KNN k=10



Exec

ution

tim

e (s

econ

ds)

512 256 192 128 640

0.05

0.1

0.15

0.2

0.25

0.3

0.35

Hybrid versions of KNN k=20



Exec

ution

tim

e (s

econ

ds)

Conclusions

• The new features of the Fermi series GPU cards:– support for inbuilt atomic double precision operations– increase in the amount of available shared memory

• Evaluated against three reduction based data mining algorithms.– Balance between the overheads of thread contention

and global combination. – For smaller clusters, contention is a dominant factor.– For larger clusters, combination overhead dominates.

Thank You!Questions?

evaluating fermi features for data mining applications

Documents