the java profiler based on byte code analysis and instrumentation for many-core hardware...

The Java profiler based on byte code analysis and instrumentation for many-core hardware accelerators

Marcin Pietroń1,2, Michał Karwatowski1,2, Kazimierz Wiatr12

1AGH University of Science and Technology, al. Mickiewicza 30, 30-059 Kraków,2ACK Cyfronet AGH, ul. Nawojki 11, 30-950 Kraków

RUC 17-18.09.2015 Kraków

2Agenda

GPU acceleration

Code analysis and instrumentation

Experiments

Results

Conclusion and future work

3GPU as modern hardware accelerators

Computing power (over 1 Tflops)

Availability

High parallelism (SIMT architecture)

High level programming tools (CUDA, OpenCL)

4GPU hardware accelerators

Number of algorithms from different domains implemented in GPU:

Linear algebra (e.g. cublas, cula)

Deep learning, neural networks, machine learning algorithms (e.g. SVM)

Computational intelligence (e.g. genetic, memetic algorithms)

Data and text mining

5Code analysis

Implementation should be preceded by appropriate analysis

Analysis can be automated

Static analysis for finding hidden parallelism (Banarjee, Range Test, Omega Test) and data reusing and distribution

Profiling as dynamic analysis

6Byte code analysis and instrumentation

Byte code analysis just in time

Apprioprate instrumentation for profiling and static analysis

Results of analysis and profiling can be used for implementation

7System architecture

8Byte code instrumentation

instrumenting array data read instructions

instrumenting array data write instructions

instrumenting array data read and write instructions for counting number of accesses and standard deviation,

instrumenting single variables read and write for counting number of accesses.

9Byte code instrumentation

for (int i = 1; i < 100; i++) { test_1[i] = 100; test_2[i] = test_1[i-1] + 10; }

for (int i = 1; i < 100; i++) { test_1[i] = 100; test_1_mon[i] = i; test_2[i] = test_1[i-1] + 10; if (test_1_mon[i-1] < i) { dist_vectors[i-1] = i-test_1_mon[i]; } }

27: iconst_128: istore %630: iload %632: bipush 10034: if_icmpge#9337: aload_138: iload %640: bipush 10042: iastore43: aload_344: iload %646: iload %648: iastore...70: if_icmpge#8773: aload %575: iload %677: iconst_178: isub79: iload %681: aload_382: iload %684: iaload85: isub86: iastore87: iinc %6 190: goto #30

10GPU implementation rules

if data is reused between iterations (between threads) this data should be transfer to shared memory,

data reused by only single iteration should be transfer to local memory (registers),

data which is reused, read only and without regular accesses should be allocated in texture memory,

11GPU implementation rules

common constant values used by threads should be write to constant memory,

data with single access but without coalesced access should be transfer in a group in a coalesced manner to shared memory and then read from this memory for further computing.

12JCuda generation

Implementation can be done manually or partly in automated way

Rules generate some parallel code patterns

13Experimental results

size of matrix GPU time [ms] CPU time (MKL BLAS) [ms]

256×256 0.4 6

512×512 4.3 24

1024×1024 34 158

2048×2048 285 956

4096×4096 2817 990

14Conlusions and future work

Implementation preceded by source code analysis helps adaption algorithm in GPU

Automated parallel code generation in GPU save a lot of time

Based on byte code = portable

Optimizations in code generation must be done furter in our system (memories access patterns)

15Questions

?

the java profiler based on byte code analysis and instrumentation for many-core hardware...

Documents

source code analysis

omega test

range test

data reusing

automatedstatic analysis

constant memory

shared memory

texture memory