the java profiler based on byte code analysis and instrumentation for many-core hardware...

15
The Java profiler based on byte code analysis and instrumentation for many- core hardware accelerators Marcin Pietroń 1,2 , Michał Karwatowski 1,2 , Kazimierz Wiatr 12 1 AGH University of Science and Technology, al. Mickiewicza 30, 30-059 Kraków, 2 ACK Cyfronet AGH, ul. Nawojki 11, 30-950 Kraków RUC 17-18.09.2015 Kraków

Upload: bartholomew-simpson

Post on 01-Jan-2016

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The Java profiler based on byte code analysis and instrumentation for many-core hardware accelerators Marcin Pietroń 1,2, Michał Karwatowski 1,2, Kazimierz

The Java profiler based on byte code analysis and instrumentation for many-core hardware accelerators

Marcin Pietroń1,2, Michał Karwatowski1,2, Kazimierz Wiatr12

1AGH University of Science and Technology, al. Mickiewicza 30, 30-059 Kraków,2ACK Cyfronet AGH, ul. Nawojki 11, 30-950 Kraków

RUC 17-18.09.2015 Kraków

Page 2: The Java profiler based on byte code analysis and instrumentation for many-core hardware accelerators Marcin Pietroń 1,2, Michał Karwatowski 1,2, Kazimierz

2Agenda

GPU acceleration

Code analysis and instrumentation

Experiments

Results

Conclusion and future work

Page 3: The Java profiler based on byte code analysis and instrumentation for many-core hardware accelerators Marcin Pietroń 1,2, Michał Karwatowski 1,2, Kazimierz

3GPU as modern hardware accelerators

Computing power (over 1 Tflops)

Availability

High parallelism (SIMT architecture)

High level programming tools (CUDA, OpenCL)

Page 4: The Java profiler based on byte code analysis and instrumentation for many-core hardware accelerators Marcin Pietroń 1,2, Michał Karwatowski 1,2, Kazimierz

4GPU hardware accelerators

Number of algorithms from different domains implemented in GPU:

Linear algebra (e.g. cublas, cula)

Deep learning, neural networks, machine learning algorithms (e.g. SVM)

Computational intelligence (e.g. genetic, memetic algorithms)

Data and text mining

Page 5: The Java profiler based on byte code analysis and instrumentation for many-core hardware accelerators Marcin Pietroń 1,2, Michał Karwatowski 1,2, Kazimierz

5Code analysis

Implementation should be preceded by appropriate analysis

Analysis can be automated

Static analysis for finding hidden parallelism (Banarjee, Range Test, Omega Test) and data reusing and distribution

Profiling as dynamic analysis

Page 6: The Java profiler based on byte code analysis and instrumentation for many-core hardware accelerators Marcin Pietroń 1,2, Michał Karwatowski 1,2, Kazimierz

6Byte code analysis and instrumentation

Byte code analysis just in time

Apprioprate instrumentation for profiling and static analysis

Results of analysis and profiling can be used for implementation

Page 7: The Java profiler based on byte code analysis and instrumentation for many-core hardware accelerators Marcin Pietroń 1,2, Michał Karwatowski 1,2, Kazimierz

7System architecture

Page 8: The Java profiler based on byte code analysis and instrumentation for many-core hardware accelerators Marcin Pietroń 1,2, Michał Karwatowski 1,2, Kazimierz

8Byte code instrumentation

instrumenting array data read instructions

instrumenting array data write instructions

instrumenting array data read and write instructions for counting number of accesses and standard deviation,

instrumenting single variables read and write for counting number of accesses.

Page 9: The Java profiler based on byte code analysis and instrumentation for many-core hardware accelerators Marcin Pietroń 1,2, Michał Karwatowski 1,2, Kazimierz

9Byte code instrumentation

for (int i = 1; i < 100; i++) { test_1[i] = 100; test_2[i] = test_1[i-1] + 10; }

for (int i = 1; i < 100; i++) { test_1[i] = 100; test_1_mon[i] = i; test_2[i] = test_1[i-1] + 10; if (test_1_mon[i-1] < i) { dist_vectors[i-1] = i-test_1_mon[i]; } }

27: iconst_128: istore %630: iload %632: bipush 10034: if_icmpge#9337: aload_138: iload %640: bipush 10042: iastore43: aload_344: iload %646: iload %648: iastore...70: if_icmpge#8773: aload %575: iload %677: iconst_178: isub79: iload %681: aload_382: iload %684: iaload85: isub86: iastore87: iinc %6 190: goto #30

Page 10: The Java profiler based on byte code analysis and instrumentation for many-core hardware accelerators Marcin Pietroń 1,2, Michał Karwatowski 1,2, Kazimierz

10GPU implementation rules

if data is reused between iterations (between threads) this data should be transfer to shared memory,

data reused by only single iteration should be transfer to local memory (registers),

data which is reused, read only and without regular accesses should be allocated in texture memory,

Page 11: The Java profiler based on byte code analysis and instrumentation for many-core hardware accelerators Marcin Pietroń 1,2, Michał Karwatowski 1,2, Kazimierz

11GPU implementation rules

common constant values used by threads should be write to constant memory,

data with single access but without coalesced access should be transfer in a group in a coalesced manner to shared memory and then read from this memory for further computing.

Page 12: The Java profiler based on byte code analysis and instrumentation for many-core hardware accelerators Marcin Pietroń 1,2, Michał Karwatowski 1,2, Kazimierz

12JCuda generation

Implementation can be done manually or partly in automated way

Rules generate some parallel code patterns

Page 13: The Java profiler based on byte code analysis and instrumentation for many-core hardware accelerators Marcin Pietroń 1,2, Michał Karwatowski 1,2, Kazimierz

13Experimental results

size of matrix GPU time [ms] CPU time (MKL BLAS) [ms]

256×256 0.4 6

512×512 4.3 24

1024×1024 34 158

2048×2048 285 956

4096×4096 2817 990

Page 14: The Java profiler based on byte code analysis and instrumentation for many-core hardware accelerators Marcin Pietroń 1,2, Michał Karwatowski 1,2, Kazimierz

14Conlusions and future work

Implementation preceded by source code analysis helps adaption algorithm in GPU

Automated parallel code generation in GPU save a lot of time

Based on byte code = portable

Optimizations in code generation must be done furter in our system (memories access patterns)

Page 15: The Java profiler based on byte code analysis and instrumentation for many-core hardware accelerators Marcin Pietroń 1,2, Michał Karwatowski 1,2, Kazimierz

15Questions

?