the java profiler based on byte code analysis and instrumentation for many-core hardware...
TRANSCRIPT
The Java profiler based on byte code analysis and instrumentation for many-core hardware accelerators
Marcin Pietroń1,2, Michał Karwatowski1,2, Kazimierz Wiatr12
1AGH University of Science and Technology, al. Mickiewicza 30, 30-059 Kraków,2ACK Cyfronet AGH, ul. Nawojki 11, 30-950 Kraków
RUC 17-18.09.2015 Kraków
2Agenda
GPU acceleration
Code analysis and instrumentation
Experiments
Results
Conclusion and future work
3GPU as modern hardware accelerators
Computing power (over 1 Tflops)
Availability
High parallelism (SIMT architecture)
High level programming tools (CUDA, OpenCL)
4GPU hardware accelerators
Number of algorithms from different domains implemented in GPU:
Linear algebra (e.g. cublas, cula)
Deep learning, neural networks, machine learning algorithms (e.g. SVM)
Computational intelligence (e.g. genetic, memetic algorithms)
Data and text mining
5Code analysis
Implementation should be preceded by appropriate analysis
Analysis can be automated
Static analysis for finding hidden parallelism (Banarjee, Range Test, Omega Test) and data reusing and distribution
Profiling as dynamic analysis
6Byte code analysis and instrumentation
Byte code analysis just in time
Apprioprate instrumentation for profiling and static analysis
Results of analysis and profiling can be used for implementation
7System architecture
8Byte code instrumentation
instrumenting array data read instructions
instrumenting array data write instructions
instrumenting array data read and write instructions for counting number of accesses and standard deviation,
instrumenting single variables read and write for counting number of accesses.
9Byte code instrumentation
for (int i = 1; i < 100; i++) { test_1[i] = 100; test_2[i] = test_1[i-1] + 10; }
for (int i = 1; i < 100; i++) { test_1[i] = 100; test_1_mon[i] = i; test_2[i] = test_1[i-1] + 10; if (test_1_mon[i-1] < i) { dist_vectors[i-1] = i-test_1_mon[i]; } }
27: iconst_128: istore %630: iload %632: bipush 10034: if_icmpge#9337: aload_138: iload %640: bipush 10042: iastore43: aload_344: iload %646: iload %648: iastore...70: if_icmpge#8773: aload %575: iload %677: iconst_178: isub79: iload %681: aload_382: iload %684: iaload85: isub86: iastore87: iinc %6 190: goto #30
10GPU implementation rules
if data is reused between iterations (between threads) this data should be transfer to shared memory,
data reused by only single iteration should be transfer to local memory (registers),
data which is reused, read only and without regular accesses should be allocated in texture memory,
11GPU implementation rules
common constant values used by threads should be write to constant memory,
data with single access but without coalesced access should be transfer in a group in a coalesced manner to shared memory and then read from this memory for further computing.
12JCuda generation
Implementation can be done manually or partly in automated way
Rules generate some parallel code patterns
13Experimental results
size of matrix GPU time [ms] CPU time (MKL BLAS) [ms]
256×256 0.4 6
512×512 4.3 24
1024×1024 34 158
2048×2048 285 956
4096×4096 2817 990
14Conlusions and future work
Implementation preceded by source code analysis helps adaption algorithm in GPU
Automated parallel code generation in GPU save a lot of time
Based on byte code = portable
Optimizations in code generation must be done furter in our system (memories access patterns)
15Questions
?