improving i/o with compiler-supported parallelism why should we care about i/o? disk access speeds...
TRANSCRIPT
Improving I/O with Compiler-Supported Parallelism
Why Should We Care About I/O?
Disk access speeds are much slower than processor and memory access speeds.
Disk I/O may be a major bottleneck in applications such as:
• scientific codes related to image processing
• multimedia applications
• out-of-core computations
Computational optimizations alone may not provide any significant improvements to these programs.
Why Should Compilers Be Involved?
Compilers have knowledge of both the application and the computer architecture or operating system.
Compilers can reduce the burden on the programmer and increase code portability by requiring little to no change in the user level program to achieve good performance on different architectures.
Compilers can automatically translate programs written in high-level languages, which may lack robust I/O or operating system interfaces, into higher performance languages that provide more control over low-level system activities.
Human Neuroimaging Lab http://www.hnl.bcm.tmc.edu/
The Human Neuroimaging Laboratory at the Baylor College of Medicine conducts research in the physiology and functional anatomy of the human brain using fMRI technology.
fMRI TechnologyFunctional Magnetic Resonance imaging is a technique for determining which parts of the brain are
activated when a person responds to stimuli. A high resolution brain scan is followed by a series of low resolution scans taken on regular time slices. Brain activity is identified by increased blood flow to specific regions of the brain.
Motivating ApplicationThe HNL wants to optimize a preprocessing application, which normalizes brain images of human
subjects to a canonical brain in order to make the images comparable and enable data analysis. The program uses calls to the SPM (Statistical Parametric Mapping) library.
Anna Youssefi, Ken Kennedy
Transformation: Loop Distribution & Parallelization
Single processor Processor 1 Processor 2 Processor 3 Processor 4
Hand transformation on I/O-intensive loop in HNL preprocessing application
The original loop reads a different input file and writes a portion of a single output file on each iteration.
The loop is distributed into two separate loops: the first loop runs in parallel on four different processors; the second loop runs sequentially across all processors.
Standard compiler transformations are implemented by hand to parallelize the loop.
Dependence analysis can be used to automate the transformation.
Performance Results
Performance of the transformed loop was constrained by shortcomings of the MPI (Message Passing Interface) implementation we used. This implementation relies on file I/O to share data and results in excessive communication times, as demonstrated by the broadcast overhead.
Even with these performance constraints, we achieved 30-40% improvement in running time. We expect to achieve even better results from using a different MPI implementation.
Conclusion and Future WorkThrough parallelization, we achieved a minimum of 30% improvement in the running time of an I/O-
intensive loop. Standard compiler transformations can be extended to reveal the parallelism in such loops. We plan to implement compiler strategies to automate these transformations.
We also plan to implement compiler support for other application-level I/O transformations, such as converting synchronous to asynchronous I/O, prefetching and overlapping I/O with computation.
for i=1 to 192 READ PROCESS WRITE
for i=1 to 48 READ PROCESS
for i=1 to 48 READ PROCESS
for i=1 to 48 READ PROCESS
for i=1 to 48 READ PROCESS
for i=1 to 48 WRITE
for i=1 to 48 WRITE
for i=1 to 48 WRITE
for i=1 to 48 WRITE
0 50 100 150 200 250 300
seconds
sequential loop
parallel loop
parallel loop w/broadcast