improving i/o with compiler-supported parallelism why should we care about i/o? disk access speeds...

Improving I/O with Compiler-Supported Parallelism Why Should We Care About I/O? Disk access speeds are much slower than processor and memory access speeds. Disk I/O may be a major bottleneck in applications such as: • scientific codes related to image processing • multimedia applications • out-of-core computations Computational optimizations alone may not provide any significant improvements to these programs. Why Should Compilers Be Involved? Compilers have knowledge of both the application and the computer architecture or operating system. Compilers can reduce the burden on the programmer and increase code portability by requiring little to no change in the user level program to achieve good performance on different architectures. Compilers can automatically translate programs written in high-level languages, which may lack robust I/O or operating system interfaces, into higher performance languages that provide more control over low-level system activities. Human Neuroimaging Lab http:// www.hnl.bcm.tmc.edu / The Human Neuroimaging Laboratory at the Baylor College of Medicine conducts research in the physiology and functional anatomy of the human brain using fMRI technology. fMRI Technology Functional Magnetic Resonance imaging is a technique for determining which parts of the brain are activated when a person responds to stimuli. A high resolution brain scan is followed by a series of low resolution scans taken on regular time slices. Brain activity is identified by increased blood flow to specific regions of the brain. Motivating Application The HNL wants to optimize a preprocessing application, which normalizes brain images of human subjects to a canonical brain in order to make the images comparable and enable data analysis. The program uses calls to the SPM (Statistical Parametric Mapping) library. Anna Youssefi, Ken Kennedy Transformation: Loop Distribution & Parallelization Single processor Processor 1 Processor 2 Processor 3 Processor 4 Hand transformation on I/O-intensive loop in HNL preprocessing application The original loop reads a different input file and writes a portion of a single output file on each iteration. The loop is distributed into two separate loops: the first loop runs in parallel on four different processors; the second loop runs sequentially across all processors. Standard compiler transformations are implemented by hand to parallelize the loop. Dependence analysis can be used to automate the transformation. Performance Results Performance of the transformed loop was constrained by shortcomings of the MPI (Message Passing Interface) implementation we used. This implementation relies on file I/O to share data and results in excessive communication times, as demonstrated by the broadcast overhead. Even with these performance constraints, we achieved 30-40% improvement in running time. We expect to achieve even better results from using a different MPI implementation. Conclusion and Future Work Through parallelization, we achieved a minimum of 30% improvement in the running time of an I/O-intensive loop. Standard compiler transformations can be extended to reveal the parallelism in such loops. We plan to implement compiler strategies to automate these transformations. We also plan to implement compiler support for other application-level I/O transformations, such as converting synchronous to asynchronous I/O, prefetching and overlapping I/O with computation. for i=1 to 192 READ PROCESS WRITE for i=1 to 48 READ PROCESS for i=1 to 48 READ PROCESS for i=1 to 48 READ PROCESS for i=1 to 48 READ PROCESS for i=1 to 48 WRITE for i=1 to 48 WRITE for i=1 to 48 WRITE for i=1 to 48 WRITE 0 50 100 150 200 250 300 seconds sequentialloop parallelloop parallelloop w / broadcast

Upload: maria-york

Post on 05-Jan-2016

212 views

Category:

Documents

0 download

Report

Download

Embed Size (px):

TRANSCRIPT

Improving I/O with Compiler-Supported Parallelism

Why Should We Care About I/O?

Disk access speeds are much slower than processor and memory access speeds.

Disk I/O may be a major bottleneck in applications such as:

• scientific codes related to image processing

• multimedia applications

• out-of-core computations

Computational optimizations alone may not provide any significant improvements to these programs.

Why Should Compilers Be Involved?

Compilers have knowledge of both the application and the computer architecture or operating system.

Compilers can reduce the burden on the programmer and increase code portability by requiring little to no change in the user level program to achieve good performance on different architectures.

Compilers can automatically translate programs written in high-level languages, which may lack robust I/O or operating system interfaces, into higher performance languages that provide more control over low-level system activities.

Human Neuroimaging Lab http://www.hnl.bcm.tmc.edu/

The Human Neuroimaging Laboratory at the Baylor College of Medicine conducts research in the physiology and functional anatomy of the human brain using fMRI technology.

fMRI TechnologyFunctional Magnetic Resonance imaging is a technique for determining which parts of the brain are

activated when a person responds to stimuli. A high resolution brain scan is followed by a series of low resolution scans taken on regular time slices. Brain activity is identified by increased blood flow to specific regions of the brain.

Motivating ApplicationThe HNL wants to optimize a preprocessing application, which normalizes brain images of human

subjects to a canonical brain in order to make the images comparable and enable data analysis. The program uses calls to the SPM (Statistical Parametric Mapping) library.

Anna Youssefi, Ken Kennedy

Transformation: Loop Distribution & Parallelization

Single processor Processor 1 Processor 2 Processor 3 Processor 4

Hand transformation on I/O-intensive loop in HNL preprocessing application

The original loop reads a different input file and writes a portion of a single output file on each iteration.

The loop is distributed into two separate loops: the first loop runs in parallel on four different processors; the second loop runs sequentially across all processors.

Standard compiler transformations are implemented by hand to parallelize the loop.

Dependence analysis can be used to automate the transformation.

Performance Results

Performance of the transformed loop was constrained by shortcomings of the MPI (Message Passing Interface) implementation we used. This implementation relies on file I/O to share data and results in excessive communication times, as demonstrated by the broadcast overhead.

Even with these performance constraints, we achieved 30-40% improvement in running time. We expect to achieve even better results from using a different MPI implementation.

Conclusion and Future WorkThrough parallelization, we achieved a minimum of 30% improvement in the running time of an I/O-

intensive loop. Standard compiler transformations can be extended to reveal the parallelism in such loops. We plan to implement compiler strategies to automate these transformations.

We also plan to implement compiler support for other application-level I/O transformations, such as converting synchronous to asynchronous I/O, prefetching and overlapping I/O with computation.

for i=1 to 192 READ PROCESS WRITE

for i=1 to 48 READ PROCESS

for i=1 to 48 WRITE

0 50 100 150 200 250 300

seconds

sequential loop

parallel loop

parallel loop w/broadcast

http://www.hnl.bcm.tmc.edu/

Design and Simulation of 250kW Mini-Hydro Power · the impulse turbines, and has slower operation speed[7]. The usual range of head, flow rate and specific speeds for various types

The Effects of Roundabouts on Pedestrian Safety · treatments. European, Australian, and U.S. reports document how both slower speeds and fewer conflict points in roundabout traffic

Are Roundabouts Good for Business?are good for business”. This paper will show that the many benefits of roundabouts -- which include reduced accidents, slower speeds but reduced