parallelization of the telemedicine benchmark for the xbox 360 architecture howard wong, surf-it...
TRANSCRIPT
Parallelization of the Telemedicine Benchmark for the Xbox 360
ArchitectureHoward Wong, SURF-IT Fellow
Professor Jean-Luc Gaudiot, EECSAugust 29, 2008
University of California, Irvine
PASCALPASCAL: PArallel Systems and Computer Architecture Lab.
PASCAL: PArallel Systems & Computer Architecture Lab.
Outline
Background (Benchmark, Platform) Current Work Methodology (Compiler, Data Set) Results Conclusions Future Work
PASCAL: PArallel Systems & Computer Architecture Lab.
Background
Why Parallel Programming? Advent of everyday multicomputers Ultimate goal: Auto-parallelization Basic concepts
− Problems− Programming primitives
Telemedicine Benchmark Platform – Xbox 360
3 Cores Graphics Engine Vector Processing
?
Work
Core 1
Core 2 Core n
PASCAL: PArallel Systems & Computer Architecture Lab.
Current Work
Goal: Identify the parallelization process Efficiency measured in performance Performance in relation to load
POSIX threads (pthreads) and OpenMP Sorting Routines
'fallbackSort'− Making search 'brackets'
'mainSort'− Dependencies between loop iterations
PASCAL: PArallel Systems & Computer Architecture Lab.
Methodology
Compilation gcc or g++ version 4.2
Data Sets Monkey brain image in PPM
format Derived data via netpbm
Test Platform Xbox 360 with Ubuntu Linux
Images courtesy of Neuroscience Center, UC Davis, and Joerg Meyer, Center of GRAVITY, Calit2, UC Irvine.
PASCAL: PArallel Systems & Computer Architecture Lab.
Initial Results
0 1 2 3 4
0.000
0.500
1.000
1.500
2.000
2.500
3.000
3.500
Speedup versus Number of ThreadsCompression of brains.ppm; Compared to bzip2
bzip2modLinearLinear
No. of Threads
Sp
ee
du
p
PASCAL: PArallel Systems & Computer Architecture Lab.
Analysis
Possible thread contention 'bitmap' of data as former optimization Optimized for long runs of 0's or 1's Extra mutex locks required
Thread Creation Sorting algorithm called at least 300 times for the large
image Thread creation efficiency
Thread management structures
PASCAL: PArallel Systems & Computer Architecture Lab.
Results (Cont’d)
0.000 0.250 0.500 0.750 1.000
2.800
2.850
2.900
2.950
3.000
3.050
Speedup versus Load (pbzip2 - 3 Threads)Compared to bzip2; 1/4, 1/2, whole image
Fraction of Image Processed
Spe
edup
0.000 0.250 0.500 0.750 1.000
0.630
0.640
0.650
0.660
0.670
0.680
0.690
Speedup versus Load (bzip2mod - 2 Threads)Compared to bzip2; 1/4, 1/2, whole image
Fraction of Image Processed
Spe
edup
PASCAL: PArallel Systems & Computer Architecture Lab.
Conclusions & Discussion
Speedup dependent on the load size Possible improvements
Use a 'threadpool' Create other important compression functions Examine alternative algorithms with a parallel
mindset End result
Thread creation Thread management overhead Heavy contention
PASCAL: PArallel Systems & Computer Architecture Lab.
Questions for Future Work
What is the impact of thread creation? Do the other TMB programs have the same
features? Can vector instructions improve program
performance? Are new, more efficient parallel programming
primitives needed for our application?
PASCAL: PArallel Systems & Computer Architecture Lab.
Acknowledgments
Professor Jean-Luc Gaudiot and the PASCAL group UC Davis Neuroscience Center Professor Joerg Meyer, Center of GRAVITY, Calit2 Calit2 UROP