csc 7600 lecture 18: applied parallel algorithms 4 spring 2009 high performance computing: models,...
TRANSCRIPT
![Page 1: CSC 7600 Lecture 18: Applied Parallel Algorithms 4 Spring 2009 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 4 Dr. Hartmut](https://reader038.vdocument.in/reader038/viewer/2022103006/56649e7c5503460f94b7e112/html5/thumbnails/1.jpg)
CSC 7600 Lecture 18: Applied Parallel Algorithms 4Spring 2009
HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS
APPLIED PARALLEL ALGORITHMS 4
Dr. Hartmut Kaiser Department of Computer ScienceLouisiana State UniversityMarch 19th , 2009
1
![Page 2: CSC 7600 Lecture 18: Applied Parallel Algorithms 4 Spring 2009 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 4 Dr. Hartmut](https://reader038.vdocument.in/reader038/viewer/2022103006/56649e7c5503460f94b7e112/html5/thumbnails/2.jpg)
CSC 7600 Lecture 18: Applied Parallel Algorithms 4Spring 2009
Topics
Fourier Transforms • Fourier analysis• Discrete Fourier transform• Fast Fourier transform• Parallel Implementation
Parallel Sorting • Bubble Sort • Merge Sort • Heap Sort • Quick Sort
2
![Page 3: CSC 7600 Lecture 18: Applied Parallel Algorithms 4 Spring 2009 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 4 Dr. Hartmut](https://reader038.vdocument.in/reader038/viewer/2022103006/56649e7c5503460f94b7e112/html5/thumbnails/3.jpg)
CSC 7600 Lecture 18: Applied Parallel Algorithms 4Spring 2009
Puzzle of the Day
void copy(char *to, char const *from, int count){ int n = (count + 3) / 4; switch (count % 4) { case 0: do { *to++ = *from++; case 3: *to++ = *from++; case 2: *to++ = *from++; case 1: *to++ = *from++; } while (--n > 0); } }
3
Duff‘s device: what is going on here?
'case' defines jump labels only!
Missing 'break' makes code 'fall through'
![Page 4: CSC 7600 Lecture 18: Applied Parallel Algorithms 4 Spring 2009 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 4 Dr. Hartmut](https://reader038.vdocument.in/reader038/viewer/2022103006/56649e7c5503460f94b7e112/html5/thumbnails/4.jpg)
CSC 7600 Lecture 18: Applied Parallel Algorithms 4Spring 2009
Topics
Fourier Transforms • Fourier analysis• Discrete Fourier transform• Fast Fourier transform• Parallel Implementation
Parallel Sorting • Bubble Sort • Merge Sort • Heap Sort • Quick Sort
4
![Page 5: CSC 7600 Lecture 18: Applied Parallel Algorithms 4 Spring 2009 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 4 Dr. Hartmut](https://reader038.vdocument.in/reader038/viewer/2022103006/56649e7c5503460f94b7e112/html5/thumbnails/5.jpg)
CSC 7600 Lecture 18: Applied Parallel Algorithms 4Spring 2009
Time and Frequency Domain Representation of Signals
5
http://robots.freehostia.com/Radio/Image137.gif
•Two ways of looking at the same signal
Example 1: Time and frequency domain representations of a sine wave
http://www.theparticle.com/cs/bc/mcs/signalnotes.pdf
![Page 6: CSC 7600 Lecture 18: Applied Parallel Algorithms 4 Spring 2009 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 4 Dr. Hartmut](https://reader038.vdocument.in/reader038/viewer/2022103006/56649e7c5503460f94b7e112/html5/thumbnails/6.jpg)
CSC 7600 Lecture 18: Applied Parallel Algorithms 4Spring 2009
Example 2
Time and frequency domain representations of a 4Hz + 12Hz Sine Wave
6
http://www.theparticle.com/cs/bc/mcs/signalnotes.pdf
![Page 7: CSC 7600 Lecture 18: Applied Parallel Algorithms 4 Spring 2009 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 4 Dr. Hartmut](https://reader038.vdocument.in/reader038/viewer/2022103006/56649e7c5503460f94b7e112/html5/thumbnails/7.jpg)
CSC 7600 Lecture 18: Applied Parallel Algorithms 4Spring 2009
Fourier Analysis
• Fourier analysis: Represent continuous functions by potentially infinite series of sine and cosine functions
7http://zone.ni.com/cms/images/devzone/tut/a/8c34be30580.gif
NOTE: The signal sum is composed from sine and cosine functions Adapted from slides(and text) of Parallel Programming in C with MPI and OpenMP by Michael Quinn
![Page 8: CSC 7600 Lecture 18: Applied Parallel Algorithms 4 Spring 2009 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 4 Dr. Hartmut](https://reader038.vdocument.in/reader038/viewer/2022103006/56649e7c5503460f94b7e112/html5/thumbnails/8.jpg)
CSC 7600 Lecture 18: Applied Parallel Algorithms 4Spring 2009
Fourier Analysis
8
Nice demo: http://www.e-mri.org/image-formation/fourier-transform.html
![Page 9: CSC 7600 Lecture 18: Applied Parallel Algorithms 4 Spring 2009 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 4 Dr. Hartmut](https://reader038.vdocument.in/reader038/viewer/2022103006/56649e7c5503460f94b7e112/html5/thumbnails/9.jpg)
CSC 7600 Lecture 18: Applied Parallel Algorithms 4Spring 2009
Fourier Representation of Square Wave
• Spectrum extends to infinity • As we move from left to right on the frequency axis amplitude(of
components) decreases monotonically
9
http://www.engr.colostate.edu/~dga/mechatronics/figures/4-5.gif
![Page 10: CSC 7600 Lecture 18: Applied Parallel Algorithms 4 Spring 2009 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 4 Dr. Hartmut](https://reader038.vdocument.in/reader038/viewer/2022103006/56649e7c5503460f94b7e112/html5/thumbnails/10.jpg)
CSC 7600 Lecture 18: Applied Parallel Algorithms 4Spring 2009
Fourier Representation of Square Wave
• Synthesis of a square wave(of zero DC component) from its frequency domain components
• Ideal square wave is represented by the thick black line
10
http://mathworld.wolfram.com/FourierSeriesSquareWave.html
![Page 11: CSC 7600 Lecture 18: Applied Parallel Algorithms 4 Spring 2009 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 4 Dr. Hartmut](https://reader038.vdocument.in/reader038/viewer/2022103006/56649e7c5503460f94b7e112/html5/thumbnails/11.jpg)
CSC 7600 Lecture 18: Applied Parallel Algorithms 4Spring 2009
Fourier Representation of Square Wave
11
Nice demo: http://www.e-mri.org/image-formation/fourier-transform.html
![Page 12: CSC 7600 Lecture 18: Applied Parallel Algorithms 4 Spring 2009 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 4 Dr. Hartmut](https://reader038.vdocument.in/reader038/viewer/2022103006/56649e7c5503460f94b7e112/html5/thumbnails/12.jpg)
CSC 7600 Lecture 18: Applied Parallel Algorithms 4Spring 2009
Topics
Fourier Transforms • Fourier analysis• Discrete Fourier transform• Fast Fourier transform• Parallel Implementation
Parallel Sorting • Bubble Sort • Merge Sort • Heap Sort • Quick Sort
12
![Page 13: CSC 7600 Lecture 18: Applied Parallel Algorithms 4 Spring 2009 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 4 Dr. Hartmut](https://reader038.vdocument.in/reader038/viewer/2022103006/56649e7c5503460f94b7e112/html5/thumbnails/13.jpg)
CSC 7600 Lecture 18: Applied Parallel Algorithms 4Spring 2009
• Digital signal: A digital signal is a signal that is both discrete and quantized
• Digital signals can be obtained by sampling analog signals
• The figure represents an analog to digital converter that does sampling and quantization
Digital Signals
13
http://www.solarisnetwork.com/term_Digital%20signalhttp://www.cdt.luth.se/~johnny/courses/smd074_1999_2/CodingCompression/kap27/slide2.html
![Page 14: CSC 7600 Lecture 18: Applied Parallel Algorithms 4 Spring 2009 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 4 Dr. Hartmut](https://reader038.vdocument.in/reader038/viewer/2022103006/56649e7c5503460f94b7e112/html5/thumbnails/14.jpg)
CSC 7600 Lecture 18: Applied Parallel Algorithms 4Spring 2009
Digital Signal Processing
• Processing of digital signals with the help of a computer
14
A/D Converter D/A Converter Digital Signal Processing
http://www.ece.rochester.edu/courses/ECE446/Introduction%20to%20Digital%20Signal%20Processing.pdf
Continuous Input
Continuous Output
![Page 15: CSC 7600 Lecture 18: Applied Parallel Algorithms 4 Spring 2009 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 4 Dr. Hartmut](https://reader038.vdocument.in/reader038/viewer/2022103006/56649e7c5503460f94b7e112/html5/thumbnails/15.jpg)
CSC 7600 Lecture 18: Applied Parallel Algorithms 4Spring 2009
Advantages of Digital Signal Processing
• Digital system can be simply reprogrammed for other applications / ported to different hardware / duplicated (Reconfiguring analog system means hardware redesign, testing, verification)
• DSP provides better control of accuracy requirements (Analog system depends on strict components tolerance, response may drift with temperature)
• Digital signals can be easily stored without deterioration (Analog signals are not easily transportable and often can’t be processed off-line)
• More sophisticated signal processing algorithms can be implemented (Difficult to perform precise mathematical operations in analog form)
15
Adapted from http://www-sigproc.eng.cam.ac.uk/~op205/3F3_1_Introduction_to_DSP.pdf
![Page 16: CSC 7600 Lecture 18: Applied Parallel Algorithms 4 Spring 2009 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 4 Dr. Hartmut](https://reader038.vdocument.in/reader038/viewer/2022103006/56649e7c5503460f94b7e112/html5/thumbnails/16.jpg)
CSC 7600 Lecture 18: Applied Parallel Algorithms 4Spring 2009
Why use Discrete Fourier Transform?
• Digital Signal Processing applications often require mapping of data in the time domain to its frequency domain counterparts
• Many applications in science, engineering
16
Adapted from slides(and text) of Parallel Programming in C with MPI and OpenMP by Michael Quinn
![Page 17: CSC 7600 Lecture 18: Applied Parallel Algorithms 4 Spring 2009 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 4 Dr. Hartmut](https://reader038.vdocument.in/reader038/viewer/2022103006/56649e7c5503460f94b7e112/html5/thumbnails/17.jpg)
CSC 7600 Lecture 18: Applied Parallel Algorithms 4Spring 2009
Example 1
• Spectrogram of Speech Signal
17
NOTE: Spectrogram is a 3D representation of signal amplitude vs time and frequency
http://www.visualizationsoftware.com/gram.html
http://ccrma.stanford.edu/~jos/st/Spectrogram_Speech.html
![Page 18: CSC 7600 Lecture 18: Applied Parallel Algorithms 4 Spring 2009 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 4 Dr. Hartmut](https://reader038.vdocument.in/reader038/viewer/2022103006/56649e7c5503460f94b7e112/html5/thumbnails/18.jpg)
CSC 7600 Lecture 18: Applied Parallel Algorithms 4Spring 2009
Example 2
18
DFT is used for converting image data in the spatial (2D) domain to the frequency domain before filtering andfor conversion back to spatialdomain afterwards
To filter an image in the frequency domain:1. Compute F(u,v) the DFT of the image2. Multiply F(u,v) by a filter function H(u,v)3. Compute the inverse DFT of the result
Adapted from www.comp.dit.ie/bmacnamee/materials/dip/lectures/ImageProcessing7-FrequencyFiltering.ppt
Output of different Gaussian low pass filters for removing blemishes
•Removing blemishes of a photograph
![Page 19: CSC 7600 Lecture 18: Applied Parallel Algorithms 4 Spring 2009 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 4 Dr. Hartmut](https://reader038.vdocument.in/reader038/viewer/2022103006/56649e7c5503460f94b7e112/html5/thumbnails/19.jpg)
CSC 7600 Lecture 18: Applied Parallel Algorithms 4Spring 2009
Discrete Fourier Transform(Qualitative)
• Discrete Fourier transform: Map a sequence over time to another sequence over frequency
– Signal strength as a function of time – Fourier coefficients as a function of frequency
19
Adapted from slides(and text) of Parallel Programming in C with MPI and OpenMP by Michael Quinn
![Page 20: CSC 7600 Lecture 18: Applied Parallel Algorithms 4 Spring 2009 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 4 Dr. Hartmut](https://reader038.vdocument.in/reader038/viewer/2022103006/56649e7c5503460f94b7e112/html5/thumbnails/20.jpg)
CSC 7600 Lecture 18: Applied Parallel Algorithms 4Spring 2009
DFT Example (1/4)
16 data points representing signal strength over time
20
Adapted from slides(and text) of Parallel Programming in C with MPI and OpenMP by Michael Quinn
![Page 21: CSC 7600 Lecture 18: Applied Parallel Algorithms 4 Spring 2009 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 4 Dr. Hartmut](https://reader038.vdocument.in/reader038/viewer/2022103006/56649e7c5503460f94b7e112/html5/thumbnails/21.jpg)
CSC 7600 Lecture 18: Applied Parallel Algorithms 4Spring 2009
DFT Example (2/4)
DFT yields amplitudes and frequencies of sine/cosine functions
21
Adapted from slides(and text) of Parallel Programming in C with MPI and OpenMP by Michael Quinn
![Page 22: CSC 7600 Lecture 18: Applied Parallel Algorithms 4 Spring 2009 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 4 Dr. Hartmut](https://reader038.vdocument.in/reader038/viewer/2022103006/56649e7c5503460f94b7e112/html5/thumbnails/22.jpg)
CSC 7600 Lecture 18: Applied Parallel Algorithms 4Spring 2009
DFT Example (3/4)
Plot of four constituent sine/cosine functions and their sum
22
Adapted from slides(and text) of Parallel Programming in C with MPI and OpenMP by Michael Quinn
![Page 23: CSC 7600 Lecture 18: Applied Parallel Algorithms 4 Spring 2009 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 4 Dr. Hartmut](https://reader038.vdocument.in/reader038/viewer/2022103006/56649e7c5503460f94b7e112/html5/thumbnails/23.jpg)
CSC 7600 Lecture 18: Applied Parallel Algorithms 4Spring 2009
DFT Example (4/4)
Continuous function and original 16 samples
23
Adapted from slides(and text) of Parallel Programming in C with MPI and OpenMP by Michael Quinn
![Page 24: CSC 7600 Lecture 18: Applied Parallel Algorithms 4 Spring 2009 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 4 Dr. Hartmut](https://reader038.vdocument.in/reader038/viewer/2022103006/56649e7c5503460f94b7e112/html5/thumbnails/24.jpg)
CSC 7600 Lecture 18: Applied Parallel Algorithms 4Spring 2009
Formal Definition of DFT
24
• DFT of a discrete signal x[n] of N sample points is defined as
N
iN
n
nk enxkX
21
0
,*][][
• Direct implementation of this equation requires complex additions and multiplications
NOTE: DFT of an N point sequence gives N points in the transform domain
2N
for Nk 0
http://cas.ensmp.fr/~chaplais/wavetour_presentation/transformees/Fourier/FFTUS.html
![Page 25: CSC 7600 Lecture 18: Applied Parallel Algorithms 4 Spring 2009 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 4 Dr. Hartmut](https://reader038.vdocument.in/reader038/viewer/2022103006/56649e7c5503460f94b7e112/html5/thumbnails/25.jpg)
CSC 7600 Lecture 18: Applied Parallel Algorithms 4Spring 2009
Formal Definition of DFT
• Complex plane, relation of different powers of ω
25
re
im
0,0
8
20
08
i
e
8
21
18
i
e
8
22
28
i
e
8
23
38
i
e
8
24
48
i
e
8
25
58
i
e
8
26
68
i
e
8
27
78
i
e
![Page 26: CSC 7600 Lecture 18: Applied Parallel Algorithms 4 Spring 2009 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 4 Dr. Hartmut](https://reader038.vdocument.in/reader038/viewer/2022103006/56649e7c5503460f94b7e112/html5/thumbnails/26.jpg)
CSC 7600 Lecture 18: Applied Parallel Algorithms 4Spring 2009
Computing DFT
• Writing the previous definition of DFT in matrix form
• Matrix-vector product Fn x– x is input vector (signal samples)
– Each element of Fn
fi,j = nij for 0 i, j < n and n is primitive nth
root of unity
NOTE: n is a complex number defined as
n
i
e2
26
Adapted from slides(and text) of Parallel Programming in C with MPI and OpenMP by Michael Quinn
![Page 27: CSC 7600 Lecture 18: Applied Parallel Algorithms 4 Spring 2009 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 4 Dr. Hartmut](https://reader038.vdocument.in/reader038/viewer/2022103006/56649e7c5503460f94b7e112/html5/thumbnails/27.jpg)
CSC 7600 Lecture 18: Applied Parallel Algorithms 4Spring 2009
Example 1
How to compute the DFT of a vector having two elements?
• Example Vector: (2, 3)
• 2, the primitive square root of unity, is -1
1
5
3
2
11
11
1
0
112
012
102
002
x
x
27
Adapted from slides(and text) of Parallel Programming in C with MPI and OpenMP by Michael Quinn
![Page 28: CSC 7600 Lecture 18: Applied Parallel Algorithms 4 Spring 2009 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 4 Dr. Hartmut](https://reader038.vdocument.in/reader038/viewer/2022103006/56649e7c5503460f94b7e112/html5/thumbnails/28.jpg)
CSC 7600 Lecture 18: Applied Parallel Algorithms 4Spring 2009
Example 2
How to compute the DFT of a vector having four elements?
• Example Vector:(1, 2, 4, 3)• The primitive 4th root of unity is i
i
i
ii
ii
x
x
x
x
3
0
3
10
3
4
2
1
11
1111
11
1111
3
2
1
0
94
64
34
04
64
44
24
04
34
24
14
04
04
04
04
04
28
Adapted from slides(and text) of Parallel Programming in C with MPI and OpenMP by Michael Quinn
![Page 29: CSC 7600 Lecture 18: Applied Parallel Algorithms 4 Spring 2009 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 4 Dr. Hartmut](https://reader038.vdocument.in/reader038/viewer/2022103006/56649e7c5503460f94b7e112/html5/thumbnails/29.jpg)
CSC 7600 Lecture 18: Applied Parallel Algorithms 4Spring 2009
Topics
Fourier Transforms • Fourier analysis• Discrete Fourier transform• Fast Fourier transform• Parallel Implementation
Parallel Sorting • Bubble Sort • Merge Sort • Heap Sort • Quick Sort
29
![Page 30: CSC 7600 Lecture 18: Applied Parallel Algorithms 4 Spring 2009 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 4 Dr. Hartmut](https://reader038.vdocument.in/reader038/viewer/2022103006/56649e7c5503460f94b7e112/html5/thumbnails/30.jpg)
CSC 7600 Lecture 18: Applied Parallel Algorithms 4Spring 2009
Why Fast Fourier Transform(FFT)?
• Reduce the computational operations required
• Straightforward implementation: (n2)• Fast Fourier transform: (n log n)
- (n log n) << (n2) for large values of n
30
Adapted from slides(and text) of Parallel Programming in C with MPI and OpenMP by Michael Quinn
![Page 31: CSC 7600 Lecture 18: Applied Parallel Algorithms 4 Spring 2009 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 4 Dr. Hartmut](https://reader038.vdocument.in/reader038/viewer/2022103006/56649e7c5503460f94b7e112/html5/thumbnails/31.jpg)
CSC 7600 Lecture 18: Applied Parallel Algorithms 4Spring 2009
Fast Fourier Transform
31
0100
0001
1000
0010
1100
1100
0011
0011
000
0101
010
0101
1
1
1
1111
1
1
1
1111
963
642
32
94
64
34
64
44
24
34
24
14
i
i
iii
iii
iii
matrixnpermutatioP
matrixIdentityI
PF
F
DI
DIF
N
N
N
NN
:
:
0
0
2/
2/
2/
2/
Fourier matrix FN can be decomposed into half size Fourier matrices FN/2 :
Example (N = 4):
1
2
0000
0
000
000
0001
N
ND
oddthenrowsevenfirst
reorderingRowPN,
,:
![Page 32: CSC 7600 Lecture 18: Applied Parallel Algorithms 4 Spring 2009 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 4 Dr. Hartmut](https://reader038.vdocument.in/reader038/viewer/2022103006/56649e7c5503460f94b7e112/html5/thumbnails/32.jpg)
CSC 7600 Lecture 18: Applied Parallel Algorithms 4Spring 2009
Fast Fourier Transform
• Based on divide-and-conquer strategy
• Suppose we want to compute f(x)• We define two new functions, f[0] and f[1]
11
2210 ...)(
nn xaxaxaaxf
12/1
2531
]1[
12/2
2420
]0[
...
...
n
n
nn
xaxaxaaf
xaxaxaaf
32
NOTE: Different FFT implementations exist Adapted from slides(and text) of Parallel Programming in C with MPI and OpenMP by Michael Quinn
![Page 33: CSC 7600 Lecture 18: Applied Parallel Algorithms 4 Spring 2009 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 4 Dr. Hartmut](https://reader038.vdocument.in/reader038/viewer/2022103006/56649e7c5503460f94b7e112/html5/thumbnails/33.jpg)
CSC 7600 Lecture 18: Applied Parallel Algorithms 4Spring 2009
FFT (Cont…)
• Note: f(x) = f [0](x2) + x f [1](x2)
• Problem of evaluating f (x) at n values of reduces toa) Evaluating f [0](x) and f [1](x) at n/2 values of That is, computing f(x) at points
becomes evaluating f [0] & f [1] at
b) Performing f [0](x2) + x f [1](x2)
• Leads to recursive algorithm with time complexity (n log
n)
33
212/222120 )(....,,.........)(,)(,)( nnnnn
1210 ...,,.........,, nnnnn
Adapted from slides(and text) of Parallel Programming in C with MPI and OpenMP by Michael Quinn
![Page 34: CSC 7600 Lecture 18: Applied Parallel Algorithms 4 Spring 2009 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 4 Dr. Hartmut](https://reader038.vdocument.in/reader038/viewer/2022103006/56649e7c5503460f94b7e112/html5/thumbnails/34.jpg)
CSC 7600 Lecture 18: Applied Parallel Algorithms 4Spring 2009
Recursive Sequential Implementation of FFT
Recursive_FFT(a,n)
Parameter n Number of elements in a
a[0……(n-1)] Coefficients
Local n Primitive nth root of unity
Evaluate polynomial at this point
a [0] Even numbered coefficients a[1] Odd numbered coefficients y Result of transform y [0] Result of FFT of a [0]
y[1] Result of FFT of a [1]
34
Adapted from slides(and text) of Parallel Programming in C with MPI and OpenMP by Michael Quinn
![Page 35: CSC 7600 Lecture 18: Applied Parallel Algorithms 4 Spring 2009 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 4 Dr. Hartmut](https://reader038.vdocument.in/reader038/viewer/2022103006/56649e7c5503460f94b7e112/html5/thumbnails/35.jpg)
CSC 7600 Lecture 18: Applied Parallel Algorithms 4Spring 2009
Recursive Sequential Implementation of FFT (Cont…)
if n=1 then
return a
else
n
1
a [0] (a[0],a[2],….,a[n-2])
a [1] (a[1],a[3],….,a[n-1])
y [0] Recursive_FFT(a [0],n/2)
y [1] Recursive_FFT(a [1],n/2)
for k0 to n/2 -1 do
y[k] y [0] [k]+* y [1] [k]
y[k+n/2] y [0] [k]- * y [1] [k] * n
end for
return y
endif
35
n
i
e2
Adapted from slides(and text) of Parallel Programming in C with MPI and OpenMP by Michael Quinn
![Page 36: CSC 7600 Lecture 18: Applied Parallel Algorithms 4 Spring 2009 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 4 Dr. Hartmut](https://reader038.vdocument.in/reader038/viewer/2022103006/56649e7c5503460f94b7e112/html5/thumbnails/36.jpg)
CSC 7600 Lecture 18: Applied Parallel Algorithms 4Spring 2009
Iterative Implementation Preferable
• Well-written iterative version performs fewer index computations than recursive version
• Iterative version evaluates key common sub-expression only once
• Easier to derive parallel FFT algorithm when sequential algorithm in iterative form
36
Adapted from slides(and text) of Parallel Programming in C with MPI and OpenMP by Michael Quinn
![Page 37: CSC 7600 Lecture 18: Applied Parallel Algorithms 4 Spring 2009 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 4 Dr. Hartmut](https://reader038.vdocument.in/reader038/viewer/2022103006/56649e7c5503460f94b7e112/html5/thumbnails/37.jpg)
CSC 7600 Lecture 18: Applied Parallel Algorithms 4Spring 2009
Recursive Iterative (1/4)
Recursive implementation of FFT for the
input sequence (1,2,4,3) is shown below
fft(1) fft(4) fft(3)fft(2)
fft(1,2,4,3)
fft(2,3)fft(1,4)
(5,-3)
(10,-3-i,0,-3+i)
(5,-1)
(2) (3)(4)(1)
We now discuss the derivation of an iterative algorithm starting with the recursive one
• Each rounded rectangle indicates an fft function call
• The function goes on dividing the vector into half until a scalar is obtained(NOTE: DFT of a scalar is the scalar itself)
• The values returned as result of each function call is indicated on the curved arrows
Adapted from slides(and text) of Parallel Programming in C with MPI and OpenMP by Michael Quinn
![Page 38: CSC 7600 Lecture 18: Applied Parallel Algorithms 4 Spring 2009 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 4 Dr. Hartmut](https://reader038.vdocument.in/reader038/viewer/2022103006/56649e7c5503460f94b7e112/html5/thumbnails/38.jpg)
CSC 7600 Lecture 18: Applied Parallel Algorithms 4Spring 2009
Recursive Iterative (2/4)
• Determining which computations are performed for each function invocation
• For each rounded rectangle, the computation is of the form x+y(z) x-y(z) which corresponds to the following statements of the recursive algorithm y[k] y [0] [k]+* y [1] [k] y[k+n/2] y [0] [k]- * y [1] [k]
38
5+1*5) -3+i*(-1) 5-1*5 -3-i*(-1)
4
1+1*4 1-1*4 2+1*3 2-1*3
1 2 3
Adapted from slides(and text) of Parallel Programming in C with MPI and OpenMP by Michael Quinn
![Page 39: CSC 7600 Lecture 18: Applied Parallel Algorithms 4 Spring 2009 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 4 Dr. Hartmut](https://reader038.vdocument.in/reader038/viewer/2022103006/56649e7c5503460f94b7e112/html5/thumbnails/39.jpg)
CSC 7600 Lecture 18: Applied Parallel Algorithms 4Spring 2009
Recursive Iterative (3/4)• This diagram tracks the propagation of data values (input vector at the bottom and FFT output at the top)• Permutation stage: Index i of the input vector is replaced by rev(i), where rev(i) is the binary value of i read in the reverse order (00=>00, 01=>10, 10=>01, 11=>11)
39
5+1*5 -3+i*(-1) 5-1*5 -3-i*(-1)
5 -3 -15
1+1*4 1-1*4 2+1*3 2-1*3
1 4 32
1 2
1 4 2 3
4 3
10 -3+i-3-i 0
Adapted from slides(and text) of Parallel Programming in C with MPI and OpenMP by Michael Quinn
![Page 40: CSC 7600 Lecture 18: Applied Parallel Algorithms 4 Spring 2009 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 4 Dr. Hartmut](https://reader038.vdocument.in/reader038/viewer/2022103006/56649e7c5503460f94b7e112/html5/thumbnails/40.jpg)
CSC 7600 Lecture 18: Applied Parallel Algorithms 4Spring 2009
Recursive Iterative (4/4)
• Initially, the scalars are simply forwarded upwards as the DFT of a scalar is the scalar itself
• For other stages, computation of the output is performed using two values forwarded from the previous stage
• The arrows depicting data flow form butterfly patterns
• An iterative algorithm can be deduced from the previous diagram
• The computation represented in each row (excluding the bottommost row) corresponds to one iteration of the algorithm
• Hence log(n) iterations should be performed (log(4)=2 in the previous example)
• For each iteration the algorithm modifies the value of every index (here n indices)
40
Adapted from slides(and text) of Parallel Programming in C with MPI and OpenMP by Michael Quinn
![Page 41: CSC 7600 Lecture 18: Applied Parallel Algorithms 4 Spring 2009 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 4 Dr. Hartmut](https://reader038.vdocument.in/reader038/viewer/2022103006/56649e7c5503460f94b7e112/html5/thumbnails/41.jpg)
CSC 7600 Lecture 18: Applied Parallel Algorithms 4Spring 2009
Topics
Fourier Transforms • Fourier analysis• Discrete Fourier transform• Fast Fourier transform• Parallel Implementation
Parallel Sorting • Bubble Sort • Merge Sort • Heap Sort • Quick Sort
41
![Page 42: CSC 7600 Lecture 18: Applied Parallel Algorithms 4 Spring 2009 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 4 Dr. Hartmut](https://reader038.vdocument.in/reader038/viewer/2022103006/56649e7c5503460f94b7e112/html5/thumbnails/42.jpg)
CSC 7600 Lecture 18: Applied Parallel Algorithms 4Spring 2009
Stages of Parallel Program Design
• Partition– Divide problem into tasks
• Communicate– Determine amount and pattern of
communication
• Agglomerate– Combine tasks
• Map– Assign agglomerated tasks to
processors
• Efficiency analysis
42
Adapted from http://nereida.deioc.ull.es/html/openmp/minnesotatutorial/content_openMP.html
![Page 43: CSC 7600 Lecture 18: Applied Parallel Algorithms 4 Spring 2009 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 4 Dr. Hartmut](https://reader038.vdocument.in/reader038/viewer/2022103006/56649e7c5503460f94b7e112/html5/thumbnails/43.jpg)
CSC 7600 Lecture 18: Applied Parallel Algorithms 4Spring 2009
Parallel FFT Program Design
• Domain decomposition– Associate primitive task with each element of input vector a and
corresponding element of output vector y
• Add channels to handle communications between tasks
43
Adapted from slides(and text) of Parallel Programming in C with MPI and OpenMP by Michael Quinn
![Page 44: CSC 7600 Lecture 18: Applied Parallel Algorithms 4 Spring 2009 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 4 Dr. Hartmut](https://reader038.vdocument.in/reader038/viewer/2022103006/56649e7c5503460f94b7e112/html5/thumbnails/44.jpg)
CSC 7600 Lecture 18: Applied Parallel Algorithms 4Spring 2009
FFT Task/Channel Graph (n=8)
44
•Long rounded rectangles representtasks and arrows indicate communication between processes
Adapted from slides(and text) of Parallel Programming in C with MPI and OpenMP by Michael Quinn
![Page 45: CSC 7600 Lecture 18: Applied Parallel Algorithms 4 Spring 2009 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 4 Dr. Hartmut](https://reader038.vdocument.in/reader038/viewer/2022103006/56649e7c5503460f94b7e112/html5/thumbnails/45.jpg)
CSC 7600 Lecture 18: Applied Parallel Algorithms 4Spring 2009
FFT Task/Channel Graph (n=8) Cont…
45
Steps:
•Permute vector as follows (000=>000, 001=>100, …,110=>011, 111=>111)
•Perform log(n) iterations (log(8)=3)- stage 1 completed after iteration 1- stage 2 completed after iteration 2- stage 3 completed after iteration 3 (Vector y after stage 3 gives the output)
NOTE: Vector y will contain the intermediate results of stage 1 and stage 2
stage 1 stage 2 stage 3
Adapted from slides(and text) of Parallel Programming in C with MPI and OpenMP by Michael Quinn
![Page 46: CSC 7600 Lecture 18: Applied Parallel Algorithms 4 Spring 2009 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 4 Dr. Hartmut](https://reader038.vdocument.in/reader038/viewer/2022103006/56649e7c5503460f94b7e112/html5/thumbnails/46.jpg)
CSC 7600 Lecture 18: Applied Parallel Algorithms 4Spring 2009
Diagrammatic Representation of Profiling Results
46
Conventions:
represents a function compute (args) that accepts the propagated values and performs the following computation (refer slide 33) x+y(z) x-y(z)
represents the MPI_Send(args) command
represents the MPI_Receive(args) command
represents the function permute(args) which is basically permute(args) { ……… MPI_Send(args) ……… }
represents the time for which the process is idle
C
S
R
http://www.cs.uoregon.edu/research/paracomp/tau/tauprofile/images/petsc/
P
![Page 47: CSC 7600 Lecture 18: Applied Parallel Algorithms 4 Spring 2009 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 4 Dr. Hartmut](https://reader038.vdocument.in/reader038/viewer/2022103006/56649e7c5503460f94b7e112/html5/thumbnails/47.jpg)
CSC 7600 Lecture 18: Applied Parallel Algorithms 4Spring 2009
Diagrammatic Representation of Profiling Results
47
Permutation Phase
Stage 1
NOTE: The diagram is oversimplified to enhance understandingof butterfly diagram
P
P
P0
P1
P2
P3
P4
P5
P6
P7
P
S
SR
S
R
S
R
S
R
Stage 2 Stage 3
y[0]
y[1]
y[2]
y[3]
y[4]
y[6]
y[5]
y[7]
R
R
R
RP
R
S
R
S
R
S
R
C
C
C
C
C
C
C
C
S
S
S
S
S
R
R
R
R
S
S
S
R
R
R
R
C
C
C
C
C
C
C
C
S
S
S
S
R
R
R
R
S
S
S
S
R
R
R
R
C
C
C
C
C
C
C
C
![Page 48: CSC 7600 Lecture 18: Applied Parallel Algorithms 4 Spring 2009 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 4 Dr. Hartmut](https://reader038.vdocument.in/reader038/viewer/2022103006/56649e7c5503460f94b7e112/html5/thumbnails/48.jpg)
CSC 7600 Lecture 18: Applied Parallel Algorithms 4Spring 2009
Agglomeration and Mapping
• Agglomerate primitive tasks associated with contiguous elements of vector to reduce communication
• Map one agglomerated task to each process
48
Adapted from slides(and text) of Parallel Programming in C with MPI and OpenMP by Michael Quinn
![Page 49: CSC 7600 Lecture 18: Applied Parallel Algorithms 4 Spring 2009 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 4 Dr. Hartmut](https://reader038.vdocument.in/reader038/viewer/2022103006/56649e7c5503460f94b7e112/html5/thumbnails/49.jpg)
CSC 7600 Lecture 18: Applied Parallel Algorithms 4Spring 2009
After Agglomeration, MappingInput
Output
49
In general, an n point FFT can beimplemented on a multicomputersupporting p processes
In this case, n=16 and p=4. a[0], a[1], a[2], a[3] process 1 a[4], a[5], a[6], a[7] process 2 and so on
Adapted from slides(and text) of Parallel Programming in C with MPI and OpenMP by Michael Quinn
![Page 50: CSC 7600 Lecture 18: Applied Parallel Algorithms 4 Spring 2009 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 4 Dr. Hartmut](https://reader038.vdocument.in/reader038/viewer/2022103006/56649e7c5503460f94b7e112/html5/thumbnails/50.jpg)
CSC 7600 Lecture 18: Applied Parallel Algorithms 4Spring 2009
Phases of Parallel FFT Algorithm
• Phase 1: Processes permute a’s (all-to-all communication)
• Phase 2:– First log n – log p iterations of FFT– No message passing is required
• Phase 3:– Final log p iterations– Processes organized as logical hypercube– In each iteration every process swaps values with
partner across a hypercube dimension
50
Adapted from slides(and text) of Parallel Programming in C with MPI and OpenMP by Michael Quinn
![Page 51: CSC 7600 Lecture 18: Applied Parallel Algorithms 4 Spring 2009 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 4 Dr. Hartmut](https://reader038.vdocument.in/reader038/viewer/2022103006/56649e7c5503460f94b7e112/html5/thumbnails/51.jpg)
CSC 7600 Lecture 18: Applied Parallel Algorithms 4Spring 2009
Computation Complexity Analysis
• Each process performs equal share of computation– Sequential complexity: Θ(n log n / p)
• Hence the complexity of parallel implementation is
Θ(n log n / p)
Adapted from slides(and text) of Parallel Programming in C with MPI and OpenMP by Michael Quinn
![Page 52: CSC 7600 Lecture 18: Applied Parallel Algorithms 4 Spring 2009 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 4 Dr. Hartmut](https://reader038.vdocument.in/reader038/viewer/2022103006/56649e7c5503460f94b7e112/html5/thumbnails/52.jpg)
CSC 7600 Lecture 18: Applied Parallel Algorithms 4Spring 2009
Communication Complexity Analysis
• A maximum of ceil(n / p) elements of the vector associated with a process
• In the all to all communication stage, every process swaps about n/p values with its counterpart – Time complexity: Θ(n/p log p)
• A total of log p iterations that need communication with other processes (average n/p swaps)– Time complexity: Θ(n/p log p)
• Hence the total communication complexity of parallel implementation is
Θ(n/p log p)
52
Adapted from slides(and text) of Parallel Programming in C with MPI and OpenMP by Michael Quinn
![Page 53: CSC 7600 Lecture 18: Applied Parallel Algorithms 4 Spring 2009 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 4 Dr. Hartmut](https://reader038.vdocument.in/reader038/viewer/2022103006/56649e7c5503460f94b7e112/html5/thumbnails/53.jpg)
CSC 7600 Lecture 18: Applied Parallel Algorithms 4Spring 2009
Topics
Fourier Transforms • Fourier analysis• Discrete Fourier transform• Fast Fourier transform• Parallel Implementation
Parallel Sorting • Bubble Sort • Merge Sort • Heap Sort • Quick Sort
53
![Page 54: CSC 7600 Lecture 18: Applied Parallel Algorithms 4 Spring 2009 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 4 Dr. Hartmut](https://reader038.vdocument.in/reader038/viewer/2022103006/56649e7c5503460f94b7e112/html5/thumbnails/54.jpg)
CSC 7600 Lecture 18: Applied Parallel Algorithms 4Spring 2009
Parallel Sorting
• Finding a permutation of a sequence [a1, a2, ...an-1], such that a1 <= a2 <= … an-1
• Often we sort records based on key• Parallel sort results in:
– Partial sequences are sorted on all nodes– Largest value on node N-1 is smaller or equal to smallest value
on node N
• Several ways to parallelize– Chunk sequence, sort locally, merge back (bubblesort)– Project algorithm structure onto cmmunication and distribution
scheme (quicksort)
54
![Page 55: CSC 7600 Lecture 18: Applied Parallel Algorithms 4 Spring 2009 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 4 Dr. Hartmut](https://reader038.vdocument.in/reader038/viewer/2022103006/56649e7c5503460f94b7e112/html5/thumbnails/55.jpg)
CSC 7600 Lecture 18: Applied Parallel Algorithms 4Spring 2009
Bubble Sort• The bubble sort is the oldest and simplest sort in use. Unfortunately, it's also the
slowest. • The bubble sort works by comparing each item in the list with the item next to it,
and swapping them if required. • The algorithm repeats this process until it makes a pass all the way through the
list without swapping any items (in other words, all items are in the correct order). • This causes larger values to "bubble" to the end of the list while smaller values
"sink" towards the beginning of the list.The bubble sort is generally considered to be the most inefficient sorting algorithm in
common usage. Under best-case conditions (the list is already sorted), the bubble sort can approach a constant O(n) level of complexity. General-case is O(n2).
Pros: Simplicity and ease of implementation.Cons: Extremely inefficient.
Referencehttp://math.hws.edu/TMCM/java/xSortLab/
Sourcehttp://www.sci.hkbu.edu.hk/tdgc/tutorial/ExpClusterComp/sorting/bubblesort.c
http://www.sci.hkbu.edu.hk
55
![Page 56: CSC 7600 Lecture 18: Applied Parallel Algorithms 4 Spring 2009 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 4 Dr. Hartmut](https://reader038.vdocument.in/reader038/viewer/2022103006/56649e7c5503460f94b7e112/html5/thumbnails/56.jpg)
CSC 7600 Lecture 18: Applied Parallel Algorithms 4Spring 2009
Bubblesort
void sort(int *v, int n){
int i, j;for(i = n-2; i >= 0; i--)
for(j = 0; j <= i; j++)if(v[j] > v[j+1])
swap(v[j], v[j+1]);}
56
![Page 58: CSC 7600 Lecture 18: Applied Parallel Algorithms 4 Spring 2009 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 4 Dr. Hartmut](https://reader038.vdocument.in/reader038/viewer/2022103006/56649e7c5503460f94b7e112/html5/thumbnails/58.jpg)
CSC 7600 Lecture 18: Applied Parallel Algorithms 4Spring 2009
Discussion
• Bubble sort takes time proportional to N*N/2 for N data items
• This parallelization splits N data items into N/P so time on one of the P processors now proportional to (N/P*N/P)/2
– i.e. have reduced time by a factor of P*P!
• Bubble sort is much slower than quick sort!– Better to run quick sort on single processor than bubble sort on many
processors!
http://www.sci.hkbu.edu.hk
58
![Page 59: CSC 7600 Lecture 18: Applied Parallel Algorithms 4 Spring 2009 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 4 Dr. Hartmut](https://reader038.vdocument.in/reader038/viewer/2022103006/56649e7c5503460f94b7e112/html5/thumbnails/59.jpg)
CSC 7600 Lecture 18: Applied Parallel Algorithms 4Spring 2009
Topics
Fourier Transforms • Fourier analysis• Discrete Fourier transform• Fast Fourier transform• Parallel Implementation
Parallel Sorting • Bubble Sort • Merge Sort • Heap Sort • Quick Sort
59
![Page 60: CSC 7600 Lecture 18: Applied Parallel Algorithms 4 Spring 2009 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 4 Dr. Hartmut](https://reader038.vdocument.in/reader038/viewer/2022103006/56649e7c5503460f94b7e112/html5/thumbnails/60.jpg)
CSC 7600 Lecture 18: Applied Parallel Algorithms 4Spring 2009
Merge Sort
• The merge sort splits the list to be sorted into two equal halves, and places them in separate arrays.
• Each array is recursively sorted, and then merged back together to form the final sorted list.
• Like most recursive sorts, the merge sort has an algorithmic complexity of O(n log n). • Elementary implementations of the merge sort make use of three arrays - one for
each half of the data set and one to store the sorted list in. The below algorithm merges the arrays in-place, so only two arrays are required. There are non-recursive versions of the merge sort, but they don't yield any significant performance enhancement over the recursive algorithm on most machines.
Pros: Marginally faster than the heap sort for larger sets.
Cons: At least twice the memory requirements of the other sorts; recursive.
Reference
http://math.hws.edu/TMCM/java/xSortLab/
Source
http://www.sci.hkbu.edu.hk/tdgc/tutorial/ExpClusterComp/sorting/mergesort.c
60
![Page 61: CSC 7600 Lecture 18: Applied Parallel Algorithms 4 Spring 2009 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 4 Dr. Hartmut](https://reader038.vdocument.in/reader038/viewer/2022103006/56649e7c5503460f94b7e112/html5/thumbnails/61.jpg)
CSC 7600 Lecture 18: Applied Parallel Algorithms 4Spring 2009
Merge Sort
[cdekate@celeritas sort]$ mpirun -np 4 mergesort1000000; 4 processors; 0.250000 secs[cdekate@celeritas sort]$
61
![Page 62: CSC 7600 Lecture 18: Applied Parallel Algorithms 4 Spring 2009 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 4 Dr. Hartmut](https://reader038.vdocument.in/reader038/viewer/2022103006/56649e7c5503460f94b7e112/html5/thumbnails/62.jpg)
CSC 7600 Lecture 18: Applied Parallel Algorithms 4Spring 2009
Mergesort
void msort(int *A, int min, int max){
int *C; /* dummy, just to fit the function */int mid = (min+max)/2;int lowerCount = mid - min + 1;int upperCount = max - mid;
/* If the range consists of a single element, it's already sorted */if (max == min) {
return;} else {
/* Otherwise, sort the first half */sort(A, min, mid);/* Now sort the second half */sort(A, mid+1, max);/* Now merge the two halves */C = merge(A + min, lowerCount, A + mid + 1, upperCount);
}}
62
![Page 64: CSC 7600 Lecture 18: Applied Parallel Algorithms 4 Spring 2009 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 4 Dr. Hartmut](https://reader038.vdocument.in/reader038/viewer/2022103006/56649e7c5503460f94b7e112/html5/thumbnails/64.jpg)
CSC 7600 Lecture 18: Applied Parallel Algorithms 4Spring 2009
Topics
Fourier Transforms • Fourier analysis• Discrete Fourier transform• Fast Fourier transform• Parallel Implementation
Parallel Sorting • Bubble Sort • Merge Sort • Heap Sort • Quick Sort
64
![Page 65: CSC 7600 Lecture 18: Applied Parallel Algorithms 4 Spring 2009 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 4 Dr. Hartmut](https://reader038.vdocument.in/reader038/viewer/2022103006/56649e7c5503460f94b7e112/html5/thumbnails/65.jpg)
CSC 7600 Lecture 18: Applied Parallel Algorithms 4Spring 2009
Heap Sort• The heap sort is the slowest of the O(n log n) sorting algorithms, but unlike the merge
and quick sorts it doesn't require massive recursion or multiple arrays to work. This makes it the most attractive option for very large data sets of millions of items.
• The heap sort works as it name suggests1. It begins by building a heap out of the data set, 2. Then removing the largest item and placing it at the end of the sorted array. 3. After removing the largest item, it reconstructs the heap and removes the largest remaining
item and places it in the next open position from the end of the sorted array.4. This is repeated until there are no items left in the heap and the sorted array is full.
Elementary implementations require two arrays - one to hold the heap and the other to hold the sorted elements.
To do an in-place sort and save the space the second array would require, the algorithm below "cheats" by using the same array to store both the heap and the sorted array. Whenever an item is removed from the heap, it frees up a space at the end of the array that the removed item can be placed in.
Pros: In-place and non-recursive, making it a good choice for extremely large data sets.
Cons: Slower than the merge and quick sorts.
Referencehttp://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/heapsort.html
Sourcehttp://www.sci.hkbu.edu.hk/tdgc/tutorial/ExpClusterComp/heapsort/heapsort.c
65
![Page 67: CSC 7600 Lecture 18: Applied Parallel Algorithms 4 Spring 2009 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 4 Dr. Hartmut](https://reader038.vdocument.in/reader038/viewer/2022103006/56649e7c5503460f94b7e112/html5/thumbnails/67.jpg)
CSC 7600 Lecture 18: Applied Parallel Algorithms 4Spring 2009
Topics
Fourier Transforms • Fourier analysis• Discrete Fourier transform• Fast Fourier transform• Parallel Implementation
Parallel Sorting • Bubble Sort • Merge Sort • Heap Sort • Quick Sort
67
![Page 68: CSC 7600 Lecture 18: Applied Parallel Algorithms 4 Spring 2009 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 4 Dr. Hartmut](https://reader038.vdocument.in/reader038/viewer/2022103006/56649e7c5503460f94b7e112/html5/thumbnails/68.jpg)
CSC 7600 Lecture 18: Applied Parallel Algorithms 4Spring 2009
Quick Sort• The quick sort is an in-place, divide-and-conquer, massively recursive sort.• Divide and Conquer Algorithms
– Algorithms that solve (conquer) problems by dividing them into smaller sub-problems until the problem is so small that it is trivially solved.
• In Place– In place sorting algorithms don't require additional temporary space to store
elements as they sort; they use the space originally occupied by the elements.• Quicksort takes time proportional to (worst case) N*N for N data items, usually
n log n, but most of the time much faster– for 1,000,000 items, Nlog2N ~ 1,000,000*20
• Constant communication cost – 2*N data items– for 1,000,000 must send/receive 2*1,000,000 from/to root
• In general, processing/communication proportional to N*log2N/2*N = log2N/2
– so for 1,000,000 items, only 20/2 =10 times as much processing as communication
• Suggests can only get speedup, with this parallelization, for very large N
Referencehttp://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/qsort.html
Sourcehttp://www.sci.hkbu.edu.hk/tdgc/tutorial/ExpClusterComp/qsort/qsort.c
http://www.sci.hkbu.edu.hk
68
![Page 69: CSC 7600 Lecture 18: Applied Parallel Algorithms 4 Spring 2009 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 4 Dr. Hartmut](https://reader038.vdocument.in/reader038/viewer/2022103006/56649e7c5503460f94b7e112/html5/thumbnails/69.jpg)
CSC 7600 Lecture 18: Applied Parallel Algorithms 4Spring 2009
Quick Sort
• The recursive algorithm consists of four steps (which closely resemble the merge sort):
1. If there are one or less elements in the array to be sorted, return immediately.
2. Pick an element in the array to serve as a "pivot" point. (Usually the left-most element in the array is used.)
3. Split the array into two parts - one with elements larger than the pivot and the other with elements smaller than the pivot.
4. Recursively repeat the algorithm for both halves of the original array.
• The efficiency of the algorithm is majorly impacted by which element is chosen as the pivot point.
• The worst-case efficiency of the quick sort, O(n2), occurs when the list is sorted and the left-most element is chosen.
• If the data to be sorted isn't random, randomly choosing a pivot point is recommended. As long as the pivot point is chosen randomly, the quick sort has an algorithmic complexity of O(n log n).
Pros: Extremely fast.Cons: Very complex algorithm, massively recursive
http://www.sci.hkbu.edu.hk
69