introduction to parallel processingtel-zur.net/teaching/bgu/pp/lecture01_part1.pdfprocess – stream...
TRANSCRIPT
Introduction to Parallel Processing
Dr. Guy Tel-Zur
Version 26-10-2014
Talk Outline
• Motivation• Basic terms• S/W. Methods of Parallelization• Examples• Profiling, Benchmarking and Performance Tuning• H/W• Supercomputers• HTC and Condor• Grid Computing and Cloud Computing• Future Trends
A Definition fromOxford Dictionary of Science:
A technique that allows more than one process – stream of activity – to be running at any given moment in a computer system, hence processes can be executed in parallel. This means that two or more processors are active among a group of processes at any instant.
Can WeMultitask?
http://news.stanford.edu/news/2009/august24/multitask-research-study-082409.html
• Motivation• Basic terms• Parallelization methods• Examples• Profiling, Benchmarking and Performance Tuning• Common H/W• Supercomputers• HTC and Condor• The Grid• Future trends
Sou
rce:
http
s://
dow
nloa
d.na
p.ed
u/ca
talo
g.ph
p?re
cord
_id=
1342
7
Parallel Processing
Parallel ComputingEmbarrassingly
Parallel
Parallel File System
Parallel Visualization
Supercomputing
Many Cores
Multi Cores
Clusters
Accelerators
SMPFarming
Image courtesy of Ioan Raicu, University of Chicago. iSG
TW
Op
inio
n -
Man
y Ta
sk C
om
pu
tin
g:
Bri
dg
ing
th
e p
erfo
rman
ce-t
hro
ug
hp
ut
gap
. J
AN
UA
RY
28,
200
9
The need for Parallel Processing
• Get the solution faster and or solve a bigger problem
• Other considerations…(for and against)– Power -> MutliCores
• Serial processor limits
DEMO:N=input('Enter dimension: ')A=rand(N);B=rand(N);ticC=A*B;toc
Memory limit
>> n_sizeEnter dimension: 10Elapsed time is 0.101245 seconds.>> n_sizeEnter dimension: 100Elapsed time is 0.119061 seconds.>> n_sizeEnter dimension: 1000Elapsed time is 0.146440 seconds.
>> n_sizeEnter dimension: 10000Error using randOut of memory. Type HELP MEMORY for your options.
Error in n_size (line 2)A=rand(N);
Demo… (Qt Octave)
Why Parallel Processing
• The universe is inherently parallel, so parallel models fit it best.
" א מז חיזוי מרחוק חישה " חישובית "ביולוגיה
The Demand for Computational Speed
Continual demand for greater computational speed from a computer system than is currently possible. Areas requiring great computational speed include numerical modeling and simulation of scientific and engineering problems. Computations must be completed within a “reasonable” time period.
Exercise
• In a galaxy there are 10^11 stars
• Estimate the computing time for 100 iterations assuming O(N^2) interactions on a 1GFLOPS computer
• For 1011 starts there are 1022 interactions
• X100 iterations 1024 operations• Therefore the computing time:
• Conclusion: Improve the algorithm! Do approximations…hopefully n·log(n)
Solution
t=1024
109 =1015sec=31 , 709 ,791 years
Performance units
FLOPs = Floating Point Operations per second
ExaFLOPs2020±2?
PetaFLOPs2008
TeraFLOPs1997
GigaFLOPs
MegaFLOPs
ZettaFLOPs2032 ?????
Zetta 1021
Exa 1018
Peta 1015
Tera 1012
Giga 109
Mega 106
Zetta 1021
Exa 1018
Peta 1015
Tera 1012
Giga 109
Mega 106
Lego Turing machine0.001FLOPs
http://rubens.ens-lyon.fr/http://rubens.ens-lyon.fr/
Large Memory RequirementsUse parallel computing for executing larger problems which require more memory than exists on a single computer.
2004 Japan’s Earth Simulator (35TFLOPS)
2011 Japan’s K Computer (8.2PF)
An Aurora simulation
Source: SciDAC Review, Number 16, 2010
Molecular Dynamics
Source: SciDAC Review, Number 16, 2010
Other considerations
• Development cost– Difficult to program and debug
– TCO, ROI…
Introduction to Parallel Processing
24/9/2010
לחיזוק ידיעהשעוד למי המוטיבציה
בחשיבות השתכנע לא...התחום
• Motivation• Basic terms• Parallelization methods• Examples• Profiling, Benchmarking and Performance Tuning• Common H/W• Supercomputers• HTC and Condor• The Grid• Future trends
Basic terms
• Buzzwords
• Flynn’s taxonomy• Speedup and Efficiency
• Amdah’l Law
• Load Imbalance
Buzzwords
Farming - Embarrassingly parallel
Parallel Computing - simultaneous use of
multiple processors
Symmetric Multiprocessing (SMP) - a single address space.
Cluster Computing - a combination of commodity
units.
Supercomputing - Use of the fastest, biggest machines to solve large problems.
Michael Flynn
Click here for a link to “Some Computer Organizations and Their Effectiveness” paper in IEEEXplore
Flynn’s taxonomy
• single-instruction single-data streams (SISD)
• single-instruction multiple-data streams (SIMD)
• multiple-instruction single-data streams (MISD)
• multiple-instruction multiple-data streams (MIMD) SPMD
http
://e
n.w
ikip
edia
.org
/wik
i/Fly
nn
%2
7s_
taxo
nom
y
“Time” Terms
Serial time, ts = Time of best serial (1 processor)
algorithm
Parallel time, tP = Time of the parallel algorithm
+ architecture to solve the problem using p processors.
Note: tP ≤ t
s but t
P=1 ≥ t
s many times we assume
t1 ≈ t
s
! ביותר חשובים בסיסיים מושגים
• Speedup: ts / t
P , 0 ≤ speedup ≤p
• Work (cost): p · tP , t
s ≤ W(p) ≤ ∞
(number of numerical operations)
● Efficiency: ts / (p · t
P) , 0 ≤ ≤ 1
(w1/w
p)
Maximal Possible Speedup
Amdahl’s Law (1967)
11
/11/1
timeParallel1
fraction code Serial
timeprocessor 1 timeSerial
+)f(n
n=
t
t=S(n)
n)f)(n+(t=nf)t(+tf=t
=f)t(
=f
==t
p
s
sssp
s
s
Maximal Possible Efficiency
= ts / (p · t
P) ; 0 ≤ ≤1
Amdahl’s Law
With only 5% of the computation being serial, the maximum speedup is 20
f=nS
n
1)(
An Example of Amdahl’s Law
Amdahl’s Law bounds the speedup due to any improvement.– Example: What will the speedup be if 20% of the exec. time is
in interprocessor communications which we can improve by 10X?
S=T/T’= 1/ [0.2/10 +0.8] = 1.25=> Invest resources where time is spent. The slowest portion willDominate.
Amdahl’s Law and Murphy’s Law: “If any system component candamage performance, it will.”
http://mprc.pku.edu.cn/courses/architecture/autumn2005/reevaluating-Amdahls-law.pdf
Communications of the ACMMay 1988 Volume 31 Number 5 pp. 532-533.
Communications of the ACMMay 1988 Volume 31 Number 5 pp. 532-533.
Gustafson’s Law
• f is the fraction of the code that can not be parallelized
• tp=f·t
p + (1-f)·t
p
• ts=f·t
p + (1-f)·p·t
p
• S=ts/t
p=f+(1-f)·p this is the Scaled Speedup
• S=f+p-f·p=p+(1-p)·f=f+p·(1-f)
• The Scaled Speedup is linear with p !
http
://w
ww
.scl
.am
esla
b.go
v/P
ublic
atio
ns/G
us/A
mda
hlsL
aw
/Am
dahl
s.ht
ml
Amdahl, G.M. Validity of the single-processor approach to achieving large scale computing capabilities. In AFIPS Conference Proceedings vol. 30 (Atlantic City, N.J., Apr. 18-20). AFIPS Press, Reston, Va., 1967, pp. 483-485.
The computation time is constant (instead of the problem size)
increasing number of CPUs solve bigger problem and get better results in the same time.
Introduction to Parallel Processing
http
://w
ww
.scl
.am
esla
b.go
v/P
ublic
atio
ns/G
us/A
mda
hlsL
aw
/Am
dahl
s.ht
ml
Benner, R.E., Gustafson, J.L., and Montry, G.R., Development and analysis of scientific application programs on a 1024-processor hypercube," SAND 88-0317, Sandia National Laboratories, Feb. 1988.
• Amdahl’s – fixed problem size (different run time)
• Gustafson’s – fixed run time (different problem size)
Computation/Communication Ratio
Computation timeCommunication time
=tcomp
tcomm
Overhead
Load Imbalance
• Static / Dynamic
Dynamic Partitioning – Domain Decomposition by Quad or Oct Trees
• Motivation• Basic terms• Parallelization Methods• Examples• Profiling, Benchmarking and Performance Tuning• Common H/W• Supercomputers• HTC and Condor• The Grid• Future trends