of parallel tlrgh performance computingoperational intensity: example mahir multiplication...
TRANSCRIPT
Design of Parallel & tlrgh Performance ComputingReasoning about Performance I
Roofline Mode - Balance Principles
5. Roofline Model ( Williams et al. 2008 )
resources in a microarchitecture associated programthat found performance ! features
• peak performance F tfbnskycle ] • work W [ flops ]• memory Sanelarelth B esytyleyche ] • detertransferred cache a men Q Estes ]
for a given problem : min W= work Hime coupled Ly
min Q = 110 aplenty
Operational intensity I : Given a program ,assume
empty Cade :
Icu ) . Wcw )|QCu )lnturtdou : higs I a compute Sound
low I ←> memory soundIcu )
Example : vector Sun y=xty OCDasymptotic founds matrix . vector product y=A× OG )
on I fast Fourier transform Ocloju )matrix - matrix product C=AB Ocu )
Operational Intensity : Example Mahir Multiplication
Assumptions : cache
siugkn,
cache 86ek=8 doubles,
leade
We want to estimate Q.
W=2u3 flops
1.) Triple loop : : 8lb,
35'
Er
/
•
2) Blocked
-1.
tF÷#*%" (
a B Ca AD A'
B ⇐ AD
1. entry of C : ut8n doubles 1. Slack of C: 2ns doubles
2. end > ofc : same 2. sleek of C : same
o a o to a o o o 8
total : Sis doubles total : 2ns . ftp.2yddoushsIca ) = OC i ) bzff ⇒ Icik @City )
Roofline Rodell Williams et al. 20087
Computer : Program :
men I= W/Q I fbpslsyde ]bandwidth A
LLC tsyteskyck ] T = runtime I cycles ]
CPU peak perf . T P= WIT ( performance ) [ fbpskyck ]tfbpskyclei
Roofline plot : ( example T= 2, A=4 )
Ptfbpskyde ] sound jasedenp fP±pI ) reason :¥=Ww¥a=¥2f^
Sind
Sounds ( PET ) PZ p. I2-
: ⇒ logp a- legit leg B1 -
r
' k - I× ← program Isl ⇒ P 2- A
-
i
run on Some
I. guts input^ 1 1 1 1 > I I f6ps/J > tif
YYYZ1 2
Ppenatroudlnhhtidy : Upper Lonely
Icu ) =War )/Q(u ) × ,y : length n
,A ,B,C= nxn ,a : scalar
It flops / # Sitesmemo eaehe
asymptotic sest exact
daxpyy=a×ty0 ( 1 ) OCI ) 2- 1/12
dgemv y - Axty OCI ) 04 ) 11/4'
fft OCGSU ) Occogy ) -
dgemm C=ABtC ocn ) o- ( of ) E YI should fe
sharp ify=eade till3u2 . 82-8
Example . y= KMB
=) n E 700
back to slides
G. Balance Principles I Ckuey 86 )
Compete : Pegram :
men Qy : data transfer men a LLC
sLLC sire j W : work = # ops
CPU IT
The computer is called Salaam ifcompute time = data transfer the
ire . , assuming perfect utilization
¥ = Ese⇒ Ip=¥Jr
r
assume it → at increases ;how A nesalauee ?
a.) A → a. A ,✓ set A grows slower than 5 !s
. ) y → x. 8 what is × ?
Example I : Mahn multiplicities Ca ABTC
algorithm worth opkmel I= hot - QCTF )
Ip =¥←=o( of ) t.at, r→e?g
Example 2 : FFT / dorky
algorithm ark optimal I= @( logo )
Is = [email protected] → ya
pots are unnealrttfe .
7. Balance principles I ( Czechowski et al.
Lou )
coal : none detailed principles for multi ones
algorithm / architecture adesrgn ,assess webof HW tends
Computer ; Algorithm :
slow men PRAM : W : work,
Di depthdi lately ,
B : throughput,
fast men sing dnansaehon fired Qp, , : men . transfersof she A
. . . each poll perf IT
P,
Ppassumptions : optimal W
,W/Q
processors
Processor is faleuafd ;ftwo * set Qp , rn fun Qnx ?
- there are general fends
Them E Tamp ("
compute - tuned and some Arnet results
" as :
of a- put ⇐ > F 3- ¥a
Derivation principles .
.
1.) ESH mete Tmen : Idea did DAG in levels
level I : o o o a o o q,men parsley of tie t
levels : \o/M¥4qz
. . . . . . - - - - - - - .- . - . :
,
level D= o a a ° 0 ° ° 9D
In each level a- A model : a +9¥¥8 '
' '
⇒ Tuey X §g ( at 9¥ ) = a .D +0¥ Qt Qpinx
2.) Estimably Tap : Brent 's theorem
Taya
CD+¥¥-men . pan . amputee
Balance : Them a- Tamp < parallelism Gupute
⇒r÷a+oyt⇒laa¥ei+÷ , )^ T parallelism algorithmth qn¥
men. par . algorithm
Example 1 ; matrix multiplication
use Q >cr[¥-Fp, Chou > et al.
2004 ) ⇒ Icu ) = 0 ( ITF)⇒ stance pwuehple PIE OCFr )P
it → a. it resdauee : A → ap on p → a2jp → a. p " : ( A → as andp→ay) onCp → Pg )
Example 2 : sending #FT
⇒ PF zocigrg )
Back to slrdy