of parallel tlrgh performance computingoperational intensity: example mahir multiplication...

10
Design of Parallel & tlrgh Performance Computing Reasoning about Performance I Roofline Mode Balance Principles

Upload: others

Post on 23-Sep-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: of Parallel tlrgh Performance ComputingOperational Intensity: Example Mahir Multiplication Assumptions: cache siugkn, cache 86ek=8 doubles leade We want to estimate Q W=2u3 flops 1.)

Design of Parallel & tlrgh Performance ComputingReasoning about Performance I

Roofline Mode - Balance Principles

Page 2: of Parallel tlrgh Performance ComputingOperational Intensity: Example Mahir Multiplication Assumptions: cache siugkn, cache 86ek=8 doubles leade We want to estimate Q W=2u3 flops 1.)

5. Roofline Model ( Williams et al. 2008 )

resources in a microarchitecture associated programthat found performance ! features

• peak performance F tfbnskycle ] • work W [ flops ]• memory Sanelarelth B esytyleyche ] • detertransferred cache a men Q Estes ]

for a given problem : min W= work Hime coupled Ly

min Q = 110 aplenty

Operational intensity I : Given a program ,assume

empty Cade :

Icu ) . Wcw )|QCu )lnturtdou : higs I a compute Sound

low I ←> memory soundIcu )

Example : vector Sun y=xty OCDasymptotic founds matrix . vector product y=A× OG )

on I fast Fourier transform Ocloju )matrix - matrix product C=AB Ocu )

Page 3: of Parallel tlrgh Performance ComputingOperational Intensity: Example Mahir Multiplication Assumptions: cache siugkn, cache 86ek=8 doubles leade We want to estimate Q W=2u3 flops 1.)

Operational Intensity : Example Mahir Multiplication

Assumptions : cache

siugkn,

cache 86ek=8 doubles,

leade

We want to estimate Q.

W=2u3 flops

1.) Triple loop : : 8lb,

35'

Er

/

2) Blocked

-1.

tF÷#*%" (

a B Ca AD A'

B ⇐ AD

1. entry of C : ut8n doubles 1. Slack of C: 2ns doubles

2. end > ofc : same 2. sleek of C : same

o a o to a o o o 8

total : Sis doubles total : 2ns . ftp.2yddoushsIca ) = OC i ) bzff ⇒ Icik @City )

Page 4: of Parallel tlrgh Performance ComputingOperational Intensity: Example Mahir Multiplication Assumptions: cache siugkn, cache 86ek=8 doubles leade We want to estimate Q W=2u3 flops 1.)

Roofline Rodell Williams et al. 20087

Computer : Program :

men I= W/Q I fbpslsyde ]bandwidth A

LLC tsyteskyck ] T = runtime I cycles ]

CPU peak perf . T P= WIT ( performance ) [ fbpskyck ]tfbpskyclei

Roofline plot : ( example T= 2, A=4 )

Ptfbpskyde ] sound jasedenp fP±pI ) reason :¥=Ww¥a=¥2f^

Sind

Sounds ( PET ) PZ p. I2-

: ⇒ logp a- legit leg B1 -

r

' k - I× ← program Isl ⇒ P 2- A

-

i

run on Some

I. guts input^ 1 1 1 1 > I I f6ps/J > tif

YYYZ1 2

Page 5: of Parallel tlrgh Performance ComputingOperational Intensity: Example Mahir Multiplication Assumptions: cache siugkn, cache 86ek=8 doubles leade We want to estimate Q W=2u3 flops 1.)

Ppenatroudlnhhtidy : Upper Lonely

Icu ) =War )/Q(u ) × ,y : length n

,A ,B,C= nxn ,a : scalar

It flops / # Sitesmemo eaehe

asymptotic sest exact

daxpyy=a×ty0 ( 1 ) OCI ) 2- 1/12

dgemv y - Axty OCI ) 04 ) 11/4'

fft OCGSU ) Occogy ) -

dgemm C=ABtC ocn ) o- ( of ) E YI should fe

sharp ify=eade till3u2 . 82-8

Example . y= KMB

=) n E 700

back to slides

Page 6: of Parallel tlrgh Performance ComputingOperational Intensity: Example Mahir Multiplication Assumptions: cache siugkn, cache 86ek=8 doubles leade We want to estimate Q W=2u3 flops 1.)

G. Balance Principles I Ckuey 86 )

Compete : Pegram :

men Qy : data transfer men a LLC

sLLC sire j W : work = # ops

CPU IT

The computer is called Salaam ifcompute time = data transfer the

ire . , assuming perfect utilization

¥ = Ese⇒ Ip=¥Jr

r

assume it → at increases ;how A nesalauee ?

a.) A → a. A ,✓ set A grows slower than 5 !s

. ) y → x. 8 what is × ?

Page 7: of Parallel tlrgh Performance ComputingOperational Intensity: Example Mahir Multiplication Assumptions: cache siugkn, cache 86ek=8 doubles leade We want to estimate Q W=2u3 flops 1.)

Example I : Mahn multiplicities Ca ABTC

algorithm worth opkmel I= hot - QCTF )

Ip =¥←=o( of ) t.at, r→e?g

Example 2 : FFT / dorky

algorithm ark optimal I= @( logo )

Is = [email protected] → ya

pots are unnealrttfe .

Page 8: of Parallel tlrgh Performance ComputingOperational Intensity: Example Mahir Multiplication Assumptions: cache siugkn, cache 86ek=8 doubles leade We want to estimate Q W=2u3 flops 1.)

7. Balance principles I ( Czechowski et al.

Lou )

coal : none detailed principles for multi ones

algorithm / architecture adesrgn ,assess webof HW tends

Computer ; Algorithm :

slow men PRAM : W : work,

Di depthdi lately ,

B : throughput,

fast men sing dnansaehon fired Qp, , : men . transfersof she A

. . . each poll perf IT

P,

Ppassumptions : optimal W

,W/Q

processors

Processor is faleuafd ;ftwo * set Qp , rn fun Qnx ?

- there are general fends

Them E Tamp ("

compute - tuned and some Arnet results

" as :

of a- put ⇐ > F 3- ¥a

Page 9: of Parallel tlrgh Performance ComputingOperational Intensity: Example Mahir Multiplication Assumptions: cache siugkn, cache 86ek=8 doubles leade We want to estimate Q W=2u3 flops 1.)

Derivation principles .

.

1.) ESH mete Tmen : Idea did DAG in levels

level I : o o o a o o q,men parsley of tie t

levels : \o/M¥4qz

. . . . . . - - - - - - - .- . - . :

,

level D= o a a ° 0 ° ° 9D

In each level a- A model : a +9¥¥8 '

' '

⇒ Tuey X §g ( at 9¥ ) = a .D +0¥ Qt Qpinx

2.) Estimably Tap : Brent 's theorem

Taya

CD+¥¥-men . pan . amputee

Balance : Them a- Tamp < parallelism Gupute

⇒r÷a+oyt⇒laa¥ei+÷ , )^ T parallelism algorithmth qn¥

men. par . algorithm

Page 10: of Parallel tlrgh Performance ComputingOperational Intensity: Example Mahir Multiplication Assumptions: cache siugkn, cache 86ek=8 doubles leade We want to estimate Q W=2u3 flops 1.)

Example 1 ; matrix multiplication

use Q >cr[¥-Fp, Chou > et al.

2004 ) ⇒ Icu ) = 0 ( ITF)⇒ stance pwuehple PIE OCFr )P

it → a. it resdauee : A → ap on p → a2jp → a. p " : ( A → as andp→ay) onCp → Pg )

Example 2 : sending #FT

⇒ PF zocigrg )

Back to slrdy