code gpu with cuda - identifying performance limiters

CODE GPU WITH CUDAIDENTIFYING PERFORMANCE LIMITERS

Created by Marina Kolpakova ( ) for cuda.geek Itseez

PREVIOUS

http://github.com/cuda-geek

http://itseez.com/

http://cuda-geek.github.io/cumib/code_gpu_with_cuda_5.html#/end1

OUTLINEHow to identify performance limiters?What and how to measure?Why to profile?Profiling case study: transposeCode paths analysis

OUT OF SCOPEVisual profiler opportunities

HOW TO IDENTIFY PERFORMANCE LIMITERSTime

Subsample when measuring performanceDetermine your code wall time. You'll optimize it

ProfileCollect metrics and eventsDetermine limiting factors (e.c. memory, divergence)

HOW TO IDENTIFY PERFORMANCE LIMITERSPrototype

Prototype kernel parts separately and time themDetermine memory access or data dependency patterns

(Micro)benchmarkDetermine hardware characteristicsTune for particular architecture, GPU class

Look into SASSCheck compiler optimizationsLook for a further improvements

TIMING: WHAT TO MEASURE?Wall time: user will see this timeGPU time: specific kernel timeCPU ⇔ GPU memory transfers time:

not considered for GPU time analysissignificantly impact wall time

Data dependent cases timing:worst case timetime of single iterationconsider probability

HOW TO MEASURE?SYSTEM TIMER (UNIX)

# i n c l u d e < t i m e . h >d o u b l e r u n K e r n e l ( c o n s t d i m 3 g r i d , c o n s t d i m 3 b l o c k ){ s t r u c t t i m e s p e c s t a r t T i m e , e n d T i m e ; c l o c k _ g e t t i m e ( C L O C K _ M O N O T O N I C , & s t a r t T i m e ) ; k e r n e l < < < g r i d , b l o c k > > > ( ) ; < b > c u d a D e v i c e S y n c h r o n i z e ( ) ; < / b > c l o c k _ g e t t i m e ( C L O C K _ M O N O T O N I C , & e n d T i m e ) ; i n t 6 4 s t a r t N s = ( i n t 6 4 ) s t a r t T i m e . t v _ s e c * 1 0 0 0 0 0 0 0 0 0 + s t a r t T i m e . t v _ n s e c ; i n t 6 4 e n d N s = ( i n t 6 4 ) e n d T i m e . t v _ s e c * 1 0 0 0 0 0 0 0 0 0 + e n d T i m e . t v _ n s e c ;

r e t u r n ( e n d N s - s t a r t N s ) / 1 0 0 0 0 0 0 0 . ; / / g e t m s}

Preferred for wall time measurement

HOW TO MEASURE?TIMING WITH CUDA EVENTS

d o u b l e r u n K e r n e l ( c o n s t d i m 3 g r i d , c o n s t d i m 3 b l o c k ){ c u d a E v e n t _ t > s t a r t , s t o p ; c u d a E v e n t C r e a t e ( & s t a r t ) ; c u d a E v e n t C r e a t e ( & s t o p ) ; c u d a E v e n t R e c o r d ( s t a r t , 0 ) ; k e r n e l < < < g r i d , b l o c k > > > ( ) ; c u d a E v e n t R e c o r d ( s t o p , 0 ) ; < b > c u d a E v e n t S y n c h r o n i z e ( s t o p ) ; < / b > f l o a t m s ; c u d a E v e n t E l a p s e d T i m e ( & m s , s t a r t , s t o p ) ; c u d a E v e n t D e s t r o y ( s t a r t ) ; c u d a E v e n t D e s t r o y ( s t o p ) ; r e t u r n m s ;}

Preferred for GPU time measurementCan be used with CUDA streams without synchronization

WHY TO PROFILE?Profiler will not do your work for you,

but profiler helps:

to verify memory access patternsto identify bottlenecksto collect statistic in data-dependent workloadsto check your hypothesisto understand how hardware behaves

Think about profiling and benchmarkingas about scientific experiments

DEVICE CODE PROFILERevents are hardware counters, usually reported per SM

SM id selected by profiler with assumption that all SMs do approximately the sameamount of workExceptions: L2 and DRAM counters

metrics computed from number of events and hardware specific properties (e.c. numberof SM)Single run can collect only a few counters

Profiler repeats kernel launches to collect all countersResults may vary for repeated runs

PROFILING FOR MEMORYMemory metrics

which have load or store in name counts from software perspective (in terms ofmemory requests)

local_store_transactionswhich have read or write in name counts from hardware perspective (in terms ofbytes transfered)

l2_subp0_read_sector_missesCounters are incremented

per warpper cache line/transaction sizeper request/instruction

PROFILING FOR MEMORYAccess pattern efficiency

check the ratio between bytes requested by the threads / application code and bytesmoved by the hardware (L2/DRAM)use g{ld,st}_transactions_per_request metric

Throughput analysiscompare application HW throughput to possible for your GPU (can be found indocumentation)g{ld,st}_requested_throughput

INSTRUCTIONS/BYTES RATIOProfiler counters:instructions_issued, instructions_executedincremented by warp, but “issued” includes replaysglobal_store_transaction, uncached_global_load_transactiontransaction can be 32,64,128 byte. Requires additional analysis to determineaverage.

Compute ratio:(warpSize X instructions_issued) v.s. (global_store_transaction +l1_global_load_miss) * avgTransactionSize

LIST OF EVENTS FOR SM_35domain eventtexture (a) tex{0,1,2,3}_cache_sector_{queries,misses}

rocache_subp{0,1,2,3}_gld_warp_count_{32,64,128}brocache_subp{0,1,2,3}_gld_thread_count_{32,64,128}b

L2 (b) fb_subp{0,1}_{read,write}_sectorsl2_subp{0,1,2,3}_total_{read,write}_sector_queriesl2_subp{0,1,2,3}_{read,write}_{l1,system}_sector_queriesl2_subp{0,1,2,3}_{read,write}_sector_missesl2_subp{0,1,2,3}_read_tex_sector_queriesl2_subp{0,1,2,3}_read_{l1,tex}_hit_sectors

LD/ST (c) g{ld,st}_inst_{8,16,32,64,128}bitrocache_gld_inst_{8,16,32,64,128}bit

LIST OF EVENTS FOR SM_35domain eventsm (d) prof_trigger_0{0-7}

{shared,local}_{load,store}g{ld,st}_request{local,l1_shared,__l1_global}_{load,store}_transactionsl1_local_{load,store}_{hit,miss}l1_global_load_{hit,miss}uncached_global_load_transactionglobal_store_transactionshared_{load,store}_replayglobal_{ld,st}_mem_divergence_replays

LIST OF EVENTS FOR SM_35domain eventsm (d) {threads,warps,sm_cta}_launched

inst_issued{1,2}[thread_,not_predicated_off_thread_]inst_executed{atom,gred}_countactive_{cycles,warps}

LIST OF METRICS FOR SM_35metricg{ld,st}_requested_throughputtex_cache_{hit_rate,throughput}dram_{read,write}_throughputnc_gld_requested_throughput{local,shared}_{load,store}_throughput{l2,system}_{read,write}_throughputg{st,ld}_{throughput,efficiency}l2_{l1,texture}_read_{hit_rate,throughput}l1_cache_{global,local}_hit_rate

LIST OF METRICS FOR SM_35metric{local,shared}_{load,store}_transactions[_per_request]gl{d,st}_transactions[_per_request]{sysmem,dram,l2}_{read,write}_transactionstex_cache_transactions{inst,shared,global,global_cache,local}_replay_overheadlocal_memory_overheadshared_efficiencyachieved_occupancysm_efficiency[_instance]ipc[_instance]issued_ipcinst_per_warp

LIST OF METRICS FOR SM_35metricflops_{sp,dp}[_add,mul,fma]warp_execution_efficiencywarp_nonpred_execution_efficiencyflops_sp_specialstall_{inst_fetch,exec_dependency,data_request,texture,sync,other}{l1_shared,l2,tex,dram,system}_utilization{cf,ldst}_{issued,executed}{ldst,alu,cf,tex}_fu_utilizationissue_slot_utilizationinst_{issued,executed}issue_slots

ROI PROFILING# i n c l u d e < c u d a _ p r o f i l e r _ a p i . h >

/ / a l g o r i t h m s e t u p c o d eu d a P r o f i l e r S t a r t ( ) ;p e r f _ t e s t _ c u d a _ a c c e l e r a t e d _ c o d e ( ) ;c u d a P r o f i l e r S t o p ( ) ;

Profile only part that you are optimizing right nowshorter and simpler profiler logDo not significantly overhead your code runtimeUsed with --profile-from-start off nvprof option

CASE STUDY: MATRIX TRANSPOSE& n v p r o f - - d e v i c e s 2 . / b i n / d e m o _ b e n c h

CASE STUDY: MATRIX TRANSPOSE& n v p r o f - - d e v i c e s 2 \- - m e t r i c s g l d _ t r a n s a c t i o n s _ p e r _ r e q u e s t , g s t _ t r a n s a c t i o n s _ p e r _ r e q u e s t \. / b i n / d e m o _ b e n c h

CASE STUDY: MATRIX TRANSPOSE& n v p r o f - - d e v i c e s 2 - - m e t r i c s s h a r e d _ r e p l a y _ o v e r h e a d . / b i n / d e m o _ b e n c h

CODE PATHS ANALYSISThe main idea: determine performance limiters through measuring different partsindependentlySimple case: time memory-only and math-only versions of the kernelShows how well memory operations are overlapped with arithmetic: compare the sumof mem-only and math-only times to full-kernel time

t e m p l a t e < t y p e n a m e T >_ _ g l o b a l _ _ v o i db e n c h m a r k _ c o n t i g u o u s _ d i r e c t _ l o a d ( T * s , t y p e n a m e T : : v a l u e _ t y p e * r , b o o l d o S t o r e ){ i n t g l o b a l _ i n d e x = t h r e a d I d x . x + b l o c k D i m . x * b l o c k I d x . x ; T d a t a = s [ g l o b a l _ i n d e x ] ; a s m ( " " : : : " m e m o r y " ) ; i f ( s & & d o S t o r e ) r [ g l o b a l _ i n d e x ] = s u m ( d a t a ) ;}

DEVICE SIDE TIMINGDevice timer located on ROP/SM depending on hardware revisionIt's relatively easy to compute per thread values but hard to analyze kernel performancedue to grid serializationsometimes is suitable for benchmarking

t e m p l a t e < t y p e n a m e T , t y p e n a m e D , t y p e n a m e L > _ _ g l o b a l _ _v o i d l a t e n c y _ k e r n e l ( T * * a , i n t l e n , i n t s t r i d e , i n t i n n e r _ i t s , D * l a t e n c y , L f u n c ){ D s t a r t _ t i m e , e n d _ t i m e ; v o l a t i l e D s u m _ t i m e = 0 ; f o r ( i n t k = 0 ; k < i n n e r _ i t s ; + + k ) { T * j = ( ( T * ) a ) + t h r e a d I d x . y * l e n + t h r e a d I d x . x ; s t a r t _ t i m e = c l o c k 6 4 ( ) ; f o r ( i n t c u r r = 0 ; c u r r < l e n / s t r i d e ; + + c u r r ) j = f u n c ( j ) ; e n d _ t i m e = c l o c k 6 4 ( ) ; s u m _ t i m e + = ( e n d _ t i m e - s t a r t _ t i m e ) ; } i f ( ! t h r e a d I d x . x ) a t o m i c A d d ( l a t e n c y , s u m _ t i m e ) ;}

FINAL WORDSTimeProfile(Micro)benchmarkPrototypeLook into SASS

THE ENDLIST OF PRESENTATIONS

BY / 2013–2015CUDA.GEEK

http://cuda-geek.github.io/cumib/index.html

https://github.com/cuda-geek

code gpu with cuda - identifying performance limiters

Education