code gpu with cuda - identifying performance limiters
TRANSCRIPT
CODE GPU WITH CUDAIDENTIFYING PERFORMANCE LIMITERS
Created by Marina Kolpakova ( ) for cuda.geek Itseez
PREVIOUS
OUTLINEHow to identify performance limiters?What and how to measure?Why to profile?Profiling case study: transposeCode paths analysis
OUT OF SCOPEVisual profiler opportunities
HOW TO IDENTIFY PERFORMANCE LIMITERSTime
Subsample when measuring performanceDetermine your code wall time. You'll optimize it
ProfileCollect metrics and eventsDetermine limiting factors (e.c. memory, divergence)
HOW TO IDENTIFY PERFORMANCE LIMITERSPrototype
Prototype kernel parts separately and time themDetermine memory access or data dependency patterns
(Micro)benchmarkDetermine hardware characteristicsTune for particular architecture, GPU class
Look into SASSCheck compiler optimizationsLook for a further improvements
TIMING: WHAT TO MEASURE?Wall time: user will see this timeGPU time: specific kernel timeCPU ⇔ GPU memory transfers time:
not considered for GPU time analysissignificantly impact wall time
Data dependent cases timing:worst case timetime of single iterationconsider probability
HOW TO MEASURE?SYSTEM TIMER (UNIX)
# i n c l u d e < t i m e . h >d o u b l e r u n K e r n e l ( c o n s t d i m 3 g r i d , c o n s t d i m 3 b l o c k ){ s t r u c t t i m e s p e c s t a r t T i m e , e n d T i m e ; c l o c k _ g e t t i m e ( C L O C K _ M O N O T O N I C , & s t a r t T i m e ) ; k e r n e l < < < g r i d , b l o c k > > > ( ) ; < b > c u d a D e v i c e S y n c h r o n i z e ( ) ; < / b > c l o c k _ g e t t i m e ( C L O C K _ M O N O T O N I C , & e n d T i m e ) ; i n t 6 4 s t a r t N s = ( i n t 6 4 ) s t a r t T i m e . t v _ s e c * 1 0 0 0 0 0 0 0 0 0 + s t a r t T i m e . t v _ n s e c ; i n t 6 4 e n d N s = ( i n t 6 4 ) e n d T i m e . t v _ s e c * 1 0 0 0 0 0 0 0 0 0 + e n d T i m e . t v _ n s e c ;
r e t u r n ( e n d N s - s t a r t N s ) / 1 0 0 0 0 0 0 0 . ; / / g e t m s}
Preferred for wall time measurement
HOW TO MEASURE?TIMING WITH CUDA EVENTS
d o u b l e r u n K e r n e l ( c o n s t d i m 3 g r i d , c o n s t d i m 3 b l o c k ){ c u d a E v e n t _ t > s t a r t , s t o p ; c u d a E v e n t C r e a t e ( & s t a r t ) ; c u d a E v e n t C r e a t e ( & s t o p ) ; c u d a E v e n t R e c o r d ( s t a r t , 0 ) ; k e r n e l < < < g r i d , b l o c k > > > ( ) ; c u d a E v e n t R e c o r d ( s t o p , 0 ) ; < b > c u d a E v e n t S y n c h r o n i z e ( s t o p ) ; < / b > f l o a t m s ; c u d a E v e n t E l a p s e d T i m e ( & m s , s t a r t , s t o p ) ; c u d a E v e n t D e s t r o y ( s t a r t ) ; c u d a E v e n t D e s t r o y ( s t o p ) ; r e t u r n m s ;}
Preferred for GPU time measurementCan be used with CUDA streams without synchronization
WHY TO PROFILE?Profiler will not do your work for you,
but profiler helps:
to verify memory access patternsto identify bottlenecksto collect statistic in data-dependent workloadsto check your hypothesisto understand how hardware behaves
Think about profiling and benchmarkingas about scientific experiments
DEVICE CODE PROFILERevents are hardware counters, usually reported per SM
SM id selected by profiler with assumption that all SMs do approximately the sameamount of workExceptions: L2 and DRAM counters
metrics computed from number of events and hardware specific properties (e.c. numberof SM)Single run can collect only a few counters
Profiler repeats kernel launches to collect all countersResults may vary for repeated runs
PROFILING FOR MEMORYMemory metrics
which have load or store in name counts from software perspective (in terms ofmemory requests)
local_store_transactionswhich have read or write in name counts from hardware perspective (in terms ofbytes transfered)
l2_subp0_read_sector_missesCounters are incremented
per warpper cache line/transaction sizeper request/instruction
PROFILING FOR MEMORYAccess pattern efficiency
check the ratio between bytes requested by the threads / application code and bytesmoved by the hardware (L2/DRAM)use g{ld,st}_transactions_per_request metric
Throughput analysiscompare application HW throughput to possible for your GPU (can be found indocumentation)g{ld,st}_requested_throughput
INSTRUCTIONS/BYTES RATIOProfiler counters:instructions_issued, instructions_executedincremented by warp, but “issued” includes replaysglobal_store_transaction, uncached_global_load_transactiontransaction can be 32,64,128 byte. Requires additional analysis to determineaverage.
Compute ratio:(warpSize X instructions_issued) v.s. (global_store_transaction +l1_global_load_miss) * avgTransactionSize
LIST OF EVENTS FOR SM_35domain eventtexture (a) tex{0,1,2,3}_cache_sector_{queries,misses}
rocache_subp{0,1,2,3}_gld_warp_count_{32,64,128}brocache_subp{0,1,2,3}_gld_thread_count_{32,64,128}b
L2 (b) fb_subp{0,1}_{read,write}_sectorsl2_subp{0,1,2,3}_total_{read,write}_sector_queriesl2_subp{0,1,2,3}_{read,write}_{l1,system}_sector_queriesl2_subp{0,1,2,3}_{read,write}_sector_missesl2_subp{0,1,2,3}_read_tex_sector_queriesl2_subp{0,1,2,3}_read_{l1,tex}_hit_sectors
LD/ST (c) g{ld,st}_inst_{8,16,32,64,128}bitrocache_gld_inst_{8,16,32,64,128}bit
LIST OF EVENTS FOR SM_35domain eventsm (d) prof_trigger_0{0-7}
{shared,local}_{load,store}g{ld,st}_request{local,l1_shared,__l1_global}_{load,store}_transactionsl1_local_{load,store}_{hit,miss}l1_global_load_{hit,miss}uncached_global_load_transactionglobal_store_transactionshared_{load,store}_replayglobal_{ld,st}_mem_divergence_replays
LIST OF EVENTS FOR SM_35domain eventsm (d) {threads,warps,sm_cta}_launched
inst_issued{1,2}[thread_,not_predicated_off_thread_]inst_executed{atom,gred}_countactive_{cycles,warps}
LIST OF METRICS FOR SM_35metricg{ld,st}_requested_throughputtex_cache_{hit_rate,throughput}dram_{read,write}_throughputnc_gld_requested_throughput{local,shared}_{load,store}_throughput{l2,system}_{read,write}_throughputg{st,ld}_{throughput,efficiency}l2_{l1,texture}_read_{hit_rate,throughput}l1_cache_{global,local}_hit_rate
LIST OF METRICS FOR SM_35metric{local,shared}_{load,store}_transactions[_per_request]gl{d,st}_transactions[_per_request]{sysmem,dram,l2}_{read,write}_transactionstex_cache_transactions{inst,shared,global,global_cache,local}_replay_overheadlocal_memory_overheadshared_efficiencyachieved_occupancysm_efficiency[_instance]ipc[_instance]issued_ipcinst_per_warp
LIST OF METRICS FOR SM_35metricflops_{sp,dp}[_add,mul,fma]warp_execution_efficiencywarp_nonpred_execution_efficiencyflops_sp_specialstall_{inst_fetch,exec_dependency,data_request,texture,sync,other}{l1_shared,l2,tex,dram,system}_utilization{cf,ldst}_{issued,executed}{ldst,alu,cf,tex}_fu_utilizationissue_slot_utilizationinst_{issued,executed}issue_slots
ROI PROFILING# i n c l u d e < c u d a _ p r o f i l e r _ a p i . h >
/ / a l g o r i t h m s e t u p c o d eu d a P r o f i l e r S t a r t ( ) ;p e r f _ t e s t _ c u d a _ a c c e l e r a t e d _ c o d e ( ) ;c u d a P r o f i l e r S t o p ( ) ;
Profile only part that you are optimizing right nowshorter and simpler profiler logDo not significantly overhead your code runtimeUsed with --profile-from-start off nvprof option
CASE STUDY: MATRIX TRANSPOSE& n v p r o f - - d e v i c e s 2 . / b i n / d e m o _ b e n c h
CASE STUDY: MATRIX TRANSPOSE& n v p r o f - - d e v i c e s 2 \- - m e t r i c s g l d _ t r a n s a c t i o n s _ p e r _ r e q u e s t , g s t _ t r a n s a c t i o n s _ p e r _ r e q u e s t \. / b i n / d e m o _ b e n c h
CASE STUDY: MATRIX TRANSPOSE& n v p r o f - - d e v i c e s 2 - - m e t r i c s s h a r e d _ r e p l a y _ o v e r h e a d . / b i n / d e m o _ b e n c h
CODE PATHS ANALYSISThe main idea: determine performance limiters through measuring different partsindependentlySimple case: time memory-only and math-only versions of the kernelShows how well memory operations are overlapped with arithmetic: compare the sumof mem-only and math-only times to full-kernel time
t e m p l a t e < t y p e n a m e T >_ _ g l o b a l _ _ v o i db e n c h m a r k _ c o n t i g u o u s _ d i r e c t _ l o a d ( T * s , t y p e n a m e T : : v a l u e _ t y p e * r , b o o l d o S t o r e ){ i n t g l o b a l _ i n d e x = t h r e a d I d x . x + b l o c k D i m . x * b l o c k I d x . x ; T d a t a = s [ g l o b a l _ i n d e x ] ; a s m ( " " : : : " m e m o r y " ) ; i f ( s & & d o S t o r e ) r [ g l o b a l _ i n d e x ] = s u m ( d a t a ) ;}
DEVICE SIDE TIMINGDevice timer located on ROP/SM depending on hardware revisionIt's relatively easy to compute per thread values but hard to analyze kernel performancedue to grid serializationsometimes is suitable for benchmarking
t e m p l a t e < t y p e n a m e T , t y p e n a m e D , t y p e n a m e L > _ _ g l o b a l _ _v o i d l a t e n c y _ k e r n e l ( T * * a , i n t l e n , i n t s t r i d e , i n t i n n e r _ i t s , D * l a t e n c y , L f u n c ){ D s t a r t _ t i m e , e n d _ t i m e ; v o l a t i l e D s u m _ t i m e = 0 ; f o r ( i n t k = 0 ; k < i n n e r _ i t s ; + + k ) { T * j = ( ( T * ) a ) + t h r e a d I d x . y * l e n + t h r e a d I d x . x ; s t a r t _ t i m e = c l o c k 6 4 ( ) ; f o r ( i n t c u r r = 0 ; c u r r < l e n / s t r i d e ; + + c u r r ) j = f u n c ( j ) ; e n d _ t i m e = c l o c k 6 4 ( ) ; s u m _ t i m e + = ( e n d _ t i m e - s t a r t _ t i m e ) ; } i f ( ! t h r e a d I d x . x ) a t o m i c A d d ( l a t e n c y , s u m _ t i m e ) ;}
FINAL WORDSTimeProfile(Micro)benchmarkPrototypeLook into SASS
THE ENDLIST OF PRESENTATIONS
BY / 2013–2015CUDA.GEEK