code gpu with cuda - identifying performance limiters

27
CODE GPU WITH CUDA IDENTIFYING PERFORMANCE LIMITERS Created by Marina Kolpakova ( ) for cuda.geek Itseez PREVIOUS

Upload: marina-kolpakova

Post on 05-Apr-2017

1.060 views

Category:

Education


6 download

TRANSCRIPT

Page 1: Code GPU with CUDA - Identifying performance limiters

CODE GPU WITH CUDAIDENTIFYING PERFORMANCE LIMITERS

Created by Marina Kolpakova ( ) for cuda.geek Itseez

PREVIOUS

Page 2: Code GPU with CUDA - Identifying performance limiters

OUTLINEHow to identify performance limiters?What and how to measure?Why to profile?Profiling case study: transposeCode paths analysis

Page 3: Code GPU with CUDA - Identifying performance limiters

OUT OF SCOPEVisual profiler opportunities

Page 4: Code GPU with CUDA - Identifying performance limiters

HOW TO IDENTIFY PERFORMANCE LIMITERSTime

Subsample when measuring performanceDetermine your code wall time. You'll optimize it

ProfileCollect metrics and eventsDetermine limiting factors (e.c. memory, divergence)

Page 5: Code GPU with CUDA - Identifying performance limiters

HOW TO IDENTIFY PERFORMANCE LIMITERSPrototype

Prototype kernel parts separately and time themDetermine memory access or data dependency patterns

(Micro)benchmarkDetermine hardware characteristicsTune for particular architecture, GPU class

Look into SASSCheck compiler optimizationsLook for a further improvements

Page 6: Code GPU with CUDA - Identifying performance limiters

TIMING: WHAT TO MEASURE?Wall time: user will see this timeGPU time: specific kernel timeCPU ⇔ GPU memory transfers time:

not considered for GPU time analysissignificantly impact wall time

Data dependent cases timing:worst case timetime of single iterationconsider probability

Page 7: Code GPU with CUDA - Identifying performance limiters

HOW TO MEASURE?SYSTEM TIMER (UNIX)

# i n c l u d e < t i m e . h >d o u b l e r u n K e r n e l ( c o n s t d i m 3 g r i d , c o n s t d i m 3 b l o c k ){ s t r u c t t i m e s p e c s t a r t T i m e , e n d T i m e ; c l o c k _ g e t t i m e ( C L O C K _ M O N O T O N I C , & s t a r t T i m e ) ; k e r n e l < < < g r i d , b l o c k > > > ( ) ; < b > c u d a D e v i c e S y n c h r o n i z e ( ) ; < / b > c l o c k _ g e t t i m e ( C L O C K _ M O N O T O N I C , & e n d T i m e ) ; i n t 6 4 s t a r t N s = ( i n t 6 4 ) s t a r t T i m e . t v _ s e c * 1 0 0 0 0 0 0 0 0 0 + s t a r t T i m e . t v _ n s e c ; i n t 6 4 e n d N s = ( i n t 6 4 ) e n d T i m e . t v _ s e c * 1 0 0 0 0 0 0 0 0 0 + e n d T i m e . t v _ n s e c ;

r e t u r n ( e n d N s - s t a r t N s ) / 1 0 0 0 0 0 0 0 . ; / / g e t m s}

Preferred for wall time measurement

Page 8: Code GPU with CUDA - Identifying performance limiters

HOW TO MEASURE?TIMING WITH CUDA EVENTS

d o u b l e r u n K e r n e l ( c o n s t d i m 3 g r i d , c o n s t d i m 3 b l o c k ){ c u d a E v e n t _ t > s t a r t , s t o p ; c u d a E v e n t C r e a t e ( & s t a r t ) ; c u d a E v e n t C r e a t e ( & s t o p ) ; c u d a E v e n t R e c o r d ( s t a r t , 0 ) ; k e r n e l < < < g r i d , b l o c k > > > ( ) ; c u d a E v e n t R e c o r d ( s t o p , 0 ) ; < b > c u d a E v e n t S y n c h r o n i z e ( s t o p ) ; < / b > f l o a t m s ; c u d a E v e n t E l a p s e d T i m e ( & m s , s t a r t , s t o p ) ; c u d a E v e n t D e s t r o y ( s t a r t ) ; c u d a E v e n t D e s t r o y ( s t o p ) ; r e t u r n m s ;}

Preferred for GPU time measurementCan be used with CUDA streams without synchronization

Page 9: Code GPU with CUDA - Identifying performance limiters

WHY TO PROFILE?Profiler will not do your work for you,

but profiler helps:

to verify memory access patternsto identify bottlenecksto collect statistic in data-dependent workloadsto check your hypothesisto understand how hardware behaves

Think about profiling and benchmarkingas about scientific experiments

Page 10: Code GPU with CUDA - Identifying performance limiters

DEVICE CODE PROFILERevents are hardware counters, usually reported per SM

SM id selected by profiler with assumption that all SMs do approximately the sameamount of workExceptions: L2 and DRAM counters

metrics computed from number of events and hardware specific properties (e.c. numberof SM)Single run can collect only a few counters

Profiler repeats kernel launches to collect all countersResults may vary for repeated runs

Page 11: Code GPU with CUDA - Identifying performance limiters

PROFILING FOR MEMORYMemory metrics

which have load or store in name counts from software perspective (in terms ofmemory requests)

local_store_transactionswhich have read or write in name counts from hardware perspective (in terms ofbytes transfered)

l2_subp0_read_sector_missesCounters are incremented

per warpper cache line/transaction sizeper request/instruction

Page 12: Code GPU with CUDA - Identifying performance limiters

PROFILING FOR MEMORYAccess pattern efficiency

check the ratio between bytes requested by the threads / application code and bytesmoved by the hardware (L2/DRAM)use g{ld,st}_transactions_per_request metric

Throughput analysiscompare application HW throughput to possible for your GPU (can be found indocumentation)g{ld,st}_requested_throughput

Page 13: Code GPU with CUDA - Identifying performance limiters

INSTRUCTIONS/BYTES RATIOProfiler counters:instructions_issued, instructions_executedincremented by warp, but “issued” includes replaysglobal_store_transaction, uncached_global_load_transactiontransaction can be 32,64,128 byte. Requires additional analysis to determineaverage.

Compute ratio:(warpSize X instructions_issued) v.s. (global_store_transaction +l1_global_load_miss) * avgTransactionSize

Page 14: Code GPU with CUDA - Identifying performance limiters

LIST OF EVENTS FOR SM_35domain eventtexture (a) tex{0,1,2,3}_cache_sector_{queries,misses}

rocache_subp{0,1,2,3}_gld_warp_count_{32,64,128}brocache_subp{0,1,2,3}_gld_thread_count_{32,64,128}b

L2 (b) fb_subp{0,1}_{read,write}_sectorsl2_subp{0,1,2,3}_total_{read,write}_sector_queriesl2_subp{0,1,2,3}_{read,write}_{l1,system}_sector_queriesl2_subp{0,1,2,3}_{read,write}_sector_missesl2_subp{0,1,2,3}_read_tex_sector_queriesl2_subp{0,1,2,3}_read_{l1,tex}_hit_sectors

LD/ST (c) g{ld,st}_inst_{8,16,32,64,128}bitrocache_gld_inst_{8,16,32,64,128}bit

Page 15: Code GPU with CUDA - Identifying performance limiters

LIST OF EVENTS FOR SM_35domain eventsm (d) prof_trigger_0{0-7}

{shared,local}_{load,store}g{ld,st}_request{local,l1_shared,__l1_global}_{load,store}_transactionsl1_local_{load,store}_{hit,miss}l1_global_load_{hit,miss}uncached_global_load_transactionglobal_store_transactionshared_{load,store}_replayglobal_{ld,st}_mem_divergence_replays

Page 16: Code GPU with CUDA - Identifying performance limiters

LIST OF EVENTS FOR SM_35domain eventsm (d) {threads,warps,sm_cta}_launched

inst_issued{1,2}[thread_,not_predicated_off_thread_]inst_executed{atom,gred}_countactive_{cycles,warps}

Page 17: Code GPU with CUDA - Identifying performance limiters

LIST OF METRICS FOR SM_35metricg{ld,st}_requested_throughputtex_cache_{hit_rate,throughput}dram_{read,write}_throughputnc_gld_requested_throughput{local,shared}_{load,store}_throughput{l2,system}_{read,write}_throughputg{st,ld}_{throughput,efficiency}l2_{l1,texture}_read_{hit_rate,throughput}l1_cache_{global,local}_hit_rate

Page 18: Code GPU with CUDA - Identifying performance limiters

LIST OF METRICS FOR SM_35metric{local,shared}_{load,store}_transactions[_per_request]gl{d,st}_transactions[_per_request]{sysmem,dram,l2}_{read,write}_transactionstex_cache_transactions{inst,shared,global,global_cache,local}_replay_overheadlocal_memory_overheadshared_efficiencyachieved_occupancysm_efficiency[_instance]ipc[_instance]issued_ipcinst_per_warp

Page 19: Code GPU with CUDA - Identifying performance limiters

LIST OF METRICS FOR SM_35metricflops_{sp,dp}[_add,mul,fma]warp_execution_efficiencywarp_nonpred_execution_efficiencyflops_sp_specialstall_{inst_fetch,exec_dependency,data_request,texture,sync,other}{l1_shared,l2,tex,dram,system}_utilization{cf,ldst}_{issued,executed}{ldst,alu,cf,tex}_fu_utilizationissue_slot_utilizationinst_{issued,executed}issue_slots

Page 20: Code GPU with CUDA - Identifying performance limiters

ROI PROFILING# i n c l u d e < c u d a _ p r o f i l e r _ a p i . h >

/ / a l g o r i t h m s e t u p c o d eu d a P r o f i l e r S t a r t ( ) ;p e r f _ t e s t _ c u d a _ a c c e l e r a t e d _ c o d e ( ) ;c u d a P r o f i l e r S t o p ( ) ;

Profile only part that you are optimizing right nowshorter and simpler profiler logDo not significantly overhead your code runtimeUsed with --profile-from-start off nvprof option

Page 21: Code GPU with CUDA - Identifying performance limiters

CASE STUDY: MATRIX TRANSPOSE& n v p r o f - - d e v i c e s 2 . / b i n / d e m o _ b e n c h

Page 22: Code GPU with CUDA - Identifying performance limiters

CASE STUDY: MATRIX TRANSPOSE& n v p r o f - - d e v i c e s 2 \- - m e t r i c s g l d _ t r a n s a c t i o n s _ p e r _ r e q u e s t , g s t _ t r a n s a c t i o n s _ p e r _ r e q u e s t \. / b i n / d e m o _ b e n c h

Page 23: Code GPU with CUDA - Identifying performance limiters

CASE STUDY: MATRIX TRANSPOSE& n v p r o f - - d e v i c e s 2 - - m e t r i c s s h a r e d _ r e p l a y _ o v e r h e a d . / b i n / d e m o _ b e n c h

Page 24: Code GPU with CUDA - Identifying performance limiters

CODE PATHS ANALYSISThe main idea: determine performance limiters through measuring different partsindependentlySimple case: time memory-only and math-only versions of the kernelShows how well memory operations are overlapped with arithmetic: compare the sumof mem-only and math-only times to full-kernel time

t e m p l a t e < t y p e n a m e T >_ _ g l o b a l _ _ v o i db e n c h m a r k _ c o n t i g u o u s _ d i r e c t _ l o a d ( T * s , t y p e n a m e T : : v a l u e _ t y p e * r , b o o l d o S t o r e ){ i n t g l o b a l _ i n d e x = t h r e a d I d x . x + b l o c k D i m . x * b l o c k I d x . x ; T d a t a = s [ g l o b a l _ i n d e x ] ; a s m ( " " : : : " m e m o r y " ) ; i f ( s & & d o S t o r e ) r [ g l o b a l _ i n d e x ] = s u m ( d a t a ) ;}

Page 25: Code GPU with CUDA - Identifying performance limiters

DEVICE SIDE TIMINGDevice timer located on ROP/SM depending on hardware revisionIt's relatively easy to compute per thread values but hard to analyze kernel performancedue to grid serializationsometimes is suitable for benchmarking

t e m p l a t e < t y p e n a m e T , t y p e n a m e D , t y p e n a m e L > _ _ g l o b a l _ _v o i d l a t e n c y _ k e r n e l ( T * * a , i n t l e n , i n t s t r i d e , i n t i n n e r _ i t s , D * l a t e n c y , L f u n c ){ D s t a r t _ t i m e , e n d _ t i m e ; v o l a t i l e D s u m _ t i m e = 0 ; f o r ( i n t k = 0 ; k < i n n e r _ i t s ; + + k ) { T * j = ( ( T * ) a ) + t h r e a d I d x . y * l e n + t h r e a d I d x . x ; s t a r t _ t i m e = c l o c k 6 4 ( ) ; f o r ( i n t c u r r = 0 ; c u r r < l e n / s t r i d e ; + + c u r r ) j = f u n c ( j ) ; e n d _ t i m e = c l o c k 6 4 ( ) ; s u m _ t i m e + = ( e n d _ t i m e - s t a r t _ t i m e ) ; } i f ( ! t h r e a d I d x . x ) a t o m i c A d d ( l a t e n c y , s u m _ t i m e ) ;}

Page 26: Code GPU with CUDA - Identifying performance limiters

FINAL WORDSTimeProfile(Micro)benchmarkPrototypeLook into SASS

Page 27: Code GPU with CUDA - Identifying performance limiters

THE ENDLIST OF PRESENTATIONS

BY / 2013–2015CUDA.GEEK