programming with cuda ws 08/09 lecture 9 thu, 20 nov, 2008
Post on 21-Dec-2015
218 views
TRANSCRIPT
Programming with Programming with CUDACUDAWS 08/09WS 08/09
Lecture 9Lecture 9Thu, 20 Nov, 2008Thu, 20 Nov, 2008
PreviouslyPreviously
CUDA Runtime ComponentCUDA Runtime Component– Common ComponentCommon Component– Device ComponentDevice Component– Host Component: runtime & driver APIsHost Component: runtime & driver APIs
TodayToday
Memory & instruction optimizationsMemory & instruction optimizations Final projects - reminderFinal projects - reminder
Instruction ProcessingInstruction Processing
To execute an instruction on a To execute an instruction on a warp of threads, the SMwarp of threads, the SM– Reads in instruction operands Reads in instruction operands for for
each threadeach thread– Executes the instuction Executes the instuction on all on all
threadsthreads– Writes the result Writes the result of each threadof each thread
Instruction ThroughputInstruction Throughput
Maximized whenMaximized when– Use of low throughput instructions is Use of low throughput instructions is
minimizedminimized– Available memory bandwidth is used Available memory bandwidth is used
maximallymaximally– The thread scheduler overlaps The thread scheduler overlaps
compute & memory operationscompute & memory operations Programs have a high arithmetic intensity Programs have a high arithmetic intensity
per memory operationper memory operation Each SM has many active threadsEach SM has many active threads
Instruction ThroughputInstruction Throughput
Maximized whenMaximized when– Use of low throughput Use of low throughput
instructions is minimizedinstructions is minimized– Available memory bandwidth is used Available memory bandwidth is used
maximallymaximally– The thread scheduler overlaps The thread scheduler overlaps
compute & memory operationscompute & memory operations Programs have a high arithmetic intensity Programs have a high arithmetic intensity
per memory operationper memory operation Each SM has many active threadsEach SM has many active threads
Instruction ThroughputInstruction Throughput
Avoid low throughput instructionsAvoid low throughput instructions– Be aware of clock cycles used per Be aware of clock cycles used per
instructioninstruction– There are often faster alternatives for There are often faster alternatives for
math functions, e.g. math functions, e.g. sinfsinf and and __sinf__sinf– Size of operands (24 bit, 32 bit) also Size of operands (24 bit, 32 bit) also
makes a differencemakes a difference
Instruction ThroughputInstruction Throughput
Avoid low throughput instructionsAvoid low throughput instructions– Integer division and modulo are Integer division and modulo are
expensiveexpensive Use bitwise operations (>>, &) insteadUse bitwise operations (>>, &) instead
– Type conversion costs cyclesType conversion costs cycles charchar//shortshort => => intint doubledouble <=> <=> floatfloat
– Define Define floatfloat quantities with quantities with ff, e.g. , e.g. 1.0f1.0f– Use float functions, e.g. Use float functions, e.g. expfexpf– Some devices (<= 1.2) demote Some devices (<= 1.2) demote doubledouble to to floatfloat
Instruction ThroughputInstruction Throughput
Avoid branchingAvoid branching– Diverging threads in a warp are Diverging threads in a warp are
serializedserialized– Try to minimize the number of Try to minimize the number of
divergent warpsdivergent warps– Loop unrolling by the compiler can be Loop unrolling by the compiler can be
controlled using controlled using #pragma unroll #pragma unroll
Instruction ThroughputInstruction Throughput
Avoid high latency memory Avoid high latency memory instructionsinstructions– A SM takes 4 clock cycles to issue a A SM takes 4 clock cycles to issue a
memory instruction to a warpmemory instruction to a warp– In case of local/global memory, there In case of local/global memory, there
is an overhead of 400 to 600 cyclesis an overhead of 400 to 600 cycles__shared__ float shared;__shared__ float shared;__device__ float device;__device__ float device;shared = device;shared = device;4 + 4 + [400,600] cycles4 + 4 + [400,600] cycles
Instruction ThroughputInstruction Throughput
Avoid high latency memory Avoid high latency memory instructionsinstructions– If local/global memory has to be If local/global memory has to be
accessed, surround it with accessed, surround it with independent arithmetic instructionsindependent arithmetic instructions
SM can do math while accessing memorySM can do math while accessing memory
Instruction ThroughputInstruction Throughput
Cost of Cost of __syncThreads()__syncThreads()– Instruction itself takes 4 clock cycles Instruction itself takes 4 clock cycles
for a warpfor a warp– Additional cycles spent waiting for Additional cycles spent waiting for
threads to catch upthreads to catch up
Instruction ThroughputInstruction Throughput
Maximized whenMaximized when– Use of low throughput instructions is Use of low throughput instructions is
minimizedminimized– Available memory bandwidth is Available memory bandwidth is
used maximallyused maximally– The thread scheduler overlaps The thread scheduler overlaps
compute & memory operationscompute & memory operations Programs have a high arithmetic intensity Programs have a high arithmetic intensity
per memory operationper memory operation Each SM has many active threadsEach SM has many active threads
Instruction ThroughputInstruction Throughput
Effective memory bandwidth of Effective memory bandwidth of each memory space (global, local, each memory space (global, local, shared) depends on the shared) depends on the memory memory access patternaccess pattern
Device memory has higher latency Device memory has higher latency and lower bandwidth than on-chip and lower bandwidth than on-chip memorymemory– Minimize use of device memoryMinimize use of device memory
Instruction ThroughputInstruction Throughput
Typical executionTypical execution– Each thread loads data from device Each thread loads data from device
to shared memoryto shared memory– Synch threads, if necessarySynch threads, if necessary– Each thread processes data in shared Each thread processes data in shared
memorymemory– Synch threads, if necessarySynch threads, if necessary– Write data from shared to device Write data from shared to device
memorymemory
Instruction ThroughputInstruction Throughput
Global memoryGlobal memory– High latency, low bandwidthHigh latency, low bandwidth– Not cachedNot cached– Right access patterns are crucialRight access patterns are crucial
Instruction ThroughputInstruction Throughput
Global memory: alignmentGlobal memory: alignment– Supported word sizes: 4, 8, 16 bytesSupported word sizes: 4, 8, 16 bytes– __device__ type device[32];__device__ type device[32];type data = device[tid];type data = device[tid];
compiles to a single load instruction ifcompiles to a single load instruction if typetype has a supported size has a supported size typetype variables are aligned to variables are aligned to sizeof(type)sizeof(type): the address of the variable : the address of the variable should be a multiple of should be a multiple of sizeof(type)sizeof(type)
Instruction ThroughputInstruction Throughput
Global memory: alignmentGlobal memory: alignment– Alignment requirement is Alignment requirement is
automatically fulfilled for built-in automatically fulfilled for built-in typestypes
– For self-defined structures, alignment For self-defined structures, alignment can be forced can be forced struct __align__(8) {struct __align__(8) { float a,b; } myStruct8; float a,b; } myStruct8;struct __align__(16) {struct __align__(16) { float a,b,c; } myStruct12; float a,b,c; } myStruct12;
Instruction ThroughputInstruction Throughput
Global memory: alignmentGlobal memory: alignment– Addresses of global variables are Addresses of global variables are
aligned to 256 bytesaligned to 256 bytes– Align structures cleverlyAlign structures cleverlystruct {struct { float a,b,c,d,e; } myStruct20; float a,b,c,d,e; } myStruct20;Five 32-bit load instructionsFive 32-bit load instructions
Instruction ThroughputInstruction Throughput
Global memory: alignmentGlobal memory: alignment– Addresses of global variables are Addresses of global variables are
aligned to 256 bytesaligned to 256 bytes– Align structures cleverlyAlign structures cleverlystruct __align__(8) {struct __align__(8) { float a,b,c,d,e; } myStruct20; float a,b,c,d,e; } myStruct20;Three 64-bit load instructionsThree 64-bit load instructions
Instruction ThroughputInstruction Throughput
Global memory: alignmentGlobal memory: alignment– Addresses of global variables are Addresses of global variables are
aligned to 256 bytesaligned to 256 bytes– Align structures cleverlyAlign structures cleverlystruct __align__(16) {struct __align__(16) { float a,b,c,d,e; } myStruct20; float a,b,c,d,e; } myStruct20;Two 128-bit load instructionsTwo 128-bit load instructions
Instruction ThroughputInstruction Throughput
Global memory: coalescingGlobal memory: coalescing– Size of a memory transaction on Size of a memory transaction on
global memory can be 32 (>= 1.2), global memory can be 32 (>= 1.2), 64 or 128 bytes64 or 128 bytes
– Used most efficiently when Used most efficiently when simultaneous memory accesses by simultaneous memory accesses by threads in a half-warp can be threads in a half-warp can be coalesced into a single memory coalesced into a single memory transactiontransaction
– Coalescing varies w/ comp capabilityCoalescing varies w/ comp capability
Instruction ThroughputInstruction Throughput
Global memory: coalescing, <= 1.1Global memory: coalescing, <= 1.1– Global memory access by threads in Global memory access by threads in
a half-warp are coalesced ifa half-warp are coalesced if Each thread accesses words of sizeEach thread accesses words of size
– 4 bytes: one 64-byte memory operation4 bytes: one 64-byte memory operation– 8 bytes: one 128-byte memory operation8 bytes: one 128-byte memory operation– 16 bytes: two 128-byte memory operations16 bytes: two 128-byte memory operations
All 16 words lie in the same (aligned) All 16 words lie in the same (aligned) segment in global memorysegment in global memory
Threads access words in sequenceThreads access words in sequence
Instruction ThroughputInstruction Throughput
Global memory: coalescing, <= 1.1Global memory: coalescing, <= 1.1– If any of the conditions is violated by If any of the conditions is violated by
a half-warp, thread memory accesses a half-warp, thread memory accesses are serializedare serialized
– Coalesced access of larger sizes is Coalesced access of larger sizes is slower than coalesced access of lower slower than coalesced access of lower sizessizes
Still a lot more efficient than non-Still a lot more efficient than non-coalesced accesscoalesced access
Instruction ThroughputInstruction Throughput
Global memory: coalescing, >= 1.2Global memory: coalescing, >= 1.2– Global memory access by threads in Global memory access by threads in
a half-warp are coalesced if accessed a half-warp are coalesced if accessed words lie in the same aligned words lie in the same aligned segment of required sizesegment of required size
32 bytes for 2 byte words32 bytes for 2 byte words 64 bytes for 4 byte words64 bytes for 4 byte words 128 bytes for 8/16 byte words128 bytes for 8/16 byte words
– Any access pattern is allowedAny access pattern is allowed Lower CC cards restrict access patternsLower CC cards restrict access patterns
Instruction ThroughputInstruction Throughput
Global memory: coalescing, >= 1.2Global memory: coalescing, >= 1.2– If a half-warp addresses words in N If a half-warp addresses words in N
different segments, N memory different segments, N memory transactions are issuedtransactions are issued
Lower CC cards issue 16Lower CC cards issue 16
– Hardware automatically detects and Hardware automatically detects and optimizes for unused words, e.g. if optimizes for unused words, e.g. if request words lie in the lower of request words lie in the lower of upper half of a 128 byte segment, a upper half of a 128 byte segment, a 64 byte operation is issued.64 byte operation is issued.
Instruction ThroughputInstruction Throughput
Global memory: coalescing, >= 1.2Global memory: coalescing, >= 1.2– Summary for memory transactions by Summary for memory transactions by
threads in a half-warpthreads in a half-warp Find the memory segment containing the Find the memory segment containing the
address requested by the lowest address requested by the lowest numbered active threadnumbered active thread
Find all other active threads requesting Find all other active threads requesting addresses in the same segmentaddresses in the same segment
Reduce transaction size, if possibleReduce transaction size, if possible Do the transaction, mark threads inactiveDo the transaction, mark threads inactive Repeat until all threads are servicedRepeat until all threads are serviced
Instruction ThroughputInstruction Throughput
Global memory: coalescingGlobal memory: coalescing– General patternsGeneral patterns
TYPE* BaseAddress; // 1D arrayTYPE* BaseAddress; // 1D array// thread reads BaseAddress + tid// thread reads BaseAddress + tid
TYPE TYPE must meet size and alignment req.smust meet size and alignment req.s If If TYPE TYPE is larger than 16 bytes, split it into is larger than 16 bytes, split it into
smaller objects that meet the smaller objects that meet the requirementsrequirements
Instruction ThroughputInstruction Throughput
Global memory: coalescingGlobal memory: coalescing– General patternsGeneral patterns
TYPE* BaseAddress;TYPE* BaseAddress;// 2D array of size: width x height// 2D array of size: width x height// read BaseAddress + tx*width + ty// read BaseAddress + tx*width + ty
Size and alignment requirements holdSize and alignment requirements hold
Instruction ThroughputInstruction Throughput
Global memory: coalescingGlobal memory: coalescing– General patternsGeneral patterns
Memory coalescing achieved for all half-Memory coalescing achieved for all half-warps in a block ifwarps in a block if
– Width of the block is a multiple of 16Width of the block is a multiple of 16– widthwidth is a multiple of 16 is a multiple of 16
Arrays whose width is a multiple of 16 are Arrays whose width is a multiple of 16 are accessed more efficientlyaccessed more efficiently
– Useful to pad arrays up to multiples of 16Useful to pad arrays up to multiples of 16– Done automatically by the Done automatically by the cuMemAllocPitch cuMemAllocPitch cudaMallocPitch cudaMallocPitch functionsfunctions
Instruction ThroughputInstruction Throughput
Local memoryLocal memory– Used for some internal variablesUsed for some internal variables– Not cachedNot cached– As expensive as global memoryAs expensive as global memory– As accesses are, by definition, per-As accesses are, by definition, per-
thread, they are automatically thread, they are automatically coalescedcoalesced
Instruction ThroughputInstruction Throughput
Constant memoryConstant memory– CachedCached
Costs one memory read from device Costs one memory read from device memory on cache missmemory on cache miss
Otherwise, one cache readOtherwise, one cache read
– For threads in a half-warp, cost of For threads in a half-warp, cost of reading cache is proportional to reading cache is proportional to number of different addresses readnumber of different addresses read
Recommended to have all threads in a Recommended to have all threads in a half-warp read the same addresshalf-warp read the same address
Instruction ThroughputInstruction Throughput
Texture memoryTexture memory– CachedCached
Costs one memory read from device Costs one memory read from device memory on cache missmemory on cache miss
Otherwise, one cache readOtherwise, one cache read
– Texture cache is optimized for 2D Texture cache is optimized for 2D spatial localityspatial locality
Recommended for threads in a warp to Recommended for threads in a warp to read neighboring texture addressesread neighboring texture addresses
Instruction ThroughputInstruction Throughput
Shared memoryShared memory– On-chipOn-chip
As fast as registers, provided there are no As fast as registers, provided there are no bank conflictsbank conflicts between threads between threads
– Divided into equally-sized modules, Divided into equally-sized modules, called called banksbanks
If N requests fall in N separate banks, they If N requests fall in N separate banks, they are processed concurrentlyare processed concurrently
If N requests fall in the same bank, there If N requests fall in the same bank, there is an N-way bank conflictis an N-way bank conflict
– The N requests are serializedThe N requests are serialized
Instruction ThroughputInstruction Throughput
Shared memory: banksShared memory: banks– Successive 32-bit words are assigned Successive 32-bit words are assigned
to successive banksto successive banks– Bandwidth: 32 bits per 2 clock cyclesBandwidth: 32 bits per 2 clock cycles– Requests from a warp are split Requests from a warp are split
according to half-warpsaccording to half-warps Threads in different half-warps cannot Threads in different half-warps cannot
conflictconflict
Instruction ThroughputInstruction Throughput
Shared memory: bank conflictsShared memory: bank conflicts– __shared__ char shared[32];__shared__ char shared[32];char data = shared[BaseIndex+tId];char data = shared[BaseIndex+tId];
– Why?Why?
Instruction ThroughputInstruction Throughput
Shared memory: bank conflictsShared memory: bank conflicts– __shared__ char shared[32];__shared__ char shared[32];char data = shared[BaseIndex+tId];char data = shared[BaseIndex+tId];
– Multiple array members, e.g. Multiple array members, e.g. char[0]char[0], , char[1]char[1], , char[2] char[2] and and char[3]char[3], lie in the , lie in the same banksame bank
– Can be resolved asCan be resolved aschar data = shared[BaseIndex+4*tId];char data = shared[BaseIndex+4*tId];
Instruction ThroughputInstruction Throughput
Shared memory: bank conflictsShared memory: bank conflicts– __shared__ double shared[32];__shared__ double shared[32];double data = shared[BaseIndex+tId];double data = shared[BaseIndex+tId];
– Why?Why?
Instruction ThroughputInstruction Throughput
Shared memory: bank conflictsShared memory: bank conflicts– __shared__ double shared[32];__shared__ double shared[32];double data = shared[BaseIndex+tId];double data = shared[BaseIndex+tId];
– 2-way bank conflict because of a stride of 2-way bank conflict because of a stride of two 32-bit wordstwo 32-bit words
Instruction ThroughputInstruction Throughput
Shared memory: bank conflictsShared memory: bank conflicts– __shared__ TYPE shared[32];__shared__ TYPE shared[32];TYPE data = shared[BaseIndex+tId];TYPE data = shared[BaseIndex+tId];
Instruction ThroughputInstruction Throughput
Shared memory: bank conflictsShared memory: bank conflicts– __shared__ TYPE shared[32];__shared__ TYPE shared[32];TYPE data = shared[BaseIndex+tId];TYPE data = shared[BaseIndex+tId];
– Three separate memory reads with no Three separate memory reads with no bank conflictsbank conflictsstruct TYPE {struct TYPE { float x,y,z; }; float x,y,z; };
– Stride of three 32-bit wordsStride of three 32-bit words
Instruction ThroughputInstruction Throughput
Shared memory: bank conflictsShared memory: bank conflicts– __shared__ TYPE shared[32];__shared__ TYPE shared[32];TYPE data = shared[BaseIndex+tId];TYPE data = shared[BaseIndex+tId];
– Two separate memory reads with no bank Two separate memory reads with no bank conflictsconflictsstruct TYPE {struct TYPE { float x,y; }; float x,y; };
– Stride of two 32-bit words, similar to Stride of two 32-bit words, similar to doubledouble
Final ProjectsFinal Projects
ReminderReminder– Form groups by next lectureForm groups by next lecture– Think of project ideas for your groupThink of project ideas for your group
Encouraged to submit several ideasEncouraged to submit several ideas
– For each idea, submit a short textFor each idea, submit a short text describing the problem you want to solvedescribing the problem you want to solve why you think it is suited for parallel why you think it is suited for parallel
computationcomputation
– Jens and I will assign you one of your Jens and I will assign you one of your suggested topicssuggested topics
Final ProjectsFinal Projects
ReminderReminder– If some people have not formed If some people have not formed
groups, Jens and I will assign you groups, Jens and I will assign you randomly to groups.randomly to groups.
– If you cannot think of any ideas, Jens If you cannot think of any ideas, Jens and I will assign you some.and I will assign you some.
– We will float around some write-ups We will float around some write-ups of our own ideas. You may choose of our own ideas. You may choose one of those.one of those.
Final ProjectsFinal Projects
Time-lineTime-line– Thu, 20 Nov (today):Thu, 20 Nov (today):
Float write-ups on ideas of Jens & WaqarFloat write-ups on ideas of Jens & Waqar
– Tue, 25 Nov:Tue, 25 Nov: Suggest groups and topicsSuggest groups and topics
– Thu, 27 Nov:Thu, 27 Nov: Groups and topics assignedGroups and topics assigned
– Tue, 2 Dec:Tue, 2 Dec: Last chance to change groups/topicsLast chance to change groups/topics Groups and topics finalizedGroups and topics finalized
All for todayAll for today
Next timeNext time– More on bank conflictsMore on bank conflicts– Other optimizationsOther optimizations