An embedded language for An embedded language for dataparallel programmingdataparallel programming
Master of Science Thesis in Computer ScienceMaster of Science Thesis in Computer Science
By Joel SvenssonBy Joel Svensson
Department of Computer Science and EngineeringDepartment of Computer Science and Engineering
CHALMERS UNIVERSITY OF TECHNOLOGYCHALMERS UNIVERSITY OF TECHNOLOGY
GÖTEBORGS UNIVERSITYGÖTEBORGS UNIVERSITY
Göteborg, SwedenGöteborg, Sweden
Obsidian: an embedded language for Obsidian: an embedded language for dataparallel programmingdataparallel programming
Dataparallel programmingDataparallel programmingGeneralPurpose computations on the GPU GeneralPurpose computations on the GPU (GPGPU)(GPGPU)LavaLava
NVIDIA 8800 GPU
Project OutlineProject Outline
An embedded language for dataparallel An embedded language for dataparallel programmingprogrammingLava programming style using combinatorsLava programming style using combinatorsGenerate C code for NVIDIA GPUGenerate C code for NVIDIA GPU
Dataparallel programmingDataparallel programming
Single sequential programSingle sequential programExecuted by a number of processing Executed by a number of processing elementselementsOperating on different dataOperating on different data
for j := 1 to log(n) do
for all k in parallel do
if ((k+1) mod 2^j) = 0 then
x[k] := x[k-2^(j-1)] + x[k]
fi
od
od
GPGPUGPGPU
GPUs are relatively cheapGPUs are relatively cheap High performance (Hundreds of GFLOPS)High performance (Hundreds of GFLOPS)
Applications:Applications:Physics simulationPhysics simulationBioinformaticsBioinformaticsSortingSorting
www.gpgpu.org
GPU vs CPU GFLOPS GPU vs CPU GFLOPS ChartChart
NVIDIA 8800 GPUsNVIDIA 8800 GPUs
A set of SIMD multiprocessorsA set of SIMD multiprocessors8 SIMD processing elements per 8 SIMD processing elements per MultiprocessorMultiprocessorUp to 16 multiprocessors in one GPUUp to 16 multiprocessors in one GPUGiving 128 processing elements totalGiving 128 processing elements total
www.nvidia.com
NVIDIA 8800 GPUsNVIDIA 8800 GPUs
NVDIA Compute Unified Device NVDIA Compute Unified Device ArchitectureArchitecture
C compiler and libraries for the GPUC compiler and libraries for the GPUGPU as a highly parallel coprocessorGPU as a highly parallel coprocessorfor use with NVIDIA's 8800 series GPUsfor use with NVIDIA's 8800 series GPUs
www.nvidia.com/cuda
CUDA Programming modelCUDA Programming model
High number of threadsHigh number of threads Divided into BlocksDivided into Blocks
Thread blockThread block 512 Threads512 Threads Divided into WarpsDivided into Warps Executed on one multiprocessorExecuted on one multiprocessor
CUDA SynchronisationCUDA Synchronisation
CUDA supplies a synchronisation primitive, CUDA supplies a synchronisation primitive, __syncthreads() __syncthreads() Barrier synchronisationBarrier synchronisation Across all the threads of a blockAcross all the threads of a block
Coordinate communicationCoordinate communication
ObsidianObsidianEmbedded in HaskellEmbedded in HaskellPresents a high level Presents a high level programmers interface programmers interface Parallel computations Parallel computations described using described using combinatorscombinatorsCUDA C code is CUDA C code is generatedgenerated
ObsidianObsidian
Describes computations on arrays:Describes computations on arrays: Length homogeneousLength homogeneous
Sorting algorithmsSorting algorithms Integer values Integer values
Limitations: Limitations: Currently limited to iterative sorting algorithmsCurrently limited to iterative sorting algorithms
Obsidian ProgrammingObsidian ProgrammingBasicsBasics Sequential composition of programs: Sequential composition of programs: ->-->- Parallel composition of programs: Parallel composition of programs: parlparl Index operations:Index operations:
revrevriffleriffleunriffle unriffle
Array operations:Array operations:halvehalveconcconc
Apply or Map: Apply or Map: fun fun
Obsidian ProgrammingObsidian Programming
Array OperationsArray Operations halve halve concconc oeSplitoeSplit shuffleshuffle
Obsidian ProgrammingObsidian Programming
Index OperationsIndex Operations revrev riffle riffle unriffleunriffle
riffle = halve ->- shuffle
unriffleunriffle
unriffle = oeSplit ->- conc
Obsidian ProgrammingObsidian Programming
Apply or Map: Apply or Map: funfun
Sequential composition of programs: Sequential composition of programs: ->-->- Parallel composition of programs: Parallel composition of programs: parlparl
Obsidian Programming: an Obsidian Programming: an example example
rev_incr :: Arr (Exp Int) -> W (Arr (Exp Int))rev_incr :: Arr (Exp Int) -> W (Arr (Exp Int))rev_incr = rev ->- fun (+1) ->- syncrev_incr = rev ->- fun (+1) ->- sync
*Obsidian> execute rev_incr [1,2,3]*Obsidian> execute rev_incr [1,2,3][4,3,2][4,3,2]
Obsidian SynchronisationObsidian Synchronisation
Synchronisation primitive: Synchronisation primitive: syncsync AllAll array elements are updated after a array elements are updated after a syncsync Only applicable at toplevelOnly applicable at toplevel
Inherits behavior from CUDA's Inherits behavior from CUDA's __syncthreads()__syncthreads()
Generating C CodeGenerating C Code
Generate CUDA C Code for NVIDIA GPUGenerate CUDA C Code for NVIDIA GPU Executed as one block of threadsExecuted as one block of threads
ProsPros Communication and synchronisation possibleCommunication and synchronisation possible
ConsCons Upper limit of 512 threads per block Upper limit of 512 threads per block Does not use entire GPUDoes not use entire GPU
Generating C CodeGenerating C Code
Each thread is in charge of calculating one Each thread is in charge of calculating one array elementarray element Limits array size to 512 elementsLimits array size to 512 elements Leads to some redundancyLeads to some redundancy
Swap operation performed by two threads in Swap operation performed by two threads in cooperationcooperation
Generating C CodeGenerating C Code
__global__ static void reverse(int *values, int n){ extern __shared__ int shared[]; const int tid = threadIdx.x; int tmp; shared[tid] = values[tid]; __syncthreads(); tmp = shared[((n - 1) - tid)]; __syncthreads(); shared[tid] = tmp; __syncthreads();
values[tid] = shared[tid];}
reverse = rev ->- sync
Generating C CodeGenerating C Code
__global__ static void example( int *values, int nint *values, int n){ extern __shared__ int shared[];extern __shared__ int shared[]; const int tid = threadIdx.x; int tmp; shared[tid] = values[tid];shared[tid] = values[tid]; __syncthreads(); tmp = f(shared[i1],...,shared[in]); __syncthreads(); shared[tid] = tmp; __syncthreads();
values[tid] = shared[tid];values[tid] = shared[tid];}
Generating C CodeGenerating C Code
__global__ static void example(int *values, int n){ extern __shared__ int shared[]; const int tid = threadIdx.x; int tmp; shared[tid] = values[tid]; __syncthreads(); tmp = f(shared[i1],...,shared[in]);tmp = f(shared[i1],...,shared[in]); __syncthreads(); shared[tid] = tmp; __syncthreads();
values[tid] = shared[tid];}
1
2
3
Implementing a sorterImplementing a sorter
A twosorter sorts a pair of values:A twosorter sorts a pair of values:cmpSwap op (a,b) = ifThenElse (op a b) (a,b) (b,a)cmpSwap op (a,b) = ifThenElse (op a b) (a,b) (b,a)
Sort each pair of elements in an array:Sort each pair of elements in an array:sort2 = (pair ->- fun (cmpSwap (<*)) ->- unpair ->- sync)sort2 = (pair ->- fun (cmpSwap (<*)) ->- unpair ->- sync)
*Obsidian> execute sort2 [2,3,5,1,6,7]*Obsidian> execute sort2 [2,3,5,1,6,7][2,3,1,5,6,7][2,3,1,5,6,7]*Obsidian> execute sort2 [2,1,2,1,2,1]*Obsidian> execute sort2 [2,1,2,1,2,1][1,2,1,2,1,2][1,2,1,2,1,2]
Implementing a sorterImplementing a sorter
A more efficient pairwise sort:A more efficient pairwise sort:sortEvens = evens (cmpSwap (<*)) ->- syncsortEvens = evens (cmpSwap (<*)) ->- sync
*Obsidian> execute sortEvens [2,3,5,1,6,7]*Obsidian> execute sortEvens [2,3,5,1,6,7][2,3,1,5,6,7][2,3,1,5,6,7]*Obsidian> execute sortEvens [2,1,2,1,2,1]*Obsidian> execute sortEvens [2,1,2,1,2,1][1,2,1,2,1,2][1,2,1,2,1,2]
Implementing a sorterImplementing a sorter
evens
Implementing a sorterImplementing a sorter
A close relative of A close relative of evens evens isis odds odds::sortOdds = odds (cmpSwap (<*)) ->- syncsortOdds = odds (cmpSwap (<*)) ->- sync
*Obsidian> execute sortOdds [5,3,2,1,4,6]*Obsidian> execute sortOdds [5,3,2,1,4,6][5,2,3,1,4,6][5,2,3,1,4,6]*Obsidian> execute sortOdds [1,2,1,2,1,2]*Obsidian> execute sortOdds [1,2,1,2,1,2][1,1,2,1,2,2][1,1,2,1,2,2]
Implementing a sorterImplementing a sorter
odds
Odd Even Transposition Odd Even Transposition SortSort
Sorter implemented using Sorter implemented using oddsodds and and evensevens: : sortOETCore = sortEvens ->- sortOddssortOETCore = sortEvens ->- sortOdds
sortOET arr = sortOET arr = let n = len arr let n = len arr in (repE (idiv (n+1) 2) sortOETCore) arrin (repE (idiv (n+1) 2) sortOETCore) arr
Odd Even Transposition Odd Even Transposition SortSort
VSortVSort
Another iterative sorterAnother iterative sorterloglog22(n) depth(n) depth
Built around a Built around a shuffle exchange network:shuffle exchange network:shex f n = rep n (riffle ->- evens f ->- sync)shex f n = rep n (riffle ->- evens f ->- sync)
VSortVSort
Merger implemented using shex: bmergeIt n = shex (cmpSwap (<*)) n
*Obsidian> execute (shex (cmpSwap (<*)) 3) [2,4,6,8,7,5,3,1][1,2,3,4,5,6,7,8]
VSortVSort
Sorter implemented using bmergeIt: vmergeIt n = tblLook tautab ->- sync –>- bmergeIt n
VsortIt n = rep n (vmergeIt n)
Comparison of sortersComparison of sorters
Six different sortersSix different sorters Bitonic sort on CPUBitonic sort on CPU Odd Even Transposition sortOdd Even Transposition sort Three versions of VSortThree versions of VSort CUDA Bitonic sort on GPUCUDA Bitonic sort on GPU
Data and HardwareData and Hardware 288 Mb of random data288 Mb of random data CPU: 2.4GHz Intel Core 2CPU: 2.4GHz Intel Core 2 GPU: 1.2GHz NVIDIA 8800 GTS (shader GPU: 1.2GHz NVIDIA 8800 GTS (shader
clock)clock)
Comparison of sortersComparison of sorters
Related workRelated work
PanPan Embedded in HaskellEmbedded in Haskell Image synthesisImage synthesis Generates C codeGenerates C code
VertigoVertigo Embedded in HaskellEmbedded in Haskell Describes Describes ShadersShaders Generates GPU programsGenerates GPU programs
Related workRelated work
PyGPUPyGPU Embedded in PythonEmbedded in Python Uses Pythons introspective abilitiesUses Pythons introspective abilities Graphics applicationsGraphics applications
Related workRelated work
NESL NESL Functional languageFunctional language Nested dataparallelismNested dataparallelism Compiles into VCodeCompiles into VCode
Data Parallel HaskellData Parallel Haskell Nested dataparallelism in HaskellNested dataparallelism in Haskell
Future workFuture work
Solve the recursion dilemmaSolve the recursion dilemma Enable the description of recursive sortersEnable the description of recursive sorters
Bitonic SortBitonic Sort
Make use of entire GPUMake use of entire GPUOptimise the generated codeOptimise the generated codeMore generality More generality Not just sortersNot just sorters
Other target platformsOther target platforms
Future workFuture work
More generalityMore generality Arr a > Arr b (not just Arr Int > Arr Int)Arr a > Arr b (not just Arr Int > Arr Int) Matrices Matrices Pairs of arrays to arraysPairs of arrays to arrays Arrays of pairs to arraysArrays of pairs to arrays Throw away length homogeneity demandThrow away length homogeneity demand