mapreduce as a language for parallel computing
DESCRIPTION
MapReduce As A Language for Parallel Computing. Wenguang CHEN, Dehao CHEN Tsinghua University. Future Architecture. Many alternatives A few powerful cores( Intel/AMD, 2,3,4,6 …) Many simple cores( nVidia, ATI, Larrabe, 32, 128, 196, 256 … ) Heterogenous( CELL, 1/8; FPGA speedup … ) - PowerPoint PPT PresentationTRANSCRIPT
MapReduce As A Language for Parallel Computing
Wenguang CHEN, Dehao CHENTsinghua University
Future Architecture• Many alternatives
– A few powerful cores( Intel/AMD, 2,3,4,6 …)– Many simple cores( nVidia, ATI, Larrabe, 32, 128, 196, 256
… )– Heterogenous( CELL, 1/8; FPGA speedup … )
• But programming them is not easy• All use different programming model, some are
(relatively) easy, some are extremely difficult– OpenMP, MPI, MapReduce– CUDA, Brooks– Verilog, System C
What makes parallel computing so difficult
• Parallelism identification and expression– Autoparallelizing has been failed so far
• Complex synchronization may be required– Data races and deadlocks which are difficult to
debug
• Load balance…
Map-Reduce is promising
• Can only solve a subset of problems– But an important and fast growing subset, such as indexing
• Easy to use– Programmers only need to write sequential code– The simplest practical parallel programming paradigm?
• Dominated programming paradigm in Internet companies
• Originally support distributed systems, now ported to GPU, CELL, multicore – But many dialects, which hurt the portability
Limitations on GPUs
• Rely on the CPU to allocate memory– How to support variant length data?
• Combine size and offset information with the key/val pair
– How to allocate output buffer on GPUs?• Two-pass scan—Get the count first, and then do real
execution
• Lack of lock support– How to synchronize to avoid write conflict?
• Memory is pre-allocated, so that every thread knows where it should write to
MapReduce on Multi-core CPU (Phoenix [HPCA'07])
Split
Map
Partition
Reduce
Merge
Input
Output
MapReduce on Multi-core CPU (Mars[PACT‘08])
MapCount
Map
ReduceCount
Reduce
Input
Output
Sort and Group
Allocate intermediate buffer on GPU
Prefixsum
PrefixsumAllocate output buffer on GPU
Program Example
• Word Count (Phoenix Implementation)… for (i = 0; i < args->length; i++) { curr_ltr = toupper(data[i]); switch (state) { case IN_WORD: data[i] = curr_ltr; if ((curr_ltr < 'A' || curr_ltr > 'Z') && curr_ltr != '\'‘) { data[i] = 0; emit_intermediate(curr_start, (void *)1, &data[i] - curr_start + 1); state = NOT_IN_WORD; } break; …
Program Example
• Word Count (Mars Implementation)__device__ void GPU_MAP_COUNT_FUNC //(void *key, void *val, int keySize, int valSize) {…. do {…. if (*line != ' ‘) line++; else { line++; GPU_EMIT_INTER_COUNT_FUNC(
wordSize-1, sizeof(int)); while (*line == ' ‘) { line++; } wordSize = 0; } } while (*line != '\n');…}
__device__ void GPU_MAP_FUNC//(void *key, void val, int keySize, int valSize)
{…. do {…. if (*line != ' ‘) line++; else { line++; GPU_EMIT_INTER_FUNC(word,
&wordSize, wordSize-1, sizeof(int)); while (*line == ' ‘) { line++; } wordSize = 0; } } while (*line != '\n');…}
Pros and Cons
• Load Balance– Phoenix: Static + Dynamic– Mars: Static, attribute same amount of map/reduce
workload to each thread• Pre-allocation
– Lock free– requires two-phase scan, which is not an efficient solution
• Sorting----Bottleneck of Mars– Phoenix use insertion sorts dynamically during emitting– Mars use bitonic sort -- O(n*logn*logn)
Map-Reduce as a Language, not a library
• Can we have a portable Map-Reduce that could run across different architectures efficiently?
• Promising– Map-Reduce already specify the parallelism well– No complex synchronizations in users code
• But still difficult– Different architecture provides different features
• Either portability and performance issues– Use compiler and runtime to cover the architecture
differences, as what we have done in supporting high-level languages such as C
CCompiler, library &Runtime
X86
Power
Sparc
…
Map-Reduce Cluster
Map-Reduce Multicore
Map-Reduce GPU
library &Runtime
Cluster
Multicorelibrary &Runtime
library &Runtime
GPU
Map-Reduce General
library &Runtime
Cluster
Multicorelibrary &Runtime
library &Runtime
GPU
Case study on nVidia GPU
• Portability– Host function support
• Annotating libc and inline
– Dynamic memory allocation• Big problem, not support that in user code?
• Performance– Memory Hierarchy Optimization( global, shared, readonly
memory identification )– Typed Language is preferrable( int4 type acceleration…)– Dynamic memory allocation(again!)
More to explore
• …