mapreduce as a language for parallel computing

MapReduce As A Language for Parallel Computing

Wenguang CHEN, Dehao CHENTsinghua University

Future Architecture• Many alternatives

– A few powerful cores( Intel/AMD, 2,3,4,6 …)– Many simple cores( nVidia, ATI, Larrabe, 32, 128, 196, 256

… )– Heterogenous( CELL, 1/8; FPGA speedup … )

• But programming them is not easy• All use different programming model, some are

(relatively) easy, some are extremely difficult– OpenMP, MPI, MapReduce– CUDA, Brooks– Verilog, System C

What makes parallel computing so difficult

• Parallelism identification and expression– Autoparallelizing has been failed so far

• Complex synchronization may be required– Data races and deadlocks which are difficult to

debug

• Load balance…

Map-Reduce is promising

• Can only solve a subset of problems– But an important and fast growing subset, such as indexing

• Easy to use– Programmers only need to write sequential code– The simplest practical parallel programming paradigm?

• Dominated programming paradigm in Internet companies

• Originally support distributed systems, now ported to GPU, CELL, multicore – But many dialects, which hurt the portability

Limitations on GPUs

• Rely on the CPU to allocate memory– How to support variant length data?

• Combine size and offset information with the key/val pair

– How to allocate output buffer on GPUs?• Two-pass scan—Get the count first, and then do real

execution

• Lack of lock support– How to synchronize to avoid write conflict?

• Memory is pre-allocated, so that every thread knows where it should write to

MapReduce on Multi-core CPU (Phoenix [HPCA'07])

Split

Map

Partition

Reduce

Merge

Input

Output

MapReduce on Multi-core CPU (Mars[PACT‘08])

MapCount

Map

ReduceCount

Reduce

Input

Output

Sort and Group

Allocate intermediate buffer on GPU

Prefixsum

PrefixsumAllocate output buffer on GPU

Program Example

• Word Count (Phoenix Implementation)… for (i = 0; i < args->length; i++) { curr_ltr = toupper(data[i]); switch (state) { case IN_WORD: data[i] = curr_ltr; if ((curr_ltr < 'A' || curr_ltr > 'Z') && curr_ltr != '\'‘) { data[i] = 0; emit_intermediate(curr_start, (void *)1, &data[i] - curr_start + 1); state = NOT_IN_WORD; } break; …

Program Example

• Word Count (Mars Implementation)__device__ void GPU_MAP_COUNT_FUNC //(void *key, void *val, int keySize, int valSize) {…. do {…. if (*line != ' ‘) line++; else { line++; GPU_EMIT_INTER_COUNT_FUNC(

wordSize-1, sizeof(int)); while (*line == ' ‘) { line++; } wordSize = 0; } } while (*line != '\n');…}

__device__ void GPU_MAP_FUNC//(void *key, void val, int keySize, int valSize)

{…. do {…. if (*line != ' ‘) line++; else { line++; GPU_EMIT_INTER_FUNC(word,

&wordSize, wordSize-1, sizeof(int)); while (*line == ' ‘) { line++; } wordSize = 0; } } while (*line != '\n');…}

Pros and Cons

• Load Balance– Phoenix: Static + Dynamic– Mars: Static, attribute same amount of map/reduce

workload to each thread• Pre-allocation

– Lock free– requires two-phase scan, which is not an efficient solution

• Sorting----Bottleneck of Mars– Phoenix use insertion sorts dynamically during emitting– Mars use bitonic sort -- O(n*logn*logn)

Map-Reduce as a Language, not a library

• Can we have a portable Map-Reduce that could run across different architectures efficiently?

• Promising– Map-Reduce already specify the parallelism well– No complex synchronizations in users code

• But still difficult– Different architecture provides different features

• Either portability and performance issues– Use compiler and runtime to cover the architecture

differences, as what we have done in supporting high-level languages such as C

CCompiler, library &Runtime

X86

Power

Sparc

…

Map-Reduce Cluster

Map-Reduce Multicore

Map-Reduce GPU

library &Runtime

Cluster

Multicorelibrary &Runtime

library &Runtime

GPU

Map-Reduce General

library &Runtime

Cluster

Multicorelibrary &Runtime

library &Runtime

GPU

Case study on nVidia GPU

• Portability– Host function support

• Annotating libc and inline

– Dynamic memory allocation• Big problem, not support that in user code?

• Performance– Memory Hierarchy Optimization( global, shared, readonly

memory identification )– Typed Language is preferrable( int4 type acceleration…)– Dynamic memory allocation(again!)

More to explore

• …

mapreduce as a language for parallel computing

Documents