the performance of mapreduce: an in-depth study

25
The Performance of MapReduce: An In-depth Study Dawei Jiang, Beng Chin Ooi, Lei Shi, Sai Wu, School of Computing, NUS Presented by Tang Kai

Upload: kevin-tong

Post on 13-Jan-2015

2.151 views

Category:

Technology


3 download

DESCRIPTION

 

TRANSCRIPT

Page 1: The Performance of MapReduce: An In-depth Study

The Performance of MapReduce: An In-depth

Study

Dawei Jiang, Beng Chin Ooi, Lei Shi, Sai Wu,School of Computing, NUS

Presented by Tang Kai

Page 2: The Performance of MapReduce: An In-depth Study

Introduction Factors affecting Performance of MR Pruning search space Implementation Benchmark

Index

Page 3: The Performance of MapReduce: An In-depth Study

MapReduce-based systems are increasingly being used.◦ Simple yet impressive interface

Map() Reduce()◦ Flexible

Storage system independence◦ Scalable◦ Fine-grain fault tolerance

Introduction

Page 4: The Performance of MapReduce: An In-depth Study

Previous study◦ Fundamental difference

Schema support Data access Fault tolerance

◦ Benchmark Parallel DB >> MR-based

Motivation

Page 5: The Performance of MapReduce: An In-depth Study

Is it not possible to have a flexible, scalable and efficient MapReduce-based systems?

Works◦ Identify several performance bottlenecks◦ manage bottlenecks and tune performance

well-known engineering and database techniques

Conclusion◦ 2.5x-3.5x

Object

Page 6: The Performance of MapReduce: An In-depth Study

Introduction Factors affecting Performance of MR Pruning search space Implementation Benchmark

Index

Page 7: The Performance of MapReduce: An In-depth Study

7 steps of a MapReduce job

Factors affecting Performance of MR

1) Map2) Parse3) Process4) Sort5) Shuffle6) Merge7) Reduce

Page 8: The Performance of MapReduce: An In-depth Study

I/O mode Indexing Parsing Sorting

Factors affecting Performance of MR

Page 9: The Performance of MapReduce: An In-depth Study

Direct I/O◦ read data from the disk directly◦ Local

Streaming I/O◦ streaming data from the storage system by an

inter-process communication scheme, such as TCP/IP or JDBC.

◦ Local and remote

Direct I/O > Streaming I/O by 10%-15%

I/O mode

Page 10: The Performance of MapReduce: An In-depth Study

Input of a MapReduce job◦ a set of files stored in a distributed file system,

i.e. HDFS Ranged-indexes

◦ input HDFS files are not sorted but each data chunk in the files are indexed by keys Block-level indexes

◦ tables stored in database servers Database indexed tables

Indexing

Boost selection task 2x-10x depending on the selectivity

Page 11: The Performance of MapReduce: An In-depth Study

Raw data -> <k,v> pair

Immutable decoding◦ Read-only records (set once)

Mutable decoding

Mutable decoder is 10x faster.◦ boost selection task 2x overall

Parsing

Page 12: The Performance of MapReduce: An In-depth Study

Map-side sorting affects performance of aggregation◦ Cost of key comparison is non-trivial.

Example◦ SourceIP in UserVisits Table◦ Sort intermediate records.◦ sourceIP variable-length string

String compare (byte-to-byte) Fingerprint compare (integer)

Fingerprint-based is 4x-5x faster.◦ 20%-25% overall

Sorting

Page 13: The Performance of MapReduce: An In-depth Study

Why◦ 4 factors

Resulting in large search space (2*2*3*2)◦ Budget limit on Amazon EC2

Greedy

Pruning search space

Page 14: The Performance of MapReduce: An In-depth Study

Greedy Stategy

Pruning search space

I/O mode

Parser

Different sort schemesIn various architecture

Direct I/O

Stream I/O

Hadoop Writable

Google’s ProtocolBuffer

Berkeley DB

3 datasets

4 queries

Benchmark

Page 15: The Performance of MapReduce: An In-depth Study

Introduction Factors affecting Performance of MR Pruning search space Implementation Benchmark

Index

Page 16: The Performance of MapReduce: An In-depth Study

Hadoop 0.19.2 as code base Direct I/O

◦ Modification of data node implementation Text decoder

◦ Immutable same as Dewitt◦ Mutable by ourselves

Binary decoder◦ Hadoop

Immutable Writable decoder Mutable using hadoop API by ourselves

◦ Google Protocol buffer Build-in compiler->mutable Immutable by ourselves

◦ Berkeley DB BDB binding API (mutable)

Implementation details

Page 17: The Performance of MapReduce: An In-depth Study

Amazon EC2 (Elastic computing cloud)◦ 7.5GB memory◦ 2 virtual cores◦ 64-bits Fedora 8

Tuning EC2 disk I/O by shifting peak time. Hadoop Setting

◦ Block size of HDFS: 512MB◦ Heap size of JVM: 1024MB

Environment details

Page 18: The Performance of MapReduce: An In-depth Study

Introduction Factors affecting Performance of MR Pruning search space Implementation Benchmark

Index

Page 19: The Performance of MapReduce: An In-depth Study

Benchmark for I/O Results for different I/O mode

◦ Single node◦ No-op job w/ map w/o reduce

Page 20: The Performance of MapReduce: An In-depth Study

Results for record parsing◦ Run in Java process instead of MapReduce job◦ Time start after loading into memory

Mutable > Immutable◦ Mutable text> mutable binary

Benchmark for parsing

Page 21: The Performance of MapReduce: An In-depth Study

In between hadoop-based system◦ Cache factor

In between hadoop-based and Parallel DB◦ Close

Benchmark for Grep Task

Page 22: The Performance of MapReduce: An In-depth Study

Selection task -> scan -> Index Caching Indexing

Benchmark for Selection Task

Page 23: The Performance of MapReduce: An In-depth Study

Parsing: 2x faster Sorting: 20%-25% faster

◦ Not significant in small size aggregation task

Benchmark for Aggregation Task

Large: SELECT sourceIP, SUM(adRevenue)FROM UserVisits GROUP BY sourceIP;

Small: SELECT SUBSTR(sourceIP, 1, 7), SUM(adRevenue) FROM UserVisits GROUP BY SUBSTR(sourceIP, 1, 7)

Page 24: The Performance of MapReduce: An In-depth Study

On decoding scheme Comparison of tuned MR-based & Parallel

DB

Benchmark for Join Task

Page 25: The Performance of MapReduce: An In-depth Study

Cons◦ Need to be committed/forked to Hadoop source

code tree◦ A complete framework is needed instead of

miscellaneous patches.◦ Various API support: CLI, Web rather than Java.

Future work◦ Provide query parser, optimizer etc to build a

complete solution◦ Elastic power-aware data intensive Cloud

http://www.comp.nus.edu.sg/~epic/download/MapReduceBenchmark.tar.gz

Cons & Future work

Tenzing: A SQL Implemetation On The MapReduce Framework