hdfs-hc: a data placement module for heterogeneous hadoop clusters
DESCRIPTION
An increasing number of popular applications become data-intensive in nature. In the past decade, the World Wide Web has been adopted as an ideal platform for developing data-intensive applications, since the communication paradigm of the Web is sufficiently open and powerful. Data-intensive applications like data mining and web indexing need to access ever-expanding data sets ranging from a few gigabytes to several terabytes or even petabytes. Google leverages the MapReduce model to process approximately twenty petabytes of data per day in a parallel fashion. In this talk, we introduce the Google’s MapReduce framework for processing huge datasets on large clusters. We first outline the motivations of the MapReduce framework. Then, we describe the dataflow of MapReduce. Next, we show a couple of example applications of MapReduce. Finally, we present our research project on the Hadoop Distributed File System. The current Hadoop implementation assumes that computing nodes in a cluster are homogeneous in nature. Data locality has not been taken into account for launching speculative map tasks, because it is assumed that most maps are data-local. Unfortunately, both the homogeneity and data locality assumptions are not satisfied in virtualized data centers. We show that ignoring the datalocality issue in heterogeneous environments can noticeably reduce the MapReduce performance. In this paper, we address the problem of how to place data across nodes in a way that each node has a balanced data processing load. Given a dataintensive application running on a Hadoop MapReduce cluster, our data placement scheme adaptively balances the amount of data stored in each node to achieve improved data-processing performance. Experimental results on two real data-intensive applications show that our data placement strategy can always improve the MapReduce performance by rebalancing data across nodes before performing a data-intensive application in a heterogeneous Hadoop cluster.TRANSCRIPT
1
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
23/4/11 1
Xiao Qin
Department of Computer Science and Software Engineering
Auburn Universityhttp://www.eng.auburn.edu/~xqin
Slides 2-20 are adapted from notes by Subbarao Kambhampati (ASU), Dan Weld (U. Washington), Jeff Dean, Sanjay Ghemawat, (Google, Inc.)
2
Motivation
• Large-Scale Data Processing– Want to use 1000s of CPUs
• But don’t want hassle of managing things
• MapReduce provides– Automatic parallelization & distribution– Fault tolerance– I/O scheduling– Monitoring & status updates
3
Map/Reduce• Map/Reduce
– Programming model from Lisp – (and other functional languages)
• Many problems can be phrased this way• Easy to distribute across nodes• Nice retry/failure semantics
4
Distributed Grep
Very big
data
Split data
Split data
Split data
Split data
grep
grep
grep
grep
matches
matches
matches
matches
catAll
matches
5
Distributed Word Count
Very big
data
Split data
Split data
Split data
Split data
count
count
count
count
count
count
count
count
mergemergedcount
6
Map Reduce
• Map:– Accepts input
key/value pair– Emits intermediate
key/value pair
• Reduce :– Accepts intermediate
key/value* pair– Emits output key/value
pair
Very big
data
ResultMAP
REDUCE
PartitioningFunction
7
Map in Lisp (Scheme)
• (map f list [list2 list3 …])
• (map square ‘(1 2 3 4))– (1 4 9 16)
• (reduce + ‘(1 4 9 16))– (+ 16 (+ 9 (+ 4 1) ) )– 30
• (reduce + (map square (map – l1 l2))))
Unary operator
Binary operator
8
Map/Reduce ala Google• map(key, val) is run on each item in set
– emits new-key / new-val pairs
• reduce(key, vals) is run for each unique key emitted by map()– emits final output
9
count words in docs
– Input consists of (url, contents) pairs
– map(key=url, val=contents):• For each word w in contents, emit (w, “1”)
– reduce(key=word, values=uniq_counts):• Sum all “1”s in values list• Emit result “(word, sum)”
10
Count, Illustrated
map(key=url, val=contents):For each word w in contents, emit (w, “1”)
reduce(key=word, values=uniq_counts):Sum all “1”s in values list
Emit result “(word, sum)”
see bob throw
see spot run
see 1
bob 1
run 1
see 1
spot 1
throw 1
bob 1 Run 1
see 2
spot 1
throw 1
11
Grep
– Input consists of (url+offset, single line)– map(key=url+offset, val=line):
• If contents matches regexp, emit (line, “1”)
– reduce(key=line, values=uniq_counts):• Don’t do anything; just emit line
12
Reverse Web-Link Graph
• Map– For each URL linking to target, …– Output <target, source> pairs
• Reduce– Concatenate list of all source URLs– Outputs: <target, list (source)> pairs
13
Example uses: distributed grep distributed sort web link-graph reversal
term-vector / host web access log stats inverted index construction
document clustering machine learning statistical machine translation
... ... ...
Model is Widely ApplicableMapReduce Programs In Google Source Tree
14
Typical cluster:
• 100s/1000s of 2-CPU x86 machines, 2-4 GB of memory • Limited bisection bandwidth • Storage is on local IDE disks • GFS: distributed file system manages data (SOSP'03) • Job scheduling system: jobs made up of tasks,
scheduler assigns tasks to machines
Implementation is a C++ library linked into user programs
Implementation Overview
15
Execution
• How is this distributed?1. Partition input key/value pairs into chunks,
run map() tasks in parallel
2. After all map()s are complete, consolidate all emitted values for each unique emitted key
3. Now partition space of output map keys, and run reduce() in parallel
• If map() or reduce() fails, reexecute!
16
Job Processing
JobTracker
TaskTracker 0TaskTracker 1 TaskTracker 2
TaskTracker 3 TaskTracker 4 TaskTracker 5
1. Client submits “grep” job, indicating code and input files
2. JobTracker breaks input file into k chunks, (in this case 6). Assigns work to ttrackers.
3. After map(), tasktrackers exchange map-output to build reduce() keyspace
4. JobTracker breaks reduce() keyspace into m chunks (in this case 6). Assigns work.
5. reduce() output may go to NDFS
“ grep”
17
Execution
18
Parallel Execution
19
Task Granularity & Pipelining• Fine granularity tasks: map tasks >> machines
– Minimizes time for fault recovery– Can pipeline shuffling with map execution– Better dynamic load balancing
• Often use 200,000 map & 5000 reduce tasks • Running on 2000 machines
20
MapReduce outside Google
• Hadoop (Java)– Emulates MapReduce and GFS
• The architecture of Hadoop MapReduce and DFS is master/slave
Master Slave
MapReduce jobtracker tasktracker
DFS namenode datanode
Improving MapReduce Performance through Data Placement in
Heterogeneous Hadoop Clusters
Download Software at: http://www.eng.auburn.edu/~xqin/software/hdfs-hc
This HDFS-HC tool was described in our paper - Improving MapReduce Performance via Data Placement in Heterogeneous Hadoop Clusters - by J. Xie, S. Yin, X.-J. Ruan, Z.-Y. Ding, Y. Tian, J. Majors, and X. Qin, published in Proc. 19th Int'l Heterogeneity in Computing Workshop, Atlanta, Georgia, April 2010.
22
Hadoop Overview
2222
(J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. OSDI ’04, pages 137–150, 2008)
23
One time setup
• set hadoop-site.xml and slaves• Initiate namenode
• Run Hadoop MapReduce and DFS
• Upload your data to DFS
• Run your process…
• Download your data from DFS
24
Hadoop Distributed File System
2424
(http://lucene.apache.org/hadoop)
25
Motivational Example
Time (min)
Node A(fast)
Node B(slow)
Node C(slowest)
2x slower
3x slower
1 task/min
26
The Native Strategy
Node A
Node B
Node C
3 tasks
2 tasks
6 tasks
Loading Transferring Processing 2626
Time (min)
27
Our Solution--Reducing data transfer time
2727
Node A’
Node B’
Node C’
3 tasks
2 tasks
6 tasks
Loading Transferring Processing 2727
Time (min)
Node A
28
Preliminary Results
Impact of data placement on performance of grep
29
Challenges
• Does computing ratio depend on the application?
• Initial data distribution
• Data skew problem– New data arrival– Data deletion – New joining node– Data updating
30
Measure Computing Ratios• Computing ratio
• Fast machines process large data sets
3030
Time
Node A
Node B
Node C
2x slower
3x slower
1 task/min
31
Steps to MeasureComputing Ratios
3131
Node Response time(s)
Ratio # of File Fragments
Speed
Node A 10 1 6 Fastest
Node B 20 2 3 Average
Node C 30 3 2 Slowest
1. Run the application on each node with the same size data, individually collect the response time
2. Set the ratio of the shortest response as 1, accordingly set the ratio of other nodes
3.Caculate the least common multiple of these ratios
4. Count the portion of each node
32
Initial Data Distribution
Namenode
Datanodes
112233
File1445566
778899
aabb
cc
• Input files split into 64MB blocks
• Round-robin data distribution algorithm
CBA
3232
Portion 3:2:1
33
1Data Redistribution
1.Get network topology, the ratio and utilization
2.Build and sort two lists:under-utilized node list L1
over-utilized node list L2
3. Select the source and destination node from the lists.
4.Transfer data
5.Repeat step 3, 4 until the list is empty.
Namenode
1122
33
4455
66778899
aabbcc
CA
CBA
B
234
L1
L2
3333
Portion 3:2:1
34
Sharing Files among Multiple Applications
• The computing ratio depends on data-intensive applications.– Redistribution– Redundancy
3434
35
Experimental Environment
Five nodes in a hadoop heterogeneous cluster
3535
Node CPU Model CPU(Hz) L1 Cache(KB)
Node A Intel core 2 Duo 2*1G=2G 204
Node B Intel Celeron 2.8G 256
Node C Intel Pentium 3 1.2G 256
Node D Intel Pentium 3 1.2G 256
Node E Intel Pentium 3 1.2G 256
36
Grep and WordCount
• Grep is a tool searching for a regular expression in a text file
• WordCount is a program used to count words in a text file
3636
37
Computing ratio for two applications
3737
Computing ratio of the five nodes with respective of Grep and Wordcount applications
Computing Node Ratios for Grep Ratios for Wordcount
Node A 1 1
Node B 2 2
Node C 3.3 5
Node D 3.3 5
Node E 3.3 5
38
Response time of Grep andwordcount in each Node
3838
Application dependence
Data size independence
39
Six Data Placement Decisions
3939
40
Impact of data placement on performance of Grep
4040
41
Impact of data placement on performance of WordCount
4141
42
Conclusion
• Identify the performance degradation caused by heterogeneity.
• Designed and implemented a data placement mechanism in HDFS.
4242
43
Future Work
• Data redundancy issue
• Dynamic data distribution mechanism
• Prefetching
4343
44
Fellowship Program Samuel Ginn College of Engineering at
Auburn University• Dean's Fellowship: $32,000 per year plus tuition
fellowship• College Fellowship: $24,000 per year plus tuition
fellowship• Departmental Fellowship: $20,000 per year plus tuition
fellowship.• Tuition Fellowships: Tuition Fellowships provide a full
tuition waiver for a student with a 25 percent or greater full-time-equivalent (FTE) assignment. Both graduate research assistants (GRAs) and graduate teaching assistants (GTAs) are eligible.
45
http://www.eng.auburn.edu/programs/grad-school/fellowship-program/
46
http://www.eng.auburn.edu
47
http://www.eng.auburn.edu
48
http://www.eng.auburn.edu
Download the presentation slideshttp://www.slideshare.net/xqin74
Google: slideshare Xiao Qin
50
Questions