big data and apache hadoop's mapreduce -...
TRANSCRIPT
![Page 1: Big Data and Apache Hadoop's MapReduce - Hahslermichael.hahsler.net/SMU/CSE7337/slides/mapreduce.pdf · Big Data and Apache Hadoop’s MapReduce Michael Hahsler Computer Science and](https://reader031.vdocument.in/reader031/viewer/2022030403/5a797d977f8b9ade698c1ead/html5/thumbnails/1.jpg)
Big Data and Apache Hadoop’s MapReduce
Michael Hahsler
Computer Science and EngineeringSouthern Methodist University
January 23, 2012
Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 1 / 23
![Page 2: Big Data and Apache Hadoop's MapReduce - Hahslermichael.hahsler.net/SMU/CSE7337/slides/mapreduce.pdf · Big Data and Apache Hadoop’s MapReduce Michael Hahsler Computer Science and](https://reader031.vdocument.in/reader031/viewer/2022030403/5a797d977f8b9ade698c1ead/html5/thumbnails/2.jpg)
Table of Contents
1 Introduction
2 Hadoop Distributed File System
3 MapReduce
4 More Examples
Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 2 / 23
![Page 3: Big Data and Apache Hadoop's MapReduce - Hahslermichael.hahsler.net/SMU/CSE7337/slides/mapreduce.pdf · Big Data and Apache Hadoop’s MapReduce Michael Hahsler Computer Science and](https://reader031.vdocument.in/reader031/viewer/2022030403/5a797d977f8b9ade698c1ead/html5/thumbnails/3.jpg)
Single Machine
Disk
CPU
Memory
Typical set up for data processing/mining!What are the problems with big data?
Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 3 / 23
![Page 4: Big Data and Apache Hadoop's MapReduce - Hahslermichael.hahsler.net/SMU/CSE7337/slides/mapreduce.pdf · Big Data and Apache Hadoop’s MapReduce Michael Hahsler Computer Science and](https://reader031.vdocument.in/reader031/viewer/2022030403/5a797d977f8b9ade698c1ead/html5/thumbnails/4.jpg)
Big Data Challenge
Data Sources
Internet and social networks
Sensors and science
Needed Infrastructure
Scale to thousands of CPUs
Run on cheap commodity hardware (fault-tolerant hardware isexpensive!)
Automatically handle data replication and node failure
Data distribution and load balancing
Easy to implement solutions (thinking in terms of parallel computingis hard!)
Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 4 / 23
![Page 5: Big Data and Apache Hadoop's MapReduce - Hahslermichael.hahsler.net/SMU/CSE7337/slides/mapreduce.pdf · Big Data and Apache Hadoop’s MapReduce Michael Hahsler Computer Science and](https://reader031.vdocument.in/reader031/viewer/2022030403/5a797d977f8b9ade698c1ead/html5/thumbnails/5.jpg)
Hardware: Cluster Architecture
Disk
CPU
Memory
Disk
CPU
Memory
Disk
CPU
Memory
Disk
CPU
Memory
Disk
CPU
Memory
Disk
CPU
Memory
Disk
CPU
Memory
Disk
CPU
Memory
SwitchRack 1
SwitchRack n
SwitchBackbone
Rack contains 16-64 nodes
... ...
Node failure!
...
Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 5 / 23
![Page 6: Big Data and Apache Hadoop's MapReduce - Hahslermichael.hahsler.net/SMU/CSE7337/slides/mapreduce.pdf · Big Data and Apache Hadoop’s MapReduce Michael Hahsler Computer Science and](https://reader031.vdocument.in/reader031/viewer/2022030403/5a797d977f8b9ade698c1ead/html5/thumbnails/6.jpg)
Software: Apache Hadoop
What is Apache Hadoop?
A software framework that supports data-intensive (petabytes) distributedapplications under a free license.
...inspired by Google’s MapReduce and Google File System (GFS).
Hadoop provides:
Distributed file system HDFS
API to work with MapReduce
Job configuration and scheduling
Track progress and utilization
Written in the Java programming language.Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 6 / 23
![Page 7: Big Data and Apache Hadoop's MapReduce - Hahslermichael.hahsler.net/SMU/CSE7337/slides/mapreduce.pdf · Big Data and Apache Hadoop’s MapReduce Michael Hahsler Computer Science and](https://reader031.vdocument.in/reader031/viewer/2022030403/5a797d977f8b9ade698c1ead/html5/thumbnails/7.jpg)
What Problem can be solved with Hadoop?
Characteristics
Processing can easily be made in parallel (simple computations)
Process large amounts of unstructured data
Running batch jobs is acceptable
Examples
Creating statistics (word counting)
Searching (distributed grep)
Sorting
Indexing (postings list)
Document clustering
Graph algorithms (E.g. pagerank)
Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 7 / 23
![Page 8: Big Data and Apache Hadoop's MapReduce - Hahslermichael.hahsler.net/SMU/CSE7337/slides/mapreduce.pdf · Big Data and Apache Hadoop’s MapReduce Michael Hahsler Computer Science and](https://reader031.vdocument.in/reader031/viewer/2022030403/5a797d977f8b9ade698c1ead/html5/thumbnails/8.jpg)
Who uses Hadoop?
Adobe
Amazon.com
AOL
IBM
Microsoft
NY Times
Yahoo!
...
Source http://wiki.apache.org/hadoop/PoweredBy
Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 8 / 23
![Page 9: Big Data and Apache Hadoop's MapReduce - Hahslermichael.hahsler.net/SMU/CSE7337/slides/mapreduce.pdf · Big Data and Apache Hadoop’s MapReduce Michael Hahsler Computer Science and](https://reader031.vdocument.in/reader031/viewer/2022030403/5a797d977f8b9ade698c1ead/html5/thumbnails/9.jpg)
Table of Contents
1 Introduction
2 Hadoop Distributed File System
3 MapReduce
4 More Examples
Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 9 / 23
![Page 10: Big Data and Apache Hadoop's MapReduce - Hahslermichael.hahsler.net/SMU/CSE7337/slides/mapreduce.pdf · Big Data and Apache Hadoop’s MapReduce Michael Hahsler Computer Science and](https://reader031.vdocument.in/reader031/viewer/2022030403/5a797d977f8b9ade698c1ead/html5/thumbnails/10.jpg)
The Hadoop Distributed File System (HDFS)
Source: http://hadoop.apache.org/common/docs/current/hdfs_design.html
Files are split into large block size: 64 MB (typical fs has 4 kB)
Replication (2-3x on different racks) → Fault-tolerance
Master node stores meta information
Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 10 / 23
![Page 11: Big Data and Apache Hadoop's MapReduce - Hahslermichael.hahsler.net/SMU/CSE7337/slides/mapreduce.pdf · Big Data and Apache Hadoop’s MapReduce Michael Hahsler Computer Science and](https://reader031.vdocument.in/reader031/viewer/2022030403/5a797d977f8b9ade698c1ead/html5/thumbnails/11.jpg)
Table of Contents
1 Introduction
2 Hadoop Distributed File System
3 MapReduce
4 More Examples
Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 11 / 23
![Page 12: Big Data and Apache Hadoop's MapReduce - Hahslermichael.hahsler.net/SMU/CSE7337/slides/mapreduce.pdf · Big Data and Apache Hadoop’s MapReduce Michael Hahsler Computer Science and](https://reader031.vdocument.in/reader031/viewer/2022030403/5a797d977f8b9ade698c1ead/html5/thumbnails/12.jpg)
MapReduce
Basic Idea1 Apply a map function to each input element and emit key/value pairs.
map(k1, v1)→ list(k2, v2)
2 Summarize the results for each key using a reduce function.
reduce(k2, list(v2))→ (k2, v3)
The user only has to specify the map and reduce functions and theframework takes care of the rest!
The user does not have to think about concurrency, load balancing, datadistribution, fault-tolerance!
Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 12 / 23
![Page 13: Big Data and Apache Hadoop's MapReduce - Hahslermichael.hahsler.net/SMU/CSE7337/slides/mapreduce.pdf · Big Data and Apache Hadoop’s MapReduce Michael Hahsler Computer Science and](https://reader031.vdocument.in/reader031/viewer/2022030403/5a797d977f8b9ade698c1ead/html5/thumbnails/13.jpg)
Example: Word Counting
Count how often each word appears in a large number of documents.
1 void map ( String name , String document ){2 // name: document name
3 // document: document contents
4 for each word w in document :5 EmitIntermediate (w , ”1” )6 }7
8 void reduce ( String word , Iterator partialCounts ){9 // word: a word
10 // partialCounts: a list of aggregated partial counts
11 int sum = 0 ;12 for each pc in partialCounts :13 sum += ParseInt ( pc ) ;14 Emit ( word , AsString ( sum ) ) ;15 }
Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 13 / 23
![Page 14: Big Data and Apache Hadoop's MapReduce - Hahslermichael.hahsler.net/SMU/CSE7337/slides/mapreduce.pdf · Big Data and Apache Hadoop’s MapReduce Michael Hahsler Computer Science and](https://reader031.vdocument.in/reader031/viewer/2022030403/5a797d977f8b9ade698c1ead/html5/thumbnails/14.jpg)
Example: Word Counting
the quick brown fox
the lazy old dog
the quick brown dog
the, 1the, 1the ,1
quick, 1quick, 1
brown, 1brown, 1
fox, 1
lazy, 1
old, 1
dog, 1dog, 1
fox, 1
lazy, 1
old, 1
dog, 2
brown, 2
quick, 2
the, 3
split map shuffle/sort reduce
the, 1
quick, 1
brown, 1
fox, 1
the, 1
lazy, 1
old, 1
dog, 1
the, 1
quick, 1
brown, 1
dog, 1
Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 14 / 23
![Page 15: Big Data and Apache Hadoop's MapReduce - Hahslermichael.hahsler.net/SMU/CSE7337/slides/mapreduce.pdf · Big Data and Apache Hadoop’s MapReduce Michael Hahsler Computer Science and](https://reader031.vdocument.in/reader031/viewer/2022030403/5a797d977f8b9ade698c1ead/html5/thumbnails/15.jpg)
Execution of a MapReduce Job
Source: Jeffrey Dean and Sanjay Ghemawat, MapReduce: Simplified Data Processing on Large Clusters, OSDI, 2004.
Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 15 / 23
![Page 16: Big Data and Apache Hadoop's MapReduce - Hahslermichael.hahsler.net/SMU/CSE7337/slides/mapreduce.pdf · Big Data and Apache Hadoop’s MapReduce Michael Hahsler Computer Science and](https://reader031.vdocument.in/reader031/viewer/2022030403/5a797d977f8b9ade698c1ead/html5/thumbnails/16.jpg)
Fault-tolerance
Disk
CPU
Memory
Disk
CPU
Memory
Disk
CPU
Memory
Disk
CPU
Memory
Disk
CPU
Memory
Disk
CPU
Memory
Disk
CPU
Memory
Disk
CPU
Memory
SwitchRack 1
SwitchRack n
SwitchBackbone
Rack contains 16-64 nodes
... ...
Node failure!
...
Task 1
Block 1 Block 1
Master reschedules Task 1 here
Master
ping
Master re-executes tasks for failed workers.
Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 16 / 23
![Page 17: Big Data and Apache Hadoop's MapReduce - Hahslermichael.hahsler.net/SMU/CSE7337/slides/mapreduce.pdf · Big Data and Apache Hadoop’s MapReduce Michael Hahsler Computer Science and](https://reader031.vdocument.in/reader031/viewer/2022030403/5a797d977f8b9ade698c1ead/html5/thumbnails/17.jpg)
Locality
Disk
CPU
Memory
Disk
CPU
Memory
Disk
CPU
Memory
Disk
CPU
Memory
Disk
CPU
Memory
Disk
CPU
Memory
Disk
CPU
Memory
Disk
CPU
Memory
SwitchRack 1
SwitchRack n
SwitchBackbone
Rack contains 16-64 nodes
... ...
...
Block 1 Block 2
Schedule Task which needs block 2 here or at least on this rack!
Master
Schedule map tasks near to the data to preserve network bandwidth.
Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 17 / 23
![Page 18: Big Data and Apache Hadoop's MapReduce - Hahslermichael.hahsler.net/SMU/CSE7337/slides/mapreduce.pdf · Big Data and Apache Hadoop’s MapReduce Michael Hahsler Computer Science and](https://reader031.vdocument.in/reader031/viewer/2022030403/5a797d977f8b9ade698c1ead/html5/thumbnails/18.jpg)
More Properties
Load Balancing: Subdivide work in many small tasks (� # ofworkers). Since the master dynamically assigns tasks to idle nodes,this automatically provides load balancing.
Chaining: MapReduce operations can be chained (output of oneoperation is input for another operation) to solve more complicatedcomputations.
Backup tasks: Reduce phase can only start after all map tasks arefinished (use backup tasks to avoid “stragglers”).
Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 18 / 23
![Page 19: Big Data and Apache Hadoop's MapReduce - Hahslermichael.hahsler.net/SMU/CSE7337/slides/mapreduce.pdf · Big Data and Apache Hadoop’s MapReduce Michael Hahsler Computer Science and](https://reader031.vdocument.in/reader031/viewer/2022030403/5a797d977f8b9ade698c1ead/html5/thumbnails/19.jpg)
Table of Contents
1 Introduction
2 Hadoop Distributed File System
3 MapReduce
4 More Examples
Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 19 / 23
![Page 20: Big Data and Apache Hadoop's MapReduce - Hahslermichael.hahsler.net/SMU/CSE7337/slides/mapreduce.pdf · Big Data and Apache Hadoop’s MapReduce Michael Hahsler Computer Science and](https://reader031.vdocument.in/reader031/viewer/2022030403/5a797d977f8b9ade698c1ead/html5/thumbnails/20.jpg)
Example: Distributed Grep
Map task?
Reduce task?
Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 20 / 23
![Page 21: Big Data and Apache Hadoop's MapReduce - Hahslermichael.hahsler.net/SMU/CSE7337/slides/mapreduce.pdf · Big Data and Apache Hadoop’s MapReduce Michael Hahsler Computer Science and](https://reader031.vdocument.in/reader031/viewer/2022030403/5a797d977f8b9ade698c1ead/html5/thumbnails/21.jpg)
Example: Creating an Inverted Index
Map task?
Reduce task?
Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 21 / 23
![Page 22: Big Data and Apache Hadoop's MapReduce - Hahslermichael.hahsler.net/SMU/CSE7337/slides/mapreduce.pdf · Big Data and Apache Hadoop’s MapReduce Michael Hahsler Computer Science and](https://reader031.vdocument.in/reader031/viewer/2022030403/5a797d977f8b9ade698c1ead/html5/thumbnails/22.jpg)
Related Projects
Apache HBase: open source, non-relational, distributed databasemodeled after Google’s BigTable
Apache Hive: data warehouse infrastructure built on top of Hadoopfor providing data summarization, query, and analysis. Provides aSQL-like query language called HiveQL.
Apache Pig: is a platform for analyzing large data sets that consistsof a high-level language for expressing data analysis programs.
Apache Mahout: free implementations of distributed or otherwisescalable machine learning algorithms on the Hadoop platform.
Apache Lucene: a high-performance, full-featured text search enginelibrary written entirely in Java.
Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 22 / 23
![Page 23: Big Data and Apache Hadoop's MapReduce - Hahslermichael.hahsler.net/SMU/CSE7337/slides/mapreduce.pdf · Big Data and Apache Hadoop’s MapReduce Michael Hahsler Computer Science and](https://reader031.vdocument.in/reader031/viewer/2022030403/5a797d977f8b9ade698c1ead/html5/thumbnails/23.jpg)
Reading
Jeffrey Dean and Sanjay Ghemawat,MapReduce: Simplified Data Processing on Large Clustershttp://labs.google.com/papers/mapreduce.html
The Apache Hadoop Projecthttp://hadoop.apache.org/
Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 23 / 23