Download - ROSEdu Tech Talks Prezentarea 09: Hadoop
-
8/14/2019 ROSEdu Tech Talks Prezentarea 09: Hadoop
1/23
Vlad Ureche
ROSEdu Tech Talks
-
8/14/2019 ROSEdu Tech Talks Prezentarea 09: Hadoop
2/23
Contents
Map Reduce
Hadoop
HDFS
Hbase
Example
-
8/14/2019 ROSEdu Tech Talks Prezentarea 09: Hadoop
3/23
MapReduce (1)
Google paper released in 2004
labs.google.com/papers/mapreduce-osdi04.pdf
Context
Google cluster many nodes many hw failures
Lots of data
Idea:
Separate the administrative part from the algorithms
Create a framework for all algorithms
Move computation instead of moving data
-
8/14/2019 ROSEdu Tech Talks Prezentarea 09: Hadoop
4/23
MapReduce (2)
MapReduce is a programming model and anassociated implementation for processingand generating large data sets
Our implementation of MapReduce runs on alarge cluster of commodity machines and ishighly scalable: a typical MapReducecomputation processes many terabytes ofdata on thousands of machines.
-
8/14/2019 ROSEdu Tech Talks Prezentarea 09: Hadoop
5/23
MapReduce (3)
MAP
DATA
N1
N2
N1
N3
MAP
MAP
SHUFFLEand
SORT
REDUCE
REDUCE
REDUCE
3
N1
3
N1
2
N1N2
2
N1N3
2
-
8/14/2019 ROSEdu Tech Talks Prezentarea 09: Hadoop
6/23
MapReduce (4)
Map: List()
Reduce: List()
Key1, key2 Anything that can be comparedand checked for equality
Value1, value2 Anything
Map and Reduce functions are up to you!
Fault tolerance schedulin concurrenc -
-
8/14/2019 ROSEdu Tech Talks Prezentarea 09: Hadoop
7/23
MapReduce (5)
Example: count occurrences of 2-grams in abook
The quick brown fox
The quick
quick brown
brown fox
Input: The book
Map: List()
Reduce: sizeof(List)
-
8/14/2019 ROSEdu Tech Talks Prezentarea 09: Hadoop
8/23
MapReduce (6)
When should you use MR?
Lots of data
Jobs can be parallel
Lots of machines
When not to use MR?
Intensive computation on small data
Jobs depend on each other
-
8/14/2019 ROSEdu Tech Talks Prezentarea 09: Hadoop
9/23
Hadoop (1)
Hadoop is an open-source implementation ofthe MapReduce framework
Is a top project of the Apache Foundation
Appeared two years after the MapReducepaper
Developed by companies:
Yahoo Cloudera
And independent submitters
-
8/14/2019 ROSEdu Tech Talks Prezentarea 09: Hadoop
10/23
Hadoop (2)
Used by everybody
http://wiki.apache.org/hadoop/PoweredBy
-
8/14/2019 ROSEdu Tech Talks Prezentarea 09: Hadoop
11/23
Hadoop (3)
JobTracker
Task tracker Task tracker Task tracker Task tracker Task tracker
Completely automated
Jobs are scheduled based on data locality
Speculative execution
-
8/14/2019 ROSEdu Tech Talks Prezentarea 09: Hadoop
12/23
Hadoop (4)
Code
Is open source
Java
Build scripts
Bash scripts
Configuration files
-
8/14/2019 ROSEdu Tech Talks Prezentarea 09: Hadoop
13/23
Hadoop (5)
Is part of a larger ecosistem
HDFS distributed file system
Hbase distributed, column-oriented database
Mahout machine learning algorithm library Nutch web crawler
And lots of other stuff
-
8/14/2019 ROSEdu Tech Talks Prezentarea 09: Hadoop
14/23
Hadoop example
Ad clicking log
User information (Age, Location) database
How could you use that to your advantage?
Mahout machine learning framework
-
8/14/2019 ROSEdu Tech Talks Prezentarea 09: Hadoop
15/23
Distributed file system
Modelled after the GFS paper
labs.google.com/papers/gfs-sosp2003.pdf
Stores multiple copies of data
Seek time >> Scan time
Move computation vs Move data
Small File Problem (TM)
HDFS (1)
-
8/14/2019 ROSEdu Tech Talks Prezentarea 09: Hadoop
16/23
-
8/14/2019 ROSEdu Tech Talks Prezentarea 09: Hadoop
17/23
HDFS (3)
Part of Hadoop
Open source
Java
Build scripts
Bash scripts
Configuration files
-
8/14/2019 ROSEdu Tech Talks Prezentarea 09: Hadoop
18/23
HBase
Distributed, column-oriented, sparse hash table
Data is stored in HDFS
Based on the BigTable paper by Google
labs.google.com/papers/bigtable.html
-
8/14/2019 ROSEdu Tech Talks Prezentarea 09: Hadoop
19/23
HBase (2)
Table
Key
Columns
Column Families
key=location:Romania;age:16;sex=M
ads:copiutze.ro.clickProbability = 0.0018 ads:copiutze.ro.bestPlacement = calendarPage
stats:clickProbability=0.0015
-
8/14/2019 ROSEdu Tech Talks Prezentarea 09: Hadoop
20/23
HBase (3)
Idee de distribuire asemanatoare HDFS-ului
Foloseste HDFS pentru stocarea fisierelor
Master
Region ServerRegion Server Region Server
-
8/14/2019 ROSEdu Tech Talks Prezentarea 09: Hadoop
21/23
Conclusion
MapReduce
Lots of input data
Parallel jobs
Lots of computers We could also talk about
Mahout machine learning
Nutch web crawling Lucene/Solr search engine
Pig, Cascading frameworks over Hadoop
-
8/14/2019 ROSEdu Tech Talks Prezentarea 09: Hadoop
22/23
Questions?
-
8/14/2019 ROSEdu Tech Talks Prezentarea 09: Hadoop
23/23
Thank you!