Dr. Sandeep G. Deshmukh
1
Introduction to
Contents
❑ Big Data Primers
❑ Hadoop
➢ Hadoop Distributed File System (HDFS)
➢ MapReduce
❑ Examples
2
Show of Hands
Big Data Primers: Source of Data
Big Data Primers : The four Vs
Big Data Primers: Size does matter
Big Data Primers: Vertical Vs Horizontal Scaling
Vertical Scaling Horizontal Scaling
Big Data Primers: The scale of infrastructure
Big Data Primers
Big Data Primers
Then what is
Distributed Programming:Even more chaos...
Big Data Primers
Network File System Distributed File System
Master
Slaves
12
Motivation - Traditional Distributed systems ❑ Processor Bound
❑ Using multiple machines
❑ Developer is burdened with managing too many things
➢ Synchronization
➢ Failures
❑ Data moves from shared disk to compute node
❑ Cost of maintaining clusters
❑ Scalability as and when required not present
13
What we need❑ Handling failure
➢ One computer = fails once in 1000 days
➢ 1000 computers = 1 per day
❑ Petabytes of data to be processed in parallel➢ 1 HDD= 100 MB/sec
➢ 1000 HDD= 100 GB/sec
❑ Easy scalability➢ Relative increase/decrease of performance depending on increase/decrease of nodes
14
What we’ve got : Hadoop!
❑ Created by Doug Cutting
❑ Started as a module in nutch and then matured as an apache project
❑ Named it after his son's stuffed
elephant
15
What we’ve got : Hadoop!❑ Fault-tolerant file system➢ Hadoop Distributed File System (HDFS)➢ Modeled on Google File system
❑ Takes computation to data➢ Data Locality
❑ Scalability: ➢ Program remains same for 10, 100, 1000,… nodes➢ Corresponding performance improvement
❑ Parallel computation using MapReduce❑ Other components – Pig, Hbase, HIVE, ZooKeeper
16
Hadoop: Myth Vs Truth
Myth Truth
Hadoop is a database HDFS is a Distributed File System
Hadoop is a database warehouse Compliments it, not a substitute
Hadoop is a complete, single product Ecosystem, not just a product. HDFS and MapReduce being the key components
Hadoop is used only for unstructured data, web analytics
Enables many types of analytics
Who is using Hadoop
HDFSHadoop distributed File System
HDFS - Contents
❑ Components❑ Architecture❑ Tasks / Services
20
Terminology
❑ HDFS: Hadoop Distributed File System
❑ Datanode: A DataNode stores data in HDFS.
❑ Namenode: The NameNode is the centerpiece of an HDFS file system. It keeps the directory tree of all files in the file system, and tracks where across the cluster the file data is kept. It does not store the data of these files itself.
➢ Active : Actively serving request
➢ Standby: Becomes Active if the current Active node fails
❑ Secondary Namenode:
➢ helper node for namenode
➢ Puts a checkpoint in filesystem which will help Namenode to function better
21
Components of HDFS
Active NameNode
DataNodes
Secondary NameNode
22
Standby NameNode
Architecture
Storing file on HDFS
Motivation: Reliability, Availability , Network Bandwidth
➢ The input file (say 1 TB) is split into smaller chunks/blocks of 128 MB➢ The chunks are stored on multiple nodes as independent files on slave nodes➢ To ensure that data is not lost, replicas are stored in the following way:
➢ One on local node ➢ One on remote rack (incase local rack fails)➢ One on local rack (incase local node fails) ➢ Other randomly placed➢ Default replication factor is 3
24
B1
B1
B1
B2
B2B2
B3 Bn
Hub 1 Hub 2
Data n
od
esFile
Master Node
8 gigabit
1 gigabit
Blocks
25
Tasks of NameNode
❑ Manages File System- mapping files to blocks and blocks to data nodes
❑ Maintaining status of data nodes➢ Heartbeat
■ Datanode sends heartbeat at regular intervals
■ If heartbeat is not received, datanode is declared dead
➢ Blockreport■ DataNode sends list of blocks on it
■ Used to check health of HDFS
26
NameNode Functions
❑ Replication
➢ On Datanode failure
➢ On Disk failure
➢ On Block corruption
❑ Data integrity
➢ Checksum for each block
➢ Stored in hidden file
27
❑ Rebalancing - balancer tool
➢ Addition of new nodes
➢ Decommissioning
➢ Deletion of some files
HDFS Robustness
❑ Safemode➢ At startup: No replication possible
➢ Receives Heartbeats and Blockreports from Datanodes
➢ Only a percentage of blocks are checked for defined replication factor
❑ Replicate blocks wherever necessary
28
All is well ☺ ➔ Exit Safemode
HDFS Summary
❑ Fault tolerant
❑ Scalable
❑ Reliable
❑ File are distributed in large blocks for
➢ Efficient reads
➢ Parallel access
29
Break Time
Questions?
30
MapReduce
MapReduce
❑ Origin❑ Example❑ Programming model
32
What is MapReduce?
❑ It is a powerful paradigm for parallel computation
❑ Hadoop uses MapReduce to execute jobs on files in HDFS
❑ Hadoop will intelligently distribute computation over cluster
❑ Take computation to data
33
Origin: Functional Programming
map f [a, b, c] = [f(a), f(b), f(c)]
Returns a list constructed by applying a function (the first argument) to all items in a list passed as the second argument
Example:
map sq [1, 2, 3] = [sq(1), sq(2), sq(3)]
= [1,4,9]
34
Origin: Functional Programming
reduce f [a, b, c] = f(a, b, c)
Returns a list constructed by applying a function (the first argument) on the list passed as the second argument
Example:
reduce sum [1, 4, 9] = sum(1, 4, 9)
= 14
35
Example: Sum of squares
[1,2,3,4]
Sq (1) Sq (2) Sq (3) Sq (4)
16941
30
Input
Intermediate output
Output
MAPPER
REDUCER
M1 M2 M3 M4
R1
36
Example: Sum of squares of even and odd
[1,2,3,4]
Sq (1) Sq (2) Sq (3) Sq (4)
(even, 16)(odd, 9)(even, 4)(odd, 1)
(even, 20) (odd, 10)
Input
Intermediate output
Output
MAPPER
REDUCER
M1 M2 M3 M4
R1 R2
37
Example: Sum of squares of even and odd
[1,2,3,4]
Sq (1) Sq (2) Sq (3) Sq (4)
(odd, 9)(even, 4)
(even, 20) (odd, 10)
Input
Intermediate output
Output
R2R1
38
(odd, 1) (even, 16)
MAPPER
REDUCER
Programming model- key, value pairs
Format of input- output
(key, value)
Map: (k1, v
1) → list (k
2, v
2)
Reduce: (k2, list v
2) → list (k
3, v
3)
39
Sum of squares of even and odd and prime
[1,2,3,4]
Sq (1) Sq (2) Sq (3) Sq (4)
(odd, 9)(prime, 9)
(even, 4)(prime, 4)
(even, 20) (odd, 10)(prime, 13)
Input
Intermediate output
Output
R2R1 R3
40
(odd, 1) (even, 16)
Many keys, many values
Format of input- output
(key, value)
Map: (k1, v
1) → list (k
2, v
2)
Reduce: (k2, list v
2) → list (k
3, v
3)
41
Input: 1TB text file containing color names- Blue, Green, Yellow, Purple, Pink, Red, Maroon, Grey,
Desired output:
Occurrence of colors Blue and Green
N1 f.001Blue PurpleBlue RedGreenBlueMaroonGreenYellow
N1 f.001Blue
Blue
GreenBlue
Green
grep Blue|Green
Nn f.00nGreen
Blue
BlueBlue
Green
Blue= 3000Green= 5500
Blue=500Green=200
Blue=420Green=200
sort |unique -c
awk ‘{arr[$1]+=$2;}END{print arr[Blue], arr[Green]}’
COMBINER
MAPPER
REDUCER
awk ‘{arr[$1]++;}END{print arr[Blue], arr[Green]}’
Nn f.00nBlue PurpleBlue RedGreenBlueMaroonGreenYellow
Input data
Map
Map
Map
Reduce
Reduce
Output
INPUT MAP SHUFFLE REDUCE OUTPUT
Works on a recordWorks on output of Map
44
MapReduce Overview
Input data
Combine
Combine
Combine
Map
Map
Map
Reduce
Reduce
Output
INPUT MAP REDUCE OUTPUT
Works on output of Map Works on output of Combiner
45
MapReduce Overview
Takes computation to data
MapReduce Overview
47
MapReduce Overview
MapReduce Overview
❑ Mapper, reducer and combiner act on <key, value> pairs
❑ Mapper gets one record at a time as an input
❑ Combiner (if present) works on output of map
❑ Reducer works on output of map (or combiner, if present)
❑ Combiner can be thought of local-reducer
➢ Reduces output of maps that are executed on same node
49
MapReduce Summary
What Hadoop is not..
❑ Not for interactive file accessing
❑ Not meant for a large number of small files - but for a small number of large files
❑ MapReduce cannot be used for any and all applications
50
Hadoop: Take Home
❑ Takes computation to data
❑ Suitable for large data centric operations
❑ Scalable on demand
❑ Fault tolerant and highly transparent
51
Questions?
52
❑ Coming up next …❑ First hadoop program
❑ Second hadoop program
Your first program in hadoop
Open up any tutorial on hadoop and first program you see will be of wordcount ☺
Task:➢ Given a text file, generate a list of words with the
number of times each of them appear in the fileInput:➢ Plain text file
Expected Output: ➢ <word, frequency> pairs for all words in the file
hadoop is a framework written in javahadoop supports parallel processing and is a simple framework
<hadoop, 2><is, 2> <a , 2> <java , 1>
<framework , 2> <written , 1> <in , 1> <and,1>
<supports , 1> <parallel , 1> <processing. , 1><simple,1>
53
Your second program in hadoop
Task:➢ Given a text file containing numbers, one per
line, count sum of squares of odd, even and prime
Input:➢ File containing integers, one per line
Expected Output: ➢ <type, sum of squares> for odd, even, prime
1253563794
<odd, 302><even, 278> <prime, 323 >
54
Your second program in hadoop
File on HDFS55
3
9
6
2
3
7
8Map: square
3 <odd,9>
7 <odd,49>
2
6 <even,36>
9 <odd,81>
3 <odd,9>
8 <even,64>
<prime,4>
<prime,9>
<prime,9>
<even,4>
Reducer: sum
prime:<9,4,9>
odd:<9,81,9,49>
even:<,36,4,64>
<odd,148>
<even,104>
<prime,22>
Input Value
Output (Key,Value)
Input (Key, List of Values)
Output (Key,Value)
Your second program in hadoop
56
Map(Invoked on a record)
Reduce(Invoked on a key)
void map (int x){ int sq = x * x; if(x is odd) print(“odd”,sq); if(x is even) print(“even”,sq); if(x is prime) print(“prime”,sq);}
void reduce(List l ){ for(y in List l){ sum += y; } print(Key, sum);}
Library functionsboolean odd(int x){ …}boolean even(int x){ …}boolean prime(int x){ …}
Your second program in hadoop
57
Map(Invoked on a record)
Map(Invoked on a record)void map (int x){ int sq = x * x; if(x is odd) print(“odd”,sq); if(x is even) print(“even”,sq); if(x is prime) print(“prime”,sq);}
Your second program in hadoop: Reduce
58
Reduce(Invoked on a key)
Reduce(Invoked on a
key)void reduce(List l ){ for(y in List l){ sum += y; } print(Key, sum);}
Your second program in hadooppublic static void main(String[] args) throws Exception {
Configuration conf = new Configuration();Job job = new Job(conf, “OddEvenPrime");
job.setJarByClass(OddEvenPrime.class);job.setMapperClass(OddEvenPrimeMapper.class);job.setCombinerClass(OddEvenPrimeReducer.class);job.setReducerClass(OddEvenPrimeReducer.class);job.setOutputKeyClass(Text.class);job.setOutputValueClass(IntWritable.class);job.setInputFormatClass(TextInputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);}
59
Questions?
60
❑ Coming up next …❑ More examples
❑ Hadoop Ecosystem
Example: Counting Fans
61
Example: Counting Fans
62
❑ Problem: Give Crowd statistics➢ Count fans
supporting India and Pakistan
63
45882 67917
❑ Traditional Way
➢ Central Processing
➢ Every fan comes to the centre and presses India/Pak button
❑ Issues
➢ Slow/Bottlenecks
➢ Only one processor
➢ Processing time determined by the speed at which people(data) can move
Example: Counting Fans
64
Hadoop Way
❑ Appoint processors per block (MAPPER)
❑ Send them to each block and ask them to send a signal for each person
❑ Central processor will aggregate the results (REDUCER)
❑ Can make the processor smart by asking him/her to aggregate locally and only send aggregated value (COMBINER)
55499003
7320
1280 5
8112
11289
629413109
8813
10676
3991
12225
Example: Counting Fans
Reducer
Combiner
Hadoop EcoSystem: Basic
65
Hadoop Distributions
66
Who all are using Hadoop
67
http://wiki.apache.org/hadoop/PoweredBy
ReferencesFor understanding Hadoop
❑ Official Hadoop website- http://hadoop.apache.org/
❑ Hadoop presentation wiki- http://wiki.apache.org/hadoop/HadoopPresentations?action=AttachFile
❑ http://developer.yahoo.com/hadoop/
❑ http://wiki.apache.org/hadoop/
❑ http://www.cloudera.com/hadoop-training/
❑ http://developer.yahoo.com/hadoop/tutorial/module2.html#basics
68
ReferencesSetup and Installation
❑ Installing on Ubuntu
➢ http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_(Single-Node_Cluster)
➢ http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_(Multi-Node_Cluster)
❑ Installing on Debian
➢ http://archive.cloudera.com/docs/_apt.html
69
❑ Hadoop: The Definitive Guide: Tom White
❑ http://developer.yahoo.com/hadoop/tutorial/
❑ http://www.cloudera.com/content/cloudera-content/cloudera-docs/HadoopTutorial/CDH4/Hadoop-Tutorial.html
70
Further Reading
Questions?
71
72
Acknowledgements
Surabhi Pendse
Sayali Kulkarni
Parth Shah