lecture 4: dfs mapreduce iaidanhogan.com/teaching/cc5212-1-2016/mdp2016-04.pdf · 2016. 3. 30. ·...
TRANSCRIPT
![Page 1: Lecture 4: DFS MapReduce Iaidanhogan.com/teaching/cc5212-1-2016/MDP2016-04.pdf · 2016. 3. 30. · More in Hadoop: Combiner • Map‐side “mini‐reduction” • Keeps a fixed‐size](https://reader035.vdocument.in/reader035/viewer/2022071108/5fe2e2c1d3fa0c2e2c303e48/html5/thumbnails/1.jpg)
CC5212‐1PROCESAMIENTO MASIVO DE DATOSOTOÑO 2016
Lecture 4: DFS & MapReduce I
Aidan [email protected]
![Page 2: Lecture 4: DFS MapReduce Iaidanhogan.com/teaching/cc5212-1-2016/MDP2016-04.pdf · 2016. 3. 30. · More in Hadoop: Combiner • Map‐side “mini‐reduction” • Keeps a fixed‐size](https://reader035.vdocument.in/reader035/viewer/2022071108/5fe2e2c1d3fa0c2e2c303e48/html5/thumbnails/2.jpg)
Fundamentals of Distributed Systems
![Page 3: Lecture 4: DFS MapReduce Iaidanhogan.com/teaching/cc5212-1-2016/MDP2016-04.pdf · 2016. 3. 30. · More in Hadoop: Combiner • Map‐side “mini‐reduction” • Keeps a fixed‐size](https://reader035.vdocument.in/reader035/viewer/2022071108/5fe2e2c1d3fa0c2e2c303e48/html5/thumbnails/3.jpg)
MASSIVE DATA PROCESSING(THE GOOGLE WAY …)
![Page 4: Lecture 4: DFS MapReduce Iaidanhogan.com/teaching/cc5212-1-2016/MDP2016-04.pdf · 2016. 3. 30. · More in Hadoop: Combiner • Map‐side “mini‐reduction” • Keeps a fixed‐size](https://reader035.vdocument.in/reader035/viewer/2022071108/5fe2e2c1d3fa0c2e2c303e48/html5/thumbnails/4.jpg)
Inside Google circa 1997/98
![Page 5: Lecture 4: DFS MapReduce Iaidanhogan.com/teaching/cc5212-1-2016/MDP2016-04.pdf · 2016. 3. 30. · More in Hadoop: Combiner • Map‐side “mini‐reduction” • Keeps a fixed‐size](https://reader035.vdocument.in/reader035/viewer/2022071108/5fe2e2c1d3fa0c2e2c303e48/html5/thumbnails/5.jpg)
Inside Google circa 2015
![Page 6: Lecture 4: DFS MapReduce Iaidanhogan.com/teaching/cc5212-1-2016/MDP2016-04.pdf · 2016. 3. 30. · More in Hadoop: Combiner • Map‐side “mini‐reduction” • Keeps a fixed‐size](https://reader035.vdocument.in/reader035/viewer/2022071108/5fe2e2c1d3fa0c2e2c303e48/html5/thumbnails/6.jpg)
Google’s Cooling System
![Page 7: Lecture 4: DFS MapReduce Iaidanhogan.com/teaching/cc5212-1-2016/MDP2016-04.pdf · 2016. 3. 30. · More in Hadoop: Combiner • Map‐side “mini‐reduction” • Keeps a fixed‐size](https://reader035.vdocument.in/reader035/viewer/2022071108/5fe2e2c1d3fa0c2e2c303e48/html5/thumbnails/7.jpg)
Google’s Recycling Initiative
![Page 8: Lecture 4: DFS MapReduce Iaidanhogan.com/teaching/cc5212-1-2016/MDP2016-04.pdf · 2016. 3. 30. · More in Hadoop: Combiner • Map‐side “mini‐reduction” • Keeps a fixed‐size](https://reader035.vdocument.in/reader035/viewer/2022071108/5fe2e2c1d3fa0c2e2c303e48/html5/thumbnails/8.jpg)
Google Architecture (ca. 1998)
Information Retrieval• Crawling• Inverted indexing• Word‐counts• Link‐counts• greps/sorts• PageRank• Updates…
![Page 9: Lecture 4: DFS MapReduce Iaidanhogan.com/teaching/cc5212-1-2016/MDP2016-04.pdf · 2016. 3. 30. · More in Hadoop: Combiner • Map‐side “mini‐reduction” • Keeps a fixed‐size](https://reader035.vdocument.in/reader035/viewer/2022071108/5fe2e2c1d3fa0c2e2c303e48/html5/thumbnails/9.jpg)
Google Engineering
• Massive amounts of data• Each task needs communication protocols• Each task needs fault tolerance• Multiple tasks running concurrently
Ad hoc solutions would repeat the same code
![Page 10: Lecture 4: DFS MapReduce Iaidanhogan.com/teaching/cc5212-1-2016/MDP2016-04.pdf · 2016. 3. 30. · More in Hadoop: Combiner • Map‐side “mini‐reduction” • Keeps a fixed‐size](https://reader035.vdocument.in/reader035/viewer/2022071108/5fe2e2c1d3fa0c2e2c303e48/html5/thumbnails/10.jpg)
Google Engineering
• Google File System– Store data across multiple machines– Transparent Distributed File System– Replication / Self‐healing
• MapReduce– Programming abstraction for distributed tasks– Handles fault tolerance– Supports many “simple” distributed tasks!
• BigTable, Pregel, Percolator, Dremel …
![Page 11: Lecture 4: DFS MapReduce Iaidanhogan.com/teaching/cc5212-1-2016/MDP2016-04.pdf · 2016. 3. 30. · More in Hadoop: Combiner • Map‐side “mini‐reduction” • Keeps a fixed‐size](https://reader035.vdocument.in/reader035/viewer/2022071108/5fe2e2c1d3fa0c2e2c303e48/html5/thumbnails/11.jpg)
Google Re‐Engineering
Google File System (GFS)
MapReduce
BigTable
![Page 12: Lecture 4: DFS MapReduce Iaidanhogan.com/teaching/cc5212-1-2016/MDP2016-04.pdf · 2016. 3. 30. · More in Hadoop: Combiner • Map‐side “mini‐reduction” • Keeps a fixed‐size](https://reader035.vdocument.in/reader035/viewer/2022071108/5fe2e2c1d3fa0c2e2c303e48/html5/thumbnails/12.jpg)
GOOGLE FILE SYSTEM (GFS)
![Page 13: Lecture 4: DFS MapReduce Iaidanhogan.com/teaching/cc5212-1-2016/MDP2016-04.pdf · 2016. 3. 30. · More in Hadoop: Combiner • Map‐side “mini‐reduction” • Keeps a fixed‐size](https://reader035.vdocument.in/reader035/viewer/2022071108/5fe2e2c1d3fa0c2e2c303e48/html5/thumbnails/13.jpg)
What is a File‐System?
• Breaks files into chunks (or clusters)• Remembers the sequence of clusters• Records directory/file structure• Tracks file meta‐data
– File size– Last access– Permissions– Locks
![Page 14: Lecture 4: DFS MapReduce Iaidanhogan.com/teaching/cc5212-1-2016/MDP2016-04.pdf · 2016. 3. 30. · More in Hadoop: Combiner • Map‐side “mini‐reduction” • Keeps a fixed‐size](https://reader035.vdocument.in/reader035/viewer/2022071108/5fe2e2c1d3fa0c2e2c303e48/html5/thumbnails/14.jpg)
What is a Distributed File‐System
• Same thing, but distributed
• Transparency: Like a normal file‐system• Flexibility: Can mount new machines• Reliability: Has replication• Performance: Fast storage/retrieval• Scalability: Can store a lot of data / support a lot of machines
What would transparency / flexibility / reliability / performance / scalability mean for a distributed file system?
![Page 15: Lecture 4: DFS MapReduce Iaidanhogan.com/teaching/cc5212-1-2016/MDP2016-04.pdf · 2016. 3. 30. · More in Hadoop: Combiner • Map‐side “mini‐reduction” • Keeps a fixed‐size](https://reader035.vdocument.in/reader035/viewer/2022071108/5fe2e2c1d3fa0c2e2c303e48/html5/thumbnails/15.jpg)
Google File System (GFS)
• Files are huge
• Files often read or appended– Writes in the middle of a file not (really) supported
• Concurrency important
• Failures are frequent
• Streaming important
![Page 16: Lecture 4: DFS MapReduce Iaidanhogan.com/teaching/cc5212-1-2016/MDP2016-04.pdf · 2016. 3. 30. · More in Hadoop: Combiner • Map‐side “mini‐reduction” • Keeps a fixed‐size](https://reader035.vdocument.in/reader035/viewer/2022071108/5fe2e2c1d3fa0c2e2c303e48/html5/thumbnails/16.jpg)
GFS: Pipelined Writes
Master
Chunk‐servers (slaves)
• 64MB per chunk• 64 bit label for each chunk• Assume replication factor of 3
11 12
22
3
3
31
1 1222
File System (In‐Memory)/blue.txt [3 chunks]1: {A1, C1, E1}2: {A1, B1, D1}3: {B1, D1, E1}/orange.txt [2 chunks]1: {B1, D1, E1}2: {A1, C1, E1}
A1 B1 C1 D1 E1
blue.txt(150 MB: 3 chunks)orange.txt
(100MB: 2 chunks)
![Page 17: Lecture 4: DFS MapReduce Iaidanhogan.com/teaching/cc5212-1-2016/MDP2016-04.pdf · 2016. 3. 30. · More in Hadoop: Combiner • Map‐side “mini‐reduction” • Keeps a fixed‐size](https://reader035.vdocument.in/reader035/viewer/2022071108/5fe2e2c1d3fa0c2e2c303e48/html5/thumbnails/17.jpg)
GFS: Pipelined Writes (In Words)
1. Client asks Master to write a file2. Master returns a primary chunkserver and
secondary chunkservers3. Client writes to primary chunkserver and tells it
the secondary chunkservers4. Primary chunkserver passes data onto
secondary chunkserver, which passes on …5. When finished, message comes back through
the pipeline that all chunkservers have written– Otherwise client sends again
![Page 18: Lecture 4: DFS MapReduce Iaidanhogan.com/teaching/cc5212-1-2016/MDP2016-04.pdf · 2016. 3. 30. · More in Hadoop: Combiner • Map‐side “mini‐reduction” • Keeps a fixed‐size](https://reader035.vdocument.in/reader035/viewer/2022071108/5fe2e2c1d3fa0c2e2c303e48/html5/thumbnails/18.jpg)
GFS: Fault Tolerance
Master blue.txt(150 MB: 3 chunks)
• 64MB per chunk• 64 bit label for each chunk• Assume replication factor of 3
11 12
22
3
3
3
orange.txt(100MB: 2 chunks)
11 122
2
File System (In‐Memory)/blue.txt [3 chunks]1: {A1, B1, E1}2: {A1, B1, D1}3: {B1, D1, E1}/orange.txt [2 chunks]1: {B1, D1, E1}2: {A1, D1, E1}
A1 B1 D1 E1C1
12
Chunk‐servers (slaves)
![Page 19: Lecture 4: DFS MapReduce Iaidanhogan.com/teaching/cc5212-1-2016/MDP2016-04.pdf · 2016. 3. 30. · More in Hadoop: Combiner • Map‐side “mini‐reduction” • Keeps a fixed‐size](https://reader035.vdocument.in/reader035/viewer/2022071108/5fe2e2c1d3fa0c2e2c303e48/html5/thumbnails/19.jpg)
GFS: Fault Tolerance (In Words)
• Master sends regular “Heartbeat” pings
• If a chunkserver doesn’t respond1. Master finds out what chunks it had2. Master assigns new chunkserver for each chunk3. Master tells new chunkserver to copy from a specific
existing chunkserver
• Chunks are prioritised by number of remaining replicas, then by demand
![Page 20: Lecture 4: DFS MapReduce Iaidanhogan.com/teaching/cc5212-1-2016/MDP2016-04.pdf · 2016. 3. 30. · More in Hadoop: Combiner • Map‐side “mini‐reduction” • Keeps a fixed‐size](https://reader035.vdocument.in/reader035/viewer/2022071108/5fe2e2c1d3fa0c2e2c303e48/html5/thumbnails/20.jpg)
GFS: Direct Reads
Master
Chunk‐servers (slaves)
11 12
22
3
3
31
1 1222
File System (In‐Memory)/blue.txt [3 chunks]1: {A1, C1, E1}2: {A1, B1, D1}3: {B1, D1, E1}/orange.txt [2 chunks]1: {B1, D1, E1}2: {A1, C1, E1}
A1 B1 C1 D1 E1
Client
I’m looking for /blue.txt
1 2 3
![Page 21: Lecture 4: DFS MapReduce Iaidanhogan.com/teaching/cc5212-1-2016/MDP2016-04.pdf · 2016. 3. 30. · More in Hadoop: Combiner • Map‐side “mini‐reduction” • Keeps a fixed‐size](https://reader035.vdocument.in/reader035/viewer/2022071108/5fe2e2c1d3fa0c2e2c303e48/html5/thumbnails/21.jpg)
GFS: Direct Reads (In Words)
1. Client asks Master for file2. Master returns location of a chunk
– Returns a ranked list of replicas
3. Client reads chunk directly from chunkserver4. Client asks Master for next chunk
Software makes transparent for client!
![Page 22: Lecture 4: DFS MapReduce Iaidanhogan.com/teaching/cc5212-1-2016/MDP2016-04.pdf · 2016. 3. 30. · More in Hadoop: Combiner • Map‐side “mini‐reduction” • Keeps a fixed‐size](https://reader035.vdocument.in/reader035/viewer/2022071108/5fe2e2c1d3fa0c2e2c303e48/html5/thumbnails/22.jpg)
GFS: Modification ConsistencyMasters assign leases to one replica: a “primary replica”
Client wants to change a file:1. Client asks Master for the
replicas (incl. primary)2. Master returns replica info to
the client3. Client sends change data4. Client asks primary to
execute the changes5. Primary asks secondaries to
change6. Secondaries acknowledge to
primary7. Primary acknowledges to
client Data & Control Decoupled
![Page 23: Lecture 4: DFS MapReduce Iaidanhogan.com/teaching/cc5212-1-2016/MDP2016-04.pdf · 2016. 3. 30. · More in Hadoop: Combiner • Map‐side “mini‐reduction” • Keeps a fixed‐size](https://reader035.vdocument.in/reader035/viewer/2022071108/5fe2e2c1d3fa0c2e2c303e48/html5/thumbnails/23.jpg)
GFS: Rack Awareness
![Page 24: Lecture 4: DFS MapReduce Iaidanhogan.com/teaching/cc5212-1-2016/MDP2016-04.pdf · 2016. 3. 30. · More in Hadoop: Combiner • Map‐side “mini‐reduction” • Keeps a fixed‐size](https://reader035.vdocument.in/reader035/viewer/2022071108/5fe2e2c1d3fa0c2e2c303e48/html5/thumbnails/24.jpg)
GFS: Rack Awareness
Rack ASwitch
Rack BSwitch
Rack CSwitch
CoreSwitch
CoreSwitch
![Page 25: Lecture 4: DFS MapReduce Iaidanhogan.com/teaching/cc5212-1-2016/MDP2016-04.pdf · 2016. 3. 30. · More in Hadoop: Combiner • Map‐side “mini‐reduction” • Keeps a fixed‐size](https://reader035.vdocument.in/reader035/viewer/2022071108/5fe2e2c1d3fa0c2e2c303e48/html5/thumbnails/25.jpg)
GFS: Rack Awareness
Rack ASwitch
Rack BSwitch
CoreSwitch
1
1
1
A1
A2
A3
A4
A5
B1
B2
B3
B4
B5
Files:/orange.txt1: {A1, A4, B3}2: {A5, B1, B5}
2
22
Racks:A: {A1, A2, A3, A4, A5}B: {B1, B2, B3, B4, B5}
![Page 26: Lecture 4: DFS MapReduce Iaidanhogan.com/teaching/cc5212-1-2016/MDP2016-04.pdf · 2016. 3. 30. · More in Hadoop: Combiner • Map‐side “mini‐reduction” • Keeps a fixed‐size](https://reader035.vdocument.in/reader035/viewer/2022071108/5fe2e2c1d3fa0c2e2c303e48/html5/thumbnails/26.jpg)
GFS: Rack Awareness (In Words)
• Make sure replicas not on same rack– In case rack switch fails!
• But communication can be slower:– Within rack: pass one switch (rack switch)– Across racks: pass three switches (two racks and a core)
• (Typically) pick two racks with low traffic– Two nodes in same rack, one node in another rack
• (Assuming 3x replication)
• Only necessary if more than one rack!
![Page 27: Lecture 4: DFS MapReduce Iaidanhogan.com/teaching/cc5212-1-2016/MDP2016-04.pdf · 2016. 3. 30. · More in Hadoop: Combiner • Map‐side “mini‐reduction” • Keeps a fixed‐size](https://reader035.vdocument.in/reader035/viewer/2022071108/5fe2e2c1d3fa0c2e2c303e48/html5/thumbnails/27.jpg)
GFS: Other Operations
Rebalancing: Spread storage out evenly
Deletion:• Just rename the file with hidden file name
– To recover, rename back to original version– Otherwise, three days later will be wiped
Monitoring Stale Replicas: Dead slave reappears with old data: master keeps version info and will recycle old chunks
![Page 28: Lecture 4: DFS MapReduce Iaidanhogan.com/teaching/cc5212-1-2016/MDP2016-04.pdf · 2016. 3. 30. · More in Hadoop: Combiner • Map‐side “mini‐reduction” • Keeps a fixed‐size](https://reader035.vdocument.in/reader035/viewer/2022071108/5fe2e2c1d3fa0c2e2c303e48/html5/thumbnails/28.jpg)
GFS: Weaknesses?
• Master node single point of failure– Use hardware replication– Logs and checkpoints!
• Master node is a bottleneck– Use more powerful machine – Minimise master node traffic
• Master‐node metadata kept in memory– Each chunk needs 64 bytes– Chunk data can be queried from each slave
What do you see as the core weaknesses of the Google File System?
![Page 29: Lecture 4: DFS MapReduce Iaidanhogan.com/teaching/cc5212-1-2016/MDP2016-04.pdf · 2016. 3. 30. · More in Hadoop: Combiner • Map‐side “mini‐reduction” • Keeps a fixed‐size](https://reader035.vdocument.in/reader035/viewer/2022071108/5fe2e2c1d3fa0c2e2c303e48/html5/thumbnails/29.jpg)
GFS: White‐Paper
![Page 30: Lecture 4: DFS MapReduce Iaidanhogan.com/teaching/cc5212-1-2016/MDP2016-04.pdf · 2016. 3. 30. · More in Hadoop: Combiner • Map‐side “mini‐reduction” • Keeps a fixed‐size](https://reader035.vdocument.in/reader035/viewer/2022071108/5fe2e2c1d3fa0c2e2c303e48/html5/thumbnails/30.jpg)
HADOOP DISTRIBUTED FILE SYSTEM (HDFS)
![Page 31: Lecture 4: DFS MapReduce Iaidanhogan.com/teaching/cc5212-1-2016/MDP2016-04.pdf · 2016. 3. 30. · More in Hadoop: Combiner • Map‐side “mini‐reduction” • Keeps a fixed‐size](https://reader035.vdocument.in/reader035/viewer/2022071108/5fe2e2c1d3fa0c2e2c303e48/html5/thumbnails/31.jpg)
Google Re‐Engineering
Google File System (GFS)
![Page 32: Lecture 4: DFS MapReduce Iaidanhogan.com/teaching/cc5212-1-2016/MDP2016-04.pdf · 2016. 3. 30. · More in Hadoop: Combiner • Map‐side “mini‐reduction” • Keeps a fixed‐size](https://reader035.vdocument.in/reader035/viewer/2022071108/5fe2e2c1d3fa0c2e2c303e48/html5/thumbnails/32.jpg)
HDFS
• HDFS‐to‐GFS– Data‐node = Chunkserver/Slave– Name‐node = Master
• HDFS does not support modifications
• Otherwise pretty much the same except …– GFS is proprietary (hidden in Google)– HDFS is open source (Apache!)
![Page 33: Lecture 4: DFS MapReduce Iaidanhogan.com/teaching/cc5212-1-2016/MDP2016-04.pdf · 2016. 3. 30. · More in Hadoop: Combiner • Map‐side “mini‐reduction” • Keeps a fixed‐size](https://reader035.vdocument.in/reader035/viewer/2022071108/5fe2e2c1d3fa0c2e2c303e48/html5/thumbnails/33.jpg)
HDFS Interfaces
![Page 34: Lecture 4: DFS MapReduce Iaidanhogan.com/teaching/cc5212-1-2016/MDP2016-04.pdf · 2016. 3. 30. · More in Hadoop: Combiner • Map‐side “mini‐reduction” • Keeps a fixed‐size](https://reader035.vdocument.in/reader035/viewer/2022071108/5fe2e2c1d3fa0c2e2c303e48/html5/thumbnails/34.jpg)
HDFS Interfaces
![Page 35: Lecture 4: DFS MapReduce Iaidanhogan.com/teaching/cc5212-1-2016/MDP2016-04.pdf · 2016. 3. 30. · More in Hadoop: Combiner • Map‐side “mini‐reduction” • Keeps a fixed‐size](https://reader035.vdocument.in/reader035/viewer/2022071108/5fe2e2c1d3fa0c2e2c303e48/html5/thumbnails/35.jpg)
GOOGLE’S MAP‐REDUCE
![Page 36: Lecture 4: DFS MapReduce Iaidanhogan.com/teaching/cc5212-1-2016/MDP2016-04.pdf · 2016. 3. 30. · More in Hadoop: Combiner • Map‐side “mini‐reduction” • Keeps a fixed‐size](https://reader035.vdocument.in/reader035/viewer/2022071108/5fe2e2c1d3fa0c2e2c303e48/html5/thumbnails/36.jpg)
Google Re‐Engineering
Google File System (GFS)
MapReduce
![Page 37: Lecture 4: DFS MapReduce Iaidanhogan.com/teaching/cc5212-1-2016/MDP2016-04.pdf · 2016. 3. 30. · More in Hadoop: Combiner • Map‐side “mini‐reduction” • Keeps a fixed‐size](https://reader035.vdocument.in/reader035/viewer/2022071108/5fe2e2c1d3fa0c2e2c303e48/html5/thumbnails/37.jpg)
MapReduce in Google
• Divide & Conquer:
1. Word count
2. Total searches per user3. PageRank4. Inverted‐indexing5. …
How could we do a distributed top‐k word count?
![Page 38: Lecture 4: DFS MapReduce Iaidanhogan.com/teaching/cc5212-1-2016/MDP2016-04.pdf · 2016. 3. 30. · More in Hadoop: Combiner • Map‐side “mini‐reduction” • Keeps a fixed‐size](https://reader035.vdocument.in/reader035/viewer/2022071108/5fe2e2c1d3fa0c2e2c303e48/html5/thumbnails/38.jpg)
MapReduce: Word Count
1 23
A1 B1 C1
a‐k r‐zl‐q
A1 B1 C1
a 10,023aa 6,234…
lab 8,123label 983…
rag 543rat 1,233…
InputDistr. File Sys.
Map()
(Partition/Shuffle)(Distr. Sort)
Reduce()
Output
Better partitioning method?
![Page 39: Lecture 4: DFS MapReduce Iaidanhogan.com/teaching/cc5212-1-2016/MDP2016-04.pdf · 2016. 3. 30. · More in Hadoop: Combiner • Map‐side “mini‐reduction” • Keeps a fixed‐size](https://reader035.vdocument.in/reader035/viewer/2022071108/5fe2e2c1d3fa0c2e2c303e48/html5/thumbnails/39.jpg)
MapReduce (in more detail)
1. Input: Read from the cluster (e.g., a DFS)– Chunks raw data for mappers– Maps raw data to initial (keyin, valuein) pairs
2. Map: For each (keyin, valuein) pair, generate zero‐to‐many (keymap, valuemap) pairs
– keyin /valuein can be diff. type to keymap /valuemap
What might Input do in the word‐count case?
What might Map do in the word‐count case?
![Page 40: Lecture 4: DFS MapReduce Iaidanhogan.com/teaching/cc5212-1-2016/MDP2016-04.pdf · 2016. 3. 30. · More in Hadoop: Combiner • Map‐side “mini‐reduction” • Keeps a fixed‐size](https://reader035.vdocument.in/reader035/viewer/2022071108/5fe2e2c1d3fa0c2e2c303e48/html5/thumbnails/40.jpg)
MapReduce (in more detail)
3. Partition: Assign sets of keymap values to reducer machines
4. Shuffle: Data are moved from mappers to reducers (e.g., using DFS)
5. Comparison/Sort: Each reducer sorts the data by key using a comparison function
– Sort is taken care of by the framework
How might Partition work in the word‐count case?
![Page 41: Lecture 4: DFS MapReduce Iaidanhogan.com/teaching/cc5212-1-2016/MDP2016-04.pdf · 2016. 3. 30. · More in Hadoop: Combiner • Map‐side “mini‐reduction” • Keeps a fixed‐size](https://reader035.vdocument.in/reader035/viewer/2022071108/5fe2e2c1d3fa0c2e2c303e48/html5/thumbnails/41.jpg)
MapReduce
6. Reduce: Takes a bag of (keymap, valuemap) pairs with the same keymap value, and produces zero‐to‐many outputs for each bag
– Typically zero‐or‐one outputs
7. Output: Merge‐sorts the results from the reducers / writes to stable storage
How might Reduce work in the word‐count case?
![Page 42: Lecture 4: DFS MapReduce Iaidanhogan.com/teaching/cc5212-1-2016/MDP2016-04.pdf · 2016. 3. 30. · More in Hadoop: Combiner • Map‐side “mini‐reduction” • Keeps a fixed‐size](https://reader035.vdocument.in/reader035/viewer/2022071108/5fe2e2c1d3fa0c2e2c303e48/html5/thumbnails/42.jpg)
MapReduce: Word Count PseudoCode
![Page 43: Lecture 4: DFS MapReduce Iaidanhogan.com/teaching/cc5212-1-2016/MDP2016-04.pdf · 2016. 3. 30. · More in Hadoop: Combiner • Map‐side “mini‐reduction” • Keeps a fixed‐size](https://reader035.vdocument.in/reader035/viewer/2022071108/5fe2e2c1d3fa0c2e2c303e48/html5/thumbnails/43.jpg)
MapReduce: Scholar Example
![Page 44: Lecture 4: DFS MapReduce Iaidanhogan.com/teaching/cc5212-1-2016/MDP2016-04.pdf · 2016. 3. 30. · More in Hadoop: Combiner • Map‐side “mini‐reduction” • Keeps a fixed‐size](https://reader035.vdocument.in/reader035/viewer/2022071108/5fe2e2c1d3fa0c2e2c303e48/html5/thumbnails/44.jpg)
MapReduce: Scholar Example
Assume that in Google Scholar we have inputs like:paperA citedBy paperB
How can we use MapReduce to count the total incoming citations per paper?
![Page 45: Lecture 4: DFS MapReduce Iaidanhogan.com/teaching/cc5212-1-2016/MDP2016-04.pdf · 2016. 3. 30. · More in Hadoop: Combiner • Map‐side “mini‐reduction” • Keeps a fixed‐size](https://reader035.vdocument.in/reader035/viewer/2022071108/5fe2e2c1d3fa0c2e2c303e48/html5/thumbnails/45.jpg)
MapReduce as a Dist. Sys.
• Transparency: Abstracts physical machines• Flexibility: Can mount new machines; can run a variety of types of jobs
• Reliability: Tasks are monitored by a master node using a heart‐beat; dead jobs restart
• Performance: Depends on the application code but exploits parallelism!
• Scalability: Depends on the application code but can serve as the basis for massive data processing!
![Page 46: Lecture 4: DFS MapReduce Iaidanhogan.com/teaching/cc5212-1-2016/MDP2016-04.pdf · 2016. 3. 30. · More in Hadoop: Combiner • Map‐side “mini‐reduction” • Keeps a fixed‐size](https://reader035.vdocument.in/reader035/viewer/2022071108/5fe2e2c1d3fa0c2e2c303e48/html5/thumbnails/46.jpg)
MapReduce: Benefits for Programmers
• Takes care of low‐level implementation:– Easy to handle inputs and output– No need to handle network communication– No need to write sorts or joins
• Abstracts machines (transparency)– Fault tolerance (through heart‐beats)– Abstracts physical locations– Add / remove machines– Load balancing
![Page 47: Lecture 4: DFS MapReduce Iaidanhogan.com/teaching/cc5212-1-2016/MDP2016-04.pdf · 2016. 3. 30. · More in Hadoop: Combiner • Map‐side “mini‐reduction” • Keeps a fixed‐size](https://reader035.vdocument.in/reader035/viewer/2022071108/5fe2e2c1d3fa0c2e2c303e48/html5/thumbnails/47.jpg)
MapReduce: Benefits for Programmers
Time for more important things …
![Page 48: Lecture 4: DFS MapReduce Iaidanhogan.com/teaching/cc5212-1-2016/MDP2016-04.pdf · 2016. 3. 30. · More in Hadoop: Combiner • Map‐side “mini‐reduction” • Keeps a fixed‐size](https://reader035.vdocument.in/reader035/viewer/2022071108/5fe2e2c1d3fa0c2e2c303e48/html5/thumbnails/48.jpg)
HADOOP OVERVIEW
![Page 49: Lecture 4: DFS MapReduce Iaidanhogan.com/teaching/cc5212-1-2016/MDP2016-04.pdf · 2016. 3. 30. · More in Hadoop: Combiner • Map‐side “mini‐reduction” • Keeps a fixed‐size](https://reader035.vdocument.in/reader035/viewer/2022071108/5fe2e2c1d3fa0c2e2c303e48/html5/thumbnails/49.jpg)
Hadoop Architecture
Client
NameNode JobTracker
DataNode 1
DataNode 2
…
DataNode n
JobNode 1
JobNode 2
…
JobNode n
![Page 50: Lecture 4: DFS MapReduce Iaidanhogan.com/teaching/cc5212-1-2016/MDP2016-04.pdf · 2016. 3. 30. · More in Hadoop: Combiner • Map‐side “mini‐reduction” • Keeps a fixed‐size](https://reader035.vdocument.in/reader035/viewer/2022071108/5fe2e2c1d3fa0c2e2c303e48/html5/thumbnails/50.jpg)
HDFS: Traditional / SPOF
1. NameNode appends edits to log file
2. SecondaryNameNodecopies log file and image, makes checkpoint, copies image back
3. NameNode loads image on start‐up and makes remaining edits
SecondaryNameNode nota backup NameNode
NameNode
DataNode 1
…
DataNode n
SecondaryNameNode
copy dfs/blue.txtrm dfs/orange.txtrmdir dfs/mkdir new/ mv new/ dfs/
fsimage
![Page 51: Lecture 4: DFS MapReduce Iaidanhogan.com/teaching/cc5212-1-2016/MDP2016-04.pdf · 2016. 3. 30. · More in Hadoop: Combiner • Map‐side “mini‐reduction” • Keeps a fixed‐size](https://reader035.vdocument.in/reader035/viewer/2022071108/5fe2e2c1d3fa0c2e2c303e48/html5/thumbnails/51.jpg)
What is the secondary name‐node?
• Name‐node quickly logs all file‐system actions in a sequential (but messy) way
• Secondary name‐node keeps the main fsimage file up‐to‐date based on logs
• When the primary name‐node boots back up, it loads the fsimage file and applies the remaining log to it
• Hence secondary name‐node helps make boot‐ups faster, helps keep file system image up‐to‐date and takes load away from primary
![Page 52: Lecture 4: DFS MapReduce Iaidanhogan.com/teaching/cc5212-1-2016/MDP2016-04.pdf · 2016. 3. 30. · More in Hadoop: Combiner • Map‐side “mini‐reduction” • Keeps a fixed‐size](https://reader035.vdocument.in/reader035/viewer/2022071108/5fe2e2c1d3fa0c2e2c303e48/html5/thumbnails/52.jpg)
Hadoop: High Availability
JournalManager
JournalNode 1
…
JournalNode n
Standby NameNodeActive NameNode
fs edits fs edits
JournalManagerfsimage
1
1
1
1 2
2
2
2
Active NameNode
![Page 53: Lecture 4: DFS MapReduce Iaidanhogan.com/teaching/cc5212-1-2016/MDP2016-04.pdf · 2016. 3. 30. · More in Hadoop: Combiner • Map‐side “mini‐reduction” • Keeps a fixed‐size](https://reader035.vdocument.in/reader035/viewer/2022071108/5fe2e2c1d3fa0c2e2c303e48/html5/thumbnails/53.jpg)
PROGRAMMING WITH HADOOP
![Page 54: Lecture 4: DFS MapReduce Iaidanhogan.com/teaching/cc5212-1-2016/MDP2016-04.pdf · 2016. 3. 30. · More in Hadoop: Combiner • Map‐side “mini‐reduction” • Keeps a fixed‐size](https://reader035.vdocument.in/reader035/viewer/2022071108/5fe2e2c1d3fa0c2e2c303e48/html5/thumbnails/54.jpg)
1. Input/Output (cmd)> hdfs dfs
![Page 55: Lecture 4: DFS MapReduce Iaidanhogan.com/teaching/cc5212-1-2016/MDP2016-04.pdf · 2016. 3. 30. · More in Hadoop: Combiner • Map‐side “mini‐reduction” • Keeps a fixed‐size](https://reader035.vdocument.in/reader035/viewer/2022071108/5fe2e2c1d3fa0c2e2c303e48/html5/thumbnails/55.jpg)
1. Input/Output (Java)Creates a file system for default
configuration
Check if the file exists; if so delete
Create file and write a message
Open and read back
![Page 56: Lecture 4: DFS MapReduce Iaidanhogan.com/teaching/cc5212-1-2016/MDP2016-04.pdf · 2016. 3. 30. · More in Hadoop: Combiner • Map‐side “mini‐reduction” • Keeps a fixed‐size](https://reader035.vdocument.in/reader035/viewer/2022071108/5fe2e2c1d3fa0c2e2c303e48/html5/thumbnails/56.jpg)
1. Input (Java)
![Page 57: Lecture 4: DFS MapReduce Iaidanhogan.com/teaching/cc5212-1-2016/MDP2016-04.pdf · 2016. 3. 30. · More in Hadoop: Combiner • Map‐side “mini‐reduction” • Keeps a fixed‐size](https://reader035.vdocument.in/reader035/viewer/2022071108/5fe2e2c1d3fa0c2e2c303e48/html5/thumbnails/57.jpg)
2. Map Mapper<InputKeyType,InputValueType,MapKeyType,
MapValueType>
(input) key: file offset.(input) value: line of the file.context: handles output and
logging.
Emit output
![Page 58: Lecture 4: DFS MapReduce Iaidanhogan.com/teaching/cc5212-1-2016/MDP2016-04.pdf · 2016. 3. 30. · More in Hadoop: Combiner • Map‐side “mini‐reduction” • Keeps a fixed‐size](https://reader035.vdocument.in/reader035/viewer/2022071108/5fe2e2c1d3fa0c2e2c303e48/html5/thumbnails/58.jpg)
(Writable for values)
Same order
(not needed in the running example)
![Page 59: Lecture 4: DFS MapReduce Iaidanhogan.com/teaching/cc5212-1-2016/MDP2016-04.pdf · 2016. 3. 30. · More in Hadoop: Combiner • Map‐side “mini‐reduction” • Keeps a fixed‐size](https://reader035.vdocument.in/reader035/viewer/2022071108/5fe2e2c1d3fa0c2e2c303e48/html5/thumbnails/59.jpg)
(WritableComparable for keys/values)
Needed for default partition function
Needed to sort keys
New Interface
Same as before
(not needed in the running example)
![Page 60: Lecture 4: DFS MapReduce Iaidanhogan.com/teaching/cc5212-1-2016/MDP2016-04.pdf · 2016. 3. 30. · More in Hadoop: Combiner • Map‐side “mini‐reduction” • Keeps a fixed‐size](https://reader035.vdocument.in/reader035/viewer/2022071108/5fe2e2c1d3fa0c2e2c303e48/html5/thumbnails/60.jpg)
3. Partition
PartitionerInterface
(This happens to be the default partition method!)
(not needed in the running example)
![Page 61: Lecture 4: DFS MapReduce Iaidanhogan.com/teaching/cc5212-1-2016/MDP2016-04.pdf · 2016. 3. 30. · More in Hadoop: Combiner • Map‐side “mini‐reduction” • Keeps a fixed‐size](https://reader035.vdocument.in/reader035/viewer/2022071108/5fe2e2c1d3fa0c2e2c303e48/html5/thumbnails/61.jpg)
4. Shuffle
![Page 62: Lecture 4: DFS MapReduce Iaidanhogan.com/teaching/cc5212-1-2016/MDP2016-04.pdf · 2016. 3. 30. · More in Hadoop: Combiner • Map‐side “mini‐reduction” • Keeps a fixed‐size](https://reader035.vdocument.in/reader035/viewer/2022071108/5fe2e2c1d3fa0c2e2c303e48/html5/thumbnails/62.jpg)
5. Sort/Comparision
Methods in WritableComparator
(not needed in the running example)
![Page 63: Lecture 4: DFS MapReduce Iaidanhogan.com/teaching/cc5212-1-2016/MDP2016-04.pdf · 2016. 3. 30. · More in Hadoop: Combiner • Map‐side “mini‐reduction” • Keeps a fixed‐size](https://reader035.vdocument.in/reader035/viewer/2022071108/5fe2e2c1d3fa0c2e2c303e48/html5/thumbnails/63.jpg)
6. Reduce Reducer<MapKey, MapValue,OutputKey, OutputValue>
key: as emitted from map
values: iterator over all values for that key
context for output
Write to output
![Page 64: Lecture 4: DFS MapReduce Iaidanhogan.com/teaching/cc5212-1-2016/MDP2016-04.pdf · 2016. 3. 30. · More in Hadoop: Combiner • Map‐side “mini‐reduction” • Keeps a fixed‐size](https://reader035.vdocument.in/reader035/viewer/2022071108/5fe2e2c1d3fa0c2e2c303e48/html5/thumbnails/64.jpg)
7. Output / Input (Java)Creates a file system for default
configuration
Check if the file exists; if so delete
Create file and write a message
Open and read back
![Page 65: Lecture 4: DFS MapReduce Iaidanhogan.com/teaching/cc5212-1-2016/MDP2016-04.pdf · 2016. 3. 30. · More in Hadoop: Combiner • Map‐side “mini‐reduction” • Keeps a fixed‐size](https://reader035.vdocument.in/reader035/viewer/2022071108/5fe2e2c1d3fa0c2e2c303e48/html5/thumbnails/65.jpg)
7. Output (Java)
![Page 66: Lecture 4: DFS MapReduce Iaidanhogan.com/teaching/cc5212-1-2016/MDP2016-04.pdf · 2016. 3. 30. · More in Hadoop: Combiner • Map‐side “mini‐reduction” • Keeps a fixed‐size](https://reader035.vdocument.in/reader035/viewer/2022071108/5fe2e2c1d3fa0c2e2c303e48/html5/thumbnails/66.jpg)
Control Flow
Create a JobClient, a JobConfand pass it the main class
Set the type of map and output keys and values in
the configuration
Set input and output paths
Set the mapper class
Set the reducer class(and optionally “combiner”)
Run and wait for job to complete.
![Page 67: Lecture 4: DFS MapReduce Iaidanhogan.com/teaching/cc5212-1-2016/MDP2016-04.pdf · 2016. 3. 30. · More in Hadoop: Combiner • Map‐side “mini‐reduction” • Keeps a fixed‐size](https://reader035.vdocument.in/reader035/viewer/2022071108/5fe2e2c1d3fa0c2e2c303e48/html5/thumbnails/67.jpg)
More in Hadoop: Combiner
• Map‐side “mini‐reduction”
• Keeps a fixed‐size buffer in memory
• Reduce within that buffer– e.g., count words in buffer– Lessens bandwidth needs
• In Hadoop: can simply use Reducer class
![Page 68: Lecture 4: DFS MapReduce Iaidanhogan.com/teaching/cc5212-1-2016/MDP2016-04.pdf · 2016. 3. 30. · More in Hadoop: Combiner • Map‐side “mini‐reduction” • Keeps a fixed‐size](https://reader035.vdocument.in/reader035/viewer/2022071108/5fe2e2c1d3fa0c2e2c303e48/html5/thumbnails/68.jpg)
More in Hadoop: Counters
Context has a group of maps of counters
![Page 69: Lecture 4: DFS MapReduce Iaidanhogan.com/teaching/cc5212-1-2016/MDP2016-04.pdf · 2016. 3. 30. · More in Hadoop: Combiner • Map‐side “mini‐reduction” • Keeps a fixed‐size](https://reader035.vdocument.in/reader035/viewer/2022071108/5fe2e2c1d3fa0c2e2c303e48/html5/thumbnails/69.jpg)
More in Hadoop: Chaining Jobs
• Sometimes we need to chain jobs
• In Hadoop, can pass a set of Jobs to the client
• x.addDependingJob(y)
![Page 70: Lecture 4: DFS MapReduce Iaidanhogan.com/teaching/cc5212-1-2016/MDP2016-04.pdf · 2016. 3. 30. · More in Hadoop: Combiner • Map‐side “mini‐reduction” • Keeps a fixed‐size](https://reader035.vdocument.in/reader035/viewer/2022071108/5fe2e2c1d3fa0c2e2c303e48/html5/thumbnails/70.jpg)
More in Hadoop: Distributed Cache
• Some tasks need “global knowledge”– For example, a white‐list of conference venues and journals that should be considered in the citation count
– Typically small
• Use a distributed cache:– Makes data available locally to all nodes
![Page 71: Lecture 4: DFS MapReduce Iaidanhogan.com/teaching/cc5212-1-2016/MDP2016-04.pdf · 2016. 3. 30. · More in Hadoop: Combiner • Map‐side “mini‐reduction” • Keeps a fixed‐size](https://reader035.vdocument.in/reader035/viewer/2022071108/5fe2e2c1d3fa0c2e2c303e48/html5/thumbnails/71.jpg)
IF WE HAVE TIME …
![Page 72: Lecture 4: DFS MapReduce Iaidanhogan.com/teaching/cc5212-1-2016/MDP2016-04.pdf · 2016. 3. 30. · More in Hadoop: Combiner • Map‐side “mini‐reduction” • Keeps a fixed‐size](https://reader035.vdocument.in/reader035/viewer/2022071108/5fe2e2c1d3fa0c2e2c303e48/html5/thumbnails/72.jpg)
MapReduce: Scholar Example
Assume that in Google Scholar we have two inputs like:
How can we use MapReduce to count the total incoming citations per author?
paperA citedBy paperBpaperA citedBy paperDpaperC citedBy paperDpaperD citedBy paperB
paperA author1 author2 author3paperA author2 author1paperC author4 author2 author5paperD author6
![Page 73: Lecture 4: DFS MapReduce Iaidanhogan.com/teaching/cc5212-1-2016/MDP2016-04.pdf · 2016. 3. 30. · More in Hadoop: Combiner • Map‐side “mini‐reduction” • Keeps a fixed‐size](https://reader035.vdocument.in/reader035/viewer/2022071108/5fe2e2c1d3fa0c2e2c303e48/html5/thumbnails/73.jpg)
RECAP
![Page 74: Lecture 4: DFS MapReduce Iaidanhogan.com/teaching/cc5212-1-2016/MDP2016-04.pdf · 2016. 3. 30. · More in Hadoop: Combiner • Map‐side “mini‐reduction” • Keeps a fixed‐size](https://reader035.vdocument.in/reader035/viewer/2022071108/5fe2e2c1d3fa0c2e2c303e48/html5/thumbnails/74.jpg)
Distributed File Systems
• Google File System (GFS)– Master and Chunkslaves– Replicated pipelined writes– Direct reads– Minimising master traffic– Fault‐tolerance: self‐healing– Rack awareness– Consistency and modifications
• Hadoop Distributed File System– NameNode and DataNodes
![Page 75: Lecture 4: DFS MapReduce Iaidanhogan.com/teaching/cc5212-1-2016/MDP2016-04.pdf · 2016. 3. 30. · More in Hadoop: Combiner • Map‐side “mini‐reduction” • Keeps a fixed‐size](https://reader035.vdocument.in/reader035/viewer/2022071108/5fe2e2c1d3fa0c2e2c303e48/html5/thumbnails/75.jpg)
MapReduce
1. Input2. Map
3. Partition4. Shuffle
5. Comparison/Sort6. Reduce7. Output
![Page 76: Lecture 4: DFS MapReduce Iaidanhogan.com/teaching/cc5212-1-2016/MDP2016-04.pdf · 2016. 3. 30. · More in Hadoop: Combiner • Map‐side “mini‐reduction” • Keeps a fixed‐size](https://reader035.vdocument.in/reader035/viewer/2022071108/5fe2e2c1d3fa0c2e2c303e48/html5/thumbnails/76.jpg)
MapReduce/GFS Revision
• GFS: distributed file system– Implemented as HDFS
• MapReduce: distributed processing framework– Implemented as Hadoop
![Page 77: Lecture 4: DFS MapReduce Iaidanhogan.com/teaching/cc5212-1-2016/MDP2016-04.pdf · 2016. 3. 30. · More in Hadoop: Combiner • Map‐side “mini‐reduction” • Keeps a fixed‐size](https://reader035.vdocument.in/reader035/viewer/2022071108/5fe2e2c1d3fa0c2e2c303e48/html5/thumbnails/77.jpg)
Hadoop• FileSystem
• Mapper<InputKey,InputValue,MapKey,MapValue>
• OutputCollector<OutputKey,OutputValue>
• Writable, WritableComparable<Key>
• Partitioner<KeyType,ValueType>
• Reducer<MapKey,MapValue,OutputKey,OutputValue>
• JobClient/JobConf
…