advanced topics on mapreduce with hadoop jiaheng lu department of computer science renmin university...
TRANSCRIPT
![Page 1: Advanced topics on Mapreduce with Hadoop Jiaheng Lu Department of Computer Science Renmin University of China](https://reader036.vdocument.in/reader036/viewer/2022062511/551a7811550346761a8b4c4c/html5/thumbnails/1.jpg)
Advanced topics on Mapreduce with Hadoop
Jiaheng Lu
Department of Computer Science
Renmin University of Chinawww.jiahenglu.net
![Page 2: Advanced topics on Mapreduce with Hadoop Jiaheng Lu Department of Computer Science Renmin University of China](https://reader036.vdocument.in/reader036/viewer/2022062511/551a7811550346761a8b4c4c/html5/thumbnails/2.jpg)
Outline
Brief Review Chaining MapReduce Jobs Join in MapReduce Bloom Filter
![Page 3: Advanced topics on Mapreduce with Hadoop Jiaheng Lu Department of Computer Science Renmin University of China](https://reader036.vdocument.in/reader036/viewer/2022062511/551a7811550346761a8b4c4c/html5/thumbnails/3.jpg)
Brief Review
A parallel programming framework Divide and merge
split0
split1
split2
Input data
Map task
Mappers
Map task
Map task
Shuffle
Reduce task
Reducers
Reduce task
Output data
output0
output1
![Page 4: Advanced topics on Mapreduce with Hadoop Jiaheng Lu Department of Computer Science Renmin University of China](https://reader036.vdocument.in/reader036/viewer/2022062511/551a7811550346761a8b4c4c/html5/thumbnails/4.jpg)
Chaining MapReduce jobs
Chaining in a sequence Chaining with complex dependency Chaining preprocessing and postprocessing
steps
![Page 5: Advanced topics on Mapreduce with Hadoop Jiaheng Lu Department of Computer Science Renmin University of China](https://reader036.vdocument.in/reader036/viewer/2022062511/551a7811550346761a8b4c4c/html5/thumbnails/5.jpg)
Chaining in a sequence
Simple and straightforward [MAP | REDUCE]+; MAP+ | REDUCE | MAP* Output of last is the input to the next Similar to pipes
![Page 6: Advanced topics on Mapreduce with Hadoop Jiaheng Lu Department of Computer Science Renmin University of China](https://reader036.vdocument.in/reader036/viewer/2022062511/551a7811550346761a8b4c4c/html5/thumbnails/6.jpg)
Configuration conf = getConf();
JobConf job = new JobConf(conf);
job.setJobName("ChainJob");
job.setInputFormat(TextInputFormat.class);
job.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(job, in);
FileOutputFormat.setOutputPath(job, out);
JobConf map1Conf = new JobConf(false);
ChainMapper.addMapper(job, Map1.class, LongWritable.class, Text.class, Text.class, Text.class, true, map1Conf);
![Page 7: Advanced topics on Mapreduce with Hadoop Jiaheng Lu Department of Computer Science Renmin University of China](https://reader036.vdocument.in/reader036/viewer/2022062511/551a7811550346761a8b4c4c/html5/thumbnails/7.jpg)
Chaining with complex dependency
Jobs are not chained in a linear fashion
Use addDependingJob() method to add dependency information:
x.addDependingJob(y)
![Page 8: Advanced topics on Mapreduce with Hadoop Jiaheng Lu Department of Computer Science Renmin University of China](https://reader036.vdocument.in/reader036/viewer/2022062511/551a7811550346761a8b4c4c/html5/thumbnails/8.jpg)
Chaining preprocessing and postprocessing steps
Example: remove stop word in IR Approaches:
Separate: inefficient Chaining those steps into a single job
Use ChainMapper.addMapper() and ChainReducer.setReducer
Map+ | Reduce | Map*
![Page 9: Advanced topics on Mapreduce with Hadoop Jiaheng Lu Department of Computer Science Renmin University of China](https://reader036.vdocument.in/reader036/viewer/2022062511/551a7811550346761a8b4c4c/html5/thumbnails/9.jpg)
Join in MapReduce
Reduce-side join Broadcast join Map-side filtering and Reduce-side join
A given key A range from dataset(broadcast) a Bloom filter
![Page 10: Advanced topics on Mapreduce with Hadoop Jiaheng Lu Department of Computer Science Renmin University of China](https://reader036.vdocument.in/reader036/viewer/2022062511/551a7811550346761a8b4c4c/html5/thumbnails/10.jpg)
Reduce-side join
Map output <key, value> key>>join key, value>>tagged with data source
Reduce do a full cross-product of values output the combination results
![Page 11: Advanced topics on Mapreduce with Hadoop Jiaheng Lu Department of Computer Science Renmin University of China](https://reader036.vdocument.in/reader036/viewer/2022062511/551a7811550346761a8b4c4c/html5/thumbnails/11.jpg)
Example
a b
1 ab
1 cd
4 ef
a c
1 b
2 d
4 c
table x
table y
map()
map()
1
4
key
x ab
x cd
x ef
value
1
2
4
key
y b
y d
y c
valuetag
join key
shuffle()
1
key
x ab
x cd
y b
valuelist
2 y d
4x ef
y c
reduce()
a b c
1 ab b
1 cd b
4 ef c
output
1
![Page 12: Advanced topics on Mapreduce with Hadoop Jiaheng Lu Department of Computer Science Renmin University of China](https://reader036.vdocument.in/reader036/viewer/2022062511/551a7811550346761a8b4c4c/html5/thumbnails/12.jpg)
Broadcast join (replicated join)
Broadcast the smaller table Do join in Map()
Using distributed cache
DistributedCache.addCacheFile()
![Page 13: Advanced topics on Mapreduce with Hadoop Jiaheng Lu Department of Computer Science Renmin University of China](https://reader036.vdocument.in/reader036/viewer/2022062511/551a7811550346761a8b4c4c/html5/thumbnails/13.jpg)
Map-side filtering and Reduce-side join
Join key: student IDs from info generate IDs file from info broadcast join
What if the IDs file can’t be stored in memory? a Bloom Filter
![Page 14: Advanced topics on Mapreduce with Hadoop Jiaheng Lu Department of Computer Science Renmin University of China](https://reader036.vdocument.in/reader036/viewer/2022062511/551a7811550346761a8b4c4c/html5/thumbnails/14.jpg)
A Bloom Filter
Introduction Implementation of bloom filter Use in MapReduce join
![Page 15: Advanced topics on Mapreduce with Hadoop Jiaheng Lu Department of Computer Science Renmin University of China](https://reader036.vdocument.in/reader036/viewer/2022062511/551a7811550346761a8b4c4c/html5/thumbnails/15.jpg)
Introduction to Bloom Filter
space-efficient data structure, constant size, test elements, add(), contains()
no false negatives and a small probability of false positives
![Page 16: Advanced topics on Mapreduce with Hadoop Jiaheng Lu Department of Computer Science Renmin University of China](https://reader036.vdocument.in/reader036/viewer/2022062511/551a7811550346761a8b4c4c/html5/thumbnails/16.jpg)
Implementation of bloom filter
Apply a bit array Add elements
generate k indexes set the k bits to 1
Test elements generate k indexes all k bits are 1 >> true, not all are 1 >> false
![Page 17: Advanced topics on Mapreduce with Hadoop Jiaheng Lu Department of Computer Science Renmin University of China](https://reader036.vdocument.in/reader036/viewer/2022062511/551a7811550346761a8b4c4c/html5/thumbnails/17.jpg)
Example
0
0
0
0
0
0
0
0
0
0
0
1
2
3
4
5
6
7
8
9
1
0
1
0
0
0
1
0
0
0
0
1
2
3
4
5
6
7
8
9
add x(0,2,6)
1
0
1
1
0
0
1
0
0
1
0
1
2
3
4
5
6
7
8
9
add y(0,3,9)
1
0
1
1
0
0
1
0
0
1
0
1
2
3
4
5
6
7
8
9
contain m(1,3,9)
1
0
1
1
0
0
1
0
0
1
0
1
2
3
4
5
6
7
8
9
contain n(0,2,9)initial state
① ② ③ ④ ⑤
× √false positives
![Page 18: Advanced topics on Mapreduce with Hadoop Jiaheng Lu Department of Computer Science Renmin University of China](https://reader036.vdocument.in/reader036/viewer/2022062511/551a7811550346761a8b4c4c/html5/thumbnails/18.jpg)
Use in MapReduce join
A separate subjob to create a Bloom Filter
Broadcast the Bloom Filter and use in Map() of join job
drop the useless record, and do join in reduce
![Page 19: Advanced topics on Mapreduce with Hadoop Jiaheng Lu Department of Computer Science Renmin University of China](https://reader036.vdocument.in/reader036/viewer/2022062511/551a7811550346761a8b4c4c/html5/thumbnails/19.jpg)
References
Chunk Lam, “Hadoop in action” Jairam Chandar, “Join Algorithms using
Map/Reduce”
![Page 20: Advanced topics on Mapreduce with Hadoop Jiaheng Lu Department of Computer Science Renmin University of China](https://reader036.vdocument.in/reader036/viewer/2022062511/551a7811550346761a8b4c4c/html5/thumbnails/20.jpg)
THANK YOU
![Page 21: Advanced topics on Mapreduce with Hadoop Jiaheng Lu Department of Computer Science Renmin University of China](https://reader036.vdocument.in/reader036/viewer/2022062511/551a7811550346761a8b4c4c/html5/thumbnails/21.jpg)
Hadoop
![Page 22: Advanced topics on Mapreduce with Hadoop Jiaheng Lu Department of Computer Science Renmin University of China](https://reader036.vdocument.in/reader036/viewer/2022062511/551a7811550346761a8b4c4c/html5/thumbnails/22.jpg)