Download - Some tips for effective map reducing
![Page 1: Some tips for effective map reducing](https://reader035.vdocument.in/reader035/viewer/2022062520/56816358550346895dd40fb6/html5/thumbnails/1.jpg)
Some tips for effective map reducing
CHRISTOPHER SEVERSeBay
eBay NetanyaDecember 2nd, 2013
![Page 2: Some tips for effective map reducing](https://reader035.vdocument.in/reader035/viewer/2022062520/56816358550346895dd40fb6/html5/thumbnails/2.jpg)
THE AGENDA
![Page 3: Some tips for effective map reducing](https://reader035.vdocument.in/reader035/viewer/2022062520/56816358550346895dd40fb6/html5/thumbnails/3.jpg)
3
THE AGENDA
1. Quick survey of the current landscape for Hadoop tools
2. A light comparison of the best functional tools. 3. General advice4. Some code samples
PRESENTATION TITLE GOES HERE
![Page 4: Some tips for effective map reducing](https://reader035.vdocument.in/reader035/viewer/2022062520/56816358550346895dd40fb6/html5/thumbnails/4.jpg)
THE ALTERNATIVES
I promise this part will be quick
![Page 5: Some tips for effective map reducing](https://reader035.vdocument.in/reader035/viewer/2022062520/56816358550346895dd40fb6/html5/thumbnails/5.jpg)
5
VANILLA MAPREDUCE
PRESENTATION TITLE GOES HERE
package org.myorg; import java.io.IOException;import java.util.*; import org.apache.hadoop.fs.Path;import org.apache.hadoop.conf.*;import org.apache.hadoop.io.*;import org.apache.hadoop.mapreduce.*;import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; public class WordCount { public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } } }
public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); } } public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount"); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); }
![Page 6: Some tips for effective map reducing](https://reader035.vdocument.in/reader035/viewer/2022062520/56816358550346895dd40fb6/html5/thumbnails/6.jpg)
6
PIG
•Apache Pig is a really great tool for quick, ad-hoc data analysis
•While we can do amazing things with it, I’m not sure we should
•Anything complicated requires User Defined Functions (UDFs)
•UDFs require a separate code base•Now you have to maintain two separate languages for no good reason
PRESENTATION TITLE GOES HERE
![Page 7: Some tips for effective map reducing](https://reader035.vdocument.in/reader035/viewer/2022062520/56816358550346895dd40fb6/html5/thumbnails/7.jpg)
7
APACHE HIVE
•On previous slide: s/Pig/Hive/g
PRESENTATION TITLE GOES HERE
![Page 8: Some tips for effective map reducing](https://reader035.vdocument.in/reader035/viewer/2022062520/56816358550346895dd40fb6/html5/thumbnails/8.jpg)
GENERAL ADVICE
Do this, not that
![Page 9: Some tips for effective map reducing](https://reader035.vdocument.in/reader035/viewer/2022062520/56816358550346895dd40fb6/html5/thumbnails/9.jpg)
PRESENTATION TITLE GOES HERE 9
DO
•Use a higher level abstraction like distributed lists•Use objects instead of tuples•Use a good serialization format•Always check for data quality•Use flatMap for uncertain computations•Develop reusable reductions (monoids!)•Prefer map side operations when possible•Always check for data skew
![Page 10: Some tips for effective map reducing](https://reader035.vdocument.in/reader035/viewer/2022062520/56816358550346895dd40fb6/html5/thumbnails/10.jpg)
10
DON’T
•Never use nulls•Don’t use too many levels of nesting•Don’t use shared state•Don’t use iteration (too much)•Try not to start with a complicated approach
PRESENTATION TITLE GOES HERE
![Page 11: Some tips for effective map reducing](https://reader035.vdocument.in/reader035/viewer/2022062520/56816358550346895dd40fb6/html5/thumbnails/11.jpg)
SCALDING AND SCOOBIThis is what we use at eBay
![Page 12: Some tips for effective map reducing](https://reader035.vdocument.in/reader035/viewer/2022062520/56816358550346895dd40fb6/html5/thumbnails/12.jpg)
12
SOME SCALA CODE
val myLines = getStuffval myWords = myLines.flatMap(w => w.split("\\s+"))val myWordsGrouped = myLines.groupBy(identity)val countedWords = myWordsGrouped.mapValues(x=>x.size)write(countedWords)
PRESENTATION TITLE GOES HERE
![Page 13: Some tips for effective map reducing](https://reader035.vdocument.in/reader035/viewer/2022062520/56816358550346895dd40fb6/html5/thumbnails/13.jpg)
13
SOME SCALDING CODE
val myLines = TextLine(path)val myWords= myLines.flatMap(w => w.split(" ")) .groupBy(identity) .sizemyWords.write(TypedTSV(output))
PRESENTATION TITLE GOES HERE
![Page 14: Some tips for effective map reducing](https://reader035.vdocument.in/reader035/viewer/2022062520/56816358550346895dd40fb6/html5/thumbnails/14.jpg)
14
WHAT HAPPENED ON THE PREVIOUS SLIDE?
•flatMap()–Similar to map, but a one-to-many rather than one-to-one
mapping –Use when the desired result has some probability of
occurring –Can handle errors with the Option (Maybe) monad. A None
type will be discarded
PRESENTATION TITLE GOES HERE
![Page 15: Some tips for effective map reducing](https://reader035.vdocument.in/reader035/viewer/2022062520/56816358550346895dd40fb6/html5/thumbnails/15.jpg)
15
MORE EXPLANATION
•groupBy()–Takes a function that generates a key from the given value–Logically the result can be thought of as an associative
array: key -> List of values–In Scalding this doesn’t necessarily force a Hadoop reduce
phase, it depends on what comes after
PRESENTATION TITLE GOES HERE
![Page 16: Some tips for effective map reducing](https://reader035.vdocument.in/reader035/viewer/2022062520/56816358550346895dd40fb6/html5/thumbnails/16.jpg)
16
THE BEST PART
•size–This part is pure magic–size is actually sugar for .map( t => 1L).sum–sum has an implicit argument, mon: Monoid[T]
PRESENTATION TITLE GOES HERE
![Page 17: Some tips for effective map reducing](https://reader035.vdocument.in/reader035/viewer/2022062520/56816358550346895dd40fb6/html5/thumbnails/17.jpg)
17
MONOIDS: WHY YOU SHOULD CARE ABOUT MATH•From Wikipedia:
–a monoid is an algebraic structure with a single associative binary operation and an identity element.
•Almost everything you want to do is a monoid–Standard addition of numeric types is the most common–List/map/set/string concatenation–Top k elements–Bloom filter, count-min sketch, hyperloglog–stochastic gradient descent–histograms
PRESENTATION TITLE GOES HERE
![Page 18: Some tips for effective map reducing](https://reader035.vdocument.in/reader035/viewer/2022062520/56816358550346895dd40fb6/html5/thumbnails/18.jpg)
18
MORE MONOID STUFF
•If you are aggregating, you are probably using a monoid•Scalding has Algebird and monoid support baked in•Scoobi can use Algebird (or any other monoid library) with almost no work–combine { case (l,r) => monoid.plus(l,r) }
•Algebird handles tuples with ease•Very easy to define monoids for your own types
PRESENTATION TITLE GOES HERE
![Page 19: Some tips for effective map reducing](https://reader035.vdocument.in/reader035/viewer/2022062520/56816358550346895dd40fb6/html5/thumbnails/19.jpg)
19
ADVANTAGES
•Type checking–Find errors at compile time, not at job submission time (or
even worse, 5 hours after job submission time)•Single language
–Scala is a full programming language•Productivity
–Since the code you write looks like collections code you can use the Scala REPL to prototype
•Clarity–Write code as a series of operations and let the job planner
smash it all together
PRESENTATION TITLE GOES HERE
![Page 20: Some tips for effective map reducing](https://reader035.vdocument.in/reader035/viewer/2022062520/56816358550346895dd40fb6/html5/thumbnails/20.jpg)
CONCLUSION
We’re almost done!
![Page 21: Some tips for effective map reducing](https://reader035.vdocument.in/reader035/viewer/2022062520/56816358550346895dd40fb6/html5/thumbnails/21.jpg)
THINGS TO TAKE AWAY
•Mapreduce is a functional problem, we should use functional tools
•You can increase productivity, safety, and maintainability all at once with no down side
•Thinking of data flows in a functional way opens up many new possibilities
•The community is awesome
![Page 22: Some tips for effective map reducing](https://reader035.vdocument.in/reader035/viewer/2022062520/56816358550346895dd40fb6/html5/thumbnails/22.jpg)
22
THANKS!
•Questions/comments?
PRESENTATION TITLE GOES HERE