some tips for effective map reducing

22
Some tips for effective map reducing CHRISTOPHER SEVERS eBay eBay Netanya December 2 nd , 2013

Upload: ilario

Post on 23-Feb-2016

33 views

Category:

Documents


0 download

DESCRIPTION

Some tips for effective map reducing. CHRISTOPHER SEVERS eBay. THE AGENDA. THE AGENDA. Quick survey of the current landscape for Hadoop tools A light comparison of the best functional tools. General advice Some code samples . The Alternatives. I promise this part will be quick. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Some tips for effective map reducing

Some tips for effective map reducing

CHRISTOPHER SEVERSeBay

eBay NetanyaDecember 2nd, 2013

Page 2: Some tips for effective map reducing

THE AGENDA

Page 3: Some tips for effective map reducing

3

THE AGENDA

1. Quick survey of the current landscape for Hadoop tools

2. A light comparison of the best functional tools. 3. General advice4. Some code samples

PRESENTATION TITLE GOES HERE

Page 4: Some tips for effective map reducing

THE ALTERNATIVES

I promise this part will be quick

Page 5: Some tips for effective map reducing

5

VANILLA MAPREDUCE

PRESENTATION TITLE GOES HERE

package org.myorg; import java.io.IOException;import java.util.*; import org.apache.hadoop.fs.Path;import org.apache.hadoop.conf.*;import org.apache.hadoop.io.*;import org.apache.hadoop.mapreduce.*;import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; public class WordCount { public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } } }

public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {

public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); } } public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount"); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); }

Page 6: Some tips for effective map reducing

6

PIG

•Apache Pig is a really great tool for quick, ad-hoc data analysis

•While we can do amazing things with it, I’m not sure we should

•Anything complicated requires User Defined Functions (UDFs)

•UDFs require a separate code base•Now you have to maintain two separate languages for no good reason

PRESENTATION TITLE GOES HERE

Page 7: Some tips for effective map reducing

7

APACHE HIVE

•On previous slide: s/Pig/Hive/g

PRESENTATION TITLE GOES HERE

Page 8: Some tips for effective map reducing

GENERAL ADVICE

Do this, not that

Page 9: Some tips for effective map reducing

PRESENTATION TITLE GOES HERE 9

DO

•Use a higher level abstraction like distributed lists•Use objects instead of tuples•Use a good serialization format•Always check for data quality•Use flatMap for uncertain computations•Develop reusable reductions (monoids!)•Prefer map side operations when possible•Always check for data skew

Page 10: Some tips for effective map reducing

10

DON’T

•Never use nulls•Don’t use too many levels of nesting•Don’t use shared state•Don’t use iteration (too much)•Try not to start with a complicated approach

PRESENTATION TITLE GOES HERE

Page 11: Some tips for effective map reducing

SCALDING AND SCOOBIThis is what we use at eBay

Page 12: Some tips for effective map reducing

12

SOME SCALA CODE

val myLines = getStuffval myWords = myLines.flatMap(w => w.split("\\s+"))val myWordsGrouped = myLines.groupBy(identity)val countedWords = myWordsGrouped.mapValues(x=>x.size)write(countedWords)

PRESENTATION TITLE GOES HERE

Page 13: Some tips for effective map reducing

13

SOME SCALDING CODE

val myLines = TextLine(path)val myWords= myLines.flatMap(w => w.split(" ")) .groupBy(identity) .sizemyWords.write(TypedTSV(output))

PRESENTATION TITLE GOES HERE

Page 14: Some tips for effective map reducing

14

WHAT HAPPENED ON THE PREVIOUS SLIDE?

•flatMap()–Similar to map, but a one-to-many rather than one-to-one

mapping –Use when the desired result has some probability of

occurring –Can handle errors with the Option (Maybe) monad. A None

type will be discarded

PRESENTATION TITLE GOES HERE

Page 15: Some tips for effective map reducing

15

MORE EXPLANATION

•groupBy()–Takes a function that generates a key from the given value–Logically the result can be thought of as an associative

array: key -> List of values–In Scalding this doesn’t necessarily force a Hadoop reduce

phase, it depends on what comes after

PRESENTATION TITLE GOES HERE

Page 16: Some tips for effective map reducing

16

THE BEST PART

•size–This part is pure magic–size is actually sugar for .map( t => 1L).sum–sum has an implicit argument, mon: Monoid[T]

PRESENTATION TITLE GOES HERE

Page 17: Some tips for effective map reducing

17

MONOIDS: WHY YOU SHOULD CARE ABOUT MATH•From Wikipedia:

–a monoid is an algebraic structure with a single associative binary operation and an identity element.

•Almost everything you want to do is a monoid–Standard addition of numeric types is the most common–List/map/set/string concatenation–Top k elements–Bloom filter, count-min sketch, hyperloglog–stochastic gradient descent–histograms

PRESENTATION TITLE GOES HERE

Page 18: Some tips for effective map reducing

18

MORE MONOID STUFF

•If you are aggregating, you are probably using a monoid•Scalding has Algebird and monoid support baked in•Scoobi can use Algebird (or any other monoid library) with almost no work–combine { case (l,r) => monoid.plus(l,r) }

•Algebird handles tuples with ease•Very easy to define monoids for your own types

PRESENTATION TITLE GOES HERE

Page 19: Some tips for effective map reducing

19

ADVANTAGES

•Type checking–Find errors at compile time, not at job submission time (or

even worse, 5 hours after job submission time)•Single language

–Scala is a full programming language•Productivity

–Since the code you write looks like collections code you can use the Scala REPL to prototype

•Clarity–Write code as a series of operations and let the job planner

smash it all together

PRESENTATION TITLE GOES HERE

Page 20: Some tips for effective map reducing

CONCLUSION

We’re almost done!

Page 21: Some tips for effective map reducing

THINGS TO TAKE AWAY

•Mapreduce is a functional problem, we should use functional tools

•You can increase productivity, safety, and maintainability all at once with no down side

•Thinking of data flows in a functional way opens up many new possibilities

•The community is awesome

Page 22: Some tips for effective map reducing

22

THANKS!

•Questions/comments?

PRESENTATION TITLE GOES HERE