![Page 1: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/1.jpg)
ScaldingHadoop Word Count in < 70 lines of code
Konrad 'ktoso' MalawskiJARCamp #3 12.04.2013
Friday, April 12, 13
![Page 2: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/2.jpg)
ScaldingHadoop Word Count
in 4 lines of code
Konrad 'ktoso' MalawskiJARCamp #3 12.04.2013
Friday, April 12, 13
![Page 3: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/3.jpg)
Friday, April 12, 13
![Page 4: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/4.jpg)
Agenda
Friday, April 12, 13
![Page 5: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/5.jpg)
Agenda
Why Scalding? (10%)
Friday, April 12, 13
![Page 6: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/6.jpg)
Agenda
Why Scalding? (10%)+
Friday, April 12, 13
![Page 7: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/7.jpg)
Agenda
Why Scalding? (10%)+
Hadoop Basics (20%)
Friday, April 12, 13
![Page 8: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/8.jpg)
Agenda
Why Scalding? (10%)+
Hadoop Basics (20%)+
Friday, April 12, 13
![Page 9: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/9.jpg)
Agenda
Why Scalding? (10%)+
Hadoop Basics (20%)+
Enter Cascading (40%)
Friday, April 12, 13
![Page 10: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/10.jpg)
Agenda
Why Scalding? (10%)+
Hadoop Basics (20%)+
Enter Cascading (40%)+
Friday, April 12, 13
![Page 11: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/11.jpg)
Agenda
Why Scalding? (10%)+
Hadoop Basics (20%)+
Enter Cascading (40%)+
Hello Scalding (30%)
Friday, April 12, 13
![Page 12: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/12.jpg)
Agenda
Why Scalding? (10%)+
Hadoop Basics (20%)+
Enter Cascading (40%)+
Hello Scalding (30%)=
Friday, April 12, 13
![Page 13: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/13.jpg)
Agenda
Why Scalding? (10%)+
Hadoop Basics (20%)+
Enter Cascading (40%)+
Hello Scalding (30%)=
100%
Friday, April 12, 13
![Page 14: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/14.jpg)
Why Scalding?Word Count in Types
type Word = Stringtype Count = Int
String => Map[Word, Count]
Friday, April 12, 13
![Page 15: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/15.jpg)
Why Scalding?Word Count in Scala
Friday, April 12, 13
![Page 16: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/16.jpg)
Why Scalding?Word Count in Scala
val text = "a a a b b"
Friday, April 12, 13
![Page 17: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/17.jpg)
Why Scalding?Word Count in Scala
val text = "a a a b b"
def wordCount(text: String): Map[Word, Count] =
Friday, April 12, 13
![Page 18: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/18.jpg)
Why Scalding?Word Count in Scala
val text = "a a a b b"
def wordCount(text: String): Map[Word, Count] = text
Friday, April 12, 13
![Page 19: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/19.jpg)
Why Scalding?Word Count in Scala
val text = "a a a b b"
def wordCount(text: String): Map[Word, Count] = text .split(" ")
Friday, April 12, 13
![Page 20: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/20.jpg)
Why Scalding?Word Count in Scala
val text = "a a a b b"
def wordCount(text: String): Map[Word, Count] = text .split(" ") .map(a => (a, 1))
Friday, April 12, 13
![Page 21: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/21.jpg)
Why Scalding?Word Count in Scala
val text = "a a a b b"
def wordCount(text: String): Map[Word, Count] = text .split(" ") .map(a => (a, 1)) .groupBy(_._1)
Friday, April 12, 13
![Page 22: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/22.jpg)
Why Scalding?Word Count in Scala
val text = "a a a b b"
def wordCount(text: String): Map[Word, Count] = text .split(" ") .map(a => (a, 1)) .groupBy(_._1) .map { a => a._1 -> a._2.map(_._2).sum }
Friday, April 12, 13
![Page 23: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/23.jpg)
Why Scalding?Word Count in Scala
val text = "a a a b b"
def wordCount(text: String): Map[Word, Count] = text .split(" ") .map(a => (a, 1)) .groupBy(_._1) .map { a => a._1 -> a._2.map(_._2).sum }
wordCount(text) should equal (Map("a" -> 3), ("b" -> 2)))
Friday, April 12, 13
![Page 24: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/24.jpg)
Stuff > MemoryScala collections... fun but, memory bound!
val text = "so many words... waaah! ..."
text .split(" ") .map(a => (a, 1)) .groupBy(_._1) .map(a => (a._1, a._2.map(_._2).sum))
Friday, April 12, 13
![Page 25: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/25.jpg)
Stuff > MemoryScala collections... fun but, memory bound!
val text = "so many words... waaah! ..."
text .split(" ") .map(a => (a, 1)) .groupBy(_._1) .map(a => (a._1, a._2.map(_._2).sum))
in Memory
Friday, April 12, 13
![Page 26: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/26.jpg)
Stuff > MemoryScala collections... fun but, memory bound!
val text = "so many words... waaah! ..."
text .split(" ") .map(a => (a, 1)) .groupBy(_._1) .map(a => (a._1, a._2.map(_._2).sum))
in Memory
in Memory
Friday, April 12, 13
![Page 27: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/27.jpg)
Stuff > MemoryScala collections... fun but, memory bound!
val text = "so many words... waaah! ..."
text .split(" ") .map(a => (a, 1)) .groupBy(_._1) .map(a => (a._1, a._2.map(_._2).sum))
in Memory
in Memory
in Memory
Friday, April 12, 13
![Page 28: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/28.jpg)
Stuff > MemoryScala collections... fun but, memory bound!
val text = "so many words... waaah! ..."
text .split(" ") .map(a => (a, 1)) .groupBy(_._1) .map(a => (a._1, a._2.map(_._2).sum))
in Memory
in Memory
in Memory
in Memory
Friday, April 12, 13
![Page 29: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/29.jpg)
Stuff > MemoryScala collections... fun but, memory bound!
val text = "so many words... waaah! ..."
text .split(" ") .map(a => (a, 1)) .groupBy(_._1) .map(a => (a._1, a._2.map(_._2).sum))
in Memory
in Memory
in Memory
in Memory
in Memory
Friday, April 12, 13
![Page 30: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/30.jpg)
Apache Hadoop (HDFS + MR)http://hadoop.apache.org/
Friday, April 12, 13
![Page 31: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/31.jpg)
package org.myorg;
import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.LongWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapred.*;
import java.io.IOException;import java.util.Iterator;import java.util.StringTokenizer;
public class WordCount {
public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text();
public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); output.collect(word, one); } } }
public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } }
public static void main(String[] args) throws Exception { JobConf conf = new JobConf(WordCount.class); conf.setJobName("wordcount");
conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(Map.class); conf.setCombinerClass(Reduce.class); conf.setReducerClass(Reduce.class);
conf.setInputFormat(TextInputFormat.class); conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf, new Path(args[0])); FileOutputFormat.setOutputPath(conf, new Path(args[1]));
JobClient.runJob(conf); }}
Why Scalding?Word Count in Hadoop MR
Friday, April 12, 13
![Page 32: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/32.jpg)
package org.myorg;
import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.LongWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapred.*;
import java.io.IOException;import java.util.Iterator;import java.util.StringTokenizer;
public class WordCount {
public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text();
public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); output.collect(word, one); } } }
public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } }
public static void main(String[] args) throws Exception { JobConf conf = new JobConf(WordCount.class); conf.setJobName("wordcount");
conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(Map.class); conf.setCombinerClass(Reduce.class); conf.setReducerClass(Reduce.class);
conf.setInputFormat(TextInputFormat.class); conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf, new Path(args[0])); FileOutputFormat.setOutputPath(conf, new Path(args[1]));
JobClient.runJob(conf); }}
Why Scalding?Word Count in Hadoop MR
Friday, April 12, 13
![Page 33: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/33.jpg)
Trivia: How old is Hadoop?
Friday, April 12, 13
![Page 34: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/34.jpg)
Friday, April 12, 13
![Page 35: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/35.jpg)
Friday, April 12, 13
![Page 36: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/36.jpg)
Friday, April 12, 13
![Page 37: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/37.jpg)
Friday, April 12, 13
![Page 38: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/38.jpg)
Friday, April 12, 13
![Page 39: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/39.jpg)
Friday, April 12, 13
![Page 40: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/40.jpg)
Friday, April 12, 13
![Page 41: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/41.jpg)
Friday, April 12, 13
![Page 42: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/42.jpg)
Cascadingwww.cascading.org/
Friday, April 12, 13
![Page 43: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/43.jpg)
Cascadingwww.cascading.org/
Friday, April 12, 13
![Page 44: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/44.jpg)
Cascadingis
Friday, April 12, 13
![Page 45: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/45.jpg)
Cascadingis
Taps & Pipes
Friday, April 12, 13
![Page 46: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/46.jpg)
Cascadingis
Taps & Pipes& Pipes
& SinksFriday, April 12, 13
![Page 47: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/47.jpg)
1: Distributed Copy
Friday, April 12, 13
![Page 48: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/48.jpg)
1: Distributed Copy
Friday, April 12, 13
![Page 49: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/49.jpg)
1: Distributed Copy
// source TapTap inTap = new Hfs(new TextDelimited(true, "\t"), inPath);
Friday, April 12, 13
![Page 50: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/50.jpg)
1: Distributed Copy
// source TapTap inTap = new Hfs(new TextDelimited(true, "\t"), inPath);
// sink TapTap outTap = new Hfs(new TextDelimited(true, "\t"), outPath);
Friday, April 12, 13
![Page 51: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/51.jpg)
1: Distributed Copy
// source TapTap inTap = new Hfs(new TextDelimited(true, "\t"), inPath);
// sink TapTap outTap = new Hfs(new TextDelimited(true, "\t"), outPath);
// a Pipe, connects tapsPipe copyPipe = new Pipe("copy");
Friday, April 12, 13
![Page 52: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/52.jpg)
1: Distributed Copy
// source TapTap inTap = new Hfs(new TextDelimited(true, "\t"), inPath);
// sink TapTap outTap = new Hfs(new TextDelimited(true, "\t"), outPath);
// a Pipe, connects tapsPipe copyPipe = new Pipe("copy");
// build the FlowFlowDef flowDef = FlowDef.flowDef()
Friday, April 12, 13
![Page 53: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/53.jpg)
1: Distributed Copy
// source TapTap inTap = new Hfs(new TextDelimited(true, "\t"), inPath);
// sink TapTap outTap = new Hfs(new TextDelimited(true, "\t"), outPath);
// a Pipe, connects tapsPipe copyPipe = new Pipe("copy");
// build the FlowFlowDef flowDef = FlowDef.flowDef() .addSource( copyPipe, inTap )
Friday, April 12, 13
![Page 54: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/54.jpg)
1: Distributed Copy
// source TapTap inTap = new Hfs(new TextDelimited(true, "\t"), inPath);
// sink TapTap outTap = new Hfs(new TextDelimited(true, "\t"), outPath);
// a Pipe, connects tapsPipe copyPipe = new Pipe("copy");
// build the FlowFlowDef flowDef = FlowDef.flowDef() .addSource(copyPipe, inTap) .addTailSink(copyPipe, outTap);
Friday, April 12, 13
![Page 55: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/55.jpg)
1: Distributed Copy
// source TapTap inTap = new Hfs(new TextDelimited(true, "\t"), inPath);
// sink TapTap outTap = new Hfs(new TextDelimited(true, "\t"), outPath);
// a Pipe, connects tapsPipe copyPipe = new Pipe("copy");
// build the FlowFlowDef flowDef = FlowDef.flowDef() .addSource(copyPipe, inTap) .addTailSink(copyPipe, outTap);
// run!flowConnector.connect(flowDef).complete();
Friday, April 12, 13
![Page 56: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/56.jpg)
1. DCP - Full Codepublic class Main { public static void main(String[] args ) { String inPath = args[0]; String outPath = args[1];
Properties props = new Properties(); AppProps.setApplicationJarClass(properties, Main.class); HadoopFlowConnector flowConnector = new HadoopFlowConnector(props);
Tap inTap = new Hfs( new TextDelimited(true, "\t"), inPath);
Tap outTap = new Hfs(new TextDelimited(true, "\t"), outPath);
Pipe copyPipe = new Pipe("copy");
FlowDef flowDef = FlowDef.flowDef() .addSource(copyPipe, inTap) .addTailSink(copyPipe, outTap);
flowConnector.connect(flowDef).complete();}}
Friday, April 12, 13
![Page 57: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/57.jpg)
1. DCP - Full Codepublic class Main { public static void main(String[] args ) { String inPath = args[0]; String outPath = args[1];
Properties props = new Properties(); AppProps.setApplicationJarClass(properties, Main.class); HadoopFlowConnector flowConnector = new HadoopFlowConnector(props);
Tap inTap = new Hfs( new TextDelimited(true, "\t"), inPath);
Tap outTap = new Hfs(new TextDelimited(true, "\t"), outPath);
Pipe copyPipe = new Pipe("copy");
FlowDef flowDef = FlowDef.flowDef() .addSource(copyPipe, inTap) .addTailSink(copyPipe, outTap);
flowConnector.connect(flowDef).complete();}}
Friday, April 12, 13
![Page 58: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/58.jpg)
1. DCP - Full Codepublic class Main { public static void main(String[] args ) { String inPath = args[0]; String outPath = args[1];
Properties props = new Properties(); AppProps.setApplicationJarClass(properties, Main.class); HadoopFlowConnector flowConnector = new HadoopFlowConnector(props);
Tap inTap = new Hfs( new TextDelimited(true, "\t"), inPath);
Tap outTap = new Hfs(new TextDelimited(true, "\t"), outPath);
Pipe copyPipe = new Pipe("copy");
FlowDef flowDef = FlowDef.flowDef() .addSource(copyPipe, inTap) .addTailSink(copyPipe, outTap);
flowConnector.connect(flowDef).complete();}}
Friday, April 12, 13
![Page 59: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/59.jpg)
1. DCP - Full Codepublic class Main { public static void main(String[] args ) { String inPath = args[0]; String outPath = args[1];
Properties props = new Properties(); AppProps.setApplicationJarClass(properties, Main.class); HadoopFlowConnector flowConnector = new HadoopFlowConnector(props);
Tap inTap = new Hfs( new TextDelimited(true, "\t"), inPath);
Tap outTap = new Hfs(new TextDelimited(true, "\t"), outPath);
Pipe copyPipe = new Pipe("copy");
FlowDef flowDef = FlowDef.flowDef() .addSource(copyPipe, inTap) .addTailSink(copyPipe, outTap);
flowConnector.connect(flowDef).complete();}}
Friday, April 12, 13
![Page 60: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/60.jpg)
1. DCP - Full Codepublic class Main { public static void main(String[] args ) { String inPath = args[0]; String outPath = args[1];
Properties props = new Properties(); AppProps.setApplicationJarClass(properties, Main.class); HadoopFlowConnector flowConnector = new HadoopFlowConnector(props);
Tap inTap = new Hfs( new TextDelimited(true, "\t"), inPath);
Tap outTap = new Hfs(new TextDelimited(true, "\t"), outPath);
Pipe copyPipe = new Pipe("copy");
FlowDef flowDef = FlowDef.flowDef() .addSource(copyPipe, inTap) .addTailSink(copyPipe, outTap);
flowConnector.connect(flowDef).complete();}}
Friday, April 12, 13
![Page 61: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/61.jpg)
1. DCP - Full Codepublic class Main { public static void main(String[] args ) { String inPath = args[0]; String outPath = args[1];
Properties props = new Properties(); AppProps.setApplicationJarClass(properties, Main.class); HadoopFlowConnector flowConnector = new HadoopFlowConnector(props);
Tap inTap = new Hfs( new TextDelimited(true, "\t"), inPath);
Tap outTap = new Hfs(new TextDelimited(true, "\t"), outPath);
Pipe copyPipe = new Pipe("copy");
FlowDef flowDef = FlowDef.flowDef() .addSource(copyPipe, inTap) .addTailSink(copyPipe, outTap);
flowConnector.connect(flowDef).complete();}}
Friday, April 12, 13
![Page 62: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/62.jpg)
2: Word Count
String docPath = args[ 0 ]; String wcPath = args[ 1 ];
Properties properties = new Properties(); AppProps.setApplicationJarClass( props, Main.class ); HadoopFlowConnector flowConnector = new HadoopFlowConnector( props );
// create source and sink taps Tap docTap = new Hfs( new TextDelimited( true, "\t" ), docPath ); Tap wcTap = new Hfs( new TextDelimited( true, "\t" ), wcPath );
// specify a regex operation to split the "document" text lines into a token stream Fields token = new Fields( "token" ); Fields text = new Fields( "text" ); RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ \\[\\]\\(\\),.]" ); // only returns "token" Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );
// determine the word counts Pipe wcPipe = new Pipe( "wc", docPipe ); wcPipe = new GroupBy( wcPipe, token ); wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );
// connect the taps, pipes, etc., into a flow FlowDef flowDef = FlowDef.flowDef() .setName( "wc" ) .addSource( docPipe, docTap ) .addTailSink( wcPipe, wcTap );
// write a DOT file and run the flow Flow wcFlow = flowConnector.connect( flowDef ); wcFlow.writeDOT( "dot/wc.dot" ); wcFlow.complete(); } }
Friday, April 12, 13
![Page 63: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/63.jpg)
2: Word Count
String docPath = args[ 0 ]; String wcPath = args[ 1 ];
Properties properties = new Properties(); AppProps.setApplicationJarClass( props, Main.class ); HadoopFlowConnector flowConnector = new HadoopFlowConnector( props );
// create source and sink taps Tap docTap = new Hfs( new TextDelimited( true, "\t" ), docPath ); Tap wcTap = new Hfs( new TextDelimited( true, "\t" ), wcPath );
// specify a regex operation to split the "document" text lines into a token stream Fields token = new Fields( "token" ); Fields text = new Fields( "text" ); RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ \\[\\]\\(\\),.]" ); // only returns "token" Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );
// determine the word counts Pipe wcPipe = new Pipe( "wc", docPipe ); wcPipe = new GroupBy( wcPipe, token ); wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );
// connect the taps, pipes, etc., into a flow FlowDef flowDef = FlowDef.flowDef() .setName( "wc" ) .addSource( docPipe, docTap ) .addTailSink( wcPipe, wcTap );
// write a DOT file and run the flow Flow wcFlow = flowConnector.connect( flowDef ); wcFlow.writeDOT( "dot/wc.dot" ); wcFlow.complete(); } }
Friday, April 12, 13
![Page 64: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/64.jpg)
2: Word Count
String docPath = args[ 0 ]; String wcPath = args[ 1 ];
Properties properties = new Properties(); AppProps.setApplicationJarClass( props, Main.class ); HadoopFlowConnector flowConnector = new HadoopFlowConnector( props );
// create source and sink taps Tap docTap = new Hfs( new TextDelimited( true, "\t" ), docPath ); Tap wcTap = new Hfs( new TextDelimited( true, "\t" ), wcPath );
// specify a regex operation to split the "document" text lines into a token stream Fields token = new Fields( "token" ); Fields text = new Fields( "text" ); RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ \\[\\]\\(\\),.]" ); // only returns "token" Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );
// determine the word counts Pipe wcPipe = new Pipe( "wc", docPipe ); wcPipe = new GroupBy( wcPipe, token ); wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );
// connect the taps, pipes, etc., into a flow FlowDef flowDef = FlowDef.flowDef() .setName( "wc" ) .addSource( docPipe, docTap ) .addTailSink( wcPipe, wcTap );
// write a DOT file and run the flow Flow wcFlow = flowConnector.connect( flowDef ); wcFlow.writeDOT( "dot/wc.dot" ); wcFlow.complete(); } }
Friday, April 12, 13
![Page 65: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/65.jpg)
String docPath = args[ 0 ]; String wcPath = args[ 1 ];
Properties properties = new Properties(); AppProps.setApplicationJarClass( props, Main.class ); HadoopFlowConnector flowConnector = new HadoopFlowConnector( props );
// create source and sink taps Tap docTap = new Hfs( new TextDelimited( true, "\t" ), docPath ); Tap wcTap = new Hfs( new TextDelimited( true, "\t" ), wcPath );
Fields token = new Fields( "token" ); Fields text = new Fields( "text" ); RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ \\[\\]\\(\\),.]" );
// only returns "token" Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );
// determine the word counts Pipe wcPipe = new Pipe( "wc", docPipe ); wcPipe = new GroupBy( wcPipe, token ); wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );
// connect the taps, pipes, etc., into a flow FlowDef flowDef = FlowDef.flowDef() .setName( "wc" ) .addSource( docPipe, docTap ) .addTailSink( wcPipe, wcTap );
// write a DOT file and run the flow Flow wcFlow = flowConnector.connect( flowDef ); wcFlow.writeDOT( "dot/wc.dot" ); wcFlow.complete(); } }
2: Word Count
Friday, April 12, 13
![Page 66: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/66.jpg)
2: Word Count String docPath = args[ 0 ]; String wcPath = args[ 1 ];
Properties properties = new Properties(); AppProps.setApplicationJarClass( props, Main.class ); HadoopFlowConnector flowConnector = new HadoopFlowConnector( props );
// create source and sink taps Tap docTap = new Hfs( new TextDelimited( true, "\t" ), docPath ); Tap wcTap = new Hfs( new TextDelimited( true, "\t" ), wcPath );
Fields token = new Fields( "token" ); Fields text = new Fields( "text" ); RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ \\[\\]\\(\\),.]" );
// only returns "token" Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );
// determine the word counts Pipe wcPipe = new Pipe( "wc", docPipe ); wcPipe = new GroupBy( wcPipe, token ); wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );
// connect the taps, pipes, etc., into a flow FlowDef flowDef = FlowDef.flowDef() .setName( "wc" ) .addSource( docPipe, docTap ) .addTailSink( wcPipe, wcTap );
// write a DOT file and run the flow Flow wcFlow = flowConnector.connect( flowDef ); wcFlow.writeDOT( "dot/wc.dot" ); wcFlow.complete(); } }
2: Word Count
Friday, April 12, 13
![Page 67: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/67.jpg)
2: Word Count
String docPath = args[ 0 ]; String wcPath = args[ 1 ];
Properties properties = new Properties(); AppProps.setApplicationJarClass( props, Main.class ); HadoopFlowConnector flowConnector = new HadoopFlowConnector( props );
// create source and sink taps Tap docTap = new Hfs( new TextDelimited( true, "\t" ), docPath ); Tap wcTap = new Hfs( new TextDelimited( true, "\t" ), wcPath );
Fields token = new Fields( "token" ); Fields text = new Fields( "text" ); RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ \\[\\]\\(\\),.]" );
// only returns "token" Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );
// determine the word counts Pipe wcPipe = new Pipe( "wc", docPipe ); wcPipe = new GroupBy( wcPipe, token ); wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );
// connect the taps, pipes, etc., into a flow FlowDef flowDef = FlowDef.flowDef() .setName( "wc" ) .addSource( docPipe, docTap ) .addTailSink( wcPipe, wcTap );
// write a DOT file and run the flow Flow wcFlow = flowConnector.connect( flowDef ); wcFlow.writeDOT( "dot/wc.dot" ); wcFlow.complete(); } }
2: Word Count
Friday, April 12, 13
![Page 68: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/68.jpg)
2: Word Count
String docPath = args[ 0 ]; String wcPath = args[ 1 ];
Properties properties = new Properties(); AppProps.setApplicationJarClass( props, Main.class ); HadoopFlowConnector flowConnector = new HadoopFlowConnector( props );
// create source and sink taps Tap docTap = new Hfs( new TextDelimited( true, "\t" ), docPath ); Tap wcTap = new Hfs( new TextDelimited( true, "\t" ), wcPath );
Fields token = new Fields( "token" ); Fields text = new Fields( "text" ); RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ \\[\\]\\(\\),.]" );
// only returns "token" Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );
// determine the word counts Pipe wcPipe = new Pipe( "wc", docPipe ); wcPipe = new GroupBy( wcPipe, token ); wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );
// connect the taps, pipes, etc., into a flow FlowDef flowDef = FlowDef.flowDef() .setName( "wc" ) .addSource( docPipe, docTap ) .addTailSink( wcPipe, wcTap );
// write a DOT file and run the flow Flow wcFlow = flowConnector.connect( flowDef ); wcFlow.writeDOT( "dot/wc.dot" ); wcFlow.complete(); } }
2: Word Count
Friday, April 12, 13
![Page 69: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/69.jpg)
2: Word Count
String docPath = args[ 0 ]; String wcPath = args[ 1 ];
Properties properties = new Properties(); AppProps.setApplicationJarClass( props, Main.class ); HadoopFlowConnector flowConnector = new HadoopFlowConnector( props );
// create source and sink taps Tap docTap = new Hfs( new TextDelimited( true, "\t" ), docPath ); Tap wcTap = new Hfs( new TextDelimited( true, "\t" ), wcPath );
Fields token = new Fields( "token" ); Fields text = new Fields( "text" ); RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ \\[\\]\\(\\),.]" );
// only returns "token" Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );
// determine the word counts Pipe wcPipe = new Pipe( "wc", docPipe ); wcPipe = new GroupBy( wcPipe, token ); wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );
// connect the taps, pipes, etc., into a flow FlowDef flowDef = FlowDef.flowDef() .setName( "wc" ) .addSource( docPipe, docTap ) .addTailSink( wcPipe, wcTap );
// write a DOT file and run the flow Flow wcFlow = flowConnector.connect( flowDef ); wcFlow.writeDOT( "dot/wc.dot" ); wcFlow.complete(); } }
2: Word Count
Friday, April 12, 13
![Page 70: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/70.jpg)
2: Word Count
String docPath = args[ 0 ]; String wcPath = args[ 1 ];
Properties properties = new Properties(); AppProps.setApplicationJarClass( props, Main.class ); HadoopFlowConnector flowConnector = new HadoopFlowConnector( props );
// create source and sink taps Tap docTap = new Hfs( new TextDelimited( true, "\t" ), docPath ); Tap wcTap = new Hfs( new TextDelimited( true, "\t" ), wcPath );
Fields token = new Fields( "token" ); Fields text = new Fields( "text" ); RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ \\[\\]\\(\\),.]" );
// only returns "token" Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );
// determine the word counts Pipe wcPipe = new Pipe( "wc", docPipe ); wcPipe = new GroupBy( wcPipe, token ); wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );
// connect the taps, pipes, etc., into a flow FlowDef flowDef = FlowDef.flowDef() .setName( "wc" ) .addSource( docPipe, docTap ) .addTailSink( wcPipe, wcTap );
// write a DOT file and run the flow Flow wcFlow = flowConnector.connect( flowDef ); wcFlow.writeDOT( "dot/wc.dot" ); wcFlow.complete(); } }
2: Word Count
Friday, April 12, 13
![Page 71: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/71.jpg)
2: Word Count
String docPath = args[ 0 ]; String wcPath = args[ 1 ];
Properties properties = new Properties(); AppProps.setApplicationJarClass( props, Main.class ); HadoopFlowConnector flowConnector = new HadoopFlowConnector( props );
// create source and sink taps Tap docTap = new Hfs( new TextDelimited( true, "\t" ), docPath ); Tap wcTap = new Hfs( new TextDelimited( true, "\t" ), wcPath );
Fields token = new Fields( "token" ); Fields text = new Fields( "text" ); RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ \\[\\]\\(\\),.]" );
// only returns "token" Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );
// determine the word counts Pipe wcPipe = new Pipe( "wc", docPipe ); wcPipe = new GroupBy( wcPipe, token ); wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );
// connect the taps, pipes, etc., into a flow FlowDef flowDef = FlowDef.flowDef() .setName( "wc" ) .addSource( docPipe, docTap ) .addTailSink( wcPipe, wcTap );
// write a DOT file and run the flow Flow wcFlow = flowConnector.connect( flowDef ); wcFlow.writeDOT( "dot/wc.dot" ); wcFlow.complete(); } }
2: Word Count
Friday, April 12, 13
![Page 72: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/72.jpg)
Cascading - how?
Friday, April 12, 13
![Page 73: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/73.jpg)
Cascading - how?// pseudo code...
Friday, April 12, 13
![Page 74: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/74.jpg)
Cascading - how?// pseudo code...
val flow = FlowDef
Friday, April 12, 13
![Page 75: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/75.jpg)
Cascading - how?// pseudo code...
val flow = FlowDefval flowConnector: FlowDef => List[MRJob] = ...
Friday, April 12, 13
![Page 76: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/76.jpg)
Cascading - how?// pseudo code...
val flow = FlowDefval flowConnector: FlowDef => List[MRJob] = ...
val jobs: List[MRJob] = flowConnector(flow)
Friday, April 12, 13
![Page 77: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/77.jpg)
Cascading - how?// pseudo code...
val flow = FlowDefval flowConnector: FlowDef => List[MRJob] = ...
val jobs: List[MRJob] = flowConnector(flow)
HadoopCluster.execute(jobs)
Friday, April 12, 13
![Page 78: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/78.jpg)
Cascading - how?// pseudo code...
val flow = FlowDefval flowConnector: FlowDef => List[MRJob] = ...
val jobs: List[MRJob] = flowConnector(flow)
HadoopCluster.execute(jobs)
Friday, April 12, 13
![Page 79: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/79.jpg)
Cascading tipsPipe assembly = new Pipe( "assembly" );assembly = new Each( assembly, DebugLevel.VERBOSE, new Debug() );// ...
// head and tail have same nameFlowDef flowDef = new FlowDef() .setName( "debug" ) .addSource( "assembly", source ) .addSink( "assembly", sink ) .addTail( assembly );
Friday, April 12, 13
![Page 80: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/80.jpg)
Cascading tipsPipe assembly = new Pipe( "assembly" );assembly = new Each( assembly, DebugLevel.VERBOSE, new Debug() );// ...
// head and tail have same nameFlowDef flowDef = new FlowDef() .setName( "debug" ) .addSource( "assembly", source ) .addSink( "assembly", sink ) .addTail( assembly );
flowDef.setDebugLevel( DebugLevel.NONE );
Friday, April 12, 13
![Page 81: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/81.jpg)
Cascading tipsPipe assembly = new Pipe( "assembly" );assembly = new Each( assembly, DebugLevel.VERBOSE, new Debug() );// ...
// head and tail have same nameFlowDef flowDef = new FlowDef() .setName( "debug" ) .addSource( "assembly", source ) .addSink( "assembly", sink ) .addTail( assembly );
flowDef.setDebugLevel( DebugLevel.NONE );
flowConnector will NOT create the Debug pipe!
Friday, April 12, 13
![Page 82: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/82.jpg)
Scalding=+
Twitter Scaldinggithub.com/twitter/scalding
Friday, April 12, 13
![Page 83: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/83.jpg)
Scalding API
Friday, April 12, 13
![Page 84: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/84.jpg)
map
Friday, April 12, 13
![Page 85: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/85.jpg)
val data = 1 :: 2 :: 3 :: Nil
mapScala:
Friday, April 12, 13
![Page 86: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/86.jpg)
val data = 1 :: 2 :: 3 :: Nil
val doubled = data map { _ * 2 }
mapScala:
Friday, April 12, 13
![Page 87: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/87.jpg)
val data = 1 :: 2 :: 3 :: Nil
val doubled = data map { _ * 2 }
// Int => Int
mapScala:
Friday, April 12, 13
![Page 88: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/88.jpg)
val data = 1 :: 2 :: 3 :: Nil
val doubled = data map { _ * 2 }
// Int => Int
mapScala:
Friday, April 12, 13
![Page 89: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/89.jpg)
val data = 1 :: 2 :: 3 :: Nil
val doubled = data map { _ * 2 }
// Int => Int
map
IterableSource(data)
Scala:
Scalding:
Friday, April 12, 13
![Page 90: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/90.jpg)
val data = 1 :: 2 :: 3 :: Nil
val doubled = data map { _ * 2 }
// Int => Int
map
IterableSource(data) .map('number -> 'doubled) { n: Int => n * 2 }
Scala:
Scalding:
Friday, April 12, 13
![Page 91: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/91.jpg)
val data = 1 :: 2 :: 3 :: Nil
val doubled = data map { _ * 2 }
// Int => Int
map
IterableSource(data) .map('number -> 'doubled) { n: Int => n * 2 }
// Int => Int
Scala:
Scalding:
Friday, April 12, 13
![Page 92: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/92.jpg)
val data = 1 :: 2 :: 3 :: Nil
val doubled = data map { _ * 2 }
// Int => Int
map
IterableSource(data) .map('number -> 'doubled) { n: Int => n * 2 }
// Int => Int
Scala:
Scalding:
available in Pipe
Friday, April 12, 13
![Page 93: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/93.jpg)
val data = 1 :: 2 :: 3 :: Nil
val doubled = data map { _ * 2 }
// Int => Int
map
IterableSource(data) .map('number -> 'doubled) { n: Int => n * 2 }
// Int => Int
Scala:
Scalding:
available in Pipestays in Pipe
Friday, April 12, 13
![Page 94: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/94.jpg)
val data = 1 :: 2 :: 3 :: Nil
val doubled = data map { _ * 2 }
// Int => Int
map
IterableSource(data) .map('number -> 'doubled) { n: Int => n * 2 }
// Int => Int
Scala:
Scalding:
must choose type!
Friday, April 12, 13
![Page 95: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/95.jpg)
mapTo
Friday, April 12, 13
![Page 96: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/96.jpg)
var data = 1 :: 2 :: 3 :: Nil
mapToScala:
Friday, April 12, 13
![Page 97: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/97.jpg)
var data = 1 :: 2 :: 3 :: Nil
val doubled = data map { _ * 2 }
mapToScala:
Friday, April 12, 13
![Page 98: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/98.jpg)
var data = 1 :: 2 :: 3 :: Nil
val doubled = data map { _ * 2 }data = null
mapToScala:
Friday, April 12, 13
![Page 99: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/99.jpg)
var data = 1 :: 2 :: 3 :: Nil
val doubled = data map { _ * 2 }data = null // Int => Int
mapToScala:
Friday, April 12, 13
![Page 100: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/100.jpg)
var data = 1 :: 2 :: 3 :: Nil
val doubled = data map { _ * 2 }data = null // Int => Int
mapToScala:
release reference
Friday, April 12, 13
![Page 101: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/101.jpg)
var data = 1 :: 2 :: 3 :: Nil
val doubled = data map { _ * 2 }data = null // Int => Int
mapToScala:
release reference
Friday, April 12, 13
![Page 102: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/102.jpg)
var data = 1 :: 2 :: 3 :: Nil
val doubled = data map { _ * 2 }data = null // Int => Int
mapTo
IterableSource(data)
Scala:
Scalding:
release reference
Friday, April 12, 13
![Page 103: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/103.jpg)
var data = 1 :: 2 :: 3 :: Nil
val doubled = data map { _ * 2 }data = null // Int => Int
mapTo
IterableSource(data) .mapTo('doubled) { n: Int => n * 2 }
Scala:
Scalding:
release reference
Friday, April 12, 13
![Page 104: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/104.jpg)
var data = 1 :: 2 :: 3 :: Nil
val doubled = data map { _ * 2 }data = null // Int => Int
mapTo
IterableSource(data) .mapTo('doubled) { n: Int => n * 2 }
// Int => Int
Scala:
Scalding:
release reference
Friday, April 12, 13
![Page 105: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/105.jpg)
var data = 1 :: 2 :: 3 :: Nil
val doubled = data map { _ * 2 }data = null // Int => Int
mapTo
IterableSource(data) .mapTo('doubled) { n: Int => n * 2 }
// Int => Int
Scala:
Scalding:
doubled stays in Pipe
release reference
Friday, April 12, 13
![Page 106: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/106.jpg)
var data = 1 :: 2 :: 3 :: Nil
val doubled = data map { _ * 2 }data = null // Int => Int
mapTo
IterableSource(data) .mapTo('doubled) { n: Int => n * 2 }
// Int => Int
Scala:
Scalding:
doubled stays in Pipenumber is removed
release reference
Friday, April 12, 13
![Page 107: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/107.jpg)
flatMap
Friday, April 12, 13
![Page 108: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/108.jpg)
val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]
flatMapScala:
Friday, April 12, 13
![Page 109: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/109.jpg)
val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]
val numbers = data flatMap { line => // String
flatMapScala:
Friday, April 12, 13
![Page 110: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/110.jpg)
val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]
val numbers = data flatMap { line => // String line.split(",") // Array[String]
flatMapScala:
Friday, April 12, 13
![Page 111: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/111.jpg)
val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]
val numbers = data flatMap { line => // String line.split(",") // Array[String]} map { _.toInt } // List[Int]
flatMapScala:
Friday, April 12, 13
![Page 112: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/112.jpg)
val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]
val numbers = data flatMap { line => // String line.split(",") // Array[String]} map { _.toInt } // List[Int]
numbers // List[Int]
flatMapScala:
Friday, April 12, 13
![Page 113: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/113.jpg)
val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]
val numbers = data flatMap { line => // String line.split(",") // Array[String]} map { _.toInt } // List[Int]
numbers // List[Int]numbers should equal (List(1, 2, 2, 3, 3, 3))
flatMapScala:
Friday, April 12, 13
![Page 114: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/114.jpg)
val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]
val numbers = data flatMap { line => // String line.split(",") // Array[String]} map { _.toInt } // List[Int]
numbers // List[Int]numbers should equal (List(1, 2, 2, 3, 3, 3))
flatMapScala:
Friday, April 12, 13
![Page 115: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/115.jpg)
val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]
val numbers = data flatMap { line => // String line.split(",") // Array[String]} map { _.toInt } // List[Int]
numbers // List[Int]numbers should equal (List(1, 2, 2, 3, 3, 3))
flatMap
TextLine(data) // like List[String]
Scala:
Scalding:
Friday, April 12, 13
![Page 116: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/116.jpg)
val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]
val numbers = data flatMap { line => // String line.split(",") // Array[String]} map { _.toInt } // List[Int]
numbers // List[Int]numbers should equal (List(1, 2, 2, 3, 3, 3))
flatMap
TextLine(data) // like List[String] .flatMap('line -> 'word) { _.split(",") } // like List[String]
Scala:
Scalding:
Friday, April 12, 13
![Page 117: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/117.jpg)
val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]
val numbers = data flatMap { line => // String line.split(",") // Array[String]} map { _.toInt } // List[Int]
numbers // List[Int]numbers should equal (List(1, 2, 2, 3, 3, 3))
flatMap
TextLine(data) // like List[String] .flatMap('line -> 'word) { _.split(",") } // like List[String] .map('word -> 'number) { _.toInt } // like List[Int]
Scala:
Scalding:
Friday, April 12, 13
![Page 118: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/118.jpg)
val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]
val numbers = data flatMap { line => // String line.split(",") // Array[String]} map { _.toInt } // List[Int]
numbers // List[Int]numbers should equal (List(1, 2, 2, 3, 3, 3))
flatMap
TextLine(data) // like List[String] .flatMap('line -> 'word) { _.split(",") } // like List[String] .map('word -> 'number) { _.toInt } // like List[Int]
Scala:
Scalding:
MR map outside
Friday, April 12, 13
![Page 119: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/119.jpg)
flatMap
Friday, April 12, 13
![Page 120: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/120.jpg)
val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]
flatMapScala:
Friday, April 12, 13
![Page 121: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/121.jpg)
val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]
val numbers = data flatMap { line => // String
flatMapScala:
Friday, April 12, 13
![Page 122: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/122.jpg)
val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]
val numbers = data flatMap { line => // String line.split(",").map(_.toInt) // Array[Int]
flatMapScala:
Friday, April 12, 13
![Page 123: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/123.jpg)
val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]
val numbers = data flatMap { line => // String line.split(",").map(_.toInt) // Array[Int]}
flatMapScala:
Friday, April 12, 13
![Page 124: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/124.jpg)
val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]
val numbers = data flatMap { line => // String line.split(",").map(_.toInt) // Array[Int]}
numbers // List[Int]
flatMapScala:
Friday, April 12, 13
![Page 125: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/125.jpg)
val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]
val numbers = data flatMap { line => // String line.split(",").map(_.toInt) // Array[Int]}
numbers // List[Int]numbers should equal (List(1, 2, 2, 3, 3, 3))
flatMapScala:
Friday, April 12, 13
![Page 126: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/126.jpg)
val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]
val numbers = data flatMap { line => // String line.split(",").map(_.toInt) // Array[Int]}
numbers // List[Int]numbers should equal (List(1, 2, 2, 3, 3, 3))
flatMapScala:
Friday, April 12, 13
![Page 127: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/127.jpg)
val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]
val numbers = data flatMap { line => // String line.split(",").map(_.toInt) // Array[Int]}
numbers // List[Int]numbers should equal (List(1, 2, 2, 3, 3, 3))
flatMap
TextLine(data) // like List[String]
Scala:
Scalding:
Friday, April 12, 13
![Page 128: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/128.jpg)
val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]
val numbers = data flatMap { line => // String line.split(",").map(_.toInt) // Array[Int]}
numbers // List[Int]numbers should equal (List(1, 2, 2, 3, 3, 3))
flatMap
TextLine(data) // like List[String] .flatMap('line -> 'word) { _.split(",").map(_.toInt) }
Scala:
Scalding:
Friday, April 12, 13
![Page 129: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/129.jpg)
val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]
val numbers = data flatMap { line => // String line.split(",").map(_.toInt) // Array[Int]}
numbers // List[Int]numbers should equal (List(1, 2, 2, 3, 3, 3))
flatMap
TextLine(data) // like List[String] .flatMap('line -> 'word) { _.split(",").map(_.toInt) } // like List[Int]
Scala:
Scalding:
Friday, April 12, 13
![Page 130: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/130.jpg)
val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]
val numbers = data flatMap { line => // String line.split(",").map(_.toInt) // Array[Int]}
numbers // List[Int]numbers should equal (List(1, 2, 2, 3, 3, 3))
flatMap
TextLine(data) // like List[String] .flatMap('line -> 'word) { _.split(",").map(_.toInt) } // like List[Int]
Scala:
Scalding:
map inside Scala
Friday, April 12, 13
![Page 131: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/131.jpg)
groupBy
Friday, April 12, 13
![Page 132: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/132.jpg)
val data = 1 :: 2 :: 30 :: 42 :: Nil // List[Int]
groupByScala:
Friday, April 12, 13
![Page 133: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/133.jpg)
val data = 1 :: 2 :: 30 :: 42 :: Nil // List[Int]
val groups = data groupBy { _ < 10 }
groupByScala:
Friday, April 12, 13
![Page 134: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/134.jpg)
val data = 1 :: 2 :: 30 :: 42 :: Nil // List[Int]
val groups = data groupBy { _ < 10 }
groups // Map[Boolean, Int]
groupByScala:
Friday, April 12, 13
![Page 135: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/135.jpg)
val data = 1 :: 2 :: 30 :: 42 :: Nil // List[Int]
val groups = data groupBy { _ < 10 }
groups // Map[Boolean, Int]
groups(true) should equal (List(1, 2))
groupByScala:
Friday, April 12, 13
![Page 136: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/136.jpg)
val data = 1 :: 2 :: 30 :: 42 :: Nil // List[Int]
val groups = data groupBy { _ < 10 }
groups // Map[Boolean, Int]
groups(true) should equal (List(1, 2))groups(false) should equal (List(30, 42))
groupByScala:
Friday, April 12, 13
![Page 137: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/137.jpg)
val data = 1 :: 2 :: 30 :: 42 :: Nil // List[Int]
val groups = data groupBy { _ < 10 }
groups // Map[Boolean, Int]
groups(true) should equal (List(1, 2))groups(false) should equal (List(30, 42))
groupByScala:
Friday, April 12, 13
![Page 138: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/138.jpg)
val data = 1 :: 2 :: 30 :: 42 :: Nil // List[Int]
val groups = data groupBy { _ < 10 }
groups // Map[Boolean, Int]
groups(true) should equal (List(1, 2))groups(false) should equal (List(30, 42))
groupBy
IterableSource(List(1, 2, 30, 42), 'num)
Scala:
Scalding:
Friday, April 12, 13
![Page 139: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/139.jpg)
val data = 1 :: 2 :: 30 :: 42 :: Nil // List[Int]
val groups = data groupBy { _ < 10 }
groups // Map[Boolean, Int]
groups(true) should equal (List(1, 2))groups(false) should equal (List(30, 42))
groupBy
IterableSource(List(1, 2, 30, 42), 'num) .map('num -> 'lessThanTen) { i: Int => i < 10 }
Scala:
Scalding:
Friday, April 12, 13
![Page 140: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/140.jpg)
val data = 1 :: 2 :: 30 :: 42 :: Nil // List[Int]
val groups = data groupBy { _ < 10 }
groups // Map[Boolean, Int]
groups(true) should equal (List(1, 2))groups(false) should equal (List(30, 42))
groupBy
IterableSource(List(1, 2, 30, 42), 'num) .map('num -> 'lessThanTen) { i: Int => i < 10 } .groupBy('lessThanTen) { _.size }
Scala:
Scalding:
Friday, April 12, 13
![Page 141: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/141.jpg)
val data = 1 :: 2 :: 30 :: 42 :: Nil // List[Int]
val groups = data groupBy { _ < 10 }
groups // Map[Boolean, Int]
groups(true) should equal (List(1, 2))groups(false) should equal (List(30, 42))
groupBy
IterableSource(List(1, 2, 30, 42), 'num) .map('num -> 'lessThanTen) { i: Int => i < 10 } .groupBy('lessThanTen) { _.size }
Scala:
Scalding:
groups all with == value
Friday, April 12, 13
![Page 142: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/142.jpg)
val data = 1 :: 2 :: 30 :: 42 :: Nil // List[Int]
val groups = data groupBy { _ < 10 }
groups // Map[Boolean, Int]
groups(true) should equal (List(1, 2))groups(false) should equal (List(30, 42))
groupBy
IterableSource(List(1, 2, 30, 42), 'num) .map('num -> 'lessThanTen) { i: Int => i < 10 } .groupBy('lessThanTen) { _.size }
Scala:
Scalding:
groups all with == value 'lessThanTenCounts
Friday, April 12, 13
![Page 143: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/143.jpg)
groupBy
Scalding:
Friday, April 12, 13
![Page 144: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/144.jpg)
groupBy
IterableSource(List(1, 2, 30, 42), 'num)
Scalding:
Friday, April 12, 13
![Page 145: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/145.jpg)
groupBy
IterableSource(List(1, 2, 30, 42), 'num) .map('num -> 'lessThanTen) { i: Int => i < 10 }
Scalding:
Friday, April 12, 13
![Page 146: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/146.jpg)
groupBy
IterableSource(List(1, 2, 30, 42), 'num) .map('num -> 'lessThanTen) { i: Int => i < 10 } .groupBy('lessThanTen) { _.sum('total) }
Scalding:
Friday, April 12, 13
![Page 147: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/147.jpg)
groupBy
IterableSource(List(1, 2, 30, 42), 'num) .map('num -> 'lessThanTen) { i: Int => i < 10 } .groupBy('lessThanTen) { _.sum('total) }
Scalding:
'total = [3, 74]
Friday, April 12, 13
![Page 148: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/148.jpg)
Scalding API
Friday, April 12, 13
![Page 149: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/149.jpg)
Scalding APIproject / discard
Friday, April 12, 13
![Page 150: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/150.jpg)
Scalding APIproject / discard
map / mapTo
Friday, April 12, 13
![Page 151: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/151.jpg)
Scalding APIproject / discard
map / mapToflatMap / flatMapTo
Friday, April 12, 13
![Page 152: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/152.jpg)
Scalding APIproject / discard
map / mapToflatMap / flatMapTo
rename
Friday, April 12, 13
![Page 153: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/153.jpg)
Scalding APIproject / discard
map / mapToflatMap / flatMapTo
renamefilter
Friday, April 12, 13
![Page 154: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/154.jpg)
Scalding APIproject / discard
map / mapToflatMap / flatMapTo
renamefilter
unique
Friday, April 12, 13
![Page 155: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/155.jpg)
Scalding APIproject / discard
map / mapToflatMap / flatMapTo
renamefilter
uniquegroupBy / groupAll / groupRandom / shuffle
Friday, April 12, 13
![Page 156: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/156.jpg)
Scalding APIproject / discard
map / mapToflatMap / flatMapTo
renamefilter
uniquegroupBy / groupAll / groupRandom / shuffle
limit
Friday, April 12, 13
![Page 157: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/157.jpg)
Scalding APIproject / discard
map / mapToflatMap / flatMapTo
renamefilter
uniquegroupBy / groupAll / groupRandom / shuffle
limitdebug
Friday, April 12, 13
![Page 158: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/158.jpg)
Scalding APIproject / discard
map / mapToflatMap / flatMapTo
renamefilter
uniquegroupBy / groupAll / groupRandom / shuffle
limitdebug
Group operations
Friday, April 12, 13
![Page 159: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/159.jpg)
Scalding APIproject / discard
map / mapToflatMap / flatMapTo
renamefilter
uniquegroupBy / groupAll / groupRandom / shuffle
limitdebug
Group operations
joinsFriday, April 12, 13
![Page 160: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/160.jpg)
Distributed Copy in Scalding
class WordCountJob(args: Args) extends Job(args) {
Friday, April 12, 13
![Page 161: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/161.jpg)
class WordCountJob(args: Args) extends Job(args) {
val input = Tsv(args("input")) val output = Tsv(args("output"))
Distributed Copy in Scalding
Friday, April 12, 13
![Page 162: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/162.jpg)
class WordCountJob(args: Args) extends Job(args) {
val input = Tsv(args("input")) val output = Tsv(args("output"))
input.read.write(output)
}
Distributed Copy in Scalding
Friday, April 12, 13
![Page 163: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/163.jpg)
class WordCountJob(args: Args) extends Job(args) {
val input = Tsv(args("input")) val output = Tsv(args("output"))
input.read.write(output)
}
Distributed Copy in Scalding
The End.
Friday, April 12, 13
![Page 164: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/164.jpg)
import org.apache.hadoop.util.ToolRunnerimport com.twitter.scalding
object ScaldingJobRunner extends App {
ToolRunner.run(new Configuration, new scalding.Tool, args)
}
Main Class - "Runner"
Friday, April 12, 13
![Page 165: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/165.jpg)
import org.apache.hadoop.util.ToolRunnerimport com.twitter.scalding
object ScaldingJobRunner extends App {
ToolRunner.run(new Configuration, new scalding.Tool, args)
}
Main Class - "Runner"
from App
Friday, April 12, 13
![Page 166: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/166.jpg)
class WordCountJob(args: Args) extends Job(args) {
}
Word Count in Scalding
Friday, April 12, 13
![Page 167: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/167.jpg)
class WordCountJob(args: Args) extends Job(args) {
val inputFile = args("input") val outputFile = args("output")
}
Word Count in Scalding
Friday, April 12, 13
![Page 168: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/168.jpg)
class WordCountJob(args: Args) extends Job(args) {
val inputFile = args("input") val outputFile = args("output")
TextLine(inputFile)
}
Word Count in Scalding
Friday, April 12, 13
![Page 169: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/169.jpg)
class WordCountJob(args: Args) extends Job(args) {
val inputFile = args("input") val outputFile = args("output")
TextLine(inputFile) .flatMap('line -> 'word) { line: String => tokenize(line) }
def tokenize(text: String): Array[String] = implemented}
Word Count in Scalding
Friday, April 12, 13
![Page 170: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/170.jpg)
class WordCountJob(args: Args) extends Job(args) {
val inputFile = args("input") val outputFile = args("output")
TextLine(inputFile) .flatMap('line -> 'word) { line: String => tokenize(line) } .groupBy('word) { _.size }
def tokenize(text: String): Array[String] = implemented}
Word Count in Scalding
Friday, April 12, 13
![Page 171: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/171.jpg)
class WordCountJob(args: Args) extends Job(args) {
val inputFile = args("input") val outputFile = args("output")
TextLine(inputFile) .flatMap('line -> 'word) { line: String => tokenize(line) } .groupBy('word) { group => group.size }
def tokenize(text: String): Array[String] = implemented}
Word Count in Scalding
Friday, April 12, 13
![Page 172: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/172.jpg)
class WordCountJob(args: Args) extends Job(args) {
val inputFile = args("input") val outputFile = args("output")
TextLine(inputFile) .flatMap('line -> 'word) { line: String => tokenize(line) } .groupBy('word) { group => group.size('count) }
def tokenize(text: String): Array[String] = implemented}
Word Count in Scalding
Friday, April 12, 13
![Page 173: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/173.jpg)
class WordCountJob(args: Args) extends Job(args) {
val inputFile = args("input") val outputFile = args("output")
TextLine(inputFile) .flatMap('line -> 'word) { line: String => tokenize(line) } .groupBy('word) { _.size } .write(Tsv(outputFile))
def tokenize(text: String): Array[String] = implemented}
Word Count in Scalding
Friday, April 12, 13
![Page 174: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/174.jpg)
class WordCountJob(args: Args) extends Job(args) {
val inputFile = args("input") val outputFile = args("output")
TextLine(inputFile) .flatMap('line -> 'word) { line: String => tokenize(line) } .groupBy('word) { _.size } .write(Tsv(outputFile))
def tokenize(text: String): Array[String] = implemented}
Word Count in Scalding
4{
Friday, April 12, 13
![Page 175: Scalding - Hadoop Word Count in LESS than 70 lines of code](https://reader033.vdocument.in/reader033/viewer/2022051400/54c67f1e4a795973728b4582/html5/thumbnails/175.jpg)
Dzięki! Thanks!ありがとう!
Konrad Malawski @ java.plt: ktosopl / g: ktosob: blog.project13.pl
Friday, April 12, 13