osdc.fr 2012 :: cascalog : progammation logique pour hadoop

CascalogProgrammation logique pour Hadoop

Bertrand Dechoux 13 Octobre 2012

Saturday, October 13, 2012

MapReduce : et vous?

Python▶ map(function, iterable, ...)▶ reduce(function,iterable[, initializer])

Perl▶ map BLOCK LIST▶ reduce BLOCK LIST

Ruby▶ map {|item| block} -> new_ary / collect {|item| block} -> new_ary▶ reduce(initial,sym) -> obj / inject(initial,sym) -> obj

Smalltalk▶ collect:aBlock=TheArray▶ inject: thisValue into: binaryBlock

PHP▶ array array_map ( callable $callback, array $arr1 [, array $...])▶ mixed array_reduce (array $input, callable $function [, mixed $initial = NULL])

2Saturday, October 13, 2012

Hadoop MapReduce : la théorie

Map▶ Map(k1,v1) -> list(k2,v2)

Reduce▶ Reduce(k2, list (v2)) -> list(k3,v3)


Hadoop MapReduce : la théorie

Map▶ Map(k1,v1) -> list(k2,v2)▶ SortByKey(list(k2,v2)) -> list(k2,v2)

Reduce▶ MergeByKey(list,list,...) -> list(k2,list(v2))

▶ Reduce(k2, list (v2)) -> list(k3,v3)


Hadoop MapReduce : la pratique

5

public class WordCount { public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } } } public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {

public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); } } public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount"); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); } }

XSaturday, October 13, 2012

Cascading : des abstractions necessaires


Cascading : ‘field algebra’ ?!

8

XSaturday, October 13, 2012

Cascalogprogrammation logique pour Hadoop

(my-predicate ?var1 42 ?var3 :> ?var4 ?var5)


Cascalog : select ... from ...

(?<- (stdout) [?person] (person ?person))




(?<- (stdout) [?person ?age] (age ?person ?age))





(?<- (stdout) [?age] (age _ ?age))





(?<- (stdout) [?age] (age _ ?age))

(?<- (stdout) [?person] (age ?person 42))


Cascalog : select ... from ... where

(?<- (stdout) [?person ?age](age ?person ?age)(< ?age 30))


Cascalog : select ... as ... from ...

(?<- (stdout) [?person ?junior](age ?person ?age)(< ?age 30 :> ?junior))


Cascalog : select count(*) from ... group by ...

(?<- (stdout) [?count](age _ _)(c/count ?count))


Cascalog : select count(*) from ... group by ...

(?<- (stdout) [?junior ?count](age _ ?age)(< ?age 30 :> ?junior)(c/count ?count))


Cascalog : select ... from ... join ...

(?<- (stdout) [?person ?age ?gender](age ?person ?age)(gender ?person ?gender))


Cascalog : select ... from ... (select ...)

(let [many-follows(<- [?person] (follows ?person _)

(c/count ?count) (> ?count 2))]

(?<- (stdout) [?personA ?personB](many-follows ?personA)

(many-follows ?personB)(follows ?personA ?personB))

)


Cascalog : définir vos fonctions

(defn toUpperCase [person] (.toUpperCase person))

(?<- (stdout) [?PERSON](person ?person)(toUpperCase ?person :> ?PERSON))


Une conclusion?

‘nouveaux’ datastores, ‘nouveaux’ types de requetage▶ Cascalog, RDF, Datomic, Neo4j ...

Affinitée entre le paradigme fonctionel▶ Et les traitements de données?▶ Et vous? Cascalog mais aussi...

21

...

PIG


22

?http://blog.xebia.fr/author/bdechoux/

@BertrandDechoux


http://blog.xebia.fr/author/bdechoux/

http://blog.xebia.fr/author/bdechoux/

osdc.fr 2012 :: cascalog : progammation logique pour hadoop

Technology

class job

new configuration job

new pathargs1 job

public static class

context context

new jobconf

new stringtokenizerline

new intwritablesum