osdc.fr 2012 :: cascalog : progammation logique pour hadoop
DESCRIPTION
Hadoop est devenu une référence dans l’univers du BigData, et MapReduce, un nouveau paradigme pour exploiter les données. Implémenter directement les traitements de données avec MapReduce donne certainement le plus de flexibilité, mais cela revient à utiliser de l’assembleur. Cascalog est sans doute l’alternative la plus concise. Basée sur Clojure, cette solution vous laisse dans un environnement familier (la JVM) tout en vous apportant une abstraction fort utile par le biais de la programmation logique.TRANSCRIPT
CascalogProgrammation logique pour Hadoop
Bertrand Dechoux 13 Octobre 2012
Saturday, October 13, 2012
MapReduce : et vous?
Python▶ map(function, iterable, ...)▶ reduce(function,iterable[, initializer])
Perl▶ map BLOCK LIST▶ reduce BLOCK LIST
Ruby▶ map {|item| block} -> new_ary / collect {|item| block} -> new_ary▶ reduce(initial,sym) -> obj / inject(initial,sym) -> obj
Smalltalk▶ collect:aBlock=TheArray▶ inject: thisValue into: binaryBlock
PHP▶ array array_map ( callable $callback, array $arr1 [, array $...])▶ mixed array_reduce (array $input, callable $function [, mixed $initial = NULL])
2Saturday, October 13, 2012
Hadoop MapReduce : la théorie
Map▶ Map(k1,v1) -> list(k2,v2)
Reduce▶ Reduce(k2, list (v2)) -> list(k3,v3)
3Saturday, October 13, 2012
Hadoop MapReduce : la théorie
Map▶ Map(k1,v1) -> list(k2,v2)▶ SortByKey(list(k2,v2)) -> list(k2,v2)
Reduce▶ MergeByKey(list,list,...) -> list(k2,list(v2))
▶ Reduce(k2, list (v2)) -> list(k3,v3)
4Saturday, October 13, 2012
Hadoop MapReduce : la pratique
5
public class WordCount { public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } } } public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); } } public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount"); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); } }
XSaturday, October 13, 2012
Cascading : des abstractions necessaires
6Saturday, October 13, 2012
Cascading : des abstractions necessaires
7Saturday, October 13, 2012
Cascading : ‘field algebra’ ?!
8
XSaturday, October 13, 2012
Cascalogprogrammation logique pour Hadoop
(my-predicate ?var1 42 ?var3 :> ?var4 ?var5)
9Saturday, October 13, 2012
Cascalog : select ... from ...
(?<- (stdout) [?person] (person ?person))
10Saturday, October 13, 2012
Cascalog : select ... from ...
(?<- (stdout) [?person] (person ?person))
(?<- (stdout) [?person ?age] (age ?person ?age))
11Saturday, October 13, 2012
Cascalog : select ... from ...
(?<- (stdout) [?person] (person ?person))
(?<- (stdout) [?person ?age] (age ?person ?age))
(?<- (stdout) [?age] (age _ ?age))
12Saturday, October 13, 2012
Cascalog : select ... from ...
(?<- (stdout) [?person] (person ?person))
(?<- (stdout) [?person ?age] (age ?person ?age))
(?<- (stdout) [?age] (age _ ?age))
(?<- (stdout) [?person] (age ?person 42))
13Saturday, October 13, 2012
Cascalog : select ... from ... where
(?<- (stdout) [?person ?age](age ?person ?age)(< ?age 30))
14Saturday, October 13, 2012
Cascalog : select ... as ... from ...
(?<- (stdout) [?person ?junior](age ?person ?age)(< ?age 30 :> ?junior))
15Saturday, October 13, 2012
Cascalog : select count(*) from ... group by ...
(?<- (stdout) [?count](age _ _)(c/count ?count))
16Saturday, October 13, 2012
Cascalog : select count(*) from ... group by ...
(?<- (stdout) [?junior ?count](age _ ?age)(< ?age 30 :> ?junior)(c/count ?count))
17Saturday, October 13, 2012
Cascalog : select ... from ... join ...
(?<- (stdout) [?person ?age ?gender](age ?person ?age)(gender ?person ?gender))
18Saturday, October 13, 2012
Cascalog : select ... from ... (select ...)
(let [many-follows(<- [?person] (follows ?person _)
(c/count ?count) (> ?count 2))]
(?<- (stdout) [?personA ?personB](many-follows ?personA)
(many-follows ?personB)(follows ?personA ?personB))
)
19Saturday, October 13, 2012
Cascalog : définir vos fonctions
(defn toUpperCase [person] (.toUpperCase person))
(?<- (stdout) [?PERSON](person ?person)(toUpperCase ?person :> ?PERSON))
20Saturday, October 13, 2012
Une conclusion?
‘nouveaux’ datastores, ‘nouveaux’ types de requetage▶ Cascalog, RDF, Datomic, Neo4j ...
Affinitée entre le paradigme fonctionel▶ Et les traitements de données?▶ Et vous? Cascalog mais aussi...
21
...
PIG
Saturday, October 13, 2012
22
?http://blog.xebia.fr/author/bdechoux/
@BertrandDechoux
Saturday, October 13, 2012