data-intensive programming mapreducedip/slides/slides2.pdf · 2016-09-09 · public class mymapper...

27
Data-intensive programming MapReduce Timo Aaltonen Department of Pervasive Computing

Upload: others

Post on 24-Jul-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data-intensive programming MapReducedip/slides/slides2.pdf · 2016-09-09 · public class MyMapper extends Mapper

Data-intensiveprogrammingMapReduce

TimoAaltonenDepartmentofPervasiveComputing

Page 2: Data-intensive programming MapReducedip/slides/slides2.pdf · 2016-09-09 · public class MyMapper extends Mapper

Outline

• MapReduce

Page 3: Data-intensive programming MapReducedip/slides/slides2.pdf · 2016-09-09 · public class MyMapper extends Mapper

OriginalMapandReduce

• map(foo,i)– Applyfunctionfootoeveryitemofi andreturnalistoftheresults

– Example:map(square,[1,2,3,4])=[1,4,9,16]• reduce(bar,i)– Applytwoargumentfunctionbar

• cumulativelytotheitemsofi,• fromlefttoright• toreducethevaluesini toasinglevalue.

– Example:reduce(sum,[1,4,9])=(((1+4)+9)+16)=30

Page 4: Data-intensive programming MapReducedip/slides/slides2.pdf · 2016-09-09 · public class MyMapper extends Mapper

MapReduce

• MapReduce isaprogramming model fordistributed processing oflarge datasets

• Scales nearly linearly– Twice asmany nodes ->twice asfast– Achieved by exploiting datalocality• Computation ismoved close tothe data

• Simple programming model– Programmer only needs towrite two functions:Map andReduce

Page 5: Data-intensive programming MapReducedip/slides/slides2.pdf · 2016-09-09 · public class MyMapper extends Mapper

Map &Reduce

• Mapmaps inputdatato(key,value)pairs• Reduce processes thelist ofvalues foragivenkey

• TheMapReduce framework (such asHadoop)takes care oftherest– Distributes thejob among nodes–Moves thedatatoandfrom thenodes– Handles node failures– etc.

Page 6: Data-intensive programming MapReducedip/slides/slides2.pdf · 2016-09-09 · public class MyMapper extends Mapper

Sheenaisa

punkrockerSheena

isa

punkrockernowWellshe’sa

punkpunk

MAP

Sheena 1is 1a 1

punk 1

rocker 1

rockerSheena

isa

punk

nowWellshe’sa

11111

1111

punk 1punk 1

1

Sheena

is

1

a

1

SHUFFLE

Sheena 21

REDUCE

is 23

a 31

punk

1punk 37

Node 1

Node 2

Node 3

Node D

Node C

Node B

Node A

Page 7: Data-intensive programming MapReducedip/slides/slides2.pdf · 2016-09-09 · public class MyMapper extends Mapper

MapReduce

MAPSheena Sheena 1

SHUFFLE REDUCEis 1a 1

punk 1rockerSheena

is…

1111

isa

punkrockerSheena

is…

1

SheenaSheena 21

is

1

a

1

is 23

a 31

punk

1punk 37

Page 8: Data-intensive programming MapReducedip/slides/slides2.pdf · 2016-09-09 · public class MyMapper extends Mapper

MapReduce

MAP REDUCE

Map(k1,v1)→list(k2,v2) Reduce(k2,list(v2))→list(v3)

Page 9: Data-intensive programming MapReducedip/slides/slides2.pdf · 2016-09-09 · public class MyMapper extends Mapper

Map &Reduce inHadoop

• InHadoop,Map andReduce functions can bewritten in– Java• org.apache.hadoop.mapreduce.lib

– C++using Hadoop Pipes– any language,using Hadoop Streaming

• Also anumber ofthird partyprogrammingframeworks forHadoop MapReduce– ForJava,Scala,Python,Ruby,PHP,…

Page 10: Data-intensive programming MapReducedip/slides/slides2.pdf · 2016-09-09 · public class MyMapper extends Mapper

Mapper Javaexample

public class WordCount {public class MyMapper extends

Mapper<LongWritable, Text, Text, IntWritable> {

public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

String txt = value.toString(); String[] items = txt.split("\\s"); for(int i = 0; i < items.length; i++){

word = new Text( items[i]);one = new IntWritable(1);context.write(word, one);

}}

Inputkey andvalue types Outputkey andvalue types

• TheMapper inputtypes depend onthedefined InputFormat• Bydefault TextInputFormat

• Key(LongWritable):positioninthefile• Value(Text):theline

Page 11: Data-intensive programming MapReducedip/slides/slides2.pdf · 2016-09-09 · public class MyMapper extends Mapper

Reducer Javaexample

public class MyReducer extendsReducer<Text, IntWritable, Text, IntWritable> {

public void reduce(Text key, Iterable<IntWritable> values, Context context)throws IOException, InterruptedException {

int sum = 0;for (IntWritable value : values) {sum = sum + value.get();

}context.write(key, new IntWritable(sum));

}}

Inputkey andvalue types Outputkey andvalue types

Page 12: Data-intensive programming MapReducedip/slides/slides2.pdf · 2016-09-09 · public class MyMapper extends Mapper

Jobofthe MapReduce example• The maindriver program configures theMapReduce job andsubmits ittothe Hadoop YARNcluster:

public static void main( String[] args) throws Exception {…Job job = new Job();job.setJarByClass(WordCount.class);job.setJobName(”Word Count");

job.setMapperClass(MyMapper.class);job.setReducerClass(MyReducer.class);

Page 13: Data-intensive programming MapReducedip/slides/slides2.pdf · 2016-09-09 · public class MyMapper extends Mapper

Inputkey andvalue types

Outputkey andvalue types

• Inputandoutputvalue classes are defined:

job.setMapOutputKeyClass(Text.class);job.setMapOutputValueClass(IntWritable.class);

job.setOutputKeyClass(Text.class);job.setOutputValueClass(IntWritable.class);

Page 14: Data-intensive programming MapReducedip/slides/slides2.pdf · 2016-09-09 · public class MyMapper extends Mapper

Jobofthe MapReduce example

• FileInputFormat feeds the inputsplits tomapinstances from wc-files directoryFileInputFormat.addInputPath(job, new Path(”wc-files"));

• The result iswritten towc-outputdirectory:FileInputFormat.setOutputPath(job, new Path(”wc-output"));

• The job waits until Hadoop runtime environmenthas executed the program:job.waitForCompletion(true);

} // main} // class

Page 15: Data-intensive programming MapReducedip/slides/slides2.pdf · 2016-09-09 · public class MyMapper extends Mapper

MapReduce WordCount Demo

• Buildingtheprogram:%mkdir wordcount_classes%javac -classpath `hadoop classpath`\-dwc_classes WordCount.java%jar-cvf wordcount.jar -Cwordcount_classes/.

• SendWordCount toYARN:%hadoop jarwordcount.jar fi.tut.WordCount /dataoutput

Page 16: Data-intensive programming MapReducedip/slides/slides2.pdf · 2016-09-09 · public class MyMapper extends Mapper

MapReduceWordCount Demo

• Theresult:%hdfs dfs -lsoutputFound2items-rw-r--r-- 1hduser supergroup 02016-09-0811:46output/_SUCCESS-rw-r--r-- 1hduser supergroup 4709232016-09-0811:46output/part-r-00000%hdfs dfs -getoutput/part-r-00000%tail-10part-r-00000zoology, 2zu 2À 2à 4çela, 1…

Page 17: Data-intensive programming MapReducedip/slides/slides2.pdf · 2016-09-09 · public class MyMapper extends Mapper

FaultToleranceandSpeculativeExecution

• Faultsarehandledbyrestartingtasks• Allmanagedinthebackground• Noneedtomanageside-effectsorprocessstate

• Speculativeexecutionhelpspreventbottlenecks

Page 18: Data-intensive programming MapReducedip/slides/slides2.pdf · 2016-09-09 · public class MyMapper extends Mapper

Combiners

AA

11

A 1B 1

Map node 1

A

Reduce node forkey A

A111

A 3

Combiner

Page 19: Data-intensive programming MapReducedip/slides/slides2.pdf · 2016-09-09 · public class MyMapper extends Mapper

Combiners

• Combiner can ”compress”dataonamapper nodebefore sending it forward

• Combiner input/outputtypes must equal themapper outputtypes

• InHadoop Java,Combiners use theReducerinterface

job.setCombinerClass(MyReducer.class);

Page 20: Data-intensive programming MapReducedip/slides/slides2.pdf · 2016-09-09 · public class MyMapper extends Mapper

Reducer asaCombiner• Reducer can be used asaCombiner if it iscommutative andassociative– Eg.max is

• max(1,2,max(3,4,5))=max(max(2,4),max(1,5,3))

• true forany order offunction applications…– Eg.avg isnot

• avg(1,2,avg(3,4,5))=2.33333≠avg(avg(2,4),avg(1,5,3))=3

• Note:if Reducer isnot c&a,Combiners can still be used– TheCombiner justhas tobe different from theReduceranddesigned forthespecific case

Page 21: Data-intensive programming MapReducedip/slides/slides2.pdf · 2016-09-09 · public class MyMapper extends Mapper

AddingaCombinertoWordCount

walk,1run,1walk,1

run,1walk,2

Map

Combiner

Shuffle

Page 22: Data-intensive programming MapReducedip/slides/slides2.pdf · 2016-09-09 · public class MyMapper extends Mapper

Hadoop Streaming

• Map andReduce functions can be implementedinany language withtheHadoop Streaming API

• Inputisread from standard input• Outputiswritten tostandard output• Input/outputitems are lines oftheformkey\tvalue– \t isthetabulator character

• Reducer inputlines are grouped by key– Onereducer instance may receive multiple keys

Page 23: Data-intensive programming MapReducedip/slides/slides2.pdf · 2016-09-09 · public class MyMapper extends Mapper

Run Hadoop Streaming

• Debug using Unixpipes:cat sample.txt | ./mapper.py | sort | ./reducer.py

• OnHadoop:hadoop jar $HADOOP_INSTALL/contrib/streaming/hadoop-*-streaming.jar \-input sample.txt \-output output \-mapper ./mapper.py \-reducer ./reducer.py

Page 24: Data-intensive programming MapReducedip/slides/slides2.pdf · 2016-09-09 · public class MyMapper extends Mapper

MapReduce Examples• Thesearefromhttps://highlyscalable.wordpress.com/2012/02/01/mapreduce-patterns/

• CountingandSumming– WordCount

• Filtering(“Grepping”),Parsing,andValidation– Problem:Thereisasetofrecordsanditisrequiredtocollectallrecordsthatmeetsomeconditionortransformeachrecord(independentlyfromotherrecords)intoanotherrepresentation

– Solution:Mappertakesrecordsonebyoneandemitsaccepteditemsortheirtransformedversions

Page 25: Data-intensive programming MapReducedip/slides/slides2.pdf · 2016-09-09 · public class MyMapper extends Mapper

MapReduce Examples• Collating– ProblemStatement:Thereisasetofitemsandsomefunctionofoneitem.Itisrequiredtosaveallitemsthathavethesamevalueoffunctionintoonefileorperformsomeothercomputationthatrequiresallsuchitemstobeprocessedasagroup.Themosttypicalexampleisbuildingofinvertedindexes.

– Solution:Mappercomputesagivenfunctionforeachitemandemitsvalueofthefunctionasakeyanditemitselfasavalue.Reducerobtainsallitemsgroupedbyfunctionvalueandprocessorsavethem.Incaseofinvertedindexes,itemsareterms(words)andfunctionisadocumentIDwherethetermwasfound.

Page 26: Data-intensive programming MapReducedip/slides/slides2.pdf · 2016-09-09 · public class MyMapper extends Mapper

Index

Thispagecontainstext

Mypagecontainstoo

PageA

PageB

This,Apage,Acontains,Atext,A

My,Bpage,Bcontains,Btoo,B

This: Apage:A,Bcontains:A,Btext:Atoo:B

Reducedoutput

Page 27: Data-intensive programming MapReducedip/slides/slides2.pdf · 2016-09-09 · public class MyMapper extends Mapper

Conclusions

• Maptasksspreadouttheload(thelogic)– mayhavehundredsormillionsmappers– fastforlargeamountsofdata