CS378–BigDataProgramming
Lecture14JoinPa:erns
CS378-Fall2016 BigDataProgramming 1
Review
• Assignment6–Reduce-sidejoin– Usersessionandimpressiondata
• QuesKons/issues?
• Review:infoinsyslog
• AvroMultipleInputs
CS378-Fall2016 BigDataProgramming 2
JoinPa:erns
• Review:Supposewewanttojoinmanysources,onlyoneofwhichislarge– Usersessions(large)– MapfromciKestoDMA(demographicmarkeKngarea)– …
• Thisiscalledareplicatedjoin– Allthesmallfileswillbereplicatedtoallmachines
CS378-Fall2016 BigDataProgramming 3
ReplicatedJoin
• Canbedonecompletelyinmappers– Noneedforsort,shuffle,orreduce– FilesarereplicatedwithDistributedCache
• RestricKons:– Allbutoneoftheinputsmustfitinmemory– Canonlyaccomplishaninnerjoin,or– Ale]outerjoinwherethelargedatasourceis“le]”part
CS378-Fall2016 BigDataProgramming 4
ReplicatedJoin-DataFlowFigure5-2fromMapReduceDesignPa:erns
CS378-Fall2016 BigDataProgramming 5
JoinPa:erns
• OK,soreplicatedjoinwasinteresKng,butmorethanoneofmydatasourcesislarge.
• Isthereawaytodoamap-sidejoininthiscase?• Orisreduce-sidejoinmyonlyopKon?
• Ifweorganizetheinputdatainaspecificway,• Wecandothisonthemap-side.
CS378-Fall2016 BigDataProgramming 6
CompositeJoin
• HadoopclassCompositeInputFormat
• Restrictedtoinner,orfullouterjoin• Inputdatasetsmusthavethesame#ofparKKons– EachinputparKKonmustbesortedbykey– AllrecordsforaparKcularkeymustbeinthesameparKKon
• Seemspre:yrestricKve…
CS378-Fall2016 BigDataProgramming 7
CompositeJoin
• ThesecondiKonsmightexistfordatafromothermapReducejobswhere:
• Thejobshadthesame#ofreducers– RecallthatinputdatasetsmustbeparKKonedinsameway
• Thejobshadthesameforeignkey• Outputfilesaren’tspli:able
CS378-Fall2016 BigDataProgramming 8
CompositeJoin
• IfallthosecondiKonsaretrue,thisjoinworks– Map-sideonly,soit’sefficientifwecanuseit.
• Ifyoufindthatyouarepreparingandformamngthedataonlytobeabletousecompositejoin
• It’sprobablynotworthit.• Justuseareduce-sidejoin.
CS378-Fall2016 BigDataProgramming 9
CompositeJoin–Data
CS378-Fall2016 BigDataProgramming 10
CompositeJoin–DataFlow
CS378-Fall2016 BigDataProgramming 11
CompositeJoinInput
• Inthedrivercode(run()method)– Getthefilenamesfromthecommandline– Specifytheinputformat,jointype,andfiles
conf.setInputFormat(CompositeInputFormat.class);
conf.set(“mapred.join.expr”,
CompositeInputFormat.compose(“inner”, KeyValueTextInputFormat.class, file1, file2));
CS378-Fall2016 BigDataProgramming 12
CompositeJoinInput
• Howmightthisimplementinnerjoin?– Outerjoin?
• Couldwedoanyotherjointype?– Le]outer?AnK-join?
• Output:TupleWritable
CS378-Fall2016 BigDataProgramming 13
OneMoreJoinPa:ern
• Supposewewantedtocompareallcarscurrentlyavailable(forsale)toallothercars– ToidenKfy“similar”cars– Usage:“Ilikethiscar,showmeotherslikeit”
• Thisjoiniscalled“CartesianProduct”– CompareNitemstoMitemsrequiresNxMcomparisons– Notstraighqorwardtodowithmap-reduce
CS378-Fall2016 BigDataProgramming 14
CartesianProduct
• Pairseveryrecordwitheveryotherrecord– Nokeysneeded– NxMresults,fordatasetsofsizeN,M
• Map-onlyjob• ButsKllexpensivetocompute• Hadoopclass:CartesianInputFormat
CS378-Fall2016 BigDataProgramming 15
CartesianProduct
• Toaccomplishthisjoin,we’llneedtopaireveryrecordwitheveryotherrecord
• Wecanstartwiththeapproachforcompositejoin
• Forcompositejoin,eachmapperreadtwofiles– Theyhadthesamekeyset– Thedatawassortedbykey– Wedon’tcareaboutthekeys,justthe‘twofileinput’
CS378-Fall2016 BigDataProgramming 16
CompositeJoin–DataFlow
CS378-Fall2016 BigDataProgramming 17
OneMapper,TwoInputs• Forcompositejoin,thekeyorderallowedusto:
– Readeachofthetwofilesonlyonce– Workedverymuchlikemergesort
• ForCartesianproduct– Foreachrecordindataset1– We’llreadeveryrecordindataset2– Thispairofrecordsispassedtothemapper
• We’daccomplishthiswithacustominputformat– RecordReaderresetsdataset2foreachinputofdataset1
CS378-Fall2016 BigDataProgramming 18
CartesianProduct–DataFlow
CS378-Fall2016 BigDataProgramming 19