cs 378 – big data programmingdfranke/courses/2016fall/lecture14.pdfcs 378 – big data programming...

19
CS 378 – Big Data Programming Lecture 14 Join Pa:erns CS 378 - Fall 2016 Big Data Programming 1

Upload: ngodat

Post on 31-Mar-2018

217 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: CS 378 – Big Data Programmingdfranke/courses/2016fall/Lecture14.pdfCS 378 – Big Data Programming Lecture 14 ... – Map from ciKes to DMA (demographic markeKng area) ... • Is

CS378–BigDataProgramming

Lecture14JoinPa:erns

CS378-Fall2016 BigDataProgramming 1

Page 2: CS 378 – Big Data Programmingdfranke/courses/2016fall/Lecture14.pdfCS 378 – Big Data Programming Lecture 14 ... – Map from ciKes to DMA (demographic markeKng area) ... • Is

Review

•  Assignment6–Reduce-sidejoin–  Usersessionandimpressiondata

•  QuesKons/issues?

•  Review:infoinsyslog

•  AvroMultipleInputs

CS378-Fall2016 BigDataProgramming 2

Page 3: CS 378 – Big Data Programmingdfranke/courses/2016fall/Lecture14.pdfCS 378 – Big Data Programming Lecture 14 ... – Map from ciKes to DMA (demographic markeKng area) ... • Is

JoinPa:erns

•  Review:Supposewewanttojoinmanysources,onlyoneofwhichislarge–  Usersessions(large)– MapfromciKestoDMA(demographicmarkeKngarea)–  …

•  Thisiscalledareplicatedjoin–  Allthesmallfileswillbereplicatedtoallmachines

CS378-Fall2016 BigDataProgramming 3

Page 4: CS 378 – Big Data Programmingdfranke/courses/2016fall/Lecture14.pdfCS 378 – Big Data Programming Lecture 14 ... – Map from ciKes to DMA (demographic markeKng area) ... • Is

ReplicatedJoin

•  Canbedonecompletelyinmappers–  Noneedforsort,shuffle,orreduce–  FilesarereplicatedwithDistributedCache

•  RestricKons:–  Allbutoneoftheinputsmustfitinmemory–  Canonlyaccomplishaninnerjoin,or–  Ale]outerjoinwherethelargedatasourceis“le]”part

CS378-Fall2016 BigDataProgramming 4

Page 5: CS 378 – Big Data Programmingdfranke/courses/2016fall/Lecture14.pdfCS 378 – Big Data Programming Lecture 14 ... – Map from ciKes to DMA (demographic markeKng area) ... • Is

ReplicatedJoin-DataFlowFigure5-2fromMapReduceDesignPa:erns

CS378-Fall2016 BigDataProgramming 5

Page 6: CS 378 – Big Data Programmingdfranke/courses/2016fall/Lecture14.pdfCS 378 – Big Data Programming Lecture 14 ... – Map from ciKes to DMA (demographic markeKng area) ... • Is

JoinPa:erns

•  OK,soreplicatedjoinwasinteresKng,butmorethanoneofmydatasourcesislarge.

•  Isthereawaytodoamap-sidejoininthiscase?•  Orisreduce-sidejoinmyonlyopKon?

•  Ifweorganizetheinputdatainaspecificway,•  Wecandothisonthemap-side.

CS378-Fall2016 BigDataProgramming 6

Page 7: CS 378 – Big Data Programmingdfranke/courses/2016fall/Lecture14.pdfCS 378 – Big Data Programming Lecture 14 ... – Map from ciKes to DMA (demographic markeKng area) ... • Is

CompositeJoin

•  HadoopclassCompositeInputFormat

•  Restrictedtoinner,orfullouterjoin•  Inputdatasetsmusthavethesame#ofparKKons–  EachinputparKKonmustbesortedbykey–  AllrecordsforaparKcularkeymustbeinthesameparKKon

•  Seemspre:yrestricKve…

CS378-Fall2016 BigDataProgramming 7

Page 8: CS 378 – Big Data Programmingdfranke/courses/2016fall/Lecture14.pdfCS 378 – Big Data Programming Lecture 14 ... – Map from ciKes to DMA (demographic markeKng area) ... • Is

CompositeJoin

•  ThesecondiKonsmightexistfordatafromothermapReducejobswhere:

•  Thejobshadthesame#ofreducers–  RecallthatinputdatasetsmustbeparKKonedinsameway

•  Thejobshadthesameforeignkey•  Outputfilesaren’tspli:able

CS378-Fall2016 BigDataProgramming 8

Page 9: CS 378 – Big Data Programmingdfranke/courses/2016fall/Lecture14.pdfCS 378 – Big Data Programming Lecture 14 ... – Map from ciKes to DMA (demographic markeKng area) ... • Is

CompositeJoin

•  IfallthosecondiKonsaretrue,thisjoinworks– Map-sideonly,soit’sefficientifwecanuseit.

•  Ifyoufindthatyouarepreparingandformamngthedataonlytobeabletousecompositejoin

•  It’sprobablynotworthit.•  Justuseareduce-sidejoin.

CS378-Fall2016 BigDataProgramming 9

Page 10: CS 378 – Big Data Programmingdfranke/courses/2016fall/Lecture14.pdfCS 378 – Big Data Programming Lecture 14 ... – Map from ciKes to DMA (demographic markeKng area) ... • Is

CompositeJoin–Data

CS378-Fall2016 BigDataProgramming 10

Page 11: CS 378 – Big Data Programmingdfranke/courses/2016fall/Lecture14.pdfCS 378 – Big Data Programming Lecture 14 ... – Map from ciKes to DMA (demographic markeKng area) ... • Is

CompositeJoin–DataFlow

CS378-Fall2016 BigDataProgramming 11

Page 12: CS 378 – Big Data Programmingdfranke/courses/2016fall/Lecture14.pdfCS 378 – Big Data Programming Lecture 14 ... – Map from ciKes to DMA (demographic markeKng area) ... • Is

CompositeJoinInput

•  Inthedrivercode(run()method)– Getthefilenamesfromthecommandline– Specifytheinputformat,jointype,andfiles

conf.setInputFormat(CompositeInputFormat.class);

conf.set(“mapred.join.expr”,

CompositeInputFormat.compose(“inner”, KeyValueTextInputFormat.class, file1, file2));

CS378-Fall2016 BigDataProgramming 12

Page 13: CS 378 – Big Data Programmingdfranke/courses/2016fall/Lecture14.pdfCS 378 – Big Data Programming Lecture 14 ... – Map from ciKes to DMA (demographic markeKng area) ... • Is

CompositeJoinInput

•  Howmightthisimplementinnerjoin?– Outerjoin?

•  Couldwedoanyotherjointype?– Le]outer?AnK-join?

•  Output:TupleWritable

CS378-Fall2016 BigDataProgramming 13

Page 14: CS 378 – Big Data Programmingdfranke/courses/2016fall/Lecture14.pdfCS 378 – Big Data Programming Lecture 14 ... – Map from ciKes to DMA (demographic markeKng area) ... • Is

OneMoreJoinPa:ern

•  Supposewewantedtocompareallcarscurrentlyavailable(forsale)toallothercars–  ToidenKfy“similar”cars–  Usage:“Ilikethiscar,showmeotherslikeit”

•  Thisjoiniscalled“CartesianProduct”–  CompareNitemstoMitemsrequiresNxMcomparisons–  Notstraighqorwardtodowithmap-reduce

CS378-Fall2016 BigDataProgramming 14

Page 15: CS 378 – Big Data Programmingdfranke/courses/2016fall/Lecture14.pdfCS 378 – Big Data Programming Lecture 14 ... – Map from ciKes to DMA (demographic markeKng area) ... • Is

CartesianProduct

•  Pairseveryrecordwitheveryotherrecord– Nokeysneeded– NxMresults,fordatasetsofsizeN,M

•  Map-onlyjob•  ButsKllexpensivetocompute•  Hadoopclass:CartesianInputFormat

CS378-Fall2016 BigDataProgramming 15

Page 16: CS 378 – Big Data Programmingdfranke/courses/2016fall/Lecture14.pdfCS 378 – Big Data Programming Lecture 14 ... – Map from ciKes to DMA (demographic markeKng area) ... • Is

CartesianProduct

•  Toaccomplishthisjoin,we’llneedtopaireveryrecordwitheveryotherrecord

•  Wecanstartwiththeapproachforcompositejoin

•  Forcompositejoin,eachmapperreadtwofiles–  Theyhadthesamekeyset–  Thedatawassortedbykey– Wedon’tcareaboutthekeys,justthe‘twofileinput’

CS378-Fall2016 BigDataProgramming 16

Page 17: CS 378 – Big Data Programmingdfranke/courses/2016fall/Lecture14.pdfCS 378 – Big Data Programming Lecture 14 ... – Map from ciKes to DMA (demographic markeKng area) ... • Is

CompositeJoin–DataFlow

CS378-Fall2016 BigDataProgramming 17

Page 18: CS 378 – Big Data Programmingdfranke/courses/2016fall/Lecture14.pdfCS 378 – Big Data Programming Lecture 14 ... – Map from ciKes to DMA (demographic markeKng area) ... • Is

OneMapper,TwoInputs•  Forcompositejoin,thekeyorderallowedusto:

–  Readeachofthetwofilesonlyonce–  Workedverymuchlikemergesort

•  ForCartesianproduct–  Foreachrecordindataset1–  We’llreadeveryrecordindataset2–  Thispairofrecordsispassedtothemapper

•  We’daccomplishthiswithacustominputformat–  RecordReaderresetsdataset2foreachinputofdataset1

CS378-Fall2016 BigDataProgramming 18

Page 19: CS 378 – Big Data Programmingdfranke/courses/2016fall/Lecture14.pdfCS 378 – Big Data Programming Lecture 14 ... – Map from ciKes to DMA (demographic markeKng area) ... • Is

CartesianProduct–DataFlow

CS378-Fall2016 BigDataProgramming 19