yelp datast analysis

11

Upload: arjun-sehgal

Post on 15-Jan-2017

144 views

Category:

Data & Analytics


0 download

TRANSCRIPT

[AUTHORNAME] 2

YELPDATASETANALYSISREPORT

1. SummaryofnumberofreviewsbyUSCity,byCategoriesInordertoanalyzeaccordingtothesegivenconditions,IhavemadeuseoftwoofthegivenDatasetsfromtheYelpAcademicChallengeDataseti.e.theBusinessandReviewsDatasets.BoththedatasetswereloadedinPigusingtwitter’selephant-birdJsonLoaderastheschemaofthedatasetsishighlynestedwithmixeddatatypes.The.jarfilesforthevariouscomponentsofelephantbirdwereloadedthroughthepropertiestabonthePigEditorinHuewebUI.The3.jarfileswere:/user/cloudera/elephant-bird-core-4.13.jar,/user/cloudera/elephant-bird-hadoop-compat-4.13.jar&/user/cloudera/elephant-bird-pig-4.13.jarrespectivelyavailableat:

I. http://mvnrepository.com/artifact/com.twitter.elephantbird/elephant-bird-core/4.13II. http://mvnrepository.com/artifact/com.twitter.elephantbird/elephant-bird-hadoop-compat/4.13III. http://mvnrepository.com/artifact/com.twitter.elephantbird/elephant-bird-pig/4.13

NOTE:Theabove3.jsonfileshavebeenusedinallofthequestionsOncethe.jsonfileforthedatasethasbeenuploadedasmaps,itisstoredinagenericvariable.Thenfromthatvariablewegeneratethefieldswerequireforanalysisusingtheformatname_of_map#’field_name’as‘field_name’.Asanexampleifweloadedthebusinessdatasetwiththemapnameasbusiness,andwewishtogeneratethebusiness_idfield,thenoursyntaxwilllooklikebusiness#’business_id’asbusiness_id.InthiscaseIhavegeneratedthefieldscategories,city,business_id,state,latitude,longitudefromthebusinessdatasetandthe fieldsbusiness_id and review_id from the reviewsdataset. ToobtainUS citieswe filter them firstbasedon theedgecoordinatesofUSAMainland.However,ascitieslikeWaterlooarepresent,wethenfilteritbyStatetoremovetheCanadianstatesofOntarioandQuebec.Theyarethenjoinedtogetherontheircommonfieldbusiness_idinanewvariablejoined.Oncetheyhavebeenjoinedtogether,Igeneratedthecityandcategoriesforeachoftherecordsinjoined.Asthecategoriesgivenin the business dataset are nested and each business can be classified under various different categories, I flattened thecategoriesso thatwecan identifyeachcategoryassociatedwith thebusiness individually.Once thecategorieshavebeenflattened,Ithengroupedthevariableflattenedbycityandcategories,sothatwecanseetheresultsgroupedrespectively.However,oncewegroupanyfield,it’sschemachanges.Soinordertoextractthedesiredresult,foreachoftherecordsinthegroupedvariable,Ihaveflattenedthegroupingdonepreviously,ascityandcategoriesandthengeneratedthecountofreviewsassociatedwithit.Finally,Ihaveorderedtheresultsbycity,sothatIcanarrangethefinaloutputbyshowingthenumberofreviewsforeachbusinesscategorywithineachcityinthedataset.IthenstoredthefinalvariableintoafolderinHDFSusingthePigStoragemethodmakingitaTabSeparatedVariableFile.AfewexceptionswhichInotedwhileanalyzingtheoutputoftheoperationisthatfewrecordsdonothaveanycitymentionedintheircityfield,whileinthecaseofsomerecordsthesamecityhasbeenspecifieddifferently,like110.LasVegasandLasVegas.Suchdiscrepanciescancauseminorfluctuationswhileanalyzingtheoutputdataset.

BasicAnalysisoftheNumberofReviewsinTableau,suggeststhatthemostnumberofreviewshavecomefromthecityofLasVegasasshownbyFigure1andthemostnumberofreviewsforanyindividualcategoryarefortheRestaurant’scategoryasshownbyFigure2.

Figure 1

[AUTHORNAME] 3

YELPDATASETANALYSISREPORT

Figure 2

PIGSCRIPT:

A=LOAD'./yelp_academic_dataset_business.json'USINGcom.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad=true')AS(yelp:map[]);business=FOREACHAGENERATEyelp#'categories'ascategories,yelp#'business_id'asbusiness_id,yelp#'city'ascity,yelp#'state'asstate,(float)yelp#'latitude'aslatitude,(float)yelp#'longitude'aslongitude;coordinates_business=FILTERbusinessBY(latitude<49.384472)AND(latitude>24.520833)AND(longitude<-66.950)AND(longitude>-124.766667);us_business=FILTERcoordinates_businessBYNOT((statematches'.*ON.*')OR(statematches'.*QC.*'));businesses=FOREACHus_businessGENERATEcategories,business_id,city;B=LOAD'./yelp_academic_dataset_review.json'USINGcom.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad=true')AS(review:map[]);revie=FOREACHBGENERATEreview#'business_id'asbusiness_id,review#'review_id'asreview_id;joined=JOINbusinessesbybusiness_id,reviebybusiness_id;flatting=FOREACHjoinedGENERATEcity,FLATTEN(categories);grouped=GROUPflattingby(city,categories);results=FOREACHgroupedGENERATEFLATTEN(group)AS(city,categories),COUNT(flatting);finals=ORDERresultsbycity;STOREfinalsINTO'./Q1'USINGPigStorage('\t');

TRUNCATEDOUTPUT: Magicians 3 EventPlanning&Services 3110.LasVegas Automotive 12110.LasVegas AutoRepair 12Ahwatukee PetBoarding/PetSitting 10Ahwatukee Fitness&Instruction 20Ahwatukee Sewing&Alterations 20Ahwatukee Health&Medical 13Ahwatukee Hotels&Travel 14Ahwatukee EyelashService 4Ahwatukee CarpetCleaning 4Ahwatukee SpecialtyFood 3Ahwatukee LocalServices 30Ahwatukee HealthMarkets 3Ahwatukee Pediatricians 13Ahwatukee HomeServices 6Ahwatukee Beauty&Spas 4Ahwatukee TruckRental 6Ahwatukee SelfStorage 6

Figure 2

[AUTHORNAME] 4

YELPDATASETANALYSISREPORT

2. RankingofcitiesonthebasisofstarsineachcategoryInordertoanalyzeaccordingtothesegivenconditions,IhavemadeuseoftwoofthegivenDatasetsfromtheYelpAcademicChallengeDataseti.e.theBusinessandReviewsDatasetsasinthelastexample.BoththedatasetswereloadedinPigusingtwitter’selephant-birdJsonLoaderastheschemaofthedatasetsishighlynestedwithmixeddatatypes.The.jarfilesforthevariouscomponentsofelephantbirdwereloadedthroughthepropertiestabonthePigEditorinHuewebUI.Oncethe.jsonfileforthedatasethasbeenuploadedasmaps,itisstoredinagenericvariable.Then from that variable we generate the fields we require for analysis using the format name_of_map#’field_name’ as‘field_name’.Asanexampleifweloadedthebusinessdatasetwiththemapnameasbusiness,andwewishtogeneratethecategoriesfield,thenoursyntaxwilllooklikebusiness#’categories’ascategories.InthiscaseIhavegeneratedthefieldscategories,city,business_idfromthebusinessdatasetandthefieldsbusiness_idandstarsfromthereviewsdataset.Aswhenwestoredallthedatainthe.jsonfileintermsofamapinkeyvaluepairs,wehavetomakesurethatwheneverweareextractinganynumberwehavetotypecastitbyspecifyingthedatatypelikeintorfloatbeforewegeneratethefieldfromthedataloadedusingthetwitterelephantbirdAPI.Thetwoarethenjoinedusingtheircommonfield i.e.business_id.Oncetheyhavebeen joinedtogether, Igeneratedthecity,stars,categoriesforeachoftherecordsinthejoinedvariable.Asthecategoriesgiveninthebusinessdatasetarenestedandeachbusinesscanbeclassifiedundervariousdifferentcategories,Iflattenedthecategoriessothatwecanidentifyeachcategoryassociatedwiththebusinessindividually.Oncethecategorieshavebeenflattened,Ithengroupedthevariableflattenedbycityandcategories,sothatwecanseetheresultsgroupedrespectively.However,oncewegroupanyfield,it’sschemachanges.Soinordertoextractthedesiredresult,foreachof the records in thegroupedvariable, Ihave flattened thegroupingdonepreviously,as cityandcategoriesandgeneratedtheaveragevalueofthestarswithinthatgroupandrenamedthecalculatedfieldasrankings.Itshouldbenoted,thatinordertoaccessthestarsfieldwehavetomentionthevariablenameinwhichthefieldstarsexist.Inthiscasewecalledthestarsfieldusingthesyntaxflattened_join.stars.Justasnotedinthelastpart,inthispartalsotheproblemwiththesamecitywithdifferentnamesexistslikeLasVegasand110.LasVegasTheMeanRatinghasbeenfoundas3.747,withtheminimumandmaximumratingvaluesas1.0and5.0.TheMedianaverageratingis3.758.

Figure3showsthenumberofcitiesinwhichtheaverageratingforbusinessesofallcategorieshavebeengroupedaccordingtothecategories:Lessthan1.5Stars,1.5-3Stars,3-4.5Stars,Greaterthan4.5Stars.Wecanseethatmostofthecitieshavebusinessesinthe3-4.5range.

Figure4similarlyillustratesthenumberofcategoriesgroupedaccordingtotheiraverageratingsplacedincategoriesas:NotGood(Lesserthan1.5stars),Fair(1.5–3stars),Good(3–4.5stars),Excellent(Above4.5stars).Wecanseethatalmost40%ofthecategoriesaGoodRatingi.e.3–4.5stars.

Figure 3

Figure 4

[AUTHORNAME] 5

YELPDATASETANALYSISREPORT

3. AverageRankforBusinesseswithin5milesofCarnegieMelllonUniversity,Pittsburgh,PA

Inordertoanalyzeaccordingtothesegivenconditions,IhavemadeuseoftwoofthegivenDatasetsfromtheYelpAcademicChallengeDataseti.e.theBusinessandReviewsDatasetsasinthelastexample.BoththedatasetswereloadedinPigusingtwitter’selephant-birdJsonLoaderastheschemaofthedatasetsishighlynestedwithmixeddatatypes.The.jarfilesforthevariouscomponentsofelephantbirdwereloadedthroughthepropertiestabonthePigEditorinHuewebUI.Oncethe.jsonfileforthedatasethasbeenuploadedasmaps,itisstoredinagenericvariable.Thenfromthatvariablewegeneratethefieldswerequireforanalysisusingtheformatname_of_map#’field_name’as‘field_name’.Asanexampleifweloadedthebusinessdataset with the map name as business, and we wish to generate the categories field, then our syntax will look likebusiness#’categories’ascategories.

InthiscaseIhavegeneratedthefieldscategories,latitude,longitude,business_idfromthebusinessdatasetandthefieldsbusiness_idandstarsfromthereviewsdataset.Aswhenwestoredallthedatainthe.jsonfileintermsofamapinkeyvaluepairs,wehavetomakesurethatwheneverweareextractinganynumberwehavetotypecastitbyspecifyingthedatatypelikeintorfloatbeforewegeneratethefieldfromthedataloadedusingthetwitterelephantbirdAPI.Inordertothenobtain

PIGSCRIPT:

A=LOAD'./yelp_academic_dataset_business.json'USINGcom.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad=true')AS(yelp:map[]);business=FOREACHAGENERATEyelp#'categories'ascategories,yelp#'city'ascity,yelp#'business_id'asbusiness_id;B=LOAD'./yelp_academic_dataset_review.json'USINGcom.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad=true')AS(review:map[]);revie=FOREACHBGENERATEreview#'business_id'asbusiness_id,(int)review#'stars'asstars;joined=JOINbusinessbybusiness_id,reviebybusiness_id;flatting=FOREACHjoinedGENERATEcity,stars,FLATTEN(categories);grouped=GROUPflattingby(categories,city);results=FOREACHgroupedGENERATEFLATTEN(group)AS(categories,city),AVG(flatting.stars)ASrankings;outputer=ORDERresultsBYcategories,rankingsDESC;STOREoutputerINTO'./Q2'USINGPigStorage('\t');TRUNCATEDOUTPUT:ATVRentals/Tours SunCity 5.0ATVRentals/Tours Wickenburg 5.0ATVRentals/Tours LasVegas 4.829787234042553ATVRentals/Tours Henderson4.615384615384615ATVRentals/Tours Phoenix 4.514285714285714ATVRentals/Tours NorthLasVegas 3.825Accessories NorthLasVegas 5.0Accessories CaveCreek 5.0Accessories Middleton4.5Accessories Verdun 4.333333333333333Accessories Madison 4.016666666666667Accessories Gilbert 4.0Accessories Pineville 4.0Accessories Phoenix 3.969924812030075Accessories Pittsburgh3.9586206896551723Accessories LasVegas 3.936688311688312Accessories Westmount 3.9166666666666665Accessories FortMill 3.888888888888889Accessories Champaign 3.8636363636363638Accessories QueenCreek 3.857142857142857Accessories Charlotte 3.8114285714285714Accessories Surprise 3.8Accessories Scottsdale3.757798165137615Accessories Karlsruhe 3.75Accessories Peoria 3.7419354838709675Accessories ParadiseValley 3.7333333333333334Accessories Goodyear 3.6666666666666665Accessories Edinburgh3.6125

[AUTHORNAME] 6

YELPDATASETANALYSISREPORT

thebusinesseswithin5milesfromthespecifiedlocationofCarnegieMellonUniversity,Ispecifiedtheparametersforthelimitsof latitudeand longitudewhichhavetobesatisfied if thesebusinessesareto locatedwithinthegivenareaspecifications.TheselimitswerethenusedwiththeFILTERBYcommandwhichgavethedesiredresults.Thetwoarethenjoinedusingtheircommonfieldi.e.business_id.Oncetheyhavebeenjoinedtogether,Igeneratedthestars,categoriesforeachoftherecordsinthejoinedvariable.Asthecategoriesgiveninthebusinessdatasetarenestedandeachbusinesscanbeclassifiedundervariousdifferent categories, I flattened the categories so thatwecan identifyeach categoryassociatedwith thebusinessindividually.Oncethecategorieshavebeenflattened,Ithengroupedthevariableflattenedbycategories,sothatwecanseetheresultsgroupedaccordingly.However,oncewegroupanyfield, it’sschemachanges.Soinordertoextractthedesiredresult, for each of the records in the grouped variable, I have flattened the grouping done previously, as categories andgeneratedtheaveragevalueofthestarswithinthatgroupandrenamedthecalculatedfieldasrankings.Itshouldbenoted,thatinordertoaccessthestarsfieldwehavetomentionthevariablenameinwhichthefieldstarsexist.Inthiscasewecalledthestarsfieldusingthesyntaxflattened.stars.TheMeanRatingfromthebusinessesinrangehasbeenfoundas3.901,withtheminimumandmaximumratingvaluesas1.0and5.0.TheMedianaverageratingis3.971.

Figure 5

Figure 6

[AUTHORNAME] 7

YELPDATASETANALYSISREPORT

Figure5above,I’verepresentedonlythestartingportionofafigurewhichshowstheratingforeachcategoryrelativetotheaverageratingforallcategorieswithintheregion.Thestraightlineinthefigurerepresentstheaveragevalueofratingsacrossallcategories.InFigure6,Ihaveselectedafewofthecategorieswhichwouldbesuitingtheareaaroundacollegecampus.Asexpectedwecannoticethattheratingsforbusinesseslikeboxing,educationalstores,guitarstores,books,bikerentalshavetheirratingsonthehigherside,allbeingabove4.Suchratingsareexpectedastheareaisaroundacollegeandtheirmajoritycrowdwouldbestudents.Alsoasweselected,businesseslikerehabilitation,engraving,retirementhomesdonothavethatgreatratingsascomparedtotheotherselectedratings.

4. ReviewersRankedbynumberofreviews&CategorywiseanalysisofTop10Reviewers

Inordertoanalyzeaccordingtothesegivenconditions,IhavemadeuseofthreeDatasetsfromtheYelpAcademicChallengeDataset i.e. the Business, Users and Reviews Datasets. All the datasets were loaded in Pig using twitter’s elephant-birdJsonLoaderastheschemaofthedatasetsishighlynestedwithmixeddatatypes.The.jarfilesforthevariouscomponentsofelephantbirdwereloadedthroughthepropertiestabonthePigEditorinHuewebUI.Ihavegeneratedthefieldsreview_count,name,user_idfromtheusersdataset.Fromthebusinessdataset:categories&business_idandthefieldsbusiness_id,user_idandstarsfromthereviewsdataset.Aswhenwestoredallthedatainthe.jsonfileintermsofamapinkeyvaluepairs,we

PIGSCRIPT:

A=LOAD'./yelp_academic_dataset_business.json'USINGcom.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad=true')AS(yelp:map[]);business=FOREACHAGENERATEyelp#'categories'ascategories,yelp#'business_id'asbusiness_id,(float)yelp#'latitude'aslatitude,(float)yelp#'longitude'aslongitude;business_in_range=FILTERbusinessBY(latitude<40.5245131)AND(latitude>40.3578471)AND(longitude>-80.0261624)AND(longitude<-79.8594964);B=LOAD'./yelp_academic_dataset_review.json'USINGcom.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad=true')AS(review:map[]);revie=FOREACHBGENERATEreview#'business_id'asbusiness_id,(int)review#'stars'asstars;joined=JOINbusiness_in_rangebybusiness_id,reviebybusiness_id;flattened=FOREACHjoinedGENERATEstars,FLATTEN(categories);grouped=GROUPflattenedbycategories;results=FOREACHgroupedGENERATEFLATTEN(group)AScategories,AVG(flattened.stars)ASrankings;outputer=ORDERresultsBYcategories;STOREoutputerINTO'./Q3'USINGPigStorage('\t');TRUNCATEDOUTPUT:Accessories 3.9863013698630136Accountants 5.0ActiveLife4.100411039342337Acupuncture 4.241379310344827AdultEducation 5.0AdultEntertainment 3.55Advertising 4.125African 4.625AirportShuttles 4.717948717948718American(New) 3.761559037065342American(Traditional) 3.5828891622249555AmusementParks 4.169014084507042AnimalShelters 4.132075471698113Antiques 4.291666666666667Apartments 2.8396946564885495Appliances3.087719298245614Appliances&Repair 3.533333333333333Aquariums3.9019607843137254Arcades 3.3969465648854964Argentine 4.697594501718213ArtClasses4.627450980392157ArtGalleries 4.11344537815126ArtSchools 4.291666666666667ArtSupplies 4.146341463414634Arts&Crafts 4.328638497652582Arts&Entertainment4.093375214163335AsianFusion 3.693535514764565

[AUTHORNAME] 8

YELPDATASETANALYSISREPORT

havetomakesurethatwheneverweareextractinganynumberwehavetotypecastitbyspecifyingthedatatypelikeintorfloatbeforewegeneratethefieldfromthedataloadedusingthetwitterelephantbirdAPI.Thefirstpartofthequestioninvolvesfindingoutthereviewersandsortingthembytheirnumberofreviewsinthedescendingorder,keepingtheuserwithmostreviewsatthetop.Forthiswesimplyhavetouseonlyonedataset,theusersdataset.Weloadthedata,andgeneratetheuser_id,review_count.FurtherwecansimpleusetheORDERBYcommandofpigtosortthemaccordingtothefieldwewish.Tosorttheminadescendingfashion,weusetheDESCoptionalongwithorderby.ForthesecondpartI’veorderedtheusersinformationbytheirreviewcountandlimitedthesettoonlytheuserswiththe10highestnumberofreviews.I’vethenjoinedthiswiththereviewsdatasetgeneratedpreviously.Thisisthenfurtherjoinedwiththebusinessdataset.Andthenflattenedaccordingtotheirnamesandthecategoriestheybelongtoalongwiththeaverageratingsforeachcategorytheyhavereviewed.TheAverageRatingforthetop10reviewershasbeenfoundas3.678,withtheminimumandmaximumratingvaluesas1.0and5.0.TheMedianaverageratingis3.692.TheAveragenumberofreviewshasbeenfoundas28,withtheminimumandmaximumnumberofreviewsbyasingleuserbeing0and10,320.Thetotalnumberofreviewsinthedatasetwas15,261,802.

Figure7isshowingusforaverageratingwhichisgivenbyeachofthereviewswiththetop10numberofreviews.FromthefigurewecanconcludethattheuserwiththehighestaverageratingisShila.

Figure 7

Figure 8

[AUTHORNAME] 9

YELPDATASETANALYSISREPORT

TheFigure8isshowingusfornumberofreviewsfromthegivendatasetforusers.InthisfigureIhavelimitedthenumberofusersto10,toachievegreaterclarity.FromthisweinferthatthoughthehighestaverageratingisforShila,Victorhasthegreatestnumberofreviews.

5. RatingsoftheTop10&Bottom10FoodBusinessesaroundCMUbymonth

InthisIhavefirstlyloadedthebusinessdatasetandgeneratedthename,categories,business_id,latitude,longitude,starsfieldsfromit.ThenIhavefilteredthedataaccordingtothelocationspecificationsofCMU.AfterthisIhavegeneratedthecolumnsotherthanthelatitudeandlongitude.IhaveinthissamestepconvertedthecategoriesfieldtoabagbyusingtheTOBAGoperator.ThisbagwasthenconvertedtoastringinthesamestepusingtheBagToStringfunction.ThiswasdonesothatIcouldfilterthecategorieseasilydependingonwhethertheyhadfoodintheircategoriesornot.ForthisIusedthematchesoperatorwhichisusedtofindastringwithinanotherstringandreturnsTRUEifitisfoundandFALSEotherwise.AfterthisIorderedtheremainingbusinessesbytheirstarsandthenlimitedthemtothetop10.Forthebottom10Iorderedthebusinessesintheascendingorderandforthetop10inthedescendingorder.

PIGSCRIPT:

A=LOAD'./yelp_academic_dataset_user.json'USINGcom.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad=true')AS(yelp:map[]);users=FOREACHAGENERATEyelp#'user_id'asuser_id,yelp#'name'asname,(int)yelp#'review_count'asreview_count;B=LOAD'./yelp_academic_dataset_review.json'USINGcom.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad=true')AS(review:map[]);revie=FOREACHBGENERATEreview#'business_id'asbusiness_id,(int)review#'stars'asstars,review#'user_id'asuser_id;C=LOAD'./yelp_academic_dataset_business.json'USINGcom.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad=true')AS(busines:map[]);business=FOREACHCGENERATEbusines#'business_id'asbusiness_id,busines#'categories'ascategories;ordered_users=ORDERusersbyreview_countDESC;STOREordered_usersINTO'./Q4a'usingPigStorage('\t');ordered=ORDERusersbyreview_countDESC;top_10=LIMITordered10;joined=JOINtop_10byuser_id,reviebyuser_id;generator=FOREACHjoinedGENERATEnameasName,starsasStars,business_idasBusiness_ID;join_again=JOINgeneratorbyBusiness_ID,businessbybusiness_id;flattened=FOREACHjoin_againGENERATEStars,Name,FLATTEN(categories);grouped=GROUPflattenedby(Name,categories);results=FOREACHgroupedGENERATEFLATTEN(group)AS(Name,categories),AVG(flattened.Stars)ASrankings;STOREresultsINTO'./Q4b'USINGPigStorage('\t');TRUNCATEDOUTPUT:Neal Bars 4.25Neal Food 3.4615384615384617Neal Thai 2.0Neal Cafes 3.5Neal Greek 3.5Neal Pizza 2.0Neal Taxis 3.0Neal Hotels 4.0Neal Indian 3.0Neal Burgers 3.75Neal Italian 4.0Neal Lounges 4.333333333333333Neal Mexican 3.0Neal Resorts 5.0Neal Airports 3.3333333333333335Neal Bakeries 3.0Neal Caterers 5.0Neal DaySpas 5.0Neal Desserts 3.0Neal Shopping 4.2Neal FastFood 3.5714285714285716Neal Nightlife 4.25Neal Automotive 4.0Neal Bookstores 5.0Neal CarRental4.0Neal Drugstores4.0Neal SushiBars 5.0

[AUTHORNAME] 10

YELPDATASETANALYSISREPORT

Afterjoiningitwiththereviewsdataset,Ihavethengeneratedthecolumnsbymakinguseoftheirlocationinthejoinedtableandthennamingthem.Forextractingthemonthfromthedate,IhaveusedtheSUBSTRINGoperator.Asthedatewasaleaadyinchararrayformat,andspecifiedas‘yyyy-mm-dd’wecansimplydenotethestartingandendinglocationandextractthedatefromthegivendates.ThenIhavegroupeditaccordingtobusiness_id,name,month.Ihaveincludedthebusiness_idalsoastherearecaseswhentherearemanybusinesseswiththesamename,whichcancauseambiguity.Thiswasseeninthebottom10businesseswhereMcDonaldswaspresent3times.Therefore,usingbusiness_idalongwiththenamehelpsustoidentifythemindividually.Weobtainedthetwosequencefilesseparately,andthenusedtheHadoopHDFScommandstoconcatenatethemintoasinglefile,thencopyingthemfromlocalfilesystemtoHDFS.Q5.txtisthefinaloutputfile.Forthisweusethecommands:

v hadoopdfs–mv/user/cloudera/BOTTOM_10_FOODS/part-r-00000/user/cloudera/TOP_10_FOODS/part-r-00001v hadoopfs–rm/user/cloudera/TOP_10_FOODS/_SUCCESSv hadoopfs–getmerge/user/cloudera/TOP_10_FOODS//user/cloudera/Q5.txtv hadoopfs–copyFromLocal/user/cloudera/Q5.txt/user/cloudera/

Inthetruncatedoutputbelow,Ihaveshownboththeoutputfilesbeforeconcatenation.InFigure9wecanseetheTop10FoodBusinessesaslocatedonthemap

Below,inFigure10wecanseetheBottom10FoodBusinessesaslocatedonthemap

ItcanbeseeninbothofthemapsshownabovethatallthebusinessesarelocatednearCarnegieMellonUniversity

Figure 9

Figure 10

[AUTHORNAME] 11

YELPDATASETANALYSISREPORT

PIGSCRIPT:

A=LOAD'./yelp_academic_dataset_business.json'USINGcom.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad=true')AS(yelp:map[]);business=FOREACHAGENERATEyelp#'categories'ascategories,(float)yelp#'stars'asstars,yelp#'name'asname,yelp#'business_id'asbusiness_id,(float)yelp#'latitude'aslatitude,(float)yelp#'longitude'aslongitude;business_in_range=FILTERbusinessBY(latitude<40.5245131)AND(latitude>40.3578471)AND(longitude>-80.0261624)AND(longitude<-79.8594964);binrange=FOREACHbusiness_in_rangeGENERATEname,stars,business_id,org.apache.pig.builtin.BagToString(TOBAG(categories))ascategory;filters=FILTERbinrangeBYcategorymatches'.*Food.*';ordered=ORDERfiltersbystarsDESC;top_10=limitordered10;B=LOAD'./yelp_academic_dataset_review.json'USINGcom.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad=true')AS(review:map[]);rev=FOREACHBGENERATE(float)review#'rating'asratings,review#'business_id'asbusiness_id,review#'date'asdate;joining=JOINtop_10bybusiness_id,revbybusiness_id;top_join=FOREACHjoiningGENERATE$0asname,$1asratings,$5asbusiness_id,(int)SUBSTRING($6,5,7)asmonth;grouped=GROUPtop_joinby(business_id,name,month);flatting=FOREACHgroupedGENERATEFLATTEN(group)as(business_id,name,month),AVG(top_join.ratings);STOREflattingINTO'./TOP_10_FOODS'usingPigStorage('\t');orderedbottom=ORDERfiltersbystars;bottom_10=limitorderedbottom10;joiningb=JOINbottom_10bybusiness_id,revbybusiness_id;bottom_join=FOREACHjoiningbGENERATE$0asname,$1asratings,$5asbusiness_id,(int)SUBSTRING($6,5,7)asmonth;groupedb=GROUPbottom_joinby(business_id,name,month);flattingb=FOREACHgroupedbGENERATEFLATTEN(group)as(business_id,name,month),AVG(bottom_join.ratings);STOREflattingbINTO'./BOTTOM_10_FOODS'usingPigStorage('\t');TRUNCATEDOUTPUTFORTOP10:08eRFhpedodAf6atSRK09g TheColombianSpot 12 5.00h2My97xjAjc1pNrMM266Q FivePointsArtisanBakeshop 1 5.00h2My97xjAjc1pNrMM266Q FivePointsArtisanBakeshop 2 5.00h2My97xjAjc1pNrMM266Q FivePointsArtisanBakeshop 3 5.00h2My97xjAjc1pNrMM266Q FivePointsArtisanBakeshop 5 5.00h2My97xjAjc1pNrMM266Q FivePointsArtisanBakeshop 6 5.00h2My97xjAjc1pNrMM266Q FivePointsArtisanBakeshop 7 5.00h2My97xjAjc1pNrMM266Q FivePointsArtisanBakeshop 8 5.00h2My97xjAjc1pNrMM266Q FivePointsArtisanBakeshop 9 5.00h2My97xjAjc1pNrMM266Q FivePointsArtisanBakeshop 10 5.00h2My97xjAjc1pNrMM266Q FivePointsArtisanBakeshop 11 5.00h2My97xjAjc1pNrMM266Q FivePointsArtisanBakeshop 12 5.03EMEYlCxPiygL4Tu_z9beQ FineWine&GoodSpirits 4 5.03EMEYlCxPiygL4Tu_z9beQ FineWine&GoodSpirits 5 5.03EMEYlCxPiygL4Tu_z9beQ FineWine&GoodSpirits 11 5.0

TRUNCATEDOUTPUTFORBOTTOM10:6C1Igw4BzRmg5Et8GSVfpA SevenElevenPennAvenue 5 1.56C1Igw4BzRmg5Et8GSVfpA SevenElevenPennAvenue 6 1.59KsHPdF1-P_CiXnvugdQvQ Foodland 1 1.59KsHPdF1-P_CiXnvugdQvQ Foodland 3 1.59KsHPdF1-P_CiXnvugdQvQ Foodland 4 1.59KsHPdF1-P_CiXnvugdQvQ Foodland 7 1.59KsHPdF1-P_CiXnvugdQvQ Foodland 8 1.59KsHPdF1-P_CiXnvugdQvQ Foodland 9 1.59KsHPdF1-P_CiXnvugdQvQ Foodland 11 1.5BbIh5NTizhV4Fq_mLmNkpg LongJohnSilver's 1 1.5BbIh5NTizhV4Fq_mLmNkpg LongJohnSilver's 7 1.5BbIh5NTizhV4Fq_mLmNkpg LongJohnSilver's 8 1.5CL3tZqbYT7B5zgewKCS6-Q McDonald's 8 1.0CL3tZqbYT7B5zgewKCS6-Q McDonald's 12 1.0KT8KJ4zt-IPqpLzACdpEZg Wendy's 2 1.5KT8KJ4zt-IPqpLzACdpEZg Wendy's 3 1.5