learning apache mahout classification related/pdfs and books... · 2015-05-24 · the entropy...
TRANSCRIPT
![Page 1: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/1.jpg)
![Page 2: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/2.jpg)
![Page 3: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/3.jpg)
LearningApacheMahoutClassification
![Page 4: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/4.jpg)
TableofContents
LearningApacheMahoutClassification
Credits
AbouttheAuthor
AbouttheReviewers
www.PacktPub.com
Supportfiles,eBooks,discountoffers,andmore
Whysubscribe?
FreeaccessforPacktaccountholders
Preface
Whatthisbookcovers
Whatyouneedforthisbook
Whothisbookisfor
Conventions
Readerfeedback
Customersupport
Downloadingtheexamplecode
Downloadingthecolorimagesofthisbook
Errata
Piracy
Questions
1.ClassificationinDataAnalysis
Introducingtheclassification
Applicationoftheclassificationsystem
Workingoftheclassificationsystem
Classificationalgorithms
Modelevaluationtechniques
Theconfusionmatrix
TheReceiverOperatingCharacteristics(ROC)graph
AreaundertheROCcurve
![Page 5: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/5.jpg)
Theentropymatrix
Summary
2.ApacheMahout
IntroducingApacheMahout
AlgorithmssupportedinMahout
ReasonsforMahoutbeingagoodchoiceforclassification
InstallingMahout
BuildingMahoutfromsourceusingMaven
InstallingMaven
BuildingMahoutcode
SettingupadevelopmentenvironmentusingEclipse
SettingupMahoutforaWindowsuser
Summary
3.LearningLogisticRegression/SGDUsingMahout
Introducingregression
Understandinglinearregression
Costfunction
Gradientdescent
Logisticregression
StochasticGradientDescent
UsingMahoutforlogisticregression
Summary
4.LearningtheNaïveBayesClassificationUsingMahout
IntroducingconditionalprobabilityandtheBayesrule
UnderstandingtheNaïveBayesalgorithm
Understandingthetermsusedintextclassification
UsingtheNaïveBayesalgorithminApacheMahout
Summary
5.LearningtheHiddenMarkovModelUsingMahout
Deterministicandnondeterministicpatterns
TheMarkovprocess
![Page 6: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/6.jpg)
IntroducingtheHiddenMarkovModel
UsingMahoutfortheHiddenMarkovModel
Summary
6.LearningRandomForestUsingMahout
Decisiontree
Randomforest
UsingMahoutforRandomforest
StepstousetheRandomforestalgorithminMahout
Summary
7.LearningMultilayerPerceptronUsingMahout
Neuralnetworkandneurons
MultilayerPerceptron
MLPimplementationinMahout
UsingMahoutforMLP
StepstousetheMLPalgorithminMahout
Summary
8.MahoutChangesintheUpcomingRelease
Mahoutnewchanges
MahoutScalaandSparkbindings
ApacheSpark
UsingMahout’sSparkshell
H2Oplatformintegration
Summary
9.BuildinganE-mailClassificationSystemUsingApacheMahout
Spame-maildataset
CreatingthemodelusingtheAssassindataset
Programtouseaclassifiermodel
Testingtheprogram
Secondusecaseasanexercise
TheASFe-maildataset
Classifierstuning
![Page 7: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/7.jpg)
Summary
Index
![Page 8: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/8.jpg)
![Page 9: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/9.jpg)
LearningApacheMahoutClassification
![Page 10: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/10.jpg)
![Page 11: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/11.jpg)
LearningApacheMahoutClassificationCopyright©2015PacktPublishing
Allrightsreserved.Nopartofthisbookmaybereproduced,storedinaretrievalsystem,ortransmittedinanyformorbyanymeans,withoutthepriorwrittenpermissionofthepublisher,exceptinthecaseofbriefquotationsembeddedincriticalarticlesorreviews.
Everyefforthasbeenmadeinthepreparationofthisbooktoensuretheaccuracyoftheinformationpresented.However,theinformationcontainedinthisbookissoldwithoutwarranty,eitherexpressorimplied.Neithertheauthor,norPacktPublishing,anditsdealersanddistributorswillbeheldliableforanydamagescausedorallegedtobecauseddirectlyorindirectlybythisbook.
PacktPublishinghasendeavoredtoprovidetrademarkinformationaboutallofthecompaniesandproductsmentionedinthisbookbytheappropriateuseofcapitals.However,PacktPublishingcannotguaranteetheaccuracyofthisinformation.
Firstpublished:February2015
Productionreference:1210215
PublishedbyPacktPublishingLtd.
LiveryPlace
35LiveryStreet
BirminghamB32PB,UK.
ISBN978-1-78355-495-9
www.packtpub.com
![Page 12: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/12.jpg)
![Page 13: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/13.jpg)
CreditsAuthor
AshishGupta
Reviewers
SivaPrakash
TharinduRusira
VishnuViswanath
CommissioningEditor
AkramHussain
AcquisitionEditor
ReshmaRaman
ContentDevelopmentEditor
MerwynD’souza
TechnicalEditors
MonicaJohn
NovinaKewalramani
ShrutiRawool
CopyEditors
SarangChari
GladsonMonteiro
AartiSaldanha
RashmiSawant
ProjectCoordinator
NehaBhatnagar
Proofreaders
SimranBhogal
SteveMaguire
Indexer
MonicaAjmeraMehta
Graphics
SheetalAute
![Page 14: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/14.jpg)
AbhinashSahu
ProductionCoordinator
ConidonMiranda
CoverWork
ConidonMiranda
![Page 15: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/15.jpg)
![Page 16: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/16.jpg)
AbouttheAuthorAshishGuptahasbeenworkinginthefieldofsoftwaredevelopmentforthelast8years.Hehasworkedindifferentcompanies,suchasSAPLabsandCaterpillar,asasoftwaredeveloper.Whileworkingforastart-upwherehewasresponsibleforpredictingpotentialcustomersfornewfashionapparelsusingsocialmedia,hedevelopedaninterestinthefieldofmachinelearning.Sincethen,hehasworkedonusingbigdatatechnologiesandmachinelearningfordifferentindustries,includingretail,finance,insurance,andsoon.Hehasapassionforlearningnewtechnologiesandsharingtheknowledgethusgainedwithothers.HehasorganizedmanybootcampsfortheApacheMahoutandHadoopecosystem.
Firstofall,Iwouldliketothankopensourcecommunitiesfortheircontinuouseffortsindevelopinggreatsoftwareforall.IwouldliketothankMerwynD’SouzaandReshmaRaman,myeditorsforthisproject.Specialthankstothereviewersofthisbook.
Nothingcanbeaccomplishedwithoutthesupportoffamily,friends,andlovedones.Iwouldliketothankmyfriends,family,andespeciallymywifeandmysonfortheircontinuoussupportthroughoutthewritingofthisbook.
![Page 17: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/17.jpg)
![Page 18: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/18.jpg)
AbouttheReviewersSivaPrakashisworkingasatechleadinBangalore.Hehasextensivedevelopmentexperienceintheanalysis,design,development,implementation,andmaintenanceofvariousdesktop,mobile,andweb-basedapplications.Helovestrekking,traveling,music,readingbooks,andblogging.
YoucanfindhimonLinkedInathttps://www.linkedin.com/in/techsivam.
TharinduRusiraiscurrentlyacomputerscienceandengineeringundergraduateattheUniversityofMoratuwa,SriLanka.Asastudentresearcher,hehasstronginterestsinmachinelearning,compilers,andhigh-performancecomputing.
TharinduhasalsoworkedasaresearchanddevelopmentsoftwareengineeringinternatZaiziAsia(Pvt)Ltd.,wherehefirststartedusingApacheMahoutduringtheimplementationofanenterprise-levelcontentmanagementandinformationretrievalsystem.
HeseesthepotentialofApacheMahoutasascalablemachinelearninglibraryforindustry-levelimplementationsandhasevencontributedtotheMahout0.9release,thelateststablereleaseofMahout.
HeisavailableonLinkedInathttps://www.linkedin.com/in/trusira.
VishnuViswanathisaseniorbigdatadeveloperwhohasmanyyearsofindustrialexpertiseinthearenaofmachinelearning.Heisatechenthusiastandispassionateaboutbigdataandhasexpertiseonmostbig-data-relatedtechnologies.
YoucanfindhimonLinkedInathttp://in.linkedin.com/in/vishnuviswanath25.
![Page 19: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/19.jpg)
![Page 20: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/20.jpg)
www.PacktPub.com
![Page 21: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/21.jpg)
Supportfiles,eBooks,discountoffers,andmoreForsupportfilesanddownloadsrelatedtoyourbook,pleasevisitwww.PacktPub.com.
DidyouknowthatPacktofferseBookversionsofeverybookpublished,withPDFandePubfilesavailable?YoucanupgradetotheeBookversionatwww.PacktPub.comandasaprintbookcustomer,youareentitledtoadiscountontheeBookcopy.Getintouchwithusat<[email protected]>formoredetails.
Atwww.PacktPub.com,youcanalsoreadacollectionoffreetechnicalarticles,signupforarangeoffreenewslettersandreceiveexclusivediscountsandoffersonPacktbooksandeBooks.
https://www2.packtpub.com/books/subscription/packtlib
DoyouneedinstantsolutionstoyourITquestions?PacktLibisPackt’sonlinedigitalbooklibrary.Here,youcansearch,access,andreadPackt’sentirelibraryofbooks.
![Page 22: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/22.jpg)
Whysubscribe?FullysearchableacrosseverybookpublishedbyPacktCopyandpaste,print,andbookmarkcontentOndemandandaccessibleviaawebbrowser
![Page 23: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/23.jpg)
FreeaccessforPacktaccountholdersIfyouhaveanaccountwithPacktatwww.PacktPub.com,youcanusethistoaccessPacktLibtodayandview9entirelyfreebooks.Simplyuseyourlogincredentialsforimmediateaccess.
![Page 24: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/24.jpg)
![Page 25: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/25.jpg)
PrefaceThankstotheprogressmadeinthehardwareindustries,ourstoragecapacityhasincreased,andbecauseofthis,therearemanyorganizationswhowanttostorealltypesofeventsforanalyticspurposes.Thishasgivenbirthtoaneweraofmachinelearning.Thefieldofmachinelearningisverycomplexandwritingthesealgorithmsisnotapieceofcake.ApacheMahoutprovidesuswithreadymadealgorithmsintheareaofmachinelearningandsavesusfromthecomplextaskofalgorithmimplementation.
TheintentionofthisbookistocoverclassificationalgorithmsavailableinApacheMahout.Whetheryouhavealreadyworkedonclassificationalgorithmsusingsomeothertoolorarecompletelynewtothefield,thisbookwillhelpyou.So,startreadingthisbooktoexploretheclassificationalgorithmsinoneofthemostpopularopensourceprojectswhichenjoysstrongcommunitysupport:ApacheMahout.
![Page 26: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/26.jpg)
WhatthisbookcoversChapter1,ClassificationinDataAnalysis,providesanintroductiontotheclassificationconceptindataanalysis.Thischapterwillcoverthebasicsofclassification,similaritymatrix,andalgorithmsavailableinthisarea.
Chapter2,ApacheMahout,providesanintroductiontoApacheMahoutanditsinstallationprocess.Further,thischapterwilltalkaboutwhyitisagoodchoiceforclassification.
Chapter3,LearningLogisticRegression/SGDUsingMahout,discusseslogisticregressionandStochasticGradientDescent,andhowdeveloperscanuseMahouttouseSGD.
Chapter4,LearningtheNaïveBayesClassificationUsingMahout,discussestheBayesTheorem,NaïveBayesclassification,andhowwecanuseMahouttobuildNaïveBayesclassifier.
Chapter5,LearningtheHiddenMarkovModelUsingMahout,coverstheHMMandhowtouseMahout’sHMMalgorithms.
Chapter6,LearningRandomForestUsingMahout,discussestheRandomforestalgorithmindetail,andhowtouseMahout’sRandomforestimplementation.
Chapter7,LearningMultilayerPerceptronUsingMahout,discussesMahoutasanearlylevelimplementationofaneuralnetwork.WewilldiscussMultilayerPerceptroninthischapter.Further,wewilluseMahout’simplementationofMLP.
Chapter8,MahoutChangesintheUpcomingRelease,discussesMahoutasaworkinprogress.WewilldiscussthenewmajorchangesintheupcomingreleaseofMahout.
Chapter9,BuildinganE-mailClassificationSystemUsingApacheMahout,providestwousecasesofe-mailclassification—spammailclassificationande-mailclassificationbasedontheprojectthemailbelongsto.Wewillcreatethemodel,andusethismodelinaprogramthatwillsimulatetherealworkingenvironment.
![Page 27: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/27.jpg)
![Page 28: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/28.jpg)
WhatyouneedforthisbookTousetheexamplesinthisbook,youshouldhavethefollowingsoftwareinstalledonyoursystem:
Java1.6orhigherEclipseHadoopMahout;wewilldiscusstheinstallationinChapter2,ApacheMahout,ofthisbookMaven,dependingonhowyouinstallMahout
![Page 29: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/29.jpg)
![Page 30: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/30.jpg)
WhothisbookisforIfyouareadatascientistwhohassomeexperiencewiththeHadoopecosystemandmachinelearningmethodsandwanttotryoutclassificationonlargedatasetsusingMahout,thisbookisidealforyou.KnowledgeofJavaisessential.
![Page 31: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/31.jpg)
![Page 32: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/32.jpg)
ConventionsInthisbook,youwillfindanumberoftextstylesthatdistinguishbetweendifferentkindsofinformation.Herearesomeexamplesofthesestylesandanexplanationoftheirmeaning.
Codewordsintext,databasetablenames,foldernames,filenames,fileextensions,pathnames,dummyURLs,userinput,andTwitterhandlesareshownasfollows:“Extractthesourcecodeandensurethatthefoldercontainsthepom.xmlfile.”
Ablockofcodeissetasfollows:
publicstaticMap<String,Integer>readDictionary(Configurationconf,
PathdictionaryPath){
Map<String,Integer>dictionary=newHashMap<String,Integer>();
for(Pair<Text,IntWritable>pair:newSequenceFileIterable<Text,
IntWritable>(dictionaryPath,true,conf)){
dictionary.put(pair.getFirst().toString(),
pair.getSecond().get());
}
returndictionary;
}
Whenwewishtodrawyourattentiontoaparticularpartofacodeblock,therelevantlinesoritemsaresetinbold:
publicstaticMap<String,Integer>readDictionary(Configurationconf,
PathdictionaryPath){
Map<String,Integer>dictionary=newHashMap<String,Integer>();
for(Pair<Text,IntWritable>pair:newSequenceFileIterable<Text,
IntWritable>(dictionaryPath,true,conf)){
dictionary.put(pair.getFirst().toString(),
pair.getSecond().get());
}
returndictionary;
}
Anycommand-lineinputoroutputiswrittenasfollows:
hadoopfs-mkdir/user/hue/KDDTrain
hadoopfs-mkdir/user/hue/KDDTest
hadoopfs–put/tmp/KDDTrain+_20Percent.arff/user/hue/KDDTrain
hadoopfs–put/tmp/KDDTest+.arff/user/hue/KDDTest
Newtermsandimportantwordsareshowninbold.Wordsthatyouseeonthescreen,forexample,inmenusordialogboxes,appearinthetextlikethis:“Now,navigatetothelocationformahout-distribution-0.9andclickonFinish.”
NoteWarningsorimportantnotesappearinaboxlikethis.
TipTipsandtricksappearlikethis.
![Page 33: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/33.jpg)
![Page 34: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/34.jpg)
![Page 35: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/35.jpg)
ReaderfeedbackFeedbackfromourreadersisalwayswelcome.Letusknowwhatyouthinkaboutthisbook—whatyoulikedordisliked.Readerfeedbackisimportantforusasithelpsusdeveloptitlesthatyouwillreallygetthemostoutof.
Tosendusgeneralfeedback,simplye-mail<[email protected]>,andmentionthebook’stitleinthesubjectofyourmessage.
Ifthereisatopicthatyouhaveexpertiseinandyouareinterestedineitherwritingorcontributingtoabook,seeourauthorguideatwww.packtpub.com/authors.
![Page 36: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/36.jpg)
![Page 37: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/37.jpg)
CustomersupportNowthatyouaretheproudownerofaPacktbook,wehaveanumberofthingstohelpyoutogetthemostfromyourpurchase.
![Page 38: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/38.jpg)
DownloadingtheexamplecodeYoucandownloadtheexamplecodefilesfromyouraccountathttp://www.packtpub.comforallthePacktPublishingbooksyouhavepurchased.Ifyoupurchasedthisbookelsewhere,youcanvisithttp://www.packtpub.com/supportandregistertohavethefilese-maileddirectlytoyou.
![Page 39: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/39.jpg)
DownloadingthecolorimagesofthisbookWealsoprovideyouwithaPDFfilethathascolorimagesofthescreenshots/diagramsusedinthisbook.Thecolorimageswillhelpyoubetterunderstandthechangesintheoutput.Youcandownloadthisfilefromhttp://www.packtpub.com/sites/default/files/downloads/4959OS_ColoredImages.pdf.
![Page 40: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/40.jpg)
ErrataAlthoughwehavetakeneverycaretoensuretheaccuracyofourcontent,mistakesdohappen.Ifyoufindamistakeinoneofourbooks—maybeamistakeinthetextorthecode—wewouldbegratefulifyoucouldreportthistous.Bydoingso,youcansaveotherreadersfromfrustrationandhelpusimprovesubsequentversionsofthisbook.Ifyoufindanyerrata,pleasereportthembyvisitinghttp://www.packtpub.com/submit-errata,selectingyourbook,clickingontheErrataSubmissionFormlink,andenteringthedetailsofyourerrata.Onceyourerrataareverified,yoursubmissionwillbeacceptedandtheerratawillbeuploadedtoourwebsiteoraddedtoanylistofexistingerrataundertheErratasectionofthattitle.
Toviewthepreviouslysubmittederrata,gotohttps://www.packtpub.com/books/content/supportandenterthenameofthebookinthesearchfield.TherequiredinformationwillappearundertheErratasection.
![Page 41: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/41.jpg)
PiracyPiracyofcopyrightedmaterialontheInternetisanongoingproblemacrossallmedia.AtPackt,wetaketheprotectionofourcopyrightandlicensesveryseriously.IfyoucomeacrossanyillegalcopiesofourworksinanyformontheInternet,pleaseprovideuswiththelocationaddressorwebsitenameimmediatelysothatwecanpursuearemedy.
Pleasecontactusat<[email protected]>withalinktothesuspectedpiratedmaterial.
Weappreciateyourhelpinprotectingourauthorsandourabilitytobringyouvaluablecontent.
![Page 42: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/42.jpg)
QuestionsIfyouhaveaproblemwithanyaspectofthisbook,youcancontactusat<[email protected]>,andwewilldoourbesttoaddresstheproblem.
![Page 43: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/43.jpg)
![Page 44: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/44.jpg)
Chapter1.ClassificationinDataAnalysisInthelastdecade,wesawahugegrowthinsocialnetworkingande-commercesites.IamsurethatyoumusthavegotinformationaboutthisbookonFacebook,Twitter,orsomeothersite.Chancesarealsohighthatyouarereadingane-copyofthisbookafterorderingitonyourphoneortablet.
ThismustgiveyouanideaofhowmuchdatawearegeneratingovertheInterneteverysingleday.Now,inordertoobtainallnecessaryinformationfromthedata,wenotonlycreatedatabutalsostorethisdata.Thisdataisextremelyusefultogetsomeimportantinsightsintothebusiness.Theanalysisofthisdatacanincreasethecustomerbaseandcreateprofitsfortheorganization.Taketheexampleofane-commercesite.Youvisitthesitetobuysomebook.Yougetinformationaboutbooksonrelatedtopicsorthesametopic,publisher,orwriter,andthishelpsyoutotakebetterdecisions,whichalsohelpsthesitetoknowmoreaboutitscustomers.Thiswilleventuallyleadtoanincreaseinsales.
Findingrelateditemsorsuggestinganewitemtotheuserisallpartofthedatascienceinwhichweanalyzethedataandtrytogetusefulpatterns.
Dataanalysisistheprocessofinspectinghistoricaldataandcreatingmodelstogetusefulinformationthatisrequiredtohelpindecisionmaking.Itishelpfulinmanyindustries,suchase-commerce,banking,finance,healthcare,telecommunications,retail,oceanography,andmanymore.
Let’staketheexampleofaweatherforecastingsystem.Itisasystemthatcanpredictthestateoftheatmosphereataparticularlocation.Inthisprocess,scientistscollecthistoricaldataoftheatmosphereofthatlocationandtrytocreateamodelbasedonittopredicthowtheatmospherewillevolveoveraperiodoftime.
Inmachinelearning,classificationistheautomationofthedecision-makingprocessthatlearnsfromexamplesofthepastandemulatesthosedecisionsautomatically.Emulatingthedecisionsautomaticallyisacoreconceptinpredictiveanalytics.Inthischapter,wewilllookatthefollowingpoints:
UnderstandingclassificationWorkingofclassificationsystemsClassificationalgorithmsModelevaluationmethods
![Page 45: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/45.jpg)
IntroducingtheclassificationThewordclassificationalwaysremindsusofourbiologyclass,wherewelearnedabouttheclassificationofanimals.Welearnedaboutdifferentcategoriesofanimals,suchasmammals,reptiles,birds,amphibians,andsoon.
Ifyourememberhowthesecategoriesaredefined,youwillrealizethattherewerecertainpropertiesthatscientistsfoundinexistinganimals,andbasedontheseproperties,theycategorizedanewanimal.
Otherreal-lifeexamplesofclassificationcouldbe,forinstance,whenyouvisitthedoctor.He/sheasksyoucertainquestions,andbasedonyouranswers,he/sheisabletoidentifywhetheryouhaveacertaindiseaseornot.
Classificationisthecategorizationofpotentialanswers,andinmachinelearning,wewanttoautomatethisprocess.Biologicalclassificationisanexampleofmulticlassclassificationandfindingthediseaseisanexampleofbinaryclassification.
Indataanalysis,wewanttousemachinelearningconcepts.Toanalyzethedata,wewanttobuildasystemthatcanhelpustofindoutwhichclassanindividualitembelongsto.Usually,theseclassesaremutuallyexclusive.Arelatedprobleminthisareaisfindingouttheprobabilitythatanindividualbelongstoacertainclass.
Classificationisasupervisedlearningtechnique.Inthistechnique,machines—basedonhistoricaldata—learnandgainthecapabilitiestopredicttheunknown.Inmachinelearning,anotherpopulartechniqueisunsupervisedlearning.Insupervisedlearning,wealreadyknowtheoutputcategories,butinunsupervisedlearning,weknownothingabouttheoutput.Let’sunderstandthiswithaquickexample:supposewehaveafruitbasket,andwewanttoclassifyfruits.Whenwesayclassify,itmeansthatinthetrainingdata,wealreadyhaveoutputvariables,suchassizeandcolor,andweknowwhetherthecolorisredandthesizeisfrom2.3”to3.7”.Wewillclassifythatfruitasanapple.Oppositetothis,inunsupervisedlearning,wewanttoseparatedifferentfruits,andwedonothaveanyoutputinformationinthetrainingdataset,sothelearningalgorithmwillseparatedifferentfruitsbasedondifferentfeaturespresentinthedataset,butitwillnotbeabletolabelthem.Inotherwords,itwillnotbeabletotellwhichoneisanappleandwhichoneisabanana,althoughitwillbeabletoseparatethem.
![Page 46: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/46.jpg)
ApplicationoftheclassificationsystemClassificationisusedforprediction.Inthecaseofe-mailcategorization,itisusedtoclassifye-mailasspamornotspam.Nowadays,Gmailisclassifyinge-mailsasprimary,social,andpromotionalaswell.Classificationisusefulinpredictingcreditcardfrauds,tocategorizecustomersforeligibilityofloans,andsoon.Itisalsousedtopredictcustomerchurnintheinsuranceandtelecomindustries.Itisusefulinthehealthcareindustryaswell.Basedonhistoricaldata,itisusefulinclassifyingparticularsymptomsofadiseasetopredictthediseaseinadvance.Classificationcanbeusedtoclassifytropicalcyclones.So,itisusefulacrossallindustries.
![Page 47: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/47.jpg)
WorkingoftheclassificationsystemLet’sunderstandtheclassificationprocessinmoredetail.Intheprocessofclassification,withthedatasetgiventous,wetrytofindoutinformativevariablesusingwhichwecanreducetheuncertaintyandcategorizesomething.Theseinformativevariablesarecalledexplanatoryvariablesorfeatures.
Thefinalcategoriesthatweareinterestedarecalledtargetvariablesorlabels.Explanatoryvariablescanbeanyofthefollowingforms:
Continuous(numerictypes)CategoricalWord-likeText-like
NoteIfnumerictypesarenotusefulforanymathematicalfunctions,thosewillbecountedascategorical(zipcodes,streetnumbers,andsoon).
So,forexample,wehaveadatasetofcustomer’s’loanapplications,andwewanttobuildaclassifiertofindoutwhetheranewcustomeriseligibleforaloanornot.Inthisdataset,wecanhavethefollowingfields:
CustomerAgeCustomerIncome(PA)CustomerAccountBalanceLoanGranted
Fromthesefields,CustomerAge,CustomerIncome(PA)andCustomerAccountBalancewillworkasexplanatoryvariablesandLoanGrantedwillbethetargetvariable,asshowninthefollowingscreenshot:
Tounderstandthecreationoftheclassifier,weneedtounderstandafewterms,asshowninthefollowingdiagram:
![Page 48: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/48.jpg)
Trainingdataset:Fromthegivendataset,aportionofthedataisusedtocreatethetrainingdataset(itcouldbe70percentofthegivendata).Thisdatasetisusedtobuildtheclassifier.Allthefeaturesetsareusedinthisdataset.Testdataset:Thedatasetthatisleftafterthetrainingdatasetisusedtotestthecreatedmodel.Withthisdata,onlythefeaturesetisusedandthemodelisusedtopredictthetargetvariablesorlabels.Model:Thisisusedtounderstandthealgorithmusedtogeneratethetargetvariables.
Whilebuildingaclassifier,wefollowthesesteps:
CollectinghistoricaldataCleaningdata(alotofactivitiesareinvolvedhere,suchasspaceremoval,andsoon)DefiningtargetvariablesDefiningexplanatoryvariablesSelectinganalgorithmTrainingthemodel(usingthetrainingdataset)RunningtestdataEvaluatingthemodelAdjustingexplanatoryvariablesRerunningthetest
Whilepreparingthemodel,oneshouldtakecareofoutlierdetection.Outlierdetectionisamethodtofindoutitemsthatdonotconformtoanexpectedpatterninadataset.Outliersinaninputdatasetcanmisleadthetrainingprocessofanalgorithm.Thiscanaffectthemodelaccuracy.Therearealgorithmstofindouttheseoutliersinthedatasets.Distance-
![Page 49: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/49.jpg)
basedtechniquesandfuzzy-logic-basedmethodsaremostlyusedtofindoutoutliersinthedataset.Let’stalkaboutoneexampletounderstandtheoutliers.
Wehaveasetofnumbers,andwewanttofindoutthemeanofthesenumbers:
10,75,10,15,20,85,25,30,25
Justplotthesenumbersandtheresultwillbeasshowninthefollowingscreenshot:
Clearly,thenumbers75and85areoutliers(farawayintheplotfromtheothernumbers).
Mean=sumofvalues/numberofvalues=32.78
Meanwithouttheoutliers:=19.29
So,nowyoucanunderstandhowoutlierscanaffecttheresults.
Whilecreatingthemodel,wecanencountertwomajorlyoccurringproblems—OverfittingandUnderfitting.
Overfittingoccurswhenthealgorithmcapturesthenoiseofthedata,andthealgorithmfitsthedatatoowell.Generally,itoccursifweuseallthegivendatatobuildthemodelusingpurememorization.Insteadoffindingoutthegeneralizingpattern,themodeljustmemorizesthepattern.Usually,inthecaseofoverfitting,themodelgetsmorecomplex,anditisallowedtopickupspuriouscorrelations.Thesecorrelationsarespecifictotrainingdatasetsanddonotrepresentcharacteristicsofthewholedatasetingeneral.
Thefollowingdiagramisanexampleofoverfitting.Anoutlierispresent,andthealgorithmconsidersthatandcreatesamodelthatperfectlyclassifiesthetrainingset,butbecauseofthis,thetestdataiswronglyclassified(boththerectanglesareclassifiedasstarsinthetestdata):
![Page 50: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/50.jpg)
Thereisnosinglemethodtoavoidoverfitting;however,wehavesomeapproaches,suchasareductioninthenumberoffeaturesandtheregularizationofafewofthefeatures.Anotherwayistotrainthemodelwithsomedatasetandtestwiththeremainingdataset.Acommonmethodcalledcross-validationisusedtogeneratemultipleperformancemeasures.Inthisway,asingledatasetissplitandusedforthecreationofperformancemeasures.
Underfittingoccurswhenthealgorithmcannotcapturethepatternsinthedata,andthedatadoesnotfitwell.Underfittingisalsoknownashighbias.Itmeansyouralgorithmhassuchastrongbiastowardsitshypothesisthatitdoesnotfitthedatawell.Foranunderfittingerror,moredatawillnothelp.Itcanincreasethetrainingerror.Moreexplanatoryvariablescanhelptodealwiththeunderfittingproblem.Moreexplanatoryfieldswillexpandthehypothesisspaceandwillbeusefultoovercomethisproblem.
Bothoverfittingandunderfittingprovidepoorresultswithnewdatasets.
![Page 51: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/51.jpg)
![Page 52: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/52.jpg)
ClassificationalgorithmsWewillnowdiscussthefollowingalgorithmsthataresupportedbyApacheMahoutinthisbook:
Logisticregression/StochasticGradientDescent(SGD):Weusuallyreadregressionalongwithclassification,butactually,thereisadifferencebetweenthetwo.Classificationinvolvesacategoricaltargetvariable,whileregressioninvolvesanumerictargetvariable.Classificationpredictswhethersomethingwillhappen,andregressionpredictshowmuchofsomethingwillhappen.WewillcoverthisalgorithminChapter3,LearningLogisticRegression/SGDUsingMahout.MahoutsupportslogisticregressiontrainedviaStochasticGradientDescent.NaïveBayesclassification:Thisisaverypopularalgorithmfortextclassification.NaïveBayesusestheconceptofprobabilitytoclassifynewitems.ItisbasedontheBayestheorem.WewilldiscussthisalgorithminChapter4,LearningtheNaïveBayesClassificationUsingMahout.Inthischapter,wewillseehowMahoutisusefulinclassifyingtext,whichisrequiredinthedataanalysisfield.Wewilldiscussvectorization,bagofwords,n-grams,andothertermsusedintextclassification.HiddenMarkovModel(HMM):Thisisusedinvariousfields,suchasspeechrecognition,parts-of-speechtagging,geneprediction,time-seriesanalysis,andsoon.InHMM,weobserveasequenceofemissionsbutdonothaveasequenceofstateswhichamodelusestogeneratetheemission.InChapter5,LearningtheHiddenMarkovModelUsingMahout,wewilltakeonemorealgorithmsupportedbyMahoutHiddenMarkovModel.WewilldiscussHMMindetailandseehowMahoutsupportsthisalgorithm.RandomForest:Thisisthemostwidelyusedalgorithminclassification.RandomForestconsistsofacollectionofsimpletreepredictors,eachcapableofproducingaresponsewhenpresentedwithasetofexplanatoryvariables.InChapter6,LearningRandomForestUsingMahout,wewilldiscussthisalgorithmindetailandalsotalkabouthowtouseMahouttoimplementthisalgorithm.Multi-layerPerceptron(MLP):InChapter7,LearningMultilayerPerceptronUsingMahout,wewilldiscussthisnewlyimplementedalgorithminMahout.AnMLPconsistsofmultiplelayersofnodesinadirectedgraph,witheachlayerfullyconnectedtothenextone.Itisabasefortheimplementationofneuralnetworks.WewilldiscussneuralnetworksalittlebutonlyafteradetaileddiscussiononMLPinMahout.
WewilldiscussalltheclassificationalgorithmssupportedbyApacheMahoutinthisbook,andwewillalsocheckthemodelevaluationtechniquesprovidedbyApacheMahout.
![Page 53: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/53.jpg)
![Page 54: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/54.jpg)
ModelevaluationtechniquesWecannothaveasingleevaluationmetricthatcanfitalltheclassifiermodels,butwecanfindoutsomecommonissuesinevaluation,andwehavetechniquestodealwiththem.WewilldiscussthefollowingtechniquesthatareusedinMahout:
ConfusionmatrixROCgraphAUCEntropymatrix
![Page 55: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/55.jpg)
TheconfusionmatrixTheconfusionmatrixprovidesuswiththenumberofcorrectandincorrectpredictionsmadebythemodelcomparedwiththeactualoutcomes(targetvalues)inthedata.AconfusionmatrixisaN*Nmatrix,whereNisthenumberoflabels(classes).Eachcolumnisaninstanceinthepredictedclass,andeachrowisaninstanceintheactualclass.Usingthismatrix,wecanfindouthowoneclassisconfusedwithanother.Let’sassumethatwehaveaclassifierthatclassifiesthreefruits:strawberries,cherries,andgrapes.Assumingthatwehaveasampleof24fruits:7strawberries,8cherries,and9grapes,theresultingconfusionmatrixwillbeasshowninthefollowingtable:
Predictedclassesbymodel
Actualclass
Strawberries Cherries Grapes
Strawberries 4 3 0
Cherries 2 5 1
Grapes 0 1 8
So,inthismodel,fromthe8strawberries,3wereclassifiedascherries.Fromthe8cherries,2wereclassifiedasstrawberries,and1isclassifiedasagrape.Fromthe9grapes,1isclassifiedasacherry.Fromthismatrix,wewillcreatethetableofconfusion.Thetableofconfusionhastworowsandtwocolumnsthatreportabouttruepositive,truenegative,falsepositive,andfalsenegative.
So,ifwebuildthistableforaparticularclass,let’ssayforstrawberries,itwouldbeasfollows:
TruePositive
4(actualstrawberriesclassifiedcorrectly)(a)
FalsePositive
2(cherriesthatwereclassifiedasstrawberries)(b)
FalseNegative
3(strawberrieswronglyclassifiedascherries)(c)
TrueNegative
15(allotherfruitscorrectlynotclassifiedasstrawberries)(d)
Usingthistableofconfusion,wecanfindoutthefollowingterms:
Accuracy:Thisistheproportionofthetotalnumberofpredictionsthatwerecorrectlyclassified.Itiscalculatedas(TruePositive+TrueNegative)/Positive+Negative.Therefore,accuracy=(a+d)/(a+b+c+d).Precisionorpositivepredictivevalue:Thisistheproportionofpositivecasesthatwerecorrectlyclassified.Itiscalculatedas(TruePositive)/(TruePositive+FalsePositive).Therefore,precision=a/(a+b).Negativepredictivevalue:Thisistheproportionofnegativecasesthatwereclassifiedcorrectly.ItiscalculatedasTrueNegative/(TrueNegative+FalseNegative).Therefore,negativepredictivevalue=d/(c+d).Sensitivity/truepositiverate/recall:Thisistheproportionoftheactualpositive
![Page 56: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/56.jpg)
casesthatwerecorrectlyidentified.ItiscalculatedasTruePositive/(TruePositive+FalseNegative).Therefore,sensitivity=a/(a+c).Specificity:Thisistheproportionoftheactualnegativecases.ItiscalculatedasTrueNegative/(FalsePositive+TrueNegative).Therefore,specificity=d/(b+d).F1score:Thisisthemeasureofatest’saccuracy,anditiscalculatedasfollows:F1=2.((Positivepredictivevalue(precision)*sensitivity(recall))/(Positivepredictivevalue(precision)+sensitivity(recall))).
![Page 57: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/57.jpg)
TheReceiverOperatingCharacteristics(ROC)graphROCisatwo-dimensionalplotofaclassifierwithfalsepositiverateonthexaxisandtruepositiverateontheyaxis.Thelowerpoint(0,0)inthefigurerepresentsneverissuingapositiveclassification.Point(0,1)representsperfectclassification.Thediagonalfrom(0,0)to(1,1)dividestheROCspace.Pointsabovethediagonalrepresentgoodclassificationresults,andpointsbelowthelinerepresentpoorresults,asshowninthefollowingdiagram:
![Page 58: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/58.jpg)
AreaundertheROCcurveThisistheareaundertheROCcurveandisalsoknownasAUC.Itisusedtomeasurethequalityoftheclassificationmodel.Inpractice,mostoftheclassificationmodelshaveanAUCbetween0.5and1.Thecloserthevalueisto1,thegreaterisyourclassifier.
![Page 59: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/59.jpg)
TheentropymatrixBeforegoingintothedetailsoftheentropymatrix,firstweneedtounderstandentropy.TheconceptofentropyininformationtheorywasdevelopedbyShannon.
Entropyisameasureofdisorderthatcanbeappliedtoaset.Itisdefinedas:
Entropy=-p1log(p1)–p2log(p2)-…….
Eachpistheprobabilityofaparticularpropertywithintheset.Let’srevisitourcustomerloanapplicationdataset.Forexample,assumingwehaveasetof10customersfromwhich6areeligibleforaloanand4arenot.Here,wehavetwoproperties(classes):eligibleornoteligible.
P(eligible)=6/10=0.6
P(noteligible)=4/10=0.4
So,entropyofthedatasetwillbe:
Entropy=-[0.6*log2(0.6)+0.4*log2(0.4)]
=-[0.6*-0.74+0.4*-1.32]
=0.972
Entropyisusefulinacquiringknowledgeofinformationgain.Informationgainmeasuresthechangeinentropyduetoanynewinformationbeingaddedinmodelcreation.So,ifentropydecreasesfromnewinformation,itindicatesthatthemodelisperformingwellnow.Informationgainiscalculatedas:
IG(classes,subclasses)=entropy(class)–(p(subclass1)*entropy(subclass1)+p(subclass2)*entropy(subclass2)+…)
Entropymatrixisbasicallythesameastheconfusionmatrixdefinedearlier;theonlydifferenceisthattheelementsinthematrixaretheaveragesofthelogoftheprobabilityscoreforeachtrueorestimatedcategorycombination.Agoodmodelwillhavesmallnegativenumbersalongthediagonalandwillhavelargenegativenumbersintheoff-diagonalposition.
![Page 60: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/60.jpg)
![Page 61: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/61.jpg)
SummaryWehavediscussedclassificationanditsapplicationsandalsowhatalgorithmandclassifierevaluationtechniquesaresupportedbyMahout.Wediscussedtechniqueslikeconfusionmatrix,ROCgraph,AUC,andentropymatrix.
Now,wewillmovetothenextchapterandsetupApacheMahoutandthedeveloperenvironment.WewillalsodiscussthearchitectureofApacheMahoutandfindoutwhyMahoutisagoodchoiceforclassification.
![Page 62: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/62.jpg)
![Page 63: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/63.jpg)
Chapter2.ApacheMahoutInthepreviouschapter,wediscussedclassificationandlookedintothealgorithmsprovidedbyMahoutinthisarea.Beforegoingtothosealgorithms,weneedtounderstandMahoutanditsinstallation.Inthischapter,wewillexplorethefollowingtopics:
WhatisApacheMahout?AlgorithmssupportedinMahoutWhyisitagoodchoiceforclassificationproblems?SettingupthesystemforMahoutdevelopment
![Page 64: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/64.jpg)
IntroducingApacheMahoutAmahoutisapersonwhoridesandcontrolsanelephant.MostofthealgorithmsinApacheMahoutareimplementedontopofHadoop,whichisanotherApache-licensedprojectandhasthesymbolofanelephant(http://hadoop.apache.org/).AsApacheMahoutridesoverHadoop,thisnameisjustified.
ApacheMahoutisaprojectofApacheSoftwareFoundationthathasimplementationsofmachinelearningalgorithms.MahoutwasstartedasasubprojectoftheApacheLuceneprojectin2008.Aftersometime,anopensourceprojectnamedTaste,whichwasdevelopedforcollaborativefiltering,anditwasabsorbedintoMahout.MahoutiswritteninJavaandprovidesscalablemachinelearningalgorithms.Mahoutisthedefaultchoiceformachinelearningproblemsinwhichthedataistoolargetofitintoasinglemachine.MahoutprovidesJavalibrariesanddoesnotprovideanyuserinterfaceorserver.Itisaframeworkoftoolstobeusedandadaptedbydevelopers.
Tosumitup,Mahoutprovidesyouwithimplementationsofthemostfrequentlyusedmachinelearningalgorithmsintheareaofclassification,clustering,andrecommendation.Insteadofspendingtimewritingalgorithms,itprovidesuswithready-to-consumesolutions.
MahoutusesHadoopforitsalgorithms,butsomeofthealgorithmscanalsorunwithoutHadoop.Currently,Mahoutsupportsthefollowingusecases:
Recommendation:Thistakestheuserdataandtriestopredictitemsthattheusermightlike.Withthisusecase,youcanseeallthesitesthataresellinggoodstotheuser.Basedonyourpreviousaction,theywilltrytofindoutunknownitemsthatcouldbeofuse.Oneexamplecanbethis:assoonasyouselectsomebookfromAmazon,thewebsitewillshowyoualistofotherbooksunderthetitle,CustomersWhoBoughtThisItemAlsoBought.Italsoshowsthetitle,WhatOtherItemsDoCustomersBuyAfterViewingThisItem?AnotherexampleofrecommendationisthatwhileplayingvideosonYouTube,itrecommendsthatyoulistentosomeothervideosbasedonyourselection.MahoutprovidesfullAPIsupporttodevelopyour
![Page 65: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/65.jpg)
ownuser-basedoritem-basedrecommendationengine.Classification:Asdefinedintheearlierchapter,classificationdecideshowmuchanitembelongstooneparticularcategory.E-mailclassificationforfilteringoutspamisaclassicexampleofclassification.MahoutprovidesarichsetofAPIstobuildyourownclassificationmodel.Forexample,Mahoutcanbeusedtobuildadocumentclassifierorane-mailclassifier.Clustering:Thisisatechniquethattriestogroupitemstogetherbasedonsomesortofsimilarity.Here,wefindthedifferentclustersofitemsbasedoncertainproperties,andwedonotknowthenameoftheclusterinadvance.Themaindifferencebetweenclusteringandclassificationisthatinclassification,weknowtheendclassname.Clusteringisusefulinfindingoutdifferentcustomersegments.GoogleNewsusestheclusteringtechniqueinordertogroupnews.Forclustering,Mahouthasalreadyimplementedsomeofthemostpopularalgorithmsinthisarea,suchask-means,fuzzyk-means,canopy,andsoon.Dimensionalreduction:Aswediscussedinthepreviouschapter,featuresarecalleddimensions.Dimensionalreductionistheprocessofreducingthenumberofrandomvariablesunderconsideration.Thismakesdataeasytouse.Mahoutprovidesalgorithmsfordimensionalreduction.SingularvaluedecompositionandLanczosareexamplesofthealgorithmsthatMahoutprovides.Topicmodeling:Topicmodelingisusedtocapturetheabstractideaofadocument.Atopicmodelisamodelthatassociatesprobabilitydistributionwitheachdocumentovertopics.Giventhatadocumentisaboutaparticulartopic,onewouldexpectparticularwordstoappearinthedocumentmoreorlessfrequently.“Football”and“goal”willappearmoreinadocumentaboutsports.LatentDirichletAllocation(LDA)isapowerfullearningalgorithmfortopicmodeling.InMahout,collapsedvariationalBayesisimplementedforLDA.
![Page 66: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/66.jpg)
![Page 67: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/67.jpg)
AlgorithmssupportedinMahoutTheimplementationofalgorithmsinMahoutcanbecategorizedintotwogroups:
Sequentialalgorithms:ThesealgorithmsareexecutedsequentiallyanddonotuseHadoopscalableprocessing.TheyareusuallytheonesderivedfromTaste.Forexample:user-basedcollaborativefiltering,logisticregression,HiddenMarkovModel,multi-layerperceptron,singularvaluedecomposition.Parallelalgorithms:ThesealgorithmscansupportpetabytesofdatausingHadoop’smapandhencereduceparallelprocessing.Forexample,RandomForest,NaïveBayes,canopyclustering,k-meansclustering,spectralclustering,andsoon.
![Page 68: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/68.jpg)
![Page 69: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/69.jpg)
ReasonsforMahoutbeingagoodchoiceforclassificationInmachinelearningsystems,themoredatayouuse,themoreaccuratethesystembuiltwillbe.Mahout,whichusesHadoopforscalability,iswayaheadofothersintermsofhandlinghugedatasets.Asthenumberoftrainingsetsincreases,Mahout’sperformancealsoincreases.Iftheinputsizefortrainingexamplesisfrom1millionto10million,thenMahoutisanexcellentchoice.
Forclassificationproblems,increaseddatafortrainingisdesirableasitcanimprovetheaccuracyofthemodel.Generally,asthenumberofdatasetsincreases,memoryrequirementalsoincreases,andalgorithmsbecomeslow,butMahout’sscalableandparallelalgorithmsworkbetterwithregardstothetimetaken.Eachnewmachineaddeddecreasesthetrainingtimeandprovideshigherperformance.
![Page 70: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/70.jpg)
![Page 71: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/71.jpg)
InstallingMahoutNowlet’strytheslightlychallengingpartofthisbook:Mahoutinstallation.Basedoncommonexperiences,Ihavecomeupwiththefollowingquestionsorconcernsthatusersfacebeforeinstallation:
IdonotknowanythingaboutMaven.HowwillIcompileMahoutbuild?HowcanIsetupEclipsetowritemyownprogramsinMahout?HowcanIinstallMahoutonaWindowssystem?
So,wewillinstallMahoutwiththehelpofthefollowingsteps.Eachstepisindependentfromtheother.Youcanchooseanyoneofthese:
BuildingMahoutcodeusingMavenSettingupadevelopmentenvironmentusingEclipseSettingupMahoutforaWindowsuser
Beforeanyofthesteps,someoftheprerequisitesare:
YoushouldhaveJavainstalledonyoursystem.Wikihowisagoodsourceforthisathttp://www.wikihow.com/Install-Java-on-LinuxYoushouldhaveHadoopinstalledonyoursystemfromthehttp://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/SingleNodeSetup.htmlURL
![Page 72: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/72.jpg)
BuildingMahoutfromsourceusingMavenMahout’sbuildandreleasesystemisbasedonMaven.
InstallingMaven1. Createthefolder/usr/local/maven,asfollows:
mkdir/usr/local/maven
2. Downloadthedistributionapache-maven-x.y.z-bin.tar.gzfromtheMavensite(http://maven.apache.org/download.cgi)andmovethisto/usr/local/maven,asfollows:
mvapache-maven-x.y.z-bin.tar.gz/usr/local/maven
3. Unpacktothelocation/usr/local/maven,asfollows:
tar–xvfapache-maven-x.y.z-bin.tar.gz
4. Editthe.bashrcfile,asfollows:
exportM2_HOME=/usr/local/apache-maven-x.y.z
exportM2=$M2_HOME/bin
exportPATH=$M2:$PATH
NoteFortheEclipseIDE,gotoHelpandselectInstallnewSoftware.ClickontheAddbutton,andinthepopup,typethenameM2Eclipse,providethelinkhttp://download.eclipse.org/technology/m2e/releases,andclickonOK.
BuildingMahoutcodeBydefault,MahoutassumesthatHadoopisalreadyinstalledonthesystem.MahoutusestheHADOOP_HOMEandHADOOP_CONF_DIRenvironmentvariablestoaccessHadoopclusterconfigurations.ForsettingupMahout,executethefollowingsteps:
1. DownloadtheMahoutdistributionfilemahout-distribution-0.9-src.tar.gzfromthelocationhttp://archive.apache.org/dist/mahout/0.9/.
2. ChooseaninstallationdirectoryforMahout(/usr/local/Mahout),andplacethedownloadedsourceinthefolder.Extractthesourcecodeandensurethatthefoldercontainsthepom.xmlfile.Thefollowingistheexactlocationofthesource:
tar-xvfmahout-distribution-0.9-src.tar.gz
3. InstalltheMahoutMavenproject,andskipthetestcaseswhileinstalling,asfollows:
mvninstall-Dmaven.test.skip=true
4. SettheMAHOUT_HOMEenvironmentvariableinthe~/.bashrcfile,andupdatethePATHvariablewiththeMahoutbindirectory:
exportMAHOUT_HOME=/user/local/mahout/mahout-distribution-0.9
![Page 73: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/73.jpg)
exportPATH=$PATH:$MAHOUT_HOME/bin
5. TotesttheMahoutinstallation,executethecommand:mahout.Thiswilllisttheavailableprogramswithinthedistributionbundle,asshowninthefollowingscreenshot:
![Page 74: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/74.jpg)
SettingupadevelopmentenvironmentusingEclipseForthissetup,youshouldhaveMaveninstalledonthesystemandtheMavenpluginforEclipse.RefertotheInstallingMavenstepexplainedintheprevioussection.Thissetupcanbedoneinthefollowingsteps:
1. DownloadtheMahoutdistributionfilemahout-distribution-0.9-src.tar.gzfromthelocationhttp://archive.apache.org/dist/mahout/0.9/andunzipthis:
tarxzfmahout-distribution-0.9-src.tar.gz
2. Let’screateafoldernamedworkspaceunder/usr/local/workspace,asfollows:
mkdir/usr/local/workspace
3. Movethedownloadeddistributiontothisfolder(fromthedownloadsfolder),asfollows:
mvmahout-distribution-0.9/usr/local/workspace/
4. Movetothefolder/usr/local/workspace/mahout-distribution-0.9andmakeanEclipseproject(thiscommandcantakeuptoanhour):
mvneclipse:eclipse
5. SettheMahouthomeinthe.bashrcfile,asexplainedearlierintheBuildingMahoutcodesection.
6. NowopenEclipse.Selectthefile,importMaven,andExistingMavenProjects.Now,navigatetothelocationformahout-distribution-0.9andclickonFinish.
![Page 75: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/75.jpg)
![Page 76: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/76.jpg)
SettingupMahoutforaWindowsuserAWindowsusercanuseCygwin(alargecollectionofGNUandopensourcetoolsthatprovidesfunctionalitysimilartoaLinuxdistributiononWindows)tosetuptheirenvironment.Thereisalsoanotherwaythatiseasytouse,asshowninthefollowingsteps:
1. DownloadHortonworksSandboxforvirtualboxonyoursystemfromthelocationhttp://hortonworks.com/products/hortonworks-sandbox/#install.HortonworksSandboxonyoursystemwillbeapseudo-distributedmodeofHadoop.
2. Logintotheconsole.UseAlt+F5oralternativelydownloadPuttyandprovide127.0.0.1asthehostnameand2222intheport,asshowninthefollowingfigure.Loginwiththeusernamerootandpassword-hadoop.
3. Enterthefollowingcommand:
yuminstallmahout
Now,youwillseeascreenlikethis:
![Page 77: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/77.jpg)
4. Entery,andyourMahoutwillstartinstalling.Oncethisisdone,youcantestbytypingthecommandmahoutandthiswillshowyouthesamescreenasshownintheSettingupadevelopmentenvironmentusingEclipserecipeseenearlier.
![Page 78: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/78.jpg)
![Page 79: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/79.jpg)
SummaryWediscussedApacheMahoutindetailinthischapter.WecoveredtheprocessofinstallingMahoutonoursystem,alongwithsettingupadevelopmentenvironmentthatisreadytoexecuteMahoutalgorithms.WehavealsotakenalookatthereasonsbehindMahoutbeingconsideredagoodchoiceforclassification.Now,wemovetothenextwherewewillunderstandaboutlogisticregressionandlearnabouttheprocessthatneedstobefollowedtoexecuteourfirstalgorithminMahout.
![Page 80: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/80.jpg)
![Page 81: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/81.jpg)
Chapter3.LearningLogisticRegression/SGDUsingMahoutInsteadofjumpingdirectlyintologisticregression,let’strytounderstandafewofitsconcepts.Inthischapter,wewillexplorethefollowingtopics:
IntroducingregressionUnderstandinglinearregressionCostfunctionGradientdescentLogisticregressionUnderstandingSGDUsingMahoutforlogisticregression
![Page 82: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/82.jpg)
IntroducingregressionRegressionanalysisisusedforpredictionandforecasting.Itisusedtofindouttherelationshipbetweenexplanatoryvariablesandtargetvariables.Essentially,itisastatisticalmodelthatisusedtofindouttherelationshipamongvariablespresentinthedatasets.Anexamplethatyoucanrefertoforabetterunderstandingofthistermisthis:determinetheearningsofworkersinaparticularindustry.Here,wewilltrytofindoutthefactorsthataffectaworker’ssalary.Thesefactorscanbeage,education,yearsofexperience,particularskillset,location,andsoon.Wewilltrytomakeamodelthatwilltakeallthesevariablesintoconsiderationandtrytopredictthesalary.Inregressionanalysis,wecharacterizethevariationofthetargetvariablearoundtheregressionfunction,whichcanbedescribedbyaprobabilitydistributionthatisalsoofinterest.Thereareanumberofregressionanalysistechniquesthatareavailable.Forexample,linearregression,ordinaryleastsquaresregression,logisticregression,andsoon.
![Page 83: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/83.jpg)
UnderstandinglinearregressionInlinearregression,wecreateamodeltopredictthevalueofatargetvariablewiththehelpofanexplanatoryvariable.Tounderstandthisbetter,let’slookatanexample.
AcompanyXthatdealsinsellingcoffeehasnoticedthatinthemonthofmonsoon,theirsalesincreasedtoquiteanextent.Sotheyhavecomeupwithaformulatofindtherelationbetweenrainandtheirpercupcoffeesale,whichisshownasfollows:
C=1.5R+800
So,for2mmofrain,thereisademandof803cupsofcoffee.Nowifyougointominutedetails,youwillrealizethatwehavethedataforrainfallandpercupcoffeesale,andwearetryingtobuildamodelthatcanpredictthedemandforcoffeebasedontherainfall.Wehavedataintheformof(R1,C1),(R2,C2)….(Ri,Ci).Here,wewillbuildthemodelinamannerthatkeepstheerrorintheactualandpredictedvaluesataminimum.
CostfunctionIntheequationC=1.5R+800,thetwovalues1.5and800areparametersandthesevaluesaffecttheendresult.WecanwritethisequationasC=p0+p1R.Aswediscussedearlier,ourgoalistoreducethedifferencebetweentheactualvalueandthepredictedvalue,andthisisdependentonthevaluesofp0andp1.Let’sassumethatthepredictedvalueisCpandtheactualvalueisCsothatthedifferencewillbe(Cp-C).Thiscanbewrittenas(p0+p1R-C).Tominimizethiserror,wedefinetheerrorfunction,whichisalsocalledthecostfunction.
Thecostfunctioncanbedefinedwiththefollowingformula:
Here,iistheithsampleandNisthenumberoftrainingexamples.Wecalculatecostsfordifferentsetsofp0andp1andfinallyselectthep0andp1thatgivestheleastcost(C).Thisisthemodelthatwillbeusedtomakepredictionsfornewinput.
GradientdescentGradientdescentstartswithaninitialsetofparametervalues,p0andp1,anditerativelymovestowardsasetofparametervaluesthatminimizesthecostfunction.Wecanvisualizethiserrorfunctiongraphically,wherewidthandlengthcanbeconsideredastheparametersp0andp1andheightasthecostfunction.Ourgoalistofindthevaluesforp0andp1inawaythatourcostfunctionwillbeminimal.Westartthealgorithmwithsomevaluesofp0andp1anditerativelyworktowardstheminimumvalue.Agoodwaytoensurethatthegradientdescentisworkingcorrectlyistomakesurethatthecostfunctiondecreasesforeachiteration.Inthiscase,thecostfunctionsurfaceisconvexandwewilltrytofindouttheminimumvalue.Thiscanbeseeninthefollowingfigure:
![Page 84: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/84.jpg)
![Page 85: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/85.jpg)
LogisticregressionLogisticregressionisusedtoascertaintheprobabilityofanevent.Generally,logisticregressionreferstoproblemswheretheoutcomeisbinary,forexample,inbuildingamodelthatisbasedonacustomer’sincome,traveluses,gender,andotherfeaturestopredictwhetherheorshewillbuyaparticularcarornot.So,theanswerwillbeasimpleyesorno.Whentheoutcomeiscomposedofmorethanonecategory,thisiscalledmultinomiallogisticregression.
Logisticregressionisbasedonthesigmoidfunction.Predictorvariablesarecombinedwithlinearweightandthenpassedtothisfunction,whichgeneratestheoutputintherangeof0–1.Anoutputcloseto1indicatesthatanitembelongstoacertainclass.Let’sfirstunderstandthesigmoidorlogisticfunction.Itcanbedefinedbythefollowingformula:
F(z)=1/1+e(-z)
Withasingleexplanatoryvariable,zwillbedefinedasz=β0+β1*x.Thisequationisexplainedasfollows:
z:Thisiscalledthedependentvariable.Thisisthevariablethatwewouldliketopredict.Duringthecreationofthemodel,wehavethisvariablewithusinthetrainingset,andwebuildthemodeltopredictthisvariable.Theknownvaluesofzarecalledobservedvalues.x:Thisistheexplanatoryorindependentvariable.Thesevariablesareusedtopredictthedependentvariablez.Forexample,topredictthesalesofanewlylaunchedproductataparticularlocation,wemightincludeexplanatoryvariablessuchasthepriceoftheproduct,theaverageincomeofthepeopleofthatlocation,andsoon.β0:Thisiscalledtheregressionintercept.Ifallexplanatoryvariablesarezero,thenthisparameterisequaltothedependentvariablez.β1:Thesearevaluesforeachexplanatoryvariable.
Thegraphofthelogisticfunctionisasfollows:
![Page 86: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/86.jpg)
Withalittlebitofmathematics,wecanchangethisequationasfollows:
ln(F(x)/(1-F(x))=β0+β1*x
Inthecaseoflinearregression,thecostfunctiongraphwasconvex,buthere,itisnotgoingtobeconvex.Findingtheminimumvaluesforparametersinawaythatourpredictedoutputisclosetotheactualonewillbedifficult.Inacostfunction,whilecalculatingforlogisticregression,wewillreplaceourCpvalueoflinearregressionwiththefunctionF(z).Tomakeconvexlogisticregressioncostfunctions,wewillreplace(p0+p1Ri-Ci)2withoneofthefollowing:
log(1/1+e(-(β0+β1*x)))iftheactualoccurrenceofaneventis1,thisfunctionwillrepresentthecost.log(1-(1/1+e(-(β0+β1*x))))iftheactualoccurrenceofaneventis0,thisfunctionwillrepresentthecost.
Wewillhavetorememberthatinlogisticregression,wecalculatetheclassprobability.So,iftheprobabilityofaneventoccurring(customerbuyingacar,beingdefrauded,andsoon)isp,theprobabilityofnon-occurrenceis1-p.
![Page 87: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/87.jpg)
StochasticGradientDescentGradientdescentminimizesthecostfunction.Forverylargedatasets,gradientdescentisaveryexpensiveprocedure.StochasticGradientDescent(SGD)isamodificationofthegradientdescentalgorithmtohandlelargedatasets.Gradientdescentcomputesthegradientusingthewholedataset,whileSGDcomputesthegradientusingasinglesample.So,gradientdescentloadsthefulldatasetandtriestofindoutthelocalminimumonthegraphandthenrepeatthefullprocessagain,whileSGDadjuststhecostfunctionforeverysample,onebyone.AmajoradvantagethatSGDhasovergradientdescentisthatitsspeedofcomputationisawholelotfaster.LargedatasetsinRAMgenerallycannotbeheldasthestorageislimited.InSGD,theburdenontheRAMisreduced,whereineachsampleorbatchofsamplesareloadedandworkedwith,theresultsforwhicharestored,andsoon.
![Page 88: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/88.jpg)
UsingMahoutforlogisticregressionMahouthasimplementationsforlogisticregressionusingSGD.Itisveryeasytounderstandanduse.Solet’sgetstarted.
Dataset
WewillusetheWisconsinDiagnosticBreastCancer(WDBC)dataset.Thisisadatasetforbreastcancertumorsanddataisavailablefrom1995onwards.Ithas569instancesofbreasttumorcasesandhas30featurestopredictthediagnosis,whichiscategorizedaseitherbenignormalignant.
NoteMoredetailsontheprecedingdatasetisavailableathttp://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.names.
Preparingthetrainingandtestdata
Youcandownloadthewdbc.datadatasetfromhttp://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data.
Now,saveitasaCSVfileandincludethefollowingheaderline:ID_Number,Diagnosis,Radius,Texture,Perimeter,Area,Smoothness,Compactness,Concavity,ConcavePoints,Symmetry,Fractal_Dimension,RadiusStdError,TextureStdError,PerimeterStdError,AreaStdError,SmoothnessStdError,CompactnessStdError,ConcavityStdError,ConcavePointStdError,Symmetrystderror,FractalDimensionStderror,WorstRadius,worsttexture,worstperimeter,worstarea,worstsmoothness,worstcompactness,worstconcavity,worstconcavepoints,worstsymmentry,worstfractaldimensions
Now,wewillhavetoperformthefollowingstepstopreparethisdatatobeusedbytheMahoutlogisticregressionalgorithm:
1. Wewillmakethetargetclassnumeric.Inthiscase,thesecondfielddiagnosisisthetargetvariable.Wewillchangemalignantto0andbenignto1.Usethefollowingcodesnippettointroducethechanges.Wecanusethisstrategyforsmalldatasets,butforhugedatasets,wehavedifferentstrategies,whichwewillcoverinChapter4,LearningtheNaïveBayesClassificationUsingMahout:
publicvoidconvertTargetToInteger()throwsIOException{
//Readthedata
BufferedReaderbr=newBufferedReader(newFileReader("wdbc.csv"));
Stringline=null;
//Createthefiletosavetheresulteddata
FilewdbcData=newFile("<YourDestinationlocationforfile.>");
FileWriterfw=newFileWriter(wdbcData);
//Weareaddingheadertothenewfile
fw.write("ID_Number"+","+"Diagnosis"+","+"Radius"+","+"Texture"+","+"Pe
rimeter"+","+"Area"+","+"Smoothness"+","+"Compactness"+","+"Concavity"+
","+"ConcavePoints"+","+"Symmetry"+","+"Fractal_Dimension"+","+"RadiusS
tdError"+","+"TextureStdError"+","+"PerimeterStdError"+","+"AreaStdErro
r"+","+"SmoothnessStdError"+","+"CompactnessStdError"+","+"ConcavityStd
Error"+","+"ConcavePointStdError"+","+"Symmetrystderror"+","+"FractalDi
mensionStderror"+","+"WorstRadius"+","+"worsttexture"+","+"worstperimet
er"+","+"worstarea"+","+"worstsmoothness"+","+"worstcompactness"+","+"w
orstconcavity"+","+"worstconcavepoints"+","+"worstsymmentry"+","+"worst
fractaldimensions"+"\n");
![Page 89: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/89.jpg)
/*Inthewhileloopwearereadinglinebylineandcheckingthelast
field-parts[1]andchangingittonumericvalueaccordingly*/
while((line=br.readLine())!=null){
String[]parts=line.split(",");
if(parts[1].equals("M")){
fw.write(parts[0]+","+"0"+","+parts[2]+","+parts[3]+","+parts[4]+","+pa
rts[5]+","+parts[6]+","+parts[7]+","+parts[8]+","+parts[9]+","+parts[10
]+","+parts[11]+","+parts[12]+","+parts[13]+","+parts[14]+","+parts[15]
+","+parts[16]+","+parts[17]+","+parts[18]+","+parts[19]+","+parts[20]+
","+parts[21]+","+parts[22]+","+parts[23]+","+parts[24]+","+parts[25]+"
,"+parts[26]+","+parts[27]+","+parts[28]+","+parts[29]+","+parts[30]+",
"+parts[31]+"\n");
}
if(parts[1].equals("B")){
fw.write(parts[0]+","+"1"+","+parts[2]+","+parts[3]+","+parts[4]+","+pa
rts[5]+","+parts[6]+","+parts[7]+","+parts[8]+","+parts[9]+","+parts[10
]+","+parts[11]+","+parts[12]+","+parts[13]+","+parts[14]+","+parts[15]
+","+parts[16]+","+parts[17]+","+parts[18]+","+parts[19]+","+parts[20]+
","+parts[21]+","+parts[22]+","+parts[23]+","+parts[24]+","+parts[25]+"
,"+parts[26]+","+parts[27]+","+parts[28]+","+parts[29]+","+parts[30]+",
"+parts[31]+"\n");
}
}
fw.close();
br.close();
}
TipDownloadingtheexamplecode
Youcandownloadtheexamplecodefilesfromyouraccountathttp://www.packtpub.comforallthePacktPublishingbooksyouhavepurchased.Ifyoupurchasedthisbookelsewhere,youcanvisithttp://www.packtpub.com/supportandregistertohavethefilese-maileddirectlytoyou.
2. Wewillhavetosplitthedatasetintotrainingandtestdatasetsandthenshufflethedatasetssothatwecanmixthemup,whichcanbedoneusingthefollowingcodesnippet:
publicvoiddataPrepration()throwsException{
//Readingthedatasetcreatedbyearliermethod
convertTargetToIntegerandhereweareusinggoogleguavaapi's.
List<String>result=
Resources.readLines(Resources.getResource("wdbc.csv"),Charsets.UTF_8);
//Thisistoremoveheaderbeforetherandomizationprocess.
Otherwiseitcanappearinthemiddleofdataset.
List<String>raw=result.subList(1,570);
Randomrandom=newRandom();
//Shufflingthedataset.
Collections.shuffle(raw,random);
//Splittingdatasetintotrainingandtestexamples.
![Page 90: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/90.jpg)
List<String>train=raw.subList(0,470);
List<String>test=raw.subList(470,569);
FiletrainingData=newFile("<yourLocation>/wdbcTrain.csv");
FiletestData=newFile("<yourLocation>/wdbcTest.csv");
writeCSV(train,trainingData);
writeCSV(test,testData);
}
//Thismethodiswritingthelisttodesiredfilelocation.
publicvoidwriteCSV(List<String>list,Filefile)throwsIOException{
FileWriterfw=newFileWriter(file);
fw.write("ID_Number"+","+"Diagnosis"+","+"Radius"+","+"Texture"+","+"Pe
rimeter"+","+"Area"+","+"Smoothness"+","+"Compactness"+","+"Concavity"+
","+"ConcavePoints"+","+"Symmetry"+","+"Fractal_Dimension"+","+"RadiusS
tdError"+","+"TextureStdError"+","+"PerimeterStdError"+","+"AreaStdErro
r"+","+"SmoothnessStdError"+","+"CompactnessStdError"+","+"ConcavityStd
Error"+","+"ConcavePointStdError"+","+"Symmetrystderror"+","+"FractalDi
mensionStderror"+","+"WorstRadius"+","+"worsttexture"+","+"worstperimet
er"+","+"worstarea"+","+"worstsmoothness"+","+"worstcompactness"+","+"w
orstconcavity"+","+"worstconcavepoints"+","+"worstsymmentry"+","+"worst
fractaldimensions"+"\n");
for(inti=0;i<list.size();i++){
fw.write(list.get(i)+"\n");
}
fw.close();
}
Trainingthemodel
Wewillusethetrainingdatasetandtrainlogisticalgorithmtopreparethemodel.Usethefollowingcommandtocreatethemodel:
mahouttrainlogistic--input/tmp/wdbcTrain.csv--output/tmp//model--
targetDiagnosis--categories2--predictorsRadiusTexturePerimeterArea
SmoothnessCompactnessConcavityConcavePointsSymmetryFractal_Dimension
RadiusStdErrorTextureStdErrorPerimeterStdErrorAreaStdError
SmoothnessStdErrorCompactnessStdErrorConcavityStdError
ConcavePointStdErrorSymmetrystderrorFractalDimensionStderrorWorstRadius
worsttextureworstperimeterworstareaworstsmoothnessworstcompactness
worstconcavityworstconcavepointsworstsymmentryworstfractaldimensions--
typesnumeric--features30--passes90--rate300
Thiscommandwillgiveyouthefollowingoutput:
![Page 91: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/91.jpg)
Let’sunderstandtheparametersusedinthiscommand:
trainlogistic:ThisisthealgorithmthatMahoutprovidestobuildthemodelusingyourinputparameters.input:Thisisthelocationoftheinputfile.output:Thisisthelocationofthemodelfile.target:Thisisthenameofthetargetvariablethatwewanttopredictfromthedataset.categories:Thisreferstothenumberofpredictedclasses.predictors:Thisfeaturesinthedatasetusedtopredictthetargetvariable.types:Thisisalistofthetypesofpredictorvariables.(Hereallarenumericbutitcouldbewordortextaswell.)features:Thisisthesizeofthefeaturevectorusedtobuildthemodel.passes:Thisspecifiesthenumberoftimestheinputdatashouldbere-examinedduringtraining.Smallinputfilesmayneedtobeexamineddozensoftimes.Verylargeinputfilesprobablydon’tevenneedtobecompletelyexamined.rate:Thissetstheinitiallearningrate.Thiscanbelargeifyouhavelotsofdataoruselotsofpassesbecauseitdecreasesprogressivelyasdataisexamined.
Nowourmodelisreadytomoveontothenextstepofevaluation.Toevaluatethemodelfurther,wecanusethesamedatasetandchecktheconfusionandAUCmatrix.Thecommandforthiswillbeasfollows:
mahoutrunlogistic--input/tmp/wdbcTrain.csv--model/tmp//model--auc--
confusion
runlogistic:Thisisthealgorithmtorunthelogisticregressionmodeloveraninputdatasetmodel:Thisisthelocationofthemodelfileauc:ThisprintstheAUCscoreforthemodelversustheinputdataafterthedataisreadconfusion:Thisprintstheconfusionmatrixforaparticularthreshold
Theoutputofthepreviouscommandisshowninthefollowingscreenshot:
![Page 92: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/92.jpg)
Now,thesematricesshowthatthemodelisnotbad.Having0.88asthevalueforAUCisgood,butwewillcheckthisontestdataaswell.Theconfusionmatrixinformsusthatoutof172malignanttumors,ithascorrectlyclassified151instancesandthat34benigntumorsarealsoclassifiedasmalignant.Inthecaseofbenigntumors,outof298,ithascorrectlyclassified264.
Ifthemodeldoesnotprovidegoodresults,wehaveanumberofoptions.
Changetheparametersinthefeaturevector,increasingthemifweareselectingfewfeatures.Thisshouldbedoneoneatatime,andweshouldtesttheresultagainwitheachgeneratedmodel.WeshouldgetamodelwhereAUCiscloseto1.
Let’srunthesamealgorithmontestdataaswell:
mahoutrunlogistic--input/tmp/wdbcTest.csv--model/tmp//model--auc–
confusion
Sothismodelworksalmostthesameontestdataaswell.Ithasclassified34outofthe40malignanttumorscorrectly.
![Page 93: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/93.jpg)
![Page 94: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/94.jpg)
SummaryInthischapter,wediscussedlogisticregressionandhowwecanusethisalgorithmavailableinApacheMahout.WeusedtheWisconsinDiagnosticBreastCancerdatasetandrandomlybrokeitintotwodatasets:onefortrainingandtheotherfortesting.WecreatedthelogisticregressionmodelusingMahoutandalsorantestdataoverthismodel.Now,wewillmoveontothenextchapterwhereyouwilllearnabouttheNaïveBayesclassificationandalsothemostfrequentlyusedclassificationtechnique:textclassification.
![Page 95: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/95.jpg)
![Page 96: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/96.jpg)
Chapter4.LearningtheNaïveBayesClassificationUsingMahoutInthischapter,wewillusetheNaïveBayesclassificationalgorithmtoclassifyasetofdocuments.Classifyingtextdocumentsisalittletrickybecauseofthedatapreparationstepsinvolved.Inthischapter,wewillexplorethefollowingtopics:
ConditionalprobabilityandtheBayesruleUnderstandingtheNaïveBayesalgorithmUnderstandingtermsusedintextclassificationUsingtheNaïveBayesalgorithminApacheMahout
![Page 97: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/97.jpg)
IntroducingconditionalprobabilityandtheBayesruleBeforelearningtheNaïveBayesalgorithm,youshouldhaveanunderstandingofconditionalprobabilityandtheBayesrule.
Inverysimpleterms,conditionalprobabilityistheprobabilitythatsomethingwillhappen,giventhatsomethingelsehasalreadyhappened.ItisexpressedasP(A/B),whichcanbereadasprobabilityofAgivenB,anditfindstheprobabilityoftheoccurrenceofeventAonceeventBhasalreadyhappened.
Mathematically,itisdefinedasfollows:
Forexample,ifyouchooseacardfromastandardcarddeckandifyouwereaskedabouttheprobabilityforthecardtobeadiamond,youwouldquicklysay13/52or0.25,asthereare13diamondcardsinthedeck.However,ifyouthenlookatthecardanddeclarethatitisred,thenwewillhavenarrowedthepossibilitiesforthecardto26possiblecards,andtheprobabilitythatthecardisadiamondnowis13/26=0.5.So,ifwedefineAasadiamondcardandBasaredcard,thenP(A/B)willbetheprobabilityofthecardbeingadiamond,givenitisred.
Sometimes,foragivenpairofevents,conditionalprobabilityishardtocalculate,andBayes’theoremhelpsusherebygivingtherelationshipbetweentwoconditionalprobabilities.
Bayes’theoremisdefinedasfollows:
Thetermsintheformulaaredefinedasfollows:
P(A):ThisiscalledpriorprobabilityorpriorP(B/A):ThisiscalledconditionalprobabilityorlikelihoodP(B):ThisiscalledmarginalprobabilityP(A/B):Thisiscalledposteriorprobabilityorposterior
Thefollowingformulaisderivedonlyfromtheconditionalprobabilityformula.WecandefineP(B/A)asfollows:
![Page 98: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/98.jpg)
Whenrearranged,theformulabecomesthis:
Now,fromtheprecedingconditionalprobabilityformula,wegetthefollowing:
Let’stakeanexamplethatwillhelpustounderstandhowBayes’theoremisapplied.
Acancertestgivesapositiveresultwithaprobabilityof97percentwhenthepatientisindeedaffectedbycancer,whileitgivesanegativeresultwith99percentprobabilitywhenthepatientisnotaffectedbycancer.Ifapatientisdrawnatrandomfromapopulationwhere0.2percentoftheindividualsareaffectedbycancerandheorsheisfoundtobepositive,whatistheprobabilitythatheorsheisindeedaffectedbycancer?Inprobabilisticterms,whatweknowaboutthisproblemcanbedefinedasfollows:
P(positive|cancer)=0.97
P(positive|nocancer)=1-0.99=0.01
P(cancer)=0.002
P(nocancer)=1-0.002=0.998
P(positive)=P(positive|cancer)P(cancer)+P(positive|nocancer)P(nocancer)
=0.97*0.002+0.01*0.998
=0.01192
NowP(cancer|positive)=(0.97*0.002)/0.01192=0.1628
Soevenwhenfoundpositive,theprobabilityofthepatientbeingaffectedbycancerinthisexampleisaround16percent.
![Page 99: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/99.jpg)
![Page 100: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/100.jpg)
UnderstandingtheNaïveBayesalgorithmInBayes’theorem,wehaveseenthattheoutcomeisbasedonlyononeevidence,butinclassificationproblems,wehavemultipleevidencesandwehavetopredicttheoutcome.InNaïveBayes,weuncouplemultiplepiecesofevidenceandtreateachoneofthemindependently.Itisdefinedasfollows:
P(outcome|multipleEvidence))=P(Evidence1|outcome)*P(Evidence2|outcome)*P(Evidence3|outcome)…./P(Evidence)
Runthisformulaforeachpossibleoutcome.Sincewearetryingtoclassify,eachoutcomewillbecalledaclass.Ourtaskistolookattheevidence(features)toconsiderhowlikelyitisforittobeofaparticularclassandthenassignitaccordingly.Theclassthathasthehighestprobabilitygetsassignedtothatcombinationofevidences.Let’sunderstandthiswithanexample.
Let’ssaythatwehavedataon1,000piecesoffruit.Theyhappentobebananas,apples,orsomeotherfruit.Weareawareofthreecharacteristicsofeachfruit:
Size:TheyareeitherlongornotlongTaste:TheyareeithersweetornotsweetColor:Theyareeitheryellowornotyellow
Assumethatwehaveadatasetlikethefollowing:
Fruittype Taste–sweet Taste–notsweet Color–yellow Color–notyellow Size–long Size–notlong Total
Banana 350 150 450 50 400 100 500
Apple 150 150 100 200 0 300 300
Other 150 50 50 150 100 100 200
Total 650 350 600 400 500 500 1000
Nowlet’slookatthethingswehave:
P(Banana)=500/1000=0.5
P(Apple)=300/1000=0.3
P(Other)=200/1000=0.2
Let’slookattheprobabilityofthefeatures:
P(Sweet)=650/1000=0.65
P(Yellow)=600/1000=0.6
P(long)=500/1000=0.5
P(notSweet)=350/1000=0.35
P(notyellow)=400/1000=0.4
![Page 101: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/101.jpg)
P(notlong)=500/1000=0.5
Nowwewanttoknowwhatfruitwewillhaveifitisnotyellowandnotlongandsweet.Theprobabilityofitbeinganappleisasfollows:
P(Apple|sweet,notlong,notyellow)=P(sweet|Apple)*P(notlong|Apple)*P(notyellow|Apple)*P(Apple)/P(sweet)*P(notlong)*P(notyellow)
=0.5*1*0.67*0.3/P(Evidence)
=0.1005/P(Evidence)
Theprobabilityofitbeingabananaisthis:
P(banana|sweet,notlong,notyellow)=P(sweet|banana)*P(notlong|banana)*P(notyellow|banana)*P(banana)/P(sweet)*P(notlong)*P(notyellow)
=0.7*0.2*0.1*0.5/P(Evidence)
=0.007/P(Evidence)
Theprobabilityofitbeinganyotherfruitisasfollows:
P(otherfruit|sweet,notlong,notyellow)=P(sweet|otherfruit)*P(notlong|otherfruit)*P(notyellow|otherfruit)*P(otherfruit)/P(sweet)*P(notlong)*P(notyellow)
=0.75*0.5*0.75*0.2/P(Evidence)
=0.05625/P(Evidence)
Sofromtheresults,youcanseethatifthefruitissweet,notlong,andnotyellow,thenthehighestprobabilityisthatitwillbeanapple.Sofindoutthehighestprobabilityandassigntheunknownitemtothatclass.
NaïveBayesisaverygoodchoicefortextclassification.BeforewemoveontotextclassificationusingNaïveBayesinMahout,let’sunderstandafewtermsthatarereallyusefulfortextclassification.
![Page 102: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/102.jpg)
![Page 103: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/103.jpg)
UnderstandingthetermsusedintextclassificationTopreparedatasothatitcanbeusedbyaclassifierisacomplexprocess.Fromrawdata,wecancollectexplanatoryandtargetvariablesandencodethemasvectors,whichistheinputoftheclassifier.
Vectorsareorderedlistsofvaluesasdefinedintwo-dimensionalspace.Youcantakeacluefromcoordinategeometryaswell.Apoint(3,4)isapointinthexandyplanes.InMahout,itisdifferent.Here,avectorcanhave(3,4)or10,000dimensions.
Mahoutprovidessupportforcreatingvectors.TherearetwotypesofvectorimplementationsinMahout:sparseanddensevectors.Thereareafewtermsthatweneedtounderstandfortextclassification:
Bagofwords:Thisconsiderseachdocumentasacollectionofwords.Thisignoreswordorder,grammar,andpunctuation.So,ifeverywordisafeature,thencalculatingthefeaturevalueofthedocumentwordisrepresentedasatoken.Itisgiventhevalue1ifitispresentor0ifnot.Termfrequency:Thisconsidersthewordcountinthedocumentinsteadof0and1.Sotheimportanceofawordincreaseswiththenumberoftimesitappearsinthedocument.Considerthefollowingexamplesentence:
ApplehaslaunchediPhoneanditwillcontinuetolaunchsuchproducts.OthercompetitorsarealsoplanningtolaunchproductssimilartothatofiPhone.
Thefollowingisthetablethatrepresentstermfrequency:
Term Count
Apple 1
Launch 3
iPhone 2
Product 2
Plan 1
Thefollowingtechniquesareusuallyappliedtocomeupwiththistypeoftable:
Stemmingofwords:Withthis,thesuffixisremovedfromthewordso“launched”,“launches”,and“launch”areallconsideredas“launch”.Casenormalization:Withthis,everytermisconvertedtolowercase.Stopwordremoval:Therearesomewordsthatarealmostpresentineverydocument.Wecallthesewordsstopwords.Duringanimportantfeatureextractionfromadocument,thesewordscomeintoaccountandtheywillnotbehelpfulinthe
![Page 104: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/104.jpg)
overallcalculation.Examplesofthesewordsare“is,are,the,that,andsoon.”So,whileextracting,wewillignorethesekindofwords.Inversedocumentfrequency:Thisisconsideredastheboostatermgetsforbeingrare.Atermshouldnotbetoocommon.Ifatermoccursineverydocument,itisnotgoodforclassification.Thefewerdocumentsinwhichatermoccurs,themoresignificantitislikelytobeforthedocumentsitdoesoccurin.Foratermt,inversedocumentfrequencyiscalculatedasfollows:
IDF(t)=1+log(totalnumberofdocuments/numberofdocumentscontainingt)
Termfrequencyandinversetermfrequency:Thisisoneofthepopularrepresentationsofthetext.Itistheproductoftermfrequencyandinversedocumentfrequency,asfollows:
TFIDF(t,d)=TF(t,d)*IDF(t)
Eachdocumentisafeaturevectorandacollectionofdocumentsisasetofthesefeaturevectorsandthissetworksastheinputfortheclassification.Nowthatweunderstandthebasicconceptsbehindthevectorcreationoftextdocuments,let’smoveontothenextsectionwherewewillclassifytextdocumentsusingtheNaïveBayesalgorithm.
![Page 105: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/105.jpg)
![Page 106: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/106.jpg)
UsingtheNaïveBayesalgorithminApacheMahoutWewilluseadatasetof20newsgroupsforthisexercise.The20newsgroupsdatasetisastandarddatasetcommonlyusedformachinelearningresearch.Thedataisobtainedfromtranscriptsofseveralmonthsofpostingsmadein20Usenetnewsgroupsfromtheearly1990s.Thisdatasetconsistsofmessages,oneperfile.Eachfilebeginswithheaderlinesthatspecifythingssuchaswhosentthemessage,howlongitis,whatkindofsoftwarewasused,andthesubject.Ablanklinefollowsandthenthemessagebodyfollowsasunformattedtext.
Downloadthe20news-bydate.tar.gzdatasetfromhttp://qwone.com/~jason/20Newsgroups/.ThefollowingstepsareusedtobuildtheNaïveBayesclassifierusingMahout:
1. Createa20newsdatadirectoryandunzipthedatahere:
mkdir/tmp/20newsdata
cd/tmp/20newsdata
tar–xzvf/tmp/20news-bydate.tar.gz
2. Youwillseetwofoldersunder20newsdata:20news-bydate-testand20news-bydate-train.Nowcreateanotherdirectorycalled20newsdataallandmergeboththetrainingandtestdataofthe20newsgroups.
3. Comeoutofthedirectoryandmovetothehomedirectoryandexecutethefollowing:
mkdir/tmp/20newsdataall
cp–R/20newsdata/*/*/tmp/20newsdataall
4. CreateadirectoryinHadoopandsavethisdatainHDFSformat:
hadoopfs–mkdir/user/hue/20newsdata
hadoopfs–put/tmp/20newsdataall/user/hue/20newsdata
5. Converttherawdataintoasequencefile.Theseqdirectorycommandwillgeneratesequencefilesfromadirectory.SequencefilesareusedinHadoop.Asequencefileisaflatfilethatconsistsofbinarykey/valuepairs.WeareconvertingthefilesintosequencefilessothatitcanbeprocessedinHadoop,whichcanbedoneusingthefollowingcommand:
bin/mahoutseqdirectory-i/user/hue/20newsdata/20newsdataall-o
/user/hue/20newsdataseq-out
Theoutputoftheprecedingcommandcanbeseeninthefollowingscreenshot:
![Page 107: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/107.jpg)
6. Convertthesequencefileintoasparsevectorusingthefollowingcommand:
bin/mahoutseq2sparse-i/user/hue/20newsdataseq-out/part-m-00000-o
/user/hue/20newsdatavec-lnorm-nv-wttfidf
Thetermsusedintheprecedingcommandareasfollows:
lnorm:Thisisfortheoutputvectortobelognormalizednv:Thisreferstonamedvectorswt:Thisreferstothekindofweighttouse;here,weusetfidf
Theoutputoftheprecedingcommandontheconsoleisshowninthefollowingscreenshot:
7. Splitthesetofvectorstotrainandtestthemodel:
bin/mahoutsplit-i/user/hue/20newsdatavec/tfidf-vectors--
trainingOutput/user/hue/20newsdatatrain--testOutput
![Page 108: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/108.jpg)
/user/hue/20newsdatatest--randomSelectionPct40--overwrite--
sequenceFiles-xmsequential
Thetermsusedintheprecedingcommandareasfollows:
randomSelectionPct:Thisdividesthepercentageofdataintotestingandtrainingdatasets.Here,60percentisfortestingand40percentfortraining.xm:Thisreferstotheexecutionmethodtouse:sequentialormapreduce.Thedefaultismapreduce.
8. Nowtrainthemodel:
bin/mahouttrainnb-i/user/hue/20newsdatatrain-el-o/user/hue/model
-li/user/hue/labelindex-ow-c
9. Testthemodelusingthefollowingcommand:
bin/mahouttestnb-i/user/hue/20newsdatatest-m/user/hue/model/-l
/user/hue/labelindex-ow-o/user/hue/results
Theoutputoftheprecedingcommandontheconsoleisshowninthefollowingscreenshot:
![Page 109: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/109.jpg)
WegettheresultofourNaïveBayesclassifierforthe20newsgroups.
![Page 110: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/110.jpg)
![Page 111: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/111.jpg)
SummaryInthischapter,wediscussedtheNaïveBayesalgorithm.Thisalgorithmisasimplisticyethighlyregardedstatisticalmodelthatiswidelyusedinbothindustryandacademia,anditproducesgoodresultsonmanyoccasions.WeinitiallydiscussedconditionalprobabilityandtheBayesrule.WethensawanexampleoftheNaïveBayesalgorithm.Youlearnedabouttheapproachestoconverttextintoavectorformat,whichisaninputforclassifiers.Finally,weusedthe20newsgroupsdatasettobuildaclassifierusingtheNaïveBayesalgorithminMahout.Inthenextchapter,wewillcontinueourjourneyofexploringclassificationalgorithmsinMahoutwiththeHiddenMarkovmodelimplementation.
![Page 112: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/112.jpg)
![Page 113: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/113.jpg)
Chapter5.LearningtheHiddenMarkovModelUsingMahoutInthischapter,wewillcoveroneofthemostinterestingtopicsofclassificationtechniques:theHiddenMarkovModel(HMM).TounderstandtheHMM,wewillcoverthefollowingtopicsinthischapter:
DeterministicandnondeterministicpatternsTheMarkovprocessIntroducingtheHMMUsingMahoutfortheHMM
![Page 114: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/114.jpg)
DeterministicandnondeterministicpatternsInadeterministicsystem,eachstateissolelydependentonthestateitwaspreviouslyin.Forexample,let’stakethecaseofasetoftrafficlights.Thesequenceoflightsisred→green→amber→red.So,hereweknowwhatstatewillfollowafterthecurrentstate.Oncethetransitionsareknown,deterministicsystemsareeasytounderstand.
Fornondeterministicpatterns,consideranexampleofapersonnamedBobwhohashissnacksat4:00P.M.everyday.Let’ssayhehasanyoneofthethreeitemsfromthemenu:icecream,juice,orcake.Wecannotsayforsurewhatitemhewillhavethenextday,evenifweknowwhathehadtoday.Thisisanexampleofanondeterministicpattern.
![Page 115: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/115.jpg)
![Page 116: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/116.jpg)
TheMarkovprocessIntheMarkovprocess,thenextstateisdependentonthepreviousstates.Ifweassumethatwehaveannstatesystem,thenthenextstateisdependentonthepreviousnstates.Thisprocessiscalledannmodelorder.IntheMarkovprocess,wemakethechoiceforthenextstateprobabilistically.So,consideringourpreviousexample,ifBobhadjuicetoday,hecanhavejuice,icecream,orcakethenextday.Inthesameway,wecanreachanystateinthesystemfromthepreviousstate.TheMarkovprocessisshowninthefollowingdiagram:
Ifwehavenstatesinaprocess,thenwecanreachanystatewithn2transitions.Wehaveaprobabilityofmovingtoanystate,andhence,wewillhaven2probabilitiesofdoingthis.ForaMarkovprocess,wewillhavethefollowingthreeitems:
States:Thisreferstothestatesinthesystem.Inourexample,let’ssaytherearethreestates:state1,state2,andstate3.Transitionmatrix:Thiswillhavetheprobabilitiesofmovingfromonestatetoanyotherstate.Anexampleofthetransitionmatrixisshowninthefollowingscreenshot:
Thismatrixshowsthatifthesystemwasinstate1yesterday,thentheprobabilityofittoremaininthesamestatetodaywillbe0.1.
Initialstatevector:Thisisthevectoroftheinitialstateofthesystem.(Anyoneofthestateswillhaveaprobabilityof1andtherestwillhaveaprobabilityof0inthisvector.)
![Page 117: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/117.jpg)
![Page 118: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/118.jpg)
![Page 119: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/119.jpg)
IntroducingtheHiddenMarkovModelTheHiddenMarkovModel(HMM)isaclassificationtechniquetopredictthestatesofasystembyobservingtheoutcomeswithouthavingaccesstotheactualstatesthemselves.ItisaMarkovmodelinwhichthestatesarehidden.
Let’scontinuewithBob’ssnackexamplewesawearlier.Nowassumewehaveonemoresetofeventsinplacethatisdirectlyobservable.WeknowwhatBobhaseatenforlunchandhissnacksintakeisrelatedtohislunch.So,wehaveanobservationstate,whichisBob’slunch,andhiddenstates,whicharehissnacksintake.WewanttobuildanalgorithmthatcanforecastwhatwouldbeBob’schoiceofsnackbasedonhislunch.
InadditiontothetransitionprobabilitymatrixintheHiddenMarkovModel,wehaveonemorematrixthatiscalledanemissionmatrix.Thismatrixcontainstheprobabilityoftheobservablestate,provideditisassignedahiddenstate.Theemissionmatrixisasfollows:
P(observablestate|onestate)
So,aHiddenMarkovModelhasthefollowingproperties:
Statevector:ThiscontainstheprobabilityofthehiddenmodeltobeinaparticularstateatthestartTransitionmatrix:Thishastheprobabilitiesofahiddenstate,giventheprevioushiddenstateEmissionmatrix:Giventhatthehiddenmodelisinaparticularhiddenstate,thishastheprobabilitiesofobservingaparticularobservablestateHiddenstates:ThisreferstothestatesofthesystemthatcanbedefinedbytheHiddenMarkovModelObservablestate:Thestatesthatarevisibleintheprocess
UsingtheHiddenMarkovModel,threetypesofproblemscanbesolved.ThefirsttwoarerelatedtothepatternrecognitionproblemandthethirdtypeofproblemgeneratesaHiddenMarkovModel,givenasequenceofobservations.Let’slookatthesethreetypes
![Page 120: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/120.jpg)
ofproblems:
Evaluation:Thisisfindingouttheprobabilityofanobservedsequence,givenanHMM.FromthenumberofdifferentHMMsthatdescribedifferentsystemsandasequenceofobservations,ourgoalwillbetofindoutwhichHMMwillmostprobablygeneratetherequiredsequence.WeusetheforwardalgorithmtocalculatetheprobabilityofanobservationsequencewhenaparticularHMMisgivenandfindoutthemostprobableHMM.Decoding:Thisisfindingthemostprobablesequenceofhiddenstatesfromsomeobservations.WeusetheViterbialgorithmtodeterminethemostprobablesequenceofhiddenstateswhenyouhaveasequenceofobservationsandanHMM.Learning:LearningisgeneratingtheHMMfromasequenceofobservations.So,ifwehavesuchasequence,wemaywonderwhichisthemostlikelymodeltogeneratethissequence.Theforward-backwardalgorithmsareusefulinsolvingthisproblem.
TheHiddenMarkovModelisusedindifferentapplicationssuchasspeechrecognition,handwrittenletterrecognition,genomeanalysis,partsofspeechtagging,customerbehaviormodeling,andsoon.
![Page 121: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/121.jpg)
![Page 122: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/122.jpg)
UsingMahoutfortheHiddenMarkovModelApacheMahouthastheimplementationoftheHiddenMarkovModel.Itisavailableintheorg.apache.mahout.classifier.sequencelearning.hmmpackage.
Theoverallimplementationisprovidedbyeightdifferentclasses:
HMMModel:ThisisthemainclassthatdefinestheHiddenMarkovModel.HmmTrainer:ThisclasshasalgorithmsthatareusedtotraintheHiddenMarkovModel.Themainalgorithmsaresupervisedlearning,unsupervisedlearning,andunsupervisedBaum-Welch.HmmEvaluator:ThisclassprovidesdifferentmethodstoevaluateanHMMmodel.Thefollowingusecasesarecoveredinthisclass:
Generatingasequenceofoutputstatesfromamodel(prediction)Computingthelikelihoodthatagivenmodelwillgeneratethegivensequenceofoutputstates(modellikelihood)Computingthemostlikelyhiddensequenceforagivenmodelandagivenobservedsequence(decoding)
HmmAlgorithms:ThisclasscontainsimplementationsofthethreemajorHMMalgorithms:forward,backward,andViterbi.HmmUtils:ThisisautilityclassandprovidesmethodstohandleHMMmodelobjects.RandomSequenceGenerator:Thisisacommand-linetooltogenerateasequencebythegivenHMM.BaumWelchTrainer:ThisistheclasstotrainHMMfromtheconsole.ViterbiEvaluator:Thisisalsoacommand-linetoolforViterbievaluation.
Now,let’sworkwithBob’sexample.
Thefollowingisthegivenmatrixandtheinitialprobabilityvector:
Icecream Cake Juice
0.36 0.51 0.13
Thefollowingwillbethestatetransitionmatrix:
Icecream Cake Juice
Icecream 0.365 0.500 0.135
Cake 0.250 0.125 0.625
Juice 0.365 0.265 0.370
![Page 123: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/123.jpg)
Thefollowingwillbetheemissionmatrix:
Spicyfood Normalfood Nofood
Icecream 0.1 0.2 0.7
Cake 0.5 0.25 0.25
Juice 0.80 0.10 0.10
Nowwewillexecuteacommand-line-basedexampleofthisproblem.WehavethreehiddenstatesofwhatBob’seatenforsnacks:ice-cream,cake,orjuice.Then,wehavethreeobservablestatesofwhatheishavingatlunch:spicyfood,normalfood,ornofoodatall.Now,thefollowingarethestepstoexecutefromthecommandline:
1. Createadirectorywiththenamehmm:mkdir/tmp/hmm.Gotothisdirectoryandcreatethesampleinputfileoftheobservedstates.ThiswillincludeasequenceofBob’slunchhabit:spicyfood(state0),normalfood(state1),andnofood(state2).Executethefollowingcommand:
echo"012221100212111122200000022200000
011112222202120212110001010212121211
002202110">hmm-input
2. RuntheBaumWelchalgorithmtotrainthemodelusingthefollowingcommand:
mahoutbaumwelch-i/tmp/hmm/hmm-input-o/tmp/hmm/hmm-model-nh3-no
3-e.0001-m1000
Theparametersusedintheprecedingcommandareasfollows:
i:Thisistheinputfilelocationo:Thisistheoutputlocationforthemodelnh:Thisisthenumberofhiddenstates.Inourexample,itisthree(icecream,juice,orcake)no:Thisisthenumberofobservablestates.Inourexample,itisthree(spicy,normal,ornofood)e:Thisistheepsilonnumber.Thisistheconvergencethresholdvaluem:Thisisthemaximumiterationnumber
Thefollowingscreenshotshowstheoutputonexecutingthepreviouscommand:
![Page 124: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/124.jpg)
3. NowwehaveanHMMmodelthatcanbeusedtobuildapredictedsequence.Wewillrunthemodeltopredictthenext15statesoftheobservablesequenceusingthefollowingcommand:
mahouthmmpredict-m/tmp/hmm/hmm-model-o/tmp/hmm/hmm-predictions-l
10
Theparametersusedintheprecedingcommandareasfollows:
m:ThisisthepathfortheHMMmodel
o:Thisistheoutputdirectorypath
l:Thisisthelengthofthegeneratedsequence
4. Toviewthepredictionforthenext10observablestates,usethefollowingcommand:
mahouthmmpredict-m/tmp/hmm/hmm-model-o/tmp/hmm/hmm-predictions-l
10
Theoutputofthepreviouscommandisshowninthefollowingscreenshot:
Fromtheoutput,wecansaythatthenextobservablestatesforBob’slunchwillbespicy,spicy,spicy,normal,normal,nofood,nofood,nofood,nofood,andnofood.
5. Now,wewilluseonemorealgorithmtopredictthehiddenstate.WewillusetheViterbialgorithmtopredictthehiddenstatesforagivenobservationalstate’ssequence.Wewillfirstcreatethesequenceoftheobservationalstateusingthefollowingcommand:
echo"012021100112">/tmp/hmm/hmm-viterbi-input
6. WewillusetheViterbicommand-lineoptiontogeneratetheoutputwiththelikelihoodofgeneratingthissequence:
mahoutviterbi--input/tmp/hmm/hmm-viterbi-input--outputtmp/hmm/hmm-
viterbi-output--model/tmp/hmm/hmm-model--likelihood
![Page 125: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/125.jpg)
Theparametersusedintheprecedingcommandareasfollows:
input:Thisistheinputlocationofthefileoutput:ThisistheoutputlocationoftheViterbialgorithm’soutputmodel:ThisistheHMMmodellocationthatwecreatedearlierlikelihood:Thisisthecomputedlikelihoodoftheobservedsequence
Thefollowingscreenshotshowstheoutputonexecutingthepreviouscommand:
7. PredictionsfromtheViterbiaresavedintheoutputfileandcanbeseenusingthecatcommand:
cat/tmp/hmm/hmm-viterbi-output
Thefollowingoutputshowsthepredictionsforthehiddenstate:
![Page 126: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/126.jpg)
![Page 127: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/127.jpg)
SummaryInthischapter,wediscussedanotherclassificationtechnique:theHiddenMarkovModel.Youlearnedaboutdeterministicandnondeterministicpatterns.WealsotouchedupontheMarkovprocessandHiddenMarkovprocessingeneral.WecheckedtheclassesimplementedinsideMahouttosupporttheHiddenMarkovModel.WetookupanexampletocreatetheHMMmodelandfurtherusedthismodeltopredicttheobservationalstate’ssequence.WeusedtheViterbialgorithmimplementedinMahouttopredictthehiddenstatesinthesystem.
Now,inthenextchapter,wewillcoveronemoreinterestingalgorithmusedinclassificationarea:Randomforest.
![Page 128: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/128.jpg)
![Page 129: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/129.jpg)
Chapter6.LearningRandomForestUsingMahoutRandomforestisoneofthemostpopulartechniquesinclassification.Itstartswithamachinelearningtechniquecalleddecisiontree.Inthischapter,wewillexplorethefollowingtopics:
DecisiontreeRandomforestUsingMahoutforRandomforest
![Page 130: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/130.jpg)
DecisiontreeAdecisiontreeisusedforclassificationandregressionproblems.Insimpleterms,itisapredictivemodelthatusesbinaryrulestocalculatethetargetvariable.Inadecisiontree,weuseaniterativeprocessofsplittingthedataintopartitions,thenwesplititfurtheronbranches.Asinotherclassificationmodelcreationprocesses,westartwiththetrainingdatasetinwhichtargetvariablesorclasslabelsaredefined.Thealgorithmtriestobreakalltherecordsintrainingdatasetsintotwopartsbasedononeoftheexplanatoryvariables.Thepartitioningisthenappliedtoeachnewpartition,andthisprocessiscontinueduntilnomorepartitioningcanbedone.Thecoreofthealgorithmistofindouttherulethatdeterminestheinitialsplit.Therearealgorithmstocreatedecisiontrees,suchasIterativeDichotomiser3(ID3),ClassificationandRegressionTree(CART),Chi-squaredAutomaticInteractionDetector(CHAID),andsoon.AgoodexplanationforID3canbefoundathttp://www.cse.unsw.edu.au/~billw/cs9414/notes/ml/06prop/id3/id3.html.
Formingtheexplanatoryvariablestochoosethebestsplitterinanode,thealgorithmconsiderseachvariableinturn.Everypossiblesplitisconsideredandtried,andthebestsplitistheonethatproducesthelargestdecreaseindiversityoftheclassificationlabelwithineachpartition.Thisisrepeatedforallvariables,andthewinnerischosenasthebestsplitterforthatnode.Theprocessiscontinuedinthenextnodeuntilwereachanodewherewecanmakethedecision.
Wecreateadecisiontreefromatrainingdatasetsoitcansufferfromtheoverfittingproblem.Thisbehaviorcreatesaproblemwithrealdatasets.Toimprovethissituation,aprocesscalledpruningisused.Inthisprocess,weremovethebranchesandleavesofthetreetoimprovetheperformance.Algorithmsusedtobuildthetreeworkbestatthestartingorrootnodesincealltheinformationisavailablethere.Lateron,witheachsplit,dataislessandtowardstheendofthetree,aparticularnodecanshowpatternsthatarerelatedtothesetofdatawhichisusedtosplit.Thesepatternscreateproblemswhenweusethemtopredicttherealdataset.Pruningmethodsletthetreegrowandremovethesmallerbranchesthatfailtogeneralize.Nowtakeanexampletounderstandthedecisiontree.
Considerwehaveairisflowerdataset.Thisdatasetishugelypopularinthemachinelearningfield.ItwasintroducedbySirRonaldFisher.Itcontains50samplesfromeachofthreespeciesofirisflower(Irissetosa,Irisvirginica,andIrisversicolor).Thefourexplanatoryvariablesarethelengthandwidthofthesepalsandpetalsincentimeters,andthetargetvariableistheclasstowhichtheflowerbelongs.
![Page 131: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/131.jpg)
Asyoucanseeintheprecedingdiagram,allthegroupswereearlierconsideredasSentosaspeciesandthentheexplanatoryvariableandpetallengthwerefurtherusedtodividethegroups.Ateachstep,thecalculationformisclassifieditemswasalsodone,whichshowshowmanyitemswerewronglyclassified.Moreover,thepetalwidthvariablewastakenintoaccount.Usually,itemsatleafnodesarecorrectlyclassified.
![Page 132: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/132.jpg)
![Page 133: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/133.jpg)
RandomforestTheRandomforestalgorithmwasdevelopedbyLeoBreimanandAdeleCutler.Randomforestsgrowmanyclassificationtrees.Theyareanensemblelearningmethodforclassificationandregressionthatconstructsanumberofdecisiontreesattrainingtimeandalsooutputstheclassthatisthemodeoftheclassesoutputtedbyindividualtrees.
Singledecisiontreesshowthebias–variancetradeoff.Sotheyusuallyhavehighvarianceorhighbias.Thefollowingaretheparametersinthealgorithm:
Bias:ThisisanerrorcausedbyanerroneousassumptioninthelearningalgorithmVariance:Thisisanerrorthatrangesfromsensitivitytosmallfluctuationsinthetrainingset
Randomforestsattempttomitigatethisproblembyaveragingtofindanaturalbalancebetweentwoextremes.ARandomforestworksontheideaofbagging,whichistoaveragenoisyandunbiasedmodelstocreateamodelwithlowvariance.ARandomforestalgorithmworksasalargecollectionofdecorrelateddecisiontrees.TounderstandtheideaofaRandomforestalgorithm,let’sworkwithanexample.
Considerwehaveatrainingdatasetthathaslotsoffeatures(explanatoryvariables)andtargetvariablesorclasses:
Wecreateasamplesetfromthegivendataset:
Adifferentsetofrandomfeaturesweretakenintoaccounttocreatetherandomsub-dataset.Now,fromthesesub-datasets,differentdecisiontreeswillbecreated.Soactuallywehavecreatedaforestofthedifferentdecisiontrees.Usingthesedifferenttrees,wewillcreatearankingsystemforalltheclassifiers.Topredicttheclassofanewunknownitem,wewilluseallthedecisiontreesandseparatelyfindoutwhichclassthesetreesare
![Page 134: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/134.jpg)
predicting.Seethefollowingdiagramforabetterunderstandingofthisconcept:
Differentdecisiontreestopredicttheclassofanunknownitem
Inthisparticularcase,wehavefourdifferentdecisiontrees.Wepredicttheclassofanunknowndatasetwitheachofthetrees.Aspertheprecedingfigure,thefirstdecisiontreeprovidesclass2asthepredictedclass,theseconddecisiontreepredictsclass5,thethirddecisiontreepredictsclass5,andthefourthdecisiontreepredictsclass3.Now,aRandomforestwillvoteforeachclass.Sowehaveonevoteeachforclass2andclass3andtwovotesforclass5.Therefore,ithasdecidedthatforthenewunknowndataset,thepredictedclassisclass5.Sotheclassthatgetsahighervoteisdecidedforthenewdataset.ARandomforesthasalotofbenefitsinclassificationandafewofthemarementionedinthefollowinglist:
CombinationoflearningmodelsincreasestheaccuracyoftheclassificationRunseffectivelyonlargedatasetsaswellThegeneratedforestcanbesavedandusedforotherdatasetsaswellCanhandlealargeamountofexplanatoryvariables
NowthatwehaveunderstoodtheRandomforesttheoretically,let’smoveontoMahoutandusetheRandomforestalgorithm,whichisavailableinApacheMahout.
![Page 135: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/135.jpg)
![Page 136: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/136.jpg)
UsingMahoutforRandomforestMahouthasimplementationfortheRandomforestalgorithm.Itisveryeasytounderstandanduse.Solet’sgetstarted.
Dataset
WewillusetheNSL-KDDdataset.Since1999,KDD‘99hasbeenthemostwidelyuseddatasetfortheevaluationofanomalydetectionmethods.ThisdatasetispreparedbyS.J.StolfoandisbuiltbasedonthedatacapturedintheDARPA‘98IDSevaluationprogram(R.P.Lippmann,D.J.Fried,I.Graf,J.W.Haines,K.R.Kendall,D.McClung,D.Weber,S.E.Webster,D.Wyschogrod,R.K.Cunningham,andM.A.Zissman,“Evaluatingintrusiondetectionsystems:The1998darpaoff-lineintrusiondetectionevaluation,”discex,vol.02,p.1012,2000).
DARPA‘98isabout4GBofcompressedraw(binary)tcpdumpdataof7weeksofnetworktraffic,whichcanbeprocessedintoabout5millionconnectionrecords,eachwithabout100bytes.Thetwoweeksoftestdatahavearound2millionconnectionrecords.TheKDDtrainingdatasetconsistsofapproximately4,900,000singleconnectionvectors,eachofwhichcontains41featuresandislabeledaseithernormaloranattack,withexactlyonespecificattacktype.
NSL-KDDisadatasetsuggestedtosolvesomeoftheinherentproblemsoftheKDD‘99dataset.Youcandownloadthisdatasetfromhttp://nsl.cs.unb.ca/NSL-KDD/.
WewilldownloadtheKDDTrain+_20Percent.ARFFandKDDTest+.ARFFdatasets.
NoteInKDDTrain+_20Percent.ARFFandKDDTest+.ARFF,removethefirst44lines(that
![Page 137: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/137.jpg)
is,alllinesstartingwith@attribute).Ifthisisnotdone,wewillnotbeabletogenerateadescriptorfile.
![Page 138: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/138.jpg)
StepstousetheRandomforestalgorithminMahoutThestepstoimplementtheRandomforestalgorithminApacheMahoutareasfollows:
1. Transferthetestandtrainingdatasetstohdfsusingthefollowingcommands:
hadoopfs-mkdir/user/hue/KDDTrain
hadoopfs-mkdir/user/hue/KDDTest
hadoopfs–put/tmp/KDDTrain+_20Percent.arff/user/hue/KDDTrain
hadoopfs–put/tmp/KDDTest+.arff/user/hue/KDDTest
2. Generatethedescriptorfile.BeforeyoubuildaRandomforestmodelbasedonthetrainingdatainKDDTrain+.arff,adescriptorfileisrequired.Thisisbecauseallinformationinthetrainingdatasetneedstobelabeled.Fromthelabeleddataset,thealgorithmcanunderstandwhichoneisnumericalandcategorical.Usethefollowingcommandtogeneratedescriptorfile:
hadoopjar$MAHOUT_HOME/core/target/mahout-core-xyz.job.jar
org.apache.mahout.classifier.df.tools.Describe
-p/user/hue/KDDTrain/KDDTrain+_20Percent.arff
-f/user/hue/KDDTrain/KDDTrain+.info
-dN3C2NC4NC8N2C19NL
Jar:Mahoutcorejar(xyzstandsforversion).IfyouhavedirectlyinstalledMahout,itcanbefoundunderthe/usr/lib/mahoutfolder.ThemainclassDescribeisusedhereandittakesthreeparameters:
Theppathforthedatatobedescribed.
Theflocationforthegenerateddescriptorfile.
distheinformationfortheattributeonthedata.N3C2NC4NC8N2C19NLdefinesthatthedatasetisstartingwithanumeric(N),followedbythreecategoricalattributes,andsoon.Inthelast,Ldefinesthelabel.
Theoutputofthepreviouscommandisshowninthefollowingscreenshot:
3. BuildtheRandomforestusingthefollowingcommand:
hadoopjar$MAHOUT_HOME/examples/target/mahout-examples-xyz-job.jar
org.apache.mahout.classifier.df.mapreduce.BuildForest
-Dmapred.max.split.size=1874231-d
/user/hue/KDDTrain/KDDTrain+_20Percent.arff
-ds/user/hue/KDDTrain/KDDTrain+.info
-sl5-p-t100–o/user/hue/nsl-forest
Jar:Mahoutexamplejar(xyzstandsforversion).IfyouhavedirectlyinstalledMahout,itcanbefoundunderthe/usr/lib/mahoutfolder.Themainclassbuild
![Page 139: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/139.jpg)
forestisusedtobuildtheforestwithotherarguments,whichareshowninthefollowinglist:
Dmapred.max.split.sizeindicatestoHadoopthemaximumsizeofeachpartition.
dstandsforthedatapath.
dsstandsforthelocationofthedescriptorfile.
slisavariabletoselectrandomlyateachtreenode.Here,eachtreeisbuiltusingfiverandomlyselectedattributespernode.
pusespartialdataimplementation.
tstandsforthenumberoftreestogrow.Here,thecommandsbuild100treesusingpartialimplementation.
ostandsfortheoutputpaththatwillcontainthedecisionforest.
Intheend,theprocesswillshowthefollowingresult:
4. Usethismodeltoclassifythenewdataset:
hadoopjar$MAHOUT_HOME/examples/target/mahout-examples-xyz-job.jar
org.apache.mahout.classifier.df.mapreduce.TestForest
-i/user/hue/KDDTest/KDDTest+.arff
-ds/user/hue/KDDTrain/KDDTrain+.info-m/user/hue/nsl-forest-a–mr
-o/user/hue/predictions
Jar:Mahoutexamplejar(xyzstandsforversion).IfyouhavedirectlyinstalledMahout,itcanbefoundunderthe/usr/lib/mahoutfolder.Theclasstotesttheforesthasthefollowingparameters:
Iindicatesthepathforthetestdata
dsstandsforthelocationofthedescriptorfile
mstandsforthelocationofthegeneratedforestfromthepreviouscommand
![Page 140: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/140.jpg)
ainformstoruntheanalyzertocomputetheconfusionmatrix
mrinformshadooptodistributetheclassification
ostandsforthelocationtostorethepredictionsin
Thejobprovidesthefollowingconfusionmatrix:
So,fromtheconfusionmatrix,itisclearthat9,396instanceswerecorrectlyclassifiedand315normalinstanceswereincorrectlyclassifiedasanomalies.Andtheaccuracypercentageis77.7635(correctlyclassifiedinstancesbythemodel/classifiedinstances).Theoutputfileinthepredictionfoldercontainsthelistwhere0and1.0definesthenormaldatasetand1definestheanomaly.
![Page 141: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/141.jpg)
![Page 142: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/142.jpg)
SummaryInthischapter,wediscussedtheRandomforestalgorithm.WestartedourdiscussionbyunderstandingthedecisiontreeandcontinuedwithanunderstandingoftheRandomforest.WetookuptheNSL-KDDdataset,whichisusedtobuildpredictivesystemsforcybersecurity.WeusedMahouttobuildtheRandomforesttree,anduseditwiththetestdatasetandgeneratedtheconfusionmatrixandotherstatisticsfortheoutput.
Inthenextchapter,wewilllookatthefinalclassificationalgorithmavailableinApacheMahout.Sostaytuned!
![Page 143: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/143.jpg)
![Page 144: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/144.jpg)
Chapter7.LearningMultilayerPerceptronUsingMahoutTounderstandaMultilayerPerceptron(MLP),wewillfirstexploreonemorepopularmachinelearningtechnique:neuralnetwork.Inthischapter,wewillexplorethefollowingtopics:
NeuralnetworkandneuronsMLPUsingMahoutforMLPimplementation
![Page 145: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/145.jpg)
NeuralnetworkandneuronsNeuralnetworkisanoldalgorithm,anditwasdevelopedwithagoalinmind:toprovidethecomputerwithabrain.Neuralnetworkisinspiredbythebiologicalstructureofthehumanbrainwheremultipleneuronsareconnectedandformcolumnsandlayers.Aneuronisanelectricallyexcitablecellthatprocessesandtransmitsinformationthroughelectricalandchemicalsignals.Perceptualinputentersintotheneuralnetworkthroughoursensoryorgansandisthenfurtherprocessedintohigherlevels.Let’sunderstandhowneuronsworkinourbrain.
Neuronsarecomputationalunitsinthebrainthatcollecttheinputfrominputnerves,whicharecalleddendrites.Theyperformcomputationontheseinputmessagesandsendtheoutputusingoutputnerves,whicharecalledaxons.Seethefollowingfigure(http://vv.carleton.ca/~neil/neural/neuron-a.html):
Onthesamelines,wedevelopaneuralnetworkincomputers.Wecanrepresentaneuroninouralgorithmasshowninthefollowingfigure:
![Page 146: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/146.jpg)
Here,x1,x2,andx3arethefeaturevectors,andtheyareassignedtoafunctionf,whichwilldothecomputationandprovidetheoutput.Thisactivationfunctionisusuallychosenfromthefamilyofsigmoidalfunctions(asdefinedinChapter3,LearningLogisticRegression/SGDUsingMahout).Inthecaseofclassificationproblems,softmaxactivationfunctionsareused.Inclassificationproblems,wewanttheoutputastheprobabilitiesoftargetclasses.So,itisdesirablefortheoutputtoliebetween0and1andthesumcloseto1.Softmaxfunctionenforcestheseconstraints.Itisageneralizationofthelogisticfunction.Moredetailsonsoftmaxfunctioncanbefoundathttp://www.faqs.org/faqs/ai-faq/neural-nets/part2/section-12.html.
![Page 147: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/147.jpg)
![Page 148: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/148.jpg)
MultilayerPerceptronAneuralnetworkorartificialneuralnetworkgenerallyreferstoanMLPnetwork.Wedefinedneuronasanimplementationincomputersintheprevioussection.AnMLPnetworkconsistsofmultiplelayersoftheseneuronunits.Let’sunderstandaperceptronnetworkofthreelayers,asshowninthenextfigure.ThefirstlayeroftheMLPrepresentstheinputandhasnootherpurposethanroutingtheinputtoeveryconnectedunitinafeed-forwardfashion.Thesecondlayeriscalledhiddenlayers,andthelastlayerservesthespecialpurposeofdeterminingtheoutput.Theactivationofneuronsinthehiddenlayerscanbedefinedasthesumoftheweightofalltheinput.Neuron1inlayer2isdefinedasfollows:
Y12=g(w110x0+w111x1+w112x2+w113x3)
Thefirstpartwhere*x0=0*iscalledthebiasandcanbeusedasanoffset,independentoftheinput.Neuron2inlayer2isdefinedasfollows:
Y22=g(w120x0+w121x1+w122x2+w123x3)
Neuron3inlayer2isdefinedasfollows:
Y32=g(w130x0+w131x1+w132x2+w133x3)
Here,gisasigmoidfunction,asdefinedinChapter3,LearningLogisticRegression/SGDUsingMahout.Thefunctionisasfollows:
g(z)=1/1+e(-z)
InthisMLPnetworkoutput,fromeachinputandhiddenlayers,neuronunitsaredistributedtoothernodes,andthisiswhythistypeofnetworkiscalledafullyconnectednetwork.Inthisnetwork,novaluesarefedbacktothepreviouslayer.(Feedforwardis
![Page 149: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/149.jpg)
anotherstrategyandisalsoknownasbackpropagation.Detailsonthiscanbefoundathttp://home.agh.edu.pl/~vlsi/AI/backp_t_en/backprop.html.)
AnMLPnetworkcanhavemorethanonehiddenlayer.TogetthevalueoftheweightssothatwecangetthepredictedvalueascloseaspossibletotheactualoneisatrainingprocessoftheMLP.Tobuildaneffectivenetwork,weconsideralotofitemssuchasthenumberofhiddenlayersandneuronunitsineachlayer,thecostfunctiontominimizetheerrorinpredictedandactualvalues,andsoon.
Nowlet’sdiscusstwomoreimportantandproblematicquestionsthatarisewhencreatinganMLPnetwork:
Howmanyhiddenlayersshouldoneuseforthenetwork?Howmanynumbersofhiddenunits(neuronunits)shouldoneuseinahiddenlayer?
Zerohiddenlayersarerequiredtoresolvelinearlyseparabledata.Assumingyourdatadoesrequireseparationbyanon-lineartechnique,alwaysstartwithonehiddenlayer.Almostcertainly,that’sallyouwillneed.IfyourdataisseparableusinganMLP,thenthisMLPprobablyonlyneedsasinglehiddenlayer.Inordertoselectthenumberofunitsindifferentlayers,thesearetheguidelines:
Inputlayer:Thisreferstothenumberofexplanatoryvariablesinthemodelplusoneforthebiasnode.Outputlayer:Inthecaseofclassification,thisreferstothenumberoftargetvariables,andinthecaseofregression,thisisobviouslyone.Hiddenlayer:Startyournetworkwithonehiddenlayerandusethenumberofneuronunitsequivalenttotheunitsintheinputlayer.Thebestwayistotrainseveralneuralnetworkswithdifferentnumbersofhiddenlayersandhiddenneuronsandmeasuretheperformanceofthesenetworksusingcross-validation.Youcanstickwiththenumberthatyieldsthebest-performingnetwork.Problemsthatrequiretwohiddenlayersarerarelyencountered.However,neuralnetworksthathavemorethanonehiddenlayercanrepresentfunctionswithanykindofshape.Thereiscurrentlynotheorytojustifytheuseofneuralnetworkswithmorethantwohiddenlayers.Infact,formanypracticalproblems,thereisnoreasontouseanymorethanonehiddenlayer.Anetworkwithnohiddenlayerisonlycapableofrepresentinglinearlyseparablefunctions.Networkswithonelayercanapproximateanyfunctionthatcontainsacontinuousmappingfromonefinitespacetoanother,andnetworkswithtwohiddenlayerscanrepresentanarbitrarydecisionboundarytoarbitraryaccuracywithrationalactivationfunctionsandcanapproximateanysmoothmappingtoanyaccuracy(Chapter5ofthebookIntroductiontoNeuralNetworksforJava).Numberofneuronsorhiddenunits:Usethenumberofneuronunitsequivalenttotheunitsintheinputlayer.Thenumberofhiddenunitsshouldbelessthantwicethenumberofunitsintheinputlayer.Anotherruletocalculatethisis(numberofinputunits+numberofoutputunits)*2/3.
Dothetestingforgeneralizationerrors,trainingerrors,bias,andvariance.Whenageneralizationerrordips,thenjustbeforeitbeginstoincreaseagain,thenumbersofnodesareusuallyfoundtobeperfectatthispoint.
![Page 150: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/150.jpg)
Nowlet’smoveontothenextsectionandexplorehowwecanuseMahoutforanMLP.
![Page 151: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/151.jpg)
![Page 152: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/152.jpg)
MLPimplementationinMahoutTheMLPimplementationisbasedonamoregeneralneuralnetworkclass.ItisimplementedtorunonasinglemachineusingStochasticGradientDescent,wheretheweightsareupdatedusingonedatapointatatime.
Thenumberoflayersandunitsperlayercanbespecifiedmanuallyanddeterminesthewholetopologywitheachunitbeingfullyconnectedtothepreviouslayer.Abiasunitisautomaticallyaddedtotheinputofeverylayer.Abiasunitishelpfulforshiftingtheactivationfunctiontotheleftorright.Itislikeaddingacoefficienttothelinearfunction.
Currently,thelogisticsigmoidisusedasasquashingfunctionineveryhiddenandoutputlayer.
Thecommand-lineversiondoesnotperformiterationsthatleadtobadresultsonsmalldatasets.AnotherrestrictionisthattheCLIversionoftheMLPonlysupportsclassification,sincethelabelshavetobegivenexplicitlywhenexecutingtheimplementationinthecommandline.
Alearnedmodelcanbestoredandupdatedwithnewtraininginstancesusingthe`--update`flag.Theoutputoftheclassificationresultissavedasa.txtfileandonlyconsistsoftheassignedlabels.Apartfromthecommand-lineinterface,itispossibletoconstructandcompilemorespecializedneuralnetworksusingtheAPIandinterfacesinthemrlegacypackage.(Thecorepackageisrenamedasmrlegacy.)
Inthecommandline,weuseTrainMultilayerPerceptronandRunMultilayerPerceptronclassesthatareavailableinthemrlegacypackagewiththreeotherclasses:Neuralnetwork.java,NeuralNetworkFunctions.java,andMultilayerPerceptron.java.Forthisparticularimplementation,userscanfreelycontrolthetopologyoftheMLP,includingthefollowing:
ThesizeoftheinputlayerThenumberofhiddenlayersThesizeofeachhiddenlayerThesizeoftheoutputlayerThecostfunctionThesquashingfunction
Themodelistrainedinanonlinelearningapproach,wheretheweightsofneuronsintheMLPisupdatedandincrementedusingthebackPropagationalgorithmproposedbyRumelhart,D.E.,Hinton,G.E.,andWilliams,R.J.(1986),Learningrepresentationsbyback-propagatingerrors.Nature,323,533-536.
![Page 153: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/153.jpg)
![Page 154: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/154.jpg)
UsingMahoutforMLPMahouthasimplementationforanMLPnetwork.TheMLPimplementationiscurrentlylocatedintheMap-Reduce-Legacypackage.Aswithotherclassificationalgorithms,twoseparatedclassesareimplementedtotrainandusethisclassifier.Fortrainingtheclassifier,theorg.apache.mahout.classifier.mlp.TrainMultilayerPerceptronclass,andforrunningtheclassifier,theorg.apache.mahout.classifier.mlp.RunMultilayerPerceptronclassisused.Thereareanumberofparametersdefinedthatareusedwiththeseclasses,butwewilldiscusstheseparametersoncewerunourexampleonadataset.
Dataset
Inthischapter,wewilltrainanMLPtoclassifytheirisdataset.Theirisflowerdatasetcontainsdataofthreeflowerspecies,whereeachdatapointconsistsoffourfeatures.ThisdatasetwasintroducedbySirRonaldFisher.Itconsistsof50samplesfromeachofthreespeciesofiris.ThesespeciesareIrissetosa,Irisvirginica,andIrisversicolor.Fourfeaturesweremeasuredfromeachsample:
SepallengthSepalwidthPetallengthPetalwidth
Allmeasurementsareincentimeters.Youcandownloadthisdatasetfromhttps://archive.ics.uci.edu/ml/machine-learning-databases/iris/andsaveitasa.csvfile,asshowninthefollowingscreenshot:
Thisdatasetwilllooklikethethefollowingscreenshot:
![Page 155: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/155.jpg)
![Page 156: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/156.jpg)
StepstousetheMLPalgorithminMahoutThestepstousetheMLPalgorithminMahoutareasfollows:
1. CreatetheMLPmodel.
TocreatetheMLPmodel,wewillusetheTrainMultilayerPerceptronclass.Usethefollowingcommandtogeneratethemodel:
bin/mahoutorg.apache.mahout.classifier.mlp.TrainMultilayerPerceptron-
i/tmp/irisdata.csv-labelsIris-setosaIris-versicolorIris-virginica
-mo/tmp/model.model-ls483-l0.2-m0.35-r0.0001
Youcanalsorunusingthecorejar:Mahoutcorejar(xyzstandsfortheversion).IfyouhavedirectlyinstalledMahout,itcanbefoundunderthe/usr/lib/mahoutfolder.Executethefollowingcommand:
Java–cp/usr/lib/mahout/mahout-core-xyz-job.jar
org.apache.mahout.classifier.mlp.TrainMultilayerPerceptron-i
/tmp/irisdata.csv-labelsIris-setosaIris-versicolorIris-virginica-
mo/user/hue/mlp/model.model-ls483-l0.2-m0.35-r0.0001
TheTrainMultilayerPerceptronclassisusedhereandittakesdifferentparameters.Also,iisthepathfortheinputdataset.Here,wehaveputthedatasetunderthe/tmpfolder(localfilesystem).Additionally,labelsaredefinedinthedataset.Herewehavethefollowinglabels:
moistheoutputlocationforthecreatedmodel.lsisthenumberofunitsperlayer,includinginput,hidden,andoutputlayers.Thisparameterspecifiesthetopologyofthenetwork.Here,wehave4astheinputfeature,8forthehiddenlayer,and3fortheoutputclassnumber.listhelearningratethatisusedforweightupdates.Thedefaultis0.5.Toapproximategradientdescent,neuralnetworksaretrainedwithalgorithms.Learningispossibleeitherbybatchoronlinemethods.Inbatchtraining,weightchangesareaccumulatedoveranentirepresentationofthetrainingdata(anepoch)beforebeingapplied,whileonlinetrainingupdatesweighsafterthepresentationofeachtrainingexample(instance).Moredetailscanbefoundathttp://axon.cs.byu.edu/papers/Wilson.nn03.batch.pdf.misthemomentumweightthatisusedforgradientdescent.Thismustbeintherangebetween0–1.0.ristheregularizationvaluefortheweightvector.Thismustbeintherangebetween0–0.1.Itisusedtopreventoverfitting.
![Page 157: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/157.jpg)
2. Totest/runtheMLPclassificationofthetrainedmodel,wecanusethefollowingcommand:
bin/mahoutorg.apache.mahout.classifier.mlp.RunMultilayerPerceptron-i
/tmp/irisdata.csv-cr03-mo/tmp/model.model-o/tmp/labelResult.txt
YoucanalsorunusingtheMahoutcorejar(xyzstandsforversion).IfyouhavedirectlyinstalledMahout,itcanbefoundunderthe/usr/lib/mahoutfolder.Executethefollowingcommand:
Java–cp/usr/lib/mahout/mahout-core-xyz-job.jar
org.apache.mahout.classifier.mlp.RunMultilayerPerceptron-i
/tmp/irisdata.csv-cr03-mo/tmp/model.model-o/tmp/labelResult.txt
TheRunMultilayerPerceptronclassisemployedheretousethemodel.Thisclassalsotakesdifferentparameters,whichareasfollows:
iindicatestheinputdatasetlocationcristherangeofcolumnstousefromtheinputfile,startingwith0(thatis,`-cr05`forincludingthefirstsixcolumnsonly)moisthelocationofthemodelbuiltearlieroisthepathtostorelabeledresultsfromrunningthemodel
![Page 158: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/158.jpg)
![Page 159: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/159.jpg)
SummaryInthischapter,wediscussedoneofthenewlyimplementedalgorithmsinMahout:MLP.WestartedourdiscussionbyunderstandingneuralnetworksandneuronunitsandcontinuedourdiscussionfurthertounderstandtheMLPnetworkalgorithm.Wediscussedhowtochoosedifferentlayerunits.WethenmovedtoMahoutandusedtheirisdatasettotestandrunanMLPalgorithmimplementedinMahout.Withthis,wehavefinishedourdiscussiononclassificationalgorithmsavailableinApacheMahout.
NowwemoveontothenextchapterofthisbookwherewewilldiscussthenewchangescomingupinthenewMahoutrelease.
![Page 160: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/160.jpg)
![Page 161: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/161.jpg)
Chapter8.MahoutChangesintheUpcomingReleaseMahoutisacommunity-drivenprojectanditscommunityisverystrong.Thiscommunitydecidedonsomeofthemajorchangesintheupcoming1.0release.Inthischapter,wewillexploretheupcomingchangesanddevelopmentsinApacheMahout.Wewilllookatthefollowingtopicsinbrief:
NewchangesdueinMahout1.0ApacheSparkH20-platform-relatedworkinApacheMahout
![Page 162: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/162.jpg)
MahoutnewchangesMahoutwasusingthemapreduceprogrammingmodeltohandlelargedatasets.FromtheendofApril2014,thecommunitydecidedtostoptheimplementationofthenewmapreducealgorithm.Thisdecisionhasavalidreason.Mahout’scodebasewillbemovingtomoderndataprocessingsystemsthatofferaricherprogrammingmodelandmoreefficientexecutionthanHadoop’sMapReduce.
MahouthasstarteditsimplementationonthetopofDomainSpecificLanguage(DSL)forlinearalgebraicoperations.ProgramswritteninthisDSLareautomaticallyoptimizedandexecutedinparallelonApacheSpark.ScalaDSLandalgebraicoptimizerisScalaandSparkbindingforMahout.
![Page 163: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/163.jpg)
MahoutScalaandSparkbindingsWithMahoutScalabindingsandMahoutSparkbindingsforlinearalgebrasubroutines,developersinMahoutaretryingtobringsemanticexplicitnesstoMahout’sin-coreandout-of-corelinearalgebrasubroutines.TheyaredoingthiswhileaddingthebenefitsofthestrongprogrammingenvironmentofScalaandcapitalizingonscalabilitybenefitsofSparkandGraphX.ScalabindingisusedtoprovidesupportforScalaDSL,andthiswillmakewritingmachinelearningprogramseasier.
MahoutScalaandSparkbindingsarepackagesthataimtoprovideanR-likelookandfeeltoMahout’sin-coreandout-of-coreSpark-backedlinearalgebra.AnimportantpartofSparkbindingsistheexpressionoptimizer.Thisoptimizerlooksattheentireexpressionanddecidesonhowitcanbesimplifiedandwhichphysicaloperatorsshouldbepicked.Ahigh-leveldiagramofthebindingstackisshowninthefollowingfigure(https://issues.apache.org/jira/secure/attachment/12638098/BindingsStack.jpg):
TheSparkbindingshellhasalsobeenimplementedinMahout1.0.Let’sunderstandtheApacheSparkprojectfirstandthenwewillrevisittheSparkbindingshellinMahout.
![Page 164: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/164.jpg)
![Page 165: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/165.jpg)
ApacheSparkApacheSparkisanopensource,in-memory,general-purposecomputingsystem.Spark’sin-memorytechniqueprovidesperformancethatis100timesfaster.InsteadofHadoop-likedisk-basedcomputation,Sparkusesclustermemorytouploadallthedataintothememory,andthisdatacanbequeriedrepeatedly.
ApacheSparkprovideshigh-levelAPIsinJava,Python,andScalaandanoptimizedenginethatsupportsgeneralexecutiongraphs.Itprovidesthefollowinghigh-leveltools:
SparkSQL:ThisisforSQLandstructureddataprocessing.MLib:ThisisSpark’sscalablemachinelearninglibrarythatconsistsofcommonlearningalgorithmsandutilities,includingclassification,regression,clustering,collaborativefiltering,dimensionalityreduction,aswellastheunderlyingoptimizationprimitives.GraphX:ThisisthenewSparkAPIforgraphsandgraph-parallelcomputation.Sparkstreaming:Thiscancollectdatafrommanysourcesandafterprocessingthisdata,itusescomplexalgorithmsandcanpushthedatatofilesystems,databases,andlivedashboards.
AsSparkisgainingpopularityamongdatascientists,theMahoutcommunityisalsoquicklyworkingonmakingMahoutalgorithmsfunctiononSpark’sexecutionenginetospeedupitscalculation10to100timesfaster.MahoutprovidesseveralimportantbuildingblockstocreaterecommendationsusingSpark.Spark-itemsimilaritycanbeusedtocreateotherpeoplealsolikedthesethingskindofrecommendationsandwhenpairedwithasearchenginecanpersonalizerecommendationsforindividualusers.Spark-rowsimilaritycanprovidenon-personalizedcontentbasedonrecommendationsandwhenpairedwithasearchenginecanbeusedtopersonalizecontentbasedonrecommendations(http://comments.gmane.org/gmane.comp.apache.mahout.scm/6513).
![Page 166: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/166.jpg)
UsingMahout’sSparkshellYoucanuseMahout’sSparkshellbyreferringtothefollowingsteps:
1. DownloadSparkfromhttp://spark.apache.org/downloads.html.2. Createanewfolderwiththenamesparkusingthefollowingcommandandmovethe
downloadedfilethere:
mkdir/tmp/spark
mv~/Downloads/spark-1.1.1.tgz/tmp/spark
3. Unpackthearchivedfileinafolderusingthefollowingcommand:
cd/tmp/spark
tarxzfspark-1.1.1.tgz
4. Thiswillunzipthefileunder/tmp/spark/spark-1.1.1.Now,movetothenewlycreatedfolderandrunthefollowingcommand:
cd/spark-1.1.1
sbt/sbtassembly
ThiswillbuildSparkonyoursystemasshowninthefollowingscreenshot:
5. NowcreateaMahoutdirectoryandmovethefiletoitusingthefollowingcommand:
mkdir/tmp/Mahout
6. CheckoutthemasterbranchofMahoutfromGitHubusingthefollowingcommand:
gitclonehttps://github.com/apache/mahoutmahout
![Page 167: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/167.jpg)
Theoutputoftheprecedingcommandisshowninthefollowingscreenshot:
7. ChangeyourdirectorytothenewlycreatedMahoutdirectoryandbuildMahout:
cdmahout
mvn-DskipTestscleaninstall
Theoutputoftheprecedingcommandisshowninthefollowingscreenshot:
8. MovetothedirectorywhereyouunpackedSparkandtypethefollowingcommandtostartSparklocally:
cd/tmp/spark/spark-1.1.1
sbin/start-all-sh
Theoutputoftheprecedingcommandisshowninthefollowingscreenshot:
9. Openabrowser;pointittohttp://localhost:8080/tocheckwhetherSparkhas
![Page 168: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/168.jpg)
successfullystarted.CopytheURLoftheSparkmasteratthetopofthepage(itstartswithspark://).
10. Definethefollowingenvironmentvariables:
exportMAHOUT_HOME=[directoryintowhichyoucheckedoutMahout]
exportSPARK_HOME=[directorywhereyouunpackedSpark]
exportMASTER=[urloftheSparkmaster]
11. Finally,changetothedirectorywhereyouunpackedMahoutandtypebin/mahoutspark-shell;youshouldseetheshellstartingandgetthemahout>prompt.
NowyourMahoutSparkshellisreadyandyoucanstartplayingwithdata.Formoreinformationonthistopic,seetheimplementationsectionathttps://mahout.apache.org/users/sparkbindings/play-with-shell.html.
![Page 169: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/169.jpg)
![Page 170: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/170.jpg)
H2OplatformintegrationAsdiscussedearlier,anexperimentalworktointegrateMahoutandtheH2Oplatformisalsoinprogress.TheintegrationprovidesanH2ObackendtotheMahoutalgebraDSL.
H2OmakesHadoopdomath!H2Oscalesstatistics,machinelearning,andmathoverbigdata.Itisextensibleanduserscanbuildblocksusingsimplemathlegosinthecore.H2OkeepsfamiliarinterfacessuchasR,Excel,andJSONsothatbigdataenthusiastsandexpertscanexplore,munge,model,andscoredatasetsusingarangeofsimple-to-advancedalgorithms.Datacollectioniseasy,whiledecisionmakingishard.H2Omakesitfastandeasytoderiveinsightsfromyourdatathroughfasterandbetterpredictivemodeling.Italsohasavisionofonlinescoringandmodelinginasingleplatform(http://0xdata.com/download/).
H2Oisfundamentallyapeer-to-peersystem.H2Onodesjointogethertoformacloudonwhichhigh-performancedistributedmathcanbeexecuted.Eachnodejoinsacloudofagivenname.Multiplecloudscanexistonthesamenetworkatthesametimeaslongastheirnamesaredifferent.Multiplenodescanexistonthesameserveraswell(theycanevenbelongtothesamecloud).
TheMahoutH2OintegrationisfitintothismodelbyhavingN-1workernodesandonedrivernode,allbelongingtothesamecloudname.Thedefaultcloudnameusedfortheintegrationismah2out.Cloudshavetobespunupaspertheirtask/job.
Moredetailscanbefoundathttps://issues.apache.org/jira/browse/MAHOUT-1500.
![Page 171: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/171.jpg)
![Page 172: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/172.jpg)
SummaryInthischapter,wediscussedtheupcomingreleaseofMahout1.0,andthechangesthatarecurrentlygoingon.WealsoglancedthroughSpark,Scalabinding,andApacheSpark.Wealsodiscussedahigh-leveloverviewofH2OMahoutintegration.
Nowlet’smoveontothefinalchapterofthisbookwherewewilldevelopaproduction-readyclassifier.
![Page 173: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/173.jpg)
![Page 174: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/174.jpg)
Chapter9.BuildinganE-mailClassificationSystemUsingApacheMahoutInthischapter,wewillcreateaclassifiersystemusingMahout.Inordertobuildthissystem,wewillcoverthefollowingtopics:
GettingthedatasetPreparationofthedatasetPreparingthemodelTrainingthemodel
Inthischapter,wewilltargetthecreationoftwodifferentclassifiers.Thefirstonewillbeaneasyonebecauseyoucanbothcreateandtestitonapseudo-distributedHadoopinstallation.Forthesecondclassifier,Iwillprovideyouwithallthedetails,soyoucanrunitusingyourfullydistributedHadoopinstallation.Iwillcountthesecondoneasahands-onexerciseforthereadersofthisbook.
Firstofall,let’sunderstandtheproblemstatementforthefirstusecase.Nowadays,inmostofthee-mailsystems,weseethate-mailsareclassifiedasspamornotspam.E-mailsthatarenotspamaredelivereddirectlyintoourinboxbutspame-mailsarestoredinafoldercalledSpam.Usually,basedonacertainpatternsuchasmessagesubject,sender’se-mailaddress,orcertainkeywordsinthemessagebody,wecategorizeanincominge-mailasspam.WewillcreateaclassifierusingMahout,whichwillclassifyane-mailintospamornotspam.WewilluseSpamAssassin,anApacheopensourceprojectdatasetforthistask.
Forthesecondusecase,wewillcreateaclassifier,whichcanpredictagroupofincominge-mails.Asanopensourceproject,therearelotsofprojectsundertheApachesoftwarefoundation,suchasApacheMahout,ApacheHadoop,ApacheSolr,andsoon.WewilltaketheApacheSoftwareFoundation(ASF)e-maildatasetandusingthis,wewillcreateandtrainourmodelsothatourmodelcanpredictanewincominge-mail.So,basedoncertainfeatures,wewillbeabletopredictwhichgroupanewincominge-mailbelongsto.
InMahout’sclassificationproblem,wewillhavetoidentifyapatterninthedatasettohelpuspredictthegroupofanewe-mail.Wealreadyhaveadataset,whichisseparatedbyprojectnames.WewillusetheASFpublice-mailarchivesdatasetforthisusecase.
Now,let’sconsiderourfirstusecase:spame-maildetectionclassifier.
![Page 175: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/175.jpg)
Spame-maildatasetAsImentioned,wewillbeusingtheApacheSpamAssassinprojectsdataset.ApacheSpamAssassinisanopensourcespamfilter.Download20021010_easy_ham.tarand20021010_spam.tarfromhttp://spamassassin.apache.org/publiccorpus/,asshowninthefollowingscreenshot:
![Page 176: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/176.jpg)
![Page 177: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/177.jpg)
CreatingthemodelusingtheAssassindatasetWecancreatethemodelwiththehelpofthefollowingsteps:
1. Createafolderundertmpwiththenamedataset,andthenclickonthefolderandunzipthedatasetsusingthefollowingcommand:
mkdir/tmp/assassin/dataset
tar–xvf/tmp/assassin/20021010_easy_ham.tar.bz2
tar–xvf/tmp/assassin/20021010_spam.tar.bz2
Thiswillcreatetwofoldersunderthedatasetfolder,easy_hamandspam,asshowninthefollowingscreenshot:
2. CreateafolderinHdfsandmovethisdatasetintoHadoop:
hadoopfs-mkdir/user/hue/assassin/
hadoopfs–put/tmp/assassin/dataset/user/hue/assassin
tar–xvf/tmp/assassin/20021010_spam.tar.bz2
Nowourdatapreparationisdone.Wehavedownloadedthedataandmovedthisdataintohdfs.Let’smoveontothenextstep.
3. ConvertthisdataintosequencefilessothatwecanprocessitusingHadoop:
bin/mahoutseqdirectory–i/user/hue/assassin/dataset–o
/user/hue/assassinseq-out
![Page 178: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/178.jpg)
4. Convertthesequencefileintosparsevector(Mahoutalgorithmsacceptinputinvectorformat,whichiswhyweareconvertingthesequencefileintosparsevector)byusingthefollowingcommand:
bin/mahoutseq2sparse-i/user/hue/assassinseq-out/part-m-00000-o
/user/hue/assassinvec-lnorm-nv-wttfidf
Thecommandintheprecedingscreenshotisexplainedasfollows:
lnorm:Thiscommandisusedforoutputvectortobelognormalized.nv:Thiscommandisusedfornamedvector.wt:Thiscommandisusedtoidentifythekindofweighttouse.Hereweusetf-idf.
5. Splitthesetofvectorsfortrainingandtestingthemodel,asfollows:
![Page 179: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/179.jpg)
bin/mahoutsplit-i/user/hue/assassinvec/tfidf-vectors--
trainingOutput/user/hue/assassindatatrain--testOutput
/user/hue/assassindatatest--randomSelectionPct20--overwrite--
sequenceFiles-xmsequential
Theprecedingcommandcanbeexplainedasfollows:
TherandomSelectionPctparameterdividesthepercentageofdataintotestandtrainingdatasets.Inthiscase,it’s80percentfortestand20percentfortraining.Thexmparameterspecifieswhatportionofthetf(tf-idf)vectorsistobeusedexpressedintimesthestandarddeviation.Thesigmasymbolspecifiesthedocumentfrequenciesofthesevectors.Itcanbeusedtoremovereallyhighfrequencyterms.Itisexpressedasadoublevalue.Agoodvaluetobespecifiedis3.0.Ifthevalueislessthan0,novectorswillbefilteredout.
6. Now,trainthemodelusingthefollowingcommand:
bin/mahouttrainnb-i/user/hue/assassindatatrain-el-o
/user/hue/prodmodel-li/user/hue/prodlabelindex-ow-c
7. Now,testthemodelusingthefollowingcommand:
bin/mahouttestnb-i/user/hue/assassindatatest-m/user/hue/prodmodel/
-l/user/hue/prodlabelindex-ow-o/user/hue/prodresults
![Page 180: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/180.jpg)
Youcanseefromtheresultsthattheoutputisdisplayedontheconsole.Asperthematrix,thesystemhascorrectlyclassified99.53percentoftheinstancesgiven.
Wecanusethiscreatedmodeltoclassifynewdocuments.Todothis,wecaneitheruseaJavaprogramorcreateaservletthatcanbedeployedonourserver.
Let’stakeanexampleofaJavaprogramincontinuationofthisexercise.
![Page 181: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/181.jpg)
![Page 182: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/182.jpg)
ProgramtouseaclassifiermodelWewillcreateaJavaprogramthatwilluseourmodeltoclassifynewe-mails.Thisprogramwilltakemodel,labelindex,dictionary-file,documentfrequency,andtextfileasinputandwillgenerateascoreforthecategories.Thecategorywillbedecidedbasedonthehigherscores.
Let’shavealookatthisprogramstepbystep:
The.jarfilesrequiredtomakeacompilationofthisprogramareasfollows:
Hadoop-core-x.y.x.jar
Mahout-core-xyz.jar
Mahout-integration-xyz.jar
Mahout-math-xyz.jar
Theimportstatementsarelistedasfollows.WearediscussingthisbecausetherearelotsofchangesintheMahoutreleasesandpeopleusuallyfinditdifficulttogetthecorrectclasses.
importjava.io.BufferedReader;
importjava.io.FileReader;
importjava.io.StringReader;
importjava.util.HashMap;
importjava.util.Map;
importorg.apache.hadoop.conf.Configuration;
importorg.apache.hadoop.fs.Path;
importorg.apache.lucene.analysis.Analyzer;
importorg.apache.lucene.analysis.TokenStream;
importorg.apache.lucene.analysis.standard.StandardAnalyzer;
import
org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
importorg.apache.lucene.util.Version;
importorg.apache.mahout.classifier.naivebayes.BayesUtils;
importorg.apache.mahout.classifier.naivebayes.NaiveBayesModel;
import
org.apache.mahout.classifier.naivebayes.StandardNaiveBayesClassifier;
importorg.apache.mahout.common.Pair;
import
org.apache.mahout.common.iterator.sequencefile.SequenceFileIterable;
importorg.apache.mahout.math.RandomAccessSparseVector;
importorg.apache.mahout.math.Vector;
importorg.apache.mahout.math.Vector.Element;
importorg.apache.mahout.vectorizer.TFIDF;
importorg.apache.hadoop.io.*;
importcom.google.common.collect.ConcurrentHashMultiset;
importcom.google.common.collect.Multiset;
![Page 183: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/183.jpg)
Thesupportingmethodstoreadthedictionaryareasfollows:
publicstaticMap<String,Integer>readDictionary(Configurationconf,
PathdictionaryPath){
Map<String,Integer>dictionary=newHashMap<String,Integer>();
for(Pair<Text,IntWritable>pair:newSequenceFileIterable<Text,
IntWritable>(dictionaryPath,true,conf)){
dictionary.put(pair.getFirst().toString(),pair.getSecond().get());
}
returndictionary;
}
Thesupportingmethodstoreadthedocumentfrequencyareasfollows:
publicstaticMap<Integer,Long>readDocumentFrequency(Configuration
conf,PathdocumentFrequencyPath){
Map<Integer,Long>documentFrequency=newHashMap<Integer,Long>();
for(Pair<IntWritable,LongWritable>pair:new
SequenceFileIterable<IntWritable,LongWritable>(documentFrequencyPath,
true,conf)){
documentFrequency.put(pair.getFirst().get(),
pair.getSecond().get());
}
returndocumentFrequency;
}
Thefirstpartofthemainmethodisusedtoperformthefollowingactions:
GettingtheinputLoadingthemodelInitializingStandardNaiveBayesClassifierusingourcreatedmodelReadinglabelindex,documentfrequency,anddictionarycreatedwhilecreatingthevectorfromthedataset
Thefollowingcodecanbeusedfortheprecedingactions:
publicstaticvoidmain(String[]args)throwsException{
if(args.length<5){
System.out.println("Arguments:[model][labelindex]
[dictionary][documentfrequency][newfile]");
return;
}
StringmodelPath=args[0];
StringlabelIndexPath=args[1];
StringdictionaryPath=args[2];
StringdocumentFrequencyPath=args[3];
StringnewDataPath=args[4];
Configurationconfiguration=newConfiguration();//modelisa
matrix(wordId,labelId)=>probabilityscore
NaiveBayesModelmodel=NaiveBayesModel.materialize(new
Path(modelPath),configuration);
StandardNaiveBayesClassifierclassifier=new
StandardNaiveBayesClassifier(model);
//labelsisamaplabel=>classId
Map<Integer,String>labels=
BayesUtils.readLabelIndex(configuration,newPath(labelIndexPath));
![Page 184: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/184.jpg)
Map<String,Integer>dictionary=readDictionary(configuration,
newPath(dictionaryPath));
Map<Integer,Long>documentFrequency=
readDocumentFrequency(configuration,new
Path(documentFrequencyPath));
Thesecondpartofthemainmethodisusedtoextractwordsfromthee-mail:
Analyzeranalyzer=newStandardAnalyzer(Version.LUCENE_CURRENT);
intlabelCount=labels.size();
intdocumentCount=documentFrequency.get(-1).intValue();
System.out.println("Numberoflabels:"+labelCount);
System.out.println("Numberofdocumentsintrainingset:"+
documentCount);
BufferedReaderreader=newBufferedReader(new
FileReader(newDataPath));
while(true){
Stringline=reader.readLine();
if(line==null){
break;
}
ConcurrentHashMultiset<Object>words=
ConcurrentHashMultiset.create();
//extractwordsfrommail
TokenStreamts=analyzer.tokenStream("text",new
StringReader(line));
CharTermAttributetermAtt=ts.addAttribute(CharTermAttribute.class);
ts.reset();
intwordCount=0;
while(ts.incrementToken()){
if(termAtt.length()>0){
Stringword=
ts.getAttribute(CharTermAttribute.class).toString();
IntegerwordId=dictionary.get(word);
//ifthewordisnotinthedictionary,skipit
if(wordId!=null){
words.add(word);
wordCount++;
}
}
}
ts.close();
Thethirdpartofthemainmethodisusedtocreatevectoroftheidwordandthetf-idfweights:
Vectorvector=newRandomAccessSparseVector(10000);
TFIDFtfidf=newTFIDF();
for(Multiset.Entryentry:words.entrySet()){
Stringword=(String)entry.getElement();
intcount=entry.getCount();
IntegerwordId=dictionary.get(word);
Longfreq=documentFrequency.get(wordId);
![Page 185: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/185.jpg)
doubletfIdfValue=tfidf.calculate(count,freq.intValue(),
wordCount,documentCount);
vector.setQuick(wordId,tfIdfValue);
}
Inthefourthpartofthemainmethod,withclassifier,wegetthescoreforeachlabelandassignthee-mailtothehigherscoredlabel:
VectorresultVector=classifier.classifyFull(vector);
doublebestScore=-Double.MAX_VALUE;
intbestCategoryId=-1;
for(inti=0;i<resultVector.size();i++){
Elemente1=resultVector.getElement(i);
intcategoryId=e1.index();
doublescore=e1.get();
if(score>bestScore){
bestScore=score;
bestCategoryId=categoryId;
}
System.out.print(""+labels.get(categoryId)+":"+score);
}
System.out.println("=>"+labels.get(bestCategoryId));
}
}
Now,putallthesecodesunderoneclassandcreatethe.jarfileofthisclass.Wewillusethis.jarfiletotestournewe-mails.
![Page 186: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/186.jpg)
![Page 187: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/187.jpg)
TestingtheprogramTotesttheprogram,performthefollowingsteps:
1. Createafoldernamedassassinmodeltestinthelocaldirectory,asfollows:
mkdir/tmp/assassinmodeltest
2. Tousethismodel,getthefollowingfilesfromhdfsto/tmp/assassinmodeltest:
Fortheearliercreatedmodel,usethefollowingcommand:
hadoopfs–get/user/hue/prodmodel/tmp/assassinmodeltest
Forlabelindex,usethefollowingcommand:
hadoopfs–get/user/hue/prodlabelindex/tmp/assassinmodeltest
Fordf-countsfromtheassassinvecfolder(changethenameofthepart-00000filetodf-count),usethefollowingcommands:
hadoopfs–get/user/hue/assassinvec/df-count
/tmp/assassinmodeltest
dictionary.file-0fromthesameassassinvecfolder
hadoopfs–get/user/hue/assassinvec/dictionary.file-0
/tmp/assassinmodeltest
3. Under/tmp/assassinmodeltest,createafilewiththemessageshowninthefollowingscreenshot:
4. Now,runtheprogramusingthefollowingcommand:
Java–cp/tmp/assassinmodeltest/spamclassifier.jar:/usr/lib/mahout/*
com.packt.spamfilter.TestClassifier/tmp/assassinmodeltest
/tmp/assassinmodeltest/prodlabelindex
/tmp/assassinmodeltest/dictionary.file-0/tmp/assassinmodeltest/df-
count/tmp/assassinmodeltest/testemail
5. Now,updatetheteste-mailfilewiththemessageshowninthefollowingscreenshot:
![Page 188: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/188.jpg)
6. Runtheprogramagainusingthesamecommandasgiveninstep4andviewtheresultasfollows:
Now,wehaveaprogramreadythatcanuseourclassifiermodelandpredicttheunknownitems.Let’smoveontooursecondusecase.
![Page 189: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/189.jpg)
![Page 190: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/190.jpg)
SecondusecaseasanexerciseAsdiscussedatthestartofthischapter,wewillnowworkonasecondusecase,wherewewillpredictthecategoryofanewe-mail.
![Page 191: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/191.jpg)
TheASFe-maildatasetTheApacheSoftwareFoundatione-maildatasetispartitionedbyproject.Thise-maildatasetcanbefoundathttp://aws.amazon.com/datasets/7791434387204566.
Asmallerdatasetcanbefoundathttp://files.grantingersoll.com/ibm.tar.gz.(Refertohttp://lucidworks.com/blog/scaling-mahout/).Usethisdatatoperformthefollowingsteps:
1. Movethisdatatothefolderofyourchoice(/tmp/asfmail)andunzipthefolder:
mkdir/tmp/asfmail
tar–xvfibm.tar
2. Movethedatasettohdfs:
hadoopfs-put/tmp/asfmail/ibm/content/user/hue/asfmail
3. ConvertthemboxfilesintoHadoop’sSequenceFileformatusingMahout’sSequenceFilesFromMailArchivesasfollows:
mahoutorg.apache.mahout.text.SequenceFilesFromMailArchives--charset
"UTF-8"--body--subject--input/user/hue/asfmail/content--output
/user/hue/asfmailout
![Page 192: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/192.jpg)
4. Convertthesequencefileintosparsevector:
mahoutseq2sparse--input/user/hue/asfmailout--output
/user/hue/asfmailseqsp--norm2--weightTFIDF--namedVector--
maxDFPercent90--minSupport2--analyzerName
org.apache.mahout.text.MailArchivesClusteringAnalyzer
5. Modifythelabels:
mahoutorg.apache.mahout.classifier.email.PrepEmailDriver--input
/user/hue/asfmailseqsp--output/user/hue/asfmailseqsplabel--
maxItemsPerLabel1000
Now,thenextthreestepsaresimilartotheonesweperformedearlier:
1. Splitthedatasetintotrainingandtestdatasetsusingthefollowingcommand:
mahoutsplit--input/user/hue/asfmailseqsplabel--trainingOutput
/user/hue/asfmailtrain--testOutput/user/hue/asfmailtest--
randomSelectionPct20--overwrite--sequenceFiles
2. Trainthemodelusingthetrainingdatasetasfollows:
mahouttrainnb-i/user/hue/asfmailtrain-o/user/hue/asfmailmodel-
extractLabels--labelIndex/user/hue/asfmaillabels
3. Testthemodelusingthetestdataset:
mahouttestnb-i/user/hue/asfmailtest-m/user/hue/asfmailmodel--
labelIndex/user/hue/asfmaillabels
Asyoumayhavenoticed,allthestepsareexactlyidenticaltotheonesweperformedearlier.Hereby,Ileavethistopicasanexerciseforyoutocreateyourownclassifiersystemusingthismodel.Youcanusehintsasprovidedforthespamfilterclassifier.Wenowmoveourdiscussiontotuningourclassifier.Let’stakeabriefoverviewofthebestpracticesinthisarea.
![Page 193: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/193.jpg)
![Page 194: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/194.jpg)
ClassifierstuningWealreadydiscussedclassifiers’evaluationtechniquesinChapter1,ClassificationinDataAnalysis.Justasareminder,weevaluateourmodelusingtechniquessuchasconfusionmatrix,entropymatrix,areaundercurve,andsoon.
Fromtheexplanatoryvariables,wecreatethefeaturevector.Tocheckhowaparticularmodelisworking,thesefeaturevectorsneedtobeinvestigated.InMahout,thereisaclassavailableforthis,ModelDissector.Ittakesthefollowingthreeinputs:
Features:Thisclasstakesafeaturevectortouse(destructively)TraceDictionary:ThisclasstakesatracedictionarycontainingvariablesandthelocationsinthefeaturevectorthatareaffectedbythemLearner:Thisclasstakesthemodelthatweareprobingtofindweightsonfeatures
ModelDissectortweaksthefeaturevectorandobserveshowthemodeloutputchanges.Bytakinganaverageofthenumberofexamples,wecandeterminetheeffectofdifferentexplanatoryvariables.
ModelDissectorhasasummarymethod,whichreturnsthemostimportantfeatureswiththeirweights,mostimportantcategory,andthetopfewcategoriesthattheyaffect.
TheoutputofModelDissectorishelpfulintroubleshootingproblemsinawronglycreatedmodel.
Moredetailsforthecodecanbefoundathttps://github.com/apache/mahout/blob/master/mrlegacy/src/main/java/org/apache/mahout/classifier/sgd/ModelDissector.java
Whileimprovingtheoutputoftheclassifier,oneshouldtakecarewithtwocommonlyoccurringproblems:targetleak,andbrokenfeatureextraction.
Ifthemodelisshowingresultsthataretoogoodtobetrueoranoutputbeyondexpectations,wecouldhaveaproblemwithtargetleak.Thiserrorcomesonceinformationfromthetargetvariableisincludedintheexplanatoryvariables,whichareusedtotraintheclassifier.Inthisinstance,theclassifierwillworktoowellforthetestdataset.
Ontheotherhand,brokenfeatureextractionoccurswhenfeatureextractionisbroken.Thistypeofclassifiershowstheoppositeresultfromthetargetleakclassifiers.Here,themodelprovidesresultspoorerthanexpected.
Totunetheclassifier,wecanusenewexplanatoryvariables,transformationsofexplanatoryvariables,andcanalsoeliminatesomeofthevariables.Weshouldalsotrydifferentlearningalgorithmstocreatethemodelandchooseanalgorithm,whichisgoodinperformance,trainingtime,andspeed.
MoredetailsontuningcanbefoundinChapter16,DeployingaclassifierinthebookMahoutinAction.
![Page 195: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/195.jpg)
![Page 196: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/196.jpg)
SummaryInthischapter,wediscussedcreatingourownproductionreadyclassifiermodel.Wetookuptwousecaseshere,oneforane-mailspamfilterandtheotherforclassifyingthee-mailaspertheprojects.WeuseddatasetsforApacheSpamAssassinforthee-mailfilterandASFforthee-mailclassifier.
Wealsosawhowtoincreasetheperformanceofyourmodel.
SoyouarenowreadytoimplementclassifiersusingApacheMahoutforyourownrealworldusecases.Happylearning!
![Page 197: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/197.jpg)
IndexA
algorithms,classificationLogisticregression/ClassificationalgorithmsStochasticGradientDescent(SGD)/ClassificationalgorithmsNaïveBayesclassification/ClassificationalgorithmsHiddenMarkovModel(HMM)/Classificationalgorithmsrandomforest/ClassificationalgorithmsMulti-layerperceptron(MLP)/Classificationalgorithms
ApacheSpamAssassinproject/Spame-maildatasetApacheSpark
about/ApacheSparkSparkSQL/ApacheSparkMLib/ApacheSparkGraphX/ApacheSparkSparkstreaming/ApacheSpark
ASFe-maildatasetabout/TheASFe-maildatasetURL/TheASFe-maildataset
Assassindatasetused,forcreatingmodel/CreatingthemodelusingtheAssassindataset
AUC(areaundertheROCcurve)/AreaundertheROCcurveaxons/Neuralnetworkandneurons
![Page 198: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/198.jpg)
Bbackpropagation/MultilayerPerceptronBagofwords/UnderstandingthetermsusedintextclassificationBaumWelchTrainerclass/UsingMahoutfortheHiddenMarkovModelBayesrule
about/IntroducingconditionalprobabilityandtheBayesrulebindingstack
URL/MahoutScalaandSparkbindings
![Page 199: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/199.jpg)
CChi-squaredAutomaticInteractionDetector(CHAID)/Decisiontreeclassification
about/Introducingtheclassification,IntroducingApacheMahoutapplication/Applicationoftheclassificationsystemsystem,working/Workingoftheclassificationsystemalgorithms/Classificationalgorithms
ClassificationandRegressionTree(CART)/Decisiontreeclassifier
trainingdataset/Workingoftheclassificationsystemtestdataset/Workingoftheclassificationsystemmodel/Workingoftheclassificationsystembuilding/Workingoftheclassificationsystem
classifiermodelusing,programfor/Programtouseaclassifiermodel
classifierstuning/Classifierstuning
clusteringabout/IntroducingApacheMahout
conditionalprobabilityabout/IntroducingconditionalprobabilityandtheBayesrule
confusionmatrixabout/TheconfusionmatrixAccuracy/TheconfusionmatrixPrecisionorpositivepredictivevalue/TheconfusionmatrixNegativepredictivevalue/TheconfusionmatrixSensitivity/truepositiverate/recall/TheconfusionmatrixSpecificity/TheconfusionmatrixF1score/Theconfusionmatrix
costfunction,linearregressionabout/Costfunction
![Page 200: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/200.jpg)
DDARPA‘98/UsingMahoutforRandomforestdataanalysis
classification/Introducingtheclassificationdecisiontree
about/Decisiontreedendrites/Neuralnetworkandneuronsdependentvariable/Logisticregressiondeterministicpatterns/Deterministicandnondeterministicpatternsdevelopmentenvironment
settingup,Eclipseused/SettingupadevelopmentenvironmentusingEclipsedimensionalreduction
about/IntroducingApacheMahoutDomainSpecificLanguage(DSL)/Mahoutnewchanges
![Page 201: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/201.jpg)
EEclipse
used,forbuildingdevelopmentenvironment/SettingupadevelopmentenvironmentusingEclipse
emissionmatrix,HMM/IntroducingtheHiddenMarkovModelEntropymatrix
about/Theentropymatrixexplanatoryvariable/Logisticregressionexplanatoryvariables
about/Workingoftheclassificationsystem
![Page 202: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/202.jpg)
Ggradientdescent
about/Gradientdescentsigmoidfunction/Logisticregressionlogisticfunction/Logisticregression
GraphX/ApacheSpark
![Page 203: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/203.jpg)
HH2Oplatform
integration/H2OplatformintegrationURL/H2Oplatformintegration
HadoopURL/IntroducingApacheMahout,InstallingMahout
hiddenlayer,MLPnetwork/MultilayerPerceptronhiddenlayers,MLPnetwork/MultilayerPerceptronHiddenMarkovModel(HMM)/Classificationalgorithmshiddenstates,HMM/IntroducingtheHiddenMarkovModelHMM
about/IntroducingtheHiddenMarkovModelproperties/IntroducingtheHiddenMarkovModelstatevector/IntroducingtheHiddenMarkovModeltransitionmatrix/IntroducingtheHiddenMarkovModelemissionmatrix/IntroducingtheHiddenMarkovModelhiddenstates/IntroducingtheHiddenMarkovModelobservablestate/IntroducingtheHiddenMarkovModelMahoutused/UsingMahoutfortheHiddenMarkovModelModelclass/UsingMahoutfortheHiddenMarkovModelHmmTrainerclass/UsingMahoutfortheHiddenMarkovModelHmmEvaluatorclass/UsingMahoutfortheHiddenMarkovModelHmmAlgorithmsclass/UsingMahoutfortheHiddenMarkovModelHmmUtilsclass/UsingMahoutfortheHiddenMarkovModelRandomSequencerGenerator/UsingMahoutfortheHiddenMarkovModelBaumWelchTrainerclass/UsingMahoutfortheHiddenMarkovModelViterbiEvaluatorclass/UsingMahoutfortheHiddenMarkovModelinputcommand/UsingMahoutfortheHiddenMarkovModeloutputcommand/UsingMahoutfortheHiddenMarkovModelmodelcommand/UsingMahoutfortheHiddenMarkovModellikelihoodcommand/UsingMahoutfortheHiddenMarkovModel
HMM,issuesevaluation/IntroducingtheHiddenMarkovModeldecoding/IntroducingtheHiddenMarkovModellearning/IntroducingtheHiddenMarkovModel
HmmAlgorithmsclass/UsingMahoutfortheHiddenMarkovModelHmmEvaluatorclass/UsingMahoutfortheHiddenMarkovModelHMMModelclass/UsingMahoutfortheHiddenMarkovModelHmmTrainerclass/UsingMahoutfortheHiddenMarkovModelHmmUtilsclass/UsingMahoutfortheHiddenMarkovModelHortonworksSandbox
URL/SettingupMahoutforaWindowsuser
![Page 204: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/204.jpg)
IiInitialstatevector,Markovprocess/TheMarkovprocessindependentvariable/Logisticregressioninputlayer,MLPnetwork/MultilayerPerceptronirisdataset
URL/UsingMahoutforMLPIterativeDichotomiser3(ID3)
URL/Decisiontree
![Page 205: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/205.jpg)
JJava
URL/InstallingMahout
![Page 206: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/206.jpg)
Llabels
about/WorkingoftheclassificationsystemLatentDirichletAllocation(LDA)/IntroducingApacheMahoutlinearregression
about/Understandinglinearregressioncostfunction/Costfunctiongradientdescent/Gradientdescent
logisticfunction/Logisticregressionlogisticregression/Classificationalgorithms
about/LogisticregressionMahout,usingfor/UsingMahoutforlogisticregressiondataset/UsingMahoutforlogisticregressiontrainingandtestdata,preparing/UsingMahoutforlogisticregressionmodel,training/UsingMahoutforlogisticregressiontrainlogistic/UsingMahoutforlogisticregressioninput/UsingMahoutforlogisticregressionoutput/UsingMahoutforlogisticregressiontarget/UsingMahoutforlogisticregressioncategories/UsingMahoutforlogisticregressionpredictors/UsingMahoutforlogisticregressiontypes/UsingMahoutforlogisticregressionfeatures/UsingMahoutforlogisticregressionpasses/UsingMahoutforlogisticregressionrate/UsingMahoutforlogisticregressionrunlogistic/UsingMahoutforlogisticregressionmodel/UsingMahoutforlogisticregressionauc/UsingMahoutforlogisticregressionconfusion/UsingMahoutforlogisticregression
![Page 207: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/207.jpg)
MM2Eclipse
URL/InstallingMavenMahout
about/IntroducingApacheMahoutusecases/IntroducingApacheMahoutfeatures/ReasonsforMahoutbeingagoodchoiceforclassificationinstalling/InstallingMahoutprerequisites/InstallingMahoutbuildingfromsource,Mavenused/BuildingMahoutfromsourceusingMavenMaven,installing/InstallingMavencode,building/BuildingMahoutcodedistributionfile,URL/BuildingMahoutcode,SettingupadevelopmentenvironmentusingEclipsesettingup,forWindowsuser/SettingupMahoutforaWindowsuserused,forlogisticregression/UsingMahoutforlogisticregressionNaïveBayesalgorithm/UsingtheNaïveBayesalgorithminApacheMahoutusing,forHMM/UsingMahoutfortheHiddenMarkovModelusing,forRandomforestalgorithm/UsingMahoutforRandomforestRandomforestalgorithm,implementing/StepstousetheRandomforestalgorithminMahoutMLP,implementing/MLPimplementationinMahoutusing,forMLP/UsingMahoutforMLPMLPalgorithm,using/StepstousetheMLPalgorithminMahoutupdations/MahoutnewchangesScalabindings/MahoutScalaandSparkbindingsSparkbindings/MahoutScalaandSparkbindingsSparkshell,using/UsingMahout’sSparkshellH2Oplatform,integration/H2Oplatformintegration
Mahout,algorithmsabout/AlgorithmssupportedinMahoutsequentialalgorithms/AlgorithmssupportedinMahoutparallelalgorithms/AlgorithmssupportedinMahout
Mahout,usecasesrecommendation/IntroducingApacheMahoutclassification/IntroducingApacheMahoutclustering/IntroducingApacheMahoutdimensionalreduction/IntroducingApacheMahouttopicmodeling/IntroducingApacheMahout
MahoutScalabindingsabout/MahoutScalaandSparkbindings
MahoutSparkbindingsabout/MahoutScalaandSparkbindings
![Page 208: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/208.jpg)
Markovprocessabout/TheMarkovprocessstates/TheMarkovprocesstransitionmatrix/TheMarkovprocessTransitionmatrix/TheMarkovprocessInitialstatevector/TheMarkovprocess
Mavenused,forbuildingMahoutfromsource/BuildingMahoutfromsourceusingMaveninstalling/InstallingMavenURL/InstallingMaven
MLib/ApacheSparkMLP
implementing,inMahout/MLPimplementationinMahoutMahoutused/UsingMahoutforMLPirisdataset/UsingMahoutforMLP
MLPalgorithmusing,inMahout/StepstousetheMLPalgorithminMahout
MLPnetworkabout/MultilayerPerceptronhiddenlayers/MultilayerPerceptronbackpropagation/MultilayerPerceptronzerohiddenlayers/MultilayerPerceptroninputlayer/MultilayerPerceptronoutputlayer/MultilayerPerceptronhiddenlayer/MultilayerPerceptronnumberofneuronsorhiddenunits/MultilayerPerceptron
modelcreating,Assassindatasetused/CreatingthemodelusingtheAssassindatasetclassifiermodel,programforusing/Programtouseaclassifiermodel
model,evaluationconfusionmatrix/TheconfusionmatrixReceiverOperatingCharacteristics(ROC)graph/TheReceiverOperatingCharacteristics(ROC)graphareaundertheROCcurve(AUC)/AreaundertheROCcurveEntropymatrix/Theentropymatrix
model,issuesoverfitting/Workingoftheclassificationsystemunderfitting/Workingoftheclassificationsystem
ModelDissectorFeaturesclass/ClassifierstuningTraceDictionaryclass/ClassifierstuningLearnerclass/Classifierstuningabout/Classifierstuning
![Page 209: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/209.jpg)
Multi-layerperceptron(MLP)/Classificationalgorithms
![Page 210: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/210.jpg)
NNaïveBayesalgorithm
about/UnderstandingtheNaïveBayesalgorithminApacheMahout/UsingtheNaïveBayesalgorithminApacheMahout
NaïveBayesclassification/Classificationalgorithmsneuralnetwork
about/Neuralnetworkandneuronsneurons
about/NeuralnetworkandneuronsURL/Neuralnetworkandneurons
nondeterministicpatterns/DeterministicandnondeterministicpatternsNSL-KDDdataset
URL/UsingMahoutforRandomforest
![Page 211: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/211.jpg)
Oobservablestate,HMM/IntroducingtheHiddenMarkovModeloutlierdetection
about/Workingoftheclassificationsystemoutputlayer,MLPnetwork/MultilayerPerceptronoverfitting,model
issues/Workingoftheclassificationsystem
![Page 212: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/212.jpg)
Pparallelalgorithms/AlgorithmssupportedinMahoutprogram
testing/Testingtheprogrampruning/Decisiontree
![Page 213: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/213.jpg)
Rrandomforest/ClassificationalgorithmsRandomforestalgorithm
about/RandomforestBiasparameter/RandomforestVarianceparameter/RandomforestMahoutused/UsingMahoutforRandomforestNSL-KDDdataset/UsingMahoutforRandomforestdataset/UsingMahoutforRandomforestimplementing,inMahout/StepstousetheRandomforestalgorithminMahout
RandomSequencerGenerator/UsingMahoutfortheHiddenMarkovModelReceiverOperatingCharacteristics(ROC)graph
about/TheReceiverOperatingCharacteristics(ROC)graphregression
about/Introducingregressionlinearregression/Understandinglinearregression
regressionintercept/Logisticregression
![Page 214: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/214.jpg)
Ssequentialalgorithms/AlgorithmssupportedinMahoutsigmoidfunction/Logisticregressionsoftmaxfunction
URL/Neuralnetworkandneuronsspame-maildatasetclassifier
about/Spame-maildatasetSpark
URL/UsingMahout’sSparkshellbinding,URL/UsingMahout’sSparkshell
Spark-item/ApacheSparkSpark-row/ApacheSparkSparkshell
using/UsingMahout’sSparkshellSparkSQL/ApacheSparkSparkstreaming/ApacheSparkstates,Markovprocess/TheMarkovprocessstatevector,HMM/IntroducingtheHiddenMarkovModelStochasticGradientDescent(SGD)/Classificationalgorithms
about/StochasticGradientDescent
![Page 215: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/215.jpg)
Ttargetvariables
about/WorkingoftheclassificationsystemTermfrequency/Understandingthetermsusedintextclassificationtermfrequency
Stemmingofwords/UnderstandingthetermsusedintextclassificationCasenormalization/UnderstandingthetermsusedintextclassificationStopwordremoval/UnderstandingthetermsusedintextclassificationInversedocumentfrequency/UnderstandingthetermsusedintextclassificationTermfrequencyandinversetermfrequency/Understandingthetermsusedintextclassification
textclassificationabout/Understandingthetermsusedintextclassification
topicmodelingabout/IntroducingApacheMahout
transitionmatrix,HMM/IntroducingtheHiddenMarkovModeltransitionmatrix,Markovprocess/TheMarkovprocess
![Page 216: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/216.jpg)
Uunderfitting,model
issues/Workingoftheclassificationsystem
![Page 217: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/217.jpg)
Vvectors
about/UnderstandingthetermsusedintextclassificationViterbiEvaluatorclass/UsingMahoutfortheHiddenMarkovModel
![Page 218: Learning Apache Mahout Classification Related/PDFs and Books... · 2015-05-24 · The entropy matrix Summary 2. Apache Mahout Introducing Apache Mahout Algorithms supported in Mahout](https://reader034.vdocument.in/reader034/viewer/2022050215/5f6154694a44c479a03739dc/html5/thumbnails/218.jpg)
WWindows
user,Mahoutsettingupfor/SettingupMahoutforaWindowsuserWisconsinDiagnosticBreastCancer(WDBC)dataset
URL/UsingMahoutforlogisticregression