bi over petabytes: meet apache mahoutororke.com/paul/blog/wp-content/uploads/2009/04/090421... ·...
Post on 27-Sep-2020
1 Views
Preview:
TRANSCRIPT
BIOverPetabytes:MeetApacheMahout
IndustrialStrengthMachineLearningApril2009
h@p://lucene.apache.org/mahout/
4/22/09 1jeff@windwardsoluJons.com
BIandML
• BusinessIntelligence– OLAP– AnalyJcs– Datamining– Performanceanalysis
– Textmining– PredicJveanalysis
• MachineLearning– ClassificaJon – Clustering– Regression– CollaboraJvefiltering
– EvoluJonaryalgorithms
4/22/09 2jeff@windwardsoluJons.com
WhatisMachineLearning?
• “MachinelearningisthesubfieldofarJficialintelligencethatisconcernedwiththedesignanddevelopmentofalgorithmsthatallowcomputerstoimprovetheirperformanceoverJme…”(h@p://en.wikipedia.org/wiki/Machine_learning)
• TypesofMLalgorithms– Supervised:Usinglabeledtrainingdata,createafuncJonthatpredictsoutputforunseeninputs
– Unsupervised:UsingunlabeleddatacreateafuncJonthatcanpredictoutput
– Semi‐supervised:Useslabeledandunlabeleddata
4/22/09 3jeff@windwardsoluJons.com
OneCommonMLExample
Google.com
4/22/09 4jeff@windwardsoluJons.com
TextClustering
AnotherCommonExample
Amazon.com
4/22/09 5jeff@windwardsoluJons.com
CollaboraJveFiltering
WhereMLisUsedToday
• Internetsearchclustering• Knowledgemanagementsystems• Socialnetworkmapping• TaxonomytransformaJons• MarkeJnganalyJcs• RecommendaJonsystems• Loganalysis&eventfiltering• SPAMfiltering,frauddetecJon
4/22/09 6jeff@windwardsoluJons.com
CurrentSituaJon
• VastamountsofdataarenowavailableviatheInternet
• PlahormsnowexisttoruncomputaJonsoverlargedatasets(MapReduce,Hadoop,Dryad)
• SophisJcatedanalyJcsareneededtoturndataintoinformaJonpeoplecanuse
• AcJveMachineLearningresearchcommunityandresearch/proprietaryimplementaJonsofMLalgorithms
• TheworldneedsscalableimplementaJonsofMLunderopenlicense‐ASF
4/22/09 7jeff@windwardsoluJons.com
HistoryofMahout
• Summer2007– DevelopersneededscalableML– Mailinglistformed
• Communityformed– Apachecontributors– Academia&industry– LotsofiniJalinterest
• MahoutprojectformedunderApacheLucene– January25,2008– Mahout0.1releaseApril,2009
4/22/09 8jeff@windwardsoluJons.com
WhoWeAre(sofar)
GrantIngersoll KarlWemn
IsabelDrostTedDunningJeffEastman
DawidWeiss
OJsGospodneJc
ErikHatcher
SeanOwen
OzgurYilmazel
4/22/09 9jeff@windwardsoluJons.com
Release0.1CodeBase• Matrix&Vectorlibrary
– Memoryresidentsparse&denseimplementaJons• ClassificaJon
– NaïveBayes,ComplementaryNaïveBayes• Clustering
– Canopy– K‐Means,fuzzyK‐Means– MeanShiq– DirichletProcess
• CollaboraJveFiltering– Taste
• EvoluJonaryAlgorithms– Watchmaker
• UJliJes– DistanceMeasures– Parameters
Highlyscalable,parallelimplementa3onsontheApache
Hadooppla7orm
4/22/09 10jeff@windwardsoluJons.com
Examples:Clustering
• Canopy– Singlepass(fastapproximaJon)assignseverypointtoasinglecluster– Inputs:DistanceMeasure,T1,T2canopyvalues
• MeanShiq– IteraJveprocessconvergesonmodesofdensitydistribuJon– Inputs:DistanceMeasure,T1,T2values,convergencecriteria
• K‐Means– IteraJveprocessconvergesonasingle,‘best’assignmentofpointstoclusters– Inputs:DistanceMeasure,iniJalclusters,convergencecriteria
• FuzzyK‐Means– LikeK‐MeansbutusesprobabilitydensityfuncJontoweightallpointsagainstallclusters
• DirichletProcess– Bayesian:incorporatespriordomainknowledgeasamixtureofmodels– IteraJveprocessconvergesonmulJple,‘mostlikely’answers– Inputs:
• Numberofmodels,numberofiteraJonstoperform• Model(parameters,observaJons,probabilitydensityfuncJon)• ModelDistribu3on(prior,posteriorsampling)
4/22/09 11jeff@windwardsoluJons.com
SampleData
4/22/09 12jeff@windwardsoluJons.com
CanopyClusters
4/22/09 13jeff@windwardsoluJons.com
MeanShiqClusters
4/22/09 14jeff@windwardsoluJons.com
K‐MeansClusters
4/22/09 15jeff@windwardsoluJons.com
FuzzyK‐MeansClusters
4/22/09 16jeff@windwardsoluJons.com
DirichletProcessClusters
4/22/09 17jeff@windwardsoluJons.com
SampleData(Again)
4/22/09 18jeff@windwardsoluJons.com
ApacheHadoop
• Usesclustersof(5‐10,000)generalpurposeLinuxboxes• HDFSsupportsredundantfilestorageandstreamingaccessin
thefaceofpredictablehardwarefailures• Map/ReduceAPIsimplifiesprogrammingofalgorithmsthat
operateovervastdatasets• HbaseoffersGoogleBigTablestyleofschema‐less,temporal
database• PIGoffershigherlevellanguageformanipulaJngverylarge
datasetsthatreducestheneedforM/Rprogramming• ZookeeperisahighlyavailableandreliablecoordinaJon
systemusedtosynchronizestatebetweenapplicaJons• Hiveisadatawarehouseinfrastructurethatprovidesdata
summarizaJon,adhocqueryingandanalysisofdatasets
h@p://hadoop.apache.org
4/22/09 19jeff@windwardsoluJons.com
TheHadoopIceberg
StorageReplicaJon
ProcessScheduling
FailureHandling
Map/ReduceCode
DataMovement
DiskManagement NetworkManagement
(h@p://hadoop.apache.org)
Monitoring
4/22/09 20jeff@windwardsoluJons.com
ReferenceDirichletImplementaJonprivatevoiditerate(intitera-on,DirichletState<Observa-on>state){
//createnewposteriormodelsModel<ObservaJon>[]newModels=modelFactory.sampleFromPosterior(state.getModels());
//iterateoverthesamples,assigningeachtoamodelfor(Observa-onx:sampleData){//computenormalizedvectorofprobabiliJesthatxisdescribedbyeachmodelVectorpi=normalizedProbabiliJes(state,x);//thenpickoneclusterbysamplingaMulJnomialdistribuJonbaseduponthem//see:h@p://en.wikipedia.org/wiki/MulJnomial_distribuJonintk=UncommonDistribu-ons.rMul%nom(pi);//asktheselectedmodeltoobservethedatumnewModels[k].observe(x);}
//updatethestatefromthenewmodelsstate.update(newModels);}
4/22/09 21jeff@windwardsoluJons.com
DirichletMapperonHadoop
publicvoidmap(WritableComparable<?>key,Textvalue,OutputCollector<Text,Text>output,Reporterreporter)throwsIOExcepJon{//readthenextsamplepointVectorsample=DenseVector.decodeFormat(value.toString());//computeavectorofprobabiliJesthatsampleisdescribedbyeachmodelVectorpi=normalizedProbabili3es(state,sample);//thenpickonemodelbysamplingaMulJnomialdistribuJonbaseduponthem//see:h@p://en.wikipedia.org/wiki/MulJnomial_distribuJonintk=UncommonDistribuJons.rMul3nom(pi);//outputvaluewithkeyofselectedmodeloutput.collect(newText(String.valueOf(k)),value);}
4/22/09 22jeff@windwardsoluJons.com
Map/ReduceJobsUseLocalData
4/22/09 23jeff@windwardsoluJons.com
DirichletReduceronHadooppublicvoidreduce(Textkey,Iterator<Text>values,OutputCollector<Text,Text>output,Reporterreporter)throwsIOExcep-on{//loadthemodelforthissetofvaluesIntegerk=newInteger(key.toString());Model<Vector>model=newModels[k];while(values.hasNext()){Vectorv=DenseVector.decodeFormat(values.next().toString());//asktheselectedmodeltoobservethedatummodel.observe(v);}//compute&setnewmodelparametersbasedupontheobservaJonsmodel.computeParameters();state.clusters.get(k).setModel(model);//outputtheclusterstateforthenextiteraJonoutput.collect(key,newText(cluster.asFormatString()));}
4/22/09 24jeff@windwardsoluJons.com
Conclusion• Thisisjustthebeginning• Highdemandforscalablemachinelearning
• Contributorsareneededwhohave– Interest,enthusiasm&programmingability– Testdrivendevelopmentskills– Comfortwiththescarymath(orbravery)
– Interestand/orproficiencywithHadoop– Somelargedatasetsyouwanttoanalyze
h@p://lucene.apache.org/mahout/
4/22/09 25jeff@windwardsoluJons.com
top related