bi over petabytes: meet apache mahoutororke.com/paul/blog/wp-content/uploads/2009/04/090421... ·...

Post on 27-Sep-2020

1 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

BIOverPetabytes:MeetApacheMahout

IndustrialStrengthMachineLearningApril2009

h@p://lucene.apache.org/mahout/

4/22/09 1jeff@windwardsoluJons.com

BIandML

•  BusinessIntelligence– OLAP– AnalyJcs– Datamining– Performanceanalysis

– Textmining– PredicJveanalysis

•  MachineLearning– ClassificaJon – Clustering– Regression– CollaboraJvefiltering

– EvoluJonaryalgorithms

4/22/09 2jeff@windwardsoluJons.com

WhatisMachineLearning?

•  “MachinelearningisthesubfieldofarJficialintelligencethatisconcernedwiththedesignanddevelopmentofalgorithmsthatallowcomputerstoimprovetheirperformanceoverJme…”(h@p://en.wikipedia.org/wiki/Machine_learning)

•  TypesofMLalgorithms–  Supervised:Usinglabeledtrainingdata,createafuncJonthatpredictsoutputforunseeninputs

– Unsupervised:UsingunlabeleddatacreateafuncJonthatcanpredictoutput

–  Semi‐supervised:Useslabeledandunlabeleddata

4/22/09 3jeff@windwardsoluJons.com

OneCommonMLExample

Google.com

4/22/09 4jeff@windwardsoluJons.com

TextClustering

AnotherCommonExample

Amazon.com

4/22/09 5jeff@windwardsoluJons.com

CollaboraJveFiltering

WhereMLisUsedToday

•  Internetsearchclustering•  Knowledgemanagementsystems•  Socialnetworkmapping•  TaxonomytransformaJons•  MarkeJnganalyJcs•  RecommendaJonsystems•  Loganalysis&eventfiltering•  SPAMfiltering,frauddetecJon

4/22/09 6jeff@windwardsoluJons.com

CurrentSituaJon

•  VastamountsofdataarenowavailableviatheInternet

•  PlahormsnowexisttoruncomputaJonsoverlargedatasets(MapReduce,Hadoop,Dryad)

•  SophisJcatedanalyJcsareneededtoturndataintoinformaJonpeoplecanuse

•  AcJveMachineLearningresearchcommunityandresearch/proprietaryimplementaJonsofMLalgorithms

•  TheworldneedsscalableimplementaJonsofMLunderopenlicense‐ASF

4/22/09 7jeff@windwardsoluJons.com

HistoryofMahout

•  Summer2007– DevelopersneededscalableML– Mailinglistformed

•  Communityformed– Apachecontributors– Academia&industry–  LotsofiniJalinterest

•  MahoutprojectformedunderApacheLucene–  January25,2008– Mahout0.1releaseApril,2009

4/22/09 8jeff@windwardsoluJons.com

WhoWeAre(sofar)

GrantIngersoll KarlWemn

IsabelDrostTedDunningJeffEastman

DawidWeiss

OJsGospodneJc

ErikHatcher

SeanOwen

OzgurYilmazel

4/22/09 9jeff@windwardsoluJons.com

Release0.1CodeBase•  Matrix&Vectorlibrary

–  Memoryresidentsparse&denseimplementaJons•  ClassificaJon

–  NaïveBayes,ComplementaryNaïveBayes•  Clustering

–  Canopy–  K‐Means,fuzzyK‐Means–  MeanShiq–  DirichletProcess

•  CollaboraJveFiltering–  Taste

•  EvoluJonaryAlgorithms–  Watchmaker

•  UJliJes–  DistanceMeasures–  Parameters

Highlyscalable,parallelimplementa3onsontheApache

Hadooppla7orm

4/22/09 10jeff@windwardsoluJons.com

Examples:Clustering

•  Canopy–  Singlepass(fastapproximaJon)assignseverypointtoasinglecluster–  Inputs:DistanceMeasure,T1,T2canopyvalues

•  MeanShiq–  IteraJveprocessconvergesonmodesofdensitydistribuJon–  Inputs:DistanceMeasure,T1,T2values,convergencecriteria

•  K‐Means–  IteraJveprocessconvergesonasingle,‘best’assignmentofpointstoclusters–  Inputs:DistanceMeasure,iniJalclusters,convergencecriteria

•  FuzzyK‐Means–  LikeK‐MeansbutusesprobabilitydensityfuncJontoweightallpointsagainstallclusters

•  DirichletProcess–  Bayesian:incorporatespriordomainknowledgeasamixtureofmodels–  IteraJveprocessconvergesonmulJple,‘mostlikely’answers–  Inputs:

•  Numberofmodels,numberofiteraJonstoperform•  Model(parameters,observaJons,probabilitydensityfuncJon)•  ModelDistribu3on(prior,posteriorsampling)

4/22/09 11jeff@windwardsoluJons.com

SampleData

4/22/09 12jeff@windwardsoluJons.com

CanopyClusters

4/22/09 13jeff@windwardsoluJons.com

MeanShiqClusters

4/22/09 14jeff@windwardsoluJons.com

K‐MeansClusters

4/22/09 15jeff@windwardsoluJons.com

FuzzyK‐MeansClusters

4/22/09 16jeff@windwardsoluJons.com

DirichletProcessClusters

4/22/09 17jeff@windwardsoluJons.com

SampleData(Again)

4/22/09 18jeff@windwardsoluJons.com

ApacheHadoop

•  Usesclustersof(5‐10,000)generalpurposeLinuxboxes•  HDFSsupportsredundantfilestorageandstreamingaccessin

thefaceofpredictablehardwarefailures•  Map/ReduceAPIsimplifiesprogrammingofalgorithmsthat

operateovervastdatasets•  HbaseoffersGoogleBigTablestyleofschema‐less,temporal

database•  PIGoffershigherlevellanguageformanipulaJngverylarge

datasetsthatreducestheneedforM/Rprogramming•  ZookeeperisahighlyavailableandreliablecoordinaJon

systemusedtosynchronizestatebetweenapplicaJons•  Hiveisadatawarehouseinfrastructurethatprovidesdata

summarizaJon,adhocqueryingandanalysisofdatasets

h@p://hadoop.apache.org

4/22/09 19jeff@windwardsoluJons.com

TheHadoopIceberg

StorageReplicaJon

ProcessScheduling

FailureHandling

Map/ReduceCode

DataMovement

DiskManagement NetworkManagement

(h@p://hadoop.apache.org)

Monitoring

4/22/09 20jeff@windwardsoluJons.com

ReferenceDirichletImplementaJonprivatevoiditerate(intitera-on,DirichletState<Observa-on>state){

//createnewposteriormodelsModel<ObservaJon>[]newModels=modelFactory.sampleFromPosterior(state.getModels());

//iterateoverthesamples,assigningeachtoamodelfor(Observa-onx:sampleData){//computenormalizedvectorofprobabiliJesthatxisdescribedbyeachmodelVectorpi=normalizedProbabiliJes(state,x);//thenpickoneclusterbysamplingaMulJnomialdistribuJonbaseduponthem//see:h@p://en.wikipedia.org/wiki/MulJnomial_distribuJonintk=UncommonDistribu-ons.rMul%nom(pi);//asktheselectedmodeltoobservethedatumnewModels[k].observe(x);}

//updatethestatefromthenewmodelsstate.update(newModels);}

4/22/09 21jeff@windwardsoluJons.com

DirichletMapperonHadoop

publicvoidmap(WritableComparable<?>key,Textvalue,OutputCollector<Text,Text>output,Reporterreporter)throwsIOExcepJon{//readthenextsamplepointVectorsample=DenseVector.decodeFormat(value.toString());//computeavectorofprobabiliJesthatsampleisdescribedbyeachmodelVectorpi=normalizedProbabili3es(state,sample);//thenpickonemodelbysamplingaMulJnomialdistribuJonbaseduponthem//see:h@p://en.wikipedia.org/wiki/MulJnomial_distribuJonintk=UncommonDistribuJons.rMul3nom(pi);//outputvaluewithkeyofselectedmodeloutput.collect(newText(String.valueOf(k)),value);}

4/22/09 22jeff@windwardsoluJons.com

Map/ReduceJobsUseLocalData

4/22/09 23jeff@windwardsoluJons.com

DirichletReduceronHadooppublicvoidreduce(Textkey,Iterator<Text>values,OutputCollector<Text,Text>output,Reporterreporter)throwsIOExcep-on{//loadthemodelforthissetofvaluesIntegerk=newInteger(key.toString());Model<Vector>model=newModels[k];while(values.hasNext()){Vectorv=DenseVector.decodeFormat(values.next().toString());//asktheselectedmodeltoobservethedatummodel.observe(v);}//compute&setnewmodelparametersbasedupontheobservaJonsmodel.computeParameters();state.clusters.get(k).setModel(model);//outputtheclusterstateforthenextiteraJonoutput.collect(key,newText(cluster.asFormatString()));}

4/22/09 24jeff@windwardsoluJons.com

Conclusion•  Thisisjustthebeginning•  Highdemandforscalablemachinelearning

•  Contributorsareneededwhohave–  Interest,enthusiasm&programmingability– Testdrivendevelopmentskills– Comfortwiththescarymath(orbravery)

–  Interestand/orproficiencywithHadoop– Somelargedatasetsyouwanttoanalyze

h@p://lucene.apache.org/mahout/

4/22/09 25jeff@windwardsoluJons.com

top related