orchestrating the intelligent web with apache mahout

40
Orchestrating the Intelligent Web with Apache Mahout Presented by Aneesha Bakharia Twitter: aneesha Email: [email protected] m

Upload: aneeshabakharia

Post on 11-May-2015

3.854 views

Category:

Technology


2 download

DESCRIPTION

Presentation on Apache Mahout at Linux Conference Australia 2011

TRANSCRIPT

Page 1: Orchestrating the Intelligent Web with Apache Mahout

Orchestrating the Intelligent Web with Apache Mahout

Presented by Aneesha BakhariaTwitter: aneesha

Email: [email protected]

Page 2: Orchestrating the Intelligent Web with Apache Mahout

What is Apache Mahout?

• Open source • Machine Learning Java library• Scalable (Apache Hadoop) • Framework for developing, testing and

deploying large-scale algorithms

http://mahout.apache.org/

Page 3: Orchestrating the Intelligent Web with Apache Mahout

What’s in a Name?

• Mahout is Hindi for Elephant Driver

Page 4: Orchestrating the Intelligent Web with Apache Mahout

What is Apache Mahout?

• Framework– Vector Math/Matrices (eg SVD)– Collections– Hadoop

• Algorithms– Classification, Clustering, etc

• Your Application???– You can orchestrate the intelligent web!!!

Page 5: Orchestrating the Intelligent Web with Apache Mahout

A New Breed of Developer

• Key Skills– Databases– Programming– Networking– Security

• …but now also– distributed data processing is fast becoming an

essential part the developer’s toolbox.

Page 6: Orchestrating the Intelligent Web with Apache Mahout

You never know where you will use Probability and

Statistics!!!!Video snippet from Equilibrium:

http://en.wikipedia.org/wiki/Equilibrium_%28film%29

Page 7: Orchestrating the Intelligent Web with Apache Mahout

You never know what you will discover!!!!

Page 8: Orchestrating the Intelligent Web with Apache Mahout

Where people swear in the United States?

http://flowingdata.com/2011/01/25/where-people-swear-in-the-united-states/

Page 9: Orchestrating the Intelligent Web with Apache Mahout

Algorithms is Apache Mahout

• Recommendation (collaborative filtering)• Clustering• Classification • Evolutionary Algorithms

Page 10: Orchestrating the Intelligent Web with Apache Mahout

Algorithms is Apache Mahout

• Top 10 algorithms in data mining

Wu, X., Kumar, V., Ross Quinlan, J., Ghosh, J., Yang, Q., Motoda, H., et al. (2008). Top 10 algorithms in data mining. Knowledge and Information Systems, 14(1), 1-37.

k-Means, Apriori (fp-growth), kNN, Naive Bayes, SVM (coming)

Already supported

Page 11: Orchestrating the Intelligent Web with Apache Mahout

Requirements

• Java 1.6java -version

• Maven 2.2mvn -- version

• Hadoop 0.2

Page 12: Orchestrating the Intelligent Web with Apache Mahout

Running Mahout

• Command line launcherbin/mahout (This shows the list of algorithms)Valid program names are:

canopy: : Canopy clustering cleansvd: : Cleanup and verification of SVD output clusterdump: : Dump cluster output to text dirichlet: : Dirichlet Clustering fkmeans: : Fuzzy K-means clustering fpg: : Frequent Pattern Growth itemsimilarity: : Compute the item-item-similarities for item-based collaborative filtering kmeans: : K-means clustering lda: : Latent Dirchlet Allocation ldatopics: : LDA Print Topics lucene.vector: : Generate Vectors from a Lucene index matrixmult: : Take the product of two matrices meanshift: : Mean Shift clustering recommenditembased: : Compute recommendations using item-based collaborative filtering …..

Page 13: Orchestrating the Intelligent Web with Apache Mahout

Running Mahout

• Run any algorithm eg kmeans locallybin/mahout kmeans –help

Job-Specific Options: --input (-i) input --output (-o) output --distanceMeasure (-dm) eg SquaredEuclidean --numClusters (-k) k

Page 14: Orchestrating the Intelligent Web with Apache Mahout

Running Mahout

• Scale outRuns on cluster as per conf files in Hadoop directory

• export HADOOP_HOME = /pathto/hadoop-0.20.2/

• Need to use the driver classesKMeansDriver.runjob(Path input, Path output ...)

Page 15: Orchestrating the Intelligent Web with Apache Mahout

Clustering

• Unsupervised Machine Learning technique• Organise items in to clusters/groups based

upon similarity• Good for finding patterns and exploring data

Page 16: Orchestrating the Intelligent Web with Apache Mahout

Clustering

• Lots of Algorithms:k-means, Fuzzy K-means, Mean Shift, Canopy, Dirichlet Process, Latent Dirichlet Allocation

• Similarity Distance Measures– Euclidean– Cosine– Tanimoto– Manhattan

Page 17: Orchestrating the Intelligent Web with Apache Mahout

Vectors

• DocumentsBag of wordsword1 => 10word2 => 2word3 => 4Resulting vector [10.0, 2.0, 4.0, .... ]

Page 18: Orchestrating the Intelligent Web with Apache Mahout

Range of Vectorization Tools

• Collate multiple words (n-grams)• Normalization• TF-IDF• Stop word removal

Page 19: Orchestrating the Intelligent Web with Apache Mahout

kmeans Example

• Set of text files in a directory• Use seqdirectory to convert files to vectors

bin/mahout seqdirectory -i <input> -o <seq-output>• Use seq2sparse to convert to sparse vector

bin/mahout seq2sparse -i seq-output -o <vector-output>• Run kmeans with k=5

bin/mahout kmeans -i<vector-output> -c <cluster-temp> -o <cluster-output> -k 5

• View outputbin/mahout clusterdump

Page 20: Orchestrating the Intelligent Web with Apache Mahout

Easy enough, but

• How do you know k?• Data Exploration is required to find the • k for your purposes• Similarity distance for your purpose

• Role for the Data Scientist• Explore, Model, Test and Evaluate

Page 21: Orchestrating the Intelligent Web with Apache Mahout

Recommender Engines

• Encounter the most• Recommend products (books, movies, etc)

based upon past actions• Infer tastes and preferences to identify

unknown items of interest

Page 22: Orchestrating the Intelligent Web with Apache Mahout

Recomendation

• Algorithms:user and item recommendation

• Framework for storage, online and offline computation

• Similarity Measures– Cosine– Tanimoto– Pearson

Page 23: Orchestrating the Intelligent Web with Apache Mahout

Frequent Pattern Mining

• Discover interesting patterns based upon how items occur in a sequence

• ExampleSales Transactions (Bread, Milk and Eggs)(Nappies, Beer)

• Parallel FPGrowth Algorithm

Page 24: Orchestrating the Intelligent Web with Apache Mahout

Classification

• Set of classes/categories (observed pattern)• Decide if a new input matches a category• Supervised technique – need training• Eg spam or not

Page 25: Orchestrating the Intelligent Web with Apache Mahout

Classification

• Algorithms:Naive Bayes, Random Forest Decision Tree, SVM coming

• Learn a model from a manually trained dataset

• Predict the class of an unseen object based on features

Page 26: Orchestrating the Intelligent Web with Apache Mahout

Latent Dirichlet Allocation

– Convert text to term-document matrix– LDA produces • word-theme mapping• theme-document mapping• Allows topic overlap

– Need to specify number of Topics (k)

Page 27: Orchestrating the Intelligent Web with Apache Mahout

Latent Dirichlet Allocation

• LDA

• Tweet 1• Tweet 2• Tweet 3

Word 1 Word 2 Word n

Doc 1 1 0 2

Doc 2 0 1 0

Doc 3 0 1 1

Term-Document Matrix

Specify No Themes (k)

Word 1

Word 2

Word n

Topic 1 0.5 0 1

Topic 2 0 0.5 0

Topic to Word Mapping

Topic 1 Topic 2

Doc 1 1 0

Doc 2 0 1

Doc 3 0 1

X

Tweet to Topic Mapping

Page 28: Orchestrating the Intelligent Web with Apache Mahout

Latent Dirichlet Allocation

– Run LDAbin/mahout lda -input <PATH> output <PATH> –numTopics 20‐

– View Topicsbin/mahout LDAPrintTopics input <PATH>‐output <PATH> dictonaryType sequencefile‐ ‐

Page 29: Orchestrating the Intelligent Web with Apache Mahout

Suggesting Twitter Lists

– Twitter introduced Lists group people you follow so you can see only their timeline of tweets

– Build an application that could recommend people that should be grouped in the same list.

– LDA because it will allow for overlapping list membership - this is great because people talk about multiple topics.

Page 30: Orchestrating the Intelligent Web with Apache Mahout

Suggesting Twitter Lists

– Twitter API Tasks• Get list of people that a user follows• Retrieve tweets for each person• Save Lists back to Twitter

– Data Processing• Combine all tweets for a person• Remove stop words• Stem words• Create a user-word matrix

Page 31: Orchestrating the Intelligent Web with Apache Mahout

Suggesting Twitter Lists

– Web UI• Authenticate to Twitter• Display suggested lists (based on estimate of k)

(Could also display the important tweets that place the person in the group?)• Allow users to change k

ie decide on the number of Lists• Allow group re-organisation with jquery sortables

Page 32: Orchestrating the Intelligent Web with Apache Mahout

Gently Getting into Machine Learning and Data Mining

• Programming Collective Intelligenceby Toby Segaram

• Mahout in Actionby Owen, Anil, Dunning and Friedman

Page 33: Orchestrating the Intelligent Web with Apache Mahout

Summary

• Mahout offers good abstraction for building intelligent web applications

• Skills in data analysis and exploration are now more important than ever

• Mahout is a good platform for distributed algorithm development

Page 34: Orchestrating the Intelligent Web with Apache Mahout

Fascinating Algorithms

• My Top 3 algorithms– Some interesting and some disturbing and

interesting at the same time

Page 35: Orchestrating the Intelligent Web with Apache Mahout

Fascinating Algorithms

• No 3 – Identifying Manipulated Imageshttp://www.technologyreview.com/computing/20423/page1/

Page 36: Orchestrating the Intelligent Web with Apache Mahout

Fascinating Algorithms

• No 2 – Seam CarvingContent Aware ResizingExample http://swieskowski.net/carve/

Page 37: Orchestrating the Intelligent Web with Apache Mahout

Disturbing Algorithms

• No 1 – Digital Face Beautificationhttp://leyvand.com/research/beautification/dfb_sketch.pdf

Page 38: Orchestrating the Intelligent Web with Apache Mahout

Disturbing Algorithms

• No 1 – Digital Face Beautificationhttp://leyvand.com/research/beautification/dfb_sketch.pdf

Page 39: Orchestrating the Intelligent Web with Apache Mahout

Disturbing Algorithms

• No 1 – Digital Face Beautificationhttp://leyvand.com/research/beautification/dfb_sketch.pdf

Image from Shrek Copyright Dreamworks

Page 40: Orchestrating the Intelligent Web with Apache Mahout

Discussion/Questions

• What will you build?