05 k-means clustering

K-Means Clustering

Scale!

• Scale to large datasets

– Hadoop MapReduce implementations that scales linearly with data.

– Fast sequential algorithms whose runtime doesn’t depend on the size of the data

– Goal: To be as fast as possible for any algorithm

• Scalable to support your business case

– Apache Software License 2

• Scalable community

– Vibrant, responsive and diverse

– Come to the mailing list and find out more

History of Mahout

• Summer 2007– Developers needed scalable ML

– Mailing list formed

• Community formed– Apache contributors

– Academia & industry

– Lots of initial interest

• Project formed under Apache Lucene– January 25, 2008

Where is ML Used Today

• Internet search clustering

• Knowledge management systems

• Social network mapping

• Taxonomy transformations

• Marketing analytics

• Recommendation systems

• Log analysis & event filtering

• SPAM filtering, fraud detection

Clustering• Call it fuzzy grouping based on a notion of similarity

Mahout Clustering

• Plenty of Algorithms: K-Means,Fuzzy K-Means, Mean Shift,Canopy, Dirichlet

• Group similar looking objects

• Notion of similarity: Distance measure: – Euclidean

– Cosine

– Tanimoto

– Manhattan

Classification• Predicting the type of a new object based on its features

• The types are predetermined

Dog Cat

Mahout Classification

• Plenty of algorithms– Naïve Bayes

– Complementary Naïve Bayes

– Random Forests

– Logistic Regression (SGD)

– Support Vector Machines (patch ready)

• Learn a model from a manually classified data

• Predict the class of a new object based on itsfeatures and the learned model

Understanding data - Vectors

X = 5 , Y = 3(5, 3)

• The vector denoted by point (5, 3) is simply Array([5, 3]) or HashMap([0 => 5], [1 => 3])

Y

X

Representing Vectors – The basics

• Now think 3, 4, 5, ….. n-dimensional

• Think of a document as a bag of words.

“she sells sea shells on the sea shore”

• Now map them to integers

she => 0

sells => 1

sea => 2

and so on

• The resulting vector [1.0, 1.0, 2.0, … ]

Vectors

• Imagine one dimension for each word.

• Each dimension is also called a feature

• Two techniques

– Dictionary Based

– Randomizer Based

K-Means clustering• K-means clustering is a classical clustering algorithm that uses an

expectation maximization like technique to partition a number of data points into k clusters.

• K-means clustering is commonly used for a number of classification applications.

• It is often run on extremely large data sets, on the order of hundreds of millions of points and tens of gigabytes of data.

• Because k-means is run on such large data sets, and because of certain characteristics of the algorithm, it is a good candidate for parallelization.

K-Means clustering• At its simplest, the algorithm is given as inputs a set of n d-dimensional points

and a number of desired clusters k. • For the purposes of this explanation, we will consider points in a Euclidean

space. • However, the clustering algorithm will work in any space provided a distance

metric is given as input as well. • Initially, k points are chosen as cluster centers. • There is no fixed way to determine these initial points, instead a number of

heuristics are used, most commonly, k-points are chosen at random from the sample of n points.

• Once the k initial centers are chosen, the distance is calculated from every point in the set to each of the k centers and each point is assigned to the particular cluster center whose distance is closest.

• Using this assignation of points to cluster centers, each cluster center is recalculated as the centroid of its member points.

• This process is then iterated until convergence is reached.• That is, points are reassigned to centers, and centroids recalculated until the k

cluster centers shift by less than some delta value.• Because of the way k-means works it has the weakness that it can converge to a

local optimum but not a global optimum. • This happens when the initial k center points are chosen badly.

Pseudocode

iterate {Compute distance from all points to all k-centersAssign each point to the nearest k-centerCompute the average of all points assigned to all specific k-centersReplace the k-centers with the new averages}

c1

c2

c3

K-Means clustering

c1

c2

c3

c1

c2

c3

K-Means clustering

c1

c2

c3

K-Means clustering

Parallelizing k-means• In order to parallelize k-means, we want to come up with a scheme where

we can operate on each point in the data set independently.

• In the first step of the iterative process of k-means, it is necessary to compute the distance from each point to each of the k cluster centers and assign that point to the cluster with the minimum distance.

• Thus, there is a small amount of shared data – namely the cluster centers.

• However, this is small in comparison to the number of data points.

• So the parallelization scheme involves duplicating the cluster centers, however once this is duplicated each data point can be operated on independently of the others and we can gain a nice speedup.

Step 1 – Convert dataset into a Hadoop Sequence File

• SGML files

– $ mkdir –p /reuters $ cd reuters-sgm && tar xzf../reuters21578.tar.gz && cd .. && cd ..

• Extract content from SGML to text file

– $ mahout org.apache.lucene.benchmark.utils.ExtractReutersreuters reuters/out

Step 1 – Convert dataset into a Hadoop Sequence File

• Use seqdirectory tool to convert text file into a Hadoop Sequence File

– $ mahout seqdirectory \

-i reuters/out \

-o reuters/seqdir \

-c UTF-8\

-chunk 5

Hadoop Sequence File• Sequence of Records, where each record is a <Key, Value> pair

– <Key1, Value1>

– <Key2, Value2>

– …

– …

– …

– <Keyn, Valuen>

• Key and Value needs to be of class org.apache.hadoop.io.Text

– Key = Record name or File name or unique identifier

– Value = Content as UTF-8 encoded string

• TIP: Dump data from your database directly into Hadoop Sequence Files

Writing to Sequence FilesConfiguration conf = new Configuration();

FileSystem fs = FileSystem.get(conf);

Path path = new Path("testdata/part-00000");

SequenceFile.Writer writer = new SequenceFile.Writer(

fs, conf, path, Text.class, Text.class);

for (int i = 0; i < MAX_DOCS; i++)

writer.append(new Text(documents(i).Id()),

new Text(documents(i).Content()));

}

writer.close();

Generate Vectors from Sequence Files

• Steps

1. Compute Dictionary

2. Assign integers for words

3. Compute feature weights

4. Create vector for each document using word-integer mapping and feature-weight

Or

• Simply run $ mahout seq2sparse

Generate Vectors from Sequence Files• $ mahout seq2sparse \

-i reuter/seqdir/ \-o reuters/sparse

• Important options

– Ngrams

– Lucene Analyzer for tokenizing

– Feature Pruning• Min support

• Max Document Frequency

• Min LLR (for ngrams)

– Weighting Method• TF v/s TFIDF

• lp-Norm

• Log normalize length

Start K-Means clustering• $ mahout kmeans \

-i reuters/sparse/tfidf/ \-c reuters-kmeans-clusters \-o reuters-kmeans \-dm org.apache.mahout.distance.CosineDistanceMeasure –cd 0.1 \-x 10 -k 20 –ow

• Things to watch out for

– Number of iterations

– Convergence delta

– Distance Measure

– Creating assignments

Inspect clusters

• $ bin/mahout clusterdump \-s reuters-kmeans/clusters-9 \-d reuters-out-seqdir-sparse-kmeans/dictionary.file-0 \-dt sequencefile -b 100 -n 20

Typical output

:VL-21438{n=518 c=[0.56:0.019, 00:0.154, 00.03:0.018, 00.18:0.018, …

Top Terms:

iran => 3.1861672217321213

strike => 2.567886952727918

iranian => 2.133417966282966

union => 2.116033937940266

said => 2.101773806290277

workers => 2.066259451354332

gulf => 1.9501374918521601

had => 1.6077752463145605

he => 1.5355078004962228

FAQs

• How to get rid of useless words

• How to see documents to cluster assignments

• How to choose appropriate weighting

• How to run this on a cluster

• How to scale

• How to choose k

• How to improve similarity measurement

FAQs

• How to get rid of useless words– Use StopwordsAnalyzer

• How to see documents to cluster assignments – Run clustering process at the end of centroid generation using –cl

• How to choose appropriate weighting– If its long text, go with tfidf. Use normalization if documents different in

length

• How to run this on a cluster– Set HADOOP_CONF directory to point to your hadoop cluster conf

directory

• How to scale– Use small value of k to partially cluster data and then do full clustering

on each cluster.

FAQs

• How to choose k

– Figure out based on the data you have. Trial and error

– Or use Canopy Clustering and distance threshold to figure it out

– Or use Spectral clustering

• How to improve Similarity Measurement

– Not all features are equal

– Small weight difference for certain types creates a large semantic difference

– Use WeightedDistanceMeasure

– Or write a custom DistanceMeasure

More clustering algorithms

• Canopy

• Fuzzy K-Means

• Mean Shift

• Dirichlet process clustering

• Spectral clustering.

End of session

Day – 4: K-Means Clustering

05 k-means clustering

Software

number of data points

initial points

millions of points

sample of n points

large data sets

data vectorsx

clustering scale

classical clustering