machine learning hands on clustering

MACHINE LEARNING Clustering

WHAT’S IN THE MENU - RECOMMENDATIONS

1. Why so popular

2. Supervised vs Unsupervised Learning

3. Topic2

4. Topic3

5. Topic4

6. Wrap-up

MACHINE LEARNING

http://videolectures.net/Top/Computer_Science/Machine_Learning/

WHY IS MACHINE LEARNING (CS 229) THE MOST POPULAR COURSE AT STANFORD? - ANDREW NG

WHAT CAN YOU TELL ME ABOUT X?

Supervised vs unsupervised learning

Typical methods: regression and classification

Given an object with observed set of features X1, …., Xn

having an response Y, the goal is to predict Y using X1,

…., Xn

Typical methods: principal component analysis (PCA),

expectation maximization (EM) and clustering (k-means

and its variations)

Given an object with observed set of features X1, …., Xn,

the goal is to discover relationships or groups between

variables or observations. Clustering algorithms try to find

natural grouping in data and therefore similar datasets.

APPLICATIONS

Market segmentation : given market research results, how you can find the best

customer segments

Anomaly detection : find fraud, detect network attacks, or discover problems in

servers or other sensor-equipped machinery. Is important to be able to find new

types of anomalies that have never seen before.

Healthcare: accident prone factor of the area to hospital assignment, gene clustering

GROUPING UNLABELED ITEMS USING K-MEANS CLUSTERING

SWAT

Strengths :

� Will always converge

� Scales well

Weakness :

� Can converge at local minima

� Slow on very large datasets

� Choosing the wrong k

Advantages :

� Easy to implement

GROUPING UNLABELED ITEMS USING K-MEANS CLUSTERING

SIMILARITY

There are several ways on measuring similarity between observations.

Manhattan distance

Euclidian distance

Cosine distance

K-MEANS PSEUDO CODE

Randomly create k points for starting centroids

----------------------------------------------------------------

For every point assigned to a centroid

Calculate the distance between the centroid and point

Assign the point to the cluster with the lowest distance

----------------------------------------------------------------

For every cluster calculate the mean of the points in that cluster

Assign the centroid to the mean

While any point has changed cluster assignment Repeat until convergence

Cluster assignment

step

Move centroid

step

COST FUNCTION & RANDOM INITIALIZATION

for i = 1 to 100 {

randomly initialize k-means

run k-means and get centroids positions c(1 to m) and µ(1 to K)

compute cost function J(c(1 to m), µ(1 to K))

}

Pick clustering that gave lowest J(c(1 to m), µ(1 to K))

Cluster assignment step: minimize J c(1 to m) while holding µ(1 to K) fixed

Move centroid step: minimize J with respect to µ(1 to K)

PERFORMANCE CONSIDERATION

K-means

The K-means has the computational complexity of O(iKnm),

i is the number of iterations,

K the number of clusters,

n the number of observations,

m the number of features.

Improvements:

•Reducing the average number of iterations.

•Parallel implementation of K-means by leveraging Hadoop or Spark.

•Reducing the number of outliers and possible features by noise filtering with a smoothing

algorithm.

•Decreasing the dimensions of the model.

FRAMEWORKS

Java : Weka, Mahout, spark

Python: scikit-learn, py-spark, Pylearn2 (Theano)

C ++: Shogun

.NET: Encog

https://github.com/josephmisiti/awesome-machine-learning

PLATFORMS - IBM BLUEMIX

PLATFORMS – MICROSOFT AZURE ML

REFERENCES

http://www.dataschool.io/15-hours-of-expert-machine-learning-videos/

http://www-bcf.usc.edu/~gareth/ISL/

machine learning hands on clustering

Data & Analytics

number of features

number of observations

number of outliers

number of clusters

average number of iterations

observed set of features

possible features

clustering algorithms