machine learning hands on clustering
TRANSCRIPT
MACHINE LEARNING Clustering
WHAT’S IN THE MENU - RECOMMENDATIONS
1. Why so popular
2. Supervised vs Unsupervised Learning
3. Topic2
4. Topic3
5. Topic4
6. Wrap-up
MACHINE LEARNING
http://videolectures.net/Top/Computer_Science/Machine_Learning/
WHY IS MACHINE LEARNING (CS 229) THE MOST POPULAR COURSE AT STANFORD? - ANDREW NG
WHAT CAN YOU TELL ME ABOUT X?
Supervised vs unsupervised learning
Typical methods: regression and classification
Given an object with observed set of features X1, …., Xn
having an response Y, the goal is to predict Y using X1,
…., Xn
Typical methods: principal component analysis (PCA),
expectation maximization (EM) and clustering (k-means
and its variations)
Given an object with observed set of features X1, …., Xn,
the goal is to discover relationships or groups between
variables or observations. Clustering algorithms try to find
natural grouping in data and therefore similar datasets.
APPLICATIONS
Market segmentation : given market research results, how you can find the best
customer segments
Anomaly detection : find fraud, detect network attacks, or discover problems in
servers or other sensor-equipped machinery. Is important to be able to find new
types of anomalies that have never seen before.
Healthcare: accident prone factor of the area to hospital assignment, gene clustering
GROUPING UNLABELED ITEMS USING K-MEANS CLUSTERING
SWAT
Strengths :
� Will always converge
� Scales well
Weakness :
� Can converge at local minima
� Slow on very large datasets
� Choosing the wrong k
Advantages :
� Easy to implement
GROUPING UNLABELED ITEMS USING K-MEANS CLUSTERING
SIMILARITY
There are several ways on measuring similarity between observations.
Manhattan distance
Euclidian distance
Cosine distance
K-MEANS PSEUDO CODE
Randomly create k points for starting centroids
----------------------------------------------------------------
For every point assigned to a centroid
Calculate the distance between the centroid and point
Assign the point to the cluster with the lowest distance
----------------------------------------------------------------
For every cluster calculate the mean of the points in that cluster
Assign the centroid to the mean
While any point has changed cluster assignment Repeat until convergence
Cluster assignment
step
Move centroid
step
COST FUNCTION & RANDOM INITIALIZATION
for i = 1 to 100 {
randomly initialize k-means
run k-means and get centroids positions c(1 to m) and µ(1 to K)
compute cost function J(c(1 to m), µ(1 to K))
}
Pick clustering that gave lowest J(c(1 to m), µ(1 to K))
Cluster assignment step: minimize J c(1 to m) while holding µ(1 to K) fixed
Move centroid step: minimize J with respect to µ(1 to K)
PERFORMANCE CONSIDERATION
K-means
The K-means has the computational complexity of O(iKnm),
i is the number of iterations,
K the number of clusters,
n the number of observations,
m the number of features.
Improvements:
•Reducing the average number of iterations.
•Parallel implementation of K-means by leveraging Hadoop or Spark.
•Reducing the number of outliers and possible features by noise filtering with a smoothing
algorithm.
•Decreasing the dimensions of the model.
FRAMEWORKS
Java : Weka, Mahout, spark
Python: scikit-learn, py-spark, Pylearn2 (Theano)
C ++: Shogun
.NET: Encog
https://github.com/josephmisiti/awesome-machine-learning
PLATFORMS - IBM BLUEMIX
PLATFORMS – MICROSOFT AZURE ML
REFERENCES
http://www.dataschool.io/15-hours-of-expert-machine-learning-videos/
http://www-bcf.usc.edu/~gareth/ISL/
BOOKS