machine learning clustering

Mauritius JEDI Machine Learning

&Big Data

Clustering Algorithms

Nadeem Oozeer

Machine learning:

• Supervised vs Unsupervised.

– Supervised learning - the presence of the outcome variable is available to guide the learning process.

• there must be a training data set in which the solution is already known.

– Unsupervised learning - the outcomes are unknown.

• cluster the data to reveal meaningful partitions and hierarchies

https://docs.google.com/document/d/1HpkI6eIhluwrwwAuJOFK1aI88gorrrX_NWUx4q_-6Qc/edit#heading=h.f9i5pm9h61xt

https://docs.google.com/document/d/1HpkI6eIhluwrwwAuJOFK1aI88gorrrX_NWUx4q_-6Qc/edit#heading=h.x09bqspl47ev

Clustering:

• Clustering is the task of gathering samples into groups of similar samples according to some predefined similarity or dissimilarity measure

sample Cluster/group

• In this case clustering is carried out using the Euclidean distance as a measure.

Clustering:

• What is clustering good for

– Market segmentation - group customers into different market segments

– Social network analysis - Facebook "smartlists"

– Organizing computer clusters and data centers for network layout and location

– Astronomical data analysis - Understanding galaxy formation

Galaxy Clustering:

• Multi-wavelength data obtained for galaxy clusters– Aim: determine robust criteria for the inclusion of a galaxy into

a cluster galaxy– Note: physical parameters of the galaxy cluster can be heavily

influenced by wrong candidate

Credit:HST

Clustering Algorithms :

• Hierarchy methods

– statistical method used to build a cluster by arranging elements at various levels

Dendogram:

• Each level will then represent a possible cluster.

• The height of the dendrogram shows the level of similarity that any two clusters are joined

• The closer to the bottom they are the more similar the clusters are

• Finding of groups from a dendrogram is not simple and is very often subjective

• Partitioning methods

– make an initial division of the database and then use an iterative strategy to further divide it into sections

– here each object belongs to exactly one cluster

Credit:Legodi, 2014

K-means:

K-means algorithm:

1. Given n objects, initialize k cluster centers

2. Assign each object to its closest cluster centre

3. Update the center for each cluster

4. Repeat 2 and 3 until no change in each cluster center

• Experiment: Pack of cards, dominoes

• Apply the K-means algorithm to the Shapley data– Change the number of potential cluster and find how the

clustering differ

K Nearest Neighbors (k-NN):

• One of the simplest of all machine learning classifiers

• Differs from other machine learning techniques, in that it doesn't produce a model.

• It does however require a distance measure and the selection of K.

• First the K nearest training data points to the new observation are investigated.

• These K points determine the class of the new observation.

1-NN

• Simple idea: label a new point the same as the closest known point

Label it red.

1-NN Aspects of anInstance-Based Learner

1. A distance metric– Euclidian

2. How many nearby neighbors to look at?– One

3. A weighting function (optional)– Unused

4. How to fit with the local points?– Just predict the same output as the nearest

neighbor.

k-NN

• Generalizes 1-NN to smooth away noise in the labels

• A new point is now assigned the most frequent label of its knearest neighbors

Label it red, when k = 3

Label it blue, when k = 7

machine learning clustering

Documents