machine learning basics

27
Machine Learning Basics Classification and Clustering Humberto Marchezi [email protected] November 2015

Upload: humberto-marchezi

Post on 16-Apr-2017

707 views

Category:

Engineering


0 download

TRANSCRIPT

Page 1: Machine Learning Basics

Machine Learning Basics

Classification and Clustering

Humberto Marchezi

[email protected]

November 2015

Page 2: Machine Learning Basics

Definitions

Pattern recognition, artificial intelligence and a bit of data

mining

Solves a given task without explicitly being programmed to do

so instead it makes predictions from provided data

Machine learning algorithms can be divided into 3 categories:

Supervised learning

Unsupervised learning

Reinforcement learning

Problem types

Classification

Regression

Clustering

etc.

Page 3: Machine Learning Basics

Algorithms

Supervised Learning

Naive Bayesian Classifier

Linear/Polynomial/Logistic/Multinomial Regression

Neural Networks

etc.

Unsupervised Learning

K-means / K-medoids

Principal Component Analysis

Gaussian Distribution (Anomaly Detection)

etc.

Page 4: Machine Learning Basics

Naive Bayes Classifier

Classify information based on probabilistic model score

Score for a category ck with features f1, f2, f3, ..., fn

p(Ck |f1, f2, ..., fn) = P(Ck )p(f1|Ck )p(f2|Ck )...p(fn|Ck )p(f1)p(f2)...p(fn)

For a text classifier, features above are each word in the

sentence (bag-of-words model)

Also known as multinomial naive bayes classifier

Page 5: Machine Learning Basics

Naive Bayes ClassifierConcrete Example

Ingredients

2 tbsp salt

lemon

InstructionsCut lemon

Pour salt

Page 6: Machine Learning Basics

Naive Bayes ClassifierConcrete Example

Ingredients

word occurrences

2 1

tbsp 1

salt 1

lemon 1

total 4

examples 2

Instructionsword occurrences

cut 1

lemon 1

pour 1

salt 1

total 4

examples 2

Globalword occurrences

2 1

tbsp 1

salt 2

lemon 2

cut 1

pour 1

total 8

examples 4

Page 7: Machine Learning Basics

Naive Bayes ClassifierConcrete Example

Ingredients 1/2

word probability

2 1/4

tbsp 1/4

salt 1/4

lemon 1/4

Instructions 1/2

word probability

cut 1/4

lemon 1/4

pour 1/4

salt 1/4

Globalword probability

2 1/8

tbsp 1/8

salt 2/8

lemon 2/8

cut 1/8

pour 1/8

Page 8: Machine Learning Basics

Naive Bayes ClassifierConcrete Example

Query ’1 tbsp salt’

Ingredients (I)

p(I |′1′,′ tbsp′,′ salt ′) = P(I )p(′1′|I )p(′tbsp′|I )p(′salt′|I )p(′1′)p(′tbsp′)p(′salt′)

= 0.5x0.0001x0.25x0.250.0001x0.125x0.25 = 1

Instructions (D)

p(D|′1′,′ tbsp′,′ salt ′) = P(D)p(′1′|D)p(′tbsp′|D)p(′salt′|D)p(′1′)p(′tbsp′)p(′salt′)

= 0.5x0.0001x0.0001x0.250.0001x0.125x0.25 = 0.0004

Result: Ingredients (since it has the highest probability)

Note: 0.0001 is the probability of an unknown element (cannot be

zero!)

Page 9: Machine Learning Basics

Naive Bayes ClassifierExamples

Classify email as spam or not spam

Document type classification

Document sections classification

Image Classification

Page 10: Machine Learning Basics

K-Means

Unsupervised learning algorithm to identify clusters

Find clusters for unlabeled data

Algorithm

k-means

Choose K examples as initial centroids

While centroids move

1) Choose closest centroid Ki for each xi and store distance ci

2) Calculate new centroid Ki in each cluster

end

Page 11: Machine Learning Basics

K-MeansK-means example steps to converge to final solution

Figure : Taken from https://en.wikipedia.org/wiki/File:

K_Means_Example_Step_2.svg

Page 12: Machine Learning Basics

K-MeansHow to avoid sub-optimal results ?

Figure : Generated from http://www.naftaliharris.com/blog/

visualizing-k-means-clustering/

Page 13: Machine Learning Basics

K-MeansHow to avoid sub-optimal results ?

k-means

Repeat N times do

Randomly choose K examples as initial centroids

While centroids move

1) Choose closest centroid Ki for each xi and store distance ci

2) Calculate new centroid Ki in each cluster

end

Calculate result cost (average distance of examples to its centroids)

If result cost is lower

end (repeat)

Page 14: Machine Learning Basics

K-MeansElbow Method - How to identify the number of clusters ?

Figure : K-means elbow method

Page 15: Machine Learning Basics

K-MeansElbow Method - How to identify the number of clusters ?

Figure : Solution for k=1

Page 16: Machine Learning Basics

K-MeansElbow Method - How to identify the number of clusters ?

Figure : Solution for k=2

Page 17: Machine Learning Basics

K-MeansElbow Method - How to identify the number of clusters ?

Figure : Solution for k=3

Page 18: Machine Learning Basics

K-MeansElbow Method - How to identify the number of clusters ?

Figure : Solution for k=4

Page 19: Machine Learning Basics

K-MeansElbow Method - How to identify the number of clusters ?

Figure : Solution for k=5

Page 20: Machine Learning Basics

K-MeansElbow Method - How to identify the number of clusters ?

Figure : Cluster costs

Page 21: Machine Learning Basics

K-MeansElbow Method - How to identify the number of clusters ?

Elbow method

Repeat for clusters K = 1,2,3,...n

Run K-Means

Compute average cost for K clusters∑n

i=1 cin (simplifying

∑ni=1 ci )

end (repeat)

Plot cost for each K and choose the one located at the ”elbow”

Page 22: Machine Learning Basics

K-MeansElbow Method - How to identify the number of clusters ?

Figure : K-means elbow method

Page 23: Machine Learning Basics

K-MeansElbow Method - How to identify the number of clusters ?

Figure : K-means elbow method

Not always possible to find elbow (well distributes examples)

Best practice associate cluster number with business meaning

Page 24: Machine Learning Basics

K-MeansExamples

Figure : Customer segmentation with k-means

Page 25: Machine Learning Basics

K-MeansExamples

Figure : Identify related news and articles

Page 26: Machine Learning Basics

K-MeansExamples

Figure : Image color reduction -

http://opencv-python-tutroals.readthedocs.org/en/latest/

_images/oc_color_quantization.jpg

Page 27: Machine Learning Basics

References and Resources

1 Coursera Machine Learning

https://www.coursera.org/learn/machine-learning

2 Naive Bayes Classifier - Wikipedia

https://en.wikipedia.org/wiki/Naive_Bayes_classifier

3 K-Means Clustering - Wikipedia

https://en.wikipedia.org/wiki/K-means_clustering

4 Visualizing K-Means Clustering

http://www.naftaliharris.com/blog/visualizing-k-means-clustering/

5 Naive Bayes for Image Processing

http://www.cs.ubc.ca/~lowe/papers/12mccannCVPR.pdf

6 Document Clustering with K-Means

http://www.codeproject.com/Articles/439890/

Text-Documents-Clustering-using-K-Means-Algorithm