semi-supervised clustering
DESCRIPTION
Semi-Supervised Clustering. CS 685: Special Topics in Data Mining Spring 2008 Jinze Liu. Outline of lecture. Overview of clustering and classification What is semi-supervised learning? Semi-supervised clustering Semi-supervised classification Semi-supervised clustering - PowerPoint PPT PresentationTRANSCRIPT
CS685 : Special Topics in Data Mining, UKY
The UNIVERSITY of KENTUCKY
Semi-Supervised Clustering
CS 685: Special Topics in Data MiningSpring 2008
Jinze Liu
CS685 : Special Topics in Data Mining, UKY
Outline of lecture
⢠Overview of clustering and classification
⢠What is semi-supervised learning?â Semi-supervised clusteringâ Semi-supervised classification
⢠Semi-supervised clusteringâ What is semi-supervised clustering?â Why semi-supervised clustering?â Semi-supervised clustering algorithms
CS685 : Special Topics in Data Mining, UKY
Supervised classification versus unsupervised clustering
⢠Unsupervised clustering â Group similar objects together to find clusters
⢠Minimize intra-class distance⢠Maximize inter-class distance
⢠Supervised classificationâ Class label for each training sample is givenâ Build a model from the training dataâ Predict class label on unseen future data points
CS685 : Special Topics in Data Mining, UKY
What is clustering?
⢠Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups
Inter-cluster distances are maximized
Intra-cluster distances are
minimized
CS685 : Special Topics in Data Mining, UKY
What is Classification?
Apply Model
Induction
Deduction
Learn Model
Model
Tid Attrib1 Attrib2 Attrib3 Class
1 Yes Large 125K No
2 No Medium 100K No
3 No Small 70K No
4 Yes Medium 120K No
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No
8 No Small 85K Yes
9 No Medium 75K No
10 No Small 90K Yes 10
Tid Attrib1 Attrib2 Attrib3 Class
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ?
14 No Small 95K ?
15 No Large 67K ? 10
Test Set
Learningalgorithm
Training Set
CS685 : Special Topics in Data Mining, UKY
Clustering algorithms
⢠K-Means
⢠Hierarchical clustering
⢠Graph based clustering (Spectral clustering)
CS685 : Special Topics in Data Mining, UKY
Classification algorithms
⢠Decision Trees⢠Naïve Bayes classifier⢠Support Vector Machines (SVM)⢠K-Nearest-Neighbor classifiers⢠Logistic Regression⢠Neural Networks⢠Linear Discriminant Analysis (LDA)
CS685 : Special Topics in Data Mining, UKY
Supervised Classification Example
.
..
.
CS685 : Special Topics in Data Mining, UKY
.Supervised Classification Example
.. ..
. .. ..
.
....
.
.
. ..
CS685 : Special Topics in Data Mining, UKY
.Supervised Classification Example
.. ..
. .. ..
.
....
.
.
. ..
CS685 : Special Topics in Data Mining, UKY
.Unsupervised Clustering Example
.. ..
. .. ..
.
....
.
.
. ..
CS685 : Special Topics in Data Mining, UKY
.Unsupervised Clustering Example
.. ..
. .. ..
.
....
.
.
. ..
CS685 : Special Topics in Data Mining, UKY
Semi-Supervised Learning
⢠Combines labeled and unlabeled data during training to improve performance:â Semi-supervised classification: Training on labeled data
exploits additional unlabeled data, frequently resulting in a more accurate classifier.
â Semi-supervised clustering: Uses small amount of labeled data to aid and bias the clustering of unlabeled data.
Unsupervisedclustering
Semi-supervisedlearning
Supervisedclassification
CS685 : Special Topics in Data Mining, UKY
.
Semi-Supervised Classification Example
.. ..
. .. ..
.
....
.
.
. ..
CS685 : Special Topics in Data Mining, UKY
.
Semi-Supervised Classification Example
.. ..
. .. ..
.
....
.
.
. ..
CS685 : Special Topics in Data Mining, UKY
Semi-Supervised Classification
⢠Algorithms:â Semisupervised EM [Ghahramani:NIPS94,Nigam:ML00].â Co-training [Blum:COLT98].â Transductive SVMâs [Vapnik:98,Joachims:ICML99].â Graph based algorithms
⢠Assumptions:â Known, fixed set of categories given in the labeled data.â Goal is to improve classification of examples into these
known categories.⢠More discussions next week
CS685 : Special Topics in Data Mining, UKY
.
Semi-Supervised Clustering Example
.. ..
. .. ..
.
....
.
.
. ..
CS685 : Special Topics in Data Mining, UKY
.
Semi-Supervised Clustering Example
.. ..
. .. ..
.
....
.
.
. ..
CS685 : Special Topics in Data Mining, UKY
.
Second Semi-Supervised Clustering Example
.. ..
. .. ..
.
....
.
.
. ..
CS685 : Special Topics in Data Mining, UKY
.
Second Semi-Supervised Clustering Example
.. ..
. .. ..
.
....
.
.
. ..
CS685 : Special Topics in Data Mining, UKY
Semi-supervised clustering: problem definition
⢠Input:â A set of unlabeled objects, each described by a set of attributes
(numeric and/or categorical)â A small amount of domain knowledge
⢠Output:â A partitioning of the objects into k clusters (possibly with some
discarded as outliers)
⢠Objective:â Maximum intra-cluster similarityâ Minimum inter-cluster similarityâ High consistency between the partitioning and the domain
knowledge
CS685 : Special Topics in Data Mining, UKY
Why semi-supervised clustering?⢠Why not clustering?
â The clusters produced may not be the ones required.â Sometimes there are multiple possible groupings.
⢠Why not classification?â Sometimes there are insufficient labeled data.
⢠Potential applicationsâ Bioinformatics (gene and protein clustering)â Document hierarchy constructionâ News/email categorizationâ Image categorization
CS685 : Special Topics in Data Mining, UKY
Semi-Supervised Clustering
⢠Domain knowledgeâ Partial label information is givenâ Apply some constraints (must-links and cannot-links)
⢠Approachesâ Search-based Semi-Supervised Clustering
⢠Alter the clustering algorithm using the constraints
â Similarity-based Semi-Supervised Clustering⢠Alter the similarity measure based on the constraints
â Combination of both
This class
Next class
CS685 : Special Topics in Data Mining, UKY
Search-Based Semi-Supervised Clustering
⢠Alter the clustering algorithm that searches for a good partitioning by:â Modifying the objective function to give a reward for
obeying labels on the supervised data [Demeriz:ANNIE99].â Enforcing constraints (must-link, cannot-link) on the
labeled data during clustering [Wagstaff:ICML00, Wagstaff:ICML01].
â Use the labeled data to initialize clusters in an iterative refinement algorithm (kMeans,) [Basu:ICML02].
CS685 : Special Topics in Data Mining, UKY
Overview of K-Means Clustering
⢠K-Means is a partitional clustering algorithm based on iterative relocation that partitions a dataset into K clusters.
Algorithm: Initialize K cluster centers randomly. Repeat until convergence:â Cluster Assignment Step: Assign each data point x to the cluster Xl, such
that L2 distance of x from (center of Xl) is minimumâ Center Re-estimation Step: Re-estimate each cluster center as the
mean of the points in that cluster
}{1l
K
l
ll
CS685 : Special Topics in Data Mining, UKY
K-Means Objective Function
⢠Locally minimizes sum of squared distance between the data points and their corresponding cluster centers:
⢠Initialization of K cluster centers:â Totally randomâ Random perturbation from global meanâ Heuristic to ensure well-separated centers
2
1||||
K
l Xx lilix
CS685 : Special Topics in Data Mining, UKY
K Means Example
CS685 : Special Topics in Data Mining, UKY
K Means ExampleRandomly Initialize Means
x
x
CS685 : Special Topics in Data Mining, UKY
K Means ExampleAssign Points to Clusters
x
x
CS685 : Special Topics in Data Mining, UKY
K Means ExampleRe-estimate Means
x
x
CS685 : Special Topics in Data Mining, UKY
K Means ExampleRe-assign Points to Clusters
x
x
CS685 : Special Topics in Data Mining, UKY
K Means ExampleRe-estimate Means
x
x
CS685 : Special Topics in Data Mining, UKY
K Means ExampleRe-assign Points to Clusters
x
x
CS685 : Special Topics in Data Mining, UKY
K Means ExampleRe-estimate Means and Converge
x
x
CS685 : Special Topics in Data Mining, UKY
Semi-Supervised K-Means
⢠Partial label information is givenâ Seeded K-Meansâ Constrained K-Means
⢠Constraints (Must-link, Cannot-link)â COP K-Means
CS685 : Special Topics in Data Mining, UKY
Semi-Supervised K-Means for partially labeled data
⢠Seeded K-Means:â Labeled data provided by user are used for initialization: initial center
for cluster i is the mean of the seed points having label i.â Seed points are only used for initialization, and not in subsequent
steps.
⢠Constrained K-Means:â Labeled data provided by user are used to initialize K-Means
algorithm.â Cluster labels of seed data are kept unchanged in the cluster
assignment steps, and only the labels of the non-seed data are re-estimated.
CS685 : Special Topics in Data Mining, UKY
Seeded K-Means
Use labeled data to find the initial centroids andthen run K-Means.
The labels for seeded points may change.
CS685 : Special Topics in Data Mining, UKY
Seeded K-Means Example
CS685 : Special Topics in Data Mining, UKY
Seeded K-Means ExampleInitialize Means Using Labeled Data
xx
CS685 : Special Topics in Data Mining, UKY
Seeded K-Means ExampleAssign Points to Clusters
xx
CS685 : Special Topics in Data Mining, UKY
Seeded K-Means ExampleRe-estimate Means
xx
CS685 : Special Topics in Data Mining, UKY
Seeded K-Means ExampleAssign points to clusters and Converge
xx the label is changed
CS685 : Special Topics in Data Mining, UKY
Constrained K-Means
Use labeled data to find the initial centroids andthen run K-Means.
The labels for seeded points will not change.
CS685 : Special Topics in Data Mining, UKY
Constrained K-Means Example
CS685 : Special Topics in Data Mining, UKY
Constrained K-Means ExampleInitialize Means Using Labeled Data
xx
CS685 : Special Topics in Data Mining, UKY
Constrained K-Means ExampleAssign Points to Clusters
xx
CS685 : Special Topics in Data Mining, UKY
Constrained K-Means ExampleRe-estimate Means and Converge
xx
CS685 : Special Topics in Data Mining, UKY
COP K-Means
⢠COP K-Means [Wagstaff et al.: ICML01] is K-Means with must-link (must be in same cluster) and cannot-link (cannot be in same cluster) constraints on data points.
⢠Initialization: Cluster centers are chosen randomly, but as each one is chosen any must-link constraints that it participates in are enforced (so that they cannot later be chosen as the center of another cluster).
⢠Algorithm: During cluster assignment step in COP-K-Means, a point is assigned to its nearest cluster without violating any of its constraints. If no such assignment exists, abort.
CS685 : Special Topics in Data Mining, UKY
COP K-Means Algorithm
CS685 : Special Topics in Data Mining, UKY
Illustration
xx
Must-link
Determineits label
Assign to the red class
CS685 : Special Topics in Data Mining, UKY
Illustration
xxCannot-link
Determineits label
Assign to the red class
CS685 : Special Topics in Data Mining, UKY
Illustration
xxCannot-link
Determineits label
The clustering algorithm fails
Must-link
CS685 : Special Topics in Data Mining, UKY
Summary
⢠Seeded and Constrained K-Means: partially labeled data⢠COP K-Means: constraints (Must-link and Cannot-link)
⢠Constrained K-Means and COP K-Means require all the constraints to be satisfied.â May not be effective if the seeds contain noise.
⢠Seeded K-Means use the seeds only in the first step to determine the initial centroids.â Less sensitive to the noise in the seeds.
⢠Experiments show that semi-supervised k-Means outperform traditional K-Means.
CS685 : Special Topics in Data Mining, UKY
Reference
⢠Semi-supervised Clustering by Seeding â http://www.cs.utexas.edu/users/ml/papers/semi-
icml-02.pdf
⢠Constrained K-means clustering with background knowledgeâ http://www.litech.org/~wkiri/Papers/wagstaff-km
eans-01.pdf
CS685 : Special Topics in Data Mining, UKY
Next class
⢠Topicsâ Similarity-based semi-supervised clustering
⢠Readingsâ Distance metric learning, with application to clustering
with side-information⢠http://ai.stanford.edu/~ang/papers/nips02-metric.pdf
â From Instance-level Constraints to Space-Level Constraints: Making the Most of Prior Knowledge in Data Clustering
⢠http://www.cs.berkeley.edu/~klein/papers/constrained_clustering-ICML_2002.pdf