cs480 introduction to machine learning semi-supervised...
TRANSCRIPT
CS480 Introduction to Machine Learning Semi-Supervised Learning
Edith Law
Basic Idea
• Label data can be expensive / difficult to collect. – Human annotation is slow, boring! – Labels can require experts, or special devices to acquire. – We prefer to get better performance for free: Unlabeled data!
• Goal of semi-supervised learning is to exploit both labeled and unlabeled examples.
Example of Hard-to-Get Labels
• Task: speech analysis • switchboard dataset • telephone conversation transcription • 400 hours annotation time for each hour of speech
Example of Hard-to-Get Labels
• Task: natural language parsing • Penn Chinese Treebank • 2 years for 4000 sentences
“The National Track and Field Championship has finished.”
Example of Not-So-Hard-to-Get Labels
For some tasks, it may not be too difficult to label 1000+ instances.
Task: image categorization of “eclipse”
Example of Not-So-Hard-to-Get Labels
There are ways like the EPS game to encourage “human computation” for more labels.
Example of Not-So-Hard-to-Get Labels
For some tasks, it may not be too difficult to label 1000+ instances.
Nonetheless, we may be able to learn better/faster if we make use of both labeled and unlabeled examples.
Can Unlabeled Data Help?
If we assume that each class is a coherent group (e.g., Gaussian), then unlabeled data can help identify the boundary more accurately.
Semi-supervised learning is possible because we can make assumptions about the relationship between the distribution of unlabeled data P(x) and the target label.
Notation
Given:
• Labeled data: (Xl, Yl) = {x1:l, y1:l} available during training
• Unlabeled data: XU = {xl+1:n} available during training
• Test data: Xtest = {Xn+1:N} NOT available during training
Usually , so much more unlabeled data than labeled data.l ≪ n
Notation
supervised learning (classification/regression){(x1:n, y1:n)}
(inductive) semi-supervised learning (classification/regression){(x1:n, y1:n), xl+1:n, xtest}
(transductive) semi-supervised learning (classification/regression){(x1:n, y1:n), xl+1:n}
semi-supervised clustering (classification/regression){x1:n, must-links, cannot-links}
unsupervised learning (e.g., clustering){x1:n}
Overview
• Self-Training • Co-Training • Gaussian Mixture Model (GMM) • Other: Graph-based methods, Semi-Supervised SVM
�11
Overview
• Self-Training • Co-Training • Gaussian Mixture Model (GMM) • Other: Graph-based methods, Semi-Supervised SVM
�12
Self-Training Algorithm
1. Pick your favourite classification method. 2. Train a classifier f using labeled data (Xl, Yl) 3. Use f to classify all unlabeled items x in Xu 4. Pick x* with the highest confidence 5. Add (x*, f(x*)) to labeled data 6. Repeat
Self-Training Algorithm (Pseudocode)
Self-Training Algorithm
• Assumption: One’s own high confidence predictions are correct. – this is likely to be the case when the classes are from well-
separated clusters
• Variations: – Add a few most confident (x, f(x)) to labeled data. – Add all (x, f(x)) to labeled data. – Add all (x, f(x)) to labeled data, weigh each by confidence.
Self-Training Algorithm: A Concrete Example
Self Training with 1NN: Well Separated Case
Self Training with 1NN: Outlier Case
Self-Training Example: Image Categorization
• Train a Naive Bayes classifier on two initial images
Self-Training Example: Image Categorization
Each image is divided into small patches, 10 x 10 grid, random size of 10-20
Represent each patch: - normalize patches - define a dictionary of 200 “visual words” - Represent a patch by the index of its
closest visual word.
Self-Training Example: Image Categorization
• Train a Naive Bayes classifier on two initial images
• Classify unlabeled data, sort by confidence log Pr(y=astronomy | x)
Self-Training Example: Image Categorization
• Add the most confident images and predicted labels to the labels data
• Retrain classifier and repeat
Self-Training Algorithm: Advantages
• The simplest semi-supervised learning method.
• A wrapper method, i.e., the technique can be applied to any existing learning algorithms.
• Often used in real tasks like natural language processing.
Self-Training Algorithm: Disadvantages
• Early mistakes could reinforce themselves. - Heuristic solutions, e.g. “un-label” an instance if its confidence
falls below a threshold.
• Cannot say too much in terms of convergence, but … - There are special cases when self-training is equivalent to the
Expectation-Maximization (EM) algorithm. - There are also special cases (e.g. linear functions) when the
closed-form solution is known.
Overview
• Self-Training • Co-Training • Gaussian Mixture Model (GMM) • Other: Graph-based methods, Semi-Supervised SVM
�25
Leveraging Multiple Views
Suppose we are classifying a collection of university webpages into two categories: Student vs Faculty.
The webpage can be classified by • the text on the webpage (e.g., words such as “professor”) • the words in all the hyperlinks that point to that webpage
“professor” “grant” “graduated”
view 1my advisormy Ph.D. advisormy mentor
view 2
Leveraging Multiple Views
In named entity classification, we classify a named entity (i.e., a proper name such as “Washington State” or Mr. Washington”) by a class label (e.g., Person or Location).
Each instance is represented by two views: • the set of words that make up the named entity itself • the words in the context of the named entity
Co-Training
but there are so many ways to express Person or Location!
So, how can we learn using no labeled data?
Co-Training
Inst 1: “headquartered in” seems to indicate y=Location.Inst 3: “Kazakhstan” must be a Location since it appears with the context “headquartered in” Inst 4: “flew to” must be about Location because it is in the context of “Kazakhstan”
“Location”
Even though we have never seen the word “China”!
Co-Training
Inst 2: “Mr.” seems to indicate y=Person.Inst 5: “partner at” must be about a Person because it appears in context of “Mr”
“Person”
Even though we have never seen the words “Robert Jordan”!
Co-Training
•Appropriate in settings where the features are redundant •The idea is to train two classifiers on the disjoint features, and make
the two classifiers take turns teaching each other •Co-training is different from self-training in that we use two classifiers
in turn, operating on different views of the instances.
Co-Training Algorithm
1. Train two classifiers: f(1) from (Xl(1), Yl), f(2) from (Xl(2), Yl) 2. Classify Xu with a classifier f(1) and f(2) separately. 3. Add f(1) ’s k-most-confident (x, f(1)(x)) to f(2) ’s labeled data 4. Add f(2) ’s k-most-confident (x, f(2)(x)) to f(1) ’s labeled data 5. Repeat
Co-Training (Pseudocode) (Blum and Mitchell, 1998)
Co-Training (Blum and Mitchell, 1998)
•begin with 12 labeled web pages (academic courses) •provide 1000 additional unlabeled web pages •average error learning from labeled data: 11.1% •average error co-training: 5%
Co-Training
Pros: •wrapper method: it does not matter what algorithms are used for the
two classifiers f(1) and f(2)
• requires that classifiers can assign a confidence score to their prediction, which is used to select which unlabeled instances to turn into additional training data for the other view.
• less sensitive to mistakes than self-training Cons: • feature split may not exist •classifier that use both feature sets may have better performance
Co-Training: Assumptions
•existence of two views x = [x(1), x(2)] •assumption #1: each view alone is sufficient to make good
classifications, given enough labeled data. •assumption #2: the two views are conditionally independent given the
class label.
- knowing one view doesn’t affect what we will observe for the other view - e.g., context “headquartered in” does not favour any particular location.
Co-Training
Many real-world problems of this type
• Semantic lexicon generation [Riloff, Jones 99], [Collins, Singer 99]
• Web page classification [Blum, Mitchell 98]
• Word sense disambiguation [Yarowsky 95]
• Speech recognition [de Sa, Ballard 98]
• Visual classification of cars [Levin, Viola, Freund 03]
Overview
• Self-Training • Co-Training • Gaussian Mixture Model (GMM) • Other: Graph-based methods, Semi-Supervised SVM
�38
Modeling Data using Distributions
y2
y1
In mixture models, we assume that we have a set of hidden mixture components {1,…,K} which generated the data.
Generative Story of Gaussian Mixture Model
Specifically, we assume that the data that we observed has been generated by repeating the following two steps
= 𝒩(x; μk, σk)
• pick a mixture component:
• generated x based on the selected mixture component: x ∼ p(x |y = k)
y ∼ p(y) = Multinomial(π)
p(x, y) = p(y) p(x |y)
to give the joint distribution:
Idea: Fit data with a combination of Gaussian distributions.
mixing proportions: a vector of probabilities that sum to 1
Gaussian Mixture Model
In general, we can compute the probability density function p(x) over by summing out y:
where
p(x)
is the probability of the kth mixture componentπk = p(y = k)= probability of x in the kth mixture componentp(x |y = k) = 𝒩(x |μk, σ2
k )
Goal: estimate parameters πk, μk, σ2k
= ∑y
p(y) p(x |y) =K
∑k=1
p(y = k) p(x |y = k)
=K
∑k=1
πk p(x |y = k)
Gaussian Mixture Model
• Think of each mixture component as a class.
• The unlabeled data tells us how instances from all classes, mixed together, are distributed.
• If we know how the instances from each class are distributed, then we can decompose the mixture into individual classes.
πk, μk, σ2kWhy do we care to learn these parameters
GMM: Supervised vs Semi-Supervised
• Given labeled data, assume each class has a Gaussian distribution …
GMM: Supervised vs Semi-Supervised
θ = {π1, μ1, Σ1, π2, μ2, Σ2}Model Parameters: Model Parameters:
The GMM: p(x, y |θ) = p(y |θ)p(x |y, θ)
= πy𝒩(x; μy, Σy)
Classification: p(y |x, θ) −p(x, y |θ)
∑y′ � p(x, y′�|θ)
GMM: Supervised vs Semi-Supervised
• Given labeled data, assume each class has a Gaussian distribution…
• The most likely model and its decision boundary …
GMM: Supervised vs Semi-Supervised
• Given labeled data, assume each class has a Gaussian distribution.
• The most likely model and its decision boundary is …
• Add unlabeled data …
GMM: Supervised vs Semi-Supervised
• Given labeled data, assume each class has a Gaussian distribution.
• The most likely model and its decision boundary is …
• Add unlabeled data …
• With unlabeled data, the most likely model and decision boundary changed.
GMM: Supervised vs Semi-Supervised
Decision boundaries are different because they maximize different quantities.
Finding the Parameters
If we have labeled data, then it’s just a matter of finding the maximum likelihood estimate of the parameters.
θ = arg maxθ
log p(D |θ)
= arg maxθ
logl
∏i=1
p(xi, yi |θ)
For a K-class GMM, we will solve constrained optimization problem:
s.t.K
∑k=1
p(y = k |θ) = 1arg maxθ
logl
∏i=1
p(xi, yi |θ)
by setting up the Lagrangian, taking partial derivative of the Lagrangian with respect with the parameters, set to 0 and solve.
Finding the Parameters: Labeled vs Unlabeled Data
with labeled data with labeled and unlabeled data
= logl
∏i=1
p(xi, yi |θ)
= logl
∏i=1
p(yi |θ) p(xi |yi, θ)
= logl
∏i=1
p(xi, yi |θ)l+u
∏i=l+1
p(xi |θ)
= logl
∏i=1
p(yi |θ) p(xi |yi, θ)l+u
∏i=l+1
K
∑k=1
p(yi = k |θ) p(xi |yi, θ)
With unlabeled data, we are left with marginal probability in the data likelihood. i.e., the probability of generating x from any of the classes (because we do not know which class the instances belong to).
log P(D |θ) log P(D |θ)
Challenge: find MLE of parameters when there are unobservable or hidden variables.
marginal probability
Expectation Maximization
Iterative method for learning the maximum likelihood estimate of a probabilistic model, when the model contains unobservable variables.
Main idea:
• If we had sufficient statistics for the data (e.g. counts of possible values), we could easily maximize the likelihood.
• With missing data, we “fantasize” how the data should look based on the current parameter setting, i.e., compute expected sufficient statistics.
• Then we maximize parameter setting, based on these statistics.
Expectation Maximization
• Start with some initial parameter setting.
• Repeat (as long as desired): – Expectation (E) step: Complete the data by assigning “values”
to the missing items. – Maximization (M) step: Compute the maximum likelihood
parameter setting based on the completed data.
A Simple Algorithm: K-means clustering
• Objective: Cluster n instances into K distinct classes. • Preliminaries:
– Step 1: Pick the desired number of clusters, K. – Step 2: Assume a parametric distribution for each class (e.g.
Normal). – Step 3: Randomly estimate the parameters of the K
distributions. • Iterate, until convergence:
– Step 4: Assign instances to the most likely classes based on the current parametric distributions.
– Step 5: Estimate the parametric distribution of each class based on the latest assignment. maximization step
expectation step
EM for 2-Class GMM
– Expectation (E) step: for every instance i, compute the responsibilities, e.g., class 1 is responsible for xi with probability .γi,1
γi,k ← P(yi = k |xi)
– Maximization (M) step: compute the weighted means, variances and mixing probabilities for each component k.
• Repeat (as long as desired):
=πk𝒩(xi; uk, σk)
∑2k′�=1 πk′�𝒩(xi; uk′�, σk′�)
πk ←1N
N
∑i=1
γi,kμk ←1
∑Ni=1 γi,k
N
∑i=1
γi,kxi σk ←1
∑Ni=1 γi,k
N
∑i=1
γi,k(xi − μk)2
(with unlabeled data)
• Start with some initial parameter setting: μ1, μ2, σ21 , σ2
2 , π1, π2
(“soft” assignment of classes)
(with labeled and unlabeled data)
EM Algorithm: In General
• Setup: – Observed data: D = (Xl, Yl, Xu) – Hidden data: H = Yu – P(D|𝛳) = ∑H p(D, H | 𝛳)
• Goal: Find 𝛳 to maximize p(D|𝛳) • Algorithm:
– Start with some arbitrary 𝛳0. – E-step: Estimate p(H|D,𝛳) – M-step: Find argmax𝛳 ∑H p(D,H|𝛳)
• Comments: EM iteratively improves p(D|𝛳). Converges to a local minima of 𝛳.
EM Algorithm: Example (Ipeirotis et al., 2010)
EM Algorithm: Example (Ipeirotis et al., 2010)
Expectation Maximization: Properties
• Likelihood function is guaranteed to improve (or stay the same) with each iteration.
• Iterations can stop when no more improvements are achieved.
• Convergence to a local optimum of the likelihood function.
• Re-starts with different initial parameters are often necessary.
• Time complexity (per iteration) depends on model structure.
• EM is very useful in practice!
Assumptions of GMM
The assumption is that data comes from the mixture, where the number of components, prior p(y) and conditional probability p(x|y) are all correct.
Alternative Method: Cluster than Label
Instead of running EM with the probabilistic generative model using the labeled data:
1. Run the clustering algorithm assuming all data is unlabeled.
2. Train a supervised learner using the labeled instances that fall into each cluster.
3. Use the predictor to label unlabeled instances in that cluster.
Pro: • Another simple wrapper method • Do not have to assume a probabilistic mixture model.
Con: • Can be difficult to analyze; labels within a cluster may disagree.
Cluster then Label (Pseudocode)
Cluster then Label: Single vs Complete Linkage
Overview
• Self-Training • Co-Training • Gaussian Mixture Model (GMM) • Other: Graph-based methods, Semi-Supervised SVM
�64
Graph-Based Methods
Graph-Based Methods
• assumption: similar unlabeled data have similar labels.
(Jerry Zhu’s Ph.D. thesis)
two images of ‘2’ with large Euclidean distances
a path in an Euclidean distance kNN graph between them
Graph-Based Methods (Jerry Zhu’s Ph.D. thesis)
• idea: construct similarity graphs to model local neighbourhood relations between data points - nodes: labeled and unlabeled
examples - edges: similarity weights computed
from features
• different types of graphs - k-nearest neighbour graph - fully connected graph, weight decay
with distance - eNN graph: two nodes are
connected if the distance is <= e.
Graph-Based Methods: Assumptions
key assumption: labels are “smooth” with respect to the graph, such that they vary slowly on the graph. That is, if two instances are connected by a strong edge, their labels tend to be the same.
Semi-Supervised SVM
• Standard SVM: maximize labeled margin.
• Semi-supervised SVMs: maximize unlabeled margin - enumerate all 2u
possible labeling of Xu - Build one standard SVM for each labeling - Pick the SVM with the largest margin.
Semi-Supervised SVM
SVM
S3VM
hat losshinge loss
hinge loss
Semi-Supervised SVM
Hinge Loss Hat Loss
prefer yf(x)=1, i.e., instances that are on the correct side
prefer f(x)<=-1 and f(x)>=1, i.e., that unlabeled instances are away from the decision boundary f(x)=0.
Semi-Supervised SVM
The third term prefers unlabeled points outside the margin.
In other words, the decision boundary f=0 wants to be placed so that there is few unlabeled data near it.
Semi-Supervised SVM
Avoiding unlabeled dense region may not be the right assumption!
Assumption: the classes are well-separated, such that the decision boundary falls into a low density region in the feature space, and does not cut through dense unlabeled data.
Overview
• Self-Training • Co-Training • Gaussian Mixture Model (GMM) • Other: Graph-based methods, Semi-Supervised SVM
�74
Can Unlabeled Data Help?
Caveats: Model assumption can make up for lack of labeled data, but can also affect the quality of the predictor.
Active vs Semi-supervised Learning
Both try to attack the same problem: make the most of unlabeled data.
Slides based on Jerry Zhu’s book and tutorials
https://www.morganclaypool.com/doi/abs/10.2200/s00196ed1v01y200906aim006http://pages.cs.wisc.edu/~jerryzhu/pub/sslicml07.pdfhttp://pages.cs.wisc.edu/~jerryzhu/pub/AERFAI06ssl.pdf
What you should know
• Problem definition for semi-supervised learning. • Self-training method, pros/cons • Basic idea of co-training • Gaussian mixture models as a generative method
• What the “E” and “M” step of the EM algorithm means • General idea of graph-based semi-supervised learning • General idea of semi-supervised SVM