cs480 introduction to machine learning semi-supervised...

CS480 Introduction to Machine Learning Semi-Supervised Learning

Edith Law

Basic Idea

• Label data can be expensive / difficult to collect. – Human annotation is slow, boring! – Labels can require experts, or special devices to acquire. – We prefer to get better performance for free: Unlabeled data!

• Goal of semi-supervised learning is to exploit both labeled and unlabeled examples.

Example of Hard-to-Get Labels

• Task: speech analysis • switchboard dataset • telephone conversation transcription • 400 hours annotation time for each hour of speech

Example of Hard-to-Get Labels

• Task: natural language parsing • Penn Chinese Treebank • 2 years for 4000 sentences

“The National Track and Field Championship has finished.”

Example of Not-So-Hard-to-Get Labels

For some tasks, it may not be too difficult to label 1000+ instances.

Task: image categorization of “eclipse”


There are ways like the EPS game to encourage “human computation” for more labels.


For some tasks, it may not be too difficult to label 1000+ instances.

Nonetheless, we may be able to learn better/faster if we make use of both labeled and unlabeled examples.

Can Unlabeled Data Help?

If we assume that each class is a coherent group (e.g., Gaussian), then unlabeled data can help identify the boundary more accurately.

Semi-supervised learning is possible because we can make assumptions about the relationship between the distribution of unlabeled data P(x) and the target label.

Notation

Given:

• Labeled data: (Xl, Yl) = {x1:l, y1:l} available during training

• Unlabeled data: XU = {xl+1:n} available during training

• Test data: Xtest = {Xn+1:N} NOT available during training

Usually , so much more unlabeled data than labeled data.l ≪ n

Notation

supervised learning (classification/regression){(x1:n, y1:n)}

(inductive) semi-supervised learning (classification/regression){(x1:n, y1:n), xl+1:n, xtest}

(transductive) semi-supervised learning (classification/regression){(x1:n, y1:n), xl+1:n}

semi-supervised clustering (classification/regression){x1:n, must-links, cannot-links}

unsupervised learning (e.g., clustering){x1:n}

Overview

• Self-Training • Co-Training • Gaussian Mixture Model (GMM) • Other: Graph-based methods, Semi-Supervised SVM

�11

Overview


�12

Self-Training Algorithm

1. Pick your favourite classification method. 2. Train a classifier f using labeled data (Xl, Yl) 3. Use f to classify all unlabeled items x in Xu 4. Pick x* with the highest confidence 5. Add (x*, f(x*)) to labeled data 6. Repeat

Self-Training Algorithm (Pseudocode)

Self-Training Algorithm

• Assumption: One’s own high confidence predictions are correct. – this is likely to be the case when the classes are from well-

separated clusters

• Variations: – Add a few most confident (x, f(x)) to labeled data. – Add all (x, f(x)) to labeled data. – Add all (x, f(x)) to labeled data, weigh each by confidence.

Self-Training Algorithm: A Concrete Example

Self Training with 1NN: Well Separated Case

Self Training with 1NN: Outlier Case

Self-Training Example: Image Categorization

• Train a Naive Bayes classifier on two initial images


Each image is divided into small patches, 10 x 10 grid, random size of 10-20

Represent each patch: - normalize patches - define a dictionary of 200 “visual words” - Represent a patch by the index of its

closest visual word.


• Train a Naive Bayes classifier on two initial images

• Classify unlabeled data, sort by confidence log Pr(y=astronomy | x)


• Add the most confident images and predicted labels to the labels data

• Retrain classifier and repeat

Self-Training Algorithm: Advantages

• The simplest semi-supervised learning method.

• A wrapper method, i.e., the technique can be applied to any existing learning algorithms.

• Often used in real tasks like natural language processing.

Self-Training Algorithm: Disadvantages

• Early mistakes could reinforce themselves. - Heuristic solutions, e.g. “un-label” an instance if its confidence

falls below a threshold.

• Cannot say too much in terms of convergence, but … - There are special cases when self-training is equivalent to the

Expectation-Maximization (EM) algorithm. - There are also special cases (e.g. linear functions) when the

closed-form solution is known.

Overview


�25

Leveraging Multiple Views

Suppose we are classifying a collection of university webpages into two categories: Student vs Faculty.

The webpage can be classified by • the text on the webpage (e.g., words such as “professor”) • the words in all the hyperlinks that point to that webpage

“professor” “grant” “graduated”

view 1my advisormy Ph.D. advisormy mentor

view 2

Leveraging Multiple Views

In named entity classification, we classify a named entity (i.e., a proper name such as “Washington State” or Mr. Washington”) by a class label (e.g., Person or Location).

Each instance is represented by two views: • the set of words that make up the named entity itself • the words in the context of the named entity

Co-Training

but there are so many ways to express Person or Location!

So, how can we learn using no labeled data?

Co-Training

Inst 1: “headquartered in” seems to indicate y=Location.Inst 3: “Kazakhstan” must be a Location since it appears with the context “headquartered in” Inst 4: “flew to” must be about Location because it is in the context of “Kazakhstan”

“Location”

Even though we have never seen the word “China”!

Co-Training

Inst 2: “Mr.” seems to indicate y=Person.Inst 5: “partner at” must be about a Person because it appears in context of “Mr”

“Person”

Even though we have never seen the words “Robert Jordan”!

Co-Training

•Appropriate in settings where the features are redundant •The idea is to train two classifiers on the disjoint features, and make

the two classifiers take turns teaching each other •Co-training is different from self-training in that we use two classifiers

in turn, operating on different views of the instances.

Co-Training Algorithm

1. Train two classifiers: f(1) from (Xl(1), Yl), f(2) from (Xl(2), Yl) 2. Classify Xu with a classifier f(1) and f(2) separately. 3. Add f(1) ’s k-most-confident (x, f(1)(x)) to f(2) ’s labeled data 4. Add f(2) ’s k-most-confident (x, f(2)(x)) to f(1) ’s labeled data 5. Repeat

Co-Training (Pseudocode) (Blum and Mitchell, 1998)

Co-Training (Blum and Mitchell, 1998)

•begin with 12 labeled web pages (academic courses) •provide 1000 additional unlabeled web pages •average error learning from labeled data: 11.1% •average error co-training: 5%

Co-Training

Pros: •wrapper method: it does not matter what algorithms are used for the

two classifiers f(1) and f(2)

• requires that classifiers can assign a confidence score to their prediction, which is used to select which unlabeled instances to turn into additional training data for the other view.

• less sensitive to mistakes than self-training Cons: • feature split may not exist •classifier that use both feature sets may have better performance

Co-Training: Assumptions

•existence of two views x = [x(1), x(2)] •assumption #1: each view alone is sufficient to make good

classifications, given enough labeled data. •assumption #2: the two views are conditionally independent given the

class label.

- knowing one view doesn’t affect what we will observe for the other view - e.g., context “headquartered in” does not favour any particular location.

Co-Training

Many real-world problems of this type

• Semantic lexicon generation [Riloff, Jones 99], [Collins, Singer 99]

• Web page classification [Blum, Mitchell 98]

• Word sense disambiguation [Yarowsky 95]

• Speech recognition [de Sa, Ballard 98]

• Visual classification of cars [Levin, Viola, Freund 03]

Overview


�38

Modeling Data using Distributions

y2

y1

In mixture models, we assume that we have a set of hidden mixture components {1,…,K} which generated the data.

Generative Story of Gaussian Mixture Model

Specifically, we assume that the data that we observed has been generated by repeating the following two steps

= 𝒩(x; μk, σk)

• pick a mixture component:

• generated x based on the selected mixture component: x ∼ p(x |y = k)

y ∼ p(y) = Multinomial(π)

p(x, y) = p(y) p(x |y)

to give the joint distribution:

Idea: Fit data with a combination of Gaussian distributions.

mixing proportions: a vector of probabilities that sum to 1

Gaussian Mixture Model

In general, we can compute the probability density function p(x) over by summing out y:

where

p(x)

is the probability of the kth mixture componentπk = p(y = k)= probability of x in the kth mixture componentp(x |y = k) = 𝒩(x |μk, σ2

k )

Goal: estimate parameters πk, μk, σ2k

= ∑y

p(y) p(x |y) =K

∑k=1

p(y = k) p(x |y = k)

=K

∑k=1

πk p(x |y = k)

Gaussian Mixture Model

• Think of each mixture component as a class.

• The unlabeled data tells us how instances from all classes, mixed together, are distributed.

• If we know how the instances from each class are distributed, then we can decompose the mixture into individual classes.

πk, μk, σ2kWhy do we care to learn these parameters

GMM: Supervised vs Semi-Supervised

• Given labeled data, assume each class has a Gaussian distribution …


θ = {π1, μ1, Σ1, π2, μ2, Σ2}Model Parameters: Model Parameters:

The GMM: p(x, y |θ) = p(y |θ)p(x |y, θ)

= πy𝒩(x; μy, Σy)

Classification: p(y |x, θ) −p(x, y |θ)

∑y′ � p(x, y′�|θ)


• Given labeled data, assume each class has a Gaussian distribution…

• The most likely model and its decision boundary …


• Given labeled data, assume each class has a Gaussian distribution.

• The most likely model and its decision boundary is …

• Add unlabeled data …


• Given labeled data, assume each class has a Gaussian distribution.

• The most likely model and its decision boundary is …

• Add unlabeled data …

• With unlabeled data, the most likely model and decision boundary changed.


Decision boundaries are different because they maximize different quantities.

Finding the Parameters

If we have labeled data, then it’s just a matter of finding the maximum likelihood estimate of the parameters.

θ = arg maxθ

log p(D |θ)

= arg maxθ

logl

∏i=1

p(xi, yi |θ)

For a K-class GMM, we will solve constrained optimization problem:

s.t.K

∑k=1

p(y = k |θ) = 1arg maxθ

logl

∏i=1

p(xi, yi |θ)

by setting up the Lagrangian, taking partial derivative of the Lagrangian with respect with the parameters, set to 0 and solve.

Finding the Parameters: Labeled vs Unlabeled Data

with labeled data with labeled and unlabeled data

= logl

∏i=1

p(xi, yi |θ)

= logl

∏i=1

p(yi |θ) p(xi |yi, θ)

= logl

∏i=1

p(xi, yi |θ)l+u

∏i=l+1

p(xi |θ)

= logl

∏i=1

p(yi |θ) p(xi |yi, θ)l+u

∏i=l+1

K

∑k=1

p(yi = k |θ) p(xi |yi, θ)

With unlabeled data, we are left with marginal probability in the data likelihood. i.e., the probability of generating x from any of the classes (because we do not know which class the instances belong to).

log P(D |θ) log P(D |θ)

Challenge: find MLE of parameters when there are unobservable or hidden variables.

marginal probability

Expectation Maximization

Iterative method for learning the maximum likelihood estimate of a probabilistic model, when the model contains unobservable variables.

Main idea:

• If we had sufficient statistics for the data (e.g. counts of possible values), we could easily maximize the likelihood.

• With missing data, we “fantasize” how the data should look based on the current parameter setting, i.e., compute expected sufficient statistics.

• Then we maximize parameter setting, based on these statistics.

Expectation Maximization

• Start with some initial parameter setting.

• Repeat (as long as desired): – Expectation (E) step: Complete the data by assigning “values”

to the missing items. – Maximization (M) step: Compute the maximum likelihood

parameter setting based on the completed data.

A Simple Algorithm: K-means clustering

• Objective: Cluster n instances into K distinct classes. • Preliminaries:

– Step 1: Pick the desired number of clusters, K. – Step 2: Assume a parametric distribution for each class (e.g.

Normal). – Step 3: Randomly estimate the parameters of the K

distributions. • Iterate, until convergence:

– Step 4: Assign instances to the most likely classes based on the current parametric distributions.

– Step 5: Estimate the parametric distribution of each class based on the latest assignment. maximization step

expectation step

EM for 2-Class GMM

– Expectation (E) step: for every instance i, compute the responsibilities, e.g., class 1 is responsible for xi with probability .γi,1

γi,k ← P(yi = k |xi)

– Maximization (M) step: compute the weighted means, variances and mixing probabilities for each component k.

• Repeat (as long as desired):

=πk𝒩(xi; uk, σk)

∑2k′�=1 πk′�𝒩(xi; uk′�, σk′�)

πk ←1N

N

∑i=1

γi,kμk ←1

∑Ni=1 γi,k

N

∑i=1

γi,kxi σk ←1

∑Ni=1 γi,k

N

∑i=1

γi,k(xi − μk)2

(with unlabeled data)

• Start with some initial parameter setting: μ1, μ2, σ21 , σ2

2 , π1, π2

(“soft” assignment of classes)

(with labeled and unlabeled data)

EM Algorithm: In General

• Setup: – Observed data: D = (Xl, Yl, Xu) – Hidden data: H = Yu – P(D|𝛳) = ∑H p(D, H | 𝛳)

• Goal: Find 𝛳 to maximize p(D|𝛳) • Algorithm:

– Start with some arbitrary 𝛳0. – E-step: Estimate p(H|D,𝛳) – M-step: Find argmax𝛳 ∑H p(D,H|𝛳)

• Comments: EM iteratively improves p(D|𝛳). Converges to a local minima of 𝛳.

EM Algorithm: Example (Ipeirotis et al., 2010)

Expectation Maximization: Properties

• Likelihood function is guaranteed to improve (or stay the same) with each iteration.

• Iterations can stop when no more improvements are achieved.

• Convergence to a local optimum of the likelihood function.

• Re-starts with different initial parameters are often necessary.

• Time complexity (per iteration) depends on model structure.

• EM is very useful in practice!

Assumptions of GMM

The assumption is that data comes from the mixture, where the number of components, prior p(y) and conditional probability p(x|y) are all correct.

Alternative Method: Cluster than Label

Instead of running EM with the probabilistic generative model using the labeled data:

1. Run the clustering algorithm assuming all data is unlabeled.

2. Train a supervised learner using the labeled instances that fall into each cluster.

3. Use the predictor to label unlabeled instances in that cluster.

Pro: • Another simple wrapper method • Do not have to assume a probabilistic mixture model.

Con: • Can be difficult to analyze; labels within a cluster may disagree.

Cluster then Label (Pseudocode)

Cluster then Label: Single vs Complete Linkage

Overview


�64

Graph-Based Methods

Graph-Based Methods

• assumption: similar unlabeled data have similar labels.

(Jerry Zhu’s Ph.D. thesis)

two images of ‘2’ with large Euclidean distances

a path in an Euclidean distance kNN graph between them

Graph-Based Methods (Jerry Zhu’s Ph.D. thesis)

• idea: construct similarity graphs to model local neighbourhood relations between data points - nodes: labeled and unlabeled

examples - edges: similarity weights computed

from features

• different types of graphs - k-nearest neighbour graph - fully connected graph, weight decay

with distance - eNN graph: two nodes are

connected if the distance is <= e.

Graph-Based Methods: Assumptions

key assumption: labels are “smooth” with respect to the graph, such that they vary slowly on the graph. That is, if two instances are connected by a strong edge, their labels tend to be the same.

Semi-Supervised SVM

• Standard SVM: maximize labeled margin.

• Semi-supervised SVMs: maximize unlabeled margin - enumerate all 2u

possible labeling of Xu - Build one standard SVM for each labeling - Pick the SVM with the largest margin.

Semi-Supervised SVM

SVM

S3VM

hat losshinge loss

hinge loss

Semi-Supervised SVM

Hinge Loss Hat Loss

prefer yf(x)=1, i.e., instances that are on the correct side

prefer f(x)<=-1 and f(x)>=1, i.e., that unlabeled instances are away from the decision boundary f(x)=0.

Semi-Supervised SVM

The third term prefers unlabeled points outside the margin.

In other words, the decision boundary f=0 wants to be placed so that there is few unlabeled data near it.

Semi-Supervised SVM

Avoiding unlabeled dense region may not be the right assumption!

Assumption: the classes are well-separated, such that the decision boundary falls into a low density region in the feature space, and does not cut through dense unlabeled data.

Overview


�74

Can Unlabeled Data Help?

Caveats: Model assumption can make up for lack of labeled data, but can also affect the quality of the predictor.

Active vs Semi-supervised Learning

Both try to attack the same problem: make the most of unlabeled data.

Slides based on Jerry Zhu’s book and tutorials

https://www.morganclaypool.com/doi/abs/10.2200/s00196ed1v01y200906aim006http://pages.cs.wisc.edu/~jerryzhu/pub/sslicml07.pdfhttp://pages.cs.wisc.edu/~jerryzhu/pub/AERFAI06ssl.pdf

What you should know

• Problem definition for semi-supervised learning. • Self-training method, pros/cons • Basic idea of co-training • Gaussian mixture models as a generative method

• What the “E” and “M” step of the EM algorithm means • General idea of graph-based semi-supervised learning • General idea of semi-supervised SVM

cs480 introduction to machine learning semi-supervised...

Documents