paired sampling in density-sensitive active learning

Paired Sampling in Density-Sensitive Active Learning

Pinar Donmez joint work with Jaime G. Carbonell

Language Technologies Institute School of Computer Science Carnegie Mellon University

Outline

Problem settingMotivationOur approachExperimentsConclusion

Setting

X: feature space, label set Y={-1,+1} Data D ~ X x Y D = T U U

T: training set U: unlabeled set T is small initially, U is large

Active Learning: Choose most informative samples to label Goal: high performance with least number of labeling

requests

Motivation

Optimize the decision boundary placement Sampling disproportionately on one side may not be

optimal Maximize likelihood of straddling the boundary with

paired samples

Three factors affect sampling Local density Conditional entropy maximization Utility score

Illustrative Example

Left Figure significant shift in the current hypothesis large reduction in version space

Right Figure small shift in the current hypothesis small reduction in version space

Paired sampling Single point sampling

Density-Sensitive Distance

Cluster Hypothesis: decision boundary should NOT cut clusters squeeze distances in high density regions increase distances in low density regions

Solution: Density-Sensitive Distance find the weakest link along each path in a graph G

a better way to avoid outliers (i.e. a very short edge in a long path)

Chapelle & Zien (2005)

Density-Sensitive Distance

Apply MDS (Multi-dimensional Scaling) to to obtain a Euclidean embedding

Find eigenvalues and eigenvectors ofPick the first p eigenvectors s.t.

Active Sampling Procedure

Given a training set T in MDS space1. Train logistic regression classifier on T

2. For all Compute the pairwise score

3. Choose the pair with the maximum score

4. Repeat 1-3

Details of the Scoring Function S

Two components of S1. Likelihood of a pair having opposite labels (straddling the

decision boundary)2. Utility of the pair

By cluster assumption decision boundary should not clusters => points in different

clusters are likely to have different labels

In the transformed space, points in different clusters have low similarity (large distance)

Thus, we can estimate

An Analysis Justifying our Claim

Pairwise distances are divided into bins Pairs are assigned to bins acc. to their distances For each bin, relative frequency of pairs with opposite class labels

are computed This graph (empirically) shows that likelihood of having opposite

labels for two points monotonically increases with the pairwise distance between them.

* This graph is plotted on g50c dataset.

Utility Function

Two components Local density depends on

number of close neighbors their proximity

Conditional Entropy

For binary problems

Uncertainty-Weighed Density

captures the density of a given point information content of its neighbors

novelty: each neighbor’s contribution weighed by its uncertainty reduces the effect of highly certain neighbors dense points with highly uncertain neighbors become

important

Utility Function

utility of a pair is

regularize information content (entropy) of the pair proximity-weighted information content of neighbors

Experimental Data

pair with maximum score selected

Six binary datasets

Experiment Setting

For each data set start with 2 labeled data points (1 +, 1 -) run each method for 20 iterations results averaged over 10 runs

Baselines Uncertainty Sampling Density-only Sampling Representative Sampling (Xu et. al. 2003) Random Sampling

Results

Conclusion

Our contributions: combine uncertainty, density, and dissimilarity across

decision boundary proximity-weighted conditional entropy selection is

effective for active learning

Results show our method significantly outperforms baselines in

error reduction fewer labeling requests than others to achieve the same

performance

Thank You!

paired sampling in density-sensitive active learning

Documents

t u u t

densitysensitive distancefind

data points

low density regionssolution

training set t

training set u

different clusters

opposite labels