active learning and experimental design anwar sadat delgado monica

Active Learning and Experimental DesignAnwar SadatDelgado Monica

Outline

1. Introduction2. Scenarios (Approaches to Querying)3. Query Strategy Frameworks

a. Uncertainty Samplingb. Query-by-Committeec. Expected Model Changed. Expected Error Reductione. Variance Reductionf. Density-Weighted Methods

4. Problem Setting Variants5. Practical Consideration6. Conclusion 2

Definition

Active learning is a special case of semi-supervised machine learning in which a learning algorithm is able to interactively query the user (or some other information source) to obtain the desired outputs at new data points. In statistics literature it is sometimes also called optimal Experimental design.

3

Settles, Burr (2010)Olsson, Fredrik (2009)

Introduction:Learning paradigms

Supervised Learning

Active Learning

4

Unsupervised Learning

Passive Learning vs. Active Learning

General Schema of a passive learner

• For all supervised and unsupervised learning tasks, it needs to gather significant amount of data randomly sampled from the underlying population distribution and then induce a classifier or model• Cons:• Gathering of labeled data often too expensive: time, money

5

Passive Learning vs. Active Learning

• Difference: the ability to ask queries about the world based upon the past queries and responses

• The notion of what exactly a query is and what response it receives will depend upon the exact task at hand

• Pro:• The entire data need not be labeled, it is the task of the learner to

request for relevant label

General Schema of a active learner

6

Active Learning Heuristic

• Start with a pool of unlabeled data (U)• Pick a few points at random and get their labels using an

oracle (e.g. human annotator)• Repeat the following:

1. Fit a classifier to the labels seen so far2. Pick the BEST unlabeled point to get a label for• (closest to the boundary?)• (most uncertain?)• (most likely to decrease overall uncertainty?)

7

Active Learning (II)

Start: Unlabeled data8

Active Learning (III)

Label a random subset9

Active Learning (IV)

Fit a classifier to labeled data

10

Active Learning (V)

Pick the BEST next point to label (e.g. closest to boundary)

11

Active Learning (VI)

Fit a classifier to labeled data12

Active Learning (VII)

Pick the BEST next point to label (e.g. closest to boundary)

13

Active Learning (VIII)

Fit a classifier to labeled data14

Active learning examples• A.L. well motivated in many modern machine learning

problems where data may be abundant but labels are scarce or expensive to obtain

• Applications:• Speech recognition• Information extraction• Classification and filtering

15

Comparison of Uncertainty Sampling (Active Learning) vs Random sampling for text classification of Baseball vs Hockey

Scenarios (Approaches to Querying)

Three Main Active Learning Scenarios (Settles, 2010)16

Sce

nari

os

Membership Query Synthesis• The learner may request labels for any unlabeled instance in

the input space, including queries that the learner generates de novo, rather than those sampled from some underlying natural distribution.

• Pros:• Computationally tractable (for finite domains)• Can be extended to regression tasks• Promising approach in automation of experiments that do not

require a human annotator• Cons:• Human annotator may have difficulty interpreting and labeling

arbitrary instances17

Sce

nari

os

Stream-Based Selective Sampling• Generally used when obtaining an unlabeled instance is free

or relatively cheap, the query can first be sampled from the actual distribution, and then the learner can decide whether or not to request its label.

• This approach is called stream / sequential active learning, as each unlabeled instance is typically drawn one at a time from the data source, and the learner must decide whether to query or discard it.

• Pros:• Better to use when memory or processing power may be limited, as with

mobile and embedded devices.

• Cons:• The assumption that the unlabeled data is available free might not

always hold.

18

Sce

nari

os

Pool-Based Sampling

The pool-based active learning cycle (Settles, 2010)19

Sce

nari

os

Pool-Based Sampling• It assumes that there is a small set of labeled data L and a

large pool of unlabeled data U available• The learner is supplied with a set of unlabeled examples from

which it can selects queries.• Pros:• Probably the most widely used sampling method (text

classification, information extraction, image classification, video classification, speech recognition, cancer diagnosis)

• Applied to many real-world tasks• Cons:• Computationally intensive• Needs to evaluate entire set at each iteration 20

Query Strategy Frameworks

1. Uncertainty Sampling2. Query-By-Committee3. Expected Model Change4. Expected Error Reduction5. Variance Reduction6. Density-Weighted Methods

21

Que

ry S

trat

egy

Fra

mew

orks

Uncertainty Sampling (I)• An active learner queries the instances about which it is least

certain how to label (e.g. closest to the decision boundary)

• Basically there are 3 strategies: least confident, margin sampling and entropy

closest to boundary

22

Que

ry S

trat

egy

Fra

mew

orks

• For problems with three or more class labels: it might query the instances whose prediction is the least confident

• The uncertainty measure:

• Where is the class label with highest posterior probability under model θ

• Pros:• Appropriate if the objective function is to reduce classification

error• Cons:• This strategy only considers information about the most probable

label

Uncertainty Sampling (II)

23

Que

ry S

trat

egy

Fra

mew

orks

Uncertainty Sampling (III)• Partial solution: margin sampling:

where and are the first and second most probable class labels under the model• Pros:• Corrects the least confident strategy by incorporating the

posterior of the second most likely label• Appropriate if the objective function is to reduce classification

error• Cons:• For problems with very large label sets, it still ignores the output

distribution for the remaining classes24

Que

ry S

trat

egy

Fra

mew

orks

Uncertainty Sampling (IV)• Solution: Entropy

• Where ranges over all possible labeling• Entropy is an information-theoretic measure that represents

the amount of information needed to “encode” a distribution• Pro:• Generalizes easily the strategy for probabilistic multi-label

classifiers and for more complex structured instances (e.g. sequences, trees)

• Appropriate if the objective function is to minimize log-loss 25

Que

ry S

trat

egy

Fra

mew

orks

Uncertainty Sampling (V)

Heat maps illustrating the query behavior for three-label classification problem 26

Que

ry S

trat

egy

Fra

mew

orks

Uncertainty Sampling (VI)• This strategy may also be employed with non-probabilistic

classifiers• Is also applicable in regression problems (i.e., learning tasks

where the output variable is a continuous value rather than a set of discrete class labels)• The learner queries the unlabeled instance for which the model

has the highest output variance in its prediction• Active learning for regression problems are generally referred to

as optimal experimental design

27

Que

ry S

trat

egy

Fra

mew

orks

Query-By-Committee (I)• The QBC approach involves maintaining a committee C of

models which are all trained on the current labeled set L, but represent competing hypotheses

• Each committee member votes on the labeling of query candidates

• Pick the instances generating the most disagreement among hypotheses

• Goal of QBC: minimize the version space (the set of hypotheses that are consistent with the current labeled trained data L), to constrain the size of this space as much as possible with a few labeled instances as possible

28

Que

ry S

trat

egy

Fra

mew

orks

Query-By-Committee (II)

Version space examples for (a) linear and (b) axis-parallel box classifiers. All hypotheses are consistent with the labeled training data in L (as indicated by shaded polygons), but each represents a different model in the version space• To implement a QBC selection algorithm:• Construct a committee of models that represent different regions

of the version space• Have some measure of disagreement among committee members

29

Que

ry S

trat

egy

Fra

mew

orks

Query-By-Committee (III)• Two approaches for measuring the level of disagreement• 1) Vote entropy:

• Where ranges over all possible labeling,• is the number of “votes”, C is the committee size

• 2) Kullback-Leibler (KL) divergence:

• Where:

• represents a particular model, C is the committee as a whole

30

Que

ry S

trat

egy

Fra

mew

orks

Query-By-Committee (IV)• So the “consensus” probability that is the correct label is

given by:

• KL divergence is an information-theoretic measure of the difference between two probability distributions

• Aside from QBC, other strategies attempt to minimize the version space:• A selective sampling algorithm that uses a committee of 2 NN,

the “most specific” and the “most general” models (Cohn et al., 1994)

• Pool-based margin strategy for SVMs (Tong and Koller, 2000)• The membership query algorithms (Angluin and King et al.)

31

Que

ry S

trat

egy

Fra

mew

orks

Expected Model Change (I)• Is a decision-theoretic approach• It selects the instance that would impart the greatest change

to the current model if we knew its label

Before After

32

Que

ry S

trat

egy

Fra

mew

orks

Expected Model Change (II)

33

• Expected Gradient Length (EGL)• It uses gradient-based training• The learner should query the instance x which, if labeled and

added to L, would result in the new training gradient of the largest magnitude

• It calculates the length as an expectation over the possible labeling:

• With the gradient of the objective function ℓ with respect to the model parameters θ

• the new gradient that would be obtained by adding the training tuple <x,y> to L

Que

ry S

trat

egy

Fra

mew

orks

Expected Model Change (III)• It prefers instances that are likely to most influence the model,

regardless of the resulting query label

• Cons:• Computationally expensive if both the feature space and set of

labeling are very large• If features are not properly scaled, EGL approach can be led

astray (Solution: parameter regularization)

34

Que

ry S

trat

egy

Fra

mew

orks

Expected Error Reduction (I)• A decision-theoretic approach• Query the instances that would most reduce error. Measures

how much its generalization error is likely to be reduced• The idea it to estimate the expected future error of a model

trained using on the remaining unlabeled instances in U, and query the instance with minimal expected future error (risk)

• One approach: Minimize the expected 0/1-loss:

• Where refers to the new model after it has been re-trained with the training tuple <x,yi> added to L

35

Que

ry S

trat

egy

Fra

mew

orks

Expected Error Reduction (II)• Objectives: • To reduce the expected total number of incorrect predictions• To minimize the expected log-loss, which is equivalent to reduce

the expected entropy over U (maximizing the expected information gain of the query x or the mutual information of the output variables over x and U

• It’s been successfully used with naïve Bayes, Gaussian random fields, logistic regression and SVM

• Cons:• The most computationally expensive query framework (It requires

estimating the expected future error over U for each query and a new model must be incrementally re-trained for each possible query labeling)

• Because of the complexity, it is usually used for simple binary classification tasks

36

Que

ry S

trat

egy

Fra

mew

orks

Variance Reduction (I)• Objective: reduce generalization error indirectly by minimizing

output variance• Considering a regression problem, the learner’s expected

future error can be decomposed:

Noise

Bias

Variance

Expectation over the conditional density P(y/x)

Expectation over the labeled set L

Expectation over both

37

Variance Reduction (II)• The variance reduction query selection strategy then becomes:

Here is the estimated output mean variance once the model has trained for the variable x

• The variance can be seen as a composition of the Fisher’s Information matrix as given by this equation (MacKay, 1992):

• In other words, to minimize the variance over its parameter estimates, an active learner should select data that maximizes its Fisher information (or minimizes the inverse thereof)

38

Density-Weighted Methods (I)• Last methods spend time querying possible outliers simple

because they are controversial

Figure: Poor strategy for classification when using uncertainty sampling can be a poor strategy for classification. Since A is on the decision boundary, it would be queried as the most uncertain. However, querying B is likely to result in more information about the data distribution as a whole

• The basic idea is to query the most “informative” and “representative” (inhabit dense regions of the input space)

Que

ry S

trat

egy

Fra

mew

orks

Density-Weighted Methods (II)• 1) The information density framework (Settles and Craven,

2008)• Main idea: informative instances should not only be

those which are uncertain, but also those which are “representative” of the underlying distribution.

In densest region but still very close to decision boundary

Very close to decision boundary

40

Que

ry S

trat

egy

Fra

mew

orks

Density-Weighted Methods (III)• The Settles and Craven approach:

• Where φA(x) represents the informativeness of x according to some “base” query strategy A, such as an uncertainty sampling or QBC approach. The second term weights the informativeness of x by its average similarity to all other instances in the input distribution (as approximated by U), subject to a parameter β that controls the relative importance of the density term

41

Que

ry S

trat

egy

Fra

mew

orks

Density-Weighted Methods (IV)• McCallum and Nigam (1998) also developed a density-

weighted QBC approach for text classification with naïve Bayes (special case of information density)

• Fujii et al. (1998) considered a query strategy for nearest-neighbor methods that selects queries that are least similar to the labeled instances in L, and most similar to the unlabeled instances in U.

• Nguyen and Smeulders (2004) proposed a density-based approach that first clusters instances and tries to avoid querying outliers by propagating label information to instances in the same cluster.

• Xu et al. (2007) use clustering to construct sets of queries for batch-mode active learning with SVMs

42

Que

ry S

trat

egy

Fra

mew

orks

Density-Weighted Methods (V)• Pro• If densities can be pre-computed efficiently and cached for later

use, the time required to select the next query is essentially no different than the base informativeness measure (e.g., uncertainty sampling)

• This is advantageous for conducting active learning interactively with oracles in real-time

43

Usage Comparison of Different Active Learning Approach

44[Guyon, Cawley , Dror , Lemaire 2009][Active Learning Challenge ]http://www.causality.inf.ethz.ch/activelearning.php#cont

(300)

Problem Setting Variants

Some common problem setting variants of the generalizations and extensions of traditional active learning work.

1. Active learning for Structured Outputs2. Active feature acquisition and classification3. Active class selection4. Active clustering

45

Pro

blem

Set

ting

Var

iant

s

• Many important learning problems involve predicting structured outputs on instances, such as sequences and trees

Example of an information extraction problem viewed as a sequence labeling task

AL for Structured Outputs (I)

46

Pro

blem

Set

ting

Var

iant

s

AL for Structured Outputs (II)• Unlike simpler classification tasks, each instance x in this

setting is not represented by a single feature vector, but rather a structured sequence of feature vectors

• The appropriate label for a token often depends on its context in the sequence

• Labels are typically predicted by a sequence model based on a probabilistic finite state machine, such as CRFs or HMMs

• Settles and Craven published- An Analysis of Active Learning Strategies for Sequence Labeling Tasks (2008) discusses how many active learning strategies can be adapted for structured labeling tasks. 47

Pro

blem

Set

ting

Var

iant

s

Active Feature Acquisition and Classification (I)• In some learning domains, instances may have incomplete

feature descriptions, example, many data mining tasks in modern business are characterized by naturally incomplete customer data

• The task of the model is to classify a problem using incomplete information as the feature set

• Active Feature Acquisition seeks to alleviate these problems by allowing the learner to request more complete feature information, the assumption that more information can be requested at extra cost/effort

48

Pro

blem

Set

ting

Var

iant

s

Active Feature Acquisition and Classification (II)• The goal in active feature acquisition is to select the most

informative features to obtain during training, rather than randomly or exhaustively acquiring all new features for all training instances

• There have been multiple approaches to solving this problem of feature acquisition, e.g:• Zheng and Padmanabhan proposed two “single-pass” approaches

1. Attempt to impute the missing values, and then acquire the ones about which the model has least confidence

2. An alternative, imputing these values, training a classifiers on the imputed training instances, and only acquiring feature values for the instances which are misclassified

• Incremental active feature acquisition may acquire values for a few salient features at a time

1. By selecting a small batch of misclassified examples (Melville et al., 2004)

49

Pro

blem

Set

ting

Var

iant

s

Active Feature Acquisition and Classification (III)

2. Taking a decision-theoretic approach and acquiring the feature values which are expected to maximize some utility function (SaarTsechansky et al., 2009).

• Works in active feature acquisition have considered cases in which feature acquisition can be done during test-time rather than during training

• The difference between these learning settings and typical active learning is that the “oracle” provides salient feature values rather than training labels

50

Pro

blem

Set

ting

Var

iant

s

Active Class Selection• We have till now assumed that the data needed for training

the model is freely available and that classification involves cost.

• An active class selection problem, the learner is capable of querying a known class label and obtaining each instance incurs a cost.

• Lomasky et al. (2007) propose several active class selection query algorithms for an “artificial nose” task • machine learns to discriminate between different vapor types

(the class labels) which must be chemically synthesized (to generate the instances)

51

Pro

blem

Set

ting

Var

iant

s

Active Clustering• Clustering algorithms are the most common examples of

unsupervised learning• Since active learning aims at selecting data that will reduce

models uncertainty, unsupervised active learning is counterintuitive

• Hofmann and Buhmann have proposed an active clustering algorithm for proximity data, based on an expected value of information criterion• The idea is to generate (or subsample) the unlabeled instances in

such a way that they self-organize into groupings with less overlap or noise than for clusters induced using random sampling

52

Practical Considerations• Until very recently the main question has been “Can machines

learn with fewer training instances if they ask questions?” • Yes (subject to some assumptions e,g: single oracle, oracle is

always right, cost of labeling instances is uniform)

• These assumptions don’t hold in most real world situations

• The research trend has now taken a turn to try and answer the question “Can machines learn more economically if they ask questions?”

53

Pra

ctic

al C

onsi

dera

tions

Noisy Oracles• The assumption that is oracle is correct doesn’t hold in real

world scenarios• Crowdsourcing tools and annotation games allow to average

out the result by getting labels from multiple oracles• Question: when the learner should query an potentially noisy

oracle or even re-request the label for an instance that seems off

• Research has proved that the labeling can be improved by selective relabeling

• Questions still remaining for research: • How can learner incorporate varying level of noise from same

oracle?• If repeated labeling not able to improve at all?

54

Pra

ctic

al C

onsi

dera

tions

Variable Labeling Costs• Aim to reduce total labeling cost, simply reducing the total

number of labels doesn’t solve the problem• Annotation costs can be reduced by using trained models to

assist in labeling of query instances by pre-labeling them (similar to data parsing)

• Kapoor et al. propose a method that accounts for labeling costs and mislabeling costs. Queries are evaluated by summing the labeling cost taking into account the probability of mislabeling depending on the quality of oracle (the cost of labeling and quality should be known before hand)

55

Pra

ctic

al C

onsi

dera

tions

Batch-Mode Active Learning• Most active learning research are oriented to work well when

the queries are selected in serial• In the presence of multiple annotators, this method slows

down the learning process• Batch-Mode active learning allows the learner to query the

instances in groups, this approach is more suited for a parallel labeling

• Challenge is to form the query set. Querying the “Q-best” queries according to the instance-level strategy usually causes information overlap and is not helpful

• Research is oriented at approaches that specifically incorporate diversity in the queries eg. Xu et al. (2007) query the centroids of different clusters

56

Pra

ctic

al C

onsi

dera

tions

Alternative Query Types• Settles et al. (2008) suggest Multiple-Instance (MI) learning.

Here instances are grouped into bags and the bags are labeled rather than the instances. • A bag is labeled positive if any of its instances are positive• A bag is labeled negative if and only if all instances are negative

• It is easier to label bags compared to labeling instances• Mixed granularity MI learning also proposed by Settles. The

learner is capable of querying instances as well as bags• Querying on feature as opposed to label also an alternative.

Also interleave feature based learning with instance learning to improve accuracy. (eg: presence of word football in a text documents usually means it’s a sport document) 57

Pra

ctic

al C

onsi

dera

tions

Multi-Task Active Learning• The same data instance can be labeled in multiple ways for

different subtasks• Also sometimes more economical to label all tasks of an

instance simultaneously• Qi et al. (2008) study a different multi-task active learning

scenario, in which images may be labeled for several binary classification tasks in parallel. For example, an image might be labeled as containing a beach, sunset, mountain, field, etc., which are not all mutually exclusive; however, they are not entirely independent, either

58

Pra

ctic

al C

onsi

dera

tions

Stopping Criteria• When is it good to stop?

1. Cost of acquiring new training data is greater than the cost of the errors made by the current model

2. Recognize when the accuracy plateau has been reached

• The real stopping criterion for practical applications is based on economic or other external factors, which likely come well before an intrinsic learner-decided threshold

59

Analysis of Active Learning• A number of research in the field and works done by large

corporations like Google, IBM, CiteSeer suggest that it works• It is however very closely related to the quality of the

annotators• It does reduce the number of instance labels required to

achieve a given level of accuracy in most of the studied research

• However there are reported research in which the users opted out of active learning as they lacked the confidence in the approach

60

Conclusions• Active learning is a growing area of research in machine

learning backed by the fact that the data available is ever growing and is available for less

• A lot of the research is involved in improving the accuracy of the learner with lesser number of queries

• Current day research is intended at implementing active learning in practical scenarios and problem solving

61

active learning and experimental design anwar sadat delgado monica

Documents