active sampling for optimizing prediction model reliability · active learning i labels are costly...

Transdisciplinary Data Science, Minneapolis, 6th of September 2016

Active Sampling forOptimizing Prediction Model Reliability

Georg Krempl

Knowledge Management & DiscoveryFaculty of Computer ScienceOtto-von-Guericke University

Magdeburg, Germany

Special thanks to D.Kottke, M.Spiliopoulou, V. Lemaire, Ch. Beyer, T.C. Ha,E. Hullermeier, J. Stefanowski, N. Adams, and B. Pfahringer.

1/47 Active SamplingKnowledge

DiscoveryManagement &

Motivation

Big Data, but . . .

I expert’s time is scarce,

I storage & processing capacities are limited

Selection is important

I efficient allocation of limited resources

I sample where we expect something interesting

Our focus

I (unexpected) change: change detection & mining

I “uncertain” regions: active learning



Preface

Key points

I Using approaches like probabilistic active learning,AL improves the learning efficiency

I Uncertainty sampling is problematic,as it ignores the uncertainty of the model itself

I Balancing exploration & exploitation is important,in particularly in non-stationary environments

I Considering the true posterior in the expectationmight be also beneficial outside probabilistic active learning

Open issues

I Unified concept of “uncertainty” in AL

I Evaluation & performance bounds for AL in streams

I Budget management with unsupervised change detection

I Sample reusability [Tomanek and Morik, 2011] &AL combinations [Beyer et al., 2015]

I Use for other active learning problems(active class/feature selection)



Active Learning3

Aliases and Historical Remarks

I Optimal experimental design [Fedorov, 1972]

I Learning with queries (later denoted query synthesis) [Angluin, 1988]

I Selective sampling [Cohn et al., 1990]

Active Learning Tasks1

I Labels: Active label acquisition

I Features: Active feature (value) acquisition

I Whole instances: Active class selection(or class-conditional example acquisition)

Active Learning Scenarios2

I Query synthesis, i.e. example is generated upon query

I Pool of unlabelled data U , static, repeated access

I Stream, instances arrive sequentially, no repeated access

1[Attenberg et al., 2011]2[Settles, 2012]3See e.g. [Settles, 2012, Cohn, 2010].



Active Learning

I Labels are costly (i.r.t. features)

I Active learner controls the labelling process

I Objective: Strategy for selectionof the most valuable labels

I Baseline: Random selection

I AL in streams with static concepts:well-studied, e.g. in surveys by

I [Settles, 2009]: Section onstream-based selective sampling

I [Fu et al., 2012]: Section onAL on streaming data platform

I Our focus: AL in non-stationary streams(selective sampling in evolving streams)

r3 Label Requestsr1

x3

ŷ3

Features

Predictions

x2

ŷ2

x1

ŷ1

y3 Labelsy1

Time



Overview on Active Learning Strategies

Selected Active Learning Strategies4

I Version Space Partitioning & Query by Committee

I Uncertainty Sampling

I Decision Theoretic ApproachesI Loss Minimisation: Expected Error & Variance ReductionI Probabilistic Active Learning

4Generic, i.e. usable with different classifier technologies.



Version Space Partitioning5

I Version Space Partitioning [Ruff and Dietterich, 1989]:Selection based on disagreement between hypotheses

I Query by Committee [Seung et al., 1992]:

I Disagreement within an ensemble of classifiers

I Requires constructing a diverse ensemble of classifiers

I Combinations with clustering (mixture models)

Feat

ure

x2

Feature x1

Classifier 1

Classifie

r 2

Disagreement

5See [Ruff and Dietterich, 1989].



Uncertainty Sampling7

I Information theoretic approach

I Uses classifier’s uncertainty as proxy

I Common uncertainty measures6

I Posterior-based:

Confidence: abs (P(y = +|x)− P(y = −|x))

Entropy −∑

y∈{+,−} p(y |x) log (p(y |x))

I Margin: distance to decision boundary

I Fast: O(|U|), where U is the set of unlabelled instances

I But do these measures really capture the uncertainty?

6See e.g. [Settles, 2012].7See [Roy and McCallum, 2001].



Exemplary AL Situations

+

++

+++

+

+-

+-+

+-

-

low highnumber of labels (n)

non

-un

iform

un

iform

obse

rved

dis

trib

uti

on

of

labels

(p

)ˆ

III

III IV

I a label’s value dependson the label informationin its neighbourhood

I label informationI number of labelsI share of classes

I uncertainty sampling ignoresthe number of similar labels



Measuring the Uncertainty

Problem with above measures

I Focus on exploitation, fails on exploration [Beyer et al., 2015]

I “Uncertainty” measures ignore uncertainty of the prediction modelCmp. epistemic vs. aleatoric uncertainty in [Senge et al., 2014]

Extensions: Combined measuresI [Fu et al., 2012]

I uncertaintyI instance correlation (within batch)

I [Reitmaier and Sick, 2013] 4DS approach, considering:I distance to the decision boundaryI diversity of samples in the query setI densityI class prior

I [Weigl et al., 2015]I conflict: overlap of opposing classesI ignorance: proximity of nearest decision boundary



Decision Theoretic Approaches

Expected Error Reduction[Cohn et al., 1996, Roy and McCallum, 2001]

I Aim: Minimise error after selection & retraining

I Model unknown label realisation as random variable

x∗ = arg minx

EY |L

∑x′∈U

EY |L′={L∪(x,y)}[y 6= y ]

I Better results reported than for uncertainty sampling [Settles, 2012]

I Relies on maximum-likelihood posterior estimate [Chapelle, 2005]

I Performance estimation relies on evaluation set (using L or by self-labelling U)

I High computational complexity: O(|U|2)



Probabilistic Active Learning

Motivation

I The true posterior in a candidate’s neighbourhood is unknown:

I Explicitly model the uncertainty associated with posterior value:Expectation not only over candidate instance’s label realisation y ,but also over true posterior p in its neighbourhood:

arg maxx

Ep

[Ey|p [performance (L ∪ (x , y))]

]

I The impact of a label is largest in its direct neighbourhood:

I Evaluate change in classification performance only therein

I A label’s influence depends on the number of similar labels to follow:

I Consider not only very next label, but m subsequent similar labels at once

I Active learning under unequal misclassification costs is considered challenging:

I Consider a cost-sensitive performance measure

I Handling large candidate sets (e.g. big pools/data streams):

I Derive a fast, close-form solution for expected misclassification loss reduction




Limitations

I Separates classifier and active selector(similar to uncertainty sampling)

I Depends on appropriate neighbourhood definitionand probabilistic estimates for ls = (n, p)

I Performance gain is approximated within the neighbourhood(evaluating globally is possible, but computationally costly)

References

I Implementations in Java, Python, MATLAB are available(open source) at http://kmd.cs.ovgu.de/res/opal/

I Optimised Probabilistic Active Learning (OPAL).Krempl, Kottke, Lemaire. Machine Learning 100(2) 2015.



http://kmd.cs.ovgu.de/res/opal/

Optimised Probabilistic Active Learning in a Nutshell

Illustrative Example

-

+

-+

?

I Given: Dataset with labelled ( - / + ) andunlabelled ( ) instances

I Objective: Determine the expected gain oflabelling e.g. the candidate ?

I What label information do we have already?

I Summarise label information in itsneighbourhood:

For example, by using a probabilistic classifier,kernel frequency estimates, label counts, . . .

I Number of labels: n = 2I Share of positives therein

(i.e. posterior estimate): p = 12

I Summarise as label statistics:ls = (n = 2, p = 0.5)




Probabilistic Gain8

pgain(ls) = Ep

[Ey|p

[gainp(ls, y)

] ]=∫ 1

0 Betaα,β(p) ·∑

y∈{0,1} Berp(y) · gainp(ls, y)dp

with:I ls = (n, p): Label statisticsI y : Candidate’s label realisationI p: True posterior at candidate’s position

I This probabilistic gain quantifies

I the expected change in classification performance

I at the candidate’s position in feature space,

I in each and every future classification there,

I given that one additional label is acquired.

I Weight pgain with the density dx over labelled andunlabelled data at the candidate’s position.

I Select the candidate with highest density-weightedprobabilistic gain.

-

+

-+

?

8See [Krempl et al., 2014].



Probabilistic Active Learning – Interpretation

0

1

2

3

4

5

0 0.2 0.4 0.6 0.8 1

no labels (n=0)few labels (n=2,p=0.5)^

more labels (n=3,p=2/3)^

many labels (n=11,p=10/11)^

0.5

0.67

0.91

True posterior (p)

Norm

alized

Lik

elih

ood

I Uniform prior: Prior to the firstlabel’s arrival, all values of p areassumed equally plausible.

I A Bayesian approach yields for thenormalised likelihood corresponds abeta distribution with parameters:α Number of positive labels plus oneβ Number of negative labels plus one

I Left: Plot of normalised likelihoodsfor different values of α, β

I The peak of this function becomesthe more distinct, the more labels areobtained.



Non-Myopic Extension of PAL

Myopic Probabilistic Gain

pgain(ls) = Ep

[Ey[gainp(ls, y)

] ]=∫ 1

0 Betaα,β(p) ·∑

y∈{0,1} Berp(y) · gainp(ls, y)dp

with:I ls = (n, p): Label statisticsI y : Candidate’s label realisationI p: True posterior at candidate’s position

Non-Myopic Extension

I Not a single label is purchased in future, but

I a set of labels according to a given budget m

I We need to optimise performance gain when acquiring this set of labels!

I Brute-Force Approach: Calculate gain for all combinations

I But: Ordering (of arrival) is irrelevantIt suffices to consider the varying number k of positives among m acquired labels



Non-Myopic Probabilistic Gain

Non-Myopic Probabilistic Gain

GOPAL(ls, τ,m) =1

m· Ep

[Ek

[gainp(ls, k,m)

] ](1)

=1

m·∫ 1

0Betaα,β(p) ·

∑0≤k≤m

Binm,p(k) · gainp(ls, k,m)dp (2)

with:I ls = (n, p): Label statisticsI p: True posterior at candidate’s positionI m: Number of candidates to be acquired (budget)I k: Number of candidates with positive label realisationsI τ : false positive costs (will be explained shortly)

I with performance gain as difference between future and current performance:

gainp(ls, k,m) = perfp

(np + k

n + m

)− perfp(p) (3)



Cost-Sensitive ClassificationGiven a situation with

I p ∈ [0, 1] true posterior prob. of the positive class in a neighbourhoodI q ∈ [0, 1] share of instances therein that are classified as positiveI costFP = τ ∈ [0, 1] cost of each false positive classification

Misclassification Loss as Performance Measure

MLoss(p, q) = p · (1− q) · costFN + (1− p) · q · costFP = (4)

p · (1− q) · (1− τ) + (1− p) · q · τ = q · (τ − p) + p · (1− τ) (5)

Resulting Cost-Optimal Classification

q∗ =

0 p < τ1− τ p = τ1 p > τ

(6)

Misclass. Loss under Cost-Opt. Classification

perfp,τ (p) = −MLp,τ (p) = −

p · (1− τ) p < ττ · (1− τ) p = ττ · (1− p) p > τ

(7)



Optimised Probabilistic Active Learning

Non-Myopic, Cost-Sensitive Probabilistic Gain

I Combining misclassification loss as performance measure and

I the non-myopic probabilistic gain yields the

I probabilistic misclassification loss reduction

GOPAL(ls, τ,m) =1

m·∫ 1

0

Betaα,β(p)m∑

k=0

Binm,p(k)

(MLp,τ (p)−MLp,τ

(np + k

n + m

))dp

(8)

Closed-Form Solution

GOPAL(n, p, τ,m) =n + 1

m·(

nn · p

)·(IML(n, p, τ, 0, 0)−

m∑k=0

IML(n, p, τ,m, k)

)(9)

IML(n, p, τ,m, k) =

(mk

)·

(1− τ) · Γ(1−k+m+n−np)Γ(2+k+np)Γ(3+m+n)

np+kn+m

< τ

(τ − τ2) · Γ(1−k+m+n−np)Γ(1+k+np)Γ(2+m+n)

np+kn+m

= τ

τ · Γ(2−k+m+n−np)Γ(1+k+np)Γ(3+m+n)

np+kn+m

> τ

(10)



Probabilistic Gain – GOPAL

Probabilistic Gain for Equal Misclassification Costs

01

23

45

0 0.2 0.4 0.6 0.8 1

0

0.05

0.1

0.15

0.2

observed posteriorp knowledge

n

pro

bab

ilist

ic g

ain

^

I The probabilistic gain in accuracy asa function of ls = (n, p) is

I monotone with variable n,

I symmetric with respect to p = 0.5,

I zero for irrelevant candidates.

I Compare to uncertainty:

(in confidence)const. w.r.t. n:

uncertainty

0

0.1

0.2

0.3

0.4

0.5

0 0.2 0.4 0.6 0.8 1

posterior



GOPAL as function of observed posterior

τ = 0.5

161115

200 0. 25 0. 5 0. 75 1

- 0. 12

- 0. 05

0. 02

0. 09

0. 16

GOPAL

Pr(+|x)^

n- 0. 1

- 0. 05

0

0. 05

0. 1

0. 15

161115

200 0. 25 0. 5 0. 75 1

- 0. 12

- 0. 05

0. 02

0. 09

0. 16

GOPAL

Pr(+|x)^

n- 0. 1

- 0. 05

0

0. 05

0. 1

0. 15

0

0. 2

0. 4

0. 6

0. 8

1 6 11 15 200

e-9

e-8

e-7

e-3

e-4

e-5

e-6

Difference in GOPAL

Pr(

+|x)

^

n

τ = 0.1

161115

200 0. 25 0. 50. 75 1

- 0. 12

- 0. 05

0. 02

0. 09

0. 16

GOPAL

Pr(+|x)^

n- 0. 1

- 0. 05

0

0. 05

0. 1

0. 15

161115

200 0. 25 0. 5 0. 75 1

- 0. 12

- 0. 05

0. 02

0. 09

0. 16

- 0. 1

- 0. 05

0

0. 05

0. 1

0. 15GOPAL

Pr(+|x)^

nPr(

+|x)

^

n

Difference in GOPAL

0

0. 2

0. 4

0. 6

0. 8

1 6 11 15 200

e-9

e-8

e-7

e-3

e-4

e-5

e-6

Plots of the GOPAL as function of observed posterior p = Pr (+|x) and number oflabels n for different cost-rations τ = 0.1, 0.25, 0.5 (rows). The left column shows themyopic GOPAL, the centre column shows the non-myopic GOPAL, and the rightcolumn shows the difference between the two (in logarithmic scale).



GOPAL as function of observed posterior (2)

I Expected average reduction in misclassification loss in each subsequent(cost-optimizing) classification

I With increasing n (comp. to remaining budget m), the already observed posteriorp is weighted stronger, the difference between expected future IML(n, p, τ,m, k)and current performance IML(n, p, τ, 0, 0) converges towards zero, thus candidatesin well-explored regions become less interesting.

I unequal misclassification costs: GOPAL is not symmetric around τInstead, sampling instances from regions where potentially a more costly error ismade is favoured!

I probabilistic gain is higher in regions where currently instances are classified asnegative, as the possible error therein is more expensive.



Visualisation of GOPAL-Values

KFE(x, ) KFE(x, ) GOPAL

A B C D E

dx

A B C D E

U.S.GOPAL · dx

p n dx

A .5 6.0 .16B .8 3.6 .15C .8 0.5 .14D .5 0.4 .13E .5 0.1 .02

GOPAL

·1 ·dx

A .0278 .00435B .0011 .00016C .0459 .00617D .0737 .00982E .0817 .00164

Figure: Visualisation of GOPAL-values for τ = 0.5 on a one-dimensional data set with labelled(red resp. green dots) and unlabelled (grey dots) data points. The upper plot shows the kernelfrequency estimates (KFE) for each class and the corresponding GOPAL-value (blue curve). Thelower plot shows the density (grey area) and the density-weighted GOPAL-values (blue curve).Additionally, the negative confidence values from the Uncertainty Sampling approach are plottedfor comparison. For exemplary data points (A-E) the corresponding label statistics and theunweighted (·1) and density weighted (·dx ) GOPAL-values are given in the tables.



Evaluation – Setup

Experimental Setup

I myopic, cost-sensitive PAL (csPAL), Uncertainty Sampling without (U.S.) andwith self-training (U.S. st), Certainty Sampling (C.S.), Expected Error Reductionwith beta-prior (Chap) or cost-sensitive extension (Marg), non-myopic expectedentropy reduction (Zhao)

I same classifiers (Parzen window classifier with Gaussian kernels)

I implemented in MATLAB and run on the same platform,

I with the same (dataset-specific, pre-tuned) bandwidth parameter,

I on several synthetic and real-world data sets,

I using cross-validation (100 random permutations),

I reporting learning curves in arithmetic mean in misclassification loss, and wins atlearning steps.

I More results are at our website http://kmd.cs.ovgu.de/res/opal/



http://kmd.cs.ovgu.de/res/opal/

Overall Classification Performance

20 labels OPAL vs.

acquired csPAL U.S. U.S. st C.S. Marg1

Chap1

Zhao1

Randτ∗ = 0.10 47% 62%∗ 70%∗ 72%∗ 66%∗ 56%∗ 72%∗ 62%∗

τ∗ = 0.25 51%∗ 63%∗ 75%∗ 88%∗ 81%∗ 62%∗ 70%∗ 65%∗

τ∗ = 0.50 1% 64%∗ 72%∗ 92%∗ 87%∗ 63%∗ 69%∗ 68%∗

τ∗ = 0.75 53%∗ 60%∗ 67%∗ 86%∗ 80%∗ 50%∗ 48%∗ 58%∗

τ∗ = 0.90 42% 61%∗ 66%∗ 77%∗ 75%∗ 53%∗ 57%∗ 62%∗

40 labels OPAL vs.

acquired csPAL U.S. U.S. st C.S. Marg1

Chap1

Zhao1

Randτ∗ = 0.10 43% 55%∗ 71%∗ 75%∗ 69%∗ 62%∗ 69%∗ 57%∗

τ∗ = 0.25 56%∗ 59%∗ 73%∗ 89%∗ 79%∗ 65%∗ 69%∗ 58%∗

τ∗ = 0.50 4% 61%∗ 72%∗ 93%∗ 89%∗ 74%∗ 76%∗ 62%∗

τ∗ = 0.75 57%∗ 64%∗ 71%∗ 90%∗ 81%∗ 59%∗ 56%∗ 54%∗

τ∗ = 0.90 46% 55%∗ 63%∗ 82%∗ 77%∗ 57%∗ 64%∗ 56%∗

Table: Percentages of runs over all data sets, where OPAL performs better than its competitor.Significantly better performance is denoted by ∗, significantly worse performance by †. The usedsignificance level in the one-sided Wilcoxon signed-rank test was for both 0.001. Algorithms aremarked with 1 if not every data set could be used in the evaluation due to their long executiontime.



Runtime

Data OPAL csPAL U.S. U.S. st C.S. Marg Chap Zhao RandSee 1.867 0.254 0.206 0.468 0.162 43.535 51.87 254.8 0.015Che 1.905 0.249 0.201 0.452 0.183 54.897 56.60 319.9 0.016Che2 1.968 0.261 0.199 0.510 0.198 66.282 69.68 440.7 0.015Ver 1.987 0.269 0.202 0.653 0.207 71.126 78.66 451.7 0.015Mam 2.580 0.353 0.268 3.913 0.277 192.86 280.1 1577 0.016Sim 2.827 0.335 0.239 2.422 0.202 242.98 302.6 1641 0.016YeaU 2.993 0.379 0.272 9.318 0.260 285.51 499.9 3050 0.017Aba 7.000 1.001 0.703 136.1 0.706 NaN NaN NaN 0.023

Table: Average execution time (in seconds), rows ordered in ascending data set size. Alldifferences w.r.t. OPAL are significant (level 0.001, one-sided Wilcoxon signed-rank test).



Active Learning in Non-Stationary Environments



Active Learning in Non-Stationary Environments

Challenges and Open Issues

I Adaptation to Changeexploitation-exploration tradeoff, e.g. [Osugi et al., 2005, Guyon et al., 2011]representativeness and diversity [Fu et al., 2012]

I Limited Computational Resourcesonline processing, limited storage capacitychunk-based vs. instance-wise processing

I Budget Management & Change Detection

I Evaluation & Performance Bounds



Nonstationarity: Insufficient Exploration and Lock-In

MotivationWhy not simply applyactive learning strategiesfrom static (iid) streams?

I Example:Uncertainty sampling,drifting distributions

I Error is never even noticed!

I Lock-in on outdated hypothesis

I Caveat: Drift might occuranywhere in the feature spaceSee e.g. [Zliobaite et al., 2011]

I Remedy: Sampling from thewhole feature space:budget management

ModelReality

Time 2



AL in Evolving Streams: Stream Processing

Chunk-Based, Myopic Processing

Motivation: Processing protocol allows data to be processed in chunks,with single-instance labelling requests

Advantage: Usable as wrapper for static approaches

Relevant work: Exemplary approaches

Decision Tree : [Huang and Dong, 2007]SVM : [Lindstrom et al., 2010]

Ensemble : [Zhu et al., 2007, Zhu et al., 2010,Masud et al., 2010, Ienco et al., 2013,Krempl et al., 2015a]

Chunk-Based with Batches

Motivation: Processing protocol requires requests to be made in batches

Advantage: Non-myopic, practical advantages

Challenge: Requires non-myopic selection technique

Relevant work: [Chakraborty et al., 2011, Chakraborty et al., 2014]



AL in Evolving Streams: Stream Processing

Instance-Wise Processing

Motivation: Processing protocol requires label request to be made immediatelyupon arrival

Advantage: Online processing

Problems: Requires dedicated AL approach & budget management (see below)

Relevant work: I Budget Management [Zhu et al., 2010, Zliobaite et al., 2011,Zliobaite et al., 2013, Kottke et al., 2015]

I [Masud et al., 2010] use outlier detection to monitor changes inregions of previously low density.



Challenges in Evolving Streams: Budget Management

Development of methods for estimating and controlling the labelling budget over time.

Motivation 1: Estimating the required labelling effort over time:

I Static context: Decreasing labelling efforts through convergenceI Dynamic context: Not necessarily the case...

Motivation 2: Balance of labelling costs over time:

I Simplistic approach: Random sampling of a fixed percentageI More efficient active budget management strategies?

Relevant work: I [Zhu et al., 2010]I Minimum-variance approach for est. number of required instances,I Random sampling for diversity over feature space.

I [Zliobaite et al., 2011, Zliobaite et al., 2013]I Variable uncertainty:

Sampling the least certain instances in a window,and adjusting the window if drift is suspected.

I VU with randomisation:As above, but include randomness for diversity over feature space.

I [Kottke et al., 2015]I Explicit distinction between temporal and spatial selectionI Temporal: Incremental Percentile Filter with trend-correctionI Spatial: Probabilistic Active Learning



Challenges in Evolving Streams: Change Detection

Change DetectionMonitoring of the feature distribution for changes

Motivation: Unlabelled instances are cheap and their distribution is unbiased.Some changes in the feature distribution might hint to concept drift.

Advantage: Requires no labelled instances

Problems: Changes in posterior might go unnoticed (false negatives),can also trigger false alarms (covariate drift without concept drift).

Relevant work: I [Fan et al., 2004] and [Huang and Dong, 2007] monitor changesin distributions of the leafs of a decision tree

I [Masud et al., 2010] use outlier detection to monitor changes inregions of previously low density.



Challenges in Evolving Streams: Evaluation & Performance Bounds

Evaluation

Challenge: When to measure the performance?Does early or final performance matter more?Spatio-Temporal evaluation with learning curves?

Relevant work:

Passive,Streams : prequential evaluation [Gama et al., 2013],recovery analysis [Shaker and Hullermeier, 2013]

Active,Static : learning curves, but open issues:[Evans et al., 2013]

Active,Streams : none

Performance Bounds

Challenge: How to model the nonstationary distribution?

Related work: [Yang, 2011] (limited to covariate drift)



AL for Evolving Streams: Literature OverviewStream Drift Act. Learn. Required

Reference Handling Type Strategy Budget[Fan et al., 2004] online feature triggered Rand fixed, on event

A change detector on P(X ) triggers random sampling, a predefined budget is spent upon change detection.

[Huang and Dong, 2007] chunks feature triggered US fixed, on eventAs above, but Naive Bayes-based uncertainty sampling.

[Zhu et al., 2007] chunks any MinVar QbC fixedFixed proportion of a new chunk is labelled randomly and used to train a new classier, the ensemble variance is used forselecting upon the remaining instances. [Zhu et al., 2010] extends this work and determines required number of labels automatically.

[Masud et al., 2010] chunks any QbC, outlier varyingEnsemble of pseudopoints (labelled clusters) is maintained, labels are requested foroutliers outside all pseudopoint ranges and for instances with high disagreement (QbC).

[Lindstrom et al., 2010] chunks posterior US fixedThe distance to hyperplane of SVM classifier is used for selection.

[Liu and Wang, 2011] online posterior US QbC varyingEnsemble of field classifiers is maintained, for a new instance the ensemble variance is compared to the historical average.

[Chu et al., 2011] online any US managedUse uncertainty of linear probit model, but (1) model uncertainty is incorporated explicitly; (2) use importance weighting for de-biasing.

[Zliobaite et al., 2011] online any US, Rnd managedDiscuss problem of drift in arbitrary location of feature space, discuss several methods for budget management.[Zliobaite et al., 2013] is an extension of this work.

[Ryu et al., 2012] online any QbC, outlier varyingEnsemble: New base classifier is learnt on demand on suspicious samples, which are instances outside the mean-variancerange of previous chunks. Classifier weights are adjusted based on feature distribution similarity.

[Cheng et al., 2013] onlineUse adaptively weighted uncertainty and density scores, no drift detection.

[Ienco et al., 2013] chunks any US,Clustering fixedCombine clustering (for diversity) with uncertainty sampling, feed labels to arbitrary classifier.

[Ienco et al., 2014] online any US+DensityHigh density-focused uncertainty sampling.

[Krempl et al., 2015a] chunks any OPAL+Clustering fixedCombine clustering (for speed+diversity) with OPAL [Krempl et al., 2015b].

[Kottke et al., 2015] online any OPAL+Budget fixedDistinguish spatial and temporal selection, combine percentile filtering with PAL.

Acronyms: Rnd = Random Sampling, US = Uncertainty Sampling, QbC = Query-by-Committee, MinVar = Minimum Variance



Summary

I Using approaches like probabilistic active learning,AL improves the learning efficiency

I Uncertainty sampling is problematic,as it ignores the uncertainty of the model itself

I Balancing exploration & exploitation is important,in particularly in non-stationary environments

I Considering the true posterior in the expectationmight be also beneficial outside probabilistic active learning

Open Issues

I Unified concept of “uncertainty” in AL

I Evaluation & performance bounds for AL in streams

I Budget management with unsupervised change detection

I Sample reusability [Tomanek and Morik, 2011] &AL combinations [Beyer et al., 2015]

I Use for other active learning problems(active class/feature selection)

I Workshop on Active Learning: Applications, Foundations and Emerging Trendshttp://vincentlemaire-labs.fr/iknow2016/

Thank you for your attention! Questions?37/47 Active Sampling

Knowledge


http://vincentlemaire-labs.fr/iknow2016/

Bibliography I

Angluin, D. (1988).Queries and concept learning.Machine Learning, 2:319–342.

Attenberg, J., Melville, P., Provost, F., and Saar-Tsechansky, M. (2011).Selective data acquisition for machine learning.In Krishnapuram, B., Yu, S., and Rao, R. B., editors, Cost-Sensitive MachineLearning. CRC Press, Inc., Boca Raton, FL, USA, 1st edition.

Beyer, C., Krempl, G., and Lemaire, V. (2015).How to select information that matters: A comparative study on active learningstrategies for classification.In Proc. of the 15th Int. Conf. on Knowledge Technologies and Data-DrivenBusiness (i-KNOW 2015), pages 2:1–2:8. ACM.

Chakraborty, S., Balasubramanian, V., and Panchanathan, S. (2011).Optimal batch selection for active learning in multi-label classification.In Candan, K. S., Panchanathan, S., Prabhakaran, B., Sundaram, H., Feng,W.-C., and Sebe, N., editors, Proceedings of the 19th International Conferenceon Multimedia 2011, Scottsdale, AZ, USA, November 28 - December 1, 2011,pages 1413–1416. ACM.



Bibliography II

Chakraborty, S., Balasubramanian, V., and Panchanathan, S. (2014).Adaptive batch mode active learning.IEEE Transactions on Neural Networks and Learning Systems, 26(8).Relevance: 2.

Chapelle, O. (2005).Active learning for parzen window classifier.In Proceedings of the Tenth International Workshop on Artificial Intelligence andStatistics, pages 49–56.

Cheng, Y., Chen, Z., Liu, L., Wang, J., Agrawal, A., and Choudhary, A. (2013).Feedback-driven multiclass active learning for data streams.Proceedings of the 22nd ACM international conference on Conference oninformation & knowledge management - CIKM ’13, pages 1311–1320.

Chu, W., Zinkevich, M., Li, L., Thomas, A., and Tseng, B. (2011).Unbiased online active learning in data streams.In Proceedings of the 17th ACM SIGKDD international conference on Knowledgediscovery and data mining, KDD ’11, pages 195–203, New York, NY, USA. ACM.

Cohn, D. (2010).Active learning.In Sammut, C. and Webb, G. I., editors, Encyclopedia of Machine Learning,pages 10–14. Springer.



Bibliography III

Cohn, D., Atlas, L., Ladner, R., El-Sharkawi, M., Marks, R., Aggoune, M., andPark, D. (1990).Training connectionist networks with queries and selective sampling.In Advances in Neural Information Processing Systems (NIPS). MorganKaufmann.

Cohn, D. A., Ghahramani, Z., and Jordan, M. I. (1996).Active learning with statistical models.Journal of Artificial Intelligence Research, 4:129–145.

Evans, L. P., Adams, N. M., and Anagnostopoulos, C. (2013).When does active learning work?In Tucker, A., Hoppner, F., Siebes, A., and Swift, S., editors, Advances inIntelligent Data Analysis XII 12th International Symposium, IDA 2013, London,UK, October 2013, volume 8207 of Lecture Notes in Computer Science, pages174–185. Springer.

Fan, W., Huang, Y.-a., Wang, H., and Yu, P. S. (2004).Active mining of data streams.In Proceedings of the 4th SIAM International Conference on Data Mining, SDM2004, USA, pages 457––461.

Fedorov, V. V. (1972).Theory of Optimal Experiments Design.Academic Press.



Bibliography IV

Fu, Y., Zhu, X., and Li, B. (2012).A survey on instance selection for active learning.Knowledge and Information Systems, 35(2):249–283.

Gama, J., Sebastiao, R., and Rodrigues, P. P. (2013).On evaluating stream learning algorithms.Machine Learning, 90:317–346.

Guyon, I., Cawley, G., Dror, G., Lemaire, V., and Statnikov, A., editors (2011).Active Learning Challenge, volume 6 of Challenges in Machine Learning.Microtome Publishing.

Huang, S. and Dong, Y. (2007).An active learning system for mining time-changing data streams.Intelligent Data Analysis, 11(4):401–419.

Ienco, D., Bifet, A., Zliobaite, I., and Pfahringer, B. (2013).Clustering based active learning for evolving data streams.In Furnkranz, J., Hullermeier, E., and Higuchi, T., editors, Proceedings of the16th Int. Conf. on Discovery Science (DS), Singapore, volume 8140 of LectureNotes in Artificial Intelligence, pages 79–93. Springer.



Bibliography V

Ienco, D., Pfahringer, B., and Zliobaite, I. (2014).High density-focused uncertainty sampling for active learning over evolvingstream data.In Proceedings of the 3rd International Workshop on Big Data, Streams andHeterogeneous Source Mining: Algorithms, Systems, Programming Models andApplications, pages 133–148.

Kottke, D., Krempl, G., and Spiliopoulou, M. (2015).Probabilistic active learning in data streams.In Fromont, E., Bie, T. D., and Leeuwen, M. v., editors, Advances in IntelligentData Analysis XIV - 14th Int. Symposium, IDA 2015, St. Etienne, France, volume9385 of Lecture Notes in Computer Science, pages 145–157. Springer.

Krempl, G., Ha, T. C., and Spiliopoulou, M. (2015a).Clustering-based optimised probabilistic active learning (COPAL).In Japkowicz, N., Matwin, S., Japkowicz, N., and Matwin, S., editors, Proc. ofthe 18th Int. Conf. on Discovery Science (DS 2015), volume 9356 of LectureNotes in Computer Science, pages 101–115. Springer.

Krempl, G., Kottke, D., and Lemaire, V. (2015b).Optimised probabilistic active learning (OPAL) for fast, non-myopic,cost-sensitive active classification.Machine Learning, 100(2).Special Issue of ECML PKDD 2015.



Bibliography VI

Krempl, G., Kottke, D., and Spiliopoulou, M. (2014).Probabilistic active learning: Towards combining versatility, optimality andefficiency.In Dzeroski, S., Panov, P., Kocev, D., and Todorovski, L., editors, Proceedings ofthe 17th Int. Conf. on Discovery Science (DS), Bled, volume 8777 of LectureNotes in Computer Science, pages 168–179. Springer.

Lindstrom, P., Delany, S., and Mac Namee, B. (2010).Handling concept drift in a text data stream constrained by high labelling cost.In Proceedings of the 23rd Int. Florida Artificial Intelligence Research SocietyConference (FLAIRS 2010), pages 32–37.

Liu, W. and Wang, T. (2011).Online active multi-field learning for efficient email spam filtering.Knowledge Information Systems.Online First.

Masud, M. M., Gao, J., Khan, L., Han, J., and Thuraisingham, B. (2010).Classification and novel class detection in data streams with active mining.In Proceedings of the 14th Pacific-Asia conference on Advances in KnowledgeDiscovery and Data Mining - Volume Part II, PAKDD 2010, pages 311–324,Berlin, Heidelberg. Springer-Verlag.



Bibliography VII

Osugi, T., Kim, D., and Scott, S. (2005).Balancing exploration and exploitation: A new algorithm for active machinelearning.In Data Mining, Fifth IEEE International Conference on, pages 8–pp. IEEE.

Reitmaier, T. and Sick, B. (2013).Let us know your decision: Pool-based active training of a generative classifierwith the selection strategy 4ds.Information Sciences, 230:106–131.

Roy, N. and McCallum, A. (2001).Toward optimal active learning through sampling estimation of error reduction.In Proc. of the 18th Int. Conf. on Machine Learning, ICML 2001, Williamstown,MA, USA, pages 441–448, San Francisco, CA, USA. Morgan Kaufmann.

Ruff, R. A. and Dietterich, T. (1989).What good are experiments?In Proc. of the sixth int. workshop on machine learning.

Ryu, J. W., Kantardzic, M. M., Kim, M.-W., and Khil, A. R. (2012).An efficient method of building an ensemble of classifiers in streaming data.In Big Data Analytics, pages 122–133. Springer.



Bibliography VIII

Senge, R., Bosner, S., Dembczynski, K., Haasenritter, J., Hirsch, O.,Donner-Banzhoff, N., and Hullermeier, E. (2014).Reliable classification: Learning classifiers that distinguish aleatoric and epistemicuncertainty.Information Sciences, 255:16–29.

Settles, B. (2009).Active learning literature survey.Computer Sciences Technical Report 1648, University of Wisconsin-Madison,Madison, Wisconsin, USA.

Settles, B. (2012).Active Learning.Number 18 in Synthesis Lectures on Artificial Intelligence and Machine Learning.Morgan and Claypool Publishers.

Seung, H. S., Opper, M., and Sompolinsky, H. (1992).Query by committee.In M.K., W. and L.G., V., editors, Proc. of the fifth workshop on computationallearning theory. Morgan Kaufmann.



Bibliography IX

Shaker, A. and Hullermeier, E. (2013).Recovery analysis for adaptive learning from non-stationary data streams.In Proceedings of the CORES 2013 Special Session on Data Stream Classificationand Big Data Analytics, volume 226 of Advances in Intelligent and SoftComputing, pages 289–298. Springer.

Tomanek, K. and Morik, K. (2011).Inspecting sample reusability for active learning.In Guyon, I., Cawley, G. C., Dror, G., Lemaire, V., and Statnikov, A. R., editors,AISTATS workshop on Active Learning and Experimental Design, volume 16 ofJMLR Proceedings, pages 169–181. JMLR.org.

Weigl, E., Heidl, W., Lughofer, E., Radauer, T., and Eitzinger, C. (2015).On improving performance of surface inspection systems by online active learningand flexible classifier updates.Machine Vision and Applications, 27(1):103–127.

Yang, L. (2011).Active learning with a drifting distribution.Neural Information Processing Systems.

Zhao, Y., Yang, G., Xu, X., and Ji, Q. (2012).A near-optimal non-myopic active learning method.In Proceedings of the 21st International Conference on Pattern Recognition,ICPR 2012, Tsukuba, Japan, November 11-15, 2012, pages 1715–1718. IEEE.



Bibliography X

Zhu, X., Zhang, P., Lin, X., and Shi, Y. (2007).Active learning from data streams.In Proceedings of the 2007 Seventh IEEE International Conference on DataMining, ICDM ’07, pages 757–762, Washington, DC, USA. IEEE ComputerSociety.

Zhu, X., Zhang, P., Lin, X., and Shi, Y. (2010).Active learning from stream data using optimal weight classifier ensemble.IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics,40(6):1607 – 1621.

Zliobaite, I., Bifet, A., Pfahringer, B., and Holmes, G. (2011).Active learning with evolving streaming data.In Proceedings of the 21st European Conference on Machine Learning andPrinciples and Practice of Knowledge Discovery in Databases (ECML PKDD’11),volume 6913 of Lecture Notes in Computer Science, pages 597–612. Springer.

Zliobaite, I., Bifet, A., Pfahringer, B., and Holmes, G. (2013).Active learning with drifting streaming data.IEEE Transactions on Neural Networks and Learning Systems, 25(1):27–39.



active sampling for optimizing prediction model reliability · active learning i labels are costly...

Documents