active learning: applications, foundations & emerging...

Graz, the 18th of October 2016

Active Learning:Applications, Foundations & Emerging Trends

Workshop & Tutorial at IKNOW 2016

Daniel Kottke1 Georg Krempl2 Vincent Lemaire3 Edwin Lughofer4

1 Kassel University, Kassel, Germany2 Otto-von-Guericke University Magdeburg, Germany3 Kepler University, Linz, Austria4 Orange Labs, Lannion, France

1/38 Active Machine LearningKnowledge

DiscoveryManagement &

Schedule

Morning Session

10:30-12:30 Tutorial and Discussion

Afternoon Session

14:00-14:20 MapView: Graphical Data Representation for Active Learning byE. Weigl, A. Walch, U. Neissl, P. Meyer-Heye, Th. Radauer, E.Lughofer, W. Heidl and Ch. Eitzinger

14:20-14:40 Active Learning with SVM for Land Cover Classification - WhatCan Go Wrong? by S. Wuttke, W. Middelmann and U. Stilla

14:40-15:00 Dynamic Parameter Adaptation of SVM Based Active LearningMethodology by J. Smailovic, M. Grcar, N. Lavrac and M. Znidarsic

15:00-15:40 Investigating Exploratory Capabilities of Uncertainty Samplingusing SVMs in Active Learning by D. Lang, D. Kottke,G. Krempl and M. Spiliopoulou

15:20-15:40 Active Subtopic Detection in Multitopic Databy B. Bergner and G. Krempl

15:40-16:00 Closing



Part 1: Introduction

I Motivation, Task & Scenarios

I Selected ApproachesI Version Space Partitioning & Query by CommitteeI Uncertainty SamplingI Expected Error ReductionI Probabilistic Active Learning

I From Pools to Evolving Streams

I A First Summary

Part 2: Online Active Learning & Applicationspresented by Edwin Lughofer

Part 3: Evaluation in Active Learningpresented by Daniel Kottke



Motivating Applications

Credit Scoring & Fraud Detection

I predict from revenue whether a client will pay or default

I predict whether a credit card transaction is fraudulent or legitimate

I relevant e.g. for a banks or e-commerce companies

Brain Computer Interfaces

I predict from EEG pattern the action the user desires

I relevant e.g. for intelligent prostheses

Historical Map Annotation

I identify from scanned pixel data the annotations in historical maps




(Supervised) Machine Learning Tasks

I Historical Datae.g. previous client’s records

I Generate Training Samplewith explanatory variables (e.g. profit)and class label (e.g. default)

I Estimate Distributionsjoint distributions d(x , y) or

posterior distributions d(y |x) = d(x,y)d(x)

I Derive Decision Boundaryat intersections of posterior distributions

I Make Automated Predictions for New Instancese.g. predict new client’s class label

I Done! (?)




Challenge

I Some labels are expensive

I Labelling all historical instances might be impossible

Examplary Applications

I Credit Scoring & Fraud Detection:E.g. costly to accept high risk clients for model building,impossible to investigate all credit card transactions

I Brain Computer Interfaces:E.g. performing tasks for calibration can be tedious for user

I Historical Map Annotation:E.g. domain expert might be expensive/have limited



Motivation

Big Data, but . . .

I Expert’s time is scarce,

I Storage & processing capacities are limited

Selection is important

I Efficient allocation of limited resources

I Sample where we expect something interesting



Active Learning1

Setting

I Some information is costly (some not)

I Active learner controls selection process

Objective

I Select the most valuable information

I Baseline: Random selection

Historical Remarks

I Optimal experimental design [Fedorov, 1972]

I Learning with queries/query synthesis [Angluin, 1988]

I Selective sampling [Cohn et al., 1990]

1See e.g. [Settles, 2012, Cohn, 2010].



Selective Data Acquisition Tasks2

Active Learning Scenarios

I Query synthesis: example generated upon query

I Pool U of unlabelled data: static, repeated access

I Stream: sequential arrival, no repeated access

Type of Selected Information

I Active label acquisition

I Active feature (value) acquisition

I Active class selection, also denotedActive class-conditional example acquisition

I . . .Time

y3 y2y1

x3x1

x2

Instances y5

x5

y4

x4

Stream

2Own categorization, inspired by [Attenberg et al., 2011, Saar-Tsechansky et al., 2009, Settles, 2009].



Overview on Active Learning Strategies

Selected Active Learning Strategies3

I Version Space Partitioning & Query by Committee

I Uncertainty Sampling

I Decision Theoretic Approaches

I Loss Minimisation: Expected Error & Variance Reduction

I Probabilistic Active Learning

3Generic, i.e. usable with different classifier technologies.



Version Space Partitioning4

I Version Space Partitioning [Ruff and Dietterich, 1989]:Selection based on disagreement between hypotheses

I Query by Committee [Seung et al., 1992]:

I Disagreement within an ensemble of classifiers

I Requires constructing a diverse ensemble of classifiers

I Combinations with clustering (mixture models)

Feat

ure

x2

Feature x1

Classifier 1

Classifie

r 2

Disagreement

4See [Ruff and Dietterich, 1989].



Uncertainty Sampling6

I Information theoretic approach

I Uses classifier’s uncertainty as proxy

I Common uncertainty measures5

I Posterior-based:

Confidence: abs (P(y = +|x)− P(y = −|x))

Entropy −∑

y∈{+,−} p(y |x) log (p(y |x))

I Margin: distance to decision boundary

I Fast: O(|U|), where U is the set of unlabelled instances

I But do these measures really capture the uncertainty?

5See e.g. [Settles, 2012].6See [Roy and McCallum, 2001].



Exemplary AL Situations

+

++

+++

+

+-

+-+

+-

-

low highnumber of labels (n)

non

-un

iform

un

iform

obse

rved

dis

trib

uti

on

of

labels

(p

)ˆ

III

III IV

I a label’s value dependson the label informationin its neighbourhood

I label informationI number of labelsI share of classes

I uncertainty sampling ignoresthe number of similar labels



Measuring the Uncertainty

Problem with above measures

I Focus on exploitation, fails on exploration [Beyer et al., 2015]

I “Uncertainty” measures ignore uncertainty of the prediction modelCmp. epistemic vs. aleatoric uncertainty in [Senge et al., 2014]

Extensions: Combined measuresI [Fu et al., 2012]

I uncertaintyI instance correlation (within batch)

I [Reitmaier and Sick, 2013] 4DS approach, considering:I distance to the decision boundaryI diversity of samples in the query setI densityI class prior

I [Zliobaite et al., 2013]I uncertainty sampling combined withI randomization for better exploration

I [Weigl et al., 2015]I conflict: overlap of opposing classesI ignorance: proximity of nearest decision boundary



Decision Theoretic Approaches

Expected Error Reduction[Cohn et al., 1996, Roy and McCallum, 2001]

I Aim: Minimise error after selection & retraining

I Model unknown label realisation as random variable

x∗ = arg minx

EY |L

∑x′∈U

EY |L′={L∪(x,y)}[y 6= y ]

I Better results reported than for uncertainty sampling [Settles, 2012]

I Relies on maximum-likelihood posterior estimate [Chapelle, 2005]

I Performance estimation relies on evaluation set (using L or by self-labelling U)

I High computational complexity: O(|U|2)



Probabilistic Active Learning

Motivation

I Given a dataset with set of labelled instances Land pool of unlabelled instances U with a candidate x

I The true posterior in a candidate’s neighbourhood is unknown:

I Explicitly model the uncertainty associated with posterior value:Expectation not only over candidate instance’s label realisation y ,but also over true posterior p in its neighbourhood:

pgain(x) = Ep

[Ey|p

[performancegainp(L ∪ (x , y))

] ]I The impact of a label is largest in its direct neighbourhood:

I Evaluate change in classification performance only therein




Limitations

I Separates classifier and active selector(similar to uncertainty sampling)

I Depends on appropriate neighbourhood definitionand probabilistic estimates for ls = (n, p)

I Performance gain is approximated within the neighbourhood(evaluating globally is possible, but computationally costly)

References

I Implementations in Java, Python, MATLAB are available(open source) at http://kmd.cs.ovgu.de/res/opal/

I Probabilistic Active Learning (PAL).Krempl, Kottke, Spiliopoulou. Discovery Science 2014.

I Optimised Probabilistic Active Learning (OPAL).Krempl, Kottke, Lemaire. Machine Learning 100(2) 2015.

I Multi-Class Probabilistic Active Learning (McPAL).Kottke, Krempl, Spiliopoulou. ECAI 2016.



http://kmd.cs.ovgu.de/res/opal/

Probabilistic Active Learning in a Nutshell

Illustrative Example

-

+

-+

?

I Given: Dataset with labelled ( - / + ) andunlabelled ( ) instances

I Objective: Determine the expected gain oflabelling e.g. the candidate ?

I What label information do we have already?

I Summarise label information in itsneighbourhood:

For example, by using a probabilistic classifier,kernel frequency estimates, label counts, . . .

I Number of labels: n = 2I Share of positives therein

(i.e. posterior estimate): p = 12

I Summarise as label statistics:ls = (n = 2, p = 0.5)




Probabilistic Gain7

pgain(ls) = Ep

[Ey|p

[gainp(ls, y)

] ]=∫ 1

0 Betaα,β(p) ·∑

y∈{0,1} Berp(y) · gainp(ls, y)dp

with:I ls = (n, p): Label statisticsI y : Candidate’s label realisationI p: True posterior at candidate’s position

I This probabilistic gain quantifies

I the expected change in classification performance

I at the candidate’s position in feature space,

I in each and every future classification there,

I given that one additional label is acquired.

I Weight pgain with the density dx over labelled andunlabelled data at the candidate’s position.

I Select the candidate with highest density-weightedprobabilistic gain.

-

+

-+

?

7See [Krempl et al., 2014].



Probabilistic Active Learning – Interpretation

0

1

2

3

4

5

0 0.2 0.4 0.6 0.8 1

no labels (n=0)few labels (n=2,p=0.5)^

more labels (n=3,p=2/3)^

many labels (n=11,p=10/11)^

0.5

0.67

0.91

True posterior (p)

Norm

alized

Lik

elih

ood

I Uniform prior: Prior to the firstlabel’s arrival, all values of p areassumed equally plausible.

I A Bayesian approach yields for thenormalised likelihood corresponds abeta distribution with parameters:α Number of positive labels plus oneβ Number of negative labels plus one

I Left: Plot of normalised likelihoodsfor different values of α, β

I The peak of this function becomesthe more distinct, the more labels areobtained.



Non-Myopic Extension of PAL

Myopic Probabilistic Gain

pgain(ls) = Ep

[Ey[gainp(ls, y)

] ]=∫ 1

0 Betaα,β(p) ·∑

y∈{0,1} Berp(y) · gainp(ls, y)dp

with:I ls = (n, p): Label statisticsI y : Candidate’s label realisationI p: True posterior at candidate’s position

Non-Myopic Extension

I Not a single label is purchased in future, but

I a set of labels according to a given budget m

I We need to optimise performance gain when acquiring this set of labels!

I Brute-Force Approach: Calculate gain for all combinations

I But: Ordering (of arrival) is irrelevant (in pools)It suffices to consider the varying number k of positives among m acquired labels



Non-Myopic Probabilistic Gain

Non-Myopic Probabilistic Gain

GOPAL(ls,m) =1

m· Ep

[Ek

[gainp(ls, k,m)

] ]=

1

m·∫ 1

0Betaα,β(p) ·

∑0≤k≤m

Binm,p(k) · gainp(ls, k,m) dp

with:I ls = (n, p): Label statisticsI p: True posterior at candidate’s positionI m: Number of candidates to be acquired (budget)I k: Number of candidates with positive label realisations

I with performance gain as difference between future and current performance:

gainp(ls, k,m) = perfp

(np + k

n + m

)− perfp(p)



Cost-Sensitive Classification

Given a situation with

I p ∈ [0, 1] true posterior prob. of the positive class in a neighbourhood

I q ∈ [0, 1] share of instances therein that are classified as positive

I costFP = τ ∈ [0, 1] cost of each false positive classification

I Misclassification loss as performance measure

Resulting Cost-Optimal Classification

q∗ =

0 p < τ1− τ p = τ1 p > τ

(1)

Misclass. Loss under Cost-Opt. Classification

perfp,τ (p) = −MLp,τ (p) = −

p · (1− τ) p < ττ · (1− τ) p = ττ · (1− p) p > τ

(2)



Fast Closed-Form Solution

Non-Myopic, Cost-Sensitive Probabilistic Gain

I Combining misclassification loss as performance measure and

I the non-myopic probabilistic gain yields the

I probabilistic misclassification loss reduction

GOPAL(ls, τ,m) =1

m·∫ 1

0Betaα,β(p)

m∑k=0

Binm,p(k)

(MLp,τ (p)−MLp,τ

(np + k

n + m

))dp

Closed-Form Solution

GOPAL(n, p, τ,m) =n + 1

m·(

nn · p

)·(IML(n, p, τ, 0, 0)−

m∑k=0

IML(n, p, τ,m, k)

)

IML(n, p, τ,m, k) =

(mk

)·

(1− τ) · Γ(1−k+m+n−np)Γ(2+k+np)Γ(3+m+n)

np+kn+m

< τ

(τ − τ2) · Γ(1−k+m+n−np)Γ(1+k+np)Γ(2+m+n)

np+kn+m

= τ

τ · Γ(2−k+m+n−np)Γ(1+k+np)Γ(3+m+n)

np+kn+m

> τ



Probabilistic Gain – GOPAL

Probabilistic Gain for Equal Misclassification Costs

01

23

45

0 0.2 0.4 0.6 0.8 1

0

0.05

0.1

0.15

0.2

observed posteriorp knowledge

n

pro

bab

ilist

ic g

ain

^

I The probabilistic gain in accuracy asa function of ls = (n, p) is

I monotone with variable n,

I symmetric with respect to p = 0.5,

I zero for irrelevant candidates.

I Compare to uncertainty:

(in confidence)const. w.r.t. n:

uncertainty

0

0.1

0.2

0.3

0.4

0.5

0 0.2 0.4 0.6 0.8 1

posterior

I Unequal misclassification costs:Asymmetric, as sampling from the“cheaper”’ class is preferred to avoidpotentially costly error

Visualisation



Evaluation – Setup

Experimental Setup

I OPAL compared against its myopic, cost-sensitive PAL (csPAL) counterpart and:Uncertainty Sampling without (U.S.) and with self-training (U.S. st), CertaintySampling (C.S.), Expected Error Reduction with beta-prior (Chap) orcost-sensitive extension (Marg), non-myopic expected entropy reduction (Zhao)

I Same classifier (Parzen window classifier with Gaussian kernels)

I implemented in MATLAB and run on the same platform,

I with the same (dataset-specific, pre-tuned) bandwidth parameter,

I on several synthetic and real-world data sets,

I using cross-validation (100 random permutations),

I reporting learning curves in arithmetic mean in misclassification loss, and wins atlearning steps.

I More results are at our website http://kmd.cs.ovgu.de/res/opal/



http://kmd.cs.ovgu.de/res/opal/

Overall Classification Performance

20 labels OPAL vs.

acquired csPAL U.S. U.S. st C.S. Marg1

Chap1

Zhao1

Randτ∗ = 0.10 47% 62%∗ 70%∗ 72%∗ 66%∗ 56%∗ 72%∗ 62%∗

τ∗ = 0.25 51%∗ 63%∗ 75%∗ 88%∗ 81%∗ 62%∗ 70%∗ 65%∗

τ∗ = 0.50 1% 64%∗ 72%∗ 92%∗ 87%∗ 63%∗ 69%∗ 68%∗

τ∗ = 0.75 53%∗ 60%∗ 67%∗ 86%∗ 80%∗ 50%∗ 48%∗ 58%∗

τ∗ = 0.90 42% 61%∗ 66%∗ 77%∗ 75%∗ 53%∗ 57%∗ 62%∗

40 labels OPAL vs.

acquired csPAL U.S. U.S. st C.S. Marg1

Chap1

Zhao1

Randτ∗ = 0.10 43% 55%∗ 71%∗ 75%∗ 69%∗ 62%∗ 69%∗ 57%∗

τ∗ = 0.25 56%∗ 59%∗ 73%∗ 89%∗ 79%∗ 65%∗ 69%∗ 58%∗

τ∗ = 0.50 4% 61%∗ 72%∗ 93%∗ 89%∗ 74%∗ 76%∗ 62%∗

τ∗ = 0.75 57%∗ 64%∗ 71%∗ 90%∗ 81%∗ 59%∗ 56%∗ 54%∗

τ∗ = 0.90 46% 55%∗ 63%∗ 82%∗ 77%∗ 57%∗ 64%∗ 56%∗

Table: Percentages of runs over all data sets, where OPAL performs better than its competitor.Significantly better performance is denoted by ∗, significantly worse performance by †. The usedsignificance level in the one-sided Wilcoxon signed-rank test was for both 0.001. Algorithms aremarked with 1 if not every data set could be used in the evaluation due to their long executiontime.



Multi-Class Extension (McPAL)8

Motivation

I Many applications involve multinomial (rather than binary) labels (i.e. C > 2)

Task & Notation

I As before, pool of labelled (~x , y) ∈ L and unlabelled (~x , ·) ∈ U instances

I Multi-Class (C > 2, not binary) classification: y ∼ Multinomial~p(~k), where

I Instance’s feature vector ~x

I Instance’s true posterior vector ~p = (p1, . . . , pC )

I Instance’s label statistics ~k = (k1, . . . , kC )

I Realisation of m ≤ M additional labels: ~l = (l1, . . . , lC ) ∈ NC , s.t.∑

li = m

Our Contributions

I Modelling as probabilistic active learning problemand deriving a closed-form solution

I Identification & evaluation of three influence factors

8Kottke, Krempl, Lang, Teschner, Spiliopoulou, ECAI, 2016.



Multi-Class Extension (McPAL): Selection Score

alScore(~x | L,U

)= P(~x | L ∪ U)︸︷︷︸

impact

· perfGain(

cl(~x | L

))︸︷︷︸posterior & reliability

(3)

perfGain(~k)

= maxm≤M

1

m

(expPerf

(~k,m

)︸︷︷︸new perf.

− expPerf(~k, 0)︸︷︷︸

curr. perf.

) (4)

expPerf(~k,m

)= E~p

[E~l

[perf

(~k +~l | ~p

)]](5)

=∑~l

(∑

(ki +li +di +1))−1∏

j=∑

(ki +1)

1

j

·∏ki +li +di∏

j=ki +1

j

· Γ ((∑

li ) + 1)∏(Γ (li + 1))

(6)

where ~l ∈ is a label realisation



From Pools to Evolving Streams

Time

y3 y2y1

x3x1

x2

Instances y5

x5

y4

x4

Stream

Data Stream

I Instances arrive sequentially

I Possibly infinite number of instances

I Non-stationary distributions (drift)

I “Big Data” is often streaming data

General Challenges

I Adaptation to change

I Limited computational ressources

Active Learning-Specific Challenges

I Budget Management & Change Detection

I Evaluation & Performance Guarantees

I . . .



Classification in Evolving Datastreams

Chunk-Based ProcessingKrempl, Ha, Spiliopoulou, DS, 2015.

I Clustering-based approach (COPAL)

I Diversity-maximising micro selection, and PAL-based macro selection

I Amnesic (COPAL-A) and incremental (COPAL-I) variants

I Experimental results: COPAL-I is better than COPAL-A,quality of clustering has a large impact on results

One-by-One ProcessingKottke, Krempl, Spiliopoulou, IDA, 2015.

I Notion of temporal usefulness, complementing spatial usefulness

I Budget management: Guaranteed that budget restriction is met

I Temporal selection: Balanced Incremental Quantile Filter (BIQF)

I Spatial selection: Probabilistic Active Learning

I Experimental results: Combination of BIQF and PAL is best for small budgets



Summary of this part

I Active learning problem:Applications where collecting ground truth (e.g. labels)is not possible for every single example

Efficient allocation of limited resources:Sample where we expect something insightful

I Different tasks and scenariosQuery synthesis, pool-based or stream-based samplingActive acquisition of labels, features, instances from specific classes, . . .

I Uncertainty Sampling & Expected Error Reductionperform sometimes poor due to ignoring other types of uncertainty

Use a combination with other measures or a probabilistic approach:

I Probabilistic Active LearningExpected gain in classification performanceModels label realisation and true posterior as random variablesConsiders posterior estimate, its reliability, and impact as influence factors

Decision-theoretic, non-myopic, cost-sensitive, multi-classfast and competitive performance



Thank you!

Questions?



Bibliography I

Angluin, D. (1988).Queries and concept learning.Machine Learning, 2:319–342.

Attenberg, J., Melville, P., Provost, F., and Saar-Tsechansky, M. (2011).Selective data acquisition for machine learning.In Cost-Sensitive Machine Learning. CRC Press.

Beyer, C., Krempl, G., and Lemaire, V. (2015).How to select information that matters: A comparative study on active learningstrategies for classification.In Proc. of the 15th Int. Conf. on Knowledge Technologies and Data-DrivenBusiness (i-KNOW 2015), pages 2:1–2:8. ACM.

Chapelle, O. (2005).Active learning for parzen window classifier.In Proceedings of the Tenth International Workshop on Artificial Intelligence andStatistics, pages 49–56.

Cohn, D. (2010).Active learning.In Sammut, C. and Webb, G. I., editors, Encyclopedia of Machine Learning,pages 10–14. Springer.



Bibliography II

Cohn, D., Atlas, L., Ladner, R., El-Sharkawi, M., Marks, R., Aggoune, M., andPark, D. (1990).Training connectionist networks with queries and selective sampling.In Advances in Neural Information Processing Systems (NIPS). MorganKaufmann.

Cohn, D. A., Ghahramani, Z., and Jordan, M. I. (1996).Active learning with statistical models.Journal of Artificial Intelligence Research, 4:129–145.

Fedorov, V. V. (1972).Theory of Optimal Experiments Design.Academic Press.

Fu, Y., Zhu, X., and Li, B. (2012).A survey on instance selection for active learning.Knowledge and Information Systems, 35(2):249–283.

Krempl, G., Kottke, D., and Spiliopoulou, M. (2014).Probabilistic active learning: Towards combining versatility, optimality andefficiency.In Dzeroski, S., Panov, P., Kocev, D., and Todorovski, L., editors, Proceedings ofthe 17th Int. Conf. on Discovery Science (DS), Bled, volume 8777 of LectureNotes in Computer Science, pages 168–179. Springer.



Bibliography III

Reitmaier, T. and Sick, B. (2013).Let us know your decision: Pool-based active training of a generative classifierwith the selection strategy 4ds.Information Sciences, 230:106–131.

Roy, N. and McCallum, A. (2001).Toward optimal active learning through sampling estimation of error reduction.In Proc. of the 18th Int. Conf. on Machine Learning, ICML 2001, Williamstown,MA, USA, pages 441–448, San Francisco, CA, USA. Morgan Kaufmann.

Ruff, R. A. and Dietterich, T. (1989).What good are experiments?In Proc. of the sixth int. workshop on machine learning.

Saar-Tsechansky, M., Melville, P., and Provost, F. (2009).Active feature-value acquisition.Management Science, 55(4):664–684.

Senge, R., Bosner, S., Dembczynski, K., Haasenritter, J., Hirsch, O.,Donner-Banzhoff, N., and Hullermeier, E. (2014).Reliable classification: Learning classifiers that distinguish aleatoric and epistemicuncertainty.Information Sciences, 255:16–29.



Bibliography IV

Settles, B. (2009).Active learning literature survey.Computer Sciences Technical Report 1648, University of Wisconsin-Madison,Madison, Wisconsin, USA.

Settles, B. (2012).Active Learning.Number 18 in Synthesis Lectures on Artificial Intelligence and Machine Learning.Morgan and Claypool Publishers.

Seung, H. S., Opper, M., and Sompolinsky, H. (1992).Query by committee.In M.K., W. and L.G., V., editors, Proc. of the fifth workshop on computationallearning theory. Morgan Kaufmann.

Weigl, E., Heidl, W., Lughofer, E., Radauer, T., and Eitzinger, C. (2015).On improving performance of surface inspection systems by online active learningand flexible classifier updates.Machine Vision and Applications, 27(1):103–127.

Zhao, Y., Yang, G., Xu, X., and Ji, Q. (2012).A near-optimal non-myopic active learning method.In Proceedings of the 21st International Conference on Pattern Recognition,ICPR 2012, Tsukuba, Japan, November 11-15, 2012, pages 1715–1718. IEEE.



Bibliography V

Zliobaite, I., Bifet, A., Pfahringer, B., and Holmes, G. (2013).Active learning with drifting streaming data.IEEE Transactions on Neural Networks and Learning Systems, 25(1):27–39.



active learning: applications, foundations & emerging...

Documents