active learning for probabilistic models

LARC-IMS Workshop

Active Learning for Probabilistic Models

Lee Wee SunDepartment of Computer ScienceNational University of Singapore

[email protected]

LARC-IMS Workshop

Probabilistic Models in Networked Environments

• Probabilistic graphical models are powerful tools in networked environments

• Example task: Given some labeled nodes, what are the labels of remaining nodes?

• May also need to learn parameters of model (later)

?

?

?

Labeling university web pages with CRF

?

Project

FacultyProject

Student

LARC-IMS Workshop

Active Learning

• Given a budget of k queries, which nodes to query to maximize performance on remaining nodes?

• What are reasonable performance measures with provable guarantees for greedy methods?

?

?

?

Labeling university web pages with CRF

?

Project

FacultyProject

Student

LARC-IMS Workshop

Entropy

• First consider non-adaptive policy• Chain rule of entropy

• Maximizing entropy of selected variables (Y1) minimizes the conditional entropy

Constant Maximize Minimize target

LARC-IMS Workshop

• Greedy method– Given already selected set S, add variable Yi to

maximize

• Near optimality:

because of submodularity of entropy.

LARC-IMS Workshop

Submodularity

• Diminishing return property

LARC-IMS Workshop

Adaptive Policy

• What about adaptive policies?

Non-adaptive Adaptive

k

LARC-IMS Workshop

• Let ρ be a path down the policy tree, and let the policy entropy be

Then we can show

where YG is the graph labeling• Correspond to chain rule in non-adaptive case –

maximizing policy entropy minimizes conditional entropy

LARC-IMS Workshop

• Recap: Greedy algorithm is near-optimal for non-adaptive case

• For adaptive case, consider greedy algorithm that selects the variable with the largest entropy conditioned on the observations

• Unfortunately, for adaptive case, we can show that, for every α > 0, there is a probabilistic model such that

LARC-IMS Workshop

Tsallis Entropy and Gibbs Error

• In statistical mechanics, Tsallis entropy is a generalization of Shannon entropy

• Shannon entropy is special case for q = 1.• We call the case q = 2, Gibbs Error

LARC-IMS Workshop

Properties of Gibbs Error

• Gibbs error is the expected error of the Gibbs classifier

– Gibbs classifier: Draw a labeling from the distribution and use the labeling as the prediction

• At most twice Bayes (best possible) error.

LARC-IMS Workshop

• Lower bound to entropy

– Maximizing policy Gibbs error, maximize lower bound to policy entropy

Gibbs Error

LARC-IMS Workshop

• Policy Gibbs error

LARC-IMS Workshop

• Maximizing policy Gibbs error minimizes expected weighted posterior Gibbs error

• Make progress on either the version space or posterior Gibbs error

Version space Posterior Gibbs error

LARC-IMS Workshop

Gibbs Error and Adaptive Policies

• Greedy algorithm: Select node i with the largest conditional Gibbs error

• Near-optimality holds for the case of policy Gibbs error (in contrast to policy entropy)

LARC-IMS Workshop

• Proof idea: – Show that policy Gibbs error is the same as the expected

version space reduction.– Version space is the total probability of remaining

labelings on unlabeled nodes (labelings that are consistent with labeled nodes)

– Version space reduction function is adaptive submodular, giving required result for policy Gibbs error (using result of Golovin and Krause).

Version space

LARC-IMS Workshop

Adaptive Submodularity

x3

ρρ’

• Diminishing return property– Change in version space

when xi is concatenated to path ρ and y is received

– Adaptive submodular because

LARC-IMS Workshop

Worst Case Version Space

• Maximizing policy Gibbs error maximizes expected version space reduction

• Related greedy algorithm: Select the least confident variable – Select the variable with the smallest maximum label

probability• Approximately

maximizes worst case version space reduction

LARC-IMS Workshop

• Let

• Using greedy strategy that selects least confident variable achieves

because version space reduction function is pointwise submodular

LARC-IMS Workshop

Pointwise Submodularity

• Let V(S,y) be the version space remaining if y is the true labeling of all nodes and subset S has been labeled

• 1-V(S,y) is pointwise submodular as it is submodular for every labeling y

LARC-IMS Workshop

Summary So Far …

Greedy Algorithm Criteria Optimality Property

Select maximum entropy variable

Entropy of selected variables

No constant factor approximation

Select maximum Gibbs error variable

Policy Gibbs error (expected version space reduction)

1-1/e Adaptive submodular

Select least confident variable

Worst case version space reduction

1-1/e Pointwise submodular

…

…

LARC-IMS Workshop

Learning Parameters

• Take a Bayesian approach• Put prior over parameters• Integrate away parameters when computing probability

of labeling

• Also works with commonly encountered pooled based active learning scenario (independent instances – no dependencies other than on parameter)

LARC-IMS Workshop

Experiments

• Named entity recognition with Bayesian CRF on CoNLL 2003 dataset

• Greedy algsperformancesimilar andbetter thanpassivelearning (random) Passive MaxEnt Least Conf Gibbs Err

71

72

73

74

75

76

77

Performance on NER

Greedy Algorithm

F1 A

UC

LARC-IMS Workshop

Weakness of Gibbs Error

• A labeling is considered incorrect if even one component does not agree

Project

FacultyProject

Student

Faculty

Project

Student

Student

Project

FacultyProject

Student

Project

Student

Student

Project

LARC-IMS Workshop

Generalized Gibbs Error

• Generalize Gibbs error to use loss function L

• Example: Hamming loss, 1-F-score, etc.• Reduces to Gibbs error when L(y,y’) =

1-δ(y,y’) where – δ(y,y’) = 1 when y = y’, and – δ(y,y’) = 0 otherwise

y2

y1y3

y4

LARC-IMS Workshop

• Generalized policy Gibbs error (to maximize)

Generalized Gibbs Error

Remaining weighted Generalized Gibbs error (agrees with y on ρ)

LARC-IMS Workshop

• Generalized policy Gibbs error is the average of

• Call this function the generalized version space reduction function

• Unfortunately, not adaptive submodular for arbitrary L.

y2

y1y3

y4

LARC-IMS Workshop

• However, generalized version space reduction function is pointwise submodular– Has good approximation

in the worst casey2

y1y3

y4

LARC-IMS Workshop

• Hedging against worst case labeling may be too conservative• Can hedge against the total generalized version space

among surviving labelings instead

y2

y1y3

y4

y2

y1y3

y4

instead of

LARC-IMS Workshop

• Call this total generalized version space reduction function

• Total generalized version space reduction function is pointwise submodular– Has good approximation in the worst case

LARC-IMS Workshop

Summary

Greedy Algorithm Criteria Optimality Property

Select maximum entropy variable

Entropy of selected variables

No constant factor approximation

Select maximum Gibbs error variable

Policy Gibbs error (expected version space reduction)

1-1/e Adaptive submodular

Select least confident variable

Worst case version space reduction


Select variable that maximizes worst case generalized version space reduction

Worst case generalized version space reduction


Select variable that maximizes worst case total generalized version space reduction

Worst case total generalized version space reduction


LARC-IMS Workshop

Experiments

• Text classification• 20Newsgroup

dataset• Classify 7 pairs of

newsgroups• AUC for

classification error• Max Gibbs error vs

Total Generalized Version Space with Hamming Loss

74 76 78 80 82 84 86 8874

76

78

80

82

84

86

88

Gibbs vs Hamming

AUC for Gibbs

AUC

for H

amm

ing

LARC-IMS Workshop

Acknowledgements

• Joint work with – Nguyen Viet Cuong (NUS)– Ye Nan (NUS)– Adam Chai (DSO)– Chieu Hai Leong (DSO)

active learning for probabilistic models

Documents

case of policy gibbs

expected error

larcims workshop lower

possible error

largest entropy

nonadaptive case

submodularity of entropy

policy tree