an interactive approach for cleaning noisy observations in...
TRANSCRIPT
An interactive approach for cleaning noisyobservations in Bayesian networks with the
help of an expert
Andrés R. Masegosa and Serafín Moral
Department of Computer Science and Artificial Intelligence
University of Granada
Granada, September 2012
PGM 2012 Granada (Spain) 1/17
Introduction
Evidence Gathering Process
Sensor failure (GPS, Vision, etc).
Noisy transmissions in a communication channel.
Outliers is a particular case of a noisy observation.
Human errors with the GUI of the system.
...
PGM 2012 Granada (Spain) 2/17
Introduction
New Cleaning Methods: misspelled words in smartphones
System detects a corrupted noisy observation.
System displays alternative words (i.e. fixed observations).
The user ultimately decides which is the correct one.
The system interacts with the user.
PGM 2012 Granada (Spain) 3/17
Introduction
An Interactive Data Cleaning Method
Noisy observations: some values of the observations are noisy (i.e. different from its actual value).
Our data model is a Bayesian network over multinomial data.
The noisy process needs to be explicitly modelled.
We assume the existence of an expert able to provide knowledge about specific parts of the observationvector.
PGM 2012 Granada (Spain) 4/17
Modelling Noisy Observations and Expert Knowledge
Notation
We assume that we have a set of observable variables: O = {O1, ...,Op}.
o is a particular observation vector.
P(O) is modelled by a given Bayesian network.
We assume there is noise when observing these variables.
O′ = {O′1, ...,O
′p} noisy observable variables
o′ is a particular noisy observation vector.
Our goal
To detect the noisy observations: o′i 6= oi .
To recover the true observations: o = {o1, ..., op}.
... with the help of an expert.
PGM 2012 Granada (Spain) 5/17
Modelling Noisy Observations and Expert Knowledge
Modelling noisy observations and expert knowledge
Noisy Observations
O′i is the noisy observable variable and Ni indicates if there is a noisy
observation.
The conditional P(O′i |Oi ,Ni ) defines the noise model.
Expert Knowledge
Oei is the variable which receives the expert knowledge and Ei indicates if the
knowledge is correct.
The conditional P(Oei |Oi ,Ei ) defines how the expert gives wrong answers.
PGM 2012 Granada (Spain) 6/17
Modelling Noisy Observations and Expert Knowledge
A model for noisy observations and expert knowledge
Automatic cleaning methodRecover the most probable assignment of the observable variables given thenoisy observations:
oMPE = arg maxO=o
P(O = o|O′ = o′)
Not a good solution.
There exist alternative explanations, O = o, withnon-negligible probability.
Use expert knowledge to discard those alternativeexplanations.
PGM 2012 Granada (Spain) 7/17
Interactive Cleaning: Entropy Based Approach
Cleaning noisy observations with the help of expert knowledge
Entropy Based ApproachReduce the conditional entropy of the true observations:
H(O|O′ = o′)
The lower this entropy, the stronger our confidence in the oMPE .
Our strategy is to request to the expert the knowledge which most reducesthe above entropy (the highest information gain):
IG(O,Oei |o
′)
Expert should submit his/her belief about the true value of Oi .
The Oei with the highest information gain is the one with the highest entropy:
arg maxOe
i
IG(O,Oei |o
′) = arg maxOe
i
(H(Oei |O
′ = o′)− H(Ei ))
PGM 2012 Granada (Spain) 8/17
Interactive Cleaning: Entropy Based Approach
Entropy Based Approach
Algorithm
1: oe = ∅.2: repeat3: Compute the Oe
i variable with the highest information gain:
Oemax = arg max
Oei
IG(O;Oei |o′,oe)
4: if IG(O;Oei |o′,oe) > λ then
5: Ask the expert about Oi .6: oe = oe ∪ oe
max .7: end if8: until end9:
10: return oMPE = arg maxO=o P(O = o|o′,oe);
PGM 2012 Granada (Spain) 9/17
Interactive Cleaning: Cost Based Approach
Cost Based Approach
Simplifications
The decision problem is solved for a particular noisy observation vector o′.
Cost of fixing computed as a sum of independent costs: CF =∑p
i=1 CFi .
We assume that we have p different decision problems:
Problem Di involves decisions Ai and {F1, ...,Fp}.When solving Di we do not ask for the rest of the variables.
PGM 2012 Granada (Spain) 10/17
Interactive Cleaning: Cost Based Approach
Cost Based Approach
Solving Problem Di
Expected cost gain: The difference in the expected cost between asking andnot asking about Oi :
CG(Ai |o′,oe) = c(Dni )− c(Da
i )
c(Dni ): expected cost when no asking about Oi .
c(Dai ): expected cost when asking about Oi .
Fixing Decisions: we compute for each decision Fi the value f?j such that
f?j = arg minfj
∑oj
CFj (fj , oj )P(oj |o′,oe)
The minimization problem can be solved in constant time after the observationshave been propagated:
0/1 cost error: select the oj with the highest probability.
PGM 2012 Granada (Spain) 11/17
Interactive Cleaning: Cost Based Approach
Cost Based Approach
Algorithm
1: oe = ∅.2: repeat3: Compute the decision Ai the highest expected cost gain:
Amax = arg maxAi
CG(Ai |o′,oe)
4: if CG(Amax |o′,oe) > 0 then5: Ask the expert about Oi .6: oe = oe ∪ oe
max .7: end if8: until end9:
10: return f ?j = arg minfj∑
ojCFj(fj ,oj)P(oj |o′,oe);
PGM 2012 Granada (Spain) 12/17
EM algorithm to estimate the noise rate
EM Algorithm to estimate the unknown noise rates τi
We are given a set of M noisy observations:D = {o′(1), ...,o
′(M)}.
The EM algorithm is applied to estimate the MAP estimate of the parametersτ = (τ1, ..., τp).
O = {O1, ...,Op} and N = {N1, ...,Np} are the hidden variables.
Expectation step: Given a current estimate of τ<k>.
Compute P(Ni = noise|o′(j), τ<k>) propagating in the extended BN forj-th data sample.
Maximization step:
τ<k+1>i =
∑j P(Ni = noise|o′(j); τ<k>)
M
PGM 2012 Granada (Spain) 13/17
Experimental Evaluation
Experimental Set-up
PGM 2012 Granada (Spain) 14/17
Experimental Evaluation
Experiments with 5% noise rate
Noise Rate Precision
With no expert knowledge only a minor proportion of the errors are identifiedand we might introduce new errors.
The introduction of the expert knowledge boost the precision of the detectederrors.
PGM 2012 Granada (Spain) 15/17
Experimental Evaluation
Experiments with 5% noise rate
PNR with EM algorithm
Quite similar behavior.EM is able to accurately estimate the unknown noise rates.
PGM 2012 Granada (Spain) 16/17
Conclusions and Future Works
Conclusions and Future Works
Conclusions:
It can be quite hard to recover the true observations even with very lownoise rates.The interaction with an expert really helps.Although the performances strongly depend of the particular model.
Future Works:
Extend this method assuming that we do not know neither theparameters of the network nor the structure.Apply this methodology to supervised classification problems (class noiseproblem).
PGM 2012 Granada (Spain) 17/17