introduction to entropy estimation (z) = dln ( z) dz, ... j k + na k+1 c(f)2 e k ; where x0x = hb...
TRANSCRIPT
IntroductionThe plug-in method
The Bayesian ApproachOther Techniques
Introduction to Entropy Estimation
Catharina Olsen
Departement d’Informatique, Machine Learning GroupUniversite Libre de Bruxelles
6th February 2008
Catharina Olsen Introduction to Entropy Estimation
IntroductionThe plug-in method
The Bayesian ApproachOther Techniques
Content
Estimators:
Maximum Likelihood
Miller Madow
Bayes
Shrinkage
Nemenman-Shafee-Bialek (NSB)
Best Upper Bounds (BUB)
using B-spline functions
Catharina Olsen Introduction to Entropy Estimation
IntroductionThe plug-in method
The Bayesian ApproachOther Techniques
Entropy
Setup:
experiment with p possible outcomes B1, . . . ,Bp
θi probability for Bi to occur∑pi=1 θi = 1
renew experiment n times, Bi will occur yi times(n =
∑pi=1 yi )
The entropy of the resulting distribution is defined as
H(θ) := −p∑
i=1
θi log θi (1)
Catharina Olsen Introduction to Entropy Estimation
IntroductionThe plug-in method
The Bayesian ApproachOther Techniques
Entropy Estimators - MLE / Miller-Madow
Maximum Likelihood Estimator (MLE)
HML(θ) := −p∑
i=1
θMLi log θML
i (2)
where θMLi := yi
n .
Property:HML negatively biased everywhere!
Catharina Olsen Introduction to Entropy Estimation
IntroductionThe plug-in method
The Bayesian ApproachOther Techniques
Entropy Estimators - MLE / Miller-Madow
Miller-Madow estimatorThe MLE with a bias correction
HMM(θ) := HML(y) +m − 1
2n(3)
Algorithm
Sort data points into p bins; occurence yi , i = 1, . . . , pn =
∑i yi number of data points
θMLi = yi
nfind number of bins with zero probability p0 → m = p − p0
HMM = −∑
i θMLi log θML
i + m−12n
Catharina Olsen Introduction to Entropy Estimation
IntroductionThe plug-in method
The Bayesian ApproachOther Techniques
Entropy Estimators - Shrink
Shrinkage estimate
θshrinki = λti + (1− λ)ui (4)
=
{λ∗ 1
p + (1− λ∗) yin λ∗ < 1
1p λ∗ ≥ 1,
(5)
where
λ∗ =p(n2 −
∑i y
2i )
(n − 1)(p∑
i y2i − n2)
. (6)
which minimizes the mean square error E{∑
i (θi − θi )2}
.
calculation
Catharina Olsen Introduction to Entropy Estimation
IntroductionThe plug-in method
The Bayesian ApproachOther Techniques
Bayesian Approach
What is the Bayesian approach?
Bayes: P(θ|n) =P(n|θ)P(θ)
P(n)(7)
P(n|θ) ∝∏
θni
i
Prior: P(θ) ∝ ∆(θ)∏
i
Θ(θi )θa−1i
Dirichlet distribution
δ(x) =
{1, if x = 0,
0, else.Θ(x) =
{1 if x ≥ 0,
0 else.
∆(θ) := δ(∑
i θi − 1)Instead of estimating the probabilities θi , a function F of θ is estimated.
F =
∫dθF (θ)P(θ|n). (8)
Catharina Olsen Introduction to Entropy Estimation
IntroductionThe plug-in method
The Bayesian ApproachOther Techniques
Theory
How do we calculate the moments of a function F (θ)?
Bayesian estimator with uniform prior for F k(θ) is given by I [F k (θ),n]I [1,n] ,
where I [F (θ), n] :=∫
dθF (θ)P(θ)∏
i θni
i .
How is this evaluated?
I [1, n] =Q
i Γ(ni +1)Γ(N+m)
I [logr1 (θ1) . . . logrm (θm), n] = δr1n1. . . δrm
nmI [1, n]
I [log(θu), n] =Q
i Γ(ni +1)Γ(N+m) × Φ1(nu + 1,N + m)
⇒ Bayes estimator: I [F (θ,n]I [1,n] = −
∑i
ni +1N+m ∆Φ(1)(ni + 2,N + m + 1),
where ∆Φ(n)(z1, z2) := ∂nz1
log(Γ(z1))− ∂nz2
log(Γ(z2)).
Catharina Olsen Introduction to Entropy Estimation
IntroductionThe plug-in method
The Bayesian ApproachOther Techniques
Bayesian Approach
What estimators do we get for F (θ) = θ?Bayesian estimator for θi with uniform prior is given by
θi =yi + 1
n + p. (9)
If the prior is chosen to be Dirichlet distributed with parameter athen the estimator is given by
θi =yi + a
n + pa. (10)
Catharina Olsen Introduction to Entropy Estimation
IntroductionThe plug-in method
The Bayesian ApproachOther Techniques
Entropy Estimators - Bayes
Bayesian estimator
θBayesi =
yi + a
n + pa, (11)
where a is the parameter of the Dirichlet distribution which ishere assumed as the prior.
Algorithm
Sort data points into p bins; occurence yi , i = 1, . . . , pn =
∑i yi number of data points
θBayesi = yi +a
n+pa
HBayes =∑
i θBayesi log θBayes
i
Catharina Olsen Introduction to Entropy Estimation
IntroductionThe plug-in method
The Bayesian ApproachOther Techniques
Bayesian Approach
What properties does this estimator have?
If λ = papa+n , then θBayes
i = θshrinki , i = 1, . . . , p.
There is a one-to-one correspondance between λ and a inDirichlet(a):
a =n
p
(λ
1− λ
). (12)
common values for a:
a→ 0 : maximum likelihood estimatora = 1
2 : Krichevsky, Trofimova = 1
p : Schurmann, Grassberger
Catharina Olsen Introduction to Entropy Estimation
IntroductionThe plug-in method
The Bayesian ApproachOther Techniques
Bayesian Approach
What estimators do we get for F (θ) = H(θ)?Taking the Dirichlet distribution with parameter a as the prior, theestimator is
E (H) = −p∑
i=1
E (θi log(θi ))
=1
n + pa
p∑i=1
(yi + a)(ψ(n + pa + 1)− ψ(yi + a + 1)),
where ψ(z) = d ln Γ(z)dz , the Digamma function.
Catharina Olsen Introduction to Entropy Estimation
IntroductionThe plug-in method
The Bayesian ApproachOther Techniques
Bayesian Approach
Properties
The estimator H minimizes the MSE from H.
It is strongly influenced by the parameter a.
Catharina Olsen Introduction to Entropy Estimation
IntroductionThe plug-in method
The Bayesian ApproachOther Techniques
NSB prior
The prior attemps to spread the probability density of H(θ) on the wholeinterval [0, log(θ)] with near uniformity.
PNSB(θ) ∝∫
dadH(a; p)
daPa(θ), (13)
where ξ := H(a; p) = ψ0(pa + 1)− ψ0(a + 1) is the average entropy ofdistributions chosen from
Pa(θ) =1
Z (a; p)
[∏i
θa−1i
]δ
(∑i
θi − 1
)(14)
The NSB entropy estimator is given by
HNSB =
∫dξq(ξ, n)E [Ha(n)]∫
dξq(ξ, n), (15)
where q(ξ, n) = Γ[pa(ξ)]Γ[n+pa(ξ)]
∏ Γ[yi +a(ξ)]Γ[a(ξ)] .
Catharina Olsen Introduction to Entropy Estimation
IntroductionThe plug-in method
The Bayesian ApproachOther Techniques
NSB prior
Properties:
good estimator
calculation is slow
Catharina Olsen Introduction to Entropy Estimation
IntroductionThe plug-in method
The Bayesian ApproachOther Techniques
BUB EstimatorB-SplinesAppendix
Best Upper Bound (BUB) Estimator
Idea: choose the N polynomial that is the best estimator of theentropy in the space of all those polynomials.
HBUB =N∑
j=0
aj ,Nhj , (16)
where hj :=∑p
i=1 1{ni =j} count statistics.Bounds on the bias and variance:
maxθ Var(HBUB) < N max0≤j<N(aj+1,N − aj ,N)2,
maxθ |B(HBUB)| ≤ 2M∗(f , Ha))
where Bj ,N(θ) :=(N
j
)θj(1− θ)N−j and
M∗(f , Ha) := supx(f (x)|H(x)−∑
j aj ,NBj ,N(x)|).
Catharina Olsen Introduction to Entropy Estimation
IntroductionThe plug-in method
The Bayesian ApproachOther Techniques
BUB EstimatorB-SplinesAppendix
Best Upper Bound (BUB) Estimator
Algorithm
Set
f (θ) =
{m θ < 1/m
1/θ θ ≥ 1/m(17)
and c∗(f ) =constant.
for 1 < k < K � N
set aN.j = − jN log j
N +1− j
N
2N for all j > kCalculate
aN,j≤k =
(X ′Xj≤k +
N
c∗(f )2(D ′D + I ′k Ik)
)−1(X ′Yj≤k +
Nak+1
c∗(f )2ek
),
where X ′X = 〈Bj,N f ,Bk,N f 〉 and
X ′Y =⟨Bj,N f , (H −
∑j aj,NBj,N)f
⟩.
choose aN,j to minimize maxp B(Ha)2 + maxp Varp(Ha).
Catharina Olsen Introduction to Entropy Estimation
IntroductionThe plug-in method
The Bayesian ApproachOther Techniques
BUB EstimatorB-SplinesAppendix
Best Upper Bound (BUB) Estimator
Some results
Catharina Olsen Introduction to Entropy Estimation
IntroductionThe plug-in method
The Bayesian ApproachOther Techniques
BUB EstimatorB-SplinesAppendix
B-Splines
A generalization to the classical binning method.
The datapoints can now be assigned to more than one bin iwith weights given by the B-spline functions Bi ,k , where k isthe spline order.
Catharina Olsen Introduction to Entropy Estimation
IntroductionThe plug-in method
The Bayesian ApproachOther Techniques
BUB EstimatorB-SplinesAppendix
B-Spline
definition of a knot vector ti for a number of binsi = 1, . . . ,M and one given spline order k = 1, . . . ,M − 1
ti :=
0 i < k
i − k + 1 k ≤ i ≤ M − 1
M − 1− k + 2 i > M − 1
(18)
B-spline functions
Bi , 1 :=
{1 ti ≤ z < ti+1
0 else.
Bi ,k := Bi ,k−1(z)z − titi+k−1
+ Bi+1,k−1(z)ti+k − z
ti+k − ti+1
Catharina Olsen Introduction to Entropy Estimation
IntroductionThe plug-in method
The Bayesian ApproachOther Techniques
BUB EstimatorB-SplinesAppendix
B-Splines
Algorithm
input outputvariable x Entropybins ai , i = 1, . . . ,Mx
spline order k
Determine knot vector t.
Determine Bi,k(x) = Bi,k(z), z transformation of x to[0,Mx − k + 1].
Determine Mx weighting coefficients for each xu from Bi,k(xu).
Sum over all xu and determine p(ai ) = 1N
∑Nu=1 Bi,k(xu) for each
bin ai
Determine entropy H(p).
Catharina Olsen Introduction to Entropy Estimation
IntroductionThe plug-in method
The Bayesian ApproachOther Techniques
BUB EstimatorB-SplinesAppendix
B-Splines
Properties
The simple binning corresponds to k = 1.
The spline order determines the shape of the weight functionsand thereby the number of bins each of the data points isassigned to.
All weights belonging to one data point sum up to unity.
Catharina Olsen Introduction to Entropy Estimation
IntroductionThe plug-in method
The Bayesian ApproachOther Techniques
BUB EstimatorB-SplinesAppendix
BSplines
Some results
Catharina Olsen Introduction to Entropy Estimation
IntroductionThe plug-in method
The Bayesian ApproachOther Techniques
BUB EstimatorB-SplinesAppendix
Further work
Integrate estimators NSB, BUB, B-Spline in R package Minet
Question: How important is the estimator in inferring networkmethods?
the already implemented estimators do not significantly changethe outcome of the inference methodintroducing NA values leads to robustness issues
Catharina Olsen Introduction to Entropy Estimation
IntroductionThe plug-in method
The Bayesian ApproachOther Techniques
BUB EstimatorB-SplinesAppendix
Reference list
Jean Hausser
Improving Entropy Estimation and the Inference of GeneticRegulatory
Master Thesis, 2006.
Ilya Nemenman, Fariel Shafee and William Bialek
Entropy and Inference, Revisited
Adv. Neural Inf. Proc. Syst. 14, Cambridge, MA, 2002.
Carsten O Daub, Ralf Steuer, Joachim Selbig and Sebastian Kloska
Estimating mutual information using B-spline functions - animproved similarity measure for analyzing gene expression data
BMC Bioinformatics, 2004.
David H. Wolpert, David R. Wolf
Estimating Functions of Probability Distributions from a finite set ofsamples (Part 1 and Part2)
, 1993. Catharina Olsen Introduction to Entropy Estimation
IntroductionThe plug-in method
The Bayesian ApproachOther Techniques
BUB EstimatorB-SplinesAppendix
Reference list
Juliane Schafer and Korbinian Strimmer
A Shrinkage Approach to Large-Scale Covariance Matrix Estimationand Implications for Functional Genomics
Statist. Appl. Genet. Mol. Biol. 4: 32, 2005.
Liam Paninski
Estimation of Entropy and Mutual Information
Neural Computation 15(6): 1191-1253, 2003.
Catharina Olsen Introduction to Entropy Estimation
IntroductionThe plug-in method
The Bayesian ApproachOther Techniques
BUB EstimatorB-SplinesAppendix
Dirichlet distribution
The probability density function with parameters a1, . . . , ap:
f (x1, . . . , xp; a1, . . . , ap) =1
B(a)
p∏i=1
xai−1i , (19)
where the normalizing constant is given by
B(a) =
∏pi=1 Γ(ai )
Γ(∑p
i=1 ai ). (20)
The Gamma function:
Γ(z) =
∫ ∞0
tz−1e−t dt (21)
If a1 = . . . = ap then we denote the values by a.go back
Catharina Olsen Introduction to Entropy Estimation
IntroductionThe plug-in method
The Bayesian ApproachOther Techniques
BUB EstimatorB-SplinesAppendix
The calculation of λ∗
E
{p∑
i=1
(θi − θi )2
}= E
{p∑
i=1
(u∗i − θi )2
}=
∑i
Var(u∗i ) + [E (u∗i − θi )]2
=∑
i
Var(λti + (1− λ)ui ) + [E (λti + (1− λ)ui )− θ]2
=∑
i
λ2Var(ti ) + (1− λ)2Var(ui ) +
2(λ(1− λ)Cov(ui , ti ) + [λE (ti − ui ) + Bias(ui )]2
Catharina Olsen Introduction to Entropy Estimation
IntroductionThe plug-in method
The Bayesian ApproachOther Techniques
BUB EstimatorB-SplinesAppendix
The calculation of λ∗
Derivation with respect to λ and setting this equal to zero gives
λ∗ =
∑i Var(ui )− Cov(ti , ui )− Bias(ui )E (ti − ui )∑
i E [(ti − ui )2]
Take Var(ui ) := ui (1−ui )n−1 . The bias of ui is zero. Then
λ∗ =p(n2 −
∑i y
2i )
(n − 1)(p∑
i y2i − n2)
. (22)
go back
Catharina Olsen Introduction to Entropy Estimation
IntroductionThe plug-in method
The Bayesian ApproachOther Techniques
BUB EstimatorB-SplinesAppendix
Properties of λ∗
Properties:
the smaller the estimate’s variance the smaller λ∗
λ∗ depends on the correlation between estimation error of uand t
with increasing mean squared difference between u and t, λ∗
decreases (protects from mispecifying the target t)
if estimator is biased towards the target, the shrinkageintensity is rediced
go back
Catharina Olsen Introduction to Entropy Estimation