introduction to entropy estimation (z) = dln ( z) dz, ... j k + na k+1 c(f)2 e k ; where x0x = hb...

IntroductionThe plug-in method

The Bayesian ApproachOther Techniques

Introduction to Entropy Estimation

Catharina Olsen

Departement d’Informatique, Machine Learning GroupUniversite Libre de Bruxelles

6th February 2008

Catharina Olsen Introduction to Entropy Estimation



Content

Estimators:

Maximum Likelihood

Miller Madow

Bayes

Shrinkage

Nemenman-Shafee-Bialek (NSB)

Best Upper Bounds (BUB)

using B-spline functions




Entropy

Setup:

experiment with p possible outcomes B1, . . . ,Bp

θi probability for Bi to occur∑pi=1 θi = 1

renew experiment n times, Bi will occur yi times(n =

∑pi=1 yi )

The entropy of the resulting distribution is defined as

H(θ) := −p∑

i=1

θi log θi (1)




Entropy Estimators - MLE / Miller-Madow

Maximum Likelihood Estimator (MLE)

HML(θ) := −p∑

i=1

θMLi log θML

i (2)

where θMLi := yi

n .

Property:HML negatively biased everywhere!




Entropy Estimators - MLE / Miller-Madow

Miller-Madow estimatorThe MLE with a bias correction

HMM(θ) := HML(y) +m − 1

2n(3)

Algorithm

Sort data points into p bins; occurence yi , i = 1, . . . , pn =

∑i yi number of data points

θMLi = yi

nfind number of bins with zero probability p0 → m = p − p0

HMM = −∑

i θMLi log θML

i + m−12n




Entropy Estimators - Shrink

Shrinkage estimate

θshrinki = λti + (1− λ)ui (4)

=

{λ∗ 1

p + (1− λ∗) yin λ∗ < 1

1p λ∗ ≥ 1,

(5)

where

λ∗ =p(n2 −

∑i y

2i )

(n − 1)(p∑

i y2i − n2)

. (6)

which minimizes the mean square error E{∑

i (θi − θi )2}

.

calculation




Bayesian Approach

What is the Bayesian approach?

Bayes: P(θ|n) =P(n|θ)P(θ)

P(n)(7)

P(n|θ) ∝∏

θni

i

Prior: P(θ) ∝ ∆(θ)∏

i

Θ(θi )θa−1i

Dirichlet distribution

δ(x) =

{1, if x = 0,

0, else.Θ(x) =

{1 if x ≥ 0,

0 else.

∆(θ) := δ(∑

i θi − 1)Instead of estimating the probabilities θi , a function F of θ is estimated.

F =

∫dθF (θ)P(θ|n). (8)




Theory

How do we calculate the moments of a function F (θ)?

Bayesian estimator with uniform prior for F k(θ) is given by I [F k (θ),n]I [1,n] ,

where I [F (θ), n] :=∫

dθF (θ)P(θ)∏

i θni

i .

How is this evaluated?

I [1, n] =Q

i Γ(ni +1)Γ(N+m)

I [logr1 (θ1) . . . logrm (θm), n] = δr1n1. . . δrm

nmI [1, n]

I [log(θu), n] =Q

i Γ(ni +1)Γ(N+m) × Φ1(nu + 1,N + m)

⇒ Bayes estimator: I [F (θ,n]I [1,n] = −

∑i

ni +1N+m ∆Φ(1)(ni + 2,N + m + 1),

where ∆Φ(n)(z1, z2) := ∂nz1

log(Γ(z1))− ∂nz2

log(Γ(z2)).




Bayesian Approach

What estimators do we get for F (θ) = θ?Bayesian estimator for θi with uniform prior is given by

θi =yi + 1

n + p. (9)

If the prior is chosen to be Dirichlet distributed with parameter athen the estimator is given by

θi =yi + a

n + pa. (10)




Entropy Estimators - Bayes

Bayesian estimator

θBayesi =

yi + a

n + pa, (11)

where a is the parameter of the Dirichlet distribution which ishere assumed as the prior.

Algorithm

Sort data points into p bins; occurence yi , i = 1, . . . , pn =

∑i yi number of data points

θBayesi = yi +a

n+pa

HBayes =∑

i θBayesi log θBayes

i




Bayesian Approach

What properties does this estimator have?

If λ = papa+n , then θBayes

i = θshrinki , i = 1, . . . , p.

There is a one-to-one correspondance between λ and a inDirichlet(a):

a =n

p

(λ

1− λ

). (12)

common values for a:

a→ 0 : maximum likelihood estimatora = 1

2 : Krichevsky, Trofimova = 1

p : Schurmann, Grassberger




Bayesian Approach

What estimators do we get for F (θ) = H(θ)?Taking the Dirichlet distribution with parameter a as the prior, theestimator is

E (H) = −p∑

i=1

E (θi log(θi ))

=1

n + pa

p∑i=1

(yi + a)(ψ(n + pa + 1)− ψ(yi + a + 1)),

where ψ(z) = d ln Γ(z)dz , the Digamma function.




Bayesian Approach

Properties

The estimator H minimizes the MSE from H.

It is strongly influenced by the parameter a.




NSB prior

The prior attemps to spread the probability density of H(θ) on the wholeinterval [0, log(θ)] with near uniformity.

PNSB(θ) ∝∫

dadH(a; p)

daPa(θ), (13)

where ξ := H(a; p) = ψ0(pa + 1)− ψ0(a + 1) is the average entropy ofdistributions chosen from

Pa(θ) =1

Z (a; p)

[∏i

θa−1i

]δ

(∑i

θi − 1

)(14)

The NSB entropy estimator is given by

HNSB =

∫dξq(ξ, n)E [Ha(n)]∫

dξq(ξ, n), (15)

where q(ξ, n) = Γ[pa(ξ)]Γ[n+pa(ξ)]

∏ Γ[yi +a(ξ)]Γ[a(ξ)] .




NSB prior

Properties:

good estimator

calculation is slow




BUB EstimatorB-SplinesAppendix

Best Upper Bound (BUB) Estimator

Idea: choose the N polynomial that is the best estimator of theentropy in the space of all those polynomials.

HBUB =N∑

j=0

aj ,Nhj , (16)

where hj :=∑p

i=1 1{ni =j} count statistics.Bounds on the bias and variance:

maxθ Var(HBUB) < N max0≤j<N(aj+1,N − aj ,N)2,

maxθ |B(HBUB)| ≤ 2M∗(f , Ha))

where Bj ,N(θ) :=(N

j

)θj(1− θ)N−j and

M∗(f , Ha) := supx(f (x)|H(x)−∑

j aj ,NBj ,N(x)|).






Algorithm

Set

f (θ) =

{m θ < 1/m

1/θ θ ≥ 1/m(17)

and c∗(f ) =constant.

for 1 < k < K � N

set aN.j = − jN log j

N +1− j

N

2N for all j > kCalculate

aN,j≤k =

(X ′Xj≤k +

N

c∗(f )2(D ′D + I ′k Ik)

)−1(X ′Yj≤k +

Nak+1

c∗(f )2ek

),

where X ′X = 〈Bj,N f ,Bk,N f 〉 and

X ′Y =⟨Bj,N f , (H −

∑j aj,NBj,N)f

⟩.

choose aN,j to minimize maxp B(Ha)2 + maxp Varp(Ha).






Some results





B-Splines

A generalization to the classical binning method.

The datapoints can now be assigned to more than one bin iwith weights given by the B-spline functions Bi ,k , where k isthe spline order.





B-Spline

definition of a knot vector ti for a number of binsi = 1, . . . ,M and one given spline order k = 1, . . . ,M − 1

ti :=

0 i < k

i − k + 1 k ≤ i ≤ M − 1

M − 1− k + 2 i > M − 1

(18)

B-spline functions

Bi , 1 :=

{1 ti ≤ z < ti+1

0 else.

Bi ,k := Bi ,k−1(z)z − titi+k−1

+ Bi+1,k−1(z)ti+k − z

ti+k − ti+1





B-Splines

Algorithm

input outputvariable x Entropybins ai , i = 1, . . . ,Mx

spline order k

Determine knot vector t.

Determine Bi,k(x) = Bi,k(z), z transformation of x to[0,Mx − k + 1].

Determine Mx weighting coefficients for each xu from Bi,k(xu).

Sum over all xu and determine p(ai ) = 1N

∑Nu=1 Bi,k(xu) for each

bin ai

Determine entropy H(p).





B-Splines

Properties

The simple binning corresponds to k = 1.

The spline order determines the shape of the weight functionsand thereby the number of bins each of the data points isassigned to.

All weights belonging to one data point sum up to unity.





BSplines

Some results





Further work

Integrate estimators NSB, BUB, B-Spline in R package Minet

Question: How important is the estimator in inferring networkmethods?

the already implemented estimators do not significantly changethe outcome of the inference methodintroducing NA values leads to robustness issues





Reference list

Jean Hausser

Improving Entropy Estimation and the Inference of GeneticRegulatory

Master Thesis, 2006.

Ilya Nemenman, Fariel Shafee and William Bialek

Entropy and Inference, Revisited

Adv. Neural Inf. Proc. Syst. 14, Cambridge, MA, 2002.

Carsten O Daub, Ralf Steuer, Joachim Selbig and Sebastian Kloska

Estimating mutual information using B-spline functions - animproved similarity measure for analyzing gene expression data

BMC Bioinformatics, 2004.

David H. Wolpert, David R. Wolf

Estimating Functions of Probability Distributions from a finite set ofsamples (Part 1 and Part2)

, 1993. Catharina Olsen Introduction to Entropy Estimation




Reference list

Juliane Schafer and Korbinian Strimmer

A Shrinkage Approach to Large-Scale Covariance Matrix Estimationand Implications for Functional Genomics

Statist. Appl. Genet. Mol. Biol. 4: 32, 2005.

Liam Paninski

Estimation of Entropy and Mutual Information

Neural Computation 15(6): 1191-1253, 2003.





Dirichlet distribution

The probability density function with parameters a1, . . . , ap:

f (x1, . . . , xp; a1, . . . , ap) =1

B(a)

p∏i=1

xai−1i , (19)

where the normalizing constant is given by

B(a) =

∏pi=1 Γ(ai )

Γ(∑p

i=1 ai ). (20)

The Gamma function:

Γ(z) =

∫ ∞0

tz−1e−t dt (21)

If a1 = . . . = ap then we denote the values by a.go back





The calculation of λ∗

E

{p∑

i=1

(θi − θi )2

}= E

{p∑

i=1

(u∗i − θi )2

}=

∑i

Var(u∗i ) + [E (u∗i − θi )]2

=∑

i

Var(λti + (1− λ)ui ) + [E (λti + (1− λ)ui )− θ]2

=∑

i

λ2Var(ti ) + (1− λ)2Var(ui ) +

2(λ(1− λ)Cov(ui , ti ) + [λE (ti − ui ) + Bias(ui )]2





The calculation of λ∗

Derivation with respect to λ and setting this equal to zero gives

λ∗ =

∑i Var(ui )− Cov(ti , ui )− Bias(ui )E (ti − ui )∑

i E [(ti − ui )2]

Take Var(ui ) := ui (1−ui )n−1 . The bias of ui is zero. Then

λ∗ =p(n2 −

∑i y

2i )

(n − 1)(p∑

i y2i − n2)

. (22)

go back





Properties of λ∗

Properties:

the smaller the estimate’s variance the smaller λ∗

λ∗ depends on the correlation between estimation error of uand t

with increasing mean squared difference between u and t, λ∗

decreases (protects from mispecifying the target t)

if estimator is biased towards the target, the shrinkageintensity is rediced

go back


introduction to entropy estimation (z) = dln ( z) dz, ... j k + na k+1 c(f)2 e k ; where x0x = hb...

Documents