introduction to entropy estimation (z) = dln ( z) dz, ... j k + na k+1 c(f)2 e k ; where x0x = hb...

30
Introduction The plug-in method The Bayesian Approach Other Techniques Introduction to Entropy Estimation Catharina Olsen epartement d’Informatique, Machine Learning Group Universit´ e Libre de Bruxelles 6th February 2008 Catharina Olsen Introduction to Entropy Estimation

Upload: buicong

Post on 01-May-2018

219 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: Introduction to Entropy Estimation (z) = dln ( z) dz, ... j k + Na k+1 c(f)2 e k ; where X0X = hB j;Nf;B ... Catharina Olsen Introduction to Entropy Estimation. Introduction

IntroductionThe plug-in method

The Bayesian ApproachOther Techniques

Introduction to Entropy Estimation

Catharina Olsen

Departement d’Informatique, Machine Learning GroupUniversite Libre de Bruxelles

6th February 2008

Catharina Olsen Introduction to Entropy Estimation

Page 2: Introduction to Entropy Estimation (z) = dln ( z) dz, ... j k + Na k+1 c(f)2 e k ; where X0X = hB j;Nf;B ... Catharina Olsen Introduction to Entropy Estimation. Introduction

IntroductionThe plug-in method

The Bayesian ApproachOther Techniques

Content

Estimators:

Maximum Likelihood

Miller Madow

Bayes

Shrinkage

Nemenman-Shafee-Bialek (NSB)

Best Upper Bounds (BUB)

using B-spline functions

Catharina Olsen Introduction to Entropy Estimation

Page 3: Introduction to Entropy Estimation (z) = dln ( z) dz, ... j k + Na k+1 c(f)2 e k ; where X0X = hB j;Nf;B ... Catharina Olsen Introduction to Entropy Estimation. Introduction

IntroductionThe plug-in method

The Bayesian ApproachOther Techniques

Entropy

Setup:

experiment with p possible outcomes B1, . . . ,Bp

θi probability for Bi to occur∑pi=1 θi = 1

renew experiment n times, Bi will occur yi times(n =

∑pi=1 yi )

The entropy of the resulting distribution is defined as

H(θ) := −p∑

i=1

θi log θi (1)

Catharina Olsen Introduction to Entropy Estimation

Page 4: Introduction to Entropy Estimation (z) = dln ( z) dz, ... j k + Na k+1 c(f)2 e k ; where X0X = hB j;Nf;B ... Catharina Olsen Introduction to Entropy Estimation. Introduction

IntroductionThe plug-in method

The Bayesian ApproachOther Techniques

Entropy Estimators - MLE / Miller-Madow

Maximum Likelihood Estimator (MLE)

HML(θ) := −p∑

i=1

θMLi log θML

i (2)

where θMLi := yi

n .

Property:HML negatively biased everywhere!

Catharina Olsen Introduction to Entropy Estimation

Page 5: Introduction to Entropy Estimation (z) = dln ( z) dz, ... j k + Na k+1 c(f)2 e k ; where X0X = hB j;Nf;B ... Catharina Olsen Introduction to Entropy Estimation. Introduction

IntroductionThe plug-in method

The Bayesian ApproachOther Techniques

Entropy Estimators - MLE / Miller-Madow

Miller-Madow estimatorThe MLE with a bias correction

HMM(θ) := HML(y) +m − 1

2n(3)

Algorithm

Sort data points into p bins; occurence yi , i = 1, . . . , pn =

∑i yi number of data points

θMLi = yi

nfind number of bins with zero probability p0 → m = p − p0

HMM = −∑

i θMLi log θML

i + m−12n

Catharina Olsen Introduction to Entropy Estimation

Page 6: Introduction to Entropy Estimation (z) = dln ( z) dz, ... j k + Na k+1 c(f)2 e k ; where X0X = hB j;Nf;B ... Catharina Olsen Introduction to Entropy Estimation. Introduction

IntroductionThe plug-in method

The Bayesian ApproachOther Techniques

Entropy Estimators - Shrink

Shrinkage estimate

θshrinki = λti + (1− λ)ui (4)

=

{λ∗ 1

p + (1− λ∗) yin λ∗ < 1

1p λ∗ ≥ 1,

(5)

where

λ∗ =p(n2 −

∑i y

2i )

(n − 1)(p∑

i y2i − n2)

. (6)

which minimizes the mean square error E{∑

i (θi − θi )2}

.

calculation

Catharina Olsen Introduction to Entropy Estimation

Page 7: Introduction to Entropy Estimation (z) = dln ( z) dz, ... j k + Na k+1 c(f)2 e k ; where X0X = hB j;Nf;B ... Catharina Olsen Introduction to Entropy Estimation. Introduction

IntroductionThe plug-in method

The Bayesian ApproachOther Techniques

Bayesian Approach

What is the Bayesian approach?

Bayes: P(θ|n) =P(n|θ)P(θ)

P(n)(7)

P(n|θ) ∝∏

θni

i

Prior: P(θ) ∝ ∆(θ)∏

i

Θ(θi )θa−1i

Dirichlet distribution

δ(x) =

{1, if x = 0,

0, else.Θ(x) =

{1 if x ≥ 0,

0 else.

∆(θ) := δ(∑

i θi − 1)Instead of estimating the probabilities θi , a function F of θ is estimated.

F =

∫dθF (θ)P(θ|n). (8)

Catharina Olsen Introduction to Entropy Estimation

Page 8: Introduction to Entropy Estimation (z) = dln ( z) dz, ... j k + Na k+1 c(f)2 e k ; where X0X = hB j;Nf;B ... Catharina Olsen Introduction to Entropy Estimation. Introduction

IntroductionThe plug-in method

The Bayesian ApproachOther Techniques

Theory

How do we calculate the moments of a function F (θ)?

Bayesian estimator with uniform prior for F k(θ) is given by I [F k (θ),n]I [1,n] ,

where I [F (θ), n] :=∫

dθF (θ)P(θ)∏

i θni

i .

How is this evaluated?

I [1, n] =Q

i Γ(ni +1)Γ(N+m)

I [logr1 (θ1) . . . logrm (θm), n] = δr1n1. . . δrm

nmI [1, n]

I [log(θu), n] =Q

i Γ(ni +1)Γ(N+m) × Φ1(nu + 1,N + m)

⇒ Bayes estimator: I [F (θ,n]I [1,n] = −

∑i

ni +1N+m ∆Φ(1)(ni + 2,N + m + 1),

where ∆Φ(n)(z1, z2) := ∂nz1

log(Γ(z1))− ∂nz2

log(Γ(z2)).

Catharina Olsen Introduction to Entropy Estimation

Page 9: Introduction to Entropy Estimation (z) = dln ( z) dz, ... j k + Na k+1 c(f)2 e k ; where X0X = hB j;Nf;B ... Catharina Olsen Introduction to Entropy Estimation. Introduction

IntroductionThe plug-in method

The Bayesian ApproachOther Techniques

Bayesian Approach

What estimators do we get for F (θ) = θ?Bayesian estimator for θi with uniform prior is given by

θi =yi + 1

n + p. (9)

If the prior is chosen to be Dirichlet distributed with parameter athen the estimator is given by

θi =yi + a

n + pa. (10)

Catharina Olsen Introduction to Entropy Estimation

Page 10: Introduction to Entropy Estimation (z) = dln ( z) dz, ... j k + Na k+1 c(f)2 e k ; where X0X = hB j;Nf;B ... Catharina Olsen Introduction to Entropy Estimation. Introduction

IntroductionThe plug-in method

The Bayesian ApproachOther Techniques

Entropy Estimators - Bayes

Bayesian estimator

θBayesi =

yi + a

n + pa, (11)

where a is the parameter of the Dirichlet distribution which ishere assumed as the prior.

Algorithm

Sort data points into p bins; occurence yi , i = 1, . . . , pn =

∑i yi number of data points

θBayesi = yi +a

n+pa

HBayes =∑

i θBayesi log θBayes

i

Catharina Olsen Introduction to Entropy Estimation

Page 11: Introduction to Entropy Estimation (z) = dln ( z) dz, ... j k + Na k+1 c(f)2 e k ; where X0X = hB j;Nf;B ... Catharina Olsen Introduction to Entropy Estimation. Introduction

IntroductionThe plug-in method

The Bayesian ApproachOther Techniques

Bayesian Approach

What properties does this estimator have?

If λ = papa+n , then θBayes

i = θshrinki , i = 1, . . . , p.

There is a one-to-one correspondance between λ and a inDirichlet(a):

a =n

p

1− λ

). (12)

common values for a:

a→ 0 : maximum likelihood estimatora = 1

2 : Krichevsky, Trofimova = 1

p : Schurmann, Grassberger

Catharina Olsen Introduction to Entropy Estimation

Page 12: Introduction to Entropy Estimation (z) = dln ( z) dz, ... j k + Na k+1 c(f)2 e k ; where X0X = hB j;Nf;B ... Catharina Olsen Introduction to Entropy Estimation. Introduction

IntroductionThe plug-in method

The Bayesian ApproachOther Techniques

Bayesian Approach

What estimators do we get for F (θ) = H(θ)?Taking the Dirichlet distribution with parameter a as the prior, theestimator is

E (H) = −p∑

i=1

E (θi log(θi ))

=1

n + pa

p∑i=1

(yi + a)(ψ(n + pa + 1)− ψ(yi + a + 1)),

where ψ(z) = d ln Γ(z)dz , the Digamma function.

Catharina Olsen Introduction to Entropy Estimation

Page 13: Introduction to Entropy Estimation (z) = dln ( z) dz, ... j k + Na k+1 c(f)2 e k ; where X0X = hB j;Nf;B ... Catharina Olsen Introduction to Entropy Estimation. Introduction

IntroductionThe plug-in method

The Bayesian ApproachOther Techniques

Bayesian Approach

Properties

The estimator H minimizes the MSE from H.

It is strongly influenced by the parameter a.

Catharina Olsen Introduction to Entropy Estimation

Page 14: Introduction to Entropy Estimation (z) = dln ( z) dz, ... j k + Na k+1 c(f)2 e k ; where X0X = hB j;Nf;B ... Catharina Olsen Introduction to Entropy Estimation. Introduction

IntroductionThe plug-in method

The Bayesian ApproachOther Techniques

NSB prior

The prior attemps to spread the probability density of H(θ) on the wholeinterval [0, log(θ)] with near uniformity.

PNSB(θ) ∝∫

dadH(a; p)

daPa(θ), (13)

where ξ := H(a; p) = ψ0(pa + 1)− ψ0(a + 1) is the average entropy ofdistributions chosen from

Pa(θ) =1

Z (a; p)

[∏i

θa−1i

(∑i

θi − 1

)(14)

The NSB entropy estimator is given by

HNSB =

∫dξq(ξ, n)E [Ha(n)]∫

dξq(ξ, n), (15)

where q(ξ, n) = Γ[pa(ξ)]Γ[n+pa(ξ)]

∏ Γ[yi +a(ξ)]Γ[a(ξ)] .

Catharina Olsen Introduction to Entropy Estimation

Page 15: Introduction to Entropy Estimation (z) = dln ( z) dz, ... j k + Na k+1 c(f)2 e k ; where X0X = hB j;Nf;B ... Catharina Olsen Introduction to Entropy Estimation. Introduction

IntroductionThe plug-in method

The Bayesian ApproachOther Techniques

NSB prior

Properties:

good estimator

calculation is slow

Catharina Olsen Introduction to Entropy Estimation

Page 16: Introduction to Entropy Estimation (z) = dln ( z) dz, ... j k + Na k+1 c(f)2 e k ; where X0X = hB j;Nf;B ... Catharina Olsen Introduction to Entropy Estimation. Introduction

IntroductionThe plug-in method

The Bayesian ApproachOther Techniques

BUB EstimatorB-SplinesAppendix

Best Upper Bound (BUB) Estimator

Idea: choose the N polynomial that is the best estimator of theentropy in the space of all those polynomials.

HBUB =N∑

j=0

aj ,Nhj , (16)

where hj :=∑p

i=1 1{ni =j} count statistics.Bounds on the bias and variance:

maxθ Var(HBUB) < N max0≤j<N(aj+1,N − aj ,N)2,

maxθ |B(HBUB)| ≤ 2M∗(f , Ha))

where Bj ,N(θ) :=(N

j

)θj(1− θ)N−j and

M∗(f , Ha) := supx(f (x)|H(x)−∑

j aj ,NBj ,N(x)|).

Catharina Olsen Introduction to Entropy Estimation

Page 17: Introduction to Entropy Estimation (z) = dln ( z) dz, ... j k + Na k+1 c(f)2 e k ; where X0X = hB j;Nf;B ... Catharina Olsen Introduction to Entropy Estimation. Introduction

IntroductionThe plug-in method

The Bayesian ApproachOther Techniques

BUB EstimatorB-SplinesAppendix

Best Upper Bound (BUB) Estimator

Algorithm

Set

f (θ) =

{m θ < 1/m

1/θ θ ≥ 1/m(17)

and c∗(f ) =constant.

for 1 < k < K � N

set aN.j = − jN log j

N +1− j

N

2N for all j > kCalculate

aN,j≤k =

(X ′Xj≤k +

N

c∗(f )2(D ′D + I ′k Ik)

)−1(X ′Yj≤k +

Nak+1

c∗(f )2ek

),

where X ′X = 〈Bj,N f ,Bk,N f 〉 and

X ′Y =⟨Bj,N f , (H −

∑j aj,NBj,N)f

⟩.

choose aN,j to minimize maxp B(Ha)2 + maxp Varp(Ha).

Catharina Olsen Introduction to Entropy Estimation

Page 18: Introduction to Entropy Estimation (z) = dln ( z) dz, ... j k + Na k+1 c(f)2 e k ; where X0X = hB j;Nf;B ... Catharina Olsen Introduction to Entropy Estimation. Introduction

IntroductionThe plug-in method

The Bayesian ApproachOther Techniques

BUB EstimatorB-SplinesAppendix

Best Upper Bound (BUB) Estimator

Some results

Catharina Olsen Introduction to Entropy Estimation

Page 19: Introduction to Entropy Estimation (z) = dln ( z) dz, ... j k + Na k+1 c(f)2 e k ; where X0X = hB j;Nf;B ... Catharina Olsen Introduction to Entropy Estimation. Introduction

IntroductionThe plug-in method

The Bayesian ApproachOther Techniques

BUB EstimatorB-SplinesAppendix

B-Splines

A generalization to the classical binning method.

The datapoints can now be assigned to more than one bin iwith weights given by the B-spline functions Bi ,k , where k isthe spline order.

Catharina Olsen Introduction to Entropy Estimation

Page 20: Introduction to Entropy Estimation (z) = dln ( z) dz, ... j k + Na k+1 c(f)2 e k ; where X0X = hB j;Nf;B ... Catharina Olsen Introduction to Entropy Estimation. Introduction

IntroductionThe plug-in method

The Bayesian ApproachOther Techniques

BUB EstimatorB-SplinesAppendix

B-Spline

definition of a knot vector ti for a number of binsi = 1, . . . ,M and one given spline order k = 1, . . . ,M − 1

ti :=

0 i < k

i − k + 1 k ≤ i ≤ M − 1

M − 1− k + 2 i > M − 1

(18)

B-spline functions

Bi , 1 :=

{1 ti ≤ z < ti+1

0 else.

Bi ,k := Bi ,k−1(z)z − titi+k−1

+ Bi+1,k−1(z)ti+k − z

ti+k − ti+1

Catharina Olsen Introduction to Entropy Estimation

Page 21: Introduction to Entropy Estimation (z) = dln ( z) dz, ... j k + Na k+1 c(f)2 e k ; where X0X = hB j;Nf;B ... Catharina Olsen Introduction to Entropy Estimation. Introduction

IntroductionThe plug-in method

The Bayesian ApproachOther Techniques

BUB EstimatorB-SplinesAppendix

B-Splines

Algorithm

input outputvariable x Entropybins ai , i = 1, . . . ,Mx

spline order k

Determine knot vector t.

Determine Bi,k(x) = Bi,k(z), z transformation of x to[0,Mx − k + 1].

Determine Mx weighting coefficients for each xu from Bi,k(xu).

Sum over all xu and determine p(ai ) = 1N

∑Nu=1 Bi,k(xu) for each

bin ai

Determine entropy H(p).

Catharina Olsen Introduction to Entropy Estimation

Page 22: Introduction to Entropy Estimation (z) = dln ( z) dz, ... j k + Na k+1 c(f)2 e k ; where X0X = hB j;Nf;B ... Catharina Olsen Introduction to Entropy Estimation. Introduction

IntroductionThe plug-in method

The Bayesian ApproachOther Techniques

BUB EstimatorB-SplinesAppendix

B-Splines

Properties

The simple binning corresponds to k = 1.

The spline order determines the shape of the weight functionsand thereby the number of bins each of the data points isassigned to.

All weights belonging to one data point sum up to unity.

Catharina Olsen Introduction to Entropy Estimation

Page 23: Introduction to Entropy Estimation (z) = dln ( z) dz, ... j k + Na k+1 c(f)2 e k ; where X0X = hB j;Nf;B ... Catharina Olsen Introduction to Entropy Estimation. Introduction

IntroductionThe plug-in method

The Bayesian ApproachOther Techniques

BUB EstimatorB-SplinesAppendix

BSplines

Some results

Catharina Olsen Introduction to Entropy Estimation

Page 24: Introduction to Entropy Estimation (z) = dln ( z) dz, ... j k + Na k+1 c(f)2 e k ; where X0X = hB j;Nf;B ... Catharina Olsen Introduction to Entropy Estimation. Introduction

IntroductionThe plug-in method

The Bayesian ApproachOther Techniques

BUB EstimatorB-SplinesAppendix

Further work

Integrate estimators NSB, BUB, B-Spline in R package Minet

Question: How important is the estimator in inferring networkmethods?

the already implemented estimators do not significantly changethe outcome of the inference methodintroducing NA values leads to robustness issues

Catharina Olsen Introduction to Entropy Estimation

Page 25: Introduction to Entropy Estimation (z) = dln ( z) dz, ... j k + Na k+1 c(f)2 e k ; where X0X = hB j;Nf;B ... Catharina Olsen Introduction to Entropy Estimation. Introduction

IntroductionThe plug-in method

The Bayesian ApproachOther Techniques

BUB EstimatorB-SplinesAppendix

Reference list

Jean Hausser

Improving Entropy Estimation and the Inference of GeneticRegulatory

Master Thesis, 2006.

Ilya Nemenman, Fariel Shafee and William Bialek

Entropy and Inference, Revisited

Adv. Neural Inf. Proc. Syst. 14, Cambridge, MA, 2002.

Carsten O Daub, Ralf Steuer, Joachim Selbig and Sebastian Kloska

Estimating mutual information using B-spline functions - animproved similarity measure for analyzing gene expression data

BMC Bioinformatics, 2004.

David H. Wolpert, David R. Wolf

Estimating Functions of Probability Distributions from a finite set ofsamples (Part 1 and Part2)

, 1993. Catharina Olsen Introduction to Entropy Estimation

Page 26: Introduction to Entropy Estimation (z) = dln ( z) dz, ... j k + Na k+1 c(f)2 e k ; where X0X = hB j;Nf;B ... Catharina Olsen Introduction to Entropy Estimation. Introduction

IntroductionThe plug-in method

The Bayesian ApproachOther Techniques

BUB EstimatorB-SplinesAppendix

Reference list

Juliane Schafer and Korbinian Strimmer

A Shrinkage Approach to Large-Scale Covariance Matrix Estimationand Implications for Functional Genomics

Statist. Appl. Genet. Mol. Biol. 4: 32, 2005.

Liam Paninski

Estimation of Entropy and Mutual Information

Neural Computation 15(6): 1191-1253, 2003.

Catharina Olsen Introduction to Entropy Estimation

Page 27: Introduction to Entropy Estimation (z) = dln ( z) dz, ... j k + Na k+1 c(f)2 e k ; where X0X = hB j;Nf;B ... Catharina Olsen Introduction to Entropy Estimation. Introduction

IntroductionThe plug-in method

The Bayesian ApproachOther Techniques

BUB EstimatorB-SplinesAppendix

Dirichlet distribution

The probability density function with parameters a1, . . . , ap:

f (x1, . . . , xp; a1, . . . , ap) =1

B(a)

p∏i=1

xai−1i , (19)

where the normalizing constant is given by

B(a) =

∏pi=1 Γ(ai )

Γ(∑p

i=1 ai ). (20)

The Gamma function:

Γ(z) =

∫ ∞0

tz−1e−t dt (21)

If a1 = . . . = ap then we denote the values by a.go back

Catharina Olsen Introduction to Entropy Estimation

Page 28: Introduction to Entropy Estimation (z) = dln ( z) dz, ... j k + Na k+1 c(f)2 e k ; where X0X = hB j;Nf;B ... Catharina Olsen Introduction to Entropy Estimation. Introduction

IntroductionThe plug-in method

The Bayesian ApproachOther Techniques

BUB EstimatorB-SplinesAppendix

The calculation of λ∗

E

{p∑

i=1

(θi − θi )2

}= E

{p∑

i=1

(u∗i − θi )2

}=

∑i

Var(u∗i ) + [E (u∗i − θi )]2

=∑

i

Var(λti + (1− λ)ui ) + [E (λti + (1− λ)ui )− θ]2

=∑

i

λ2Var(ti ) + (1− λ)2Var(ui ) +

2(λ(1− λ)Cov(ui , ti ) + [λE (ti − ui ) + Bias(ui )]2

Catharina Olsen Introduction to Entropy Estimation

Page 29: Introduction to Entropy Estimation (z) = dln ( z) dz, ... j k + Na k+1 c(f)2 e k ; where X0X = hB j;Nf;B ... Catharina Olsen Introduction to Entropy Estimation. Introduction

IntroductionThe plug-in method

The Bayesian ApproachOther Techniques

BUB EstimatorB-SplinesAppendix

The calculation of λ∗

Derivation with respect to λ and setting this equal to zero gives

λ∗ =

∑i Var(ui )− Cov(ti , ui )− Bias(ui )E (ti − ui )∑

i E [(ti − ui )2]

Take Var(ui ) := ui (1−ui )n−1 . The bias of ui is zero. Then

λ∗ =p(n2 −

∑i y

2i )

(n − 1)(p∑

i y2i − n2)

. (22)

go back

Catharina Olsen Introduction to Entropy Estimation

Page 30: Introduction to Entropy Estimation (z) = dln ( z) dz, ... j k + Na k+1 c(f)2 e k ; where X0X = hB j;Nf;B ... Catharina Olsen Introduction to Entropy Estimation. Introduction

IntroductionThe plug-in method

The Bayesian ApproachOther Techniques

BUB EstimatorB-SplinesAppendix

Properties of λ∗

Properties:

the smaller the estimate’s variance the smaller λ∗

λ∗ depends on the correlation between estimation error of uand t

with increasing mean squared difference between u and t, λ∗

decreases (protects from mispecifying the target t)

if estimator is biased towards the target, the shrinkageintensity is rediced

go back

Catharina Olsen Introduction to Entropy Estimation