entropy estimation and applications to decision trees

Entropy Estimation and Applications to Decision Trees

EstimationDistribution over K=8 classes

Repeat 50,000 times:1. Generate N samples2. Estimate entropy from samples 1 2 3 4 5 6 7 8

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 20

1000

2000

3000

4000

5000

6000

7000

8000Plugin H, N=10, 50000 replicates

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 20

500

1000

1500

2000


0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 20

500

1000

1500

2000

2500

3000

3500

4000

4500


N=10 N=100 N=50000

H=1.289

Estimation

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 20

500

1000

1500

2000

2500Plugin H, N=100, 50000 replicates, with true entropy and 2 std dev estimates shown

Estimating the true entropy

Goals:1. Consistency: large N guarantees correct result2. Low variance: variation of estimates small3. Low bias: expected estimate should be correct

Discrete Entropy Estimators

• UCI classificationdata sets

• Accuracy on test set• Plugin vs. Grassberger• Better trees

Experimental Results

Source: [Nowozin, “Improved Information Gain Estimates for Decision Tree Induction”, ICML 2012]

• In regression, differential entropy– measures remaining uncertainty about y– is a function of a distribution

Differential Entropy Estimation

𝐻 (𝑞)=−∫𝑦

❑

𝑞 ( 𝑦|𝑥 ) log𝑞 (𝑦∨𝑥)d 𝑦

• Problem– q is not from a parametric family

• Solution 1: project onto a parametric family• Solution 2: non-parametric entropy estimation

• Multivariate Normal distribution– Estimate covariance matrix of all y vectors– Plugin estimate of the entropy

Solution 1: parametric family

𝐻 (�̂� )= 𝑑2 +

𝑑2 log 2𝜋+

12 log

|�̂�|

– Uniform minimum variance unbiased estimator (UMVUE)

𝐻 (𝑌 )=𝑑2log𝑒𝜋+ 1

2log|∑𝑦∈𝑌 𝑦 𝑦

𝑇|− 12∑𝑗=1𝑑

𝜓(𝑛+1− 𝑗2 )

[Ahmed, Gokhale, “Entropy expressions and their estimators for multivariate distributions”, IEEE Trans. Inf. Theory, 1989]

Solution 1: parametric family

• Minimal assumptions on distribution• Nearest neighbour estimate

– NN distance – Euler-Mascheroni constant – Volume of d-dim. hypersphere

• Other estimators: KDE, spanning tree, k-NN, etc.

Solution 2: Non-parametric entropy estimation

𝐻1𝑁𝑁=𝑑𝑛∑

𝑖=1

𝑛

log 𝜌𝑖+ log (𝑛−1 )+𝛾+ log𝑉 𝑑

[Kozachenko, Leonenko, “Sample estimate of the entropy of a random vector”, Probl. Peredachi Inf., 1987][Beirlant, Dudewicz, Győrfi, van der Meulen, “Nonparametric entropy estimation: An overview”, 2001][Wang, Kulkarni, Verdú, “Universal estimation of information measures for analog sources”, FnT Comm. Inf. Th., 2009]

Solution 2: Non-parametric estimation

Experimental Results

[Nowozin, “Improved Information Gain Estimates for Decision Tree Induction”, ICML 2012]

Streaming Decision Trees

Streaming Data

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 20

500

1000

1500

2000

2500Plugin H, N=100, 50000 replicates, with true entropy and 2 std dev estimates shown

• “Infinite data” setting

• 10 possible splits and their scores• When to stop and make a decision?

Streaming Decision Trees

[Domingos, Hulten, “Mining High-Speed Data Streams”, KDD 2000][Jin, Agralwal, “Efficient Decision Tree Construction on Streaming Data”, KDD 2003][Loh, Nowozin, “Faster Hoeffding racing: Bernstein races via jackknife estimates”, ALT 2013]

• Score splits on a subset of samples only

• Domingos/Hulten (Hoeffding Trees), 2000:– Compute sample count n for given precision– Streaming decision tree induction– Incorrect confidence intervals, but work well in practice

• Jin/Agralwal, 2003:– Tighter confidence interval, asymptotic derivation using delta method

• Loh/Nowozin, 2013:– Racing algorithm (bad splits are removed early)– Finite sample confidence intervals for entropy and gini

Multivariate Delta Method

Theorem. Let be a sequence of -dimensional random vectors such that . Let be once differentiable at with gradient matrix . Then

[DasGupta, “Asymptotic Theory of Statistics and Probability”, Springer, 2008]

𝜃

𝑔

𝑔 (𝜃 )

∇𝑔 (𝜃 )

Delta Method for the Information Gain

1 2 3 4 5 6 7 80

0.1

0.2

0.3

0.4

1 2 3 4 5 6 7 80

0.1

0.2

0.3

0.4

• 8 classes, 2 choices (left/right)• : probability of choice S, class I

• , mutual information (infogain)• Derivation lengthy but not difficult, slight generalization of Jin & Agralwal

Multivariate delta method: for we have that

[Small, “Expansions and Asymptotics for Statistics”, CRC, 2010][DasGupta, “Asymptotic Theory of Statistics and Probability”, Springer, 2008]

Delta Method Example

As , is fixed

0 50 100 150 200 250 300 350 400 450 5000.2

0.25

0.3

0.35

0.4

0.45

Sample size

Info

gain

est

imat

e

Plugin estimate and standard deviation, 10000 replicates

Infogain estimateInfogain truth

0 50 100 150 200 250 300 350 400 450 5000

0.05

0.1

0.15

Asymptotic variance of the information gain

Empirical stddevDelta method stddev

• Statistical problem• Large body of literature exists on entropy estimation• Better estimators yield better decision trees• Distribution of estimate relevant in the streaming setting

Conclusion on Entropy Estimation

entropy estimation and applications to decision trees

Documents

n samplesestimate entropy

entropy expressions

title entropy estimation

samples n

parametric familysolution

parametric family samples

list of entropy estimators

large n variance