entropy estimation and applications to decision trees

19
Entropy Estimation and Applications to Decision Trees

Upload: kalyca

Post on 24-Feb-2016

40 views

Category:

Documents


0 download

DESCRIPTION

Entropy Estimation and Applications to Decision Trees. Estimation. Distribution over K=8 classes Repeat 50,000 times: Generate N samples Estimate entropy from samples. H=1.289. N=10. N=50000. N=100. Estimation. Estimating the true entropy Goals: - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Entropy Estimation and Applications to Decision Trees

Entropy Estimation and Applications to Decision Trees

Page 2: Entropy Estimation and Applications to Decision Trees

EstimationDistribution over K=8 classes

Repeat 50,000 times:1. Generate N samples2. Estimate entropy from samples 1 2 3 4 5 6 7 8

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 20

1000

2000

3000

4000

5000

6000

7000

8000Plugin H, N=10, 50000 replicates

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 20

500

1000

1500

2000

2500Plugin H, N=100, 50000 replicates

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 20

500

1000

1500

2000

2500

3000

3500

4000

4500

5000Plugin H, N=1000, 50000 replicates

N=10 N=100 N=50000

H=1.289

Page 3: Entropy Estimation and Applications to Decision Trees

Estimation

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 20

500

1000

1500

2000

2500Plugin H, N=100, 50000 replicates, with true entropy and 2 std dev estimates shown

Estimating the true entropy

Goals:1. Consistency: large N guarantees correct result2. Low variance: variation of estimates small3. Low bias: expected estimate should be correct

Page 4: Entropy Estimation and Applications to Decision Trees

Discrete Entropy Estimators

Page 5: Entropy Estimation and Applications to Decision Trees

• UCI classificationdata sets

• Accuracy on test set• Plugin vs. Grassberger• Better trees

Experimental Results

Source: [Nowozin, “Improved Information Gain Estimates for Decision Tree Induction”, ICML 2012]

Page 6: Entropy Estimation and Applications to Decision Trees

• In regression, differential entropy– measures remaining uncertainty about y– is a function of a distribution

Differential Entropy Estimation

𝐻 (𝑞)=−∫𝑦

𝑞 ( 𝑦|𝑥 ) log𝑞 (𝑦∨𝑥)d 𝑦

• Problem– q is not from a parametric family

• Solution 1: project onto a parametric family• Solution 2: non-parametric entropy estimation

Page 7: Entropy Estimation and Applications to Decision Trees

• Multivariate Normal distribution– Estimate covariance matrix of all y vectors– Plugin estimate of the entropy

Solution 1: parametric family

𝐻 (�̂� )= 𝑑2 +

𝑑2 log 2𝜋+

12 log

|�̂�|

– Uniform minimum variance unbiased estimator (UMVUE)

𝐻 (𝑌 )=𝑑2log𝑒𝜋+ 1

2log|∑𝑦∈𝑌 𝑦 𝑦

𝑇|− 12∑𝑗=1𝑑

𝜓(𝑛+1− 𝑗2 )

[Ahmed, Gokhale, “Entropy expressions and their estimators for multivariate distributions”, IEEE Trans. Inf. Theory, 1989]

Page 8: Entropy Estimation and Applications to Decision Trees

Solution 1: parametric family

Page 9: Entropy Estimation and Applications to Decision Trees

Solution 1: parametric family

Page 10: Entropy Estimation and Applications to Decision Trees

• Minimal assumptions on distribution• Nearest neighbour estimate

– NN distance – Euler-Mascheroni constant – Volume of d-dim. hypersphere

• Other estimators: KDE, spanning tree, k-NN, etc.

Solution 2: Non-parametric entropy estimation

𝐻1𝑁𝑁=𝑑𝑛∑

𝑖=1

𝑛

log 𝜌𝑖+ log (𝑛−1 )+𝛾+ log𝑉 𝑑

[Kozachenko, Leonenko, “Sample estimate of the entropy of a random vector”, Probl. Peredachi Inf., 1987][Beirlant, Dudewicz, Győrfi, van der Meulen, “Nonparametric entropy estimation: An overview”, 2001][Wang, Kulkarni, Verdú, “Universal estimation of information measures for analog sources”, FnT Comm. Inf. Th., 2009]

Page 11: Entropy Estimation and Applications to Decision Trees

Solution 2: Non-parametric estimation

Page 12: Entropy Estimation and Applications to Decision Trees

Experimental Results

[Nowozin, “Improved Information Gain Estimates for Decision Tree Induction”, ICML 2012]

Page 13: Entropy Estimation and Applications to Decision Trees

Streaming Decision Trees

Page 14: Entropy Estimation and Applications to Decision Trees

Streaming Data

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 20

500

1000

1500

2000

2500Plugin H, N=100, 50000 replicates, with true entropy and 2 std dev estimates shown

• “Infinite data” setting

• 10 possible splits and their scores• When to stop and make a decision?

Page 15: Entropy Estimation and Applications to Decision Trees

Streaming Decision Trees

[Domingos, Hulten, “Mining High-Speed Data Streams”, KDD 2000][Jin, Agralwal, “Efficient Decision Tree Construction on Streaming Data”, KDD 2003][Loh, Nowozin, “Faster Hoeffding racing: Bernstein races via jackknife estimates”, ALT 2013]

• Score splits on a subset of samples only

• Domingos/Hulten (Hoeffding Trees), 2000:– Compute sample count n for given precision– Streaming decision tree induction– Incorrect confidence intervals, but work well in practice

• Jin/Agralwal, 2003:– Tighter confidence interval, asymptotic derivation using delta method

• Loh/Nowozin, 2013:– Racing algorithm (bad splits are removed early)– Finite sample confidence intervals for entropy and gini

Page 16: Entropy Estimation and Applications to Decision Trees

Multivariate Delta Method

Theorem. Let be a sequence of -dimensional random vectors such that . Let be once differentiable at with gradient matrix . Then

[DasGupta, “Asymptotic Theory of Statistics and Probability”, Springer, 2008]

𝜃

𝑔

𝑔 (𝜃 )

∇𝑔 (𝜃 )

Page 17: Entropy Estimation and Applications to Decision Trees

Delta Method for the Information Gain

1 2 3 4 5 6 7 80

0.1

0.2

0.3

0.4

1 2 3 4 5 6 7 80

0.1

0.2

0.3

0.4

• 8 classes, 2 choices (left/right)• : probability of choice S, class I

• , mutual information (infogain)• Derivation lengthy but not difficult, slight generalization of Jin & Agralwal

Multivariate delta method: for we have that

[Small, “Expansions and Asymptotics for Statistics”, CRC, 2010][DasGupta, “Asymptotic Theory of Statistics and Probability”, Springer, 2008]

Page 18: Entropy Estimation and Applications to Decision Trees

Delta Method Example

As , is fixed

0 50 100 150 200 250 300 350 400 450 5000.2

0.25

0.3

0.35

0.4

0.45

Sample size

Info

gain

est

imat

e

Plugin estimate and standard deviation, 10000 replicates

Infogain estimateInfogain truth

0 50 100 150 200 250 300 350 400 450 5000

0.05

0.1

0.15

Asymptotic variance of the information gain

Empirical stddevDelta method stddev

Page 19: Entropy Estimation and Applications to Decision Trees

• Statistical problem• Large body of literature exists on entropy estimation• Better estimators yield better decision trees• Distribution of estimate relevant in the streaming setting

Conclusion on Entropy Estimation