williams09_mutualinformation

Upload: sandeep-gogadi

Post on 02-Jun-2018

216 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/10/2019 Williams09_MutualInformation

    1/8

    P. Wen et al. (Eds.): RSKT 2009, LNCS 5589, pp. 389396,2009.

    Springer-Verlag Berlin Heidelberg 2009

    Estimation of Mutual Information: A Survey

    Janett Walters-Williams1, 2, and Yan Li2

    1School of Computing and Information Technology, University of Technology,

    Jamaica, Kingston 6, Jamaica W.I.2Department of Mathematics and Computing, Centre for Systems Biology,

    University of Southern Queensland, QLD 4350, Australia

    [email protected], [email protected]

    Abstract.A common problem found in statistics, signal processing, data analy-

    sis and image processing research is the estimation of mutual information,

    which tends to be difficult. The aim of this survey is threefold: an introduction

    for those new to the field, an overview for those working in the field and a

    reference for those searching for literature on different estimation methods. In

    this paper comparison studies on mutual information estimation is considered.

    The paper starts with a description of entropy and mutual information and it

    closes with a discussion on the performance of different estimation methods and

    some future challenges.

    Keywords: Mutual Information, Entropy, Density Estimation.

    1 Introduction

    Mutual Information (MI) is a nonlinear measure used to measure both linear and

    nonlinear correlations. It is well-known that any estimation of MI is difficult but it is a

    natural measure of the dependence between random variables thereby taking into

    account the whole dependence structure of the variables.

    There has been work on the estimation of MI but up to 2008 to the best of ourknowledge there has been no research on the comparison of the different categories of

    commonly used estimators. Given the varying results in the literature on these estima-

    tors this paper seeks to draw conclusion on their performance up to 2008.

    We aim to introduce and explain MI and to give an overview of the literature on

    different mutual information estimators. We start at the basics, with the definition of

    entropy and its interpretation. We then turn to mutual information, presenting its mul-

    tiple forms of definition, its properties and applications to which it is applied.

    The survey classifies the estimators into two main categories: nonparametric

    density and parametric density. Each category looks at the commonly used types ofestimators in that area. Finally, having considered a number of comparison studies,

    we discuss the results of years of research and also some challenges that still

    lie ahead.

    *Corresponding author.

  • 8/10/2019 Williams09_MutualInformation

    2/8

    390 J. Walters-Williams and Y. Li

    The paper is organized as follows. In Section 2, the concept of entropy is intro-

    duced. Section 3 highlights the MI in general. In Section 4 different methods for the

    estimation of MI are presented. Section 5 describes comparison studies of different

    estimation methods then finally Section 6 discusses the conclusion.

    2 Entropy

    The concept of entropy was developed in response to the observation that a certain

    amount of functional energy released from combustion reactions was always lost to

    dissipation or friction and thus not transformed into useful work. In 1948, while work-

    ing at Bell Telephone Laboratories electrical engineer Claude Shannon set out to

    mathematically quantify the statistical nature of lost information in phone-line sig-

    nals. To do this, Shannon developed the very general concept of information entropy,a fundamental cornerstone of information theory. He published his famous paper A

    Mathematical Theory of Communication, containing the section to what he calls

    Choice, Uncertainty, and Entropy. Here he introduced an H function as:

    1

    ( ) log ( )k

    i

    H K p i p i=

    = (1)

    where Kis a positive constant. Entropy works well when describing the order, uncer-

    tainty or variability of a unique variable, however it cannot work properly for more

    than one variable. This is where joint entropy, mutual information and conditionalentropy come in.

    (a)Joint entropy. The joint entropy of a pair of discrete random variables X and Y is

    defined as

    ( , ) ( , ) log ( , )x y

    H X Y p x y p x y= (2)

    wherep(x,y)is the joint distribution of the variables. The chain rule for joint entropy

    states that the total uncertainty about the value of X and Y is equal to the uncertainty

    about X plus the (average) uncertainty about Y once you know X.

    )|()(),( XYHXHYXH += (3)

    (b)Conditional Entropy (Equivocation). Conditional entropy measures how much

    entropy are random variableXhas remaining if the value of a second random variable

    Yis known. It is referred to as the entropy of X conditional on Y, writtenH(X| Y) and

    defined as:

    ( | ) ( | )log ( | ) ( )x y

    H X Y p x y p x y p y=

    (4)

    The chain rule for conditional entropy is

    )(),()|( XHYXHXYH = (5)

  • 8/10/2019 Williams09_MutualInformation

    3/8

    Estimation of Mutual Information: A Survey 391

    (c) Marginal Entropy. Marginal or absolute entropy is defined as:

    1

    ( ) ( ) log ( )n

    i i

    i

    H X p x p x=

    = (6)

    where nrepresents the number of eventsxiwith probabilities p(xi) (i=1,2..n)

    These joint entropy, conditional entropy and marginal entropy then create the

    Chain rule for Entropy which can be defined as

    )|()()|()(),( YXHYHXYHXHYXH +=+= (7)

    3 Mutual Information

    Mutual Information (MI), also known as transinformation, was first introducedin classical information theory by Shannon in 1948. It is considered to be a non

    parametric measure of relevance [1] that measures the mutual dependence of two vari-

    ables, both linear and non linear for which it has a natural generalization. It therefore

    looks at the amount of uncertainty that is lost from one variable when the other is

    known. MI represented as I(X:Y), in truth measures the reduction in uncertainty in X

    which results from knowing Y, i.e. it indicates how much information Y conveys

    aboutX.Mutual information has the following properties

    (i)

    ( : ) ( : )I X Y I Y X=

    It is symmetric

    (ii)

    ( : ) 0I X Y

    It is always non-negative between X and Y; the uncertainty of X cannot be in-

    creased by learning of Y.

    It also has the following properties:

    (iii)

    ( : ) ( )I X X H X=

    The information X contains about itself is the entropy of X

    (iv)

    ( : ) ( )I X Y H X

    ( : ) ( )I X Y H Y

    The information variables contain about each other can never be greater than the

    information in the variables themselves.

    The information in X is in no way related to Y, i.e. no knowledge is gained aboutX when Y is given and visa versa.XandYare, therefore, strictly independent. Mutual

    information can be calculated using Entropy, considered to be the best way, using

    Probability Density and using Kullback-Leibler Divergence.

  • 8/10/2019 Williams09_MutualInformation

    4/8

    392 J. Walters-Williams and Y. Li

    4 Estimating Mutual Information

    MI is considered to be very powerful yet it is difficult to estimate [2]. Estimation can

    therefore be unreliable, noisy and even bias. To use the definition of entropy, the

    density has to be estimated. This problem has severely restricted the use of mutualinformation in ICA estimation and many other applications. In recent years research-

    ers are designed different ways of estimating MI. Some researchers have used

    approximations of mutual information based on polynomial density expansions,

    which led to the use of higher-order cumulants. The approximation is valid, however,

    only when it is not far from the Gaussian density function. More sophisticated

    approximations of mutual information have been constructed. Some have estimated

    MI by binning the coordinate axes, the use of histograms as well as wavelets. All

    have, however, sought to estimate a density P(x)given a finite number of data points

    xN

    drawn from that density function. There are two basic approaches to estimation parametric and nonparametric and this paper seeks to survey some of these methods.

    Nonparametric estimation is a statistical method that allows the functional form of

    the regression function to be flexible. As a result, the procedures of nonparametric

    estimation have no meaningful associated parameters. Parametric estimation, by con-

    trast, makes assumptions about the functional form of the regression and the estimate

    is of those parameters that are free. Estimating MI techniques include histogram

    based, adaptive partitioning, spline, kernel density and nearest neighbour. The choices

    of the parameters in these techniques are often made blindly, i.e. no reliable meas-

    ure used for the choice. The estimation is very sensitive to those parameters howeverespecially in small noisy sample conditions [1]. Nonparametric density estimators are

    histogram based estimator, adaptive partitioning of the XY plane, kernel density esti-

    mator (KDE), B-Spline estimator, nearest neighbor (KNN) estimator and wavelet

    density estimator (WDE).

    Parametric density estimation is normally referred to as Parameter Estimation. It is

    a given form for the density function which assumes that the data are from a known

    family of distributions, such as the normal, log-normal, exponential, and Weibull (i.e.,

    Gaussian) and the parameters of the function (i.e., mean and variance) are then opti-

    mized by fitting the model to the data set. Parametric density estimators are Bayesian

    estimator, Edgeworth estimators, maximum likelihood (ML) estimator, and leastsquare estimator.

    5 Comparison Studies of Mutual Information Estimation

    By comparison studies we mean all papers written with the intention of comparing

    several different estimators of MI as well as to present a new method.

    5.1 Parametric vs. Nonparametric Methods

    It's worth to note that it's not one hundred percent right when we say that non-

    parametric methods are "model-free" or free of distribution assumptions. For example,

    some kinds of distance measures have to be used to identify the "nearest" neighbour.

    Although the methods did not assume a specific distribution, the distance measure is

  • 8/10/2019 Williams09_MutualInformation

    5/8

    Estimation of Mutual Information: A Survey 393

    distribution-related in some sense (Euclidean and Mahalanobis distances are closely

    related to multivariate Gaussian distribution). Compared to parametric methods, non-

    parametric ones are only "vaguely" or "remotely" related to specific distributions and,

    therefore, are more flexible and less sensitive to violation of distribution assumptions.

    Another characteristic found to be helpful in discriminating the two is that: thenumber of parameters in parametric models is fixed a prioriand independent of the

    size of the dataset, while the number of statistics used for non-parametric models are

    usually dependent on the size of the dataset (e.g. more statistics for larger datasets).

    5.2 Comparison of Estimators

    (a) Types of Histograms. Daub et. al [2] compared the two types of histogram

    based estimators -equidistant and equiprobable and found that the equiprobable histo-

    gram-based estimator is more accurate.

    (b)

    Histogram vs AP. Trappenberg et al.[3] compared the equidistant histogram-

    based method, the adaptive histogram-based method of Darbellay and Vajda and the

    Gram-Charlier polynomial expansion and concluded that all three estimators gave

    reasonable estimates of the theoretical mutual information, but the adaptive histo-

    gram-based method converged faster with the sample size.

    (c) Histogram vs KDE. Histogram based methods and kernel density estimations

    are the two principal differentiable estimator of Mutual Information [4] however

    histogram-based is the simplest non-parametric density estimator and the one that is

    mostly frequently encountered. This is because it is easy to calculate and understand.

    (d) B-Spline vs KDE. MI calculated from KDE does not show a linear behaviorbut rather an asymptotic one with a linear tail for large data sets. Values are slightly

    greater than those produces when using the B-Spline method. According to Daub et.

    al. [2] the B-Spline method is computationally faster than the kernel density method

    and also improves the simple binning method. It was found that B-Spline also avoided

    the time-consuming numerical integration steps for which kernel density estimators

    are noted.

    (e) B-Spline vs Histogram. In the classical histogram based method data points

    close to bin boundaries can cross over to a neighboring bin due to noise or fluctua-

    tions; in this way they introduce an additional variance into the computed estimate[5]. Even for sets of a moderate size, this variance is not negligible. To overcome this

    problem, Daub et al. [2] proposed a generalized histogram method, which uses

    B-Spline functions to assign data points to bins. B-Spline also somewhat alleviates the

    choice-of-origin problem for the histogram based methods by smoothing the effect of

    transition of data points between bins due to shifts in origin.

    (f) B-Spline vs KNN. Rossi et al. [6] stated that B-Splines estimated MI reduces

    feature selection. When compared to KNN, it was found that KNN has a total com-

    plexity of O(n3p

    2)(because estimation is linear in the dimension nand quadratic in the

    number of data pointsp), while B-Spline worst-case complexity is still less at O(n3p)

    thus having a smaller computation time.

    (g) WDE vs other Nonparametric Estimators. In statistics, amongst other applica-

    tions, wavelets [7] have been used to build suitable non parametric density estimators.

    A major drawback of classical series estimators is that they appear to be poor in esti-

    mating local properties of the density. This is due to the fact that orthogonal systems,

  • 8/10/2019 Williams09_MutualInformation

    6/8

    394 J. Walters-Williams and Y. Li

    like the Fourier one, have poor time/frequency localization properties. On the contrary

    wavelets are localized both in time and in frequency making wavelet estimators well

    able to capture local features. Indeed it has been shown that KDE tend to be inferior

    to WDE [8].

    (h)

    KDE vs KNN.A practical drawback of the KNN-based approach is that theestimation accuracy depends on the value of k and there seems no systematic strategy

    to choose the value of k appropriately. Kraskov et. al[9] created a KNN estimator and

    found that for Gaussian distributions KNN performed better. This was reinforced

    when Papana et. al. [10] compared the two along with the histogram based method.

    They found that KNN was computationally more effective when fine partitions were

    sought, due to the use of effective data structures in the search for neighbors. They

    concluded that KNN was the better choice as fine partitions capture the fine structure

    of chaotic data and because KNN is not significantly corrupted with noise.They

    found therefore that the k-nearest neighbor estimator is the more stable and lessaffected by the method-specific parameter.

    (i) AP vs ML. Being a parametric technique, ML estimation is applicable only if

    the distribution which governs the data is known, the mutual information of that dis-

    tribution is known analytically and the maximum likelihood equations can be solved

    for that particular distribution. It is clear that ML estimators have an unfair advan-

    tage over any nonparametric estimator which would be applied to data coming from

    a distribution. Darbellay et. al. [11] compared AP with ML. They found that adaptive

    partitioning appears to be asymptotically unbiased and efficient. They also found that

    unlike ML it is in principle applicable to any distribution and intuitively easy to un-

    derstand.

    (j) ML vs KDE. Suzuki et. al. [12] considers KDE to be a naive approach to

    estimating MI, since the densities pxy(x, y),px(x), and py(y) are separately estimated

    from samples and the estimated densities are used for computing MI. After evalua-

    tions they stated that the bandwidth of the kernel functions could be optimized based

    on likelihood cross-validation, so there remains no open tuning parameters in this

    approach. However, density estimation is known to be a hard problem and division by

    estimated densities is involved when approximating MI, which tends to expand the

    estimation error. ML does not have this estimation error.

    (k) ML vs KNN. Using KNN as an estimator for MI means that there is no simplyreplacement of entropies with their estimates, but it is designed to cancel the error of

    individual entropy estimation. It has systematic strategy to choose the value of k

    appropriately. Suzukiet. al. [12] found that KNN works well if the value of kis opti-

    mally chosen. This means that there is no model selection method for determining the

    number of nearest neighbors. ML, however, does not have this limitation.

    (l) EDGE vs ML. Suzuki et. al. [12] found that if the underlying distribution is close

    to the normal distribution, approximation is quite accurate and the EDGE method

    works very well. However, if the distribution is far from the normal distribution, the

    approximation error gets large and therefore the EDGE method performs poorly andmay be unreliable. In contrast, ML performs reasonably well for both distributions.

    (m) EDGE vs KDE and Histogram. Research has shown that differential entropy

    by the Edgeworth expansion avoids the density estimation problems although it

    makes sense only for "close"-to-Gaussian distributions. Further research shows that

    the order of Edgeworth approximation of differential entropy is O(N-3/2), while KDE

  • 8/10/2019 Williams09_MutualInformation

    7/8

    Estimation of Mutual Information: A Survey 395

    approximation is of order O(N-1/2) where N is the size of processed sample. This

    means that KDE cannot be used for differential entropy while the Edgeworth expan-

    sion of neg-entropy produces very good approximations also for more-dimensional

    Gaussian distributions [5].

    (n) ML vs Bayesian. ML is prone to over fitting [13]. This can occur when the sizeof the data set is not large enough to compare to the number of degrees of freedom ofthe chosen model. The Bayesian method fixes the problem of ML in that it deals withhow to determine the best number of model parameters. It is, therefore, vey usefulwhere there large data sets are hard to come by e.g. neuroscience.(o) LSMI vs KNN and KDE. Suzuki et. al. [14] found that when KDE, KNN andLSMI are compared they found density estimation to be a hard problem and thereforethe KDE-based method may not be so effective in practice. Although KNN performedbetter than KDE there was a problem when choosing the number k appropriately.Their research showed that LSMI overcame the limitations of both KDE and KNN.

    (p)

    LSMI vs EDGE. Suzuki et. al. [14] found that although EDGE was quiteaccurate and works well if the underlying distribution is close to normal distribution;however when the distribution is far the approximation error gets large and EDGEbecomes unreliable. LSMI is distribution-free and therefore does not suffer from theseproblems.

    6 Discussion and Conclusion

    There have been many comparisons of different estimation methods. Table 1 shows

    the order of performances of those discussed within this paper. It can be shown that

    both KNN and KDE converge to the true probability density as N, provided that V

    shrinks withN, and kgrows withN appropriately. It can be seen, therefore, that KNN

    and KDE truly outperform the histogram methods.

    From conclusions of the comparison studies it can be inferred that estimating MI

    by parametric density produces a more effect methodology. This is supported by re-

    searchers [5, 11, 12, 14] who have casted nonparametric methods into parametric

    frameworks, such as WDE or KDE into a ML framework - in doing so, moving the

    problem into the parametric realm. When both methods are therefore combined

    the performances of the methods used in this paper are (i) Equidistant Histogram, (ii)Equiprobable Histogram, (iii) Adaptive Partitioning, (iv) Kernel Density, (v) K-

    Nearest Neighbor, (vi) B-Spline, (vii) Wavelet, (viii) Edgeworth, (ix) Least-Square,

    Table 1.Estimators based on performances in Category

    Nonparametric Density Parametric Density

    Equidistant Histogram Edgeworth

    Equiprobable Histogram Least Square

    Adaptive Partitioning Maximum LikelihoodKernel Density Bayesian

    K-Nearest Neighbor

    B-Spline

    Wavelet

  • 8/10/2019 Williams09_MutualInformation

    8/8

    396 J. Walters-Williams and Y. Li

    (x) Maximum Likelihood and (xi) Bayesian. Since parametric methods are better the

    question remains as why the nonparametric methods are still the methods of the

    choice for most estimators.

    To date there is still research into the development of new ways to estimate MI.

    Research will continue on: (i) their performances, (ii) optimal parameters investiga-tion, (iii) linear and non-linear datasets and (iv) applications.

    The challenge is how to create an estimation method that covers both parametric

    and non-parametric density methodologies and still be applied to most if not all appli-

    cations effectively. From the continuing interest in the measurement it can be deduced

    that mutual information will still be popular in the near future. It is already a success-

    ful measure for many applications and it can indoubtedly be adapted and extended to

    aid in many more problems.

    References

    1.

    Francois, D., Wertz, V., Verleysen, M.: Proceedings of the European Symposium on Arti-

    ficial Neural Networks (ESANN), pp. 239244 (2006)

    2. Daub, C.O., Steuer, R., Selbig, J., Kloska, S.: BMC Bioinformatics 5 (2004)

    3.

    Trappenberg, T., Ouyang, J., Back, A.: Journal of Latex Class Files 1, 8 (2002)

    4.

    Fransens, R., Strecha, C., Van Gool, L.: Proceedings of Theory and Applications of

    Knowledge-Driven Image Information Mining with Focus on Earth Observation (ESA-

    EUSC) (2004)

    5.

    Hlavkov-Schindler, K.: Information Theory and Statistical Learning. Springer Science& Business Media LLC (2009)

    6. Rossi, F., Francois, D., Wertz, V., Meurens, M., Verleysen, M.: Chemometrics and Intelli-

    gent Laboratory Systems 86 (2007)

    7.

    Aumnier, S.: Generalized Correlation Power Analysis in the workshop ECRYPT, Tools

    for Cryptanalysis (2007)

    8.

    Vamucci, M.: Technical Reports. Duke University (1998)

    9.

    Kraskov, A., Stgbauer, H., Andrzejak, R.G., Grassberger, P,

    http://arxiv.org/abs/q-bio/0311039

    10. Papana, A., Kugiumtzis, D.: Nonlinear Phenomena in Complex Systems 11, 2 (2008)

    11.

    Darbellay, G.A., Vajda, I.: IEEE Transaction on Information Theory 45, 4 (1999)12.

    Suzuki, T., Sugiyama, M, Sese, J., Kanamori, T.: Proceedings of JMLR: Workshop and

    Conference 4 (2008)

    13.

    Endres, D., Fldik, P.: IEEE Transactions on Information Theory 51, 11 (2005)

    14.

    Suzuki, T., Sugiyama, M., Sese, J., Kanamori, T.: Proceedings of the Seventh Asia-Pacific

    Bioinformatics Conference (APBC) (2009)