copula correlation: an equitable and consistently

arX

iv:1

312.

7214

v1 [

stat

.ME

] 2

7 D

ec 2

013

Copula Correlation: An Equitable And

Consistently Estimable Measure For

Association.

A. Adam Ding, and Li Yi ∗

Abstract

Reshef et al. (Science, 2011) proposed an important new concept of equitabil-

ity for good measures of association between two random variables. To this end,

they proposed a novel measure, the maximal information coefficient (MIC). However,

Kinney and Atwal (2013) showed that MIC in fact is not equitable. Others pointed

out that MIC often has low power to detect association. We clarify the paradox on the

equitability of MIC and mutual information (MI) by separating the equitability and

estimability of the parameters. We prove that MI is not consistently estimable over a

class of continuous distributions. Several existing association measures in literature are

studied from these two perspectives. We propose a new correlation measure, the cop-

ula correlation, which is half the L1-distance of copula density from the independence.

Among the association measures studied, this new measure is the only one to be both

equitable and consistently estimable. Simulations and a real data analysis confirm its

good theoretical properties.

Key Words: Equitability; maximal information coefficient; Copula; rate of convergence;

mutual information; distance correlation.

∗A. Adam Ding is Associate Professor of Mathematics, Northeastern University, Boston, MA; email:[email protected]. Li Yi is PhD candidate of Mathematics, Northeastern University, Boston, MA; email:[email protected].

http://arxiv.org/abs/1312.7214v1

1 INTRODUCTION

With the advance of modern technology, the size of available data keeps exploding. Data

mining is increasingly used to keep up with the trend, and to explore complex relationships

among a vast number of variables. The nonlinear relationships are as important as the

linear relationship in data exploration. Hence the traditional measure such as Pearson’s

linear correlation coefficient is no longer adequate for today’s big data analysis. Reshef et al.

(2011) proposed the concept of equitability, that is, a measure of dependence (association)

should give equal importance to linear and nonlinear relationships. For this purpose, they

proposed a novel maximal information coefficient (MIC) measure.

The MIC measure has stimulated great interest and further studies in the statistical com-

munity. Speed (2011) praised it as “a correlation for the 21st century”. It has been quickly

adopted by many researchers in data analysis. However, its mathematical and statistical

properties are still not studied very well. There are also criticisms on the measure based on

those properties.

Simon and Tibshirani (2011) showed that MIC often has low power for detecting depen-

dence in comparison to the distance correlation (dcor) proposed by Szekely et al. (2007).

The dcor asymptotically has the same power as Pearson’s correlation in the linear case,

and can also detect all nonlinear relationships asymptotically. Thus, Simon and Tibshirani

(2011) recommended dcor over MIC for detecting dependence among variables. However,

dcor does not have the equitable property.

Kinney and Atwal (2013) gives a strict mathematical definition of R2-equitability de-

scribed in Reshef et al. (2011)’s MIC paper. They discovered that no non-trivial statistic

can be R2-equitable, and hence proposed a replacement definition of self-equitability. Inter-

estingly, the MIC is also not self-equitable, but mutual information (MI) is self-equitable.

This contradicts with the observation that MI is not equitable but MIC is equitable in sim-

ulation studies of Reshef et al. (2011) and Reshef et al. (2013). This paradox can be explain

by the fact that the simulated finite sample studies are also affected by the statistical errors

in estimators for MI and MIC. Hence we need to separate the studies of equitability of a

measure from its estimability.

1

We relate the study of equitability to another popular line of research on the copula – a

joint probability distribution with uniform marginals. Sklar’s Theorem decomposes any joint

probability distribution into two components: the marginal distributions and the copula.

The copula captures all the association information among the variables. Hence an equitable

dependence measure should be copula-based. We propose a definition of weak-equitability

that is equivalent to the requirement of copula-based measures. Therefore, it is natural

to reexpress the association measures in terms of copula. While little attention has been

paid to directly create an association measure from the copula, many probability distances

can be applied on the copula to create such measures. And some of those copula-based

probability distances have been used for testing of independence (Genest and Remillard,

2004; Genest et al., 2007; Kojadinovic and Holmes, 2009).

Recasting the problem in terms of copula provides additional insights. As Omelka et al.

(2009) and Segers (2012) pointed out, some standard mathematical conditions on the proba-

bility distributions do not hold for many common copulas. We need to study the convergence

of copula estimators under non-standard settings. Under a non-standard setting, we provide

a theoretical proof that the mutual information (MI)’s minimax risk is infinite. While MI is

known to be very hard to estimate under continuous distributions, no mathematical qualifi-

cation for this fact exists. Our result is the first theoretical result demonstrating that MI is

not consistently estimable for continuous random variables. The difficulty of accurately esti-

mating MI explains how the theoretically equitable MI measure appears to be non-equitable

for finite sample data analysis in Reshef et al. (2011).

Based on the copula, we propose a new association measure, the copula correlation (Ccor),

which is defined as half the L1-distance of the copula density function from independence.

We theoretically study the MI, MIC, dcor, Ccor and several other dependence measures in

literature from the perspective of both equitability and estimability. The Ccor is the only

measure with the desirable property of being both self-equitable and consistently estimable.

This makes it a strong candidate as a dependence measure in data exploration.

In Section 2, we introduce the definition of several dependence measures and study their

equitability. We study the convergence of estimators for MI and Ccor in Section 3. Section 4

2

compare these dependence measures further through numerical examples. We then end the

paper with a summary discussion.

2 ASSOCIATION MEASURES AND THEIR EQUI-

TABILITY

In this section, we review several classes of association measures D(X ; Y ) between two ran-

dom variables X and Y in the literature and introduces our proposed new measure. We

review and strictly define the concept of equitability, that is, the association measure treats

all types of relationships equally. We conduct a systematic analysis for the equitability of

these association measures in this section. For simplicity, we will be focusing on contin-

uous univariate random variables X and Y . Section 5 briefly discuss extension to higher

dimensional and other types of random variables.

2.1 Several Classes Of Association Measures From Independence

Characterization

The most commonly used association measure is Pearson’s linear correlation coefficient

ρ(X ; Y ) = Cov(X, Y )/√V ar(X)V ar(Y ) where COV (X, Y ) denotes the covariance between

X and Y , and V ar(X) denotes the variance of X . The linear correlation coefficient ρ is good

at characterizing linear relationships between X and Y : |ρ| = 1 for perfectly deterministic

linear relationship and ρ = 0 when X and Y are independent. However, for nonlinear

relationships, dependent X and Y can also have zero linear correlation coefficient ρ.

Several classes of association measures have the desirable property that D(X ; Y ) = 0

if and only if X and Y are statistically independent. They use different but equivalent

mathematical characterizations of the statistical independence between X and Y of a similar

form:

fX,Y (x, y) = fX(x)fY (y) for all x, y. (1)

Here the fX,Y can be either joint cumulative distribution function (CDF) FX,Y (x, y) =

Pr(X ≤ x, Y ≤ y), or joint characteristic function φX,Y (s, t) = E[ei(Xs+Y t)] with E[·] de-noting the expectation, or joint probability density function pX,Y . Then fX and fY are the

3

corresponding marginal functions: CDFs FX(x) = Pr(X ≤ x) and FY , or characteristic

functions φX(s) = E[eiXs] and φY , or probability density functions pX and pY .

Due to the characterization (1), it is natural to use a functional distance between the

joint function fX,Y and the product of marginal functions fXfY as an association measure

D(X ; Y ). Such types of D(X ; Y ) would equal to zero if and only if fX,Y = fXfY always,

i.e., X and Y are independent. The first class of association measures use CDFs in the

characterization (1). In that case, using L∞ and L2 norms for functional distances lead to,

respectively, Kolmogorov-Smirnov criterion

KS(X ; Y ) = maxx,y

|FX,Y (x, y)− FX(x)FY (y)|, (2)

and Cramer-von Mises criterion

CVM(X ; Y ) =∫∫x,y

|FX,Y (x, y)− FX(x)FY (y)|2FX(dx)FY (dy)

=∫∫x,y

|FX,Y (x, y)− FX(x)FY (y)|2pX(x)pY (y)dxdy. (3)

For the second class of association measures, using the characteristic functions in the char-

acterization (1) can lead to the distance covariance of Szekely and Rizzo (2009).

dCov2(X ; Y ) =

∫∫

s,t

|φX,Y (s, t)− φX(s)φY (t)|2|s|2|t|2 dtds. (4)

The third class of association measures uses the probability density functions in the charac-

terization (1). This class includes the popular mutual information (MI) criterion

MI(X ; Y ) = EX,Y [log pX,Y (x, y)− log(pX(x)pY (y))]=

∫∫x,y

[log pX,Y (x, y)− log(pX(x)pY (y))]pX,Y (x, y)dxdy. (5)

In the later subsection 2.4, we will see that these association measures from the first two

classes are not equitable. Hence we propose to consider more association measures in the

third class. Specifically, we consider the class of Copula-Distance

CDα =

∫∫

x,y

|pX,Y (x, y)− pX(x)pY (y)|α[pX(x)pY (y)]1−αdxdy, (6)

for α > 0. The name comes from the fact (shown in subsection 2.3) that CDα is the Lα

distance of the copula density from the uniform density (i.e., independence) on the unit

square.

4

The association measures above have different ranges, making comparison among them

difficult. For example, the KS criterion’s value falls between 0 and 1, while the MI takes value

between 0 and ∞. For easy interpretation, it is customary to transform these measures into

correlation measures with values between 0 and 1. Szekely et al. (2007) defined the distance

correlation from the distance covariance in (4) as

dcor(X ; Y ) =dCov(X ; Y )√

dCov(X ;X)dCov(Y ; Y ). (7)

Similarly, we can define mutual information correlation from MI in (5) as (Joe, 1989)

MIcor =√

1− e−2MI . (8)

For the Copula-Distance, we notice that 0 ≤ CD1 ≤ 2. Hence we propose a new corre-

lation measure, copula correlation (Ccor), as

Ccor =1

2CD1 =

1

2

∫∫

x,y

|pX,Y (x, y)− pX(x)pY (y)|dxdy. (9)

2.2 Parameters, Estimators and MIC

The association measures in Section 2.1 are all parameters. Sometimes the same names also

refer to the corresponding sample statistics. Let (X1, Y1), ..., (Xn, Yn) be a random sample

of size n from the joint distribution of (X, Y ). Then the sample statistic ρn =∑n

i=1(Xi −X)(Yi − Y )/

√∑ni=1(Xi − X)2

∑ni=1(Yi − Y )2 is also called Pearson’s correlation coefficient.

In fact, ρn is an estimator for ρ, and converges at the parametric rate of n−1/2 assuming

the random variables have finite first two moments. The first two class of measures have

natural empirical estimators, replacing CDFs and characteristic functions by their empirical

version. Particularly, Szekely et al. (2007) showed that the resulting dcorn statistic is the

sample correlation of centered distances between pairs of (Xi, Yi) and (Xj, Yj). The last

class of association measures use the probability density functions instead, and are harder

to estimate. For continuous X and Y , simply plugging in empirical density functions do not

result in good estimators for the association measures. However, we will see in section 2.4

that the first two class of measures do not have the equitability property. Hence we will need

to study the harder-to-estimate measures MIcor and Ccor.

5

The MIC introduced in Reshef et al. (2011) is in fact a definition of a sample statistic,

not a parameter. On the data set (X1, Y1), ..., (Xn, Yn), they first consider putting these n

data points into a grid G of bX × bY bins. Then the mutual information MIG for the grid

is computed from the empirical frequencies of the data on the grid. The MIC statistic is

defined as the maximum value of MIG/ log[min(bX , bY )] over all possible grids G with the

the total number of bins bXbY bounded by B = n0.6. That is,

MICn = maxbXbY <B

MIGlog[min(bX , bY )]

(10)

The MICn is always bounded between 0 and 1 since 0 ≤ MIG ≤ log[min(bX , bY )].

The corresponding parameter MIC for the joint distribution of X and Y can be defined

as the limit of the sample statistic for big sample size MIC = limn→∞MICn. We notice

that this definition depends on the tuning parameter B and the implicit assumption that

the limit exists. Hence the MIC parameter may changes with different selection of B(n).

This is in contrast to the usual statistical literature where the parameter definition is fixed

but its estimator may contain some tuning parameter B(n). Because the MIC parameter is

only defined as a limit, the theoretical study on its mathematical properties is very hard.

As we introduce the strict mathematical definition for the equitability in next two sub-

sections 2.3 and 2.4, we can see that equitability should be a property for the parameter

not for the statistic. For any data set, there are always statistical errors between the sample

statistic and the true parameter. A good association measure should be equitable and can

also be well estimated (i.e., the statistical errors can be controlled).

2.3 Copula and Weak-equitability

The equitability refers to the concept that an association measure D[X ; Y ] should remain

invariant under certain transformations of the random variables X and Y . For example, if

we change the unit of X (or Y ), the values of X (or Y ) changes by a constant multiple, but

should not affect the association measure D[X ; Y ] at all. Similarly, if we apply a monotone

transformation on X (e.g. the commonly used logarithmic or exponential transformation),

then the association with Y should not be affected and the measure D[X ; Y ] should remain

the same. This leads to the following definition of weak-equitability.

6

Definition 1 A dependence measure D[X ; Y ] is weakly-equitable if and only if D[X ; Y ] =

D[f(X); Y ] whenever f is a strictly monotone continuous deterministic function.

The weak-equitability property is related to the popular copula concept. The Sklar’s

theorem ensures that for any joint distribution FX,Y , there exists a copula C – a probability

distribution on the unit square – such that

FX,Y (x, y) = C[FX(x), FY (y)] for all x, y. (11)

The copula captures all the association between X and Y . For continuous joint distribution

FX,Y , the copula C is also continuous with copula density function c(u, v) = ∂2

∂u∂vC(u, v) for

(u, v) ∈ [0, 1]× [0, 1].

We call a dependence measure D[X ; Y ] symmetric if D[X ; Y ] = D[Y ;X ] for all random

variables X and Y . Then a symmetric dependence measure D[X ; Y ] is weakly-equitable

if and only if D[X ; Y ] depends on the copula C(u, v) only and is not affected by the

marginals FX(x) and FY (y). Clearly the corresponding sample statistic for a symmetric

weakly-equitable D[X ; Y ] should be a rank-based statistic.

For the association measures mentioned above, Kolmogorov-Smirnov criterionKS, Cramer-

von Mises criterion CVM , mutual information MI, Copula-Distance CDα and maximal

information coefficient MIC are all weakly-equitable. The linear coefficient ρ and distance

covariance dCov are not weakly-equitable. However, it is easy to get a weakly-equitable

version of these statistics by calculating them on the ranks of X and Y . For example,

the weakly-equitable version of the linear coefficient ρn is the Spearman’s rank correlation

coefficient

ρs,n =

∑ni=1(RX,i − RX)(RY,i − RY )√∑n

i=1(RX,i − RX)2∑n

i=1(RY,i − RY )2(12)

where RX,i and RY,i denote the ranks for Xi and Yi. The corresponding parameter is

ρs = limn→∞

ρs,n =Cov(U, V )√

V ar(U)V ar(V )(13)

with U = F−1X (X) and V = F−1

Y (Y ) each following univariate uniform distribution, and the

copula C(u, v) is the joint distribution of U and V .

7

We can rewrite the Copula-Distance of (6) in terms of copula density function as

CDα =

∫∫

x,y

|pX,Y (x, y)−pX(x)pY (y)|α[pX(x)pY (y)]1−αdxdy =

1∫

0

1∫

0

|c(u, v)−1|αdudv. (14)

Hence the copula correlation (9) is in fact half the L1 distance of the copula density function

from independence

Ccor =1

2CD1 =

1

2

1∫

0

1∫

0

|c(u, v)− 1|dudv. (15)

The weakly-equitable statistics KS, CVM and MI in equations (2), (3) and (5) similarly

have copula-based equivalent definitions as following,

KS = max0≤u≤1,0≤v≤1

|C(u, v)− uv|,

CV M =1∫0

1∫0

|C(u, v)− uv|2dudv,

MI =1∫0

1∫0

log[c(u, v)]c(u, v)dudv.

(16)

We will focus on these copula-based equivalent versions in the rest of paper, as they allow

better understanding of the issues at hand.

2.4 Self-equitable Measures

The weakly-equitable measure D(X ; Y ) remains invariant under strictly monotone continu-

ous transformation f(·): D[X ; Y ] = D[f(X); Y ]. This equitability concept can be extended

to the invariance under all transformations in the regression setting Y = f(X) + ε where ε

denotes the random noise that is independent of X conditional on f(X). That is, D(X ; Y )

remains invariant when f(X) contains all information of X about Y . Mathematically, this

leads to the following self-equitability definition.

Definition 2 (Kinney and Atwal, 2013) A dependence measure D[X ; Y ] is self-equitable if

and only if D[X ; Y ] = D[f(X); Y ] whenever f is a deterministic function and X ↔ f(X) ↔Y forms a Markov chain.

Kinney and Atwal (2013) showed that the self-equitability is characterized by a com-

monly used inequality in information theory.

8

Definition 3 A dependence measure D[X ; Y ] satisfies the Data Processing Inequality

(DPI) if and only if D[X ; Y ] ≥ D[X ;Z] whenever the random variables X, Y, Z form a

Markov chain X ↔ Y ↔ Z.

All dependence measures satisfying DPI are self-equitable. Since mutual information

MI satisfies DPI, it is self-equitable. Similarly we derive the self-equitability of the Copula-

Distance CDα.

Theorem 1 The Copula-Distance CDα is self-equitable when α ≥ 1.

The proof of Theorem 1 is provided in Appendix A1.

As a direct result of Theorem 1, the copula correlation Ccor = 12CD1 is also self-equitable.

Examples MIcor Ccor MIC dcor KS CVM

A 0.94 0.625 1 0.56 0.19 0.0035

B 0.94 0.625 0.95 0.82 0.19 0.0074

C 0.94 0.625 1 0.87 0.25 0.0084

D 1 1 1 1 0.25 0.0111

E 0.97 0.750 1 0.94 0.25 0.0098

F 0.87 0.500 1 0.79 0.25 0.0069

Table 1: The values of several association measures on some examples. For each exampledistribution, the graph shows its probability density function: the white regions have zerodensity, the shaded regions have constant densities. The dark regions have densities twiceas big as the densities on the light grey regions.

We illustrate the equitability/non-equitability of all association measures above on some

examples of simple probability distributions on the unit square. These examples are modified

from those in Kinney and Atwal (2013). When (X, Y ) follows the distribution in Example A,

(f(X), Y ) follows the distribution in Example B with the invertible piece-wise transformation

f(x) = x, 0 ≤ x ≤ 14; x + 1

2, 1

4< x ≤ 1

2; x − 1

4, 1

2< x ≤ 1. And (X, g(Y )) follows the

9

distribution in Example C with the invertible piece-wise transformation g(x) = x, 0 ≤ x ≤14; x + 1

4, 1

4< x ≤ 3

4; x − 1

2, 3

4< x ≤ 1. It is easy to check that X ↔ f(X) ↔ Y

and X ↔ g(Y ) ↔ Y both are Markov chains. Hence self-equitable measures MIcor and

Ccor remain constants for the first three examples A, B and C. The other measures MIC,

dcor, KS and CVM do not have same value across the first three examples, thus are all not

self-equitable.

The next three examples D-F show increasing noise levels. Correspondingly, a good

correlation measure should show decreasing values in the order of D, E and F. Notice that

MIC and KS remain constants across Examples D, E and F. Hence they fails to correctly

reflect the noise levels as good correlation measures should.

Another desirable property for a correlation measure is that the deterministic relationship

should result in the maximal correlation value of 1. We can see that KS is far less than one

for the deterministic relationship Y = X in Example D. From these examples, we can see

that MIcor and Ccor are the only two self-equitable correlation measures with desirable

properties.

We note again that the mathematically strict definitions of equitablity above are defined

for the parameter D(X ; Y ), not on its estimator Dn. The equitablity should be a property

about the parameter. Since the estimator Dn varies from data set to data set, we should

not define on it the property involving invariance under different models. The paradoxical

observation that MIC is equitable but MI is not in Reshef et al. (2011) comes from their

attempt to decide equitability on the estimator Dn instead. To see how that approach can

be misleading, we plot the the correlation measures Ccor and MIC for nine different functions

under different noise levels in Figure 1. Similar plots were used by Reshef et al. (2011) and

Reshef et al. (2013) to study the equitability of association measures.

The first two columns of graphs in Figure 1 plots the estimated Ccor and MIC values from

simulation data set of sizes n = 100 and n = 10000 respectively. We estimate Ccor using the

estimator in equations (20) and (A.9). The estimator is described in detail in Appendix A4.

The data is generated from the regression model Y = f(X)+ε with X uniformly distributed

10

0.0 0.5 1.0 1.5 2.0

0.0

0.2

0.4

0.6

0.8

1.0

n=100

Noise Level

Cco

r

0.0 0.5 1.0 1.5 2.0

0.0

0.2

0.4

0.6

0.8

1.0

n=10000

Noise Level

Cco

r

0.0 0.5 1.0 1.5 2.0

0.0

0.2

0.4

0.6

0.8

1.0

true

Noise LevelC

cor

ABCDEFGHI

0.0 0.5 1.0 1.5 2.0

0.0

0.2

0.4

0.6

0.8

1.0

n=100

Noise Level

MIC

0.0 0.5 1.0 1.5 2.0

0.0

0.2

0.4

0.6

0.8

1.0

n=10000

Noise Level

MIC

0.0 0.5 1.0 1.5 2.0

0.0

0.2

0.4

0.6

0.8

1.0

true

Noise Level

MIC ?

Figure 1: The correlation measures Ccor and MIC across nine different functions at variousnoise levels at different sample sizes.

11

on [0, 1] and the noise ε uniformly distributed on [−0.5l, 0.5l]. The nine functions are all

chosen to have maximum value of 1 and minimum value of 0 over the domain of X , and are

listed in Table 2. We chose the uniformly distributed noise so that the true Ccor values can

be easily calculated at noise levels l = 0, 0.1, ..., 2. These true values are plotted in the last

graph in the first row of Figure 1. To assess the variation of the association measures across

different functions under same noise level as in Reshef et al. (2011), we need to look at the

last graph of the true values. The first two graphs of estimated values contain estimation

errors and the patterns change with the sample size. The increasing sample size reduces

the estimation errors and makes the plot of estimated values increasingly similar to the plot

of true values. However, we do not know before hand what is the sufficient sample size to

make the patterns for estimated values similar enough to the patterns for true values. The

true parameter MIC value can not be calculated directly for different noise levels as it is

only defined as a limit of statistics. The patterns observed using simulated data sets of size

n = 10, 000 is still likely to be misleading. At least we know the last two functions (circle

and cross) should have true MIC = 1 in the noiseless case (l = 0). But the estimated MIC

values for those two functions are still less than 0.8 for n = 10, 000.

A Linear y = x

B Quadratic y = x2

C Square Root y =√x

D Cubic y = x3

E Centered Cubic y = 4(x− 12)

3

F Centered Quadratic y = 4x(1− x)G Cosine (Period 1) y = 1

2 [cos(2πx) + 1]H Circle (x− 1

2 )2 + y2 = 1

4I Cross (y − 1

2)2 = (x− 1

2)2

Table 2: The functional relationships studied in Figure 1.

On a side note, MIC calculation is also very computationally intensive for large data set.

We only plotted sample size up to n = 10, 000 in in Figure 1 since it is computationally

inhibitive to generate a similar plot for MIC at n = 100, 000. In contrast, we were able to

generate the plot for Ccor at n = 100, 000 in half an hour.

12

Reshef et al. (2011) observed large variation of MI among different functions at the same

noise level and concluded that mutual information is not an equitable measure. However,

these variations are in part due to the estimation errors. For the noiseless situation, MI = ∞(correspondingly, MIcor = 1) for the functions, and the large variations are purely due to

the statistical error MI−MI. In the next section, we show that MI can not be consistently

estimated over a class of continuous copulas. This explains the contradictory claims about

the equitability of mutual information MI in Reshef et al. (2011) and Kinney and Atwal

(2013). We also show that the proposed copula correlation Ccor is consistently estimable

and we consider it the candidate of choice for an equitable correlation measure.

3 STATISTICAL ERROR IN THE ASSOCIATION

MEASURE ESTIMATION

We now turn our attention to the statistical errors in estimating the association measures.

Particularly we focus on the two self-equitable measures MI and Ccor.

From equation (16), the KS and CVM are defined through the underlying copula function

C(u, v). We use the notations KS(C) and CVM(C) to emphasize their dependence on the

copula C. Then we can estimate them by plug-in estimators KS = KS(Cn) and CV M =

CVM(Cn), where Cn(u, v) denotes the empirical estimator for the copula function C(u, v).

Since Cn(u, v) converges to C(u, v) at the parametric rate of n−1/2 (Omelka et al., 2009;

Segers, 2012), KS and CVM can also be estimated at the parametric rate of n−1/2.

In contrast, the measures MI and Ccor are defined through copula density function c(u, v)

in equations (15) and (16). For discrete distributions, the empirical density function cn(u, v)

also converges to true density c(u, v) at the rate of n−1/2. Hence the plug-in estimator MI =

MI(cn) converges to MI at the parametric rate of n−1/2 (Joe, 1989) for discrete distribu-

tions. For continuous distributions, the kernel density estimator cn(u, v) (Moon et al., 1995;

Khan et al., 2007; Reshef et al., 2011) has been used for plug-in estimator MI = MI(cn).

However, such estimator MI does not converge to MI at the parametric rate of n−1/2 for

continuous distributions.

The convergence of density estimator has been studied well in the literatures. With

13

bounded m-th derivatives, the two-dimensional kernel density estimator converges to the

true density at the rate of n−(m+1)/(2m+4) (Silverman, 1986; Scott, 1992). However, the cop-

ula density function is harder to estimate because it is often unbounded, hence its derivatives

are also unbounded. For example, the commonly used bivariate Gaussian density is bounded.

However, after transforming to the unit square through copula decomposition (11), the cor-

responding Gaussian copula density is unbounded except in the independence case. The

copula density can not be accurately estimated where its value c(u, v) is big. Because the

mutual information MI overweighs the region where c(u, v) is big, it can not be consistently

estimated. On the other hand, our proposed Ccor does not need accurate density estima-

tion in the region where c(u, v) is big. We show that MI is not consistently estimable in

subsection 3.1, and provide a consistent estimator Ccor in subsection 3.2.

3.1 The Mutual Information Is Not Consistently Estimable

We study the minimax rate of convergence for mutual information for continuous copula

distributions. Starting from Farrell (1972), the minimax rate of convergence for density

estimation has been studied for the class of functions whose derivatives satisfy the Holder

condition. Since the minimax rate agrees with the achievable convergence rate by the kernel

estimator, it is also the optimal convergence rate of density estimation under those Holder

classes.

For simplicity, let us consider imposing the Holder condition on the copula density itself

(i.e., on the 0-th derivative). That is, the density function satisfies

|c(u1, v1)− c(u2, v2)| ≤ M1‖(u1 − u2, v1 − v2)‖ (17)

for a constant M1 and all u1, v1, u2, v2 values between 0 and 1. Here and in the following

‖ · ‖ refers to the Euclidean norm. However, most commonly used continuous copula density

functions are unbounded (Omelka et al., 2009; Segers, 2012) and hence do not satisfy the

Holder condition (17). Therefore, we need to investigate the problem with some nonstandard

conditions.

Since the Holder condition can not hold for the region where c(u, v) is big, we impose

it only on the region where the copula density is small. Specifically, we assume that the

14

Holder condition (17) holds only on the region AM = {(u, v) : c(u, v) < M} for a constant

M > 1. That is, |c(u1, v1)− c(u2, v2)| ≤ M1‖(u1 − u2, v1 − v2)‖ whenever (u1, v1) ∈ AM and

(u2, v2) ∈ AM . Then this condition is satisfied by all common continuous copulas listed in

the book by Nelsen (2006). For example, all Gaussian copulas satisfy the Holder condition

(17) on AM for some constants M > 1 and M1 > 0. If (17) holds on AM for any particular

M and M1 values, then (17) holds on AM also for all smaller M values and for all bigger M1

values. Without loss of generality, we assume that M is close to 1 and M1 is a big constant.

Let C denotes the class of continuous copulas whose density satisfies the Holder condition

(17) on AM . We can then study the minimax risk for estimating MI(C) for C ∈ C. Without

loss of generality, we consider the data set {(U1, V1), ..., (Un, Vn)} consisting of independent

observations from a copula distribution C ∈ C.

Theorem 2 Let MIn be any estimator of the mutual information MI in equation (16) based

on the observations (U1, V1), ..., (Un, Vn) from a copula distribution C ∈ C. And let MIcorn

be any estimator of the MIcor in equation (8). Then

supC∈C

E[|MIn(C)−MI(C)|] = ∞, and

supC∈C

E[|MIcorn(C)−MIcor(C)|] ≥ a2 > 0,(18)

for a positive constant a2.

The proof of Theorem 2 uses a method of Le Cam (Le Cam, 1973, 1986) by finding a

pair of hardest to estimate copulas. That is, we can find a pair of copulas C1 and C2 in

the class C such that C1 and C2 are arbitrarily close in Hellinger distance but their mutual

information are very different. Then no estimator can estimate MI well at both copulas

C1 and C2, leading to a lower bound for the minimax risk. Detailed proof is provided in

Appendix A2.

Theorem 2 implies that the MI and MIcor can not be estimated consistently over the

class C since the minimax risk is bounded away from zero. The difficulty in estimating MI is

due to the fact that it overweighs the region with large density c(u, v) values. From equation

(16), we see that MI is the expectation of log[c(u, v)] under the true copula distribution

c(u, v). In contrast, the Ccor in (15) takes the expectation at the independence case, similar

15

to CVM . This allows Ccor to be consistently estimable over the class C, as shown in the

next subsection 3.2.

3.2 The Consistent Estimation Of Copula Correlation

The proposed copula correlation measure Ccor can be consistently estimated since the region

of large copula density values has little effect on it. To see this, we derive an alternative

expression of Ccor from (15). Let x+ = max(x, 0) denote the non-negative part of x. Then

1∫

0

1∫

0

[c(u, v)− 1]+dudv −1∫

0

1∫

0

[1− c(u, v)]+dudv =

1∫

0

1∫

0

[c(u, v)− 1]dudv = 1− 1 = 0.

Hence1∫0

1∫0

[c(u, v)− 1]+dudv =1∫0

1∫0

[1− c(u, v)]+dudv. Therefore,

1∫0

1∫0

|c(u, v)− 1|dudv =1∫0

1∫0

[c(u, v)− 1]+dudv +1∫0

1∫0

[1− c(u, v)]+dudv

= 21∫0

1∫0

[1− c(u, v)]+dudv

Then we get the alternative expression of Ccor from (15),

Ccor =1

2

1∫

0

1∫

0

|c(u, v)− 1|dudv =

1∫

0

1∫

0

[1− c(u, v)]+dudv. (19)

In the new expression (19), Ccor only depends on [1 − c(u, v)]+ which is nonzero only

when c(u, v) < 1. To estimate Ccor well, we only need the density estimator cn(u, v) to be

good for points (u, v) with low copula density. Specifically, we consider the plug-in estimator

Ccor = Ccor(cn) =

1∫

0

1∫

0

[1− cn(u, v)]+dudv, (20)

where cn(u, v) =1

nh2

n∑i=1

K(u−Ui

h)K(v−Vi

h) is a kernel density estimator with kernel K(·) and

bandwidth h.

To analyze the statistical error of Ccor, we can look at the error in the low copula

density region separately from the error in the high copula density region. Specifically, let

M2 be a constant between 1 and M , say, M2 = (M + 1)/2. Then we can separate the

16

unit square into the low copula density region AM2 = {(u, v) : c(u, v) ≤ M2} and the high

copula density region AcM2

= {(u, v) : c(u, v) > M2}. We now have Ccor = T1(c) + T2(c)

where T1(c) =∫∫

AM2

[1 − c(u, v)]+dudv and T2(c) =∫∫

Ac

M2

[1 − c(u, v)]+dudv. Since the Holder

condition (17) holds on AM , the classical error rate O(h+ (nh2)−1/2) for the kernel density

estimator holds for |cn(u, v) − c(u, v)| on the low copula density region AM2. Hence the

error |T1(cn) − T1(c)| is also bounded by O(h + (nh2)−1/2). While the density estimation

error |cn(u, v) − c(u, v)| can be unbounded on the high copula density region AcM2

, it only

propagates into error for Ccor when cn(u, v) < 1. We can show that the overall propagated

error |T2(cn)− T2(c)| is controlled at a higher order O((nh2)−1). Therefore, the error rate of

Ccor can be controlled by the classical kernel density estimation error rate as summarized

in the following Theorem 3.

Theorem 3 Let cn(u, v) = 1nh2

n∑i=1

K(u−Ui

h)K(v−Vi

h) be a kernel estimation of the copula

density based on observations (U1, V1), ..., (Un, Vn). We assume the following conditions

1. The bandwidth h → 0 and nh2 → ∞.

2. The kernel K has compact support [−1, 1].

3.∫∞−∞K(x)dx = 1,

∫∞−∞ xK(x)dx = 0 and µ2 =

∫∞−∞ x2K(x)dx > 0.

Then the plug-in estimator Ccor = Ccor(cn) in (20) has a risk bound

supC∈C

E[|Ccor − Ccor|] ≤ 2√M1h +

2µ2√nh2

+M5

nh2(21)

for some finite constant M5 > 0.

The detailed proofs for Theorem 3 are provided in Appendix A3. From (21), if we choose

the bandwidth h = n−1/4, then Ccor converges to the true value Ccor at the rate of O(n−1/4).

Thus Ccor can be consistently estimated, in contrast to the results on MI and MIcor in

subsection 3.1.

The Theorem 3 provides only an upper bound for the statistical error of the plug-in

estimator Ccor. The actual error may be lower. In fact, the error |T1(c) − T1(cn)| can

17

be controlled at O(n−1/2) using kernel density estimator cn (Bickel and Ritov, 2003). Here

we did not find the optimal rate of convergence. But the upper bound already shows that

Ccor is much easier to estimate than MI and MIcor. Similar to classical kernel density

estimation theory, assuming that the Holder condition holds on AM for the m-th derivatives

of the copula density, the upper bound on the convergence rate can be further improved to

O(n−(m+1)/(2m+4)).

The technical conditions 1 − 3 in Theorem 3 are classical conditions on the bandwidth

and the kernel. We have used the bivariate product kernel for technical simplicity. Other

variations of the conditions in the literature may be used. For example, it is possible to relax

the compact support condition 2 to allow using the Gaussian kernel.

In practice, the (Ui, Vi)’s are not observed. The estimator conventionally is calculated on

(Ui = RX,i/(n+ 1), Vi = RY,i/(n+ 1))’s. The risk bound still holds using (Ui, Vi)’s.

4 NUMERICAL STUDIES

In this section, we conduct two numerical studies on the finite sample properties of the

proposed Ccor and compare it with several other measures. We first conduct a simulation

study on the power using Ccor and other measures for detecting association. We then

apply Ccor to a data set of social, economic, health, and political indicators from the World

Health Organization (WHO). This WHO data set is analyzed by Reshef et al. (2011), and is

available from their website http://www.exploredata.net. Their maximal information-based

nonparametric exploration (MINE) package is available on the same website. We used their

MINE package to calculate MIC.

4.1 Power Comparison

We analyze the power to detect association by MIC, linear correlation ρ, distance correlation

dcor and our copula correlation Ccor on simulated data sets. We choose the same simulation

setting as in the study by Simon and Tibshirani (2011). Data sets of sample size n = 320

were generated from the regression model Y = f(X) + ε with Gaussian error ε ∼ N(0, σ2),

and the four measures were applied to test the null hypothesis thatX and Y are independent.

18

http://www.exploredata.net

The sample size n = 320 is chosen as same as that in the simulations by Reshef et al. (2011).

The rejection rule at α = 0.05 level is decided from 500 data sets generated from the null

hypothesis. Then the rejection results of tests on 500 simulated data sets were used to

calculate their power.

We simulated data from the eight functions f(·) in Simon and Tibshirani (2011) plus a

new functional relationship: the Cross pattern. These nine functions are listed in Table 3.

The powers are calculated at 31 different noise levels: σ = 0, 0.1σ0,..., 3σ0. We note that

the σ0 value was chosen differently for each functional relationship so that the power curves

clearly show the pattern of decreasing within the simulated noise range. One simulated

data set for each combination of the noise level and the functional relationship is plotted in

Table 3.

noise levelsType formula 0 0.5 1 1.5 2 2.5 3

Linear y = x

Quadratic y = 4(x− 12 )

2

Cubic y = 80(x − 13)

3 − 12(x− 13)

Power x1/4 y = x1/4

Step function y = 1{x > 12}

Sine: period 1/2 y = sin(4πx)

Circle (x− 12)

2 + y2 = 14

Cross (y − 12)

2 = (x− 12)

2

Sine: period 1/8 y = sin(8πx)

Table 3: The simulated data in the power study.

The power curves for testing independent null hypothesis by MIC, linear correlation ρ,

distance correlation dcor and copula correlation Ccor are plotted in the Figure 2. Without

Ccor, the other three measures follow the conclusion from Simon and Tibshirani (2011): dcor

has the best power in all but one case. MIC only beats dcor in the case of high frequency

periodic function. This seems to suggest that dcor is a better association measure to detect

nonlinear relationship.

19

0.0 0.5 1.0 1.5 2.0 2.5 3.0

0.0

0.2

0.4

0.6

0.8

1.0

Linear

Noise Level

Pow

er

cordcorMICCcor

0.0 0.5 1.0 1.5 2.0 2.5 3.0

0.0

0.2

0.4

0.6

0.8

1.0

Quadratic

Noise Level

Pow

er

cordcorMICCcor

0.0 0.5 1.0 1.5 2.0 2.5 3.0

0.0

0.2

0.4

0.6

0.8

1.0

Cubic

Noise Level

Pow

er

cordcorMICCcor

0.0 0.5 1.0 1.5 2.0 2.5 3.0

0.0

0.2

0.4

0.6

0.8

1.0

x1 4

Noise Level

Pow

er

cordcorMICCcor

0.0 0.5 1.0 1.5 2.0 2.5 3.0

0.0

0.2

0.4

0.6

0.8

1.0

Step function

Noise Level

Pow

er

cordcorMICCcor

0.0 0.5 1.0 1.5 2.0 2.5 3.0

0.0

0.2

0.4

0.6

0.8

1.0

Sine: period 1/2

Noise Level

Pow

er

cordcorMICCcor

0.0 0.5 1.0 1.5 2.0 2.5 3.0

0.0

0.2

0.4

0.6

0.8

1.0

Circle

Noise Level

Pow

er

cordcorMICCcor

0.0 0.5 1.0 1.5 2.0 2.5 3.0

0.0

0.2

0.4

0.6

0.8

1.0

Cross

Noise Level

Pow

er

cordcorMICCcor

0.0 0.5 1.0 1.5 2.0 2.5 3.0

0.0

0.2

0.4

0.6

0.8

1.0

Sine: period 1/8

Noise Level

Pow

er

cordcorMICCcor

Figure 2: This plots the power curves for testing independence using the four correlationmeasures.

20

However, that conclusion no longer holds when we also consider Ccor in the comparison.

The dcor has clear power advantage over MIC and Ccor only in four out of nine cases. The

magnitude of its advantage in those four cases pales in comparison to the magnitude of

its disadvantage in the last three cases. A more careful examination also reveal that those

four cases are where the Pearson’s linear correlation ρ also performs well. Hence Ccor and

MIC are more useful than dcor to find novel functional relationships which are not detected

by Pearson’s linear correlation. Ccor and MIC are adept at detecting different types of

functional relationships. MIC is able to detect the high frequency periodic signal, while

Ccor works well for the type of many-to-many relationships where the other measures fail.

We can better understand the performance of these measures by combining the power

curves shown in Figure 2 with the corresponding data patterns plotted in Table 3. The four

cases where dcor has biggest power advantage over Ccor and MIC are: (1) Linear function

with noise level 1.5σ0; (2) Cubic function with noise level 1.5σ0; (3) Power function x1/4

with noise level 1.0σ0; and (4) Step function with noise level 0.5σ0. The corresponding data

plots for those four cases in Table 3 are very random to the naked eyes. The detection

of the association in those cases comes from the underlying monotone increasing trends.

The Pearson’s linear correlation is also very good at detecting such monotone trends. On

the other hand, Ccor has biggest power advantage over the other measures for two cases:

Circle with noise level 2.0σ0 and Cross with noise level 1.0σ0. While the noise obscured

the functional relationships somewhat in these two cases, the non-random patterns are still

clearly visible to the naked eyes. It is quite reasonable to expect the detection power in these

cases to be very close to 100%. However, only Ccor achieves this. The other association

measures all emphasize some particular features of dependence and were not able to pick

up these obvious association. In contrast, MIC has its biggest power advantage over other

measures for the high frequency periodic function with noise level 1.0σ0. The corresponding

plot in Table 3 does not show visible non-random patterns. It is rather amazing that MIC

is able to detect this type of signal with near 100% power for this case. More mathematical

study on MIC is needed to understand why it is good at detecting high frequency periodic

signals and what other type of signals it can detect. Ccor on the other hand, seems to find

21

many relationships ignored by linear correlation, particularly those visible to human eyes.

Thus, Ccor would be a very useful tool for automated scanning for possible associations in

huge data sets.

We note that Ccor and MIC aim to be equitable, putting equal importance to the linear

function and other types of deterministic relationships. This necessarily leads to lower cor-

relation values than Pearson’s ρ in linear cases to give other types of functional relationships

a fair shake. Therefore, Ccor and MIC will lose some power to detect the functions that

Pearson’s linear correlation can detect. This is more than compensated by their ability to

detect relationships which escaped detection by Pearson’s linear correlation. While dcor

amazingly seems to find every relationship detectable to Pearson’s linear correlation, this

limits its ability to detect other types of association. In practice, Pearson’s linear correlation

would always be the first measure we check. Hence it is not a concern that a new association

measure may lose power where Pearson’s linear correlation works. Ccor and MIC are very

well suited as the second/third measures to check other types of association missed by the

first scan using linear correlation.

The above study concentrates on the power of detecting association by different measures.

It is worthy to point out that power is not the only criterion nor the most important criterion

to assess the usefulness of equitable association measures. The equitable measures are used

to rank the detected association relationships without giving preference to linear relationship.

Therefore, losing some power for detecting the linear relationship to give more consideration

to other types of association in fact is a desirable property.

4.2 Analysis Of WHO Data

We now apply the new measure Ccor to the WHO data set. We repeat the analysis

in Reshef et al. (2011) by calculating the pairwise correlations among the 357 variables in

the data set. The first variable contains the ID numbers of the countries: from 1 to 202.

These numerical values have no real intrinsic meaning. Hence the correlations between the

first variable with other variables are rather senseless. We drop the first variable and only

calculate the pairwise correlations among the rest 356 variables. There are many missing

22

data in the data sets. For some pairs of variables the available sample size is very small.

Since our estimator for Ccor uses the copula density estimation, its accuracy under a very

small sample size is suspectable. Therefore we calculate the measure Ccor only on those

pairs with at least n = 50 common observations. This results in 49286 pairwise correlations

in total.

We first look at some pairs of variables studied by Reshef et al. (2011). Figure 3 plots

the data along with linear correlation (cor), MIC and Ccor values for the examples 4C-4H

in Reshef et al. (2011).

We can see that Ccor and MIC qualitatively give the same conclusion in those examples.

They both give low correlations to the first case. They both detect some clear nonrandom

relationships with weak linear correlations. They give lower correlation values than ρ in the

two cases with high linear correlations, but big enough to detect the relationship. There

are some differences in the numerical values between Ccor and MIC. The biggest difference

occurs for the third case in the first row, with MIC = 0.72 and Ccor = 0.46.

To compare the estimates for Ccor and MIC, we plotted their values for all 49286 pairs

on the WHO data sets in Figure 4. We can see that the values fall in a band around the

diagonal. This means that Ccor and MIC generally rate the pair-wise association similarly.

To investigate the different rankings by these two measures, we investigate three pairs

of variables that have very similar values in one measure but big difference in the other

measure. These three pairs are labeled as A, B and C on the graph of Figure 4. We plot the

data for these variables in the Figure 5. Since Ccor and MIC are both rank-based, we also

plot these data in the ranks to avoid any specious pattern due to the scales on the variables.

As we can see from Figure 5, the later two cases (B and C) both seem to have strong linear

relationships with some noise. While the noise patterns are different in Figure 5B and 5C,

the average noise amount looks about the same. The first case Figure 5A clearly is much

noisier than the later two cases. This pattern is correctly reflected by Ccor which assigns

similar correlation to the latter two case while giving the first case a much lower correlation

value. However, MIC assigns about the same correlation value to the first two cases and

23

0 5 10 15

010

2030

40

cor=0.12, MIC=0.1, Ccor=0.18

Dentist Density (per 10,000)

Life

Los

t to

Inju

ries

(%yr

s)

1 2 3 4 5 6 7

3040

5060

7080

90

cor=−0.84, MIC=0.61, Ccor=0.52

Children Per Woman

Life

Exp

ecta

ncy

(Yea

rs)

0 500000 1000000 2000000

050

010

0015

00

cor=−0.12, MIC=0.72, Ccor=0.46

Number of PhysiciansD

eath

s du

e to

HIV

/AID

S

0 10000 20000 30000 40000

020

4060

cor=0.04, MIC=0.50, Ccor=0.42

Income / Person (Int$)

Adu

lt (F

emal

e) O

besi

ty (

%)

0 50 100 150 200 250 300

−10

010

2030

4050

60

cor=−0.14, MIC=0.54, Ccor=0.43

Health Exp. / Person (US$)

Mea

sles

Imm

. Dis

parit

y (%

)

0 10000 30000 50000

010

0020

0030

0040

0050

0060

00cor=0.93, MIC=0.85, Ccor=0.82

Gross Nat’l Inc / Person (Int$)

Hea

lth E

xp. /

Per

son

(Int

$)

Figure 3: The raw data and estimated correlation measures for several example casesin Reshef et al. (2011).

24

a much higher correlation value to the third case. This certainly does not agree with the

observed data patterns. Particularly, MIC assigns a correlation value of 1 to the case 5C

which is far from a noiseless deterministic relationship. From these observations, it appears

that Ccor better reflects the noise level than MIC. Thus Ccor appears to be a better equitable

correlation measure.

5 DISCUSSIONS AND CONCLUSIONS

For simplicity of presentation, we only worked on bivariate continuous distributions. The

definition for Ccor in equation (9) also works for discrete distributions, where the densities

pXY (x, y), pX(x) and pY (y) are defined in respect to the appropriate counting measures.

Similarly Ccor can be defined for mixed-type random variables using the densities under

appropriate dominating measures in equation (9). The extension to higher dimensions is

straight forward. For p-dimensional X and q-dimensional Y , the densities pXY (x, y), pX(x)

and pY (y) are functions of respectively p+q, p and q dimensions. However, high-dimensional

density estimation is much harder. The statistical properties of the Ccor in higher-dimension

is the subject of future research.

We studied the equitability of several association measures in this paper. We explained

the paradoxical equitability results on MI and MIC in literature by separating estimability

from equitability. To this end, we prove that MI is not consistently estimable over a class of

continuous copulas. This provides the first mathematical quantitative result on the difficulty

of estimating MI in continuous distributions.

We proposed a novel association measure, the copula correlation (Ccor). We proved that

theoretically Ccor is both self-equitable and consistently estimable. It is the only known

association measure that have both properties. Numerical studies show that Ccor performs

well as intended. It ranks the correlations more reasonable than MIC on the WHO data

set. It is good at finding some functional relationships missed by other correlation measures.

25

0.2 0.4 0.6 0.8 1.0

0.2

0.4

0.6

0.8

1.0

Ccor

MIC

A B

C

Figure 4: The Ccor and MIC values for all pairs in the WHO data. Three cases labeled onthe graph is shown in detail in the Figure 5

26

0 10 20 30 40

4060

8010

0

A: Raw data, MIC=0.64, Ccor=0.33

Malnutrition Prevalence

Boy

s C

ompl

etin

g P

rimar

y S

chl(%

)

0 20 40 60 80

020

4060

80

A: Ranks, MIC=0.64, Ccor=0.33

Malnutrition PrevalenceB

oys

Com

plet

ing

Prim

ary

Sch

l(%)

65 70 75 80 85 90 95 100

2040

6080

100

B: Raw data, MIC=0.64, Ccor=0.71

DTP3 Imm. in 1yr olds(%)

Hib

3 Im

m. i

n 1y

r ol

ds(%

)

0 20 40 60 80 100

020

4060

8010

0

B: Ranks, MIC=0.64, Ccor=0.71

DTP3 Imm. in 1yr olds(%)

Hib

3 Im

m. i

n 1y

r ol

ds(%

)

0 10000 20000 30000 40000 50000 60000 70000

010

2030

4050

60

C: Raw data, MIC=1, Ccor=0.70

Income Per Person

Oil

Con

sum

ptio

n P

er P

erso

n

0 10 20 30 40 50 60

010

2030

4050

60

C: Ranks, MIC=1, Ccor=0.70

Income Per Person

Oil

Con

sum

ptio

n P

er P

erso

n

Figure 5: The comparison of Ccor and MIC on three example cases.

27

Computationally, Ccor is faster than MIC for large sample size by several order of magnitude.

Based on these studies, Ccor will be a very useful new tool to explore complex associations

in big data sets.

References

Bickel, P. J. and Ritov, Y. (2003). Nonparametric estimators which can be “plugged-in”.

The Annals of Statistics, 31(4):pp. 1033–1053.

Donoho, D. L. and Liu, R. C. (1991). Geometrizing rates of convergence, ii. The Annals of

Statistics, 19(2):pp. 633–667.

Farrell, R. H. (1972). On the best obtainable asymptotic rates of convergence in estimation

of a density function at a point. The Annals of Mathematical Statistics, 43(1):pp. 170–180.

Genest, C., Quessy, J.-F., and Remillard, B. (2007). Asymptotic local efficiency of cramer-

von mises tests for multivariate independence. The Annals of Statistics, 35(1):pp. 166–191.

Genest, C. and Remillard, B. (2004). Test of independence and randomness based on the

empirical copula process. Test, 13(2):335–369.

Joe, H. (1989). Relative entropy measures of multivariate dependence. Journal of the Amer-

ican Statistical Association, 84(405):157–164.

Jones, M. C., Marron, J. S., and Sheather, S. J. (1996). A brief survey of bandwidth selection

for density estimation. Journal of the American Statistical Association, 91(433):401–407.

Khan, S., Bandyopadhyay, S., Ganguly, A. R., Saigal, S., Erickson, D. J., Protopopescu,

V., and Ostrouchov, G. (2007). Relative performance of mutual information estimation

methods for quantifying the dependence among short and noisy data. Phys. Rev. E,

76:026209.

Kinney, J. B. and Atwal, G. S. (2013). Equitability, mutual information, and the maximal

information coefficient. arXiv preprint arXiv:1301.7745.

28

Kojadinovic, I. and Holmes, M. (2009). Tests of independence among continuous random

vectors based on cramer-von mises functionals of the empirical copula process. Journal of

Multivariate Analysis, 100(6):1137 – 1154.

Le Cam, L. (1973). Convergence of estimates under dimensionality restrictions. The Annals

of Statistics, pages 38–53.

Le Cam, L. (1986). Asymptotic Methods in Statistical Decision Theory. Springer series in

statistics. Springer, New York, NY.

Moon, Y. I., Rajagopalan, B., and Lall, U. (1995). Estimation of mutual information using

kernel density estimators. Physical Review E, 52(3):2318–2321.

Nelsen, R. B. (2006). An Introduction to Copulas (Springer Series in Statistics). Springer-

Verlag New York, Inc., Secaucus, NJ, USA.

Omelka, M., Gijbels, I., and Veraverbeke, N. (2009). Improved kernel estimation of copulas:

weak convergence and goodness-of-fit testing. The Annals of Statistics, 37(5B):3023–3058.

Reshef, D., Reshef, Y., Mitzenmacher, M., and Sabeti, P. (2013). Equitability analysis of

the maximal information coefficient, with comparisons. arXiv preprint arXiv:1301.6314.

Reshef, D. N., Reshef, Y. A., Finucane, H. K., Grossman, S. R., McVean, G., Turnbaugh,

P. J., Lander, E. S., Mitzenmacher, M., and Sabeti, P. C. (2011). Detecting novel associ-

ations in large data sets. Science, 334(6062):1518–1524.

Scott, D. (1992). Multivariate Density Estimation: Theory, Practice, and Visualization.

Wiley Series in Probability and Statistics. Wiley.

Segers, J. (2012). Asymptotics of empirical copula processes under non-restrictive smooth-

ness assumptions. Bernoulli, 18(3):764–782.

Silverman, B. W. (1986). Density estimation for statistics and data analysis, volume 26.

CRC press.

29

Simon, N. and Tibshirani, R. (2011). comment on detecting novel associations in large data

sets by reshef et al, science dec 16, 2011. Science.

Speed, T. (2011). A correlation for the 21st century. Science, 334(6062):1502–1503.

Szekely, G. J. and Rizzo, M. L. (2009). Brownian distance covariance. The annals of applied

statistics, pages 1236–1265.

Szekely, G. J., Rizzo, M. L., and Bakirov, N. K. (2007). Measuring and testing dependence

by correlation of distances. The Annals of Statistics, 35(6):2769–2794.

Wand, M. and Jones, C. (1994). Multivariate plug-in bandwidth selection. Computational

Statistics, 9(2):97–116.

Wand, M. and Jones, M. (1993). Comparison of smoothing parameterizations in bivariate

kernel density estimation. Journal of the American Statistical Association, 88(422):520–

528.

Appendices

A1 Proof of Theorem 1.

We check the DPI for CDα on the random variables X, Y, Z forming a Markov chain

X ↔ Y ↔ Z. Consider U = F−1X (X), V = F−1

Y (Y ) and W = F−1Z (Z). Then U , V and W

each follows the uniform distribution, with their joint density function cXY Z(u, v, w) as a 3-

dimensional copula. Similarly cXY (u, v) denotes the joint density of U and V , and cY Z(v, w)

denotes the joint density of V and W . In the following, we suppress the subscript from these

copula densities c(·) when their input arguments clearly indicate which one is being used.

Now, by the copula property,1∫0

c(v, w)dw = 1 for all v ∈ [0, 1]. Hence,

CDα(X ; Y ) =

1∫

0

1∫

0

|c(u, v)− 1|αdudv =

1∫

0

1∫

0

1∫

0

|c(u, v)− 1|αc(v, w)dudvdw.

30

For α ≥ 1, |x− 1|α is convex in x. Hence applying Jensen’s inequality,

CDα(X ; Y ) =

1∫

0

1∫

0

1∫

0

|c(u, v)− 1|αc(v, w)dudvdw ≥1∫

0

1∫

0

|1∫

0

c(u, v)c(v, w)dv− 1|αdudw.

Since X ↔ Y ↔ Z, we have c(u, v, w) = c(u, v)c(v, w) for all u, v, w values. Therefore1∫0

c(u, v)c(v, w)dv =1∫0

c(u, v, w)dv = c(u, w). This leads to

CDα(X ; Y ) ≥1∫

0

1∫

0

|1∫

0

c(u, v)c(v, w)dv−1|αdudw =

1∫

0

1∫

0

|c(u, w)−1|αdudw = CDα(X,Z).

Hence DPI holds for CDα. Thus CDα is self-equitable.


To prove the theorem, we use Le Cam (1973)’s method to find the lower bound on the

minimax risk of the estimating mutual information MI. To do this, we will use a more

convenient form of Le Cam’s method developed by Donoho and Liu (1991). Define the

module of continuity of a functional T over the class F with respect to Hellinger distance as

in equation (1.1) of Donoho and Liu (1991):

w(ε) = sup{|T (F1)− T (F2)| : H(F1, F2) ≤ ε, Fi ∈ F}. (A.1)

Here H(F1, F2) denotes the Hellinger distance between F1 and F2. Then the minimax rate

of convergence for estimating T (F ) over the class F is bounded below by w(n−1/2).

We now look for a pair of density functions c1(u, v) and c2(u, v) on the unit square for

distributions that are close in Hellinger distance but far away in their mutual information.

This provides a lower bound on the module of continuity for mutual information MI over

the class C, and hence leading to a lower bound on the minimax risk. We provide an outline

of the proof here.

We first divide the unit square into three disjoint regions R1, R2 and R3 with R1 ∪R2 ∪R3 = [0, 1] × [0, 1]. The first density function c1(u, v) puts probability masses δ, a and

1 − a − δ respectively on the regions R1, R2 and R3 each uniformly. The a is an arbitrary

31

small fixed value, for example, a = 0.01. For now, we take δ to be another small fixed value.

The area of the region is chosen so that c1(u, v) = M on region R2 and c1(u, v) = M∗ on

region R1 for a very big M∗. The second density function c2(u, v), compared to c1(u, v),

moves a small probability mass ε from R1 to R2. We will see that the Hellinger distance

between c1 and c2 is of the same order as ε, but the change in MI is unbounded for big M∗.

Hence module of continuity w(ε) is unbounded for mutual information MI. Hence the MI

can not be consistently estimated over the class C.

Specifically, the region R1 is chosen to be a narrow strip immediately above the diagonal,

R1 = {(u, v) : −δ1 < u − v < 0}; and R2 is chosen to be a narrow strip immediately

below the diagonal, R2 = {(u, v) : 0 ≤ u − v < δ2}. The remaining region is R3 =

[0, 1]× [0, 1] \ (R1 ∪R2). The values of δ1 and δ2 are chosen so that the areas of regions R1

and R2 are δ/M∗ and a/M respectively. Then clearly c1(u, v) = M∗ on R1; c1(u, v) = M on

R2; c1(u, v) = (1 − a − δ)/(1 − a/M − δ/M∗) on R3. And c2(u, v) = M∗ − ε(M∗/δ) on R1;

c2(u, v) = M + ε(M/a) on R2; c2(u, v) = c1(u, v) on R3. See the Figure 6.

Then we have

2H2(c1, c2) =∫∫

(√

c2(u, v)−√c1(u, v))

2dudv

= (√

M∗ − ε(M∗/δ)−√M∗)2δ/M∗ + (

√M + ε(M/a)−

√M)2a/M

= δ(√

1− ε/δ − 1)2 + a(√

1 + ε/a− 1)2

= δ(ε/2δ)2 + a(ε/2a)2 + o(ε2)= ε2( 1

4δ+ 1

4a) + o(ε2).

Hence the Hellinger distance is of the same order as ε:

H(c1, c2) = ε

√1

8δ+

1

8a+ o(ε).

On the other hand, the difference in the mutual information is

MI(c1)−MI(c2)= δ log(M∗) + a log(M)− (δ − ε) log[M∗ − ε(M∗/δ)]− (a + ε) log[M + ε(M/a)]= ε log(M∗)− ε log(M)− (δ − ε) log(1− ε/δ)− (a + ε) log(1 + ε/a).

(A.2)

Here M , δ and a are fixed constants. Hence when M∗ → ∞, this difference in MI also goes

to ∞. For example, if we let M∗ = e1/(ε)2, then the module of continuity w(ε) ≥ O(1/ε).

32

0 1

1

R1

R2

R3

R3

u

v

δ1

δ2

Figure 6: The plot shows the regions R1, R2 and R3. The other two narrow strips neighboringR1 and R2 are for the continuity correction mentioned at the end of the proof.

That means, the rate of convergence is at least O(w(n−1/2)) = O(n1/2) → ∞. In other

words, MI can not be consistently estimated.

The small difference in Hellinger distance of c1 and c2 can lead to unbounded difference in

MI(c1) and MI(c2) since MI is unbounded. After the transformation MIcor =√1− e−2MI

in (8), the mutual information correlation is bounded. The difference between MIcor(c1)

and MIcor(c2) in the above example is actually small since the MI are big for both c1 and

c2 (leading to corresponding MIcors close to zero). However, MIcor is also very hard to

estimate over the class C. To see that, we follow the same reasoning above but modify the

example of c1 and c2. First, we notice that for any pair of densities c1 and c2,

|MIcor(c1)−MIcor(c2)| = |√1− e−2MI(c1) −

√1− e−2MI(c2)|

= | [1−e−2MI(c1)]−[1−e−2MI(c2)]√1−e−2MI(c1)+

√1−e−2MI(c2)

|≥ 1

2|e−2MI(c1) − e−2MI(c2)|

= 12e−2MI(c1)|1− e−2[MI(c1)−MI(c2)]|.

For the difference MIcor(c1) −MIcor(c2) to be the same order of the difference MI(c1) −

33

MI(c2), we need to set MI(c1) at constant order when ε → 0.

Therefore, we modify the above c1 to have probability mass δ = 2ε in region R1, varying

with the ε value instead of fixed as before. And we set M∗ = e1/ε, leading to

MI(c1) = δ log(M∗) + a log(M) + (1− a− δ) log[(1− a− δ)/(1− a/M − δ/M∗)]= 2 + a log(M) + (1− a− 2ε) log[(1− a− 2ε)/(1− a/M − 2εe−1/ε)],

which converges to a fixed constant a1 = 2 + a log(M) + (1 − a) log[(1 − a)/(1 − a/M)] as

ε → 0. Using (A.2), recall that δ = 2ε and M∗ = e1/ε, we have

MI(c1)−MI(c2)= ε log(M∗)− ε log(M)− (δ − ε) log(1− ε/δ)− (a+ ε) log(1 + ε/a)= 1− ε log(M)− ε log(1/2)− (a + ε) log(1 + ε/a),

which converges to 1 as ε → 0. Hence we have

limε→0

w(ε) ≥ limε→0

1

2e−2MI(c1)|1− e−2[MI(c1)−MI(c2)]| = 1

2e−2a1(1− e−2(1)),

a positive constant a2 = e−2a1(1 − e−2)/2. Therefore, MIcor can not be estimated consis-

tently over the class C either.

The above outlines the main idea of the proof, ignoring some mathematical subtleties.

One is that the example densities c1 and c2 are only piecewise continuous on the three

regions, but not truly continuous as required for the class C. This can be easily remedied by

connecting the three pieces linearly. Specifically we set the densities ci(u, v) = M , i = 1, 2,

on the boundary between R1 and R3, {(u, v) : u− v = −δ1}, and on the boundary between

R2 and R3, {(u, v) : u− v = δ2}. Then we use two narrow strips within R3, {(u, v) : −δ3 ≤u−v ≤ −δ1} and {(u, v) : δ2 ≤ u−v ≤ δ4} to connect the constant ci(u, v) values on the rest

of region R3 with the boundary value ci(u, v) = M continuously through linear (in u − v)

ci(u, v)’s on the two strips that satisfies the Holder condition (17). By the Holder condition

(17), the connection can be made with strips of width at most (M − 1 + a + δ)/M1. This

continuity modification does not affect the calculation of the difference MI(c1) − MI(c2)

above as c1 and c2 only differ on regions R1 and R2. Within regions R1 and R2, the densities

c1 and c2 can be further similarly connected continuously linearly in u − v. As there is

no Holder condition on AcM , the connection within R1 and R2 can be as steep as we want.

Clearly the order obtained through above calculations will not change if we make these

connections very steep so that their effect is negligible.

34

Another technical subtlety is that the c1 and c2 defined above are only densities on the

unit square but not copula densities which require uniform marginal distributions. However,

it is clear that the marginal densities for cis are uniform over the interval (δ3, 1 − δ4) and

linear in the rest of interval near the two end points 0 and 1. The copulas densities c∗i ’s

corresponding to ci’s can be calculated directly through Sklar’s decomposition (11). It is

easy to see that the order for the module of continuity w(ε) remain the same for using the

corresponding copula densities c∗i ’s.


Let M2 be a constant between 1 and M , say, M2 = (M + 1)/2. Denote AM2 = {(u, v) :

c(u, v) ≤ M2}. Then denote T1(c) =∫∫

AM2

[1− c(u, v)]+dudv and T2(c) =∫∫

Ac

M2

[1− c(u, v)]+dudv

so that Ccor = T1(c) + T2(c).

For a density estimator cn(u, v), we have the corresponding copula correlation estimator

by plug-in cn(u, v) into the Ccor expression. Hence Ccor = T1(cn) + T2(cn). We now bound

the errors in estimating T1 and T2 separately.

T1 involves the integral over the (u, v) points in AM2 only. Those points are contained

in the set of low density points where the Holder condition holds. Hence we can apply

the usual bounds for kernel density estimation. Particularly, let cn(u, v) = E[cn(u, v)] =∫∫

K(s)K(t)c(u+ hs, v + ht)dsdt denote the expectation of the density estimator cn. Then

the bias in density estimation is bounded by

|cn(u, v)− c(u, v)| ≤∫∫

K(s)K(t)|c(u+ hs, v + ht)− c(u, v)|dsdt.

For (u, v) ∈ AM2 , c(u + hs, v + ht) ∈ AM for h ≤ (M −M2)/(√2M1), |s| ≤ 1 and |t| ≤ 1.

Since the support of K(·) is [−1, 1], for small enough h, the bias is bounded using the Holder

condition by

1∫

−1

1∫

−1

K(s)K(t)M1h(|s|+ |t|)dsdt ≤ 2M1h

1∫

−1

1∫

−1

K(s)K(t)dsdt = 2M1h. (A.3)

35

The variance of cn is given by

V ar[cn(u, v)] =1

nV ar[

1

h2K(

u− U1

h)K(

v − V1

h)] ≤ 1

nh2

1∫

−1

1∫

−1

K2(s)K2(t)c(u+hs, v+ht)dsdt.

Hence by the same arguments above, for small enough h, the variance is bounded by

1

nh2

1∫

−1

1∫

−1

K2(s)K2(t)[c(u, v) +M1h(|s|+ |t|)]dsdt ≤ 1

nh2µ22[c(u, v) + 2M1h], (A.4)

where µ2 =∫ 1

t=−1K2(t)dt. Combining (A.3) and (A.4), we get

E{[cn(u, v)− c(u, v)]2} ≤ 4M1h2 +

1

nh2µ22[c(u, v) + 2M1h]. (A.5)

The integration of the right hand side over the region AM2 is bounded by its integration over

the whole unit square: (u, v) ∈ [0, 1]× [0, 1]. For h small enough, since 2M1h ≤ 1, we get

E{∫∫

AM2

[cn(u, v)− c(u, v)]2dudv} ≤1∫0

1∫0

{4M1h2 + 1

nh2µ22[c(u, v) + 1]}dudv

= 4M1h2 +

2µ22

nh2 .

(A.6)

Hence

{E∫∫

AM2

|cn(u, v)− c(u, v)|dudv}2 ≤ E{∫∫

AM2

[cn(u, v)− c(u, v)]2dudv}

≤ 4M1h2 +

2µ22

nh2 ≤ (2√M1h+ 2µ2√

nh)2.

That is,

|T1(cn)− T1(c)| ≤ E

∫∫

AM2

|cn(u, v)− c(u, v)|dudv ≤ 2√

M1h +2µ2√nh

. (A.7)

Now we look at the error bound on AcM2

. Since the Holder condition does not hold here,

we can not control the error in cn on AcM2

. Notice that

V ar[cn(u, v)] =1

nV ar[

1

h2K(

u− U1

h)K(

v − V1

h)]

may be unbounded since c(u, v) is unbounded on AcM2

. However,

V ar[cn(u, v)] ≤1

nh2

1∫

−1

1∫

−1

K2(s)K2(t)c(u+ hs, v + ht)dsdt ≤ 1

nh2M2

KE[cn(u, v)],

36

where MK = max0≤t≤1

K(t).

Let 1{cn(u, v) < 1} be the indicator variable for where cn < 1. Then

Pr[cn(u, v) < 1] = E[1{cn(u, v) < 1}] ≤ V ar[cn(u, v)]

[cn(u, v)− 1]2

by Chebyshev’s inequality.

Let M3 be a constant between 1 and M2, say M3 = (1 +M2)/2 > 1. Then for an point

(u, v) ∈ AcM2

, when h is small enough, the h-square centered at (u, v) are contained in AcM3

.

Hence cn(u, v) =∫∫

K(s)K(t)c(u + hs, v + ht)dsdt ≥ M3. Since the function x/(x − 1)2 is

strictly decreasing on [1,∞), let M4 = M3/(M3 − 1)2, then

E[1{cn(u, v) < 1}] ≤ V ar[cn(u, v)]

[cn(u, v)− 1]2≤ 1

nh2M2

K

cn(u, v)

[cn(u, v)− 1]2≤ 1

nh2M2

KM4.

Hence

|T2(cn)− T2(c)| = |T2(cn)| = E|∫∫

Ac

M2

[1− cn(u, v)]+dudv| ≤∫∫

Ac

M2

E[1{cn(u, v) < 1}]dudv

≤ 1nh2M

2KM4.

(A.8)

Combine (A.7) and (A.8),

|Ccor − Ccor| ≤ 2√M1h+

2µ2√nh

+1

nh2M2

KM4.

This is (21) with M5 = M2KM4.

A4 Ccor Estimation With Kernel Copula Density Es-

timator: Bandwidth Selection And Finite Sample

Correction.

We estimate Ccor using the plug-in estimator (20). For the compact support kernel K(·),we took the constant function on [−1, 1]. That is, K(x) = 1/2 for −1 ≤ x ≤ 1. Hence the

resulting bivariate kernel is simply a square (u± h, v ± h).

We first make a finite-sample correction to Ccor. For any fixed sample size n and fixed

bandwidth h, the estimator Ccor can never reach the value of 1 and 0. This problem

diminishes for large sample size as Ccor converges to the true value by Theorem 3. However,

37

this can be a serious problem for real applications where the sample size is always finite. We

make a linear correction of

Ccor = (Ccor − Cmin)/(Cmax− Cmin). (A.9)

Here Cmax and Cmin are the maximum and minimum possible values of Ccor and are

functions of n and h. Cmax is the Ccor value on perfectly matched U and V : Ui = Vi,

i = 1, ..., n. Cmin is calculated on the most evenly distributed possible case of (Ui, Vi)’s.

That is, for Ui arranged in increasing order, Vi’s are arranged in evenly distributed columns

with the neighboring Vis separated by 2h distance within each column. The reported values

in the numerical studies throughout the paper is for this finite-sample corrected estimator.

We now turn attention to the choice of bandwidth. Theorem 3 suggested the bandwidth

h = b · n−1/4 for a constant b. While asymptotically any b value works, for any finite sample

different b values make a big difference. There have been extensive literature on bandwidth

selection for density estimations. Wand and Jones (1993) and Wand and Jones (1994) pro-

vided plug-in formulas for choosing bandwidth in multivariate density estimation. However,

those formulas can not be directly used here since they are calculated under conditions in-

appropriate for copula density estimation as argued in section 3. They were calculated for

other types of kernels and a Gaussian reference distribution which is not a copula distri-

bution. Also, minimizing estimation error of Ccor is different from minimizing the error

in density function c(u, v). In any case, we first still tried to plug into Ccor the bivariate

density estimation using the function KDE2d() in R with default bandwidth. This is similar

to what is done with MI estimation by Khan et al. (2007) and Reshef et al. (2011). The

resulting estimator Ccor is ok for big sample size, but can be much improved upon for the

mediate sample sizes smaller than thousands.

Therfore, we used an empirical approach to decide on the constant b for bandwidth

selection. For the nine functions listed in Table 2, we calculated the true values of Ccor at

various noise levels. Then we estimated Ccor on generated noisy data sets using different

bandwidth values at sample sizes of n = 102, 103, 104 and 105. The averages of Ccor from

100 randomly generated noisy data sets are compared to the true Ccor values to decide

on an optimal b value. From this simulation, we decided on the bandwidth h = 0.25n−1/4.

38

Figure 7 plots the simulation results using h = 0.25n−1/4. We can see that the performance

of Ccor improves as sample size increases, and gives very accurate estimates for Ccor under

big sample sizes. For illustration, we showed the plots with bandwidth h = 0.1n−1/4 and

h = 0.5n−1/4 in Figure 8 and Figure 9 respectively. Those bandwidth choices are clearly

either too small or too big.

All the reported numerical results in this paper use the plug-in estimator Ccor in equation

(A.9) with a square kernel and bandwidth h = 0.25n−1/4. This choice works well in the

numerical studies. Further investigation of other kernel and bandwidth choices is a future

research topic. Data-based adaptive bandwidth selection (Jones et al., 1996) could also be

investigated.

Another possible future research direction is to consider the Ccor estimator over a range

of varying bandwidths. This idea is motivated by the MIC measure. Although theoretically

not equitable, Reshef et al. (2011) demonstrated some good attributes of MIC under finite

sample. More mathematical investigation of MIC is warranted to understand its behaviour.

Studies by Reshef et al. (2013) indicate that taking the maximum value of the MI statistics

over varying sizes of grids is essential to its stability across different functional relationships

in finite samples. It can be proven that taking maximum of the plug-in Ccor estimator over

a range of varying bandwidths still results in a consistent estimator. It could be interesting

to investigate if such estimators can exhibit some good attributes of MIC in finite sample.

39

0.0 0.5 1.0 1.5 2.0

0.0

0.2

0.4

0.6

0.8

1.0

A: Linear

Noise Level

Cco

r

truen=102

n=103

n=104

n=105

0.0 0.5 1.0 1.5 2.0

0.0

0.2

0.4

0.6

0.8

1.0

B: Quadratic

Noise Level

Cco

r

truen=102

n=103

n=104

n=105

0.0 0.5 1.0 1.5 2.0

0.0

0.2

0.4

0.6

0.8

1.0

C: Square Root

Noise Level

Cco

r

truen=102

n=103

n=104

n=105

0.0 0.5 1.0 1.5 2.0

0.0

0.2

0.4

0.6

0.8

1.0

D: Cubic

Noise Level

Cco

r

truen=102

n=103

n=104

n=105

0.0 0.5 1.0 1.5 2.0

0.0

0.2

0.4

0.6

0.8

1.0

E: Centered Cubic

Noise Level

Cco

r

truen=102

n=103

n=104

n=105

0.0 0.5 1.0 1.5 2.0

0.0

0.2

0.4

0.6

0.8

1.0

F: Centered Quadratic

Noise Level

Cco

r

truen=102

n=103

n=104

n=105

0.0 0.5 1.0 1.5 2.0

0.0

0.2

0.4

0.6

0.8

1.0

G: Cosine

Noise Level

Cco

r

truen=102

n=103

n=104

n=105

0.0 0.5 1.0 1.5 2.0

0.0

0.2

0.4

0.6

0.8

1.0

H: Circle

Noise Level

Cco

r

truen=102

n=103

n=104

n=105

0.0 0.5 1.0 1.5 2.0

0.0

0.2

0.4

0.6

0.8

1.0

I: Cross

Noise Level

Cco

r

truen=102

n=103

n=104

n=105

Figure 7: The comparison of Ccor with its estimated values under different sample sizes.This estimator uses the square kernel density estimator with bandwidth h = 0.25n−1/4.

40

0.0 0.5 1.0 1.5 2.0

0.0

0.2

0.4

0.6

0.8

1.0

A: Linear

Noise Level

Cco

r

truen=102

n=103

n=104

n=105

0.0 0.5 1.0 1.5 2.0

0.0

0.2

0.4

0.6

0.8

1.0

B: Quadratic

Noise Level

Cco

r

truen=102

n=103

n=104

n=105

0.0 0.5 1.0 1.5 2.0

0.0

0.2

0.4

0.6

0.8

1.0

C: Square Root

Noise Level

Cco

r

truen=102

n=103

n=104

n=105

0.0 0.5 1.0 1.5 2.0

0.0

0.2

0.4

0.6

0.8

1.0

D: Cubic

Noise Level

Cco

r

truen=102

n=103

n=104

n=105

0.0 0.5 1.0 1.5 2.0

0.0

0.2

0.4

0.6

0.8

1.0

E: Centered Cubic

Noise Level

Cco

r

truen=102

n=103

n=104

n=105

0.0 0.5 1.0 1.5 2.0

0.0

0.2

0.4

0.6

0.8

1.0


Noise Level

Cco

r

truen=102

n=103

n=104

n=105

0.0 0.5 1.0 1.5 2.0

0.0

0.2

0.4

0.6

0.8

1.0

G: Cosine

Noise Level

Cco

r

truen=102

n=103

n=104

n=105

0.0 0.5 1.0 1.5 2.0

0.0

0.2

0.4

0.6

0.8

1.0

H: Circle

Noise Level

Cco

r

truen=102

n=103

n=104

n=105

0.0 0.5 1.0 1.5 2.0

0.0

0.2

0.4

0.6

0.8

1.0

I: Cross

Noise Level

Cco

r

truen=102

n=103

n=104

n=105


41

0.0 0.5 1.0 1.5 2.0

0.0

0.2

0.4

0.6

0.8

1.0

A: Linear

Noise Level

Cco

r

truen=102

n=103

n=104

n=105

0.0 0.5 1.0 1.5 2.0

0.0

0.2

0.4

0.6

0.8

1.0

B: Quadratic

Noise Level

Cco

r

truen=102

n=103

n=104

n=105

0.0 0.5 1.0 1.5 2.0

0.0

0.2

0.4

0.6

0.8

1.0

C: Square Root

Noise Level

Cco

r

truen=102

n=103

n=104

n=105

0.0 0.5 1.0 1.5 2.0

0.0

0.2

0.4

0.6

0.8

1.0

D: Cubic

Noise Level

Cco

r

truen=102

n=103

n=104

n=105

0.0 0.5 1.0 1.5 2.0

0.0

0.2

0.4

0.6

0.8

1.0

E: Centered Cubic

Noise Level

Cco

r

truen=102

n=103

n=104

n=105

0.0 0.5 1.0 1.5 2.0

0.0

0.2

0.4

0.6

0.8

1.0


Noise Level

Cco

r

truen=102

n=103

n=104

n=105

0.0 0.5 1.0 1.5 2.0

0.0

0.2

0.4

0.6

0.8

1.0

G: Cosine

Noise Level

Cco

r

truen=102

n=103

n=104

n=105

0.0 0.5 1.0 1.5 2.0

0.0

0.2

0.4

0.6

0.8

1.0

H: Circle

Noise Level

Cco

r

truen=102

n=103

n=104

n=105

0.0 0.5 1.0 1.5 2.0

0.0

0.2

0.4

0.6

0.8

1.0

I: Cross

Noise Level

Cco

r

truen=102

n=103

n=104

n=105


42

copula correlation: an equitable and consistently

Documents