nonparametric estimation of a cdf with mixtures of cdf ... · poisson operator (using hille’s...

Nonparametric Estimation of a Cdf with

Mixtures of Cdf Concentrated on Small

Intervals.

Mohammed Haddou∗ Francois Perron∗

CRM-3210January 2006

∗Department of Mathematics and Statistics, University of Montreal, P.O. Box 6128, “Centre-ville,” Montreal, Quebec,Canada.

Corresponding author: [email protected]

Abstract

In this paper, we propose a new nonparametric approach for estimating a cumulative distri-bution function F using finite mixtures. This new approach is meant to be an alternative toclassical methods including the kernel method, spline based methods and polynomial basedmethods. The properties of the proposed estimator are studied. In particular, Lp convergencefor 1 ≤ p ≤ ∞ are obtained together with the corresponding rates of convergence. Further-more, the asymptotic behavior of the estimator is established in terms of convergence in law.We obtain the same asymptotic statistical properties as those obtained with the empirical dis-tribution function, under the same regularity conditions. Simulations and examples illustratethe approach.

1 Introduction

We are interested in estimating a cumulative distribution function (cdf) F with support on an intervalI of R, bounded or not, based on a sample X1, X2, . . . , Xn from F . Several works have been devoted tothe estimation of a cdf. Most of these works require regularity conditions such as the existence of thedensity function, the density being Lipschitz or F being twice differentiable, for example. Our aim is toobtain a smooth estimation when necessary with strong asymptotic results such as those obtained usingthe empirical distribution function under the same regularity conditions. Furthermore, we wish to obtaininteresting results even for small sample sizes. We hope, on the basis of the theoretical and simulationresults we obtain, that the reader will be convinced we achieved these goals.

The most frequently used estimator for a cdf F is the empirical (sample) distribution function (edf) Fn,where Fn(x) =

∑ni=1 I(Xi ≤ x)/n (I being the indicator function.) Here nFn(x) has a binomial distribution

B(n, F (x)) and Fn indicates the location of the observations. Also, the edf has a long list of good statisticalproperties such as it is first order efficient in the minimax sense and Fn(x) is the unique minimum varianceand unbiased estimator of F (x) (see Dvoretzky & al. (1956) and Lehmann & Cassella (1998), chapter2.). Furthermore, the edf is the nonparametric maximum likelihood estimator of F and plays a centralrole in nonparametric simulation and bootstrap (see Efron & Tibshirani (1993) p. 310.) For a review ofsome properties of the edf see, e.g., Csaki E. (1984), Stute (1982) and Serfling (1980.) The fact that Fn

is a step function even when the underlying cdf F is continuous, has called for the need (in certain areasof application like estimating the density) for smooth(er) estimators of F . Many smooth estimators havebeen proposed in the literature. Most of these estimators are based on smoothing the edf. One that hasbeen extensively studied is the kernel estimator, say Fh. Nadaraya (1964) established, under appropriateregularity conditions, the almost sure uniform convergence of Fh. Watson and Leadbetter (1964) provedthe asymptotic normality of Fh, Winter (1979) showed that Fh has the Smirnov-Chung property. Azzalini(1981) derived an asymptotic expression for the mean squared error of Fh. Falk (1983), Mammitzsch (1984),Swanepoel (1988) and Jones (1990) analyzed mean integrated squared error properties of Fh, proving thatthe smoothed estimate is asymptotically more efficient than the empirical one. Shirahata and Chu (1992)showed that the superiority of kernel estimators is not necessarily true in the sense of the integrated squarederror. Sarda (1993), Altman and Leger (1995) and Chu (1995) are devoted to the problem of bandwidthselection for Fh. See as well Shao & Xiang (1997), Bowman et al. (1998) and Alvarez & al. (2000).Alternative methods have been proposed as well.

Several estimators using splines have been investigated. Wahba (1976) proposed the method calledhistospline (a spline-smoothed histogram) which uses a cubic spline to smooth the edf. Restle (1999)proposed another estimator based on smoothing the edf using cubic splines. He and Shi 1998) use quadraticsplines to estimate the cdf. We may mention as well Ramsay (1998, 1988) who uses monotone (regression)splines to estimate a monotone function.

Nonparametric Bayesian estimators have been proposed by, e.g., Perron & Mengersen (2001) who usemixtures of triangular distributions (quadratic splines). Hansen & Lauritzen (2002) use Dirichlet processesto model the prior to estimate a concave cdf.

Chaubey & Sen (1996, 2002 (multivariate)) propose the estimation of a smooth cdf F , based on aPoisson operator (using Hille’s theorem) to smooth the edf. Babu & al. (2002) propose an estimator basedon Bernstein (operator) polynomials to smooth the edf.For other approaches and references see, e.g., Efromovitch (2001).

In order to estimate a cdf F , we start from the fact that the best estimation based on the observationscannot do better than the best approximation based on the fact that F is known. The present work hastwo goals. The first one is to develop a method for approximating a cdf F . The aim is to approximate thespace of all cdf on the interval I by a finite dimensional space, of dimension m, say. The second one is toapply the approximation to the edf Fn and therefore to estimate F . We would like to mention here thatour estimator is not necessarily smooth. In fact, our approach concerns the estimation of a cdf withoutfurther conditions on the function F .

When F is known (section 2), a basic approximation G of F is a step function, with (discontinuity)jumps of the same amplitude. It is then enough to choose m jumps in the interval I in order to obtain anapproximation G such that ‖F −G‖∞ ≤ 1/2(m+1) (uniform upper bound.) The problem then reduces to

1

determining the location of the jumps (i.e. the nodes), this is an m-dimensional problem. Still, one mightprefer other options than working on step functions. Therefore, we seek for smoother alternatives to thestep function G. In section 2, we develop an approach, where by using a smoothing parameter k, we areable to construct an approximation Gk, in the form of a finite mixture of cdfs (we call basis functions), suchthat, using m nodes, we obtain ‖F −Gk‖∞ ≤ k/2(m + 1). Furthermore, in our construction, we allow thepossibility of using an instrumental function H which, when chosen close to F , help in obtaining a betterapproximation (in fact, when H is taken as F , then G = F , see lemma 2.3.) By analogy to the bayesianapproach, the function H is seen here as a “prior” distribution (a pseudo prior.) The basis functions, thatenter in the definition of G, possess a hierarchical structure, with supports that depend on a vector ofnodes. Each cdf (descendant), element of the basis, is a mixture of two cdfs (generators) at a lower level,and the mixture of the two is performed using a third (fixed) cdf, say H. A choice of the nodes that ensurea uniform bound to ‖F −G‖∞ is given in Lemma 2.2. A construction of the basis is given in section 2.3.The approximation G is smooth in general but it allows for discontinuity jumps when necessary/suitable.

We think that the approximation results we obtain are new and could be used in approximation theory.In particular, we give a probabilistic interpretation to our construction which allows, among other things,to show in a simple way how the construction of monotone splines gives the property of monotonicity. Inparticular, we construct the monotone splines basis in a probabilistic manner (which present an interest initself.) Furthermore, we think that, the (parameter) function H and the role it plays in the approximationprocess, is unique to our method.

When F is unknown (section 3), we apply the approximation to the edf Fn. The edf is then used tochoose the nodes (among the order statistics) that define the supports of the basis functions and thus definethe estimator F of F (Lemma 3.1.). With the right choice of the nodes, we obtain ‖Fn−F‖∞ ≤ k/2(m+1),which allows us to prove almost sure uniform convergence of F to F (Lemma 3.2.). We then give, undercertain conditions on m and n, the rates of the L∞ convergence and the Lp convergence for p ≥ 1. We showas well that the estimator F has the Chung-Smirnov property. In section 3.5., we establish the asymptoticbehavior of the estimator in terms of the convergence in law of

√n‖F − F‖∞. In sections 3.4. and

3.5., equivalent statistics to the Kolmogorov-Smirnov and the Cramer-Von Mises statistics are introduced(where Fn is replaced by F .) We then obtain similar asymptotic results than those obtained using Fn. Insection 3.6., we give some uniform upper bounds to the variance, bias, MSE, and other quantities (lemma3.11.) Note that, no other conditions on F than the fact of being a cdf is supposed here. Comparedto other methods where, usually, we suppose that F is differentiable, and sometimes twice differentiable.Furthermore, our approach works on bounded and unbounded supports with no need for transformationsand adjustments that may causes a lost of certain (asymptotic) properties of the estimator.

In section 4, we present numerical simulations to illustrate the performance of the proposed estimator.We think that a series of examples will help in understanding the choices to make concerning differentparameters involved in the construction of the estimator. General guidelines are given relatively to thechoice of these parameters. Comparison to recent works are done throughout the paper. Finally, we wouldlike to mention here that direct applications are considered in future works. Among these are the estimationof a density function (in progress), the estimation of the survival function and related functions and somebootstrap applications (smoothed bootstrap) are considered.

2 Approximation of functions and properties

In this section we discuss the problem of approximating a cumulative distribution function (cdf) F definedon an interval I ⊂ R, i.e., a non-decreasing, continuous from the right function on [0, 1] such that F (x)(1−F (x)) = 0 for all x /∈ I.Let us denote by F(I) the space of all cdfs on I, and let us define for F1, F2 ∈ F(I) the (usual supremum)metric, dsup(F1, F2) = sup

x∈R|F1(x)− F2(x)|, which we denote by ‖F1 − F2‖∞.

The space (F(I), dsup), is a complete metric space.In a statistical setup, we may have X1, X2, . . . , Xn, a sample from a common distribution Pθ, where theparameter θ depends on different quantities including a parameter F , F ∈ F . Under a frequentist approach,we may consider the maximum likelihood estimator of θ, but this is well known to lead to overfitting

2

problems with the space F being too large. In a Bayesian context, we may instead establish a prior on Θ,the space of θ, but this may be complicated since the space is infinite dimensional. The technical difficultiesof working on F itself thus impel us to consider alternative finite dimensional approximate spaces.

2.1 Approximation and loss

Assume that Fm = Hω : ω ∈ Ω ⊂ Rm be a space of dimension m approximating F .Two questions arise: What is the loss associated with the use of Fm instead of F , and can we find anapproximate space Fm for which this loss is small ?A measure of loss associated with the approximation of F by Fm can be given by

λ(F ,Fm) = supF∈F

infH∈Fm

dsup(F, H)

Hence, λ(F ,Fm) is always bounded from above by one.The second question may now be phrased as follows. Given ε > 0, can we find an approximate space Fm

such that λ(F ,Fm) < ε? And, if so, is there a simple upper bound on m? In practice we would prefer tohave both m and ε small!In the next section, we introduce an approximate space that will achieve the goals set in this section. Thisapproximate space consists of finite mixtures of cdfs.

2.2 The mixture

Let a = infx : x ∈ I and b = supx : x ∈ I. The approximate space Fm, we choose to work with, isbased on m points a ≤ y1 ≤ · · · ≤ ym ≤ b called nodes (knots) and a parameter k called the smoothingparameter. For convenience, we set yj = a for j = −k+2, . . . , 0 and yj = b for j = m+1, . . . , m+k−1(multiples nodes of order at least (k− 1) at the end points a and b.) Any Gk ∈ Fm is a finite mixture andsatisfies the following representation

Gk =m+k−1∑

j=1

ωkj Gk,j =1

(k − 1)(m + 1)

m+1∑

j=1

j+k−1∑

l=j

Gk,l,

with weights

ωkj =min(j, k − 1,m + k − j)

(k − 1)(m + 1)for j = 1, 2, . . . , m + k − 1.

The weights ωkj are therefore positive and add up to 1.Each function Gk,j is a cdf on the interval Ik,j = [yj−k+1, yj ] for j = 1, 2, . . . , m+k−1. The elements Gk,j

are called basis functions. A construction of Gk,j is given in the next section. We further suppose thatm ≥ k−1 > 0. The function Gk thus defined is therefore an element of F(I) (hence a cdf by construction.)

2.3 The basis

The basis elements are such that Gk,j ≥ Gk,j+1 for j = 1, 2, . . . , m + k − 2.For convenience, we set Gk,0 = 1 and Gk,m+k = 0.There is a hierarchical structure in the construction of the basis based on a fixed cdf H on I.Let’s define on Ik,j , j = 1, 2, . . . ,m + k − 1, the following functions

Hk,j(x) =

H(x)−H(yj−k+1)H(yj)−H(yj−k+1)

if H(yj−k+1) < H(yj),I(x ≥ yj) if H(yj−k+1) = H(yj)

Then the hierarchical structure is the following

Step 1 (initial step): G1,j(x) = I(x ≥ yj) for all x ∈ I and j = 1, . . . , m (Dirac cdf.)

Step l + 1: Gl+1, j = Hl+1, j Gl, j−1 + (1−Hl+1, j) Gl, j .

3

Clearly Gk,j ≥ Gk,j+1 for all k and j, j = 1, 2, . . . , m + k − 2.The fact that Gk,j is a cdf on Ik,j = [yj−k+1, yj ] comes from the next lemma.

Lemma 2.1 We have the following results,

(a) Let F1, F2 be two cdfs on I, an interval of R, such that F1 ≥ F2, and let F3 be a cdf on R.The function F(2) defined by

F(2) = F3 F1 + (1− F3) F2

is then a cdf on I.

(b) The function F2∧3 = F3 + (1− F3)F2 = F2 + (1− F2)F3 is a cdf on I (the extreme case F1 ≡ 1.)In particular, if F1 = F2 = 1, then F(2) = 1.

(c) The function F1∨3 = F3F1 is a cdf on I (the extreme case F2 ≡ 0.)In particular, if F1 = F2 = 0, then F(2) = 0.

Proof. (a) Let X1 be a random variable (r.v.) with cdf F1.Let U be a r.v. independent of X1 and uniformly distributed on [0, 1].Set X2 = F−1

2 [(1− U) supF1(x) : x < X1+ U F1(X1)] .The cdf of X2 is then F2 and we have P [X2 ≥ X1] = 1.Let X3 be a random variable independent of (X1, X2) and with cdf F3.If we denote by X(2) the second order statistic based on X1, X2 and X3, then

P [X(2) ≤ x] = P [X(2) ≤ x|X3 ≤ x]P [X3 ≤ x] + P [X(2) ≤ x|X3 > x]P [X3 > x]= F1(x)F3(x) + F2(x)(1− F3(x))= F(2)(x) for all x ∈ I,

recall that X1 ≤ X2 with probability 1.Note that F2 ≤ F(2) ≤ F1 and X1 ≤ X(2) ≤ X2 with probability one.

(b) The function F2∧3 is the cdf of the r.v. min(X2, X3).

(c) The function F1∨3 is the cdf of the r.v. max(X1, X3).

Remarks :

(1) Suppose we have multiples nodes of multiplicity l at yj , e.g., yj = yj+1 = · · · = yj+l−1 , then weobtain G1,j = G1,j+1 = · · · = G1,j+l−1 and therefore,

Gl, j+l−1(x) = Gl−1,j+l−2(x) = · · · = G1,j(x) if yj = yj+l−1.

(2) If a node yj is of multiplicity l with l ≥ k, then the function G will have a jump at yj .If H is continuous, then the amplitude of the jump is given by (l − k + 1)+/(m + 1), where a+ =max(0, a).

(3) If H is chosen as the cdf of a uniform distribution on I (bounded), then the basis functions Gk,j

become piecewise polynomials of degree k − 1 and the function Gk becomes a monotone spline. So,for k = 4, e.g., the approximation is a cubic (monotone) spline.

(4) The function H may depend on the nodes.

(5) If H is flat between two nodes, then Gk is flat between these two nodes.

(6) If H has a discontinuity jump at a point x0, then Gk has a jump at x0.

(7) In figure 1, we consider the case where F is the cdf of a Beta(2, 4), H is taken as the cdf of a Beta(3, 3)and we show the building of an element of the basis for k = 4, say G4,4 (cdf on [y1, y4]) throw thedifferent stages involved. The function in (d) (G4,4) is the mixture of the two functions in (c) (G3,3

and G3,4). Each function in (c) is a mixture of two functions in (b), etc.

4

F º BetaH2,4L H º BetaH3,3L : G4,4 function of cdfs at a lower level.

0.2

0.4

0.6

0.8

1HcL G3,3 and G3,4

y-1=y0=0 y1 y2 y3 y4 y5=y6=1

G3,4

0.2

0.4

0.6

0.8

1HdL G4,4 = H4,4 G3,3 + H1-H4,4L G3,4

y-1=y0=0 y1 y2 y3 y4 y5=y6=1

G4,4

0.2

0.4

0.6

0.8

1HaL G1,1, G1,2, G1,3 and G1,4 HDirac.L

y-1=y0=0 y1 y2 y3 y4 y5=y6=1

G1,4

0.2

0.4

0.6

0.8

1HbL G2,2, G2,3 and G2,4

y-1=y0=0 y1 y2 y3 y4 y5=y6=1

G2,4

Figure 1: The building of the basis function G4,4.

2.4 The choice of the nodes and the parameters H and k.

When k = 1, Gk is a step function. In order to obtain a good approximation for F , it is natural to set

yj = F−1

(j

m + 1

)

where F−1 is the generalized inverse of F ,

F−1(x) = inft : F (t) ≥ x.For k > 1, we shall keep the same nodes. In fact, we have the following lemma on the quality of theapproximation.

Lemma 2.2 (Choice of the nodes.)If we take yi = F−1

(i

m+1

)for i = 1, . . . , m, where F−1 denotes the generalized inverse of F ; F−1(t) =

infx : F (x) ≥ t, then we obtain

‖F −Gk‖∞ ≤ k

2(m + 1).

Proof. Let x ∈ I\a, y1, . . . , ym, b, then there exists i ∈ 0, 1, · · · ,m for which yi < x < yi+1. We have,on the one hand

i

m + 1≤ F (x) <

i + 1m + 1

and on the other hand

i + 1m + 1

− k

2(m + 1)≤

i∑

j=1

ωkj ≤ Gk(x) ≤i+k−1∑

j=1

ωkj ≤ i

m + 1+

k

2(m + 1)

Hence for x ∈ I\a, y1, . . . , ym, b

− k

2(m + 1)≤ F (x)−Gk(x) <

k

2(m + 1),

that’s,

|F (x)−Gk(x)| ≤ k

2(m + 1), for all x ∈ I\a, y1, . . . , ym, b.

Finally, since F and G are continuous from the right, we obtain

‖F −Gk‖∞ ≤ k

2(m + 1).

5

It follows that for ε > 0 given, we have λ(F ,Fm) ≤ ε whenever m > k2 ε−1, where Fm is the approximate

space of dimension m. We need to emphasize here that the choice of the nodes is really critical. The functionH is like a guess for F . When the nodes are adequately selected, a poor choice for H cannot be dramatic.However, we hope that a perfect guess will imply that Gk = F for all k > 1. In fact, this is true in thecase where F is continuous (see next lemma.)

Lemma 2.3 ( H ≡ F =⇒ Gk ≡ F. )If F is a continuous cdf on an interval I ⊂ R, yj = F−1(j/(m + 1)) for j = 1, 2, . . . ,m and H = F thenGk = F for all k, k ≤ m + 1.

Proof. The proof is done by induction on k, m being fixed. The result holds for k = 2.Suppose it is also true for k = 2, . . . , l, where l ≤ m + 1.For all x ∈ R we have,

Gl+1(x) =m+l∑

j=1

ωl+1,j Gl+1,j(x)

=m+l∑

j=1

ωl+1,j Hl+1,j(x) Gl,j−1(x) + (1−Hl+1,j(x))Gl,j(x)

= ωl+1,1Hl+1,1(x) Gl,0(x)

+m+l−1∑

j=1

ωl+1,j+1Hl+1,j+1(x) + ωl+1,j(1−Hl+1,j(x)) Gl,j(x)

+ ωl+1,m+l (1−Hl+1,m+l(x))Gl,m+l(x).

Note that Gl,0(x) = 1 and Gl,m+l(x) = 0, for all x. Therefore,

Gl+1(x) =1l

F (x) I(−∞,y1](x) +1

l(m + 1)I(y1,∞)(x)

+m+l−1∑

j=1

l − 1

lωl,j Gl,j(x) +

1l

F (x) I(yj ,yj+1](x)− j

l(m + 1)I(yj ,yj+1](x)

+1

l(m + 1)I(yj+1,∞)(x)

By induction, we have∑m+l−1

j=1 ωl,j Gl,j(x) = F (x) for all x. Furthermore, we have

1. I(−∞,y1](x) +∑m+l−1

j=1 I(yj ,yj+1](x) = 1.

2. I(y1,∞)(x)−∑m+l−1j=1 j I(yj ,yj+1](x) = −∑m

j=1 (j − 1) I(yj ,yj+1](x)

3.

m+l−1∑

j=1

I(yj+1,∞)(x) =m∑

j=1

I(yj+1,∞)(x)

=m∑

j=1

m∑

i=j+1

I(yi,yi+1](x) =m∑

i=2

i−1∑

j=1

I(yi,yi+1](x)

=m∑

i=2

(i− 1) I(yi,yi+1](x) =m∑

j=1

(j − 1) I(yj ,yj+1](x)

Thus, Gl+1(x) = F (x) for all x.

6

F º BetaH2,4L H º BetaH3,3L k=3 and m=7 nodes.

0.2

0.4

0.6

0.8

1HcL Plots of F, G and wjGk,j; 1 j b9.

y-1=y0=0 y1y2y3y4y5 y6 y7 y8=y9=1

F __________G ---

wjGk,j

0.2 0.4 0.6 0.8 1

0.025

0.05

0.075

0.1

0.125

0.15

0.175

HdL Plots of ÈFHxL-GHxLÈ and k

2 Hm + 1L

ÈF-GÈ

k

2 Hm + 1L=

1

4

0.2 0.4 0.6 0.8 1

0.2

0.4

0.6

0.8

1HaL CDFs : F and H.

F __________H ---

0.2 0.4 0.6 0.8 1

0.5

1

1.5

2

HbL Densities : f and h.

f __________h ---

yj 0 0 y1 y2 y3 y4 y5 y6 y7 1 1

yj 0 0 0.1275 0.1937 0.2537 0.3138 0.3785 0.4541 0.5563 1 1

F (yj) 0 0 0.125 0.25 0.375 0.5 0.625 0.75 0.875 1 1

Gk(yj) 0 0 0.1023 0.2375 0.3652 0.4910 0.6158 0.7390 0.8530 1 1

wk,j 1/16 1/8 1/8 1/8 1/8 1/8 1/8 1/8 1/16

Figure 2: Example of a Beta(2, 4).

Next, we give two examples. The first one concerns a smooth cdf, i.e. a Beta(2, 4), and the second oneconcerns a cdf with a discontinuity jump.

Example 1: Let F be the cdf of a Beta(2, 4) distribution. For illustration , we choose to take k = 3,m = 7 nodes only and for H we take a symmetric distribution on [0, 1], a Beta(3, 3) distribution. Thefunction H is seen as a prior distribution. When no knowledge about F is available and I is bounded, wecould choose H as the cdf of a uniform distribution for example. In figure 2, we have plotted the cdfs F andH (“the prior”) and the corresponding densities f and h. Figure 2.c. shows the plots of the approximationG together with F and the 9 functions wkjGk,j . The weights at the end points are equal to 1/2(m + 1). Infigure(2.d) we have plotted the (absolute) error function |F (x)−G(x)| and we can see that ‖F −G‖∞ isrelatively far from the uniform upper bound k/2(m + 1) = 1/4.

Example 2: In this example, we consider a cdf F on [0, 1] with a discontinuity jump at x = 1/2. Thefunction F is given by,

F (x) =

x2 if x <12

1 + (2x− 1)2

2, otherwise.

There are multiple nodes at x = 1/2 so that G jumps at x = 1/2. For this example we choose to takek = 3, m = 10, 30, 50, and 100. The function H is chosen as the cdf of a uniform distribution on [0, 1]. Inthis case, the elements Gk,j of the basis are piecewise polynomials of degree 2 = k− 1. So that G becomesa quadratic (monotone) spline. It is clear that a prior knowledge about this feature of F (the jump) wouldhave made us choose another H that reflects this feature and therefore obtain a better approximation. Thisability of the approximation G, and thus of the estimator F (see section 3), to be smooth in the regionswhere we expect F to be so, and to follow the jumps of F whenever there are makes our method moregeneral than the competitors and help us obtain better estimations for F .

7

Plots of F and G for k=3, HºUnifH0,1L and m=10,30,50,100.

0.2 0.4 0.6 0.8 1

0.2

0.4

0.6

0.8

1k=3 m=50

0.2 0.4 0.6 0.8 1

0.2

0.4

0.6

0.8

1k=3 m=100

0.2 0.4 0.6 0.8 1

0.2

0.4

0.6

0.8

1k=3 m=10

0.2 0.4 0.6 0.8 1

0.2

0.4

0.6

0.8

1k=3 m=30

Figure 3: Example of a cdf on [0, 1] with a discontinuity jump at 1/2.

3 Estimation and statistical results

In this section we apply the results of section 2 to the estimation of an unknown distribution functionF ∈ F(I) based on a sample X1, . . . , Xn from F . Let Fn(x) =

∑ni=1 I(Xi ≤ x)/n denote the empirical

distribution function and let’s denote G by F (where we drop the subscript k for simplicity.) In the nextsubsections, we are going to look at the asymptotic behavior of the estimator F of F . In section 3.1.,we give a uniform upper bound to ‖Fn − F‖∞. In section 3.2. we establish the (almost sure) uniformconvergence of F to F (lemma 3.2.), then by adding some conditions on m and n, we are able to give, inlemma 3.3., the rate of the uniform convergence. In section 3.3., we give a uniform upper bound to the Lp

norm ‖F − F‖p (p ≥ 1) (and by doing that we prove the Lp convergence of F to F and we give the rateof this convergence). In section 3.4., we define a Cramer-Von Mises like statistic and give a result relativeto this one (lemma 3.6.). In section 3.5., we define a statistic similar to the Kolmogorov-Smirnov one. Weestablish the convergence in law of

√n ‖F − F‖∞ (lemma 3.9.) and we show that the estimator F has

the Chung-Smirnov property. We end the section by giving some uniform results (uniform upper bounds)concerning the bias, variance, MSE and other quantities (Lemma 3.11.) We would like to note here thatby adding some conditions on m and n, we are able to obtain the asymptotic results and properties thatthe Kolmogorov-Smirnov and the Cramer-Von Mises statistics have.

3.1 Distance of F to Fn.

When F is unknown, we use the edf Fn to choose the nodes (among the order statistics.) The selectednodes are given by yj = F−1

n

(j

m+1

)for j = 1, . . . , m. In other words, we approximate the edf. We have

the next lemma

Lemma 3.1 (Distance of F to Fn.)The above choice of the nodes imply that yj = x(dn j

m+1e), and in this case we obtain

‖Fn − F‖∞ ≤ k

2(m + 1).

Where dae = minn ∈ Z : a ≤ n (the ceiling), and x(·) denote the order statistics.In the case where m = n, we have yj = x(j) for 1 ≤ j ≤ n and we obtain,

‖Fn − F‖∞ ≤ k

2(n + 1).

We may compare this upper bound to the one obtained in the interesting paper by Babu & al. (2002)(theorem 2.2.). In [(4)], the authors propose a (degree m polynomial) estimator Fn,m based on Bernstein

8

polynomials and show that if F is differentiable with density f that is Lipschitz of order 1, then at bestwe have

‖Fn,m − Fn‖∞ = O

((log(n)

n

)3/4)

almost surely (when m = n.)

Compare this rate of convergence with the one in lemma 3.1. (equation (20)) where we suppose F to be acdf without further conditions.

3.2 Uniform convergence of F to F (L∞ convergence.)

Using the previous lemma, we show that one can obtain (almost sure) uniform convergence of F to F .

Lemma 3.2 We havelim

m,n→∞ ‖F − F‖∞ = 0 a.s.

Proof. By taking yj = x(dn jm+1

e), we obtain

‖F − F‖∞ ≤ ‖F − Fn‖∞ + ‖Fn − F‖∞ ≤ k

2(m + 1)+ ‖Fn − F‖∞.

According to the Glivenko-Cantelli theorem, ‖Fn − F‖∞ converges almost surely to 0 as n tends to ∞.Therefore ‖F − F‖∞ converges almost surely to 0 as m and n tend to ∞ .

By adding conditions on m and n, we are able to control the rate of the uniform converge.

Lemma 3.3 (Rate of the uniform convergence)

If supn

[ n/(m2 log log(n))] < ∞, then ‖F − F‖∞ tends to 0 almost surely at the rate√

2 log log(n)n .

Proof. We have

lim‖F − F‖∞ ·√

n

2 log log(n)≤ lim

k

2(m + 1)·√

n

2 log log(n)+ lim‖Fn − F‖∞ ·

√n

2 log log(n).

And we have the result by noting that, by the law of the iterated logarithm, we have

lim‖Fn − F‖∞ ·√

n

2 log log(n)≤ 1

2, with probability one.

3.3 Lp convergence of F to F .

Let’s first recall the following inequalities:

xs + ys ≤ (x + y)s ≤ 2s−1 (xs + ys), for every s ≥ 1 and x, y > 0,

2s−1 (xs + ys) ≤ (x + y)s ≤ xs + ys, for every 0 ≤ s ≤ 1 and x, y > 0,

andP ‖F − Fn‖∞ > t ≤ 2 e−2nt2 , for all t > 0.

For the last inequality, see Massart (1990) and Devroye (2001.)We have the next lemma,

9

Lemma 3.4 Let p ≥ 1 and let’s denote by ‖ · ‖p the Lp-norm, i.e.,

‖G‖p = ‖G‖p,K =

∞∫

−∞|G(x)|pdK(x)

1/p

, for all G,K ∈ F(I).

We have,

‖F − F‖p,F ≤

2p−1

[p Γ(p/2)(2n)p/2

+[

k

2(m + 1)

]p]1/p

≤ 121/p

√2 [p Γ(p/2)]1/p

√n

+k

m + 1

.

Proof. Since,

|F (x)− F (x)| ≤ ‖F − Fn‖∞ +k

2(m + 1), for every x ∈ I,

we obtain,

|F (x)− F (x)|p ≤‖F − Fn‖∞ +

k

2(m + 1)

p

≤ 2p−1

‖F − Fn‖p

∞ +[

k

2(m + 1)

]p,

it follows that,

‖F − F‖pp,F ≤ 2p−1

E‖F − Fn‖p

∞ +[

k

2(m + 1)

]p

We obtain,

E‖F − Fn‖p∞ =

∞∫

0

P(‖F − Fn‖∞ > t1/p) dt

≤ 2

∞∫

0

e−2nt2/pdt

= p

∞∫

0

up/2−1 e−2nu du

=p Γ(p/2)(2n)p/2

.

Hence,

‖F − F‖p,F ≤

2p−1

[p Γ(p/2)(2n)p/2

+[

k

2(m + 1)

]p]1/p

≤ 121/p

√2 [p Γ(p/2)]1/p

√n

+k

m + 1

.

3.4 A Cramer-Von Mises like statistic.

The Cramer-Von Mises statistic is given by (see, e.g., Serfling (1980))

Cn = n

∞∫

−∞[Fn(x)− F (x)]2 dF (x).

We have the following well known result,

10

Lemma 3.5 (Finkelstein (1971.))With probability 1,

limCn

2 log log n=

1π2

.

Let’s define a similar statistic based on F in the following way

C∗n = n

∞∫

−∞[F (x)− F (x)]2 dF (x),

so that we obtain the next lemma

Lemma 3.6 With probability 1,

limC∗

n

2 log log n≤ k2

4lim

n

m2 log log n

So that,

limC∗

n

2 log log n= 0 if lim

n

m2 log log n= 0

Proof. We have (cf Lemma 3.4.)

‖F − F‖22 ≤ 2

(1n

+k2

4(m + 1)2

),

thus,

C∗n

2 log log n≤ k2

4n

(m + 1)2 log log n+

1log log n

.

Remark : Note that if we choose m such that

limn

m2 log log n= 0

then we obtain,limCn

limC∗n

= ∞.

3.5 Asymptotic behavior of F .

Let’s recall the following well known results (see Serfling (1980), for example.)

Theorem 3.7 (Kolmogorov, 1933.)If F is continuous and if we set Dn = ‖F − Fn‖∞, we obtain

limn→∞ Pr(

√n Dn ≤ d) =

∞∑

j=−∞(−1)j e−2 j2 d2

, d > 0,

Theorem 3.8 (Smirnov, 1941.)Let’s introduce the two quantities ;

D+n = sup

x[Fn(x)− F (x)] and D−

n = supx

[F (x)− Fn(x)].

If F is continuous, then we have the following result,

limn

P(√

n D+n > d) = lim

nP(√

nD−n > d) = e−2 d2

, d > 0.

11

Lemma 3.9 (Asymptotic behavior of F .)If F is continuous and m is chosen such that n

m2 −−−−−→n,m→∞ 0, then

limm,n→∞ Pr(

√n Dn ≤ d) =

∞∑

j=−∞(−1)j e−2 j2 d2

, d > 0.

where Dn = ‖F − F‖∞.

Proof. We have| ‖F − F‖∞ − ‖F − Fn‖∞ | ≤ ‖F − Fn‖∞ ≤ k

2(m + 1).

Hence,

− k√

n

2(m + 1)+√

n Dn ≤√

nDn ≤√

n Dn +k√

n

2(m + 1).

We obtain,

Pr(√

nDn ≤ d− k√

n

2(m + 1)

)≤ Pr

(√n Dn ≤ d

)≤ Pr

(√nDn ≤ d +

k√

n

2(m + 1)

),

and since (Kolmogorov’s theorem)

limn→∞ Pr(

√n Dn ≤ d) =

∞∑

j=−∞(−1)j e−2 j2 d2

, d > 0,

the result then follows.

Lemma 3.10 Let’s put,

D+n = sup

x[F (x)− F (x)] and D−

n = supx

[F (x)− F (x)].

If F is continuous and m is chosen such that nm2 −−−−−→

n,m→∞ 0, then

limn

P(√

n D+n > d) = lim

nP(√

n D−n > d) = e−2 d2

, d > 0.

Remarks

(1) The Chung-Smirnov propertyNote that (cf. lemma 3.3.) if m is chosen such that lim n/(m2 log log(n)) = 0, then F has theChung-Smirnov property, i.e.,

lim (2n/ log log n)1/2‖F − F‖∞ ≤ 1, with probability one.

(2) Pointwise convergence in lawIf m is chosen such that lim n/m2 = 0, then

√n

(F (x)− F (x)

) D−−−−−→n,m→∞ N (0, F (x)(1− F (x)) for all x.

(3) Let’s recall the two following inequalities

‖F − F‖∞ ≤ k

2(m + 1)+ ‖Fn − F‖∞,

andP ‖F − Fn‖∞ > t ≤ 2 e−2nt2 , for all t > 0.

Then for all t > 0, if m is chosen such that m ≥ k/t− 1, we obtain

P‖F − F‖∞ > t

≤ 2 e−nt2/2.

12

3.6 Some uniform results concerning the bias, variance, MSE and other quantities.

In the next lemma, we give uniform upper bounds to the bias, variance, MSE and other quantities.

Lemma 3.11 We have the following results:

(a) |Bias(F )| = |E (F − F )| ≤ k2(m+1) = δm. Therefore F is asymptotically unbiased.

(b) MSE (F ) ≤ (δm + 12√

n)2 ≤ 2 (δ2

m + 14 n).

(c) Var (F ) ≤ (δm + 12√

n)2 ≤ 2 (δ2

m + 14 n).

(e) E(‖F − F‖∞

)≤ δm + 1√

n.

(d) Var(‖F − F‖∞

)≤ 2

(δ2m + 1

n

).

Proof.

(a) Bias,|Bias| = |E (F − F )| = |E (F − Fn)| ≤ E |F − Fn| ≤ δm.

(b) MSE,

MSE (F ) = E (F − F )2 = E(F − Fn + Fn − F )2

≤√

E (F − Fn)2 +√

E (Fn − F )22

≤ (δm +1

2√

n)2 ≤ 2 (δ2

m +1

4n).

(c) Variance, Var (F ) ≤ MSE (F ).

(d) We have,

E(‖F − F‖∞

)≤ E (‖F − Fn‖∞) + E (‖Fn − F‖∞) ≤ 1√

n+ δm.

(e) We have

Var(‖F − F‖∞

)= E (‖F − F‖2

∞)− E 2(‖F − F‖∞) ≤ E (‖F − F‖2∞)

≤ E(‖F − Fn‖∞ + ‖Fn − F‖∞

)2

≤ 2E(‖F − Fn‖2

∞ + ‖Fn − F‖2∞

)

≤ 2(E ‖F − Fn‖2

∞ + δ2m

) ≤ 2 (1n

+ δ2m).

13

4 Guidelines on the choice of the parameters k, H, m and Simulations.

We start this section by explaining the choice of the different parameters involved in the construction ofthe estimator, i.e. k, H and m. When m and k are chosen, we work with the nodes as in lemma 3.1. Hereis a summary guideline to the choice of the parameters.

1. The choice of k (“the smoothing parameter”)The parameter k is seen here as a smoothing parameter. The larger k, the smoother the estimator.When H is taken as the cdf of a uniform distribution (on a bounded support), the basis functions,Gk,j , become a basis for the monotone splines of order k − 1 (with variable nodes.) So, to have acubic spline, for example, we need to take k = 4.However, a large value of k renders the jumps of F more difficult to obtain. In fact, to have a jumpat yj , a multiple node of order r (say yj = yj+1 = · · · = yj+r−1), we need to have r ≥ k.In our simulations, we take k = 4, unless we have some knowledge about certain features of F .

2. Choice for H (the instrumental cdf.)By analogy to the Bayesian approach, H is seen as a prior distribution. In fact, if H is equal to F(not possible in practice!), then we obtain excellent results as mentioned in section 2.4. (lemma 2.3.).A choice of H that is close to F will help mostly in the regions where F is flat or almost flat (inparticular in the tails), that’s, in those regions where the density f of F (when it exists) is very small.A prior knowledge about certain features of the distribution, like unimodality, asymmetry, concavity,discontinuity jumps, etc., dictates the choice of H, and therefore help obtain a better estimation.When the support I of F is bounded but no other knowledge about F is available, H might be taken,as the cdf of a uniform distribution. Note that when the support I of F is not bounded, anotherchoice than the uniform distribution is necessary. This choice will have an influence on F in the tails.If H has a discontinuity jump at a given point, the estimator will present a jump at that same point.So a prior information of this kind help get a better estimation of F especially when the sample sizeis small.

3. The choice of m (link between H and Fn.)The parameter m = m(n) depends on n in general. When H is smooth, small values for m inducea very smooth F . Whereas large values of m make F stick to Fn, the edf, but still remains smooth(unless we have enough multiples nodes to make F jump.)By analogy to Kernel methods, we may compare 1/m to h, the bandwidth (window.) Recall (lemma3.9.) that by choosing m such that n/m2 → 0, we are able to obtain the asymptotic distribution of√

n‖F − F‖∞. So, we might use this condition to help us choose m.The choice of m is also associated with H. If one strongly believes that H is close to F , then choosingm small is good. Furthermore, if m is small relatively to n then the spacings will be very stable. Ifm is small, F looks like H and if m is large, F looks like Fn. Note that the choice of m is moreimportant than the choice of k.

4.1 Numerical examples

To illustrate the performance of the estimator, we have chosen six examples, four of which concern cdfs withbounded supports and two with unbounded ones. In figure 4 we have plotted the cdfs and correspondingdensities of the selected distributions together with the chosen cdf H (dashed.) For each of these cdfs, wehave computed 10 estimates and plotted them together with the respective true distribution (cf figure 5.)In what follows, we comment the results obtained for the corresponding examples (cf figure 5.):

(a) The standard normal ; N(0, 1)In the case of the standard normal distribution, we choose to take H as a N(−0.5, (1.5)2). In figure5.a. we show the graphics of 10 estimates from 10 samples with n = 100, m = 15 and k = 4.

(b) The exponential distribution; Exp(1)For the exponential distribution, we choose to take H as the cdf of an Expo(0.5) (“large variance.”)In figure 5.b. we show the graphics of 10 estimates from 10 samples with n = 100, m = 15 and k = 4.

14

(c) Mixture of beta distributionsWe choose to take the following mixture:

12Beta(10, 20) +

12Beta(30, 20).

The function H is taken symmetric to reflect “absence” of prior knowledge, we choose to use aBeta(9, 9).We have taken 10 samples with n = 100, m = 15 and k = 4. We can see in figure 5.c. that theestimator does quite well. For our simulations, we take k = 4 in general, unless we knew about aparticular feature in F , like being very smooth. In this case, we may take k greater than 4.

(d) Another mixture of beta distributions (with a flat region.)We choose to take another mixture with a flat area, that’s:

12Beta(10, 40) +

12Beta(40, 20).

We choose for H the following mixture 2/3Beta(4, 13) + 1/3Beta(25, 20) (suppose we had a priorknowledge that F was bimodal and we had an idea where the modes were.)We have taken 10 samples with n = 200, m = 30 and k = 4. We can see in figure 5.d. that theestimator does well. It is clear that a prior knowledge about the flat region would have helped inobtaining a better estimation.

(e) A cdf with an infinite derivative.We take the following cdf on [0, 1],

F (x) =

1− (1− 2x)18

2if x <

12

1 + (2x− 1)18

2, otherwise.

The cdf H is chosen as a Beta(10, 10) (symmetric.) We have taken 10 samples with n = 100, m = 20and k = 4. It is clear that no polynomial based estimator would have followed the infinite slope.

(f) A distribution with a jumpLike in example 2, we consider a cdf F on [0, 1] with a discontinuity jump at x = 1/2. The functionF is given by,

F (x) =

x2 if x <12

1 + (2x− 1)2

2, otherwise.

The cdf H is chosen as a U(0, 1) to reflect the fact that no prior knowledge is available. We havetaken 10 samples with n = 200, m = 30 and k = 4. Figure 5.f. shows that the estimator doeswell. An accumulation of nodes provoke the jump in F . Clearly, another choice of H reflecting thediscontinuity jump would lead have to a better estimation.

15

Density f, cdf F and the function H Hdashed.L

0.2 0.4 0.6 0.8 1

0.5

1

1.5

2

0.20.40.60.8 1

0.20.40.60.81

F6

0.2 0.4 0.6 0.8 1

102030405060

0.20.40.60.8 1

0.20.40.60.81

F5

0.2 0.4 0.6 0.8 1

0.51

1.52

2.53

3.5

0.20.40.60.8 1

0.20.40.60.81

F4

0.2 0.4 0.6 0.8 1

0.51

1.52

2.53

0.20.40.60.8 1

0.20.40.60.81

F3

1 2 3 4 5

0.2

0.4

0.6

0.8

1

1 2 3 4 5

0.20.40.60.81

F2

-4 -2 2 4

0.1

0.2

0.3

0.4

-4 -2 2 4

0.20.40.60.81F1

Figure 4: Plots of the distributions used for numerical evaluations with the respective chosen H.

16

0.2 0.4 0.6 0.8 1

0.2

0.4

0.6

0.8

1HeL 10 samples: n=100 m=20 k=4.

0.2 0.4 0.6 0.8 1

0.2

0.4

0.6

0.8

1HfL 10 samples: n=200 m=30 k=4.

0.2 0.4 0.6 0.8 1

0.2

0.4

0.6

0.8

1HcL 10 samples: n=100 m=15 k=4

0.2 0.4 0.6 0.8 1

0.2

0.4

0.6

0.8

1HdL 10 samples: n=200 m=30 k=4

-4 -2 2 4

0.2

0.4

0.6

0.8

1HaL 10 samples: n=100 m=15 k=4.

1 2 3 4 5 6

0.2

0.4

0.6

0.8

1HbL 10 samples: n=100 m=15 k=4.

Figure 5: Plots of the true distribution together with the estimates from 10 samples.

17

References

[1] Altman, N., Leger, C., (1995). “Bandwidth selection for kernel distribution function estimation”.J. Statist. Plann. Inference 46, 195-214.

[2] Azzalini, A., 1981. “A note on the estimation of a distribution function and quantiles by a kernelmethod.” Biometrika 68 (1), 326-328.

[3] Bowman, A., Hall, P., Prvan, T., 1998.“Bandwidth selection for the smoothing of distribution functions.”Biometrika 85 (4), 799-808.

[4] Babu J. , G., Canty , A. J. and Chaubey, P.C. (2002).“Application of Bernstein Polynomials for SmoothEstimation of a Distribution and Density Function.”J. Statist. plann. Inference 105, No 2, 377-392.

[5] Chaubey, Y.P., Sen, P.K. (1996). “ On smooth estimation of survival and density functions.”Statist. Decisions 14, 1-22.

[6] Chu, I.S., (1995).“Bootstrap smoothing parameter selection for distribution function estimation.”Math.Japon. 41 (1), 189-197.

[7] Csaki E. (1984). “Empirical Distribution Function.”,Handbook of Statistics (P. R. Krishnaiah and P. K. Sen, eds), vol. 4, 405-430.

[8] Devroye, L. and Lugosi, G. (2001). “Combinatorial Methods in Density Estimation”,Springer-Verlag, New York, Inc.

[9] Dvoretzky, A. Kiefer, J. and Wolfowitz, A. J. (1956) “Asymptotic minimax character of the sampledistribution function and of the classical multinomial estimator,”Ann. Math. Statist., 27, No. 3, 642669.

[10] Efromovitch, S. (2001). “Second Order Efficient Estimating a Smooth Distribution Function and itsApplications.” Methodology and Computing in Applied Probability, 9, 179-198.

[11] Efron, B., Tibshirani, R.J., (1993). “ An Introduction to the Bootstrap.” Chapman & Hall, London.

[12] Falk, M., (1983). “Relative eficiency and deficiency of kernel type estimators of smooth distributionfunctions”. Statist. Neerlandica 37, 73-83.

[13] Finkelstein, H. (1971). “ The Law of the Iterated Logarithm for empirical distributions,”Ann. Math. Statist., 42,, 607-615.

[14] Hansen, B. M., Lauritzen S. L. (2002). “Nonparametric Bayes inference for concave distribution func-tions.” Statistica Neerlandica, 56, No 1, 110-127.

[15] He, X., Shi, P. (1998). “Monotone B-Spline Smoothing,” JASA, Theory and Methods, 93, No. 442,643-650.

[16] Jones, M.C., 1990. “The performance of kernel density functions in kernel distribution function esti-mation.” Statist. Probab. Lett. 9, 129-132.

[17] Lehmann, E.L., Casella, G. (1998.) “ Theory of Point Estimation.” 2nd ed., Springer, New York.

[18] Mammitzsch, V., (1984). “On the asymptotically optimal solution within a certain class of kernel typeestimators.” Statist. Decisions 2, 247-255.

[19] Massart, P. (1990). “The Tight Constant in the Dvoretsky-Kiefer-Wolfowitz Inequality”.Annals of probability, 18, 1269-1283.

18

[20] Nadaraya, E.A., (1964). “Some new estimates for distribution functions”.Theory Probab. Appl. 15 , 497-500.

[21] Perron, F. and Mengersen, K. (2001).“Bayesian Nonparametric Modeling Using Mixtures of TriangularDistributions” Biometrics 57, 518-528.

[22] Restle E. Maria (1999). “ Estimating Distribution Functions with smoothing Splines”,Technical Report Statap.1999.5 (DMA, EPF Lausanne, CH.)

[23] Sarda, P., (1993).“Smoothing parameter selection for smooth distribution functions”. J. Statist. Plann.Inference 35, 65-75.

[24] Serfling, J. Robert (1980). “Approximation Theorems of Mathematical Statistics”,John Wiley & Sons, Inc..

[25] Shao, Y., Xiang, X., (1997).“Some extensions of the asymptotics of a kernel estimator of a distributionfunction.” Statist. Probab. Lett. 34, 301-308.

[26] Shirahata, S., Chu, I.S., (1992). “Integrated squared error of kernel-type estimator of distributionfunction”. Ann. Inst. Statist. Math. 44 (3), 579-591.

[27] Stute, W. (1982). “ The oscilation behavior of empirical processes.” Ann. Probability, 10, 86-107.

[28] Swanepoel, J.W.H., (1988). “Mean integrated squared error properties and optimal kernels when esti-mating a distribution function”. Comm. Statist. Theory Methods 17 (11), 3785-3799.

[29] de Una-Alvarez, J., Gonzalez-Manteiga, W., Cadarso-Suarez, C., (2000)“ Kernel distribution function estimation under the Koziol-Green model.”Journal of Statistical Planning and Inference 87, pp. 199-219.

[30] Wahba G. (1976). “ Histosplines with knots which are order statistics,”,Journal of the Royal Statistical Society, B, 38 140-151.

19

Francois Perron Mohammed HaddouDep. de mathematiques et de statistique Dep. de mathematiques et de statistiqueUniversite de Montreal Universite de MontrealC.P. 6128, succursale “Centre-ville” C.P. 6128, succursale “Centre-ville”Montreal, QC H3C 3J7 Montreal, QC H3C [email protected] [email protected]

20

nonparametric estimation of a cdf with mixtures of cdf ... · poisson operator (using hille’s...

Documents