lecture 1: nonparametric estimation of distribution ...cai/as2015/lecture1.pdfas the parametric...

About this course Nonparametric statistics Nonparametric estimation of distribution functions and quantiles

Lecture 1: Nonparametric Estimation ofDistribution Functions and Quantiles

Applied Statistics 2014

1 / 32

About the course

12 weeks: 2 hours a week

Mainly on nonparametric methods in statistics

Exercises: Some are theoretical, and others are practical and involvesome computational methods

Prerequisites: Basic knowledge of probability and statistics, and acertain level of mathematical maturity.

Instructor: Juan Juan Cai email: [email protected]

Teaching assistant: Aurelius Zilko email: [email protected]

course website:http://dutiosb.twi.tudelft.nl/~cai/teaching.html

2 / 32

About the course

12 weeks: 2 hours a week

Mainly on nonparametric methods in statistics

Exercises: Some are theoretical, and others are practical and involvesome computational methods

Prerequisites: Basic knowledge of probability and statistics, and acertain level of mathematical maturity.

Instructor: Juan Juan Cai email: [email protected]

Teaching assistant: Aurelius Zilko email: [email protected]

course website:http://dutiosb.twi.tudelft.nl/~cai/teaching.html

2 / 32

Evaluation

Each week except the first week, there will be one or two group pre-sentations. Each group has 15 minutes at most.

Presenting a given paper, to explain what you think is most importantabout the paper, like main contribution, novelty, findings etc.Or demonstrating the implementation of a R code constructed by yourgroup to solve an exercise.Please send your PDF slides and R code to me and Zilko.

A final exam (written, open book): 10 points (maximum).

Suppose you get a points (out of 10) from the presentation and bpoints from the exam. Your final grade equals min{a/10 + b, 10}.

3 / 32

Aim of this course

Obtain some knowledge of nonparametric methods in statistics

Study several modern and popular nonparametric methods instatistics, and understand their pros and cons

Understand some of the theory behind nonparametric models.

4 / 32

Topics for this course...

Tentative list:

Nonparametric estimation of distribution functions and quantiles(notes and Ch. 2 of Wasseman All of Nonparametric Statistics).

Goodness of fit (notes).

Permutation tests (article + notes).

Bootstrapping

Kernel density estimator

Smoothing: general concepts (Ch. 4 of Wasserman).

Nonparametric regression (Ch. 5 of Wasserman).

Extreme value theory (notes).

5 / 32

Topics for this course...

Tentative list:

Nonparametric estimation of distribution functions and quantiles(notes and Ch. 2 of Wasseman All of Nonparametric Statistics).

Goodness of fit (notes).

Permutation tests (article + notes).

Bootstrapping

Kernel density estimator

Smoothing: general concepts (Ch. 4 of Wasserman).

Nonparametric regression (Ch. 5 of Wasserman).

Extreme value theory (notes).

5 / 32

What is nonparametric statistics?

Wolfowitz (1942):

We shall refer to this situation (where a distribution is com-pletely determined by the knowledge of its finite parameter set)as the parametric case, and denote the opposite case, where thefunctional forms of the distributions are unknown, as the non-parametric case.

Randles, Hettmansperger and Casella (2004)

Nonparametric statistics can and should be broadly defined toinclude all methodology that does not use a model based on asingle parametric family.

Wasserman (2005)

The basic idea of nonparametric inference is to use data to inferan unknown quantity while making as few assumptions aspossible.

6 / 32

Wolfowitz (1942):

Wasserman (2005)

6 / 32

Wolfowitz (1942):

Wasserman (2005)

6 / 32

Parametric modelsDefinitionIf the data X has a probability distribution P = Pθ where θ ∈ Θ ⊂ Rd,then we speak of a parametric model.

Example: 799 Waiting times at an IT helpdesk:

Requirement: (at least) 99% of the waiting times should be below oneminute.Is the requirement satisfied, based on the data?

7 / 32

Parametric modelsDefinitionIf the data X has a probability distribution P = Pθ where θ ∈ Θ ⊂ Rd,then we speak of a parametric model.

Example: 799 Waiting times at an IT helpdesk:

Requirement: (at least) 99% of the waiting times should be below oneminute.Is the requirement satisfied, based on the data?

7 / 32

Parametric models

Build a probability model. Assume X1, . . . , Xn are a sample from either

A a Gamma distribution

pθ,m(x) =θe−θx(θx)m−1

(m− 1)!, x > 0;

B or a Lognormal distribution

pµ,σ(x) =1

xσ√

2πexp

((log x− µ)2

, x > 0.

8 / 32

Parametric models

PA(W ≤ 1) ≈ 0.993 PB(W ≤ 1) ≈ 0.968.

Conclusions can be very sensitive to parametric assumptions.

9 / 32

Parametric models

PA(W ≤ 1) ≈ 0.993 PB(W ≤ 1) ≈ 0.968.

Conclusions can be very sensitive to parametric assumptions.9 / 32

Advantages and disadvantages of parametric models

Advantages:

Convenience: parametric models are generally easier to work with.

Efficiency: If a parametric model is correct, then P methods aremore efficient than NP methods. (However, the loss in efficiency ofNP methods is often small.)

Interpretation: Sometimes parametric models are easier to interpret.

Disadvantages:

Sometimes it is hard to find a suitable parametric model.

High risk of Misspecification: assuming a (the?) wrong model.

10 / 32

Advantages and disadvantages of parametric models

Advantages:

Convenience: parametric models are generally easier to work with.

Efficiency: If a parametric model is correct, then P methods aremore efficient than NP methods. (However, the loss in efficiency ofNP methods is often small.)

Interpretation: Sometimes parametric models are easier to interpret.

Disadvantages:

Sometimes it is hard to find a suitable parametric model.

High risk of Misspecification: assuming a (the?) wrong model.

10 / 32

Estimating a distribution function

Let X1, . . . , Xn be a random sample (i.i.d.) from some unknowndistribution function F (discrete or continuous).

Empirical distribution function

The empirical (cumulative) distribution function (EDF) is defined by

F̂n(x) =1

1{Xi ≤ x}.

1{Xi ≤ x} =

{1 if Xi ≤ x0 otherwise

11 / 32

Estimating the cumulative distribution function (CDF)

Properties:

Since nF̂n(x) ∼ Bin(n, F (x)), easy to see that

E[F̂n(x)

]= F (x), Var

(F̂n(x)

)=F (x)(1− F (x))

Glivenko-Cantelli Theorem (fundamental theorem of statistics)For any distribution function F ,

supx∈R|F̂n(x)− F (x)| → 0 a.s. (n→∞).

Dvoretzky-Kiefer-Wolfowith (DKW) inequality: for any ε > 0, forany n,

(supx∈R|F̂n(x)− F (x)| > ε

)≤ 2e−2nε

12 / 32

A confidence-band for F

Using the DKW inequality we can construct a confidence band for F . Letthe significance level α ∈ (0, 1):

Ln(x) = max{F̂n(x)−

α, 0},

Un(x) = max{F̂n(x) +

α, 1},

DKW confidence bandFor any CDF F and all n

P(Ln(x) ≤ F (x) ≤ Un(x) for all x) > 1− α .

13 / 32

Nerve Data

14 2. Estimating the cdf and Statistical Functionals

0.0 0.5 1.0 1.5

FIGURE 2.1. Nerve data. Each vertical line represents one data point. The solidline is the empirical distribution function. The lines above and below the middleline are a 95 percent confidence band.

2.3 Example (Nerve data). Cox and Lewis (1966) reported 799 waiting times

between successive pulses along a nerve fiber. Figure 2.1 shows the data and

the empirical cdf !Fn. !

The following theorem gives some properties of !Fn(x).

2.4 Theorem. Let X1, . . . , Xn ! F and let !Fn be the empirical cdf. Then:

1. At any fixed value of x,

!Fn(x)#

= F (x) and V"

!Fn(x)#

=F (x)(1 " F (x))

Thus, MSE = F (x)(1!F (x))n # 0 and hence !Fn(x)

P"#F (x).

2. (Glivenko–Cantelli Theorem).

| !Fn(x) " F (x)| a.s."# 0.

3. (Dvoretzky–Kiefer–Wolfowitz (DKW) inequality). For any ! > 0,

|F (x) " !Fn(x)| > !

%$ 2e!2n!2 . (2.5)

14 / 32

Confidence intervals for F (x) at a fixed point x

Find ln, un such that

P(ln ≤ F (x) ≤ un) ≥ 1− α.

SincenF̂n(x) ∼ Bin(n, F (x)),

the problem can be restated:

Observe Y ∼ Bin(n, p), find a CI for p.

1 Exact (Clopper-Pearson)

2 Asymptotic (Wald)

3 Wilson method

4 Asymptotic using variance stablilizing transformation

15 / 32

Confidence intervals for F (x) at a fixed point x

Find ln, un such that

P(ln ≤ F (x) ≤ un) ≥ 1− α.

SincenF̂n(x) ∼ Bin(n, F (x)),

the problem can be restated:

Observe Y ∼ Bin(n, p), find a CI for p.

1 Exact (Clopper-Pearson)

2 Asymptotic (Wald)

3 Wilson method

4 Asymptotic using variance stablilizing transformation

15 / 32

1. Exact (Clopper-Pearson)

Pp(Y = k) =(nk

)pk(1− p)n−k, k = 0, . . . , n.

For an observed value y, the Clopper-Pearson interval (equal-tail) isdefined by

[pL, pU ],

where pL is the solution of

2= PpL(Y ≥ y) =

)piL(1− pL)n−i;

and pU is the solution of

2= PpU (Y ≤ y) =

)piU (1− pU )n−i.

This interval is in general conservative, with coverage probability always≥ 1− α (due to the discreteness of Y , exact coverage is not possible).

16 / 32

1. Exact (Clopper-Pearson)

Pp(Y = k) =(nk

)pk(1− p)n−k, k = 0, . . . , n.

For an observed value y, the Clopper-Pearson interval (equal-tail) isdefined by

[pL, pU ],

where pL is the solution of

2= PpL(Y ≥ y) =

)piL(1− pL)n−i;

and pU is the solution of

2= PpU (Y ≤ y) =

)piU (1− pU )n−i.

This interval is in general conservative, with coverage probability always≥ 1− α (due to the discreteness of Y , exact coverage is not possible).

16 / 32

2. Asymptotic (Wald)Let p̂n = Y/n. By the central limit theorem

p̂n − p√p(1− p)

D→ N (0, 1).

Strong law of large numbers p̂na.s.−→ p

Slutsky’s lemma:

p̂n − p√p̂n(1− p̂n)

D→ N (0, 1).

Isolating p, we obtain the following CI for p[p̂n − zα/2

√p̂n(1− p̂n)

n, p̂n + zα/2

√p̂n(1− p̂n)

where zα/2 denotes the upper α/2-quantile of the standard normaldistribution.

17 / 32

p̂n − p√p(1− p)

D→ N (0, 1).

Slutsky’s lemma:

p̂n − p√p̂n(1− p̂n)

D→ N (0, 1).

√p̂n(1− p̂n)

n, p̂n + zα/2

√p̂n(1− p̂n)

17 / 32

p̂n − p√p(1− p)

D→ N (0, 1).

Slutsky’s lemma:

p̂n − p√p̂n(1− p̂n)

D→ N (0, 1).

√p̂n(1− p̂n)

n, p̂n + zα/2

√p̂n(1− p̂n)

17 / 32

3. Wilson Method

In the derivation of the Wald interval, forget about the plug-in stepfor p̂n.

Asymptotically, for large n

(−zα/2 ≤

p̂n − p√p(1− p)

≤ zα/2)≈ 1− α.

Solving for p gives the desired confidence interval, which hasendpoints

p̂n +z2α/22n ± zα/2

√p̂n(1−p̂n)+z2α/2/(4n)

1 + z2α/2/n.

18 / 32

3. Wilson Method

In the derivation of the Wald interval, forget about the plug-in stepfor p̂n.

Asymptotically, for large n

(−zα/2 ≤

p̂n − p√p(1− p)

≤ zα/2)≈ 1− α.

Solving for p gives the desired confidence interval, which hasendpoints

p̂n +z2α/22n ± zα/2

√p̂n(1−p̂n)+z2α/2/(4n)

1 + z2α/2/n.

18 / 32

4. Asymptotic using variance stabilizing transformation

Apply a variance stabilizing transformation

Suppose ϕ : R→ R is differentiable at p with ϕ′(p) 6= 0. By theδ-method 1

√n(ϕ(p̂n)− ϕ(p))

D→ N (0, p(1− p)(ϕ′(p))2) .

Take ϕ so that ϕ′(p) = 1/(√p(1− p). That is take

ϕ(x) = 2 arcsin√x.

Zn(p) := 2√n(arcsin(

√p̂n)− arcsin(

√p))

D→ N (0, 1).

1Let Yn be a sequence of random variables and ϕ : R→ R a map that is

differentiable at µ and ϕ′(µ) 6= 0. Suppose√n(Yn − µ)

D→ N (0, 1). Then√n(ϕ(Yn)− ϕ(µ))

D→ N (0, (ϕ′(µ))2).19 / 32

√n(ϕ(p̂n)− ϕ(p))

D→ N (0, p(1− p)(ϕ′(p))2) .

√p̂n)− arcsin(

√p))

D→ N (0, 1).

D→ N (0, 1). Then√n(ϕ(Yn)− ϕ(µ))

D→ N (0, (ϕ′(µ))2).19 / 32

√n(ϕ(p̂n)− ϕ(p))

D→ N (0, p(1− p)(ϕ′(p))2) .

√p̂n)− arcsin(

√p))

D→ N (0, 1).

D→ N (0, 1). Then√n(ϕ(Yn)− ϕ(µ))

D→ N (0, (ϕ′(µ))2).19 / 32

Center p in P(−zα/2 ≤ Zn(p) ≤ zα/2

)≈ 1− α with

√p̂n)− arcsin(

√p))

The resulting confidence interval has endpoints

(arcsin(

√p̂n)− zα/2

), sin2

(arcsin(

√p̂n) +

20 / 32

Center p in P(−zα/2 ≤ Zn(p) ≤ zα/2

)≈ 1− α with

√p̂n)− arcsin(

√p))

The resulting confidence interval has endpoints

(arcsin(

√p̂n)− zα/2

), sin2

(arcsin(

√p̂n) +

20 / 32

Coverage of binomial confidence intervals

Study coverage probability of a CI. If the CI takes the form[l(Y ), u(Y )], evaluate

Pr(p ∈ [l(Y ), u(Y )]) for Y ∼ Bin(n, p).

Not easy, so we conduct a simulation study, where we instead lookat the coverage fraction.

Monte Carlo simulation scheme:

1 Choose n and p ∈ (0, 1).

2 Generate 1000 Bin(n, p) random variables.

3 For each realization, compute the 95% confidence interval(Clopper-Pearson, Wald (asymptotic), Wilson and variancestabilizing).

4 Compute the fraction of times that the CIs contain the trueparameter p.

21 / 32

Study coverage probability of a CI. If the CI takes the form[l(Y ), u(Y )], evaluate

Pr(p ∈ [l(Y ), u(Y )]) for Y ∼ Bin(n, p).

Not easy, so we conduct a simulation study, where we instead lookat the coverage fraction.

Monte Carlo simulation scheme:

1 Choose n and p ∈ (0, 1).

2 Generate 1000 Bin(n, p) random variables.

3 For each realization, compute the 95% confidence interval(Clopper-Pearson, Wald (asymptotic), Wilson and variancestabilizing).

4 Compute the fraction of times that the CIs contain the trueparameter p.

21 / 32

22 / 32

23 / 32

24 / 32

Estimating quantiles using order statistics

Quantile functionThe quantile function of F is its generalized inverse function given by

qp = F−1(p) = inf{u | F (u) ≥ p} p ∈ (0, 1).

Order statisticsSuppose X1, . . . , Xn is a random sample from a continuous distribution,then the order statistics are denoted by

X(1) < X(2) < · · · < X(n).

25 / 32

qp = F−1(p) = inf{u | F (u) ≥ p} p ∈ (0, 1).

X(1) < X(2) < · · · < X(n).

25 / 32

qp = F−1(p) = inf{u | F (u) ≥ p} p ∈ (0, 1).

X(1) < X(2) < · · · < X(n).

25 / 32

If we estimate F−1 by the empirical quantile function F̂−1n , then theestimator of qp is then the order statistics X(i), where p ∈

(i−1n , in

To find the (1− α) confidence interval for qp, the main idea is tochoose r < s such that

P(X(r) < qp ≤ X(s)

)≥ 1− α.

First note thatP(X(r) < qp ≤ X(s)

)= 1−

(P(X(r) ≥ qp

(X(s) < qp

Find r and s such that

P(X(r) ≥ qp

)≤ α/2 and P

(X(s) < qp

)≤ α/2.

26 / 32

If we estimate F−1 by the empirical quantile function F̂−1n , then theestimator of qp is then the order statistics X(i), where p ∈

(i−1n , in

To find the (1− α) confidence interval for qp, the main idea is tochoose r < s such that

P(X(r) < qp ≤ X(s)

)≥ 1− α.

First note thatP(X(r) < qp ≤ X(s)

)= 1−

(P(X(r) ≥ qp

(X(s) < qp

Find r and s such that

P(X(r) ≥ qp

)≤ α/2 and P

(X(s) < qp

)≤ α/2.

26 / 32

Computing P(X(s) < qp

{X(s) < qp} is equivalent to say that at least s observations are lessthan qp.

Write N =∑ni=1 1{Xi < qp}. Suppose that F is continuous. Then

P(X1 < qp) = P(X1 ≤ qp) = p. Thus N ∼ bin(n, p) and

P(X(s) ≤ qp

)= P(N ≥ s) =

)pi(1− p)n−i.

Similarly, we have P(X(r) ≥ qp

)= P(N < r).

27 / 32

Computing P(X(s) < qp

{X(s) < qp} is equivalent to say that at least s observations are lessthan qp.

Write N =∑ni=1 1{Xi < qp}. Suppose that F is continuous. Then

P(X1 < qp) = P(X1 ≤ qp) = p. Thus N ∼ bin(n, p) and

P(X(s) ≤ qp

)= P(N ≥ s) =

)pi(1− p)n−i.

Similarly, we have P(X(r) ≥ qp

)= P(N < r).

27 / 32

A confidence interval for qp

Let X(0) = −∞ and X(n+1) =∞.

The (1− α) confidence interval for qp is given by

(X(r), X(s)],

where r = max{k ∈ {0, 1, . . . , n} : P(N < k) ≤ α/2} ands = min{k ∈ {1, . . . , n+ 1} : P(N ≥ k) ≤ α/2} .

28 / 32

Back to the waiting time data

PA(W ≤ 1) ≈ 0.993 PB(W ≤ 1) ≈ 0.968.

q̂0.99 ≈ 0.90 and a 95% confidence interval for q0.99 is given by (0.81,

29 / 32

Back to the waiting time data

PA(W ≤ 1) ≈ 0.993 PB(W ≤ 1) ≈ 0.968.

q̂0.99 ≈ 0.90 and a 95% confidence interval for q0.99 is given by (0.81,

1.21]29 / 32

Useful R commands

Method R-function within packageEmpirical distr. function ecdf statsBinomial confidence interval binconf Hmisc

30 / 32

Group Presentation (February 9)

Group 1Present the paper “Interval Estimation for a Binomial Proportion”(Brown-Cai-DasGupta 2001). Choose two methods and compare theperformance by simulation.

31 / 32

Group Presentation (February 9)

Group 2Write an R-function that implements a confidence interval for the p-thquantile based on the order-statistics of the data. Simulate 1000 timesa sample of size 10 from a standard normal distribution and computethe coverage of the confidence interval with p = 0.5. Repeat forsample sizes 50, 100 and 1000.

32 / 32

lecture 1: nonparametric estimation of distribution ...cai/as2015/lecture1.pdfas the parametric...

Documents