lecture 4: statistics review ii date: 9/5/02 hypothesis tests: power estimation: likelihood,...

Lecture 4: Statistics Review II

Date: 9/5/02Hypothesis tests: powerEstimation: likelihood, moment estimation, least squareStatistical properties of estimators

Types of Errors

False positive (Type I): Probability () that H0 rejected when it is true.

False negative (Type II): Probability () of accepting H0 when it is false.

Accept H0 Reject H0

H0 true 1- Type I = H0 false Type II = power = 1-

Example: Error

H0: Central Chi-SquareHA: Non-Central Chi-Square withnon-centrality parameter E(G)

1-

1-

Power of a Test

Definition: The statistical power of a test is defined as 1-

The power is only defined when HA is defined, the experimental conditions (e.g. sample size) are known and the significance level has been selected.

Example: calculate sample size needed to obtain particular linkage detection power.

Estimation

Hypothesis testing allows us to make qualitative conclusions regarding the suitability or not of a statement (H0).

Often we want to make quantitative inference, e.g. an actual estimate of the recombination fraction, not just evidence that genes are linked.

Estimation

Definition: Point estimation is the process of estimating a specific value of the parameter based on observed data X1, X2,…,Xn.

Definition: Interval estimation is the process of estimating upper and lower limits within which the unknown parameter occurs with certain probability.

Definition: The maximum likelihood estimator is the which maximizes the likelihood function.

The MLE is obtained by analytically solving the score, S=0 grid search Newton-Raphson iteration Expectation and Maximization (EM) algorithm

Maximum Likelihood Estimation

MLE: Grid Search

Plot likelihood L or log likelihood l vs. parameter throughout the parameter space.

Obtain MLE by visual inspection or search algorithm.

MLE: Grid Search Algorithm

1. Initially estimate 0. Pick a step size .

2. At step n, evaluate L (or l) on both sides of n.

3. Choose n+1= n+ if L is increasing to the right, else choose n+1= n-.

4. Repeat steps 2 and 3 until no longer advance. Choose smaller , and repeat steps 2, 3, and 4 until desired accuracy met.

MLE: Grid Search Problems

Multiple peaks result in a failure to find the global maximum likelihood.

Solving for multiple simultaneous parameters gets computationally intensive and difficult to interpret visually when there are more than two parameters.

Example: 2-Locus with Non-Penetrant Allele

Let be the recombinant fraction between marker A and gene of interest B.

Let be the probability that the allele of the gene of interest (f) fails to be detected at the phenotype level (i.e. 1- is the penetrance).

Cross +F/+F –f/–f. Score gametes of an F1 +F/–f individual for +/-

phenotype and P/p phenotype, where P means F or non-penetrant f and p means penetrant f.

B(F or f)A(+ or -)

Experimental Data

P(+F gamete) = P(–f gamete) = 0.5(1-) P(–F gamete) = P(+f gamete) = 0.5 P(+P gamete) = 0.5(1-) + 0.5 P(–P gamete) = 0.5 + 0.5(1-) P(+p gamete) = 0.5(1-) P(–p gamete) = 0.5(1-) (1-) Observe n+P, n-P, n+p, n-p.

Experimental Data: Log Likelihood

11log1log

1log1log

111

11

pp

PP

nn

nn

nn

nnl

Lpp

PP

Experimental Data: Grid Search

Newton-Raphson: One Parameter

Let S() be the score. Then the MLE is obtained through the

equation Taylor expansion of S() for n near the

MLE, gives

0)ˆ( S

0ˆˆ

d

dSSS m

mm

ddS

S

m

mm /

ˆ

ddS

S

m

mmm /1

Newton-Raphson: Analysis

NR fits a parabola to the likelihood function at the point of the current parameter estimate. Obtain a new parameter estimate at the maximum of the parabola.

NR may fail when there are multiple peaks. NR may fail when the information is zero (when the

estimate is at the extremes). Recommendations: Use multiple starting initial

values. Bound new estimates.

Newton-Raphson: Multiple Parameters

mmmm SIN

11

1

N is the total sample size.S(m) is the score vector evaluated at m.I-1 (m) is the inverse information matrix evaluated at m.

EM Algorithm: Incomplete Data

The notion of incomplete data:

AB ab Ab aB

AB AB/AB AB/ab AB/Ab AB/aB

ab ab/AB ab/ab ab/Ab ab/aB

Ab Ab/AB Ab/ab Ab/Ab Ab/aB

aB aB/AB aB/ab aB/Ab aB/aB

EM Algorithm: Example of Incomplete Data

+P gamete may result from nonrecombinant +F or from recombinant, non-penetrant +f.

+p gamete can only result from penetrant, nonrecombinant +f.

–P gamete can result from recombinant –F or from nonrecombinant, non-penetrant –f gene.

–p gamete can result only from nonrecombinant –f.

EM Algorithm

Make an initial guess 0. Expectation step: Pretend that n for iteration n is

true. Estimate the complete data. This usually request distribution of complete data conditional on the observed data. For example: P(recombinant|observed phenotype).

Maximization step: Compute the maximum likelihood estimate for step n .

Repeat E & M steps until likelihood converges.

n

Example: E Step

E Step:

5.015.0

5.0

P

ANDpenetrant -nonPpenetrant-nonP

penetrant-nonPpenetrants-nonE

trecombinanPtsrecombinanE

4

1

4

1

P

PP

Pf

Pf

iii

iii

Example: M Step

M Step:

recessive

1

1

penetrants-nonE

tsrecombinanE

N

N

n

n

Moment Estimation

Obtain equations for the population moments in terms of the parameters to estimate.

Substitute the sample moments and solve for the parameters.

For example: binomial distribution

m1 = np

n

X

n

mp

n

ii

11ˆ

Example: Moment Estimation

ntP

tp

tPpppP

PpP epepeppttt ,,mgf

pp

PP

pp

PP

npm

npm

npm

npm

nfp

nfp

nfp

nfp

pp

PP

pp

PP

ˆ

ˆ

ˆ

ˆ

15.0ˆ

115.0ˆ

15.0ˆ

15.0ˆ

P

p

p

P

p

p

p

p

Moment Estimation: Problems

Large sample properties of ML estimators are usually better than those for the corresponding moment estimators.

Sometimes solution of moments equations are not unique.

Least Squares Estimation

XY

XXYXYY '''2''

YXXX ''ˆ 1

YXYY

YXYY

RRreduced

full

SSE

SSE

''ˆ'

'ˆ'

kNkNF

kN

SSEkN

SSE

Ffull

reduced

,1~1

Variance of an Estimator

k

iik 1

22ˆ

ˆ1

1ˆ

•Suppose k independent estimates are available for :

•Suppose you have a large sample, then the variance of the MLEis approximately:

nI

1ˆ 2

ˆ Cramer-Raolower bound for variance

•Empirical estimates using resampling techniques.

Variance: Linear Estimator

k

i

k

ijjiji

k

iiikk ccccc

1 11

211 ,Cov2VarVar

Variance: General Function f(, , …, )

k

i

k

i

k

ijji

jii

i

k

d

df

d

df

d

df

f

1 1 1

2

1

,Cov2Var

,,Var

Bias

The mean square estimator (MSE) is defined as

bias

2Ê MSE

22

ˆÊ

ÊÊÊ

MSE

If an estimator is unbiased, the MSE and variance are the same.

Estimating Bias

Bootstrap:

b

iiB b

Bias1

ˆˆ1

bootstrap estimator for bootstrap trial i

original estimate

Confidence Interval

Because of sampling error, the estimate is not exactly the true parameter value .

A confidence interval is symmetric if

A confidence interval is non-symmetric if

A confidence interval is one-sided if

ULˆP

0P OR 0P ˆˆ UL

UL ˆˆ PP

2PP ˆˆ UL

Confidence Interval: Normal Approximation I

Need pivotal quantity, i.e. a quantity that depends on the data and the parameters but whose distribution does not.

If the estimate is unbiased and normally distributed with variance , then the pivotal quantity is

ˆ

ˆ

ˆ

Confidence Interval: Normal Approximation II

The MLE is asymptotically normally distributed.

ˆ15.0ˆ15.0

15.0ˆ

15.0

ˆˆP

ˆP

zz

zz

Confidence Interval: Nonparametric Approximation

xxCDF b P

5.0,15.0 11 CDFCDF

percentile method

Bootstrap Example

Generate a multinomial random variable with the given proportions b times and generate a bootstrap dataset. Estimate parameters and .

+P +p –P –p

Count 168 3 52 163

Proportion 0.44 0.01 0.13 0.42

b b

Confidence Interval: Likelihood Approach

Let Lmax be the maximum likelihood for a given model. Find the parameter values L and U such that

log Lmax – log L() = 2

Then (L, U) serves as a confidence interval.

LOD Score Support

The LOD score support for a confidence interval is

log10Lmax –log10L

where L is the likelihood at the limit values of the parameter.

In practice, you plot the LOD score support for various values of the parameter and choose the upper and lower bounds such that the LOD score support is 1.

Choosing Good Confidence Intervals

The actual coverage probability should be close to the confidence coefficient.

Should be biologically relevant. For example, the following is not:

(0.1,0.6)

Good Estimator: Consistent

An estimator is mean squared error consistent if the MSE approaches zero as the sample size approaches infinity.

An estimator is simple consistent if 1ˆPlim

n

Good Estimator: Unbiased

An unbiased estimator is usually better than a biased one, but this may not always be true. If the variance is larger, what have we gained?

There are bootstrap techniques for obtaining a bias-corrected estimate. These are computationally more intensive than bootstrap, but sometimes worth it.

Good Estimator: Asymptotically Normal

If the pivotal quantity

is normal with mean 0 and variance 1 as the sample size goes to infinity, it can be a very convenient property of the estimator.`

ˆ

ˆ

Good Estimator: Confidence Interval

A good estimator should have a good way to obtain an confidence interval. MLE are good in this way if the sample size is large enough.

Sample Size for Power

1-

1-

Need E(G)> 12

,,1 2

df

E(G)=nE(Gunit)

unit

12

,,1

E

2

Gn df

Sample Size for Target Confidence Interval

Confidence interval by normal approximation is

The bigger the range , the less precise the confidence interval.

Suppose we wish to have

ˆ2/1ˆ

z

ˆ2/12 z

dz ˆ2/12

Sample Size for Target Confidence Interval II

Then,

I

d

z

n

dnI

z

2

2/1

2/1

2

12

Summary

Distributions Likelihood and Maximum Likelihood

Estimation Hypothesis Tests Confidence Intervals Comparison of estimators Sample size calculations

lecture 4: statistics review ii date: 9/5/02 hypothesis tests: power estimation: likelihood,...

Documents