lecture 2: parameter estimation and evaluation of support

Lecture 2:

Parameter Estimation and Evaluation of Support

Parameter Estimation

“The problem of estimation is of more central importance, (than hypothesis testing).. for in almost all situations we know that the effect whose significance we are measuring is perfectly real, however small; what is at issue is its magnitude.” (Edwards, 1992, pg. 2)

“An insignificant result, far from telling us that the effect is non-existent, merely warns us that the sample was not large enough to reveal it.” (Edwards, 1992, pg. 2)

Parameter Estimation

Finding Maximum Likelihood Estimates (MLEs)- Local optimization (optim)

» Gradient methods» Simplex (Nelder-Mead)

- Global optimization» Simulated Annealing (anneal)» Genetic Algorithms (rgenoud)

Evaluating the strength of evidence (“support”) for different parameter estimates- Support Intervals

» Asymptotic Support Intervals» Simultaneous Support Intervals

- The shape of likelihood surfaces around MLEs

Parameter estimation: finding peaks on likelihood “surfaces”...

The variation in likelihood for any given set of parameter values defines a likelihood “surface”...

-155

-153

-151

-149

-147

2 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8

Parameter Estimate

Lo

g-L

ikel

iho

od

The goal of parameter

estimation is to find the peak of the

likelihood surface....(optimization)

Local vs Global Optimization

“Fast” local optimization methods

- Large family of methods, widely used for nonlinear regression in commercial software packages

“Brute force” global optimization methods

- Grid search

- Genetic algorithms

- Simulated annealing

-14

-12

-10

-8

-6

-4

-2

0

0 5 10 15 20

Parameter value

Lo

g L

ike

lih

oo

d

local optimum

global optimum

Local Optimization – Gradient Methods

Derivative-based (Newton-Raphson) methods:

Likelihood surface

0...

)|()|(

d

dl

ypyl

General approach: Vary parameter estimate systematically and search for zero slope in the first derivative of the likelihood function...(using numerical methods to estimate the derivative, and checking the second derivative to make sure it is a maximum, not a minimum)

Local Optimization – No Gradient

The Simplex (Nelder Mead) method

- Much simpler to program

- Does not require calculation or estimation of a derivative

- No general theoretical proof that it works, (but lots of happy practitioners…)

Global Optimization

“Virtually nothing is known about finding global extrema in general.”

“There are tantalizing hints that so-called “annealing methods” may lead to important progress on global (optimization)...”

Quote from Press et al. (1986) Numerical Recipes

Global Optimization – Grid Searches

Simplest form of optimization (and rarely used in practice)

- Systematically search parameter space at a grid of points

Can be useful for visualization of the broad features of a likelihood surface

Global Optimization – Genetic Algorithms

Based on a fairly literal analogy with evolution

- Start with a reasonably large “population” of parameter sets

- Calculate the “fitness” (likelihood) of each individual set of parameters

- Create the next generation of parameter sets based on the fitness of the “parents”, and various rules for recombination of subsets of parameters (genes)

- Let the population evolve until fitness reaches a maximum asymptote

Global optimization - Simulated Annealing

Analogy with the physical process of annealing:

- Start the process at a high “temperature”

- Gradually reduce the temperature according to an annealing schedule

Always accept uphill moves (i.e. an increase in likelihood)

Accept downhill moves according to the Metropolis algorithm:

tlh

ep

p = probability of accepting downhill movelh = magnitude of change in likelihood

t = temperature

Effect of temperature (t)

0

0.2

0.4

0.6

0.8

1

0 2 4 6 8 10

Drop in likelihood

Pro

bab

ility

of

Acc

epti

ng

D

ow

nh

ill M

ove

t = 5

t = 3

t = 1

t = 0.5

tlh

ep

Simulated Annealing in practice...

REFERENCES:Goffe, W. L., G. D. Ferrier, and J. Rogers. 1994. Global optimization of statistical functions with simulated annealing. Journal of Econometrics 60:65-99.

Corana et al. 1987. Minimizing multimodal functions of continuous variables with the simulated annealing algorithm. ACM Transactions on Mathematical Software 13:262-280

A version with automatic adjustment of range...

Lower bound Upper boundCurrent value

Search range (step size)

Constraints – setting limits for the search...

Biological limits

- Values that make no sense biologically (be careful...)

Algebraic limits

- Values for which the model is undefined (i.e. dividing by zero...)

Bottom line: global optimization methods let you cast your net widely, at the cost of computer time...

Simulated Annealing - Initialization

Set

- Annealing schedule» Initial temperature (t) (3.0)» Rate of reduction in temperature (rt) (0.95)N» Interval between drops in temperature (nt) (100)» Interval between changes in range (ns) (20)

- Parameter values» Initial values (x)» Upper and lower bounds (lb,ub)» Initial range (vm)

Typical values in blue...

Begin {a single iteration}

{copy the current parameter array (x) to a temporary holder (xp) for this iteration}

xp := x;

{choose a new value for the parameter in use (puse)}

xp[puse] := x[puse] + ((random*2 - 1)*vm[puse]);

{check if the new value is out of bounds }

if xp[puse] < lb[puse] then xp[puse] := x[puse] - (random * (x[puse]-lb[puse]));

if xp[puse] > ub[puse] then xp[puse] := x[puse] + (random * (ub[puse]-x[puse]));

Simulated Annealing – Step 1

Pick a new set of parameter values (by varying just 1 parameter)

vm is the rangelb is the lower boundub is the upper bound


{call the likelihood function with the new set of parameter values} likeli(xp,fp); {fp = new likelihood}

{accept the new values if likelihood increases or at least stays the same} if (fp >= f) then begin x := xp; f := fp; nacp[puse] := nacp[puse] + 1; if (fp > fopt) then {if this is a new maximum, update the maximum likelihood} begin xopt := xp; fopt := fp; opteval := eval; BestFit; {update display of maximum r} end; end

Accept the step if it leads uphill...


else {use Metropolis criteria to determine whether to accept a downhill move } begin try {fp < f, so the code below is a shortcut for exp(-1.0(abs(f-fp)/t)} p := exp((fp-f)/t); {t = current temperature} except on EUnderflow do p := 0; end; pp := random; if pp < p then begin x := xp; f := fp; nacp[puse] := nacp[puse] + 1; end; end;

Use the Metropolis algorithm to decide whether to accept a

downhill step...


{after nused * ns cycles, adjust VM so that half of evaluations are accepted} If eval mod (nused*ns) = 0 then begin for i := 0 to npmax do if xvary[i] then begin ratio := nacp[i]/ns; { C controls the adjustment of VM (range) - references suggest setting at 2.0} if (ratio > 0.6) then vm[i] := vm[i]*(1.0+c[i]*((ratio - 0.6)/0.4)) else if ratio < 0.4 then vm[i] := vm[i]/(1.0+c[i]*((0.4 - ratio)/0.4)); if vm[i] > (ub[i]-lb[i]) then vm[i] := ub[i] - lb[i]; end; { reset nacp[i]} for i := 1 to npmax do nacp[i] := 0; end;

Periodically adjust the range (VM) within which new steps are chosen...

ns is typically ~ 20

This part is strictly ad hoc...

Effect of C on Adjusting Range...

0

1

2

3

4

5

6

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Fraction of steps accepted

Fra

ctio

na

l ch

an

ge

in r

an

ge

C = 1

C = 2

C = 4

Simulated Annealing Code – Final Step

{after nused * ns * nt cycles, reduce temperature t } If eval mod (nused*ns*nt) = 0 then begin t := rt * t; {store current maximum lhood in history list} lhist[eval div (nused*ns*nt)].iter := eval; lhist[eval div (nused*ns*nt)].lhood := fopt; end;

Reduce the “temperature” according to the annealing schedule

rt = fractional reduction in temperature at each drop in temperature:

I typically set nt = 100(a very slow annealing)

NOTE: Goffe et al. restart the search at the previous MLE estimates each time the temperature drops... (I don’t)

How many iterations?...

-11460

-11440

-11420

-11400

-11380

-11360

-11340

-11320

-11300

0 2500000 5000000

IterationL

ikel

iho

od

-232

-230

-228

-226

-224

-222

-220

-218

0 100000 200000 300000 400000 500000

Iteration

Max

imu

m L

ikel

iho

od

Red maple leaf litterfall(6 parameters)

500,000 is way more than necessary!

Logistic regression of windthrow susceptibility

(188 parameters)5 million is not enough!

What would constitute convergence?...

Optimization - Summary

No hard and fast rules for any optimization – be willing to explore alternate options.

Be wary of initial values used in local optimization when the model is at all complicated

How about a hybrid approach? Start with simulated annealing, then switch to a local optimization…

Evaluating the strength of evidence for the MLE

Now that you have an MLE, how should you evaluate it?

(Hint: think about the shape of the likelihood function, not just the MLE)

Strength of evidence for particular parameter estimates –

“Support”

Likelihood provides an objective measure of the strength of evidence for different parameter estimates...

Log-likelihood = “Support” (Edwards 1992)

-155

-153

-151

-149

-147

2 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8

Parameter Estimate

Lo

g-L

ikel

iho

od

Fisher’s “Score” and “Information”

“Score” (a function) = First derivative (slope) of the likelihood function

- So, S(θ) = 0 at the maximum likelihood estimate of θ

“Information” (a number) = -1 * Second derivative (acceleration) of the likelihood function, evaluated at the MLE..

- So this is a number: a measure of how steeply likelihood drops off as you move away from the MLE

- In general cases, “information” is equivalent to the variance of the parameter…

Profile Likelihood

Evaluate support (information) for a range of values of a given parameter by treating all other parameters as “nuisance” and holding them at their MLEs…

Parameter 1

Par

amet

er 2

Asymptotic vs. Simultaneous M-Unit Support Limits

Asymptotic Support Limits (based on Profile Likelihood):

- Hold all other parameters at their MLE values, and systematically vary the remaining parameter until likelihood declines by a chosen amount (m)...

-155

-153

-151

-149

-147

2 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8

Parameter Estimate

Lo

g-L

ike

liho

od

2-unit support interval

Maximum likelihood estimate

What should “m” be? (2 is

a good number, and is

roughly analogous to a

95% CI)

Asymptotic vs. Simultaneous M-Unit Support Limits

Simultaneous:

- Resampling method: draw a very large number of random sets of parameters and calculate log-likelihood. M-unit simultaneous support limits for parameter xi are the upper and lower limits that don’t differ by more than m units of support...

In practice, it can require an enormous number of iterations to do this if there are more than a few parameters

Asymptotic vs. Simultaneous Support Limits

Parameter 1

Par

amet

er 2

2-unit dropin support

A hypothetical likelihood surface for 2 parameters...

Asymptotic 2-unitsupport limits for P1

Simultaneous 2-unitsupport limits for P1

Other measures of strength of evidence for different parameter estimates

Edwards (1992; Chapter 5)

- Various measures of the “shape” of the likelihood surface in the vicinity of the MLE...

How pointed is the peak?...

Bootstrap methods

Bootstrap methods can be used to estimate the variances of parameter estimates

- In simple terms: » generate many replicates of the dataset by sampling with

replacement (bootstraps)» Estimate parameters for each of the datasets» Use the variance of the parameter estimates as a bootstrap

estimate of the variance

Evaluating Support for Parameter Estimates: A Frequentist Approach

Traditional confidence intervals and standard errors of the parameter estimates can be generated from the Hessian matrix

- Hessian = matrix of second partial derivatives of the likelihood function with respect to parameters, evaluated at the maximum likelihood estimates

- Also called the “Information Matrix” by Fisher

- Provides a measure of the steepness of the likelihood surface in the region of the optimum

- Can be generated in R using optim and fdHess

Example from R

The Hessian matrix (when maximizing a log likelihood) is a numerical approximation for Fisher's Information Matrix (i.e. the matrix of second partial derivatives of the likelihood function), evaluated at the point of the maximum likelihood estimates. Thus, it's a measure of the steepness of the drop in the likelihood surface as you move away from the MLE.

> res$hessian a b sda -150.182 -2758.360 -0.201b -2758.360 -67984.416 -5.925sd -0.202 -5.926 -299.422(sample output from an analysis that estimates two parameters

and a variance term)

More from R

now invert the negative of the Hessian matrix to get the matrix of parameter variance and covariance

> solve(-1*res$hessian) a b sda 2.613229e-02 -1.060277e-03 3.370998e-06b -1.060277e-03 5.772835e-05 -4.278866e-07sd 3.370998e-06 -4.278866e-07 3.339775e-03

the square roots of the diagonals of the inverted negative Hessian are the standard errors*

> sqrt(diag(solve(-1*res$hessian))) a b sd 0.1616 0.007597 0.05779

(*and 1.96 * S.E. is a 95% C.I….)

lecture 2: parameter estimation and evaluation of support

Documents

fitness likelihood

minimumlocal optimization

likelihood function

goal of parameter estimation

generation of parameter

given set of parameter

global extrema

mlesparameter estimation