alison bowling maximum likelihood. general linear model

A L I S O N BO W L I N G

MAXIMUM LIKELIHOOD

GENERAL LINEAR MODEL

• ei ~ i.i.d. N(0, s2)

• Residuals are• Independent and identically distributed• Normally distributed• Mean 0, Variance s2

• What to do when the normality assumption does not hold?

• We can fit an alternative distribution• This requires Maximum Likelihood methods.

ALTERNATIVE DISTRIBUTIONS

• Binomial (proportions)• P (event occurring), 1-P (event not occurring)

• Poisson (count data)

MAXIMUM LIKELIHOOD

• Myung, J. (2003). Tutorial on maximum likelihood estimation. Journal of Mathematical Psychology, 47, 90 – 100.• Standard approach to parameter estimation and

inference in statistics• Many of the inference methods in statistics are based on

MLE.• Chi-square test• Bayesian methods• Modelling of random effects

PROBABILITY DISTRIBUTIONS

• Imagine a biased coin, with the probability of heads, w, = 0.7, is tossed 10 times.• The following probability distribution, can be

computed using the binomial theorem.

0 1 2 3 4 5 6 7 8 9 100

0.05

0.1

0.15

0.2

0.25

0.3

Number of Heads

Prob

ality

of

resu

lt (f

(y))

This is a probability distribution.• the probability of

obtaining a particular outcome for 10 tosses of a coin with w = .7

• 7 heads are more likely to occur than any other combination

LIKELIHOOD FUNCTION

• Suppose we don’t know w, but have tossed the coin 10 times and obtained y = 7 heads. • What is the most likely value of

w?• This may be obtained from the

likelihood function.• This is a function of the

parameter, w, given the data, y.

• The most likely value of w is at the peak of this function.

MAXIMUM LIKELIHOOD ESTIMATION

• We are interested in finding the probability distribution that underlies that data that have been collected.• We are consequently interested in finding the parameter

value(s) that correspond to the desired probability distribution.

• The MLE estimate is the maximum (peak) of the maximum likelihood function• This may be obtained from the first derivative of the MLF.• To make sure this is a peak (and not a valley), the second

derivative is also checked.

ITERATIVE METHOD

• For very simple scenarios, the maximum can be obtained using calculus as in the example.• This is usually not possible, especially when the

model involves many parameters.• This is done by an iterative series of trial and

error steps.• Start with a value of a parameter, w, and compute the

likelihood of obtaining this.• Then try another, and see if the likelihood is higher.• If so, keep going• Stop when the maximum is found (solution converges).

MLE ALGORITHMS

• Different algorithms are used to obtain the result• EM: estimation maximisation algorithm• Newton-Raphson• Fisher Scoring.

• SPSS uses both the Newton-Raphson and the Fisher scoring method.

LOG LIKELIHOOD

• The computation of likelihood involves multiplying probabilities for each individual outcome• This can be computationally intensive.

• For this reason, the log of the likelihood is computed instead.• Instead of multiplying, the outcomes are added.• Log (A x B) = Log A + Log B

• We maximise the log of the likelihood rather than the likelihood itself, for computational convenience.

-2LL

• The log likelihood is the sum of the probabilities associated with the predicted and actual outcomes.• This is analogous to the residual sum of squares in OLS

regression.• The larger the log likelihood the greater the unexplained

variance.• This is usually negative, and can be made positive by

adding the negative sign.• We multiply by 2 to enable us to obtain p values to

compare models.• This value is -2LL

EVALUATING MODELS

• Using OLS we use R2 to evaluate models.• i.e. does the addition of a predictor produce a significant

increase in R2?• R2

is based on Sums of Squares, which we do not have when using ML.• We use the -2LL, Deviance, and Information

Criteria to evaluate models using ML.• Unlike R2, -2LL is not meaningful in its own right.• Used to compare with other models.

DEVIANCE

• Deviance is a measure of lack of fit.• Measures how much worse the model is than a perfectly

fitting model.• Deviance can be used to obtain a measure of

pseudo-R2

• -

LIKELIHOOD RATIO STATISTIC

• LR = likelihood of reduced model (without the parameters)• LF = likelihood of the full model (with the

parameters)

• LR ~ c2r , where r = dffull – dfreduced

• G2 compares the fitted model with the intercept-only model.

MAXIMUM LIKELIHOOD IN SPSS

• Logistic regression.• Used with a binomial outcome variable• E.g. yes, no; correct, incorrect; married, not married.

• Generalised Linear models• Provides a range of non-linear models to be fitted.

BAR-TAILED GODWIT DATA

• Dependent variable is a count:• Maximum number of birds observed at each estuary for

each year• Independent variables• Estuary: Richmond, Hastings, Clarence, Hunter, Tweed• categorical

• Year: 1981 – 2014.• Continuous (centred to 0 at 1981).

• Research question:• Does the number of Bar-tailed Godwits in the Richmond

Estuary remain stable, or improve, compared to the other estuaries?

STEP 1: GRAPH THE DATA

It is obvious that these data have problems.

Counts in the Hunter estuary are much higher than the other estuaries, and have much greater variance.

STEP 2: DUMMY CODE THE ESTUARY DATA

Richmond Clarence Hunter Hastings Tweed0 1 0 0 00 0 1 0 00 0 0 1 00 0 0 0 1

Use Richmond as the comparison category.Each of the other estuaries may be compared in turn with Richmond.

STEP 3: RUN OLS ANALYSIS OF THE DATA

• I will just include Hunter in this analysis to illustrate.• Model:

• Including just the Year0:

• There is a non-significant change in Godwit numbers over the years.

OLS DATA ANALYSIS

• Including the estuary and estuary * Year0 interaction.

There is a significant increase in R2 when the Hunter and Hunter* year interaction are included in the model.

INTERPRETATION OF THE FULL MODEL

• At year0 =0, the predicted Godwit for Richmond = 292 birds• Change in numbers over the years for Richmond = -4.4• At Year0=0, difference between numbers in the Hunter and Richmond =

1449.7 (p < .001) • Over 24 years, difference in rate of change for Hunter, compared with

Richmond is -15.2 (p = .031)• i.e. there is a steeper decline in bird numbers in Hunter estuary, than the

Richmond estuary.

CHECKING RESIDUALS….

• Residuals are not normally distributed.• The assumptions

for a linear model are not met!!

WHAT TO DO?

• We could try a transformation of the DV• A Square root transformation is better, but not perfect

• We could use a non-linear model• The data are counts, and we could use either a Poisson or

Negative Binomial distribution• We will use a Negative Binomial (for reasons that will be

explained later)• Use Generalized Linear Models for the analysis.

INTERCEPT ONLY MODEL

• No predictors are included, and the model simply tests whether the overall number of BT Godwits is different to zero.• The Log likelihood is -

827.26• -2LL = 1654.53

MODEL WITH THREE PARAMETERS

• Running the model including Year0, Hunter and Hunter*Year0 gives the following Goodness of Fit MeasuresLog likelihood = -781.3

-2LL = 1562.6

COMPARING THE TWO MODELS

• -2LL for intercept only model = 1654.53• -2LL for full model (with parameters) = 1562.6• Likelihood ratio (G2) = 1654.5 – 1562.6 = 91.9 • df = 3 , p < .001• Therefore the model including the three parameters is a

better fit to the data than just the intercept only model.• Limitations:

1. the models must be nested (one model must be contained within the other)

2. Data sets must be identical

INFORMATION CRITERIA

• Akaike’s Information Criterion : AIC = -2LL + 2k• Schwartz’s Bayesian Criterion : BIC = -2LL + k + ln(N)• k = number of parameters• N = number of participants

• Can be used with non-nested models• These IC are similar to restricted R2

• The more parameters you have, the better a model is likely to fit the data.

• The IC take this into account by penalising for additional parameters and/or participants.

• Better fitting models have lower values of the IC.

ANALYSIS OF COUNT DATA

• Coxe, S., West, S.G. and Aiken, L. (2009). The analysis of count data: a gentle introduction to Poisson regression and its alternatives. Journal of Personality Assessment, 91, 121- 136.• Poisson regression • Overdispersed Poisson regression models• Negative binomial regression models• Models which address problems with zeros.

ANALYSIS OF COUNT DATA

• Count data are discrete numbers• Usually not normally

distributed.• E.g. number of drinks

on a Saturday night.• Modelled by a Poisson

distribution.• This has one parameter, m.

POISSON MODEL

• Assumptions: (Y|X)~ Poi(μ), Var(Y|X)=fμ, f=1

• i.e. The residuals have a Poisson distribution.

EXAMPLE: DRINKS DATA

• Coxe et al Poisson dataset in SPSS format.• Sensation: mean score on a sensation seeking scale (1-7)• Gender (0 = female, 1 = male)• Y : number of drinks on a Saturday night.

OLS REGRESSION

• Intercept < 0• When sensation =

0, number of drinks is negative!!

• Residuals are not normally distribution.

• OLS has problems!!

POISSON REGRESSION: PARAMETERS

• Sensation only

• When sensation = 0, drinks = e-.14 = .86• For every 1 unit change in sensation, number of

drinks is multiplied by e-.231 = 1.26.

POISSON REGRESSION: MODEL FIT

• Sensation only: Model fit

• G2 = 35.07• Model fits better than the intercept only model

• Deviance = 1151• -2LL = -(-1037.5) x2 = 2075• BIC = 2087

• Deviance for the intercept-only model = 1186 (check)

• Pseudo-R2 =

POISSON REGRESSION: PARAMETERS

• Sensation and Gender as predictors

• What is the effect of gender on number of drinks consumed (holding sensation constant)??

EFFECT OF GENDER

• Intercept = -.789 (for gender = 0; female)• Exp(-.789) = .45• Females drink .45 drinks on a Saturday night

• B = .839 (gender = 1: male)• Exp(.839) = 2.3• Males drink 2.3 times as many drinks as females (when

sensation seeking = 0).

POISSON REGRESSION: MODEL FIT

• -2LL = -2 * (-.941.4) = 1828.2• BIC = 1900.77• Model including gender is a substantially better fit

than sensation model alone• (1900 vs 2087)

Pseudo-R2 =

MODEL ADEQUACY

• Save deviance residuals and predicted values, and plot the residuals against predicted values.

OVERDISPERSION

• A Poisson distribution has only one parameter, m, where m is the mean and variance of the distribution.• Often the variance of a set of data is greater than

the mean• The data are overdispersed.

OVERDISPERSED POISSON REGRESSION MODELS

• A second parameter, f, is estimated to scale the variance.• The parameters from the overdispersed model

are the same as with the simple model, but standard errors are larger.• Use information criteria to compare models

NEGATIVE BINOMIAL MODELS

• Negative binomial models use a Poisson distribution, but allow for individuals to vary in the distribution fitted.

HOMEWORK

• Use PGSI Data.sav (Leigh’s Honours data)• DV = PGSI (Score on Problem Gambling Severity Scale)• Predictors = GABS, FreqCoded

• Run a Poisson regression to predict PGSI from GABS• Does GABS significantly predict PGSI score?• Look at the likelihood ratio (G2)

• Interpret the coefficients for the intercept and GABS• Run a second regression including FreqCode (as a

continuous variable) in the model.• Does this second predictor improve the model fit?• (hint – look at the BIC for the two models)

alison bowling maximum likelihood. general linear model

Documents