advanced statistical methods: beyond linear regression

John R. StevensUtah State University

Notes 2. Statistical Methods I

Mathematics Educators Workshop 28 March 2009*Advanced Statistical Methods:Beyond Linear Regressionhttp://www.stat.usu.edu/~jrstevens/pcmi

What would your students know to do with these data?ObsFlightTempDamage1STS166NO2STS970NO3STS51B75NO4STS270YES5STS41B57YES6STS51G70NO7STS369NO8STS41C63YES9STS51F81NO10STS48011STS41D70YES12STS51I76NO13STS568NO14STS41G78NO15STS51J79NO16STS667NO17STS51A67NO18STS61A75YES19STS772NO20STS51C53YES21STS61B76NO22STS873NO23STS51D67NO24STS61C58YES

Two Sample t-test

data: Temp by Damage t = 3.1032, df = 21, p-value = 0.005383alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 2.774344 14.047085 sample estimates: mean in group NO mean in group YES 72.12500 63.71429

Does the t-test make sense here?Traditional:Treatment Group mean vs. Control Group mean

What is the response variable?Temperature? [Quantitative, Continuous]Damage? [Qualitative]

Traditional Statistical Model 1Linear Regression: predict continuous response from [quantitative] predictorsY=weight, X=heightY=income, X=education levelY=first-semester GPA, X=parents incomeY=temperature, X=damage (0=no, 1=yes)

Can also control for other [possibly categorical] factors (covariates):SexMajorState of OriginNumber of Siblings

Traditional Statistical Model 2Logistic Regression: predict binary response from [quantitative] predictorsY=graduate within 5 years=0 vs. Y=not=1X=first-semester GPAY=0 (no damage) vs. Y=1 (damage)X=temperatureY=0 (survive) vs. Y=1 (death)X=dosage (dose-response model)Can also control for other factors, or covariatesRace, SexGenotypep = P(Y=1 | relevant factors) = prob. that Y=1, given state of relevant factors

Traditional Dose-Response Modelp = Probability of death at dose d:

Look at what affects the shape of the curve, LD50 (lethal dose for 50% efficacy), etc.

Fitting the Dose-Response ModelWhy logistic regression?0 = place-holder constant1 = effect of dosage dTo estimate parameters:Newton-Raphson iterative process to maximize the likelihood of the modelCompare Y=0 (no damage) with Y=1 (damage) groups

Likelihood Function (to be maximized)likelihood for obs. imultiply probabilities (independence)

Estimation by IRLSIteratively Reweighted Least Squares

equivalent: Newton-Raphson algorithm for iteratively solving score equations

Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 15.0429 7.3786 2.039 0.0415 *Temp -0.2322 0.1082 -2.145 0.0320 *---Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

What if the data were even better?Complete separation of points

What should happen to our slope estimate?

Coefficients: Estimate Std. Error z value Pr(>|z|)(Intercept) 928.9 913821.4 0.001 1Temp -14.4 14106.7 -0.001 1

Failure?Shape of likelihood function

Large Standard Errors

Solution only in 2006

Rather than maximizing likelihood, consider a penalty:

Model fitted by Penalized MLConfidence intervals and p-values by Profile Likelihood

coef se(coef) Chisq p(Intercept) 30.4129282 16.5145441 11.35235 0.0007535240Temp -0.4832632 0.2528934 13.06178 0.0003013835

Beetle Data

Phosphine

Total

Dosage

Receiving

Total

Total

Survivors Observed at Genotype

(mg/L)

Dosage

Deaths

Survivors

-/B

-/H

-/A

+/B

+/H

+/A

0

98

0

98

31

27

10

6

20

4

0.003

100

16

84

18

26

10

6

20

4

0.004

100

68

32

10

4

3

5

7

4

0.005

100

78

22

1

4

7

2

6

2

0.01

100

77

23

0

1

9

8

5

0

0.05

300

270

30

0

0

0

5

20

5

0.1

400

383

17

0

0

0

0

10

7

0.2

750

740

10

0

0

0

0

0

10

0.3

500

490

10

0

0

0

0

0

10

0.4

500

492

8

0

0

0

0

0

8

1.0

7850

7,806

44

0

0

0

0

0

44

10,798

10,420

378

Dose-response modelRecall simple model:

pij = Pr(Y=1 | dosage level j and genotype level i)

But when is genotype (covariate Gi) observed?

Coefficients: Estimate Std. Error z value Pr(>|z|)(Intercept) -2.657e+01 8.901e+04 -2.98e-04 1dose -7.541e-26 1.596e+07 -4.72e-33 1G1+ -3.386e-28 1.064e+05 -3.18e-33 1G2B -1.344e-14 1.092e+05 -1.23e-19 1G2H -3.349e-28 1.095e+05 -3.06e-33 1dose:G1+ 7.541e-26 1.596e+07 4.72e-33 1dose:G2B 3.984e-12 3.075e+07 1.30e-19 1dose:G2H 7.754e-26 2.760e+07 2.81e-33 1G1+:G2B 1.344e-14 1.465e+05 9.17e-20 1G1+:G2H 3.395e-28 1.327e+05 2.56e-33 1dose:G1+:G2B -3.984e-12 3.098e+07 -1.29e-19 1dose:G1+:G2H -7.756e-26 2.763e+07 -2.81e-33 1Before we fix this, first a little detour

A Multivariate Gaussian MixtureComponent j is MVN(j,j) with proportion j

The Maximum Likelihood Approach

A Possible Work-AroundKeys here:the true group memberships are unknown (latent)statisticians specialize in unknown quantities

A reasonable approach1. Randomly assign group memberships , and estimate group means j , covariance matrices j , and mixing proportions j2. Given those values, calculate (for each obs.) j = E[j|] = P(obs. in group j)3. Update estimates for j , j , and j , weighting each observation by these : 4. Repeat steps 2 and 3 to convergence

Plotting character and color indicate most likely component

The EM (Baum-Welch) Algorithm- maximization made easier with Zm = latent (unobserved) data; T = (Z,Zm) = complete dataStart with initial guesses for parametersExpectation: At the kth iteration, compute Maximization: Obtain estimate by maximizing over Iterate steps 2 and 3 to convergence ($?)

Beetle Data NotationObserved values Unobserved (latent) values If Nij had been observed:

How Nij can be [latently] considered:

Likelihood FunctionParameters =(p,P) and complete data T=(n,N) After simplification:

Mechanism of missing data suggests EM algorithm

Missing at Random (MAR)Necessary assumption for usual EM applicationsCovariate x is MAR if probability of observing x does not depend on x or any other unobserved covariate, but may depend on response and other observed covariates (Ibrahim 1990)Here genotype is observed only for survivors, and for all subjects at zero dosage

Initialization StepTwo classes of marginal information hereFor all dosage levels j observeAt zero dosage level observe for genotype iAllows estimate of Pi Consider marginal distn. of missing categorical covariate (genotype)Using zero dosage level:

This is the key the marginal distribution of the missing categorical covariate

Expectation StepDropping constants and :

Need to evaluate:

(*)

Expectation StepBayes Formula:

Multinomial (*)

Expectation StepFor :Not needed for maximization only affects EM convergence rateDirect calculation from multinomial distn. is possible but computationally prohibitiveNeed to employ some approximation strategySecond-order Taylor series about , using Binets formula(*)

Expectation StepConsider Binets formula (like Stirlings):

Have:

Use a second-order Taylor series approximation taken about as a function of :(*)

Maximization StepPortion of related to :

Portion of related to :by Lagrange multipliersby Newton-Raphson iterations, with some parameterization(*)

Convergence

Dose Response Curves (log scale)

EM Resultstest statistic for H0: no dosage effectseparation of points

Confidence

LD50

L95

U95

t

-/B

0.0035

0.0031

0.0039

3.99

-/H

0.0033

0.0028

0.0038

4.98

-/A

0.0290

-7.1862

7.2442

0.13

+/B

0.0484

0.0123

0.0845

0.09

+/H

0.0664

0.0407

0.0921

4.20

+/A

0.7382

0.1428

1.3336

1.36

Topics Used HereCalculusDifferentiation & Integration (including vector differentiation)Lagrange MultipliersTaylor Series ExpansionsLinear AlgebraDeterminants & EigenvaluesInverting [computationally/nearly singular] MatricesPositive DefinitenessProbabilityDistributions: Multivariate Normal, Binomial, MultinomialBayes FormulaStatisticsLogistic RegressionSeparation of Points[Penalized] Likelihood MaximizationEM AlgorithmBiology a little time and communication

*

advanced statistical methods: beyond linear regression

Documents

damage t

binary response

continuous response

response variable

dosage doseresponse

modelcompare y

dose d

control group meanwhat