unit 2b: dealing “rationally” with nonlinear relationships © andrew ho, harvard graduate school...

Unit 2b: Dealing “Rationally” with Nonlinear Relationships

© Andrew Ho, Harvard Graduate School of Education Unit 2b – Slide 1http://xkcd.com/314/

http://xkcd.com/314/

http://xkcd.com/314/

• Introducing a theory-driven approach to fitting nonlinear models to data• Fitting nonlinear model and interpreting results• Polynomial regression

© Andrew Ho, Harvard Graduate School of Education Unit 2b– Slide 2

Multiple RegressionAnalysis (MRA)

Multiple RegressionAnalysis (MRA) iiii XXY 22110

Do your residuals meet the required assumptions?

Test for residual

normality

Use influence statistics to

detect atypical datapoints

If your residuals are not independent,

replace OLS by GLS regression analysis

Use Individual

growth modeling

Specify a Multi-level

Model

If time is a predictor, you need discrete-

time survival analysis…

If your outcome is categorical, you need to

use…

Binomial logistic

regression analysis

(dichotomous outcome)

Multinomial logistic

regression analysis

(polytomous outcome)

If you have more predictors than you

can deal with,

Create taxonomies of fitted models and compare

them.

Form composites of the indicators of any common

construct.

Conduct a Principal Components Analysis

Use Cluster Analysis

Use non-linear regression analysis.

Transform the outcome or predictor

If your outcome vs. predictor relationship

is non-linear,

Use Factor Analysis:EFA or CFA?

Course Roadmap: Unit 2b

Today’s Topic Area

© Andrew Ho, Harvard Graduate School of Education Unit 2b – Slide 3

Two General Approaches to Fitting Nonlinear Relationships

Use theory, or knowledge of the field, to postulate a non-linear model for the hypothesized relationship between outcome and predictor.

Use nonlinear regression analysis to fit the postulated trend, and conduct all of your statistical inference there.

Interpret the parameter estimates directly, and produce plots of findings.

This ClassThis Class

Harder to apply, easier to interpret

Theory-Driven, “Rational” Approach

Find an ad-hoc transformation of either the outcome or the predictor, or both, that renders their relationship linear.

Use regular linear regression analysis to fit a linear trend in the transformed world, and conduct all statistical inference there.

De-transform fitted model to produce plots of findings, and tell the substantive story in the untransformed world.

Last ClassLast Class

Easier to apply, harder to interpret

Data-Driven, “Empirical” Approach


Theory-Driven, “Rational” Approach Use theory, or knowledge of the field, to postulate

a non-linear model for the hypothesized relationship between outcome and predictor.



Theory-Driven, “Rational” Approach Use theory, or knowledge of the field, to postulate

a non-linear model for the hypothesized relationship between outcome and predictor.



Theory: Pioneers in mathematical psychology, in the mid-20th century, theorized that human learning was state-dependent – that the rate at which individuals learned was proportional to the amount that they had left to learn.

This led psychologists, like Nancy Bayley, to hypothesize that IQ had a negative exponential trajectory with age:

]1[ AGEeIQ

Under this theory, the shape of the IQ/AGE trend in the BAYLEY data would look like this:

IQ

AGE

Because the meaning of model parameters is not immediately obvious, we need to build

intuition about the shape of the negative exponential curves … by sketching a few plots.

Because the meaning of model parameters is not immediately obvious, we need to build

intuition about the shape of the negative exponential curves … by sketching a few plots.

http://www.foundalis.com/lan/hw/grkhandw.htm http://www.livingwaterbiblegames.com/greek-alphabet-handwriting.html

lambda: Greek “l”

gamma: Greek “g”

http://www.livingwaterbiblegames.com/greek-alphabet-handwriting.html






You can build intuition about how the shape of a negative exponential curve depends on the values of its parameters by sketching curves for prototypical parameter values, fixing all but one and varying others. Let’s start with parameter λ…You can build intuition about how the shape of a negative exponential curve depends on the values of its parameters by sketching curves for prototypical parameter values, fixing all but one and varying others. Let’s start with parameter λ…

0 0 0 010 65.93599 82.41999 98.9039920 110.1342 137.6678 165.201330 139.7612 174.7014 209.641740 159.6207 199.5259 239.43150 172.9329 216.1662 259.399460 181.8564 227.3205 272.784670 187.838 234.7975 281.75780 191.8476 239.8094 287.771390 194.5353 243.1691 291.8029

100 196.3369 245.4211 294.5053110 197.5445 246.9307 296.3168120 198.3541 247.9426 297.5311130 198.8967 248.6209 298.345140 199.2604 249.0755 298.8906150 199.5042 249.3803 299.2564160 199.6677 249.5846 299.5015170 199.7772 249.7216 299.6659180 199.8507 249.8134 299.776190 199.8999 249.8749 299.8499200 199.9329 249.9161 299.8994 0

50

100

150

200

250

300

350

0 20 40 60 80 100

IQ

AGE

Figure I.2(b).1. Examples of Hypothetical Negative Exponential Curves for Lambda equal to 200, 250, & 300

(Gamma = .04)

]1[ AGEeIQ

300

250

200

AGEAGE IQwhenλ=200=.04

IQwhenλ=200=.04

IQwhenλ=250=.04

IQwhenλ=250=.04

IQwhenλ=300=.04

IQwhenλ=300=.04

Conclusion?Parameter λ is the upper asymptote -- larger λ, higher the asymptote.

Sli

ders

in E

xcel

P

rope

rtie

s: 1

) L

inke

d C

ell,

2) M

in/M

ax

http://office.microsoft.com/en-us/excel-help/add-a-scroll-bar-or-spin-button-to-a-worksheet-HP010236682.aspx


And here’s how the values of parameter affect the shape …And here’s how the values of parameter affect the shape …

0 0 0 010 19.03252 65.93599 100.682920 36.25385 110.1342 150.680630 51.83636 139.7612 175.508740 65.93599 159.6207 187.83850 78.69387 172.9329 193.960560 90.23767 181.8564 197.000970 100.6829 187.838 198.510780 110.1342 191.8476 199.260490 118.6861 194.5353 199.6327

100 126.4241 196.3369 199.8176110 133.4258 197.5445 199.9094120 139.7612 198.3541 199.955130 145.4936 198.8967 199.9777140 150.6806 199.2604 199.9889150 155.374 199.5042 199.9945160 159.6207 199.6677 199.9973170 163.4633 199.7772 199.9986180 166.9402 199.8507 199.9993190 170.0863 199.8999 199.9997200 172.9329 199.9329 199.9998

0

50

100

150

200

250

0 20 40 60 80 100

IQ

AGE

Figure I.2(b).2. Examples of Hypothetical Negative Exponential Curves for Gamma equal to .01, .04, & .07

(Lambda = 200)

]1[ AGEeIQ

01.

04.07.

AGEAGE IQwhenλ=200=.01

IQwhenλ=200=.01

IQwhenλ=200=.04

IQwhenλ=200=.04

IQwhenλ=200=.07

IQwhenλ=200=.07

Conclusion?Parameter determines the rate at which the asymptote is approached – the

higher the value of , the more rapid the approach (see later).


Fitting a hypothesized negative exponential curve to the BAYLEY data, using nl, proceeds by an iterative process of informed guessing … if your were to do it by hand, here is an initial (pretty bad) guess. What might the next step be?

0 01 4 142.22712 10 194.54963 17 213.79794 220.8795 37 223.4846 224.44237 65 224.79488 224.92459 85 224.972210 88 224.989811 95 224.996212 101 224.998613 103 224.999514 107 224.999815 113 224.999916 22517 22518 121 22519 22520 22521 148 22522 22523 225

0

50

100

150

200

250

0 10 20 30 40 50 60 70

Age (Months)

IQ

ObservedChild IQ & AGE

Observed Data

221

24

23

22

21 ... eeeeeSSE

Initial guess for fitted IQ1ˆ;225ˆ

In the next step, would you … Increase or decrease the initial estimate of ? Increase or decrease the initial estimate of ?

http://www.dynamicgeometry.com/JavaSketchpad/Gallery/Other_Explorations_and_Amusements/Least_Squares.html




Step 0

There’s another useful way of looking at the iterative journey to a final fitted model …think of it as a hike through a mountainous region of SSELAND, whose map grid is laid out in units of and , and we keep going downhill.

SSE

Step 1

Step 2

Step 3

Step 4

Step 5

???

The problem: How do we know our “local minimum” is our “global minimum”?

You might try a number of different starting points and see if you converge to the same

answer. Also, always visualize fit if you can.


Unit 2b .do File ... programming STATA to conduct a non-linear regression analysis …Unit 2b .do File ... programming STATA to conduct a non-linear regression analysis …

*--------------------------------------------------------------------------------* Hypothesize and fit a nonlinear relationship directly.*--------------------------------------------------------------------------------

* Specify the hypothesized non-linear model and conduct nonlinear regression* analysis, providing some sensible initial guesses ("start values") for the* parameter estimates. nl (IQ = {lambda}*(1-exp(-{gamma}*AGE))), initial (lambda 225 gamma 1) * Output the predicted values and raw residuals for brief diagnosis: predict PREDICTED, yhat predict RESID, resid * Other standard diagnostic statistics can also be output.

nl is the STATA routine for fitting nonlinear regression models by least squares

You not only have to identify the outcome and predictors, you

also have to provide the hypothesized model. STATA

recognizes the variable names in the model (here “IQ” &

“AGE”) and assumes that other “names” in the model (here “lambda” & “gamma”) are

parameters you want to estimate.

You have to provide some sensible initial guesses (“starting values”) for the parameter

estimates. Where your hike begins.

You can output diagnostic datasets, as in linear regression analysis, including diagnostic statistics, although they are limited due to the nonlinear fit (I choose not to do a full accounting and output only residuals and

fits, to retain focus on the nonlinear modeling itself. But much of what you already know still applies)

Warning. The hypothesized model is fitted to the data ITERATIVELY , by a process of guessing parameter estimates and then successively refining that guess, while attending to a best-fit criterion. The process stops when parameter estimates have “converged” on the “best” answer. With difficult problems, this can sometimes take a lot of steps, lead to loops, or, worse, lead you to a suboptimal answer. Adjusting starting values and convergence criteria can help.


Iteration 0: residual SS = 70735.96Iteration 1: residual SS = 13439.45Iteration 2: residual SS = 5794.627Iteration 3: residual SS = 695.1685Iteration 4: residual SS = 670.2241Iteration 5: residual SS = 670.2171Iteration 6: residual SS = 670.2171

Iteration 0: residual SS = 70735.96Iteration 1: residual SS = 13439.45Iteration 2: residual SS = 5794.627Iteration 3: residual SS = 695.1685Iteration 4: residual SS = 670.2241Iteration 5: residual SS = 670.2171Iteration 6: residual SS = 670.2171

Here is the actual sequence of refinements to the Sum of Squared Residuals made by Stata as it iterated towards a final fitted negative exponential curve for the BAYLEY data …Here is the actual sequence of refinements to the Sum of Squared Residuals made by Stata as it iterated towards a final fitted negative exponential curve for the BAYLEY data …

STATA began the iterative fitting process at “Step Zero” by computing the SSE associated with the initial guesses that I had provided …

… clearly, my initial guesses were not good!

The computer regards the fitting process as having “converged” when SSE is reduced by less than one millionth between any two contiguous steps … you can modify this criterion, and choose your own.

Over the next three steps, STATA focused rapidly on better estimates of the parameters, and SSE plummeted from over 70,000 to just under 700.

Iteration step #

Sum of squared errors, SSE

Then, STATA spent a couple of steps trying to refine the final estimates, without much luck … making only a marginal improvement to SSE.

And it quit, when between Step #5 and Step #6, it could not reduce SSE any further …


Source | SS df MS-------------+------------------------------ Number of obs = 21 Model | 388067.783 2 194033.891 R-squared = 0.9983 Residual | 670.217063 19 35.2745823 Adj R-squared = 0.9981-------------+------------------------------ Root MSE = 5.939241 Total | 388738 21 18511.3333 Res. dev. = 132.3201------------------------------------------------------------------------------ IQ | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- /lambda | 248.0641 6.051146 40.99 0.000 235.3989 260.7293 /gamma | .0412756 .0019789 20.86 0.000 .0371337 .0454174------------------------------------------------------------------------------

Here are the t-statistics and p-values for each predictor. They test the usual marginal null hypotheses of no population effect on the outcome, for the respective predictor

variable, given all else in the model

A familiar quantity?

Approximate standard errors:

002.0)ˆ.(.051.6)ˆ.(.

eses

95% confidence intervals on each

regression parameter.

Final parameter estimates:

041.0ˆ1.248ˆ

Usual R2 statistic, , with a standard

interpretation.

The estimated value of -- which equals 248.1 -- describes the asymptote of the trajectory – it’s the estimated ceiling on the child’s learning.

Interpretation of the estimated value of -- which equals 0.0413 -- is not immediately obvious, but remember that it is related to the rate at which the trajectory approaches the asymptote.

Mathematical learning theorists were able to show that an estimate of the half-life of the child’s learning could be obtained from the estimated value of : months.

The “rational approach” provides parameter estimates that have an intuitive meaning in the context of the theory that provided the hypothesized regression model …

050

100

150

200

250

Bayl

ey In

fant

IQ S

core

0 20 40 60Infant's Age (Months)


8.16ˆ:life-Half

1.248ˆ:Asymptote

Does it Fit?R2 statistic = 0.9983

Pretty darn good, but don’t forget this is one individual

with time series data.


We may check the residuals for violations of regression assumptions ().We may check the residuals for violations of regression assumptions ().

Residual Diagnostics, Normality0

24

6F

req

uenc

y

-10 -5 0 5 10Residuals

-10

-50

510

Re

sid

uals

-10 -5 0 5 10Inverse Normal

RESID 21 0.93706 1.542 0.876 0.19054 Variable Obs W V z Prob>z

Shapiro-Wilk W test for normal data

. swilk RESID

Insufficient evidence to reject the null hypothesis that the residuals are normally distributed in the population.

A bit of a heavy lower tail in the residual distribution, but there’s not much to say given the low sample size…


Because we have time series data, we might begin to ask about autocorrelation…Because we have time series data, we might begin to ask about autocorrelation…

Residual Diagnostics, Heteroscedasticity, Autocorrelation

A look at the residuals seems to hint at heteroscedasticity, but it is difficult to claim with this small sample size. Consistent with greater measurement error at the center of raw-score test scales (test theory) whereas error is reduced towards the asymptote?

Adjacent residuals do show signs of being correlated, as negatives tend to predict adjacent negatives and positives tend to predict adjacent positives.

-10

-50

510

Re

sid

uals


-10

-50

510

Re

sid

uals


© Andrew Ho, Harvard Graduate School of Education

Polynomial Regression: Interacting Variables with Themselves

Unit 2b – Slide 15

112110 * XXXY

0.00

1.00

2.00

3.00

4.00

5.00

0 2 4 6 8 10 12

X

Y211 10.020.150.0ˆ XXY

Q: What if the effect of a given predictor differed by levels of that very predictor—the “effect of ” differed by levels of ?

212110 XXY

“I test the following hypotheses… wives’ percentage of income is associated with divorce in an inverted U-shaped curve such that the odds of divorce are highest when spouses’ economic contributions are similar”

Source: Rogers, SJ (2004). Dollars, dependency, and divorce: Four perspectives on the role of wives’ income. Journal of Marriage and Family, 66, 59-74.

Quadratic modelWe allow a predictor’s effect to differaccording to levels of that predictor.The test on 2 provides a test of whether the quadratic term (model) is necessary

All quadratics are non-monotonic—they both rise and fall (or fall and rise)

However, quadratic regression can fit monotonic curves as well: As with all interactions, we have to be careful about extrapolation.

http://www.blackwell-synergy.com.ezp1.harvard.edu/doi/abs/10.1111/j.1741-3737.2004.00005.x





Residual Diagnostics, Heteroscedasticity, Autocorrelation

* p<0.05, ** p<0.01, *** p<0.001AGE2 = AGE^2, AGE3 = AGE^3, AGE4 = AGE^4Standard errors in parentheses F 152.4 659.6 1039.5 926.9 df_r 19 18 17 16 df_m 1 2 3 4 rss 10731.4 1303.2 524.9 416.0 mss 86074.4 95502.6 96280.9 96389.8 R-sq 0.889 0.987 0.995 0.996 (8.095) (4.369) (3.675) (4.150) _cons 41.18*** 3.862 -7.766* -12.72**

(0.0000172) AGE4 -0.0000353

(0.000276) (0.00205) AGE3 0.00139*** 0.00555*

(0.00697) (0.0242) (0.0782) AGE2 -0.0796*** -0.199*** -0.352***

(0.302) (0.417) (0.583) (1.050) AGE 3.730*** 8.321*** 10.91*** 12.76*** Linear Quadratic Cubic Quartic Fitting polynomial regression models for IQ on AGE (n=21)

Higher-Order Polynomials: Less Rational Than Empirical0

5010

015

020

025

0B

ayle

y In

fan

t IQ

Sco

re


050

100

150

200

250

Bay

ley

Infa

nt I

Q S

core


050

100

150

200

250

Bay

ley

Infa

nt IQ

Sco

re


050

100

150

200

250

Bay

ley

Infa

nt I

Q S

core


Linear

QuadraticCubic

Quartic

𝑌=𝛽0+𝛽1 𝑋+𝛽2 𝑋2+𝛽3 𝑋

3+𝛽4 𝑋4+𝜖

A quadratic model may have a loose argument for being theory-driven, but polynomial regression is largely a data-driven exercise.

An advantage of polynomial regression over Box-Cox is a built-in framework for testing the hypothesis that an additional order added to the polynomial is useful for prediction.


unit 2b: dealing “rationally” with nonlinear relationships © andrew ho, harvard graduate school...

Documents

nonlinear regression

nonlinear regression

use factor analysis

gls regression analysis

b slide

rational approach use

predictor relationship

relationship linear