introduction to and overview of def2 an r software package for cross-cultural research

Introduction to and Overview of DEF2An R software package for cross-cultural research

E. Anthon EffMalcolm M. Dow

Wes Routon

Anthropological Sciences Conference, Albuquerque, March 18, 2014

The two major problems with cross-cultural data analysis addressed by DEF2 are:

Missing Data

All of the major cross-cultural data sets have substantial missing data. Single imputation methods – mean substitution, regression predicted scores, hot deck, etc. – result in coefficient variance estimates of that are downwardly biased. Data editing procedures, e.g. listwise deletion, generally result in small samples (loss of power) and also require very strong assumptions about why data are missing. These assumptions are very unlikely to hold. Single imputation methods are no longer recommended.

DEF 2 employs the Multiple Imputation by Chained Equations (mice) approach to handling missing data.

Non-Independence of Sample Units Sample cases in cross-cultural and cross-national data are frequently not independent of one another due to various inter-societal network processes: cultural trait borrowing, conquest, emulation, inheritance from ancestral populations, etc. This is the classic Galton’s Problem in anthropology, understood more generally as the problem of cultural trait transmission.

DEF2 addresses this issue by incorporating networks of relations into regression models, and employing instrumental variables procedures to generate consistent and relatively efficient estimates.

Step 1 of the DEF2 Approach to Multiple Imputation of Missing Data: finding auxiliary variables.

The mice procedure imputes values for missing observations on the variables specified in the structural regression model of interest, using both these variables themselves plus a set of auxiliary variables.

Ideal auxiliary variables are usually a subset of those with no missing values in the full data set. Auxiliary variables must be correlated with the variables in the structural regression model that have missing

values, since the imputation procedure is designed to “borrow” information from them to help impute the missing values.

DEF2 will employ auxiliary variables provided by the user. Alternatively, DEF2 will identify suitable auxiliary variables as follows:

1. identify all categorical, ordinal, interval variables with no missing values in the complete data set.2. identify variables that one wants to impute, and, one at a time, treating each as a dependent variable: i) regress (using binary/ordinal logit, multinomial, OLS) the dependent variable on the covariate that

provides the highest correlation, and save the residualii) add to the regression model the covariate that correlates highest with the residual, and calculate the new residualiii) repeat the above steps 8 times (or more)iv) calculate the relative importance of predictors, drop variables that fall below a given threshold, and recalculate the residualv) repeat steps ii – iv.

Step 2: Create m complete data sets

• The mice procedure is repeated m times to create m copies of the data set, each containing different sets of imputed values.

• Since each data set is now complete, each can be analyzed using any of the usual statistical models that require complete data.

• m = 10 - 100 is currently suggested, depending on sample size and amounts of missing data.

Step 3: Analyzing the data and pooling the results: Rubin’s Rules

Separate analyses of m multiply imputed samples generates m estimates of any statistic of

interest. In the general case, for any statistic Q an analysis of m data sets yields )(ˆ jQ and )( jU

estimates of the statistic and its variance for the jth data set (j = 1, 2,….,m). The multiple

imputation point estimate of each parameter Q is simply the mean of the m estimates:

m

j

j mQQ1

)(ˆ

Analyzing the data and pooling the results, cont….

To calculate the variance of this estimate, both the m estimated variances of each )(ˆ jQ and the

variance in the )( jU across the m estimations must be combined. First, the mean of the m

estimated variances for each parameter is obtained as a simple average:

m

j

j mUW1

)(

This quantity is known as the within-imputation variance.


Next, the variance in the m estimated values )(ˆ jQ is calculated as:

m

j

j mQQB1

2)( 1ˆ

This quantity is known as the between-imputation variance. These two variances are then combined to get the total variance in the combined estimate of Q:

Bm

mWT 1


Rubin (1987: 79) shows that the following quantity is approximately distributed as a t-distribution

where the degrees of freedom, df, is given by

2

11)1(

BmmWmdf


Rubin’s pooling procedures can be done with any statistic generated by the statistical method employed to analyze the m imputed data sets.

Galton’s Problem

Incorporating inter-societal networks into network autocorrelation effects

regression models

What processes might be inducing non-independence?

Spatial Diffusion: societies in close proximity have more opportunity to emulate, conform to, adopt, borrow, etc. neighbors behaviors, beliefs, customs, rituals… (horizontal diffusion.)

Language similarity: Similarity due to populations splitting off from same ancestral population. (vertical diffusion.)

Religion: Marriage practices spread world-wide by the colonization of large swaths of the world by European Christian nations.

Equivalence: units “similarly situated” in a network and not necessarily proximate. E.g., economic similarity, core/periphery in world system, colonial status, ecological setting, …

Assessing non-independence: Tobler’s First Law of Geography

“Everything is related to everything else, but near things are more closely related than distant things.”

This “law” suggests that the scores on variable y for the ith society should be similar to the scores of those societies with which it has the closest relationships. Call these societies i’s “neighborhood set.”

If so, yi should be similar to the weighted average of the set of y scores for i’s neighborhood set, where the weights indicate relative closeness.

If the N scores on y are significantly correlated with the N weighted

average scores, conclude the y variable is auto-(self)-correlated.

Weighting sample units.

First , need to construct an NxN connectivity matrix C of pair-wise relatedness scores among sample units, and then row-normalize C to unity to get the required weights matrix W. That is, wij = cij ⁄Σjcij.

(If a variable y is premultiplied by W, i.e. Wy, the product will be an Nx1 vector of weighted averages that are on the same scale as y.)

Raw Connectivity Matrix C Weights Matrix W y Wy

0 1 1 1 0 0 0 0 1/3 1/3 1/3 0 0 0 6 7 1 0 0 1 0 0 0 1/2 0 0 1/2 0 0 0 5 7 1 0 0 1 0 0 0 1/2 0 0 1/2 0 0 0 8 7C = 0 1 1 0 1 0 0 W = 0 1/3 1/3 0 1/3 0 0 8 5.3 0 0 0 1 0 1 1 0 0 0 1/3 0 1/3 1/3 3 3.3 0 0 0 0 1 0 1 0 0 0 0 1/2 0 1/2 1 2 0 0 0 0 1 1 0 0 0 0 0 1/2 1/2 0 1 2

Incorporating autocorrelated variables into multiple regression

Most cross-cultural researchers are usually interested in testing whether hypothesized predictor variables are acting on a dependent variable, as well as what processes are inducing autocorrelation in it.

The Network Autocorrelation Regression Effects Models in DEF2 do just that.

Most commonly used network autocorrelation regression model is:

Network Autocorrelation Effects model:

y = α + ρWy + Xβ + ε

Where: W is a row-normalized NxN weighting matrix with wij > 0 if i and j are related, 0 otherwise, and wii = 0 for all i; ρ is the network autocorrelation coefficient; y is an Nx1 vector; Wy is an Nx1 vector where each element i is a weighted average of y values for i’s neighborhood set; X is an Nxk matrix of exogenous variables; β is an kx1 vector of coefficients; ε is an Nx1 vector of error terms.

Also called the Network “Lag” model, by analogy to time series, since W acts similarly to the lag operator in time series models, except that W lags the y variable in other kinds of social and physical “spaces.”

This is the model currently implemented in DEF2

Estimating the network autocorrelation effects regression model

y = α + ρWy + Xβ + ε MLE: Maximum Likelihood Estimation. This is usually the method of choice.

But the log-likelihood function contains the term ln|A|, where A= (I – ρW). Since A is asymmetric and usually not sparse, finding the eigenvalues is computationally burdensome for large N. And, for more than two endogenous Wy variables, the likelihood function is intractable.

OLS: Ordinary Least Squares. Basic assumption of OLS is that all r.h.s. variables be independent of (uncorrelated with) the error term ε. If not, all coefficient estimates (ρ and β) are biased and inconsistent. Here, y is by definition a function of ε, so Wy is also a function of ε. That is, Cov(Wy, ε) ≠ 0. Wy is thus an endogenous regressor.

IV: Instrumental Variables (IV). Provides a way to obtain consistent parameter estimates for models with endogenous variables. 2SLS is an IV estimation procedure. Can deal with large samples and multiple endogenous variables. DEF2 uses IV estimation procedures.

An “intuitive” view of the IV regression approach

OLS model: y = α + ρWy + ε ε

Z Wy y

Z is an instrument for Wy if

Cov(Z,ε) = 0 (Z is valid) and Cov(Z,Wy) ≠ 0 (Z is relevant).

So, need to find an additional variable(s) Z that is correlated with Wy but uncorrelated with ε to serve as an instrument for Wy.

An “intuitive” view of the 2SLS IV estimation procedure

Consider again the network effects model y = α + ρWy + Xβ + ε

Suppose we use WX, the lagged values of X, as an instrument for Wy.

Step 1. Using OLS, estimate Wy = a + WXc + υ

Save the predicted scores ŷw = â + WXĉ

Step 2. Again using OLS, estimate y = α + ρ ŷw + Xβ + ε

(Note: the reported standard errors from step 2 are incorrect. Not an issue for the 1-step procedures used in all the usual software packages.)

2SLS Estimation of the network autocorrelation effects regression model with IVs: general case

y = α + Xβ + ε

Where to get appropriate instruments?

Usually, it’s hard to find additional variables that meet the conditions required. Variables that affect the endogenous variable(s) are often also likely to affect the dependent variable.

Kelejian and Prucha (1998) show that the set of {WX, W2X, W3X,…} variables are optimal as instruments for Wy, where W2, W3,…. are the 2-step and 3-step connections between sample units. In practice, the WX variables or some subset of them will usually be sufficient.

Evaluating the quality of the instrumental variables

Quality of 2SLS estimators depends on the quality of the IVs. Require that

Cov(Z,ε) = 0. IVs must be valid. IV estimation is vulnerable on this point. Tests are available only if there are more instruments than endogenous variables (overidentification.)

IVs also need to be relevant. i.e., they should predict endogenous variables independently of other exogenous variables. Shea (1997) proposed a partial R2 measure of instrument relevance for multiple endogenous variable models.

Marginal associations between endogenous variable(s) and Z is known as the “weak” instruments problem. Some diagnostics are available.

No perfect collinearity between all exogenous variables.

Overidentification tests

If there is more than 1 instrumental variable available for Wy, can test the null hypothesis that at least one of them is correlated with the errors.

Sargan (1958) is the best known test:

Ts = NR2u ~ χ2

(with df = #IVs - #endogenous variables)

where R2u is the R2 of OLS regression of 2SLS residuals on the IVs.

Basmann (1960) provides an alternate, though similar, test.

Kirby and Bollen (2009) discuss additional variants of Sargan and Basmann in the context of SEM.

“Weak” Instruments

Bound et al (1995) show that when the instruments are only weakly correlated with the endogenous variables IV estimates are biased in the same direction as OLS estimates, and may be more biased than OLS. In addition, weak IV regression estimates may not be consistent.

Staiger and Stock (1997) suggest that the partial F-statistic from the increase in the regression R2 after adding the auxiliary instruments to the exogenous variables in the first stage regression should be greater than 10.

Stock and Yogo (2005) provide tables that give some guidance as

to how much greater than 10 the F-statistic may have to be.

Example: Monogamy in the Pre-industrial World

Multiple proposed determinants of the long-term historical shift in marriage preference from polygynous to monogamous unions are tested using data from the Standard Cross-Cultural Sample.

Determinants of Monogamy (adapted from Dow and Eff 2013)

Theoretical perspective Primary Sources Determinants (expected sign)

Agent level perspectives

Males provide essential resources Orions 1969; Borgerhoff-Mulder et al 1990; Marlowe 2000; Low 2003; Alexander et al 1979

male resource inequality (-), female economic contribution (-), beneficial natural environment (-)

Female intra-sexual aggression Gowaty 1996; Reichard 2003 Endemic violence (-)

Male intra-sexual aggression Emlen & Oring 1977; Hawkes et al 1995; Marlowe 2000; Borgerhoff-Mulder 1990; Quinlan and Quinlan 2007; van Schaik and Dunbar 1990; Wrangham et al 1999; Ember & Ember 1992.

endemic violence (-), social control (-)

Extrinsic Risk Quinland and Quinlan 2007; Del Guidice 2009; Low 1988, 1990, 2003, 2007

pathogen stress (-)

Group-level processes

Collective action in small-scale societies

Olson 1971; Alexander et al 1979; Price 1999; Betzig 1986

(Inverse of)societal scale (-)

Socially Imposed Monogamy (SIM) Alexander et al 1979; Betzig 1986 societal scale (+)

Cultural Trait Transmission Divale and Seda 2001; Dow and Eff 2009; Herlihy 2005; Price 1999

Distance (+), language (+), modernization (+)

W matrices employed

Geographical Distance: the WD matrix is described in Dow and Eff (2009), where cij = (1/dij)2

Use only the nearest 20 societies. Language similarity: the WL matrix is described in Eff (2008), where cij = e-score(ij)

If the Ws are collinear, can combine them into a single matrix:

WDL = πDWD + πLWL where 0 ≤ πD, πL ≤ 1 and πD + πL =1

Then, run all combinations of WDL and select as “best” the matrix that maximizes R2iv

Also obtain information on the weights that yield the “best” combined W.

2SLS estimation of network autocorrelation regression model using composite distance/language W matrix. Dependent variable is a Box-Cox transform of the percentage of married females in

monogamous marriages [monofem (λ – 1)/λ) ]

Variable Description Std coef p-value VIF

Unrestricted model Wy network lag term 0.354 0.000 1.463 modern modernization 0.166 0.026 1.113 pathstr pathogen stress -0.253 0.006 1.721 violence intra-societal violence -0.165 0.034 1.155 environ beneficient environment 0.217 0.007 1.330 femecon female economic contribution -0.216 0.002 1.097 techlev technological level 0.141 0.063 1.606 socont social control over sexual relations 0.093 0.286 1.232 resineq resource inequality 0.054 0.631 2.040 socscale scale of society -0.008 0.929 2.074

R2 = 0.466

Restricted model Std coef p-value partitioned R2

Wy network lag term 0.361 0.000 0.178 modern modernization 0.173 0.020 0.038 pathstr pathogen stress -0.257 0.003 0.099 violence intra-societal violence -0.155 0.044 0.043 femecon female economic contribution -0.208 0.003 0.052 environ beneficient environment 0.188 0.011 0.018 techlev technological level 0.179 0.008 0.026

R2 = 0.453 Restricted model F-statistics and p-values on diagnostics tests F-stat p-value Hausman H0: Wy exogenous 4.242 0.040 Ramsey RESET H0: model correct functional form 0.269 0.604 Bresuch-Pagan H0: residuals homoskedastic 2.678 0.102 Wald-restrictions H0: dropped variables have coef=0 0.494 0.483 Shapiro-Wilkes H0: residuals normally distributed 1.826 0.177 LM error (geographic) H0: residuals not autocorrelated 2.353 0.125 LM error (language) H0: residuals not autocorrelated 0.043 0.836 LM error (ecological) H0: residuals not autocorrelated 0.690 0.406 Sargan Test H0: residuals uncorrelated with IVs 0.459 0.498 Steiger-Stock weak IVs F = 15.00

Notes: Dependent variable is monofema(λ -1)/ λ (Box-Cox transformation), where λ=4.157. Coefficient p-values from bootstrap standard errors (1,000 replications). All estimations from multiply imputed (m=15) data; only observations non-missing for the dependent variable (N=143) are used in the m regressions. Composite matrix weights: distance=0.78, language=0.22.

Summary:

• DEF2 is a new statistical package designed for cross-cultural and cross-national data sets.

• Given the ubiquity of missing data in such data sets, DEF2 includes a suite of programs for multiple imputation of missing data

• Given that sample units in comparative data sets are non-independent due to various processes of cultural trait diffusion, DEF2 includes a suite of programs to implement network autocorrelation effects models.

• Available ??? Where and How, Anthon and Doug.

introduction to and overview of def2 an r software package for cross-cultural research

Documents

substantial missing

missing values

set of auxiliary variables

crossnational data

interval variables

ideal auxiliary variables

suitable auxiliary variables

crosscultural data analysis