screening the data tedious but essential!. missing data missing not at random (mnar) missing at...

42
Screening the Data Tedious but essential!

Upload: domenic-harper

Post on 24-Dec-2015

240 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Screening the Data Tedious but essential!. Missing Data Missing Not at Random (MNAR) Missing at Random (MAR) Missing Completely at Random (MCAR)

Screening the Data

Tedious but essential!

Page 2: Screening the Data Tedious but essential!. Missing Data Missing Not at Random (MNAR) Missing at Random (MAR) Missing Completely at Random (MCAR)

Missing Data• Missing Not at Random (MNAR)• Missing at Random (MAR)• Missing Completely at Random (MCAR)

Page 3: Screening the Data Tedious but essential!. Missing Data Missing Not at Random (MNAR) Missing at Random (MAR) Missing Completely at Random (MCAR)

Missing Not at Random (MNAR)

• Are missing cases on Y• Missingness is related to the value of Y• Faculty salaries – those with high salaries

may be reluctant to reveal them• Estimates of mean Y will be biased if use

just the available data

Page 4: Screening the Data Tedious but essential!. Missing Data Missing Not at Random (MNAR) Missing at Random (MAR) Missing Completely at Random (MCAR)

Missing at Random (MAR)

• Missingness on Y not related to value of Y• Or is related but through other variables

on which we have data.• Faculty salary related to rank.• Higher rank = higher salary• If missingness is random within each rank,

within-rank estimates will be unbiased.• Overall mean = weighted sum of within-

rank estimates

Page 5: Screening the Data Tedious but essential!. Missing Data Missing Not at Random (MNAR) Missing at Random (MAR) Missing Completely at Random (MCAR)

Missing Completely at Random (MCAR)

• There is no variable, observed or not, that is related to missingness of Y.

• Ideal, not likely ever absolutely true.

Page 6: Screening the Data Tedious but essential!. Missing Data Missing Not at Random (MNAR) Missing at Random (MAR) Missing Completely at Random (MCAR)

Finding Patterns of Missingness

• There is specialized software. You do not have it.

• Can use SAS.• Can use SPSS with home license code.• Create missingness dummy variable• 0 = not missing, 1 = missing• Relate missingness to other variables.

Page 7: Screening the Data Tedious but essential!. Missing Data Missing Not at Random (MNAR) Missing at Random (MAR) Missing Completely at Random (MCAR)

Dealing with MCAR Data

• Delete Cases: Will create no bias, but will lower power and precision.

• Mean Substitution: For each missing value, substitute the group mean on that value. No bias for means, but will reduce standard deviations.

Page 8: Screening the Data Tedious but essential!. Missing Data Missing Not at Random (MNAR) Missing at Random (MAR) Missing Completely at Random (MCAR)

Dealing with MCAR Data

• Regression: For each missing score, develop a multiple regression to predict score from other variables. Impute that predicted score. Regression towards mean will reduce variability.

Page 9: Screening the Data Tedious but essential!. Missing Data Missing Not at Random (MNAR) Missing at Random (MAR) Missing Completely at Random (MCAR)

Dealing with MAR Data

• Deletion of Variables: If another variable can serve as a proxy.

• Multiple Imputation – specialized software, may eliminate bias– Involves resampling techniques to generate

several sets of predictions of missing scores– Analyze each set and then average the

results across sets.

Page 10: Screening the Data Tedious but essential!. Missing Data Missing Not at Random (MNAR) Missing at Random (MAR) Missing Completely at Random (MCAR)

Dealing with MNAR Data

• Sophisticated methods may reduce, but not eliminate, bias.

• Pairwise Correlation Matrix – use as input to multivariate procedures. Different correlations will be based on different subsets of the data. Can produce very strange results, not recommended.

Page 11: Screening the Data Tedious but essential!. Missing Data Missing Not at Random (MNAR) Missing at Random (MAR) Missing Completely at Random (MCAR)

Missing Item Data Within Unidimensional Scale

• Assume each item measures the same construct.

• For each subject, compute the means on the items which do have data.

• Set to missing the scale scores for subjects who have answered fewer than a threshold number of items.

Page 12: Screening the Data Tedious but essential!. Missing Data Missing Not at Random (MNAR) Missing at Random (MAR) Missing Completely at Random (MCAR)

Identifying Outliers

• Univariate: Box and whiskers plots• Multivariate: Compute Mahalanobis

Distance or Leverage. Investigate cases with high values. Use outlier dummy variable to compare outliers with inliers.

• Regression Diagnostics:o Leverage: Cases with unusual values on the

predictor variables

Page 13: Screening the Data Tedious but essential!. Missing Data Missing Not at Random (MNAR) Missing at Random (MAR) Missing Completely at Random (MCAR)

Outliers

o Standardized Residuals: Cases whose actual Y is far from predicted Y.

o Cook’s D: Cases with values that make them have great influence on the regression solution.

Page 14: Screening the Data Tedious but essential!. Missing Data Missing Not at Random (MNAR) Missing at Random (MAR) Missing Completely at Random (MCAR)

Dealing with Outliers

• Investigate: May be bad data. May be able to correct the data, may not. May represent cases not properly considered part of the population of interest.

• Out-of-Range Values: Even if not outliers, these are bad data that need correction.

Page 15: Screening the Data Tedious but essential!. Missing Data Missing Not at Random (MNAR) Missing at Random (MAR) Missing Completely at Random (MCAR)

Dealing with Outliers

• Set to Missing: If all else fails.• Delete the Case: For example, if

convinced the respondent was not even reading the questions.– “I frequently visit planets outside of our solar

system.”– “I make all of my own clothes.”

• Delete the Variable: Last resort when it has many cases with missing data.

Page 16: Screening the Data Tedious but essential!. Missing Data Missing Not at Random (MNAR) Missing at Random (MAR) Missing Completely at Random (MCAR)

Dealing with Outliers

• Transform the Variable: If outliers are valid but contributing to skewness.

• Change the Score: For example, reduce very high score to value a small bit higher than the remaining highest score. See Howell’s discussion of “Winsorizing.”

Page 17: Screening the Data Tedious but essential!. Missing Data Missing Not at Random (MNAR) Missing at Random (MAR) Missing Completely at Random (MCAR)

Assumptions of the Analysis

• Check Outliers First: Dealing with outliers may resolve the problems below.

• Normality: Look at plots and measures of skewness and kurtosis. Ignore tests of significance, like Kolgomorov-Smirnov. May need to use different analysis.

• Homogeneity of Variance: Does the variance differ considerably across groups? May need to transform or use different analysis.

Page 18: Screening the Data Tedious but essential!. Missing Data Missing Not at Random (MNAR) Missing at Random (MAR) Missing Completely at Random (MCAR)

Assumptions of the Analysis

• Homoscedasticity: Carefully inspect the residuals. May need to transform data or use a different analysis.

• Homogeneity of Variance/Covariance Matrices (across groups): Box’s M.

• Sphericity: For univariate-approach related samples ANOVA. Check with Mauchley’s Test. Correct the df or use a multivariate approach instead.

Page 19: Screening the Data Tedious but essential!. Missing Data Missing Not at Random (MNAR) Missing at Random (MAR) Missing Completely at Random (MCAR)

Assumptions of the Analysis

• Homogeneity of Regression: In ANCOV, we assume the relationship between Y and the predictors is constant across groups. Test the Groups x Predictor(s) interactions.

• Linear Relationships: Look at plots. If necessary, transform variables or use curvilinear techniques.

Page 20: Screening the Data Tedious but essential!. Missing Data Missing Not at Random (MNAR) Missing at Random (MAR) Missing Completely at Random (MCAR)

Multicollinearity

• One predictor is nearly perfectly correlated with the other predictors.

• Makes the regression coefficients unstable across random samples from the same population.

• Makes complicated the interpretation of unique effects.

Page 21: Screening the Data Tedious but essential!. Missing Data Missing Not at Random (MNAR) Missing at Random (MAR) Missing Completely at Random (MCAR)

Detecting Multicollinearity

• For each predictor, compute the R2 between it and the other predictors. If very high (.9 or more), there is a problem.

• SAS will compute tolerance= (1 – that R2 ). If very low, there is a problem.

• If R2 = 1, the correlation matrix is singulair, cannot be inverted, the analysis crashes– Predictors = Verbal SAT, Math SAT, Total SAT.

Page 22: Screening the Data Tedious but essential!. Missing Data Missing Not at Random (MNAR) Missing at Random (MAR) Missing Completely at Random (MCAR)

Variance Inflation Factor

• VIF = 1/tolerance. If high, there is a problem.

• How High? • Some say 10, some say 5, a few say 2.5.• If R2 = .9, tolerance = .1, VIF = 10.

Page 23: Screening the Data Tedious but essential!. Missing Data Missing Not at Random (MNAR) Missing at Random (MAR) Missing Completely at Random (MCAR)

Dealing with Multicollinearity

• Drop a Predictor – may resolve the problem.

• Combine Predictors – into a composite variable

• Principle Components Analysis – conduct the analysis on the resulting weighed linear combinations of the variables. Can then transform the results back to the original variables.

Page 24: Screening the Data Tedious but essential!. Missing Data Missing Not at Random (MNAR) Missing at Random (MAR) Missing Completely at Random (MCAR)

SAS 1• Look at the command lines in the SAS

program.• Always give every case a unique ID

number, so you can locate it later.• Label variables if their SAS name is not

informative.• input ID 1-3 @5 (Q1-Q138) (1.);label Q1='Sex' Q3 = 'Age';

Page 25: Screening the Data Tedious but essential!. Missing Data Missing Not at Random (MNAR) Missing at Random (MAR) Missing Completely at Random (MCAR)

SAS 2

• Recode values that represent missing data.

• On several variables, such as “number of biological brothers,” response 5 was “do not know.”

• if Q15 = 5 then Q15 = . ; if Q16 = 5 then q16 = . ;

Page 26: Screening the Data Tedious but essential!. Missing Data Missing Not at Random (MNAR) Missing at Random (MAR) Missing Completely at Random (MCAR)

SAS 3 & 4

• Transform variable to reduce positive skewness

• age_sr = sqrt(Q3); age_log = log10(Q3); age_inv = -1/(Q3);

• Dichotomize variable – transformation of last resort.

• if q3 = 1 then age_di = 1; else if q3 > 1 then age_di = 2;

Page 27: Screening the Data Tedious but essential!. Missing Data Missing Not at Random (MNAR) Missing at Random (MAR) Missing Completely at Random (MCAR)

SAS 5 & 6

• Create composite variable• SIBS = Q15 + Q16;• Transform to reduce positive skewness• sibs_sr = sqrt(sibs);sibs_log = log10(sibs);sibs_in = -1/sibs;

Page 28: Screening the Data Tedious but essential!. Missing Data Missing Not at Random (MNAR) Missing at Random (MAR) Missing Completely at Random (MCAR)

SAS 7

• Create mental variable and associated missingness variable.

• MENTAL = Q62 + Q65 + Q67;MentalMiss = 0;If Mental=.then MentalMiss = 1;

Page 29: Screening the Data Tedious but essential!. Missing Data Missing Not at Random (MNAR) Missing at Random (MAR) Missing Completely at Random (MCAR)

SAS 8

• Transform to reduce negative skewness• Mental2 = Mental*Mental;Mental3 = Mental**3;Ment_exp = EXP(Mental);R_Ment = 13 - Mental;R_Ment_sr = sqrt(R_Ment); R_Ment_log = log10(R_Ment);

Page 30: Screening the Data Tedious but essential!. Missing Data Missing Not at Random (MNAR) Missing at Random (MAR) Missing Completely at Random (MCAR)

SAS 9

• Dichotomize Mental• if 0 LE Mental LE 9 then Ment_di=1;else if Mental > 9 then Ment_di=2;

• Be careful – SAS codes missing data with an extreme negative number.

Page 31: Screening the Data Tedious but essential!. Missing Data Missing Not at Random (MNAR) Missing at Random (MAR) Missing Completely at Random (MCAR)

SAS 10

• Check for missing data and out-of-range values.

• proc means min max n nmiss;var q1-q10 q50-q70; run;

Page 32: Screening the Data Tedious but essential!. Missing Data Missing Not at Random (MNAR) Missing at Random (MAR) Missing Completely at Random (MCAR)

SAS 11

• Check for skewness & kurtosis• proc means min max n nmiss skewness kurtosis;var Q3 age_sr -- Mental Mental2 -- R_Ment_log; run;

Page 33: Screening the Data Tedious but essential!. Missing Data Missing Not at Random (MNAR) Missing at Random (MAR) Missing Completely at Random (MCAR)

SAS 12

• Check distributions of variables with few values

• proc freq;tables q3 age_di sibs mental ment_di; run;

Page 34: Screening the Data Tedious but essential!. Missing Data Missing Not at Random (MNAR) Missing at Random (MAR) Missing Completely at Random (MCAR)

SAS 13

• Locate cases with bad data• data duh; set delmel;if q9 > 3;proc print; var q9; id id; run;

• Case 159 has out-of-range on item Q9.

Page 35: Screening the Data Tedious but essential!. Missing Data Missing Not at Random (MNAR) Missing at Random (MAR) Missing Completely at Random (MCAR)

SAS 14

• Check correlates of missingness.• proc corr nosimple data=delmel; var MentalMiss;with Q1 Q3 Q5 Q6 sibs; run;

• MentalMiss negatively correlated with sibs.• Duh, some subjects have missing data on

number of brothers or number of sisters.• Instead of Mental = Q62+Q65+Q67, use

Mental = Mean(of Q62 Q65 Q67);

Page 36: Screening the Data Tedious but essential!. Missing Data Missing Not at Random (MNAR) Missing at Random (MAR) Missing Completely at Random (MCAR)

Multidimensional Outliers

• investigate observations with leverage greater than 2p/n, “where n is the number of observations used to fit the model, and p is the number of parameters in the model.”

• 4 variables: Q1 Q3 Q6 mental + intercept• 193 observations• 2*5/193 = .052

Page 37: Screening the Data Tedious but essential!. Missing Data Missing Not at Random (MNAR) Missing at Random (MAR) Missing Completely at Random (MCAR)

SAS 15

• Identify multivariate outliers• proc reg data=delmel;model id = Q1 Q3 Q6 mental; output out=hat H=Leverage; run;data outliers; set hat;if leverage > .052;

Page 38: Screening the Data Tedious but essential!. Missing Data Missing Not at Random (MNAR) Missing at Random (MAR) Missing Completely at Random (MCAR)

SAS 15

• Identify multivariate outliers• proc print; var id Q1 Q3 Q6 mental leverage; run;proc means mean;var Q1 Q3 Q6 mental; run;

• As a group, the outliers are older than the overall sample.

• All three students aged 25 or older are included among the outliers.

Page 39: Screening the Data Tedious but essential!. Missing Data Missing Not at Random (MNAR) Missing at Random (MAR) Missing Completely at Random (MCAR)

Survey Scoundrels

• These sloths do not even read the questions, they just answer randomly to get whatever incentive is available for completing the survey.

• My daughter’s shock upon discovering this.

• Monitor how long it takes respondents to complete the survey.

Page 40: Screening the Data Tedious but essential!. Missing Data Missing Not at Random (MNAR) Missing at Random (MAR) Missing Completely at Random (MCAR)

Items to Help Detect Scoundrels

• Repeat same item, compare responsese• “I frequently visit with aliens from other

planets.”• “I make all of my own clothes.”

Page 41: Screening the Data Tedious but essential!. Missing Data Missing Not at Random (MNAR) Missing at Random (MAR) Missing Completely at Random (MCAR)
Page 42: Screening the Data Tedious but essential!. Missing Data Missing Not at Random (MNAR) Missing at Random (MAR) Missing Completely at Random (MCAR)