analyzing missing data

Upload: melvin-a-vidar

Post on 03-Jun-2018

227 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/11/2019 Analyzing Missing Data

    1/49

    SW388R7Data Analysis &

    Computers II

    Slide 1

    Analyzing Missing Data

    Introduction

    ro!lems

    "sing Scripts

  • 8/11/2019 Analyzing Missing Data

    2/49

    ters II

    Slide 2Missing data and data analysis

    Missing data is a pro!lem in multi#ariate data !ecausea case $ill !e e%cluded rom t'e analysis i it ismissing data or any #aria!le included in t'e analysis(

    I our sample is large) $e may !e a!le to allo$ casesto !e e%cluded(

    I our sample is small) $e $ill try to use a su!stitutionmet'od so t'at $e can retain enoug' cases to 'a#e

    su icient po$er to detect e ects(In eit'er case) $e need to ma*e certain t'at $eunderstand t'e potential impact t'at missing datamay 'a#e on our analysis(

  • 8/11/2019 Analyzing Missing Data

    3/49

    ters II

    Slide 3+ools or e#aluating missing data

    S SS 'as a speci ic pac*age or e#aluating missingdata) !ut it is included under t'e "+ license(

    In place o t'is pac*age) $e $ill irst e%aminemissing data using S SS statistics and procedures(

    A ter studying t'e standard S SS procedures t'at $e

    can use to e%amine missing data) $e $ill use an S SSscript t'at $ill produce t'e output needed ormissing data analysis $it'out re,uiring us to issue allo t'e S SS commands indi#idually(

  • 8/11/2019 Analyzing Missing Data

    4/49

    ters II

    Slide 4-ey issues in missing data analysis

    We $ill ocus on t'ree *ey issues or e#aluatingmissing data.

    +'e num!er o cases missing per #aria!le

    +'e num!er o #aria!les missing per case+'e pattern o correlations among #aria!lescreated to represent missing and #alid data(

    /urt'er analysis may !e re,uired depending on t'epro!lems identi ied in t'ese analyses(

  • 8/11/2019 Analyzing Missing Data

    5/49

    ters II

    Slide 5ro!lem 1

    1( 0ased on a missing data analysis or t'e #aria!lesemployment status) num!er o 'ours $or*ed in t'e past $ee*)sel employment) go#ernmental employment) andoccupational prestige score in t'e dataset 2SS 444(sa#) is t'eollo$ing statement true) alse) or an incorrect application o astatistic5

    +'e #aria!les num!er o 'ours $or*ed in t'e past $ee* andemployment status are missing data or more t'an 'al o t'ecases in t'e data set and s'ould !e e%amined care ully !e oredeciding 'o$ to 'andle missing data(

    1( +rue ( +rue $it' caution 3( /alse

    6( Incorrect application o a statistic

  • 8/11/2019 Analyzing Missing Data

    6/49

    ters II

    Slide 6Identi ying t'e num!er o cases in t'e data set

    This problem wants to know if a variable ismissing data for more than half the cases.

    Our first task is to identify the number ofcases that meets that criterion.

    If we scroll to the bottom of the data set,we see than there are 270 cases in the dataset.

    270 2 ! "#$.

    If any variable included in the analysis hasmore than "#$ missing cases, the answer tothe problem will be true.

  • 8/11/2019 Analyzing Missing Data

    7/49

    ters II

    Slide 7Re,uest re,uency distri!utions

    %e will use the output forfre&uency distributions tofind the number of missingcases for each variable.

    'elect the Frequencies ( )Descriptive Statistics command from the Analyze menu.

  • 8/11/2019 Analyzing Missing Data

    8/49

    ters II

    Slide 8Completing t'e speci ication or re,uencies

    Second , click on the O*button to complete there&uest for statisticaloutput.

    First , move the fivevariables included in theproblem statement tothe list bo+ for variables.

  • 8/11/2019 Analyzing Missing Data

    9/49

    ters II

    Slide 9um!er o missing cases or eac' #aria!le

    In the table of statistics atthe top of the re&uenciesoutput, there is a tabledetailing the number ofmissing cases for eachvariable in the analysis.

    -one of the variables has more than "#$ missing cases, althoughnumber of hours worked in the past week comes close.

    The answer to the &uestion is false .

  • 8/11/2019 Analyzing Missing Data

    10/49

    Slide!

    ro!lem

    ( 0ased on a missing data analysis or t'e #aria!lesemployment status) num!er o 'ours $or*ed in t'e past$ee*) sel employment) go#ernmental employment) andoccupational prestige score in t'e dataset 2SS 444(sa#) is t'eollo$ing statement true) alse) or an incorrect application o a

    statistic5

    16 cases are missing data or more t'an 'al o t'e #aria!les int'e analysis and s'ould !e e%amined care ully !e ore deciding'o$ to 'andle missing data(

    1( +rue ( +rue $it' caution 3( /alse

    6( Incorrect application o a statistic

  • 8/11/2019 Analyzing Missing Data

    11/49

    Slide Create a #aria!le t'at counts missing data

    %e want to know howmany of the five variablesin the analysis hadmissing data for each

    case in the data set.

    %e will create a variablecontaining thisinformation that uses an' '' function to countthe number of variableswith missing data.

    To compute a newvariable, select theCompute (command from theTransform menu.

  • 8/11/2019 Analyzing Missing Data

    12/49

    Slide2

    nter speci ications or ne$ #aria!le

    Third , click on the

    up arrow button tomove the NMISS function into the-umeric /+pressionte+t bo+.

    First , type in the name forthe new variable nmiss inthe Target variable te+t bo+.

    Second , scroll down the listof functions and highlightthe NMISS function.

  • 8/11/2019 Analyzing Missing Data

    13/49

    Slide3

    nter speci ications or ne$ #aria!le

    The NMISS function ismoved into the -umeric/+pression te+t bo+.

    Second , click on theright arrow button tomove the variablename into the functionarguments.

    To add the list ofvariables to count

    missing data for,we first highlightthe first variable toinclude in thefunction, wrkstat .

  • 8/11/2019 Analyzing Missing Data

    14/49

    Slide4

    nter speci ications or ne$ #aria!le

    First , before we add anothervariable to the function, wetype a comma to separate thenames of the variables.

    Third , click on theright arrow button tomove the variablename into the functionarguments.

    Second , to addthe ne+t variablewe highlight thesecond variable toinclude in thefunction, hrs1 .

  • 8/11/2019 Analyzing Missing Data

    15/49

    Slide5

    Complete speci ications or ne$ #aria!le

    ontinue adding variables to

    function until all of thevariables specified in theproblem have been added.

    1e sure to type a commabetween the variable names.

    %hen all of the variables havebeen added to the function,click on the O button tocomplete the specifications.

  • 8/11/2019 Analyzing Missing Data

    16/49

    Slide6

    +'e nmiss #aria!le in t'e data editor

    If we scroll the worksheetto the right, we see the newvariable that ' '' has ustcomputed for us.

  • 8/11/2019 Analyzing Missing Data

    17/49

    Slide7

    A re,uency distri!ution or nmiss

    To answer the&uestion of how manycases had each of thepossible numbers ofmissing value, wecreate a fre&uencydistribution. 'elect the Frequencies ( )

    Descriptive Statistics command from the Analyze menu.

  • 8/11/2019 Analyzing Missing Data

    18/49

    Slide8

    Completing t'e speci ication or re,uencies

    Second , click on the O*button to complete there&uest for statisticaloutput.

    First , move the nmiss variable to the list ofvariables.

  • 8/11/2019 Analyzing Missing Data

    19/49

    Slide9

    +'e re,uency distri!ution

    ' '' produces a fre&uencydistribution for the nmiss variable.

    "70 cases had valid, non3missing values for all $variables. 4$ cases had onemissing value5 " case had 2missing values5 and "6 caseshad missing values for 6variables.

  • 8/11/2019 Analyzing Missing Data

    20/49

    Slide2!

    Ans$ering t'e pro!lem

    The problem asked whetheror not "6 cases had missingdata for more than half thevariables. or a set of fivevariables, cases that had #,6, or $ missing valueswould meet thisre&uirement.

    The number of cases with#, 6, or $ missing values is"6.

    The answer to the problemis true .

  • 8/11/2019 Analyzing Missing Data

    21/49

    Slide2

    ro!lem 3

    3( 0ased on a missing data analysis or t'e #aria!lesemployment status) num!er o 'ours $or*ed in t'e past$ee*) sel employment) go#ernmental employment) andoccupational prestige score in t'e dataset 2SS 444(sa#) is t'eollo$ing statement true) alse) or an incorrect application o astatistic5 "se 4(41 as t'e le#el o signi icance(

    A ter e%cluding cases $it' missing data or more t'an 'al ot'e #aria!les rom t'e analysis i necessary) t'e presence ostatistically signi icant correlations in t'e matri% o dic'otomousmissing9#alid #aria!les suggests t'at t'e missing data pattern

    may not !e random(

    1( +rue ( +rue $it' caution 3( /alse

    6( Incorrect application o a statistic

  • 8/11/2019 Analyzing Missing Data

    22/49

    Slide22

    Compute #alid9missing dic'otomous #aria!les

    To evaluate the pattern ofmissing data, we need tocompute dichotomousvalid missing variables foreach of the five variablesincluded in the analysis.

    %e will compute the newvariable using the 8ecodecommand.

    To create the newvariable, select the!eco"e # IntoDi$$erent %aria&les'from the (rans$orm menu.

  • 8/11/2019 Analyzing Missing Data

    23/49

    Slide23

    nter speci ications or ne$ #aria!le

    First , move the firstvariable in the analysis,wrkstat , into the Numeric%aria&le )* Output %aria&le te+t bo+.

    Second , type the name for the newvariable into the -ame te+t bo+. 9yconvention is to add an underscorecharacter to the end of the variable name.

    If this would make the variable more than4 characters long, delete characters fromthe end of the original variable name.

  • 8/11/2019 Analyzing Missing Data

    24/49

    Slide24

    nter speci ications or ne$ #aria!le

    Next , type the label for thenew variable into the :abelte+t bo+. 9y convention is toadd the phrase ; %ali"+Missin,-

    to the end of the variablelabel for the original variable.

    Finally , click onthe hange buttonto add the name ofthe dichotomousvariable to theNumeric Variable ->Output Variable te%t!o%(

  • 8/11/2019 Analyzing Missing Data

    25/49

    Slide25

    nter speci ications or ne$ #aria!le

    To specify the values for thenew variable, click on the Ol"an" New %alues' button.

  • 8/11/2019 Analyzing Missing Data

    26/49

    Slide26

    C'ange t'e #alue or missing data

    The dichotomous variable should becoded " if the variable has a valid value,0 if the variable has a missing value.

    First , markthe System) oruser)missin, option button.

    Second , type 0 inthe

  • 8/11/2019 Analyzing Missing Data

    27/49

    Slide27

    C'ange t'e #alue or #alid data

    First , markthe All othervalues option

    button.

    Second , type " inthe

  • 8/11/2019 Analyzing Missing Data

    28/49

    Slide28

    Complete t'e #alue speci ications

    =aving entered the valuesfor recoding the variableinto dichotomous values, weclick on the Continue buttonto complete this dialog bo+.

  • 8/11/2019 Analyzing Missing Data

    29/49

    Slide29

    Complete t'e recode speci ications

    =aving entered specifications for thenew variable and the values forrecoding the variable into dichotomousvalues, we click on the O button toproduce the new variable.

  • 8/11/2019 Analyzing Missing Data

    30/49

    Slide3!

    +'e dic'otomous #aria!le

    The procedure for creating a dichotomous

    valid missing variable is repeated for thefour other variables in the analysis> hrs",wrkslf, wrkgovt, and prestg40.

  • 8/11/2019 Analyzing Missing Data

    31/49

    Slide3

    /iltering cases $it' e%cessi#e missing #aria!les

    To filter cases included infurther analysis, we choosethe Select Cases' command from the Data menu.

    The problem calls for us toe+clude cases that havemissing data for more thanhalf of the variables.

    %e do this by selecting in,or filtering, cases that havefewer than half missingvariables, i.e. less than #missing variables.

  • 8/11/2019 Analyzing Missing Data

    32/49

    Slide32

    nter speci ications or selecting cases

    Second , click on the I$' button to enter thecriteria for includingcases.

    First , click on the I$con"ition is satis$ie" option button on theSelect panel.

  • 8/11/2019 Analyzing Missing Data

    33/49

    Slide33

    nter speci ications or selecting cases

    Second , clickon the Continue button tocomplete the Ifspecification.

    First , enter the criteriafor including cases>

    nmiss ? #

  • 8/11/2019 Analyzing Missing Data

    34/49

    Slide34

    Complete t'e speci ications or selecting cases

    To complete thespecifications, clickon the O button.

  • 8/11/2019 Analyzing Missing Data

    35/49

    Slide35

    Cases e%cluded rom urt'er analyses

    ' '' marks the cases that will not beincluded in further analyses by drawing

    a slash mark through the case number.

    %e can verify that the selection isworking correctly by noting that thecase which is omitted had 6 missingvariables.

  • 8/11/2019 Analyzing Missing Data

    36/49

    Slide36

    Correlating t'e dic'otomous #aria!les

    To compute a correlationmatri+ for the dichotomousvariables, select theCorrelate command fromthe Analyze menu.

  • 8/11/2019 Analyzing Missing Data

    37/49

    Slide37

    Speci ications or correlations

    Second , click onthe O button tocomplete there&uest.

    First , move thedichotomous variablesto the variables list bo+.

  • 8/11/2019 Analyzing Missing Data

    38/49

    Slide38

    Correlations

    . a . a . a . a . a

    . . . . .256 256 256 256 256

    . a 1 -.049 . a -.042

    . . .437 . .501

    256 256 256 256 256

    . a -.049 1 . a -.010

    . .437 . . .877

    256 256 256 256 256

    . a . a . a . a . a

    . . . . .256 256 256 256 256

    . a -.042 -.010 . a 1

    . .501 .877 . .256 256 256 256 256

    Pearson Correlation

    Sig. (2-tailed)NPearson CorrelationSig. (2-tailed)

    N

    Pearson CorrelationSig. (2-tailed)N

    Pearson CorrelationSig. (2-tailed)NPearson CorrelationSig. (2-tailed)N

    L !"# $#C% S& &'S

    ( alid *issing)

    N'*!%# "$ +"'#S,"# % L S& ,%%( alid *issing)

    # S%L$-%*P "#,"# S $"#S"*%!" /

    ( alid *issing)

    " & "# P# &%%*PL"/%%( alid *issing)

    #S "CC'P & "N LP#%S& % SC"#%(1980) ( alid *issing)

    L !"#$#C%

    S& &'S( alid *is

    sing)

    N'*!%#"$ +"'#S,"# %

    L S& ,%%( alid *issin

    g)

    # S%L$-%*P"# ,"# S

    $"#S"*%!" /( alid *issin

    g)

    " & "#P# &%

    %*PL"/%%( alid *issi

    ng)

    #S"CC'P& "N L

    P#%S&% SC"#%

    (1980)( alid *is

    sing)

    Cannot e o ted e a se at least one o t e aria les is onstant.a.

    +'e correlation matri%

    The correlation matri+ issymmetric along the diagonal;shown by the blue line@. Thecorrelation for any pair ofvariables is included twice inthe table. 'o we only countthe correlations below the

    diagonal ;the cells with theyellow background@.

  • 8/11/2019 Analyzing Missing Data

    39/49

    Slide39

    Correlations

    . a .a . a . a . a

    . . . . .256 256 256 256 256

    . a 1 -.049 . a -.042

    . . .437 . .501

    256 256 256 256 256

    . a -.049 1 . a -.010

    . .437 . . .877

    256 256 256 256 256

    . a .a . a . a . a

    . . . . .256 256 256 256 256

    . a -.042 -.010 . a 1

    . .501 .877 . .256 256 256 256 256

    Pearson Correlation

    Sig. (2-tailed)NPearson CorrelationSig. (2-tailed)

    N

    Pearson CorrelationSig. (2-tailed)N

    Pearson CorrelationSig. (2-tailed)NPearson CorrelationSig. (2-tailed)N

    L !"# $#C% S& &'S

    ( alid *issing)

    N'*!%# "$ +"'#S,"# % L S& ,%%( alid *issing)

    # S%L$-%*P "#,"# S $"#S"*%!" /

    ( alid *issing)

    " & "# P# &%%*PL"/%%( alid *issing)

    #S "CC'P & "N LP#%S& % SC"#%(1980) ( alid *issing)

    L !"#$#C%

    S& &'S( alid *is

    sing)

    N'*!%#"$ +"'#S,"# %

    L S& ,%%( alid *issin

    g)

    # S%L$-%*P"# ,"# S

    $"#S"*%!" /( alid *issin

    g)

    " & "#P# &%

    %*PL"/%%( alid *issi

    ng)

    #S"CC'P& "N L

    P#%S&% SC"#%

    (1980)( alid *is

    sing)

    Cannot e o ted e a se at least one o t e aria les is onstant.a.

    +'e correlation matri%

    The correlations marked withfootnote a could not becomputed because one of thevariables was a constant, i.e.the dichotomous variable hasthe same value for all cases.

    This happens when one of thevalid missing variables has nomissing cases, so that all ofthe cases have a value of "and none have a value of 0.

  • 8/11/2019 Analyzing Missing Data

    40/49

    Slide4!

    Correlations

    .a . a . a . a . a

    . . . . .256 256 256 256 256

    .a 1 -.049 . a -.042

    . . .437 . .501

    256 256 256 256 256

    .a -.049 1 . a -.010

    . .437 . . .877

    256 256 256 256 256

    .a . a . a . a . a

    . . . . .256 256 256 256 256

    .a -.042 -.010 . a 1

    . .501 .877 . .256 256 256 256 256

    Pearson Correlation

    Sig. (2-tailed)NPearson CorrelationSig. (2-tailed)

    N

    Pearson CorrelationSig. (2-tailed)N

    Pearson CorrelationSig. (2-tailed)NPearson CorrelationSig. (2-tailed)N

    L !"# $#C% S& &'S

    ( alid *issing)

    N'*!%# "$ +"'#S,"# % L S& ,%%( alid *issing)

    # S%L$-%*P "#,"# S $"#S"*%!" /

    ( alid *issing)

    " & "# P# &%%*PL"/%%( alid *issing)

    #S "CC'P & "N LP#%S& % SC"#%(1980) ( alid *issing)

    L !"#$#C%

    S& &'S( alid *is

    sing)

    N'*!%#"$ +"'#S,"# %

    L S& ,%%( alid *issin

    g)

    # S%L$-%*P"# ,"# S

    $"#S"*%!" /( alid *issin

    g)

    " & "#P# &%

    %*PL"/%%( alid *issi

    ng)

    #S"CC'P& "N L

    P#%S&% SC"#%

    (1980)( alid *is

    sing)

    Cannot e o ted e a se at least one o t e aria les is onstant.a.

    +'e correlation matri%

    In the cells for which the correlationcould be computed, the probabilitiesindicating significance are 0.6#7,0.$0", and 0.477.

    -one of the correlations arestatistically significant. The answer tothe &uestion is false . %e do notneed to be concerned about a missingdata problem for this set of variables.

  • 8/11/2019 Analyzing Missing Data

    41/49

    Slide4

    "sing scripts

    +'e process o e#aluating missing data re,uiresnumerous S SS procedures and outputs t'at are timeconsuming to produce(

    +'ese procedures can !e automated !y creating anS SS script( A script is a program t'at e%ecutes ase,uence o S SS commands(

    +'oug't $riting scripts is not part o t'is course) $ecan ta*e ad#antage o scripts t'at I use to reduce t'e!urdensome tas*s o e#aluating missing data(

  • 8/11/2019 Analyzing Missing Data

    42/49

    Slide42

    "sing a script or missing data

    +'e script :MissingDataC'ec*(s!s; $ill produce all ot'e output $e 'a#e used or e#aluating missing data)as $ell as ot'er outputs descri!ed in t'e te%t!oo*(

    a#igate to t'e lin* :S SS Scripts and Synta%; on t'ecourse $e! page(

    Do$nload t'e script ile :MissingDataC'ec*(e%e; toyour computer and install it) ollo$ing t'e directionson t'e $e! page(

  • 8/11/2019 Analyzing Missing Data

    43/49

    Slide43

  • 8/11/2019 Analyzing Missing Data

    44/49

    Slide44

    In#o*e t'e script

    To invoke the script, selectthe 8un 'cript( commandin the Atilities menu.

  • 8/11/2019 Analyzing Missing Data

    45/49

    Slide45

    Select t'e missing data script

    First , navigate to the folder where you put the script.If you followed the directions, you will have a file withan B.'1'B e+tension in the >C'%#4487 folder.

    If you only see a file with an D./E/F e+tension in thefolder, you should double click on that file to e+tractthe script file to the >C'%#4487 folder.

    Third , click on!un button tostart the script.

    Second , click on thescript name to highlightit.

  • 8/11/2019 Analyzing Missing Data

    46/49

    Slide46

    +'e script dialog

    The script dialog bo+ actssimilarly to ' '' dialogbo+es. Gou select thevariables to include in theanalysis and choose optionsfor the output.

  • 8/11/2019 Analyzing Missing Data

    47/49

    Slide47

    Complete t'e speci ications

    'elect the variables for theanalysis. This analysis usesthe variables for the e+ampleon page $H in the te+tbook.

    lick on the O*button to producethe output.

    The checkbo+esare marked toproduce theoutput we needfor our problems.

    The onlyadditional optionis to compute thet3tests and chi3s&uare tests forall of thevariables.

  • 8/11/2019 Analyzing Missing Data

    48/49

    Slide48

    +'e script inis'es

    If you ' '' output viewer isopen, you will see the outputproduced in that window.

    'ince it may take a while toproduce the output, andsince there are times whenit appears that nothing ishappening, there is an alertto tell you when the script isfinished.

    Anless you are absolutelysure something has gonewrong, let the script rununtil you see this alert.

    %hen you see this alert,click on the O* button.

  • 8/11/2019 Analyzing Missing Data

    49/49

    Slide49