ence and challenges in blaise 5 - university of michigan...data-out experience and challenges in...

Da

Funding f

ata-out

Insti

Insti

for this resear

Experi

itute for S

itute for S

rch was provi

ience a

Mohammocial Rese

Aprocial Rese

ided by the N

and Ch

mad Muearch, Uni

ril Beaulearch, Uni

National Scien

Tec

halleng

ushtaq iversity o

le iversity o

nce Foundatio

chnical Se

ges in B

of Michiga

of Michiga

on (SES 115769

eries Pape

Blaise 5

an

an

98).

er #16-02

5

Data-out Experience and Challenges in Blaise 5 Mohammad Mushtaq and April Beaulé

Abstract: Blaise 5 web interface was used to collect data from Opt-In Pilot 2015. At the time of data collection, a limited number of data-out utilities were available. The data extraction from Blaise 5 became a challenging task. The available data-out options were flat file and xml data format and previously developed tool for Blaise 4 were not working successfully with these options. Using available data-out options in Blaise 5, the data were extracted into Blaise 4 compatible flat file. Then it was imported into SAS, the wide tables had 11664 and 11970 variables from two survey instruments V1 and V2 respectively. Using knowledge of core interview instruments from prior years, Blaise 4 extraction and reverse engineering method of data transformation, the wide data files were transformed in relational data structure. First, identify interview sections (Housing, Employment, etc.) and create data files by sections – this is household level file with sample id (SID) as primary key. Second, identify arrays and loops from each section and stack data by loop instance into level-one tables from each section (parent table), remove loop-one variables from parent table. Third, some sections had nested loops (loop inside of loop), therefore, data from these sections were further stacked by loop-two instance and loop-two variables were removed from loop-one table. As a result of this relational transformation, the table structure is matching with PSID 2013. Using same method, the data from All Stars 2016 can be compared with data from PSID (CATI) 2015. The data-out task would have been much simpler if ASCII-Relational output option in Manipula can be made available sooner so that projects like the PSID can benefit from existing data processing tools. A second major hurdle that we have to find a solution for with Blaise 5 is identifying questions that are ‘on the route’ but not answered versus questions ‘not on the route’. For self-administered web questionnaires where respondents may just skip a question, it is imperative for data processing that we can easily identify which questions were presented and not answered so that we can properly document the universe of each question in the codebook and generate the appropriate frequencies for valid responses.

1. Introduction

The Panel Study of Income Dynamics (PSID) is a nationally representative longitudinal study of approximately 9,000 U.S. families. Since 2001, the PSID has used Blaise as its main software for its data collection. Due to the size and complexity of the instrument, PSID staff work with tables extracted in looped blocks rather than a single flat file. As the PSID moves to more web-based or mixed mode data collections we have begun to run pilot projects using Blaise 5. This paper will discuss the challenges we encountered in extracting data from Blaise 5. We also discuss potential future challenges in data handling and documentation for web-based interviews, for which respondents’ ability to skip questions creates challenges for variable documentation. We begin by giving you an overview of specific challenges of Blaise 5 extraction where are goal was to extract in the same format as Blaise 4.8 CATI. Secondly, we will review the issues surrounding determining an acceptable partial interview in the web self interviewing environment. Finally we will discuss the data processing challenges that Blaise 5 self-administered web projects present for final variable handling and documentation. 2. Background In the early years of the study, the data collection instrument was a paper and pencil questionnaire. In 1992, the PSID automated the survey instrument using SurveyCraft software. By 2001, the Survey Research Center (SRO) at the University of Michigan began using Blaise and the PSID was re-programmed at that time based on the new software. Just as the sample continues to grow, so has the instrument. In 1968, the first year of the PSID, the interview took approximately 20 minutes to administer and the staff released 447 variables. By 2015, the average interview length was 78 minutes and we plan to release 5,493 variables on the Family File. The PSID is a family level instrument meaning that one respondent reports details about all the individuals in the family. In order to capture individual-level information, we have to set the threshold for array sizes higher than the average family size to allow data collection for larger families. These large array sizes as well as the need for many embedded arrays makes flat extraction unwieldy and cumbersome given the large number of blank positions in most interviews. In all previous versions of Blaise, we extracted ‘relational blocks’ using an extraction tool based on Manipula that was designed ‘in-house’ at the University of Michigan. This has allowed us to handle the data efficiently and keep only rows which contain data while dropping blank rows. However, with our Blaise 5 pilot projects that have launched over the last year we have faced several challenges extracting data in the same format given the limited number of extraction tools available in Blaise 5 at the time of data collection.

3. Blais At the timavailable.extraction

The goal o2015 so thgoal, widerelational

Blaise ETfunction f400, the la

3.1 Data Ehave used

1. Oit

2. Fr

se 5 Data

me of Opt-In P As previous

n became a ch

of this task what data can ee format extrastructure.

TL tools (expofrom “Home”ast 400 or a ra

Extraction: Ad for successfu

Open Blaise 5 also opens “D

rom “Data Vi

Extraction

Pilot 2015, Blsly developedhallenging tas

was to create Seasily be compaction and rev

ort functions) ” tab does not andom set of

After careful ful extraction.

database by “Data Viewer”

iewer” tab, cl

n and Rela

laise 5 Manipd tools did notk.

SAS data filespared betweeverse enginee

are availableexport all cas400 cases, w

examination

“Browse Data” tab.

lick “Export D

ational Tr

ula extractiont work with B

s which are ren two waves

ering approach

e from “Homeses. It only ke don’t know

of all tabs and

abase” option

Data” and Sel

ransforma

n with ASCIIBlaise 5data ou

elational and hof data collech were used t

e” and “Data keeps 400 casew.

d available op

n from “Home

lect Output Fo

ation

I-relational ouut options, th

have same strction. In ordeto transform d

Viewer” tabses from the d

ptions, below

e” tab. With

ormat as belo

utput was not herefore, data

ructure as PSIer to achieve tdata into desir

s. ETL Expordatabase – the

w are steps tha

h database ope

ow.

ID this red

rt first

at we

ened,

3. S

4. Sk

With thes1489 dataabove. Thfrom the w

elect Output F

kip Blaise Da

e steps, data aa rows (observhe first row iswide file.

File Specifica

ata Interface F

and open textvations) and 1s column head

ation as below

File (click Ne

t have been ex11664 columnder, which wi

w and click N

ext), select Ex

xtracted in twns (variables)ill be processe

Next

xport from “D

wo files. The ). The delimied further to

Data Exporter”

wide data fileiter is as speccreate section

” tab

e PSID15.txt cified in step #ns and sub-se

has #3 ction

3.2 SAS Tengineerinsections (Alevel file wstack dataparent tabsections wtable. As table strucdata betw

1. C

2. Idwco

…

…

…

…

…

…

…

…

…

Transformatng method, wA, BC, F, G, with sample i

a by loop-one ble. Third, secwere further st

a result of thcture is matcheen CATI 20

Create SAS tab

dentify sectionwide file, sectiomputer usag

…..

…..

…..

…..

…..

…..

…..

…..

…..

tion: Using knwide data file w

R, W, P, H, Mid (SID) as prinstance into

ctions P, H antacked by loo

his transformahing with PSI15. Below ar

ble from wide

ns and sub-seion A has 149ge –, each sub

nowledge of pwas transformM, KL and J1rimary key. So child level tand J has nesteop-two instancation, V1 dataID 2015 core,re details of tr

e file, the tabl

ection from co9 variables, it -section has t

previous yearmed in relation

/J2) and creaSecond, identiables from paed loops (loopce and loop-tw

abase has 52 t therefore, thransformation

le has 1489 ro

olumn headerhas three sub

two loops.

section A

mortgage ins

mortgage ins

foreclosure i

foreclosure i

computer us

computer us

r core intervienal data struc

ate data files bify arrays and

arent table, remp inside of loowo variables tables and 255e relational stn process:

ows and 1166

rs extracted byb-sections – m

stance #1

stance #2

instance #1

instance #2

sage instance

sage instance

ew instrumentcture. First, idby sections – d loops from emove loop-onop), thereforewere remove

54 variables. tructure can b

64 columns

y the extractimortgage, fore

#1

#2

ts and reversedentify intervthis is househeach section ane variables fe, data from thed from loop- The resulting

be used to com

ion process. Ieclosure and

e view hold and from hese one g mpare

In

…

S10co

3. FrkeroSA

…..

ection BC ha0 loops. Sectomplicated.

rom our expeey), and mortow is unique bAS code. It c

a. SAS c

b. SAS c

s 1040 variabtions P, H and

erience with Ptgage, forecloby sampleid acreated tables

ode for forec

ode for forec

bles from widd J have secon

PSID data, Seosure and comand instance ns for all loops

losure loop #

losure loop #

how heated

end of sectio

de file and sevnd level nesti

ction A (paremputer usage (number. We within each s

1

2

on A

veral sub-secting which ma

ent table) has (child tables) used SAS dasub-section an

tions, and eacakes relational

one row per sare sub-secti

ata step progrand then all lo

h sub-sectionl conversion m

sampleid (priions where eaaming to writ

oops were stac

n has more

mary ach te cked.

4. Tpaan

3.3 Missin

1. Nchopdain

2. Csmel

In the absfrom Blaiwaves and

c. SAS c

The above exa

arent table. Tnd/or loops ar

ng data:

Numeric variabharacters dataption. In widata on the firsnstances have

Characters varmaller length lse data will b

sence of ASCIse 5 databased excellent da

ode for stack

ample is from The process bre significantl

bles from looa type in SASde file, each inst row then its mixed data t

riables from dfrom other va

be truncated in

IIRelational oe needed lots oata manageme

ing data from

two loops anecomes morely large numb

ops as specifieS, when delimnstance is seps data type is types and the different instanariables, thenn the stacking

output optionsof resources, ent skills in S

m foreclosure l

nd two variable complicatedber.

ed in Blaise dmited file is reaparate set of v

set as charactstacking procnces can be o

n stacking prog process.

s from Blaisegood underst

SAS.

loops

les and outpud when output

data-model, caad into SAS u

variables. If ater. Thereforcess can’t stacof different lenocess is going

e5 and Maniputanding of PS

ut table is one t tables are tw

an assume nuusing data stea numeric varire, same variack instance tangth. If insta to need addit

ula, the data eSID questionn

level below two steps deepe

umeric or ep and “dsd” iable has mis

able from diffables successfance-1 variabltional adjustm

extraction tasnaire from pre

the er

sing ferent fully. le has ment,

sk evious

4. Determining a completed interview The PSID has started completing supplemental interviews to the main study by offering respondents mixed mode data collection using primarily a Web/PAPI protocol. The web collected records pose several challenges for the data processing team. The first main challenge is the handling of partially completed interviews. In CATI interviews the interviewer codes an interim or final disposition based on whether an interview has gone far enough to be considered a partial. In contrast, with self-administered web interviews we do know when a form has been submitted by the respondent but we do not have a sense of its ‘completeness’. For those cases where an interview has been started but not yet submitted we are also unsure about its status and about whether the respondent will go back and make further attempts to complete it. These ambiguities make it more difficult to monitor data collection while it is in the field and add extra burden for data processing staff to adjudicate which cases we will accept as completed interviews for the final dataset. In CATI mode, the interviewer develops a sense of item-level refusals and may in fact suspend the interview or code the case out a refusal if the respondent refuses to answer a large proportion of questions. Alternatively, interviewers alert the processing team of problematic cases by triggering a flag in the observation section of the interview. These interviewer-initiated triggers for further scrutiny are not possible in self-administered web surveys and alternative approaches are required to deem an interview an acceptable completed case: For web self-administered surveys, the Data Processing (DP) team works closely with the lead researcher to determine an acceptable level of item nonresponse based on the forms submitted by the respondent that are automatically coded as complete by the sample management system. Generally we may look at the entire sample and determine the average number of variables with non-missing values to determine the acceptable threshold. This approach presents challenges as instruments increase in complexity and the number of variables ‘on-route’ or ‘off-route’ (see below) varies significantly between interviews. In many situations, cases with questionable ‘completeness’ have to be reviewed on a case by case basis, which is often time-consuming. 5. Problem of determining questions ‘on-route’ vs ‘off-route’

One of the most basic fundamentals in data documentation is identifying the universe of each variable. In the PSID, we have long documented the inverse of the universe in our codebooks (Figure A). The INAP (Inappropriate code) lists the criteria the respondent met in order to skip this particular question.

Figure A:

In a CATI‘on-route’response inorm is to‘Refused’responden For web dthat variabsystem mibut insteaimperativreceive thhave, the instrumen(Figure B)all the posroute’.

Example Co

I interview, w’ (i.e. those this ‘Don’t Knoo simply let th are not expli

nt to provide a

data-out in Blble was ‘off-rissing code b

ad just skippede once we mo

his question beData Process

nt using cleanB) or visual dossible skip pa

debook for PS

we can determhat are presenow’ or ‘Refushe respondenticit options ava valid (non-m

aise 5 there isroute’ (i.e. thoecause the vad the questionove to the docecause of rouing (DP) teaming/analysis s

ocumentation atterns and ass

SID Family L

mine which qunted based on sed’. In contrat pass a questivailable to themissing respo

s no distinctioose variables ariable was ‘on. Determinincumentation s

uting. Becausem must essentsoftware like of the questio

sign a unique

Level Variable

uestions are skpathing) mus

ast, in self-admion without ae respondent onse).

on between vathat are never

on-route’ but tng the differenstage that dese this critical tially re-writeSAS. The DPonnaire often code to ques

e in CATI

kipped easily st be answeredministered w

an answer. Simat all as the a

ariables with r presented bathe respondennce between tcribes the unidistinction is

e all the skip pP team membn written in a Wstions that are

y because all qd by the intereb interviewemilarly, ‘Donaim is to encou

a system misased on pathint did not prothese two sceiverse of thosnot currently

patterns in thebers use the ‘BWord docume

e empty becau

questions thatrviewer even ers the generan’t Know’ or urage the

ssing code becing) versus a vide a responnarios is

se who did noy a feature thae Blaise Box and Arrowent to determ

use they were

t are if the

al

cause

nse

ot at we

w’ mine

‘off-

Figure B:

For basic attempt daembeddedthe Blaiseprogrammoutput for Writing thprogrammerror in desavings ofsystem kndata modeextraction

Example PSI

questionnaireata collectiond arrays this be programminmers do not har integrity.

he logic of themer is also extetermining thf having the rnows which qel version, it an of data from

SID Box and A

es with few skn on the web wbecomes ever ng phase, whicave the same

e entire questtremely ineffi

he correct skiprespondent ratquestions wereappears there

m the bdbx aut

Arrow Docum

kip patterns, twith more commore challen

ch includes mamount of tim

tionnaire by bicient and map flow (Figurether than the e ‘on-route’ vshould be a w

tomatically.

mentation of C

this is not undmplex instrumnging for the Smany rounds ome or testing r

both the Blaiseay affect data e C). It also minterviewer c

vs ‘off-route’ way to comm

CATI Question

duly burdensoments containSAS programof testing by sresources to c

e programmeintegrity if th

may offset somcomplete the qat the time of

municate those

nnaire

ome. Howevening many loommer and erroseveral testerscheck their pr

er and then aghe SAS programe of the prequestionnairef the interviewe differences i

er, if we plan ops, arrays, anor prone. Unlis, the SAS rograms and

gain by the SAammer makessumed cost . Since the Blw based on thin the storage

to nd ike

AS s an

laise he e and

Figure C:

: Example SAS code for Re

ecoding Skippped Variabless to Not Ascerrtained (NA) == 9

6. Conclusion and Summary Using knowledge of core interview instruments from prior years, Blaise 4 extraction and reverse engineering method of data transformation, the wide data files from Blaise 5 were transformed into relational data structure. First, identify interview sections (Housing, Employment, etc.) and create data files by sections – this is household level file with sample id (SID) as primary key. Second, identify arrays and loops from each section and stack data by loop instance into level-one tables from each section (parent table), remove loop-one variables from parent table. Third, some sections had nested loops (loop inside of loop), therefore, data from these sections were further stacked by loop-two instance and loop-two variables were removed from loop-one table. As a result of this relational transformation, the table structure is matching with PSID 2015. It is likely that many large surveys like the PSID will continue to explore data collection via the web given cost-cutting pressures from funders as well as increasing respondent coverage as more and more families gain access to computers and the internet. The expectation from both funders and the user community is that data collected via the web will have the same standards of cleaning and documentation as data collected via CATI. The current tools offered in Blaise 5 do not allow us to easily check and assign appropriate codes for ‘on-route’ but not answered versus ‘off-route’. This is a significant drawback for data processing and documentation. If tools could be provided to automatically assign fields with this distinction then we could also build on that information to help us come up with algorithms for determining ‘completeness’ of a web interview. Instead of case-by-case checking of completeness we could flag outliers that have more than the average share of missing responses by dividing the number of presented but skipped questions by the number of overall questions that were ‘on-route’.

ence and challenges in blaise 5 - university of michigan...data-out experience and challenges in...

Documents