ence and challenges in blaise 5 - university of michigan...data-out experience and challenges in...
TRANSCRIPT
Da
Funding f
ata-out
Insti
Insti
for this resear
Experi
itute for S
itute for S
rch was provi
ience a
Mohammocial Rese
Aprocial Rese
ided by the N
and Ch
mad Muearch, Uni
ril Beaulearch, Uni
National Scien
Tec
halleng
ushtaq iversity o
le iversity o
nce Foundatio
chnical Se
ges in B
of Michiga
of Michiga
on (SES 115769
eries Pape
Blaise 5
an
an
98).
er #16-02
5
Data-out Experience and Challenges in Blaise 5 Mohammad Mushtaq and April Beaulé
Abstract: Blaise 5 web interface was used to collect data from Opt-In Pilot 2015. At the time of data collection, a limited number of data-out utilities were available. The data extraction from Blaise 5 became a challenging task. The available data-out options were flat file and xml data format and previously developed tool for Blaise 4 were not working successfully with these options. Using available data-out options in Blaise 5, the data were extracted into Blaise 4 compatible flat file. Then it was imported into SAS, the wide tables had 11664 and 11970 variables from two survey instruments V1 and V2 respectively. Using knowledge of core interview instruments from prior years, Blaise 4 extraction and reverse engineering method of data transformation, the wide data files were transformed in relational data structure. First, identify interview sections (Housing, Employment, etc.) and create data files by sections – this is household level file with sample id (SID) as primary key. Second, identify arrays and loops from each section and stack data by loop instance into level-one tables from each section (parent table), remove loop-one variables from parent table. Third, some sections had nested loops (loop inside of loop), therefore, data from these sections were further stacked by loop-two instance and loop-two variables were removed from loop-one table. As a result of this relational transformation, the table structure is matching with PSID 2013. Using same method, the data from All Stars 2016 can be compared with data from PSID (CATI) 2015. The data-out task would have been much simpler if ASCII-Relational output option in Manipula can be made available sooner so that projects like the PSID can benefit from existing data processing tools. A second major hurdle that we have to find a solution for with Blaise 5 is identifying questions that are ‘on the route’ but not answered versus questions ‘not on the route’. For self-administered web questionnaires where respondents may just skip a question, it is imperative for data processing that we can easily identify which questions were presented and not answered so that we can properly document the universe of each question in the codebook and generate the appropriate frequencies for valid responses.
1. Introduction
The Panel Study of Income Dynamics (PSID) is a nationally representative longitudinal study of approximately 9,000 U.S. families. Since 2001, the PSID has used Blaise as its main software for its data collection. Due to the size and complexity of the instrument, PSID staff work with tables extracted in looped blocks rather than a single flat file. As the PSID moves to more web-based or mixed mode data collections we have begun to run pilot projects using Blaise 5. This paper will discuss the challenges we encountered in extracting data from Blaise 5. We also discuss potential future challenges in data handling and documentation for web-based interviews, for which respondents’ ability to skip questions creates challenges for variable documentation. We begin by giving you an overview of specific challenges of Blaise 5 extraction where are goal was to extract in the same format as Blaise 4.8 CATI. Secondly, we will review the issues surrounding determining an acceptable partial interview in the web self interviewing environment. Finally we will discuss the data processing challenges that Blaise 5 self-administered web projects present for final variable handling and documentation. 2. Background In the early years of the study, the data collection instrument was a paper and pencil questionnaire. In 1992, the PSID automated the survey instrument using SurveyCraft software. By 2001, the Survey Research Center (SRO) at the University of Michigan began using Blaise and the PSID was re-programmed at that time based on the new software. Just as the sample continues to grow, so has the instrument. In 1968, the first year of the PSID, the interview took approximately 20 minutes to administer and the staff released 447 variables. By 2015, the average interview length was 78 minutes and we plan to release 5,493 variables on the Family File. The PSID is a family level instrument meaning that one respondent reports details about all the individuals in the family. In order to capture individual-level information, we have to set the threshold for array sizes higher than the average family size to allow data collection for larger families. These large array sizes as well as the need for many embedded arrays makes flat extraction unwieldy and cumbersome given the large number of blank positions in most interviews. In all previous versions of Blaise, we extracted ‘relational blocks’ using an extraction tool based on Manipula that was designed ‘in-house’ at the University of Michigan. This has allowed us to handle the data efficiently and keep only rows which contain data while dropping blank rows. However, with our Blaise 5 pilot projects that have launched over the last year we have faced several challenges extracting data in the same format given the limited number of extraction tools available in Blaise 5 at the time of data collection.
3. Blais At the timavailable.extraction
The goal o2015 so thgoal, widerelational
Blaise ETfunction f400, the la
3.1 Data Ehave used
1. Oit
2. Fr
se 5 Data
me of Opt-In P As previous
n became a ch
of this task what data can ee format extrastructure.
TL tools (expofrom “Home”ast 400 or a ra
Extraction: Ad for successfu
Open Blaise 5 also opens “D
rom “Data Vi
Extraction
Pilot 2015, Blsly developedhallenging tas
was to create Seasily be compaction and rev
ort functions) ” tab does not andom set of
After careful ful extraction.
database by “Data Viewer”
iewer” tab, cl
n and Rela
laise 5 Manipd tools did notk.
SAS data filespared betweeverse enginee
are availableexport all cas400 cases, w
examination
“Browse Data” tab.
lick “Export D
ational Tr
ula extractiont work with B
s which are ren two waves
ering approach
e from “Homeses. It only ke don’t know
of all tabs and
abase” option
Data” and Sel
ransforma
n with ASCIIBlaise 5data ou
elational and hof data collech were used t
e” and “Data keeps 400 casew.
d available op
n from “Home
lect Output Fo
ation
I-relational ouut options, th
have same strction. In ordeto transform d
Viewer” tabses from the d
ptions, below
e” tab. With
ormat as belo
utput was not herefore, data
ructure as PSIer to achieve tdata into desir
s. ETL Expordatabase – the
w are steps tha
h database ope
ow.
ID this red
rt first
at we
ened,
3. S
4. Sk
With thes1489 dataabove. Thfrom the w
elect Output F
kip Blaise Da
e steps, data aa rows (observhe first row iswide file.
File Specifica
ata Interface F
and open textvations) and 1s column head
ation as below
File (click Ne
t have been ex11664 columnder, which wi
w and click N
ext), select Ex
xtracted in twns (variables)ill be processe
Next
xport from “D
wo files. The ). The delimied further to
Data Exporter”
wide data fileiter is as speccreate section
” tab
e PSID15.txt cified in step #ns and sub-se
has #3 ction
3.2 SAS Tengineerinsections (Alevel file wstack dataparent tabsections wtable. As table strucdata betw
1. C
2. Idwco
…
…
…
…
…
…
…
…
…
Transformatng method, wA, BC, F, G, with sample i
a by loop-one ble. Third, secwere further st
a result of thcture is matcheen CATI 20
Create SAS tab
dentify sectionwide file, sectiomputer usag
…..
…..
…..
…..
…..
…..
…..
…..
…..
tion: Using knwide data file w
R, W, P, H, Mid (SID) as prinstance into
ctions P, H antacked by loo
his transformahing with PSI15. Below ar
ble from wide
ns and sub-seion A has 149ge –, each sub
nowledge of pwas transformM, KL and J1rimary key. So child level tand J has nesteop-two instancation, V1 dataID 2015 core,re details of tr
e file, the tabl
ection from co9 variables, it -section has t
previous yearmed in relation
/J2) and creaSecond, identiables from paed loops (loopce and loop-tw
abase has 52 t therefore, thransformation
le has 1489 ro
olumn headerhas three sub
two loops.
section A
mortgage ins
mortgage ins
foreclosure i
foreclosure i
computer us
computer us
r core intervienal data struc
ate data files bify arrays and
arent table, remp inside of loowo variables tables and 255e relational stn process:
ows and 1166
rs extracted byb-sections – m
stance #1
stance #2
instance #1
instance #2
sage instance
sage instance
ew instrumentcture. First, idby sections – d loops from emove loop-onop), thereforewere remove
54 variables. tructure can b
64 columns
y the extractimortgage, fore
#1
#2
ts and reversedentify intervthis is househeach section ane variables fe, data from thed from loop- The resulting
be used to com
ion process. Ieclosure and
e view hold and from hese one g mpare
In
…
S10co
3. FrkeroSA
…..
ection BC ha0 loops. Sectomplicated.
rom our expeey), and mortow is unique bAS code. It c
a. SAS c
b. SAS c
s 1040 variabtions P, H and
erience with Ptgage, forecloby sampleid acreated tables
ode for forec
ode for forec
bles from widd J have secon
PSID data, Seosure and comand instance ns for all loops
losure loop #
losure loop #
how heated
end of sectio
de file and sevnd level nesti
ction A (paremputer usage (number. We within each s
1
2
on A
veral sub-secting which ma
ent table) has (child tables) used SAS dasub-section an
tions, and eacakes relational
one row per sare sub-secti
ata step progrand then all lo
h sub-sectionl conversion m
sampleid (priions where eaaming to writ
oops were stac
n has more
mary ach te cked.
4. Tpaan
3.3 Missin
1. Nchopdain
2. Csmel
In the absfrom Blaiwaves and
c. SAS c
The above exa
arent table. Tnd/or loops ar
ng data:
Numeric variabharacters dataption. In widata on the firsnstances have
Characters varmaller length lse data will b
sence of ASCIse 5 databased excellent da
ode for stack
ample is from The process bre significantl
bles from looa type in SASde file, each inst row then its mixed data t
riables from dfrom other va
be truncated in
IIRelational oe needed lots oata manageme
ing data from
two loops anecomes morely large numb
ops as specifieS, when delimnstance is seps data type is types and the different instanariables, thenn the stacking
output optionsof resources, ent skills in S
m foreclosure l
nd two variable complicatedber.
ed in Blaise dmited file is reaparate set of v
set as charactstacking procnces can be o
n stacking prog process.
s from Blaisegood underst
SAS.
loops
les and outpud when output
data-model, caad into SAS u
variables. If ater. Thereforcess can’t stacof different lenocess is going
e5 and Maniputanding of PS
ut table is one t tables are tw
an assume nuusing data stea numeric varire, same variack instance tangth. If insta to need addit
ula, the data eSID questionn
level below two steps deepe
umeric or ep and “dsd” iable has mis
able from diffables successfance-1 variabltional adjustm
extraction tasnaire from pre
the er
sing ferent fully. le has ment,
sk evious
4. Determining a completed interview The PSID has started completing supplemental interviews to the main study by offering respondents mixed mode data collection using primarily a Web/PAPI protocol. The web collected records pose several challenges for the data processing team. The first main challenge is the handling of partially completed interviews. In CATI interviews the interviewer codes an interim or final disposition based on whether an interview has gone far enough to be considered a partial. In contrast, with self-administered web interviews we do know when a form has been submitted by the respondent but we do not have a sense of its ‘completeness’. For those cases where an interview has been started but not yet submitted we are also unsure about its status and about whether the respondent will go back and make further attempts to complete it. These ambiguities make it more difficult to monitor data collection while it is in the field and add extra burden for data processing staff to adjudicate which cases we will accept as completed interviews for the final dataset. In CATI mode, the interviewer develops a sense of item-level refusals and may in fact suspend the interview or code the case out a refusal if the respondent refuses to answer a large proportion of questions. Alternatively, interviewers alert the processing team of problematic cases by triggering a flag in the observation section of the interview. These interviewer-initiated triggers for further scrutiny are not possible in self-administered web surveys and alternative approaches are required to deem an interview an acceptable completed case: For web self-administered surveys, the Data Processing (DP) team works closely with the lead researcher to determine an acceptable level of item nonresponse based on the forms submitted by the respondent that are automatically coded as complete by the sample management system. Generally we may look at the entire sample and determine the average number of variables with non-missing values to determine the acceptable threshold. This approach presents challenges as instruments increase in complexity and the number of variables ‘on-route’ or ‘off-route’ (see below) varies significantly between interviews. In many situations, cases with questionable ‘completeness’ have to be reviewed on a case by case basis, which is often time-consuming. 5. Problem of determining questions ‘on-route’ vs ‘off-route’
One of the most basic fundamentals in data documentation is identifying the universe of each variable. In the PSID, we have long documented the inverse of the universe in our codebooks (Figure A). The INAP (Inappropriate code) lists the criteria the respondent met in order to skip this particular question.
Figure A:
In a CATI‘on-route’response inorm is to‘Refused’responden For web dthat variabsystem mibut insteaimperativreceive thhave, the instrumen(Figure B)all the posroute’.
Example Co
I interview, w’ (i.e. those this ‘Don’t Knoo simply let th are not expli
nt to provide a
data-out in Blble was ‘off-rissing code b
ad just skippede once we mo
his question beData Process
nt using cleanB) or visual dossible skip pa
debook for PS
we can determhat are presenow’ or ‘Refushe respondenticit options ava valid (non-m
aise 5 there isroute’ (i.e. thoecause the vad the questionove to the docecause of rouing (DP) teaming/analysis s
ocumentation atterns and ass
SID Family L
mine which qunted based on sed’. In contrat pass a questivailable to themissing respo
s no distinctioose variables ariable was ‘on. Determinincumentation s
uting. Becausem must essentsoftware like of the questio
sign a unique
Level Variable
uestions are skpathing) mus
ast, in self-admion without ae respondent onse).
on between vathat are never
on-route’ but tng the differenstage that dese this critical tially re-writeSAS. The DPonnaire often code to ques
e in CATI
kipped easily st be answeredministered w
an answer. Simat all as the a
ariables with r presented bathe respondennce between tcribes the unidistinction is
e all the skip pP team membn written in a Wstions that are
y because all qd by the intereb interviewemilarly, ‘Donaim is to encou
a system misased on pathint did not prothese two sceiverse of thosnot currently
patterns in thebers use the ‘BWord docume
e empty becau
questions thatrviewer even ers the generan’t Know’ or urage the
ssing code becing) versus a vide a responnarios is
se who did noy a feature thae Blaise Box and Arrowent to determ
use they were
t are if the
al
cause
nse
ot at we
w’ mine
‘off-
Figure B:
For basic attempt daembeddedthe Blaiseprogrammoutput for Writing thprogrammerror in desavings ofsystem kndata modeextraction
Example PSI
questionnaireata collectiond arrays this be programminmers do not har integrity.
he logic of themer is also extetermining thf having the rnows which qel version, it an of data from
SID Box and A
es with few skn on the web wbecomes ever ng phase, whicave the same
e entire questtremely ineffi
he correct skiprespondent ratquestions wereappears there
m the bdbx aut
Arrow Docum
kip patterns, twith more commore challen
ch includes mamount of tim
tionnaire by bicient and map flow (Figurether than the e ‘on-route’ vshould be a w
tomatically.
mentation of C
this is not undmplex instrumnging for the Smany rounds ome or testing r
both the Blaiseay affect data e C). It also minterviewer c
vs ‘off-route’ way to comm
CATI Question
duly burdensoments containSAS programof testing by sresources to c
e programmeintegrity if th
may offset somcomplete the qat the time of
municate those
nnaire
ome. Howevening many loommer and erroseveral testerscheck their pr
er and then aghe SAS programe of the prequestionnairef the interviewe differences i
er, if we plan ops, arrays, anor prone. Unlis, the SAS rograms and
gain by the SAammer makessumed cost . Since the Blw based on thin the storage
to nd ike
AS s an
laise he e and
Figure C:
: Example SAS code for Re
ecoding Skippped Variabless to Not Ascerrtained (NA) == 9
6. Conclusion and Summary Using knowledge of core interview instruments from prior years, Blaise 4 extraction and reverse engineering method of data transformation, the wide data files from Blaise 5 were transformed into relational data structure. First, identify interview sections (Housing, Employment, etc.) and create data files by sections – this is household level file with sample id (SID) as primary key. Second, identify arrays and loops from each section and stack data by loop instance into level-one tables from each section (parent table), remove loop-one variables from parent table. Third, some sections had nested loops (loop inside of loop), therefore, data from these sections were further stacked by loop-two instance and loop-two variables were removed from loop-one table. As a result of this relational transformation, the table structure is matching with PSID 2015. It is likely that many large surveys like the PSID will continue to explore data collection via the web given cost-cutting pressures from funders as well as increasing respondent coverage as more and more families gain access to computers and the internet. The expectation from both funders and the user community is that data collected via the web will have the same standards of cleaning and documentation as data collected via CATI. The current tools offered in Blaise 5 do not allow us to easily check and assign appropriate codes for ‘on-route’ but not answered versus ‘off-route’. This is a significant drawback for data processing and documentation. If tools could be provided to automatically assign fields with this distinction then we could also build on that information to help us come up with algorithms for determining ‘completeness’ of a web interview. Instead of case-by-case checking of completeness we could flag outliers that have more than the average share of missing responses by dividing the number of presented but skipped questions by the number of overall questions that were ‘on-route’.