evaluating survey questions · 2011. 3. 23. · ways to evaluate survey instruments and wording 1....

Post on 04-Mar-2021

1 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Population Research Seminar Series

Session 3: Evaluating Survey Questions

Jack Fowler

Center for Survey Research

Survey and Statistical Methods Core

There are 7 kinds of standards

for questions

1. Ask the right question

2. Cognitive standards

3. Usability standards

4. Interpersonal standards

5. Psychometric standards

6. Multi-mode standards

7. Multi-lingual standards

Ways to evaluate survey

instruments and wording

1. Appraisal forms

2. Focus groups

3. Cognitive testing

4. Pretests with observation and/or debriefing (for self administration)+ paradata

5. Pretests with behavior coding (for interviews)

6. Split-ballot experiments

7. Analytic assessments of reliability and validity

Appraisal forms

• Can flag issues that are known to affect

understanding or usability

• However, most of them rely on some

judgments (not clear, difficult to recall) that

require additional information

Focus groups

• Recruit a few groups of 6-8 people to

come talk about the issues the survey will

cover

Focus groups

• Content standards– What people’s relevant experiences and

situations are

– What people know and think about a topic

– What kinds of answers they give

• Cognitive standards– Vocabulary—what different candidate words

mean to people and what words they use

– What questions they can answer

Strengths of focus groups

1. An efficient way to gather info about

several people at a time

2. Sometimes being in a group stimulates

ideas or raises issues that would not be

found one-on-one

Weakness of focus groups

• Do not get to probe into the way

individuals understand and answer

questions

• May not hear all the issues because each

person only gets so much air time

• That’s what one-on-one cognitive

interviews do

Cognitive testing—a little history

• 1984 an NSF conference brought together

survey researchers and cognitive scientists

• NCHS established a laboratory with NSF

funding in the latter 1980s

• Bureau of the Census started a lab a little

later

• Cognitive testing did not start to become

common until the early 1990s

What that means

• Questions that were designed before the

mid-1990s were not usually evaluated for

whether or not people understood the

questions or their answers meant what the

researchers hoped they meant

Answering questions

1. Comprehend what is being asked for

2. Having information relevant to the answer

3. Working with the information to put it in a

form needed to answer

4. Providing the answer

Error happens

• When one of these four steps is handled

problematically, validity is at risk

• The goal of question design is to minimize

the risk of problems at each of the steps of

the question answering process

• The goal of cognitive testing is to evaluate

how well each of these steps is performed

when someone answers a question

Cognitive testing

• Recruit volunteers

• Usually pay them

• Interviews can take 1 -1.5 hours

• Interviews are almost always tape

recorded and reviewed after interview

Protocols

• There is a lot of diversity in the way

interviews are done:

• Think aloud

• Follow-up probes after first asking

respondent to answer the test question

• Probes can be highly structured or

interviewers can be given a lot of flexibility

Potentially probes can focus on

each aspect of the process1. What the question is asking

2. What the respondent knows or thinks

3. Refinement needed to get material needed for answer

4. Turning relevant material into an answer

What question means

• IN THE PAST YEAR, HOW MANY TIMES

HAVE YOU SEEN OR TALKED WITH A

DOCTOR ABOUT YOUR OWN HEALTH?

Could you say in your own words what the

question is asking?

Do you think the question includes times when

you talked with a doctor on the telephone?

Do you think the question includes times when

you exchanged e-mails with a doctor?

What the respondent knows or thinks

• How many times do you think you have been to

a doctor’s office about your own health in the

last year?

• How do you remember that?

• Have you had your eyes checked in the last year

by a doctor? (Did you include that in your

answer?)

• Was there a time when you saw a nurse

practitioner, but not the doctor? (Did you count

that?)

Refining info to get close to answer

• So, could you take me through the process

you went through to decide how many times

you saw or talked with a doctor in the last

year?

• Were there any doubtful cases that you

decided to leave out that you thought about

including?

• Any that you included that you considered

leaving out?

Continued

• How confident are you that you have the

right answer?

• If you are in error, what would you guess

is the most likely kind of error you could

have made?

Providing an answer

• Did you consider more than one possible

answer before you gave your answer?

• How did you decide which one to give?

• Is there any way in which the answer you

gave might not be a good reflection of

what the true answer is?

Example

• Do you think we are doing too much, too

little or about the right amount to fight

terrorism?

Example

• How would you rate the importance of the

following behaviors that might promote good

health—very important, somewhat important,

or not important at all?

• Doing moderate exercise for 20 minutes at

least 3 days a week

• Having at least 2 glasses of wine, cans of

beer or 1.5 oz drinks of alcohol every day

What cognitive testing can evaluate

• Asking right question?

• Cognitive standards

• Maybe psychometric issues

• Multi-lingual standards

Conclusion

• The results depend on the investigator

juxtaposing the information provided in the

cognitive interview and the idealized way

that a respondent should answer if the

answer is going to be a ―valid‖ measure of

the target construct

Usability testing

• Observation

• Debriefing

• Analysis of errors of navigation

• Paradata (when computer based)

Pretests with behavior coding

Pretests of interviews with some people like

the planned respondents have been

standard for years

Main input from interviewer debriefing

Not very reliable

Not very informative

What should question-answer

process look like?

1. I reads question exactly as worded

2. R understands question as intended

3. R retrieves information needed to answer

4. R puts answer in form required and tells

interviewer

Premises of behavior coding

• 1. Deviations from this ideal may reflect

problems that are threats to validity of data

• 2. The wording of a question often is the

direct cause of these problems

• 3. The presence of problems can often be

inferred from the behavior of interviewers

or respondents

How to do behavior coding

• Can be done live using an observer

• Mainly done by recording interview and

then having trained coders listen to

recordings

• NOTE: Experience shows that almost

everyone who is willing to be interviewed

agrees to be tape recorded

What to code

• Unit of observation is usually the question

• The core focus of most coding is to count

deviations from an ideal question and

answer process

Question reading

• Read exactly as worded

• Minor changes

• Major changes

• Interrupted

Question answering

• Adequate (codable) answer given

• Inadequate (uncodable) answer given

• Qualified answer (―I think‖ ―It might be..‖)

• Refusals and ―don’t knows‖

Other aspects of R behavior

• Asks for clarification

• Asks for all or part of question to be

repeated

Sample output for each question

• Often results are reported like this:

– % read exactly as worded

– % interrupted reading

– % asked for clarification

– % gave inadequate answer

Let’s try some behavior coding

What do results mean?

• Requests for clarification often mean there

is an unclear term or concept.

• When did you move to New York?

• Does that mean the city, the metropolitan

area or the state?

What do results tell us?

( Some examples)

1.Interruptions often occur when there is

dangling material in the question, such as

a definition after the question.

• How many children do you have? Do not

include step children.

What do results mean?

2. Inadequate answers often occur when it is

unclear how to answer the question.

• When did you move to New York?

• Unclear whether interviewer wants date,

years ago, or stage in life cycle—and how

precise it has to be.

What do results mean?

3. Requests for clarification often mean

there is an unclear term or concept.

• When did you move to New York?

• Does that mean the city, the metropolitan

area or the state?

Effects on Data

• One of the clearest findings is that the

higher the rate at which interviewers have

to probe to get an adequate answer, the

higher the interviewer-related error

More effects on data

• There is evidence that qualified answers

and response latency are related to the

―accuracy‖ of responses

Strengths of behavior coding as question

evaluation method

• Low cost—easily integrated into pretest

• Evaluation is of how questions work under

realistic data collection conditions

• Results are reliable

• Results are objective—i.e. not dependent

on an individual’s subjective assessment

• Results are quantitative

Weaknesses of Behavior Coding

1. Sometimes hard to diagnose reason for

results and how to fix it

2. Can’t always tell if an observed problem

actually affects data

3. Some question problems (such as

comprehension) do not show up in

behavior coding

Split-ballot experiments

• About how many months has it been since

you last saw or talked to a dentist? Include

all types of dentists, such as orthodontists,

oral surgeons, or all other dental specialists,

as well as dental hygienists.

• About how many months has it been since

you last went to a dentist office for any type

of dental care?

Results

ORIGINAL ALTERNATIVE

6 Months or less 60% 57%

More than 6 months

but not more than 1

year

14% 18%

More than 1 year 26% 25%

TOTAL 100%

(n=77)

100%

(n=79)

Split-ballot experiment

• In the last 12 months, how often did you

get an appointment for regular or routine

health care as soon as you wanted -

always, usually, sometimes, never?

(“Always” recoded to “Yes”)

• In the last 12 months, were you always

able to get an appointment as soon as you

wanted?

Results

ORIGINAL ALTERNATIVE

Yes 47% 66%

No 53% 34%

Total 100%

(261)

100%

(299)

Split-ballot experiments are needed to

find out if wording affects estimates

They also are key to evaluating whether or

not questions meet

Multi-mode standards

Multi-lingual standards

Validity must be established via theory or

consistency with other evidence

Psychometric analysis

• Reliability

• Validity

Reliability

• Extent to which same people who should

not have changed much give same

answer at two points in time

• Extent to which people for whom true

value is the same give the same answers

Validity

• Extent to which answers measure what

they are supposed to measure

• Note: Answers can be reliable and yet not

be valid

How to measure validity

• Correspondence between answers and

some other ―true‖ measure of what is to be

measured. Record-check study is good

example.

• Correspondence between answers and

other answers or measures that are

supposed to be measuring or related to

the same thing

And what is bias?

• A systematic difference between survey

answers and the ―true‖ scores

• Example: On average, people report their

weight as lower than a scale reading

• Answers can be biased and valid: that is,

they can be systematically different by

correlated highly

More on Bias

• And bias is only meaningful with respect to

factual questions

• Since it is not possible to obtain a

calibrated ―true score‖ for a subjective

state (happiness?), bias is meaningless

What does it mean to have a

“validated measure”

• Validity = a correlation coefficient

• If it is a positive number, that means there

is evidence that the answers measure the

―true scores‖ to some extent……

Qualifications..

• In the population in which the data were collected

• Under the conditions in which the data were collected

• To the extent that you can be convincing that whatever it is correlated with has something to do with the true value you are trying to measure

Psychometric Testing

• In the end, evidence that you are

measuring what you want to measure is

the ultimate question standard.

• Often, it is hard to gather that evidence

• And it usually requires forethought and

special effort to have the data to assess

validity

Thank you.

top related