conijn, 2014

12
Assessment 1–12 © The Author(s) 2014 Reprints and permissions: sagepub.com/journalsPermissions.nav DOI: 10.1177/1073191114560882 asm.sagepub.com Article During the previous two decades, the growing interest in the quality of mental health care has led to an increased use of self-report outcome measures (Holloway, 2002). To monitor the effectiveness of treatment for individual patients, outcome measures that assess symptom severity and daily functioning are repeatedly administered during treatment. Based on the repeated measurements, the treat- ment plan can be altered if recovery does not proceed as expected (Lambert & Shimokawa, 2011). Furthermore, mental health care providers use these outcome data to eval- uate treatment results at the institutional level, and insur- ance companies, health care managers, and other regularity bodies use outcome measures for policy decisions aimed at improving cost-effectiveness (Bickman & Salzer, 1997). Examples of frequently used outcome measures are the Outcome Questionnaire–45 (OQ-45; Lambert et al., 2004), the Brief Symptom Inventory (BSI; Derogatis, 1993), and the Clinical Outcomes in Routine Evaluation–Outcome Measure (CORE-OM; Evans et al., 2002). Given the importance of outcome measures for individual decision making in mental health care, their psychometric properties are a major concern (e.g., Doucette & Wolf, 2009; Pirkis et al., 2005). However, even if instruments have excellent psychometric properties, persons may respond aberrantly to clinical and personality scales, thus produc- ing invalid test scores. In fact, response inconsistency to personality and psychopathology self-report inventories was found to be positively related to indicators of psychological distress, psychological problems, and negative affect (Conijn, Emons, Van Assen, Pedersen, & Sijtsma, 2013; Reise & Waller, 1993; Woods, Oltmanns, & Turkheimer, 2008), which suggests that mental health care patients may be inclined to respond aberrantly. Cognitive deficits that are commonly observed in mental illness may explain concentra- tion problems that interfere with the quality of self-reports (Atre-Vaidya et al., 1998; Cuijpers, Li, Hofann, & Andersson, 2010). However, potential causes of aberrant responding are numerous, including lack of motivation, response styles, idiosyncratic interpretation of item content, and low traited- ness. Traitedness refers to the applicability of the trait to the respondent (Tellegen, 1988). 560882ASM XX X 10.1177/1073191114560882AssessmentConijn et al. research-article 2014 1 School of Social and Behavioral Sciences, Tilburg University, Tilburg, Netherlands 2 Institute of Psychology, Leiden University, Leiden, Netherlands ³Research department, GGZ Noord-Holland-Noord Corresponding Author: Judith M. Conijn, Department of Clinical Psychology, Institute of Psychology, Leiden University, Wassenaarseweg 52, 2333 AK Leiden, Netherlands. Email: [email protected] Detecting and Explaining Aberrant Responding to the Outcome Questionnaire–45 Judith M. Conijn 1 , Wilco H. M. Emons 1 , Kim De Jong 2,3 , and Klaas Sijtsma 1 Abstract We applied item response theory based person-fit analysis (PFA) to data of the Outcome Questionnaire-45 (OQ-45) to investigate the prevalence and causes of aberrant responding in a sample of Dutch clinical outpatients. The l z p person-fit statistic was used to detect misfitting item-score patterns and the standardized residual statistic for identifying the source of the misfit in the item-score patterns identified as misfitting. Logistic regression analysis was used to predict person misfit from clinical diagnosis, OQ-45 total score, and Global Assessment of Functioning code. The l z p statistic classified 12.6% of the item-score patterns as misfitting. Person misfit was positively related to the severity of psychological distress. Furthermore, patients with psychotic disorders, somatoform disorders, or substance-related disorders more likely showed misfit than the baseline group of patients with mood and anxiety disorders. The results suggest that general outcome measures such as the OQ-45 are not equally appropriate for patients with different disorders. Our study emphasizes the importance of person-misfit detection in clinical practice. Keywords aberrant responding, item response theory, outcome measurement, Outcome Questionnaire–45, person-fit analysis at Seoul National University on December 18, 2014 asm.sagepub.com Downloaded from

Upload: harold-lee

Post on 03-Dec-2015

212 views

Category:

Documents


0 download

DESCRIPTION

sample

TRANSCRIPT

Page 1: Conijn, 2014

Assessment 1 –12© The Author(s) 2014Reprints and permissions: sagepub.com/journalsPermissions.navDOI: 10.1177/1073191114560882asm.sagepub.com

Article

During the previous two decades, the growing interest in the quality of mental health care has led to an increased use of self-report outcome measures (Holloway, 2002). To monitor the effectiveness of treatment for individual patients, outcome measures that assess symptom severity and daily functioning are repeatedly administered during treatment. Based on the repeated measurements, the treat-ment plan can be altered if recovery does not proceed as expected (Lambert & Shimokawa, 2011). Furthermore, mental health care providers use these outcome data to eval-uate treatment results at the institutional level, and insur-ance companies, health care managers, and other regularity bodies use outcome measures for policy decisions aimed at improving cost-effectiveness (Bickman & Salzer, 1997). Examples of frequently used outcome measures are the Outcome Questionnaire–45 (OQ-45; Lambert et al., 2004), the Brief Symptom Inventory (BSI; Derogatis, 1993), and the Clinical Outcomes in Routine Evaluation–Outcome Measure (CORE-OM; Evans et al., 2002).

Given the importance of outcome measures for individual decision making in mental health care, their psychometric properties are a major concern (e.g., Doucette & Wolf, 2009; Pirkis et al., 2005). However, even if instruments have excellent psychometric properties, persons may respond aberrantly to clinical and personality scales, thus produc-ing invalid test scores. In fact, response inconsistency to

personality and psychopathology self-report inventories was found to be positively related to indicators of psychological distress, psychological problems, and negative affect (Conijn, Emons, Van Assen, Pedersen, & Sijtsma, 2013; Reise & Waller, 1993; Woods, Oltmanns, & Turkheimer, 2008), which suggests that mental health care patients may be inclined to respond aberrantly. Cognitive deficits that are commonly observed in mental illness may explain concentra-tion problems that interfere with the quality of self-reports (Atre-Vaidya et al., 1998; Cuijpers, Li, Hofann, & Andersson, 2010). However, potential causes of aberrant responding are numerous, including lack of motivation, response styles, idiosyncratic interpretation of item content, and low traited-ness. Traitedness refers to the applicability of the trait to the respondent (Tellegen, 1988).

560882 ASMXXX10.1177/1073191114560882AssessmentConijn et al.research-article2014

1School of Social and Behavioral Sciences, Tilburg University, Tilburg, Netherlands2Institute of Psychology, Leiden University, Leiden, Netherlands³Research department, GGZ Noord-Holland-Noord

Corresponding Author:Judith M. Conijn, Department of Clinical Psychology, Institute of Psychology, Leiden University, Wassenaarseweg 52, 2333 AK Leiden, Netherlands. Email: [email protected]

Detecting and Explaining Aberrant Responding to the Outcome Questionnaire–45

Judith M. Conijn1, Wilco H. M. Emons1, Kim De Jong2,3, and Klaas Sijtsma1

AbstractWe applied item response theory based person-fit analysis (PFA) to data of the Outcome Questionnaire-45 (OQ-45) to investigate the prevalence and causes of aberrant responding in a sample of Dutch clinical outpatients. The lz

p person-fit statistic was used to detect misfitting item-score patterns and the standardized residual statistic for identifying the source of the misfit in the item-score patterns identified as misfitting. Logistic regression analysis was used to predict person misfit from clinical diagnosis, OQ-45 total score, and Global Assessment of Functioning code. The lz

p statistic classified 12.6% of the item-score patterns as misfitting. Person misfit was positively related to the severity of psychological distress. Furthermore, patients with psychotic disorders, somatoform disorders, or substance-related disorders more likely showed misfit than the baseline group of patients with mood and anxiety disorders. The results suggest that general outcome measures such as the OQ-45 are not equally appropriate for patients with different disorders. Our study emphasizes the importance of person-misfit detection in clinical practice.

Keywordsaberrant responding, item response theory, outcome measurement, Outcome Questionnaire–45, person-fit analysis

at Seoul National University on December 18, 2014asm.sagepub.comDownloaded from

Page 2: Conijn, 2014

2 Assessment

Aberrant responding provides clinicians with invalid information and, as a result, adversely affects the quality of treatment and diagnosis decisions (Conrad et al., 2010; Handel, Ben-Porath, Tellegen, & Archer, 2010). Person-fit analysis (PFA) involves statistical methods to detect aber-rant item-score patterns that are due to aberrant responding. Conrad et al. (2010) provided an example of the potential of PFA for mental health care by using PFA to screen for atypi-cal symptom profiles among persons at intake for drug or alcohol dependence treatment. They found that the persons with aberrant item-score patterns required different treat-ments than persons with model-consistent item-score pat-terns and concluded that PFA may detect inconsistencies that have important implications for treatment and diagnosis decisions. As self-report outcome measures are increasingly used to make treatment decisions in clinical practice, PFA may be a valuable screening tool in outcome measurement.

The importance of detecting aberrant responding has been recognized since long. Both the original and current versions of the Minnesota Multiphasic Personality Inventory (Butcher et al., 2001; Handel et al., 2010) include scales to detect different types of aberrant responding. Examples are lie scales to detect faking good or faking bad, and indices based on the consistency of the responses to items either highly similar or opposite with respect to content, such as the Variable Response Inconsistency (VRIN) scale to detect random responding and the True Response Inconsistency (TRIN) scale to detect acquiescence. Despite validity scales’ importance, outcome questionnaires typically do not include specialized scales for detecting aberrant responding (Lambert & Hawkins, 2004). One possible explanation is that with the increasing demand of cost-effectiveness, time for assessment has been reduced greatly (Wood, Garb, Lilienfeld, & Nezworski, 2002). Consequently, outcome questionnaires are required to be short and efficient, which limits the use of validity scales consisting of additional items (e.g., lie scales) and limits construction of TRIN and VRIN scales because less item pairs with similar or oppo-site content are available.

Person-Fit Analysis in Outcome Measurement

In this study, we used PFA to investigate the prevalence and possible causes of aberrant responding in outcome mea-surement by means of the OQ-45. In PFA, person-fit statis-tics signal whether an individual’s item-score pattern is inconsistent with the item-score pattern expected under the particular measurement model (Meijer & Sijtsma, 2001). A significant discrepancy between the observed item-score pattern and the expected item-score pattern provides evi-dence of person misfit. Person misfit means that the indi-vidual’s test score is unlikely to be meaningful in terms of the trait being measured. For noncognitive data, the l

z per-

son-fit statistic (Drasgow, Levine, & McLaughlin, 1987) is

one of the best performing and most popular person-fit sta-tistics (Emons, 2008; Ferrando, 2012; Karabatsos, 2003). To determine whether an item-score pattern shows signifi-cant misfit, statistic l

z is compared with a cutoff value

obtained under the item response theory (IRT; Embretson & Reise, 2000) model that serves as the null model of consis-tency (De la Torre & Deng, 2008; Nering, 1995). Statistic l

z

detects various types of aberrant responding, such as acqui-escence and extreme response style, but the statistic is most powerful for detecting random responding (Emons, 2008). In detecting random responding to 57 items measuring the Big Five personality factors, PFA has been found to outper-form an inconsistency index based on the rationale of the Minnesota Multiphasic Personality Inventory VRIN scale (Egberink, 2010, pp. 94-100).

An advantage of statistic lz and other person-fit statistics

for application to outcome measurement is that they can be used to detect invalid test scores on any self-report scale that is consistent with an IRT model. Also, the rise of com-puterized and IRT-based outcome monitoring (e.g., Patient Reported Outcomes Measurement Information System; Cella et al., 2007) renders the implementation of PFA fea-sible. Along with the computer-generated test score, a per-son-fit value may be provided to the clinician serving as an alarm bell that warns him that the test score may be invalid and further inquiry may be useful.

Follow-up PFA of item-score patterns flagged as misfit-ting can help the clinician to infer possible explanations for an individual’s observed aberrant responding. In personal-ity measurement, Ferrando (2010) used item-score residu-als for follow-up PFA and found that a person who had an aberrant item-score pattern on an extraversion scale showed unexpected low scores on items referring to situations where the person could make a fool of himself. This result suggested that the aberrant responding was due to fear of being rejected. For another person, follow-up PFA sug-gested inattentiveness to reverse item wording. In outcome measurement, for individual patients follow-up PFA can inform the clinician about the sources of the misfit and cli-nicians can discuss the unexpected item scores with the patients to obtain a better understanding of the patient’s psychological profile.

PFA primarily focuses on individuals, but can also be used to explain individual differences in aberrant respond-ing at the group level; for examples, see Conijn, Emons, and Sijtsma (2014) and Conijn et al. (2013). In outcome measurement, PFA can be used to investigate the extent to which general measures are suited for assessing patients suffering from different disorders. General outcome mea-sures, such as the OQ-45 and the CORE-OM, use items that assess the most common symptoms of psychopathol-ogy such as those observed in depression and anxiety dis-orders (Lambert & Hawkins, 2004), and are also used to assess patients suffering from different, specific disorders,

at Seoul National University on December 18, 2014asm.sagepub.comDownloaded from

Page 3: Conijn, 2014

Conijn et al. 3

varying from somatoform disorders to psychotic disorders and addiction. For rare or specific disorders many of the general measures’ items are irrelevant and low traitedness may lead to inconsistent or unmotivated completion of out-come measures.

Goal of the Study

We investigated the prevalence and the causes of aberrant responding to the OQ-45 (Lambert et al., 2004). The OQ-45 is one of the most popular general outcome measures used in mental health care. We used OQ-45 data of a large Dutch clinical outpatient sample suffering from a large variety of disorders. The l

z person-fit statistic (Drasgow et al., 1987)

was used to identify misfitting item-score patterns, and standardized item-score residuals (Ferrando, 2010, 2012) were used to investigate sources of item-score pattern mis-fit. We employed logistic regression analyses using the l

z

statistic as the dependent variable to investigate whether patients suffering from specific disorders (e.g., somatoform disorders and psychotic disorders) and severely distressed patients are more predisposed to produce aberrant item-score patterns on the OQ-45 than other patients. Based on the results for the OQ-45, we discuss the possible causes of aberrant responding in outcome measurement in general, and the potential of PFA for improving outcome-measure-ment practice.

Method

Participants

We performed a secondary analysis on data collected in routine mental health care. Participants were 2,906 clinical outpatients (42.1% male) from a mental health care institu-tion with four different locations situated in Noord-Holland, a predominantly rural province in the Netherlands. Participants’ age ranged from 17 to 77 years (M = 37; SD = 13). Apart from gender and age no other demographic infor-mation was collected.

Most patients completed the OQ-45 at intake but 160 (5.5%) patients completed the OQ-45 after treatment started. The sample included 2,632 (91%) patients with a clinician-rated Diagnostic and Statistical Manual of Mental Disorders (4th ed.; DSM-IV) primary diagnosis at Axis I, 192 (7%) persons with a primary diagnosis at Axis II, and 82 (3%) patients for which the primary diagnosis was miss-ing. Most frequent primary diagnoses were depression (38%); anxiety disorders (20%); disorders usually first diagnosed in infancy, childhood, or adolescence (8%); per-sonality disorders (7%); adjustment disorders (6%); somatoform disorders (3%); eating disorders (2%); and substance-related disorders (2%). Of the diagnosed patients, 13.1% had comorbidity between Axis 1 and Axis 2, 32.0%

had comorbidity within Axis 1, and 0.1% had comorbidity within Axis 2. The clinician had access to the OQ-45 data, but since the OQ-45 is not a diagnostic instrument it was unlikely that diagnosis was based on the OQ-45 results.

Measures

The Outcome Questionnaire–45. The OQ-45 (Lambert et al., 2004) uses three subscales to measure symptom severity and daily functioning. The Social Role Performance (SR; 9 items of which 3 are reversely worded) subscale measures dissatisfaction, distress, and conflicts concerning one’s employment, education, or leisure pursuits. An example item is “I feel stressed at work/school.” The Interpersonal Relations (IR; 11 items of which 4 are reversely worded) subscale measures difficulty with family, friends, and mari-tal relationship. An example item is “I feel lonely.” The Symptom Distress (SD; 25 items of which 3 are reversely worded) subscale measures symptoms of the most fre-quently diagnosed mental disorders, in particular, anxiety and depression. An example item is “I feel no interest in things.” Respondents are instructed to express their feelings with respect to the past week on a 5-point rating scale with scores ranging from 0 (never) through 4 (almost always), higher scores indicating more psychological distress.

In this study, we used the Dutch OQ-45 (De Jong & Nugter, 2004). The Dutch OQ-45 total score has good con-current and criterion-related validity (De Jong et al., 2007). In our sample, coefficient alpha for the subscale total scores equaled .65 (SR), .77 (IR), and .91 (SD). Results concern-ing OQ-45 factor structure are ambiguous. Some studies provided support for the theoretical three-factor model for both the original OQ-45 (Bludworth, Tracey, & Glidden-Tracey, 2010) and the Dutch OQ-45 (De Jong et al., 2007). Other studies found poor fit of the theoretical three-factor model and suggested that a three-factor model showed bet-ter fit when it was based on a reduced item set (Kim, Beretvas, & Sherry, 2010) or a one-factor model (Mueller, Lambert, & Burlingame, 1998). In this study, we further investigated the fit of the theoretical three-factor model.

Explanatory Variables for Person MisfitSeverity of distress. The OQ-45 total score and the cli-

nician-rated DSM-IV Global Assessment of Functioning (GAF) code were taken as measures of the patient’s sever-ity of distress. The GAF code ranges from 1 to 100 with higher values indicating better psychological, social, and occupational functioning. The GAF code was missing for 187 (6%) patients.

Diagnosis category. The clinician-rated DSM-IV diag-nosis was classified into nine categories representing the most common types of disorders present in the sam-ple. Table 1 describes the diagnosis categories and the

at Seoul National University on December 18, 2014asm.sagepub.comDownloaded from

Page 4: Conijn, 2014

4 Assessment

number of patients classified in each category by their pri-mary diagnosis. Three remarks are in order. First, because mood and anxiety symptoms dominate the OQ-45 (Lambert et al., 2004), we assumed that for patients suffering from these symptoms misfit was less likely than for patients with other diagnoses. Hence, we classified patients with mood and anxiety diagnoses into the same category and used this category as the baseline for testing the effects of the other diagnosis categories on person fit. Second, because we expected that the probability of aberrant responding depends on the specific symptoms the patient experienced, we categorized patients into diagnosis categories that are defined by symptomatology. Third, if we were unable to categorize patients’ diagnosis unambiguously in one of the categories (e.g., adjustment disorder with predominant dis-turbance of conduct), we treated the diagnosis as missing. Our approach resulted in 2,514 categorized patients (87%).

Statistical Analysis

Model-Fit Evaluation. We conducted PFA based on the graded response model (GRM; Samejima, 1997). The GRM is an IRT model for polytomous items. The core of the GRM are the item step response functions (ISRFs), which specify the relationship between the probability of a response in a specific or higher answer category and the latent trait the test measures. The GRM is defined by three assumptions: unidimensionality of the latent trait, absence of structural influences other than the latent trait on item

responding (i.e., local independence), and logistic ISRFs. A detailed discussion of the GRM is beyond the scope of this study; the interested reader is referred to Embretson and Reise (2000) and Samejima (1997).

Satisfactory GRM fit to the data is a prerequisite for application of GRM-based PFA to the OQ-45 subscale data. Forero and Maydeu-Olivares (2009) showed that differ-ences between parameter estimates obtained from the full information (GRM) and the limited information (factor analysis on the polychoric correlation matrix) approaches are negligible. Hence, for each OQ-45 subscale we used exploratory factor analysis (EFA) for categorical data in Mplus (Muthén & Muthén, 2007) to assess the GRM assumptions of unidimensionality and local independence. For comparing one-factor models with multidimensional models, we used the root mean squared error of approxima-tion (RMSEA) and the standardized root mean residual (SRMR; Muthén & Muthén, 2007). RMSEA ≤ .08 and SRMR < .05 suggest acceptable model fit (MacCallum, Browne, & Sugawara, 1996; Muthén & Muthén, 2009). To detect local dependence, we used the residual correlations under the one-factor solution. We assessed the logistic shape of ISRFs by means of a graphical analysis comparing the observed ISRFs with the ISRFs expected under the GRM (Drasgow, Levine, Tsien, Williams, & Mead, 1995). In case substantial violations of GRM assumptions were identified, we used a simulation study to investigate whether PFA was sufficiently robust with respect to the identified OQ-45 model misfit (Conijn et al., 2014).

Table 1. Description of the Diagnosis Categories Used as Explanatory Variables in Multiple Regression Analysis.

Category Common DSM-IV diagnoses included n Mean lzmp # Detected % Detected

Mood and anxiety disordera Depressive disorders, generalized anxiety disorders, phobias, panic disorders, posttraumatic stress disorder

1,786 0.28 229 12.8

Somatoform disorder Pain disorder, somatization disorder, hypochondriasis, undifferentiated somatoform disorder

82 0.16 16 19.5

Attention deficit hyperactivity disorder(ADHD)

Predominantly inattentive, combined hyperactive-impulsive and inattentive

198 0.08 15 7.6

Psychotic disorder Schizophrenia, psychotic disorder not otherwise specified

26 −0.10 7 26.9

Borderline personality disorder

Borderline personality disorder 53 0.35 2 3.8

Impulse-control disorders not elsewhere classified

Impulse-control disorder, intermittent explosive disorder

58 0.02 10 17.2

Eating disorder Eating disorder not otherwise specified, bulimia nervosa

67 0.38 4 6.0

Substance-related disorder Cannabis-related disorders, alcohol-related disorders

58 0.09 13 22.4

Social and relational problem Phase of life problem, partner relational problem, identity problem

186 0.26 20 10.8

a. Including 65% patients with a mood disorder.

at Seoul National University on December 18, 2014asm.sagepub.comDownloaded from

Page 5: Conijn, 2014

Conijn et al. 5

Person-Fit AnalysisDetection of misfit. We used statistic l

z for polytomous

item scores, denoted by lzp (Drasgow, Levine, & Williams,

1985), to identify item-score patterns that show misfit rela-tive to the GRM. Statistic lz

p is the standardized log-like-lihood of a person’s item-score pattern given the response probabilities under the GRM, with larger negative lz

p values indicating a higher degree of misfit (see Appendix A for the equations). Emons (2008) found that lz

p had a higher detection rate than several other person-fit statistics. Because item-score patterns that contain only 0s or only 4s (i.e., after recoding the reversed worded items) always fit under the postulated GRM, corresponding lz

p statistics are meaningless and therefore treated as missing values. Twenty-two respondents (1%) had a missing lz

p value due to only 0 or 4 scores. We may add that even though these perfect patterns are consistent with the model, they still may be suspicious as they may reflect gross under- or overre-porting of symptoms.

Because the GRM is a model for unidimensional data, we computed statistic lz

p for each OQ-45 subscale sepa-rately. To categorize persons as fitting or misfitting with respect to the complete OQ-45, we used the multiscale per-son-fit statistic lzm

p (Conijn et al., 2014; Drasgow, Levine, & McLaughlin, 1991), which equals the standardized sum of the subscale lz

p values across all subscales.Under the null model of fit to the IRT model and given

the true θ value, statistic lzp is standard normally distributed

(Drasgow et al., 1985), but when the unknown true θ value is replaced by the estimated θ value, statistic lz

p is no longer standard normal (Nering, 1995). Therefore, following De la Torre and Deng (2008), we used the following parametric bootstrap procedure to compute the lz

p and lzmp values and

the corresponding p values. For each person, we generated 5,000 item-score patterns under the postulated GRM using the item parameters and the person’s estimated θ value. For each item-score pattern, we again estimated the estimated θ value and computed the corresponding lzm

p statistic. The 5,000 bootstrap replications of lzm

p determined the person-specific null distribution of lzm

p . The percentile rank of the observed lzm

p value in this bootstrapped distribution pro-vided the p value. We used one-tailed significance testing and a .05 significance level (α). The GRM item parameters were estimated using MULTILOG (Thissen, Chen, & Bock, 2003). For the bootstrap procedure we developed dedicated software in C++. Algorithms for estimating θ were obtained from Baker and Kim (2004). The software, including the source code, is available on request from the first author.

Follow-up analyses. For each item-score pattern lzmp classi-

fied as misfitting, we used standardized item-score residuals to identify the source of the person misfit (Ferrando, 2010, 2012). Negative residuals indicate that the person’s observed item score is lower than expected under the estimated GRM

and positive residuals indicate that the item score is higher than expected (Appendix A). To test residuals for signifi-cance, we used critical values based on the standard normal distribution and two-tailed significance testing with α = .05 (i.e., cutoff values of −1.96 and 1.96) or α = .10 (i.e., cutoff values of −1.64 and 1.64).

Explanatory person-fit analysis. We used logistic regres-sion to relate type of disorder and severity of psychological distress to person misfit on the OQ-45. The dependent vari-able was the dichotomous person-fit classification based on lzmp (1 = significant misfit, 0 = no misfit). Based on pre-

vious research results, gender (0 = male, 1 = female) and measurement occasion (0 = at intake, 1 = during treatment) were included in the model as control variables (e.g., Pitts, West, & Tein, 1996; Schmitt, Chan, Sacco, McFarland, & Jennings, 1999; Woods et al., 2008).

Results

First, we discuss model fit and implications of the identified model misfit for the application of PFA to the OQ-45 data. Second, we discuss the number of item-score patterns that the lzm

p statistic classified as misfitting (prevalence) and we illustrate how standardized item-score residuals may help infer possible causes of misfit for individual respondents. Third, we discuss the results of logistic regression analysis in which lzm

p person misfit classification was predicted by means of clinical diagnosis and severity of disorder.

Model-Fit Evaluation

Inspection of multiple correlation coefficients and item-rest correlations showed that the Items 11, 26, and 32, which measured substance abuse, and Item 14 (“I work/study too much”) fitted poorly in their subscales. As these results were consistent with previous research (De Jong et al., 2007; Mueller et al., 1998), we excluded these items from further analysis. Coefficient alphas for the SR (7 items; 2 items excluded), IR (10 items; 1 item excluded), and SD (24 items; 1 item excluded) subscales equaled .67, .78, and .91, respectively.

For the subscale data, EFA showed that the first factor explained 38.6% to 40.0% of the variance and that one-fac-tor models fitted the subscale data poorly (RMSEA > .10 and SRMR > .06). For the SR subscale, three factors were needed to produce an RMSEA ≤ .08. The two-factor model pro-duced an RMSEA of .13 but the SRMR of .05 was accept-able. The RMSEA value may have been inflated due to the small number of degrees of freedom (i.e., df = 7) of the two-factor model (Kenny, Kaniskan, & McCoach, 2014). Because parallel analysis based on the polychoric correla-tion matrix suggested that two factors explained the data, we decided that a two-factor solution was most appropriate

at Seoul National University on December 18, 2014asm.sagepub.comDownloaded from

Page 6: Conijn, 2014

6 Assessment

-4-2

02

4

Stan

dard

ized

resi

dual

s

4 12 21 28 38

3944

SR subscale

1

716

1718 19

20 30 37 43

IR subscale

2 35 6 8

9 10131522

232425

272931

3334

3536

40

41

4245

SD subscale-4

-20

24

Stan

dard

ized

resi

dual

s

4

12

21

28

38 39 441

7

1617

18

1920

30 3743

2

3

5 6

89

10

13

15

22

23

24

2527

293133

343536

40

41

42

45

Figure 1. Standardized residuals plotted by item number for fitting patient #663 ( lzmp = 3.54; upper panel) and misfitting patient #2752 ( lzmp = −7.96 ; lower panel).

(RMSEA = .13; SRMR = .05). For the IR subscale, a two-factor solution provided acceptable fit (RMSEA = .08 and SRMR = .04), and for the SD subscale a three-factor solu-tion provided acceptable fit (RMSEA = .07 and SRMR = .03). Violations of local independence and violations of a logistic ISRF were only found for some items of the SD and SR subscales, respectively. Thus, EFA results suggested that, more than other model violations, multidimensionality caused the subscale data to show GRM misfit.

To investigate the performance of statistic lzmp and the

standardized item-score residuals for detecting person mis-fit on OQ-45 in the presence of mild model misfit, we used data simulation to assess the Type I error rates and the detection rates of these statistics. Data were simulated using methods proposed by Conijn et al. (2014; also see Appendix B). The types of misfit included were random error (three levels: random scores on 10, 20, or 30 items) and acquies-cence (three levels: weak, moderate, and strong).

We found that for statistic lzmp , Type I error rate equaled

.01, meaning that the risk of incorrectly classifying normal respondents as misfitting was small and much lower than nominal Type I error rate. Furthermore, the power of lzm

p to detect substantial random error ranged from .77 to .95 (i.e., 20 to 30 random item scores) and the power to detect acqui-escence equaled at most .51 (i.e., for strong acquiescence; 88% of the responses in the most extreme category). We concluded that, despite mild GRM model misfit, lzm

p is use-ful for application to the OQ-45 but lacks power to detect acquiescence. For the residual statistic, we found modest power to detect deviant item scores due to random error, and low power to detect deviant item scores due to acquies-cence. Even though the residuals had low power in our

simulation study, we decided to use the residual statistic for the OQ-45 data analysis to obtain further insight in the sta-tistics’ performance.

Detection of Misfit and Follow-Up Analyses

For 90 (3%) patients with a missing lzp value for one of the

subscales, lzmp was computed across the two other OQ-45

subscales. Statistic lzmp ranged from −7.96 to 3.54 (M =

0.45; SD = 3.54). For 367 (12.6%) patients, statistic lzmp

classified the observed item-score pattern as misfitting. With respect to age, gender, and measurement occasion, we did not find substantial differences between detection rates.

Based on the residual statistic’s low power in the simula-tion study, we used α = .10 for identifying unexpected item scores. We use two cases to illustrate the use of the residual statistic. Figure 1 shows the standardized residuals for female patient #663 who had the highest lzm

p value ( lzmp =

3.54, p > .99) and for male patient #2752 who had the low-est lzm

p value ( lzmp = −7.96, p < .001). Patient #663 (upper

panel) was diagnosed with posttraumatic stress disorder. The patient’s absolute residuals were smaller than 1.64, thus showing that her item scores were consistent with the expected GRM item scores.

Patient #2752 (lower panel) was diagnosed to suffer from adjustment disorder with depressed mood. He had large residuals for each of the OQ-45 subscales, but misfit was largest on the IR subscale (lz

p = −5 44. ) and the SD sub-scale (lz

p = −7 66. ). On the IR subscale, residuals suggested unexpected high distress on Items 7, 19, and 20. One of these items concerned his “marriage/significant other rela-tionship”. Therefore, a possible cause of the IR subscale

at Seoul National University on December 18, 2014asm.sagepub.comDownloaded from

Page 7: Conijn, 2014

Conijn et al. 7

misfit may be that his problems were limited to this rela-tionship. On the SD subscale, he had both several unex-pected high and low item scores. Two of the three items with unexpected high scores reflected mood symptoms of depression: feeling blue (Item 42) and not being happy (Item 13). The third item concerned suicidal thoughts (Item 8). Most items with unexpected low scores concerned low self-worth and incompetence (Items 15, 24, and 40) and hopelessness (Item 10), which are all cognitive symptoms of depression. A plausible explanation of the SD subscale misfit consistent with the patient’s diagnosis is that an external source of psychological distress caused the patient to experience only the mood symptoms of depression but not the cognitive symptoms. Hence, the cognitive symp-toms constituted a separate dimension on which the patient had a lower trait value. Furthermore, except for 10 items all other item scores were either 0s or 4s. Hence, apart from potential content-related misfit, extreme response style may have been another cause of the severe misfit. In practice, the clinician may discuss unexpected item scores and potential explanations with the patient and suggest a more definite explanation for the person misfit.

Explanatory Person-Fit Analysis

For each of the diagnosis categories, Table 1 shows the average lzm

p value and the number and percentage of patients classified as misfitting. For patients with mood and anxiety disorders (i.e., the baseline category), the detection rate was substantial (12.8%) but not high relative to most of the other diagnosis categories.

Table 2 shows the results of the logistic regression analy-sis. Model 1 included gender, measurement occasion, and the diagnosis categories as predictors of person misfit. Diagnosis category had a significant overall effect, χ2(8) = 26.47, p < .001. The effects of somatoform disorder, ADHD, psychotic disorder, and substance abuse disorder were sig-nificant. Patients with ADHD were unlikely to show misfit relative to the baseline category of patients with mood or anxiety disorders. Patients with somatoform disorders, psy-chotic disorders, and substance-related disorders were more likely to show misfit.

Model 2 (Table 2, third column) also included GAF code and OQ-45 total score. Both effects were significant and suggested that patients with higher levels of psychological distress were more likely to show misfit. After controlling for GAF code and OQ-45 score, the positive effect of ADHD was not significant. Hence, patients with ADHD were less likely to show misfit because they had less severe levels of distress. For the baseline category, the estimated probability of misfit was .13. For patients with somatoform disorders, psychotic disorders, and substance-related disor-ders, the probability was .23, .31, and .22, respectively.

To investigate whether patients in the same diagnosis category showed similar person misfit, we compared the standardized item-score residuals of the misfitting patterns produced by patients with psychotic disorders (n = 7), somatoform disorders (n = 16), or substance-related disor-ders (n = 13). Most patients with a psychotic disorder had low or average θ levels for each of the subscales. Misfit was due to several unexpected high scores indicating severe symptoms. In general, these patients did not have large

Table 2. Estimated Regression Coefficients of Logistic Regression Predicting Person Misfit Based on lzmp (1 = Significant Misfit at the 5% Level, 0 = No Misfit).

Model 1 Model 2

Intercept −1.84 (0.11)*** −1.93 (0.11)***Gender −0.12 (0.13) −0.12 (0.13)Measurement occasion −0.17 (0.27) −0.18 (0.27)Diagnosis categorya Somatoform 0.57 (0.29)* 0.74 (0.29)* ADHD −0.58 (0.28)* −0.39 (0.28) Psychotic 1.05 (0.46)* 1.13 (0.47)* Borderline −1.30 (0.72) −1.39 (0.73) Impulse control 0.35 (0.36) 0.57 (0.36) Eating disorders −1.10 (0.60) −0.97 (0.60) Substance related 0.66 (0.33)* 0.69 (0.33)* Social/relational −0.20 (0.26) 0.08 (0.27)GAF code — −0.17 (0.07)*OQ-45 total score — 0.26 (0.07)***

Note. N = 2,434. The correlation between the GAF code and OQ-45 total score equaled −.26.a. The mood and anxiety disorders category is used as the baseline category.*p < .05. **p < .01. ***p < .001.

at Seoul National University on December 18, 2014asm.sagepub.comDownloaded from

Page 8: Conijn, 2014

8 Assessment

residuals for the same items. However, unexpected high scores on Item 25 (“disturbing thoughts come into my mind that I cannot get rid of”) were frequent (4 patients), which is consistent with the symptoms characterizing psychotic dis-orders. Low to average θ levels combined with several unexpected high item scores, suggested that patients with psychotic disorders showed misfit because many OQ-45 items were irrelevant to them, suggesting lack of traitedness (Reise & Waller, 1993). For the misfitting patients with a somatoform disorder or a substance-related disorder, the standardized residuals showed no common pattern, and thus they did not show similar person misfit.

Discussion

We investigated prevalence and explanations of aberrant responding to the OQ-45 by means of IRT-based person-fit methods. Reise and Waller (2009) suggested that IRT mod-els fit poorly to psychopathology data and that misfit may adversely affect PFA. The GRM failed to fit the OQ-45 data, but consistent with previous research results (Conijn et al., 2014) our simulation results suggested that lzm

p is robust to model misfit. The low empirical Type I error rates suggested that GRM misfit did not lead to incorrect classi-fication of model-consistent item-score patterns as misfit-ting. The current findings are valuable because they are obtained under realistic conditions, using a psychometri-cally imperfect outcome measure that is frequently used in practice. We notice that this is the rule rather than the excep-tion; in general, measurement instruments’ psychometric properties are imperfect. However, our findings concerning the robustness of lzm

p may only hold for the kind of model misfit we found for the OQ-45. Future studies should inves-tigate the robustness of PFA methods to different IRT model violations.

The detection rate of 12.6% in the OQ-45 data is sub-stantial and comparable with detection rates found in other studies using different measures. For example, using repeated measurements on the State–Trait Anxiety Inventory (Spielberger, Gorsuch, Lushene, Vagg, & Jacobs, 1983), Conijn et al. (2013) found detection rates of 11% to 14% in a sample of cardiac patients, and Conijn et al. (2014) also found 16% misfit on the International Personality Item Pool 50-item questionnaire (Goldberg et al., 2006) in a panel sample from the general population.

Consistent with previous research (Conijn et al., 2013; Reise & Waller, 1993; Woods et al., 2008), we found that more severely distressed patients more likely showed mis-fit. Also, patients with somatoform disorders, psychotic dis-orders, and substance-related disorders more likely showed misfit. Plausible explanations for these results include patients tending to deny their mental problems (somatoform disorders), symptoms of disorder and confusion, and the irrelevance of most OQ-45 items (psychotic disorders),

being under the influence during test-taking, and the nega-tive effect of long-term substance use on cognitive abilities (substance-related disorders). The residual analysis con-firmed the explanation of limited item relevance for patients with psychotic disorders but also suggested that patients identified with the same type of disorder generally did not show the same type of misfit. However, as the simulation study revealed that the residual statistic had low power, sug-gesting only item scores reflecting large misfit are identi-fied, results should be interpreted with caution.

There are two possible, general explanations for finding group differences with respect to the tendency to show per-son misfit. First, person misfit may be due to a mismatch between the OQ-45 and the specific disorder, and second, misfit may be due to a general tendency to show misfit on self-report measures. Each explanation has a unique impli-cation for outcome measurement. The mismatch explana-tion implies that disease-specific outcome measures rather than general outcome measures should be used. Examples of disease-specific outcome measures are the Patient Health Questionnaire-15 (Kroenke, Spitzer, & Williams, 2002) for assessing somatic symptoms, and the Severe Outcome Questionnaire (Burlingame, Thayer, Lee, Nelson, & Lambert, 2007) to diagnose severe mental illnesses such as psychotic disorders and bipolar disorder.

The general-misfit explanation implies that other meth-ods than self-reports should be used for patients’ treatment decisions, for example, clinician-rated outcome measures such as the Health of the Nation Outcome Scales (Wing et al., 1998). Also, these patients’ self-report results should be excluded from cost-effectiveness studies to prevent poten-tial negative effects on policy decisions. Future studies may address the scale-specific misfit versus general-misfit explanations by means of explanatory PFA of data from other outcome measures.

Residual statistics have shown to be useful in real-data applications for analyzing causes of aberrant responding to unidimensional personality scales containing at least 30 items (Ferrando, 2010, 2012), but their performance has not been evaluated previously by means of simulations. Our simulation study and real-data application question the use-fulness of the residual statistic for retrieving causes of mis-fit in outcome measures consisting of multiple short subscales. An alternative method is the inspection of item content to identify unlikely combinations of item scores, in particular when outcome measures contain few items. Such a qualitative approach to follow-up PFA could be particu-larly useful when clinicians use the OQ-45 to provide feed-back to the patient.

Person-fit statistics such as lzmp can potentially detect

aberrant responding due to low traitedness, low motivation, cognitive deficits, and concentration problems. However, an important limitation for outcome measurement is that the statistics have low power to identify response styles and

at Seoul National University on December 18, 2014asm.sagepub.comDownloaded from

Page 9: Conijn, 2014

Conijn et al. 9

malingering. These types of aberrant responding result in biased total scores, but do not necessarily produce an incon-sistent pattern of item scores across the complete measure (Ferrando & Chico, 2001). Hence, future research might also use measures especially designed to identify response styles (Cheung & Rensvold, 2000) and malingering and compare the results with those from general-purpose per-son-fit statistics. Inconsistency scales such as the TRIN and VRIN potentially detect the same types of generic response inconsistencies as person-fit indices, but to our knowledge only one study has compared their performance to that of PFA (Egberink, 2010). PFA was found to outperform the inconsistency scale, but relative performance may depend on dimensionality and test length. Hence, more research is needed on this subject.

Other suggestions for future research include the fol-lowing. A first question is whether the prevalence of aber-rant responding increases (e.g., due to diminishing motivation) or decreases (e.g., due to familiarity with the questions) with repeated outcome measurements. A sec-ond question is whether the prevalence of aberrant responding justifies routine application of PFA in clinical settings such as Patient Reported Outcomes Measurement Information System. If prevalence among the tested patients is low, the number of item-score patterns incor-rectly classified as aberrant (i.e., Type I errors) may out-number the correctly identified aberrant item-score patterns (Piedmont, McCrae, Riemann, & Angleitner, 2000). Then, PFA becomes very inefficient, and it is unlikely to improve the quality of individual decision making in clinical practice. Third, future research may use group-level analysis such as differential item functioning analysis (Thissen, Steinberg, & Wainer, 1993) or IRT mix-ture modeling (Rost, 1990) to study whether patients with the same disorder showed similar patterns of misfit.

To conclude, our results have two main implications pertaining to psychometrics and psychopathology. First, the simulation study results suggest that lzm

p is useful for application to outcome measures despite moderate model misfit due to multidimensionality. As data of other psy-chopathology measures also have been shown to be incon-sistent with assumptions of IRT models, the results of the simulation study are valuable when considering applica-tion of the lzm

p statistic to data of outcome measures. Second, our results suggest that general outcome measures such as the OQ-45 may not be equally suitable for patients with different disorders. Also, more severely distressed patients, for whom psychological intervention is mostly needed, appear to be at the largest risk to produce invalid outcome scores. Overall, our results emphasize the impor-tance of person misfit identification in outcome measure-ment and demonstrate that PFA may be useful for preventing incorrect decision making in clinical practice due to aberrant responding.

Appendix A

Statistic lzp

Suppose the data are polytomous scores on J items (items are indexed j; j = 1, . . ., J) with M + 1 ordered answer cat-egories. Let the score on item j be denoted by X j with possible realizations x Mj = …0, , . The latent trait is denoted by θ and P X xj j( | )= θ is the probability of a score Xj = xj. Let d mj ( ) =1 if x m m Mj = = …( )0, , , and 0 otherwise. The unstandardized log-likelihood of an item-score pattern x of person i is given by

l d m P X mpi

j

J

m

M

j j i( ) .x = ( ) =( )= =∑∑1 0

ln |θ (A1)

The standardized log-likelihood is defined as

ll E l

VAR l

zp

i

pi

pi

pi

xx x

x

( ) =( ) − ( )

( )

( )1

2

, (A2)

where E l p( ) is the expected value and VAR l p( ) is the variance of l p .

Standardized Residual Statistic

The unstandardized residual for person i on item j is given by

e X E Xij ij ij= − ( ) , (A3)

where E Xij( ) is the expected value of Xij, which equals

E X mP X mij

m

M

ij i( ) = =( )=∑0

|θ . The residual eij has a mean

of 0 and variance equal to

VAR e E X E Xij ij ij( ) = ( ) − ( ) 2 2

. (A4)

The standardized residual is given by

zee

VAR eij

ij

ij

=( )

. (A5)

To compute zeij, latent trait θ

i needs to be replaced by its

estimated value. This may bias the standardization of eij.

Appendix B

For each OQ-45 subscale, we estimated an exploratory mul-tidimensional IRT (MIRT; Reckase, 2009) model. Based on results of exploratory factor analyses conducted in Mplus

at Seoul National University on December 18, 2014asm.sagepub.comDownloaded from

Page 10: Conijn, 2014

10 Assessment

(Muthén & Muthén, 2007), we used a MIRT model with two factors for both the SR and IR subscale, and three fac-tors for the SD subscale. The MIRT model is defined as logit[p(θ)] = −1.702(αθ + δ). Vector θ denotes the latent traits and has a multivariate standard normal distribution, where the θ correlations are estimated along with the item parameters of the MIRT model. Higher α values indicate better discriminating items and higher δ values indicate more popular answer categories.

We used the MIRT parameter estimates (Table B1) for generating replicated OQ-45 data sets. In each replication, we included two types of misfitting item-score patterns:

patterns were due to random error or acquiescence. Following St-Onge, Valois, Abdous, and Germain (2011), we simulated varying levels of random error (random scores on 10, 20, or 30 items) and varying levels of acquiescence (weak, moderate, and strong; based on Van Herk, Poortinga, and Verhallen, 2004). The total percentage of misfitting pat-terns in the data was 20% (Conijn et al., 2013; Conijn et al., 2014). Based on 100 replicated data sets, the average Type I error rate and the average detection rates of lzm

p and the standardized residual were computed. For computing the person-fit statistics, GRM parameters were estimated using MULTILOG 7 (Thissen et al., 2003).

Table B1. Descriptive Statistics for Multidimensional IRT Model Parameters Used for Data Simulation.

SR subscale IR subscale SD subscale

Item parameter M (SD) Range M (SD) Range M (SD) Range

Discrimination αθ1 0.76 (0.47) 0.23, 1.46 0.84 (0.37) 0.36, 1.65 0.90 (0.26) 0.52, 1.32 αθ 2 0.26 (0.50) −0.09, 1.31 0.02 (0.37) −0.48, 0.73 0.03 (0.40) −0.74, 0.54 αθ 3 — — — — 0.02 (0.27) −0.56, 0.53Threshold δ 0 — — — — — — δ1 0.98 (0.70) −0.25, 1.84 1.12 (0.66) 0.14, 2.21 1.76 (0.84) −0.11, 3.08 δ 2 0.10 (0.60) −0.88, 0.80 0.07 (0.52) −0.98, 0.95 0.77 (0.65) −0.70, 1.75 δ 3 −0.82 (0.53) −1.57, −0.19 −1.11 (0.67) −2.40, −0.18 −0.40 (0.53) −1.67, 0.62 δ 4 −1.84 (0.50) −2.60, −1.26 −2.28 (0.87) −3.62, −1.09 −1.74 (0.53) −2.87, −0.70Latent-trait correlations rθ θ1 2 .20 .50 −.42 rθ θ1 3 — — .53 rθ θ2 3

— — −.55

Note. To simulate weak, moderate, and strong acquiescence, the δs were shifted 1, 2, or 3 points, respectively (Cheung & Rensvold, 2000). For the positively worded items, the δ shift was positive, and for the negatively worded items, the δ shift was negative. For weak acquiescence, the average percentages of respondents’ 3-scores and 4-scores were 27 and 25, respectively; for moderate acquiescence, percentages were 20 and 45; and for strong acquiescence, percentages were 12 and 88.

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article:

This research was supported by a grant from the Netherlands Organization of Scientific Research (NWO 400-06-087; first author).

References

Atre-Vaidya, N., Taylor, M. A., Seidenberg, M., Reed, R., Perrine, A., & Glick-Oberwise, F. (1998). Cognitive deficits, psychopathology, and psychosocial functioning in bipolar

mood disorder. Neuropsychiatry, Neuropsychology, and Behavioral Neurology, 11, 120-126.

Baker, F. B., & Kim, S.-H. (2004). Item response theory: Parameter estimation techniques (2nd ed.). New York, NY: Marcel Dekker.

Bickman, L., & Salzer, M. S. (1997). Introduction: Measuring quality in mental health services. Evaluation Review, 21, 285-291.

Bludworth, J. L., Tracey, T. J. G., & Glidden-Tracey, C. (2010). The bilevel structure of the Outcome Questionnaire–45. Psychological Assessment, 22, 350-355.

Burlingame, G. M., Thayer, S. D., Lee, J. A., Nelson, P. L., & Lambert, M. J. (2007). Administration & scoring manual for the Severe Outcome Questionnaire (SOQ). Salt Lake City, UT: OQ Measures.

Butcher, J. N., Graham, J. R., Ben-Porath, Y. S., Tellegen, A., Dahlstrom, W. G., & Kaemmer, B. (2001). MMPI-2

at Seoul National University on December 18, 2014asm.sagepub.comDownloaded from

Page 11: Conijn, 2014

Conijn et al. 11

(Minnesota Multiphasic Personality Inventory-2): Manual for administration and scoring (Rev. ed.). Minneapolis: University of Minnesota Press.

Cella, D., Yount, S., Rothrock, N., Gershon, R., Cook, K., & Reeve, B., . . . on behalf of the PROMIS Cooperative Group. (2007). The Patient Reported Outcomes Measurement Information System (PROMIS): Progress of an NIH Roadmap Cooperative Group during its first two years. Medical Care, 45, S3-S11.

Cheung, G. W., & Rensvold, R. B. (2000). Assessing extreme and acquiescence response sets in cross-cultural research using structural equations modeling. Journal of Cross-Cultural Psychology, 31, 187-212.

Conijn, J. M., Emons, W. H. M., & Sijtsma, K. (2014). Statistic lz based person-fit methods for non-cognitive multiscale mea-sures. Applied Psychological Measurement, 38, 122-136.

Conijn, J. M., Emons, W. H. M, Van Assen, M. A. L. M, Pedersen, S. S., & Sijtsma, K. (2013). Explanatory, multilevel person-fit analysis of response consistency on the Spielberger State-Trait Anxiety Inventory. Multivariate Behavioral Research, 48, 692-718.

Conrad, K. J., Bezruczko, N., Chan, Y. F., Riley, B., Diamond, G., & Dennis, M. L. (2010). Screening for atypical suicide risk with person fit statistics among people presenting to alco-hol and other drug treatment. Drug and Alcohol Dependence, 106, 92-100.

Cuijpers, P., Li, J., Hofmann, S. G., & Andersson, G. (2010). Self-reported versus clinician-rated symptoms of depression as outcome measures in psychotherapy research on depression: A meta-analysis. Clinical Psychology Review, 30, 768-778.

De Jong, K., & Nugter, A. (2004). De Outcome Questionnaire: Psychometrische kenmerken van de Nederlandse vertal-ing. [The Outcome Questionnaire: Psychometric properties of the Dutch translation]. Nederlands Tijdschrift voor de Psychologie, 59, 76-79.

De Jong, K., Nugter, M. A., Polak, M. G., Wagenborg, J. E. A., Spinhoven, P., & Heiser, W. J. (2007). The Outcome Questionnaire-45 in a Dutch population: A cross cultural vali-dation. Clinical Psychology & Psychotherapy, 14, 288-301.

De la Torre, J., & Deng, W. (2008). Improving person-fit assessment by correcting the ability estimate and its refer-ence distribution. Journal of Educational Measurement, 45, 159-177.

Derogatis, L. R. (1993). BSI Brief Symptom Inventory: Administration, scoring, and procedures manual (4th ed.). Minneapolis, MN: National Computer Systems.

Drasgow, F., Levine, M. V., & McLaughlin, M. E. (1987). Detecting inappropriate test scores with optimal and practical appropriateness indices. Applied Psychological Measurement, 11, 59-79.

Drasgow, F., Levine, M. V., & McLaughlin, M. E. (1991). Appropriateness measurement for some multidimensional test batteries. Applied Psychological Measurement, 15, 171-191.

Drasgow, F., Levine, M. V., Tsien, S., Williams, B., & Mead, A. D. (1995). Fitting polytomous item response theory models to multiple-choice tests. Applied Psychological Measurement, 19, 143-165.

Drasgow, F., Levine, M. V., & Williams, E. A. (1985). Appropriateness measurement with polychotomous item response models and

standardized indices. British Journal of Mathematical and Statistical Psychology, 38, 67-86.

Doucette, A., & Wolf, A. W. (2009). Questioning the measure-ment precision of psychotherapy research. Psychotherapy Research, 19, 374-389.

Egberink, I. J. L. (2010). The use of different types of validity indicators in personality assessment (Doctoral dissertation). University of Groningen, Netherlands. http://irs.ub.rug.nl/ppn/32993466X

Embretson, S. E., & Reise, S. P. (2000). Item response theory for psychologists. Mahwah, NJ: Erlbaum.

Emons, W. H. M. (2008). Nonparametric person-fit analysis of polytomous item scores. Applied Psychological Measurement, 32, 224-247.

Evans, C., Connell, J., Barkham, M., Margison, F., Mellor-Clark, J., McGrath, G., & Audin, K. (2002). Towards a standardised brief outcome measure: Psychometric properties and utility of the CORE-OM. British Journal of Psychiatry, 180, 51-60.

Ferrando, P. J. (2010). Some statistics for assessing person-fit based on continuous-response models. Applied Psychological Measurement, 34, 219-237.

Ferrando, P. J. (2012). Assessing inconsistent responding in E and N measures: An application of person-fit analysis in personal-ity. Personality and Individual Differences, 52, 718-722.

Ferrando, P. J., & Chico, E. (2001). Detecting dissimulation in personality test scores: A comparison between person-fit indices and detection scales. Educational and Psychological Measurement, 61, 997-1012.

Forero, C. G., & Maydeu-Olivares, A. (2009). Estimation of IRT graded response models: Limited versus full information methods. Psychological Methods, 14, 275-299.

Goldberg, L. R., Johnson, J. A., Eber, H. W., Hogan, R., Ashton, M. C., Cloninger, C. R., & Gough, H. C. (2006). The International Personality Item Pool and the future of public-domain personality measures. Journal of Research in Personality, 40, 84-96.

Handel, R. W., Ben-Porath, Y. S., Tellegen, A., & Archer, R. P. (2010). Psychometric functioning of the MMPI-2-RF VRIN-r and TRIN-r scales with varying degrees of random-ness, acquiescence, and counter-acquiescence. Psychological Assessment, 22, 87-95.

Holloway, F. (2002). Outcome measurement in mental health—Welcome to the revolution. British Journal of Psychiatry, 181, 1-2.

Karabatsos, G. (2003). Comparing the aberrant response detec-tion performance of thirty-six person-fit statistics. Applied Measurement in Education, 16, 277-298.

Kenny, D. A., Kaniskan, B., & McCoach, D. B. (2014). The per-formance of RMSEA in models with small degrees of free-dom. Sociological Methods and Research. Advance online publication. doi:10.1177/0049124114543236

Kim, S., Beretvas, S. N., & Sherry, A. R. (2010). A validation of the factor structure of OQ-45 scores using factor mixture modeling. Measurement and Evaluation in Counseling and Development, 42, 275-295.

Kroenke, K., Spitzer, R. L., & Williams, J. B. (2002). The PHQ-15: validity of a new measure for evaluating the severity of somatic symptoms. Psychosomsatic Medicine, 2002, 258-266.

at Seoul National University on December 18, 2014asm.sagepub.comDownloaded from

Page 12: Conijn, 2014

12 Assessment

Lambert, M. J., & Hawkins, E. J. (2004). Measuring outcome in professional practice: Considerations in selecting and utiliz-ing brief outcome instruments. Professional Psychology: Research and Practice, 35, 492-499.

Lambert, M. J., Morton, J. J., Hatfield, D., Harmon, C., Hamilton, S., Reid, R. C., . . .Burlingame, G. M. (2004). Administration and scoring manual for the OQ-45.2 (Outcome Questionnaire) (3th ed.). Wilmington DE: American Professional Credential Services.

Lambert, M. J., & Shimokawa, K. (2011). Collecting client feed-back. Psychotherapy, 48, 72-79.

MacCallum, R. C., Browne, M. W., & Sugawara, H. M. (1996). Power analysis and determination of sample size for covari-ance structure modeling. Psychological Methods, 1, 130-149.

Meijer, R. R., & Sijtsma, K. (2001). Methodology review: Evaluating person fit. Applied Psychological Measurement, 25, 107-135.

Mueller, R. M., Lambert, M. J., & Burlingame, G. M. (1998). Construct validity of the outcome questionnaire: A confirma-tory factor analysis. Journal of Personality Assessment, 70, 248-262.

Muthén, B. O., & Muthén, L. K. (2007). Mplus: Statistical analy-sis with latent variables (Version 5.0). Los Angeles, CA: Statmodel.

Muthén, L. K., & Muthén, B. O. (2009). Mplus Short Course vid-eos and handouts. Retrieved from http://www.statmodel.com/download/Topic%201-v11.pdf

Nering, M. L. (1995). The distribution of person fit using true and estimated person parameters. Applied Psychological Measurement, 19, 121-129.

Piedmont, R. L., McCrae, R. R., Riemann, R., & Angleitner, A. (2000). On the invalidity of validity scales: Evidence from self-reports and observer ratings in volunteer samples. Journal of Personality and Social Psychology, 78, 582-593.

Pirkis, J. E., Burgess, P. M., Kirk, P. K., Dodson, S., Coombs, T. J., & Williamson, M. K. (2005). A review of the psycho-metric properties of the Health of the Nation Outcome Scales (HoNOS) family of measures. Health and Quality of Life Outcomes, 3, 76-87.

Pitts, S. C., West, S. G., & Tein, J. (1996). Longitudinal measure-ment models in evaluation research: Examining stability and change. Evaluation and Program Planning, 19, 333-350.

Reckase, M. D. (2009). Multidimensional item response theory. New York, NY: Springer.

Reise, S. P., & Waller, N. G. (1993). Traitedness and the assess-ment of response pattern scalability. Journal of Personality and Social Psychology, 65, 143-151.

Reise, S. P., & Waller, N. G (2009). Item response theory and clinical measurement. Annual Review of Clinical Psychology, 5, 27-48.

Rost, J. (1990). Rasch models in latent classes: An integration of two approaches to item analysis. Applied Psychological Measurement, 3, 271-282.

Samejima, F. (1997). Graded response model. In W. J. van der Linden & R. Hambleton (Eds.), Handbook of modern item response theory (pp. 85-100). New York, NY: Springer.

Schmitt, N., Chan, D., Sacco, J. M., McFarland, L. A., & Jennings, D. (1999). Correlates of person-fit and effect of person-fit on test validity. Applied Psychological Measurement, 23, 41-53.

Spielberger, C. D., Gorsuch, R. L., Lushene, R., Vagg, P. R., & Jacobs, G. A. (1983). Manual for the State-Trait Anxiety Inventory (Form Y). Palo Alto, CA: Consulting Psychologists Press.

St-Onge, C., Valois, P., Abdous, B., & Germain, S. (2011). Person-fit statistics’ accuracy: A Monte Carlo study of the aberrance rate’s influence. Applied Psychological Measurement, 35, 419-432.

Tellegen, A. (1988). The analysis of consistency in personality assessment. Journal of Personality, 56, 621-663.

Thissen, D., Chen, W. H., & Bock, R. D. (2003). MULTILOG for Windows (Version 7). Lincolnwood, IL: Scientific Software International.

Thissen, D., Steinberg, L., & Wainer, H. (1993). Detection of dif-ferential functioning using the parameters of item response models. In P. W. Holland & H. Wainer (Eds.), Differential Item Functioning (pp. 67-113). Hillsdale, NJ: Lawrence Erlbaum.

Van Herk, H., Poortinga, Y. H., & Verhallen, T. M. M. (2004). Response styles in rating scales: Evidence of method bias in data from six EU countries. Journal of Cross-Cultural Psychology, 35, 346-360.

Wing, J. K., Beevor, A. S., Curtis, R. H., Park, B. G., Hadden, S., & Burns, H. (1998). Health of the Nation Outcome Scales (HoNOS): Research and development. British Journal of Psychiatry, 172, 11-18.

Wood, J. M., Garb, H. N., Lilienfeld, S. O., & Nezworski, M. T. (2002). Clinical assessment. Annual Review of Psychology, 53, 519-543.

Woods, C. M., Oltmanns, T. F., & Turkheimer, E. (2008). Detection of aberrant responding on a personality scale in a military sample: An application of evaluating person fit with two-level logistic regression. Psychological Assessment, 20, 159-168.

at Seoul National University on December 18, 2014asm.sagepub.comDownloaded from