lost-contact.mit.edu · web viewmachery et al collected data at four sites: two in india (at...

39
Mike Zeddies SurvMeth 721 TSE II Winter 2013 Final Version Applying Total Survey Error Methods to a Philosophical Experiment Abstract Experimental philosophy (or X-Phi) has been a growing field of quantitative social science since 2001. Experimental philosophers typically perform cognitive experiments by presenting subjects with philosophical vignettes that measure cognitive judgments, though method and mode can vary substantially. Vignettes are typically presented via self-adminstered modes, often PAPI or WASI, and require substantial design; hence we may consider them from a survey research perspective, in particular employing the Total Survey Error (TSE) paradigm. Experimental philosophers themselves have already begun citing the survey research literature, and research in experimental philosophy often intersects with research in cognitive science and political science, both related to the survey methodological discipline. Machery, Olivola, and de Blanc (MOB) in 2009 published the results of a one-item instrument, manipulated experimentally and administered at four international sites, that collected additional demographic data on respondents. This paper investigates this study from the TSE perspective, finding in particular an age-related effect on response propensity, congruent with MOB’s attention to cognitive processes, as well as further details regarding the sub-population estimates that MOB express interest in. This paper critiques MOB's questionnaire design and data-handling procedures, and uncovers a processing error that may seriously undermine their results. Nevertheless, MOB’s efforts may suggest a model for quantifying operationalization error, and this paper suggests some directions for 1

Upload: others

Post on 22-Jan-2021

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: lost-contact.mit.edu · Web viewMachery et al collected data at four sites: two in India (at different colleges), one in France, and one in Mongolia. The Indian sites were presented

Mike ZeddiesSurvMeth 721TSE IIWinter 2013Final Version

Applying Total Survey Error Methods to a Philosophical Experiment

Abstract

Experimental philosophy (or X-Phi) has been a growing field of quantitative social science since 2001. Experimental philosophers typically perform cognitive experiments by presenting subjects with philosophical vignettes that measure cognitive judgments, though method and mode can vary substantially. Vignettes are typically presented via self-adminstered modes, often PAPI or WASI, and require substantial design; hence we may consider them from a survey research perspective, in particular employing the Total Survey Error (TSE) paradigm. Experimental philosophers themselves have already begun citing the survey research literature, and research in experimental philosophy often intersects with research in cognitive science and political science, both related to the survey methodological discipline. Machery, Olivola, and de Blanc (MOB) in 2009 published the results of a one-item instrument, manipulated experimentally and administered at four international sites, that collected additional demographic data on respondents. This paper investigates this study from the TSE perspective, finding in particular an age-related effect on response propensity, congruent with MOB’s attention to cognitive processes, as well as further details regarding the sub-population estimates that MOB express interest in. This paper critiques MOB's questionnaire design and data-handling procedures, and uncovers a processing error that may seriously undermine their results. Nevertheless, MOB’s efforts may suggest a model for quantifying operationalization error, and this paper suggests some directions for future survey research that utilize the methods of experimental philosophy

INTRODUCTION

Experimental philosophy is a growing field generating new and interesting data about human cognition. A groundbreaking article in this field was Weinberg, Nichols and Stich 2001, which demonstrated meaningful differences in cognitive reasoning between individual from different cultural backgrounds. A follow-up to this article was Machery et al 2004, which showed different styles of philosophical thinking among East Asian (Hong Kong) students, and Western (American) students. Among the larger social science community there is growing interest regarding these claims of different cognitive (or “philosophical”) styles between individuals from different cultural groups, as defined and explored for example in Nisbett et al 2001. Building on the findings of Nisbett et al, Machery et al presented evidence that East Asian thinkers tend to use “descriptivist” intuitions (for example, “Shakespeare” means “whoever wrote Hamlet”) and Western thinkers tend to use “causal-historical”

1

Page 2: lost-contact.mit.edu · Web viewMachery et al collected data at four sites: two in India (at different colleges), one in France, and one in Mongolia. The Indian sites were presented

intuitions (for example, “Shakespeare” means “the Elizabethan actor, whether he wrote Hamlet or not”). Such findings, if true, may have serious implications in survey research, in terms of anticipating the different ways respondents may react to certain kinds of survey questions, or even to the survey experience in general. These could also affect the methods of the cognitive testing of survey items and questions, or the expectations researchers ought to have from focus groups.

There have been a number of efforts to follow up on Machery et al’s results. Lam 2010, for example, attempted to replicate Machery’s results, but either failed to find an effect, or in some cases found a reversed effect between East Asians and Westerners. Lam surmised that the real effect was not cultural, but linguistic, that is, it depended on both the language of the subject and on the language of the questionnaires s/he received. Several papers have been exchanged on these particular claims, and these questions seem of a piece of the larger debate within the social science community regarding the applicability of research conducted on North American and Western European test subjects (see Heinrich, Heine, Norenzayan 2009 and commentary).

In fact experimental philosophers have already begun incorporating survey research methods in their experiments, which involve the composition of philosophical vignette items, usually followed by brief batteries of questions. Feltz and Cokely 2007 and Cokely and Feltz 2008 discovered an item-order effect (that held only for women) in Knobe 2003's infamous experiment that uncovered asymmetries in ethical reasoning. Beebe and Buckwalter 2010 cited Norbert Schwartz’s 1991 paper on response scales, while replicating Knobe 2003’s findings, using a 1 to 7 Likert scale in place of the -3 to 3 scale used in Knobe’s original experiment. John Turri (unpublished) has also shown how an unfolding item design can improve question comprehension in philosophical experiments, eliminating certain findings in this area of research. These particular findings may or may not be relevant to survey research, but they show that experimental philosophers are increasingly aware of survey research methodology. It seems it is time for survey research in turn to begin paying close attention to these developments in experimental philosophy.

Bringing the TSE Perspective to Experimental Philosophy

This paper may be among the first (if not the only) analysis of a philosophical experiment from the total survey error (TSE) perspective. As a result, its methods and conclusions may be somewhat exploratory and tentative. This approach will not, for example, examine the components of mean squared error (MSE) for this study, despite the fact that this is a traditional path for TSE analysis (see Biemer 2010, for example). One reason for this is that MOB do not report survey statistics; their publication focuses on a comparison of proportions for answers to a pair of two-valued questions, hence they employ only a chi-square test of their results. TSE research does not always analyze MSE and its components: for example, see Curtin, Presser and Singer 2005, for example, who used response rates and the CASRO eligibility rate e to show how these measures altered trends in telephone survey nonresponse. (Admittedly non-response is a rather important topic for TSE.) Despite my own paper’s somewhat unorthodox approach to TSE, I believe it will be fruitful in drawing attention to meaningful concerns that survey researchers might have with experimental philosophy, and also in improving the strength and validity of future findings in X-Phi.

TSE is often formulated as having six primary components (see a recent review by Groves and Lyberg 2010). My analysis may briefly touch on each of these, but we will be primarily focused on errors of measurement, related to the validity of their outcome of interest (answer choice on a philosophical

2

Page 3: lost-contact.mit.edu · Web viewMachery et al collected data at four sites: two in India (at different colleges), one in France, and one in Mongolia. The Indian sites were presented

item), as well as potential measurement error (related to the design of their philosophical items.) An examination of processing error, using again the additional variables MOB collected but did not report on, will also help shed light on the robustness of their results, and the robustness of any effects our analysis may uncover. A brief examination of the distribution of these variables will also provide some perspective on errors of representation (coverage, sampling, and non-response) but this will not be the primary focus of this paper. As a field experiment, MOB did not use traditional sampling methods, but we can consider MOB’s inferences from their multi-site experiment to world population in light of coverage error, and their random assignment to cases (by instrument type) as a process analogous to random sampling.

Outline of Paper

I will first describe Machery, Olivola and de Blanc’s research for the benefit of those readers unfamiliar with it, forming my own research hypothesis on the data. Next I will describe the instrument and the data, and present a multinomial logistic regression model for analysis, including a brief discussion about coding and recoding. I then present the results of our research: this begins with an analysis of the complexity of the instrument items, and moving to a re-examination of MOB’s results. Next, I present my own logistic models, beginning with comparisons of single-variable models, moving to the complete model, and finishing by an examination of possible reduced models. These models are varied using different scenarios coded on respondent exposure to living or study abroad. This is followed by a discussion of those results and their potential relationship with survey error. Particular attention is drawn to the processing error used to build the scenarios used in analysis. I conclude that the TSE paradigm can be effective in analyzing and qualifying the results of Experimental philosophy research.

Machery, Olivola and de Blanc 2009

One criticism of Machery et al 2004 was presented by Marti 2008, who suggested their analysis had not been complete. She claimed they had only examined “meta-linguistic” intuitions, or intuitions about how our words should theoretically refer to objects in our world, rather than to “linguistic” intuitions, or intuitions about how people actually use words to refer to objects. An example will suffice: one might think that “Shakespeare” means “whoever wrote Hamlet”, and that is a meta-linguistic intuition. But the same person, watching someone else talk about “Shakespeare”, might think that person actually intends to mean someone else, and that is a linguistic intuition. Meta-linguistic intuitions could be said to describe how we think words ought to refer to things, and linguistic intuitions could be said to describe how we think words are intended to refer to things by their speakers, whether rightly or wrongly. As a result, Machery et al produced a follow-up experiment in Machery, Olivola and de Blanc 2009 that purported to fill the gap Marti identified.

The researchers presented participants with both a linguistic item and a meta-linguistic item. Both items had two responses: the first indicated a “descriptivist” intuition, and the second indicated a “causal-historical” intuition (as described in my Introduction above, pp. 1-2). The “causal-historical” intuition is the one favored by philosophers in the Western tradition, and generally seen as the “correct” answer ever since the work of Saul Kripke in the 1970s (hence such answers are also called “Kripkean” answers).

3

Page 4: lost-contact.mit.edu · Web viewMachery et al collected data at four sites: two in India (at different colleges), one in France, and one in Mongolia. The Indian sites were presented

Machery et al collected data at four sites: two in India (at different colleges), one in France, and one in Mongolia. The Indian sites were presented with an English-language instrument (provided in MOB 2009), the French site with a French-language instrument, and the Mongolian site with a Mongolian-language instrument. The researchers experimentally divided each site by presenting one or the other item (linguistic or meta-linguistic) to each respondent via a randomization process. Additional variables were collected on various measures, mostly demographic, but different variables were collected at different sites; only a subset of these variables were collected at all four sites. The respondents’ answers (answer A or answer B) were recorded. The results were published in Analysis in 2009; no significant differences were found among the three sub-samples (the combined Indian sites, the French site, and the Mongolian site).

Figure 1. MOB’s graphic from their 2009 results

MOB showed that responses to questions about linguistic usage correspond with responses to questions about meta-linguistic concepts, hence they conclude that when differences in question response are observed (such as in Machery et al 2004), these reflect genuine differences in overall cognition, and not some other effect. Philosophers may therefore, they argue, proceed with Experimental philosophy research without worrying that a linguistic effect among a national or cultural population might differ from a meta-linguistic effect among the same population (and vice-versa): the detection of one effect will necessarily imply the presence of another. Because differences among national and cultural populations have been observed in experimental samples in some philosophical studies, MOB’s argument could have serious implications for the conclusions we draw from that research in relation to cultural variation (or similarity) in cognition.

Thus, to summarize MOB’s findings:

4

Page 5: lost-contact.mit.edu · Web viewMachery et al collected data at four sites: two in India (at different colleges), one in France, and one in Mongolia. The Indian sites were presented

At each of MOB’s experimental sites, respondents chose the “Kripkean” or “causal-historical” answer (i.e. answer B) more often than the “descriptivist” answer (i.e. answer A).

Although respondents did choose answer B more often for the “linguistic” item than for the “meta-linguistic”, these differences were not meaningfully different within each sub-sample (authors’ reported chi-squares: India <.9 [1, N=83], France <.4 [1, N=66], Mongolia <.8 [1, N=78]). In other words, Answer B was also chosen more often than A regardless of the item presented.

Slightly fewer French cases selected answer B to the meta-linguistic item than answer A, but again this proportion was not found to be significantly different than the proportion of French cases who selected answer B for the linguistic item.

My own chi-squared comparison among all six proportions likewise indicates no meaningful differences, χ2 [1, N=233] <.74.

Research Hypothesis

Regardless of whether MOB’s argument is true, their procedures may be examined to determine potential sources of error. As simple as the design may be, their study is effectively a survey: it presented a one-item questionnaire, with experimental variation, to respondents at multiple sites, hence (I argue) it can be analyzed from the TSE perspective. I therefore propose that using the additional variables MOB collected, especially those related to language, can be used to investigate potential sources of error in MOB’s experiment, and perhaps also explain the philosophical phenomena they describe. I hypothesize that these variables will help us uncover error in MOB, and that this will mitigate their findings. If my hypothesis fails, that will help validate MOB’s findings. I will emphasize two primary sources of error: measurement error from item complexity, and processing error (related also to measurement and/or item nonresponse error on process-related variables).

Item Complexity

To both a casual and an expert observer, MOB’s items seem unusually difficult (see the Data section below). Survey researchers continue to develop standards for evaluating item design. One set of standards relates to the complexity of the item, and its relationship to comprehension by a respondent (the first step in Tourangeau’s schema of the response process, and hence of central importance to data collection and potential survey error). The plausibility of this relationship seems fairly straightforward and is discussed at length in the literature: to comprehend an item, respondents must linguistically understand its lexis, syntax, semantics, and pragmatics. Two types of measures have general support among theorists as measures of item complexity and difficulty: question length, and question complexity, as measured variously by Knauper et al. 1997, Holbrook, Cho and Johnson 2006, and Yan and Tourangeau 2008 (see especially Schaefer and Dykema 2011, pp. 918-19). Knauper et al. and Holbrook, Cho and Johnson suggest word counts as a measure of question length, and Yan and Tourangeau 2008 point to the number of clauses per question, as well as the number of words per clause. Perhaps an even

5

Page 6: lost-contact.mit.edu · Web viewMachery et al collected data at four sites: two in India (at different colleges), one in France, and one in Mongolia. The Indian sites were presented

more useful, though less quantifiable, measure is the “level of abstraction” proposed by Holbrook, Cho and Johnson, where “abstract items” were defined as “those for which the major concept introduced by the question was not grounded in physical reality”. Thus we will be examining MOB’s items for potential complexity and its relationship to measurement error.

Other Factors in Item Comprehension

MOB’s results may also follow from errors in interpretation of their items. Ji, Zhang and Nisbett 2004 demonstrated an effect on experimental results based on both language of the test subject and the test subject’s cultural background, that varied in particular by age of second language acquisition. Kleiner, Pan and Bouic 2009 note that teams of translators speaking different languages and translating survey instruments into those languages favored different strategies for translation. Thus it is worth studying both the language of the instrument and native language of respondents in order to uncover potential effects of comprehension difficulties. Lam 2010, mentioned above, also drew attention to the interaction of language of instrument and participant in philosophical experiemtns. Thus I examine variables MOB collected indicating 1) exposure to travel abroad, as indicated above, 2) a shared primary language with the language of the questionnaire, 3) the language of the questionnaire itself, as well as 4) educational status. The language variables in particular relate to potential errors in comprehension; educational status does not relate directly to these potential errors, but they do relate indirectly, as increased levels of education may increase the chances of exposure to foreign languages, and of exposure to those languages at increasingly complex levels of comprehension.

Finally, I will include an alternative test of MOB’s research hypothesis, that philosophical item type (meta-linguistic vs. linguistic) does not predict differences in respondent answers to those items, by including instrument or item type (linguistic or meta-linguistic) in my regression model as a predictor variable.

Potential Processing Error

I have also uncovered various sources of error that at best violate best practices for data-handling. For example, MOB performed three experiments in three different countries, but appear to have applied different rules for including subjects in their final analysis, eliminating subjects with periods of foreign residence for one country, but ignoring this indicator for the other countries. This suggests that their results are not really comparable, despite the authors’ assumptions that they are. Thus I further propose that adding and removing these problematic cases will change the research findings.

DATA AND METHODS

The researchers provided me with their data on request. Data from each sub-sample (India, France, and Mongolia) was collected and stored in a separate file. Because the variables collected were quite different at each site, a significant amount of editing was necessary to combine these files into a single file for analysis. Only those variables collected in all three sub-samples were used for modeling; one variable collected only on the Indian and French sample (indicating living abroad) was also used to build alternate scenarios for analysis of processing error.

6

Page 7: lost-contact.mit.edu · Web viewMachery et al collected data at four sites: two in India (at different colleges), one in France, and one in Mongolia. The Indian sites were presented

The instrument was a self-administered paper-and-pencil questionnaire, though there was a site administrator at each site. The different questionnaires were as follows:

Item 1 (the “linguistic” item):

Ivy is a high school student in Hong Kong. In her astronomy class, she was taught that Tsu Ch’ung Chih was the man who first determined the precise time of the summer and winter solstices. But, like all her classmates, this is the only thing she has heard about Tsu Ch’ung Chih.

Now suppose that Tsu Ch’ung Chih did not really make this discovery. He stole it from an astronomer who died soon after making the discovery. But the theft remained entirely undetected and Tsu Ch’ung Chih became famous for the discovery of the precise times of the solstices.

Everybody is like Ivy in this respect; the claim that Tsu Ch’ung Chih determined the solstice times is the only thing people have heard about him. Having read the above story and accepting that it is true, when Ivy says, ‘Tsu Ch’ung Chih was a great astronomer’, do you think that her claim is: (A) true or (B) false?

Item 2 (the “meta-linguistic item”):

Ivy is a high school student in Hong Kong. In her astronomy class, she was taught that Tsu Ch’ung Chih was the man who first determined the precise time of the summer and winter solstices. But, like all her classmates, this is the only thing she has heard about Tsu Ch’ung Chih.

Now suppose that Tsu Ch’ung Chih did not really make this discovery. He stole it from an astronomer who died soon after making the discovery. But the theft remained entirely undetected and Tsu Ch’ung Chih became famous for the discovery of the precise times of the solstices.

Everybody is like Ivy in this respect; the claim that Tsu Ch’ung Chih determined the solstice times is the only thing people have heard about him. Having read the above story and accepting that it is true, when Ivy uses the name ‘Tsu Ch’ung Chih’, who do you think she is actually talking about:

(A) the person who (unbeknownst to Ivy) really determined the solstice times?

or

(B) the person who is widely believed to have discovered the solstice times, but actually stole this discovery and claimed credit for it?

My variables of interest include:

Answer Choice

This forms both MOB’s outcome variable and my own: answer “A” or “B” as described in the items above.

7

Page 8: lost-contact.mit.edu · Web viewMachery et al collected data at four sites: two in India (at different colleges), one in France, and one in Mongolia. The Indian sites were presented

A BLinguistic Meta Linguistic Meta

India 16 18 31 24France 7 26 9 24Mongolia 12 19 23 24

Two Mongolian cases did not report an answer.

Instrument-Item Assignment

Meta-Linguistic item Linguistic itemIndia 42 47France 50 16Mongolia 43 35

The French assignment of items deviates from random expectations significantly (χ2 [1, N=66] <.00003). The cause of this is unknown. This could be viewed as a kind of sampling error, and will be discussed below.

AgeFull sample average MOB average 2010 median Range

India 20.3 20.4 25.9 18-26France 30.3 32.3 39.7 18-70Mongolia 29.3 29.4 (corrected) 25.8 18-62

The MOB average apparently excludes the six Indian cases that went abroad for education. I have been unable to determine source of discrepancy for France between the full sample average and the MOB average. The mean age for Mongolia was misprinted in MOB as 39.3; the “corrected MOB” average above, 29.3, apparently excludes the two Mongolian cases with item missing data on answer choice, i.e. the outcome variable of interest.

GenderMale/Female ratio, 15-64 (as of 2013) Full sample ratio

India 1.06 39/50 (.78)France 1.00 26/37 (.70)Mongolia 1.00 53/26 (2.03)

Three French cases lacked values on the gender variable. The deviation of the Mongolian sample from the expected proportions is significant (χ2 [1, N=79] <.002).

Language%Speaking Language of Instrument-Item

India 30.3%France 90.5%Mongolia 97.5%

Sample (=Language of Instrument)

NIndia=89, NFrance=66, NMongolia=80

8

Page 9: lost-contact.mit.edu · Web viewMachery et al collected data at four sites: two in India (at different colleges), one in France, and one in Mongolia. The Indian sites were presented

NTotal=235

Education (Highest Completed)

India France MongoliaHigh School 38 13 41Some College 21 36College 51 2 3Masters-level 19Ph.D. 7

Totals: 89 62 80

Four French cases were missing data on this variable. In addition, although 38 Indian cases did report “High School” as their last completed level, it is unclear at what point the researchers measured completion of “some college” to begin. It may be that completion of “some college” begins at the point one leaves college without a degree; nevertheless, it is entirely unclear whether this was explained to any of the respondents.

Education and Living Abroad

Education abroad was recorded for all three samples: six Indian cases indicated study abroad, as did nine French cases (with three missing data on this item) and one Mongolian case.

As for living abroad, this variable was actually collected in two different ways: in India, it was collected on an item that measured whether a participant had spent his or her entire life in India, but in France, it was collected on an item that measured where a participant had spent most of his or her life. In Mongolia, this information was not collected at all on any variable.

Hence, all Indian cases who had not studied abroad indicated they had lived in India their entire lives; the six who had studied abroad indicated they had not lived in India their entire lives. Whereas for the French sample, the “living” variable had no relationship with the “study abroad” variable. Eleven French cases indicated they had spent most of their lives outside of France; seven cases either were missing data on this item, or provided invalid answers (for example: “sous le soleil” or “under the sun”; “planete terre” or “Planet Earth”; and so on), leaving 73% of French cases who indicated they had lived most of their lives in France.

Other variables were collected for both the Indian and French sub-samples but these were not used by either MOB or by myself for analysis.

STATISTICAL ANALYSIS

I use a logistic regression model, with answer choice (“A” or “B”) as the response/outcome or dependent variable, and using as independent variables the six variables that are shared across all sub-samples: age, gender, native language (recoded), education (recoded), sub-sample (=language of

9

Page 10: lost-contact.mit.edu · Web viewMachery et al collected data at four sites: two in India (at different colleges), one in France, and one in Mongolia. The Indian sites were presented

instrument/country of collection; coded) and instrument-item type. I also generate an indicator variable for education abroad, and use it to vary the cases included in my analyses, beginning with the “publication” sample that left out six cases from the Indian sub-sample, but no others. Our hypothesis is that these variables together will predict outcomes on MOB’s philosophical experiment.

Thus we have the model:

ln ( p̂1− p̂ )=β0+β1 X1+ β2 X2+β3 X3+ β4 X4+β5 X5+β6 X 6+ε

Where p̂ is the proportion of respondents who answer “B” to an item

X1 = age of respondent

X2 = gender of respondent

X3 = indication of respondent native language shared with instrument

X4 = respondent college graduate status

X5 = sub-sample (language of instrument/country of collection)

X6 = instrument-item type (“linguistic/“meta-linguistic”)

These variables also have relatively low item-missing-data rates:

Variable Missing DataAge 2%

Gender 2%Shared Native Language 1%College Graduate Status 2%Sub-Sample/Language 0%Instrument-Item type 0%

Table 1. Item Missing Data Rates

Software

The Stata 12 software package was used for analysis, utilizing the logit command.

Recoding

As described above, the data had to be re-worked substantially before it was ready for analysis. Since native language of respondents was recorded, it could be matched with the language of the instrument to create a two-valued variable that measures whether the respondent was reading the instrument in his or her native language. This could have implications for comprehension of difficult items, and also provides one way to examine the issues that Lam 2010 raises, for example, when he claims that respondents will answer items in foreign languages differently than items in their own native

10

Page 11: lost-contact.mit.edu · Web viewMachery et al collected data at four sites: two in India (at different colleges), one in France, and one in Mongolia. The Indian sites were presented

tongue. Likewise, after investigating levels of education I decided that to eliminate ambiguity, it was best collapsed into just two levels: one expressing college graduate status, and the other expressing the lack of a college degree.

For my indicator variables, two variables were constructed that measure, on the one hand, exposure to education abroad, and on the other, exposure to living abroad. These were used either independently or jointly to include or exclude different sub-groups from the sample, in an effort not only to replicate the sample the authors used to reach their substantive conclusions, but also to test those conclusions by enforcing the exclusion rules either less or more stringently. Results from these tests will be shown below; otherwise the text refers to the “published” sample that the authors used, which did not include 6 Indian cases. (MOB collected an additional variable on nationality that occasionally recorded mixed or foreign nationality, but because the relationship of this variable to actual living abroad was not specified, I ignored it for the purposes of analysis.)

RESULTS

Item Complexity

We begin by evaluating the vignette items for item design effects. The word count for each of MOD’s items is rather large: 158 for the “linguistic” item and 189 for the “meta” item. Compare to a notoriously difficult item in the SCA instrument, item A28-A28a, at 110 words. Thus the Machery et al items fairly high by professional standards. It is true that both are rated at a 7.5- or 7.6-grade level (on the Flesch-Kinkaid scale; compared to, for example, 11.4 for SCA A28-A28a), but this may have to do with the relative simplicity of each individual sentence. Examining the number of clauses, SCA item A28-28a includes 11 clauses for 10 words/clause: MOB’s “linguistic” item contains 15 clauses for 10.5 words/clause, and MOB’s “meta” item contains 18 clauses for 10.5 words/clause. It seems clear that they compare to SCA item A28-A28a in terms of complexity, and even surpass it, according to Yan and Tourangeau’s measures.

The vocabulary is also relatively high, as rated by the QUAID online system (which singled out “solstices” “classmates” “astronomer” “theft” “undetected”, as well as part of the name “Tsu Ch’ung Chi” as unfamiliar terms), and QUAID also suggested the use of “now” was vague and the meaning of “true” imprecise. To be sure, QUAID found a few unfamiliar terms in SCA A28-A28a (“forecasters” “percent” and “inflation”) and found the SCA item particularly riddled by vauge terms (“few” “there” “certain” “small” “very” and “more”) and its syntax unusually complex. But again, this item is well-known as an exceptionally difficult one within the SCA instrument, and the MOD items exceed it by some measures (word count, unfamiliar terms).

With regards to abstraction, expert review (by three experts) of MOB’s items indicated that each of them does possess some degree of abstraction (both receiving an average rating of .5, “somewhat abstract”, on a 0.0-0.5-1.0 assignment of values to the three levels of abstraction), though these results are preliminary and imprecise. It at least seems legitimate to claim that these item characteristics ought to operate on MOD’s sample by increasing comprehension difficulty.

11

Page 12: lost-contact.mit.edu · Web viewMachery et al collected data at four sites: two in India (at different colleges), one in France, and one in Mongolia. The Indian sites were presented

Re-Running MOB’s Chi-Square Analysis of Proportions

And these considerations have an effect on the results of analysis: here we re-run the proportions that the authors originally presented in their study, using three different alternatives:

the “full” scenario, with the six Indian cases put back in; the “compromise” scenario, applying the author’s same standard for study abroad to the French

and Mongolian sub-sample as to the Indian sub-sample, including throwing out cases with item missing data on study abroad (all of which are found in the French sub-sample);

and the “consistent” scenario, applying the strictest standard possible, by throwing out any cases with any indication of residence abroad whether on the education-abroad indicator or the living-abroad indicator. This includes excluding cases with item missing data and invalid responses on the original MOB variables used to build these indicators. (All such cases are found in the French sub-sample). This also necessitates ignoring the Mongolian sub-sample altogether, since a variable indicating residence abroad, over and above studying abroad, was not collected for it.

Below I place in bold proportions that differ from the “published” sample proportions. I also draw a thicker border around those proportions that fall below the “Kripkean line” of 50%, indicating a “descriptivist” tendency among the sub-sample (again, generally viewed by philosophers as the “incorrect” answer). Finally, I make semi-transparent proportions whose distribution violates the chi-squared rule of thumb that no more than 20% of a contingency table’s cells should fall below an expected value of 5, rendering chi-squared validation tests inappropriate. I also eliminate proportions when they vanish from the sample.

India France Mongolia0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

57%48%

56%

66%56%

66%

Speaking About (meta-linguistic)

Truth-Value (linguistic)

Figure 2. Proportion of B answers (“full” scenario)

12

Page 13: lost-contact.mit.edu · Web viewMachery et al collected data at four sites: two in India (at different colleges), one in France, and one in Mongolia. The Indian sites were presented

India France Mongolia0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

58%

46%

57%

67%

50%

66%

Speaking About (meta-linguistic)

Truth-Value (linguistic)

Figure 3. Proportion of B answers (“compromise” scenario)

India France Mongolia0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

58%48%

67% 67%

Speaking About (meta-linguistic)

Truth-Value (linguistic)

Figure 4. Proportion of B answers (“consistent” scenario)

As we see, this does alter the proportions, especially for the “consistent” standard, which leaves only three French cases that saw the linguistic item (1 answering A, 2 answering B), leaving us unable to

13

Page 14: lost-contact.mit.edu · Web viewMachery et al collected data at four sites: two in India (at different colleges), one in France, and one in Mongolia. The Indian sites were presented

calculate the chi-squared evaluation of the proportions for the French-subpopulation (with frequency less than 5 in half of the cells), and ignores the Mongolian sample altogether. For these last two cases, a Fisher exact test may be appropriate: if so, I find a two-tailed p<.73 for the “compromise” scenario, and two-tailed p<.6 for the “consistent” scenario.

Logistic Regression Analysis

Next, we test our exploratory models singly, on the “published” sample (excluding only the six Indian cases that studied abroad, plus the two Mongolian cases missing data on answer choice) by building simple models out of each one of our five predictors (leaving aside constants for the sake of clarity):

β Coeff. s.e. z p n

Age -0.0266† (0.0137) -1.95 0.053 222Instrument-Item 0.4788† (0.2773) 1.73 0.084 227

1=linguistic, 2=metaNative speaker -0.1863 (0.2980) -0.63 0.532 224

0=no, 1=yesInstrument lang. 227Reference=English

French -0.5173 (0.3348) -1.54 0.122Mongolian -0.1011 (0.3241) -0.31 0.755

College grad. -0.0344 (0.2810) -0.12 0.903 2230=no, 1=yes

Gender 0.0769 (0.2719) 0.28 0.777 2231=male, 2=female

†significant at .01 *significant at .05 **significant at .01

Table 2. Table of coefficients for single-variable mlogit models (using “publication” sample; constants excluded; significant predictors in bold)

Our initial analysis suggests that age may be a fruitful effect to examine more closely; it just barely escapes significance. The other variables, perhaps surprisingly, seem to suggest no effect at all. The item type may hint at an effect, but fails significance.

Next we run the most likely candidates for inclusion in our model through one-variable logistic regressions, varying regressions by all four sample scenarios: “full” “published” “compromise” and “consistent”.

14

Page 15: lost-contact.mit.edu · Web viewMachery et al collected data at four sites: two in India (at different colleges), one in France, and one in Mongolia. The Indian sites were presented

Model: Age-only Instrument-Item-only1=linguistic, 2=meta

β Coeff. p n β Coeff. p nScenario:

Full -0.0254† 0.060 228(0.0135)

0.4542 0.095 233(0.2724)

Published -0.0266† 0.053 222(0.0137)

0.4788 0.084 227(0.2773)

Compromise -0.0307* 0.036 211(0.0146)

0.4989 0.082 216(0.2873)

Consistent -0.0500* 0.021 128(0.0216)

0.6283 0.102 128(0.2873)

†significant at .01 *significant at .05

Table 3. Single-variable logistic regression on Age and Item, varied by scenario.

We now move on to build our full sample around age, including not only item type, but also shared native language with the instrument, instrument language (=sub-sample), college graduate status, and gender, to see how well age performs in the presence of these other variables in the model. Here we compare only the “consistent” scenario to the “published” scenario: age performed the best in the “consistent” scenario above, so a contrast between the full model measured on that scenario, and the full model measured on MOB’s “published” scenario, should draw a fairly direct and meaningful distinction.

“Published” scenario (n=219) .4285

“Consistent” scenario (n=125) .2583

β Coeff. s.e. p β Coeff. s.e. pAge -0.0212 0.1627 0.193 -0.0598* 0.0302 0.048

15

Page 16: lost-contact.mit.edu · Web viewMachery et al collected data at four sites: two in India (at different colleges), one in France, and one in Mongolia. The Indian sites were presented

Item 0.4230 0.3001 0.159 0.4296 0.4343 0.3231=linguistic, 2=meta

Native speaker -0.0051 0.4486 0.991 -0.2734 0.5188 0.9150=no, 1=yesInstr. lang.

reference=EnglishFrench -0.2285 0.4606 0.620 0.1558 0.6106 0.799

Mongolian -0.0766 0.5026 0.879 ---- ---- ----

College grad. -0.0679 0.3713 0.855 0.4154 0.4339 0.5980=no, 1=yes

Gender 0.1565 0.2881 0.587 0.0406 0.3820 0.9151=male, 2=female

β0 0.6907 0.5235 0.187 1.4416† 0.7464 0.053

Model L2 p< .4285 .2583

†significant at .01 *significant at .05

Table 4. Full model, on the “published” scenario and the “consistent” scenario.

In the published sample, no predictor achieves significance, once more supporting MOB’s overall conclusion that cognition or intuition about meta-linguistic and linguistic uses of words does not in general vary. Although the significance of age worsens, we again see that age gains significance under the “consistent” scenario, whereas the other variables do not. The poor fit of the model and of its individual variables indicates that reduction is necessary, and interaction effects are not examined.

A simple solution to reducing this model—eliminating all variables save age alone—is not unreasonable, but as a means of testing this reduction and also to demonstrate the effect of the potential processing error, I pair age below with each other variable in a series of two-variable models to measure the strength of the age effect when controlling for the other variables one by one, and then vary these comparisons by scenario, from the “published” scenario (where we learned above that age does not perform well in the full model) through the “compromise” scenario to the “consistent” scenario (where we learned age becomes significant in the full model) . We saw above how well age performs alone, and how well it performs together; now we carefully vary the single-variable model by increasing the variable count to two, one by one, by scenario. I also include the p value from the each model likelihood ratio chi-square test, as an indication of overall model performance.

“Published” β Coeff.

Age -0.0233† -0.0236 -0.0233 -0.0281* -0.0260*(0.0139) (0.0145) (0.0154) (0.0138) (0.0136)

Instrument-Item 0.4475

16

Page 17: lost-contact.mit.edu · Web viewMachery et al collected data at four sites: two in India (at different colleges), one in France, and one in Mongolia. The Indian sites were presented

1=linguistic, 2=meta (0.2838)Native Speaker -0.0680

0=no, 1=yes (0.3138)Instrument Lang.reference=English

French -0.3525(0.3696)

Mongolian 0.0466(0.3573)

College Graduate -0.03320=no, 1=yes (0.2851)

Gender 0.11351=male, 2=female (0.2756)

β0 0.7230 0.9018* 0.9941** 1.0469** 0.9427*(0.4209) (0.4401) (0.3881) (0.4029) (0.4101)

n 222 221 222 220 220

Model L2 p< 0.0403 0.1947 0.1429 0.1152 0.1403

“Compromise”β Coeff.

Age -0.0276† -0.0282† -0.0277† -0.0324* -0.0299*(0.0148) (0.0157) (0.0166) (0.0148) (0.0146)

Instrument-Item 0.43541=linguistic, 2=meta (0.2932)

Native Speaker -0.01330=no, 1=yes (0.3297)

Instrument Lang.reference=English

French -0.4012(0.3888)

Mongolian 0.1119(0.3620)

College Graduate -0.06050=no, 1=yes (0.2942)

Gender 0.11331=male, 2=female (0.2836)

β0 0.8456† 1.0353* 1.0838** 1.1516** 1.0433*(0.4373) (0.4721) (0.4082) (0.4283)

n 211 210 211 210

Model L2 p< 0.0336 0.1508 0.0843 0.0795 0.1042

“Consistent”β Coeff.

Age -0.0444* -0.0471* -0.0262† -0.0598* -0.0489*

17

Page 18: lost-contact.mit.edu · Web viewMachery et al collected data at four sites: two in India (at different colleges), one in France, and one in Mongolia. The Indian sites were presented

(0.0222) (0.0471) (0.0165) (0.0237) (0.0215)Instrument-Item 0.4155

0=linguistic, 1=meta (0.3995)Native Speaker -0.0297

0=no, 1=yes (0.3927)

Instrument Lang.reference=English

French -0.0638(0.4587)

College Graduate 0.38130=no, 1=yes (0.3786)

Gender 0.05930=male, 1=female (0.3702)

β0 1.2429* 1.4778* 1.0528** 1.5328** 1.4822*(0.6051) (0.6756) (0.4065) (0.5808) (0.5749)

n 128 127 128 127 127

Model L2 p< 0.0265 0.0822 0.0454 0.0171 0.0517

†significant at .01 *significant at .05 **significant at .01

Table 5. Reduced Comparison Models.

Although one might get the mistaken impression that this is nothing more than a “fishing expedition” (an examination of as many relationships as possible, in the hopes of increasing the chances of finding one that is significant), my intent here is simply to demonstrate the strength (or weakness) of age as a predictor of answer choice. We will discuss these results below.

DISCUSSION

Re-running MOBs experiment and re-reporting the proportions does suggest that the confidence they exhibit in their results may be somewhat little premature. While there is nothing here that invalidates their results, there is some cause for concern with respect to the inconsistency of their rules for including and excluding cases from their analysis, and the potential effects that might have on the selection of the chi-squared statistic as the measure of their success. (Admittedly the Fisher Exact Test, if applied appropriately here, does help salvage their results somewhat.)

Arguably the major concept introduced by MOD’s items is that of linguistic reference, which seems like an inherently abstract concept, and at best their major concept is that of the truth-value of hypothetical attitudinal statements. We used expert review to determine level of abstraction, and they were determined to be at least somewhat abstract, and exhibit a significant degree of complexity. Thus we ought to expect to observe a cognition-related effect in our logistic regression.We did observe an age effect on answer outcome that when tested alone came close to significance, negatively predicting answer

18

Page 19: lost-contact.mit.edu · Web viewMachery et al collected data at four sites: two in India (at different colleges), one in France, and one in Mongolia. The Indian sites were presented

B. The negative value indicates that age may make respondents less likely to choose the “correct”, (“Kripkean”, or “causal-historical”) answer on the linguistic instrument (answer B) compared the “incorrect” answer on the meta-linguistic instrument (answer A). An age-related effect would make intuitive sense; previous research has shown that with age comes declining cognitive abilities. For the linguistic item, the true/false answer option may especially activate satisficing behavior; for the meta-linguistic item, the length of the answer options may also activate satisficing behavior. (Jon Krosnick has also recently written on whether introductory sentences may also trigger acquiescent responses.) It also makes sense that those with less-than-optimal cognitive skills may fail to arrive at answer B, since this is the answer that philosophers argue optimal cognition should lead us towards. This conclusion must be somewhat mitigated, since we might then expect to see a positive association between age and the selection of meta-A, but this was not observed. Nevertheless, the observed effect (negative association with meta-B) is consistent with what survey research would expect.

This age effect varied across scenarios, achieving clear significance and growing in strength as individuals with exposure to study or living abroad were eliminated from the sample. These eliminations may be helping to remove a potentially confounding factor: exposure to foreign and multi-lingual cultures, hence it is possible that we are observing a valid effect of age as it gains strength when the scenarios are varied. If the criteria for these scenarios are valid, they should have been consistently applied by the researchers. Hence the potential effect of the observed processing error—when varied from the “full” most inclusive sample to the “consistent” or least inclusive sample, effectively doubles the effect that a 1-year change in age has on the odds of choosing a “B” answer: for the “full” scenario, a 1-year change in age multiplies the odds of choosing a “B” answer over an “A” answer by .9749, or roughly 2.5% lower odds; for the “consistent” scenario, the odds multiplier is .9512, or roughly 4.9% lower odds. The probability falls from .4937 to .4872, lowering the probability of a “B” answer by about .65% for each 1-year change in age. This is perhaps the most direct measure of error I present here.

We observed no significance on item type, nor did it vary across scenarios (hovering around the.1 significance level), corroborating MOB’s conclusion that despite some differences in answer rates by item, the proportions on each item were similar.

The age affect also held in the logistic regression when available variables were added, though only under the most exclusive “consistent” scenario. However, the model was a poor fit, and hence a reduction was undertaken. The results seemed to uphold a consistent performance of age across many variations of models. The model chi-squared statistics for those models suggested that only a handful of them were worth serious consideration, the best being those that were evaluated using the “consistent” scenario, yet even for those models, the added variables did not achieve significance. It would appear that a single-variable model of age alone, regardless of scenario is the best model choice. We may point to Table 3 as an illustration of the effect of age, as well as the variation in the strength of the effect of age based on scenario.

This at the very least suggests that philosophers are correctly operationalizing their survey concepts. Their aim is to study human cognition in its various forms; in this study, their aim was to measure human cognition or philosophical intuition on a “linguistic” item and a “meta-linguistic item”. The negative relationship with suggests that their items do measure cognitive practices and abilities. This also suggests a new avenue for the quantitative measurement of operationalization by survey researchers:

19

Page 20: lost-contact.mit.edu · Web viewMachery et al collected data at four sites: two in India (at different colleges), one in France, and one in Mongolia. The Indian sites were presented

by searching for or specifically designing variables inferentially related to the survey response process, then examining their relationships with the responses to a particular item within the same survey.

Validity- and Measurement-Related Caveats

More problematic, however, is the intent of the Machery et al to measure cross-cultural variation (or similarity) in cognition. This research builds on MOB’s earlier research (and its criticism) in Machery et al 2004, which in turn was an extension of the path-breaking original study by Nichols et al in 2001 on observed differences in cross-cultural cognition. However, Nichols et al based their comparisons on the self-reported ancestry of research participants, whereas Machery et al have based their comparisons solely on the location of the data collection sites. These operationalizations of “culture” are obviously quite different in principle, and could also produce quite different results in practice; indeed, the relationship of either operationalization with “culture” does not seem entirely straightforward. The authors did collect self-reports on “nationality”, but this is not the same thing as ancestry, and furthermore this variable was apparently no on the Mongolian instrument at all. The concept of “culture” seems poorly-operationalized in a paper designed to play an important role in the ongoing conversation on cross-cultural differences in cognition.

A Role for Processing Error

Furthermore, the authors seem to have made a processing error after data collection. Six Indian cases were thrown out of the Indian sample, apparently due to the fact that these cases each spent some number of years traveling and studying abroad (outside of India). However, despite the fact that this data was also collected for both the French and Mongolian sample, no such edits were made to their sub-samples. Indeed, neither the motivation nor the criteria for these edits seems entirely clear: it could be that the authors wished only to analyze cases that had not been exposed to foreign cultures, in an effort to isolate any effect of culture on cognition, but if so they do not present this argument.

The Indian and French sample also collect data on not just study abroad, but living abroad--however, once again, this variable was not collected for the Mongolian sub-sample, and furthermore the variables collected for the Indian and French samples were different: the Indian sub-sample measured whether respondents had spent their “entire lives” in India, whereas the French sub-sample measured only where a respondent had spent “most of their lives living”. This discrepancy could lead to significant measurement error: to give some idea of the differences, 100% of the included Indian sample indicated they had lived their entire lives in India (remember, the other six were thrown out), whereas only 73% of the French sample indicated they had even lived “most of their lives” in France. We leave aside for now further discussion of this potentially problematic concept of “travel/living” abroad.

Failure of Language-Related Variables to Predict Outcomes

It should be noted that my language-related variables did not pass tests of significance. It seems that language does not play a role in the comprehension of MOB’s items, a partial failure of my hypothesis. The reasons for this could be several: perhaps MOB achieved excellent instrument translations; perhaps these particular items do not lend themselves to language-related effects on measurement; or perhaps language-related effects on survey measurement are less powerful than suspected.

20

Page 21: lost-contact.mit.edu · Web viewMachery et al collected data at four sites: two in India (at different colleges), one in France, and one in Mongolia. The Indian sites were presented

Total Survey Error—A Summary and Directions for Future Research

We now summarize the major sources of survey error, as examined in this study.

Measurement

Validity – MOB’s items do seem to validly measure linguistic and meta-linguistic effects: we observed a relatively consistent age effect that holds not only individually, but also in the presence of other model variables, including item type. From a survey-methodological perspective, this age effect implies that MOB’s items do in fact measure cognitive processes, because the effect negatively predicts “B” answers for older respondents, suggesting that they are resorting to satisficing strategies as expected. It is hard to think of why else this age effect would be observed. It is somewhat surprising that none of the other explanatory variables—instrument language, shared native language with participant, sample location, and college graduate status—indicated any effect on the experimental outcome, since these are all related to comprehension of survey items, especially in cross-cultural research. These negative results suggest that MOB’s item design does not contribute to error; they have designed items that do not rely on high education levels for comprehension, and have correctly translated their instrument into several languages. It may be that age-related effects on comprehension are expressed via different processes than education- and language-related effects; if so, this could be a contribution to research on cognition. Further research is likely called for in this area, on a larger sample that is more representative of the population of interest, and that collects a wider range of covariates, both demographic and cognitive.

Measurement error – although once again we did not explore the effect of measurement error on the MSE of MOB’s variables, nevertheless there are some meaningful things to say about measurement error. First of all, although item non-response was low for each model variable, item non-response and invalid responses on the variables used to determine sample scenarios (study abroad and living abroad) were significantly higher. This suggests that experimental philosophy research needs to pay more attention to data quality and collection procedures. Furthermore, as noted above, it is not clear that these variables were designed correctly or appropriately for each sub-sample, causing for example the need to drop the entire Mongolian sub-sample when applying a consistent standard of living abroad. While this admittedly does not matter in the sense that using this variable to adjust the sample (achieving the “consistent” sample scenario) still did not improve the predictive ability of the instrument type (corroborating MOB’s hypothesis that cognition is consistent across meta-linguistic and linguistic items), it does matter in the sense that invalidating one-third of a sample leaves us far less able to investigate potential sources of error in MOB’s study. One recommendation to Experimental philosophy might be to take their ancillary variables as seriously as their research variables, viewing the entire data-collection process as the implementation of a single instrument measuring variables of equal importance.

In addition, because we detected an age effect on answer choice, more research needs to be done on item order effects, and on answer order effects. Any survey methodologist knows that these are important phenomena in survey research, and if age is producing satisficing error in philosophical experiments, more research must be done that can test this effect and explore its extent.

21

Page 22: lost-contact.mit.edu · Web viewMachery et al collected data at four sites: two in India (at different colleges), one in France, and one in Mongolia. The Indian sites were presented

Processing error – As seen in the results of the scenario variations, processing error did not seem to change the significance level of the experimental condition variable (i.e. item/instrument type), implying that including or excluding cases based on exposure to living and/or studying abroad does not bias MOB’s results. However, our analysis of their comparison of proportions does suggest that while the effect of processing error might not reverse MOB’s hypothesis, it does undermine confidence in the validity of their conclusions, suggesting again that some bias is at work due to processing error. And processing error does affect the significance level of the effect of age: the age effect grows stronger the more consistently the author’s processing rules are applied to the sample, hence processing error biases the strength of this age effect.

Representation

Touching briefly on issues of representation related to survey error, we might still wonder whether our inferences from these unusually different sub-samples (one taken entirely from two colleges in India, for example) to the general population are valid. In terms of sampling error, the analogue here is assignment to condition, which does seem to have surprisingly deviated from expected proportions in the case of France. This may have had implications for MOB’s results, but we did not analyze this, and it could be a fruitful direction for future research. Non-response was not a focus of our research, though item non-response plays an important role in our discussion of the valid inclusion or exclusion of cases, and we have briefly emphasized the importance of quality data collection techniques.

CONCLUSION

The results of my TSE analysis have met with limited to moderate success. While again, my methods were somewhat unorthodox and do not focus on examination of components of MSE, I was able to provide some support for MOB's (negative) findings. These negative findings are not affected by the processing error we detected, and hold when included in models that do detect a significant effect of age on answer choice, as well as when included in much larger models that add variables related to language comprehension as well as education level. The age effect likewise holds in these larger models.

That age effect also helps to extend MOB's research on cognition by uncovering information about the within-population variation of answer choice in this experiment. I showed there are a priori reasons, based on survey methodological principles, for expecting to find this age effect, and it seems unlikely that there is an alternate explanation for it. From the TSE perspective, this suggests a potential age-related source of error (in the non-perjorative sense) in philosophical experiments. It would be worthwhile, at least from the survey research perspective, to explore this age effect on linguistic cognition in more depth, by conducting a larger-scale survey involving perhaps several or numerous experimental items in conjunction with well-designed items to collect demographic variables and a more careful and rigorous implementation in the field.

Although the processing error did not seem to affect the significance levels of item type in my logistic regression, my replications of MOB's results using scenarios based on different criteria for exclusion of cases did somewhat qualify their claims. Using their full sample without exclusion did not significantly alter their results, but using two scenarios that more rigorously applied their criteria for exclusion of cases based on exposure to living abroad (the "compromise" scenario and the "consistent" scenario) showed that their published chi-square analysis did not seem like an appropriate test of their

22

Page 23: lost-contact.mit.edu · Web viewMachery et al collected data at four sites: two in India (at different colleges), one in France, and one in Mongolia. The Indian sites were presented

hypothesis. This this at least emphasizes the lesson that field procedures should be determined in advance and should be applied consistently one way or another, in order to avoid processing error. This is especially true when these procedures can result in the exclusion of cases from final analysis. Indeed, this processing error may have influenced the choice of statistical test that MOB applied to their data.

Another example of the consequences of processing error, again, relates to the observed age effect. This effect, while close to the traditional .05 significance in the single-variable model even for the "full" and "published" sample, does not quite pass this level until the more consistent rules for excluding sample are applied. The fact that this age effect grows stronger as more and more cases are excluded for exposure to living abroad is somewhat intriguing. This trend held also for the full model results, as well as for the sets of two-variable model comparisons. Without attention to processing error, this effect may never have been appreciated. This effect may also be an example of how survey errors can interact: the age effect (I argue) is a type of measurement error, which is revealed in full when processing error is examined. And, once this age effect is established, it can then be applied to the analysis of specification error, and used to confirm, somewhat ironically, that no specification error has occurred. The relationship of study and living abroad to philosophical item comprehension (and survey item comprehension) may need to be studied in more detail.

Since X-Phi was launched around 2001, several ongoing conversations have ensued, focusing especially on careful attention to language and item/instrument design. MOB’s work lies at the heart of one of these conversations, and this paper has tried to contribute to these conversations by paying its own kind of careful attention to item design and language, from a survey research perspective. Applying the TSE perspective to MOB’s instruments has been fruitful in demonstrating that they have correctly operationalized their research concepts in the design of their survey, by designing items whose performance in the field seems to be coherent with what survey researchers know about cognition. This paper has also been successful in employing the TSE paradigm to show how ignoring survey error in X-Phi research (specifically processing error) can lead to potentially unwarranted results, confound interesting and important effects, and undermine confidence in philosophical researchers’ ability to support and confirm their findings. It is this author’s hope that further efforts can be made to improve the data collection in experimental philosophy, and a dialogue opened between this growing field, and the well-established field of survey research, that can be of benefit to both disciplines.

Bibliography

Beebe, James R., Jensen Mark. 2010. “Surprising Connections Between Knowledge and Action: The Robustness of the Epistemic Side-Effect Effect.” Philosophical Psychology, 25:689-715.

Biemer, Paul. “Total Survey Error: Design, Implementation, and Evaluation.” Public Opinion Quarterly, 74(5):817-848.

23

Page 24: lost-contact.mit.edu · Web viewMachery et al collected data at four sites: two in India (at different colleges), one in France, and one in Mongolia. The Indian sites were presented

Brewer, Marilynn B., and Gardner, Wendi. 1996. Who is this "we"? Levels of collective identity and self representations. Journal of Personality and Social Psychology, 71:83-93

Cokely, Edward, and Feltz, Adam. 2008. “Individual Differences, Judgment Biases, and Theory-of-Mind: Deconstructing the Intentional Action Side Effect Asymmetry.” Journal of Research in Personality, 43:18-24.

Feltz, Adam, and Cokely, Edward. 2007. “An Anomaly in Intentional Action Ascription: More Evidence of Folk Diversity.” Proceedings of the 29th Annual Meeting of the Cognitive Science Society, 1748.

Groves, Robert, Lyberg, Lars. 2010. “Total Survey Error Past, Present and Future”. 2010. Public Opinion Quarterly 74(5): 849-879.

Haberstroh, Susane, Oyserman, Daphna, Schwarz, Norbert, Kuhnen, Ulrich, and Ji, Li-Jun. 2002. “Is the Interdependent Self More Sensitive to Question Context Than the Independent Self? Self-Construal and the Observation of Conversational Norms.” Journal of Experimental Social Psychology, 38:323-329.

Heinrich, Joseph, Heine, Stephen J., Norenzayan, Ara. 2010. “The Weirdest People in the World.” Behavioral and Brain Sciences, 33: 61-135.

Holbrook, Allyson, Cho, Young, Johnson, Timothy. 2006. “The Impact of Question and Respondent Characteristics on Comprehension and Mapping Difficulties.” Public Opinion Quarterly, 70(4): 565-595.

Ji, Li-Jun, Zhang, Zhiyong, and Nisbett, Richard E. 2004. “Is it culture, or is it language? Examination of language effects in cross-cultural research.” Journal of Personality and Social Psychology, 87(1), 57-65.

Knobe, Joshua. 2003. “Intentional action and side-effects in ordinary language.” Analysis, 63: 190-193.

Kleiner, Brian, Pan, Yuling, and Bouic, Jerelyn. 2009. “The Impact of Instructions on Survey Translation: An Experimental Study.” Survey Research Methods, 3(3): 113-122.

Knäuper, Bärbel, Schwartz, Norbert, Park, Denise, Fritsch, Andreas. 2007. “The Perils of Interpreting Age Differences in Attitude Reports: Question Order Effects Decrease With Age.” Journal of Official Statistics, 23(4):515-528.

Lam, Barry .2010. “Are Cantonese Speakers Really Descriptivists? Revisiting Cross-Cultural Semantics.” Cognition 115:320-329

Machery, Edouard, Mallon, Ron, Nichols, Shaun, and Stich, Stephen P. 2004. “Semantics, cross-cultural style.” Cognition, 92(3): 1-12.

Machery, Edouard, Olivola, Christopher Y., and de Blanc, Molly. 2009. “Linguistic and metalinguistic intuitions in the philosophy of language.” Analysis, 69(4):1-12.

24

Page 25: lost-contact.mit.edu · Web viewMachery et al collected data at four sites: two in India (at different colleges), one in France, and one in Mongolia. The Indian sites were presented

Machery, Edouard, Deutsch, Max, Mallon, Ron, Nichols, Shaun, Sytsma, Justin, Stich, Stephen. 2010. “Semantic Intuitions: Reply to Lam.” Cognition, 117: 363-366.

Marti, G. 2009. “Against semantic multi-culturalism”. Analysis, 69:42-48.

Nisbett, Richard E., Peng, Kaiping, Choi, Incheol, and Norenzayan, Ara. 2001. Culture and systems of thought: Holistic Versus Analystic Cognition. Psychological Review, 108(2): 291-310.

Norenzayan, Ara, Smith, Edward E., Kim, Beom Jun, and Nisbett, R.E. 2002. “Cultural preferences for formal versus intuitive reasoning.”  Cognition Science, 26:653-684.

Schaeffer, Nora Cate, and Dykema, Jennifer. 2011. “Questions for Surveys: Current Trends and Future Directions.” Public Opinion Quarterly, Special Issue 2011, 75(5):909-961.

Schwartz, Norbert, “What Respondents Learn from Questionnaires: The Survey Interview and the Logic of Conversation.” 1995. International Statistical Review, 63(2):153-177.

Stich, Stephen and Nisbett, Richard (1980). Justification and the psychology of human reasoning. Philosophy of Science, 47(2): 188-202.

Sytsma, Justin and Livengood, Jonathan. 2011. “A New Perspective Concerning Experiments on Semantic Intuitions.” Australasian Journal of Philosophy, 89(2)315-332.

Tversky, Amos, and Kahneman, Daniel. 1974. “Judgment Under Uncertainty: Heuristics and Biases”. Science (New Series), 185(4157):1124-1131.

Weinberg, Jonathan, Nichols, Shaun, and Stich, Stephen. 2001. “Normativity and epistemic intuitions.” Philosophical Topics, 29(1&2): 429-460.

Yan, Ting, and Tourangeau, Roger. 2008. “Fast Times and Easy Questions: The Effects of Age, Experience and Question Complexity on Web Survey Response Times.” Applied Cognitive Psychology, 22:51-68.

25