cognitive complexity and consideration setshauser/papers/hauser_toubia... · 2008. 8. 19. ·...

Cognitive Simplicity and Consideration Sets by

John R. Hauser Olivier Toubia

Theodoros Evgeniou

Rene Befurt

Daria Silinskaia

John R. Hauser is the Kirin Professor of Marketing, MIT Sloan School of Management, Massa-chusetts Institute of Technology, E40-179, One Amherst Street, Cambridge, MA 02142, (617) 253-2929, fax (617) 253-7597, [email protected]. Olivier Toubia is the David W. Zalaznick Associate Professor of Business, Columbia Business School, Columbia University, 522 Uris Hall, 3022 Broadway, New York, NY, 10027, (212) 854-8243, [email protected]. Theodoros Evgeniou is an Associate Professor of Decision Sciences and Technology Manage-ment, INSEAD, Boulevard de Constance 77300, Fontainebleau, FR, (33) 1 60 72 45 46, [email protected]. Rene Befurt is a Visiting Scholar at the MIT Sloan School of Management, Massachusetts Insti-tute of Technology, E40-157, One Amherst Street, Cambridge, MA 02142, (857) 753-7531, [email protected]. Daria Silinskaia is a doctoral student at the MIT Sloan School of Management, Massachusetts Institute of Technology, E40-170, One Amherst Street, Cambridge, MA 02142, (617) 253-2268, [email protected]. We would like to thank Daniel Bailiff (AMS), Simon Blanchard (PSU), Robert Bordley (GM), Anja Dieckmann (GfK), Holger Dietrich (GfK), Min Ding (PSU), Steven Gaskin (AMS), Patricia Hawkins (GM), Phillip Keenan (GM), Clarence Lee (MIT), Carl Mela (Duke), Andy Norton (GM), Daniel Roesch (GM), Matt Seleve (MIT), Glen Urban (MIT), Limor Weisberg and Kaifu Zhang (INSEAD) for their insights, inspiration, and help on this project. This paper has benefited from presentations at the Analysis Group Boston, the Columbia Business School, Digital Business Conference at MIT, General Motors, the London Business School, Northeastern University, the Marketing Science Conference in Vancouver, B.C., and the Seventh Triennial Choice Symposium at the University of Pennsylvania. Upon publication Matlab software and the US data are available from the authors.

Cognitive Simplicity and Consideration Sets Abstract

We develop and test methods to identify cognitively-simple decision rules that explain

which products consumers select for their consideration sets. Drawing on qualitative research we

propose disjunctions-of-conjunctions (DOC) decision rules that generalize well-studied decision

models such as disjunctive, conjunctive, lexicographic, and subset conjunctive rules. We draw-

ing on behavioral insights about cognitive simplicity and illustrate how these insights enhance

DOC rules. Using synthetic and empirical data we compare cognitively-simple DOC-based rules

to extant compensatory and non-compensatory rules. Synthetic data suggest that estimation

methods matched to data-generating rules predict validation data best. Empirically we observe

consumers’ consideration sets for global positioning systems for both estimation and validation

data. On validation data DOC-based rules, which account for cognitive simplicity (and market

commonalities), fit significantly better than other rules. This result is robust with respect to sam-

ple (German representative vs. US student), format by which consideration is measured (four

formats tested), and presentation of profiles (pictures vs. text). We illustrate that gains due to

cognitive simplicity apply to alternative estimation methods and are robust with respect to alter-

native means to estimate the benchmark rules. Empirically, our analyses suggest that cogni-

tively-simple DOC rules predict validation data well and imply different managerial insights.

Keywords: Consideration sets, non-compensatory decisions, consumer heuristics, statistical

learning, machine learning, revealed preference, conjoint analysis, cognitive

complexity, cognitive simplicity, environmental regularity, lexicography, logical

analysis of data, decision trees, combinatorial optimization.

Cognitive Simplicity and Consideration Sets

1. Introduction and Focus We focus on decision rules that consumers use to form consideration sets. We explore

decision rules (disjunctions of conjunctions) which generalize disjunctive, conjunctive, lexico-

graphic, and subset conjunctive rules that have been shown to explain and predict consideration

decisions. In theory, disjunctions of conjunctions (DOC) can involve complex logical patterns,

but prior research suggests that consumers use relatively simple rules when deciding on their

consideration sets. We therefore enforce cognitive simplicity.

Although the concept of DOC rules comes from in-depth qualitative interviews, our focus

in this paper is to develop and test methods that incorporate cognitive simplicity while inferring

decision rules directly from data in which consumers indicate which products (or product pro-

files) they would or would not consider. Such “revealed” inferences complement data-

augmentation methods in which only choices are observed as well as more-intensive and, poten-

tially intrusive, measurements of decision rules such as process tracing, information display

boards, eye-tracking, and in-depth qualitative research. For example, to the extent that valida-

tion-data predictions are best if the estimation approach matches the “true” decision rules, re-

vealed-inference methods can be used to test hypotheses developed from more-intensive meas-

urement. Behavioral experiments can vary context, observe consideration-sets, and infer the

likely context-dependent decision rules. Moreover, to the extent that non-compensatory decision

rules predict better, conjoint-like simulators based on such rules might be used to evaluate mana-

gerial actions designed to affect consumers’ consideration decisions.

Building on evidence from a variety of fields that simple rules, if they are used, balance

consumer benefits with cognitive processing costs, we modify estimation methods to incorporate

the cognitive simplicity of the decision rules. Synthetic data experiments suggest that modifica-

tions which better match the data-generating decision rules predict better. Empirical data, in

which German consumers evaluate Global Positioning Systems (GPSs), suggest that the pro-

posed cognitively-simple DOC decision rules predict consideration (from a new set of stimuli)

better than either compensatory rules or existing non-compensatory rules. The results are robust

with respect to country (Germany vs. the US), sample (random vs. student), stimuli (text vs. pic-

tures), and data collection format (four different versions). We examine managerial implications

and close by demonstrating that the performance of cognitively-simple DOC rules is not unique

to a single estimation method for DOC rules or the benchmarks.

1


To provide context, we begin with a short discussion of the managerial and scientific im-

portance of consideration sets and review evidence that consumers use cognitively-simple deci-

sion rules when evaluating many products as is common in consideration decisions. Subsequent

sections review existing methods, introduce disjunctions-of-conjunctions decision rules, present

four formal results which motivate cognitive simplicity, and develop one example estimation

method for cognitively-simple DOC rules. We then describe the simulation experiments, em-

pirical tests, robustness checks, managerial implications, and alternative estimation methods.

2. Evidence for Consideration Sets and for Cognitive Simplicity When consumers are faced with a large number of alternative products, as is increasingly

common in today’s retail and web-based shopping environments, they typically screen the full

set of products down to a smaller, more-manageable consideration set which they evaluate fur-

ther (e.g., Bronnenberg and Vanhonacker 1996; DeSarbo et al., 1996; Hauser and Wernerfelt

1990; Jedidi, Kohli and DeSarbo, 1996; Mehta, Rajiv, and Srinivasan, 2003; Montgomery and

Svenson 1976; Payne 1976; Roberts and Lattin, 1991; Shocker et al., 1991; Wu and Rangas-

wamy 2003). Understanding the formation of consideration sets can be managerially important.

For example, consideration sets for packaged goods are typically 3-4 products rather than the 30-

40 products that are available while in automobiles the typical consumer focuses on 5-6 options

out of the 350+ make-model combinations on the market (Hauser and Wernerfelt 1990, Urban

and Hauser 2004). Marketing strategies that encourage consumers to add a product to the con-

sideration set increase the odds of a purchase dramatically. For example, top management at

General Motors (GM) is investing heavily to increase consideration of GM automobiles because

GM believes that its vehicles are much better than consumers perceive them to be. GM believes

that the key barrier to sales is that consumers reject GM before seriously evaluating GM products

(e.g., only 36% of California consumers will even consider a GM vehicle). This strategic initia-

tive has led to tactics such as bringing test drives to consumers, directed customer relationship

management, moderated community groups, web-based showrooms, and web-based auto-choice

advisors (Rhoads, Urban, and Sultan 2004). Indeed, GM has begun to apply the measurement

protocols and models explored in this paper.

Most experimental evidence suggests that consumers make consideration decisions with

relatively simple rules that enable them to make good decisions while avoiding excess cognitive

effort (e.g., Bettman, Luce and Payne 1998; Gigerenzer and Todd 1999; Payne, Johnson and

2


Bettman 1988, 1993; Simon 1955; Shugan 1980). Some researchers suggest that such simple

rules lead to better decision outcomes than more-complex compensatory rules in the decision en-

vironments that consumers normally face (e.g., Bröder 2000; Gigerenzer and Goldstein 1996;

Hogarth and Karelaia 2005; Martignon and Hoffrage 2002). This perspective is consistent with

economic theories of consideration-set formation which posit that consumers balance search

costs and the option value of utility maximization (Hauser and Wernerfelt 1990; Roberts and

Lattin 1991). Low search-and-evaluation-cost rules might be the most efficient search or evalua-

tion methods.

From our perspective we do not need that every consumer uses a simple decision rule for

consideration decisions nor that they use simple rules in all contexts, only that it is scientifically

interesting and managerially relevant to study cognitively simple decision rules. In this paper we

focus on the consideration decision. Our analyses are consistent with existing, well-studied mod-

els of choice from among considered products.

3. Established Models of Decision Rules for Consideration Decisions Figure 1 illustrates sixteen features that consumers use to evaluate handheld GPSs. These

features were chosen as the most important based on two pretests of 58 and 56 consumers, re-

spectively. Ten of the features are represented by text and icons while the remaining six features

are represented by text and visual cues. (We review the detailed measurement formats in Section

9 and test alternative presentation formats, including text-only, in Section 10.) We focus on data

in which respondents are asked to indicate which of several profiles (32 in our experiments) they

would consider. Respondents are free to select any size consideration set. In some formats they

must classify each profile as considered or not considered; in other formats they do not need to

evaluate every profile. In this paper we explore situations in which features are described by

finitely many levels as is common in most conjoint-analysis applications. The concept of cogni-

tive simplicity also applies to continuous features, an issue we address at the end of this paper.

3


Figure 1 Features of Handheld GPSs

Let j index the profiles, index the levels, f index the features (sometimes called “attrib-

utes” in the literature), and h index the respondents. Let J, L, F, and H be the corresponding

numbers of profiles, levels, features, and respondents. For ease of exposition only, we do not

write J, L, and F as dependent (e.g., L

l

f). Our models and estimation can (and do) handle such

dependency, but the notation is cumbersome. Let = 1 if profile j has feature f at level . Oth-

erwise = 0. Let

ljfx l

ljfx jxr be the binary vector (of length LF) describing profile j. Let yhj = 1 if we

observe that respondent h considers profile j. Otherwise, yhj = 0. Let hyr be the binary vector de-

scribing respondent h’s consideration decisions. All notations are summarized in Appendix 1.

Compensatory Decision Rules If consumers are utility-maximizing, then they will consider a profile if its utility is above

some threshold, Th, which accounts for search and processing costs. If hβr

is the vector of part-

worths for respondent h, then h’s evaluation of the utility of profile j is hjhjx εβ +′rr where hjε is an

error term drawn from an extreme-value distribution. Subsuming the threshold in the scaling of

the partworths, yields the standard logit model (e.g., Swait and Erdem 2007, p. 691). We defer

estimation of this and other benchmark models to Section 7.

Non-compensatory Decision Rules Commonly-studied non-compensatory rules are disjunctive, conjunctive, lexicographic,

elimination-by-aspects, and subset conjunctive rules (e.g., Gilbride and Allenby 2004, 2006;

4


Jedidi and Kohli 2005; Montgomery and Svenson 1976; Ordóñez, Benson and Beach 1999;

Payne, Bettman, and Johnson 1988; Yee, et. al. 2007). Subset conjunctive rules generalize dis-

junctive and conjunctive rules (Jedidi and Kohli 2005). For consideration decisions, they also

generalize lexicographic rules and deterministic elimination-by-aspects.1

Disjunctive Rules

In a disjunctive rule, a profile is considered if at least one of the features is at an “accept-

able” (or satisfactory) level. Let = 1 if level of feature f is acceptable to respondent h.

Otherwise, = 0. Let be the binary vector of acceptabilities for respondent h. A disjunctive

rule states that respondent h considers profile j if

lhfa l

lhfa har

1≥′ hjax rr .

Conjunctive Rules

In a conjunctive rule, a profile is considered if all of the features are at an acceptable

level. With this definition, a feature may have no effect on consideration if all levels of that fea-

ture are acceptable. (Conjunctive rules usually assume a larger set of acceptable levels than dis-

junctive rules, but this is not required.) Because the use in each rule is clear in context, we use

the same notation: in a conjunctive rule, respondent h considers profile j if Fax hj =′ rr .

Subset Conjunctive Rules

In a subset conjunctive rule, a profile is considered if at least S features are at an accept-

able level.2 Using the same notation, respondent h considers profile j if Sax hj ≥′ rr . Clearly, a

disjunctive rule is a special case where S = 1 and, because hjax rr′ can never exceed F, a conjunc-

tive rule is a special case where S = F. We denote subset conjunctive rules by Subset(S).

Lexicographic Rules

In a lexicographic rule, respondents decide on an order of features and an order of levels

within features. Respondents first rank profiles on levels within the first feature. They use the

1 Tversky’s (1972) elimination-by-aspects (EBA) rule has been used in its defined-aspect-order form by researchers such as Hogarth and Karelaia (2005), Johnson, Meyer, and Ghose (1989), Montgomery and Svenson (1976), Payne, Bettman, and Johnson (1988). If the aspect order is defined, then the EBA rule for consideration is the same as a lexicographic-by-aspects rule. Additionally, if aspects measures are non-zero for a limited set of acceptable levels, then probabilistic EBA is the same as a conjunctive rule. 2 Subset(S) rules are equivalent to “image-theory” rules in organizational behavior (Ordóñez, Benson and Beach 1999). Image-theory rules are defined as rejecting options if F-S features are below thresholds. Such rules are mathematically equivalent to accepting S features.

5


next feature in the lexicographic order only when profiles are tied on all higher-ranked features.

Lexicographic rules have been applied to consideration decisions by allowing ties and assuming

that only the first S features affect consideration (e.g., Yee, et. al. 2007).3 However, if we do not

distinguish ranks within consideration sets, then the only feature ordering that matters is that the

first S features have acceptable levels and the exact feature ordering is not unique. For example,

if a respondent considers any GPS with an extra bright, high resolution display, then we get the

same consideration set whether brightness is ranked before resolution or vice versa. In consid-

eration decisions, a lexicographic rule will be indistinguishable from a conjunctive rule if har is

coded such that the appropriate levels of the first S features are acceptable (and the remaining

features coded as not affecting the decision). With this coding, the lexicographic rules are

equivalent to a conjunctive rule, which, in turn, can be written as a subset conjunctive rule.

Because the disjunctive, conjunctive, and lexicographic rules (and most common forms

of EBA) can be written as subset conjunctive rules, we adopt subset conjunctive rules as our

non-compensatory benchmark. We now discuss empirically-reasonable non-compensatory

screening rules that cannot be written as subset conjunctive rules.

4. Disjunctions of Conjunctions (DOC) Our initial motivation for disjunctions of conjunctions came from qualitative discussions

with consumers. For example, when we first began interviewing respondents about GPSs, we

heard respondents express rules for handheld GPSs that were based on one or more conjunctive-

like criteria. A respondent might be willing to consider a GPS with a B&W screen if the GPS is

small and the screen is high resolution, but would require a color screen on a large GPS. Such

rules can be written as logical patterns: (B&W screen ∧ small size ∧ high resolution) ∨ (color

screen ∧ large size), where ∧ is the logical “and” and ∨ is the logical “or.” Patterns might also

include negations (¬), for example, a consumer might accept a B&W screen as long as the GPS

is less than the highest price of $399: (B&W screen ∧ ¬ $399).

In a qualitative study sponsored by General Motors some respondents considered auto-

mobiles based on two or more conjunctive-like criteria (Anonymous 2008). That study used in-

depth interviewing for 38 automobile consumers who were asked to describe their consideration

decisions for 100 real automobiles that were balanced to market data. For example, the follow- 3 Yee, et. al. (2007) also consider lexicographic rules in which the ranking of profiles is inferred from feature-level combinations (called “aspects”). The conceptual arguments in this paragraph also apply to such rules.

6


ing respondent considers automobiles that satisfy either of two criteria. The first criterion is

clearly conjunctive (good styling, good interior room, excellent mileage). The second criterion

allows cars that are “hotrods.” “Hotrods” usually have poor interior room and poor mileage.

[I would consider the Toyota Yaris because]the styling is pretty good, lot of interior room, mileage is supposed to be out of this world. I definitely [would] consider [the Infinity M-Sedan], though I would proba-bly consider the G35 before the "M". I like the idea of a kind of a hotrod.

All interviews were video-recorded and the videos were evaluated by independent judges

who were blind to any hypotheses about consumers’ decision rules (Hughes and Garrett 1990;

Perreault and Leigh 1989). Most respondents made consideration decisions rapidly (89% aver-

aged less than 5 seconds per profile) and most used non-compensatory decision rules (76%).

Typically, consumers used conjunctive-like criteria defined on specific levels of features. Some

consumers would consider an automobile if it satisfied at least one of multiple criteria (a disjunc-

tion of two or more conjunctions).

We seek to formalize these qualitative insights with a class of decision rules that general-

izes previously-proposed rules. First, following Tversky (1972) we define an aspect as a binary

descriptor such as “B&W screen.” A profile either has or does not have an aspect. A pattern is a

conjunction of aspects or their negations such as (B&W screen ∧ ¬ $399). We define the size, s,

of a pattern as the number of aspects in the pattern. For example, (B&W screen ∧ ¬ $399) has

size s = 2. If p indexes patterns, then we say that a profile j matches pattern p if profile j contains

all aspects (or negations) in pattern p.

We study rules where the respondents consider a profile if the profile matches one or

more target patterns. Because each pattern is a conjunction, these logical rules are disjunctions of

conjunctions (DOC). DOC rules generalize both disjunctive and conjunctive rules and are con-

sistent with the qualitative interviews. While other logical patterns are possible, we show that

DOC rules are sufficiently general to fit any observed consideration decisions.

Formal Definition of DOC Rules For any set of features and levels there are many potential conjunctions of aspects. If all

F features were binary, there would be 3F – 1 possible patterns. The factor of 3 comes from the

fact that every conjunctive pattern could either ignore a feature, contain a feature, or contain its

negation. There are many more patterns if the features have many levels. Based on the literature

7


cited earlier, we expect that decision rules are cognitively simple. Thus, we expect the number

of aspects in a pattern to be small. To capture this concept, we define DOC(S) as the set of DOC

rules in which the maximum size of the patterns is S. (It will be clear in context whether S refers

to the maximal pattern size in DOC(S) or the subset sizes in Subset(S).)

For a set of allowable patterns, let whp = 1 if pattern p is one of the patterns describing re-

spondent h’s decision rule and let mjp = 1 if profile j matches pattern p. Otherwise, whp and mjp

are zero. Let and hwr jmr be the corresponding binary vectors with length equal to the number of

allowable patterns in a DOC rule. Then a DOC rule implies that respondent h considers profile j

if 1≥′ hjwm rr .

While the binary vectors, and hwr jmr , play a role analogous to partworth and feature vec-

tors in conjoint analysis, their length can be quite large because the number of allowable patterns

grows rapidly with S. For example, if all 16 features in Figure 1 were binary, then there would

be 32 patterns for S = 1, 512 for S =2, 4,992 for S = 3, and 34,112 for S = 4, growing to almost

20 million for S = 10. Fortunately, behavioral theory suggests we can limit our search to cogni-

tively-simple DOC rules.

Relationship of DOC Rules to other Non-compensatory Rules DOC rules generalize Subset(S) rules and are more flexible. For example, (B&W screen

∧ small size ∧ high resolution) ∨ (color screen ∧ large size) cannot be written as a subset con-

junctive rule. DOC rules nest other logical rules and match specific rules for some values of S.

In particular, we have the following results.

Result 1. The following sets of rules are equivalent: (a) disjunctive rules, (b) Subset(1)rules, and (c) DOC(1) rules.

Result 2. Conjunctive rules are equivalent to Subset(F) rules which, in turn, are a subset of the DOC(F) rules, where F is the number of features.

Result 3. All Subset(S) rules can be written as a DOC(S) rule, but not all DOC(S) rules can be written as a Subset(S) rule.

The formal proofs to Results 1, 2, and 3 are contained in Appendix 2. We have already

argued that disjunctive and conjunctive rules are equivalent to Subset(1) and Subset(F), respec-

tively. Part (c) of Result 1 follows because a conjunction of size S = 1 is just a single aspect and,

if we include all relevant aspects in a DOC(1) rule, it is a disjunction of the aspects. The second

8


half of Result 2 follows because DOC(F) rules allow conjunctions up to the number of features.

Result 3 follows similar logic. A Subset(S) rule implies that a profile is considered if any

subset of S features is at an acceptable level. Each subset of S features corresponds to one con-

junction. Because at least one of the conjunctions needs to be satisfied, this is just a disjunction

of conjunctions. DOC(S) allows disjunctions of conjunctions that are not allowed with a Sub-

set(S) rule and allows conjunctions with fewer than S features. For example, suppose there were

three features: battery life, track log, and price, and consider a Subset(2) rule with the following

acceptable features: 30-hour battery life, track log (yes), $249, and $299. This rule can be can be

written as a DOC(2) rule: (30 hours ∧ yes) ∨ (30 hours ∧ $299) ∨ (30 hours ∧ $249) ∨ (yes ∧

$299) ∨ (yes ∧ $249). However, the rule (30 hours ∧ yes) ∨ ($249) can be written as a DOC(2)

rule, but not a Subset(2) rule.

Results 1 and 2 are important because they make predictions that we test with synthetic

data. For example, we should not be surprised if either a DOC-based or a Subset(1)-based esti-

mation method does well on data generated with disjunctive rules (Result 1) or if either a DOC-

based or a Subset(F)-based estimation method does well on data generated with conjunctive

rules (Result 2). Result 3 implies that the comparison of Subset(S) rules and DOC rules is inter-

esting. A DOC-based estimation may not do as well as a Subset(S)-based estimation on data

generated with Subset(S) rules or vice versa. Of course, the ability to fit a rule to data depends

upon our ability to estimate the parameters of a rule. We turn now to estimation.

5. Estimation of DOC Rules: Issues of Complexity One strength of DOC rules is their generality, but this also presents a challenge to estima-

tion that infers DOC rules from observed consideration decisions. The following result illus-

trates the generality of DOC rules:

Result 4. Any set of considered profiles can be fit perfectly with at least one DOC rule. Moreover, the DOC rule need not be unique.

To establish Result 4 we recognize that every considered profile, j, is a set of aspects. Let

pj be a pattern of length F that contains all aspects in j and only those aspects. Clearly, pattern pj

matches profile j. This pattern will match no other profile that is not identical to j. A disjunction

of the pj patterns (for all considered j) will match all considered profiles but no other profiles.

For example, fix all features in Figure 1 except battery life, track log (yes or no), and price and

suppose that respondent h considers only two profiles (out of J): {“30-hour battery,” “track log,”

9


$299} and {“15-hour battery,” “track log,” $249}. These data could be fit perfectly with (“30-

hour battery” ∧ “track log” ∧ “$299”) ∨ (“15-hour battery” ∧ “track log” ∧ “$249).

The second half of Result 4 is clear by counterexample. Suppose that respondent h con-

siders only GPSs that have “30-hour battery” and a “track log.” The pattern (“30 hour battery” ∧

“track log”) will fit the data. However, this DOC rule with one size-2 pattern is equivalent to

another DOC rule in which we have a disjunction of two size-3 patterns each of which includes

the simple rule combined with any aspect or its negation: (“30 hour battery” ∧ “track log” ∧

$249) ∨ (“30 hour battery” ∧ “track log” ∧ ¬ $249). Expanding to size-F patterns we find a very

large number of rules consistent with the observed data.

Result 4 implies a very important challenge: if for any observed consideration set we can

find at least one, and possibly many, DOC rules that fit the data perfectly, then, without further

constraints, we are likely to over-fit the observed data. Fortunately, the behavioral literature

(cited earlier) suggests a solution to the dilemma of Result 4. Experimental evidence suggests

that consumers use simple rules (e.g., Gigerenzer and Goldstein 1996; Payne, Bettman, and

Johnson 1993). This experimental evidence suggests further that simple rules do well for deci-

sions that consumers face on a day-to-day basis. These behavioral hypotheses are consistent

with the statistical learning literature which recommends avoiding complexity when estimating

the parameters of models (e.g., Cucker and Smale 2002; Evgeniou, Boussios and Zacharia 2005;

Hastie, Tibshirani and Friedman 2001; Langley 1996; Vapnik 1998).

Based on both literatures we propose to focus on simplicity by placing limits on cognitive

complexity. A simpler decision rule that is consistent with experimental evidence may not fit es-

timation data perfectly, but may predict validation data better – because it is more likely to repre-

sent the consumer’s true decision rule or because complexity control mitigates Result 4 and

avoids overfitting. (Simulations in Section 8 hint that the former is a sufficient explanation; ei-

ther or both explanations are consistent with our empirical tests.)

However, even if we reward cognitive simplicity, we might not identify unique patterns

that have empirical validity. To help identify DOC rules for individual respondents we use

“market” information. There are at least two motivations for using “market” information. The

first motivation is an analogy to population shrinkage which enhances accuracy in hierarchical

Bayesian models (e.g., Rossi and Allenby 2003). The second motivation is drawn from the be-

havioral literature which hypothesizes that consumers use simple rules because they “capitalize

10


on environmental regularities to make smart inferences (Chase, Hertwig and Gigerenzer 1998, p.

209).” Gigerenzer and Selten (2001) argue further that simple rules are “ecologically rational.”

Similarly, Payne, Johnson and Bettman (1993, pp. 97-99) demonstrate that the performance of

simple decision rules varies with the decision environment. By inference, commonalities among

respondents facing similar decision environments provide valuable information on which rules

are more likely for a respondent. Either motivation is sufficient to suggest that “market” behavior

provides valuable information for identifying DOC rules for each respondent.

6. Identifying Decision Rules by Accounting for Cognitive Simplicity In this section we illustrate an estimation method that identifies DOC rules while ac-

counting for cognitive simplicity (and market commonalities). We choose a statistical-learning

algorithm because the modifications for cognitive simplicity and market information are trans-

parent. We believe that the predictive performance is due to the cognitively-simple DOC rules

rather than the specific estimation method. For example, in Section 12 we illustrate how cogni-

tively-simple DOC rules can be identified with a different statistical-learning algorithm (logical

analysis of data). We also demonstrate that (1) algorithms (such as decision trees) without cog-

nitive complexity control and market information do less well and (2) statistical-learning bench-

marks do less well than statistical-learning methods for cognitively-simple DOC rules. To the

extent that another approach to estimating cognitively-simple DOC rules improves predictions

relative to the methods we test, our results are conservative.4

The basic data we observe, for a set of respondents and profiles, is whether or not a re-

spondent considers a profile (yhj). We seek to identify the patterns respondent h uses to evaluate

profiles. Fit on a calibration sample is maximized when we select patterns such that profile j is

considered if 1≥′ hjwm rr and not considered if 0=′ hj wm rr . (Recall that jmr identifies the patterns

which match profile j and is a binary vector that identifies patterns.) hwr

To measure errors, we define penalty variables. Let be non-negative integers such

that

+hjξ

+≤′ hjhj wm ξrr . Based on this constraint, will equal 1 (or greater) whenever DOC rules pre-

dict that profile j is considered. Similarly, let be non-negative integers such that

+hjξ

−hjξ

4 Bayesian methods for cognitively-simple DOC rules may face practical challenges due to the length of the hwr vec-tor. Such formulations require further research. See Section 12.

11


−−≥′ hjhj wm ξ1rr . will equal 1 (or greater) whenever DOC rules predict that profile j is not

considered.

−hjξ

We can now define prediction errors. Because yhj = 1 when we observe that profile j is

considered in the estimation data, will equal 1 (or greater) when we observe profile j is

considered, but we predict it is not-considered (false negative predictions). Similarly,

will equal 1 (or greater) for false positive predictions. Part of our objective in esti-

mating DOC rules is to minimize the sum of these false prediction errors in the estimation data.

(We might also weigh false positives more or less than false negatives. We address this issue in

Section 9.)

−hjhjy ξ

+− hjhjy ξ)1(

We can define cognitive simplicity in many ways. In this section we penalize the number

of patterns and favor simple rules by allowing only patterns that have a maximum length of S.

(Other penalties, such as the length of the patterns, are also possible. Section 12 provides an il-

lustration.) If is a vector of 1’s, of length equal to the number of potential patterns, then we

measure complexity by which counts the number of patterns in

er

hwe rr′ hwr .

We balance fit and cognitive simplicity with the following loss function. (γc is a parame-

ter that tells us how much to penalize the lack of cognitive simplicity.)

(1) . hc

J

jhjhjhjhj weyy rr′+−+∑

=

+− γξξ1

])1([

Result 4 cautions us about non-uniqueness even if we favor cognitive simplicity, hence we in-

corporate information from the “market.” In particular, we choose decision rules that are more

likely to match profiles that are considered by other respondents in the market. If Mj is the (mar-

ket) percent of respondents who consider profile j, then we “shrink” to the market with an addi-

tional criterion in the loss function (γM is a parameter that tells us how much to weigh market

considerations).

(2) ∑=

+− −+J

jhjjhkjM MM

1

])1([ ξξγ

If we select γM to be small, Equation 2 will break ties among those patterns that minimize Equa-

tion 1. Such market-based constraints have proven valuable in other marketing applications

(e.g., Evgeniou, Pontil and Toubia 2007).

There are many estimation methods that we might consider. Statistical learning provides

12


one transparent estimation method to illustrate the concepts of estimating DOC patterns while

accounting for cognitive simplicity and market commonalities. In particular, we use the integer

program in Equation 3, which, for simplicity, we call DOCMP. (The cognitive simplicity con-

straint on S is implicit in the definition of hwr .)

(3) hc

J

jhjjhkjM

J

jhjhjhjhj

hh

weMMyyw

rrrr

′+−++−+ ∑∑=

+−

=

+− γξξγξξξ 11

])1([])1([},{

min

Subject to: +≤′ hjhj wm ξrr for all j = 1 to J

−−≥′ hjhj wm ξ1rr for all j = 1 to J

, ≥ 0, +hjξ −

hjξ hwr a binary vector

Solving the Mathematical Program (DOCMP) DOCMP is equivalent to the set-covering problem and, hence, is an NP-hard integer

program (Cormen, et. al. 2001). Fortunately, efficient approximation algorithms have been

developed and tested for this class of problems. For example, a greedy heuristic runs in

polynomial time (Fiege 1998; Lund and Yannakakis 1994). The greedy approximation adds

non-empty patterns sequentially by choosing the patterns based on the greatest reduction in the

objective function and stopping when no further reduction is feasible. Alternatively, DOCMP

can be solved approximately with a linear-programming relaxation in which we first allow the

to be continuous on [0, 1], then round up to 1.0 any positive whwr hj that is above a threshold.

Formulated thus, the linear-programing relaxation is similar to the “LASSO” method in

statistical learning. The “LASSO” method usually provides sparse solutions in which a relatively

few patterns are chosen (Hastie, Tisbshirani, and Friedman 2003, and references therein). In our

estimations, we use both the greedy and the relaxation methods, choosing the solution that

provides the best value of the objective function. These solution methods scale sufficiently well

for cognitively-simple values of S and can easily handle the 16-feature empirical application in

Section 9.

Choosing DOCMP’s Tuning Parameters with Leave-one-out Cross Validation DOCMP, when solved, chooses the number of patterns automatically, However, there

are two explicit tuning parameters in Equation 3, γc and γM. These tuning parameters tell us how

13


much to penalize the number of patterns (one measure of cognitive simplicity) and how much to

shrink h’s patterns toward the market.

We set γM to an arbitrary small number so that market information is used only to break

ties among patterns. We select γc with a method called leave-one-out cross validation. Leave-

one-out cross validation has been used successfully in both the statistical learning and marketing

literatures (e.g., Cooil, Winer and Rados 1987; Efron and Tibshirani 1997; Evgeniou, Pontil and

Toubia 2007, Hastie, Tibshirani, and Friedman 2003; Kearns and Ron 1999; Kohavi 1995; Shao

1993; Toubia, Evgeniou and Hauser 2007; Zhang 2003). Specifically, for each potential value of

γc we leave out one profile from the estimation data and use Equation 3 to identify patterns with

data on the remaining J–1 profiles. We predict consideration for the left-out profile, repeat for

each profile, and sum errors over respondents choosing γc to minimize leave-one-out cross-

validation errors on the estimation data. No data from any holdout or validation observations are

used in leave-one-out cross validation. We test sensitivity to the choice of γc for both the

calibration and validation samples and, in Section 12, we examine an algorithm that does not use

leave-one-out cross validation.

We enforce an upper bound on pattern length by fixing S. We choose S = 4 as

appropriate to the goals of this paper. In the simulation experiments F = 4, hence S = 4 provides

a reasonable test of DOCMP. In the empirical application S = 4 provides a rich set of possible

DOC rules (30,000-plus potential patterns) and is consistent with the DOC rules articulated in

qualitative pretests.5 More importantly, a fixed S provides a conservative perspective on whether

accounting for cognitive complexity improves predictions. Similarly, the performance of

DOCMP with S = 4 is a conservative indicator of what is possible when other estimation

methods are modified to identify cognitively-simple DOC decision rules.

7. Benchmark Decision Rules We compare cognitively-simple DOC rules to compensatory, conjunctive, disjunctive,

subset conjunctive, and lexicographic rules. Results 1-3, and the degeneracy of lexicographic

rules for consideration decisions, imply that conjunctive, disjunctive, and lexicographic rules are

special cases of Subset(S). These benchmarks provide a broad sampling from previous proposals

5 Large S is neither consistent with behavioral theory nor parsimonious. For example, when S = 7 any proposed es-timation method must deal with almost 2 million patterns. We choose to be conservative in our use of empirical validation data and, hence, fix S to an empirically-reasonable value.

14


and enable us to test DOC rules vs. other rules of consideration decisions. In Section 12 we

address other estimation methods for cognitively-simple DOC rules.

To estimate our benchmarks, we use published hierarchical Bayes (HB) methods. We

retain the basic Bayesian formulations cited in the references, modified slightly for consideration

decisions. These formulations have been applied widely and have been shown to predict well

(e.g., Arora and Huber 2001; Gilbride and Allenby 2004, 2006; Rossi and Allenby 2003).

The benchmark rules can also be estimated with statistical-learning methods (less

common in the marketing literature). As a check on the robustness of our empirical

comparisons, we formulate statistical-learning methods analogous to DOCMP for both

compensatory and Subset(S) rules (see Appendix 6). The performance is comparable to the

Bayesian methods and does not change the basic interpretations of the empirical comparisons.

HB Compensatory Estimation

Respondent h considers profile j if hjhjx εβ +′rr is above a threshold.6 Subsuming the

threshold in the partworths, we get a standard logit likelihood function:

(4) hj

hj

x

x

hjhje

exyβ

β

β rr

rrrr

′

′

+==

1),|1Pr(

),|0Pr( hjhj xy βrr

= = 1 - ),|1Pr( hjhj xy βrr

= . We impose a first-stage prior on hβr

that is normally

distributed with mean 0βr

and covariance D. The second stage prior on D is inverse-Wishart with

parameters equal to I/(Q+3) and Q+3, where Q is the number of parameters to be estimated and I

is an identity matrix. We use diffuse priors on 0βr

. Inference is based on a Monte Carlo Markov

chain with 20,000 iterations, the first 10,000 of which are used for burn-in.

HB Subset(S) Estimation (includes Disjunctive and Conjunctive) We use the hierarchical Bayes model of Gilbride and Allenby (2004) modified to esti-

mate Jedidi and Kohli’s (2005) subset conjunctive rules. The modifications reflect differences in

data and the generalization in models. In particular, we observe consideration directly while it is

a latent construct in the Gilbride-Allenby formulation. We also do not impose constraints that 6 An additive model can represent a lexicographic or conjunctive model, for example, if one partworth is greater than the sum of the other partworths. To account for this phenomenon Hogarth and Karelaia (2005), Martignon and Hoffrage (2002), and Yee, et. al.’s (2007) constrain an additive model such that the partworths are compensatory. Their data suggest that an additive model outperforms a constrained compensatory model. Hence, HB Compensa-tory is a conservative benchmark for DOC-based estimation.

15


levels within a feature are ordered. This allows us to address multi-level features in which there

is no defined ordering. The Subset(S)-based likelihood function is:

(5) ⎪⎩

⎪⎨⎧

<′

≥′==

Saxifb

Saxifbaxy

hj

hj

hjhj rr

rrrr

2

1

),|1Pr(

Where again ),|0Pr( hjhj axy rr= = 1 - ),|1Pr( hjhj axy rr

= . The parameters, b1 and b2 model re-

sponse errors. Specifically, a profile is considered with probability b1 if it satisfies a Subset(S)

rule; a profile is considered with probability b2 if it does not.

The first-stage prior on each is a binomial distribution with parameter, lhfa lfθ . The sec-

ond-stage priors are beta for b1 and b2 and Dirichlet for the lfθ ’s. (We use the same distributions

and parameterization used by Gilbride and Allenby 2004.) We impose the constraint, b1 > b2,

with rejection sampling (e.g., Allenby, Arora and Ginter 1995). Inference is based on 20,000 it-

erations of the Monte Carlo Markov chain, the first 10,000 of which are used for burn-in. Be-

cause the set of possible acceptabilities, har , is large, we follow Gilbride and Allenby (2004, p.

404) and use a “Griddy Gibbs” algorithm. Details are available in Gilbride and Allenby and are

summarized in Appendix 3.

8. Simulation Experiments We seek to test whether matched decision rules predict better than decision rules that are

mismatched, where we say a decision rule is matched if both the estimation and the data genera-

tion are based on that decision rule. For example, we might expect compensatory-based estima-

tion to predict better than non-compensatory-based estimation when the data are generated with a

compensatory rule. We might also expect Subset(S)-based estimation to predict better when the

data are generated with Subset(S) and DOC-based estimation to predict better when the data are

generated with DOC rules, unless the generating rules are equivalent as per Results 1-3.

We consider products with 4 features, each with 4 levels. We generate two orthogonal

designs of 32 profiles each.7 The first orthogonal design is used for estimation (including leave-

one-out cross validation). The second orthogonal design is used purely for validation. We gen-

erate data independently for each of eight “true” decision rules: compensatory, disjunctive [same

7 Orthogonal designs might pose problems for leave-one-out cross validation (Evgeniou, Pontil and Toubia 2007). Thus, the choice of orthogonal designs favors the HB benchmark estimation methods and provides a conservative test of DOCMP. We return to this issue at the end of Section 9.

16


as Subset (1) and DOC(1)], Subset(2), Subset(3), conjunctive [same as Subset(4) and lexico-

graphic for consideration sets], DOC(2), DOC(3), and DOC(4). For each decision rule we gen-

erate estimation and validation data for four independent sets of 100 respondents.

We generated the data to be consistent with the HB formulations: normally-distributed

partworths and binomial sampling from logit probabilities for the compensatory rules; Dirichlet-

distributed acceptability parameters and binomial sampling for choices for Subset(S) rules. The

DOC-based data were based on Dirichlet-distributed pattern weights and a binomial distribution

for choices with the same b1,b2 probabilities as for Subset(S). To maintain consistency among

alternative data-generation rules, we calibrated the decision rules to be as parsimonious as feasi-

ble and to hold the average number of considered profiles constant across decision rules (8 pro-

files; a number consistent with our empirical data). Details are provided in Appendix 4.

In making comparisons, we take Results 1-3 into account. For example, both DOCMP

and HB Subset(1) are matched to a disjunctive rule. The results are given in Table 1. The per-

centages in bold are the best predictions on validation data (or not significantly different from the

best) for the indicated data-generation decision rule.

Table 1 Out-of-Sample Hit Rate

(Each Estimation Method and Each Data-Generation Decision Rule)

Hit Rate for Indicated Estimation Method

Data Generation Decision Rule

HB Compensatory

HB Subset(1)

HB Subset(2)

HB Subset(3)

HB Subset(4) DOCMP

Compensatory 74.6%* 45.2% 59.3% 66.7% 72.4% 72.8%

Subset(2) 78.5% 71.1% 88.0%* 85.4% 80.3% 84.5%

Subset(3) 78.6% 61.3% 81.9% 87.2%* 80.9% 83.8%

Conjunctive [Subset(4)] 78.7% 60.3% 80.7% 87.1% 89.0%* 89.2%* Disjunctive [DOC(1), Subset(1)] 84.4% 85.6% 86.4% 86.1% 83.7% 90.8%*

DOC(2) 77.6% 70.6% 76.1% 78.6% 78.8% 87.0%*

DOC(3) 76.3% 51.0% 65.4% 76.4% 77.8% 83.3%*

DOC(4) 74.8% 53.7% 65.8% 75.0% 76.9% 82.9%*

*Best predictive hit rate, or not significantly different than the best at the 0.05 level, for that decision rule (row).

17


Table 1 has a distinctly diagonal flavor. Predictions are usually best whenever estimation

is matched to the decision rule by which synthetic respondents make consideration decisions.

There is redundancy for conjunctive and disjunctive rules. For conjunctive rules, the predictive

abilities of DOCMP and HB Subset(4) are not statistically different. However, for disjunctive

rules DOCMP does better than HB Subset(1). Table 1 suggests that an estimation method pre-

dicts well on average when it matches the true decision rule. This is a necessary, but not suffi-

cient, condition for using this set of estimation methods to attempt to infer whether DOC rules

are a reasonable description of empirical consideration decisions.

9. Empirical Application – Global Positioning Systems (GPSs) Using the sixteen features in Figure 1 we generated an orthogonal design of 32 GPS pro-

files.8 We then developed four alternative formats by which to measure consideration. These

respondent task formats were developed based on qualitative pretests to approximate the shop-

ping environment for GPSs. Each respondent task format was implemented in a web-based sur-

vey and pretested extensively with over 55 potential respondents from the target market. At the

end of the pretests respondents found the tasks easy to understand and felt that the task formats

were reasonable representations of the GPS market.

We invited two sets of respondents to complete the web-based tasks: a representative

sample of German consumers who were familiar with GPSs and a US-based student sample. In

this section we describe results from our primary format using the German sample of representa-

tive consumers. We defer to Section 10 discussion of the student sample, the other formats, and

a text-only version.

Figure 2 provides screen-shots in English and German for the basic format. A “bullpen”

is on the far left. As respondents move their cursor over a generic image in the bullpen, a GPS

appears in the middle panel. If respondents click on the generic image, they can evaluate the

GPS in the middle panel deciding whether or not to consider it. If they decide to consider the

GPS, its image appears in the right panel. Respondents can toggle between current consideration

sets and their current not-consider sets. There are many ways in which they can change their

mind, for example, putting a GPS back or moving it from the consideration set to the not-

consider set, or vice versa. In this format respondents continue until all GPSs are evaluated. 8 To make the task realistic and to avoid dominated profiles (Johnson, Meyer and Ghose 1989), price was manipu-lated as a two-level price increment. Profile prices were based on this increment plus additive feature-based costs. We return to the issue of orthogonal designs at the end of this section.

18


Figure 2 Consideration Task in One of the Formats (English and German)

Because decision rules are often context dependent (Payne, Bettman and Johnson 1988,

1993), it is possible that forcing respondents to evaluate every profile would influence decision

rules. Thus, we tested two formats that do not require respondents to evaluate all GPSs. We also

tested a format in which respondents saw profiles randomly. See Section 10.

Before respondents made consideration decisions, they reviewed screens that described

GPSs in general and each of the GPS features. They also viewed instruction screens for the con-

sideration task and instructions that encouraged incentive compatibility. Following the consid-

19


eration task respondents ranked profiles within the consideration set (data not used in this paper)

and then completed tasks designed to cleanse memory. These tasks included short brain-teaser

questions that direct respondents’ attention away from GPSs. Following the memory-cleansing

tasks, respondents completed the consideration task a second time, but for a different orthogonal

set of GPSs. These second consideration decisions are validation data and are not used in the es-

timation of any rules.

Respondents were drawn from a web-based panel of consumers maintained by the GfK

Group. Initial screening eliminated respondents who had no interest in buying a GPS and no ex-

perience using a GPS. Those respondents who completed the questionnaire received an incen-

tive of 200 points toward general prizes (Punkte) and were entered in a lottery in which they

could win one of the GPSs (plus cash) that they considered. This lottery was designed to be in-

centive compatible as in Ding (2007) and Ding, Grewal, and Liechty (2005). (Respondents who

completed only the screening questionnaire received 15 Punkte.)

In total 2,320 panelists were invited to answer the screening questions. The incidence

rate (percent eligible) was 64%, the response rate was 47%, and the completion rate was 93%.

Respondents were assigned randomly to one of the five task formats (the basic format in Figure

2, three alternative formats, and a text-only format). After eliminating respondents who had null

consideration sets or null not-consider sets in the estimation task, we retained 580 respondents.

The average size of the consideration set (estimation data) for the task format in Figure 2 was 7.8

profiles. There was considerable variation among respondents (standard deviation was 4.8 pro-

files). The average size of the consideration set in the validation task was smaller, 7.2 profiles,

but not significantly different. Validation consideration set sizes had an equally large standard

deviation (4.8 profiles).

Predictive Tests Initially, we estimate HB Compensatory, HB Subset(S) models for S = 1 to 4, and

DOCMP. Table 2 summarizes the ability of each estimation method (calibrated on the estima-

tion task) to predict consideration for the validation task. DOC-based estimation is significantly

better than all benchmark estimation methods on the ability to predict consideration (hit rate).

The next best method is HB Compensatory estimation.

Interestingly, if we were to examine hit rate alone on this data set, and limit ourselves to

estimation methods other than DOCMP, we might conclude erroneously that a compensatory

20


rule has the best hit rate. Given the robustness of the linear model for empirical data (e.g.,

Dawes 1979; Dawes and Corrigan 1974), this is not surprising. Including DOC-based estimation

gives, potentially, a different interpretation: cognitively-simple non-compensatory rules, DOC,

have the best hit rate.9

Table 2 Empirical Comparison of Estimation Methods

(Representative German Sample, Task Format in Figure 2)

Estimation method Overall hit rate† Relative hit-rate improvement

K-L divergence percentage

HB Compensatory 78.5% 34.4% 15.0%

HB Subset(1) [Disjunctive] 66.7% -1.7% 17.8%

HB Subset(2) 69.1% 5.8% 21.6%

HB Subset(3) 74.8% 23.0% 24.9%

HB Subset(4) 75.4% 24.0% 24.7%

DOCMP [Disjunctions of conjunctions] 81.9%* 44.8%* 32.0%*

† Number of profiles predicted correctly, divided by 32. * Best at the 0.05 level.

Hit rate was sufficient for the synthetic-data experiments because we sought relative pre-

dictive ability. However, hit rates alone are difficult to interpret for empirical consideration de-

cisions. For example, if a respondent considers 8 profiles (of 32) in the validation sample, then

naively predicting the respondent considers nothing will predict all not-considered profiles cor-

rectly and give a hit rate of 75%. If 8 profiles were considered in the validation sample, a ran-

dom prediction that 25% of the profiles are considered gives an average hit rate of 62.5%

((0.25)2 + (0.75)2). To account for this phenomenon, Srinivasan (1988), Srinivasan and Park

(1997) and, in a related situation, Payne, Bettman and Johnson (1993, p. 128) use a relative

measure: (observed hit rate – random hit rate)/(100% - random hit rate). This relative measure is

given in Table 2 where, in our case, “random” is the expected hit rate obtained on the validation

sample if we randomly predict consideration in proportion to that observed on the estimation

sample.

Another issue with hit rate is that it places equal weight on false positives and false nega-

9 This is a paramorphic statement. We are hesitant to conclude that the model with the best hit rate is the best iso-morphic description of respondents’ decision rules. The simulation results are necessary, but not sufficient, to con-clude that DOC is the true decision model.

21


tives. If we knew the managerial situation we could weigh these two types of error differently in

Equation 1. Absent managerial weights we turn to information theory for a natural way to com-

bine false positives and false negatives. This measure is the Kullback-Leibler (K-L) divergence

(Chaloner and Verdinelli 1995; Kullback and Leibler 1951; Lindley 1956). It measures the ex-

pected gain in Shannon’s information relative to a random model. Because the null model de-

pends upon the number of profiles considered, we normalize by comparing the K-L divergence

for a model to the K-L divergence for perfect prediction (e.g., as in Hauser 1978). Appendix 5

provides formulae for the K-L divergence percentage. This measure complements relative hit

rate because none of the estimation methods is designed to maximize K-L divergence. The K-L

divergences, reported in Table 2, are consistent with those from overall and relative hit rates;

cognitively-simple DOC-based estimation is significantly better than the other rule/estimation

methods.

Empirical Evidence Implies Relatively Simple DOC Rules Although DOCMP encourages cognitive simplicity, the estimation could choose cogni-

tively complex rules if they were the best rules on the calibration data. Despite the large number

of potential patterns, DOCMP chose relatively simple rules for our data. Only 7.1% of the re-

spondents used more than one pattern and no one used more than two patterns. For most respon-

dents DOCMP appears to predict well because it focuses on a relatively few specific patterns and

is flexible about pattern length (subject to complexity). Subset(S) rules are less specific and less

flexible; they require a disjunction of conjunctions of size S. The pure disjunctive rule might be

too simple; it does not appear to predict well.

To further examine the advantage of simple rules, we re-estimated DOCMP without ac-

counting for cognitive simplicity and market commonalities. The relative hit rate for this model

(25.9%) was comparable to the non-DOC benchmarks and significantly worse than the full

DOCMP model (p < 0.001).10

Statistical-Learning Benchmarks and LOOCV Sensitivity Bayesian estimation for the compensatory and Subset(S) benchmarks is common in mar-

keting. However, it is possible that the results in Table 2 are due to the use of statistical-learning

and/or leave-one-out cross validation (LOOCV) rather than DOC rules, cognitive simplicity, and 10 The hit rate for the reduced model was 75.7% and the K-L divergence was 29.6%. Eliminating either complexity or market commonality, but not both, gives intermediate results.

22


market commonality. To test this hypothesis, we re-estimated the compensatory and Subset(S)

benchmarks using integer programs formulated to be as similar to DOCMP as feasible

(CompMP and SubsetMP – see Appendix 6). The statistical-learning methods predicted better

than the Bayesian methods for compensatory and less well for Subset(S); DOCMP was better

than both CompMP and SubsetMP on both K-L percentage and hit rate with most comparisons

significant.11 This suggests that the higher performance of DOCMP is due at least partly to the

use of cognitively-simple DOC rules.

On our data DOCMP’s performance is relatively insensitive to γc. For γc =1 to 4.5 the

CV hit rate (used to select γc) varies from 80.8% to 81.6% and validation hit rate varies from

81.9% and 82.4%. This robustness is consistent with Evgeniou, Pontil and Toubia (2007).

In summary, the performance of cognitively-simple DOC rules does not appear to be due

solely to either statistical learning methods or to LOOCV.

Sensitivity to Orthogonal Designs There has been significant research in marketing on efficient experimental designs for

choice-based conjoint experiments (Arora and Huber 2001; Huber and Zwerina 1996; Kanninen

2002; Toubia and Hauser 2007), but we are unaware of any research on efficient experimental

designs for consideration decisions or for the estimation of cognitively-simple DOC rules. When

decisions are made with respect to the full set of 32 profiles, aspects are uncorrelated up to the

resolution of the design and, if there were no errors, we should be able to identify DOC patterns.

However, when one profile is removed for LOOCV, aspects are no longer uncorrelated and pat-

terns may not be defined uniquely – especially if one of the considered profiles is left out. As a

mild test, we re-estimated the two best-performing rules, DOCMP and HB Comp, with only the

17 of 32 most-popular profiles (#’s 16-17 were tied). DOCMP remained significantly better on

both comparison measures: DOCMP achieved a K-L of 29.4% and a hit rate of 79.3%; HB

Comp achieved a K-L percentage of 15.8% and a hit rate of 76.3%.

Until the issue of optimal DOC-consideration experimental designs is resolved, the per-

formance of DOCMP remains a conservative test of cognitively-simple DOC rules. Improved or

adaptive experimental designs might improve performance.

11 K-L percentage: CompMP 23.0%, p < 0.001, SubsetMP 11.8%, p < 0.001. Hit rate: CompMP 80.6%, p = 0.08, SubsetMP, p < 0.001. Statistical tests compare to DOCMP’s K-L percentage of 32.0% and hit rate of 81.9%.

23


Summary of Empirical Results (Initial Tests) DOC-based estimation appears to yield simple rules, predict hit rates well, and provide

information about future consideration decisions. Some of this improvement is due to a focus on

DOC rules and some due to accounting for cognitive simplicity and market commonality. These

results appear to be robust with respect to the method by which the benchmarks are estimated.

10. Robustness: Target Population, Task Format, and Profile Representation Table 2 is promising, but we would like to know whether the predictive ability in Table 2

is an anomaly or whether it is a more robust finding. For example, we would like to examine

hypotheses that the predictive ability is unique to the GfK respondents, to the task format in Fig-

ure 2, or to the way we present profiles.

US Student Sample vs. Representative German Sample We replicated the GPS measurement with a sample of MBA students at a US university.

Students were invited to an English-language website (e.g., first panel of Figure 2). As incen-

tives, and to maintain incentive-compatibility, they were entered in a lottery with a 1-in-25

chance of winning a laptop bag worth $100 and a 1-in-100 chance of winning a combination of

cash and one of the GPSs that they considered. The response rate for US students was lower,

26%, and consideration-set sizes were, on average, larger. Despite the differences in sample, re-

sponse rate, incentives, and consideration-set size, DOCMP was still the best estimation method

on both the hit-rate and K-L divergence metrics. See Table 3.

Table 3 Replication with a US Student Sample

Estimation method Overall Hit Rate† Relative hit-rate improvement

K-L divergence percentage

HB Compensatory 78.9% 39.2% 19.4%

HB Subset(1) [Disjunctive] 61.2% -11.6% 20.6%

HB Subset(2) 72.9% 22.0% 27.9%

HB Subset(3) 72.0% 19.6% 26.3%

HB Subset(4) 72.7% 21.5% 26.6%

DOCMP [Disjunctions of conjunctions] 82.3%* 49.2%* 36.5%*

† Number of profiles predicted correctly, divided by 32. * Best at the 0.05 level.

24


Variations in Task Formats In Figure 2 respondents must evaluate every profile (“evaluate all profiles”). However,

such a restriction may be neither necessary nor descriptive. For example, Ordóñez, Benson and

Beach (1999) argue that consumers screen products by rejecting products that they would not

consider further. Because choice rules are context dependent (e.g., Payne, Bettman and Johnson

1993), the task format could influence the propensity to use a DOC rule.

To examine context sensitively, we tested alternative task formats. One format asked re-

spondents to indicate only the profiles they would consider (“consider only”); another asked re-

spondents to indicate only the profiles they would reject (“reject only”). The tasks were other-

wise identical to “evaluate all profiles.” We also tested a “no browsing” format in which re-

spondents evaluated profiles sequentially (in a randomized order). Representative screen shots

for these formats are shown in Appendix 7.

Table 4 Comparison of Predictive Ability for Different Task Formats

K-L Divergence Percentage for Each Respondent Task Format (respondents were assigned randomly to format)

Estimation method Evaluate all profiles

Consider only Reject only No

browsing Text only

HB Compensatory 17.8% 5.7% 14.6% 17.6% 13.9%

HB Subset(1) [Disjunctive] 15.0% 9.3% 20.5% 17.8% 11.2%

HB Subset(2) 21.6% 13.1% 27.9% 26.0% 18.5%

HB Subset(3) 24.9% 15.5% 28.2% 25.9% 20.7%

HB Subset(4) 24.7% 15.5% 27.9% 25.7% 21.3%

DOCMP [Disjunctions of conjunctions] 32.0%* 29.4%* 42.1%* 34.1%* 30.5%*

* Best at the 0.05 level.

We first examine predictive ability where, for simplicity, we show only the K-L diver-

gence percentage for the German respondents. See Table 4. DOC-based estimation was signifi-

cantly better than all benchmarks. It was also significantly better on German-respondent hit rates

and on the US student respondents for both hit rates and K-L divergence.12 For ease of compari-

12 Tables available from the authors. The German sample sizes were 93, 135, 94, 123, and 135, respectively, for the formats in Table 4.

25


son, we repeat the K-L divergence percentages for “evaluate all profiles (Figure 2).”

As predicted by the evaluation-cost theory of consideration-set formation, respondents

considered fewer profiles when the relative evaluation cost (for consideration) was higher: 4.3

profiles in “consider only,” 7.8 in “evaluate all,” and 10.6 in “reject only.” As predicted by the

theory of context dependence, the propensity to use a second DOC pattern varied as well. Second

disjunctions were more common when consideration sets were larger: 0% for “consider only,”

7.1% for “evaluate all,” and 9.8% for “reject only.” While our data cannot distinguish whether

the differences are due to the size of the consideration set or due to differential evaluation efforts

induced by task variation, these data illustrate how revealed-preference non-compensatory esti-

mation provides a non-intrusive indicator that complements more direct (but intrusive) measures.

Text-Only vs. Visual Representation of the GPS Profiles The profile representations in Figure 1 were designed by a professional graphic artist and

were pretested extensively. Pretests suggested which features should be included in the “JPEGs”

and which features should be included as satellite icons. Nonetheless, it is possible that the rela-

tive predictive ability of the estimation methods might depend upon the specific visual represen-

tations of the profiles. To examine this hypothesis we included a task format that was identical

to the task in Figure 2 except that all features were described by text rather than pictures, icons,

and text (see Appendix 7). The results are given in the last column of Table 4. DOC-based es-

timation is again the best predictive method. Interestingly, there is no significant difference be-

tween picture-representations and text-representations for DOCMP predictions (t = 0.40).

Summary of Robustness Tests The relative predictive ability of the tested methods appears to be robust with respect to:

• respondent sample (representative German vs. US student),

• format of the respondent task (evaluate all profiles, consideration only, rejection only,

or no browsing),

• presentation of the stimuli (pictures vs. text).

11. Managerial Implications and Category Context To investigate whether the empirical data for GPSs leads to differential insights we com-

pare the estimated cognitively-simple DOC rules with the estimated compensatory rules. We

compare measures of relative influence in the consumer’s decision, the implied value to the firm

26


of feature improvements, and two indicators of co-occurrence.

Comparing Conjunctive Features to Compensatory Partworths In our data, compensatory rules suggest that price is the most important feature represent-

ing, on average, 14% of the relative importance. DOC rules suggest that price is even more in-

fluential in screening: 70% of the respondents include one or more price levels in a conjunction

and 92% of those use price as a rejection mechanism. This result is face valid for a relatively

new and unfamiliar category such as GPSs. The next highest screening features are the mini-

USB port (36%), an extra bright display (25%), and a color display (21%) with relative compen-

satory importances of 10%, 8%, and 11%, respectively.

Value of Feature Improvements Ofek and Srinivasan (2002, p. 401) propose that a value of a feature be defined as “the

incremental price the firm would charge per unit improvement in the product attribute (assumed

to be infinitesimal) if it were to hold market share (or sales) constant." In DOC rules features and

price levels are discrete, hence we modify their definition slightly. We compute the incremental

improvement in market share if a feature is added for an additional $50 in price. Because this

calculation is sensitive to the base product, we select the features of the base product randomly.

We illustrate two of the many differences between DOC rules and compensatory rules.

(1) According to HB Compensatory, the Magellan brand has higher average partworths. Further,

54% of the respondents have higher (consideration-set) partworths for Magellan compared to

Garmin. However, according to DOCMP about 12% of the respondents screen on brand and

82% of those prefer Garmin. As a result, DOC rules predict that that consideration share would

increase if we switch to Garmin and raise the price by $50, but compensatory rules predict that

consideration share would decrease. (2) The HB compensatory model predicts that “extra

bright” is the highest-valued feature improvement yielding an 11% increase for the $50 price.

However, DOC rules predict a much smaller improvement (2%) because many of the respon-

dents who screen on “extra bright” also eliminate on the higher price.

Comparing Correlation with Co-occurrence Correlation and co-occurrence matrices are somewhat different data summaries.13 To

13 Correlation among partworths for compensatory rules. Relative co-occurrence of features within a pattern for DOC rules. Cutoff of 20.5% defined by correlation significance at the 0.05 level.

27


fully appreciate their implications we would need to embed DOC rules or compensatory rules

within a product-line optimization. While this is beyond the scope of this paper, we obtain initial

insight by comparing these two summaries of co-variation.

Estimated compensatory rules imply complex covariation with over 77% of the entries

significant. Cognitively-simple DOC rules imply simpler covariation with 21% of the entries

above the comparable cutoff. If a full product-line optimization model were developed, DOC-

based rules might imply that products need fewer features in order to be considered. However,

because DOC rules also tend to be heterogeneous, the co-occurrence matrix might also imply a

broader product line. We leave full investigation to future research.

Summary of Managerial Comparisons We have illustrated a few of the many managerial differences for the GPS market. These

differences have face validity, but we are cautious about generalizations. The GPS category was

relatively new and unfamiliar to many respondents at the time of our study. We expect that this

domain favors cognitively simple rules. For example, Yee, et. al. (2007) found more lexico-

graphic rules in Smart Phones, which were new to the market at the time of their study, than they

found in personal computers. It is possible that relatively more DOC rules are used to screen

GPSs than would be used in a familiar category such as standard cell phones.

12. Alternative Estimation Methods DOCMP predicts well and is robust, but DOCMP is not the only estimation method that

can be used to estimate DOC rules while favoring cognitive simplicity and market commonal-

ities. In this section we (1) illustrate a non-LOOCV DOC-based estimation method and (2) sug-

gest how other popular methods can be modified to identify cognitively-simple DOC rules.

Logical Analysis of Data Logical analysis of data (LAD) attempts to identify minimal sets of features to

distinguish “positive” events from “negative” events (Boros, et. al. 1997; 2000). In its basic

form, LAD uses a greedy algorithm to find the fewest patterns necessary to match the set of

considered profiles. The union of these patterns is a DOC rule. When we applied LAD to our

data we were able to achieve a relative hit rate of 40.3% and a K-L divergence percentage of

32.5%, both of which are significantly better than the non-DOC models in Table 2. Basic LAD is

significantly worse than DOCMP on hit rate (p = 0.03) but not K-L divergence (p = 0.79).

28


We next modified LAD to favor cognitive simplicity and account for market commonal-

ities. To favor cognitive simplicity we limit the number of patterns, P, and the length of the pat-

terns, S. We break ties based on market commonalities. We call this method LAD-DOC(P, S).

We begin with P = 2 and S = 4. For the German data in Table 2, the relative hit rate im-

proves to 45.6%, which is significantly better than basic LAD (p = 0.002). The K-L divergence

improved to 34.6%, but was only marginally significantly better (p = 0.07). LAD-DOC(2, 4) was

slightly better than DOCMP, but not significantly so (p = 0.74 and 0.12, respectively). We gain

similar insight from all task formats and both samples.14 When we expand the comparisons to

include LAD-DOC(3, 4) and LAD-DOC(4, 4), we get similar results.

The LAD-DOC results suggest that the performance of cognitively-simple DOC rules is

not unique to DOCMP estimation. LAD can be used to identify DOC rules and can be modified

to account for cognitive complexity and market commonalities.

Decision Trees Decision trees, as proposed by Currim, Meyer and Le (1988) for modeling consumer

choice, are compatible with DOC rules for classification data (consider vs. not consider). In the

growth phase, decision trees select the aspect that best splits profiles into considered vs. not

considered. Subsequent splits are conditioned on prior splits. For example, we might split first

on “B&W” vs. “color,” then split “B&W” based on screen size and split “color” based on

resolution. With enough levels, decision trees fit estimation data perfectly (similar to Result 4),

hence researchers either prune the tree with a defined criterion (usually a minimum threshold on

increased fit) or grow the tree subject to a stopping criterion on the tree’s growth (e.g., Breiman,

et. al. 1984).

Each node in a decision tree is a conjunction, hence the set of all “positive” nodes is a

DOC rule. However, because the logical structure is limited to a tree-structure, a decision tree

often takes more than S levels to represent a DOC(S) model. For example, suppose we generate

errorless data with the DOC(2) rule: (a ∧ b) ∨ (c ∧ d). To represent these data, a decision tree

would require up to 4 levels and produce either (a ∧ b) ∨ (a ∧ ¬b ∧ c ∧ d) ∨ (¬a ∧ c ∧ d) or

14 When we compare all formats and both data sets on both hit rates and K-L divergence, two comparisons were marginally significant (one favoring DOCMP and one favoring LAD-DOC(2, 4)). These differences were not sig-nificant when we corrected for the fact that we ran 18 simultaneous t-tests (p > 0.10).

29


equivalent reflections.15 This DOC(3) rule is logically equivalent to (a ∧ b) ∨ (c ∧ d), but more

complex in both the number of patterns and pattern lengths. To impose cognitive simplicity we

would have to address these representation and equivalence issues.

As a test, we applied the Currim, Meyer and Le (1988) decision tree to the data in Table

2. We achieved a relative hit rate of 38.5% and a K-L divergence of 28.4%, both excellent, but

not as good as those obtained with DOCMP and LAD-DOC estimation.16 While many

unresolved theoretical and practice issues remain in order to best incorporate cognitive simplicity

and market commonalities into decision trees, we have no reason to doubt that once these issues

are resolved, decision trees can be developed to estimate cognitively-simple DOC rules.

Continuous Models Conjunctions are analogous to interactions in a multilinear model; DOC decision rules

are analogous to a limited set of interactions (Bordley and Kirkwood 2004; Mela and Lehmann

1995). Thus, in principle, we might use continuous estimation to identify DOC decision rules.

For example, Mela and Lehmann (1995) use finite-mixture methods to estimate interactions in a

two-feature model. In addition, continuous models can be extended to estimate “weight”

parameters for the interactions and thresholds on continuous features.

We do not wish to minimize either the practical or theoretical challenges of scaling

continuous models from a few features to many features. For example, without enforcing

cognitive simplicity there are over 130,000 interactions to be estimated for our GPS application.

Cognitive simplicity constrains the number of parameters and, potentially, improves predictive

ability, but would still require over 30,000 interactions to be estimated. Nonetheless, with

sufficient creativity and experimentation researchers might extend either finite-mixture,

Bayesian, simulated-maximum-likelihood, or kernel estimators to find feasible and practical

methods to estimate continuously-specified DOC rules (Evgeniou, Boussios, and Zacharia 2005;

Mela and Lehmann 1995; Rossi and Allenby 2003; Swait and Erdem 2007).

In summary, we posit that the predictive ability in Tables 2, 3, and 4 is due to cogni-

tively-simple DOC rules combined with market commonalities, rather than the particular estima-

tion method used to identify such rules. At least one other estimation method, modified to ac-

15 Depending on the incidence of profiles, the decision tree might also produce (c ∧ d) ∨ (c ∧ ¬d ∧ a ∧ b) ∨ (¬c ∧ a ∧ b), which is also logically equivalent to (a ∧ b) ∨ (c ∧ d). Other logically equivalent patterns are also feasible. 16 LAD-DOC (p = 0.002) and DOCMP (p = 0.01) are significantly better on relative hit rate. LAD-DOC (p = 0.002) is significantly better and DOCMP is better (p = 0.06) on information percentage.

30


count for cognitive simplicity and market commonalities, LAD-DOC, does well on our data.

Furthermore, statistical-learning algorithms which estimate compensatory and Subset(S) rules do

not predict as well as comparable algorithms which estimate DOC rules. We are optimistic that

the phenomena we explore in this paper are relevant to many estimation methods.

13. Summary and Future Directions Consumers often make decisions with a two-stage process in which they first form a con-

sideration set and then choose a product from that set. Two-stage processes are managerially

relevant – a product cannot be purchased if it is not considered and a product that is often con-

sidered has a better chance of being purchased. Evidence from a variety of perspectives suggests

that consideration decisions are based on cognitively-simple decision rules, especially when

there are many features or many product alternatives to be evaluated.

In this paper we explore a generalization of existing non-compensatory decision rules:

disjunctions of conjunctions (DOC) and their relationship to cognitive simplicity and market

commonalities. While we illustrate estimation with DOCMP and LAD-DOC, we posit the basic

concepts can be implemented with many extant estimation methods.

We test cognitively-simple DOC models with synthetic and empirical data. The simula-

tion experiments suggest predictive ability is maximized when the estimation method matches

the decision rule used to generate the data. The empirical data suggest that cognitively-simple

DOC-based rules have better predictive ability than the benchmark rules. This result is robust

across sample, respondent task format, profile presentation, and estimation method. While good

predictive ability does not guarantee that consumers actually use a DOC decision rule, the pre-

dictive ability is encouraging and suggests future research with different product categories, dif-

ferent samples, different task formats, and, perhaps, other forms of cognitive simplicity.

The identified DOC rules are simple. Our field experiments suggest that one or two pat-

terns per consumer are sufficient. Further, DOCMP and LAD-DOC perform better when we en-

force cognitive simplicity and market commonalities as is consistent with the existing experi-

mental literature (e.g., Gigerenzer and Goldstein 1996; Payne, Bettman and Johnson 1988,

1993).

We did not address explicitly the choice stage of a consider-then-choice rule. Such re-

search is promising and complementary. For example, Gaskin, et. al. (2007) combine greedoid

methods for consideration with adaptive polyhedral conjoint methods to estimate two-stage

31


choice models. The two-stage models outperform one-stage compensatory methods. There is

also a rich history in marketing of two-stage models in which consideration is a latent, unob-

served construct (e.g., Andrews and Srinivasan 1995; Gensch 1987; Gilbride and Allenby 2004;

Siddarth, Bucklin, and Morrison 1995; Swait and Erdem 2007). We believe that DOC rules

combined with cognitive simplicity could complement these lines of research.

Finally, our empirical test represents a single category, GPSs, and decisions among prod-

uct profiles described by features with finitely many levels. There were a large number of fea-

tures and products in this category were relatively new to our respondents. Both characteristics

are likely to favor cognitively-simple decision rules. We posit that cognitively-simple DOC

rules are relevant in many but not all categories, retail environments, and contexts.

32


References

Allenby, Greg M., Neeraj Arora, and James L. Ginter (1995), “Incorporating Prior Knowledge into the Analysis of Conjoint Studies,” Journal of Marketing Research, 32, (May), 152-162.

Andrews, Rick L. and T. C. Srinivasan (1995), “Studying Consideration Effects in Empirical Choice Models Using Scanner Panel Data,” Journal of Marketing Research, 32, (Febru-ary), 30-41.

Arora, Neeraj and Joel Huber (2001), “Improving Parameter Estimates and Model Prediction by Aggregate Customization in Choice Experiments,” Journal of Consumer Research, 28, (September), 273-283.

Anonymous (2008), “Qualitative Evidence of Non-compensatory Processes in the Consideration of New Automobile Models.”

Bettman, James R., Mary Frances Luce, and John W. Payne (1998), “Constructive Consumer Choice Processes,” Journal of Consumer Research, 25(3), 187-217.

Bordley, Robert F. and Craig W. Kirkwood (2004), “Multiattribute Preference Analysis with Performance Targets,” Operations Research, 52, 6, (November-December), 823-835.

Boros, Endre, Peter L. Hammer, Toshihide Ibaraki, and Alexander Kogan (1997), “Logical Analysis of Numerical Data,” Mathematical Programming, 79:163--190, August 1997

------, ------, ------, ------, Eddy Mayoraz, and Ilya Muchnik (2000), “An Implementation of Logi-cal Analysis of Data,” IEEE Transactions on Knowledge and Data Engineering, 12(2), 292-306.

Breiman, Leo, Jerome H. Friedman, Richard A. Olshen, and Charles J. Stone (1984), Classifica-tion and Regression Trees, (Belmont, CA: Wadsworth).

Bröder, Arndt (2000), “Assessing the Empirical Validity of the “Take the Best” Heuristic as a Model of Human Probabilistic Inference,” Journal of Experimental Psychology: Learn-ing, Memory, and Cognition, 26, 5, 1332-1346.

Bronnenberg, Bart J., and Wilfried R. Vanhonacker (1996), “Limited Choice Sets, Local Price Response, and Implied Measures of Price Competition,” Journal of Marketing Research, 33 (May), 163-173.

Chaloner, Kathryn and Isabella Verdinelli (1995), “Bayesian Experimental Design: A Review,” Statistical Science, 10, 3, 273-304. (1995)

Chase, Valerie M., Ralph Hertwig, and Gerd Gigerenzer (1998), “Visions of Rationality,” Trends in Cognitive Sciences, 2, 6, (June), 206-214.

Cooil, Bruce, Russell S. Winer and David L. Rados (1987), “Cross-Validation for Prediction,” Journal of Marketing Research, 24, (August), 271-279.

Cormen, Thomas H., Charles E. Leiserson, Ronald L. Rivest and Clifford Stein (2001), Introduc-tion to Algorithms, 2E, (Cambridge, MA: MIT Press).

Cucker, Felipe, and Steve Smale (2002), “On the Mathematical Foundations of Learning,” Bulle-tin of the American Mathematical Society, 39(1), 1-49.

33


Currim, Imran S., Robert J. Meyer, and Nhan T. Le (1988), “Disaggregate Tree-Structured Mod-eling of Consumer Choice Data,” Journal of Marketing Research, 25(August), 253-265.

Dawes, Robyn M. (1979), “The Robust Beauty of Improper Linear Models in Decision Making,” American Psychologist, 34, 571-582.

------ and Bernard Corrigan (1974), “Linear Models in Decision Making,” Psychological Bulle-tin, 81, 95-106.

DeSarbo, Wayne S., Donald R. Lehmann, Gregory Carpenter, and Indrajit Sinha (1996), “A Sto-chastic Multidimensional Unfolding Approach for Representing Phased Decision Out-comes,” Psychometrika, 61(3), 485-508.

Ding, Min (2007), “An Incentive-Aligned Mechanism for Conjoint Analysis,” Journal of Mar-keting Research, 54, (May), 214-223.

------, Rajdeep Grewal, and John Liechty (2005), “Incentive-Aligned Conjoint Analysis,” Jour-nal of Marketing Research, 42, (February), 67–82.

Efron, Bradley and Robert Tibshirani (1997), “Improvements on Cross-Validation: The .632+ Bootstrap Method,” Journal of the American Statistical Association, 92, 438, (June), 548-560.

Evgeniou, Theodoros, Constantinos Boussios, and Giorgos Zacharia (2005), “Generalized Ro-bust Conjoint Estimation,” Marketing Science, 24(3), 415-429.

------, Massimiliano Pontil, and Olivier Toubia (2007), “A Convex Optimization Approach to Modeling Heterogeneity in Conjoint Estimation,” Marketing Science, 26, 6, (November-December), 805-818.

Feige, Uriel (1998), “A threshold of ln n for approximating set cover,” Journal of the Association for Computing Machinery, 45(4), 634 – 652.

Gaskin, Steven, Theodoros Evgeniou, Daniel Bailiff, John Hauser (2007), “Two-Stage Models: Identifying Non-Compensatory Heuristics for the Consideration Set then Adaptive Poly-hedral Methods Within the Consideration Set,” Proceedings of the Sawtooth Software Conference in Santa Rosa, CA, October 17-19.

Gensch, Dennis H. (1987), “A Two-stage Disaggregate Attribute Choice Model,” Marketing Sci-ence, 6, (Summer), 223-231.

Gigerenzer, Gerd and Daniel G. Goldstein (1996), “Reasoning the Fast and Frugal Way: Models of Bounded Rationality,” Psychological Review, 103(4), 650-669.

------ and Reinhard Selten (2001), "Rethinking rationality", in Gerd Gigerenzer and Reinhard Selten, eds, Bounded Rationality: The Adaptive Toolbox, (Cambridge, MA: MIT Press).

------, Peter M. Todd, and the ABC Research Group (1999), Simple Heuristics That Make Us Smart, (Oxford, UK: Oxford University Press).

Gilbride, Timothy J. and Greg M. Allenby (2004), “A Choice Model with Conjunctive, Disjunc-tive, and Compensatory Screening Rules,” Marketing Science, 23(3), 391-406.

------ and ------ (2006), “Estimating Heterogeneous EBA and Economic Screening Rule Choice Models,” Marketing Science, 25, 5, (September-October), 494-509.

34


Hastie, Trevor, Robert Tibshirani, Jerome H. Friedman (2003), The Elements of Statistical Learning, (New York, NY: Springer Series in Statistics).

Hauser, John R. (1978), "Testing the Accuracy, Usefulness and Significance of Probabilistic Models: An Information Theoretic Approach," Operations Research, Vol. 26, No. 3, (May-June), 406-421.

------ and Birger Wernerfelt (1990), “An Evaluation Cost Model of Consideration Sets,” Journal of Consumer Research, 16 (March), 393-408.

Hogarth, Robin M. and Natalia Karelaia (2005), “Simple Models for Multiattribute Choice with Many Alternatives: When It Does and Does Not Pay to Face Trade-offs with Binary At-tributes,” Management Science, 51, 12, (December), 1860-1872.

Huber, Joel, and Klaus Zwerina (1996), “The Importance of Utility Balance in Efficient Choice De-signs,” Journal of Marketing Research, 33 (August), 307-317.

Hughes, Marie Adele and Dennis E. Garrett (1990), “Intercoder Reliability Estimation Approaches in Marketing: A Generalizability Theory Framework for Quantitative Data,” Journal of Market-ing Research, 27, (May), 185-195.

Jedidi, Kamel and Rajeev Kohli (2005), “Probabilistic Subset-Conjunctive Models for Heteroge-neous Consumers,” Journal of Marketing Research, 42 (4), 483-494.

------, ------ and Wayne S. DeSarbo (1996), “Consideration Sets in Conjoint Analysis,” Journal of Marketing Research, 33 (August), 364-372.

Johnson, Eric J., Robert J. Meyer, and Sanjoy Ghose (1989), “When Choice Models Fail: Com-pensatory Models in Negatively Correlated Environments,” Journal of Marketing Re-search, 26, (August), 255-290.

Kanninen, Barbara J. (2002), “Optimal Design for Multinomial Choice Experiments,” Journal of Marketing Research, 39, (May), 214-227.

Kearns, Michael and Dana Ron (1999), “Algorithmic Stability and Sanity-Check Bounds for Leave-One-Out Cross-Validation,” Neural Computation, 11, 1427–1453.

Kohavi, Ron (1995), "A study of cross-validation and bootstrap for accuracy estimation and model selection," Proceedings of the Fourteenth International Joint Conference on Arti-ficial Intelligence. 2, 12, 1137-1143.

Kohli, Rajeev, and Kamel Jedidi, “Representation and Inference of Lexicographic Preference Models and Their Variants,” Marketing Science, 26(3), 380-399.

Kullback, Solomon, and Leibler, Richard A. (1951), “On Information and Sufficiency,” Annals of Mathematical Statistics, 22, 79-86.

Langley, Pat (1996), Elements of Machine Learning, (San Francisco, CA: Morgan Kaufmann).

Lindley, Dennis V. (1956), “On a Measure of the Information Provided by an Experiment,” The Annals of Mathematical Statistics, 27, 4 (December), 986-1005.

Lund, Carsten, and Mihalis Yannakakis (1994), “On the Hardness of Approximating Minimiza-tion Problems,” Journal of the Association for Computing Machinery, 41(5), 960 - 981

Martignon, Laura and Ulrich Hoffrage (2002), “Fast, Frugal, and Fit: Simple Heuristics for Paired Comparisons,” Theory and Decision, 52, 29-71.

35


Mehta, Nitin, Surendra Rajiv, and Kannan Srinivasan (2003), “Price Uncertainty and Consumer Search: A Structural Model of Consideration Set Formation,” Marketing Science, 22(1), 58-84.

Mela, Carl F. and Donald R. Lehmann (1995), “Using Fuzzy Set Theoretic Techniques to Iden-tify Preference Rules From Interactions in the Linear Model: An Empirical Study,” Fuzzy Sets and Systems, 71, 165-181.

Montgomery, H. and O. Svenson (1976), “On Decision Rules and Information Processing Strate-gies for Choices among Multiattribute Alternatives,” Scandinavian Journal of Psychol-ogy, 17, 283-291.

Ofek, Elie, and V. Srinivasan (2002), “How Much Does the Market Value an Improvement in a Product Attribute?,” Marketing Science, 21, 4, (Fall), 398-411.

Ordóñez, Lisa D., Lehmann Benson III, and Lee Roy Beach (1999), “Testing the Compatibility Test: How Instructions, Accountability, and Anticipated Regret Affect Prechoice Screen-ing of Options,” Organizational Behavior and Human Decision Processes, 78, 1, (April), 63-80.

Payne, John W. (1976), “Task Complexity and Contingent Processing in Decision Making: An Information Search,” Organizational Behavior and Human Performance, 16, 366-387.

------, James R. Bettman and Eric J. Johnson (1988), “Adaptive Strategy Selection in Decision Making,” Journal of Experimental Psychology: Learning, Memory, and Cognition, 14(3), 534-552.

------, ------ and ------ (1993), The Adaptive Decision Maker, (Cambridge UK: Cambridge Uni-versity Press)..

Perreault, William D., Jr. and Laurence E. Leigh (1989), “Reliability of Nominal Data Based on Qualitative Judgments,” Journal of Marketing Research, 26, (May), 135-148.

Rhoads, Bryan, Glen L. Urban, and Fareena Sultan (2004), “Building Customer Trust Through Adaptive Site Design,” MSI Conference, Yale University, New Haven, CT, December 11.

Roberts, John H., and James M. Lattin (1991), “Development and Testing of a Model of Consid-eration Set Composition,” Journal of Marketing Research, 28 (November), 429-440.

Rossi, Peter E., Greg M. Allenby (2003), “Bayesian Statistics and Marketing,” Marketing Sci-ence, 22(3), p. 304-328.

Shao, Jun (1993), “Linear Model Selection by Cross-Validation,” Journal of the American Sta-tistical Association, 88, 422, (June), 486-494.

Shocker, Allan D., Moshe Ben-Akiva, Bruno Boccara, and Prakash Nedungadi (1991), “Consid-eration Set Influences on Consumer Decision-Making and Choice: Issues, Models, and Suggestions,” Marketing Letters, 2(3), 181-197.

Shugan, Steven (1980), “The Cost of Thinking,” Journal of Consumer Research, 27(2), 99-111.

Siddarth, S., Randolph E. Bucklin, and Donald G. Morrison (1995), “Making the Cut: Modeling and Analyzing Choice Set Restriction in Scanner Panel Data,” Journal of Marketing Re-search, 33, (August), 255-266.

Simon, Herbert A. (1955), “A Behavioral Model of Rational Choice,” The Quarterly Journal of Economics, 69(1). 99-118.

36


Srinivasan, V. (1988), “A Conjunctive-Compensatory Approach to The Self-Explication of Mul-tiattributed Preferences,” Decision Sciences, 295-305.

------ and Chan Su Park (1997), “Surprising Robustness of the Self-Explicated Approach to Cus-tomer Preference Structure Measurement,” Journal of Marketing Research, 34, (May), 286-291.

Swait, Joffre and Tülin Erdem (2007), “Brand Effects on Choice and Choice Set Formation Un-der Uncertainty,” Marketing Science 26, 5, (September-October), 679-697.

Toubia, Olivier, Theodoros Evgeniou, and John Hauser (2007), “Optimization-Based and Ma-chine-Learning Methods for Conjoint Analysis: Estimation and Question Design,” in An-ders Gustafsson, Andreas Herrmann and Frank Huber, Eds, Conjoint Measurement: Methods and Applications, 4E, (New York, NY: Springer).

------ and John R. Hauser (2007), “On Managerial Efficient Designs,” Marketing Science, 26, 6, (November-December), 851-858.

Tversky, Amos (1972), “Elimination by Aspects: a Theory of Choice,” Psychological Review, 79(4), 281-299.

Urban, Glen L. and John R. Hauser (2004), “’Listening-In’ to Find and Explore New Combina-tions of Customer Needs,” Journal of Marketing, 68, (April), 72-87.

Vapnik, Vladimir (1998), Statistical Learning Theory, (New York, NY: John Wiley and Sons).

Wu, Jianan and Arvind Rangaswamy (2003), “A Fuzzy Set Model of Search and Consideration with an Application to an Online Market,” Marketing Science, 22(3), 411-434.

Yee, Michael, Ely Dahan, John R. Hauser and James Orlin (2007) “Greedoid-Based Noncom-pensatory Inference,” Marketing Science, 26, 4, (July-August), 532-549.

Zhang, Tong (2003), “Leave One Out Bounds for Kernel Methods,” Neural Computation, 15, 1397–1437.

37

Cognitive Simplicity and Consideration Sets, Appendices

Appendix 1: Summary of Notation and Acronyms lhfa binary indicator of whether level lof feature f is acceptable to respondent h (disjunctive,

conjunctive, or subset conjunctive models, use varies by model) har binary vector of acceptabilities for respondent h

b1, b2 parameters of the HB subset conjunctive model, respectively, the probability that a pro-file is considered if Sax hj ≥′ rr and the probability it is not considered if Sax hj <′ rr

er a vector of 1’s of length equal to the number of potential patterns D covariance matrix used in estimation HB compensatory f indexes features, F is the total number of features h indexes respondents (mnemonic to households), H is the total number of respondents I the identity matrix of size equal to the total number of aspects j indexes profiles, J is the total number of profiles l indexes levels within features, L is the total number of levels mjp binary indicator of whether profile j matches pattern p

jmr binary vector describing profile j by the patterns it matches Mj percent of respondents in the sample (“market”) that consider profile j p indexes patterns; also used for significance level in t-tests when clear in context P maximum number of patterns [LAD-DOC(P, S) estimation] Q number of partworths (compensatory model) s size of a pattern (number of aspects in a conjunction) S maximum subset size [Subset(S) model] or maximum number of aspects in a conjunctive

pattern [DOC(S) model, LAD-DOC(P, S) estimation] Th threshold for respondent h in compensatory model whp binary indicator of whether respondent h considers profiles with pattern p

hwr binary vector indicating the patterns used by respondent h

ljfx binary indicator of whether profile j has feature f at level l

jx binary vector describing profile j yhj binary indicator of whether respondent h considers profile j

hyr binary vector describing respondent h’s consideration decisions

hβr

vector of partworths (compensatory model) for respondent h

hjε extreme value error in compensatory model γc, γM parameters penalizing, respectively, complexity and deviation from the “market”

+hjξ non-negative integer that indicates a model predicts consideration if 1≥+

hjξ−hjξ non-negative integer that indicates a model predicts non-consideration if 1≥−

hjξ

DOC(S) set of disjunctions of conjunctions models. S, when indicated, is the maximum size of the patterns.

DOCMP combinatorial optimization estimation for DOC models (see Equation 3) LAD-DOC(P, S) alternative estimation method for DOC models in which we limit both the

number of patterns, P, and the size of the patterns, S Subset(S) set of subset conjunctive models with maximum subset size of S

A1


Appendix 2: Proofs to the Results in the Text Result 1. The following sets of rules are equivalent (a) disjunctive rules, (b) Subset(1)rules, and

(c) DOC(1) rules.

Proof. A disjunctive rule requires 1≥′ hjax rr ; a Subset(S) rule requires Sax hj ≥′ rr ; a DOC(S) rule

requires 1≥′ hjwm rr . Clearly the first two rules are equivalent with S = 1. For DOC(1) recognize

that all patterns are single aspects hence jmr and hwr correspond one-to-one with aspects and jmr

can be recoded to match jxr and can be recoded to match hwr har .

Result 2. Conjunctive rules are equivalent to Subset(F) rules which, in turn, are a subset of the

DOC(F) rules, where F is the number of features.

Proof. A conjunctive rule requires Fax hj =′ rr Setting S = F establishes the first statement. The

second statement follows directly from Result 3 with S = F.

Result 3. A Subset(S) rule can be written as a DOC(S) rule, but not all DOC(S) rules can be

written as a Subset(S) rule.

Proof. Sax hj ≥′ rr holds if any S aspects are acceptable. Therefore jxr must match at least one

pattern of length S. Let be the set of such patterns, then SΣ jxr matches at least one element of

. Consider the DOC(S) rule defined by wSΣ hj = 1 for any pattern in SΣ . The inequality

Sax hj ≥′ rr holds if and only if 1≥′ hjwm rr , establishing that Subset(S) can be written as a DOC(S)

rule. By definition, a DOC(S) rule also includes patterns of size less than S, hence, Sax hj <′ rr for

some DOC(S) rules. This establishes the second statement.

Result 4. Any set of considered profiles can be fit perfectly with at least one DOC rule. More-

over, the DOC rule need not be unique.

Proof. For each considered profile, create a pattern of size F that matches that profile. This pat-

tern will not match any other profile because F aspects establishes a profile uniquely. Create hwr

such that whj = 1 for all such profiles and whj = 0 otherwise. Then 1=′ hjwm rr if profile j is consid-

ered and 0=′ hj wm rr otherwise. The second half of the proof is established by the examples in the

text which establishes the existence of non-unique DOC rules.

A2


Appendix 3: HB Estimation of the Subset Conjunctive Model All posterior distributions are known, hence we use Monte Carlo Markov chains

(MCMC) with Gibbs sampling. Recall that S is fixed.

),,',','|Pr( 21 bbssaothersya fhhjhf ll

r θ . We follow Gilbride and Allenby (2004, p. 404)

and use a “Griddy Gibbs” algorithm. For each h we update the acceptabilities, , aspect by

aspect. For each candidate set of acceptabilities we compute the likelihood as if we kept all other

acceptabilities constant replacing only the candidate . The likelihood is based on Equation 5

and the prior on the

lhfa

chfa l

lfθ ’s. The probability of drawing is then proportional to the likelihood

times the prior summed over the set of possible candidates.

chfa l

),,','|Pr( 21 bbsasy hhjflrθ . The lfθ ’s are drawn successively, hence we require the mar-

ginal of the Dirichlet distribution – the beta distribution. Because the beta distribution is conju-

gate to the binomial likelihood, we draw lhfθ from ∑∑ −++h hfh hf aaBeta )]1(6,6[ ll .

).',','|,Pr( 21 ssasybb fhhj l

r θ Because the beta distribution is conjugate to the binomial

likelihood, we draw b1 from ∑∑ ≥′−≥′+jh hjhjjh hjhj SaxySaxyBeta

,,)]()1(,)(1[ rrrr δδ and we

draw b2 from ∑∑ <′−<′+jh hjhjjh hjhj SaxySaxyBeta

,,)]()1(,)(1[ rrrr δδ , where )(•δ is the indica-

tor function.

Appendix 4: Generation of Synthetic Data for Simulation Experiments Compensatory model: We drew partworths from a normal distribution that was zero-

mean except for the intercept. The covariance matrix was I/2. We adjusted the value of the in-

tercept (to 1.5) such that respondents considered, on average, approximately 8 profiles. Profiles

were identified as considered with Bernoulli sampling from logit probabilities.

Subset conjunctive model: We drew each acceptability parameter from a binomial dis-

tribution with the same parameters for all features and levels. We adjusted the binomial prob-

abilities such that respondents considered, on average, approximately 8 profiles. This gave us

0.06, 0.23, 0.43, and 0.69 for S = 1 to 4. We set b1 = 0.95 and b2 = 0.05.

Disjunctions of conjunctions model: We drew binary pattern weights from a Dirichlet

distribution adjusting the marginal binomial probabilities such that respondents considered, on

A3


average, approximately 8 profiles. This gave us 0.025, 0.018, and 0.017 for S = 2 to 4. We

simulate consideration decisions such that the probability of considering a profile with a match-

ing pattern is 0.95 and the probability of considering a profile without a matching pattern is 0.05.

Appendix 5: Kullback-Leibler Divergence for Consideration Data To describe this statistic, we introduce additional notation. Let qj be the null probability

that profile j is considered and let rj be the probability that profile j is considered based on the

model and the observations. The K-L divergence for respondent h is ∑ j jjj qrr ]/[ln{ +

. To use the K-L divergence for discrete predictions we let z)]}1/()1[(ln)1( jjj qrr −−− hj and

be the indicator variables for hjz validation consideration, that is, zhj =1 if respondent h considers

profile j and = 1 if respondent h is predicted to consider profile j. They are zero otherwise.

Let be the number of profiles considered in the estimation task. Let and

be corresponding observed and predicted numbers for the validation task. Let

be the number of false negatives (observed as considered but predicted as

not considered) and be the number of false positives (observed as not con-

sidered but predicted as considered). (F

hjz

∑=j hje yC ∑=

j hjv zC

∑=j hjv zC ˆˆ

∑ −=j hjhjn zzF )ˆ1(

∑ −=j hjhjp zzF ˆ)1(

n and Fp are not to be confused with F, the number of

features as used in the text.) Substituting, we obtain the K-L divergence for a model being

evaluated. The second expression expands the summations and simplifies the fractions.

K-L divergence = ∑∑=

−

−−−

−

=−

−

⎥⎥

⎦

⎤

⎢⎢

⎣

⎡

−−−

+−

+⎥⎥

⎦

⎤

⎢⎢

⎣

⎡+

−

0ˆ:)(

ˆˆ

ˆ

1ˆ:)(

ˆˆ

ˆ

lnˆˆ

lnˆlnˆlnˆˆ

hje

v

nv

e

v

n

hje

v

p

e

v

pv

zj JCJCJ

FCJ

v

nv

JC

CJF

v

n

zj JCJC

F

v

p

JCC

FC

v

pv

CJFCJ

CJF

CF

CFC

))(ˆ()ˆ(ln)ˆ(

)ˆ(ln

)(ˆlnˆ)ˆ(

ln)ˆ(ev

nvnv

ev

nn

ev

pp

ev

pvpv CJCJ

FCJJFCJCCJ

JFFCJC

JFF

CCFCJ

FC−−

−−−−+

−+

−+

−−=

The perfect-prediction benchmark sets hjhj zz ˆ= , hence Fn = Fp = 0 and . The relative K-

L divergence is the K-L divergence for the model versus the null model, divided by the K-L di-

vergence for perfect prediction versus the null model.

vv CC =ˆ

A4


Appendix 6: Statistical-Learning for Compensatory and Subset Rules The following mathematical programs were formulated to be as similar as feasible to

DOCMP. Both can be simplified with algebraic substitutions. As in HB Compensatory we sub-

sume the threshold in the partworths estimated by CompMP. We set K to a number that is large

relative to Th. For comparability and to be conservative, we set S = 4 in SubsetMP.

CompMP: hc

J

jhjjhkjM

J

jhjhjhjhj

hh

eMMyy βγξξγξξξβ

rrrr

′+−++−+ ∑∑=

+−

=

+−

11

])1([])1([},{

min

Subject to: ++≤′ hjhhj KTx ξβrr for all j = 1 to J

)1( −−≥′ hjhhj Tx ξβrr for all j = 1 to J

, ≥ 0, +hjξ −

hjξ hβr

≥ 0

SubsetMP: SMMyySa

c

J

jhjjhkjM

J

jhjhjhjhj

hh

γξξγξξξ

+−++−+ ∑∑=

+−

=

+−

11

])1([])1([},,{

minrr

Subject to: +≤′ hjhj Sax ξrr for all j = 1 to J

)1( −−≥′ hjhj Sax ξrr for all j = 1 to J

, ≥ 0, +hjξ −

hjξ har a binary vector, S > 0, integer

A5


Appendix 7: Consider-only, Reject-only, No-browsing, Text-only, and Example Feature-Introduction Screenshots

Screenshots are shown in English, except for the text-only format. German versions, and

other screenshots from the surveys, are available from the authors.

A6

cognitive complexity and consideration setshauser/papers/hauser_toubia... · 2008. 8. 19. ·...

Documents