metaphors of human thinking, version 2 1 usability...
TRANSCRIPT
Metaphors of Human Thinking, version 2 1
Usability Inspection by Metaphors of Human Thinking Compared to Heuristic Evaluation
Kasper Hornbæk & Erik Frøkjær*
University of Copenhagen, Denmark
* The authors contributed equally to the paper.
Metaphors of Human Thinking, version 2 2
Abstract
A new usability inspection technique based on metaphors of human thinking (MOT) has been
experimentally compared to heuristic evaluation (HE). The aim of MOT is to focus inspection on users’
mental activity and to make inspection easily applicable to different devices and use contexts. Building
upon classical introspective psychology, MOT bases inspection on metaphors of habit formation, stream
of thought, awareness and associations, the relation between utterances and thought, and knowing. An
experiment was conducted in which 87 novices evaluated a large web application, and its key developer
assessed the problems found. Compared to HE, MOT uncovered usability problems that were assessed
as more severe on users and also appeared more complex to repair. The evaluators using HE found more
cosmetic problems. The time spent learning and performing an evaluation with MOT was shorter. A
discussion of strengths and weaknesses of MOT and HE is provided, which shows how MOT can be an
effective alternative or supplement to HE.
Keywords
Human thinking, inspection technique, usability evaluation method, heuristic evaluation,
quantitative study
Metaphors of Human Thinking, version 2 3
1. Introduction
The current study presents an experimental comparison of how novices identify usability problems
with the aid of heuristic evaluation (Molich & Nielsen, 1990; Nielsen & Molich, 1990; Nielsen & Mack,
1994) on the one hand and with the aid of a novel technique based on metaphors of human thinking
(Frøkjær & Hornbæk, 2002; Hornbæk & Frøkjær, 2002) on the other.
Inspection techniques aim at uncovering potential usability problems by having evaluators inspect
user interfaces with a set of guidelines or questions (Nielsen & Molich, 1994). The most widely adopted
inspection technique is heuristic evaluation (Rosenbaum, Rohn, & Humberg, 2000; Vredenburg, Mao,
Smith, & Carey, 2002), which uses as the basis for evaluation a list of heuristics such as “be consistent”
or “minimize the users’ memory load” (Molich & Nielsen, 1990, p. 339). Heuristic evaluation has been
found a low cost supplement to empirical techniques for identifying usability problems (Nielsen, 1994).
In addition, designers may use heuristic evaluation and other inspection techniques throughout the
design process (Shneiderman, 1998).
Despite its widespread use, heuristic evaluation has severe limitations. Problems identified with the
aid of heuristic evaluation are often not found in user testing or actual use of the application—they are
so-called false positives. In a recent study, more than half of the problems found were false positives
(Cockton & Woolrych, 2001). Heuristic evaluation also seems to find many problems that pose only
slight inconvenience to the user, so-called cosmetic problems (Nielsen, 1992; Frøkjær & Larusdottir,
1999). Further, the heuristics used for evaluation are to some degree device-dependent and assume a
certain context of use. The heuristics used in Molich & Nielsen (1990), for example, are aimed at
WIMP-style interfaces used on a desktop computer. New heuristics have had to be developed for e-
commerce (Nielsen, Molich, Snyder, & Farrell, 2001) and groupware (Pinelle & Gutwin, 2002); similar
efforts are found in mobile computing (Pascoe, Ryan, & Morse, 2000).
Metaphors of Human Thinking, version 2 4
As an attempt to cope with these problems, we have proposed an inspection technique (Frøkjær &
Hornbæk, 2002; Hornbæk & Frøkjær, 2002) based on descriptions of human thinking from classical
introspective psychology (James, 1890; Naur, 1995; Naur, 2001). The technique builds upon metaphoric
descriptions of central aspects of human thinking, e.g. habit, awareness, and associations. In addition to
being independent of a certain context of use, we hypothesize that compared to heuristic evaluation this
technique (1) finds more usability problems that are severe on users; (2) uncovers deeper problems, i.e.
problems that do not mainly concern surface-level and easily correctable problems; and (3) provides
more useful input to the development process, e.g. through better design ideas or through identifying
problems that were previously unknown to designers.
The aim of the present paper is to investigate these hypotheses through an experimental comparison
of the inspection technique based on metaphors of human thinking (MOT) and heuristic evaluation
(HE). As is commonly done in experiments with usability evaluation methods, the techniques are
compared by the number of problems they identify. Several studies have investigated how usability
problems impact systems development (Whiteside, Bennett, & Holtzblatt, 1988; Sawyer, Flanders, &
Wixon, 1996; John & Marks, 1997; Hertzum, 1999). Such analyses may lead to a more realistic and
possibly different appreciation of techniques. Inspired by these studies, it is examined if the techniques
differ in generating design ideas or in the severity of the problems identified as assessed by the key
manager/developer of the application.
More broadly, the paper aims at strengthening HCI research and practice by a contribution towards
developing a device-independent, psychology-based evaluation method.
Metaphors of Human Thinking, version 2 5
2. Presentation of the Inspection Techniques
2.1 Heuristic evaluation (HE)
Heuristic evaluation as presented by Nielsen and Molich (Molich & Nielsen, 1990; Nielsen &
Molich, 1990) was based on a set of general usability principles, so-called heuristics, used to inspect
user interfaces. In its most common form (Nielsen, 1993), HE consists of ten heuristics labeled: Simple
and natural dialogue (H1); Speak the users’ language (H2); Minimize the users’ memory load (H3);
Consistency (H4); Feedback (H5); Clearly marked exits (H6); Shortcuts (H7); Good error messages
(H8); Prevent errors (H9); Help and documentation (H10).
The procedure of HE is recommended to involve a small group of evaluators. First each evaluator
goes through the interface, examines all the interface elements and judges their compliance with the 10
heuristics. For any specific interface element, the evaluator may further consider additional usability
principles or circumstances that seem to be relevant. After the evaluators’ individual usability
inspections, the group of evaluators discusses and summarizes their results to reach a more
comprehensive report of the usability problems of the interface.
2.2 Evaluation by metaphors of human thinking (MOT)
Compared to HE, the aim of MOT is to focus inspection on users’ mental activity, through
metaphors inspired by classical introspective psychology. Below MOT is summarized by describing the
five supporting metaphors and the underlying understanding of human thinking. For each metaphor, a
few key questions to consider in a usability inspection and an example of its use in design are given.
Metaphors in the HCI literature have been used in describing certain styles of interfaces, e.g. the desktop
metaphor (Johnson et al., 1989), and as a vehicle for representing and developing designs of interfaces
(Erickson, 1990; Madsen, 1994; Neale & Carroll, 1997). The current paper uses the term metaphors
differently, in that the metaphors are not in any way intended as interface metaphors, nor is the
Metaphors of Human Thinking, version 2 6
metaphors imagined to form part of designs. Rather, the aim of the metaphors is to support the
evaluator/systems designer in a focused study of how well certain important aspects of human thinking
are taken into account in the user interface under inspection. The metaphors are intended to stimulate
thinking, generate insight, and break fixed conceptions. These uses of metaphors have been thoroughly
studied in the literature on creative thinking (Gardner, 1982; Kogan, 1983) and illustratively applied, for
example, by Sfard (1998) in the educational domain.
2.2.1 Metaphor M1: Habit Formation is Like a Landscape Eroded by Water
Habits shape most of human thought activity and behavior (e.g. as physical habits, automaticity, all
linguistic activity, and habits of reasoning). This metaphor should indicate how a person’s formation of
habits leads to more efficient actions and less conscious effort, like a landscape through erosion adapts
for a more efficient and smooth flow of water. Creeks and rivers will, depending on changes in water
flow, find new ways or become arid and sand up, in the same way as a person’s habits will adjust to new
circumstances and, if unpracticed, vanish. Usability inspection with M1 calls for considering: Are
existing habits supported? Can effective new habits, when necessary or appropriate, be developed? Can
the user use common key combinations? Is it possible for the user to predict, a requisite for forming
habits, the layout and functioning of the interface?
In design, there is an abundance of examples of user interfaces that violate human habits. One
example is adaptive menus, used for example in Microsoft Office 2000. Adaptive menus change the
layout of the menu according to how often menu items are used, for example by removing or changing
the position of items seldom used. However, adaptive menus make it impossible to form habits in the
selection of menu items, since their position may be different from when they were previously selected.
A study by Somberg (1987) showed the efficiency of constant position placement of menu items
Metaphors of Human Thinking, version 2 7
compared to menus that change based on use frequency. Somberg, however, did not explicitly link habit
formation to the usefulness of constant placement of menu items.
2.2.2 Metaphor M2: Thinking as a Stream of Thought
Human thinking is experienced as a stream of thought, e.g. in the continuity of our thinking, and in
the richness and wholeness of a person’s mental objects, of consciousness, of emotions and subjective
life. This metaphor was proposed by William James (1890, vol. I, p. 239) to emphasize how
consciousness does not appear to itself chopped up in bits: ‘Such words as “chain” or “train” do not
describe it fitly. It is nothing jointed; it flows’. Particular issues can be distinguished and retained in a
person’s stream of thought with a sense of sameness, as anchor points, which function as ‘the keel and
backbone of human thinking’ (James, 1890, vol. I, p. 459). Usability inspection with M2 calls for
considering: Is the flow of users’ thought supported in the interface by recognizability, stability and
continuity? Does the application make visible and easily accessible interface elements that relate to the
anchor points of users’ thinking about their tasks? Does the application help users to resume interrupted
tasks?
In design, a simple, yet effective, attempt to recreate part of the richness of the stream of thought
when users return to resume interrupted work, is Raskin's (2000) design of the Canon Cat. When the
Canon Cat is started, the display immediately shows up as it was before work was suspended. Not only
does this allow the user to start thinking about the task at hand while the system is booting. It also
provides help in remembering and recreating the stream of thought as it was when work was interrupted.
2.2.3 Metaphor M3: Awareness as a Jumping Octopus in a Pile of Rags
Here the dynamics of human thinking are considered; i.e., the awareness shaped through a focus of
attention, the fringes of mental objects, association, and reasoning. This metaphor was proposed by
Metaphors of Human Thinking, version 2 8
Peter Naur (1995, pp. 214-215) to indicate how the state of thought at any moment has a field of central
awareness, that part of the rag pile in which the body of the octopus is located; but at the same time has
a fringe of vague and shifting connections and feelings, illustrated by the arms of the octopus stretching
out into other parts of the rag pile. The jumping about of the octopus indicates how the state of human
thinking changes from one moment to the next. Usability inspection with M3 calls for considering: Are
users’ associations supported through flexible means of focusing within a stable context? Do users
associate interface elements with the actions and objects they represent? Can words in the interface be
expected to create useful associations for the user? Can the user switch flexibly between different parts
of the interface?
In design, an example of a problematic solution is a use of modal dialog boxes that prevents the user
from switching to potentially relevant information—in Microsoft Word, for example, it is not possible to
switch back to the document to look for a good file name once the 'save as ...' dialog has began.
2.2.4 Metaphor M4: Utterances as Splashes over Water
Here the focus is on the incompleteness of utterances in relation to the thinking underlying them
and the ephemeral character of those utterances. This metaphor was proposed by Naur (1995, pp. 214-
215) to emphasize how utterances are incomplete expressions of the complexity of a person’s current
mental object, in the same way as the splashes over the waves tell little about the rolling sea below.
Usability inspection with M4 calls for considering: Are changing and incomplete utterances supported
by the interface? Are alternative ways of expressing the same information available? Are the
interpretations of users’ input in the application made clear? Does the application make a wider
interpretation of users’ input than users intend or are aware of?
Metaphors of Human Thinking, version 2 9
For design, one implication of the metaphor of utterances as splashes over the waves is that we must
expect users to describe the same objects and functions incompletely and in a variety of ways. Furnas et
al. (1987) investigated the diversity in words used for describing commands and everyday objects. On
the average, two participants described the same command or object by the same term with less than
20% probability. The most popular name was chosen only in 15-35% of the cases. Furnas et al.'s
suggestion for relieving this problem is called the unlimited alias approach, where terms unknown to the
system may be interactively related to existing commands or object names. This solution is coherent
with the metaphor and uses interactivity to clarify the intentions of the user. However, it would partly go
against the metaphor of habit formation.
2.2.5 Metaphor M5: Knowing as a Building Site in Progress
Human knowing is always under construction and incomplete. Also this metaphor was proposed by
Naur (1995, pp. 214-215) and meant to indicate the mixture of order and inconsistency characterizing
any person’s insight. These insights group themselves in many ways, the groups being mutually
dependent by many degrees, some closely, some slightly. As an incomplete building may be employed
as shelter, so the insights had by a person in any particular field may be useful even if restricted in
scope. Usability inspection with M5 calls for considering: Are users forced by the application to depend
on complete or accurate knowledge? Is it required that users pay special attention to technical or
configuration details before beginning to work? Do more complex tasks build on the knowledge users
have acquired from simpler tasks? Are users supported in remembering and understanding information
in the application?
In design, mental models have been extensively discussed. Consider as an example Norman's
(1983) description of the use of calculators. He argues that the use of calculators is characterized by
users' incomplete understanding of the calculators, by the in-stability of the understanding, by
Metaphors of Human Thinking, version 2 10
superstitions about how calculators work, and by the lack of boundaries in the users' understanding of
one calculator and another. These observations by Norman are perfectly coherent with the ideas
expressed by the metaphor of knowing.
2.2.6 Procedure for Performing a MOT Evaluation
The basic procedure when using the metaphors for evaluating user interfaces is to inspect the interface,
noting when it supports or violates the aspects of human thinking that the metaphors and key questions
aim to capture. This enables the evaluators to identify potential usability problems. The evaluation will
result in a list of usability problems, each described with reference to the application and to metaphors
that were used to uncover the problem. The usability problems may then be given a severity rating, and
suggestions may be made as to how to correct the problem. In Hornbæk & Frøkjær (2002) each of the
metaphors and their implications for user interfaces are described in more detail, and a procedure of how
to do a usability inspection based upon these metaphors is proposed. This procedure is quite similar to
the one described above for HE.
The steps in the procedure are:
1. Familiarize yourself with the application.
2. Find three tasks that users typically would do with the application. These tasks may be thought
up, may be based on observations of users, or may be based on scenarios used in systems
development.
3. Try to do the tasks with the application. Identify major problems found in this way. Use the key
questions and the metaphors to find usability problems.
4. Do the tasks again. This time, take the perspective of each of the metaphors at a time and work
through the tasks. Use the key questions and the metaphors to find usability problems.
Metaphors of Human Thinking, version 2 11
5. If more time is left, find some more tasks. See if new problems arise in step 3 and 4 for those
tasks. Iterate step 3 and 4 until each new task reveals few new problems or until no time is left.
MOT differs, however, from HE in one crucial aspect: HE aims to provide simple guidelines with
straightforward interpretations, but MOT provides guidelines that are complex and require evaluators’
active interpretation. While the techniques thus appear very different, no studies have empirically
evaluated their differences, nor do data exists as to whether MOT can be easily understood and applied.
These observations provide the rationale for the experiment described next.
3. Method
In the experiment, participants using either MOT or HE inspected a web application; the problems
found were consolidated to a common list; and the key manager/developer of the application assessed
the problems.
3.1 Objectives
The experiment has as its objectives to compare MOT and HE by (1) the number of problems they
identify, (2) how evaluators assess the techniques, and (3) how the key developer of the application
assess the usability problems found.
3.2 Application
The web application inspected is a portal for students at the University of Copenhagen to course
administration, e-mail, information on grades, university news, etc., see http://punkt.ku.dk. The
application builds upon and exchanges data with five existing, administrative systems. The version of
the application inspected took approximately five person-months to develop.
3.3 Participants
Metaphors of Human Thinking, version 2 12
As a compulsory part of a first-year university course in multimedia technology, 87 computer
science students used either HE or MOT to evaluate the web application. Participation was anonymous,
and the students were free to choose whether their data could be included in the analysis.
3.4 Procedure for participants’ inspection
Forty-four participants received as description of MOT a pseudonymized version of Hornbæk &
Frøkjær (2002); forty-three participants received as description of HE pages 19-20 and 115-163 from
Nielsen (1993). Each participant individually performed the evaluation supported by scenarios made
available by the developers of the web application. The participants were instructed to write for each
usability problem identified (a) a brief title, (b) a detailed description, (c) an identification of the
metaphors or heuristics that helped uncover the problem, and (d) a seriousness rating.
Participants chose seriousness ratings from a commonly used scale (Molich, 1994, p. 111): Rate 1 is
given to a critical problem that gives rise to frequent catastrophes which should be corrected before the
system is put into use. This grade is for those few problems that are so serious that the user is better
served by a delay in the delivery of the system; Rate 2 is given to a serious problem that occasionally
gives rise to catastrophes which should be corrected in the next version; and Rate 3 is given to a
cosmetic problem that should be corrected sometime when an opportunity arises.
3.5 Consolidation of problems
In order to find problems that were similar to each other, a consolidation of the problems was
undertaken. In this consolidation, the two authors grouped together problems perceived alike. The
consolidation was done over a five-day period, with at least two passes over each problem. The
consolidation was done blind to what technique had produced the problems.
Metaphors of Human Thinking, version 2 13
This consolidation resulted in a list of 341 consolidated problems. Each consolidated problem
consisted of one or more problems. Figure 1 gives a preview of the relation between problems and
consolidated problems; the Results section will treat this relation in detail.
In order to test the reliability of the consolidation, an independent rater tried consolidating a random
subset of the problems. The rater received 53 problems together with the list of the consolidated
problems from which these 53 problems had been deleted. For each problem, the rater either grouped
together that problem with a consolidated problem, or noted that the problem was not similar to any of
the consolidated problems. Using Cohen’s kappa, the interrater reliability between ratings was κ=.77,
suggesting an excellent agreement beyond chance (Fleiss, 1981).
3.6 The client’s assessment of problems
The current study try to make a realistic assessment of usability problems—similar to what goes on
when managers and developers read and prioritize problems in usability reports. The person
representing the client in this study is the person who decides what to do about the problems. This
approach to assessment is close to what has been called the impact ratio of inspections (Sawyer et al.,
1996), which has a similar, pragmatic view of how to assess usability problems. A related approach has
been suggested by Hartson, Andre, & Williges (2001):
“Perhaps an alternative way to evaluate post-usability testing utility of UEM [Usability evaluation
methods] outputs is by asking real-world usability practioneers to rate their perceptions of usefulness of
problem reports in meeting their analysis and redesign needs within the development cycle for
interaction design in their own real development environment”.
Metaphors of Human Thinking, version 2 14
Furthermore, the realistic assessment of usability problems comes closer to meeting the challenge
put forward by Dennis Wixon (2003) of going beyond using counts on problem lists to compare
techniques.
Developers have a vested interest in minimizing redesign in order to meet time and cost-constraints
and thus may be inherently biased in their assessment of usability problems. However, in practice these
are the circumstances that determine which and how problems are addressed. Thus, a crucial part of the
experiment was having the consolidated problems assessed by persons who were developing the web
application, here called the client. In this experiment, the person who managed the development of the
web application and was responsible for developing the design represented the client. The usability
problems found have been used in the client’s development of a new version of the application.
For each consolidated problem the client was asked to assess aspects considered likely to influence
how to prioritize and revise the design. The aspects considered were:
• Severity of the problem. The severity of the problem related to users’ ability to do their tasks was
judged as 1 (very critical problem), 2 (serious problem), 3 (cosmetic problem), or % (not a problem).
Note that this grading is different from the students’ seriousness ratings in that only the nature of the
problem is being assessed, not when the problem should be corrected which is contingent upon
resources within the development organization.
• Design ideas generated by the problem. The client stated (yes or no) whether the development team,
from the problem descriptions, got any good ideas for how to resolve the problem.
• The novelty of the problem. The client stated (yes or no) whether the problem brought up a new or
surprising difficulty with the web application.
Metaphors of Human Thinking, version 2 15
• The perceived complexity of solving the problem. The client also assessed how complex it would be
to reach a clear and coherent proposal for how to change the web application so as to remove the
problem. The client used a four-rate scale to judge complexity: (1) very complex problem: will take
several weeks to make a new design, possibly involving outside expert assistance; (2) complex
problem: a suggestion may be arrived at by experts in the development group in a few weeks; (3)
middle complexity: new design can be devised in a few days; (4) simple: while the actual
implementation may take long, a solution to the problem can be found in a few hours.
The assessment was done from a list, which for each consolidated problem showed all the problems
that it was consolidated from. The client performed the rating blind to what technique had produced the
problems, and he was not familiar with what techniques were studied.
4. Results
Table 1 summarizes the differences in problems between techniques; Table 2 shows the overlap
between problems as determined by the consolidation of problems. An overall multivariate analysis of
variance on these data shows a significant difference between techniques, Wilks’ lambda=.715, F(8, 78)
= 3.88, p < .001. With this test protecting the experiment-wise error, below the data are analyzed from
the two tables with individual analyses of variance. Note that ratings and other ordinal data are analyzed
with parametric tests, justified by the large number of observations—all results have, however, been
corroborated with non-parametric tests.
4.1 Number of problems and participants’ seriousness rating
There was no significant difference between the number of problems participants identified with the
two techniques, F(1, 85) = 1.76, p > .1. Between participants, large differences exist in the number of
problems uncovered, for example one participant finds only 2 problems, another finds 28.
Metaphors of Human Thinking, version 2 16
Participants’ ratings of the seriousness of the problems found differed only marginally between
techniques, F(1, 85) = 2.98, p =.09. Problems found by participants using MOT (Mean, M = 2.14;
standard deviation, SD = 1.31) were reported marginally more serious than were problems found by HE
(M = 2.28; SD = 1.05).
4.2 Client’s assessment
Analyzing the client’s assessment of the severity of problems, a significant difference between
techniques was found, F(1, 85) = 15.51, p < .001. The client assessed problems identified with MOT as
more severe (M = 2.21; SD = 0.73) than problems found by HE (M = 2.42; SD = 0.87). As can be seen
from Table 1, 49% of the problems identified with HE were assessed cosmetic problems by the client;
only 33% of the problems found with MOT were assessed cosmetic. The number of problems that the
client did not perceive as usability problems was surprisingly small, between 1% and 3%, compared to
what other studies report. As mentioned earlier, Cockton & Woolrych (2001) found more than half of
the problems identified with HE to be false positives. In Frøkjær & Laurusdottir (1999), HE stood out
by identifying 23% false problems compared to cognitive walkthrough (0%) and thinking aloud (1%).
The client’s assessment of the number of design ideas to be gained from the problem descriptions
was not significantly different between techniques, F(1 ,85) = 0.35, p > .5.
Concerning the number of novel problems, HE identified significantly more than MOT, F(1, 85) =
14.59, p < .001. On the average, participants using HE found 3.8 (SD = 2.8) problems that the client
considered novel; using MOT participants on the average found 2.0 (SD = 1.5) such problems. In
interpreting this difference, it seems that novel problems may be of two kinds, either severe problems
that the client had previously overlooked, or cosmetic and somewhat esoteric problems found by only
one participant. For both techniques, novel problems were mostly of the second kind, since novel
Metaphors of Human Thinking, version 2 17
problems on the average were less severe (M = 2.31; SD = 0.75), less complex (M = 3.48; SD = 0.71),
and 41% were only found by one participant.
The complexity of the problems identified was significantly different between techniques, F(1, 85)
= 12.94, p < .001. The client assessed problems found with MOT as more complex to solve (M = 3.00,
SD = 0.80) compared to those found by HE (M = 3.21, SD = 0.96). As shown in Table 1, approximately
20% more problems considered ‘complex’ were found with MOT compared to HE; around 60% more
problems considered ‘simple’ were found with HE compared to MOT.
Note that severity and complexity are correlated, Spearman’s rho = .40, p < .001, suggesting that in
this study severity and complexity cannot be regarded as being independent.
The data from the client’s assessment are supported by the marginally significant difference in
participants’ ratings of problem seriousness, where problems found by MOT were rated as more serious.
4.3 Overlap between evaluators and techniques
One use of the consolidation of problems is to describe the overlap between participants using the
same technique and the overlap between techniques, see Table 2 and Figure 1.
Between techniques, a significant difference was found in the number of problems identified by
only one participant, F(1, 85) = 6.58, p < .05. On the average, participants using HE found 78% more
one-participant problems compared to participants using MOT. Incidentally, the one-participant
problems found by MOT and those found by HE have comparable low (2.72) average severity (MOT:
SD = 0.80, HE: SD = 0.62). As further related to the number of one-participant problems, participants
using MOT found problems that were more generally agreed upon among the participants as usability
problems. Using a measure from research looking at the evaluator effect (Hertzum & Jacobsen, 2001),
Metaphors of Human Thinking, version 2 18
the average overlap in problems found by two evaluators using MOT—the so-called any-two agreement
measure—was 9.2%, while the any-two agreement for HE was 7.2%.
HE found 74% of the problems found by MOT; MOT found 61% of the problems found by HE. The
large number of one-participant problems found by HE resulted in the total number of different
problems found being larger for HE (249), compared to MOT (181). Thus in some sense, HE resulted in
a broader class of problems.
4.4 Use of individual metaphors and heuristics
A central question with both MOT and HE is the utility of individual metaphors and heuristics in
helping evaluators predict usability problems. Table 3 shows some initial data that might help explore
this question in the context of novice evaluators and the particular web application evaluated.
For the heuristics, large differences exist in the number of problems identified. Heuristic H1
(“Simple and natural dialogue”) helped participants find 205 problems, 62 of which were single-
participant problems; H7 (“Shortcuts”) found only 28 problems. In this study, H8 (“Good error
messages”) stands out as particularly good at finding severe problems. Previously, Nielsen (1992)
showed that evaluators have difficulty in applying slightly different versions of heuristics H6, H8, and
H9. Also in our study, H6 (“clearly marked exits”) seems difficult to use, in that it identified problems
with a low severity rating. H1 contributed with the highest absolute number of one-participant problems
among any heuristic/metaphor. This high contribution perhaps suggests that H1 is understood and
applied very differently among participants, though in part this result may have been produced
artificially by the ordering of heuristics.
Each metaphor has less variation in the number of problems found compared to the heuristics. M5
(“Knowing as a site of buildings”) seems to help participants the most in finding severe usability
Metaphors of Human Thinking, version 2 19
problems. M4 (“Utterances as splashes over water”) found fewer problems than the other metaphors and
the problems found were also slightly less severe.
Comparing heuristics and metaphors, it is noteworthy that problems found with metaphors were
most frequently found with more than one metaphor (see Table 3). Problems found with HE frequently
overlapped with problems found with the metaphors.
4.5 Learning and usage experiences
For reading and performing the inspections, the participants reported spending for MOT on average
4.0 hours (SD = 2.3) and for HE 5.8 hours (SD = 3.8). This difference is significant (Mann-Whitney
U=546.5, z=-2.88, p < .01). But the descriptions of the two inspection techniques were of quite different
lengths. The MOT description contained 10,500 words and the HE description approximately 17,000
words. As participants were not asked to distinguish time spent for reading and for performing the
evaluation, we can just note that participants reported spending less time using MOT, but not point to
why.
The participants wrote in groups of 3-4 persons about their experiences with learning and using the
inspection technique that they had tried. These reports give a main impression of MOT as being more
difficult to learn. The ideas about human thinking and the metaphors seem to have been quite
demanding to understand, but the examples and key questions to consider during an inspection (see
Hornbæk & Frøkjær, 2002) were reported helpful. The description of HE was generally acknowledged
as well presented. Note that none of the techniques were introduced orally so as to be sure that the
participants were uninformed about our specific interest in MOT. The participants were familiar with
Shneiderman’s eight golden rules of interface design (1998, p. 74-75) from lectures and the course
textbook. This familiarity may have served partly as an introduction to HE.
Metaphors of Human Thinking, version 2 20
Both for HE and for MOT, many groups told of problems with choosing and referring heuristics or
metaphors to a specific usability problem. For some of the users of MOT it seemed to have caused
difficulty that some of the metaphors were tightly related. This difficulty was mentioned less often for
the heuristics. For HE it was not always easy to find a relevant heuristic, for instance if functionality was
missing. HE participants often mentioned that they felt that some of the problems identified were mainly
a matter of taste. Some groups acknowledged how the metaphors supported an understanding not only
of usability problems in an interface, but also of possible elegant design solutions.
5. Discussion
Let’s return to the hypotheses outlined in the introduction. Concerning our first hypothesis, the
experiment showed that MOT, compared to HE, found problems that were assessed by the client as
more severe on users. In addition, participants using MOT found fewer one-participant problems, which
across techniques were assessed by the client as less severe. Concerning the second hypothesis,
problems found with MOT are assessed on the average as more complex to repair, suggesting that they
go deeper than problems found with HE. Concerning our third hypothesis, we did not find any
difference between techniques in the number of design ideas the client got from the descriptions of
problems. However, HE results in more problems that are assessed as novel and surprising. In this way,
HE identifies a broader class of problems, although these problems are mostly one-participant problems
assessed by the client to be cosmetic. Concerning learning of the techniques, participants seem equally
able to pick up the techniques, and they experience the time needed to learn and perform an evaluation
with MOT as shorter, but more difficult.
Overall, the experiment shows inspection by metaphors of human thinking as a promising
alternative and supplement to HE. We find this initial result encouraging, since HE has been refined for
many years and consistently has performed well when compared to other inspection techniques. The
Metaphors of Human Thinking, version 2 21
results need to be validated in further experiments to address some of the limitations of and questions
raised by the experiment, two of which are discussed below.
First, in this experiment only novices’ use of MOT were studied. Previous studies of inspection
technique, e.g. Nielsen (1992), show that more experienced evaluators find more problems and that the
kinds of problems found are also likely to be different. It is not know if HCI persons in industry, for
example, will react to and use MOT differently. However, novices are an important audience for
effective inspection techniques. Each year, thousands of computing students need to receive a first, and
perhaps their only, introduction to evaluation of user interfaces.
Second, in this experiment only one application was evaluated. While this application shares some
characteristics with many other web applications, one cannot conclude from this study that MOT leads
to better inspections for all applications. The experiment does not shed light on whether MOT is more
applicable to new devices or use contexts. An experiment addressing this question would necessarily
place HE at a disadvantage, because at least some of the heuristics are closely associated with particular
use-contexts and interaction styles.
An important issue is how to improve MOT and HE based on the results from the experiment. For
the metaphors, more examples are needed of how non-WIMP interfaces violate or respect aspects of
thinking as captured by the metaphors. For the heuristics, it seems (based on the usage of individual
heuristics) that for example heuristic H7 (“Shortcuts”) perhaps needs a broader formulation when web
applications are inspected.
The techniques could be combined so as to uncover a broader range of problems, because there is
limited overlap between the problems found with MOT and HE (see Figure 1). Perhaps other usability
evaluation methods might be more complementary in finding different problems, e.g. heuristic
Metaphors of Human Thinking, version 2 22
evaluation and think aloud, which Frøkjær & Laurusdottir (1999) found to be a very useful combination.
In future work, we need to explore in more detail the qualitative differences in the problems found. Do
they concern different kinds of usability problems? And if so, is there utility in combining the methods?
Alternatives to the way the severity of problems was evaluated—e.g. using think aloud tests as a
gold standard—have their own limitations. For think aloud tests, it is difficult to extract problems from
test data in a structured way (Cockton & Lavery, 1999). However, the pragmatic assessment of usability
problems herein may be extended still further. For example, in a development project it can be studied
during all or most of the project how problems found by different techniques are used, corrected,
discussed—or put aside. A possible unexplored strength of MOT is the utility of the technique in
design—for example with experienced designers as the evaluators participating in real design and
development processes, essentially exploring more fully the third hypothesis mentioned in the
Introduction. The current experiment was not designed to make an extensive assessment of this
hypothesis.
Finally, it would be interesting to investigate in more detail how the metaphors are applied during
evaluation, for example through a study of evaluators thinking aloud.
6. Conclusion
A new usability inspection technique based on metaphors of human thinking (MOT) has been
experimentally compared to heuristic evaluation (HE). MOT focuses inspection on users’ mental
activity through five metaphors of essential aspects of human thinking. The experiment showed that
MOT compared to HE uncovered more of the usability problems that were assessed by the key
developer and manager to be severe on users and complex to repair. In addition, the evaluators using
Metaphors of Human Thinking, version 2 23
MOT show a stronger agreement by finding the same problems more often; and evaluators use less time
to perform their evaluation, but find it more difficult to learn.
It is remarkable how MOT in this first experimental study has given good results compared to HE,
the usability inspection technique most widely used in industry. HE usually performs very well in
comparison with other inspection techniques, e.g. cognitive walkthrough and GOMS-based techniques.
It must be emphasized that these results have to be challenged by further studies. What happens when
MOT is used for evaluating interfaces in non-traditional use contexts, when the evaluators are more
proficient, or when MOT is used in design work?
Acknowledgements
We wish to thank Peter Naur for permission to cite his writings and for invaluable support in our
efforts to develop the metaphors as an inspection technique. Claus Damgaard, the key developer and
manager of the web-application, did a huge effort in thoroughly assessing the usability problems,
without which this study had been impossible.
Metaphors of Human Thinking, version 2 24
References
Cockton, G. & Woolrych, A. (2001). Understanding Inspection Methods: Lessons from an Assessment
of Heuristic Evaluation. In A. Blandford, J. Vanderdonckt, & P. D. Gray (Eds.), People and
Computers XV: Joint Proceedings of HCI 2001 and IHM 2001 (pp. 171-192). Berlin: Springer-
Verlag
Cockton, G. & Lavery, D. (1999). A Framework for Usability Problem Extraction. In M. A. Sasse & C.
Johnson (Eds.), Seventh IFIP Conference on Human-Computer Interaction Incorporating
HCI'99 (pp. 344-352). IOS Press
Erickson, T. D. (1990). Working with Interface Metaphors. In B.Laurel (Ed.), The Art of Human
Computer Interface Design (pp. 65-73). Reading, MA: Addison-Wesley.
Fleiss, J. L. (1981). Statistical Methods for Rates and Proportions. New York NY: John Wiley & Sons.
Frøkjær, E. & Hornbæk, K. (2002). Metaphors of Human Thinking in HCI: Habit, Stream of Thought,
Awareness, Utterance, and Knowing. In Proceedings of HF2002/OzCHI 2002
Frøkjær, E. & Larusdottir, M. (1999). Predicting of Usability: Comparing Method Combinations. In
10'th international Conference of the Information Ressources Management Association
Furnas, G. W., Landauer, T. K., Gomez, L. M., & Dumais, S. T. (1987). The Vocabulary Problem in
Human-System Communication. Communications of the ACM, 30 (11), 964-971
Gardner, H. (1982). Art, Mind and Brain: A cognitive approach to creativity. New York: Basic Books.
Metaphors of Human Thinking, version 2 25
Hartson, H. R., Andre, T. S., & Williges, R. C. (2001). Criteria for evaluating usability evaluation
methods. International Journal of Human-Computer Interaction, 13,(4), 373-410.
Hertzum, M. (1999). User testing in industry: A case study of laboratory, workshop, and field tests. In
A. Kobsa & C. Stephanidis (Eds.), User interfaces for all. Proceedings. 5. ERCIM workshop (pp.
59-72).
Hertzum, M. & Jacobsen, N. E. (2001). The evaluator effect: A chilling fact about usability evaluation
methods. International Journal of Human-Computer Interaction, 13,), 421-443.
Hornbæk, K. & Frøkjær, E. (2002). Evaluating User Interfaces with Metaphors of Human Thinking. In
N. Carbonell & C. Stephanidis (Eds.), Proceedings of 7th ERCIM Workshop "User Interfaces for
All". Also in N. Carbonell & C. Stephanidis (2003), "Universal access", Lecture Notes in
Computer Science 2615 (pp. 486-507). Berlin: Springer-Verlag.
James, W. (1890). Principles of Psychology. Henry Holt & Co.
John, B. E. & Marks, S. J. (1997). Tracking the effectiveness of usability evaluation methods. Behaviour
and Information Technology, 16,(4/5), 188-202.
Johnson, J., Roberts, T., Verplank, W., Smith, D., Irby, C., Bear, M. et al. (1989). The Xerox Star: A
Retrospective. IEEE Computer, 22,(9), 11-29.
Kogan, N. (1983). Stylistic variation in childhood and adolescence: Creativity, metaphor, and cognitive
styles. In J.H. Flavell & E. M. Markman (Eds.), Handbook of Child Psychology: Vol. 3.
Cognitive Development (pp. 630-705). New York, NY: Wiley.
Metaphors of Human Thinking, version 2 26
Madsen, K. H. (1994). A Guide to Metaphorical Design. Communications of the ACM, 37,(12), 57-62.
Molich, R. (1994). Brugervenlige edb-systemer (in Danish). Teknisk Forlag.
Molich, R. & Nielsen, J. (1990). Improving a Human-Computer Dialogue. Communications of the ACM,
33,(3), 338-348.
Naur, P. (1995). Knowing and the Mystique of Logic and Rules. Dordrecht, the Netherlands: Kluwer
Academic Publishers.
Naur, P. (2001). Anti-philosophical Dictionary. naur.com Publishing.
Neale, D. C. & Carroll, J. M. (1997). The Role of Metaphors in User Interface Design. In
M.G.Helander, T. K. Landauer, & P. V. Prabhu (Eds.), Handbook of Human-computer
Interaction (2 ed., pp. 441-462). Amsterdam: Elsevier.
Nielsen, J. (1992). Finding Usability Problems Through Heuristic Evaluation Usability Walkthroughs.
In P. Baursfield, J. Bennet, & G. Lynch (Eds.), ACM CHI'92 Conference on Human Factors in
Computing Systems (pp. 373-380). New York, NY: ACM Press
Nielsen, J. (1993). Usability Engineering. San Diego CA: Academic Press.
Nielsen, J. (1994). Guerrilla HCI: Using Discount Usability Engineering to Penetrate the Intimidation
Barrier. In R.G.Bias & D. J. Mayhew (Eds.), Cost-Justifying Usability (pp. 245-272). Boston,
MA: Academic Press.
Nielsen, J. & Mack, R. L. (1994). Usability Inspection Methods. Wiley and Sons Inc.
Nielsen, J. & Molich, R. (1990). Heuristic evaluation of user interfaces. In (pp. 249-256).
Metaphors of Human Thinking, version 2 27
Nielsen, J., Molich, R., Snyder, C., & Farrell, S. (2001). E-commerce User Experience. Fremont CA:
Nielsen Norman Group.
Norman D. (1983). Observations on Mental Models. In: Gentner, D. & Stevens, A.L. (Eds.), Mental
Models, Erlbaum, Hillsdale NJ (1983), 7-14.
Pascoe, J., Ryan, N., & Morse, D. (2000). Using while moving: HCI issues in fieldwork environments.
ACM Transactions on Computer-Human Interaction, 7,(3), 417-437.
Pinelle, D. & Gutwin, C. (2002). Groupware walkthrough: adding context to groupware usability
evaluation. In D. Wixon & L. Terveen (Eds.), ACM Conference on Human Factors in
Computing (pp. 455-462). New York, NY: ACM Press
Raskin, J. (2000). The Humane Interface: New Directions for Designing Interactive Systems. Addison-
Wesley, Reading MA.
Rosenbaum, S., Rohn, J., & Humberg, J. (2000). A toolkit for strategic usability: results from
workshops, panels, and surveys. In Proceedings of the CHI 2000 conference on Human factors
in computing systems (pp. 337-344). New York, NY: ACM Press
Sawyer, P., Flanders, A., & Wixon, D. (1996). Making a difference - The impact of inspections. In M. J.
Tauber (Ed.), ACM Conference on Human Factors in Computing (pp. 376-382). New York, NY:
ACM Press
Sfard, A. (1998). On two metaphors for learning and on the dangers of choosing just one. Educational
Researcher, 27,(2), 4-13.
Metaphors of Human Thinking, version 2 28
Shneiderman, B. (1998). Designing the User Interface. (3rd ed.) Reading, MA: Addison-Wesley.
Somberg, B. L. (1987). A Comparison of Rule-Based and Positionally Constant Arrangements of
Computer Menu Items. In Proceedings of CHI+GI'87, 255-260
Vredenburg, K., Mao, J.-Y., Smith, P. W., & Carey, T. (2002). A Survey of User-Centered Design
Practice. In L. Terveen, D. Wixon, E. Comstock, & A. Sasse (Eds.), Proceedings of ACM
Conference on Human Factors in Computing Systems (pp. 472-478). New York NY: ACM Press
Whiteside, J., Bennett, J., & Holtzblatt, K. (1988). Usability engineering: Our experience and evolution.
In M.Helander (Ed.), Handbook of Human-Computer Interaction (pp. 791-817). Amsterdam:
Elsevier.
Wixon, D. (2003). Evaluating usability methods: Why the Current Literature fails the Practitioner.
interactions, 10, 29-34.
Metaphors of Human Thinking, version 2 29
Table 1
The client’s assessment of usability problems found by participants using either heuristic evaluation
(HE) or evaluation by metaphors of human thinking (MOT).
HE (43 participants) MOT (44 participants) Mean (SD) % Mean (SD) % Number of problems 11.3 (6.2) - 9.6 (5.7) - Severity (avg.)*** Very Critical (1) Serious (2) Cosmetic (3) Not a problem (%)
2.4 (0.9) 0.8 (1.1) 4.8 (3.0) 5.6 (4.2) 0.1 (0.4)
- 7% 42% 49% 1%
2.2 (0.7) 1.2 (1.1) 5.0 (3.6) 3.2 (2.8) 0.3 (0.5)
- 12% 52% 33% 3%
Complexity (avg.)*** Very complex (1) Complex (2) Middle comp. (3) Simple (4) Not graded (%)
3.2 (1.0) 0.1 (0.3) 2.7 (1.9) 2.8 (2.0) 5.2 (3.7) 0.5 (0.8)
- 1% 24% 24% 46% 5%
3.00 (0.8) 0.02 (0.2) 3.3 (2.5) 2.3 (1.9) 3.3 (2.6) 0.7 (0.9)
- 0% 34% 23% 35% 7%
Novel problems*** 3.8 (2.8) 34% 2.0 (1.5) 28% Design ideas 2.5 (1.9) 22% 2.2 (2.2) 23%
Note: *** = p < .001; averages are weighted by the number of problems; HE=heuristic evaluation;
MOT=evaluation by metaphors of thinking. Due to rounding errors percentages may not add up.
Metaphors of Human Thinking, version 2 30
Table 2.
Overlap between techniques and participants in problems found.
HE (N=43) MOT (N=44) Mean (SD) % Mean (SD) % Number of problems 11.3 (6.2) - 9.6 (5.7) - Found by both tech. 6.9 (3.6) 61% 7.2 (4.3) 74% Found with one tech. Many participants One participant*
1.3 (1.4) 3.2 (3.0)
11% 28%
0.7 (1.2) 1.8 (1.8)
7% 19%
Note: * = p < .05; HE=heuristic evaluation; MOT=evaluation by metaphors of thinking.
Metaphors of Human Thinking, version 2 31
Table 3.
Relation between and use of individual heuristics and metaphors.
No. prob. Single prob. Avg. severity
Problems also found with
Heuristics 800 30% 2.4 - H1 205 30% 2.5 H2 (52%); M3 (49%) H2 117 25% 2.5 H1 (74%); H4 (56%) H3 63 29% 2.4 H1 (70%); M2 (58%) H4 134 24% 2.3 H1 (67%); M3 (54%) H5 43 35% 2.3 H1 (42%); M2 (28%) H6 41 34% 2.5 H4 (55%); M2 (41%) H7 28 36% 2.2 H1, H3, M3 (57%) H8 41 37% 2.0 H9 (63%); H2 (49%) H9 64 28% 2.2 H1 (56%); H2, M1 (41%) H10 64 36% 2.6 H1 (73%); H2 (55%) Metaphors 573 18% 2.2 - M1 124 16% 2.2 M2 (69%); M3 (67%) M2 169 21% 2.2 M3 (74%); M1 (66%) M3 141 16% 2.3 M2 (75%); M1 (63%) M4 66 21% 2.4 M3 (59%); M2 (52%) M5 73 19% 2.0 M2 (66%); M1 (64%)
Note: H1 to H10 refers to the heuristics mentioned in Nielsen (1993); M1 to M5 refers to the
metaphors described in Hornbæk & Frøkjær (2002). Section 2 describes these heuristics and metaphors.
Severity is measured on a 1 to 3 scale, with 1 being the most severe. Participants could indicate more
than one metaphor per problem, which is why the sum of column two is higher than the number of
problems. The rightmost column shows the heuristics/metaphors with the two largest overlaps with the
heuristic/metaphor in the row.
Metaphors of Human Thinking, version 2 32
Figure Caption
Figure 1. The relation between problems and consolidated problems. The top part of the figure
shows the problems identified with the two techniques. The circles at the bottom part of the figure show
the consolidated problems. The arrows indicate the number and percentage of problems in each group of
consolidated problems.
Metaphors of Human Thinking, version 2 33
HE (N=487) MOT (N=424)
89
296137 54
23137
315
13
79
79
30
One partici-pant
One partici-
pant
More partici-pants
Morepartici-pants
Common
Specific to HE Specific to MOT
61%28% 11% 74% 19%7%