reliability of usability evaluations

Armen Chakmakjian HF750 – SP11 3/24/11

Reliability of Usability Evaluations

Review of Literature and How To Minimize Problems

Table of Contents

Introduction: Is Usability Testing Reliable?.................................................................2

The optimal number under question..............................................................................2

The layering of effects on results.....................................................................................3

The evaluator effect bias.................................................................................................... 4

Analytical data as a solution............................................................................................. 4

Evaluation of the Evaluation Literature........................................................................5

Summary and Conclusion................................................................................................. 5

Works Cited........................................................................................................................ 6


Introduction: Is Usability Testing Reliable?

Usability testing has become an integral part of the software design life cycle

for products in many industries. Much literature has been written describing how to

conduct an effective usability test. This paper will describe some of those methods,

and will address the fundamental issue being argued by practitioners, namely that the

results are not reliable or comprehensive in a scientific sense.

All phases and methodologies have been under scrutiny for several years now.

Some examples of topics under scrutiny:

The optimal number of users to test

The observers effect on the test

The moderators effect on the test

The evaluators’ effect on the conclusions

Whether expert reviews are better than user testing

Whether a specific task plan is better than free form use

In reviewing the pertinent literature, this paper will attempt to explain each of these

items and the expert opinions in those areas. Finally the paper will describe the effect

of these things on the practitioner and propose some possible solutions.

The optimal number under question

Starting with the work of Virzi at GTE Laboratories (Virzi, 1992), there was a

set of research whose results coalesced around the optimal number of subjects

participating to be about 5. Those 5 would generally find 80% of the issues detected

as well as identifying the most severe usability issues being detected by the first few

Paper 2 2


subjects. More recently, Law and Hvannberg (Law & Hvannberg, 2004) show that

from a probabilistic point of view that that magic number and the methodology

behind it may have problems due to the independence of individual task scenarios

events and that there is an unequal likelihood that individual problems are identified.

Lindgaard and Chattratichart point out that that the magic number 5 is relied

upon excessively by practitioners (Lindgaard & Chattratichart, 2007). They contend

that by concentrating on the number rather than the task coverage, many problems are

being missed. They based their study on their own evaluation of the results from the

CUE-4 study (Molich & Dumas, 2006), which attempted to compare the results of

expert review teams against each other. That particular study and one of its

predecessors, CUE-2 (Molich, Ede, Kaasgaard, & Karyukin, 2004), studied the

results of consistency of findings between teams and organizations. In both cases

they presented showed that 75% of results were unique reported by teams and that

only a small number of problems were found the teams involved. As Molich pointed

out and Lindgaard later may have inferred, consistency of method and task may be a

cure for this.

The layering of effects on results

The difficulty here is that there seemed to be layered effects of human

evaluators and simply having the same task list may not achieve consistent results.

For example observers of the same test may record different results. As far back as

1992, it was apparent (Donkers, Tombaugh, & Dillon, 1992) that the performance

observers recording usability problems was affected by both the obviousness of

usability problems and prior knowledge. Simply watching and recording (like

Paper 2 3


behind a 2 way mirror) is affected by ancillary factors in process. Further studies

have shown that posttest evaluations can also be affected (Jacobsen, Hertzum, &

John, 1998). Jacobsen points out that the results of their study “questions the use of

data from usability tests as a baseline for comparison to other usability evaluation

methods.” (Jacobsen, Hertzum, & John, 1998).

The evaluator effect bias

Further clouding the usability testing reliability waters with the evaluator

effect is a study by Hertzum, Jacobsen and Molich (Hertzum, Jacobsen, & Molich,

2002) wherein they studied what happens when evaluators compare their results to

each other. In this case the mere fact that two evaluators found issues in the same

area led them to feel that their own results were justified and therefore unnecessary to

involve more team in the evaluation. Even more damning was that all the evaluators

missed the several of the same problems that were clearly usability issues. (Hertzum,

Jacobsen, & Molich, 2002)

Analytical data as a solution

Some have argued that to ameliorate the problem of rigor and inconsistency of

results, both quantitative and qualitative methodologies have to be used in

conjunction or that if the testing is done with one method or the other, the client

should be limitation of resulting data. (Hughes, 1999) Others feel that using more

quantitative techniques in conjunction with contextual environment would possibly

remove bias. In particular the use of equipment that can record behaviors as video as

well as mouse and keyboard manipulation during a think aloud test would help.

Paper 2 4


(Christensen & Frøkjær , 2010) In other words, this might make remote testing a

more reliable format if the quantitative data could be collected easily.

Evaluation of the Evaluation Literature

The literature in the whole area of usability evaluation seems to fall into two

camps. Promoting a particular technique or damning the reliability of many of the

techniques and rubrics that are considered etched in stone (like the commandment 5).

One article tries mightily to point out that usability evaluations are a part of a HCI

tool kit and that they have an appropriate place in the design cycle but are not the

only tool in the kit. (Greenberg & Buxton, 2008)

Summary and Conclusion

In the end the lesson for the practitioner is that usability evaluation is not just

a think-aloud qualitative test nor is it just google analytics tracking a user’s every

click (not to mention eye-tracking). As many of the studies presented have shown,

the reliability of usability evaluation can be biased by the rigor applied to the

particular technique and by the makeup of the team doing the evaluation. In fact,

usability professionals sharing and comparing results can result in erroneously

affirming the correctness of a single set of results.

The implications for the practitioner are quite profound and in some way a

lesson is:

Keep the test target and expectations small and focused…software can be big

Have a well defined task list which keeps the user’s think aloud evaluation

focused on the problem being evaluated

Paper 2 5


Use some amount of technically objective data such as keystrokes and

mouseclicks to create a data stream to correlate to evaluators observations

Expert Reviews and User evaluations are complementary techniques but be

prepared for results to vary widely

Works CitedVirzi, R. A. (1992). Refining the Test Phase of Usability Evaluation: How Many Subjects is Enough? Human Factors , 34, 457-471.Christensen, L., & Frøkjær , E. (2010). Distributed Usability Evaluation: Enabling Large-scaleUsability Evaluation with User-controlled Instrumentation. Proceedings: NordiCHI 2010 (pp. 118-127). Reykjavik, Iceland: ACM.Donkers, A. M., Tombaugh, J. W., & Dillon, R. F. (1992). Observer Accuracy in Usability Testing: The Effects of Obviousness and Prior Knowledge of Usability Problems. Carleton University, Department of Psychology, Ottawa, Ontario, Canada.Greenberg, S., & Buxton, B. (2008). Usability Evaluation Considered Harmful (Some of the Time). CHI 2008 Proceedings (pp. 111-121). Florence, Italy: ACM.Hughes, M. (1999, November). Rigor in Usability Testing. Technical Communication (4), pp. 488-494.Hertzum, M., Jacobsen, N. E., & Molich, R. (2002). Usability Inspections by Groups of Specialists: Perceived Agreement in Spite of Disparate Observations. CHI 2002: , (pp. 662-663). Denmark.Jacobsen, N. E., Hertzum, M., & John, B. E. (1998). The Evaluator Effect in Usability Tests. CHI 98 (pp. 255-256). ACM.Law, E. L.-C., & Hvannberg, E. T. (2004). Analysis of Combinatorial User Effect in International Usability Tests. CHI 2004, 6, pp. 9-16. Vienna, Austria.Lindgaard, G., & Chattratichart, J. (2007). Usability Testing: What Have We Overlooked? CHI 2007 Proceedings • Usability Evaluation, (pp. 1415-1424). San Jose, CA.Molich, R., & Dumas, J. S. (2006). Comparative Usability Evaluation (CUE-4). Behaviour and Information Technology , Preprint.Molich, R., Ede, M. R., Kaasgaard, K., & Karyukin, B. (2004). Comparative usability evaluation. BEHAVIOUR & INFORMATION TECHNOLOGY , 23 (1), 65-74.

Paper 2 6

reliability of usability evaluations

Documents