quantitative evaluation john kelleher, it sligo. 1

Quantitative Evaluation

John Kelleher, IT Sligo

3

Definition Methods

Performance/Predictive Modeling GOMS/KLM Fitts’ Law

Controlled Experiments & Statistical Analysis Without measurement, success is undefined Formal Usability Study

to compare two designs on measurable aspects time required number of errors effectiveness for achieving very specific tasks

4

GOMS Model Card, Moran & Newell (1983) Model the knowledge and cognitive processes involved when

users interact with systems. Goals

refer to particular state the user wants to achieve Operators

refer to the cognitive processes and physical actions that need to be performed in order to attain those goals

Methods are learned procedures for accomplishing the goals, consisting of

exact sequence of steps required Selection Rules

Are used to determine which method to select when there is more than one available for a given stage of a task.

5

GOMS: Example of deleting word in MS Word

Goal: delete a word in a sentence

Method for accomplishing goal of deleting a word using menu option:

Step 1: Recall that word to be deleted has to be highlightedStep 2: Recall that command is ‘cut’Step 3: Recall that command ‘cut’ is in edit menuStep 4: Accomplish goal of selecting and executing

the ‘cut’ commandStep 5: Return with goal accomplished

6

GOMS: Example of deleting word in MS Word

Method for accomplishing goal of deleting a word using delete key:

Step 1: Recall where to position cursor in relation to word to be deletedStep 2: Recall which key is delete keyStep 3: Press ‘delete’ key to delete each letterStep 4: Return with goal accomplished

7

GOMS: Example of deleting word in MS WordOperators to use in above methods:

Click mouseDrag cursor over textSelect menuMove cursor to commandPress keyboard key

Selection Rules to decide which method to use:1: Delete text using mouse and selecting from menu if large amount of text is to be deleted2: Delete text using delete key if small number of letters is to be deleted

8

Keystroke Level Model Well-known analytic evaluation technique Derived from MHP1

Provides detailed quantitative (numerical) information of user performance

Sufficient for predicting speed of interaction with a user interface

Basic time prediction components empirically derived

1 Model Human Processor by Card, Moran, Newell (1983)

9

KLM ConstantsOperator Name

Description Time (Sec)

K Pressing a single key or buttonSkilled typist (55 wpm)Average typist (40 wpm)User unfamiliar with the keyboardPressing shift or control key

0.35 (average)0.220.281.200.08

P Point with a mouse or other device to a target on a displayClicking the mouse or similar device

1.100.20

H Homing hands on the keyboard or other device 0.40

D Draw a line using a mouse Variable depending on the length of line

M Mentally prepare to do something (e.g. make a decision) 1.35

R(t) System response time – counted only if it causes the user to wait when carrying out their task

t

10

Task in Text Editor Using GOMS

Create new file Type in “Hello, World.” Save document as “Hello” Print document Exit editor

Assume system response is 0, or comparable across systems (constant) Average typist (55wpm) (K = 0.2) Editor is started, hands in lap

11

All Mouse

TASK KLM TIMEOpen New File H+P+B+P+B 2.8Type words H + 15 * K 3.4

Save H+P+B+P+B+H+5*K+K

4.4

Print H+P+B+P+B+P+BB

4.1

Exit P+B+P+BB 2.5TOTAL: 17.2 secs

12

Shortcuts

TASK KLM TimeOpen New File H+(2*K) 0.8Type words 15 * K 3.0

Save (2*K)+(5*K)+K 1.6

Print (2*K) + K 0.6Exit (2*K) + K 0.6Total: 6.6 secs

13

KLM Applicability

User interface w/ limited number of features Repetitive task execution Really only useful for comparative study among

alternatives albeit sensitive to minor changes Project Ernestine

Caveats assumes expert behaviour – no errors tolerated user already knows the sequence of operations that he or

she is going to perform time estimates best followed-up by empirical studies ambiguity regarding M operator assumes serial processing

14

Fitts’ Law

Predicts time taken to reach a target using a pointing device

T = k log2(D/S + 0.5), k ~ 100 msec.where

T = time to move the hand to a targetD = distance between hand and targetS = size of target

Highlights corners of screen as good targets

15

Performance measures

Time: easy to measure and suitable for statistical analysis. E.g. learning time, task completion time.

Errors: shows where problem exist within a system. Suggests the cause of a difficulty.

Patterns of system use: study the patterns of use in different sections. Preference and avoidance of sections in a system.

Amount of work done in a given time.

16

Other measures Subjective impression measures

Attitude measures: Use questionnaires or interviews

Rated aesthetics Rated ease of learning Stated decision to purchase

Composite measures Weighted averages of the above E.g. efficiency = throughput / number of errors

17

Designed to test predictions arising from an explicit hypothesis that arises out of an underlying theory

Allows comparison of systems, fine-tuning of details ... Strives for

lucid and testable hypothesis quantitative measurement measure of confidence in results obtained (statistics) replicability of experiment control of variables and conditions removal of experimenter bias

Controlled experiments

18

Ben Shneiderman (Univ. Maryland US)

Experiments have: Two Parents:

‘a practical problem’ ‘a theoretical foundation’

Three Children: ‘Help in resolving the practical problems’ ‘refinements to the theory’ ‘advice to future experimenters who work on the

same problem’

19

Designing Experiments

Formulating the hypotheses Developing predictions from the hypotheses Choosing a means to test the predictions Identifying all the variables that might affect

the results of the experiment Deciding which are the independent

variables, dependent variables and which variables need to be controlled by some means

20

Usability Laboratory

21

Usability Laboratory

22

Designing Experiments (contd.)

Designing the experimental task and method Subject selection Deciding the experimental design, data

collection method and controlling confounding variables

Deciding on the appropriate statistical or other analysis

Carrying out a pilot study

23

The Experimental Methoda) Begin with a lucid, testable hypothesis

Example 1:

“ there is no difference in the number of cavities in children and teenagers using crest and no-teeth toothpaste”

24

The Experimental Method Example 2:

“ there is no difference in user performance (time and error rate) when selecting a single item from a pop-up or a pull down menu, regardless of the subject’s previous expertise in using a mouse or using the different menu types”

25

The Experimental Method

b) Explicitly state the independent variables that are to be altered independent variable

the things you manipulate independent of how a subject behaves determines a modification to the conditions the subjects undergo may arise from subjects being classified into different groups

In toothpaste experiment toothpaste type: uses Crest or No-teeth toothpaste age: <= 11 years or > 11 years

In menu experiment menu type: pop-up or pull-down menu length: 3, 6, 9, 12, 15 subject type (expert or novice)

26

The Experimental Methodc) Carefully choose the dependent variables that will be

measured Dependent variables

Measures to demonstrate the effects of the independent variables

Properties Readily observable Stable and reliable so that they do not vary under constant

experimental conditions Sensitive to the effects of the independent variables Readily related to some scale of measurement

27

Dependent variables

Some commonly used dependent variables Number of errors made Time taken to complete a given task Time taken to recover from an error

In menu experiment time to select an item selection errors made

In toothpaste experiment number of cavities frequency of brushing

28

What is an experiment?

Three criteria The experimenter must systematically manipulate one or

more independent variables in the domain under investigation

The manipulation must be made under controlled conditions, such that all variables which could affect the outcome of the experiment are controlled see confounding variables, next.

The experimenter must measure some un-manipulated feature that changes, or is assumed to change, as a function of the manipulated independent variable

29

Confounding variables

Variables that are not independent variables but are permitted to vary along in the experiment

“The logic of experiments is to hold variables-not-of-interest constant among conditions, systematically manipulate independent variables, and observe the effects of the manipulation on the dependent variables.”

30

Sources of variation Variations in the task performed The effect of the treatment (i.e. the user interface

improvements that we made) Individual differences between experimental

subjects (e.g. IQ) Different stimuli for each task Distractions during the trial (sneezing, dropping

things) Motivation of the subject Accidental hints or intervention by the experimenter Other random factors.

31

Examples of Confounding Order effects

Tasks done early in testing are slower and more prone to error. Tasks done late in testing may be affected by user fatigue.

Carry-over effects A difference occurs if one condition follows another. E.g. Learning

text editor commands. Experience factors

People in one condition have more/less relevant experience than in others.

Experimenter/subject bias The experimenter systematically treats some subjects different from

others, or when subjects have different motivation levels. Other uncontrolled variables

Time of day, system load.

32

Confounding Prevention

Randomization Negates the order effect.

Random assignment to conditions is used to ensure that any effect due to unknown differences among users or conditions is random.

Counterbalancing Order and carry-over effect. Test half of the users in condition 1 first, and the other half

in condition II first. Different permutations of condition order can be used.

33

Allocation of participants Judiciously select and assign subjects to groups to control

variabilitya) Between-Groups Experiment

Two groups of test users, same tasks for both groups. Randomly assign users to two equally-sized groups. Group A uses only system A, group B only system B.

b) Within-Groups Experiment One group of test users Each user performs equivalent tasks on both systems. Randomly assign users to two equally-sized pools. Pool A uses system A first, pool B system B first.

c) Matched-pairs

34

Example DesignsBetween Groups

System A System B

John Dave

James May

Mary Ann

Stuart Phil

Within Groups

Participant Sequence

Elizabeth A,B

Michael B,A

Steven A,B

Richard B,A

Is more powerful statistically (can compare the same person across different conditions, thus isolating effects of individual differences) Requires fewer participants than between-groups

Learning effects Fatigue effects

Requires more participants No transfer of learning effects

Less arduous on participants large individual variation in user skills

35

Experimental Details Order of tasks

choose one simple order (simple -> complex) unless doing within groups experiment

Training depends on how real system will be used

What if someone doesn’t finish assign very large time & large # of errors

Pilot study helps you fix problems with the study do 2, first with colleagues, then with real users

36

Sample Size

Depends on desired confidence level and confidence interval.

Confidence level of 95% often used for research, 80% ok for practical development.

Rule of thumb: 16-20 test users.

37

Analysing the numbers

Example: trying to get task time <=30 min. test gives: 20, 15, 35, 80, 10, 20 mean (average) = 30 looks good! wrong answer, not certain of anything always chart results

Factors contributing to our uncertainty small number of test users (n = 6) results are very variable (standard deviation = 32)

std. dev. measures dispersal from the mean

38

Experimental Evaluation

Powerful method (depending on the effects investigated)

Quantitative data for statistical analysis

Can compare different groups of users

Reliability and validity good Replicable

High resource demands Requires knowledge of experimental

method Time spent on experiments can mean

evaluation is difficult to integrate into design cycle

Tasks can be artificial and restricted Cannot always generalise to full system

in typical working situation all human behaviour variables cannot be

controlled little recognition of work, time,

motivational & social context subject’s ideas, thoughts, beliefs largely

ignored

Advantages Disadvantages

39

Summary Allows comparison of alternative designs Collects objective, quantitative data (bottom-line data) Needs significant number of test users (16-20) Usable only later in development process Requires administrator expertise Cannot provide why-information (process data) Formal studies can reveal detailed information but take

extensive time/effort Applicability:

system location dangerous or impractical for constrained single user systems to allow controlled manipulation of use

40

Summary (contd.)

Suitable... system location dangerous or impractical for constrained single user systems to allow controlled manipulation of use

Advantages and Dis-advantages sophisticated & expensive equipment uninterrupted environment Hawthorne principle

quantitative evaluation john kelleher, it sligo. 1

Documents