the art and science of measuring people l reliability l validity l operationalizing

The art and science of measuring people

ReliabilityValidityOperationalizing

Overview of design and analysis

Posing a usability question Conceptualizing the question Operationalizing the related concepts Identifying Independent, Dependent, & Controlled Variables Developing the Hypothesis

Choosing the testing method

What method is appropriate for the current situation? (experiment, observation, surveys etc.)

>> choice of method as a

trade off between control

realism

Experimental, Quasi-Experimental and Non-Experimental Methods.

Trade Off

Experimental Control

Ex

pe

rim

en

tal R

ea

lism

Collecting data

The art of finding and recruiting participants A practical view of randomization: Randomization and Pseudo Randomization Random Selection and Random Assignment. Practical issues about sample size and statistical power.

Analyzing the data: Basic Statistics

Levels of measurement: nominal, ordinal, interval, and ratio Mean, median, standard deviation Testing mean differences Significance levels and what they mean

Analysis of experimental designs: Single Factor Experiments

Statistical Hypothesis Testing Estimates of Experimental Error Estimates of Treatment Effects Evaluation of the Null Hypothesis Various ANOVA models

Multi Factor Experiments

Advantages of the factorial design Interaction Effects The power of within-subjects designs (reduction of variance) The two factorial experiment Higher Order Factorial Designs

Analysis of Non-Experimental Studies

Statistical methods for analyzing

correlational data

Correlations, Scatter Plots, Partial Corrs

Multiple Regression

Introduction to Factor Analysis, Cluster

Analysis and Multidimensional Scaling

Surveys and Questionnaires The design of surveys and questionnaires How to frame questions Kinds of scales: Likert, Semantic Differential etc. Analyzing survey data: which items are useful, Item Response Theory Forming a scale to measure an attribute, e.g., satisfaction. Reliability, validity of scale

Measuring Individual Differences How to test for individual differences within users Kinds of individual differences variables:

-demographic: such a age, gender etc.

-situational: motivation, interest, fatigue

-cognitive: memory, cognitive style etc.,

-personality: internal/external locus of control

How to analyze existing data to identify individual differences, and how to design studies to test for individual differences?

Frenzied Shopping: Obstacles to purchase, and the perception of download times

-A study on ecommerce conducted by

Jared Spool

A critical analysis and

Illustration of alternative methods of examining this question

Frenzied shopping

Create a realistic scenario: in present situation, get

person motivated

Counted obstacles to purchase

– Advantages of measure:

» concrete: people agree about measure

» valid: good measure of actual ecommerce experience

– Disadvantages of measure:

» not reliable: since situation is not structured

» data analysis problems

Results

• Found more than 200 obstacles to purchase

• The more the no of users, the greater the no of problems

• What’s wrong with each test discovering hundreds of problems?– Client has limited resources, need to focus on

solving important (most common / most catastrophic problems)

More results:Perception of download times

How long will users wait for pages to download? -Should web developers waste their time in making pages faster.

Method: Users were asked to rate the perceived speed of pages after they had completed task.

Ave. Speed Rated

Amazon.com 30 sec FastestAbout.com8 sec Slowest

So what do download times relate to?

• Only correlated with success or failure of shopping. – (Amazon.com judged to be slower than About.com even

though About.com was much faster)

• Result is foregone conclusion given the task.• Problems with method:

– Memory issues: Users asked for ratings at the end of their experience with all the sites. Retrospective memory problems.

– Ask someone waiting for a page to download if it is taking too long!

• Timeline Issues

rating speedof site

buying item

browsing,searching foritem

Rated speed no longer reflects the browsing, searching part of the experience.

Cannot infer that download speeds are not important, can only infer that perception of download speeds can be influenced by other aspects of site

Perception of download speed and all the ways to study it...

Breadth of Ques

Depth of Ques Realism

Control over situation

Survey Broad Low Low HighObservation Broad Low High LowUser Log Analysis Broad Low High LowExperiment Narrow High Low High

Survey: Are people bothered by long download times?

• Sample Question:– How often do you leave a site without waiting for the

first page to download?» 0-5% of times » 5-10% of times» 10 % and higher

– In your opinion, how important are the below web site characteristics. Rate their relative importance.

» Download speed » site content» site interactivity

Possibilities: Task based surveys

Observation: Do people seem to like fast (without graphics) sites as compared to slow (with graphics) sites?

• Method: Make two versions of a site, one with sophisticated graphics (slower site) and the other mostly text (faster site). Ask subjects to browse / complete a task on both sites.

• Measurement: Watch participants for signs of frustration or satisfaction with speed of site

Experiment: Relationship of perceived download times to actual download times?

• Method: Find similar sites with differential speeds. Ask people to complete the same tasks on both sites. Give them some interesting and some boring tasks, and less than enough time to complete the task.

• Measurement: Log the clicks of the users as they traverse the sites. How many of the interesting and how many boring tasks did they complete. Relate that to download speed of site.

• Do some users tend to be more frustrated with slower sites.

User Logs: Do people leave sites while waiting for slow pages to download?

• Method: Find similar sites with differential speeds. Analyze the server logs for the sites.

• Measurement: Log the clicks of the users as they traverse the sites. How many of the interesting and how many boring tasks did they complete. Relate that to download speed of site.

• Do some users tend to be more frustrated with slower sites.

The state of the art

• What usability methods are currently prevalent and accepted in the field

• CUE 2

Comparative Usability Evaluation(CUE) 2Molich et al., 1999

Purpose: Too much emphasis on one-way mirrors and scan converters

Little knowledge of REAL usability testing procedures

”Who checks the checker?”

Method: Nine teams tested the usability of a web site

Seven professional teams

Two student teams

Four European, five US teams

Test web-site: www.hotmail.com

Problems found in Comparative Usability Evaluation

Found by 7 teams 1

6 teams 1

5 teams 4

4 teams 4

3 teams 15

2 teams 49

only 1 team 226(0.75%)

Problem Found by Seven Teams

During the registration process Hotmail users are asked to provide a password hint question. The corresponding text box must be filled.

Most users did not understand the meaning of the password hint question. Some entered their Hotmail password in the Hint Question text box.

TeamPerson hours

# Usability Professionals

# tests

A 136 2 7B 123 1 6C 84 1 6D 16 1 50E 130 3 9F 50 1 5G 107 1 11H 45 3 4J 218 6 6

Characteristics of the tests

Team# Positive findings

# Problems

% Exclusive

A 0 26 42B 8 150 71C 4 17 24D 7 10 10E 24 58 57F 25 75 51G 14 30 33H 4 18 56J 6 20 60

Problems by teams

What factors predict no of problems & no of common (non-exclusive) problems?

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

Person hours# Usability

professionals Number of tests PerPersonHrs #Positive

#Problems

%common problems

Inferences from CUE study

Much disagreement about methods of usability testing•How to test?•Who should test?•What methods to use?•How many testers to have?•How many users to have?

the art and science of measuring people l reliability l validity l operationalizing

Documents

hypothesis slide

operationalizing slide

validity of scale slide

factor analysis

multidimensional scaling

cluster analysis

critical analysis

correlational data