extracting usability information from user interface eventsredmiles/publications/j006-hr00.pdf ·...

Extracting Usability Information from User Interface Events

DAVID M. HILBERT AND DAVID F. REDMILES

Department of Information and Computer Science, University of California, Irvine, CA<{dhilbert, redmiles}@ics.uci.edu>

Modern window-based user interface systems generate user interface events as natural products oftheir normal operation. Because such events can be automatically captured and because they indicateuser behavior with respect to an application’s user interface, they have long been regarded as apotentially fruitful source of information regarding application usage and usability. However, becauseuser interface events are typically voluminous and rich in detail, automated support is generallyrequired to extract information at a level of abstraction that is useful to investigators interested inanalyzing application usage or evaluating usability.

This survey examines computer-aided techniques used by HCI practitioners and researchers to extractusability-related information from user interface events. A framework is presented to help HCIpractitioners and researchers categorize and compare the approaches that have been, or mightfruitfully be, applied to this problem. Because many of the techniques in the research literature havenot been evaluated in practice, this survey provides a conceptual evaluation to help identify some ofthe relative merits and drawbacks of the various classes of approaches. Ideas for future research in thisarea are also presented.

This survey addresses the following questions: How might user interface events be used in evaluatingusability? How are user interface events related to other forms of usability data? What are the keychallenges faced by investigators wishing to exploit this data? What approaches have been brought tobear on this problem and how do they compare to one another? What are some of the important openresearch questions in this area?

Categories and Subject Descriptors: H.5.2 [Information Interfaces and Presentation]: UserInterfaces—Evaluation/methodology

General Terms: Human factors, Measurement, Experimentation

Additional Key Words and Phrases: Usability testing, User interface event monitoring, Sequentialdata analysis, Human-computer interaction

December 13, 2000

1. INTRODUCTION

User interface events (UI events) are generatedas natural products of the normal operation ofwindow-based user interface systems such asthose provided by the Macintosh OperatingSystem [Lewis and Stone 1999], MicrosoftWindows [Petzold 1998], the X Window System[Nye and O’Reilly 1992], and the Java AbstractWindow Toolkit [Zukowski and Loukides 1997].Such events indicate user behavior with respectto the components that make up an application’suser interface (e.g., mouse movements withrespect to application windows, keyboardpresses with respect to application input fields,mouse clicks with respect to application buttons,menus, and lists). Because such events can be

automatically captured and because they indicateuser behavior with respect to an application’suser interface, they have long been regarded as apotentially fruitful source of informationregarding application usage and usability.However, because user interface events aretypically extremely voluminous and rich indetail, automated support is generally required toextract information at a level of abstraction thatis useful to investigators interested in analyzingapplication usage or evaluating usability.

While a number of potentially related techniqueshave been applied to the problem of analyzingsequential data in other domains, this paperprimarily focuses on techniques that have beenapplied within the domain of HCI. Providing an

Extracting Usability Information from User Interface Events • 2

December 13, 2000

in-depth treatment of all potentially relatedtechniques would necessarily limit the amount ofattention paid to characterizing the approachesthat have in fact been brought to bear on thespecific problems associated with analyzing HCIevents. However, this survey attempts tocharacterize UI events and analysis techniques insuch a way as to make comparison betweentechniques used in HCI and those used in otherdomains straightforward.

1.1 Goals and Method

The fundamental goal of this survey is toconstruct a framework to help HCI practitionersand researchers categorize, compare, andevaluate the relative strengths and limitations ofapproaches that have been, or might fruitfully be,applied to this problem. Because exhaustivecoverage of all existing and potential approaches

is impossible, we attempt to identify keycharacteristics of existing approaches that dividethem into more or less natural categories. Thisallows classes of systems, not just instances, tobe compared. The hope is that an illuminatingcomparison can be conducted at the class leveland that classification of new instances intoexisting classes will prove to be unproblematic.

In preparing this survey, we searched theliterature in both academic and professionalcomputing forums for papers describingcomputer-aided techniques for extractingusability-related information from user interfaceevents. We selected and analyzed an initial set ofpapers to identify key characteristics thatdistinguish the approaches applied by variousinvestigators.

We then constructed a two-dimensional matrixwith instances of existing approaches listedalong one axis and characteristics listed alongthe other. This led to an initial classification ofapproaches based on clusters of relatedattributes. We then iteratively refined thecomparison attributes and classification schemebased on further exploration of the literature.The resulting matrix indicates areas in whichfurther research is needed and suggestssynergistic combinations of currently isolatedcapabilities.

Ideally, an empirical evaluation of theseapproaches in practice would help elucidatemore precisely the specific types of usabilityquestions for which each approach is best suited.However, because many of the approaches havenever been realized beyond the researchprototype stage, little empirical work has beenperformed to evaluate their relative strengths andlimitations. This survey attempts to provide aconceptual evaluation by distinguishing classesof approaches and illuminating their underlyingnature. As a result, this survey should beregarded as a guide to understanding the researchliterature and not as a guide to selecting analready implemented approach for use inpractice.

1.2 Comparison Framework

This subsection introduces the high levelcategories that have emerged as a result of thesurvey. We present the framework in more detailin Section 4.

CONTENTS

1. INTRODUCTION 1.1 Goals and Method 1.2 Comparison Framework 1.3 Organization of the Survey2. BACKGROUND 2.1 Definitions 2.2 Types of Usability Evaluation 2.3 Types of Usability Data3. THE NATURE OF UI EVENTS 3.1 Spectrum of HCI Events 3.2 Grammatical Issues in Analysis 3.3 Contextual Issues in Analysis 3.4 Composition of Events4. COMPARISON OF APPROACHES 4.1 Synchronization and Searching 4.2 Transformation 4.3 Counts and Summary Statistics 4.4 Sequence Detection 4.5 Sequence Comparison 4.6 Sequence Characterization 4.7 Visualization 4.8 Integrated Support5. DISCUSSION 5.1 Summary of the State of the Art 5.2 Some Anticipated Challenges 5.3 Related Work and Future Directions6. CONCLUSIONS

3 • D. H. Hilbert and D. F. Redmiles

December 13, 2000

• Techniques for synchronization and search-ing. These techniques allow user interfaceevents to be synchronized and cross-indexedwith other sources of data such as videorecordings and coded observation logs. Thisallows searches in one medium to locate sup-plementary information in others. In someways, this is the simplest (i.e. most mechani-cal) technique for exploiting user interfaceevents in usability evaluation. However, it isquite powerful.

• Techniques for transforming event streams.Transformation involves selecting, abstract-ing, and recoding event streams to facilitatehuman and automated analysis (includingcounts, summary statistics, pattern detection,comparison, and characterization). Selectioninvolves separating events and sequences ofinterest from the “noise”. Abstractioninvolves relating events to higher-level con-cepts of interest in analysis. Recodinginvolves generating new event streams basedon the results of selection and abstraction sothat selected and abstracted events can besubjected to the same types of manual andautomated analysis techniques normally per-formed on raw event streams.

• Techniques for performing counts and sum-mary statistics. Once user interface eventshave been captured, there are a number ofcounts and summary statistics that might becomputed to summarize user behavior, forexample, feature use counts, error frequen-cies, use of the help system, and so forth.Although most investigators rely on general-purpose spreadsheets and statistical packagesto provide such functionality, some investiga-tors have proposed specific “built-in” func-tions for calculating and reporting this sort ofsummary information.

• Techniques for detecting sequences. Thesetechniques allow investigators to identifyoccurrences of concrete or abstractly defined“target” sequences within “source” sequencesof events that may indicate potential usabilityissues. In some cases, target sequences areabstractly defined and are supplied by thedevelopers of the technique. In other cases,target sequences are more specific to particu-lar applications and are supplied by the usersof the technique. Sometimes the purpose is togenerate a list of matched source subse-

quences for further perusal by the investiga-tor. Other times the purpose is toautomatically recognize particular sequencesthat violate expectations about proper userinterface usage. Finally, in some cases, thepurpose is to perform transformation of thesource sequence by abstracting and recodinginstances of the target sequence into“abstract” events.

• Techniques for comparing sequences. Thesetechniques help investigators compare“source” sequences against concrete orabstractly defined “target” sequences indicat-ing the extent to which the sequences matchone another. Some techniques attempt todetect divergence between an abstract modelrepresenting the target sequence and a sourcesequence. Others attempt to detect divergencebetween a concrete target sequence produced,for example, by an expert user, and a sourcesequence produced by some other user. Someproduce diagnostic measures of distance tocharacterize the correspondence between tar-get and source sequences. Others attempt toperform the best possible alignment of eventsin the target and source sequences and presentthe results to investigators in visual form.Still others use points of deviation betweenthe target and input sequences to automati-cally indicate potential usability issues. In allcases, the purpose is to compare actualsequences of events against some model ortrace of “ideal” or expected sequences toidentify potential usability issues.

• Techniques for characterizing sequences.These techniques take “source” sequences asinput and attempt to construct an abstractmodel to summarize, or characterize, interest-ing sequential features of those sequences.Some techniques compute probability matri-ces in order to produce process models withprobabilities associated with transitions. Oth-ers construct grammatical models or finitestate machines to characterize the grammati-cal structure of events in the sourcesequences.

• Visualization techniques. These techniquespresent the results of transformations andanalyses in forms allowing humans to exploittheir innate visual analysis capabilities tointerpret results. These techniques can be par-ticularly useful in linking results of analysis


December 13, 2000

back to features of the interface.

• Integrated evaluation support. Evaluationenvironments that facilitate flexible composi-tion of various transformation, analysis, andvisualization capabilities provide integratedsupport. Some environments also providebuilt-in support for managing domain-spe-cific artifacts such as evaluations, subjects,tasks, data, and results of analysis.

Figure 1 illustrates how the framework might bearranged as a hierarchy. At the highest level, thesurveyed techniques are concerned with:synchronization and searching, transformation,analysis, visualization, or integrated support.Transformation can be achieved throughselection, abstraction, and recoding. Analysiscan be performed using counts and summarystatistics, sequence detection, sequencecomparison, and sequence characterization.

1.3 Organization of the Survey

The next section establishes background andpresents definitions of important terms. Thesubsequent section discusses the nature andcharacteristics of UI events and providesexamples to illustrate some of the difficultiesinvolved in extracting usability-relatedinformation from such events. Section 4 presentsa comparison of the approaches based on theframework outlined above. For each class oftechniques, the following is provided: a briefdescription, examples, related work whereappropriate, and strengths and limitations.Section 5 summarizes the most important pointsof the survey and outlines some directions forfuture research. Section 6 presents conclusions.

Many of the authors are not explicit regardingthe event representations assumed by theirapproaches. We assume that UI events are datastructures that include attributes indicating anevent type (e.g., MOUSE_PRESSED orKEY_PRESSED), an event target (e.g., an IDindicating a particular button or text field in theuser interface), a timestamp, and other attributesincluding various aspects of the state of inputdevices when the event was generated. However,to raise the level of abstraction in our discussion,we typically represent event streams assequences of letters where each lettercorresponds to a more detailed event structure asdescribed above. When beneficial, and possible,

we provide examples using this notation toillustrate how particular approaches operate.

2. BACKGROUND

This section serves three purposes. First, itestablishes working definitions of key termssuch as “usability,” “usability evaluation,” and“usability data”. Second, it situates observationalusability evaluation within the broader context ofHCI evaluation approaches, indicating some ofthe relative strengths and limitations of each.Finally, it isolates user interface events as one ofthe many types of data commonly collected inobservational usability evaluation, indicatingsome of its strengths and limitations relative toother types. The definitions and frameworkspresented here are not new and can be found instandard HCI texts [Preece et al. 1994; Nielsen1993]. Those well acquainted with usability,usability evaluation, and user interface eventdata, may wish to skip directly to Section 3where the specific nature of user interface eventsand the reasons why analysis is complicated arepresented.

2.1 Definitions

“Usability” is often thought of as referring to asingle attribute of a system or device. However,it is more accurately characterized as referring toa large number of related attributes. Nielsenprovides the following definition [Nielsen 1993]:

Synchronization and searching

Transformation

Selection

Abstraction

Recoding

Analysis

Counts and summary statistics

Sequence detection

Sequence comparison

Sequence characterization

Visualization

Integrated support

Figure 1. Comparison framework.


December 13, 2000

Usability has multiple components and istraditionally associated with these fiveusability attributes:

Learnability: The system should be easy tolearn so that the user can rapidly start gettingsome work done with the system.

Efficiency: The system should be efficient touse, so that once the user has learned thesystem, a high level of productivity is possible.

Memorability: The system should be easy toremember, so that the casual user is able toreturn to the system after some period of nothaving used it, without having to learneverything all over again.

Errors: The system should have a low errorrate, so that users make few errors during theuse of the system, and so that if they do makeerrors they can easily recover from them.Further, catastrophic errors must not occur.

Satisfaction: The system should be pleasant touse, so that users are subjectively satisfiedwhen using it; they like it.

“Usability evaluation” can be defined as the actof measuring (or identifying potential issuesaffecting) usability attributes of a system ordevice with respect to particular users,performing particular tasks, in particularcontexts. The reason that users, tasks, andcontexts are part of the definition is that thevalues of usability attributes can vary dependingon the background knowledge and experience ofusers, the tasks for which the system is used, andthe context in which it is used.

“Usability data” is any information that isuseful in measuring (or identifying potentialissues affecting) the usability attributes of asystem under evaluation.

It should be noted that the definition of usabilitycited above makes no mention of the particularpurposes for which the system is designed orused. Thus, a system may be perfectly usable andyet not serve the purposes for which it wasdesigned. Furthermore, a system may not serveany useful purpose at all (save for providingsubjective satisfaction) and still be regarded asperfectly usable. Herein lies the distinctionbetween usability and utility.

Usability and utility are regarded assubcategories of the more general term“usefulness” [Grudin 1992]. Utility is the

question of whether the functionality of a systemcan, in principle, support the needs of users,while usability is the question of howsatisfactorily users can make use of thatfunctionality. Thus, system usefulness dependson both usability and utility.

While this distinction is theoretically clear,usability evaluations often identify both usabilityand utility issues, thus more properly addressingusefulness. However, to avoid introducing newterminology, this survey simply assumes thatusability evaluations and usability data canaddress questions of utility as well as questionsof usability.

2.2 Types of Usability Evaluation

This section contrasts the different types ofapproaches that have been brought to bear inevaluating usability in HCI.

First, a distinction is commonly drawn betweenformative and summative evaluation. Formativeevaluation primarily seeks to provide feedbackto designers to inform and evaluate designdecisions. Summative evaluation primarilyinvolves making judgements about “completed”products, to measure improvement over previousreleases or to compare competing products. Thetechniques discussed in this survey can beapplied in both sorts of cases.

Another important issue is the more specificmotivation for evaluating. There are a number ofpractical motivations for evaluating. Forinstance, one may wish to gain insight into thebehavior of a system and its users in actual usagesituations in order to improve usability(formative) and to validate that usability hasbeen improved (summative). One may also wishto gain further insight into users’ needs, desires,thought processes, and experiences (alsoformative and summative). One may wish tocompare design alternatives, for example, todetermine the most efficient interface layout orthe best design representation for some set ofdomain concepts (formative). One may wish tocompute usability metrics so that usability goalscan be specified quantitatively and progressmeasured, or so that competing products can becompared (summative). Finally, one may wish tocheck for conformance to interface styleguidelines and/or standards (summative). Thereare also academic motivations, such as the desire


December 13, 2000

to discover features of human cognition thataffect user performance and comprehension withregard to human-computer interfaces (potentiallyresulting in formative implications).

There are a number of HCI evaluationapproaches for achieving these goals that fallinto three basic categories: predictive,observational, and participative.

Predictive evaluation usually involves makingpredictions about usability attributes based onpsychological modeling techniques (e.g., theGOMS model [John and Kieras 1996a &1996b]or the Cognitive Walkthrough [Lewis et al.1990]), or based on design reviews performed byexperts equipped with a knowledge of HCIprinciples and guidelines and past experience indesign and evaluation (e.g., Heuristic Evaluation[Nielsen and Mack 1994]). A key strength ofpredictive approaches is their ability to produceresults based on non-functioning design artifactswithout requiring the involvement of actualusers.

Observational evaluation involves measuringusability attributes based on observations ofusers actually interacting with prototypes or fullyfunctioning systems. Observational approachescan range from formal laboratory experiments toless formal field studies. A key strength ofobservational techniques is that they tend to

uncover aspects of actual user behavior andperformance that are difficult to capture usingother techniques.

Finally, participative evaluation involvescollecting information regarding usabilityattributes directly from users based on theirsubjective reports. Methods for collecting suchdata range from questionnaires and interviews tomore ethnographically inspired approachesinvolving joint observer/participantinterpretation of behavior in context. A keybenefit of participative techniques is their abilityto capture aspects of users’ needs, desires,thought processes, and experiences that aredifficult to obtain otherwise.

In practice, actual evaluations often combinetechniques from multiple approaches. However,the methods for posing questions and forcollecting, analyzing, and interpreting data varyfrom one category to the next. Table 1 provides ahigh level summary of the relationship betweentypes of evaluation and typical reasons forevaluating. An upper-case ‘X’ indicates a strongrelationship. A lower-case ‘x’ indicates a weakerrelationship. An empty cell indicates little or norelationship.

The approaches surveyed here are primarilygeared toward supporting observationalevaluation, although some provide limitedsupport for capturing participative data as well.

2.3 Types of Usability Data

Having situated observational usabilityevaluation within the broader context ofpredictive, observational, and participativeapproaches, user interface events can be isolatedas just one of many possible sources ofobservational data.

Sweeny and colleagues [Sweeny et al. 1993]identify a number of indicators that might beused to measure (or indicate potential issuesaffecting) usability attributes:

• On-line behavior/performance: e.g., tasktimes, percentage of tasks completed, errorrates, duration and frequency of on-line helpusage, range of functions used.

• Off-line behavior (non-verbal): e.g., eyemovements, facial gestures, duration and fre-quency of off-line documentation usage, off-line problem solving activity.

Reasons for evaluating

Predictiveevaluation

Observationalevaluation

Participativeevaluation

Understand-ing user behavior & performance

X x

Understand-ing userthoughts &experience

x X

Comparing designalternatives

x X X

Computing usabilitymetrics

x X X

Certifying conformance w/ standards

X

Table 1: Types of evaluation and reasons for evaluating.


December 13, 2000

• Cognition/understanding: e.g., verbal proto-cols, answers to comprehension questions,sorting task scores.

• Attitude/opinion: e.g., post-hoc comments,questionnaire and interview comments andratings.

• Stress/anxiety: e.g., galvanic skin response(GSR), heart rate (ECG), event-related brainpotentials (ERPs), electroencephalograms(EEG), ratings of anxiety.

Table 2 summarizes the relationship betweenthese indicators and various techniques forcollecting observational data. This table is by nomeans comprehensive and is used only toindicate the rather specialized yetcomplementary nature of user interface eventdata in observational evaluation. UI eventsprovide excellent data for quantitativelycharacterizing on-line behavior, however, theusefulness of UI events in providing dataregarding the remaining indicators has not beendemonstrated. However, some investigators haveused UI events to infer features of userknowledge and understanding [Kay and Thomas1995; Guzdial et al. 1993].

Many of the surveyed approaches focus on eventdata exclusively. However, some also combineother sources of data including video recordings,coded observations, and subjective user reports.

3. THE NATURE OF UI EVENTS

This section discusses the nature andcharacteristics of HCI events in general and UIevents specifically. We discuss the grammaticalnature of UI events including some implicationson analysis. We also discuss the importance ofcontextual information in interpreting thesignificance of events. Finally, we present acompositional model of UI events to illustrate

how these issues manifest themselves in UIevent analysis. We use this discussion to groundlater discussion and to highlight some of thestrengths and limitations of the surveyedapproaches.

3.1 Spectrum of HCI Events

Before discussing the specific nature of UIevents, this section introduces the broaderspectrum of events of interest to researchers andpractitioners in HCI. Figure 2, adapted from[Sanderson and Fisher 1994], indicates thedurations of different types of HCI events.

The horizontal axis is a log scale indicatingevent durations in seconds. It ranges fromdurations of less than one second to durations ofyears. The durations of UI events fall in therange of 10 milliseconds to approximately onesecond. The range of possible durations for each“type” of event is between one and two orders ofmagnitude, and the ranges of different types ofevents overlap one another.

If we assume that events occur serially, then thepossible frequencies of events are constrained bythe duration of those events. So, by analogy withthe continuous domain (e.g., analog signals),each event type will have a characteristicfrequency band associated with it [Sandersonand Fisher 1994]. Event types of shorterduration, for example, UI events, can exhibitmuch higher frequencies when in sequence, andthus might be referred to as high-frequency bandevent types. Likewise, event types of longerduration, such as project events, exhibit muchlower frequencies when in sequence and thusmight be referred to as low-frequency band eventtypes. Evaluations that attempt to address thedetails of interface design have tended to focuson high-frequency band event types, whereasresearch on computer supported cooperative

Usability IndicatorsUI Event

RecordingAudio/Video Recording

Post-hocComments

UserInterview

Survey/Questionnaire/

Test scores

Psychophysical recording

On-line behavior/performance X XOff-line behavior (non-verbal) XCognition/understanding X X X X XAttitude/opinion X X XStress X

Table 2: Data collection techniques and usability indicators.


December 13, 2000

work (CSCW) has tended to focus on mid- tolow-frequency band event types [Sanderson andFisher 1994].

Some important properties of HCI events thatemerge from this characterization include thefollowing:

1. Synchronous vs. Asynchronous Events:Sequences composed of high-frequencyevent types typically occur synchronously.For example, sequences of UI events, ges-tures, or conversational turns can usually becaptured synchronously using a singlerecording. However, sequences composed oflower frequency event types, such as meetingor project events, may occur asynchronously,aided, for example, by electronic mail, col-laborative applications, memos, and letters.This has important implications on the meth-ods used to sample, capture, and analyze data,particularly at lower frequency bands [Sand-erson and Fisher 1994].

2. Composition of Events: Events within a givenfrequency band are often composed of eventsfrom higher frequency bands. These sameevents typically combine to form events atlower frequency bands. Sanderson and Fisheroffer this example: a conversational turn istypically composed of vocalizations, ges-tures, and eye movements, and a sequence ofconversational turns may combine to form a

topic under discussion within a meeting[Sanderson and Fisher 1994]. This composi-tional structure is also exhibited within fre-quency bands. For instance, user interactionswith software applications occur and may beanalyzed at multiple levels of abstraction,where events at each level are composed ofevents occurring at lower levels. (See [Hil-bert et al. 1997] for an early treatment). Thisis discussed further below.

3. Inferences Across Frequency Band Bound-aries: Low frequency band events do notdirectly reveal their composition from higherfrequency events. As a result, recording onlylow frequency events will typically result ininformation loss. Likewise, high frequencyevents do not, in themselves, reveal how theycombine to form events at lower frequencybands. As a result, either low frequency bandevents must be recorded in conjunction withhigh frequency band events or there must besome external model (e.g., a grammar) todescribe how high frequency events combineto form lower frequency events. This too isdiscussed further below.

3.2 Grammatical Issues in Analysis

UI events are often grammatical in structure.Grammars have been used in numerousdisciplines to characterize the structure ofsequential data. The main feature of grammars

Figure 2. A spectrum of HCI events. Adapted from [Sanderson and Fisher 1994].

Eventduration

in seconds(log scale)

High Frequency Band Events(Mostly synchronous interactions)

Low Frequency Band Events(Many asynchronous interactions)

.001 .01 .1 1 10 100 1K 10K 100K 1M 10M 100M

1 min 1 month1 hour 1 day 1 year1 sec

UI events

eye movements

vocalizations

meeting events

operation events

project events

turns

topics

gestures, motions


December 13, 2000

that make them useful in this context is theirability to define equivalence classes of patternsin terms of rewrite rules. For example, thefollowing grammar (expressed as a set of rewriterules) may be used to capture the ways in whicha user can trigger a print job in a givenapplication:

PRINT_COMMAND —>“MOUSE_PRESSED PrintToolbarButton” or(PRINT_DIALOG_ACTIVATED then “MOUSE_PRESSED OkButton)

PRINT_DIALOG_ACTIVATED —>“MOUSE_PRESSED PrintMenuItem” or“KEY_PRESSED Ctrl-P”

Rule 1 simply states that the user can trigger aprint job by either pressing the print toolbarbutton (which triggers the job immediately) orby activating the print dialog and then pressingthe “OK” button. Rule 2 specifies that the printdialog may be activated by either selecting theprint menu item in the “File” menu or byentering a keyboard accelerator, “Ctrl-P”.

Let us assume that the lexical elements used toconstruct sentences in this language are:

A: indicating “print toolbar button pressed”B: indicating “print menu item selected”C: indicating “print accelerator key entered”D: indicating “print dialog okayed”

Then the following “sentences” constructed fromthese elements each indicate a series of fourconsecutive print job activations:

AAAACDAAAABDBDABDCDACDCDBDCDBD

All occurrences of ‘A’ indicate an immediateprint job activation while all occurrences of ‘BD’or ‘CD’ indicate a print job activated by usingthe print dialog and then selecting “OK”.

Notice that each of these sequences contains adifferent number of lexical elements. Some ofthem have no lexical elements in common (e.g.,AAAA and CDBDCDBD). The lexical elementsoccupying the first and last positions differ fromone sequence to the next. In short, there are anumber of salient differences between thesesequences at the lexical level. Techniques forautomatically detecting, comparing, andcharacterizing sequences are typically sensitive

to such differences. Unless the data istransformed based on the above grammar, thefact that these sequences are semanticallyequivalent (in the sense that each indicates aseries of four consecutive print job activations)will most likely go unrecognized, and evensimple summary statistics such as “# of printjobs per session” may be difficult to compute.

Techniques for extracting usability-relatedinformation from UI events should take intoconsideration the grammatical relationshipsbetween lower level events and higher levelevents of interest.

3.3 Contextual Issues in Analysis

Another set of problems arises in attempting tointerpret the significance of UI events based onlyon the information carried within eventsthemselves. To illustrate the problem moregenerally, consider the analogous problem ofinterpreting the significance of utterances intranscripts of natural language conversation.Important contextual cues are often spreadacross multiple utterances or may be missingfrom the transcript altogether.

Let us assume we have a transcript of aconversation that took place between individualsA and B at a museum. The task is to identify A’sfavorite paintings based on utterances in thetranscript.

Example 1: “The Persistence of Memory, byDali, is one of my favorites”.

In this case, everything we need to know in orderto determine one of A’s favorite paintings iscontained in a single utterance.

Example 2: “The Persistence of Memory, byDali”.

In this case we need access to prior context. ‘A’is most likely responding to a question posed by‘B’. Information carried in the question is criticalin interpreting the response. For example, thequestion could have been: “Which is your leastfavorite painting?”.

Example 3: “That is one of my favorites”.

In this case, we need the ability to de-referencean indexical. The information carried by theindexical “that” may not be available in any ofthe utterances in the transcript, but was clearlyavailable to the interlocutors at the time of the


December 13, 2000

utterance. Such contextual information was“there for the asking”, so to speak, and couldhave been noted had the transcriber been presentand chosen to do so at the time of the utterance.

Example 4: “That is another one.”

In this case we would need access to both priorcontext and the ability to de-reference anindexical.

The following examples illustrate analogoussituations in the interpretation of user interfaceevents:

Example 1’: “MOUSE_PRESSED PrintToolbarButton”

This event carries with it enough information toindicate the action the user has performed.

Example 2’: “MOUSE_PRESSED OkButton”

This event does not on its own indicate whataction was performed. As in Example 2 above,this event indicates a response to some priorevent, for example, a prior “MOUSE_PRESSED

PrintMenuItem” event.

Example 3’: “WINDOW_OPENED ErrorDialog”

The information needed to interpret thesignificance of this event may be available inprior events, but a more direct way to interpretits significance would be to query the dialog forits error message. This is similar to de-referencing an indexical, if we think of the errordialog as figuratively “pointing at” an errormessage that does not actually appear in theevent stream.

Example 4’: “WINDOW_OPENED ErrorDialog”

Assuming the error message is “InvalidCommand”, then the information needed tointerpret the significance of this event is not onlyfound by de-referencing the indexical (the errormessage “pointed at” by the dialog) but must besupplemented by information available in priorevents. It may also be desirable to querycontextual information stored in user interfacecomponents to determine the combination ofparameters (specified in a dialog, for example)that led to this particular error.

The basic insight here is that sometimes anutterance — or a UI event — does not carryenough information on its own to allow its

significance to be properly interpreted.Sometimes critical contextual information isavailable elsewhere in the transcript, andsometimes that information is not available inthe transcript, but was available, “for theasking”, at the time of the utterance, or event, butnot afterwards. Therefore, techniques forextracting usability-related information from UIevents should take into consideration the factthat context may be spread across multipleevents, and that in some cases, importantcontextual information may need to be explicitlycaptured during data collection if meaningfulinterpretation is to be performed.

3.4 Composition of Events

Finally, user interactions may be analyzed atmultiple levels of abstraction. For instance, onemay be interested in analyzing low-level mousemovement and timing information, or one maybe more interested in higher-level informationregarding the steps taken by users in completingtasks, such as placing an order or composing abusiness letter. Techniques for extractingusability information from UI events should becapable of addressing events at multiple levels ofabstraction.

Figure 3 illustrates a multi-level model of eventsoriginally presented in [Hilbert et al. 1997]. Atthe lowest level are physical events, for example,fingers depressing keys or a hand moving apointing device such as a mouse. Input deviceevents, such as key and mouse interrupts, aregenerated by hardware in response to physicalevents. UI events associate input device eventswith windows and other interface objects on thescreen. Events at this level include buttonpresses, list and menu selections, focus events ininput fields, and window movements andresizing.

Abstract interaction events are not directlygenerated by the user interface system, but maybe computed based on UI events and othercontextual information such as UI state. Abstractinteraction events are indicated by recurring,idiomatic patterns of UI events and indicatehigher level concepts such as shifts in users’editing attention or the act of providing values toan application by manipulating applicationcomponents.


December 13, 2000

Consider the example of a user editing an inputfield at the top of a form-based interface, thenpressing tab repeatedly to edit a field at thebottom of the form. In terms of UI events, inputfocus shifted several times between the first andlast fields. In terms of abstract interaction events,the user’s editing attention shifted directly fromthe top field to the bottom field. Notice thatdetecting the occurrence of abstract interactionevents such as “GOT_EDIT” and“LOST_EDIT” requires the ability to keep trackof the last edited component and to noticesubsequent editing events in other components.

Another type of abstract interaction event mightbe associated with the act of providing a newvalue to an application by manipulating userinterface components. In the case of a text field,this would mean that the field had received anumber of key events, was no longer receivingkey events, and now contains a new value. Thepatterns of window system events that indicatean abstract interaction event such as“VALUE_PROVIDED” will differ from one typeof interface component to another, and from oneapplication to another, but will typically remainfairly stable within a given application. Noticethat detecting the occurrence of an abstractinteraction event such as “VALUE_PROVIDED”requires the ability to access user interface statesuch as the component value before and afterediting events.

Domain/task-related and Goal/problem-relatedevents are at the highest levels. Unlike otherlevels, these events indicate progress in theuser’s tasks and goals. Inferring these eventsbased on lower level events can bestraightforward when the user interface providesexplicit support for structuring tasks orindicating goals. For instance, Wizards inMicrosoft WordTM [Rubin 1999] lead usersthrough a sequence of steps in a predefined task.The user’s progress can be recognized in termsof simple UI events such as button presses on the“Next” button. In other cases, inferring task andgoal related events might require morecomplicated composite event detection. Forinstance, the goal of placing an order includesthe task of providing address information. Thetask-related event “ADDRESS_PROVIDED”may be recognized in terms of“VALUE_PROVIDED” abstract interactionevents occurring within each of the requiredfields in the address section of the form. Finally,in some cases, it may be impossible to inferevents at these levels based only on lower levelevents.

Techniques for extracting usability-relatedinformation from UI events should be sensitiveto the fact that user interactions can occur and beanalyzed at multiple levels of abstraction.

4. COMPARISON OF APPROACHES

This section introduces the approaches that havebeen applied to the problem of extractingusability-related information from UI events.The following subsections discuss the featuresthat distinguish each class of approaches andprovide examples of some of the approaches ineach class. We mention related work whereappropriate and discuss the relative strengths andlimitations of each class. Figures 4 through 11provide pictorial representations of the keyfeatures underlying each class. Table 3 (locatedat the end of this section) presents acategorization and summary of the featuresbelonging to each of the surveyed techniques.

4.1 Synchronization and Searching

4.1.1 PurposeUser interface events provide detailedinformation regarding user behavior that can becaptured, searched, counted, and analyzed usingautomated tools. However, it is often difficult to

Figure 3. Levels of abstraction in user interactions.

Goal/Problem-Related (e.g., placing an order)

Domain/Task-Related (e.g., providing address information)

Abstract Interaction Level(e.g., providing values in input fields)

UI Events(e.g., shifts in input focus, key events)

Input Device Events(e.g., hardware-generated key or mouse interrupts)

Physical Events (e.g., fingers pressing keys or hand moving mouse)


December 13, 2000

Figure 4. Synchronization and Searching

UI Events Video Observations

Description : UI events are synchronized with video and coded observations. Searches in one medium are used to locate supplementary information in others.

Examples : Playback; Microsoft, Apple, and Sun-Soft Labs; DRUM; MacSHAPA; I-Observe.

Figure 5. Transformation

UI Events Selected Events

Description : Selection is the process of separating events of interest from the rest of the event stream. Recoding is the process of generating a new event stream based on selected events.

Examples : Incident Monitoring; CHIME; Hawk; MacSHAPA; User-Identified CIs; EDEM.

UI Events Abstracted Events

Description : Abstraction is the process of generat-ing new events based on existing, or patterns of existing, events. Recoding is the process of generat-ing a new event stream based on abstracted events.

Examples : CHIME; Hawk; MacSHAPA; User-Identi-fied CIs; EDEM.

Figure 6. Counts and Summary Statistics

UI Events Counts & Statistics

Description : Counts and summary statistics are numeric values calculated based on UI events to characterize user behavior.

Examples : MIKE UIMS; KRI/AG; MacSHAPA; Long Term Monitoring; AUS.

36 File->Opens

12 Errors/Hour

25 Seconds/Task

96% Idle Time


December 13, 2000

Figure 7. Sequence Detection

Target Source

Description : Sequence detection is the process of identifying occurrences of target sequences—in this case concretely defined—in source sequences.

Examples : None of the surveyed techniques use concretely defined target sequences.

Description : Sequence detection is the process of identifying occurrences of target sequences—in this case abstractly defined—in source sequences.

Examples : LSA; Fisher’s Cycles; TOP/G; MRP; MacSHAPA; Automatic Chunk Detection; Expecta-tion Agents; EDEM.

Target Source

Figure 8. Sequence Comparison

Target Source

Description : Sequence comparison is the process of comparing target sequences—in this case con-cretely defined—against source sequences and producing measures of correspondence.

Examples : ADAM; UsAGE; MacSHAPA.

Description : Sequence comparison is the process of comparing target sequences—in this case abstractly defined—against source sequences and producing measures of correspondence.

Examples : EMA.

Target Source

Figure 9. Sequence Characterization

Description : Sequence characterization is the pro-cess of analyzing source sequences and generating abstract models to characterize the sequential structure of those sequences.

Examples : Markov-based; Grammar-based.

Source Model


December 13, 2000

infer higher level events of interest from userinterface events alone, and sometimes criticalcontextual information is simply missing fromthe event stream, making proper interpretationchallenging at best.

Synch and Search techniques seek to combinethe advantages of UI event data with theadvantages provided by more semantically richobservational data, such as video recordings andexperimenters’ observations.

By synchronizing UI events with other sourcesof data such as video or coded observations,searches in one medium can be used to locatesupplementary information in others. Therefore,if an investigator wishes to review all segmentsof a video in which a user uses the help systemor invokes a particular command, it is notnecessary to manually search the entirerecording. The investigator can: (a) searchthrough the log of UI events for particular eventsof interest and use the timestamps associatedwith those events to automatically cue up thevideo recording, or (b) search through a log ofobservations (that were entered by theinvestigator either during or after the time of therecording) and use the timestamps associatedwith those observations to cue up the video.Similarly, segments of interest in the video canbe used to locate the detailed user interfaceevents associated with those episodes.

4.1.2 ExamplesPlayback is an early example of a systememploying synch and search capabilities [Nealand Simmons 1983]. Playback captures UI

events automatically and synchronizes them withcoded observations and comments that areentered by experimenters either during or afterthe evaluation session. Instead of using video,Playback allows recorded events to be playedback through the application interface to re-tracethe user’s actions. The evaluator can stepthrough the playback based on events or codedobservations as if using an interactive debugger.There are a handful of simple built-in analyses toautomatically calculate counts and summarystatistics. This technique captures lessinformation than video-based techniques sincevideo can also be used to record off-linebehavior such as facial gestures, off-linedocumentation use, and verbalizations. Also,there can be problems associated with replayinguser sessions accurately in applications wherebehavior is affected by events outside of userinteractions. For instance, the behavior of someapplications can vary depending on the state ofnetworks and persistent data stores.

DRUM, the “Diagnostic Recorder for UsabilityMeasurement”, is an integrated evaluationenvironment that supports video-based usabilityevaluation [Macleod et al. 1993]. DRUM wasdeveloped at the National Physical Laboratory aspart of the ESPRIT Metrics for UsabilityStandards in Computing Project (MUSiC).DRUM features a module for recording andsynchronizing events, observations, and video, amodule for defining and managing observationcoding schemes, a module for calculating pre-defined counts and summary statistics, and amodule for managing and manipulating

Figure 10. Visualization

Description : Visualizations present the results of transformations and analyses in graphical form.

Examples : MacSHAPA; UsAGE; I-Observe; AUS.

Counts & Statistics

SelectionAbstractionRecoding

DetectionComparison

Characterization

Figure 11. Integrated Support

Description : Integrated support includes support for multiple transformation, analysis, and visualization capabilities as well as data management.

Examples : Hawk; DRUM; MacSHAPA.

SelectionAbstraction

Recoding

Detection

ComparisonCharacteriza-

Counts & Statistics

Visualization

Data Management


December 13, 2000

evaluation-related information regardingsubjects, tasks, recording plans, logs, videos, andresults of analysis.

Usability specialists at Microsoft, Apple, andSunSoft all report the use of tools that providesynch and search capabilities [Weiler et al. 1993;Hoiem and Sullivan 1994]. The tools used atMicrosoft include a tool for loggingobservations, a tool for tracking UI events, and atool for synchronizing and reviewing data fromthe multiple sources. The tools used at Apple andSunSoft are essentially similar. All tools supportsome level of event selection as part of thecapture process. Apple’s selection appears to beuser-definable while Microsoft and SunSoft’sselection appear to be programmed into thecapture tools. Scripts and general-purposeanalysis programs, such as Microsoft ExcelTM

[Dodge and Stinson 1999], are used to performcounts and summary statistics after capture. Alltools support video annotations to produce“highlights” videos. Microsoft’s tools provide anAPI to allow applications to report application-specific events or events not readily available inthe UI event stream.

I-Observe, the “Interface OBServation,Evaluation, Recording, and VisualizationEnvironment”, also provides synch and searchcapabilities [Badre et al. 1995]. I-Observe is aset of loosely integrated tools for collecting,selecting, analyzing, and visualizing event datathat has been synchronized with videorecordings. Investigators can perform searchesby specifying predicates over the attributescontained within a single event record.Investigators can then locate patterns of eventsby stringing together multiple searchspecifications into regular expressions.Investigators can then use intervals matched bysuch regular expressions (identified by begin andend events) to select data for visualization or todisplay the corresponding segments of the videorecording.

4.1.3 StrengthsThe strengths of these techniques lie in theirability to integrate data sources withcomplementary strengths and weaknesses and toallow searches in one medium to locate relatedinformation in the others. UI events providedetailed performance information that can besearched, counted, and analyzed using

automated techniques. However, UI events oftenleave out higher level contextual informationthat can more easily be captured using videorecordings and coded observations.

4.1.4 LimitationsTechniques relying on synchronizing UI eventswith video and coded observations typicallyrequire the use of video recording equipment andthe presence of observers. The use of videoequipment and the presence of observers canmake subjects self-conscious and affectperformance and may not be practical orpermitted in certain circumstances. Furthermore,video-based evaluations tend to produce massiveamounts of data that can be expensive toanalyze. The ratio of the time spent in analysisversus the duration of the sessions beinganalyzed has been known to reach 10:1[Sanderson and Fisher 1994; Nielsen 1993;Sweeny 1993]. These matters are all seriouslimiting factors on evaluation size, scope,location, and duration.

Some researchers have begun investigatingtechniques for performing collaborative remoteusability evaluations using video-conferencingsoftware and application sharing technologies.Such techniques may help lift some of thelimitations on evaluation location. However,more work must be done in the area ofautomating data collection and analysis if currentrestrictions on evaluation size, scope, andduration are to be addressed.

4.2 Transformation

4.2.1 PurposeThese techniques combine selection, abstraction,and recoding to transform event streams forvarious purposes, such as facilitating humanpattern detection, comparison, andcharacterization, or to prepare data for input intoautomatic techniques for performing thesefunctions.

Selection operates by subtracting informationfrom event streams, allowing events andsequences of interest to emerge from the “noise”.Selection involves specifying constraints onevent attributes to indicate events of interest tobe separated from other events or to indicateevents to be separated from events of interest.For instance, one may elect to disregard allevents associated with mouse movements in


December 13, 2000

order to focus analysis on higher level actionssuch as button presses and menu selections. Thiscan be accomplished by “positively” selectingbutton press and menu selection events or by“negatively” selecting, or filtering, mousemovement events.

Abstraction operates by synthesizing new eventsbased on information in the event stream,supplemented (in some case) by contextualinformation outside of the event stream. Forinstance, a pattern of events indicating that aninput field had been edited, that a new value hadbeen provided, and that the user’s editingattention had since shifted to another componentmight indicate the abstract event“VALUE_PROVIDED”, which is not signified byany single event in the event stream.Furthermore, the same abstract event might beindicated by different events in different UIcomponents, for example, mouse eventstypically indicate editing in non-textualcomponents while keyboard events typicallyindicate editing in textual components. One mayalso wish to synthesize events to relate the use ofparticular UI components to higher levelconcepts such as the use of menus, toolbars, ordialogs to which those components belong.

Recoding involves producing new event streamsbased on the results of selection and abstraction.This allows the same manual or automatedanalysis techniques normally applied to rawevent streams to be applied to selected andabstracted events, potentially leading to differentresults. Consider the example presented inSection 3. If the sequences representing foursequential print job activations were embeddedwithin the context of a larger sequence, theymight not be identified as being similarsubsequences, particularly by automatedtechniques such as those presented below.However, after performing abstraction based onthe grammar in that example, each of thesesequences could be recoded as “AAAA”, makingthem much more likely to be identified ascommon subsequences by automatic techniques.

4.2.2 ExamplesChen presents an approach to user interfaceevent monitoring that selects events based on thenotion of “incidents” [Chen 1990]. Incidents aredefined as only those events that actually triggersome response from the application and not just

the user interface system. This technique wasdemonstrated by modifying the X ToolkitIntrinsics [Nye and O’Reilly 1992] to reportevents that trigger callback procedures registeredby applications. This allows events not handledby the application to be selected “out”automatically. Investigators may furtherconstrain event reporting by selecting specificincidents of interest to report. Widgets in the userinterface toolkit were modified to provide aquery procedure to return limited contextualinformation when an event associated with agiven widget triggers a callback.

Hartson and colleagues report an approach toremote collaborative usability evaluation thatrelies on users to select events [Hartson et al.1996]. Users identify potential usabilityproblems that arise during the course ofinteracting with an application and reportinformation regarding these “critical incidents”by pressing a “report” button that is supplied inthe interface. The approach uses E-Mail to reportdigitized video of the events leading up to andfollowing critical incidents along with contextualinformation provided by users. In this case,selection is achieved by only reporting the nevents leading up to, and m events following,user-identified critical incidents (where n and mare parameters that can be set in advance byinvestigators).1

CHIME, the “Computer-Human InteractionMonitoring Engine”, is similar, in some ways, toChen’s approach [Badre and Santos 1991a].CHIME allows investigators to select ahead oftime which events to report and which events tofilter. An important difference is that CHIMEalso supports a limited notion of abstraction thatallows a level of indirection to be built on top ofthe window system. The basic idea is thatabstract “interaction units” (IUs) are defined thattranslate window system events into platformindependent events upon which furthermonitoring infrastructure is built. The events tobe recorded are then specified in terms of theseplatform independent IUs.2

1.This approach actually focuses on capturing video and notevents. However, the ideas embodied by the approach canequally well be applied to the problem of selecting events ofinterest surrounding user-identified critical incidents.


December 13, 2000

Hawk is an environment for selecting,abstracting, and recoding events in log files[Guzdial 1993]. Hawk’s main functionality isprovided by a variant of the AWK programminglanguage [Aho et al. 1988], and an environmentfor managing data files is provided byHyperCardTM [Goodman 1998]. Events appear inthe event log one per line, and AWK pattern-action pairs are used to specify what is to bematched in each line of input (the pattern) andwhat is to be printed as output (the action). Thisallows fairly flexible selection, abstraction, andrecoding to be performed.

MacSHAPA, which is discussed further below,supports selection and recoding via a databasequery and manipulation language that allowsinvestigators to select event records based onattributes and define new streams based on theresults of queries [Sanderson et al. 1994].Investigators can also perform abstraction bymanually entering new records representingabstract events and visually aligning them withexisting events in a spreadsheet representation(see Figure 13).

EDEM, an “Expectation-Driven EventMonitoring” system, captures UI events andsupports automated selection, abstraction, andrecoding [Hilbert and Redmiles 1998a].Selection is achieved in two ways: investigatorsspecify ahead of time which events to report, andusers can also cause events to be selected via acritical incident reporting mechanism akin to thatreported in [Hartson et al. 1996]. However, oneimportant difference is that EDEM also allowsinvestigators to define automated agents to helpin the detection of “critical incidents”, therebylifting some of the burden from users who oftendo not know when their actions are violatingexpectations about proper usage [Smilowitz etal. 1994]. Furthermore, investigators can useEDEM to define abstract events in terms ofpatterns of existing events. When EDEM detectsa pattern of events corresponding to a pre-defined abstract event, it generates an abstractevent and inserts it into the event stream.

Investigators can configure EDEM to performfurther hierarchical event abstraction by simplydefining higher level abstract events in terms oflower level abstract events. All of this is done incontext, so that contextual information can beused in selection and abstraction. EDEM alsocalculates a number of simple counts andsummary statistics.

4.2.3 StrengthsThe main strength of these approaches lies intheir explicit support for selection, abstraction,and recoding which are essential steps inpreparing UI events for most types of analysis(as illustrated in Section 3). Chen and CHIMEaddress issues of selection prior to reporting.Hartson and colleagues add to this a techniquefor accessing contextual information via the user.EDEM adds to these automatic detection ofcritical incidents and abstraction that isperformed in context. All these techniques mightpotentially be used to collect UI events remotely.

Hawk and MacSHAPA, on the other hand, donot address event collection but providepowerful and flexible environments fortransforming and analyzing already captured UIevents.

4.2.4 LimitationsThe techniques that select, abstract, and recodeevents while collecting them run the risk ofthrowing away data that might have been usefulin analysis. Techniques that rely exclusively onusers to select events are even more likely todrop useful information. With the exception ofEDEM, the approaches that support flexibleabstraction and recoding do so only after eventshave been captured, meaning contextualinformation that can be critical in abstraction isnot available.

4.3 Counts and Summary Statistics

4.3.1 PurposeAs noted above, one of the key benefits of UIevents is how readily details regarding on-linebehavior can be captured and manipulated usingautomated techniques. Most investigators rely ongeneral-purpose analysis programs such asspreadsheets and statistical packages to computecounts and summary statistics based on collecteddata (e.g., feature use counts or errorfrequencies). However, some investigators haveproposed systems with specific built-in facilities

2.The paper also alludes to the possibility of allowing higherlevel IUs to be hierarchically defined in terms of lower levelIUs (using a context-free grammar and pre-conditions) to pro-vide a richer notion of abstraction. However, this appears tonever have been implemented [Badre and Santos 1991a and1991b]).


December 13, 2000

for performing and reporting such calculations.This section provides examples of some of thesystems boasting specialized facilities forcalculating usability-related metrics.

4.3.2 ExamplesThe MIKE user interface management system(UIMS) is an early example of a system offeringbuilt-in facilities for calculating and reportingmetrics [Olsen and Halversen 1988]. BecauseMIKE controls all aspects of input and outputactivity, and because it has an abstractdescription that links user interface componentsto the application commands they trigger, MIKEis in a uniquely good position to monitor UIevents and associate them with responsibleinterface components and applicationcommands. Example metrics include:

• Performance time: How much time is spentcompleting tasks such as specifying argu-ments for commands?

• Mouse travel: Is the sum of the distancesbetween mouse clicks unnecessarily high?

• Command frequency: Which commands areused most frequently or not at all?

• Command pair frequency: Which commandsare used together frequently? Can they becombined or placed closer to one another?

• Cancel and undo: Which dialogs are fre-quently canceled? Which commands are fre-quently undone?

• Physical device swapping: Is the user switch-ing back and forth between keyboard andmouse unnecessarily? Which features areassociated with high physical swappingcounts?

MIKE logs all UI events and associates themwith the interface components triggering themand the application commands triggered bythem. Event logs are written to files that are laterread by a metric collection and report generationprogram. This program uses the abstractdescription of the interface to interpret the logand to generate human readable reportssummarizing the metrics.

MacSHAPA also includes numerous built-infeatures to support computation and reporting ofsimple counts and summary statistics, including,for instance, the frequencies and durations of anyselection of events specified by the user

[Sanderson et al. 1994]. MacSHAPA’s othermore powerful analysis features are described infollowing sections.

Automatic Usability Software (AUS) is reportedto provide a number of automatically computedmetrics such as help system use, use of canceland undo, mouse travel, and mouse clicks perwindow [Chang and Dillon 1997].

Finally, ErgoLight Operation Recording Suite(EORS) and Usability Validation Suite (EUVS)[ErgoLight Usability Software 1998] provide anumber of built-in counts and summary statisticsto characterize user interactions captured by thetools, either locally in the usability lab, orremotely over the Internet.

4.3.3 RelatedA number of commercial tools such as AqueductAppScopeTM [Aqueduct Software 1998] and FullCircle TalkbackTM [Full Circle Software 1998]have recently become available for capturingdata about application crashes over the Internet.These tools capture metrics about the operatingsystem and application at the time crashes occur,and Talkback allows users to provide feedbackregarding the actions leading up to crashes.These tools also provide APIs that applicationdevelopers can use to report events of interest,such as application feature usage. These toolsthen send captured data via E-mail todevelopers’ computers where it is stored in adatabase and plotted using standard databaseplotting facilities.

4.3.4 StrengthsWith the number of possible metrics, counts, andsummary statistics that might be computed andthat might be useful in usability evaluation, it isnice that some systems provide built-in facilitiesto perform and report such calculationsautomatically.

4.3.5 LimitationsWith the exception of MacSHAPA, the systemsdescribed above do not provide facilities to allowevaluators to modify built-in counts, statistics,and reports, or to add new ones of their own.Also, the computation of useful metrics isgreatly simplified when the system computingthe metrics has a model linking user interfacecomponents to application commands, as in thecase of MIKE, or when the application code ismanually instrumented to report the events to be


December 13, 2000

analyzed, as in the case of AppScope andTalkback. AUS does not address application-specific features and thus is limited in its abilityto relate metrics results to application features.

4.4 Sequence Detection

4.4.1 PurposeThese techniques detect occurrences of concreteor abstractly defined “target” sequences within“source” sequences.1 In some cases targetsequences are abstractly defined and are suppliedby the developers of the technique (e.g., Fisher’scycles, lag sequential analysis, multiplerepeating pattern analysis, and automatic chunkdetection). In other cases, target sequences aremore specific to a particular application and aresupplied by the investigators using the technique(e.g., TOP/G, Expectation Agents, and EDEM).Sometimes the purpose is to generate a list ofmatched source subsequences for perusal by theinvestigator (e.g., Fisher’s cycles, maximalrepeating pattern analysis, and automatic chunkdetection). Other times the purpose is torecognize sequences of UI events that violateparticular expectations about proper UI usage(e.g., TOP/G, Expectation Agents, and EDEM).Finally, in some cases the purpose may be toperform abstraction and recoding of the sourcesequence based on matches of the targetsequence (e.g., EDEM).

4.4.2 ExamplesTOP/G, the “Task-Oriented Parser/Generator”,parses sequences of commands from acommand-line simulation and attempts to inferthe higher level tasks that are being performed[Hoppe 1988]. Users of the technique modelexpected tasks in a notation based on Payne andGreen’s task-action grammars [Payne and Green1986] and store this information as rewrite orproduction rules in a Prolog database.Investigators can define composite taskshierarchically in terms of elementary tasks,which they must further decompose into“triggering rules” that map keystroke levelevents into elementary tasks. Investigators mayalso define rules to recognize “suboptimal” user

behaviors that might be abbreviated by simplercompositions of commands. A later versionattempted to use information about the sideeffects of commands in the environment torecognize when a longer sequence of commandsmight be replaced by a shorter sequence. In bothcases, TOP/G’s generator functionality could beused to generate the shorter, or more “optimal”,command sequence.

Researchers involved in exploratory sequentialdata analysis (ESDA) have applied a number oftechniques for detecting abstractly definedpatterns in sequential data. For an in-depthtreatment see [Sanderson and Fisher 1994].These techniques can be subdivided into twobasic categories:

Techniques sensitive to sequentially separatedpatterns of events, for example:

• Fisher’s cycles

• Lag sequential analysis (LSA)

Techniques sensitive to strict transitions betweenevents, for example:

• Maximal Repeating Pattern Analysis (MRP)

• Log linear analysis

• Time-series analysis

Fisher’s cycles allow investigators to specifybeginning and ending events of interest that arethen used to automatically identify alloccurrences of subsequences beginning andending with those events (excluding those withfurther internal occurrences of those events)[Fisher 1991]. For example, assume aninvestigator is faced with a source sequence ofevents encoded using the letters of the alphabet,such as: ABACDACDBADBCACCCD. Supposefurther that the investigator wishes to find outwhat happened between all occurrences of ‘A’ (asa starting point) and ‘D’ (as an ending point),Fisher’s cycles produces the following analysis:

Source sequence:ABACDACDBADBCACCCD

Begin event: A

End event: D

Output:ABACDACDBADBCACCCDABACDACDBADBCACCCDABACDACDBADBCACCCD

1.The following sentences include names of approaches in pa-rentheses to indicate how the examples explained in the nextsubsection fall into finer subcategories of the broader “se-quence detection” category.


December 13, 2000

ABACDACDBADBCACCCD

The investigator could then note that there wereclearly no occurrences of B in any A->D cycle.Furthermore, the investigator might use agrammatical technique to recode repetitions ofthe same event into a single event, therebyrevealing that the last cycle (ACCCD) isessentially equivalent to the first two (ACD). Thisis one way of discovering similar subsequencesin “noisy” data.

Lag sequential analysis (LSA) is another populartechnique that identifies the frequency withwhich two events occur at various “removes”from one another [Sackett 1978; Allison andLiker 1982; Faraone and Dorfman 1987;Sanderson and Fisher 1994]. LSA takes oneevent as a ‘key’ and another event as a ‘target’and reports how often the target event occurs atvarious intervals before and after the key event.If ‘ A’ were the key and ‘D’ the target in theprevious example, LSA would produce thefollowing analysis:

Source sequence:ABACDACDBADBCACCCD

Key event: A

Target event: D

Lag(s): -4 through +4

Output:

The count of 2 at Lag = +2 corresponds to theACD cycles identified by Fischer’s cycles above.Assuming the same recoding operationperformed above to collapse multipleoccurrences of the same event into a singleevent, this count would increase to 3. Thepurpose of LSA is to identify correlationsbetween events (that might be causally related toone another) that might otherwise have beenmissed by techniques more sensitive to the stricttransitions between events.

An example of a technique that is more sensitiveto strict transitions is the Maximal RepeatingPattern (MRP) analysis technique [Siochi andHix 1991]. MRP operates under the assumptionthat repetition of user actions can be animportant indicator of potential usabilityproblems. MRP identifies all patterns occurringrepeatedly in the input sequence and produces alisting of those patterns sorted by length firstfollowed by frequency of occurrence in thesource sequence. MRP applied to the sequenceabove would produce the following analysis:

Source Sequence:ABACDACDBADBCACCCD

Output:

MRP is similar in spirit to Fisher’s cycles andLSA, however, the investigator does not specifyparticular events of interest. Notice that theACCCD subsequence identified in the previousexamples is not identified by MRP since it onlyoccurs once in the source sequence.

Markov-based techniques can be used tocompute the transition probabilities from one ormore events to the next event. Statistical testscan be applied to determine whether theprobabilities of these transitions is greater thanwould be expected by chance [Sanderson andFisher 1994]. Other related techniques includelog linear analysis [Gottman and Roy 1990] andformal time-series analysis [Box and Jenkins1976]. All of these techniques attempt to findstrict sequential patterns in the data that occurmore frequently than would be expected bychance.

Santos and colleagues have proposed analgorithm for detecting users’ “mental chunks”based on pauses and flurries of activity in humancomputer interaction logs [Santos et al. 1994].The algorithm is based on an extension of Fitts’law [Fitts 1964] that predicts the expected timebetween events generated by a user who isactively executing plans, as opposed to engagingin problem solving and planning activity. For

Cycle # Frequency Cycle

1 2 ACD

2 1 AD

3 1 ACCCD

Lag -4 -3 -2 -1 1 2 3 4

Occurrences 0 1 1 1 1 2 0 1

Pattern # Frequency Pattern

1 2 ACD

2 3 AC

3 3 CD

4 2 BA

5 2 DB


December 13, 2000

each event transition in the log, if the pause ininteraction cannot be justified by the predictivemodel, then the lag is assumed to signify atransition from “plan execution phase” to “planacquisition phase” [Santos et al. 1994]. Theapproach uses the results of the algorithm tosegment the source sequence into plan executionchunks and chunks most probably associatedwith problem solving and planning activity. Theassumption is that expert users tend to havelonger, more regular execution chunks thannovice users, so user expertise might be inferredon the basis of the results of this chunkingalgorithm.

Finally, work done by Redmiles and colleagueson “Expectation Agents” (EAs) [Girgensohn etal. 1994] and “Expectation-Driven EventMonitoring” (EDEM) [Hilbert and Redmiles1998b] rely on sequence detection techniques totrigger various actions in response to pre-specified patterns of events. These approachesemploy an event pattern language to allowinvestigators to specify composite events to bedetected. When a pattern of interest is detected,contextual information may also be queriedbefore action is taken. Possible actions includenotifying the user and/or investigator that aparticular pattern was detected, collecting userfeedback, and reporting UI state and eventsleading up to detected patterns. Investigatorsmay also configure EDEM to abstract and recodeevent streams to indicate the occurrence ofabstract events associated with pre-specifiedevent patterns.

4.4.3 RelatedEBBA is a debugging system that attempts tomatch the behavior of a distributed programagainst partial models of expected behavior[Bates 1995]. EBBA is similar to EDEM,particularly in its ability to abstract and recodethe event stream based on hierarchically definedabstract events.1

Amadeus [Selby et al. 1991] and YEAST[Krishnamurthy and Rosenblum 1995] are event-action systems used to detect and take actionsbased on patterns of events in softwareprocesses. These systems are also similar inspirit to Expectation Agents and EDEM.Techniques that have been used to specifybehavior of concurrent systems, such as the TaskSequencing Language (TSL) as described in

[Rosenblum 1991] are also related. The eventspecification notations used in these approachesmight be applicable to the problem of specifyingand detecting patterns of UI events.

4.4.4 StrengthsThe strength of these approaches lies in theirability to help investigators detect patterns ofinterest in events and not just perform analysison isolated events. The techniques associatedwith ESDA help investigators detect patternsthat may not have been anticipated. Languagesfor detecting patterns of interest in UI eventsbased on extended regular expressions[Sanderson and Fisher 1994] or on moregrammatically inspired techniques [Hilbert andRedmiles 1998b] can be used to locate patternsof interest and to transform event streams byrecoding patterns of events into abstract events.

4.4.5 LimitationsThe ESDA techniques described above tend toproduce large amounts of output that can bedifficult to interpret and that frequently do notlead to identification of usability problems[Cuomo 1994]. The non-ESDA techniquesrequire investigators to know how to specify thepatterns for which they are searching and todefine them (sometimes painstakingly) beforeanalysis can be performed.

4.5 Sequence Comparison

4.5.1 PurposeThese techniques compare “source” sequencesagainst concrete or abstractly defined “target”sequences indicating partial matches betweenthe two.2 Some techniques attempt to detectdivergence between an abstract model of thetarget sequence and the source sequence (e.g.,

1.EBBA is sometimes characterized as a sequence comparisonsystem since the information carried in a partially matchedmodel can be used to help the investigator better understandwhere the program’s behavior has gone wrong (or where amodel is inaccurate). However, EBBA does not directly indi-cate that partial matches have occurred or provide any diag-nostic measures of correspondence. Rather, the user mustnotice that a full match has failed, and then manually inspectthe state of the pattern matching mechanism to see whichevents were matched and which were not.

2.The following sentences include names of approaches in pa-rentheses to indicate how the examples explained in the nextsubsection, fall into finer subcategories of the broader “se-quence comparison” category.


December 13, 2000

EMA and USINE). Others attempt to detectdivergence between a concrete target sequenceproduced, for example, by an expert user and asource sequence produced by some other user(e.g., ADAM and UsAGE). Some producediagnostic measures of distance to characterizethe correspondence between target and sourcesequences (e.g., ADAM). Others attempt toperform the best possible alignment of events intarget and source sequences and present theresults visually (e.g., UsAGE and MacSHAPA).Still others use points of deviation between thetarget and input sequences to automaticallyindicate potential “critical incidents” (e.g., EMAand USINE). In all cases, the purpose is tocompare actual usage against some model ortrace of “ideal” or expected usage to identifypotential usability problems.1

4.5.2 ExamplesADAM, an “Advanced Distributed AssociativeMemory”, compares fixed length sourcesequences against a set of target sequences thatwere used to “train” the memory [Finlay andHarrison 1990]. Investigators train ADAM byhelping it associate example target sequenceswith “classes” of event patterns. After training,when a source sequence is input, the associativememory identifies the class that most closelymatches the source sequence and outputs twodiagnostic measures: a “confidence” measurethat is 100% only when the source sequence isidentical to one of the trained target sequences,and a “distance” measure, indicating how far thesource pattern is from the next “closest” class.Investigators then use these measures todetermine whether a source sequence is differentenough from the trained sequences to be judgedas a possible “critical incident”. Incidentally,ADAM might also be trained on examples of“expected” critical incidents so that these mightbe detected directly.

MacSHAPA [Sanderson and Fisher 1994]provides techniques for aligning two sequencesof events as optimally as possible based onmaximal common subsequences [Hirschberg1975]. The results are presented visually as cellsin adjacent spreadsheet columns with alignedevents appearing in the same row and missingcells indicating events in one sequence that couldnot be aligned with events in the other (seeFigure 17).

UsAGE applies a related technique in which asource sequence of UI events (related toperformance of a specific task) is aligned asoptimally as possible with a target sequenceproduced by an “expert” performing the sametask [Ueling and Wolf 1995]. UsAGE presentsits alignment results in visual form.

EMA, an “automatic analysis mechanism for theergonomic evaluation of user interfaces”,requires investigators to provide a grammar-based model describing all the expected pathsthrough a particular user interface [Balbo 1996].An evaluation program then compares a log ofevents generated by use of the interface againstthe model, indicating in the log and the modelwhere the user has taken “illegal” paths. EMAalso detects and reports the occurrence of othersimple patterns, for example, the use of cancel orrepeated actions. The evaluator can then use thisinformation to identify problems in the interface(or problems in the model).

USINE is a similar technique [Lecerof andPaterno 1998], in which investigators use ahierarchical task notation to specify how lower-level actions combine to form higher-level tasks,and to specify sequencing constraints on actionsand tasks. The tool then compares logs of useractions against the task model. All actions notspecified in the task model, or actions and tasksperformed “out of order” according to thesequencing constraints specified in the taskmodel, are flagged as potential errors. The toolthen computes a number of built-in counts andsummary statistics including number of taskscompleted, errors, and other basic metrics (e.g.,window resizing and scrollbar usage) andgenerates simple graphs.

ErgoLight Usability Validation Suite (EUVS)also compares user interactions againsthierarchical representations of user tasks[ErgoLight Usability Software 1998]. EUVS is

1.TOP/G, Expectation Agents, and EDEM (discussed above)are also intended to detect deviations between actual and ex-pected usage to identify potential usability problems. Howev-er, these approaches are better characterized as detectingcomplete matches between source sequences and (“negativelydefined”) target patterns that indicate unexpected, or subopti-mal behavior, as opposed to partially matching, or comparing,source sequences against (“positively defined”) target patternsindicating expected behavior.


December 13, 2000

similar in spirit to EMA and USINE with theadded benefit that it provides a number of built-in counts and summary statistics regardinggeneral user interface use in addition toautomatically-detected divergences between useractions and the task model.

4.5.3 RelatedProcess validation techniques are related in thatthey compare actual traces of events generatedby a software process against an abstract modelof the intended process [Cook and Wolf 1997].These techniques compute a diagnostic measureof distance to indicate the correspondencebetween the trace and the closest acceptabletrace produced by the model. Techniques forperforming error correcting parsing are alsorelated. See [Cook and Wolf 1997] for furtherdiscussion and pointers to relevant literature.

4.5.4 StrengthsThe strengths of these approaches lie their abilityto compare actual traces of events againstexpected traces, or models of expected traces, inorder to identify potential usability problems.This is particularly appealing when expectedtraces can be specified “by demonstration” as inthe case of ADAM and UsAGE.

4.5.5 LimitationsUnfortunately, all of these techniques havesignificant limitations.

A key limitation of any approach that comparessource sequences against concrete targetsequences is the underlying assumption that: (a)source and target sequences can be easilysegmented for piecemeal comparison, as in thecase of ADAM, or (b) that whole interactionsequences produced by different users willactually exhibit reasonable correspondence, as inthe case of UsAGE.

Furthermore, the output of all these techniques,(except in the case of perfect matches) requiresexpert human interpretation to determinewhether the sequences are interestingly similaror different. In contrast to techniques thatcompletely match patterns that directly indicateviolations of expected patterns (e.g., as in thecase of EDEM), these techniques produce outputto the effect, “the source sequence is similar to atarget sequence with a correspondence measureof 61%”, leaving it up to investigators to decide

on a case to case basis what exactly thecorrespondence measure means.

A key limitation of any technique comparingsequences against abstract models (e.g., EMA,USINE, ErgoLight EUVS, and the processvalidation techniques described by Cook andWolf) is that in order to reliably categorize asource sequence as being a poor match, themodel used to perform the comparison must berelatively complete in its ability to describe allpossible, or rather, expected paths. This is all butimpossible in most non-trivial interfaces.Furthermore, the model must somehow deal with“noise” so that low-level events, such as mousemovements, won’t mask otherwise significantcorrespondence between source sequences andthe abstract model. Because these techniquestypically have no built-in facilities forperforming transformations on input traces, thisimplies that either the event stream has alreadybeen transformed, perhaps by manuallyinstrumenting the application (as with EMA), orcomplexity must be introduced into the model toavoid sensitivity to “noise”. In contrast,techniques such as EDEM and EBBA useselection and abstraction to pick out patterns ofinterest from the noise. The models need not becomplete in any sense and may ignore eventsthat are not of interest.

4.6 Sequence Characterization

4.6.1 PurposeThese techniques take “source” sequences asinput and attempt to construct an abstract modelto summarize, or characterize, interestingsequential features of those sequences. Sometechniques produce a process model withprobabilities associated with transitions [Guzdial1993]. Others construct models that characterizethe grammatical structure of events in the inputsequences [Olson et al. 1993].

4.6.2 ExamplesGuzdial describes a technique, based on MarkovChain analysis, that produces process modelswith probabilities assigned to transitions tocharacterize user behavior with interactiveapplications [Guzdial 1993]. First, theinvestigator identifies abstract stages, or states,of application use. In Guzdial’s example, asimple design environment was the object ofstudy. The design environment providedfunctions supporting the following stages in a


December 13, 2000

simple design process: “initial review”,“decomposition”, “composition”, “debugging”,and “final review”. The investigator then createsa mapping between each of the operations in theinterface and one of the abstract stages. Forinstance, Guzdial mapped all debugging relatedcommands (which incidentally all appeared in asingle “debugging” menu) to the “debugging”stage. The investigator then uses the Hawk toolto abstract and record the event stream to replacelow level events with the abstract stagesassociated with them (presumably droppingevents not associated with stages). Theinvestigator then uses Hawk to compute theobserved probability of entering any stage fromthe stage immediately before it to yield atransition matrix. The investigator can then usethe matrix to create a process diagram withprobabilities associated with transitions. InGuzdial’s example, one subject was observed tohave transitioned from “debugging” to“composition” more often (52% of all transitionsout of “debugging”) than to “decomposition”(10%) (See Figure 18). Guzdial then computed asteady state vector to reflect the probability ofany event chosen at random belonging to eachparticular stage. He could then compare this toan expected probability vector (computed bysimply calculating the percentage of commandsassociated with each stage) to indicate user“preference” for classes of commands.

Olson and colleagues describe an approach,based on statistical and grammatical techniques,for characterizing the sequential structure ofverbal interactions between participants indesign meetings [Olson et al. 1993]. They beginby mapping meeting verbalizations into eventcategories that are then used to manually encodethe transcript into an event sequencerepresentation. The investigator then appliesstatistical techniques, including log linearmodeling and lag sequential analysis, to identifypotential dependencies between events in thesequence. The investigator then uses the resultsof these pattern detection techniques to suggestrules that might be included in a definite clausegrammar to summarize, or characterize, some ofthe sequential structure of the meetinginteractions. The investigator then uses theresulting grammar rules to rewrite some of thepatterns embedded in the sequence (i.e.,abstraction and recoding), and the new sequence

is subjected to the same statistical techniquesleading to further iterative refinement of thegrammar. The result is a set of grammar rulesthat provide insight into the sequential structureof the meeting interactions.

4.6.3 RelatedProcess discovery techniques are related in thatthey attempt to automatically generate a processmodel, in the form of a finite state machine, thataccounts for a trace of events produced by aparticular software process [Cook and Wolf1996]. It is not clear how well these techniqueswould perform with data as noisy as UI events.A more promising approach might be to performselection, abstraction, and recoding of the eventstream prior to submitting it for analysis.

4.6.4 StrengthsThe strength of these techniques lies in theirability to help investigators discover sequentialstructure within event sequences and tocharacterize that structure abstractly.

4.6.5 LimitationsThe technique described by Olson andcolleagues requires extensive humaninvolvement and can be very time-consuming[Olson et al. 1994]. On the other hand, theautomated techniques suggested by Cook andWolf appear to be sensitive to noise and are lesslikely to produce models that make sense toinvestigators [Olson et al. 1994].

In our opinion, Markov-based models, whilerelying on overly simplifying assumptions, aremore likely than grammar-based techniques totell investigators something about userinteractions that they don’t already know.Investigators often have an idea of thegrammatical structure of interactions that mayarise from the use of (at least portions of) aparticular interface. Grammars are thus useful intransforming low level UI events into higherlevel events of interest, or to detect when actualusage patterns violate expected patterns.However, the value of generating a grammaticalor FSM-based model to summarize use is morelimited. More often than not, a grammar- orFSM-based model generated on the basis ofmultiple traces will be vacuous in that it willdescribe all observed patterns of usage of aninterface without indicating which are mostcommon. While this may be useful in definingpaths for UI regression testing, investigators


December 13, 2000

interested in locating usability problems willmore likely be interested in identifying anddetermining the frequency of specific observedpatterns than in seeing a grammar to summarizethem all.

4.7 Visualization

4.7.1 PurposeThese techniques present the results oftransformation and analysis in forms allowinghumans to exploit their innate visual analysiscapabilities to interpret results. Some of thesetechniques are helpful in linking results ofanalysis back to features of the interface.

4.7.2 ExamplesInvestigators have proposed a number oftechniques to visualize data based on UI events.For a survey of such techniques see [Guzdial etal. 1994]. Below are few examples of techniquesthat have been used in support of the analysisapproaches described above.

Transformation:

The results of performing selection orabstraction on an event stream can sometimes bevisualized using a timeline in which bands ofcolors indicate different selections of events inthe event stream. For example, one might use redto highlight the use of “Edit” menu operationsand blue to highlight the use of “Font” menuoperations in the evaluation of a word processor(Figure 12).

MacSHAPA [Sanderson and Fisher 1994]visualizes events as occupying cells in aspreadsheet. Event streams are listed vertically(in adjacent columns) and correspondence ofevents in one stream with events in adjacentstreams is indicated by horizontal alignment(across rows). A large cell in one column maycorrespond to a number of smaller cells inanother column to indicate abstractionrelationships (Figure 13).

Counts and summary statistics:

There are a number of visualizations that can beused to represent the results of counts andsummary statistics, including static 2D and 3Dgraphs, static 2D effects superimposed on acoordinate space representing the interface, andstatic and dynamic 2D and 3D effectssuperimposed on top of an actual visualrepresentation of the interface.

The following are examples of static 2D and 3Dgraphs:

• Graph of keystrokes per window [Chang andDillon 1997].

• Graph of mouse clicks per window [Changand Dillon 1997].

• Graph of relative command frequencies [Kayand Thomas 1995] (Figure 14).

• Graph of relative command frequencies asthey vary over time [Kay and Thomas 1995](Figure 15).

Figure 12. Use of “Edit” menu operations isindicated in black. Use of “Font” menu operationsis indicated in grey. Used to display results oftransformations or sequence detection.

Figure 13. Correspondence between events atdifferent levels of abstraction is indicated byhorizontal alignment. Single “Key” events in largecells correspond to “LostEdit”, “GotEdit”, and“ValueProvided” abstract interaction events insmaller, horizontally aligned cells.


December 13, 2000

The following are examples of static 2D effectssuperimposed on an abstract coordinate spacerepresenting the interface:

• Location of mouse clicks [Guzdial et al.1994; Chang and Dillon 1997].

• Mouse travel patterns between clicks [Buxtonet al. 1983; Chang and Dillon 1997].

The following are examples of static anddynamic 2D and 3D effects superimposed on topof a graphical representation of the interface:

• Static highlighting to indicate location anddensity of mouse clicks [Guzdial et al. 1994].

• Dynamic highlighting of mouse click activityas it varies over time [Guzdial et al. 1994].

• 3D representation of mouse click locationand density [Guzdial et al. 1994] (Figure 16).

Sequence detection:

The same technique illustrated in Figure 12 canbe used to visualize the results of selectingsubsequences of UI events based on sequencedetection techniques.

EDEM provides a dynamic visualization of theoccurrence of UI events by highlighting nodes ina hierarchical representation of the user interfacebeing monitored. A similar visualization isprovided to indicate the occurrence of abstractevents, defined in terms of abstract patterns ofevents, by highlighting entries in a list of agentsresponsible for detecting those patterns. Thesevisualizations help investigators inspect thedynamic behavior of events, thereby supportingthe process of event pattern specification[Hilbert and Redmiles 1998a].

Sequence comparison:

As described above, MacSHAPA providesfacilities for aligning two sequences of events asoptimally as possible and presenting the resultsvisually as cells in adjacent spreadsheet columns[Sanderson and Fisher 1994]. Aligned eventsappear in the same row and missing cellsindicate events in one sequence that could not bealigned with events in the other (Figure 17).

UsAGE provides a similar visualization forcomparing sequences based on drawing a

Figure 14. Relative command frequencies orderedby “rank”.

Figure 15. Relative command frequencies over time

Figure 16. A 3D representation of mouse clickdensity superimposed over a graphicalrepresentation of the interface.


December 13, 2000

connected graph of nodes [Ueling and Wolf1995]. The “expert” series of actions is displayedlinearly as a sequence of nodes across the top ofthe graph. The “novice” series of actions areindicated by drawing directed arcs connectingthe nodes to represent the order in which thenovice performed the actions. Out of sequenceactions are indicated by arcs that skip expertnodes in the forward direction or that pointbackwards in the graph. Unmatched actionstaken by the novice appear as nodes (with adifferent color) placed below the last matchedexpert node.

Sequence characterization:

Guzdial uses a connected graph visualization toillustrate the results of his Markov-basedanalysis [Guzdial 1993]. The result is a processmodel with nodes representing process steps andarcs indicating the observed probabilities oftransitions between process steps (Figure 18).

4.7.3 StrengthsThe strengths of these techniques lie in theirability to present the results of analysis in formsallowing humans to exploit their innate visualanalysis capabilities to interpret results.

Particularly useful are the techniques that linkresults of analysis back to features of theinterface, such as the techniques superimposinggraphical representations of behavior over actualrepresentations of the interface.

4.7.4 LimitationsWith the exception of simple graphs (which cantypically be generated using standard graphingcapabilities provided by spreadsheets andstatistical analysis packages), most of thevisualizations above must be produced by hand.Techniques for accurately superimposinggraphical effects over visual representations ofthe interface can be particularly problematic.

4.8 Integrated Support

4.8.1 PurposeEnvironments that facilitate flexible compositionof various transformation, analysis, andvisualization capabilities provide integratedsupport. Some environments also provide built-in support for managing domain-specificartifacts such as evaluations, subjects, tasks, dataand results of analysis.

4.8.2 ExamplesMacSHAPA is perhaps the most comprehensiveenvironment designed to support all manner ofexploratory sequential data analysis (ESDA)[Sanderson et al. 1994]. Features include: dataimport and export; video and coded observationsynch and search capabilities; facilities forperforming selection, abstraction, and recoding;a number of built-in counts and summary

Figure 17. Results of an automatic alignment oftwo separate event streams. Horizontal alignmentindicates correspondence. Black spaces indicatewhere alignment was not possible.

Final Review Debugging

CompositionInitial Review

Decomposition

0.10

0.270.10

0.250.25

0.50

0.02

0.89

0.52

0.63

0.01

0.100.500.50

0.38

0.52

Figure 18. A process model characterizing userbehavior with nodes representing process stepsand arcs indicating observed probabilities oftransitions between process steps.


December 13, 2000

statistics; features supporting sequencedetection, comparison, and characterization; ageneral-purpose database query andmanipulation language; and a number of built-invisualizations and reports.

DRUM provides integrated features forsynchronizing events, observations, and video;for defining and managing observation codingschemes; for calculating pre-defined counts andsummary statistics; and for managing andmanipulating evaluation-related artifactsregarding subjects, tasks, recording plans, logs,videos, and results of analysis [Macleod et al.1993].

Hawk provides flexible support for creating,debugging, and executing scripts toautomatically select, abstract, and recode eventlogs [Guzdial 1993]. Management facilities arealso provided to organize and store event logsand analysis scripts.

Finally, ErgoLight Operation Recording Suite(EORS) and Usability Validation Suite (EUVS)[ErgoLight Usability Software 1998] offer anumber of facilities for managing usabilityevaluations, both local and remote, as well asfacilities for merging data from multiple users.

4.8.3 StrengthsThe task of extracting usability-relatedinformation from UI events typically requires themanagement of numerous files and media typesas well as the creation and composition ofvarious analysis techniques. Environmentssupporting the integration of such activities can

significantly reduce the burden of datamanagement and integration.

4.8.4 LimitationsMost of the environments above possessimportant limitations.

While MacSHAPA is perhaps the mostcomprehensive integrated environment foranalyzing sequential data, it is not specificallydesigned for analysis of UI events. As a result, itlacks support for event capture and focusesprimarily on analysis techniques that, whenapplied to UI events, require extensive humaninvolvement and interpretation. MacSHAPAprovides many of the basic building blocksrequired for an “ideal” environment forcapturing and analyzing UI events, however,selection, abstraction, and recoding cannot beeasily automated. Furthermore, because thepowerful features of MacSHAPA cannot be usedduring event collection, contextual informationthat might be useful in selection and abstractionis not available.

While providing features for managing andanalyzing UI events, coded observations, videodata, and evaluation artifacts, DRUM does notprovide features for selecting, abstracting, andrecoding data.

Finally, while Hawk addresses the problem ofproviding automated support for selection,abstraction, and recoding, like MacSHAPA, itdoes not address UI event capture, and as aresult, contextual information cannot be used inselection and abstraction.

Column Label Key to Column Values

Event Capture (Yes) = events captured automatically; (Instr) = application must be hand-instrumented; (Sim) = events captured by command line simulation.

Synch/Search (Obs) = events synchronized with coded observations; (Vid) = events synchronized with video.

Use of Context (UI) = the UI can be queried for contextual information; (App) = the application can be queried. (User) = the user provides contextual info.

Select/Recode through Sequence Detect

(Built-in) = built-in selection/abstraction/counts & stats/detection; (User) = user selects/abstracts events; (Model) = abstract model used to select/abstract/detect; (Script) = scripts used; (Program) = programs used; (Database) = data-base query & manipulation language used.

Sequence Compare (Concrete) = source sequence compared against concrete target sequence; (Model) = source sequence compared against abstract model.

Sequence Character (Manual) = abstract model constructed manually using statistical/grammatical techniques; (Auto) = abstract model generated automatically.

Visualize (Built-in) = built-in visualizations; (Graphs) = use of standard graphing facilities; (Database) = use of database graphing facilities.

Data Manage (Built-in) = built-in management of domain-specific artifacts.

Figure 19. A key to interpreting the values listed in columns of Table 3.

Table 3: A classification of computer-aided techniques for extracting usability-related information from user interface events

Tool/Technique ReferenceSynch/ Search

Event Capture

Use of Context

Select/ Recode

Abstract/Recode

Counts & Stats

Sequence Detect

Sequence Compare

Sequence Character Visualize

DataManage UI Platform

Tech

niqu

es (

sort

ed b

y cl

ass)

Sec

tion

4.1

Syn

ch &

Sea

rch

Playback [Neal & Simmons 1983] Yes Obs. Built-in Unknown

Apple Lab [Weiler 1993] Yes Obs.+Vid. Built-in MacOS

SunSoft Lab [Weiler 1993] Yes Obs.+Vid. Built-in X Windows

Microsoft Lab [Hoiem & Sullivan 1994] Yes Obs.+Vid. Built-in MS Windows

I-Observe [Badre et al. 1995] Yes Vid. Model Built-in X Windows

Sec

tion

4.2

Tra

nsfo

rm

Incident Monitoring [Chen 1990] Yes UI Built-in X Windows

User Identified CIs [Hartson et al. 1996] Yes User User User Unknown

CHIME [Badre & Santos 1991] Yes UI Model Model X Windows

EDEM [Hilbert & Redmiles 1997] Yes UI+User Model+User Model+User Built-in Model Built-in Java AWT

Sec

tion

4.3

Cou

nts

& S

tats

UIMS [Buxton et al. 1983] Yes Built-in Built-in UIMS

MIKE [Olsen & Halversen 1988] Yes Built-in UIMS

KRI/AG [Lowgren & Nordqvist 1992] Yes Built-in UIMS

Long-Term Monitoring [Kay & Thomas 1995] Instr. Programs Graphs API

AUS [Chang & Dillon 1997] Yes Built-in Built-in MS Windows

Aqueduct AppScope [Aqueduct Software 1998] Instr. Database Database API

Full Circle Talkback [Full Circle Software 1998] Instr. Database Database API

Sec

tion

4.4

Se

quen

ceD

etec

t

LSA [Sackett 1978] Built-in

Fisher’s Cycles [Fisher 1988] Built-in

TOP/G [Hoppe 1988] Sim. Sim. Model Simulation

MRP [Siochi & Hix 1991] Built-in

Expectation Agents [Girgensohn et al. 1994] Yes UI+User User User Model OS/2

EDEM [Hilbert & Redmiles 1997] Yes UI+User Model+User Model+User Built-in Model Java AWT

USINE [Lecerof & Paterno 1998] Yes Built-in Model Model Built-in X Windows

TSL [Rosenblum 1991] Model N/A

Amadeus [Selby et al. 1991] Model N/A

YEAST [Krishnamurthy & Rosenblum 1995] Model Model N/A

EBBA [Bates 1995] App Model Model N/A

GEM [Mansouri-Samani & Sloman 1997] App Model Model Model N/A

Sec

tion

4.5

Seq

uenc

eC

ompa

re

ADAM [Finlay & Harrison 1990] Concrete Unknown

UsAGE [Ueling & Wolf 1995] Yes Concrete Built-in UIMS

EMA [Balbo 1996] Instr. Model API

USINE [Lecerof & Paterno 1998] Yes Built-in Model Model Built-in X Windows

Process Validation [Cook & Wolf 1997] Model N/A

Sec

tion

4.6

Se

quen

ceC

hara

cter Markov-based [Guzdial 1993] Model Manual

Grammer-based [Olson et al. 1994] Model Manual

Process Discovery [Cook & Wolf 1995] Auto N/A

Se

ctio

n 4.

7

Inte

gra

ted

Sup

port

Hawk [Guzdial 1993] Script Script Script Script Built-in

DRUM [Macleod & Rengger 1993] Yes Obs.+Vid. Built-in Built-in MacOS

MacSHAPA [Sanderson et al. 1994] Obs.+Vid. Database Database Built-in Built-in+DB Concrete Manual Built-in Built-in

Ergolight EORS/EUVS [Ergolight Usability Software 1998] Yes Built-in Model Model? Built-in Built-in MS Windows


December 13, 2000

5. DISCUSSION

5.1 Summary of the State of the Art

Synch and search techniques are among the mostmature technologies for exploiting UI event datain usability evaluations. Tools supporting thesetechniques are becoming increasingly commonin usability labs. However, these techniques canbe costly in terms of equipment, humanobservers, and data storage and analysisrequirements. Furthermore, synch and searchtechniques generally exploit UI events as nomore than convenient indices into videorecordings. In some cases, events may be used asthe basis for computing simple counts andsummary statistics using spreadsheets orstatistical packages. However, such analysestypically require investigators to performselection and abstraction by hand.

The other, arguably more sophisticated, analysistechniques such as sequence detection,comparison, and characterization continue toremain denizens of the research lab for the mostpart. Those that are most compelling tend torequire the most human intervention,interpretation, and effort (e.g., exploratorysequential data analysis techniques and theMarkov- and Grammar-based sequencecharacterization techniques). Those that are mostautomated tend to be least compelling and mostunrealistic in their assumptions (e.g., ADAM,UsAGE, and EMA). One of the main problemslimiting the success of automated approachesmay be their lack of focus on transformation,which appears to be a necessary prerequisite formeaningful analysis (for reasons articulated inSection 3 and discussed further below).

Nevertheless, few investigators have attemptedto address the problem of transformationrealistically. Of the twenty-five plus approachessurveyed here, only a handful providemechanisms that allow investigators to performtransformations at all (Microsoft, SunSoft,Apple, Chen, CHIME, EDEM, Hawk, andMacSHAPA). Of those, fewer still allow modelsto be constructed and reused in an automatedfashion (CHIME, EDEM, and Hawk). Of those,fewer still allow transformation to be performedin context so that important contextualinformation can be used in selection andabstraction (EDEM).

5.2 Some Anticipated Challenges

There is very little data published regarding therelative utility of the surveyed approaches insupporting usability evaluations. As a result, wehave focused on the technical capabilities of thesurveyed approaches in order to classify,compare, and evaluate them. To take thisanalytical evaluation a step further: ourunderstanding of the nature of UI events (basedon extensive Java, Windows, and X Windowsprogramming experience) leads us to concludethat more work will likely be needed in the areaof transforming the “raw” data generated by suchevent-based systems in preparation for othertypes of analysis in order to increase thelikelihood of useful results. This is because mostother types of analysis (including simple countsand summary statistics as well as sequenceanalysis techniques) are sensitive to lexical-leveldifferences in event streams that can be removedvia transformation, as illustrated in Section 3.2.

There are a number of ways that investigatorshave successfully side-stepped thetransformation problem. For instance, buildingdata collection directly into a user interfacemanagement system or requiring applications toreport events themselves can help amelioratesome of the issues. However, both of theseapproaches have important limitations.

User interface management systems (UIMSs)typically model the relationships betweenapplication features and UI events explicitly, soreasonable data collection and analysismechanisms can be built directly in, as in thecase of the MIKE UIMS [Olsen and Halversen1988], KRI/AG [Lowgren and Nordqvist 1992],and UsAGE [Ueling and Wolf 1995]. BecauseUIMSs have dynamic access to most aspects ofthe user interface, contextual information usefulin interpreting the significance of events is alsoavailable. However, many developers do not useUIMSs, thus, a more general technique that doesnot presuppose the use of a UIMS is needed.

Techniques that allow applications to reportevents directly via an event-reporting APIprovide a useful service, particularly in caseswhere events of interest cannot be inferred fromUI events. This allows important application-specific events to be reported by applicationsthemselves and provides a more general solutionthan a UIMS-based approach. However, this


December 13, 2000

places an increased burden on applicationdevelopers to capture and transform events ofinterest, for example, as in [Kay and Thomas1995; Balbo 1996]. This can be costly,particularly if there is no centralized commanddispatch loop, or similar mechanism, that can betapped as a source of application events. Thisalso complicates software evolution since datacollection code is typically intermingled withapplication code. Furthermore, there is muchusability-related information not typicallyprocessed by applications that can be easilycaptured by tapping into the UI event stream, forinstance, shifts in input focus, mousemovements, and the specific user interfaceactions used to invoke application features. As aresult, an event-reporting API is just part of amore comprehensive solution.

Thus, we conclude that more work is needed inthe area of transformation and data collection toensure that useful information can be captured inthe first place, before automated analysistechniques, such as those surveyed above, can beexpected to yield meaningful results (where“meaningful” means the results can be related,without undue hardship, to aspects of the userinterface and application being studied as well asusers' actions at higher levels of abstraction thansimple key presses and mouse clicks). Areasonable approach would assume no more thana typical event-based user interface system, suchas provided by the Macintosh Operating System,Microsoft Windows, X Window System, or JavaAbstract Window Toolkit, and developers wouldnot be required to adopt a particular UIMS norcall an API to report every potentially interestingevent.

5.3 Related Work and Future Directions

There are a number of related techniques thathave been explored, both in academia andindustry, that have the potential of providinguseful insights into how to more effectivelyexploit UI events as a source of usabilityinformation.

A number of researchers and practitioners haveaddressed related issues in capturing andevaluating event data in the realm of softwaretesting and debugging:

• Work in distributed event monitoring, e.g.,GEM [Mansouri-Samani and Sloman 1991],

and model-based testing and debugging, e.g.,EBBA [Bates 1995] and TSL [Rosenblum1991], have addressed a number of problemsin the specification and detection of compos-ite events and the use of context in interpret-ing the significance of events. The eventspecification notations, infrastructure, andexperience that have come out of this workmight provide useful insights that can beapplied to the problem of capturing and ana-lyzing UI event data.

• Automated user interface testing techniques,e.g., WinRunnerTM [Mercury Interactive1998] and JavaStarTM [Sun Microsystems1998], are faced with the problem of robustlyidentifying user interface components in theface of user interface change, and evaluatingevents against specifications of expected UIbehavior in test scripts. The same problem isfaced in maintaining the relationshipsbetween UI components and higher-levelspecifications of application features andabstract events of interest in usability evalua-tions based on UI events.

• Monitoring of application programmaticinterfaces (APIs), e.g., Hewlett Packard’sApplication Response-time MeasurementAPI [Hewlett Packard 1998], addresses theproblem of monitoring API usage to helpsoftware developers evaluate the fit betweenthe design of an API and how it is actuallyused. Insights gained in this area may gener-alize to the problem of monitoring UI usageto evaluate the fit between the design of a UIand how it is actually used.

• Internet-based application monitoring sys-tems, e.g., AppScopeTM [Aqueduct Software1998] and TalkbackTM [Full Circle Software1998], have begun to address issues of col-lecting application failure data on a poten-tially large and ongoing basis over theInternet. The techniques developed to makethis practical for application failure monitor-ing could be applicable in the domain oflarge-scale, ongoing collection of user inter-action data over the Internet.

A number of researchers have addressedproblems in the area of mapping between lowerlevel events and higher level events of interest:

• Work in the area of event histories, e.g., [Kos-bie and Myers 1994], and undo mechanisms


December 13, 2000

has addressed issues involved in groupinglower level UI events into more meaningfulunits from the point of view of users’ tasks.Insights gained from this work, and the actualevent representations used to support undomechanisms, might be exploited to captureevents at higher levels of abstraction than aretypically available at the window systemlevel.

• Work in the area of user modeling [UserModeling 1998] is faced with the problem ofinferring users’ tasks and goals based on userbackground, interaction history, and currentcontext in order to enhance human-computerinteraction. The techniques developed in thisarea, which range from rule-based to statisti-cally-oriented machine-learning techniques,might eventually be harnessed to infer higherlevel events from lower level events in sup-port of usability evaluations based on UIevents.

• Work in the area of programming by demon-stration [Cypher 1994] and plan recognitionand assisted completion [Cypher 1991] alsoaddresses problems involved in inferring userintent based on lower level interactions. Thiswork has shown that such inference is feasi-ble in at least some structured and limiteddomains, and programming by demonstrationappears to be a desirable method for specify-ing expected or unexpected patterns of eventsfor sequence detection and comparison pur-poses.

• Layered protocol models of interaction, e.g.,[Nielsen 1986; Taylor 1988a & 1988b], allowhuman-computer interactions to be modeledat multiple levels of abstraction. Such tech-niques might be useful in specifying howhigher level events are to be inferred based onlower level events. Command languagegrammars (CLGs) [Moran 1981] and task-action grammars (TAGs) [Payne and Green1986] are other potentially useful modelingtechniques for specifying relationshipsbetween human-computer interactions andusers’ tasks and goals.

Work in the area of automated discovery and val-idation of patterns in large corpora of event data might also provide valuable insights:• Data mining techniques for discovering asso-

ciation rules, sequential patterns, and time-

series similarities in large data sets [Agrawalet al. 1996] may be applicable in uncoveringpatterns relevant to investigators interested inevaluating usage and usability based on UIevents.

• The process discovery techniques investi-gated by [Cook and Wolf 1996] provideinsights into problems involved in automati-cally generating models to characterize thesequential structure of event traces

• The process validation techniques investi-gated by [Cook and Wolf 1997] provideinsights into problems involved in comparingtraces of events against models of expectedbehavior.

Finally, there are numerous domains in whichevent monitoring has been used as a means ofidentifying and, in some cases, diagnosing andrepairing breakdowns in the operation ofcomplex systems. For example:

• Network and enterprise management tools forautomating network and application adminis-tration, e.g., TIBCO HawkTM [TIBCO 1998].

• Product condition monitoring, e.g., high-endphotocopiers or medical devices that reportdata back to equipment manufacturers toallow performance, failures, and maintenanceissues to be tracked remotely [Lee 1996].

6. CONCLUSIONS

We have surveyed a number of computer-aidedtechniques for extracting usability-relatedinformation from UI events. Our classificationscheme includes the following categories: synchand search techniques; transformationtechniques; techniques for performing simplecounts and summary statistics; techniques forperforming sequence detection, comparison, andcharacterization; visualization techniques; andfinally, techniques that provide integratedevaluation support.

Very few of the surveyed approaches supporttransformation, which we argue is a criticalsubprocess in the overall process of extractingmeaningful usability-related information fromUI events.

Our current research involves exploringtechniques and infrastructure for performingtransformation and analysis automatically and incontext in order to greatly reduce the amount of


December 13, 2000

data that must ultimately be reported. It is anopen question whether such an approach mightbe scaled up to large-scale and ongoing use overthe Internet. If so, we believe that automatedtechniques, such as those surveyed here, will beuseful in capturing indicators of the “big picture”regarding application use in the field. However,we believe that such techniques may be lesssuited to identifying subtle, nuanced usabilityissues. Fortunately, these strengths andweaknesses nicely complement the strengths andweaknesses inherent in current usability testingpractice, in which subtle usability issues areidentified through careful human observation,but in which there is little sense of the “bigpicture” of how applications are used on a largescale.

A usability professional from a large softwaredevelopment organization recently reported to usthat the usability team is often approached bydesign and development team members withquestions such as “how often do users do X?” or“how often does Y happen?”. This is obviouslyuseful information for developers wishing toassess the impact of suspected problems or tofocus development effort for the next version.However, it is not information that can bereliably collected in the usability lab. We believethat automated usage data collection techniqueswill eventually complement traditional usabilityevaluation practice, not only by supportingdevelopers as described above, but also inhelping assess the impact of, and focusing theefforts of, usability evaluations.


December 13, 2000

ABBOTT, A. A Primer on sequence methods.Organization Science, Vol.4, 1990.

AGRAWAL, R., ARNING, A., BOLLINGER, T.,MEHTA, M., SHAFER, J., SRIKANT, R. TheQuest data mining system. In Proceedings ofthe 2nd International Conference on Knowl-edge Discovery in Databases and Data Min-ing. 1996.

AHO, A.V., KERNIGHAN, B.W., AND WEIN-BERGER, P.J. The AWK programming lan-guage. Addison-Wesley, Reading, MA.1988.

ALLISON, P.D. AND LIKER, J.K. Analyzingsequential categorical data on dyadic interac-tion: A comment on Gottman. PsychologicalBulletin, 2, 1987.

AQUEDUCT SOFTWARE. AppScope Web Pages.URL: http://www.aqueduct.com/. 1998.

BADRE, A.N., GUZDIAL , M., HUDSON, S.E., AND

SANTOS, P.J. A user interface evaluationenvironment using synchronized video, visu-alizations, and event trace data. Journal ofSoftware Quality, Vol. 4, 1995.

BADRE, A.N. AND SANTOS, P.J. CHIME: Aknowledge-based computer-human interac-tion monitoring engine. Tech Report GIT-GVU-91-06. 1991a.

BADRE, A.N. AND SANTOS, P.J. A knowledge-based system for capturing human-computerinteraction events: CHIME. Tech ReportGIT-GVU-91-21. 1991b.

BAECKER, R.M, GRUDIN, J., BUXTON, W.A.S.,AND GREENBERG, S. (Eds.). Readings inHuman-Computer Interaction: Toward theYear 2000. Morgan Kaufmann, San Mateo,CA, 1995.

BALBO, S. EMA: Automatic analysis mechanism

REFERENCES

The following table provides an index into the references based on the categories established by thecomparison framework.

Category Approaches

Synchronization and Searching Playback [Neal & Simmons 1983], Apple [Weiler 1993], SunSoft [Weiler 1993], Microsoft [Hoiem & Sullivan 1994], I-Observe [Badre et al. 1995]

Transformation Incident Monitoring [Chen 1990], User-Identified CIs [Hartson et al. 1996], CHIME [Badre & Santos 1991], EDEM [Hilbert & Redmiles 1997]

Counts and Summary Statistics UIMS [Buxton et al. 1983], MIKE [Olsen & Halversen 1988], KRI/AG [Lowgren & Nordqvist 1992], Long-Term Monitoring [Kay & Thomas 1995], AUS [Chang & Dillon 1997], EORS & EUVS [ErgoLight Usability Software 1998]. Related: AppScope [Aqueduct Software 1998], Talkback [Full Circle Software 1998]

Sequence Detection LSA [Sackett 1978], Fisher’s Cycles [Fisher 1988], TOP/G [Hoppe 1988], MRP [Siochi & Hix 1991], Expectation Agents [Girgensohn et al. 1994], EDEM [Hilbert & Redmiles 1997], USINE [Lecerof & Paterno 1998]. Related: TSL [Rosenblum 1991], Amadeus [Selby et al. 1991], YEAST [Krishnamurthy & Rosenblum 1995], EBBA [Bates 1995], GEM [Man-souri-Samani & Sloman 1997]

Sequence Comparison ADAM [Finlay & Harrison 1990], UsAGE [Ueling & Wolf 1995], EMA [Balbo 1996], USINE [Lecerof & Paterno 1998], EUVS [ErgoLight Usability Software 1998]. Related: Process Validation [Cook & Wolf 1997]

Sequence Characterization Markov-based [Guzdial 1993], Grammar-based [Olson et al. 1994]. Related: Process Discovery [Cook & Wolf 1995]

Integrated Support MacSHAPA[Sanderson et al. 1994], DRUM [Macleod & Rengger 1993], Hawk [Guzdial 1993], EORS & EUVS [ErgoLight Usability Software 1998].


December 13, 2000

for the ergonomic evaluation of user inter-faces. CSIRO Technical report. 1996.

BATES, P.C. Debugging heterogeneous distrib-uted systems using event-based models ofbehavior. ACM Transactions on ComputerSystems, Vol. 13, No. 1, 1995.

BELLOTTI, V. A framework for assessing applica-bility of HCI techniques. In Proceedings ofINTERACT’90. 1990.

BUXTON, W., LAMB, M., SHEMAN, D., AND

SMITH, K. Towards a comprehensive userinterface management system. In Proceed-ings of SIGGRAPH’83. 1983.

CHANG, E. AND DILLON, T.S. Automated usabil-ity testing. In Proceedings of INTERACT’97.

CHEN, J. Providing intrinsic support for userinterface monitoring. In Proceedings ofINTERACT’90. 1990.

COOK, J.E. AND WOLF, A.L. Toward metrics forprocess validation. In Proceedings ofICSP’94. 1994.

COOK, J.E., AND WOLF, A.L. Automating pro-cess discovery through event-data analysis.In Proceedings of ICSE’95. 1995.

COOK, J.E. AND WOLF, A.L. Software processvalidation: quantitatively measuring the cor-respondence of a process to a model. TechReport CU-CS-840-97, Department of Com-puter Science, University of Colorado atBoulder. 1997.

COOK, R., KAY, J., RYAN, G., AND THOMAS,R.C. A toolkit for appraising the long-termusability of a text editor. Software QualityJournal, Vol. 4, No. 2, 1995.

CUOMO, D.L. Understanding the applicability ofsequential data analysis techniques for anal-ysing usability data. Nielsen, J. (Ed.).Usability Laboratories Special Issue ofBehaviour and Information Technology,Vol.13, No.1 & 2, 1994.

CYPHER, A. (Ed.). Watch what I do: program-ming by demonstration. MIT Press, Cam-bridge MA, 1993.

CYPHER, A. Eager: programming repetitive tasksby example. In Proceedings of CHI’91.

1991.

DODGE, M. AND STINSON, C. Running MicrosoftExcel 2000. Microsoft Press. 1999

DOUBLEDAY, A., RYAN, M., SPRINGETT, M., AND

SUTCLIFFE, A. A comparison of usabilitytechniques for evaluating design. In Pro-ceedings of DIS’97. 1997.

ELGIN, B. Subjective usability feedback from thefield over a network. In Proceedings ofCHI’95. 1995.

ERGOLIGHT USABILITY SOFTWARE. OperationRecording Suite (EORS) and Usability Vali-dation Suite (EUVS) Web pages. URL: http://www.ergolight.co.il/. 1998.

FARAONE, S.V. AND DORFMAN, D.D. Lagsequential analysis: Robust statistical meth-ods. Psychological Bulletin, 101, 1987.

FEATHER, M.S., NARAYANASWAMY , K., COHEN,D., AND FICKAS, S. Automatic monitoring ofsoftware requirements. Research Demonstra-tion in Proceedings of ICSE’97. 1997.

FICKAS, S. AND FEATHER, M.S. Requirementsmonitoring in dynamic environments. IEEEInternational Symposium on RequirementsEngineering. 1995.

FINLAY, J. AND HARRISON, M. Pattern recogni-tion and interaction models. In Proceedingsof INTERACT’90. 1990.

FISHER, C. Advancing the study of programmingwith computer-aided protocol analysis. InOlson, G., Soloway, E., and Sheppard, S.(Eds.). Empirical Studies of Programmers,1987 Workshop. Ablex, Norwood, NJ, 1987.

FISHER, C. AND SANDERSON, P. Exploratorysequential data analysis: exploring continu-ous observational data. Interactions, Vol.3,No. 2, ACM Press, Mar. 1996.

FISHER, C. Protocol Analyst’s Workbench:Design and evaluation of computer-aidedprotocol analysis. Unpublished PhD thesis,Carnegie Mellon University, Department ofPsychology, Pittsburgh, PA, 1991.

FITTS, P.M. Perceptual motor skill learning. InMelton, A.W. (Ed.). Categories of humanlearning. Academic Press, New York, NY,


December 13, 2000

1964. S.J.

FULL CIRCLE SOFTWARE. Talkback Web pages.URL: http://www.fullsoft.com/. 1998.

GIRGENSOHN, A., REDMILES, D.F., AND SHIP-MAN, F.M. III. Agent-Based Support forCommunication between Developers andUsers in Software Design. In Proceedings ofthe Knowledge-Based Software EngineeringConference `94. Monterey, CA, USA, 1994.

GOODMAN, D. Complete HyperCard 2.2 Hand-book. ToExcel. 1998.

GOTTMAN, J.M. AND ROY, A.K. Sequential anal-ysis: A guide for behavioral researchers.Cambridge University Press, Cambridge,England, 1990.

GRUDIN, J. Utility and usability: Research issuesand development contexts”. Interacting withcomputers, Vol. 4, No. 2, 1992.

GUZDIAL , M. Deriving software usage patternsfrom log files. Tech Report GIT-GVU-93-41.1993.

GUZDIAL , M., SANTOS, P., BADRE, A., HUDSON,S., AND GRAY, M. Analyzing and visualizinglog files: A computational science of usabil-ity. Presented at HCI Consortium Workshop,1994.

GUZDIAL , M, WALTON, C., KONEMANN, M., AND

SOLOWAY, E. Characterizing process changeusing log file data. Tech Report GIT-GVU-93-44. 1993.

HARTSON, H.R., CASTILLO, J.C., KELSO, J., AND

NEALE, W.C. Remote evaluation: the net-work as an extension of the usability labora-tory. In Proceedings of CHI’96. 1996.

HELANDER, M. (Ed.). Handbook of human-com-puter interaction. Elsevier Science Publish-ers B.V. (North Holland), 1998.

HEWLETT PACKARD. Application Response Mea-surement API. URL: http://www.hp.com/openview/rpm/arm/. 1998.

HILBERT, D.M. AND REDMILES, D.F. Anapproach to large-scale collection of applica-tion usage data over the Internet. In Proceed-ings of ICSE’98. 1998a.

HILBERT, D.M. AND REDMILES, D.F. Agents for

collecting application usage data over theInternet. In Proceedings of AutonomousAgents’98. 1998b.

HILBERT, D.M., ROBBINS, J.E., AND REDMILES,D.F., Supporting Ongoing User Involvementin Development via Expectation-DrivenEvent Monitoring. Tech Report UCI-ICS-97-19, Department of Information and Com-puter Science, University of California, Irv-ine. 1997.

HIRSCHBERG, D.S. A linear space algorithm forcomputing maximal common subsequences.Communications of the ACM, Vol. 18, 1975.

HOIEM, D.E. AND SULLIVAN , K.D. Designingand using integrated data collection andanalysis tools: challenges and consider-ations. Nielsen, J. (Ed.). Usability Laborato-ries Special Issue of Behaviour andInformation Technology, Vol.13, No.1 & 2,1994.

HOPPE, H.U. Task-oriented parsing: A diagnosticmethod to be used by adaptive systems. InProceedings of CHI’88. 1988.

JOHN, B.E. AND KIERAS, D.E. The GOMS familyof user interface analysis techniques: com-parison and contrast. ACM Transactions onComputer-Human Interaction, Vol. 3, No. 4,1996.

JOHN, B.E. AND KIERAS, D.E. Using GOMS foruser interface design and evaluation: whichtechnique? ACM Transactions on Computer-Human Interaction, Vol. 3, No. 4, 1996.

KAY, J. AND THOMAS, R.C. Studying long-termsystem use. Communications of the ACM,Vol. 38, No. 7, 1995.

KOSBIE, D.S. AND MYERS, B.A. Extending pro-gramming by demonstration with hierarchi-cal event histories. In Proceedings of East-West Human Computer Interaction’94. 1994.

KRISHNAMURTHY, B AND ROSENBLUM, D.S.Yeast: A General Purpose Event-Action Sys-tem. IEEE Transactions on Software Engi-neering, Vol. 21, No. 10, 1995.

LECEROF, A. AND PATERNO, F. Automatic sup-port for usability evaluation. IEEE Transac-tions on Software Engineering, Vol. 24, No.


December 13, 2000

10, 1998.

LEE, B. Remote diagnostics and product lifecyclemonitoring for high-end appliances: a newInternet-based approach utilizing intelligentsoftware agents. In Proceedings of the Appli-ance Manufacturer Conference. 1996.

LEWIS, R. AND STONE, M. (Ed). Mac OS in aNutshell. O’Reilly and Associates. 1999.

LOWGREN, J. AND NORDQVIST, T. Knowledge-based evaluation as design support forgraphical user interfaces. In Proceedings ofCHI’92. 1992.

MACLEOD, M., AND RENGGER, R. The Develop-ment of DRUM: A Software Tool for Video-assisted Usability Evaluation. In Proceed-ings of HCI’93. 1993.

MANSOURI-SAMANI , M. AND SLOMAN, M. GEM:A generalised event monitoring language fordistributed systems. IEE/BCS/IOP Distrib-uted Systems Engineering Journal, Vol 4, No2, 1997.

MERCURY INTERACTIVE. WinRunner and XRun-ner Web Pages. URL: http://www.merc-int.com/. 1998.

MORAN, T. P. The command language grammar:a representation for the user interface ofinteractive computer systems. InternationalJournal of Man-Machine Studies, 15, 1981.

NEAL, A.S. AND SIMONS, R.M. Playback: Amethod for evaluating the usability of soft-ware and its documentation. In Proceedingsof CHI’83. 1983.

NIELSEN, J. A virtual protocol model for com-puter-human interaction. International Jour-nal of Man-Machine Studies, 24, 1986.

NIELSEN, J. Usability engineering. AcademicPress/AP Professional, Cambridge, MA,1993.

NYE, A. AND O’REILLY, T. X Toolkit IntrinsicsProgramming Manual for X11, Release 5.O’Reilly and Associates. 1992.

OLSEN, D.R. AND HALVERSEN, B.W. Interfaceusage measurements in a user interface man-agement system. In Proceedings of UIST’88.1988.

OLSON, G.M., HERBSLEB, J.D., AND RUETER,H.H. Characterizing the sequential structureof interactive behaviors through statisticaland grammatical techniques. Human-Com-puter Interaction Special Issue on ESDA,Vol.9, 1994.

PAYNE, S.G. AND GREEN, T.R.G. Task-actiongrammars: A model of the mental represen-tation of task languages. Human-ComputerInteraction, Vol. 2, 1986.

PENTLAND, B.T. A grammatical model of organi-zational routines. Administrative ScienceQuarterly. 1994.

PENTLAND, B.T. Grammatical models of organi-zational processes. Organization Science.1994.

PETZOLD, C. Programming Windows. MicrosoftPress. 1998.

PREECE, J., ROGERS, Y., SHARP, H., BENYON, D.,HOLLAND, S., AND CAREY, T. Human-com-puter interaction. Addison-Wesley, Woking-ham, UK, 1994.

ROSENBLUM, D.S. Specifying concurrent sys-tems with TSL. IEEE Software, Vol. 8, No.3, 1991.

RUBIN, C. Running Microsoft Word 2000.Microsoft Press. 1999.

SACKETT. G.P. Observing behavior (Vol. 2). Uni-versity Park Press, Baltimore, MD, 1978.

SANDERSON, P.M. AND FISHER, C. Exploratorysequential data analysis: foundations.Human-Computer Interaction Special Issueon ESDA, Vol. 9, 1994.

SANDERSON, P.M., SCOTT, J.J.P, JOHNSTON, T.,MAINZER, J., WATANABE, L.M., AND JAMES,J.M. MacSHAPA and the enterprise ofExploratory Sequential Data Analysis(ESDA). International Journal of Human-Computer Studies, Vol. 41, 1994.

SANTOS, P.J. AND BADRE, A.N. Automatic chunkdetection in human-computer interaction. InProceedings of Workshop on AdvancedVisual Interfaces AVI ‘94. Also available asTech Report GIT-GVU-94-4. 1994.

SCHIELE, F. AND HOPPE, H.U. Inferring task


December 13, 2000

structures from interaction protocols. In Pro-ceedings of INTERACT’90. 1990.

SELBY, R.W., PORTER, A.A., SCHMIDT, D.C.,AND BERNEY, J. Metric-driven analysis andfeedback systems for enabling empiricallyguided software development. In Proceed-ings of ICSE’91. 1991.

SIOCHI, A.C. AND EHRICH, R.W. Computer anal-ysis of user interfaces based on repetition intranscripts of user sessions. ACM Transac-tions on Information Systems. 1991.

SIOCHI, A.C. AND HIX, D. A study of computer-supported user interface evaluation usingmaximal repeating pattern analysis. In Pro-ceedings of CHI’91. 1991.

SMILOWITZ , E.D., DARNELL, M.J., AND BENSON,A.E. Are we overlooking some usability test-ing methods? A comparison of lab, beta, andforum tests. Nielsen, J. (Ed.). Usability Lab-oratories Special Issue of Behaviour andInformation Technology, Vol.13, No.1 & 2,1994.

SUN MICROSYSTEMS. SunTest JavaStar WebPages. URL: http://www.sun.com/suntest/.1998.

SWEENY, M., MAGUIRE, M., AND SHACKEL, B.Evaluating human-computer interaction: Aframework. International Journal of Man-Machine Studies, Vol.38, 1993.

TAYLOR, M.M. Layered protocols for computer-human dialogue I: Principles. International

Journal of Man-Machine Studies, 28, 1988a.

TAYLOR, M.M. Layered protocols for computer-human dialogue II: Some practical issues.International Journal of Man-Machine Stud-ies, 28, 1988b.

TAYLOR, R.N. AND COUTAZ, J. Workshop onSoftware Engineering and Human-ComputerInteraction: Joint Research Issues. In Pro-ceedings of ICSE‘94. 1994.

TIBCO. HAWK Enterprise Monitor Web Pages.URL: http://www.tibco.com/. 1998.

UEHLING, D.L. AND WOLF, K. User ActionGraphing Effort (UsAGE). In Proceedings ofCHI’95. 1995.

USER MODELING INC. (UM INC.). Home Page.URL: http://um.org/. 1998.

WEILER, P. Software for the usability lab: a sam-pling of current tools. In Proceedings ofINTERCHI’93. 1993.

WHITEFIELD, A., WILSON, F., AND DOWELL, J. Aframework for human factors evaluation.Behaviour and Information Technology, Vol.10, No. 1, 1991.

WOLF, A.L. AND ROSENBLUM, D.S. A Study inSoftware Process Data Capture and Analy-sis. In Proceedings of the Second Interna-tional Conference on Software Process,1993.

ZUKOWSKI, J. AND LOUKIDES, M. (Ed). Java AwtReference. O’Reilly and Associates. 1997.

extracting usability information from user interface eventsredmiles/publications/j006-hr00.pdf ·...

Documents