abstract title of dissertation: interactive graphical ... · title of dissertation: interactive...

ABSTRACT

Title of Dissertation: INTERACTIVE GRAPHICAL QUERYING OF

TIME SERIES AND LINEAR SEQUENCE DATA SETS

Harry Hochheiser, Doctor of Philosophy, 2003

Dissertation directed by: Professor Ben ShneidermanDepartment of Computer Science

Numerous analytic domains involve the study of measurable quantities that change

over time. This widespread interest in time series data sets has led to substantial work

in algorithmic strategies for querying and indexing data. Much less work has been

done in the development of interactive tools for identifying patterns in these data sets.

This dissertation uses a graphical mechanism for specifying queries on time series

data to provide the basis for an exploration of the algorithmic and semantic issues

surrounding interactive querying of time series data. Contributions of this dissertation

include:

• The definition of timeboxes - rectangular widgets that can be used in direct-

manipulation Graphical User Interfaces (GUIs) to specify query constraints on

time series data sets. Timeboxes are used to simultaneously specify two sets of

constraints: given a set of N time series profiles, a timebox covering time periods

x1 . . .x2 (x1 ≤ x2) and values y1 . . .y2 (y1 ≤ y2) will retrieve only those n ∈ N that

have values y1 ≤ y ≤ y2 during all times x1 ≤ x ≤ x2.

• The TimeSearcher information visualization tool, which is based on the time-

box query model. TimeSearcher’s object-oriented architecture can easily be

extended to support variants of the timebox model that provide additional ex-

pressive power.

• The design and implementation of query models and widgets that extend the

timebox model, including variable-time timeboxes (VTTs), angular queries,

leaders & laggards queries, multiple search attributes, and query inversion.

• Analysis of algorithmic issues: A comparison of multiple alternative search al-

gorithms found that simple sequential scans outperformed geometric indices for

processing timebox queries.

• Empirical evaluation of timeboxes: Two empirical studies, each with 12 sub-

jects, provided preliminary insight into the utility of timeboxes and led to design

improvements for input and display.

• Validation through case studies: TimeSearcher has been used by molecular bi-

ologists to explore gene expression data and nucleotide frequencies. This work

has validated the utility of the tool and identified design suggestions and oppor-

tunities.

• A framework for extending the timebox model, including the description of nu-

merous possible extensions.

INTERACTIVE GRAPHICAL QUERYING OF

TIME SERIES AND LINEAR SEQUENCE DATA SETS

by

Harry Hochheiser

Dissertation submitted to the Faculty of the Graduate School of theUniversity of Maryland, College Park in partial fulfillment

of the requirements for the degree ofDoctor of Philosophy

2003

Advisory Committee:

Professor Ben Shneiderman, Chair and AdvisorAssociate Professor Eric BaehreckeAssistant Professor Ben BedersonProfessor Bruce GoldenProfessor David MountProfessor Stephen Mount

c©Copyright by

Harry Hochheiser

2003

DEDICATION

To Judy

ii

ACKNOWLEDGEMENTS

Working with Ben Shneiderman has been a truly wonderful experience. I’ve

learned a great deal with Ben, both about how to do research and about how to be

a researcher. I’m particularly grateful for Ben’s support of my “extra-curricular” ac-

tivities, and his awareness that Computer Science research does not take place in a

vacuum.

This research has benefited enormously from the input of several faculty members

who acted as collaborators and members of my committee. Ben Bederson provided an

invaluable advice regarding implementation and evaluation, along with a different per-

spective on Information Visualization. Eric Baehrecke was an enthusiastic supporter

and early user of TimeSearcher. Steve Mount has also provided valuable guidance.

Thanks to both Eric and Steve for patiently answering my repeated questions about

their research. David Mount’s exemplary teaching helped me build the foundation

necessary for thinking about the algorithmic analysis of this work, and his comments

in these areas have been most helpful. As an outsider, Bruce Golden has provided a

useful perspective.

iii

Jesse Grosjean and Lance Good provided invaluable help with implementation is-

sues relating to Jazz and Piccolo.

It was a pleasure collaborating with Eamonn Keogh on variable-time timeboxes.

Along with Ben S. and Ben B., Allison Druin, Catherine Plaisant, and Francois

Guimbretiere have made the Human-Computer Interaction Lab a wonderful place to

work. Anne Rose deserves thanks for cheerfully putting up with my constant com-

plaining. Egemen Tanin, Jaime Montemayor, Juan Pablo Hourcade, Hilary Browne

Hutchison, Jinwook Seo, Hyunmo Kang, Gene Chipman, and other HCIL students

have have provided a supportive and engaging working environment. As librarian for

the CS department, Jordan Landes was a constant and reliable source of assistance and

good cheer.

Other colleagues from outside the University of Maryland have provided useful

feedback and encouragement. Special thanks to Karen Duca, Chris North, Eric Hoff-

man, Clare-Marie and John Karat, and Mary Czerwinski. Batya Friedman deserves

special thanks for suggesting Ben Shneiderman as a good research mentor.

The bulk of this work was supported by the AOL Fellowship in Human-Computer

Interaction. AOL was generous enough to provide this support with no strings at-

tached. AOL colleagues Amy Hale, Clayton Lewis, and Arkady Pogostkin have been

supportive and helpful throughout.

Finally, thanks to my extended family: Dave, Kellie, Herb, Eleanore, Toby, and

Michael. My daughter Elena isn’t old enough to know it yet, but her smiles have been

enormously helpful in overcoming thesis-related anxiety. I can’t say enough about my

wife Judy - this work literally would not have happened without her.

iv

TABLE OF CONTENTS

List of Tables xii

List of Figures xiv

1 Introduction 1

2 Related Work 5

2.1 Visualizations and Interactive Systems . . . . . . . . . . . . . . . . . 5

2.1.1 Time Series Data: Visualizations . . . . . . . . . . . . . . . . 6

2.1.2 Temporal Data: Visualizations . . . . . . . . . . . . . . . . . 11

2.1.3 Time Series Data: Querying . . . . . . . . . . . . . . . . . . 13

2.1.4 Temporal Data: Querying . . . . . . . . . . . . . . . . . . . 17

2.1.5 Parallel Coordinates . . . . . . . . . . . . . . . . . . . . . . 19

2.2 Data Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.2.1 Similarity Searching . . . . . . . . . . . . . . . . . . . . . . 21

2.2.2 Inverse Queries . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.2.3 Outlier Detection . . . . . . . . . . . . . . . . . . . . . . . . 24

2.2.4 Query Specification . . . . . . . . . . . . . . . . . . . . . . . 25

2.2.5 Other Approaches . . . . . . . . . . . . . . . . . . . . . . . 25

2.3 Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

v

2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3 Timeboxes: Interactive Temporal Query Widgets 29

3.1 Anyof Semantics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.2 Variable Time Timeboxes . . . . . . . . . . . . . . . . . . . . . . . . 33

3.3 Timeboxes in the Context of Information Visualization Research . . . 36

4 TimeSearcher 41

4.1 Overviews . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

4.2 Leaders & Laggards . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

4.3 Multiple Time-Varying Attributes . . . . . . . . . . . . . . . . . . . 52

4.4 Query Inversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

4.5 Anyof Timeboxes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

4.6 Variable Time Timeboxes . . . . . . . . . . . . . . . . . . . . . . . . 59

4.7 Angular Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

4.8 Averages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

4.9 Other Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

5 TimeSearcher Implementation 71

5.1 A Tour of the Code . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

5.2 Data Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

5.2.1 Input File Format . . . . . . . . . . . . . . . . . . . . . . . . 73

5.2.2 Data Structures . . . . . . . . . . . . . . . . . . . . . . . . . 74

5.2.3 Loading a Data File . . . . . . . . . . . . . . . . . . . . . . . 75

5.3 Graphical User Interface . . . . . . . . . . . . . . . . . . . . . . . . 76

5.3.1 Piccolo Windows . . . . . . . . . . . . . . . . . . . . . . . . 77

5.3.2 Interaction Handlers: Creation and Modification of Queries . 80

vi

5.3.3 Display Techniques . . . . . . . . . . . . . . . . . . . . . . . 84

5.3.4 The transition from Jazz to Piccolo . . . . . . . . . . . . . . 85

5.4 Query Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

5.5 Extending Timeboxes . . . . . . . . . . . . . . . . . . . . . . . . . . 89

5.6 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

6 Search Algorithms 95

6.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

6.2 Sequential Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

6.3 Sequential Search for Timebox Extensions . . . . . . . . . . . . . . . 100

6.3.1 Variable Time Timeboxes . . . . . . . . . . . . . . . . . . . 100

6.3.2 Angular Queries . . . . . . . . . . . . . . . . . . . . . . . . 101

6.4 Geometric Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

6.4.1 Orthogonal Range Trees . . . . . . . . . . . . . . . . . . . . 104

6.4.2 Grids . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

6.5 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

6.5.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . 108

6.5.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

6.5.3 Sequential scans vs. Geometric Indices . . . . . . . . . . . . 118

6.5.4 Theoretical worst-case analyses . . . . . . . . . . . . . . . . 124

6.5.5 Further Examination of Sequential Algorithms . . . . . . . . 126

6.5.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 128

6.6 Next Steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

7 Empirical Evaluations 136

7.1 Evaluation of Input Mechanisms for Questions of Varying Complexity 137

vii

7.1.1 Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137

7.1.2 Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . 142

7.1.3 Task Types . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

7.1.4 Hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . . . 145

7.1.5 Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146

7.1.6 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147

7.1.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 149

7.2 Empirical Evaluation of Input and Output for Exploratory Tasks . . . 157

7.2.1 Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157

7.2.2 Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159

7.2.3 Hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 160

7.2.4 Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160

7.2.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162

7.2.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 165

7.3 Conclusion & Future Steps . . . . . . . . . . . . . . . . . . . . . . . 170

8 Applications 173

8.1 DNA Microarray Data Set Analysis . . . . . . . . . . . . . . . . . . 173

8.1.1 Programmed Cell Death in Drosophila melanogaster . . . . . 176

8.1.2 Viral Life Cycle in Epithelial Cells . . . . . . . . . . . . . . . 188

8.2 Nucleotide Sequence Data . . . . . . . . . . . . . . . . . . . . . . . 191

8.2.1 Branch Site Consensus Splicing Signal in Arabidopsis thaliana 192

8.2.2 Observations . . . . . . . . . . . . . . . . . . . . . . . . . . 197

8.2.3 Contributions and Design Suggestions . . . . . . . . . . . . . 198

8.3 Other Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . 200

8.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200

viii

9 Query Expressiveness 203

9.1 Example Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204

9.1.1 Fixed-Time, Fixed-Value, and logical combinations thereof . . 205

9.1.2 Variable Time and/or Value . . . . . . . . . . . . . . . . . . . 205

9.1.3 Open-Ended Time and/or Value . . . . . . . . . . . . . . . . 206

9.1.4 Relative Time/Value . . . . . . . . . . . . . . . . . . . . . . 207

9.1.5 Interval Trending . . . . . . . . . . . . . . . . . . . . . . . . 208

9.1.6 Maximal Periods . . . . . . . . . . . . . . . . . . . . . . . . 209

9.1.7 Aggregate Functions . . . . . . . . . . . . . . . . . . . . . . 209

9.1.8 Similarity to a Known Item . . . . . . . . . . . . . . . . . . . 210

9.1.9 Global Constraint . . . . . . . . . . . . . . . . . . . . . . . . 210

9.1.10 Inter-item queries: Leaders & Laggards . . . . . . . . . . . . 211

9.1.11 Prevailing Trends . . . . . . . . . . . . . . . . . . . . . . . . 212

9.1.12 More general queries . . . . . . . . . . . . . . . . . . . . . . 213

9.2 Query Dimensions . . . . . . . . . . . . . . . . . . . . . . . . . . . 213

9.2.1 Range Events . . . . . . . . . . . . . . . . . . . . . . . . . . 214

9.2.2 Transition Events . . . . . . . . . . . . . . . . . . . . . . . . 216

9.2.3 Inter-item Queries . . . . . . . . . . . . . . . . . . . . . . . 217

9.2.4 Other Logical Operators: Disjunctions and Negations . . . . . 217

9.2.5 More General Queries . . . . . . . . . . . . . . . . . . . . . 218

9.3 Towards A Formal Query Model . . . . . . . . . . . . . . . . . . . . 218

9.3.1 Time Series Data Set . . . . . . . . . . . . . . . . . . . . . . 219

9.3.2 Range Events . . . . . . . . . . . . . . . . . . . . . . . . . . 219

9.3.3 Logical Combinations . . . . . . . . . . . . . . . . . . . . . 220

9.3.4 Variable Timeboxes . . . . . . . . . . . . . . . . . . . . . . . 221

ix

9.3.5 Relative Timeboxes . . . . . . . . . . . . . . . . . . . . . . . 221

9.3.6 Transitions . . . . . . . . . . . . . . . . . . . . . . . . . . . 222

9.3.7 Global Constraints . . . . . . . . . . . . . . . . . . . . . . . 222

9.3.8 Inter-item Queries . . . . . . . . . . . . . . . . . . . . . . . 223

9.3.9 Open Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . 223

9.4 Implementing the Extended Queries . . . . . . . . . . . . . . . . . . 224

9.5 User Needs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228

9.6 Subsequence Queries: Beyond Full-Sequence Matches . . . . . . . . 229

10 Future Work 231

10.1 Further Development of TimeSearcher . . . . . . . . . . . . . . . . . 231

10.1.1 Re-Implementation . . . . . . . . . . . . . . . . . . . . . . . 231

10.1.2 Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232

10.1.3 Domain Customization . . . . . . . . . . . . . . . . . . . . . 234

10.1.4 Multiple Time-Varying Attributes . . . . . . . . . . . . . . . 236

10.1.5 Additional Functionality . . . . . . . . . . . . . . . . . . . . 236

10.2 Further Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 238

10.3 Other Types of Time-oriented Data . . . . . . . . . . . . . . . . . . . 239

10.3.1 Categorical or Nominal Data . . . . . . . . . . . . . . . . . . 239

10.3.2 Temporal Data . . . . . . . . . . . . . . . . . . . . . . . . . 243

11 Conclusions 245

A A Sample TimeSearcher Data File 248

B Study Materials for Evaluation of Input Mechanisms for Questions of

Varying Complexity 250

x

B.1 Exploratory Task . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250

B.2 Training Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . 250

B.3 Experimental Questions . . . . . . . . . . . . . . . . . . . . . . . . . 251

B.4 User Interface Satisfaction Questionnaire . . . . . . . . . . . . . . . 253

C Empirical Evaluation of Multiple-Constraint Query Formation 255

C.1 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255

C.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258

C.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260

C.4 Study Materials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264

C.4.1 Exploratory Task . . . . . . . . . . . . . . . . . . . . . . . . 264

C.4.2 Training Questions . . . . . . . . . . . . . . . . . . . . . . . 264

C.4.3 Experimental Questions . . . . . . . . . . . . . . . . . . . . 265

D Study Materials for Empirical Evaluation of Input and Output 268

D.1 Training Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . 268

D.2 Experimental Questions . . . . . . . . . . . . . . . . . . . . . . . . . 269

Bibliography 269

xi

LIST OF TABLES

5.1 Raw performance data. . . . . . . . . . . . . . . . . . . . . . . . . . 92

5.2 Portion of query processing time spent on updating display, for sample

queries on some data sets. All times are in ms. . . . . . . . . . . . . . 94

6.1 Data sets used in algorithm evaluation. . . . . . . . . . . . . . . . . . 109

6.2 Query Operations in each block. . . . . . . . . . . . . . . . . . . . . 110

6.3 Average times (ms) across all operations for data sets with 100 time

points and 100, 1000, 10000, and 50000 items. . . . . . . . . . . . . 112

6.4 Average times (ms) across all operations for data sets with 100 items

and 100, 1000, and 10000 time points. . . . . . . . . . . . . . . . . . 113

6.5 Average times (ms) for the data set with 1000 items and 1000 time

points, with results for both 100 items and 1000 time points and 100

time points and 1000 items given for context. . . . . . . . . . . . . . 116

6.6 Comparison of number of values checked versus possible number of

checks for sequential search of data sets with 100 time points and 100,

1000, and 10000 items . . . . . . . . . . . . . . . . . . . . . . . . . 120


checks for sequential search of data sets with 100 items and 100, 1000,

and 10000 time points . . . . . . . . . . . . . . . . . . . . . . . . . . 121

xii


checks for sequential search of data sets with 100 time points and 100,

1000, and 10000 items . . . . . . . . . . . . . . . . . . . . . . . . . 122


checks for sequential search of data sets with 100 items and 100, 1000,

and 10000 time points . . . . . . . . . . . . . . . . . . . . . . . . . . 122

7.1 User preferences by interface for the different task types. . . . . . . . 148

C.1 User preferences by interface for the different task types. . . . . . . . 260

xiii

LIST OF FIGURES

1.1 Patterns of interest in stock trend analysis [87]. . . . . . . . . . . . . 3

2.1 A spiral visualization of the consumption of Baphia Capparidifolia by

Chimpanzees in Tanzania during 1980-1988. Each lap represents one

year, and each spoke one month. The area of each blot is proportional

to the observed consumption during that month of the given year. To

see how consumption varied during a given year, users can move along

a given lap of the spiral. To compare consumption in a given month

across years, users examine blots along the same spoke [27]. . . . . . 7

2.2 A Diamond Fast display showing a zoomed image of two overlaid 10-

year periods [135]. . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.3 A ThemeRiver visualization of news items regarding Fidel Castro,

from November 1959 through June 1961. Each band in the river in-

dicates a separate topic, with the thickness of the band indicating the

number of stories on that topic [59]. . . . . . . . . . . . . . . . . . . 9

2.4 A TimeTube, with four DiskTrees showing the evolution of the web

site over time [31]. . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.5 A LifeLines display of a patient medical records [99]. . . . . . . . . 12

2.6 Circular query controls for filtering cyclic data [25]. . . . . . . . . . 14

xiv

2.7 The Patterns visual query language, specifying a sequence involving

one of four alternative transitions followed by a single required transi-

tion [90]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.8 The MMVIS query window [60]. . . . . . . . . . . . . . . . . . . . . 17

2.9 A sample parallel coordinates visualization involving four dimensions

from a database describing automobiles [58]. . . . . . . . . . . . . . 20

3.1 A graph overview, formed by superimposing the time series for all of

the items in the data set. . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.2 A single timebox query, for items between $70 and $190 during weeks

1-5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.3 A refinement of the query in Figure 3.2. . . . . . . . . . . . . . . . . 32

3.4 A complex query containing three timeboxes. . . . . . . . . . . . . . 33

3.5 A variable time timebox, specifying that for at least R consecutive time

periods between x1 and x2, items must have values in the range y1 ≤

y ≤ y2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.6 The Influence Explorer: Range Sliders on the “brightness”and “work-

ing life” dimensions select the ranges of interest. Histograms with

each variable indicate the number of items having various values of

that variable, and lines between histograms indicate the values of a

selected item [134]. . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.7 XmdvTool [89]: The highlighted items have been selected by “brush-

ing”. Once the brush is created, the highlighted areas on any given axis

can be moved or resized [148]. . . . . . . . . . . . . . . . . . . . . . 38

xv

3.8 Explicit range sliders in CityOScope’s parallel coordinates display Ar-

rows at the top and bottom of each axis can be used to limit the range

of interest [53]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3.9 Two dimensional query widgets: (a) A point query indicating an exact

number of bedrooms and cost of a home. (b) A range of number of

bedrooms and cost [117]. . . . . . . . . . . . . . . . . . . . . . . . . 39

4.1 The TimeSearcher application window. Clockwise from upper-left:

query space (with data envelope, query envelope, and graph overview),

details-on-demand, item list, range sliders for query adjustment, and

data items. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

4.2 Partial results from a timebox query, with time points that match the

query highlighted. Items in the result set differ in the points that match

the query, indicating an anyof or variable time timebox. . . . . . . . . 43

4.3 Drag-and-drop query-by-example, with results. . . . . . . . . . . . . 45

4.4 Query window with data envelope. . . . . . . . . . . . . . . . . . . . 47

4.5 Query display with data and query envelopes. . . . . . . . . . . . . . 47

xvi

4.6 The query window displaying a “leaders & laggards” query. The top

window shows leaders, with the original query in magenta providing

a reference that can be used for comparison. The leaders window also

includes a label indicating that the leaders are being shown, along with

the name of the attribute being used for the leader query. The record

count at the bottom of this window also indicates that the items shown

are leaders. The bottom window - the “laggards” display -shows the

original query in outline, and has new timeboxes representing the new

query, which is defined by shifting the old query one time period to

the right. The count label below this window indicates that the items

shown are laggards. . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

4.7 Leaders & Laggards: The top-left window is the leader window, and

the laggard window is directly below it. . . . . . . . . . . . . . . . . 51

4.8 TimeSearcher with a data set involving multiple time-varying at-

tributes. Two panes have been created - for the “low” and the “high”

values. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

4.9 The data items in the result set with two variables displayed. The

profiles are taken from yeast microarray data, with absolute log ratio

and log ratio values shown for seven time points [40]. . . . . . . . . . 54

4.10 Updated query envelopes for one of two attributes that are currently

active. Note that even though there are no queries in this window,

queries in the inactive window (for “Low” measurements) have con-

strained the data set, as shown by the query envelope. . . . . . . . . . 54

4.11 A summary window for a query over two attributes. . . . . . . . . . . 56

xvii

4.12 Query Inversion: The original query (top) and the inverted query (bot-

tom). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

4.13 Anyof timeboxes: The display on the top shows a query consisting

of two timeboxes. In the bottom display, the timebox on the left has

been converted to an anyof query. As these queries are more inclusive

(requiring only one value in the given range during the interval, as

opposed to all values), the result set for the anyof query is a superset

of the other result set. . . . . . . . . . . . . . . . . . . . . . . . . . . 60

4.14 A variable time timebox (VTT), with two sets of modification handles.

The outer handles can be used to modify the value range and the time

window, while the inner handles can be dragged to modify the duration

of the interval during which values must be within the given range. . . 61

4.15 Calculation of an angular query. If an items ti has a value v at the

starting time tmin, its value at the ending time tmax must be between

vmin and vmax, as determined by θ1 and θ2, along with the width of the

query. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

4.16 The angular query widget. . . . . . . . . . . . . . . . . . . . . . . . 64

4.17 An annotated angular query widget. The dark lines demonstrate how

the vertical line in the query widget is used to determine the two angles

necessary for the query. . . . . . . . . . . . . . . . . . . . . . . . . . 64

xviii

4.18 The TimeSearcher query space with an angular query under the “all

points” interpretation. Data and query envelopes have been disabled

for clarity. Selection handles on the query widget can be used to move

and rescale the query, and a tooltip provides a textual representation of

the query on mouse-over. Note that the graph envelopes show items

with a slope similar to that of the angular query widget, but at differing

ranges along the value axis. . . . . . . . . . . . . . . . . . . . . . . . 65

4.19 The angular query from Figure 4.18, under the alternate “end points”

interpretation. Note that some items in the result set have interme-

diate transitions that exceed the range specified, even though the line

between values at the end points fits within the specified range. . . . . 66

4.20 An angular brush that searches for negative correlations between items

in the second and third axes [58]. . . . . . . . . . . . . . . . . . . . . 67

4.21 The TimeSearcher query window, with an average profile displayed in

red. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

4.22 An average query. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

5.1 A schematic overview of the container classes used in the Time-

Searcher GUI. The entire window is an instance of TQCore - a sub-

class of JFrame. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

5.2 A UML-style depiction of the relationships between the classes in the

display list and query window. . . . . . . . . . . . . . . . . . . . . . 78

5.3 The steps involved in TimeSearcher query processing. . . . . . . . . . 88

xix

5.4 Average times for TimeSearcher to completely process queries - in-

cluding search and display update - on several query types. Results

are shown for data sets of 1000, 10000, 25000, and 50000 items with

100 and 200 times points, and 100,000 items with 100 time points only. 92

6.1 Example of entities that meet (upper) and fail to meet (lower) the con-

straints of a timebox. . . . . . . . . . . . . . . . . . . . . . . . . . . 103

6.2 Clipping: as the timebox is moved to the lower right, the area marked

“D” is removed from the query, and the “A” region is added. These

two regions must be processed, but there is no need to reprocess the

overlap (“O”). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

6.3 A grid index for a data set with time points 0-9, values 0-80, and 8

buckets in the value dimension. Given this scheme, values from 0-10

will go into bucket 1, 11-20 in bucket 2, etc. The timebox shown will

cover the grids for values 21-30, 31-40, 41-50 and 51-60 for times

3-5. Buckets 21-30 and 51-60 are only partially covered, thus their

contents must be checked at each time point. The other buckets are

completely covered by the timebox, so checking of individual points

is not necessary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

6.4 Average times (ms) across all operations for data sets with 100 time

points and 100, 1000, 10000, and 50000 items. . . . . . . . . . . . . 111

6.5 Average times (ms) across all operations for data sets with 100 items

and 100, 1000, and 10000 time points. . . . . . . . . . . . . . . . . . 112

6.6 Comparative times for query creation and translation on data sets with

100 time points and 100, 1000, 10000 and 50000 items. . . . . . . . . 114

xx

6.7 Comparative times for query resize and deletion on data sets with 100

time points and 100, 1000, 10000 and 50000 items. . . . . . . . . . . 115

6.8 Comparative times for query creation and translation on data sets with

100 items and 100, 1000, and 10000 time points. . . . . . . . . . . . 116

6.9 Comparative times for query resize and deletion on data sets with 100

items and 100, 1000, and 10000 time points. . . . . . . . . . . . . . . 117

6.10 A timebox query demonstrating the advantage that sequential process-

ing has over geometric methods. For this timebox that spans eight

time points, sequential processing can stop after the second time value

is identified as falling outside of the timebox. However, the geometric

approaches must examine every point that falls within the timebox. . . 119

6.11 The timebox from Figure 6.10, with a time series for which S(ti,b) =

true. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

6.12 The number of values actually checked for sequential and Grid-20 al-

gorithms for data sets involving 100, 1000, and 10000 items with 100

time points. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

6.13 The number of values actually checked for sequential and Grid-20 al-

gorithms for data sets involving 100 items with 100, 1000, and 10000

time points. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124

6.14 Optimized sequential vs. Hashed sequential for data sets involving

100, 1000, and 10000 items . . . . . . . . . . . . . . . . . . . . . . . 128

6.15 Optimized sequential vs. Hashed sequential for data sets involving

100, 1000, and 10000 time points . . . . . . . . . . . . . . . . . . . . 129

xxi

6.16 Why time series query performance is independent of the width of the

series. As this timebox covers 25% of the value space and five time

periods, a randomly generated time series would only have odds of

< 1% of satisfying the timebox (like t2 does). The odds that a timebox

will fail to meet this query by the fourth time point (like t1) are greater

than 99%. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132

7.1 A form fill-in interface for specifying query constraints. . . . . . . . . 138

7.2 A range slider interface for specifying query constraints. . . . . . . . 138

7.3 The tsexp interface. . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

7.4 Feedback provided in the tsexp interface. Note the highlighted border

around the feedback corresponding to the selected timebox. . . . . . . 142

7.5 Average completion time (with standard deviation error bars) for well-

defined tasks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148

7.6 Number of items correctly identified in exploratory task . . . . . . . . 149

7.7 Average task completion time for exploratory tasks . . . . . . . . . . 150

7.8 Average subjective satisfaction ratings (1-9, 9 is best), n = 12. . . . . 151

7.9 A demonstration of the difficulty of resizing small handles. The large

timebox on the left has handles that are clearly separated and easily

graspable. The small timebox on the right has handles that are only a

few pixels apart, and are therefore harder to select. . . . . . . . . . . 153

7.10 The form-fill interface with tabular display of query results. Each row

contains the data for one item in the set, with the values for displayed

in the columns. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158

7.11 Average task completion time with standard deviation error bars. . . . 162

xxii

7.12 Average task completion time (with standard deviation error bars) for

each of the two timed tasks. . . . . . . . . . . . . . . . . . . . . . . . 163

7.13 Average task performance times (with standard deviation error bars)

for the six participants who were fastest with the timebox interface. . . 164

7.14 Average task performance times (with standard deviation error bars)

for the six participants who were fastest with either of the form fill-in

interfaces. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165

7.15 Average subjective satisfaction ratings 1-9, 9 is best), n = 12. The

preference for the timebox interface was significant in all cases. . . . 166

8.1 Red-green “heat map”display expression genes at seven time points.

Each row is a gene sample, and each column is a time point. Bright

green samples are repressed genes, bright red are induced genes, and

darker samples are close to the average. Genes that are repressed (low

expression levels) are shown at the top, and induced genes (high ex-

pression levels) at the bottom [34]. . . . . . . . . . . . . . . . . . . . 174

8.2 The Hierarchical Clustering Explorer. Dendrogram clusters and filters

for detail and similarity are shown in the top window, with a detailed

display of a subset is shown below. A scatterplot on the right is used

for pairwise comparison between two of the experimental conditions

[111]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175

xxiii

8.3 TimeSearcher query display identifying genes that are roughly similar

to E93 at 10 and 12 hours. This query contains two timeboxes, based

on the values of E93 at 10 and 12 hours. The 12 hour timebox has

been shifted up, to eliminate smaller increases in expression levels.

This timebox has also been increased in height, in order to include

some very sharp increases in expression level that might not have been

included in the original timebox. . . . . . . . . . . . . . . . . . . . . 181

8.4 TimeSearcher query identifying genes that decrease significantly be-

tween 10 and 12 hours, when E93 is increasing. . . . . . . . . . . . . 181

8.5 A query illustrating the need for additional constraints requiring non-

increasing (or non-decreasing) values over a specified interval. Al-

though the general trend of the two timeboxes is upwards, the high-

lighted item actually has a decrease in value between 10 and 12 hours.

Additional constraints requiring non-decreasing items would remove

this item from the result set. . . . . . . . . . . . . . . . . . . . . . . 185

8.6 The three main stages in the creation of protein from DNA. During

transcription, the strand of DNA is copied. During splicing, the introns

are removed, leaving only the exons. The output of splicing is a strand

of mRNA. During translation, the mRNA is exported from the nucleus

and used to create a protein. . . . . . . . . . . . . . . . . . . . . . . 193

8.7 Splice sites and branch sitesxb. . . . . . . . . . . . . . . . . . . . . . 194

8.8 Data envelope overview of pentamer frequency distributions in Ara-

bidopsis thaliana. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195

xxiv

8.9 Timebox query aimed at finding pentamers with higher frequencies at

a specific region within introns (the branch site) and lower frequencies

elsewhere within introns. . . . . . . . . . . . . . . . . . . . . . . . . 196

9.1 A schematic layout of the different types of example queries. Queries

are expressed in approximate order of increasing precision, from left

to right. Aggregate queries are modifiers that apply to queries within

the shaded box, and maximal period queries are modifiers that might

apply to those within the unshaded box. Queries below the dashed

lines involve comparisons are based on the characteristics of individual

items in the data set, while those above the line involve comparisons

between items. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204

9.2 A timebox query expressing A∧ (B∨C)∧D. B and C must be dis-

juncts, as both cannot be true simultaneously. . . . . . . . . . . . . . 226

9.3 A timebox that may lead to ambiguous intepretation under the model

given in Figure 9.2. The item drawn is in either timebox B or C for

the two time points during which they overlap, but it does not spend

both of thoes time poitns in any one box. Should this item be included

under the disjunctive semantics of Figure 9.2? What would the result

that users would expect? . . . . . . . . . . . . . . . . . . . . . . . . 227

xxv

10.1 The TimeSearcher query display, augmented with a preview display

displaying time periods that have larger number of items that follow

the pattern. The number of items that match the query at each time

point is given by the line color at that time: lighter colors indicate

a small number of matches, while darker colors show intervals with

more matches. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237

10.2 Sketch of a potential design for categorical timeboxes. For a data set

involving web log records for multiple hosts, this interface might be

used to find queries that had large numbers of visitors from “.com”

hosts in September and October, followed by large numbers of “.org”

visitors in December and January. . . . . . . . . . . . . . . . . . . . 241

10.3 An categorical timebox query looking for sites that had large numbers

of “.org” or “.edu” visitors during December and January. . . . . . . . 242

C.1 Average completion time for well-defined tasks. . . . . . . . . . . . . 258

C.2 Number of items correctly identified in exploratory task . . . . . . . . 259

C.3 Average task completion time for exploratory tasks . . . . . . . . . . 260

C.4 Average subjective satisfaction ratings (1-9, 9 is best). n = 7 . . . . . 261

xxvi

Chapter 1

Introduction

Numerous analytic domains involve the study of measurable quantities that change

over time. Financiers examining trends in economic indicators, meteorologists study-

ing climate data, demographers quantifying trends in census data, and numerous others

use time series graphs, statistical evaluations, and other tools to identify patterns and

find trends in these time series data sets.

Interest in time series data has prompted a substantial body of work in the develop-

ment of strategies for storing and indexing temporal data. Algorithmic and statistical

methods for identifying patterns have provided substantial functionality in a wide va-

riety of situations [4, 5, 6, 19, 24, 30, 49].

Algorithmic research only addresses one aspect of the data mining problem. The

question of query formulation - which questions are worth asking? - is often left unan-

swered. Data mining researchers often pose the challenge of finding patterns in time

series in terms of similarity to an input pattern. These queries involve specification of

both a query pattern and a range of allowable similarity. Identification of parameters

such as these using trial-and-error processing is often challenging and computationally

expensive. A central problem for users is that the effects of small changes on parame-

ters such as similarity tolerances may be hard to gauge without running multiple trials.

1

In these cases, users need tools to support interactive exploration of the contents

of time series data sets. By providing analysts with the power to quickly construct

queries, modify parameters, and examine result sets, these tools would encourage the

development of understanding of the data set as a whole. This understanding is use-

ful for guiding the construction of queries, thus speeding the process of knowledge

discovery.

Dynamic queries [7] and related information visualization techniques [26] have

proven useful in supporting users interested in understanding multi-dimensional ab-

stract datasets. The combination of graphic displays with easily manipulated user-

interface widgets for query formulation allows users to explore data sets in search of

items of interest. Although there has been little work to date on interactive systems for

querying time series data, lessons from information visualization research can guide

developers of systems for the exploration of time series data sets.

The existence of familiar graphic displays of time series presents an obvious start-

ing point for the application of information visualization techniques. Two-dimensional

graphs with time on the x-axis, and a continuous variable on the y-axis are ubiquitous:

stock charts, weather data, and physiologic data (electroencephalogram (EEG)), and

electrocardiogram (EKG)) etc., are just a few examples. In domains such as stock price

analysis, familiar patterns have been named and identified as shorthand approaches to

identifying trends of interest (Figure 1.1) [87].

Preliminary investigations into possible interactive systems for exploring time se-

ries data has led to the development of the timebox metaphor, and its implementation

in TimeSearcher. Timeboxes are rectangular regions that are placed and directly ma-

nipulated on a timeline, with the boundaries of the region providing the relevant query

parameters. TimeSearcher is a research application that supports the use of timebox

2

Figure 1.1: Patterns of interest in stock trend analysis [87].

queries to interactively search and explore time series data sets. TimeSearcher also

provides other querying tools, including support for simultaneous querying of mul-

tiple time-varying attributes, extensions to the timebox query model, drag-and-drop

query-by-example, “leaders & laggards” querying, and query inversion.

This dissertation describes related work, introduces the timebox model and Time-

Searcher, and continues with more in-depth discussion of the query model, implemen-

tation, evaluation, extensions, and future work:

• Chapter 2 provides a discussion of related work in visualization of time series

and temporal data, data mining, databases, and searching of time series data sets.

• Chapter 3 introduces the timebox concept and provides examples of its use

• Chapter 4 describes TimeSearcher, its implementation of timeboxes, and other

features.

• Chapter 5 provides details of TimeSearcher’s implementation.

3

• Chapter 6 describes a comparison of various algorithms that were evaluated as

candidates for providing the efficient processing needed for dynamic queries.

• Chapter 7 contains results from two empirical studies of timeboxes as compared

to other query specification modalities. Additional empirical results are pre-

sented in Appendix C.

• TimeSearcher has been used in ongoing research in molecular biology. Inves-

tigations of microarray data have been used to find patterns in gene expression

data. Building on the observation that any linear sequence can be treated as

a time series, biologists have also used TimeSearcher for exploration of data

sets describing nucleotide frequencies at differing positions in aligned genetic

sequences. These applications are described in Chapter 8.

• A wide variety of extensions to the timebox model might be used to provide

greater query expressiveness. Some of the possible extensions are introduced in

Chapter 9.

• Possibilities for future work are outlined in Chapter 10.

• Chapter 11 is the conclusion.

4

Chapter 2

Related Work

The focus of this thesis is time series data: sequences of real-valued measurements

x1 . . .xn. Although time series data has been the subject of extensive examination in

a wide variety of research fields, it is only one component of what has been called

“time-oriented data” [123]. Other forms of time-oriented data include temporal data,

involving events of arbitrary duration (as opposed to values that are recorded at discrete

intervals) [9], and spatio-temporal data, which combines temporal (perhaps time series

information) with other spatial data. The challenges associated with these varying

domains have led to work in a variety of areas.

2.1 Visualizations and Interactive Systems

A recent survey of linear temporal visualizations is found in [123]. Generally, these

tools focus on visualization and navigation, with relatively little emphasis on querying

and pattern identification.

5

2.1.1 Time Series Data: Visualizations

The importance of time series data sets has led to extensive work on the part of graphic

designers interested in increasing the readability of these time series graphs [133]. A

non-linear display, which emphasizes more recent information by compressing older

data, was suggested by Powsner & Tufte [102]. This compression provided a static

“focus+context” [26] of individual readings from a patient’s medical record. The

combination of multiple small graphs on a single page provided a succinct overview

of the entire record.

Recent research into interactive visualizations of these data sets has focused on

supporting multi-scale and periodic views. Recursive patterns, an early visualization

technique, provide dense displays of data divided hierarchically into finer-grained time

periods (year, month, week, etc.) [73].

Spiral visualizations [27] uses a circular metaphor to display the periodicity of

some data sets. Time progresses along the path of the archimedean spiral, with corre-

sponding periods in each interval aligned to form “spokes” of each period. For exam-

ple, each revolution of the spiral might indicate one year, with data points for January

in each year aligned along a single spoke (Figure 2.1). The spiral visualizations tool

also provides facilities for zooming in on subsets of the spiral and manually adjusting

the duration of each revolution support interactive exploration.

Many interesting data sets contain multiple time-varying quantities. For example,

values temperature, precipitation, and barometric pressure might be available for each

time point. These data sets provide additional challenges for visualization designers:

displaying multiple attributes in a single space may increase information density, but

the resulting display might be cluttered.

Spiral Visualizations uses two approaches to address this problem. Multiple at-

6

Figure 2.1: A spiral visualization of the consumption of Baphia Capparidifolia by

Chimpanzees in Tanzania during 1980-1988. Each lap represents one year, and each

spoke one month. The area of each blot is proportional to the observed consumption

during that month of the given year. To see how consumption varied during a given

year, users can move along a given lap of the spiral. To compare consumption in a

given month across years, users examine blots along the same spoke [27].

tributes are shown on a single spiral by providing a marking for each variable of interest

at each time point, producing a series of “flags” [27] that are displayed perpendicular

to the spiral, in a 3D projected view. Alternatively, two separate spirals can be tightly

coupled.

Spiral depictions have also been used to provide basic facilities for filtering time

series displays. Brewer, et al. describe a circular query control which can be used to

7

Figure 2.2: A Diamond Fast display showing a zoomed image of two overlaid 10-year

periods [135].

specify consecutive and non-consecutive time intervals at multiple scales (year, month,

day) [25].

Other work in time series visualization tools has focused on tools for statistical

analysis. Diamond Fast provided a variety of features suitable for exploratory analy-

sis of time series data, including the ability to zoom displays in both time and value

dimensions, point queries to find values at individual data points, and overlay of multi-

ple time series to compare periodic patterns (Figure 2.2) [135, 136]. Another proposed

system - FORTUNE - builds on Diamond Fast, adding decomposition, smoothing,

forecasting, and other statistical analysis tools [80].

ThemeRiver [59], a system for visualizing thematic patterns over time in a docu-

ment collection, uses a different approach to depicting changes over time. Given a set

of news stories, selected from a given time period, ThemeRiver displays the changes in

content as topics become more and less newsworthy. As working from the observation

that cumulative histograms of the number of articles in each topic may be hard to inter-

pret, ThemeRiver connects the corresponding areas in adjacent bars of the histogram

to create a visual flow from one time point to the next. In this view, the thickness of

a section of the river indicates the number of stories on the associated topic, and the

8

Figure 2.3: A ThemeRiver visualization of news items regarding Fidel Castro, from

November 1959 through June 1961. Each band in the river indicates a separate topic,

with the thickness of the band indicating the number of stories on that topic [59].

vertical position of an item in the river is not important. The total thickness of the

river at a given time point provides a measure of the overall news activity on that date

(Figure 2.3).

DiskTrees and TimeTubes [31] provide an interesting model for the use of circles

to display multiple attributes from a hierarchical data set. A disk tree is a circular

display of the contents of a web site, with the root in the center and subsequent layers

in the circle corresponding to layers in the hierarchy. The usage of various parts of the

web site is indicated through the dual codings of line size and brightness. Sequential

displays of multiple disk trees support examination of changes in usage trends over

time (Figure 2.4). Mouse-over, zooming, and other interaction techniques are used to

support interactive exploration of the data.

Animation has also been used to display changes in time series data, either auto-

matically or in response to user input to display changes over time [31, 45].

9

Figure 2.4: A TimeTube, with four DiskTrees showing the evolution of the web site

over time [31].

Another class of visualizations handle data sets that are not strictly time series,

even if they involve observations at discrete times. SeeLog uses graphic glpyhs to

portray entries in Unix command logs [44], in order to provide oversights of large, log

files that previously went unanalyzed. Controls for filtering items and limiting the time

scale can be used to limit the time span being displayed, and full-text entries from the

original log files can be retrieved via mouse-click. Ribler, et al. developed a suite of

metrics that could be used to model and visualize categorical time series data [106].

Wong, et al. [145] display sequential patterns in text corpora with time proceed-

ing along the horizontal x axis and category names on the vertical y axis. Categories

are sorted alphabetically from bottom to top, providing an ordered comparable to the

numeric ordering used in real-valued time series data. Alternative arrangements that

related proximity to topic similarity might provide for more easily-understood arrange-

ments.

10

Primarily implemented as visualizations, these systems are limited in their ability

to specify queries involving patterns that change over time. Facilities for zooming-

in on desired intervals are generally provided, but creation of specific queries is not:

patterns are identified through visual inspection.

Although the lack of common application domain and task profiles makes compar-

ison between these visualizations difficult, more understanding of relative strengths

and weaknesses is clearly needed. As none of the papers mentioned above involve any

empirical evaluation, further evaluation and comparison would be particularly useful.

2.1.2 Temporal Data: Visualizations

Visualizations of temporal data have been used in a variety of domains, including:

Medical Data : Medical records for individuals include treatment information, test

results, and diagnoses that evolve over time. Cousins and Kahn [36] developed an early

visualization that combined time series measurements, intervals for events of non-

point duration, and additional external details such as patient calendar information. A

semantic model that supported differing temporal granularities was used to support

moving between different time scales. LifeLines [99] used a categorized, zoomable

display to provide an overview of an entire medical record: as the user zooms in to

successively smaller intervals, the display is updated to provide progressively finer-

grained information (Figure 2.5).

Calendar and activity timelines : Calendar and time schedules are perhaps the

most familiar and natural timelines. Calendar visualizers [88] use multiple-scale rep-

resentation, zooming, and focus+context displays to support scheduling activities and

coordinating meetings.

11

Figure 2.5: A LifeLines display of a patient medical records [99].

Interaction Histories Time-Machine Computing [105] and LifeStreams [50] use

linear temporal displays to visualize current and past activities in a desktop computing

environment, in order to support recall, navigation, and location of information and

documents. Other systems have examined the use of timelines for finer-grain actions,

as the basis for extended undo/redo facilities [41, 100]. In presenting the possibility

of navigating to a previous state and subsequently modifying that state, these visual-

izations face the challenge of appropriately handling divergent timelines that represent

possible descendants of a given point in time [43].

Data Analysis : Human-Computer Interaction, cognitive engineering, and other en-

gineering and ethnographic research fields often require the collection and interpre-

tation of real-time data collected from participants in research studies. Often involv-

ing synchronization of video recording, computer activity, and other activities, these

data sets can be difficult to interpret. MacSHAPA [110] and Timelines [57] are two

tools that support navigation in time through data collected in the course of these ob-

12

servations. Data collected is displayed on one or more synchronized timelines, and

navigation tools support scrolling in time.

Other visualizations use less familiar representations of temporal attributes. Peo-

pleGarden [147] uses lengthening stalks of individual flowers to display the amount of

time that users have spent participating in online discussions. SELES [28] displays

time-varying changes in landscape data by using two dimensions of a cube for location

and the third for time. Exploration is supported via distortion techniques that can be

used to see “inside” the cube.

2.1.3 Time Series Data: Querying

For many tools involving time series data, temporal query facilities are limited to nar-

rowing the display to selected regions of interest. Spiral Visualizations provides facil-

ities for changing the time scale of the display and zooming in on periods of interest

Modifications to the scale of the display lead to revisions of the spiral that may cause

patterns to appear along the spokes of the spiral [27]. Similarly, circular query widgets

have been used to limit displays to intervals of interest [25] (Figure 2.6).

MIMSY [107] provided an early example of an interactive tool for querying time

series data. Designed to support analysis of stock market data, MIMSY used tradi-

tional GUI widgets including text entry fields, pull-down menus, and other traditional

widgets to search for trends of interest in stock data. MIMSY supports a number of

domain-specific time series, including volume, shares outstanding, and others, along

with aggregates including average, min, max, and move. Other interesting operators

include support for relative changes (“close of IBM down more than 10%”) and cross-

ings in value (“select close of abc when close of abc crosses close of b”). Query

processing is handled in a traditional batch mode.

13

Figure 2.6: Circular query controls for filtering cyclic data [25].

QuerySketch is an innovative query-by-example tool that uses an easily-drawn

sketch of a time series profile to retrieve similar profiles, with similarity defined by

Euclidean distance [141]. Queries are executed implicitly on mouse release, and re-

sults are displayed in thumbnail form beneath the query space. Designed for simplicity

and ease-of-use, QuerySketch does not support editing of existing queries.

Spotfire’s Array Explorer 3 [127] supports graphically editable queries of temporal

patterns in microarray data. Queries are dynamically modified by moving discrete

value markers at each time point. Query results are based on Euclidean distances

from the resulting profile. Queries are evaluated against clustered time series from a

larger set of microarray gene expression profiles. The limitation of each query point

to a single time instance complicates the expression of queries involving values that

remain relatively unchanged for a period of time.

14

Figure 2.7: The Patterns visual query language, specifying a sequence involving one

of four alternative transitions followed by a single required transition [90].

Patterns [90] uses a set of graphic primitives and operators to specify patterns of

interest in time series data. Query primitives can be used to search for intervals dur-

ing which values are rising, falling, flat, contained within a given threshold, straight

(constant slope), concave, or convex. These operators are specified via operators that

include visual depictions of the trend associated with the operator. Operators are pa-

rameterized, and can be combined via operators including conjunction, disjunction,

loop (for repetition), and gap (indicating a “don’t-care” interval between two events

(Figure 2.7). Query results are provided in a parse tree, which details the composition

of the result in terms of the primitive operators. Although algorithmic details of query

processing are not provided, the Patterns query language is powerful and flexible.

The identification of patterns in time-series data at of varying granularities involves

additional challenges. This problem was addressed by van Wijk and van Selow, in an

attempt to identify time-varying trends in energy usage and employee attendance in

terms of variations over a given day and identification of similar days over a several

15

month period. Patterns for individual days were clustered hierarchically, forming the

basis for a calendar display that colors each date based on the cluster to which it be-

longs. Alongside the calendar, a graph display could be used to display the graphs

corresponding to one or more clusters. Querying and browsing are both supported, as

clusters can be selected via point-and-click on dates, similarity to a chosen date, or via

top-down browsing of the hierarchy [138].

Data Mining research regarding time series data has generally been limited to al-

gorithmic strategies for finding patterns similar to a given query (Section 2.2). Some

work in this area has addressed issues relevant to interactive systems, particularly with

respect to query specification and refinement.

Agrawal et al.’s Shape Definition Language (SDL) [5] provides very similar query

mechanisms. SDL uses textual operators such as “up”, “down”, “stable” and “zero”

to construct queries similar to those that can be created with Patterns. Composition op-

erators supporting a regular expression-like syntax can be used to construct complex

queries (e.g., “(in 5 (and (noless 2 (any up stable)) (nomore 1 (any down stable))))”.

Although an interface is not described, an index structure and well-defined seman-

tics provide the groundwork for construction of an interactive system based on SDL .

Similarity Miner [146] uses a similar approach to query specification.

Noting that time series that look substantially different can have small Euclidean

distances, Keogh and Pazzani suggested a relevance feedback approach to query pro-

cessing. In this model, users would evaluate responses to an original query, rating

items in the result set on a 7-point scale. These ratings would be used to create a new

query based on the old query and on the rated result items. Both the query and result

items were segmented by a piecewise linear approximation, allowing users to express

different preferences for different features of interest. User models are also modified

16

Figure 2.8: The MMVIS query window [60].

to account for the potential impact of offset (vertical) translation, amplitude scaling,

discontinuities and other distortions [78].

2.1.4 Temporal Data: Querying

Interactive query techniques for temporal data sets might be adapted for use with time

series, and vice-versa.

TVQL [60] is a visual query language for identifying relationships between events

of interest in multimedia (video) data. TVQL uses double-thumbed sliders to support

expression of queries involving relative temporal relationships between two subsets of

events chosen from multimedia annotation. Four sliders are provided, for specification

of relative time of, and elapsed time between, the start and end points of two subsets

(Figure 2.8). Query constraints are displayed to the user in a notation based on Allen’s

interval relationships [9]. TVQL’s dynamic query model provides fast updates.

TVQL was evaluated in a pair of studies. In the first, TVQL was compared to

TForms, a forms-based interface. TForms used pull-down menus and text entry fields

to express relationships between subsets of events. In a between-subjects study, each

17

participant was asked to interpret and express queries with one of the two interfaces.

Subjects took more time to learn the TVQL interface, but query interpretation was

significantly faster with TVQL. For query specification, TVQL was significantly faster

only on queries involving incremental modification, a validation of TVQL’s use of

dynamic queries. User questionnaires revealed no significant difference in preference

ratings. User comments included a desire to manipulate temporal diagrams directly,

suggesting an alternative interface design [61].

A second study compared TVQL to a paper timeline. Examination of the num-

ber and kinds of queries created in response to free-form queries revealed strengths

and weaknesses of both approaches. Users of TVQL took advantage of features that

displayed the frequency of events to answer questions involving frequencies of event

occurrence, while timeline users generally did not do the manual counting necessary

to generate these results. Timeline users found more trends than TVQL users, but they

also made more errors. Timeline also failed to identify negative trends (“B never fol-

lows directly after A”). These results supported the suggestion of incorporating time-

line views alongside the TVQL facilities [62].

TVQE [124] builds upon TVQL’s use of Allen’s intervals [9] to support dynamic

temporal queries in a relational environment. Based on a formal model of temporal

queries [121], TVQE uses a series of sliders, checkboxes, and other widgets to specify

the time intervals, scales, and relationships of interest. These temporal constraints are

combined with relational selections made from a graph view of a database schema, to

form a full temporal relational query. TVQE has been used to model histories of user

interactions. Specifically, histories of use of TVQE have been modeled within TVQE

[122].

Other efforts have involved the development of interactive tools for identifying

18

patterns in other forms of temporal data, such as music. For example, one system used

contours to allow users to specify sequences of pitch transitions of interest [22], in a

manner that is somewhat reminiscent of Shape Definition Language (SDL) [5].

2.1.5 Parallel Coordinates

Parallel Coordinates is a visualization technique for high-dimensional data sets. A

parallel coordinates display is built by laying out the dimensions in a data set with

a set of parallel (usually vertical) axes. Each axes provides a linear ordering of the

values for the corresponding dimension that are found in a data set. An items in a data

set is displayed on these axes by drawing a polyline connecting the points on each of

the axes that correspond to the values for that item in each dimension (Figure 2.9).

Originally developed by Alfred Inselberg, parallel coordinates have been extensively

studied [52, 53, 58, 67, 68, 89].

A variety of techniques for “brushing” - selecting and highlighting points of inter-

est - in parallel coordinates have been developed, including direct manipulation sliders

and “painting” areas of interest through selections. These brushes can also be com-

posed through logical boolean operators [89] . Additional work has been aimed at sup-

porting queries that identify specific correlations between adjacent dimensions through

“angular” brushes [58] or dialogs [144], and the use of “structure-based” brushes to

support navigation through clustered hierarchies of data [52]. Many of these tech-

niques are similar to approaches that have been used in TimeSearcher, while others

may provide the basis for future work.

Although parallel coordinates are not designed for use specifically with time series

data, many of the display and interaction techniques developed for parallel coordinates

may be applicable to time series data sets. In fact, TimeSearcher’s “graph overview”

19

Figure 2.9: A sample parallel coordinates visualization involving four dimensions

from a database describing automobiles [58].

display (Chapter 3) is visually similar to the patterns of overdrawn and possibly cross-

ing lines found in parallel coordinates (Figure 2.9).

There are two important distinctions between parallel coordinates and time series

data. In time series data, each measurement is made along a common scale, resulting in

common minimum and maximum values across all time points In parallel coordinates,

the extents of each dimension can be different - perhaps involving categorical values.

Thus, parallel coordinates tools may support the inversion or “flipping” of an axes

in order to see patterns more clearly [58]. This operation would not be particularly

meaningful in time series data sets.

The other major distinction involves the ordering of axes. In time series data,

adjacency in the graph implies adjacency in time - time t should always come right

20

before time t + 1. No such adjacency is implied in parallel coordinates. In fact, some

tools support manual reordering of axes [58], and algorithmic techniques for finding

preferred orderings have been developed [68].

2.2 Data Mining

The combination of visualization tools and data mining approaches is an intriguing

possibility that presents the possibility of combining the strengths of two powerful

analysis approaches [118]. One example of the power of this approach combines

textual mining for sequential patterns with an interactive, timeline-based visualiza-

tion [145].

2.2.1 Similarity Searching

The challenge of data mining in time series databases is generally defined in terms of

sequence similarity: given a set X of sequences and a query sequence Q, find all xi ∈ X

such that Q and xi are sufficiently similar. Alternatively, find the k nearest neighbors

to Q. Similarity in this context is generally defined in terms of Euclidean distance. A

substantial body of work aimed at addressing this question has been conducted over

the past several years.

Much of this work has been based on the paradigm of dimensionality reduction

and spatial embedding, first introduced by Agrawal, Faloutsos, and Swami [3]. Noting

the curse of dimensionality associated with long time series, Agrawal et al., devised a

lower-dimensionality representation based on the Discrete Fourier Transform (DFT).

Specifically, they proved that the use of a representation based on the first few coef-

ficients of the DFT provided a lower bound on the Euclidean distance between two

21

sequences. In other words, if D(−→x ,−→y ) is the distance between two sequences, and

D(−→X f ,

−→Yf ) is the distance between the truncated DFT representations of x and y, then

D(−→X f ,

−→Yf ) ≤ D(−→x ,−→y ).

This observation was used to form the basis for a search algorithm. Given a set

of sequences xi, generate the truncated DFT representations Xi f , and store them in a

spatial index (R∗ trees [15] were found to provide the best results). To find the se-

quences similar (within distance ε to a given query sequence q, derive the appropriate

DFT representation Q f , and use the spatial query index to identify all Xi f such that

D(−→Q f ,

−→Xi f ) ≤ ε. As the distance in the DFT-space is a lower bound on the actual dis-

tance, this will provide a super set of the desired results. For each item thus retrieved,

the actual sequence is retrieved, and its distance to the query is calculated, to filter out

any false alarms.

This work was later extended to support subsequence querying through the use

of sliding windows to create trails in feature-space. These trails are collected in

adaptively-defined minimum bounding rectangles to be used in the spatial index [49].

Subsequent research has extended and revised this general technique of dimen-

sionality reduction and spatial embedding a variety of ways. Algorithms based on

wavelets [29] and singular value decomposition (SVD) [81] have also been proposed.

Yi and Faloutsos extended this model to handle similarities based on any Lp distance

measure [149].

Other work has examined the use of piecewise approximations to reduce dimen-

sionality. Piecewise Aggregate Approximations (PAA) divide a time series into of

length n into a set of N values (N � n) representing the average values in each of N

equal-sized “frames” [76] . These values can then be indexed using techniques de-

scribed by Faloutsos, et al. [49]. The PAA model was later extended to achieve further

22

reduction through the use of adaptive frame lengths [75].

The notion of similarity between time series is often more subtle than simple Eu-

clidean distance. Time series that look very different may have small distances, while

series with the same general “shape” may have higher distances [78]. For example, an-

alysts might be interesting in identifying sequences which are similar in shape but have

differing time scales - for example, sine waves of differing frequencies. Dynamic time

warping approaches that minimize the error between template (query) sequence and

result sequences have been used to handle these queries [19]. Similar techniques have

been proven particularly effective when used in combination with spatial-embedding

algorithms [150]. More recently, piecewise constant approximations have been used

to provide the basis for indexing of dynamic time warping [74].

Other transformations that might be of interest include scaling in the value (as

opposed to time) dimension, translation, and noise. Agrawal, et al. approach this

problem by generating scaled windows that represent similar subsequences. These se-

quences can then be stitched together to find matches of maximal length [4]. Rafiei

and Mendelzon proposed a model for handling translations, scalings, and moving av-

erages within the reduced-dimensionality, spatial indexing approach described above.

Queries are expressed in terms of similarity to the result of subjecting a given time

series to one or more of a set of transforms. These queries are evaluated by applying

the given transform to the original index on the fly , and then post-processing based on

actual distances [104].

2.2.2 Inverse Queries

Although similarity queries have been the focus of extensive research, other queries

may be useful. Lin, et al. identified two basic classes of queries on time series:

23

1. Forward queries ask for values at specific points, or value ranges during given

intervals.

2. Inverse queries ask when the time sequence had a given value or fell within a

given range [86].

After observing that forward queries can be efficiently supported by a variety of

indices, Lin, et al. introduce the IP-Index for handling of inverse queries. The IP-

Index divides a time series into one-dimensional projections in the value dimension.

These projections are then stored in an ordered indexing structure for efficient retrieval.

The IP-Index can also be used with appropriate interpolation to handle continuous data

[86]. This work later extended by the SIQ-Index, which was based on the observation

that the IP-Index did not scale well for non-periodic data. The SIQ-index stores the

one-dimensional projections in an R∗-tree, using the trails technique [49] to derive

efficient minimum-bounding rectangles [93].

2.2.3 Outlier Detection

Outlier, or “deviant” detection has also been a topic of interest. From a practical view-

point, identification of outliers can be useful for compression: if outliers are stored

explicitly, the size of the resulting index structures might be reduced without increased

error. From a data mining viewpoint, outliers can be interesting in their own right,

as they indicate differences from common or expected cases. Approaches to outlier

detection are generally based on error minimization given parameterized constraints

on storage. Proposals include dynamic programming approaches that minimize er-

rors associated with bucketing histograms [69], and modifications of SVD indexing

algorithms to include calculations of appropriate deviants [81].

24

2.2.4 Query Specification

By focusing on similarity search, many of the proposed data mining algorithms elim-

inate the need to address the issue of query specification: the query is simple a time

series sequence. One exception is Agrawal et al.’s Shape Definition Language (Sec-

tion 2.1.3), which specifies queries in terms of natural language descriptions of profiles

(e.g., ”(zero appears up up down)”) [5]. User interaction issues with similarity-based

data mining were also addressed by Keogh and Pazzani’s proposal for the use of rele-

vance feedback for retrieving patterns from time series data [78] (Section 2.1.3).

2.2.5 Other Approaches

Further work in time series data mining has been aimed at exploring alternative query

techniques and identifying more specific structures and patterns. Probabilistic search

methods were suggested by Keogh and Smyth [79]. To find interesting trends in fi-

nancial data, Povinelli used genetic algorithms to find clusters of interesting time se-

quences in a multi-dimensional space [101]. Other examples include algorithms for

identifying partial periodic patterns - repeating patterns at some, but not all points in

time [56] , mining time series for intervals of interest [139], rule discovery [37],

online analysis of multiple sequences [151, 153], and mining at different time granu-

larities [20].

Algorithms from string-searching research have also been adapted to address han-

dle queries over time series data. Suffix trees have been used as indices on time series

data that has been converted into a discrete alphabet [66]. Other work involved the

adaptation of the the Knuth-Morris-Pratt string searching algorithm to handle general

predicates involving relative changes in value [108]. The EMMA (Enumeration of Mo-

tifs through Matrix Approximation) uses a discretized representation of a Piecewise

25

Aggregate Approximation (PAA) [76] to support searches for similar subsequences,

known as motifs [85].

2.3 Databases

Temporal databases research has been ongoing for many years [71]. Numerous tempo-

ral query languages have been proposed [33], TSQL2 most prominently [126]. These

systems generally handle both time point and interval-based temporal data, making

them suitable for time series data sets. Other systems, such as SEQ [112, 113, 114]

are specifically designed to store and index sequence data , and thus may be partic-

ularly of interest for time series. As is the case with most database research, these

projects have focused on data representation, query semantics, and query processing,

with little discussion of user interfaces.

Interest in temporal databases has also led to the development of graphical query

languages for temporal relational data. Graphical queries built on top of the entity-

relational model [82, 132] augment familiar entity-relationship models to handle tem-

poral queries. Alternative models such as GTL [96] take different approaches.

These languages might be used as the basis for interactive querying systems [132],

but their use in such environments do not support dynamic queries. Instead, queries are

translated to an underlying relational query, which is then evaluated in a batch mode.

Researchers in spatio-temporal and multimedia databases have developed a variety

of interactive querying mechanisms, including visual languages that specify abstract

depictions of the time changes of interest [23, 47], and interactive systems where the

user’s manipulation of an icon on the screen specifies a query trajectory [39].

26

2.4 Discussion

This survey of related work illustrates the breadth of issues related to time series (and

more generally, temporal) data.

Visualizations of time series data illustrate the various perspectives that may be ap-

propriate for interpreting these data sets. Factors such as periodicity [25, 27], multiple

scales of resolution [73, 102, 138], and the need to display multiple variables at each

time period [27, 102] lead to a variety of ways to display and interpret these data sets.

Although QuerySketch [141] and Spotfire [127] provide tools for querying these data

sets, the possibilities for interactive visualization have not been exhaustively explored.

Temporal and spatio-temporal database research suggests the possibility of adapt-

ing query tools for time series data to work directly with more general databases. Al-

though these tools often handle time intervals that are more general than time series,

a tool for querying time series data might be used as a front-end to appropriately-

structured temporal databases, perhaps via translation into a temporal query language

[71, 126]. Similarly, time series querying tools might be useful for specifying tem-

poral constraints on spatio-temporal databases, raising the possibility of comparison

between models that combine spatial and temporal constraints in one query mecha-

nism [23, 47], and tools that separate the two. More generally, tools developed for

time series data might be extended to handle intervals over temporal intervals, provid-

ing functionality similar to that of TVQL [60].

Interactive tools and visualizations have almost exclusively focused on searches for

patterns involving well-specified changes over well-defined time periods. Data mining

algorithms are generally much more ambitious, as they often address the challenge of

finding patterns that occur at arbitrary times and are “similar” in some general manner

that can often account for variations in scale and duration, discontinuities, and other

27

idiosyncratic features [4, 19, 104, 150]. Efforts to find trends or “events” that are of

interest [4, 5, 37] are similar in spirit to the goals of interactive query tools.

Combining the interactivity of dynamic query tools with the power of these data

mining approaches presents several challenges. A query interface that supports these

algorithms must include mechanisms for specifying tolerances of approximate fits,

lengths of allowable gaps, tolerances in time dilation or contraction, and other con-

straints. Query result display would be equally challenging, as any output would need

to display not only the results themselves, but sufficient contextual information to ex-

plain why the result was a match. Furthermore, the implementation of systems that

practically combine the rapid, incremental updates of information visualization with

the computational requirements of data mining may be difficult.

28

Chapter 3

Timeboxes: Interactive Temporal Query Widgets

Timeboxes are rectangular query regions drawn directly on a two-dimensional display

of time series data. The data set is assumed to consist of some number of items (n),

each of which has a measurement at each of m time points. The extent of the Timebox

on the time (x) axis specifies the time period of interest, while the extent on the value

(y) axis specifies a constraint on the range of values of interest in the given time period.

More specifically, we assume that ti ∈ T is an item in a time series data set, ti( j) is the

value of ti at time j, and a timebox is a 4-tuple: b = (tmin, tmax,vmin,vmax). We say that

ti satisfies the timebox b if ∀tmin≤t≤tmax vmin ≤ ti(t) < vmax (assuming vmax ≥ vmin and

tmax ≥ tmin)1.

We assume that the temporal data is divided into discrete time points of granularity

determined by each data set. The discrete nature of the data is enforced by constrain-

ing timeboxes to occupy an integral number of time points. Multiple timeboxes can

be drawn to specify conjunctive queries. Items in a data set must match all of the

constraints implied by the active timeboxes in order to be included in the result set.

Creation of timeboxes is straightforward: the user simply clicks on the desired

1This model can be easily extended to account for time series containing several variables, each of

which is measured at each time point. See Section 4.3.

29

starting point of the timebox and drags the pointer to the desired location of the oppo-

site corner. As this is identical to the mechanism used for creating rectangles in widely

used drawing programs, this operation should be familiar to most users. As the box

is drawn, the interaction handler responsible for the drawing of the box will force the

box to occupy an integral number of time points.

Timeboxes are drawn to extend beyond the time points covered by one-half interval

on either side. Thus, a timebox that covers time periods 2-5 (inclusive) will have

its leftmost side half-way between 1 and 2 and its rightmost halfway between 5 and

6. This avoids difficulties in interpretation that might arise if the vertical sides of a

timebox were aligned with (or close to) the vertical line through a time point.

Once the timebox is created, it may be dragged to a new location or resized via

appropriate resize handles on the corners, using similarly familiar interactions. In all

cases, the query is re-processed with each mouse event. When a user action leads to

a modification of a timebox, the new position of the timebox is stored, the query is

updated, and the new result set is displayed.

Construction of timeboxes is aided by drawing all of the items in the data set di-

rectly on the query area. This graph overview display provides additional insight into

the density, distributions, and patterns of change found among items in the data set

(Figure 3.1).

The example data set shown in Figure 3.1 contains weekly stock prices for 1430

stocks and will be used in a brief scenario to illustrate the use of timeboxes. An an-

alyst interested in finding stocks that rose and then fell within a four-month period

might start by drawing a timebox specifying stocks that traded between $70 and $190

during the first few weeks. When this query is executed, the graph overview is up-

dated to show only those records that match these constraints. We can quickly see that

30

Figure 3.1: A graph overview, formed by superimposing the time series for all of the

items in the data set.

Figure 3.2: A single timebox query, for items between $70 and $190 during weeks 1-5

.

31

Figure 3.3: A refinement of the query in Figure 3.2.

this query substantially limits the number of items under consideration, but many still

remain (Figure 3.2).

To find stocks in this restricted set that dropped in subsequent weeks, the user

draws a second box, specifying items that traded between $12 and $80 during weeks

10-12 (Figure 3.3). A third box, specifying a higher price range ($60-$120) during

weeks 19-24 completes the query (Figure 3.4).

As timeboxes are added to the query, the graph overview provides an ongoing

display of the effects of each action and an overview of the result set. Once created, the

timeboxes can be scaled or moved singly or together to modify the query constraints.

The use of simple, familiar idioms for creation and modification of timeboxes sup-

ports interactive use with minimal cognitive overhead. Rapid (<100ms), automatic

query processing on mouse-up events provides the virtually instantaneous response

necessary for dynamic queries, thus supporting interactive data exploration. Users can

easily and quickly try a wide range of queries, modifying these queries to quickly see

the effects of changes in query parameters. This ability to easily explore the data is

32

Figure 3.4: A complex query containing three timeboxes.

helpful in identifying specific patterns of interest, as well as in gaining understanding

of the data set as a whole.

3.1 Anyof Semantics

Alternative interpretations of timeboxes are also possible. For example, a disjunctive

timebox might require that the value of a time series have some value in specified

range for some time points during the specified interval, as opposed to all of those time

points. For these anyof timeboxes, we say that ti satisfies timebox b = (x1,x2,y1,y2) if

∃x1≤x≤x2 s.t. y1 ≤ ti(x) ≤ y2

3.2 Variable Time Timeboxes

As defined above, the basic timebox is limited to expressing queries with fully-defined

time and value constraints. Additional expressive power might be gained by extending

33

the model in a manner that relaxes these constraints. One possibility would be to

support searches for items that fall within a given value range during some interval

of a given duration that falls within some longer window of time. For example, stock

analysts might want to identify stocks that traded between $30 and $60 for some 3

month period anytime between January and August (inclusive). These queries - known

as variable time timeboxes (VTTs) - are the simplest extension to the timebox model.

We have developed several additional extensions to the timebox model [63]: these

are described in more detail in Chapter 9. In collaboration with Eammon Keogh, a

preliminary implementation of variable time queries (query 2) has been implemented

and evaluated through a preliminary study [77].

Formally, a variable time timebox (VTT) is defined as two points (x1,y1) and

(x2,y2) and a single integer R. The VTT provides a constraint on a time series such

that for the time range x1 ≤ x ≤ x2, the dynamic variable must have a value in the range

y1 ≤ y ≤ y2 for at least R consecutive time units (assuming y2 ≥ y1 and x2 ≥ x1) (Figure

3.5). Under this formalism, a VTT with a value of R = x2 − x1 is simply a standard

timebox.

Graphically, VTTs are represented as outline boxes that surround a traditional time

box. When initially created, the VTT has a value of R = x2 − x1. By clicking and

dragging the sides of the internal rectangle, the user can adjust the value of R.

A user study was conducted to evaluate the claim that VTTs would be particularly

useful for the task of separating large data sets into disjoint classes. In particular,

the hypotheses was VTTs would be more effective than standard timeboxes in this

separation task. Ten undergraduate students performed a series of tasks with two data

sets. These tasks were aimed at measuring their ability to create queries that separated

each data set into two disjoint partitions. The quality of separation was measured by

34

Figure 3.5: A variable time timebox, specifying that for at least R consecutive time

periods between x1 and x2, items must have values in the range y1 ≤ y ≤ y2.

subtracting the number of false positives from the number of items correctly separated,

and normalizing by the size of the data set. Using this measure, VTTs appeared to have

significant advantage in the quality of results [77]. It should be noted that these results

applied only to the creation and interpretation of single queries, and therefore might

not generalize to interactive dynamic query environments.

Variable Time Timeboxes are based in a model of placing a timebox within a larger

region that provides additional constraints. This model can easily be easily general-

ized to support other extensions that increase the expressivity of the timebox model.

Queries with variability in value instead of time - Variable Value Timeboxes (VVTs)

- might be formed by providing vertical variability, instead of horizontal. Vertical and

horizontal variability might be combined to provide queries that support variability in

both time and value. These and other extensions are discussed in Chapter 9.

35

3.3 Timeboxes in the Context of Information Visualiza-

tion Research

The timebox is an incremental extension to previous work on development of widgets

for dynamic queries in information visualization environments. The ancestry of time-

boxes can be traced back to one-dimensional range sliders, which extended traditional

GUI sliders. Range sliders allowed users to adjust values from both ends (instead

of only one end), and to move the entire range of interest by dragging the middle of

the the slider [117]. Multiple range sliders can be combined to support searching in

multiple dimensions. This approach has been used in a variety of systems, often with

augmented displays aimed at linking the various dimensions. For example, the In-

fluence Explorer used multiple range sliders with histograms and lines linking items

across each dimension to display the relative influence of various dimensions (Fig-

ure 3.6) [134]. Similar techniques have been used for selection and filtering of items

in parallel coordinates displays, both with implicit (Figure 3.7) and explicit (Figure 3.8

range sliders.

These approaches share the common limitation of controlling only one or two di-

mensions at any given time. To adjust constraints on multiple dimensions, users must

adjust multiple brushes controls individually. For high-dimensional data sets, this can

get tedious. Two-dimensional widgets have been suggested as an approach to improve

this situation . These widgets might be used to specify single points for two variables

or by selecting a range in 2D space (Figure 3.9 [117]). Points could be selected with a

single click, and ranges would be modified by moving and resizing the area of interest.

Unfortunately, these widgets have not been widely used. Furthermore, 2D widgets

require identification of related pairs of variables that might effectively be combined

36

Figure 3.6: The Influence Explorer: Range Sliders on the “brightness”and “working

life” dimensions select the ranges of interest. Histograms with each variable indi-

cate the number of items having various values of that variable, and lines between

histograms indicate the values of a selected item [134].

into a single widget.

“Brushing” is another related technique that uses two-dimensional graphical re-

gions to select items from a scatterplot or other display. Given a 2D-scatterplot of

items in a data set, a brush is a rectangular region that can be drawn to “lasso” and se-

lect items of interest. For higher-dimensional data sets, multiple, repeated scatterplots

with differing brushes might be used [89].

Like 2D range sliders and brushes, timeboxes are rectangular regions that can be

created, moved, and scaled to specify and modify query constraints. However, time-

boxes are significantly more expressive. 2D widgets and brushes express constraints

on two dimensions, but a timebox constrains an arbitrary number of values.

Specifically, a data set containing m time points can be seen as an m-dimensional

data set - each item in the data set is a single point in R m. In this space, a timebox of

width m′ < m simultaneously constraints m′ dimensions. Of course, these constraints

are not independent, as values in each of the m′ dimensions are required to fall within

the same range.

37

.

Figure 3.7: XmdvTool [89]: The highlighted items have been selected by “brushing”.

Once the brush is created, the highlighted areas on any given axis can be moved or

resized [148].

Figure 3.8: Explicit range sliders in CityOScope’s parallel coordinates display Arrows

at the top and bottom of each axis can be used to limit the range of interest [53].

38

(a) Point query (b) Range Query

Figure 3.9: Two dimensional query widgets: (a) A point query indicating an exact

number of bedrooms and cost of a home. (b) A range of number of bedrooms and

cost [117].

This coverage of multiple dimensions provides an increase in expressive power.

Each movement of a timebox in the value dimension (vertically) results in changes to

m′ constraints. This is a significant increase over range sliders or brushes, which would

require either m′ (for 1D range sliders) or dm′/2e (for 2D range sliders or brushes)

separate modifications.

The ability to move and rescale timeboxes provides further power. Modifications

that add or subtract time periods effectively add and remove constraints from the query.

To do this in an interface based on 1D (or even 2D) sliders or controls, users would

have to add or remove each control manually.

There are significant similarities between timeboxes and some of the parallel co-

ordinates displays that have been developed. The “graph overview” display is partic-

ularly reminiscent of the overlapping drawn lines found in parallel coordinates (Fig-

ures 3.7 and 3.8). In theory, a rectangular brushing facility in a parallel coordinates

display would likely be very similar to a timebox query. However, characteristics of

the data sets involved provide an important difference between timeboxes and parallel

39

coordinates displays.

Time series data has significant auto-correlation - for a given item, the value at time

t is closely related to values at times t − 1 and t + 1, and perhaps less so than at time

t +10 [77]. As consecutive measurements are related, it makes sense to use timeboxes

to express a given constraint over multiple consecutive measurements.

For parallel coordinates, the situation is quite different. Although some tools sup-

port manual reordering of axes [58], and algorithmic methods for identifying preferred

orderings have been developed [68], adjacent axes do not necessarily have any rela-

tionship or correlation that can easily be expressed in a single set of constraints. In

fact, parallel coordinate graphs with wide swings from one axes to the next are com-

mon (Figure 3.7).

Timeboxes are based on the assumption that each measurement is made on a com-

mon scale, and that all values will fall between some global minimum and maximum

value. This assumption may not hold for parallel coordinates displays. For example,

a data set involving cars may have ranges of 4-12 for the number of cylinders, 18-40

for miles/gallon, and $15,000-$50,000 for price. It is not clear how, if at all, a timebox

or similar query could be used to simultaneously express constraints on these three

variables.

As a result of these differences, brushing facilities in parallel coordinates displays

tend to provide brushes that resemble contours over subsets of the axes (Figures 3.7

and 3.8). These brushes are semantically similar to timeboxes, but the manipulations

required are substantially different.

40

Chapter 4

TimeSearcher

TimeSearcher uses timeboxes to pose queries over a set of entities with one or more

time-varying attributes. Entities have one or more static attributes, and one or more

time-varying attributes, with the number of time points and the interpretation of those

points being the same for every entity in a given data set.

When a data set is loaded, entities in the data set are displayed in a window in the

lower left-hand corner of the application. Each entity is labeled with its name, and

the values of the active dynamic attribute are plotted in a line graph. Complete details

about the entity (details-on-demand) can be retrieved by simply clicking on the graph

for the desired entity: this will cause the relevant information to be displayed in the

upper right-hand window (Figure 4.1).

The top-left corner of the TimeSearcher window is the query input space. This

space initially contains an empty grid. To specify a query, users simply draw a timebox

in the desired location. The query is re-processed with every mouse event. Thus, as

the box is drawn the results are continuously and implicitly updated, without the need

for explicit user action.

When query processing completes, the display in the bottom half of the application

window is updated to show those entities that match the query constraints. For each of

41

Figure 4.1: The TimeSearcher application window. Clockwise from upper-left: query

space (with data envelope, query envelope, and graph overview), details-on-demand,

item list, range sliders for query adjustment, and data items.

these entities, the time points that match the query are highlighted, in order to simplify

the interpretation of the display. This matching will depend upon the type of query: for

standard timeboxes, the points that are highlighted will be exactly those points that are

contained in the original timebox. For anyof timeboxes (Section 3.1) and variable time

timeboxes (Section 3.2), only those points from a given entity that match the query

will be highlighted in the display window (Figure 4.2).

Once the initial query is created, the timeboxes can be moved and resized. The

42

Figure 4.2: Partial results from a timebox query, with time points that match the query

highlighted. Items in the result set differ in the points that match the query, indicating

an anyof or variable time timebox.

hand and box icons on the upper toolbar are used to switch between creating timeboxes

and moving/resizing them. As is the case with initial timebox creation, the query is

reprocessed with each mouse event.

Although somewhat less than ideal, this switching between the drawing and mod-

ification modes is necessary for proper operation. When in the timebox mode, a click

on the background of the query space is interpreted as the specification of the upper-

left corner of a new timebox. However, that same action is interpreted as the start of a

selection lasso when in timebox modification mode. Elimination of the modes would

require an additional input mechanism such as a shift key to disambiguate between

43

these modes.

When multiple timeboxes are present, they can be modified individually or simulta-

neously in groups of two or more. This functionality is particularly useful for searches

for complex patterns (Figure 3.4). In these cases, users can select some or all of the

timeboxes (using standard lasso and shift-click interactions) and simultaneously apply

the same translation and/or scale along either or both axes to all selected timeboxes.

This is useful for searching for instances of a pattern that vary slightly in scale or

magnitudes, or for modifying queries based on example items.

Timeboxes can also be adjusted via a pair of range sliders in the lower right-hand

corner of the screen. When a timebox is selected (or created), these range sliders are

initialized with the parameters of the timebox, with the top slider containing time ex-

tents and the bottom including values. As each dimension is adjusted separately by

its own slider, these controls support a degree of fine-tuning that might be difficult to

achieve by dragging the timeboxes. These sliders are disabled when multiple time-

boxes are selected.

A third mechanism is provided for modifying the value range of timeboxes. The

textual labels above the value range sliders are editable, allowing users to specify value

constraints by typing them in. Like the range sliders, these entry fields are disabled

when multiple timeboxes are selected.

Much of the research in mining of time series involves queries for items in a data

set that are similar to a specified query [4, 6, 19, 49]. TimeSearcher provides a simple

drag-and-drop mechanism for these ”query-by-example” queries: the user can simply

click on an entry in the data display window, drag it into the query window, and release

the mouse to drop, thus instantiating a query.

The query resulting from a drag and drop has a separate timebox for each time

44

Figure 4.3: Drag-and-drop query-by-example, with results.

point in the data set. Each timebox has a width of one interval, with the query values

centered around the actual value of the attribute for that entity at the given time point.

The height of each timebox is set to be 10% of the total range of the attribute being

queried, so each timebox has a range of v±5% of the total range in the attribute value,

where v is the value of the template time series at the given time point (Figure 4.3).

The timeboxes in the resulting query can be modified to specify for varying def-

initions of similarity. For example, the boxes could be enlarged to allow for a looser

definition of similarity, or subsets of the query could be eliminated to focus on items

45

that are similar only at specific time points.

4.1 Overviews

TimeSearcher provides a limited overview display in the lower left-hand window, dis-

playing each of the entities in the data set in a linear list. As this display shows a small

number of items at any given time, it is not an effective overview. Another possible

overview would display each of the entities in a thumbnail graph. These thumbnails

would be displayed in a grid, instead of the linear arrangement shown in Figures 4.1

and 4.3. This approach suffers from two shortcomings. For any reasonably sized data

set (more than a few dozen items), the limited screen space available would restrict

each thumbnail to a tiny area of the screen, rendering it virtually unreadable. Fur-

thermore, displaying each entity in a separate graph may not help users in identifying

global trends, such as the extreme values of the time-varying attribute at any given

point in time.

TimeSearcher provides another form of overview by displaying the extreme values

that can be found in the data set at each time point. Known as a “data envelope”, this

overview is optionally shown in the background of the query window as a contour that

follows the extreme values of the query attribute at each point in time, thus displaying

the range of values that may be queried (Figure 4.4). When the user executes a query,

the data envelope is extended by a “query envelope” - an overlay that outlines extreme

values of the entities in the result set (Figure 4.5). This display provides users with a

graphic summary of the relationship between the result set and the data set as a whole.

Without any timeboxes present, the data envelope highlights areas that would be

fruitful for query creation, while leaving empty areas unmarked. For example, the data

46

Figure 4.4: Query window with data envelope.

Figure 4.5: Query display with data and query envelopes.

envelope in Figure 4.4 does not extend to the upper right-hand corner, so queries in

that region would not return useful results. When a timebox is created, the updated

query envelope shows the differences between the current result set and the data set as

a whole, thus clarifying the range of values excluded by the timebox. The query enve-

lope also guides the creation of additional timeboxes, as queries outside this envelope

will not match any records.

The graph overview (Chapter 3) provides further support for browsing the data

set. When the user mouses over a graph envelope line, the line is highlighted, thus

displaying the individual item in the context of the larger data set. At the same time,

the name of the item is displayed as a tooltip, along with the value of the item at the

47

time point closest to the point where the mouse-over occurred. The item list, item

display window, and details-on-demand window are also updated to display on the

selected item. This tight coupling in response to lightweight mouse movement will

encourage exploration based on visual examination of the graph overview.

Overdrawing and visual clutter might cause the graph overview display to become

less useful for large data sets. Furthermore, the computational overhead of drawing

the graph overviews and processing the mouse-over handling can lead to substantial

performance degradation when graph overviews are used with these data sets.

To avoid these difficulties, TimeSearcher supports the possibility of graceful degra-

dation between overviews. For large result sets, the data and query envelopes will be

shown. When user queries reduce the size of the data set below a user-specified thresh-

old (set to 100 items by default), the graph envelopes will be displayed. The use of

graph overviews for smaller result sets and data/query envelopes for larger result sets

thus provides an example of a dynamic decision regarding the tradeoff between high-

resolution overviews and performance.

4.2 Leaders & Laggards

The analysis of time series data sets frequently includes a search for items with behav-

ior trends that somehow anticipate changes that will eventually be seen in other items

in the data set. For example, stock market analysts might look for a given stock that

dropped sharply shortly before other stocks in the same sector experienced a similar

decline in price. Similarly, biologists looking at microarray experiments (Section 8.1)

might be interested in finding a gene or EST that has a sharp increase in expression lev-

els immediately before a group of genes has a similar increase. Such a finding might

48

form the basis for the hypothesis that the first gene is a regulatory gene that plays a

role in stimulating the expression of the other genes.

TimeSearcher provides a mechanism to support this search for “Leaders & Lag-

gards”. After creating a timebox query that identifies the set of items with a trend of

interest, the user presses the toolbar button with the parallel arrows (or selects “Set

Leaders” from the edit Menu) to invoke this “leaders” mode. The query window will

then be split into two sub-windows

• The top, “leader” window will contain the specified query, along with the items

that match the leader query.

• The lower, “laggard” window contains the items in the original query in outline,

along with one new query box for each timebox in the ordinal query. These new

query boxes will be offset by one time period (to the right if possible, if not, to

the left) from their original counterparts.

An example query, and its use as a “leader”, are shown in Figure 4.6.

Once the leader and laggard windows have been created, the user can use the stan-

dard mechanisms to modify the query in the laggard window as desired. Thus, the

user can find items that lead or lag by an arbitrary number of time points, or that have

a wider range of values than the original query, etc.

Items in both the original leader result set and the laggard results are displayed in

both the lower-left display window and the window containing the list of item names.

In the display window, items that match the leader query are indicated by the label

“leader”, and the time points that match the leader query are highlighted distinctively.

Similarly, in the item list, leader item names are highlighted in a color that matches the

display of the query in the leader window and the leader label in the display window

(Figure 4.7).

49

Figure 4.6: The query window displaying a “leaders & laggards” query. The top

window shows leaders, with the original query in magenta providing a reference that

can be used for comparison. The leaders window also includes a label indicating that

the leaders are being shown, along with the name of the attribute being used for the

leader query. The record count at the bottom of this window also indicates that the

items shown are leaders. The bottom window - the “laggards” display -shows the

original query in outline, and has new timeboxes representing the new query, which is

defined by shifting the old query one time period to the right. The count label below

this window indicates that the items shown are laggards.

50

Figure 4.7: Leaders & Laggards: The top-left window is the leader window, and the

laggard window is directly below it.

The “leaders & laggards” facilities provide basic support for identifying trend rela-

tionships between different items in the data set. In the future, this functionality might

be extended with a more generalized bookmark facility, which would provide similar

functionality for multiple stored queries. In this case, the stored queries would serve

as a library of templates that might be used to identify patterns of interest.

51

4.3 Multiple Time-Varying Attributes

Although the model for timeboxes presented in Chapter 3 assumes a data set con-

taining items containing a single measurement for each of m time points, there is no

particular reason for restricting consideration to data sets involving only one time-

varying attribute. In fact, many meaningful data sets include multiple simultaneous

measurements. For example:

• Stock price data sets might include both low and high prices

• Meteorological data sets might include temperatures and precipitation levels

• Databases of genetic expression levels might include results from two or more

experimental conditions1.

In the notation given in Chapter 3, these data sets can be modeled by assuming

that there are k variables for each item in the data set. Thus, tik( j) is the value of

variable k for ti at time j. A timebox is then interpreted as a specific constraint on

any one of the k variables: b = (tmin, tmax,vkmin,vkmax), and ti satisfies the timebox b if

∀tmin≤t≤tmax vkmin ≤ ti(t) < vkmax (assuming vkmax ≥ vkmin and tmax ≥ tmin).

TimeSearcher provides limited support for some data sets with multiple variables.

When a data set with multiple variables is loaded into TimeSearcher, the first variable

in the data set is initially shown as the default, in a single pane of a tabbed pane win-

dow. To examine and query the values of any other variable, the user selects the desired

variable name from the pull-down menu marked “Query Variable” in the toolbar. This

leads to creation of a new frame in the tabbed pane (Figure 4.8).

When multiple attributes are present, users can switch between them by clicking on

the tab at the top of the pane. An attribute can be removed by clicking the close icon

1This example was motivated by collaborators working with microarray data sets. See Chapter 8.

52

Figure 4.8: TimeSearcher with a data set involving multiple time-varying attributes.

Two panes have been created - for the “low” and the “high” values.

(the “x”) in the appropriate tab, and reinstated by making the appropriate selection

in the pull-down menu. Each variable that is active displays its own data envelope,

and the display window shows graphs of each individual item using the values of the

attribute in the currently-selected pane. If desired, the user can modify the individual

graph display in the lower panel to show the graphs for each variable simultaneously,

by choosing the “Display All Variables” choice in the “View menu” (Figure 4.9).

When multiple attributes are displayed, the pane for each attribute acts as a query

space for that attribute. Queries can be created independently for each attribute, and

only items that match all queries - even those for variables in panes other than that

which is currently selected - will be included in the result set. When a query is created

or modified, the query envelopes and graph overviews for each active variable will be

updated to display the appropriate subset of the results (Figure 4.10). All items in the

lower display window will have time points for all active queries highlighted, not just

those time points corresponding to queries for the currently-displayed variable.

53

Figure 4.9: The data items in the result set with two variables displayed. The profiles

are taken from yeast microarray data, with absolute log ratio and log ratio values shown

for seven time points [40].

Figure 4.10: Updated query envelopes for one of two attributes that are currently ac-

tive. Note that even though there are no queries in this window, queries in the inactive

window (for “Low” measurements) have constrained the data set, as shown by the

query envelope.

54

This implementation provides only basic support for multiple time-varying at-

tributes, with several limitations. When multiple time-varying attributes are displayed,

they are shown in the same scale, with extent defined by the minimum and maximum

values found for any attribute in the data set. While this works well for comparable

values , it does not work well for multiple attributes with vastly different ranges. Thus,

for example, this facility would generally not be useful for simultaneous examination

of temperature and precipitation levels.

The requirement that all attributes be displayed in the same scale was motivated

by the need to overcome some of the limitations associated with the use of a tabbed

pane window for the multiple query spaces. As the tabbed pane window displays

only one of the panes at any given time, all but one the attributes is always obscured.

This increases the cognitive load associated with interpreting queries, as users must

remember the queries that have been created for variables that are not currently visible.

The change between panes might be particularly confusing if the panes involved

different scales and ranges of values. In this case, users might make interpretation

errors if they did not realize that changing panes had led to a switch between query

spaces that covered widely different ranges. Specifically, users might interpret queries

in one space in terms of the range used in a different space. By requiring that all

attributes use the same range of values, TimeSearcher sacrifices some flexibility in

an attempt at minimizing user confusion. Further work in this area will be aimed

at designing an alternative approach that does not require this restriction of common

scales.

TimeSearcher provides an optional “summary” overview that can help alleviate

the problem of occlusion of query spaces. When the user selects the “Summaries..”

option from the view menu, a new window containing miniature views of all of the

55

Figure 4.11: A summary window for a query over two attributes.

active query spaces is opened. Each summary view is labeled with the name of the ap-

propriate attribute, and the summary view corresponding to the currently active query

window is highlighted (Figure 4.11).

These windows contain active linked views that are updated as the query space

is updated. Although the miniaturized views do not provide enough detail to fully

interpret the queries, they provide a reminder of the occluded query spaces without

taking large amounts of screen space from the currently selected query.

Future work might address alternative solutions to this problem of occlusion. For

example, the tabbed pane might be replaced by a series of individual windows, one for

each attribute. These windows would be coordinated, providing multiple perspectives

similar to those found in Snap-Together Visualizations [94].

56

4.4 Query Inversion

Having found items in a data set that match a specific query, users might like to find

items that have opposite behavior patterns. For example, a stock analyst might like

to see stocks that fell at the same time as others were rising. TimeSearcher’s “query

inversion” facility supports this task.

Queries containing one or more timebox can be inverted by selecting the desired

timeboxes and pressing the toolbar button with the inverted arrows (or selecting “Flip

Selected Queries” from the “Transform” menu. This will cause the queries to be ro-

tated to form an inverse pattern (Figure 4.12). Pressing this button again restores

original queries.

The inverse query is derived by calculating the midpoint of the range covered by

the query. Specifically, this midpoint is half-way between the extreme maximal and

minimal points in any of the constituent timeboxes. Each box is then rotated around

this axis, providing the desired inversion.

As the original queries all must fit within the parameters of the results set, this

approach to inversion has the desirable feature that the resulting inverse query is guar-

anteed to be a legal query. Other definitions of reciprocal queries - for example, taking

the first timebox as a given constant and rotating other boxes relative to this first box -

might lead to nonsensical queries.

The query inversion tools might be particularly useful when used in conjunction

with leaders and laggards queries (Section 4.2).

57

Figure 4.12: Query Inversion: The original query (top) and the inverted query (bot-

tom).

58

4.5 Anyof Timeboxes

TimeSearcher provides support for timeboxes with alternative, anyof semantics (Sec-

tion 3.1). After a timebox has been created, the semantics can be changed by toggling

the “any” checkbox on the pop-up menu that is opened by right-clicking on a time-

box. When the toggle is changed, the query will be re-evaluated, and the timebox

will be displayed in a different color, in order to indicate the alternative semantics

(Figure 4.13).

4.6 Variable Time Timeboxes

Variable time timeboxes (VTTs,Section 3.2) are supported through a button on the

menu bar, which can be selected to switch to a query creation mode analogous to the

mode used for standard queries. The user creates a VTT by drawing a box, using the

same mechanism used for creating a standard time box. Once the VTT is created, it can

be selected for modification. Two types of modification are possible: outer handles can

be used to modify or scale the range in the value (y) dimension and the overall range

in the time (y) dimension, while inner handles can be used to modify the extent of

the inner box, which specifies the length of the required interval (Figure 4.14). Query

results are re-processed with each modification of any of the parameters.

As implemented in TimeSearcher, the user initially specifies the window of inter-

est, and then modifies the inner box to specify the duration within that window that

must satisfy the given value constraints.

VTTs specify variable constraints, raising the possibility that different items might

satisfy a VTT at different time points. For example, if a VTT specifies an interval that

is 3 time periods long within a window from time 5 to time 10, one item might satisfy

59

Figure 4.13: Anyof timeboxes: The display on the top shows a query consisting of

two timeboxes. In the bottom display, the timebox on the left has been converted to

an anyof query. As these queries are more inclusive (requiring only one value in the

given range during the interval, as opposed to all values), the result set for the anyof

query is a superset of the other result set.

60

Figure 4.14: A variable time timebox (VTT), with two sets of modification handles.

The outer handles can be used to modify the value range and the time window, while

the inner handles can be dragged to modify the duration of the interval during which

values must be within the given range.

the VTT during periods 6-8, while another might satisfy the same VTT during periods

7-9. To display these differences, each individual graph of an item in the result set

highlights only those time points when that item meets the criteria for each query item

(Figure 4.2).

4.7 Angular Queries

In many cases, analysis of time series data may require queries aimed at finding rela-

tive changes in value, as opposed to the absolute changes that can be expressed with

timeboxes. For example, timeboxes can be used to find items that rise from a value

61

tmin

t max

12

vmin

vmax

v θ θ

Figure 4.15: Calculation of an angular query. If an items ti has a value v at the starting

time tmin, its value at the ending time tmax must be between vmin and vmax, as determined

by θ1 and θ2, along with the width of the query.

of 80 to a value of 120 four time periods later, but they cannot be used to identify all

items that rose by 50% in value - regardless of the starting value - over that same time

period.

TimeSearcher’s angular queries can be used to create this sort of query. An

angular query specifies a range of slopes that place constraints on the slope of an

item’s values over the course of an interval. An angular query is a four-tuple:

b = (tmin, tmax,θmin,θmax). As with standard timeboxes, tmin and tmax specify the start-

ing and ending points for the query. The angles θmin and θmax present upper and lower

bounds on the slope that the item’s profile must form with the horizontal (Figure 4.15).

Of course, −π/2 ≤ θmin ≤ π/2 and −π/2 ≤ θmax ≤ π/2.

The simplest conception of an angular query involves the angle formed by the line

62

between the value at the starting point and the ending point. For any given item ti,

the angle is formed by finding the difference between the value at the starting point

(ti(tmin)) and at the ending point (ti(tmax)), and dividing it by the width of the timebox.

This value is the arctangent of the angle in question for item i. Specifically, θi =

arctan((ti(tmin)− ti(tmax)/width. This definition - the “end points” version of angular

queries - is based purely on the relationships between values at the end of the interval.

As a result, an item can have values that fluctuate wildly between the start and end of

the interval in question and still meet the constraints of the query.

An alternative definition - the “all points” model - requires that every transition

within the interval conform to the stated requirements. This more stringent definition

essentially requires that overall slope of the an item’s profile fall within the desired

range.

The mechanism for creating angular queries is identical to that which is used for

standard and variable time timeboxes: after selecting the appropriate button on the

TimeSearcher toolbar, the user draws a box that specifies the initial extremes of the

angular query. The lower-left corner of the box is used as the starting point, and the

upper-right corner is the ending point for the maximum value. The angle that the line

between these two points forms with the horizontal is θmax. A default value is used to

determine the range between θmax and θmin.

Like standard timeboxes and variable-time timeboxes, angular queries are con-

strained to occupy an integral number of discrete time points. A further similarity

with those other widgets is the extents of the query widgets, which extend horizontally

to occupy 1/2 extra interval beyond the graph points that indicate values covered by

the query. This may cause some confusion, but it is necessary for consistency with

standard and variable time timeboxes.

63

Figure 4.16: The angular query widget.

Figure 4.17: An annotated angular query widget. The dark lines demonstrate how the

vertical line in the query widget is used to determine the two angles necessary for the

query.

Although the angular query is specified by drawing a box, the widget used to dis-

play the query is somewhat different. This widgets consists of two lines. An angled

line from the starting time point to the ending time point indicates the angle of the

query (Figure 4.16). This line meets a vertical line at the ending time point. This

vertical line depicts the range between θ1 and θ2, as shown in Figure 4.17.

The “all points” query model is the default interpretation for angular queries (Fig-

ure 4.18). A query can be changed to the “end points” configuration by right clicking

on the widget and selecting “End Points Only”. This causes reprocessing of the query

64

Figure 4.18: The TimeSearcher query space with an angular query under the “all

points” interpretation. Data and query envelopes have been disabled for clarity. Se-

lection handles on the query widget can be used to move and rescale the query, and

a tooltip provides a textual representation of the query on mouse-over. Note that the

graph envelopes show items with a slope similar to that of the angular query widget,

but at differing ranges along the value axis.

under the alternative representation, and the coloring of the widget is changed to re-

flect the alternative semantics (Figure 4.19), in a manner similar to the presentation of

anyof queries (Section 4.5).

The width used to calculate the angles is not the difference between the time points

- tmax − tmin. As this difference is generally very small (often on the order of less than

ten time points), using the width in terms of time points would lead to large values for

the tangent, and correspondingly large values for the angle. To avoid this difficulty, the

width in screen coordinates is used to calculate the angles. This results in angles that

correspond to the angle that the angular query widget shows on the screen.

Like timebox queries, angular queries can be modified via handles. These handles

can be used to modify the width, angles, or range between the angles (θ2 − θ1). The

65

Figure 4.19: The angular query from Figure 4.18, under the alternate “end points”

interpretation. Note that some items in the result set have intermediate transitions that

exceed the range specified, even though the line between values at the end points fits

within the specified range.

handle can also be translated in either time or value. Since translations in value do not

change the angle or starting and ending times of an angular queries, these translations

do not impact the result set.

Angular queries are conceptually similar to the use of angular brushes in parallel

coordinates. Angular brushes can be used to find trends of a certain direction and

magnitude in parallel coordinates displays, without regard for initial comparison point

(Figure 4.20) [58]. CASSATT uses a dialog box to provide similar functionality [144].

Angular queries have one advantage over angular brushes. Angular brushes are

limited to comparisons between two adjacent axes, while the comparisons specified by

angular queries may involve comparison across time points separated by an arbitrary

interval - adjacency is not required.

This implementation of angular queries provides an example of the expressive

66

Figure 4.20: An angular brush that searches for negative correlations between items in

the second and third axes [58].

power that additional widgets and interaction techniques might bring to TimeSearcher.

A variety of additional extensions that might be of interest are discussed in Chapter 9.

4.8 Averages

For some tasks, analysts may wish to identify and explore those items in a data set

that are close to the “average” of the data set. TimeSearcher provides support for one

particular notion of averaging through the “show averages” selection in the “View”

menu.

When this option is selected, a new profile is constructed by calculating the average

value of all of the items in the data set at each time point. In other words, the first time

67

Figure 4.21: The TimeSearcher query window, with an average profile displayed in

red.

point in the average profile contains the average of the first values for all of the items in

the data set, etc. This profile is displayed in the query area as a red line (Figure 4.21).

This line, which is similar in appearance to a graph overview (Chapter 3), provides

users with basic feedback regarding the distribution of items in the data set.

When the average profile is displayed, a new button is added to the toolbar. When

this button is pressed, the average profile is used as a template for a query, which is con-

structed by creating a range around the average value at each time point (Figure 4.22.

In essence, this button treats the average profile as a drag and drop query (Figure 4.3).

Once the query has been completed, the individual timeboxes can be moved, scaled,

or deleted at will to form a variety of queries focused around some interpretation of

the average profile.

68

Figure 4.22: An average query.

4.9 Other Features

TimeSearcher provides rudimentary support for saving and managing query results. A

set of queries can be saved and later reloaded via menu “Save Query File...” and “Open

Query Files...” menu selections. Queries are saved without reference to the underlying

data set, thus allowing users to transfer queries between data sets. Query results can

be saved by selecting “Save results..”, which writes a text file which describes the

data file, the current query parameters, and the items in the data set that match those

parameters.

The “search” box in the toolbar provides basic support for known-item search by

name.

TimeSearcher also supports alternate treatment of time varying values. Menu items

69

can be used to switch between raw values, linear normalization, or z-score normaliza-

tion.

70

Chapter 5

TimeSearcher Implementation

TimeSearcher was implemented in Java 2, using the Swing toolkit for user-interface

widgets. Initial versions of TimeSearcher used the Jazz zooming toolkit [17] to provide

drawing and scenegraph control in the data and query displays, along with function-

ality for moving and rescaling timeboxes. Timeboxes, graphs of each item, and query

and data envelopes were implemented as Jazz widgets. After the first public (1.0) re-

lease of TimeSearcher, the code was redesigned to replace Jazz with Piccolo, a newer

zooming toolkit intended to replace Jazz [16].

As a research prototype, TimeSearcher is a product of more than two years of

development work, including substantial redesign. The evolutionary nature of this

growth is reflected in the design and in the code. Prospects for long-term maintenance

and growth of TimeSearcher might be improved by a redesign and implementation that

accounted for lessons learned to date.

This chapter will provide an overview of the TimeSearcher implementation, along

with a description of some of the lessons learned from the original Jazz implementa-

tion. Specific search algorithms are discussed in Chapter 6.

71

5.1 A Tour of the Code

As a Java application, TimeSearcher is divided into several packages, all of which fall

under the main class edu.umd.cs.temporalquery. The main package contains classes

needed for the basic operation of TimeSearcher: TQMain starts the program, TQCore

provides core functionality, and CmdTable and TQMenuBar provide menu and tool bar

support. A variety of sub-packages provide the bulk of TimeSearcher’s functionality:

• edu.umd.cs.temporalquery.data: DataSet is the class that holds the currently ac-

tive data set, which consists of Entity objects. DataVal, FloatVal, StringVal, and

IntVal are utility classes used for reading data from a text file into a DataSet.

• edu.umd.cs.temporalquery.graph: GraphSet is the class responsible for display-

ing the items in the data set that match the current query.

• edu.umd.cs.temporalquery.query: This package contains a variety of classes for

maintaining the state of the active set of queries. QuerySet contains the core code

for managing the queries, QueryExtremes maintains the minima and maxima for

each of the active attributes at each time point, and QueryElementFactory is used

to rebuild queries when they are loaded from a file. QueryElement is the rep-

resentation of the query associated with a timebox. VariableTimeQueryElement

and AngularQueryElement subclass QueryElement to support alternate query

semantics.

• edu.umd.cs.temporalquery.windows: Classes used to build TimeSearcher’s GUI.

TQSplitDataPane, TQControl, TQDetails, TQItemList, and TQFilter are the

sub-windows in the interface, described in more detail in Section 5.3. Pref-

Dialog is a dialog box used for preferences, and LeaderQuery is the window that

holds the leaders in a leaders & laggards view.

72

• edu.umd.cs.temporalquery.pwindows: This package contains Swing compo-

nents that are used as containers for piccolo components. TQPZoom is a JPanel

that can be used to hold a Piccolo canvas. TQPZoom is subclassed by Query for

the query space and Display for the display space. SummaryFrame provides the

summary display used for queries involving multiple attributes.

• edu.umd.cs.temporalquery.piccolo: Classes that extend Piccolo classes in order

to provide the graphic support for the query space. Specific classes will be de-

scribed in detail below.

• edu.umd.cs.temporalquery.event: QueryEvent is a class that is used to package

the information associated with a query modification.

• edu.umd.cs.temporalquery.rangeslider: IntRangeSlider and FloatRangeSlider

are widgets that support the double-box sliders that TimeSearcher uses to sup-

port independent modification of the individual dimensions of a timebox.

• edu.umd.cs.temporalquery.util: A variety of support classes, including code for

doing external tasks in separate threads, logging of information, file selection

filters, popup menus, and customized widgets for tabbed panes and text entry

fields.

5.2 Data Management

5.2.1 Input File Format

TimeSearcher uses a simple, ad-hoc file format for input files. Data files are plain-

text, with commas and semicolons used as delimiters. Lines beginning with a pound

symbol (’#’) are comments.

73

A legal TimeSearcher data file contains a series of data lines describing the data

set as a whole, followed by the individual time series:

1. Title: describing the data set.

2. Static attributes: for each item in the data set. Each static attribute is provided as

“Name,Type”, where “Name” is the name of the attribute, and “type” is the data

type (String,float, int,etc.). Attributes are separated by semicolons.

3. Dynamic Attribute: Similar to static attributes, this line contains one entry for

each time varying value that will be measured.

4. Number of time points: The width of the time series.

5. Number of items: The number of items in the data set.

6. Time point labels: Text labels that will be associated with the time points.

7. Individual items: Each of the items in the data set will be on a line of its own, in

the following format:

(a) The static attributes for that item, in the order given above

(b) The dynamic attributes for the first time point, in the order given above

(c) The dynamic attributes for the second time point, etc . . . .

A sample TimeSearcher data file is given in Appendix A.

5.2.2 Data Structures

Data from TimeSearcher files is read into an instance of the Java class DataSet. This

class also contains some information about the global characteristics of the data set,

74

such as the number and types of dynamic and static variables, and the minimum and

maximum values for each dynamic attribute at each time point. These numbers are

particularly important for creating the data envelope overviews.

The DataSet object also contains an array of Entity instances, one for each item

in the data set. Each of these instances contains the name of the object, the static

variables for that object, and values of each of the dynamic variables at each time point.

Additional fields include storage for normalized dynamic attributes, along with the

minimum and maximum values of each attribute. These values are used as a shortcut in

query evaluation: when evaluating a timebox for dynamic variable i, if the extent of the

timebox is greater than the maximum value of that variable (or less than its minimum)

it will fail by definition, so examination of individual points can be avoided. Finally,

each entity contains a set of flags - one for each time point - that are set to be true when

the entity is contained in the result set for the current query. These flags are used to

implement the highlighting of relevant result points in the display list (Chapter 4).

Although simple, this arrangement for data storage is consistent with an algorith-

mic analysis that identified an optimized linear scan as the most effective approach for

query evaluation (Chapter 6).

5.2.3 Loading a Data File

The process of loading a data file begins when the user selects “Open Data File..”

from the file menu. a TQFileFilter (from the edu.umd.cs.temporalquery.util package

is created and a file name is retrieved through a JFileChooser. The file name is used to

create a DataSet object. The DataSet reads through the metadata at the start of the file

and initializes data structures appropriately. The individual items in the data file are

retrieved by a LoadTask, which is created by DataSet. LoadTask creates a new thread,

75

which iterates through the file, reading each of the data lines into an Entity and up-

dating global DataSet parameters regarding extreme values at each of the time points.

When the LoadTask finishes reading the data from the file, the TQCore.doFile() pro-

cedure completes creates GraphSet and QuerySet objects to hold the display lists and

query space, respectively, and initializes these objects and the menu bar appropriately.

5.3 Graphical User Interface

TimeSearcher’s GUI is implemented as a series of several Swing windows:

• TQCore is a JFrame that acts as the main application window. TQCore contains

a menu bar and a vertical JSplitPane. This split pane has a TQSplitDataPane as

its left component and a TQControl as its right component.

• TQControl is the JSplitPane on the right-hand side of the screen. It contains

another JSplitPane, which holds TQDetails in the top and TQItemList in its bot-

tom. The bottom component of TQControl is a TQFilter window.

• TQDetails is the details-on-demand display.

• TQItemList is the list of individual items by name.

• TQFilter contains the range sliders used to adjust timeboxes.

• TQSplitDataPane is the split pane on the left side window. The top compo-

nent of this pane contains a JTabbedPaneWithCloseIcons, which is used to hold

instances of Query - JPanels that hold Piccolo canvases. spaces. The bottom

component contains the display list in a Display window, described below.

A schematic overview of the classes involved in the TimeSearcher window is given

in Figure 5.1

76

Display

Query

TQControlTQSplitDataPane

TQFilter

TQItemList

TQDetails

JToolBar

Figure 5.1: A schematic overview of the container classes used in the TimeSearcher

GUI. The entire window is an instance of TQCore - a subclass of JFrame.

5.3.1 Piccolo Windows

TimeSearcher’s query spaces and display list are implemented using the Piccolo zoom-

ing toolkit. Although TimeSearcher does not currently provide any zooming, Piccolo’s

facilities for scenegraph management and event handling make it an ideal platform

for building applications like TimeSearcher. Furthermore, future extensions to Time-

Searcher might incorporate zooming functionality (Chapter 10).

The query and display list spaces are both implemented as subclasses of PSizable-

Canvas, a TimeSearcher class that is designed to notify its container class - generally

an instance of TQPZoom - about any resize events that would require modification

of the size of the components contained in the Canvas. As the display list and query

77

1111

Query Display

TQPZoomPSizableCanvas

DropCanvasDragCanvas

Figure 5.2: A UML-style depiction of the relationships between the classes in the

display list and query window.

window must support drag-and-drop, the canvas in the display window is implemented

as a DragCanvas, a subclass of PSizableCanvas that implements DragSourceListener.

Similarly, the canvas in the query space is implemented as a DropCanvas, which im-

plements DropTargetListener.

TQPZoom is a subclass of JPanel that is used as a container to hold the instances

of PSizableCanvas. There are two classes of TQPZoom - Display for the display list

and Query for the query list. These subclasses manage the details of the display and

query spaces. A simplified UML-style schematic of the class relationships is given in

Figure 5.2.

As TimeSearcher’s supports simultaneous querying of multiple time-varying at-

tributes, there can be multiple instances of Query that are active at any given time.

These instances are stored in the JTabbedPaneWithCloseIcons that is the top compo-

nent in the TQSplitDataPane. There is, however, only one instance of Display.

78

Display

The Display window contains a DragCanvs that displays the graphs of the individual

items in the result set of the current query. Each graph of an item in the data set is an

instance of DataAxis, which is a subclass of Axis. Axis is a PNode that draws a pair

of axes, along with labels. DataAxis extends Axis, adding a line that plots the values

for an item in the data set. he updating and management of the graphs in the Display

window is handled by an instance of GraphSet, which creates instances of DataAxis

and displays them when they are in the result set of the active query.

The DragCanvas in the Display window has one event handler - the DisplayEven-

tHandler, a subclass of Piccolo’s PDragSequenceEventHandler. DisplayEventHandler

is used to update TimeSearcher’s display when the user mouses-over one of the graphs

in the display window. When the mouse enters one of the graphs, DisplayEven-

tHandler updates the details window to display the details for the item in question,

scrolls the item list to highlight the name of this item, and highlights the graph

overview line for the given item (if the graph overview is active).

Scrolling of this list is provided by placing the Display in a a PScrollPane - Pic-

colo’s version of a JScrollPane class. Event handlers in TQSplitDataPane respond

to scrolling of this window, and update the position of the scroll bar TQItemList to

keep the two windows synchronized as necessary. Similarly, when the TQItemList is

scrolled, code in TQSplitDataPane is executed to scroll the Display window as neces-

sary.

Query

The Query panel uses a DropCanvas to display the query space. The main Piccolo

component used in the query space is the QueryAxis, a subclass of Axis. QueryAxis

79

includes DataEnvelope and QueryEnvelope nodes. Both subclasses of Envelope,

DataEnvelope and QueryEnvelope provide the data and query envelope overviews.

QueryAxis is also responsible for management of the instances of GraphEnvelopeN-

ode that provide the graph overviews. A variety of Piccolo handlers are used for the

creation and modification of the various classes of queries (Section 5.3.2).

Other Piccolo Classes

Leaders & laggards queries (Section 4.2) use an additional window to display the lead-

ers. When the user starts a leaders & laggards query, the top window of the TQSplit-

DataPane is replaced by a new split data pane. The bottom component of this new

pane is set to the tabbed data pane the holds the Query panels. The top component of

this new pane is set to be a new instance of LeaderQuery, a subclass of TQPZoom that

is similar to Query. LeaderAxis, LeaderEnvelopeNode, and LeaderHandler provide

facilities for the leader window, similar to those provided by QueryAxis, GraphEn-

velopeNode, and DisplayEventHandler.

The Average class provides the display of the data set averages needed for the

average query facilities (Section 4.8).

5.3.2 Interaction Handlers: Creation and Modification of Queries

Piccolo provides basic interaction handlers that are suitable for unconstrained creation,

translation, and scaling of visual objects. This free-form movement is not acceptable

for manipulation of timeboxes, which involves two important constraints:

1. Creation, modification, and scaling of timeboxes must be limited to move within

a bounded rectangular area.

80

2. The width and position of a timebox must be aligned with a described grid, as

determined by the number of time points in the time series. Any changes to a

timebox must respect this alignment.

The first constraint is necessary to prevent users from creating queries that are

either nonsensical or simply out of bounds (for example, queries involving time periods

outside of the range of the data set). Discrete horizontal motion is needed to clarify

the extent of the queries and the motion: as it is assumed that time series are discrete,

queries that involve values between two time points in a series are nonsensical and

should not be allowed. TimeSearcher does not allow movement or scaling operations

that modify the horizontal extent of queries by non-integral amounts, thus prohibiting

these queries.

Standard Piccolo handlers for moving and resizing objects were modified and cus-

tomized to implement these constraints. When a selected timebox is created, moved,

or scaled, these handlers examine the movement or scaling operation and adjust the

magnitude of the vertical and horizontal change to insure that the resulting timebox

will be both within bounds and appropriately aligned in the horizontal dimensions.

Each class of timeboxes has its own handler that implements the behavior nec-

essary for that class of timebox (standard timeboxes, variable time timeboxes, etc.),

while modification (moving and resizing) of all timeboxes is handled by a common

handler - the ConstrainedSelectionHandler. This handler is also responsible for pro-

cessing query modification via arrow keys on the keyboard, and for handling mouse-

over events on graph overview lines.

Piccolo event handlers are also used for the creation of timeboxes: the TimeBox-

Handler is used to create new timeboxes. When the user uses the toolbar to switch

between timebox creation mode and timebox modification mode, the TimeBoxHan-

81

dler is deactivated and the ConstrainedSelectionHandler is activated, and vice-versa.

An instance of TimeBoxHandler is the active event handler on the Query canvas

when TimeSearcher is in query creation mode. When a TimeBoxHandler is active, a

mouse press on the query space leads to the creation of a new TimeBox, at the loca-

tion of the press. With each subsequent mouse event, the TimeBoxHandler constrains

the bounds of the box as described above, updates the display, and calls the code in

QuerySet needed to reprocess the query.

VTTHandler and AngularHandler are subclasses of TimeBoxHandler that are used

for creation of VariableTimeTimeBox and AngularTimeBox objects for variable-time

timeboxes and angular queries, respectively. Each of the three query creation buttons

on TimeSearcher’s toolbar is responsible for activating the event handler for the appro-

priate query class.

Timeboxes and other query widgets can be translated or scaled, but not created,

when TimeSearcher is not in query creation mode. In this case, the ConstrainedSelec-

tionHandler is active. This handler supports three main functions:

• Mouse-over highlighting of graph overview items, including scrolling update of

the display and item lists.

• Selection of query widgets, either via clicking or lasso for group selection.

• Translation of query widgets via dragging or key press.

When the user drags a selected query widget, ConstrainedSelectionHandler con-

strains the translation to remain within appropriate bounds, using a strategy very sim-

ilar to the approach used in TimeBoxHandler. Once the movement is constrained,

the query widget’s position is updated by a call to its setBounds() method, and the

queryChanged() method of the QuerySet objects is executed to process the modified

82

query.

When a query is selected - either by direct mouse click or by lassoing - it is dec-

orated with resizing handles. These handles, which are generally subclasses of the

Piccolo class PHandle, are nodes that are added to points on the perimeter of the query

widget while it is selected. These handles can be dragged to scale and otherwise mod-

ify the parameters of the query widget.

Each type of query has its own subclass of PHandle that supports the range of

changes that can be made. TimeBox is the simplest case, with BoundsHandles pro-

viding eight handles: one on each of the four corners, and one at the midpoints of

each of the four sides. When these handles are dragged, code in TimeBox is called

to update the size and shape of the box, taking the constraints mentioned above into

account. When a bounds handle on a TimeBox is dragged, the modification implied

by this drag is applied to all selected widgets, thus supporting concurrent scaling of

selected objects.

Handles for variable-time timeboxes are slightly more complicated than the han-

dles used on standard timeboxes. In addition to the eight handles implied by Bound-

sHandles, VTTs use an additional two handles from the VariableTimeTimeBoxHan-

dles class to support resizing of the inner box.

Unlike standard and variable-time timeboxes, angular queries do not have four

sides and corners that provide natural locations for interaction handles. Instead, han-

dles for angular queries are placed on the left-hand end of the query, and on either end

of the query’s range indicator (Figure 4.18). As the default positioning of handles is

not appropriate, The AngularBoundsLocator class is used to calculate the location of

the AngularBoundsHandles, which provide the scaling functionality.

83

Additional details about the subclassing relationships needed to implement the var-

ious types of queries can be found in Section 5.4.

5.3.3 Display Techniques

Efficient redisplay of graphical information in both the query window and the data

items (Figure 4.1) is necessary for efficient support of dynamic queries. TimeSearcher

uses several strategies to provide the necessary performance.

Improvements to the display performance can be achieved by limiting the extent

of the display that is dynamically updated during user interaction. When a query is

being modified, the user’s attention is focused on the query space, as opposed to the

display of the individual data items. As a result, continuous updating of this display

is unnecessary. Instead, TimeSearcher updates the graph overview, the data/query

envelopes, and the list of items that match the query with with each mouse event, and

saves the update of the display of the individual items until the end of the interaction.

The decay from graph overview on smaller result sets to data/query envelopes on

larger result sets (Section 4.1) provides additional performance benefits, reducing the

update requirements from O(n) individual graph lines to the four lines needed for

drawing the two contours.

The summary window used for an overview of multiple-attribute queries (Sec-

tion 4.3) is implemented as a series of Piccolo canvas ( PCanvas) objects. Each of

these objects has a Piccolo camera ( PCamera) that contains the graphic layer from

one of the current query spaces. The views of these cameras are scaled to provide the

miniaturized view. Each of these cameras displays the same scenegraph that is shown

in one of the active query spaces, so the summaries are directly linked to the query

space. Therefore, the summaries will be updated with each modification of the query

84

space, including dragging of queries and subsequent updating of the result set.

5.3.4 The transition from Jazz to Piccolo

Efficient redisplay of graphical information in both the query window and the data

items (Figure 4.1) is necessary for efficient support of dynamic queries. To provide

this support, TimeSearcher initially used a customized version of Jazz that improved

performance in certain critical areas.

In the display window, each individual graph is a separate node in the Jazz scene-

graph. To draw these items in the continuous vertical scrolling display, these nodes

must be translated and redisplayed with each query. Specifically, the nth item in the

result set must be displayed at vertical offset n ∗ k, where k is the height of each item.

For large result sets, this leads to numerous changes to the scenegraph, which must be

handled appropriately for good performance.

The default Jazz implementation treated the modification of any item in the scene-

graph as reason to update the portion of that scenegraph associated with the parent of

the item. When an item in the display list is translated, each of the other items in the

list must be updated. Thus,translation of each of the O(n) graphs leads to examination

of all of the graphs, leading to a total response time that is O(n2).

Piccolo does not have this overhead, and can modify O(n) objects in O(n) time.

The current implementation of TimeSearcher uses an unmodified version of the Pic-

colo libraries, which should ease maintenance and future development.

Piccolo also provides interaction handlers that are significantly smaller and simpler

than those provided in Jazz. As discussed in Section 5.3.2, TimeSearcher requires aug-

mentation of interaction handlers to provide functionality not usually found in zooming

toolkits. As a result of Piccolo’s improved design and greater parsimony, the Piccolo

85

version of TimeSearcher is significantly smaller than the Jazz version (14K lines vs

20K lines).

Piccolo is also used to draw the timeboxes, data and query envelopes, and the graph

envelope lines in the query space. Each graph envelope line is a separate node in the

piccolo scenegraph. This imposes a significant overhead for large data sets, but is

necessary for support of mouse-over highlighting and linking with other application

windows.

5.4 Query Processing

TimeSearcher provides dynamic query updates by recomputing query results with ev-

ery modification to any timebox (including VTTs and angular queries) involved in the

current query. As described above (Section 5.3.3), this re-processing is somewhat in-

cremental. As the mouse is dragged during a resize (scaling) or translation (movement)

operation, the graph overview, query envelope, and item list will be updated, but the

display list will not be updated until the mouse is completed. More specifically, the

processing of a timebox query proceeds as follows:

1. A mouse event or key press indicates a creation, translation, or scaling of an

instance of the Timebox class.

2. An instance of QueryElement is created. This instance converts the screen repre-

sentation of the timebox to a set of coordinates that represent the query in terms

of starting and ending times and value extents in the range of the current data

set. This QueryElement is associated with the timebox.

3. Each item in the data set is checked against the current query, using one of the

linear search algorithms described in Chapter 6. If the item matches the query,

86

the appropriate flags are set, indicating the time points in the item that match the

query. An additional flag is set to indicate that the graph overview for this item

should be shown. Finally, statistics needed for updating the query envelope are

updated to include this item.

4. After all items in the data set have been checked, the number of items that match

the query is checked to see if size of the result set is small enough to display in-

dividual graph overview lines (Section 4.1). If the result set size is small enough,

the graph overview lines are displayed. If not, only the data and query envelopes

will be displayed instead.

5. When the mouse is released, indicating the completion of the query, each of

the items that match the query is displayed in the display list window, and the

envelopes are updated.

A flowchart of this process is given in Figure 5.3.

A minor modification of this approach is necessary for queries that involve si-

multaneous modification to multiple timeboxes. For these queries, new instances of

QueryElement are first created for each timebox before iterating through each of the

items in the data set. Thus, the simultaneous modification of several timeboxes re-

quires recalculation of each of the QueryElement instances, but no additional overhead

(relative to modification of a single timebox) is involved.

The QueryElement class essentially acts as a model in the sense of the Model-

View-Controller architecture often used in GUI implementations. As the visible, on-

screen representation of the query, the TimeBox class, in conjunction with the appro-

priate event handlers, provides both the view (the graphic display) and the controller

(handles for modification) of a timebox query. The QueryElement class converts the

87

Result set size below Threshold?

Display Graph Overview

Update Data, QueryEnvelopes

Yes

No

Yes

No

Update statistics

Find Matches,

Modifications Complete?

Create QueryElement

Update Display List

Input EventModifies Query

Figure 5.3: The steps involved in TimeSearcher query processing.

88

graphical coordinates of the TimeBox instance into a meaningful query, thus providing

the model.

The QueryElement class is also necessary for creating an appropriate TimeBox to

represent a query object. This happens, for example, when a saved query is read from

a file: a QueryElement is created, and then the dimensions of the active query screen

are used to translate this QueryElement into the appropriate TimeBox.

5.5 Extending Timeboxes

The query processing code has been designed to be object-oriented and extensible to

support new types of timebox queries. The TimeBox and QueryElement classes pro-

vide support for processing basic timeboxes, and contain all code for handling queries.

Specifically, all of the code used for determining whether or not an entity in a data set

matches a query can be found in the QueryElement class. New timebox queries can be

created by subclassing QueryElement, TimeBox, and TimeBoxHandler for creation of

the timeboxes.

The implementation of variable time timeboxes (Section 3.2) provides a road map

that can be used to create other types of extended timeboxes. The VTTHandler class

subclasses TimeBoxHandler, in two important ways:

• the setLimits procedure calls VariableTimeTimeBox.setLimits, informing this

class of the constraints of the current query space.

• The createTimeBox procedure is over-ridden to return the appropriate subclass

of TimeBox.

The subclass of TimeBox is known as VariableTimeTimeBox. This class contains

a great deal of support code for managing the manipulation of the inner constraints,

89

but this will not be needed in all cases. In general, the minimal requirements in an

extended timebox class will be:

• A paint procedure to appropriately render the box.

• A createQueryElement procedure that returns an instance of the appropriate sub-

class of QueryElement.

Finally, a subclass of QueryElement will be needed. This subclass will need to

over-ride createTimeBox, creating an instance of the appropriate subclass of Time-

Box, along with getCopy. The extended search semantics can be specified by over-

riding matchEntityAll and matchEntityAny, which are the procedures called to deter-

mine whether an entity matches the timebox’s constraints for all of the points in the

given interval, or simply for any of those points. As anyof queries are not particularly

meaningful with variable time timeboxes, the current implementation does not over-

ride matchEntityAny. Instead, the any choice is disabled for variable time timeboxes.

Additional extensions may be needed for some timebox variants. As described

above (Section 5.3.2), queries may need to create interaction handles and handle loca-

tors that match their specific interaction needs and geometries.

Of course, appropriate tool bar and/or menu entries will also be needed to support

switching into the appropriate modes.

5.6 Performance

Information visualization tools strive to provide highly-interactive performance for in-

creasingly larger data sets. Although 100ms response time is the goal, this is not al-

ways possible for very large data sets. Approximate quantification of the performance

of a tool provides a rough understanding of its limits.

90

Synthetic data sets of various sizes were constructed for evaluation of Time-

Searcher’s performance. For each data set, several operations were conducted:

• Creation of three queries

• Several modifications to those queries

• Deletion of the queries

• A drag-and-drop query.

For each query, the total processing time - including identification of matching items

and all screen updates - was measured. This value was averaged across all queries, for

an average query processing time for each data set.

Data sets with 1000, 10000, 25000, and 50000 with both 100 and 200 time points

were created. A data set with 100,000 items was also used with 100 time points.

TimeSearcher was unable to handle a data set with 100,000 items and 200 time points,

as this data set exhausted available RAM on the test computer. Graph overviews were

turned off in call cases. All tests were run on a 1.33 GHz Pentium III-compatible with

512MB Ram, running Mandrake Linux 8.0. Average response times across all query

types are given in Figure 5.4 and Table 5.1.

These results show that TimeSearcher’s performance scales linearly with the num-

ber of items in the data set. In fact, for both 100 and 200 time points, the correlation

was almost perfect. These results can be used to generate a regression that would pre-

dict the performance of TimeSearcher on data sets of various sizes: if t is the query

processing time and n is the number of items in the data set, the regression equations

are as follows:

• 100 time points: t = 13.77+ .0043n (r2 = .99).

91

0

50

100

150

200

250

300

350

400

450

500

0 20000 40000 60000 80000 100000

Ave

rage

Que

ry P

roce

ssin

g Ti

me

(ms)

Number of Items in Data Set

100 time points200 time points

Figure 5.4: Average times for TimeSearcher to completely process queries - including

search and display update - on several query types. Results are shown for data sets

of 1000, 10000, 25000, and 50000 items with 100 and 200 times points, and 100,000

items with 100 time points only.

Average Total Query Processing Time (ms)

Number of Items 100 time points 200 time points

1000 17 10

10000 56 90

25000 123 157

50000 238 301

100000 449

Table 5.1: Raw performance data.

92

• 200 time points: t = 16.34+ .0057n (r2 = .99).

It is interesting to note that the performance does not seem to scale linearly with

the number of time points. This is consistent with the algorithmic analysis (Chap-

ter 6), which showed that the performance of the TimeSearcher search algorithm was

relatively insensitive to the number of items in the data set.

Despite the high correlations, these results are fairly limited in their applicability

and generality. The tasks used to generate these results were not rigorously controlled,

and the specific timing values are not generalizable beyond the computer system that

was used to run the test.

However, these results do provide a rough measure of the scalability of Time-

Searcher. With the computer used in this test, 100ms performance is only likely to be

possible on data sets of less than 25000 items. However, performance with 50,000 or

even 100,000 is not unreasonably slow. As performance continues to increase, 100ms

performance with data sets containing 100,000 items may soon be possible.

From an implementation viewpoint, understanding of the components of these

query processing times is most useful for identification of processing bottlenecks and

opportunities for optimization. Specifically, understanding of the costs of display up-

dates relative to the other components of query processing will be helpful both for

understanding the potential impact of improvements in rendering and for identifying

areas which would be the most fruitful targets for optimization.

A rough breakdown of the contributions of these components was created by run-

ning several operations - including creation, modification, and deletion of timeboxes

- over a variety of smaller data sets. Unlike the previous tests, these trials were con-

ducted with graph overviews turned on. This was necessary to get a “worst case” pic-

ture of the cots of updating the TimeSearcher display. As these results are not based

93

Data Set Display Time Total Processing Display Portion

223 items, 13 time points 108 609 18%




Table 5.2: Portion of query processing time spent on updating display, for sample

queries on some data sets. All times are in ms.

on a carefully controlled set of query operations, they are not meant to be interpreted

as definitive. Rather, they are designed to give a rough picture of where time is being

spent in processing queries. Results of this analysis are given in Table 5.2.

These preliminary results seem to indicate that display is a relatively small part of

the overall costs of query processing. Therefore, improvements in rendering should not

be expected to improve TimeSearcher’s query performance, and optimization efforts

should focus on the search algorithms and related code.

94

Chapter 6

Search Algorithms

.

To provide users with rapid, incremental feedback, dynamic query tools must meet

stringent performance requirements. Specifically, queries must be processed within a

100ms update cycle if updates are to appear to be instantaneous [117]. To meet this

goal, developers of dynamic query tools must use efficient techniques to achieve a

high-level of performance in two key areas:

• Display: Updating the graphic display to show only those items that match the

query, limiting display updates to areas that will occupy the user’s visual atten-

tion, and other techniques have been used to minimize the overhead of repeatedly

updating the complex displays found in information visualization environments.

• Search: Identifying the subset of items that match the current query requires ap-

propriate indices that support incremental queries. Although a variety of indices

and strategies have been evaluated [70, 129], the choice of search algorithm is

strongly influenced by the specific details of the problem being addressed by a

given system.

Of course, these optimizations become more important as the data sets grow larger.

95

Strategies used to improve performance of the display components of Time-

Searcher are described in Chapter 5. This chapter focuses on the performance of search

in TimeSearcher. After defining the problem, this chapter introduces several possible

alternative algorithms, and describes an analysis of their performance, using a common

testing platform. This analysis led to the initially surprising conclusion that a relatively

simple sequential search outperformed more sophisticated alternatives. Further explo-

rations aimed at resolving this seeming paradox led to a deeper understanding of the

problem, which might be used as the basis for further examination of potential search

strategies.

It should be noted that the analysis in this chapter applies only to standard time-

boxes. Although search algorithms used for variable time timeboxes (Section 3.2) and

angular queries (Section 4.7) are discussed in this chapter, analysis of their perfor-

mance remains a possible area for future investigation.

6.1 Problem Definition

The search problem presented by timebox queries is found in the the definition of a

timebox (given in Chapter 3 and repeated here for convenience). Specifically:

• A set T of time series profiles t1 . . .tn, each containing values for each of m time

points. The value of ti at time t is denoted by ti(t).

• A timebox is a a 4-tuple b = (tmin, tmax,vmin,vmax). Without loss of generality,

we assume that vmax ≥ vmin and tmax ≥ tmin.

• Time series profile ti satisfies timebox b if ∀tmin≤t≤tmax vmin ≤ ti(t)≤ vmax. In this

case, we say that S(ti,b) = true.

96

This definition naturally extends to queries formed as conjunctions of multi-

ple timeboxes: ti satisfies a set of timeboxes B = b1 . . .bn (S(ti,B)) if and only if

∀b j∈BS(ti,b j) = true. In the following discussion, we will consider the problem of

identifying the items ti ∈ T that satisfies a set of timeboxes B.

The width of the data set - m - and the number of items in the data set n are the

primary influences on search performance.

User interface requirements for TimeSearcher present another constraint that must

be met by any search algorithm. As TimeSearcher presents items in a set linear order,

search results from any query should be presented in a manner that retains the original

relative order of items in the result set. This ordering will provide consistency that will

help users interpret search results.

6.2 Sequential Search

The naive approach to timebox searching follows directly from the definition of the

problem.

Search Algorithm 1 SEQ NAIVE

• For each ti ∈ T , check to see if it satisfies each b j ∈ B.

• To see if ti satisfies b j = (tmin, tmax,vmin,vmax), check ti(t) for tmin ≤ t ≤ tmax to

see if vmin ≤ ti(t) < vmax. If this is true for each t the given range, S(ti,b j) = true

• if S(ti,b j) = true for all b j ∈ B, S(ti,B) = true.

In other words, we simply perform the expected iteration of the three loops: for

each of the items in the data set, we look at all of the time points in all of the boxes.

97

The conjunctive nature of timeboxes leads directly to the first optimization on this

scheme. If some ti fails to meet the constraint for b j at some time t, we know that

S(ti,b j) = f alse, even if we have not completely processed the range of times tmin ≤

t ≤ tmax. Thus, processing for any ti, b j pair can stop as soon as one value outside of the

given range is encountered. This is equivalent to the familiar programming language

shortcut used in evaluation of conjunctive conditionals.

Additional, less obvious, optimizations can be applied to conjunctive queries. In

this scenario, we have a set B = b1 . . .bn of queries and a set R ⊆ T such that S(R,B) =

true. A change is made to the query: either

• A new timebox bn+1 is added to B, leading to B′ = B∪bn+1.

• Timebox b j ∈ B is deleted, forming B′′ = B−b j.

• Timebox b j ∈ B is modified.

The creation and of queries each present opportunities for optimizations

New Queries Here, we note that if S(ti,B) = f alse then there must be some b j ∈ B

where S(ti,b j)= f alse. Therefore, the addition of bn+1 to form B′ cannot make S(ti,B′)

true. In practical terms, this means that when we add bn+1, we must examine only those

ti such that S(ti,B) = true, to see if S(ti,bn+1) = true. If it is, then S(ti,B′) = true.

Deletions This case is the flip side of query creation. If we have an item ti such that

S(ti,B) = true, S(ti,B′′) is true by definition - removing a timebox from the current

set makes the query less restrictive. Therefore, we must only examine those ti where

S(ti,B) = f alse to see if S(ti,B′′) = true.

For the time being, we assume that in the remaining case of query modification, we

must re-examine all entities and timeboxes. An alternative that avoids this overhead is

98

discussed in Section 6.5.5.

Based on these optimizations, we define the improved sequential algorithm, which

assumes that we are changing an existing set of queries B through the creation, dele-

tion, or modification of a timebox b j:

Search Algorithm 2 SEQ OPTIMIZED

• Begin by assuming that all items meet the (initially null) query

• For each change to the query, respond as follows:

– Creation of a timebox: Check all items that satisfied the previously existing

query. If they satisfy the new timebox, they satisfy the new query. All others

do not.

– Deletion of a timebox: All items that did not satisfy the previously existing

query should be checked against all remaining timeboxes in the query (B′′).

These boxes are added to the set of timeboxes that satisfied the original

query (B), to make the set of results of the new query.

– Modification of a timebox: All of the items in the data set are compared

against all of the timeboxes to find the items that match the query.

Although the analysis in this chapter is based on this optimized sequential scan

algorithm, the implementation in TimeSearcher does not take advantage of the opti-

mization for query deletion. This is due to the need to update the result display to

indicate which time points in an item match the query. When a timebox B is deleted,

the time points in a given item that match the remaining query must be updated to

reflect the removal of B. This requires a pass through all items, thus rendering the

optimization irrelevant. In practice, this does not appear to significantly impact the

performance of TimeSearcher.

99

6.3 Sequential Search for Timebox Extensions

Variants of the sequential search algorithm are used to process variable time timebox

queries (Section 3.2) and angular queries (Section 4.7). Although these algorithms are

not analyzed in this chapter, they are outlined here for completeness.

6.3.1 Variable Time Timeboxes

Variable time timeboxes (VTTs) differ from ordinary timeboxes in that values must

be in a given range for at least R consecutive measurements in a wider interval (Sec-

tion 3.2).

Sequential scan processing of these queries is similar to processing of standard

timeboxes. For each entity, processing begins at the start of the larger window defined

by the VTT and steps along until the end of the interval. If the value at a time point

falls within the given range, a counter is incremented. When that counter exceeds the

width of the VTT (R), the item matches the VTT. However, if the value falls outside of

the given range, the counter is reset to zero. Since an entity in the data set can match

a VTT during multiple, possibly disjoint intervals, processing proceeds until the entire

interval has been checked - there is no “falling out” of the loop as there is with the

basic sequential scan.

The additional checks required for this algorithm make it possible that VTT eval-

uation will be significantly slower than standard timebox evaluation. Although sys-

tematic evaluation has not been conducted, informal evaluation seems to indicate that

VTT evaluation is sufficiently fast for moderately-sized data sets.

100

6.3.2 Angular Queries

Angular queries (Section 4.7) involve comparison of the angles formed between the

horizontal and segments connecting values for a given item. Specifically, the angle

formed by the line segment between the start and end of that interval with the horizon-

tal must be within a given range. There are two interpretations: “all points” angular

queries require that each transition within a given interval fall within the specified

range, while “end points only” angular queries only require that the values at the start

and the end point form a segment make an angle that falls within the desired range.

Processing of “all points” angular queries proceeds via a sequential algorithm that

is analogous to the approach used for standard timeboxes. For a given item, the angle

formed by each transition within the range is calculated, and checked to see if it falls

within the desired range. As soon as one angle falls outside of the range, the item has

failed to satisfy the query and processing for that item is complete. As there are n−1

transitions in an angular query covering n time periods, angular queries require one

fewer check than would be required for a standard timebox of the same width.

“End points only” angular queries are even simpler, requiring only the calculation

of the angle formed by the segment between the value at the start point and the value

of the end point. This can clearly be done in constant time, regardless of the width of

the interval.

6.4 Geometric Methods

Geometric approaches to processing timebox searches are based on an alternative in-

terpretation of the data set. A set of n time series profiles, each containing measure-

ments for each of m time points can also be interpreted as a set of mn points in a

101

2-dimensional space. Each of these mn points is associated with one of the n profiles,

such that each profile has exactly one associated point for each time point. Under this

interpretation, a timebox can be seen as a two-dimensional orthogonal range query - a

query aimed at identifying the points that fit inside the rectangular region covered by

the query.

In other words, a time series ti satisfies a timebox b if all of the values for ti during

the time range covered by b are within the value range specified by b.

Alternatively, we can define S(ti,b) in terms of the number of points in ti in the time

range that fall into the appropriate value range. Let C(b) be the width of the timebox

- C(b) = tmax − tmin +1. Furthermore, let C(ti,b) be the number of points in ti that fall

within the constraints of the query: C(ti,b) = |Q|, where Q = {t|tmin ≤ t ≤ tmax,vmin ≤

tit < vmax}. In this case, we say that S(ti,b) if and only if C(ti,b) =C(b) - if the number

of items in the time range defined by the timebox is equal to the number of time points

contained in the timebox.

Figure 6.1 demonstrates this model. The timebox is three time intervals wide. The

upper line is one entity that has values within the timebox for all three time points, so

the entity will be included in the result set. The lower entity, however, has only two

out of the three values in the needed range, and therefore is not included in the result

set.

Thus, to process a timebox query, we start by identifying the points that fall within

this query. For each of these points, we increment a counter associated with the time-

box and the time series to which that point belongs. When all of the points that are

processed, the items that have a count for the timebox that is equal to the width of the

timebox (C(ti,b) = C(b)) are the matches:

Search Algorithm 3 Geometric Basic

102

Figure 6.1: Example of entities that meet (upper) and fail to meet (lower) the con-

straints of a timebox.

To process a timebox b.

• Initially, assume ∀ti∈TC(ti,b) = 0

• For each of the points p that fall within the timebox b, increment the counter

C(ti,b)

• The set of items that match the query is given by R = {ti|ti ∈ T,C(ti,b) = C(b)}.

This approach can be extended to conjunctive queries containing multiple time-

boxes by simply maintaining a separate counter C(ti,b) for each item,timebox pair.

After each of the items in the range query have been processed, a separate pass

through all of the items in the data set is needed to identify those items that match the

query. This pass is necessary for two reasons. First, if we are to maintain the ordering

of the items in the data set, including items in the result set as their individual counts

reached the specified threshold would not be sufficient, as the resulting list would be

unordered and require sorting. A linear pass through the list of items in the data set

would be more efficient. During this pass, the optimizations used in algorithm 2 can

be used for query creations and deletions.

103

Deletions of timeboxes present another reason for this separate pass through the

entire data set. When a timebox is deleted, an item ti may or may not match the other

timeboxes in the query. Therefore, a separate check will be needed to see if each item

is in the data set. Similar concerns exist for modifications (moving and scaling) of

queries.

The cost of the separate pass through the data set can be ameliorated by the use

of the optimizations described in the optimized sequential algorithm described above

(Algorithm 2).

Further optimizations that minimize the area that must be queried are possible for

query modifications. Taking a cue from clipping algorithms from computer graphics,

we observe that small adjustments to a timebox - either in location or scale - can lead

to a new box that has significant overlap with the previous box. In these cases, query

processing can simply add the points in the areas that are added to the query, and

eliminate points from areas that are removed. The appropriate areas can be quickly

identified using a constant-time clipping approach (Figure 6.2), thus eliminating any

redundant processing.

Implementation of geometric methods requires an appropriate index for efficient

handling of the range queries. Two possibilities - orthogonal range trees and a bucke-

tized grid - are discussed below.

6.4.1 Orthogonal Range Trees

Orthogonal range trees use nested trees to process orthogonal range queries. These

trees are nested interval trees, with each internal node containing a secondary tree. This

tree is used to index all of the items in that internal node along the second dimension.

For a search, the first one-dimensional interval tree is searched to find those items that

104

Figure 6.2: Clipping: as the timebox is moved to the lower right, the area marked “D”

is removed from the query, and the “A” region is added. These two regions must be

processed, but there is no need to reprocess the overlap (“O”).

fall in the appropriate range for that dimension. As leaf and internal nodes that fall

within the interval are identified, their associated indices for the second dimension are

searched to find the items that fall within both dimensions of the query [38].

For TimeSearcher, a modified orthogonal range tree can be used to simplify pro-

cessing. The time dimension is searched first, followed by the value dimension. Since

the time dimension covers a known range 0 ≤ k ≤ m− 1, and each entity has a value

at every time point, we use a linear array in place of the range tree for the time dimen-

sion. The start and endpoints in this array can be found in constant time, and the value

indices associated with each included time points are then searched. The second-level

value indices are stored as skip lists [103]. Each skip list contains the value for each

of the n items at the time associated with that skip list. To search the second-level

105

indices, the entry in the skip list with the lower bound is found, and then items with

successively greater values are read off of the list until the higher bound is reached.

The total number of points in the data structure is mn (the product of the number

of entities and the number of time points). For a query of width w, the expected search

time should be O(w(logn+k)), where k is the expected number of points that are found

within the query region at each time point.The use of an array instead of a range tree

for the first level of the search has a potential cost, as search trees that might have been

subsumed in internal canonical nodes of the first level tree must be searched explicitly.

However, if w = O(logn), the query cost should be comparable to that of a traditional

orthogonal range tree.

This approach may not be arbitrarily scalable. The need for more memory-efficient

approaches led to the consideration of alternative geometric searches based on bucke-

tized “grid” indices.

6.4.2 Grids

Grid structures divide multi-dimensional space into finite rectangular “buckets” that

contain many items. To process a range query, the buckets that contain the range are

identified, and items in these buckets are checked to determine whether or not they

meet the constraints of the timebox.

In the context of the current discussion, it is straightforward to convert the structure

described above (Section 6.4.1) to a grid. Specifically, the interval tree associated with

each time point is replaced by an array. Each element in this array represents some

range of values, with the range of values and the number of items held constant across

all time periods. To place a data point into the index, a simple linear conversion can be

used to go from the point’s value to the appropriate slot in the secondary array for the

106

51−60

0−10

11−20

21−30

31−40

41−50

61−70

71−80

1 2 3 4 5 6 7 8 90

Figure 6.3: A grid index for a data set with time points 0-9, values 0-80, and 8 buckets

in the value dimension. Given this scheme, values from 0-10 will go into bucket 1,

11-20 in bucket 2, etc. The timebox shown will cover the grids for values 21-30,

31-40, 41-50 and 51-60 for times 3-5. Buckets 21-30 and 51-60 are only partially

covered, thus their contents must be checked at each time point. The other buckets are

completely covered by the timebox, so checking of individual points is not necessary.

range containing that value (Figure 6.3).

The efficiency of the grid approach can be improved slightly by noting that some

buckets are entirely covered by a timebox, while others are only partially covered.

If a bucket is entirely covered, its points need not be checked individually. This is

generally the case for the “interior” buckets - any bucket covered by the timebox other

than the highest valued and the lowest valued buckets.

The granularity of the grid is an adjustable parameter that can influence perfor-

mance. The granularity is calculated by dividing the number of records in the data

set by the number of records that would be in the data set if values were evenly dis-

tributed. Thus, a data set containing 1000 items and a granularity of 20 would lead to

107

a grid containing 50 value buckets.

6.5 Analysis

Of these three alternative search algorithms, which is most efficient? Sequential

searches would appear to scale poorly, being linear in the number of items in the data

set (n). Geometric approaches appear to have the benefit of multi-dimensional index-

ing. However, these algorithms under discussion have to handle a variety of queries,

which may occur with different frequencies.

Comparison on simulated data can be used to build a better understanding of the

merits of the various approaches. This section describes a test-bed that was used to

compare these alternatives, along with the results of the analyses and some conclusions

that can be drawn.

6.5.1 Methodology

Thorough testing of the various algorithms requires examination of a range of queries

on plausible data sets.

Data Sets

A Perl script was written to generate random time series profiles. Each time series

started with a random value between -1 and 1, which was then multiplied by one

and then added to 50, to provide a starting point between 30 and 70. Subsequent

values were calculated by adding a second random variable - also scaled by 20 - to the

previous value in the set. Values were constrained to run between 0 and 100. In this

manner, a “pseudo-random” walk was created.

108

100 items 100 time points

100 time points 100 items



50000 time points

1000 items,1000 time points

Table 6.1: Data sets used in algorithm evaluation.

To test for the effects of the number of items in the data set (n), data sets containing

100, 1000, 10000, and 50000 items with 100 time points each were created. To test

the effects of the width of the data set, data sets with 100 items and 100, 1000, and

10000 time points were used. A data set with 100 items and 50000 time points was

attempted, but the test program was not able to hold this data set in memory. A final

data set with 1000 items and 1000 time points was used to test for possible interactions

between the width and depth of the data set. A summary of the data sets is given in

Table 6.1.

Test Queries

Each test query set contained 1000 query blocks, with each of these blocks consisting

of a set of eight operations on a single query (Table 6.2). The resulting test set con-

tained a total of 8000 operations. One such test set was developed for each of the input

data sets.

The parameters for the query values and the extent of the moves, were generated

using a random scheme similar to the scheme described above for the test data. For data

sets involving varying number of time points, the widths of the queries were allowed

109

1. Creation of an initial query

2. Moving the query in the time dimension

3. Moving the query in the value dimension

4. Moving in both dimensions

5. Resizing (scaling) in the time dimension

6. Resizing in the value dimension

7. Resizing in both dimensions

8. Deleting the query

Table 6.2: Query Operations in each block.

to grow with the width of the data set.

Algorithms tested

Four algorithms were tested: optimized sequential search (“Seq”) (Algorithm 2), geo-

metric search with an orthogonal range tree (“Orth”), and geometric search with two

grid granularities of 20 and 100 (“Grid-20” and “Grid-100”). The somewhat arbitrary

nature of the choices of granularity for the grid index.

Metrics

For each test, the total time spent on each type of operation was recorded, along with

the average for each operation and the variance. The analyses that follow are based on

the average time for each operation.

110

0

1000

2000

3000

4000

5000

6000

0 10000 20000 30000 40000 50000

Ave

rage

tim

e (m

s)

Number of items

SeqOrth

Grid-20Grid-100

Figure 6.4: Average times (ms) across all operations for data sets with 100 time points

and 100, 1000, 10000, and 50000 items.

Testing Platform

All tests were run on a 1.333 GHz Pentium III-compatible computer with 512MB of

RAM, running Mandrake Linux 8.0.

6.5.2 Results

Summary results presenting average times over all operations are presented in Fig-

ure 6.4 and Table 6.3 for the varying depths (the “deep” data sets) and Figure 6.5 and

Table 6.4 for varying widths (the “wide” data sets). For the deep data set, sequential

search was fastest, followed by Grid-20, Orth, and Grid-100. Sequential search was

also fastest for the wide data set, followed by Orth, Grid-20, and Grid-100.

These results show a clear advantage for sequential search over both the orthogonal

111

orth seq grid20 grid100

100 0.7 0.7 2.0 4.5

1000 9.7 7.2 9.4 19.9

10000 252.9 69.9 243.2 438.5

50000 3614.6 381.6 3071.7 5493.3

Table 6.3: Average times (ms) across all operations for data sets with 100 time points

and 100, 1000, 10000, and 50000 items.

0

50

100

150

200

250

300

0 2000 4000 6000 8000 10000

Ave

rage

tim

e (m

s)

Number of time points

SeqOrth

Grid-20Grid-100

Figure 6.5: Average times (ms) across all operations for data sets with 100 items and

100, 1000, and 10000 time points.

112


100 0.7 0.7 2.0 4.5

1000 8.7 0.7 21.7 22.0

10000 103.7 0.75 248.6 275.902

Table 6.4: Average times (ms) across all operations for data sets with 100 items and

100, 1000, and 10000 time points.

range tree and grid indices. For data sets with the largest number of items, sequential

search is an order of magnitude faster than the others. The results are even more

striking for data sets involving more time points: while the performance of the indexed

algorithms seems to scale linearly with the number of time points, the performance of

the sequential algorithm is not influenced by the width of the data set.

These tests also provide some perspective on the potential limits on the prospects

for dynamic queries with larger data sets. According to Table 6.3, sequential search

of a data set with 10000 items and 100 time points takes 69.9 ms. For the larger data

set with 50000 items, the time increases to 381.6ms. Given these numbers, it would

appear that it will be very difficult to meet the dynamic query goal of 100ms processing

time with this hardware configuration and data sets that have significantly more than

10000 items. In fact, the real limit might be somewhat smaller, as these numbers do

not include the time required for display updates.

Results broken down by individual operations are given in Figures 6.6 and 6.7 for

variations in the number of items in the data set. Figures 6.8 and 6.9 provide similar

results for variations in the number of time points. As expected, these results are

generally consistent with the averages given in Figures 6.4 and 6.5. For the “depth”

data, the sequential algorithm was always at least as fast as any of the others, with

Grid20, Orth, and Grid-100 following in decreasing performance rank.

113

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

0 10000 20000 30000 40000 50000

Ave

rage

tim

e (m

s)

Number of items

SeqOrth

Grid-20Grid-100

0

1000

2000

3000

4000

5000

6000

0 10000 20000 30000 40000 50000

Ave

rage

tim

e (m

s)

Number of items

SeqOrth

Grid-20Grid-100

(a) Query Creation (b) Movement in time (x)

0

500

1000

1500

2000

2500

3000

3500

0 10000 20000 30000 40000 50000

Ave

rage

tim

e (m

s)

Number of items

SeqOrth

Grid-20Grid-100

0

1000

2000

3000

4000

5000

6000

7000

8000

0 10000 20000 30000 40000 50000

Ave

rage

tim

e (m

s)

Number of items

SeqOrth

Grid-20Grid-100

(c) Movement in value (y) (d) Movement in both dimensions

Figure 6.6: Comparative times for query creation and translation on data sets with 100

time points and 100, 1000, 10000 and 50000 items.

For the test involving the “deep” data set and movement in both directions (Fig-

ure 6.6), the performance of the Grid-20 index was comparable to that of the sequential

scan. However, since performance of the sequential scan was otherwise superior, this

result does not present any reason to prefer any of the other approaches. The perfor-

mance of the sequential algorithm also demonstrates more favorable scaling behavior.

As the geometric approaches begin to show growth rates that appear to be greater than

linear, the sequential scan’s performance maintains consistently linear growth.

The wide data set shows a different ordering - Sequential, Orth, Grid-20, and Grid-

114

0

500

1000

1500

2000

2500

3000

3500

4000

0 10000 20000 30000 40000 50000

Ave

rage

tim

e (m

s)

Number of items

SeqOrth

Grid-20Grid-100

0

200

400

600

800

1000

1200

1400

1600

1800

2000

0 10000 20000 30000 40000 50000

Ave

rage

tim

e (m

s)

Number of items

SeqOrth

Grid-20Grid-100

(a) Resize in time (x) (b) Resize in value (y)

0

1000

2000

3000

4000

5000

6000

7000

0 10000 20000 30000 40000 50000

Ave

rage

tim

e (m

s)

Number of items

SeqOrth

Grid-20Grid-100

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

0 10000 20000 30000 40000 50000

Ave

rage

tim

e (m

s)

Number of items

SeqOrth

Grid-20Grid-100

(c) Resize in both dimensions (d) Deletions

Figure 6.7: Comparative times for query resize and deletion on data sets with 100 time

points and 100, 1000, 10000 and 50000 items.

100, in order of decreasing performance - but the preference for the sequential scan is

just as clear. Furthermore, the sequential scan shows very little sensitivity to the width

of the data set.

Results for the data set containing 1000 items and 1000 time points are given in

Table 6.5, along with results for 100 items and 1000 time points and 100 time points

and 1000 items for context. Relative to the data set with 1000 items and 1000 time

points, the sequential algorithm was 80% slower for queries on this data set: 13.0ms vs

7.2ms. However, the geometric queries were roughly one order of magnitude slower.

115

0

50

100

150

200

250

300

350

0 2000 4000 6000 8000 10000

Ave

rage

tim

e (m

s)


SeqOrth

Grid-20Grid-100

0

50

100

150

200

250

0 2000 4000 6000 8000 10000

Ave

rage

tim

e (m

s)


SeqOrth

Grid-20Grid-100

(a) Query Creation (b) Movement in time (x)

0

50

100

150

200

250

300

350

0 2000 4000 6000 8000 10000

Ave

rage

tim

e (m

s)


SeqOrth

Grid-20Grid-100

0

50

100

150

200

250

300

350

400

450

500

0 2000 4000 6000 8000 10000

Ave

rage

tim

e (m

s)


SeqOrth

Grid-20Grid-100

(c) Movement in value (y) (d) Movement in both dimensions

Figure 6.8: Comparative times for query creation and translation on data sets with 100

items and 100, 1000, and 10000 time points.


1000 time points, 1000 items 207.2 13.0 213.8 220

100 time points, 1000 item 9.7 7.2 9.4 19.9

1000 time points, 100 items 8.7 0.7 21.7 22.0

Table 6.5: Average times (ms) for the data set with 1000 items and 1000 time points,

with results for both 100 items and 1000 time points and 100 time points and 1000

items given for context.

116

0

20

40

60

80

100

120

140

0 2000 4000 6000 8000 10000

Ave

rage

tim

e (m

s)


SeqOrth

Grid-20Grid-100

0

20

40

60

80

100

120

140

160

180

0 2000 4000 6000 8000 10000

Ave

rage

tim

e (m

s)


SeqOrth

Grid-20Grid-100

(a) Resize in time (x) (b) Resize in value (y)

0

50

100

150

200

250

300

0 2000 4000 6000 8000 10000

Ave

rage

tim

e (m

s)


SeqOrth

Grid-20Grid-100

0

50

100

150

200

250

300

350

0 2000 4000 6000 8000 10000

Ave

rage

tim

e (m

s)


SeqOrth

Grid-20Grid-100

(c) Resize in both dimensions (d) Deletions

Figure 6.9: Comparative times for query resize and deletion on data sets with 100

items and 100, 1000, and 10000 time points.

This seems to indicate that there may be interactions that might influence query per-

formance for data sets with both large numbers of time points and items. However,

such interactions do not change the basic conclusion - sequential search outperforms

the geometric alternatives.

These results may appear to be somewhat counter-intuitive. Why would a naive

sequential scan out-perform indexed searches? And why is the performance of the

sequential algorithm relatively insensitive to the width of the data set?

117

6.5.3 Sequential scans vs. Geometric Indices

The superior performance of the sequential scan approach might be explained by an

advantage in the number of points that must be processed for a given timebox query.

Specifically, does the sequential scan algorithm examine fewer points than geometric

approaches to determine the value of S(ti,b) for any ti in the data set?

Figure 6.10 shows a timebox that spans eight time points (times 1-8), along with a

time series that falls within the timebox for seven of those eight points. To determine

S(ti,b) for this timebox and any given ti, the sequential only needs to examine the first

two values in the given time range: once it is determined that the second value falls

outside of the timebox, we know that S(ti,b) = f alse, and there is no need to examine

any of the remaining values. The geometric approach does not look at the value at time

two, as this point falls outside of the timebox. Instead, geometric approaches must ex-

amine the seven remaining points for ti that fall inside the box. All of these points (and

any others that fall inside the box) must be examined and the appropriate totals calcu-

lated before the value of S(ti,b) can be determined. Furthermore, this determination

requires a separate, final pass through the data set.

The limiting case for this advantage would appear to be for time series profiles that

are completely contained within the timebox (Figure 6.11). In these cases, both the

sequential and geometric approaches must visit all of the points contained in the width

of the timebox to determine that S(ti,b) = true. However, the geometric approaches

still suffer from the need for a final scan through all of the items in the data set.

To validate this model for the superior performance of the sequential algorithm,

the above tests were repeated with additional instrumentation for counting the number

of items that were checked in the process of completing each set of queries. For any

given query on a data set, the number of values that might possibly checked in the

118

Figure 6.10: A timebox query demonstrating the advantage that sequential processing

has over geometric methods. For this timebox that spans eight time points, sequen-

tial processing can stop after the second time value is identified as falling outside of

the timebox. However, the geometric approaches must examine every point that falls

within the timebox.

course of processing that query is equal to the width of the query - w - multiplied by

the number of items in the data set - n. By comparing the number of values that are

actually tested to this theoretical maximum, we can evaluate the relative performance

of sequential and geometric approaches. Data for the sequential case are presented in

Tables 6.6 for the tests involving increased number of items, and Table 6.7 for tests

involving increased number of time points.

As would be expected, the number of values that might possibly be checked scales

linearly with both the number of items in the data set and the number of time points.

Furthermore, the number of items that is actually checked is relatively small - 7% or

119

Figure 6.11: The timebox from Figure 6.10, with a time series for which S(ti,b)= true.

Number of values checked

Number of items Possible Actual Ratio

100 25,052,800 1,737,711 0.069

1000 246,198,000 17,355,879 0.07

10000 2,501,220,000 175,558,021 0.07

Table 6.6: Comparison of number of values checked versus possible number of checks

for sequential search of data sets with 100 time points and 100, 1000, and 10000 items

.

120


Number of time points Possible Actual Ratio

100 25,052,800 1,737,711 0.069

1000 248,961,400 1,906,656 0.0077

10000 2,495,734,000 1,782,613 0.00071


for sequential search of data sets with 100 items and 100, 1000, and 10000 time points

.

less - for both the “wide” and the “deep” data sets. This establishes a benchmark for

the geometric algorithms - if those approaches require a substantially larger portion

of the possible checks, the proposed explanation for the superior performance of the

sequential algorithm would be validated.

The scaling of the number of values actually checked presents some interesting

results. For the “deep” data set involving 100, 1000, and 10000 items, the number

of values actually checked scaled linearly with the number of items in the data set,

maintaining a fairly constant ratio of roughly 7% of the possible number of checks. For

the “wide” data set involving 100, 1000, and 10000 time points, the number of items

actually checked stayed roughly constant - varying between 1,737,711 and 1,906,656

- despite the increase in the number of time points. As a result, the ratios decreased

progressively - from 6.9% for 100 time points to .071% for 10000 time points. This

would seem to be consistent with the insensitivity of search times to the number of

time points in a data set (Figures 6.5, 6.8, and 6.9 and Table 6.4).

Data for a similar analysis conducted with the Grid-20 index is given in Table 6.8

for the data set involving varying numbers of items and Table 6.9 for the data set

involving varying numbers of time points. As the various geometric approaches differ

121


Number of items Possible Actual Ratio

100 25,052,800 7,227,326 0.29

1000 246,198,000 73,553,822 0.30

10000 2,501,220,000 723,176,660 0.29


for sequential search of data sets with 100 time points and 100, 1000, and 10000 items

.


Number of time points Possible Actual Ratio

100 25,052,800 7,227,326 .29

1000 248,961,400 71,606,632 .29

10000 2,495,734,000 717,849,880 .29


for sequential search of data sets with 100 items and 100, 1000, and 10000 time points

.

in the indices used to retrieve the values that must be checked, but not in the actual

values that are checked, this analysis can be considered as representative of geometric

algorithms in general.

As with the sequential algorithm, the number of values actually checked scaled

with the number of items in the data set. However, the number of values checked also

scaled with the number of time points in the data set. Furthermore, in both cases,

the percentage of possible checks made was much higher than in the sequential case:

roughly 29% for the geometric algorithms, as opposed to a maximum of 7% for the

122

0

1e+08

2e+08

3e+08

4e+08

5e+08

6e+08

7e+08

8e+08

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

Num

ber o

f val

ues

chec

ked

Number of itemss

Grid-20Seq

Figure 6.12: The number of values actually checked for sequential and Grid-20 algo-

rithms for data sets involving 100, 1000, and 10000 items with 100 time points.

sequential scan.

Comparative graphs are shown in Figures 6.12 and 6.13 for the “deep” and “wide”

datasets respectively. Note the similarity between these graphs and corresponding

graphs for execution times in Figures 6.4, 6.5, 6.6, 6.7, 6.8, and 6.9.

Taken as a whole, these results appear to confirm the hypothesis that the perfor-

mance advantage of the sequential algorithm can be attributed to the gains associated

with boolean “shortcuts” that minimize the number of values that must be checked to

evaluate timebox queries.

123

0

1e+08

2e+08

3e+08

4e+08

5e+08

6e+08

7e+08

8e+08

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

Num

ber o

f val

ues

chec

ked


Grid-20Seq

Figure 6.13: The number of values actually checked for sequential and Grid-20 algo-

rithms for data sets involving 100 items with 100, 1000, and 10000 time points.

6.5.4 Theoretical worst-case analyses

An analysis of worst-case search performance for the various algorithms can provide

further insight. For a data set of n items, each containing m time points, worst-case

search performance occurs during the modification of a query that covers the entire

range of the data set. Specifically, this query must cover all m time points, with values

ranging from the lowest to the highest (inclusive) values found in the data set for any

item at any time point. As the modification of any timebox in a query requires the

reprocessing of all timeboxes in that query, this worst-case performance will occur

with a set of k boxes with a union that covers all m time points, or with a single

timebox that coves m time points.

124

It should be noted that this query is worst-case in the sense that it occupies the entire

query space and there are no interesting queries that would require more processing.

It is certainly conceivable that a user could construct a query that contained several

copies of a timebox that covered the entire query space, but this would be redundant.

Analysis of this query is straightforward for the sequential algorithm. Each of the

n items in the data set would potentially require anywhere from 1 to m checks: one for

each of the time points. As the value range of this query contains all values found in the

data set, all of the m checks would be necessary to verify that any given item matched

the query - no shortcuts would be possible. The total time required for processing this

query would therefore be O(mn).

For the orthogonal range tree algorithm, this worst-case query would require ex-

amination of each of the m range trees in the data set. For each of these trees, O(logn)

time would be required to find the starting point for the query interval in the skip

list,and O(n) would be required to find each of the n points in the skip list. Thus, each

of the m searches would take O(n + logn). The final pass through the data set would

require an additional O(n) time, for a total of O(n+m(n+ logn)).

This might be reduced somewhat, if we assume special-case handling that could

avoid the O(logn) search in the skip list for the case of searching for the minimum

value in the range. In this case, the resulting search time would be O(n + mn). This

may be asymptotically equivalent to the running time for the sequential algorithm, but

the constants are probably higher.

Similar results can be found for the grid variant of the geometric search. For each

of the m time periods, each of the buckets would be in the value range of the query, and

each of the points in each bucket must be checked, for a total of n points. As a result,

the time required for the basic search is O(mn). The addition of O(n) for the final pass

125

through the data set leads to O(n + mn). This is equivalent to the result found for the

orthogonal range version of the geometric algorithm.

This analysis implies that the sequential algorithm is likely to outperform the geo-

metric approaches even in pathological worst case scenarios.

6.5.5 Further Examination of Sequential Algorithms

The analysis presented thus far argues that the sequential algorithm outperforms the

geometric alternatives. Although further investigation explains this result, the sequen-

tial algorithm may still seem somewhat unsatisfying.

Specifically, Algorithm 2 seems to include an intrinsic inefficiency. For operations

that involve modification of a timebox, all of the remaining timeboxes are compared

against each item in the data set. In other words, if there are k timeboxes - b0 . . .bk−1,

and box l is deleted, each item in the data set must be checked against all of the k−1

remaining boxes. This is potentially wasteful, as many of these checks may have been

completed previously. If, for example, item tx had been previously found to match by

(y 6= l), we should not have to repeat the check to see if S(tx,by) = true.

An alternative formulation of the sequential algorithm might be used to avoid this

problem. In this model, we use a hash table for each entity ti in the data set. This

hash table contains pointers to the timeboxes that the entity satisfies. When a timebox

operation occurs, this approach still iterates over each of the time series profiles in the

data set. When checking profile ti against timebox b, the entry for that b in ti’s profile

is removed, and then (if the operation is not a delete), ti is compared against b to see if

there is a match. If there is, b is added to ti’s hash table. Finally, the number of items

in ti’s hash table is compared to the number of active items in the query. If they are

equal, S(ti,b) is true. This approach is summarized in Algorithm 4.

126

Search Algorithm 4 SEQ HASHED

• Begin by assuming that all items have an empty hash table.

• For each change to the query involving timebox b, and each item ti:

– Remove b from ti’s hash table.

– if the operation is not a deletion, check ti against b. If S(ti,b) = true, add

b to ti’s hash table.

– If the number of items in ti’s hash table is equal to the number of current

timeboxes, ti matches the query as a whole - S(ti,B) = true.

Evaluation of this algorithm requires a somewhat different approach than that

which was taken above. As this revised algorithm is aimed at eliminating costs for

cases involving modification of queries, it will be most effective for query cases in-

volving multiple queries (as opposed to the single query cases given above).

Alternate query sets similar to those used for the comparison of sequential and ge-

ometric algorithms were developed. Like the original query sets, these contained a

series of 1000 repetitions of 8 query operations. However, these repetitions were con-

ducted after creating four extra timeboxes. These four timeboxes were held constant

while the 8000 operations were performed.

Results for these tests are given in Figure 6.14 (for “deep” data sets) and Fig-

ure 6.15 (for “wide” data sets). Although the hashed version of the algorithm seems to

perform better on smaller data sets, the original sequential algorithm seems stronger

for larger data sets.

127

0

50

100

150

200

250

300

350

400

450

0 10000 20000 30000 40000 50000

Ave

rage

tim

e (m

s)

Number of items

SeqSeq Hashed

Figure 6.14: Optimized sequential vs. Hashed sequential for data sets involving 100,

1000, and 10000 items

.

6.5.6 Discussion

Although the analysis described above appears to support the use of a sequential scan

approach for timebox searches, several questions regarding the interpretation of these

results and their generalizability remain unresolved.

Generalizability

The sample query and data sets are not necessarily representative of real data sets and

user queries. Therefore, the results should not be taken as definitive or predictive.

Instead, they should be used for comparative discussion of the various algorithms.

128

0

0.2

0.4

0.6

0.8

1

0 2000 4000 6000 8000 10000

Ave

rage

tim

e (m

s)


SeqSeq Hashed

Figure 6.15: Optimized sequential vs. Hashed sequential for data sets involving 100,

1000, and 10000 time points

.

In particular, the ratio of numbers of values checked to the number of possible

values checked (Tables 6.6, 6.7, 6.8, and 6.9) are probably artifacts of the strategy

used to generate test data and queries. Other data sets and queries are likely to have

significantly different ratios.

Anomalous Data

There was one class of query operation where the grid index had performance compa-

rable to that of the sequential scan. For movement in both directions with the “deep”

data sets, the Grid-20 index was almost as fast as sequential scan for data sets involving

129

10000 items (78.8 ms for sequential scan and 337.5 for Grid-20). For 50000 items, the

two approaches had virtually indistinguishable performance: 430.5ms and 438.6ms

for sequential and Grid-20, respectively (Figure 6.6d).

Given the otherwise consistent superiority of the sequential approach, this result is

somewhat puzzling. The most likely explanation is that this is an artifact of the specific

query data set used in the tests. Further investigation would be needed to clarify. In

any case, the Grid-20 index never outperforms the sequential scan.

Scaling with Width of the Data Set

The analysis of the data sets involving varying widths of the time series revealed an

unexpected result: query execution time stayed roughly constant as both time series

and queries grow wider (Figures 6.5, 6.8, 6.9, and Table 6.4).

This result might be understood by considering what would be required for an

increase in time points to lead to an increase in the time required to process a query.

First, the queries must increase in width. Processing a query of width k will take at

most k comparisons for each item in the data set, regardless of the number of time

points for each item. This requirement is met by the test queries, which can increase

in width with the data set. Second, we must have many items in the data set that must

be scanned for the whole width of the query. If items fall out of the timebox quickly

(Figure 6.10), the increased width of the data set and query will not lead to increased

processing time.

More concretely, consider a timebox b = (tmin, tmax,vmin,vmax). The vertical range

vmax − vmin might be considered as p - the portion of the entire value range of the data

set that is included. If we assume that data values are randomly distributed throughout

the entire range, p is also the probability that the value of a time series ti will fall inside

130

of b for any point in time i. As a time series must stay within the value range of the

timebox b for each of the k = tmax − tmin +1 time periods included in the timebox, the

probability that any time series will match the timebox is pk. Since p can be expected

to be significantly less than one, this value will quickly become very small even for

small values of k. For example, if the user creates a relatively broad query covering half

of the value range (p = .5), the likelihood that a time series will match this query will

be less than .1 if the query is only four time periods long (Figure 6.16). For data sets

containing such randomly distributed values, the likelihood is that many time series

will quickly fall outside of wide timeboxes, thus making performance independent of

the width of the data set.

Most meaningful data sets will not have values that are randomly distributed, so

likelihoods may be somewhat higher. However, most interesting data sets contain

profiles with non-trivial changes over time -exactly those profiles that are unlikely to

be contained in relatively constrained timeboxes that cover long intervals.

Performance with the data set with 1000 items and 1000 time points (Table 6.5)

provides some evidence that performance is not completely insensitive to the num-

ber of time points. For this data set, the sequential search was 80% slower than it

was on the data set with 1000 items and 100 time points. However, this increase was

relatively modest, given the ten-fold increase in the number of time points. This per-

formance seems particularly good when compared to the geometric algorithms, which

were roughly one order of magnitude slower with this wider data set.

Further analysis would be needed to fully characterize the performance of the se-

quential algorithm on data sets with large numbers of both items and time points.

However, it appears that query performance can be expected to scale well with the

width of the data set.

131

Figure 6.16: Why time series query performance is independent of the width of the

series. As this timebox covers 25% of the value space and five time periods, a randomly

generated time series would only have odds of < 1% of satisfying the timebox (like

t2 does). The odds that a timebox will fail to meet this query by the fourth time point

(like t1) are greater than 99%.

Variants on Sequential Search

The comparison between the optimized sequential (Algorithm 2) and the hashed (Al-

gorithm 4) (Section 6.5.5) was inconclusive. Further comparisons, perhaps including

data and query sets that might be more representative of actual users tasks, might be

necessary to gain a deeper understanding of the strengths and weaknesses of these

alternative approaches.

132

6.6 Next Steps

This chapter has formulated the timebox search problem, described alternative ap-

proaches, and presented results based on synthetic query sets. These tests have shown

that sequential scans with heuristic optimizations outperform searches based on more

sophisticated geometric indices. By counting the number of values checked in the

different circumstances, this analysis established an explanation for the superior per-

formance.

The sequential algorithms benefit from the ability to quickly and easily determine

- on the basis of the first value from a time series that falls outside of the timebox -

when a time series profile will fail to satisfy a timebox. Sequential approaches use this

information to eliminate the need to examine any values in a time series subsequent to

that value that falls outside of the timebox. Geometric algorithms, on the other hand,

lack this global knowledge of a single value for a time series that will be sufficient to

conclude that the time series falls outside of a timebox. As a result, these approaches

must examine all data points that fall within a timebox, even though many of them

may belong to items that will not fall within the timebox.

This understanding can be used to identify the requirements that must be met by

any proposed algorithm that would hope to improve upon the sequential scan algo-

rithms described above. The key to the success of the sequential scan is in its ability to

quickly identify profiles that cannot satisfy a timebox. An index that could be used to

quickly (less than linear time) identify only those profiles that might possibly satisfy a

timebox might possibly outperform the sequential search.

One possible approach might be to reduce each time series in a data set to a one-

dimensional projection on the value axis, covering the range between the maximum

and minimum values covered by that time series. These projections would be searched

133

in an interval tree. Each timebox could then be converted into a similar interval, and

search would consist of finding all of the profiles that overlapped with the timebox and

then doing a complete search on those candidates.

Unfortunately, there are at least two problems with this proposal. The first involves

ordering of items: maintaining the consistent ordering of items would require an ex-

pensive sorting of the result list after the search. Even if this requirement is relaxed,

there is a very real possibility that the items in the data set will have substantial overlap

in the projections of their profiles. This would minimize the discriminatory power of

the interval tree and (in the limit) reduce this approach to a sequential search.

It is important to note that the conclusions presented in this chapter relate only

to complete-matching of time series data sets with timeboxes. Numerous algorithms

have been suggested for similarity matching on subsequences and other approaches to

searching time series data (Section 2.2). These approaches might be worth reconsider-

ing in the context of possible extensions to the timebox query language (Chapter 9).

Processing of timebox and related queries might be limited by the “dimensionality

curse” - the inherent difficulty of searching in high-dimensional cases. As a time series

data set with n time points can be viewed as a set of n dimensional vectors, a timebox

query over that data set can be considered to be a query in n-dimensional space. Re-

cent analyses of index structures for nearest neighbor searches in high-dimensional

space have shown that sequential scans outperform indexed searches for moderate di-

mensionalities (< 20) [21, 115, 142]. Thus, sequential scans might outperform these

indices for time series of even moderate width.

In fact, the performance degradation of these indices may be even greater for time-

boxes. Nearest neighbor searches are based upon calculations of distances between

data points and a fully-specified query point. Timebox queries can be substantially

134

more vague, as conjunctive queries may specify constraints on some, but not necessar-

ily all, of the time values. In essence, a timebox query can be seen as a similarity query

with at least one specified constraint and an arbitrary number of “don’t care” values.

Although further analysis would be necessary to confirm this conjecture, it seems rea-

sonable to expect that the performance degradations for these queries would occur at

lower dimensionalities than those seen for completely specified similarity queries.

The results describe above seem to indicate that dynamic query processing for data

sets containing more than 100,000 items may be impractical for some time. Alternative

strategies might be developed to handle larger data sets. For example, searching might

be done on clusters of similar profiles, allowing users to “drill-down” to actual data

items once a cluster of interest is found.

135

Chapter 7

Empirical Evaluations

Evaluation is a key component of the process of developing interactive systems. Em-

pirical studies, user observations, and other analytic approaches to examination of the

use of the tool in practice can help validate ideas, support (or refute) underlying as-

sumptions, and otherwise clarify understanding of the issues surrounding the system

under investigation.

This chapter describes two controlled design studies that were conducted with

TimeSearcher. These studies investigated various aspects of the timebox query model

and the TimeSearcher application, with the goal of providing formative feedback that

would be useful for revising and improving the utility of these tools.

Both of the studies asked participants to use a direct-manipulation timebox in-

terface alongside two alternatives to complete tasks involving a search for items of

interest in a data set involving stock prices. An additional study was attempted and

terminated, due to difficulties with user comprehension of study tasks. This study is

described in Appendix C.

Chapter 8 presents several case studies with researchers who have been using

TimeSearcher for examining data sets in their ongoing work. Based on observations

made during sessions spent directly with these users as they worked on problems that

136

they found meaningful, these case studies document the utility of the tool as seen by

motivated users.

Both approaches to evaluation have their strengths and weaknesses. Empirical

studies are well-suited for understanding the impact of small design changes and com-

paring alternatives in well-controlled environments. These studies can also be too nar-

row, focusing on minute details that may be uninteresting, even if they are easy to test.

Case studies provide powerful testimonials to the utility of a tool, but as they are far

less rigorous than empirical studies, conclusions are often less clear and generalizable.

These difficulties were notable in the course of the evaluation of TimeSearcher.

The two approaches to evaluation provided markedly different results. Although users

of the system were enthusiastic and found the tool to be valuable, the results of the

empirical studies are less clear. The first study showed that form fill-in interfaces

outperform direct manipulation timeboxes under certain circumstances, and the second

study failed to show any significant difference between the alternatives. Understanding

these apparent paradoxes will be a goal of the discussion of these evaluations.

7.1 Evaluation of Input Mechanisms for Questions of

Varying Complexity

7.1.1 Interfaces

Two equivalent alternative means of specifying query constraints were considered 1:

1In the discussion below, “Timebox” will refer to standard timeboxes as implemented in Time-

Searcher, “Form Fill-in” will refer to the form fill-in interface, and “Range Slider” will refer to the

range slider interface.

137

Figure 7.1: A form fill-in interface for specifying query constraints.

Figure 7.2: A range slider interface for specifying query constraints.

1. Form Fill-in: Using traditional text entry widgets, users could type values to

specify a query equivalent to a timebox (Figure 7.1).

2. Range Sliders: Paired range sliders - one for time constraints and one for value

constraints (Figure 7.2), can be used to specify query parameters.

The alternative interfaces can be viewed as indirect means of creating timebox

queries: the parameters expressed in the slider or form fill-in form queries equiva-

lent to what might be expressed with a timebox. Furthermore, these parameters were

displayed on the screen with a box, just as if a box had been created with direct ma-

nipulation.

Form fill-in was chosen as a “traditional” interface design, based on commonly-

accepted conventions for graphical user interfaces. Range sliders were chosen as a

potentially more powerful alternative that has been shown to be useful in earlier infor-

mation visualization work [8].

A modified and instrumented version of TimeSearcher was built to serve as the

platform for running the study. Known as tsexp, this version involved several modi-

fications to the TimeSearcher interface, along with additional functionality needed to

run the study.

138

In constructing tsexp, interface components found in TimeSearcher were removed

if they were irrelevant to the study or if they somehow interfered. Thus, the details-on-

demand window and the item list were eliminated as being irrelevant. The range sliders

for adjusting queries were removed, as they provided a tool for modifying queries that

could interfere with the query manipulation methods that were being examined.

The resulting interface contains three windows: the display list, the overview win-

dow, and the query space. The display list is analogous to the display list in Time-

Searcher - a scrollable window containing individual graphs for the items in the data

set. The overview window was used to show a graph overview of the data set, and

for display of the boxes corresponding to the query parameters. For tasks involving

timebox queries, this window was also used for query input. The query space -the

third at the bottom of the screen - is used for query input. Unlike timeboxes, which

can be drawn directly on the graph space, form fill-in and range slider queries require

additional display space. This window was used to display the input devices, along

with other necessary controls. For tasks involving timeboxes, this space was used to

provide feedback regarding the extent of each timebox.

tsexp includes functionality for reading a set of tasks from a text file. This file

indicates the number of items in the session, data files to be used for that session, and

the questions, along with indicators describing the type of question and its complexity.

A participant starts a task in tsexp by pressing the “start” button on the toolbar. This

leads to the loading and display of the appropriate data file, along with the initialization

of the query space. A popup window displaying the current task, and the type of

interface, is also displayed (Figure 7.3). The user then proceeds to answer the question.

Users were instructed to to find the answer to the question, press the “stop” button on

the toolbar, and then write down the answer.

139

Figure 7.3: The tsexp interface.

For each task, data stored included the time required to complete the task, the

number of timeboxes created, the number of modifications, and the number of items

deleted. For the studies described below, task completion time was the only variable

analyzed. The time between the pressing of the “start” and “stop” buttons was used as

the task completion time.

Although every attempt was made to keep the differences between the three in-

terfaces as minimal as possible, slight differences in their handling were necessary.

Unlike timeboxes, which can be drawn anywhere to specify a query, the form fill-in

and range slider interfaces require some initialization. It was decided that these con-

trols should be initialized to contain the maximum extent of the data set. Thus, these

140

tasks began with a query that contained all of the items in the data set.

Query execution also differs slightly. The range slider and timebox interfaces exe-

cute queries implicitly with every mouse event. Form fill-in queries are executed either

by pressing “return” in any of the form fill-in boxes, or by pressing the “run query”

button on the toolbar.

Further special handling was needed for creation of additional query terms and for

deletion of query terms. For timebox queries, these mechanisms were straightforward

and based on TimeSearcher: new terms could be created by selecting the drawing

icon on the toolbar and drawing the new box. Deletion of queries is accomplished by

right-clicking on the timebox and selecting “delete” from the pop-up menu.

Form fill-in and range slider queries share common mechanisms for creation and

deletion. The “New Query Item” button on the toolbar causes a new query term to be

created. Each query component occupies a separate line in the bottom window of the

tsexp display. As with the original query term, this new term will initially occupy the

entire extent of the data space. Each term in the query has a “delete” button, which can

be used to remove that term. If only one term is present, it cannot be deleted.

All three interfaces provide users with feedback indicating the extent of the query

items. For range sliders and form fill-in, the extents are provided with each query line.

For timeboxes, the bottom window is used to display lines with feedback displaying

the extent of each box that has been created. This feedback is dynamically updated

as boxes are moved. In all cases, selecting a query item leads to highlighting of the

corresponding feedback (Figure 7.4).

141

Figure 7.4: Feedback provided in the tsexp interface. Note the highlighted border

around the feedback corresponding to the selected timebox.

7.1.2 Complexity

There are two sources of complexity in the class of time series tasks that might be

handled by timeboxes:

1. Number of modifications: Many tasks involve comparison of results from

slightly differing queries. These tasks require creation of a query that is sub-

sequently modified, along with comparison of the corresponding results. The

difficulty of these tasks increases with the number of modifications/comparisons

that must be made.

2. Number of query terms: Tasks that involve identification of complex patterns

require creation of several terms. The complexity of these tasks increases with

the number of terms required.

This study investigates the first source of complexity, with the other source held

constant. The incomplete study described in Appendix C attempted to address the

second source of complexity. In both studies, three levels of complexity - low, medium

and high - are used.

142

This design does not account for the possibility for any interaction between the

sources of complexity. Although a more comprehensive study that accounted for in-

teractions might have been interesting, the resulting 3x3x3 design (interface vs. # of

modifications vs. # of query terms) would have required a daunting number of tasks

from each participant. Furthermore, comparative investigation of the relative impact

of the types of complexity is of secondary interest.

7.1.3 Task Types

The study followed a within-subjects design consisting of two sets of tasks. The first

set, which occupied the bulk of the session, involved well-defined questions, while the

second involved exploratory tasks.

Well-Defined Tasks

Participants were asked to use each of the three interfaces to complete tasks at each of

the three levels of task complexity, resulting in a 3x3 design. This session contained

18 tasks - 2 repetitions for each of the 9 possible combinations of the two conditions.

The ordering of interfaces presentation was varied among the participants, with all

6 possible orderings represented with equal frequency. Three data sets were used for

these tasks, with the data sets similarly varied to avoid disproportionate presentation

of any combination of interface and data set.

In all cases, tasks were presented in increasing order of difficulty, and all of the

tasks for a given complexity level were completed before before the next level of com-

plexity started. Thus, if the order of interfaces was A,B,C and the order would be

A low-complexity, B low-complexity, C low-complexity, A low-complexity, B low-

complexity, C low-complexity, A medium-complexity . . . .

143

The levels of complexity were defined in terms of the number of changes to an

initially-specified query that would be needed to answer a question. Low-complexity

queries simply required answering a question that could be specified with a single time-

box, medium complexity tasks required comparison between three conditions (two

modifications), and high complexity tasks involved comparison between five condi-

tions (four modifications). For example,

1. Low Complexity: How many stocks had prices between $10 and $30 during

weeks 1-5?

2. Medium Complexity: Which price range has the most stocks during days 29-30:

$50-$75, $75-$100, or $68-$93?

3. High Complexity: More difficult queries involving comparison between five

possibilities: Which days have the most stocks with prices between $50 and

$100: 2-10, 4-12, 6-14, 8-16, or 10-18?

Medium and high complexity questions involved modification in either time or

value, but not both. Tasks used for this study are given in Appendix B. As the tasks in

this study only involved the use of one set of constraints at any given time, the tsexp

interface that disallowed multiple simultaneous query constructs was used.

The data sets used in this study tracked 30 days of stock prices for a set of 200

stocks, extracted from a larger set of actual stock prices from 1998-1999 2

Twelve graduate and undergraduate students from the University of Maryland’s

Department of Computer Science participated in this study. These twelve participants

represent 2 participants for each of the six possible orderings of interface presentation.

2Thanks to Martin Wattenberg for providing stock price data.

144

The study materials and interface were tested with two pilot subjects and revised

based on the resulting feedback. Study participation took about one hour.

Exploratory Tasks

Participants were asked to use each of the three interfaces to find items in the data set

that were somehow “interesting” or “unusual”. Users were asked to find 3 such items

with each interface. The exact definition of what constituted an “interesting” item was

left to the discretion of the participant.

7.1.4 Hypotheses

This study examines the following hypothesis:

Hypothesis 1 Direct-manipulation of graphical query widgets is faster for specifying

and modifying complex time series queries than alternative interfaces that are seman-

tically equivalent.

A secondary hypothesis addresses the interaction between interface type and task

complexity:

Hypothesis 2 The advantages of timeboxes will be greater for more difficult tasks.

This comparison is, by design, quite narrow. None of the other aspects of the

TimeSearcher display were included in the study. This approach focuses the evaluation

specifically on the query specification mechanisms. Future studies might evaluate the

impact of interface components such as the envelopes and overviews (Chapter 4).

145

7.1.5 Procedure

After signing informed consent forms, participants read a short introduction to the

problem and tasks.

A training session containing 6 questions followed. Each of the three interfaces

was included twice in the training session, with 1 low-complexity question and 1

medium-complexity question. High-complexity questions were not included in the

training session. Training questions were repeated as needed in order to familiarize

users with the interfaces and then the tasks. When necessary, the administrator of the

study completed one or more of the training tasks for the participants. In these cases,

the participants repeated the tasks on their own as well.

The well-defined tasks were presented after the training session. Each task was

presented with a three minute time window. If the participant did not arrive at a suit-

able answer within that window, they were allowed given a second three minute time

window to repeat the task. No further attempts were allowed. The dependent measure

for these tasks was the time required for completion

Exploratory tasks followed the well-defined tasks. Participants were given up to

3 minutes with each interface. During that time, they were asked to find the items of

interest. Measures for these tasks included the number of items actually found, and the

time required to find them.

During all tasks, the administrator of the study was observing the participants’

interactions with the system.

After the training session, users completed a short subjective questionnaire. Ques-

tions were based on a subset of the Questionnaire for User Interface Satisfaction

(QUIS [32]) (Appendix B.4). Users were also asked to identify the interface that they

preferred to use for each of the two tasks.

146

7.1.6 Results

Results for the well-defined tasks are given in Figure 7.5. These results were analyzed

with a repeated measures analysis of variance (RMANOVA). As expected, task com-

pletion times increased significantly with complexity (F(2,103) = 53.25, p < .01).

The impact of the interface was also strongly significant (F(2,103) = 25.03), p < .01),

but not in the manner that was expected. The form fill-in interface was fastest overall

(41.9ms average for all tasks), followed by the range slider (54.1ms), and timeboxes

were slowest (73.4ms). The interaction between interface and task complexity was

not significant (F(4,99) = 1.18, p = .33). These results clearly fail to support the

hypotheses.

Further examination of the times for each user and task support the generally poor

performance of timeboxes: for 10 out of 12 users, performance with timeboxes was in-

ferior to performance with the other interfaces for all three tasks. In the two remaining

cases, timeboxes outperformed range sliders for the high-complexity tasks.

Results for the exploratory tasks are given in Figures 7.6 and 7.7. There were

no significant differences between the three interfaces, either in the number of items

correctly identified (RMANOVA, F(2,33) = .60, p = .55) or in the task completion

time (F(2,33) = .89, p = .42).

Subjective satisfaction results are given in Figure 7.8. For three of the four ques-

tions - terrible/wonderful, frustrating/satisfying, and difficult/easy - the form fill-in

interface was rated most highly, followed by range sliders and timeboxes. This differ-

ence was significant in all three cases (ANOVA, F(2,33) = 10.8, p < .01, F(2,33) =

13.77, p < .01, and F(2,33) = 26.13, p < .01, respectively). For the rigid/flexible

question, range sliders were rated most highly, followed by form fill-in and then time-

boxes, but these results were not significant (F(2,33) = 1.19, p = .32).

147

10

20

30

40

50

60

70

80

90

100

110

Low Medium High

Ave

rage

Tas

k C

ompl

etio

n Ti

me

(ms)

Complexity

Form Fill-inRange Slider

Timebox

Figure 7.5: Average completion time (with standard deviation error bars) for well-

defined tasks.

Form Fill-in Range Slider Timebox

Well-Defined 9 2 1

Exploratory 2 5 5

Table 7.1: User preferences by interface for the different task types.

When asked which interface they preferred for each type of task, users expressed a

strong preference (9/12) for the form fill-in interface on the well-defined tasks. Prefer-

ences for the exploratory task were more mixed, with five users preferring range sliders

and five preferring timeboxes (Table 7.1).

Original plans for this study called for 18 participants - 3 for each of the 6 orderings

of interface presentation. Analysis after 12 subjects led to the above results. As these

results are generally unambiguous, the study was terminated at that point.

148

0

0.5

1

1.5

2

2.5

3

3.5


Num

ber o

f ite

ms

Cor

rect

ly Id

entif

ied

Interface

Figure 7.6: Number of items correctly identified in exploratory task

7.1.7 Discussion

This study failed to support the hypothesis that timebox queries would provide bet-

ter performance than the alternative. In fact, form fill-in interfaces provided the best

performance, followed by range sliders and finally by timeboxes. For three of the

four measures of subjective satisfaction, users rated the three interfaces in the same

order. Observations of participant interactions with the system and comments made

during the sessions were consistent with the statistical results. Participants frequently

commented that they liked the form fill-in interface and found timeboxes hard to use.

The tasks used in this study might have played an important role in these results.

All of the well-defined tasks involved precisely defined regions for comparison, with

time periods and dollar values expressed exactly. This sort of task is especially well-

suited for the form fill-in interface: to specify - for example - a range between $50 and

149

0

50

100

150

200


Ave

rage

Tas

k C

ompl

etio

n Ti

me

(ms)

Interface

Figure 7.7: Average task completion time for exploratory tasks

.

$75, participants simply had to type in the two numbers and press “return”.

Completing these well-defined tasks with the range sliders or timebox interface

requires fine-grain movement of user interface widgets over a small number of pixels.

This led to frustration for many participants, as the interface did not always provide

the level of control that they might have wanted. Participants would often get close

to the desired values and then overshoot, oscillating back and forth until reaching the

desired value. For timeboxes queries, this happened more frequently for changes in

values than in times, which is consistent with the significantly higher granularity of

the value dimension.

The choice of tasks may also limit the generality of these results. The fully-defined

queries used in this study are not necessarily representative of how timeboxes and

TimeSearcher might be used by actual users. Users engaged in data exploration tasks

150

0

2

4

6

8

10

Terrible/Wonderful Frustrating/Satisfying Difficult/Easy Rigid/Flexible

Ave

rage

Sub

ject

ive

Rat

ing


Timebox

Figure 7.8: Average subjective satisfaction ratings (1-9, 9 is best), n = 12.

are likely to engage in a wide mix of timebox manipulations, including creating, scal-

ing, moving, and deleting query components. This study seems to indicate that time-

boxes are slower than the alternatives for the basic operation of creating a query com-

ponent, but other operations are not addressed. If query modification is faster with

timeboxes than with form fill-in or range slider interfaces, overall performance for

real user tasks might be best with timeboxes. Furthermore, query creation might be

substantially different for real tasks. Users engaged in data exploration may not be

interested in exact criteria for initial query specification. In this case, the performance

penalties associated with timeboxes might be significantly reduced.

The choice of population for study participants may have been a contributing factor

in these results. TimeSearcher is designed to be a tool for motivated domain experts.

Training and familiarity in the tool is Implicit in that assumption. For this study, par-

151

ticipants were given minimal training (less than 30 minutes). Furthermore, the stock

price data set may have been unfamiliar to some of the users. Although it is likely that

most computer science students have some familiarity with the stock market, they may

not have much experience using charts of stock prices.

Observation of study participants revealed some behavior patterns that may have

been result of the relative lack of training and experience with the timebox interface.

Some users seemed intimidated by the timebox interface - the blank screen presented

at the start of each task may have left them unclear how to proceed. Other participants

had difficulty interpreting the effects of modifications to timeboxes. Specifically, they

were surprised when changes in the size of a timebox led to unanticipated changes in

the size of the result set.

Some of the participant confusion in interpreting timebox queries might be at-

tributed to a fundamental asymmetry in timebox interpretation. When the range of a

timebox in the value dimension (vertically) is increased, the query is more inclusive:

for each of the n observations included in the timebox, the range of acceptable values

has increased. However, an increase in the time extent of a timebox in the time di-

mension is less inclusive. If a timebox is increased from n observations to n′(n′ > n)

observations, an additional n′− n constraints have been imposed. Changes in the two

dimensions are therefore not comparable: when the range of the box increases in one

dimension (vertically), the result set may grow larger, but when the range increases

in the other direction (horizontally), the result set may shrink in size. To some users,

this may be somewhat counterintuitive, particularly if they believe that enlarging the

timebox should enlarge the data set. Indeed, several participants seemed to experience

difficulties in interpreting query results after they modified the temporal extent of a

timebox.

152

Figure 7.9: A demonstration of the difficulty of resizing small handles. The large

timebox on the left has handles that are clearly separated and easily graspable. The

small timebox on the right has handles that are only a few pixels apart, and are therefore

harder to select.

Other difficulties may have contributed to the disparity in task performance. With

the timebox interface, queries that involved small time or value ranges proved particu-

larly difficult to move or resize. This difficulty was caused by the resize and movement

handles on the outline of the timebox. In general, the user must click on a corner or on

the middle of one of the sides. Very small timeboxes may have handles that are sepa-

rated by a few pixels or less, making selection very difficult (Figure 7.9). As a result,

these queries become especially time-consuming and frustrating. Similar behavior

may happen with range sliders as the thumbs on the opposite end of the slider become

closer together. Form fill-in interfaces, which do not suffer from such difficulties, may

have increased advantages for queries covering a small range.

These observations present a design challenge for the timebox interface. Specif-

ically, how might the interface be modified to better support moving and resizing of

small timeboxes? One possibility would be to use some form of interface “gravity”

153

that would attract the mouse pointer towards the nearest handle. Alternatively, some

local magnification - perhaps through a lens - might be used to display the area in

question in greater detail, thus allowing more fine grain control.

Another approach would be to supply alternate, indirect tools for modifying time-

boxes. TimeSearcher takes this approach, providing range sliders and form fill-in fields

that can be used to modify the time and value extents of a timebox (Chapter 4). While

these tools have proven useful, improvements to the direct manipulation interface have

the potential to be more flexible and easier to use. Additional implementation and

evaluation will be necessary to compare alternative approaches.

Several participants also had difficulty dragging timeboxes over large horizontal

ranges. When answering a question that involved looking at a given value range at dif-

ferent points in time (see the “High Complexity” example, above), participants would

define a box that examined the first condition and then attempt to drag it horizontally

to the time ranges covered in subsequent conditions. In doing so, they often found that

the box drifted vertically during the course of the movement, requiring a readjustment

of the value range after the desired time range had been reached. This readjustment

appeared to contribute substantially to both task completion time and user frustration

with timeboxes.

This drift was largely a result of the lack of stability in mouse movement. The

difficulty of moving a mouse in a vertically constrained tunnel is inversely related to the

width of the tunnel [2], making strictly horizontal movement virtually impossible. As

a result, any horizontal mouse movement is likely to contain substantial vertical noise.

Since the timebox interface does not have any constraints on the vertical movements

of timeboxes, this noise will lead to vertical movements of the timebox. Since the

timebox interface constrains horizontal movements to be in discrete quanta based on

154

the number of time points in the active data set, horizontal drift does not cause a similar

problem when the mouse is moved vertically.

A simple modification to TimeSearcher provides some assistance in overcoming

this difficulty. When the user clicks and drags a box with the middle mouse button or

mouse wheel (as opposed to the left mouse button), the timebox will move horizontally,

but not vertically. Of course, TimeSearcher users can also use the range slider to

indirectly adjust the time range of a timebox without modifying the value range.

Observation of study participants revealed a range of problem-solving strategies

that were used and problems that participants encountered. Several users followed a

serial process in creating timeboxes, manipulating one side at a time. For example,

instead of dragging a box horizontally to the left, they would drag the left-hand side

to the left, and then drag the right-hand side. As the sides could be dragged inde-

pendently without changing the value range, this provided greater control and avoided

the “vertical drift” problems discussed above. Similarly, some users created boxes by

drawing them along the horizontal axis and then dragging them up to the desired value

range. This technique may have been useful for increasing accuracy in the time range.

Some aspects of the system implementation and study design may have influenced

user performance. To simplify the user tasks, the testing software rounded all dis-

played values to integer values. However, the values in the data set were not similarly

changed. This led to user confusion, as a vertical movement of a box (perhaps be-

tween lower bounds of 53.6 and 53.8) might not have changed the value displayed

(which would stay at 54), but the upper bounds might have changed (perhaps from 73

to 74). When faced with this problem, users often tried repeatedly to adjust the boxes

appropriately. The study administrator tried to identify these situations and define the

task as complete when the user was close enough, but this problem increased task

155

completion times for the timeboxes.

Several users were also confused by the differences in scales between the query

space and the displays of the individual items. As the query space was taller than each

individual item in the display list, features and transitions that were prominent and easy

to spot in the query space may have been difficult to find in the individually-displayed

items. This confused users and created difficulties in completing the exploratory tasks.

The form fill-in and range slider interfaces may have suffered from another artifact

of the design of the software. For these interfaces, the controls (text fields and range

sliders, respectively) are decoupled from the input box - users must go to the query

window and examine two sets of controls presented horizontally (Figure 7.1 and 7.2).

Some users found this arrangement confusing, as they had difficulty correctly mapping

the controls to the correct dimension. This confusion presented itself in the form of an

inappropriate data entry - attempting to enter a time constraint in the value control, for

example. Most users stopped doing this after the error was pointed out to them.

156

7.2 Empirical Evaluation of Input and Output for Ex-

ploratory Tasks

The study described above (and the study described in Appendix C) focused solely on

query input. This decision was made in order to focus the study on the merits of the

timebox query input. This resulted in a study that did not explore all of the strengths

of TimeSearcher - notably, the graph overviews and the scrolling display list.

This study was designed to augment the first study by adding consideration of

query result display to the evaluation. Drawing on lessons learned from the first study,

this study used different formats for presentation of tasks, and contained fewer tasks,

than the first study.

7.2.1 Interfaces

This study compared three different interfaces for completing queries on time series

data. Two of these interfaces are identical to those used in the first study (Section 7.1):

the form fill-in interface and the timebox interface.

The third interface used form fill-in for query specification, with a spreadsheet-like

table of numeric values to display query results. Each row in this spreadsheet was a

single item in the data set, with each column containing one of the time periods. An

additional column contains the names of the items (Figure 7.10). When the query is

executed (either by pressing “return” in one of the text entry fields or by pressing the

“Run Query” button), this table is updated to display the items that match the query.

Despite this different display, procedures for query specification and modification are

identical to those used in the form fill-in interface described above.

This design intentionally omits the range slider interface used in the previous study.

157

Figure 7.10: The form-fill interface with tabular display of query results. Each row

contains the data for one item in the set, with the values for displayed in the columns.

This inclusion of a fourth condition would have lengthened sessions, potentially mak-

ing them unacceptably long.

For this study, the total task completion time was defined as the interval between

pressing the “start” button and the last modification of any item in the query. This is in

contrast to the previous study, which used the time between pressing the “start” button

and the “stop” button as the task completion time. The approach used for this study

158

has the advantage of greater accuracy, as it is not dependent upon an action that user’s

often forgot to take.

7.2.2 Tasks

Task design in this study attempted to avoid troubling characteristics of the tasks used

in the initial study (Section 7.1) and in the aborted study (Appendix C). In the first

study, well-defined tasks involving complete specification of time and value ranges

proved to be well-suited for the form fill-in interface. In the aborted study, verbal

descriptions of a less-precise pattern were found to be confusing to study participants.

In an attempt to find a middle ground between these two extremes, tasks in this

study were designed to be precise enough to be easily understood while also being

open-ended enough to be more challenging than the tasks from the initial study. Tasks

generally asked users to identify items that fell within a given value range for some

number of days, or had other transitions of a well-defined magnitude.

Each participant completed four tasks: two training tasks and two timed tasks.

One of the training tasks was somewhat simpler than the other, in order to help par-

ticipants gain familiarity with the interfaces. The tasks were the same for all three

interfaces. Balanced ordering of the presentation of the interfaces was used to over-

come any learning effect that might have been caused by repeated exposure to the

questions. Each task was presented to the users with a graphical depiction of an item

matching the pattern. These graphics were included to help participants understand

the questions 3. The questions used in this study are given in Appendix D.

3Thanks to Francois Guimbretiere for this suggestion

159

7.2.3 Hypothesis

This study was designed to test the following hypotheses:

Hypothesis 3 1. Graphical display of results will lead to faster task completion

time than tabular display.

2. Direct manipulation specification of queries will lead to faster task completion

time than form fill-in specification.

3. Task completion time will be fastest for the direct manipulation interface, fol-

lowed by the form fill-in interface with graphical feedback and finally by the

form fill-in interface with tabular feedback.

7.2.4 Procedure

The session began with the signing of informed consent forms, and a brief explanation

of the goals of the study and the tasks. The main body of the session consisted of three

blocks - one for each of the three interfaces. Each block consisted of the following

steps:

1. The administrator of the study described the interface and demonstrated its use.

2. The participant was given the opportunity to try the interface

3. The participant completed the four tasks.

For each of the tasks, the user was instructed to read the task description and to

verify that they understood the question before starting the task. This often involved

having the user restate the question. When there was any confusion, the administrator

160

provided clarification. This emphasis on comprehension was included in an attempt to

avoid the comprehension problems faced in the third study (Appendix C).

Participants had two minutes (120 seconds) to complete each tasks. Only one

attempt was allowed for each question - if the question was not answered at the end of

the allowed time, the participant simply moved on to the next task.

The data sets used for this study were synthetic data sets that were hand-tuned

to include answers to the various tasks. Specifically, data sets containing randomly-

generated values for 13 time points for each of 100 stocks were generated, and then

modified to guarantee that they each contained at least five items that would be correct

answers for each of the tasks. Four data sets were used - one for training and one for

each of the three interfaces. The ordering of the data sets for the timed tasks questions

was varied so that every possible pairing of data set with interface occurred equally

frequently, and the ordering of the data sets within the sessions were also balanced.

After the session, each participant completed a subjective satisfaction form similar

to the form used in the first study (Appendix B.4).

The initial design of the study was revised based on feedback from three pilot

subjects. The most significant change that was made on this basis was simplification

of one of the training tasks.

Thirteen Computer Science graduate students from the University of Maryland

participated in this study. Due to technical difficulties, data from one of the subjects

was not collected correctly, so the analysis only included the results from the remaining

twelve subjects. Thus, each of the six possible interface orderings was used by two

subjects.

161

0

20

40

60

80

100

120

Form with Table Form with Visual Timebox

Ave

rage

Tas

k C

ompl

etio

n Ti

me

(ms)

Interface

Figure 7.11: Average task completion time with standard deviation error bars.

7.2.5 Results

Average task completion times were 49.94 seconds for the timebox interface, 58.52

seconds for the form fill-in interface with tabular feedback, and 59.07 seconds

for the form fill-in interface with visual feedback. These results appear to indi-

cate a slight advantage for the timebox interface, but the results are not signifi-

cant (ANOVA,F(2,69) = 0.76, p = .47) (Figure 7.11). Separate analyses of each of

the two timed tasks also failed to show any statistically significant differences be-

tween the three interfaces (ANOVA, F(2,33) = 0.02, p = .98 for question one and

F(2,33) = 1.06, p = .36 for question two) (Figure 7.12).

As expected, the results appear to have been influenced by a learning effect. Of

the twelve subjects, only one was fastest with the first interface that they saw, while

6 were faster with the second interface and five were faster with the third. Similarly,

162

0

20

40

60

80

100

120


Ave

rage

Tas

k C

ompl

etio

n Ti

me

(ms)

Interface

Question 1Question 2

Figure 7.12: Average task completion time (with standard deviation error bars) for

each of the two timed tasks.

nine of the twelve subjects were slowest with the first interface that they used, while

the remaining three were slowest with the second interface. Grouping the questions

in terms of ordering of presentation (first interface, second, or third), reveals a sig-

nificant effect of interface ordering, with the third interface presented having the best

performance (ANOVA, F(2,69) = 4.97, p < .01). Paired t-tests of the orders showed

that the difference between the second and third interface presented was not signifi-

cant (t = 0.12, p = .45), but the the first interface was significantly slower than both

the second and third interfaces (t = 2.61, p < .05 and t = 1.68, p < .05, respectively).

Examination of the results for the individual participants reveals that differences

in performance may be stronger in some individuals than in others. Six of the twelve

participants had the fastest completion times with the timebox interface. For these

six subjects, task completion times for the timebox interface was significantly faster

163

0

20

40

60

80

100

120


Ave

rage

Tas

k C

ompl

etio

n Ti

me

(ms)

Interface

Figure 7.13: Average task performance times (with standard deviation error bars) for

the six participants who were fastest with the timebox interface.

than with either of the form fill-in interfaces (ANOVA, F(2,15) = 4.40, p < 0.05,

Figure 7.13). The remaining six subjects did not show any significant differences

between the three interfaces (ANOVA, F(2,15) = 0.38, p = .69, Figure 7.14).

Some of this effect may have been caused by an imbalance of ordering. Of the

six participants who were fastest with the boxes, three of them used the form fill-in

interface with tables first, and only one of them used the timebox interface first. Further

investigation would be needed to conclusively determine whether the differences in

performance observed from these six subjects was a meaningful effect, as opposed to

being a result of the order of presentation.

Participants clearly preferred the timebox interface. The timebox interface was

rated significantly higher for the four questions that rated the interfaces (ANOVA, p <

164

0

20

40

60

80

100

120


Ave

rage

Tas

k C

ompl

etio

n Ti

me

(ms)

Interface

Figure 7.14: Average task performance times (with standard deviation error bars) for

the six participants who were fastest with either of the form fill-in interfaces.

0.05 in all cases,Figure 7.15). When asked to indicate which interface they preferred,

ten of the twelve subjects indicated the timebox interface, one indicated the form fill-in

interface with visual feedback, and one indicated an equal preference for the timebox

interface and the form fill-in interface with visual feedback.

7.2.6 Discussion

This study failed to show any significant differences in task performance times for

the three interfaces. This result stands in contrast with the outcome of the first study,

which showed a significant performance advantage for the form fill-in interface. Al-

though further study might be necessary for a complete understanding, these discrep-

ancies might be the result of the different designs of the two studies. Analysis of these

165

0

2

4

6

8

10


Ave

rage

Sub

ject

ive

Rat

ing

Form with TableForm with Visual

Timebox

Figure 7.15: Average subjective satisfaction ratings 1-9, 9 is best), n = 12. The prefer-

ence for the timebox interface was significant in all cases.

differences can provide some insight to the study results and suggest further studies.

The first major difference between the two studies was in the interfaces used. The

first study used interfaces that differed only in the query input. Three modalities were

used : form fill-in, range sliders, and timeboxes. In an attempt to study presentation of

results as well as query input, the second study replaced the range slider interface with

a second form fill-in interface that used a tabular spreadsheet to present query results,

instead of a graphical display.

Although further study involving additional cases (for example, the timebox inter-

face with textual output) might be needed to draw stronger conclusions, the results of

these studies would seem to imply that the query result output is not a major factor

in the differences in performance. In the second study, performance on the two form

166

fill-in interfaces was virtually indistinguishable (x = 58.52,σ = 35.53 for the tabular

output, x = 59.07,σ = 35.48 for the visual output). Thus, the tabular feedback did not

lead to any measurable performance penalty, despite several complaints from partici-

pants who found it difficult to use.

It seems more likely that the differences between these two studies is a result of

differences in the tasks. The first study found that form fill-in interfaces were superior

for completely-specified tasks. In an attempt to approximate the use of timeboxes for

open-ended data analysis, the tasks in the second study were more open-ended. This

may have helped performance with the timebox interface.

Closer examination of the tasks used in this study might explain the lack of sig-

nificant differences between the interfaces. These tasks do not include fully-defined

constraints that require items to be in a specific value range during a specific interval,

but they do include hard-coded values (“$25 range”,”$40 more”, “rise in price of $35”,

etc.). In some sense, these tasks might be seen as intermediate tasks, falling somewhere

between the fully-defined tasks used in the first study and fully exploratory tasks. The

difference in results between the two studies seems to support the conjecture that the

performance of timeboxes relative to form fill-in interfaces would continue to improve

as tasks move further from fully-defined towards exploratory. Of course, further study

will be needed to verify this hypothesis.

As the participants who were fastest with timeboxes were significantly faster, sec-

ond study also seems to provide some preliminary evidence for possible performance

differences between individuals. Observation of some of the subjects provided some

clues as to some of the factors behind these performance differences. Specifically, as

participants were asked to read the question before starting the task, many took the

time to formulate search strategies. This proved particularly useful for form fill-in

167

interfaces, as planning helped reduce the exploration that was needed. Furthermore,

since this time was not included in the task performance time, this planning appeared

to reduce the search time.

This result raises the possibility that one of the benefits of the timebox interface

might be in reduced cognitive load. Specifically, timeboxes might help users reduce the

strategizing and planning needed to complete tasks. Further study - perhaps including

consideration of planning time - would be needed to investigate this hypothesis.

This study also provided further evidence for the importance of training. Study par-

ticipants generally fared better with the second and third interface than they did with

the first. Further study with more detailed training might clarify the differences be-

tween the three interfaces. However, it should be noted that the assumption of trained

users is acceptable in this case, as timeboxes (and TimeSearcher) are not designed for

use by novices.

The design of this study may have influenced results in a manner that limits the

generality of the results. In this study, users were presented with the three interfaces

in succession, with all tasks from one interface being completed before moving on

to the next interface. This may have reduced cognitive load, but it also might have

contributed to the order effects, as participants’ growing familiarity with the tasks and

strategies might have helped them improve their performance.

The repetition of the tasks across interfaces might have had an impact on the re-

sults. This repetition was intended to ease comparison of performance across the three

interfaces, but it may also have added to the order effects. Even though the data sets

were different, users frequently generated strategies that helped them more effectively

complete the tasks on the second or third try. Further study involving interleaving of

the order of interface presentation and the use of different interfaces would be interest-

168

ing.

A more subtle aspect of the study design appears to have had an additional impact.

As described above, the data sets used in this study involved randomly-generated data

that was hand-crafted to insure that each data set contained answers to each of the tasks.

Because of the nature of the tasks, these modifications took on a certain predictable

character. Specifically, the first task looked for items that had low prices during the

first five time periods and high prices during the last five. Many of the items that

matched that task also matched the second task, which required items to stay in a $25

range for four time periods and then to have a rise of $35 at some later point.

This commonality interacted with a trend in user strategies. Many of study partic-

ipants answered the second question by starting at the lower-left corner of the query

space - for example, looking at the range of $10− $35 for the first four time peri-

ods. The presence of items that matched the first task made this an effective strategy

for quickly completing the second task. Repetition of this study with tasks and data

sets that were carefully constructed to avoid these overlaps between tasks and users

strategies might increase confidence in the results.

This study used a modest sized data set with 100 items and 13 time points. It

seems likely that as the number of items or number of timepoints increases the users

of a tabular data display will have a much harder time performing as well as users of

the graphic overview.

169

7.3 Conclusion & Future Steps

Although these studies may fall short of providing strong empirical support for the

utility of timeboxes, they have provided some valuable insights into the design of in-

terfaces for exploration of time series data.

The obvious need that was identified was for improved mechanisms for specify-

ing precise values for timebox ranges. Small timeboxes (and range sliders) can be

hard to manipulate due to the narrow range of pixels that must be precisely selected

(Figure 7.9). Augmenting the TimeSearcher interface with tools that would overcome

these difficulties might help users avoid troubles with narrow ranges. The precise na-

ture of such facilities might require some design or further evaluation. As discussed

above, One possibility might be to implement some notion of “gravity” that would

attract the mouse pointer to appropriate handles, thereby easing selection and manipu-

lation. Another approach is the provision of alternative input mechanisms, such as the

text-entry and range sliders already included in TimeSearcher (Chapter 4). The results

from these studies clearly validate the decisions to include these facilities.

Other design suggestions that arose out of observations made during these stud-

ies have already been implemented in TimeSearcher. For example, horizontal-only

movement in order to avoid vertical drift (Section 7.1.7 is now supported. Other pos-

sibilities, including the ability to temporarily disable query clauses and to vertically

align boxes (Appendix C.3), are interesting candidates for future work.

Further assessments aimed at exploring the performance of timeboxes on more

exploratory tasks might provide further insights while possibly providing a clearer

demonstration of the utilities of timeboxes. For example, the somewhat exploratory

tasks used in the second study might be replaced with more open-ended questions.

For example, users might be asked to identify items that a large increase in value after

170

staying relatively steady for some amount of time. These tasks may present difficulties

in evaluating completion and correctness. For example, how would “relatively steady”,

or a “large increase in value” be defined? Clear definitions of tasks and criteria for

judging accuracy of task completion might not be easily specifiable.

These empirical studies provided feedback that has been useful in clarifying un-

derstanding of the strengths and weaknesses of both the timebox query model and

the TimeSearcher tool. The results of the first study have been particularly useful in

this regard: by identifying situations where timeboxes do not perform well, this study

provided motivation for additional query tools like angular queries and variable-time

timeboxes that support the exploratory tasks that timeboxes are likely to be best suited

for. The second study reinforced this intuition, as the relative strength of timeboxes

seemed to improve as questions became more open-ended.

The results of these studies should be interpreted in the context of the context of

the case studies (Chapter 8), which demonstrate the success of TimeSearcher in help-

ing motivated users address meaningful research problems. This feedback supports

the claim that timeboxes and the associated information visualization tools found in

TimeSearcher provide real value for users.

Finally, these studies provided some insight into the challenge of developing em-

pirical methods for evaluating exploratory interfaces. The unexpected results from the

first study were largely a result of the mismatch between the tasks that were chosen

and the strengths of the tool that were evaluated. The incomplete study suffered from

overly complex tasks that participants found hard to interpret. All three studies had

relatively novice users working with a tool that was designed for motivated domain

experts. Empirical evaluations that involve a combination of appropriate tasks and

users are clearly necessary for maximizing the utility of these studies. These results

171

should contribute to and act as a warning for the increasing number of researchers and

practitioners who pursue evaluation strategies for information visualizations.

172

Chapter 8

Applications

The design of TimeSearcher has been informed and validated by work with users. In

particular, colleagues in molecular biology have made extensive use of TimeSearcher

for examining time series and linear order data sets. This chapter provides an overview

of the biological applications of TimeSearcher.

TimeSearcher has also been used by researchers to examine climatological (partic-

ulate concentration), hydrological, and demographic (census mortality) data sets.

8.1 DNA Microarray Data Set Analysis

Recent advances in DNA microarray technology have provided geneticists with the

ability to examine expression levels of thousands of genes under varying circumstances

[55]. Numerous published reports of microarray data have used the examination of

changes in gene expression levels over time to examine the effects of various stimuli

on genetic expression.

Analyses of the microarray data generally are conducted via some sort of mathe-

matical grouping of genes with similar expression profiles. Clustering techniques that

have been used include hierarchical clustering [40, 46, 111], self-organizing maps

173

Figure 8.1: Red-green “heat map”display expression genes at seven time points. Each

row is a gene sample, and each column is a time point. Bright green samples are

repressed genes, bright red are induced genes, and darker samples are close to the

average. Genes that are repressed (low expression levels) are shown at the top, and

induced genes (high expression levels) at the bottom [34].

[128, 143], and singular value decomposition [65].Clustered expression profiles are

often displayed with 2D layouts that use coloring to display the expression levels of

each sample, with bright-green indicating relatively under-expressed genes and bright-

red indicating genes with relatively high levels (Figure 8.1). The Cluster and TreeView

programs are widely used for generation and viewing of clusters [46].

Heat-Maps are very useful for condensing significant amounts of information in a

display that helps highlight gross trends and similarities between clusters. However,

they generally suffer from the drawbacks of other static displays: interactive querying

and exploration are not supported. Other microarray analysis techniques, including

the use of spreadsheets and manual creation of histograms suffer from similar lack of

interactivity.

174

Figure 8.2: The Hierarchical Clustering Explorer. Dendrogram clusters and filters for

detail and similarity are shown in the top window, with a detailed display of a subset is

shown below. A scatterplot on the right is used for pairwise comparison between two

of the experimental conditions [111].

The Hierarchical Clustering Explorer [111](Figure 8.2) addresses many of these

problems by combining filters for minimum similarity and detail display with alterna-

tive displays showing pairwise similarities between expression profiles and the ability

to compare clusters computed from different algorithms.

TimeSearcher’s dynamic query tools are well-suited for expressing queries aimed

at identifying genes with particular expression profiles. Two ongoing collaborations

175

have explored these possibilities.

8.1.1 Programmed Cell Death in Drosophila melanogaster

As organisms develop, new cells are created and old cells that are no longer needed are

destroyed and eliminated. For example, when a tadpole becomes a frog, the tail and in-

testine (among other structures) are no longer needed, and are therefore destroyed. The

process of controlled destruction and elimination of cells is known as Programmed cell

Death (PCD). Programmed cell death is of interest to biologists for a variety of rea-

sons. As a genetically-controlled process, PCD involves complex interactions between

many genes. Furthermore, the absence of cell death may be related to the uncontrolled

proliferation of cells associated with cancerous tumors.

Studies of cell death in flies, worms, humans, and other organisms have identified

a variety of genes that are involved in the control of PCD. Furthermore, the many of

the genes involved in this process appear to be similar in these organisms - in other

words, the relevant genes have been conserved [12]. However, the processes involved

in PCD are not completely-understood. In particular, the exact genes that are required,

and the sequences of expression of these genes, are uncharacterized for many types of

cell death.

Eric Baehrecke’s lab at the University of Maryland Biotechnology Institute,

Center for Biosystems Research, studies the programmed cell-death in Drosophila

melanogaster - the common fruit fly. In Drosophila, the transition between larva and

pupa involves destruction of larval cells that are no longer needed, along with differ-

entiation of cells that will be used in the future adult [12]. Studies of changes in gene

expression levels in cells that die during these processes are useful for understanding

the genes that play a role in PCD. In particular, larval salivary gland and midgut cells

176

have proven to be a fruitful area for investigation.

The steroid hormone 20-hydroxyecdysone (ecdysone) plays a critical role in cell

death in Drosophila larvae. The presence of ecdysone at 10 hours after the onset of

metamorphosis leads to the expression of genes known to be involved in Drosophila

cell death, including reaper (rpr) and head involution defective (hid) [72]. More specif-

ically, the gene E93 plays a crucial role in this process. E93 is induced by ecdysone,

and appears to play critical role in the expression of other cell death genes: mutations

in E93 have reduced levels of expression of cell death genes including rpr, hid and

others. Furthermore, E93 is expressed only during metamorphosis [84].

Microarray experiments have led to further insight into the genetic mechanisms un-

derlying cell death in Drosophila. These experiments involved RNA samples from flies

at 6 and 12 hours after the beginning of metamorphosis - the times of greatest changes

in gene transcript levels. Furthermore, this work contrasted the steroid-controlled

death of groups of cells that occurs during metamorphosis with the radiation-induced

death of individual cells - processes known as autophagy and apoptosis, respectively.

The microarray experiments from the samples involving steroid-induced cell death

involved 2,876 genes that were consistently found in each of three replicated trials. Of

these genes, 484 showed an increase of 5-fold or greater between the 6 and 12 hour

samples, and 448 showed a decrease of 5-fold or more. Known cell death genes rpr,

hid, dronc, and crq were among those that showed significant increases in transcrip-

tion.

The samples involving radiation-induced cell death had 5,495 genes that were con-

sistently detected, most of which were at levels nearly equal to those found in the

steroid-induced data set. Only 22 genes increased more than 5-fold in the irradi-

ated flies (as compared to unirradiated controls), and 12 decreased greater than 5-fold.

177

Comparison of the genes that were induced following radiation with those induced by

steroid revealed that rpr was the only known cell death gene appearing in both data

sets, but several other genes had increased levels of transcription in both data sets [83].

The use of only two time points - 6 and 12 hours after the onset of metamorphosis

- limits the explanatory power of these data sets. Specifically, fluctuations of gene ex-

pression levels between 6 and 12 hours might provide additional insight into regulatory

interactions between genes. This possibility has been addressed by a second microar-

ray experiment, involving 5 time points - 6, 8, 10, 12, and 14 hours after the onset of

metamorphosis. For steroid-induced cell death, these experiments yielded 3225 genes

that were consistently detected.

Analysis of this data with TimeSearcher has been the focus of ongoing collabora-

tion with the Baehrecke lab.

Sample Analysis Sessions

Direct user observation can provide valuable insights into the strengths and shortcom-

ings of a tool for performing a particular task. During the course of the ongoing col-

laboration with the Baehrecke lab, there have been several such sessions. The follow-

ing discussion is a composite of observations from multiple sessions in October and

November 2002.

Analysis of this data set with TimeSearcher might begin with a search for genes

that increase in expression level at each time point. Starting points - i.e., the values

from which genes might increase - are chosen heuristically, as is the magnitude of the

change.

In general, the direction and magnitude of change is more interesting than the exact

values involved. These sessions were conducted before variable-time timeboxes and

178

angular queries were supported in TimeSearcher, but it seems likely that these exten-

sions to the query model and other tools that supported relative querying (Chapter 9)

might be helpful for this task.

These preliminary searches are useful for identification of genes that are expected

and to confirm understanding with respect to prior results. These “sanity checks” can

help the user build confidence in the data set. Several unexpected genes with high

expression levels were found. Examination of these results indicated that they were

generally ribosomal and anti-microbial genes that are consistently present in cells. Al-

though important to proper cell functioning, such genes are not particularly interesting

in the current context.

Continuing examination of transitions from times 6-8 identified the gene timp,

which is correlated with mmp1, a known cancer inhibitor. This raises the possibil-

ity that timp might be an interesting gene in the context of cell death.

Further examination of transitions from 8 hours to 10 hours yielded additional

insights, including the identification of eif45 as a potentially interesting gene, and the

use of the leaders and laggards functionality identified the Wrinkle gene at times 12 and

14 as potentially interesting, along with the more global observation that the number

of genes that followed a pattern of rises from 8-10 hours was greater at 10-12 hours

than at 12-14 hours.

An alternative approach to using TimeSearcher to analyze this data set starts with

the observation that the E93 gene plays an critical role in cell death in Drosophila [84].

We might hypothesize that genes that have profiles similar E93’s profile might also be

involved in cell death. Specifically, genes that show increases in expression level after

E93’s expression increases might be regulated by E93 - in other words, E93 might be

a factor that contributes to the expression of these genes.

179

The use of TimeSearcher to explore this line of investigation begins with the use of

the text search box to find the E93 sample in the database1. The drag-and-drop query-

by-example tool is then used to create a query identifying those items that are similar

to E93’s profile. As the 10 and 12-hour measurements - corresponding to the interval

before the second peak in ecdysone levels [11] - are most interesting, the boxes for the

6, 8, and 14 hour samples are eliminated.

The resulting set contains over 1200 of the 3225 genes in the original sample - far

too many to be of immediate interest. To filter further, the timebox for the 12-hour

time point is adjusted to include only those genes with a more pronounced increase in

expression level - the box is moved up to remove lower values, and expanded to include

a higher range of values. The resulting data contains less than 300 genes - a substantial

reduction in size (Figure 8.3). These results are saved as potentially interesting.

The leaders and laggards facility was then used to identify genes that have this

same increase in expression level at a later time point - specifically, between 12 and

14 hours. This leads to a set of under 100 laggards - genes that might be regulated by

E93. This set was also saved as being of interest.

Alternative approaches included shifting the paired timeboxes to look at genes with

increases in transcription between 8 and 10 hours, as genes that have earlier increases

in expression level might influence the expression of E93. Decreases in expression

level, which are also potentially interesting, can be examined by recreating the mod-

ified query-by-example described above (Figure 8.3), and then using the query inver-

sion facility (Section 4.4) to find genes that have decrease of similar value at the same

time point (Figure 8.4). This query can then be used as the basis for a leaders and lag-

gards query that would identify genes with similar decreases later in the time series.

1In this data set, E93 is known by the alternative name EIP93f.

180

Figure 8.3: TimeSearcher query display identifying genes that are roughly similar to

E93 at 10 and 12 hours. This query contains two timeboxes, based on the values

of E93 at 10 and 12 hours. The 12 hour timebox has been shifted up, to eliminate

smaller increases in expression levels. This timebox has also been increased in height,

in order to include some very sharp increases in expression level that might not have

been included in the original timebox.

Figure 8.4: TimeSearcher query identifying genes that decrease significantly between

10 and 12 hours, when E93 is increasing.

181

It is important to note that these queries are useful primarily for generators of

hypotheses. Although suggestive, the temporal relationships identified with Time-

Searcher are not sufficient to establish any direct linkages between the genes involved

in these queries. However, these results may be useful for identifying genes (or sets

of genes) that merit further experimental analysis, which might identify regulatory

relationships.

Observations

Researchers in the Baehrecke lab were extremely enthusiastic about the use of Time-

Searcher for analysis of their microarray data:

TimeSearcher gives us the ability to see a large amount of time series data

and rapidly query for patterns based on known mechanism. This makes it

a valuable tool for the generation of new hypotheses. We haven’t found

any other software that gives us similar capabilities to observe overviews

of temporal data, and query for characteristics based on the knowledge of

a biological system [13].

The use of TimeSearcher for the analysis of the cell-death data played an important

role in the identification of novel results [35].

These analysis sessions also provided some insight into various facets of one pos-

sible use of the TimeSearcher tool. One of the most striking observations involved the

interpretation of query results. The item list on the right-hand side of the TimeSearcher

window was seen as being much more important than the graphical displays of each

of the items in the lower-left-hand window. As this data set contains many genes that

the biologist using the tool knew by name, the item list provides a concise display that

can be easily scanned for familiar names.

182

To some extent, this behavior might be a result of the type of analysis being done

with this data. Specifically, this analysis was conducted by a biologist who brought a

significant amount of understanding and focus to the task. Out of the 3225 genes in the

original data set, the user had preconceived notions of which genes - roughly 100 in

number - that were potentially interesting. Thus, the item list was a powerful tool for

identifying which of those genes were present in any given result set. Of course, the

user’s prior knowledge may bias him or her away from potentially interesting genes

that fall outside of those existing notions of what might be interesting, but this problem

is likely to exist with any visualization tool.

Most of the query modifications made by this user were made directly with the

mouse. There was relatively little use of the keyboard and the range sliders, although

arrow keys were used to change the time periods covered by the boxes.

Finally, the 100-item threshold for displaying individual graph overview lines (as

opposed to only the query envelope (Section 4.1)), was seen as being too low, suggest-

ing that the default might be raised to a higher value.

These sessions involved one individual user focusing on a single data set. As a

result, these comments may not generalize to others. Further observation of a wider

range of users would be necessary before any generally applicable conclusions could

be drawn.

Contributions and Design Suggestions

The Baehrecke lab’s participation in the development of TimeSearcher involved nu-

merous design discussions, many of which occurred before the data set described

above was collected. Together with the analysis sessions described above, these discus-

sions generated several ideas for TimeSearcher functionality. Some of these features

183

have been implemented, while others present interesting possibilities for future work.

Currently-implemented features that resulted at least partially from these discus-

sions include leaders & laggards (Section 4.2), support for multiple time-varying at-

tributes (Section 4.3), and query inversion (Section 4.4). Leaders & laggards was sug-

gested early in the discussions as potentially useful for identifying regulatory relation-

ships, as described above. Support for multiple time-varying attributes was proposed

as useful for simultaneous display and querying of data collected under two different

conditions: naturally occurring (“wild-type”) flies and mutated flies. Query inversion

was proposed as a tool for identifying transitions that were contrary to previously iden-

tified trends of interest. All of these features were deemed to be of sufficiently general

interest to merit inclusion in TimeSearcher.

One issue that arose regarding query expressiveness involves adding constraints

to existing queries that required that all items that matched must have values that are

non-decreasing (or non-increasing) during some specified interval. This suggestion

arose after construction of a query that had two adjacent boxes with some vertical

overlap. Although the boxes suggested a general rise in value, the overlap allowed

some items that actually decreased in value to be included in the result set (Figure 8.5).

Facilities for requiring values to be non-decreasing or non-increasing could be used

to eliminate such values, without requiring restatement of the relative constraints of

the two query boxes. These facilities would be similar to the interval trending query

facilities discussed in Section 9.1.5. This observation was part of the motivation for

the eventual implementation of angular queries (Section 4.7).

These analysis sections also provided motivation for the inclusion of support for

multiple time-varying attributes, including extensions that go beyond the facilities pro-

vided in TimeSearcher. The current implementation supports multiple tabs in a tabbed

184

Figure 8.5: A query illustrating the need for additional constraints requiring non-

increasing (or non-decreasing) values over a specified interval. Although the general

trend of the two timeboxes is upwards, the highlighted item actually has a decrease in

value between 10 and 12 hours. Additional constraints requiring non-decreasing items

would remove this item from the result set.

window, each displaying a different attribute (Section 4.3). For comparisons between

pairs of attributes, an alternative presentation might involve displaying the differences

between values at each point in time. Although this can easily be achieved through

appropriate pre-processing of the data set, integrated features for automatically con-

structing this comparison would be easier to use.

Theses analysis sessions also identified the need for support for using a query to

remove items from consideration from further investigation as potentially useful. Like

many other microarray experiments, the programmed cell death data contained profiles

from numerous genes that are necessary for other phenomena that might not be of

interest to the current inquiry. In some cases, it might be helpful for the user to flag

these items as being uninteresting, thus removing them from further consideration.

One approach to supporting this functionality might be to provide an additional control

185

- perhaps through a “trash can” icon on a button - that would identify items that match

the currently active query and remove them from further consideration. In essence,

such a query would support a form of negation, finding all items that fail to match the

query.

Although TimeSearcher currently provides users with the ability to save result sets

and to save and reuse specific queries, these features are somewhat limited. The anal-

ysis sessions with the Drosophila programmed cell death data set provided a clear il-

lustration of the need for tools that provide greater support for working with the result

set of a given query. For example, annotation support might be provided to somehow

mark genes that are of interest, perhaps because they have been seen elsewhere.

Further extensions along these lines involve customized displays of the item list

and result set, based on additional metadata regarding the items in the data set. For

example, the Gene Ontology (GO) is a vocabulary that provides hierarchical group-

ings of genes according to various functional and structural criteria [131]. Similarly,

FlyBase (http://www.flybase.org) is a database containing detailed information about

the annotated Drosophila genome [130]. GO and FlyBase both contain rich metadata

that might be used to augment TimeSearcher displays. For example, categorical data

such as grouping in the GO, or the existence of known comparable genes (homologs)

in other organisms, might be indicated via color coding of items in the result list, or

via special glyphs on displays of the individual graphs.

Output from other forms of analysis might be used to customize the display of

individual items. As described above, microarray analyses often include mathematical

clustering of similar genes. For results involving a small number of clusters, each

item in a given cluster might be displayed in the same color, with saturation of each

item’s display indicating the distance between that item and the center of the cluster

186

that contains it.

Other possible improvements to display of query results might involve customiz-

able ordering of items as displayed in the result list, or in the display of individual

graphs. Currently, these displays are provided in a set order based on the order in

which they are found in the data set. An obvious extension would be to support sorting

of the item list based on some alphabetic or lexicographic criteria, but other, potentially

more interesting approaches are possible. For example, a “maximum differential” or-

dering might sort the items in a data set in decreasing order of the differential between

their highest value and their lowest value. This would place items with the greatest

change at the top of the list, and the items that changed the least (and are therefore

possibly less interesting) at the bottom of the list.

Revised displays for leaders & laggards queries were also suggested. For example,

the current color-coding of the item list might be replaced with a two column display,

with leaders in one column and laggards in the other.

Finally, analysis of the cell death data set highlighted the potential utility of in-

tegrating TimeSearcher with other visualization tools. As coordination of multiple

visualizations has been shown to decrease task performance time and increase user

satisfaction [95], the use of TimeSearcher in conjunction with other visualization tools

might improve comprehension and utility of the visualizations. For example, an on-

going effort with the Baehrecke lab has investigated the possibility of displaying Gene

Ontology information in a hierarchical treemap display [10, 116].

A coordinated visualization might use a treemap display to highlight the genes that

matched a particular query. This would provide an immediate graphical perspective on

the similarities between genes in the result set: the presence of multiple similar genes,

would lead to a tight cluster of highlighted genes. However, if the result set of a given

187

query contained highly dissimilar genes, the highlights would be scattered throughout

the treemap.

On a similar note, TimeSearcher might be equipped with hooks to appropriate web

sites. For example, each gene in the data set might be tied to FlyBase, allowing the

user to retrieve a complete FlyBase entry by simply clicking on the name of a gene in

the result set.

Although the details of some of these suggestions may be specific to analysis of

microarray data, the general ideas of increased flexibility in query expression; display

and manipulation of results; and integration with other tools are easily generalized to

other application domains.

8.1.2 Viral Life Cycle in Epithelial Cells

Karen Duca’s lab at the Virginia Bioinformatics Institute is in the preliminary stages

of using TimeSearcher to analyze data experiments involving genetic responses to in-

fluenza virus in epithelial cells. Feedback from this work has been provided in the form

of responses to a questionnaire, as opposed to from direct observation of an analysis

session, as with the Programmed Cell Death work with the Baehrecke lab.

In analyzing the influenza data sets, the Duca lab has found TimeSearcher partic-

ularly useful for seeing an overview of the data set, particularly for identification of

genes that have expression profiles that are atypical.

After normalization, most genes are unchanged across the whole time

course. TimeSearcher allows one to very quickly pick out what deviates

from ”unchanged” behavior. We had already found our markers, but it

took weeks (even months to do it effectively) with histograms and Clus-

ter. With TimeSearcher, had it been available then, we would have had

188

the information in one day. The changes are quite subtle, really, but Time-

Searcher gets you there faster than k-means clustering, which was our best

technique up to then [42].

Ongoing collaboration with the Duca lab has focused on domain-specific exten-

sions to TimeSearcher that would increase its utility for microarray analysis. Specifi-

cally, microarray analyses often use statistical comparisons between and among items

in the data set. Similarly, microarray data sets often contain controls that are used for

data validation and are not interesting for analytic purposes. Analyses of these data

sets would be simplified if TimeSearcher could be configured to ignore appropriately-

labeled controls.

Alternative approaches to handling control items might be of general interest in

other domains. As discussed above, analysis sessions with members of the Baehrecke

lab led to the suggestion of support for the ability to remove items that matched a query.

This functionality could certainly be used to handle control data: a set of timeboxes

describing the control - perhaps created by a drag-and-drop query - would define those

items that should be removed from further consideration.

Another possible enhancement is motivated by the observation that control items

are likely to be those that have little change throughout the time series. Sliders that

filtered items based on relative change levels could be used to eliminate these items

from the data set. These sliders would allow users to specify minimum (or maximum)

percentage over the course of the time series that must be met for inclusion in the result

set. These differences might be specified in terms of z-score deviations from the mean.

Thus, this slider could be used to require that items must have changes of at least 1/2

and less than three standard deviations from the mean in order to be retained in the

data set.

189

Handling of microarray data points that involve experimental error is another area

of interest. Microarray data sets are often very noisy and multiple repetitions of a

given experiment are usually conducted (and then averaged) to generate a reliable data

set. Despite these repetitions, these data sets are plagued by experimental errors and

missing values.

In some cases, values that exceed certain thresholds might indicate experimental

error. A variety of approaches to handling such errors might be implemented in Time-

Searcher. Filters that eliminated any items that exceeded these threshold (at any time

point) might be eliminated via a double-thumb range slider. Of course, this slider

would be equivalent to a timebox that spanned all of the time points in the data set,

but it would present less visual clutter and would perhaps be easier to use. Alterna-

tively, TimeSearcher might use color-coding to indicate values with lower reliability,

or provide an interpolated value in place of the value that indicates experimental er-

ror. Handling of missing values in microarray data sets is an active area of ongoing

research: techniques involving the use of cubic splines to interpolate values [14], or

dynamic programming to compare microarray time series despite missing values [1]

might be of use in this context. Alternatively, TimeSearcher might be augmented to

displays and query semantics that appropriate account for missing values.

Support for multiple time-varying attributes is also seen as important for analysis

of the microarray data sets in Duca’s lab. Proteomics experiments that parallel the

microarray experiments will provide data regarding the proteins present in the cell.

Support for simultaneous exploration of both microarray and proteomics experiments

is seen as potentially very useful, particularly in conjunction with leaders and laggards

functionality.

Discussions with Duca’s lab also led to the particularly intriguing challenge of

190

extending leaders and laggards functionality to help users explore the possibility of

multiple items acting as leaders that influence others. For example, a microarray data

set might contain several genes - A, B, C, D, and E - all of which regulator the tran-

scription of gene G. Existing leaders and laggards displays might be used to see these

influences individually - for example, to identify A as a factor that might influence

G. However, this raises the possibility of misinterpretation, as users might stop at that

point and not identify B, C, D, and E as being relevant.

Finally, several features commonly found in production software were suggested

as being potentially useful. For example, cut and paste facilities for exporting query

results directly to other analysis tools were seen as being desirable.

8.2 Nucleotide Sequence Data

The interpretation of timeboxes and TimeSearcher as tools for querying time series

data sets places an unnecessary limitation on the applicability of these ideas. In fact,

there is nothing in the timebox model, or in the TimeSearcher application that is re-

stricted specifically to time series data. The only requirement is that data sets involve

measurements taken at discrete intervals along some linearly-ordered dimension. The

original motivation for this work involved the use of time as the dimension in question,

but others are possible.

For example, data sets containing real measurements at discrete physical positions

along a one-dimensional line are appropriate for TimeSearcher. In this case, each

physical position on the line corresponds to a “time” point in the TimeSearcher display.

Queries could be used to identify items that had value in a given range during certain

intervals on this line.

191

Nucleotide sequences provide a particularly interesting example of the application

of TimeSearcher to linear dimensions other than time. Specifically, short sequences

of nucleotides (A,G,C, and T) can be aligned and statistics regarding the frequen-

cies at which different patterns appear in different positions in these sequences can

be calculated. TimeSearcher can then be used to find patterns that have desired fre-

quency profiles - perhaps occurring frequently in some areas and infrequently in oth-

ers. These patterns might help identify DNA subsequences that influence the process

of converting DNA into RNA and then into protein. The use of TimeSearcher for this

purpose complements existing statistical approaches towards identification of these

subsequences [48, 92, 97, 137].

8.2.1 Branch Site Consensus Splicing Signal in Arabidopsis

thaliana

The creation of protein from DNA is essentially a three-step process. During tran-

scription, the sequence of a strand of DNA is copied into a complementary strand of

pre-mRNA. During the second phase - splicing - regions that are not converted directly

into protein (the introns) are removed from the pre-mRNA, leaving only the exons -

the regions that will be converted into protein. This output of this process is a strand

of mRNA, which is exported from the nucleus and translated into protein, during the

third step - translation (Figure 8.6).

Splicing involves the removal of the introns from a strand of pre-mRNA. The

boundaries where this splicing occurs are known as splice sites. As an intermedi-

ate step in the splicing process, a portion of the intron is looped around itself, forming

a “lariat” structure. This looping occurs at the branch site - a location that is roughly

30 nucleotides from one end of the intron (Figure 8.7).

192

Transcription:

DNA pre−mRNA

� � � � � � ��

� � � � � � ��

� � � � � ��

� � � � � ��

� � � � � ��

� � � � � ��

exonintron

exonintron

exon

� � � � � ��

� � � � � ��

� � � � � ��

� � � � � ��

� � � � � ��

exonintron

exonintron

exon

� � � � � � ��

� � � � � � ��

� � � � � ��

� � � � � ��

� � � � � ��

� � � � � ��

Splicing:

pre−mRNA mRNA

exonexon exon

Translation − export from nucleus and creation of protein

Figure 8.6: The three main stages in the creation of protein from DNA. During tran-

scription, the strand of DNA is copied. During splicing, the introns are removed, leav-

ing only the exons. The output of splicing is a strand of mRNA. During translation,

the mRNA is exported from the nucleus and used to create a protein.

Characterization of these sites in the splicing process is an important step in in-

terpreting the contents of the genome. Reliable and consistent identification of splice

sites and branch sites can be useful for determining the function of a given sequence

of DNA. specifically, if a sequence has a splice site on either end and a branch site in

the appropriate position in-between, it is likely an intron.

Identification of splice sites is straightforward, as the sequences found at splice

193

exon exon

Splice Sites

Branch Site

intron

Figure 8.7: Splice sites and branch sitesxb.

sites are well-defined and generally invariant across organisms. The sequences sur-

rounding branch sites are more variable [140] . For organisms containing variations

in branch site sequences, consensus patterns describing the range of possibilities have

been developed.

Stephen Mount of the University of Maryland Department of Cell Biology and

Molecular Biology has been using TimeSearcher to identify consensus branch site

splicing signals in the plant Arabidopsis thaliana. The data set being used for this

purpose was generated from the genomic sequences surrounding 8550 internal exons

that were internally truncated and aligned with respect to their boundaries [109]. This

data set contains the the normalized frequencies of each of the 1024 possible pentamers

- sequences of five nucleotides - at each of 192 possible positions.

The occurrence frequencies in this data set can be used to create queries that search

for items that match known characteristics of the sequences around branch points.

194

Figure 8.8: Data envelope overview of pentamer frequency distributions in Arabidop-

sis thaliana.

Specifically, branch points are found approximately 25-30 positions upstream (before)

the end of an intron, and surrounded by sequences that are not commonly found in

exons. Thus, sequences that might include branch points can be found by searching

for pentamers that are frequently found 25-30 positions before the end of an intron and

infrequently found elsewhere in the intron. If we consider the position in the aligned

sequences as the linearly-ordered dimension of interest, these queries can easily be

created with TimeSearcher.

Figure 8.8 shows a data envelope overview of the whole data set. Two peaks, indi-

cating the boundaries between the exon in the middle and the introns on the ends, are

immediately apparent. These peaks represent well-known conservation of sequences

at splice sites.

To identify candidate splicing signals, a query using two timeboxes is used. One

195

Figure 8.9: Timebox query aimed at finding pentamers with higher frequencies at a

specific region within introns (the branch site) and lower frequencies elsewhere within

introns.

component of the query will identify those pentamers that are frequently found before

the exon-intron boundary. The second identifies pentamers that are infrequently found

elsewhere with the intron (Figure 8.9). Taken together, these criteria identify candidate

branch site consensus sequences. [125].

These queries can be used to identify candidate branch points. Taken together with

domain knowledge of the expert user, and perhaps in combination with candidates

generated through statistical or algorithmic approaches [48], these results can be used

to extend known consensus sequences. In this case, the analysis was used to extend

the previously-identified consensus branch point sequence CTRAY (where “R” can be

either “A” or “G” and “Y” can be either “C” or “T”) [125], to WYTRAY (W= A or T,

Y=C or T, R=A or G).

196

8.2.2 Observations

The possibility of using TimeSearcher to “play around” with the data set was seen

as being extremely useful for this analysis. The interactivity supported exploring and

identification of patterns of interest that would not be possible with algorithmic ap-

proaches or prior practice - viewing data in a spreadsheet. Furthermore, overview

displays were useful for increasing confidence in the data set, as the data overview

were consistent with the expected distribution of the profiles. In general,TimeSearcher

was found to be superior to existing data exploration tools:

I have been looking at sequence data of this sort for over 20 years and

find TimeSearcher to be the best data exploration tool I’ve encountered.

What I like about it is the ability to rapidly change your query and see

the results in order to converge quickly upon a query that is appropriately

selective [91].

Much of the investigation involved in identifying the candidate splice site se-

quences involved searching for items that were in differing normalized frequency

ranges at different times. Using the normalized view of the data, these searches tried to

identify items that were, for example, at least one standard deviation above the norm

during the exon.

Unlike the analysis of the Drosophila programmed cell death data sets, the inves-

tigation of this data set involved frequent use of the arrow keys, range sliders, and

editable text labels for value ranges to modify queries. As a result, there was relatively

little direct modification of timeboxes. This difference might be the result of some

inherent characteristics of the data sets, or it may just be an indication of the working

styles of the individuals involved. Further examination would be needed to understand

the factors influencing the users’ choice of interaction style.

197

8.2.3 Contributions and Design Suggestions

Features currently implemented in TimeSearcher that were suggested during the anal-

ysis of the Arabidopsis sequence data set include the item list window and the editable

labels for the range sliders. Additional suggestions for functionality that has not yet

been implemented focused on query specification and result set display and manipula-

tion.

Many of the queries created during analysis of the sequence data involved two (or

more) timeboxes with covering value ranges that were contiguous: for example, one

timebox might contain items that ranged from -1 to 0 deviations below the mean, while

another might contain items ranging from 0 to one deviation above the mean. Time-

Searcher does not currently support this sort of query. The text entry fields associated

with the range sliders might be used to set the maximum value of one timebox to be

equal to the minimum value of another (or vice-versa), but this does not create an in-

variant that will be maintained as the boxes are modified. A tool that would somehow

link multiple boxes and constrain the relationship between their values might help sim-

plify the manipulation of this type of query. This suggestion was independently made

during one of the empirical studies of timebox queries (Section C.3).

The handling of query results arose several times during this analysis as an area in

need of improvements. The user noted that it was often difficult to tell how a result

set changed after the modification of a query. This led to the suggestion that the item

list and display of individual items be augmented with additional coding that would

help the user understand the impact of recent changes. For example, items that have

recently been added to the data set might be color coded with a bright color, while

items that have been removed might still be displayed, but with a “grayed-out” color

coding, to indicate their removal from the display. Varying shades of color and gray

198

might be used to provide a finer-grain display. Alternatively, separate lists of recently-

added and removed items might display this information.

Tools for manipulating result sets were also seen as being useful. For example,

the items in a result set might some how be considered a cluster, for which aggregate

statistics might be calculated. For the nucleotide dataset, tools for identifying common

features of the items in a result set might be particularly useful. For example, such

a tool might tell the user that all of the items in the current result set have the same

nucleotide in the third position, while 80% have one of two nucleotides in the fourth

position. This might help simplify the identification of candidate sequences. The

aggregation of the items might also be added into the data set and considered a target

for querying along with existing individual items.

Other suggestions regarding manipulation of result sets addressed the comparison

of results from multiple queries and the synthesis of new understanding from these

comparisons. Tools for displaying the intersection (or difference) between multiple

result sets were suggested as potentially useful, as was improved support for saving

results and queries.

In some cases, the characteristics of the result sets that were most interesting (and

therefore most appropriate for saving) did not involve the actual items in the result set,

but simply the size of the result set. For example, one line of exploration involved

creation of a query that was 20 nucleotides wide. The user moved this query across

the data set, identifying the number of items that matched the query at each position,

and plotting a histogram by hand. Automated facilities for creating such plots within

TimeSearcher might be an interesting possibility for future development.

199

8.3 Other Applications

Although these applications to biological data represent the most extensive uses of

TimeSearcher to date, interest from researchers working in other domains appears to

validate TimeSearcher and the timebox query model as being applicable to a range of

discipline and data types. As of February 2003, users in fields including hydrology,

climatology, and finance have expressed interest in using TimeSearcher to analyze

their data sets.

TimeSearcher executables were made available for public download in October

2002. Between October 2002 and early February 2003, more than 150 unique users

downloaded TimeSearcher.

8.4 Conclusions

The on ongoing, collaborative work with these users has proven to be valuable for the

design and evolution of both the theoretical and concrete aspects of this work. In using

TimeSearcher to address meaningful tasks, the users have demonstrated the efficacy of

the tool, and therefore of the underlying query model. These case studies have led to

several suggestions for useful functionality, including:

• Leaders & Laggards

• Support for multiple time-varying attributes

• Query inversion

• Queries for non-decreasing/non-increasing trends

• Removal of items that match a query

200

• Enhanced facilities for saving and manipulating result sets

• Customizing of item lists displays

• Linkages with external data sources

• Integration with other tools

• Vertical alignment of timeboxes

These case studies also proved useful in identifying areas of inquiry that might

have been less helpful or possibly distracting. Several observers not involved in these

case studies made various suggestions for features that they would have liked to have

seen included in TimeSearcher. Although many of these suggestions were intriguing,

development efforts were explicitly focused on user needs. As a result, suggestions

from the users involved in the case study were given high priority whenever possible.

This had the dual advantage of focusing efforts on features that were truly needed

while providing participants with the incentive to continue, in the form of evidence

that their concerns were being taken seriously.

These successful case studies would not have been possible without the participa-

tion of researchers who saw themselves as partners in the development and evolution

of TimeSearcher, rather than as mere users or customers. Regular meetings and open

feedback - in the case of the Baehrecke lab, periodic meetings over the course of more

than two years - were critical for the success of this effort. This work had many of the

elements of participatory design, even if formal methods associated with that approach

were not used.

The participation of the case study participants as research partners also meant that

they were willing to accept the shortcomings of a research prototype. For example,

201

they were generally willing to accept the explanation that a proposed feature was not

particularly interesting from a research view point, even if it would have been useful to

them. Their understanding of the need to focus on the research issues was invaluable.

The observation sessions described above were particularly useful for building

understandings of the research questions that TimeSearcher was being used to ad-

dress. These ongoing conversations conducted during these sessions were invaluable

for understanding the needs of the users, and for generating proposed designs aimed

at meeting those needs. Further sessions involving more in-depth analysis might have

provided additional insight. Complete immersion in the research efforts that moti-

vated the case studies - perhaps in the form of spending several months working in the

Baehrecke or Mount labs - might have proven useful as well.

This model of developing information visualization applications through close col-

laboration with motivated users is potentially generalizable to other efforts. A small,

committed set of users who understand the difference between their research needs

and the needs of the project will be necessary for this approach to succeed.

202

Chapter 9

Query Expressiveness

The basic timebox model supports a limited set of queries: all values of interest (start

time, end time, min value, and maximum value) must be specified exactly. Many

interesting queries require additional expressive power. The data mining literature

contains numerous examples of queries for patterns in time series that are independent

of exact time or values, scale, or other factors [4, 5, 19, 29, 49, 69, 76, 75, 78, 98, 104,

108, 150].

TimeSearcher contains some extensions to the basic query model. Disjunctive

queries (Section 3.1), Leaders & Laggards queries (Section 4.2), variable-time time-

boxes (Section 3.2), angular queries (Section 4.7) and query inversion (Section 4.4)

use a combination of interaction techniques and alternative query semantics to extend

the range of queries that can be expressed.

This chapter provides some examples of additional query possibilities, along with

a categorization and sketch of a formal model for extended query semantics.

Further gains in expressiveness might be gained by extending the types of data

involved. Extensions aimed at adapting timeboxes to handle queries on categorical

data and more general temporal data could support a wide range of new tasks. These

possibilities are briefly sketched in Section 10.3.

203

Query Expressiveness

Inter−Item (Q12)

Fixed Time/Interval (Q1)

Interval Trends (Q7)Global Constraint (Q11)

Maximal Periods (Q8)

Similarity (Q10)

Low High

Inter−Item

Prevailing Trends (Q13)

Relative Time/Value (Q5,Q6)Variable Time/Value (Q2,Q3)Intra−Item

Open−Ended Time/Value (Q4)

Aggregate (Q9)

More General (Q14)

Figure 9.1: A schematic layout of the different types of example queries. Queries are

expressed in approximate order of increasing precision, from left to right. Aggregate

queries are modifiers that apply to queries within the shaded box, and maximal period

queries are modifiers that might apply to those within the unshaded box. Queries below

the dashed lines involve comparisons are based on the characteristics of individual

items in the data set, while those above the line involve comparisons between items.

9.1 Example Queries

A series of example queries will illustrate the range of query formulations that might be

supported by an interactive system. This list is not intended to be exhaustive: queries

involving greater expressive power will be discussed below.

In the examples below, we assume that S is a set of m stock prices S0, . . . ,Sm−1,

over a set of time points 1, . . . ,n, and Si(t) is the value of Si at time t. Queries are

expressed textually, with alternative presentations in a “pseudo-SQL” notation.

A preliminary schematic displaying a rough approximation of the relationships

between the types of queries is given in Figure 9.1.

204

9.1.1 Fixed-Time, Fixed-Value, and logical combinations thereof

Fixed-Time, fixed-value constraints involving a single set of times and values can be

expressed with a single timebox:

Query 1 Find stocks where the prices are between $10 and $20 during days 5-10

SELECT Si from S WHERE

$10 ≤ Si(t) ≤ $20

when 5 ≤ t ≤ 10;

As described above, TimeSearcher can also be used to create complex queries con-

sisting of conjunctions of multiple queries of this sort. Disjunctions between values

might also be expressed if appropriate support for grouping (parenthesization) is pro-

vided.

9.1.2 Variable Time and/or Value

Basic timeboxes can be extended by allowing a window of variability in the allowable

times and/or values specified.

Query 2 Find an interval of 5 consecutive days during days 10-20 during which prices

ranged between $50 and $70.


50 ≤ Si(t) ≤ 70

when ti ≤ t ≤ ti +4

AND

ti ≥ 10 and ti +4 ≤ 20

205

This class of query is currently implemented in TimeSearcher as a Variable Time

Timebox (Sections 3.2 and 4.6).

Query 3 Find stocks that stayed in some $10 range between $20 and $40 during days

5-10.


vi ≤ Si(t)≤ vi +10

AND

vi ≥ 20 and vi +10 ≤ 40

when 5 ≤ t ≤ 10

Further flexibility might be accomplished by combining these two types of queries,

creating queries that have variability in both time and value.

9.1.3 Open-Ended Time and/or Value

Less restrictive queries might require only an upper (or lower) bound on the time period

or value range desired:

Query 4 Find stocks where the prices is greater than $50 for some period of time after

the 20th day


Si(t)≥ 50

when t ≥ 20;

206

9.1.4 Relative Time/Value

These queries involve times and values that are specified relative to each other, rather

than in terms of any absolute values:

Query 5 Find stocks that traded within a $10 range for days 1-5 and then increased

by $20 above that range during days 10-15.


v1 ≤ Si(t)≤ v1 +10

when 1 ≤ t ≤ 5

AND

v1 +30 ≤ Si(t) ≤ v1 +40

when 10 ≤ t ≤ 15;

Query 6 Find stocks that trade between $10 and $20 for some 10 day period, and then

traded between $30 and $40 for some 5 day period that starts at least 10 days later.


$10 ≤ Si(t) ≤ $20

when ts ≤ t ≤ ts +10

AND

$30 ≤ Si(t) ≤ $40

when ts +20 ≤ t ≤ ts +25;

.

207

In query 5, we know the times, but the value ranges are specified relative to each

other. In query 6, the values are known, but the time periods are relative. Further

generalization of these queries would involve combination of relative times and relative

values in the same query.

Relative time/value queries can be composed to identify more complex patterns,

such as double bottom patterns [19, 98].

9.1.5 Interval Trending

Identification of intervals of monotonic increases, decreases, non-increases, or non-

decreases may be of interest [66, 108].

Query 7 Find stocks that increased in value every day over a 10 day period, with a

resulting increase of more than $100.


Si(t) > Si(t −1)

and Si(ts +10)−Si(ts) > $100

forall ts +1 ≤ t ≤ ts +10;

As in query 5 and 6, the important element is the relative change: 10 days of

increase, from any starting value t1. The magnitude of the overall change may be

specified, or not.

“All points” angular queries as currently implemented in TimeSearcher (Section

4.7) provide some support for this type of query. As mentioned above in the discus-

sion of relative-value queries, this supported is limited to a graphical notion of the

magnitude of the desired increase or decrease.

208

9.1.6 Maximal Periods

Query 8 Find the maximal period during which values increased every day


Si(t) > Si(t −1)

forall ts ≤ t ≤ te

AND

te − ts is maximal;

This query is similar to query 7 in that it asks for a interval of continuous increase

in value. However, this query asks for a maximal interval of increase, rather than a

time-limited period of increase.

9.1.7 Aggregate Functions

Query 9 Find stocks that had an average price between $10 and $20 during times

10-15


$10 ≤ ave(Si(t))≤ $20

when 10 ≤ t15;

Queries might include any of the standard SQL aggregates - avg, min, max, sum,

and count. Other possible functions include moving averages, and standard deviations.

Existing timebox queries fall into this query as well: standard conjunctive timeboxes

209

might be seen as the val(x) aggregate, which simply looks at the value at time t. Dis-

junctive boxes (Section 3.1) might be seen as using the anyo f (x) operator, to indicate

that any one of the items in the given time period falls within the desired range.

9.1.8 Similarity to a Known Item

Similarity queries involve the use of a known item as a query to find items that are

similar to it [29, 49, 78, 75, 76, 150]:

Query 10 Find stocks that are similar to ABC


D(Si,ABC) ≤ ε

A wide variety of distance measures may be used, according to the specific circum-

stances [149]. In most data mining research, distances are defined in terms of Lp

norms. Possible alternatives include the similarity model used in TimeSearcher’s drag-

and-drop query facility, where items will be defined as similar if the distance between

all corresponding values stays within a threshold: ∀i|qi− ri| < ε.

In other cases, notions of similarity might be modified to include similarity with

different scalings of time (dynamic time warping) or value, moving averages, and other

transformations [1, 19, 74, 104].

9.1.9 Global Constraint

Query 11 Find stocks that never trade above $50

SELECT Sifrom S WHERE

210

max(Si(t)) < $50

These queries identify items with global behavior within some specified range.

Global queries might be based on minima and maxima, standard deviations, averages,

or other similar measures.

9.1.10 Inter-item queries: Leaders & Laggards

Query 12 Find stocks that decreased by at least $20 over some five-day period occur-

ring 10 days after some other stock rose by $100 in a 2-day period


Si(t ′) < Si(t ′−1)

and Si(t ′+10)−Si(t ′+15) > $20

when t ′+10 ≤ t ≤ t ′+15

and t ′ in

SELECT t from S WHERE

S j(t) > S j(t −1)

and S j(ts +2)−S j(ts) > $100

when ts ≤ t ′ ≤ ts +2

Other possible queries may involve comparison between items: “find times when

the value of XYZ is greater than that of ABC” [113]. These and related queries attempt

to find items that exhibit a certain trend or pattern at some point in time after another

item (or set of items) exhibits a second trend. Such trends are useful for finding leaders

or laggards that predict or trail trends across items in a data set.

211

As relative times and values can be specified with this class of query, this notion of

leaders & laggards is more general than the functionality provided by TimeSearcher

(Section 4.2). This discrepancy provides an example of the tradeoffs that might be

involved in implementing many of the extensions to query expresssiveness. Specifi-

cally, if increased expresivity comes at the cost of increased interface complexity or

processing time, a combination of interface enhancements and user interaction might

be the most effective means of increasing the functionality available to users.

In the case of leaders & laggards, the implementation in TimeSearcher provides ba-

sic tools for specifying a set of leaders that might be of interest, while freeing the user

to search for laggards. This provides basic support, without incurring the processing

and complexity costs that would be associated with query 12.

9.1.11 Prevailing Trends

Query 13 Find items that are generally trending upwards, but may include downturns

within a certain tolerance,between days 10 and 15 [66].


δi = Si(t)v−Si(t −1)

and ave(δi) > 0

when 10 ≤ t ≤ 15

This query expresses the general upwards trend in terms of the average of the

changes in value between two measurements. Other formulations may be possible

for expressing this notion of general, but not monotonic, increases (or decreases) over

time. “End points only” angular queries (Section 4.7) provide a limited version of this

form of query.

212

9.1.12 More general queries

Query 14 Find items that have periods lasting 5 intervals long that contain at least 2

upwards changes and no more than one downward change.

A more general class of queries involves specification of a set of events that may

occur in any order within a given time interval. Query 14 requires at least two events

of one type, and no more one event of another type, to occur during a certain time

period. Formulations of this sort might be useful for detecting trends in the presence

of outliers, as they allow for mismatch tolerance constraints [66] that allow elements

in the sequence to deviate from a general pattern (i.e., “Find all items that increased in

value during 4 of 5 time periods”). These queries might also be useful for identifying

intervals that contain vague trends similar to query 13. SDL uses a powerful set of

composible operators to specify these queries, providing expressive power similar to

that of regular expressions [5].

9.2 Query Dimensions

The above examples present a range of possible time series queries. Although this

list is not in any way complete, it does provide the basis for discussing the space of

queries that may be possible. In particular, we would like to work towards a model

that bridges the gap between these queries and the current limitations of the timebox

query model. In addition to providing a framework for more rigorous analysis of query

expressiveness, this model will help guide extensions to timebox queries.

Most of the example queries can be divided into either one of two categories: those

involving values that stay within a particular range (queries 1, 2, 3, 4, 5, 6, and 11), and

those that involve some sort of transition, like an upward or downward trend (queries 7

213

and 13). Others involved modifications to an interval, such as intervals of a maxi-

mum length that meet specified criteria (query 8), or averaging values over an interval

(query 9). Similarity queries (query 10) and leaders and laggards queries (query 12)

involve relationships between values of multiple items in a data set, as opposed to other

queries that can be evaluated on each item independently. These distinctions form the

basis for discussions that will lead to development of the query model.

9.2.1 Range Events

A range event involves the restrictions on both a set of times of interest, and values

of interest during those time periods. A range event can be viewed as a set of four

constraints: start time, duration, minimum value, and extent (for maximum flexibility,

duration and extent can be negative if desired). More formally, q = (s,d,m,e) is a

specification of a range event. In TimeSearcher, each timebox defines a single range

event1.

Examination of the example queries indicates a variety of range events, ranging

from completely specified to minimally specified. A complete and absolutely specified

range event q - one in which all four parameters are provided in absolute (non-relative)

terms, is a simple fixed-time, fixed-value query (Query 1). If the event’s duration

and/or extent is left unspecified, an open-ended query results (Query 4). Restricting

the values or times to fall occupy a given range within some broader constraints leads to

a variable time or value query (queries 2 and 3). Similarly, start time and/or minimum

value can be omitted, leading to a relative time/value query (query 5 or 6).

A meaningful range event must minimally have either a minimum value or an

1For the current discussion, this definition will be more convenient than the equivalent definition

given in chapter 3

214

extent, and a minimum time or a duration. If both the minimum value and extent

remain unspecified, the range event will reduce to requiring that the time series have

some value during the specified time, which can be assumed to be vacuously true.

Similarly, if both start time and duration are omitted, the range event requests items

that have certain values at any point in the time series - i.e, a global constraint query

(query 11).

In the general case, these values can be specified either in absolute terms- in terms

of precise numeric values or time periods- or relative to the constraints of another range

events. Relative specification allows for relative time/value queries (queries 5 or 6)

and leader/laggard queries (query 13), where the key characteristic is the relationship

between events.

Relative specification of timeboxes may pose some challenges for interpretation

and evaluation. For example, if timeboxes B and C are both specified simply as being

after A and before D, the ordering between B and C is under-specified. In this case, B,C

and C,B may be acceptable orderings. This might be addressed by providing output

displays that indicate the nature of the ordering in query results, or by constructing the

query model and interface in a manner that prohibits such ambiguities.

Maximal period queries (query 8) are a special case of the a range event with an

unspecified start. In this case, the width is defined relative to other intervals meeting

the same value constraint.

The defining characteristics of a range event are which of the four parameters (start,

duration, minimum, and extent) are fixed, and which are unspecified. Defining range

events in these terms is the first step towards extending the timebox model to handle

these more complicated queries: extensions to the model to handle these more general

events would allow expression of more complex queries like 2, 3, 4, 5, 6 and 11. These

215

extensions may come in form of new or modified timeboxes, or in additional widgets

that act on the data set as a whole.

Aggregate operators (query 9) can be seen as additional qualifiers that are added to

a range event to alter the nature of the query. If a basic timebox is a constraint requiring

that all of the points in the specified interval fall within the general range, this can be

interpreted as applying the allof operator to the given time and value limits. Other

possibilities might be applied by changing this operator, perhaps to an ave, or anyof

for average or disjunctive queries respectively. Other aggregate operators such as sum

and count may pose additional challenges of query construction and interpretation.

9.2.2 Transition Events

A transition event describes the nature of the interval between values at two time peri-

ods. While a range event limits the values in a range of time points to fall within certain

parameters, a transition event imposes restrictions on the nature of the change between

two time periods: perhaps requiring that values be monotonically non-decreasing

(query 7), or more generally trending upwards (query 13). These events are speci-

fied in completely relative terms, with starting time and duration determined either

by surrounding range events, or in terms of length constraint. Value restrictions are

defined in terms of differences between values within the range event.

In a certain sense, a range event might be loosely interpreted as the first derivative

of a transition event. This connection might be an interesting area for further investi-

gation.

216

9.2.3 Inter-item Queries

Most of the example queries involve consideration of each item in a data set separately,

without reference to any other items. Similarity (query 10) and leader/laggard queries

(query 12) are different in that they involve comparison between items.

Similarity queries involve a direct comparison between items. One variant of a

similarity search is currently supported via TimeSearcher’s drag-and-drop query-by-

example. Alternative models might use different definitions of similarity.

Leader and Laggard queries (query 12) raise the more general, and therefore more

interesting, challenge of working with queries that relate a set of patterns in one item

to related patterns in another item.

Other interesting inter-item queries might arises in data sets involving multiple

time-varying attributes. For example, climatological data sets might include wind

speed, temperature and precipitation data from a set of series over a given period of

time. Inter-item queries might (for example) be used to identify relationships between

the temperature in one city and precipitation in another. This idea might be extended

to involve some sort of temporal join between two data sets. Precise details these

operations will be spelled out in the complete formal model.

9.2.4 Other Logical Operators: Disjunctions and Negations

In all of the above example queries, logical combinations between range event con-

straints were all described in terms of conjunctions. Alternative possibilities include

disjunctions and negations. As time and value ranges are assumed to be finite, nega-

tions may not be strictly necessary. However, they may prove useful for aiding in query

construction and interpretation.

217

9.2.5 More General Queries

Query 14 and other queries that can be expressed in SDL [5] or similar notation

provide expressive power similar to that of regular expressions. This power leads to

significant challenges in query creation and interpretation. These challenges stem from

two main sources: mismatch tolerance and arbitrary ordering of events. Modifications

to timeboxes that support mismatch tolerance constraints (“find all subsequences that

had increases in four out of five time periods”) [66] might involve additional widgets

attached to the timebox to specify the desired tolerance.

The arbitrary ordering of events provides the further challenge of specifying events

of an unknown ordering. Query 14 essentially asks for a time period of five intervals

with anywhere between two and five events matching certain criteria, in any order.

Since the events can be interleaved in time, a sequence of timebox would only be ap-

propriate if queries involving partial orders are allowed. Furthermore, the constraints

on the cardinality of the different event types will involve additional complexity, per-

haps involving sliders or other measures similar to those suggested above for mismatch

tolerance. Alternatively, other more complex query specifications similar to filter-flow

queries [152] or the queries used in Patterns [90] might be used.

9.3 Towards A Formal Query Model

This discussion of the possible facets of time series queries can be used to guide a

sketch of a formal description of more expressive queries based on timeboxes. By

providing a clear and rigorous description of possible queries, the model will further

understanding of the query space and support reasoning about queries. This under-

standing will be particularly useful for generalizing the query model to other data sets,

218

including general temporal data and categorical data.

After a description of the data set, simple range event queries and related operators

will be discussed, followed by conjunctive queries, transition events, and inter-item

queries.

9.3.1 Time Series Data Set

We assume that all queries are being evaluated against a set S of M time series records

(S0, . . . ,Sm−1), with each record containing N real-valued measurements, correspond-

ing to time points 0, . . . ,N−1. The jth measurement from the ith time series is referred

to as Si( j). The values of the Si( j) occupy a finite range between vmin = mini, jSi( j)

and vmax = maxi, jSi( j),

9.3.2 Range Events

A timebox range event query q consists of four parameters: start time, duration, min-

imum value, and extent. Specifically, q = (s,d,m,e) (Section 9.2.1). As discussed

above, at least one of the start time and duration must be defined, along with at least

one of the minimum and extent. For now, we delay the case of expression relative to

another query and insist that any specified values are expressed in absolute terms.

If all four values are specified, the extent of the box must fit within the dimensions

of the data set as a whole: 0 ≤ s ≤ n− 1, 0 ≤ s + d ≤ n− 1, vmin ≤ m ≤ vmax, vmin ≤

m+ e ≤ vmax.

The default interpretation of a timebox requires that the value be in the appropriate

range during all of the specified time periods. If Si ∈ S and q = (s,d,m,e), we say that

Si satisfies the query q - T (q,Si) - if ∀s≤ j≤s+d−1m ≤ Si( j) ≤ m+ e.

Any unspecified values are indicated by the special symbol U. In the presence

219

of unspecified values, the above constraints are enforced where logically possible. For

example, if s = U, the duration of the timebox must still be less than the duration of the

entire data set. Once range event queries are executed, the result items will instantiate

the values of the unspecified parameters in a manner that restricts the range event to fit

within the limits of the given data set.

Maximal periods queries (query 8) can be created by specifying the duration d as

the special symbol M , indicating the interval of maximum duration that satisfies the

start time and value constraints.

Unspecified and relative values and maximal period operators introduce additional

complexity that must be addressed in a completed formal model. If all of the queries

are expressed in terms of exact starting points and durations, each item in a data set

can only match a given query once - at exactly those specified times. However, un-

specified and relative values and maximal periods raise the possibility that a query

might match a time series at multiple intervals with different values. Thus, a complete

notation indicating a match between a query and a time series Si might take the form

T (q,Si,s′,d′,m′,e′), where s′,d′,m′ and e′ are the intervals during which the series met

the query constraints.

Alternative interpretations of timebox queries can be specified via operators such

as ave and anyo f , requiring that the average value, or any one of the values, in the

specified interval fall within the desired range. In particular, T (ave(q),Si) if m ≤

(Ss + . . .+Ss+d−1)/d ≤ m+e), and T (anyo f (q),Si) if ∃s≤ j≤s+d−1m ≤ Si( j) ≤ m+e.

9.3.3 Logical Combinations

Timebox queries can be combined via logical operators AND, OR, and NOT to form

more complex queries. Although the final model may allow for arbitrary logical com-

220

binations, the set of allowable queries in TimeSearcher is likely to be much more

limited. Due to the difficulty of constructing and modifying queries with arbitrary

grouping, TimeSearcher queries might be limited to conjuncts of disjuncts. Similarly,

negations may not be allowed.

Conjunctive queries containing timeboxes for each time point in the data set can be

used to create similarity queries based on local (point-by-point) similarity restrictions.

For the sake of convenience, we view a complex query of multiple timeboxes as

an ordered sequence: q0 op1 q1 . . .opn qn. The timebox queries qi are assumed to be

sorted in increasing order of start time (when specified), even if they are constructed

in an arbitrary order, as may be the case with the TimeSearcher application.

9.3.4 Variable Timeboxes

Queries involving ranges of times or values (queries 2 and 3) can be specified by

the introduction of additional constraints. For example, a fixed time/value timebox

q = (s,d,m,e) might be augmented to become a variable time timebox by adding the

constrints s≥ t1 and s+d ≤ t2 to the query. Similar modifications can be used to create

variable value timeboxes.

9.3.5 Relative Timeboxes

Queries involving relative values (queries 5 and 6) involve the specification of query

parameters in terms of other timeboxes in a conjunctive query. If we have the query

q0 op1 q1 . . . opn qn, where qi = (si,di,mi,ei), we might specify the next query com-

ponent as qi+1 = (si + δs,di + δd,mi + δm,ei + δe). Any (or all) parameters might be

specified relatively.

This model makes the initial simplifying assumption that clauses in a query can be

221

completely ordered, and that each query should be defined in terms of its predecessor.

This avoids the difficulties associated with an ordering of query elements that is partial

and not complete (Section 9.2.1). This assumptions are not necessarily appropriate in

all cases: a complete model might use a less restrictive approach.

9.3.6 Transitions

Transitions involving monotonic increases, decreases, non-increases, or non-decreases

can be expressed as additional clauses in a conjunctive query. For example, to

specify an interval of monotonic increase between queries qi and qi+1, the clause

inc(qi,qi+1) might be added to the query statement. Additional operators indicating

monotonic decrease, non-increase, or non-decrease might also be useful. Prevailing

trends (query 13) might be supported via additional operators using parameters to

specify allowable tolerances. For example, inc(qi,qi + 1,δ) might be used to specify

a deviation δ indicating the tolerance desired in a vaguely increasing trend. Alterna-

tively, an interval of a given length might be specified without reference to a timebox

constraint: inc(n,δ).

9.3.7 Global Constraints

A global constraint (query 12) can be viewed as a range event with no constraint on

time. Using the notation given above, the range event q = (U,U,m,e) specifies a

constraint that will the result set to items that always (or never) fall between m and

m + e. These events will be added to the compound query as additional terms in the

conjunction. As with standard timeboxes, either the minimum or the extent, but not

both, might be unspecified.

222

9.3.8 Inter-item Queries

Inter-item queries such as leaders and laggards (query 12) essentially involve connec-

tions between two multiple sets of items, each set resulting from an individual query.

In particular, a query Q1 = q10 op11 . . .op1n q1n defines a subset of items S′ ⊂ S

that meets the query constraints. Inter-item queries would be specified in a manner

similar to that which was used with relative timeboxes, with the important difference

that inter-item queries are specified relative to timeboxes from a completely separate

query. Exact notation details will be needed to describe this relationship. In particu-

lar, inter-item queries involving joins between disjoint data sets will introduce further

complexity to the model.

9.3.9 Open Issues

This sketch presents the beginnings of a formal model of time series queries. A com-

plete model would build on these notes, adding more precise definitions of the lan-

guage, a formal grammar, and domain descriptions where appropriate. The process of

completing this model may uncover further questions in need of clarification.

The possibility of negative duration or extent values is one area that will certainly

need clarification. For some range queries - particularly those involving global con-

straints - it may be advantageous to express value constraints in terms of an extent

relative to a maximum value, or time constraints relative to an endpoint. These might

be handled with negative extent and duration, respectively, at the expense of increased

complexity in the description of timebox constraints. Alternatively, the basic model

might be extended to include extra fields for the endpoint and maximum value. This

would define an over-constrained timebox in terms of three variables in each dimen-

sion, with any two of the three (i.e, start, end, or duration) determining the third.

223

Precise definitions of the notion of a timebox satisfying a query, relative queries and

complex queries containing multiple clauses will also need more work, particularly in

the context of potentially overlapping queries that may prevent a simple total ordering

of the query components.

Finally, the operators and syntax must be clarified and collected into a well-defined

grammar.

9.4 Implementing the Extended Queries

From the above discussion, we can identify several additions to the query model that

will be needed to achieve the desired expressiveness. Specifically, an enhanced system

will need widgets and interaction techniques for:

1. Queries with relative and variable specification of value and time constraints

2. Unspecified width, start, min, extent., and regions of maximal length.

3. Global trends

4. Alternative interpretations of timeboxes

5. Transition events

6. Disjunction and possibly negation.

7. Inter-item queries

Some of these query extensions might be implemented via minor additions to the

timebox model. Indeterminate time/values and queries involving operators (averages,

etc.) on range events might be expressed using different color timeboxes, or perhaps

224

via boxes involving decorations that specify the query type. For example, variable time

timeboxes are implemented by placing a simple tiembox inside of a second box used

to indicate variability. This might be extended to provide broader suport for general

variable value queries (query 3). Regions that are not fully specified might be indicated

by incomplete time boxes, containing only two or three sides instead of four.

Queries involving relative specification of value and time constraints might be

achieved by creating queries on a blank screen, independent of any constraints associ-

ated with the values and times on the query grid currently used. External widgets of

specified width or height (“struts”) might be used to require a minimum specification

between relatively specified timeboxes (query 6).

Such an approach would allow for strictly relative specification of timeboxes, but

combinations of relative and absolute queries would require additional support mech-

anisms. Other extensions may require new query widgets, such as the widget used for

angular queries (Section 4.7).

Disjunctions and negations will require additional special handling. The current

TimeSearcher model of conjunctive combination as the default relationship between

query components is simple and easily interpreted. In some cases, such as timeboxes

that occupy disjoint value ranges during identical time intervals, a seemingly natural

disjunct might appear (Figure 9.2). However, models that may appear to be natural

and obvious may in fact support ambiguous interpretation (Figure 9.3).

In these and other cases, additional interaction techniques such as manual specifi-

cation of disjunction might be necessary. This might be done through a separate control

panel used to control logical combinations of query components, similar to the Brush

Toolbox in XmdvTool [89]. Negations might be slightly easier: as discussed above,

timebox coloring or decoration might be useful. As the demand for these features is

225

Figure 9.2: A timebox query expressing A∧ (B∨C)∧D. B and C must be disjuncts,

as both cannot be true simultaneously.

unclear, implementation of these extensions will be of lower priority.

Further extensions to the query model may be necessary for inter-item queries such

as Leaders and Laggards (query 12). These queries involve comparisons between dif-

ferent subsets of a data set, with each subset involved being the result set of a (possibly

arbitrary) time series query. Specifying these subset is is likely to be a complex, per-

haps iterative process. Appropriate query input (and display) tools will be needed to

distinguish between the multiple sets of constraints.

Additional display techniques may be needed to work with these new query types.

For many queries - particularly those involving patterns that may occur at varying

points in time with varying durations - it may not be immediately apparent why a par-

ticular region in a given time series matches the specified query. Appropriate designed

displays that provide a clear and natural mapping between query input and result out-

put will be needed to help users understand and interpret query results. In some cases,

there may be multiple candidate strategies for query input and result display. These

226

Figure 9.3: A timebox that may lead to ambiguous intepretation under the model given

in Figure 9.2. The item drawn is in either timebox B or C for the two time points

during which they overlap, but it does not spend both of thoes time poitns in any one

box. Should this item be included under the disjunctive semantics of Figure 9.2? What

would the result that users would expect?

tradeoffs may be the subject of empirical and/or heuristic evaluation.

Timeboxes are simple and easily-understood. Any additions to the query system

should be similarly straightforward and unambiguous. This tradeoff between expres-

sive power and simplicity presents an interesting opportunity for further evaluation

(Section 10.2): if extensions to the range of possible queries lead to increased confu-

sion and difficulty, simpler models may in the end be more powerful.

Alternatively, existing query tools might be combined with interface enhancements

to provide much, if not all, of the extended semantics. This is the case with leaders &

laggards (Section 9.1.10), where enhancements to the TimeSearcher interface (Section

4.2) provide tools that might provide feedback that would help the user interactively

explore for the desired collections of leaders and laggards.

227

Algorithmic enhancements will be needed to process these advanced queries. One

possible approach to this problem would be to use a two-step process, involving tradi-

tional querying followed by post-processing. In the first step, a standard search algo-

rithm would be used to identify items within the dataset that met the constraints of any

traditional timeboxes, and fell within bounding boxes surrounding widgets such as the

slanted regions described above. In the second step, candidate items in the result set

would be examined to determine whether they met the requirements of any of the more

expressive query widgets. Other algorithms may be considered if this simple approach

does not perform well.

9.5 User Needs

The proposed query extensions described in this chapter are the result of a theoretical

exploration of the space of possible queries. This exploration outlines some of the

query concepts that might expand the capabilities of analysts who work with time

series data. Specifically, these queries might help these analysts overcome limitations

with current tools and conduct valuable and novel searches that reveal patterns or find

interesting items in their data sets.

Any further work aimed at implementing these (or other) enhancements to the

timebox model should be based in analysis of meaningful tasks. Query extensions that

meet known user needs are the most likely to prove worthwhile.

228

9.6 Subsequence Queries: Beyond Full-Sequence

Matches

Like basic timebox queries, many of the extended queries described above are all

designed to identify items from a data set that match specified criteria. Much of

the recent research on algorithmic methods for querying time series data has moved

beyond these “full-sequence” matches to examine the question of “subsequence”

matches: queries that can identify portions of a time series that meet limited criteria

[5, 49, 78, 75, 98, 104]. Unlike full-match queries, subsequence queries may match

a given item in a time series multiple times. Thus, a match for a subsequence query

takes the form of an identifier for the time series along with the interval defining the

subsequence that matches.

Relative time queries that do not refer to specific time point provide the conceptual

basis for handling subsequence queries. Without specific time points to “anchor” the

query in time, these queries act as motifs or patterns that can be identified at arbitrary

points.

Further flexibility might be gained by extending subsequence queries to ask for

trends such as those described in queries 13 and 14. Such queries would be very similar

to the motif queries handled by the Shape Definition Language [5] or the graphical

trend queries in Patterns [90]. The flexibility of these queries may impose significant

processing demands, possibly making dynamic query support difficult for medium and

large databases.

These extended queries raise the possibility of an interesting tradeoff between per-

formance and expressivity. The goal of a 100ms response time for dynamic queries is

based on the claim that rapid response is necessary to avoid the user frustration and

229

delays that accompany longer waits for query responses. However, there may be times

when users are willing to wait slightly longer for answers to queries, particularly if

those queries are substantially more powerful. Examination of these tradeoffs within

the context of an extended timebox query language would be an interesting area for

further work.

230

Chapter 10

Future Work

The algorithmic challenges (Chapter 6) and the extensions to query expressiveness

(Chapter 9) present numerous challenges for future work. Case studies yielded further

suggestions for enhancements to TimeSearcher (Chapter 8). Additional possibilities

are described below.

10.1 Further Development of TimeSearcher

10.1.1 Re-Implementation

As a research prototype, TimeSearcher was implemented with a focus on demonstra-

tion of capabilities and exploration of ideas. A new implementation, based on lessons

learned from the existing prototype, would simplify future work and lead to increased

flexibility and extensibility.

Data management elements of TimeSearcher would benefit from a redesign. Cur-

rently, TimeSearcher assumes that the entire data set is in local RAM. This assumption

- which is found throughout the code - limits the flexibility of TimeSearcher to scale

to larger data sets. Ideally, the data management code would be separated from other

231

code with clear abstraction barriers that would support the possibility of “plugging in”

alternative data storage models . The query algorithm should be similarly abstracted,

to support exploration of alternative search strategies and semantic (Section 3.1) and

other extensions (Chapter 9).

10.1.2 Scaling

Many interesting time series data sets are very large, both in terms of the number of

items, and the number of time points in each item. Scaling the timebox/TimeSearcher

model to accommodate larger numbers (perhaps O(106) items or time points) would

require improvements to search algorithms (Section 6.6) and the rendering portion of

the system. Alternatively, larger data sets might be randomly sampled or mathemati-

cally clustered into smaller sets of manageable size. Clustered data sets might make

use of “structure-based brushes” developed for hierarchical data sets [52]. Other pos-

sibilities include extending TimeSearcher to use disk-based indices (Section 6.6) and

possibly query previews [54] to query data sets that are too large to fit into RAM.

Long time series present particular problems for query specification and display.

Screen space limitations of approximately 1000 horizontal pixels limit displays to time

series of a few hundred time points. A variety of approaches might be taken to over-

come this constraint:

1. Scrolling: Horizontal scrolling in the query and display window might be used

to pan through the length of the time series. Linking the scrolling of the two

windows would allow users to display specific areas of the query along with

corresponding points in the data display.

2. Semantic Zooming: Zooming facilities provided by Piccolo could be used to

232

display long time series at varying levels of scale and detail. A “zoomed-out”

display could show a very long time series at a low-level of detail by show-

ing a display with fewer data points than the original. This reduced display

might be based on averages of adjacent data points, examination of trends to

eliminate “uninteresting data points” or other approaches. Hierarchical sim-

plification could be used to provide progressively more detailed displays that

would be shown as the user “zooms in” to view specific areas of the data set

at greater (or even full) resolution. These simplifications might be presented in

an “overview+detail” fashion, with a compressed overview presented alongside

raw data. Alternatively, distortion techniques might be used to present areas of

interest in full detail and peripheral areas in a compressed display.

3. Filtering: Many long time series contain short periods of interesting data sep-

arated by large intervals of relatively uninteresting data with little change. For

example, EKG data contains long stretches of “normal” heart activity between

incidents of excitement or heart difficulty. For such data sets, appropriate global

filters might be used to eliminate those periods that are uninteresting, essentially

shortening the length of a time series to include only those areas that are po-

tentially relevant. These filters might take the form of a range slider that would

specify thresholds for the minimum and maximum amount of change that would

be interesting: intervals with changes that fell outside of the threshold would be

filtered.

Each of these approaches presents challenges in terms of appropriate displays

and interaction techniques. All three strategies might benefit from the addition of

an overview window which would show a thumbnail version of the entire width of a

data set while indicating where the current display fits into the whole. Zoomed dis-

233

plays might be controlled with a slider that could be used to select the desired level of

magnification. These approaches (and others) might also be used in combination.

10.1.3 Domain Customization

As TimeSearcher was developed primarily as a platform for exploring the possibilities

of timebox queries, it is intentionally generic. More in-depth work in specific applica-

tion areas might require the additional functionality that would meet the needs of users

in specific domains. Possibilities include:

• Statistical analyses: Microarray data analysis (Section 8.1) often involves sta-

tistical tests and clustering algorithms. Integrating the results of these analytical

tools into the TimeSearcher display and query interaction might increase the

utility of the tool.

• Statistical descriptions of result sets. Microarray analyses also often involve the

search for clusters of related genes. Augmented displays that provide statistical

descriptions of result sets might help in this regard. For example, some measure

of the similarities between items in a result set might be useful for determining

whether or not the items in that set might be interpreted as a cluster.

• Alternatively, TimeSearcher might be modified to support searching over

hierarchically-clustered data, so that users might see either a cluster or an in-

dividual item as the result from a timebox search. This might be combined with

facilities for drilling-down to see the individual items found in a cluster of inter-

est.

• Links to other data sets and applications: Domain experts often benefit from

examining related data sets in multiple, coordinated views [94]. Visualizations

234

that linked time series data sets in TimeSearcher to other views of related data

might prove particularly powerful. For example, microarray time series data

sets might be linked to views of gene ontologies using treemaps [18]. Alter-

natively, TimeSearcher might be extended to be compatible with coordination

architectures such as snap-together visualizations (Snap) [94], thus supporting

the possibility of ad-hoc coordinated visualizations.

• Process Support: The analyses conducted in TimeSearcher may be part of ongo-

ing investigations that involve multiple reviews of each data set, saving of inter-

mediate results, and re-interpretation of trends that have been identified. Time-

Searcher’s rudimentary features for saving intermediate results (Section 4.9),

might be extended to include more detailed history-keeping and browsing [100],

annotation of results, and other bookkeeping tools that could be used to support

the ongoing process of data interpretation and synthesis.

Additional work with expert users will provide the motivation for prioritization of

future efforts: work will focus on those areas that meet the needs of motivated users.

Customization for other applications domains might provide additional interest-

ing challenges. In collaboration with ChevronTexaco, researchers at the University of

Maryland are investigating the use of TimeSearcher for analyzing oil-well monitor-

ing data. This collaboration has already identified the need for additional facilities for

handling monitoring data sets. Another domain of potential interest is signal process-

ing, as timeboxes, VTTs, and angular queries are similar to common operations on

signals [120].

235

10.1.4 Multiple Time-Varying Attributes

The support for multiples time-varying attributes described in Section 4.3 is prelim-

inary and limited. Exploration of alternative strategies for displaying multiple at-

tributes, probably via multiple windows, might increase the utility of this tool while

reducing the cognitive demands placed on users.

10.1.5 Additional Functionality

Additional tools for exploring and manipulating time series data sets would increase

TimeSearcher’s utility and flexibility. For example, support for zooming time series,

overlay of related time series, point queries, and other functionality implemented in

Diamond Fast [135, 136], along with decomposition, smoothing, and forecasting and

related techniques [80] might support statistically oriented tasks.

Further refinement to existing features might also increase TimeSearcher’s utility.

For example, the “leaders and laggards” functionality (Section 4.2) might be extended

with facilities that would help users explicitly relate modified laggard queries back to

the original leader query.

Additional displays or feedback might guide users in query creation or modifica-

tion. Given a query, users might be interested in finding other intervals that had large

number of items that had items that followed the same pattern. This information might

be provided via a line of varying intensity underneath the horizontal (time) axis. The

saturation of this line’s color at any given time would be determined by the number of

items that matched the specified query, starting at that time point (Figure 10.1). This

preview line would provide users with suggestions for other time intervals where the

given pattern would be found. Similar previews might be provided to help users find

other value ranges for a given pattern during given time periods.

236

Figure 10.1: The TimeSearcher query display, augmented with a preview display dis-

playing time periods that have larger number of items that follow the pattern. The

number of items that match the query at each time point is given by the line color at

that time: lighter colors indicate a small number of matches, while darker colors show

intervals with more matches.

The computational overhead required for generating this preview information

might be substantial, making dynamic query response times infeasible. This difficulty

might be handled by presenting preview information only upon explicit user request.

Other possibilities might involve augmenting TimeSearcher to move “backwards”

- from items in the result set to queries that describe those items. TimeSearcher’s

query-by-example tool (Chapter 4) might be extended to provide additional power.

For example, given a set of items of interest, is there some sort of minimal timebox

query that returns exactly that set? Such queries be computationally-intensive, perhaps

drawing on work in data mining and machine learning, but they could be useful for

237

some analyses.

The empirical studies conducted during the course of this work (Chapter 7) identi-

fied several potential areas for improvements to the current TimeSearcher system:

• Improved facilities for adjusting timeboxes over small intervals and fine-tuning

of query ranges

• Support for temporarily disabling timeboxes.

• Alignment tools for easing the process of creating timeboxes that are aligned in

value.

Finally, work with domain experts using TimeSearcher for ongoing research iden-

tified several proposed extensions to query expressiveness, search functionality, result

displays, and other components of TimeSearcher (Chapter 8). Implementation of these

extensions would increase TimeSearcher’s utility and flexibility.

10.2 Further Evaluation

Current evaluations of timeboxes and TimeSearcher have provided mixed results. De-

spite promising case studies involving the use of TimeSearcher for hypothesis genera-

tion in ongoing scientific research (Chapter 8), refined empirical studies are needed to

identify and measure the benefits of TimeSearcher for realistic data sets and queries.

Further empirical studies aimed at overcoming the shortcomings of previous efforts

might help identify some of the strengths of the timebox query model. As discussed in

Chapter 7, previously conducted-studies suffered from tasks that were not especially

well-suited for timeboxes, and also from difficulties in user comprehension of com-

plex queries. Studies involving a more careful selection of tasks, perhaps focused on

238

exploration of data sets, and perhaps involving more training, might overcome these

difficulties to provide more informative results.

Narrowly-focused studies with motivated domain experts might provide another

means of avoiding difficulties associated with user comprehension of tasks. As such

evaluations would involve users who had a vested interest in solving real problems

that they face with meaningful data sets, the difficulties associated with novice users

would be avoided, and additional types of evaluations might be possible. For example,

TimeSearcher might be compared to whatever existing tools they use. Alternatively,

TimeSearcher might be used with and without various features, in order to determine

which features are most helpful.

10.3 Other Types of Time-oriented Data

The extensions to the timebox query model described in Chapter 9 provide several

possible directions for future work. However, these extensions were all discussed in

terms of basic time series data sets. Generalizing the timebox concept to apply to these

more challenging data sets presents further opportunities for interesting work.

10.3.1 Categorical or Nominal Data

Timeboxes and TimeSearcher were originally designed to support time series data sets

involving continuous measurements. In this context, “continuous” is defined to mean

that the values involved can take any value in a finite interval. An potentially inter-

esting generalization of timeboxes and TimeSearcher might involve support for cate-

gorical or nominal data sets: data sets involving values that each fall into one of a set

of (possibly ordered, in the case of nominal data) discrete and disjoint classes. This

239

definition is somewhat arbitrary: a standard time series data set might be converted to

a categorical data set by “bucketing” the values ($0-$10 becomes category 1, $11-$20

category 2, etc.).

Examples of categorical time series data include log files containing time stamped

events [44, 64, 106], For example, web log file entries contain a timestamp, along with

the page that was referenced, the browser that was used, and other related informa-

tion [64]. A query tool for these categorical data sets would allow users to identify

patterns of sequences of actions. For example, systems administrators might be inter-

ested in knowing which users tried to execute an “su” command to gain root privileges

immediately after logging in to the system.

The first step in extending timeboxes to handle categorical data would be to estab-

lish some sort of linear order on the categories for any given data set. Although some

data sets may have a natural ordering of categories, other data sets might require the

imposition of a potentially arbitrary ordering. This linear order would be used to define

the y axis of the query space, just as the range of measured values defines the y axis in

the current model. Since boundaries between categories are discretely defined, shading

or other visual cues might be used to differentiate between the categories. Timeboxes

could be drawn to include only one category, or multiple adjacent categories (Figure

10.2). When appropriate, hierarchical data sets might be displayed along with tools

for selecting the level in the hierarchy that would be displayed [119].

As a straightforward extension of the timebox model, this approach is appealing.

However, two significant differences between continuous and categorical time series

data sets present potential problems:

• Continuous time series data is based on a meaningful ordering of values that

may not be present in categorical data sets. A given set of n categories will have

240

Figure 10.2: Sketch of a potential design for categorical timeboxes. For a data set in-

volving web log records for multiple hosts, this interface might be used to find queries

that had large numbers of visitors from “.com” hosts in September and October, fol-

lowed by large numbers of “.org” visitors in December and January.

n! possible orderings. The choice of ordering may be somewhat arbitrary, with

different orderings producing different visual patterns in the data sets [106].

• The type and magnitude of changes from one time period to the next have mean-

ing in continuous data sets that they may not in categorical data sets. For stan-

dard time series data, it is often the

case that the value time ti is more closely related to the value at time ti+1 than it

is to the value at time ti+10. Therefore, a timebox of a limited height can be used

to define a range of variability, perhaps filtering out changes that are too small

to be of interest. When orderings on categories are not based on some internal

ordinal, ratio, or interval scale, the height of a timebox may just be a function

of the arbitrary ordering of the categories. In other cases, natural orderings of

241

Figure 10.3: An categorical timebox query looking for sites that had large numbers of

“.org” or “.edu” visitors during December and January.

the categories may not be useful for query construction. For example, web site

accesses might be ordered by alphabetizing URLs, but the utility of queries in-

volving include lexicographically similar URLs might be limited [64].

Further revisions to the timebox query model may be needed to address these is-

sues. For example, timeboxes might be constrained vertically to occupy only one

category, with multiple, vertically-aligned timeboxes indicating a disjunction (Figure

10.3). This would eliminate problems that might be caused by timeboxes with heights

that spanned multiple categories, but the potential difficulties associated with deriving

orderings of the categories would remain. Additional analyses of specific data sets and

user tasks will be needed to guide appropriate designs.

242

10.3.2 Temporal Data

Temporal data sets involve events with arbitrary, finite duration. The timing of these

events can have a variety of complex relationships. For example, event A can precede

event B (A end before B starts), follow B (A starts after B ends), or occur during B (A

starts after B starts and ends before B ends). Temporal relationships between events

have been characterized [9, 51], and a large body of research in temporal databases

and temporal query languages has provided numerous proposals for efficient storage

and indexing of these data sets [33, 71, 126].

As with categorical time series data, the line between temporal data and time se-

ries data may be blurred. For example, a time series data set tracking an individual’s

body temperature can easily be converted to a temporal database, with all consecutive

readings of greater than 98.6 degrees F described as “Fever” events.

Notable visualizations of temporal data have addressed medical records [102, 99].

Rectangular regions describe events, with the start and end points determined by the

left and right ends of the rectangle, respectively. Different types of events might be

treated as categories, which can be ordered vertically (Of course, orderings of cate-

gories presenting challenges similar to those found with categorical time series data).

The temporal query language (TVQL), developed as part of MMVIS, presented a

dynamic query interface for temporal data. TVQL uses range sliders and other tra-

ditional widgets to support queries involving temporal relationships between two sets

of events (Section 2.1.4) [60, 61]. Extensions to the timebox model for temporal data

may have the potential to handle more complex queries involving relationships be-

tween multiple items.

Basic temporal constraints could be specified by constructing timeboxes with the

desired vertical alignments. The result would be a graphical notation similar to that

243

used in TVQL [60], with overlaps or adjacencies in the temporal extents of events

describing the desired relationships. Additions to expressive power might be both

complicated and desirable. For example, some tasks might require a general constraint

that A precedes B, while others might demand the more specific query that A precedes

B by a given duration ta,b. Flexible intermingling of query components that are more or

less constrained may prove challenging. As these difficulties may be similar to those

found in allowing arbitrary relationships in time series queries (Chapter 9), similar

strategies might be used to address both sets of problems.

Temporal data sets also require additional work in terms of defining appropriate

data storage models and indices for searching. Recent work on temporal databases [71]

may be useful in this regard, but it is not yet known if these strategies are capable of

providing the performance needed for dynamic query applications. Further investiga-

tion will be needed to understand the range of temporal queries that can realistically

be processed in the 100ms response window that is needed.

244

Chapter 11

Conclusions

Despite the wide range of data sets and domains that make extensive use of time series

data, there has been relatively little work to date involving dynamic queries for spec-

ifying constraints on time series data sets. This dissertation uses the timebox query

model as the basis for an exploration of issues associated with interactive queries on

time series data. Specific contributions include:

• The definition of the timebox query model: The timebox query model refines

existing dynamic query widgets by allowing concurrent specification of multiple

constraints.

• The TimeSearcher application: TimeSearcher uses timeboxes, drag-and-drop

query-by-example and bookmark capabilities (“leaders & laggards”) to support

exploration of time series data sets. Implemented in Java, TimeSearcher uses

object-oriented design techniques support the use of subclassing to easily add of

new classes of queries.

• New query widgets for additional expressive power: Variable-time timeboxes

(VTTs) and angular queries build upon the basic timebox model to provide ad-

245

ditional expressive power. New interface widgets that provide dynamic query

functionality needed to support these models have been designed and imple-

mented.

• Validation through case studies: The utility of timeboxes and TimeSearcher has

been demonstrated by ongoing use in active research projects (Chapter 8). In

addition to confirming early intuitions regarding the utility of the tool, this col-

laboration has led to numerous insights and design suggestions that otherwise

might not have been identified (Chapter 8).

• Empirical Evaluation of timeboxes: Although more work remains to be done

in empirically characterizing the strengths of timeboxes as a query mechanism,

studies conducted thus far have led to an increased understanding of the strengths

of timeboxes (Chapter 7). Further studies will attempt to refine this understand-

ing, with the ultimate goal of generalizing results to apply to other 2D rectangu-

lar query widgets.

• Analysis of algorithmic expectations: Providing dynamic query performance

(100ms updates) for queries on large time series data sets requires fast process-

ing. Comparison of several alternative approaches led to some initially counter-

intuitive results: index structures provided inferior performance as compared to

non-indexed data. Further examination of the problem led to the observation

that structures that index each time series as a whole will be needed for efficient

evaluation of full-match queries.

• Framework for extending the query model: The timebox query model is a start-

ing point. Chapter 9 describes a subset of the possible extensions to timeboxes

that might be used to provide various increases in the expressive power of the

246

query language. Further work in this area will be needed to identify the potential

extensions that are interesting and relevant to user tasks as well as realistically

achievable.

This work has led to a wide range of possibilities for future work (Chapter 10).

Extending the query language, implementing new classes of queries, and examination

of new algorithmic techniques are just a few of the challenging and interesting areas

that may be suitable for closer examination.

As a tool designed for use by motivated experts for use in examining real data,

TimeSearcher has benefited from the design suggestions and feedback provided by

those users. Future research involving the timebox model and the TimeSearcher tool

should continue in this vein.

247

Appendix A

A Sample TimeSearcher Data File

The sample data file given below is a modified version of a data file based on yeast

microarray data [40]. As this data set contains only five items, it is shown primarily to

illustrate the file format. Two time-varying attributes are given for each item at each

time point - the log ratio and the absolute value of the log ratio. For each item, the first

value if the log ratio for the first time point, followed by the absolute value of the log

ratio for the first time point. This then repeats for time points 2-7.

#title

Yeast MicroArray Data

# static attributes

Gene,String

#Dynamic Atts

LogRatio,Float;AbsLogRatio,Float

# of time points

7

# of records ... ???

5

248

#time point labels

9,11,13,15,17,19,21

#vals

YAL003W,0.072,0.072,-0.004,0.004,-0.018,0.018,-0.19,0.19,-0.28,

0.28,-0.46,0.46,-0.72,0.72

YAL010C,-0.37,0.37,-0.032,0.032,0.013,0.013,-0.27,0.27,-0.28,

0.28,0.11,0.11,-0.06.0.06

YAL016W,0.045,0.045,0.021,0.021,0.041,0.041,-0.022.022,-0.051,

0.051,0.13,0.13,-0.027,0.027

YAL021C,0.045,0.045,0.021,0.021,0.041,0.041,-0.10,0.10,-0.066,

0.066,0.13,0.13,-0.041,0.04

YAL026C,0.053,0.053,0.182,0.182,0.140,0.140,0.23,0.23,0.22,0.22,

0.46,0.46,0.22,0.22

249

Appendix B

Study Materials for Evaluation of Input Mechanisms

for Questions of Varying Complexity

B.1 Exploratory Task

Find three stocks that are interesting or different.

This task was repeated for each of the three interfaces.

B.2 Training Questions

1. How many stocks had prices between $55 and $87 during days 1-3?



4. During days 13-17, are there more stocks between $0-$35, $20-$55, or $40-$75?

5. Which interval has more stocks priced between $78 and $94: days 4-9, 12-17,

or 20-25?

250

6. During days 1-4, which price range has the most stocks: $90-$116, $80-$106,

or $70-$96?

The first three training questions are low complexity, and the remaining are

medium complexity.

B.3 Experimental Questions







7. Which price range has the most stocks during days 29-30: $50-$75, $75-$100,

or $68-$93?

8. Which time period has the most stocks in the range $87-$124: 1-6, 11-16, or

17-22?

9. During days 22-23, are there more stocks between $69-$119, $59-$109,or $49-

$99?

10. Which intervals have more stocks priced between $10 and $35: days 15-20,

21-25, or 26-30?

251

11. Which price range has the most stocks during days 1-4: $30-$50, $40-$60, or

$50-$70?

12. Which time period has the most stocks in the range $43-$113: 1-8, 14-21, or

23-30?

13. Which price range had the most stocks during days 13-15: $12-$35, $17-$42,

$22-$47, $27-$52, or $32-$55?

14. When intervals have more stocks priced between $60 and $80: days 11-14,15-

18,19-22,23-26, or 27-30?

15. Which price range had the most stocks during days 1-7: $60-$120, $50-$110,

$40-$100, $30-$90, or $20-$80?

16. Which days have the most stocks with prices between $0 and $50: 1-3, 6-8,

11-13, 16-18, 21-23, or 26-28?

17. Which price range has the most stocks during days 14-20: $10-$40, $20-$50,

$30-$60, $40-$70, or $50-$80?

18. Which days have the most stocks with prices between $50 and $100: 2-10, 4-12,

6-14, 8-16, or 10-18?

Questions 1-6 are low complexity, 7-12 are medium complexity, and 13-18 are

high complexity. Each group contains 6 questions - 2 repetitions for each of

three interfaces.

252

B.4 User Interface Satisfaction Questionnaire

Please circle the numbers which most appropriately reflect your impressions about

using this computer system.

Not Applicable = NA.

1. Overall reactions to the form fill-in interface

(a) (1=terrible,9=wonderful) 1 2 3 4 5 6 7 8 9 NA

(b) (1=frustrating,9=satisfying) 1 2 3 4 5 6 7 8 9 NA

(c) (1=difficult,9=easy) 1 2 3 4 5 6 7 8 9 NA

(d) (1=rigid,9=flexible) 1 2 3 4 5 6 7 8 9 NA

2. Overall reactions to the range slider interface





3. Overall reactions to the direct manipulation interface





4. Which interface did you prefer for the defined tasks (first set)? Form fill-in

Range Slider Direct Manipulation

253

5. Which Interface did you prefer for the exploratory tasks (second set)? Form

fill-in Range Slider Direct Manipulation

6. Do you have any further comments or suggestions:

254

Appendix C

Empirical Evaluation of Multiple-Constraint Query

Formation

This study was designed to evaluate the use of timeboxes for searching for complex

patterns involving multiple constraints. Due to difficulties with participant comprehen-

sion of the study tasks, this study was modified after four individuals had participated,

and terminated after eight subjects. The study procedures, tasks, and preliminary re-

sults are presented here for completeness sake.

C.1 Design

The second study used variations in the number of query clauses required to complete a

task to address another source of complexity. Tasks in this case involved identification

of one item that matched a given set of criteria. When multiple items satisfied the

criteria, any one of those matches was considered to be a correct item.

The three complexity levels were defined in terms of the number of clauses re-

quired to answer the task. Low complexity tasks required two clauses while medium

complexity tasks required three and high complexity tasks needed four. For example:

255

1. Low complexity:Find a stock that had prices during days 5-7 that were lower

than its prices during days 1-3.

2. Medium complexity: Find a stock whose price decreased from days 11-13 to

days 15-17 and again from days 15-17 to days 19-21.

3. High complexity: Find stocks that increased gradually, over days 13-15,17-

19,21-23, and 25-27, such that the prices in each interval are generally higher

than the previous interval.

The complete set of tasks is given in Section C.4.

These tasks are less well-formed than the tasks used in the first study (Section 7.1).

Specifically, these tasks ask participants to find items that had certain trends in values

over specific dates, but all values are specified relative to each other. These tasks ask

users to identify items that follow general trends (“prices during days 1-3 that were

lower than prices during days 11-13”).

This increased flexibility may in some cases cause some difficulties. The reduced

specificity of the task statements may lead to ambiguity that might confuse users and

increase task times. To avoid difficulties with ambiguity, the study administrator ac-

cepted approximate answers as being correct and - when necessary - told participants

when their answers were close enough to be acceptable.

This study used a modified version of the tsexp interface described in Section 7.1.1.

The version of tsexp had two major differences from the implementation used in the

first study (Section 7.1). First, as this study involved complex patterns that required

multiple constraints, the tsexp interface was revised to support multiple query condi-

tions.

This study used the definition of start and stop times, and data sets, similar to those

256

used in the study that compared input and output(Section 7.2).

Synthetic data sets containing randomly-generated values were used for this study.

The data sets included 30 time points for each of 200 items. For the exploratory tasks,

the data sets from the first study (Section 7.1) were used.

The criteria for correct task completion were also relaxed somewhat relative to the

previous study. Users were told that an exactly correct answer was not necessary, and

that the administrator of the study would indicate when they reached an acceptable

answer. Every attempt was made to accept queries that addressed the spirit of the

question at hand. This change was made in an attempt to avoid spending significant

amounts of time making fine adjustments, and to more closely approximate the use of

TimeSearcher for exploratory tasks involving approximate queries.

This study was initially designed to have 18 graduate and undergraduate students

from the University of Maryland’s Computer Science department as participants. Pilot

tests with 3 subjects were used to fine-tune the study content. In particular, pilot par-

ticipants found the phrasing of the tasks to be challenging. Despite attempts at revising

the wording of the tasks, many participants had difficulty interpreting the tasks, and

sessions took much longer than anticipated (as long as 2 hours, as opposed to the goal

of 1 hour).

Due to these difficulties, the study was shortened from 2 repetitions of each in-

terface/complexity combination (18 tasks total) to 1 repetition (9 tasks total) after the

fourth participant. Although this shortened the session to be closer to the goal of one

hour, it did not resolve the comprehension difficulties. As a result, the study was ter-

minated after eight participants.

Results are presented below for these eight participants. For the first four partici-

pants, only the nine questions that were completed by all participants were included in

257

0

50

100

150

200

250

Low Medium High

Ave

rage

Tas

k C

ompl

etio

n Ti

me

(ms)

Complexity


Timebox

Figure C.1: Average completion time for well-defined tasks.

the analysis. The small sample size and the comprehension difficulties experienced by

participants limit the generalizability of the results.

C.2 Results

Results from the well-defined tasks are given in Figure C.1. The increase in task

completion with complexity was significant (repeated measures analysis of variance

(RMANOVA), F(2,67) = 7.39, p < .01), but there were no significant differences be-

tween the three interfaces: (F(2,67) = 0.99, p > .05). There was no interaction effect

(P(4,63) = .73, p = .57).

Results for the exploratory tasks are given in Figures C.2 and C.3. There were no

significant differences, ether in the number of items correctly identified (RMANOVA,

F(2,21) = .5, p = .61) or in the task completion time (F(2,21) = .63, p = .54).

258

0

0.5

1

1.5

2

2.5

3

3.5


Num

ber o

f ite

ms

Cor

rect

ly Id

entif

ied

Interface

Figure C.2: Number of items correctly identified in exploratory task

Seven of the eight participants completed the subjective questionnaire. Results

are given in Figure C.4. Users showed a general preference towards timeboxes, with

significant differences in preference levels for the Difficult/Easy question (ANOVA,

F(2,18) = 3.69, p < .05), and the Rigid/Flexible question (F(2,18) = 10.7, p < .01).

Significance levels were marginal for the Terrible/Wonderful (F(2,18) = 3.28, p =

.06) and Frustrating/Satisfying (F(2,18) = 3.43, p = 0.05) ratings.

The slight preference for timeboxes over the other input mechanisms was con-

firmed when users were asked to select the interface that they preferred for each type

of task. For the well-defined tasks, four users preferred timeboxes, three preferred

form fill-in, and one preferred range sliders. Preferences for the exploratory tasks

were much clearer, with 6 users preferring timeboxes and 1 each preferring form fill-in

and range sliders (Table C.1).

259

0

50

100

150

200


Ave

rage

Tas

k C

ompl

etio

n Ti

me

(ms)

Interface

Figure C.3: Average task completion time for exploratory tasks

.


Well-Defined 3 1 4

Exploratory 1 1 6

Table C.1: User preferences by interface for the different task types.

C.3 Discussion

This study was plagued by design flaws that were not apparent until several additional

subjects completed the protocol. The primary difficulty was in the wording of the

questions. Many subjects had significant difficulties in interpreting the phrasing of

the tasks. When faced with tasks asking for stocks that “decreased from days 11-13

to days 15-17, and again from days 15-17 to days 19-21” (for example), participants

often had trouble determining the direction of changes required. Some drew a series

260

0

2

4

6

8

10


Ave

rage

Sub

ject

ive

Rat

ing


Timebox

Figure C.4: Average subjective satisfaction ratings (1-9, 9 is best). n = 7

of arrows or boxes to illustrate the required directions of changes between each time

interval.

Many of the tasks asked users to treat a stock’s price during an interval of several

days as a single chunk (“decreased from days 11-13 to days 15-17” being two chunks).

Once this interpretation was explained, participants did not appear to have significant

difficulties with this interpretation.

These difficulties in interpretation were apparent throughout the. Participants often

had to repeat training tasks, and the administrator of the study frequently performed

the first one or two training tasks for the participants, showing them how the questions

should be interpreted and answered. Even after these repeats, users often had difficul-

ties that were clearly attributable to interpretation of the question (as opposed to use of

the interface). For example, participants frequently inverted the transitions requested,

261

finding (for example), a pattern of decrease-increase-decrease when the task required

the opposite pattern of increase-decrease-increase.

These difficulties led to a session that was significantly longer than intended. Re-

ducing the study to contain only one repetition for each of the nine task types did not

eliminate comprehension difficulties, so the study was terminated after four additional

participants.

Other aspects of the study design might have been similarly problematic. Many of

the tasks asked users to find stocks that increased and/or decreased in price from one

interval to the next. Task completion times for these questions may have been sensitive

to initial conditions. Specifically, if a participant was fortunate enough to create the

first two terms of a query in a manner that met the first constraint, subsequent terms

and constraints would be relatively easy to satisfy. On the other hand, if the user’s first

query was placed in a region with relatively little data, they may have had more trouble

satisfying the terms of the task.

The flexibility provided to study participants may have confused matters further.

Users were told to find items that were “close” to the parameters specified in each

task, without being told how close they needed to be. As a result, they had to ask the

administrator of the study for clarification, which required a judgment call that may

not have been made consistently.

Several aspects of the user interaction with the specific interfaces seemed notable.

When subjects appeared to understand the tasks, they had relatively little trouble with

the interfaces or other aspects of the study. Some users found that it took time to learn

to use timeboxes. Once these users were comfortable using timeboxes, they often

made positive comments, saying that timeboxes were “nice once I got the feel of it.”

As in the first study (Section 7.1), subjects clearly had difficulty with range sliders

262

and timeboxes when the ranges covered were relatively small. Similarly, some users

had trouble after creating queries that produced zero hits - some form of the data en-

velope or other overview might have helped with this difficulty. There were a few

instances of confusion between the interfaces. Specifically, some users clicked on a

box associated with a range slider as if it were a timebox.

In terms of user interaction, the primary difference between this study and the

first study is in the presence of multiple timeboxes which could be deleted or lassoed

and dragged for simultaneous modification. However, very few users deleted query

clauses, and users often failed to understand the idea of moving multiple timeboxes

at once. It is not clear if this was due to difficulties in understanding the interface,

insufficient training, or a combination of both factors.

One study participant made two concrete design suggestions that merit consider-

ation for inclusion in future versions of TimeSearcher. Noting the difficulty involved

in interpreting the impact of a single query clause, this subject suggested a “disable”

feature that would temporarily remove a timebox clause from a query. The timebox

would still be displayed in some altered manner, but the displayed result set would

not include the constraints associated with that box. This would provide the user with

a tool that could be used to determine if a given timebox was useful for meeting the

user’s search goal.

The other suggestion was for an feature that would link boxes to have the top of

one timebox aligned with the bottom of another timebox. This would support searches

involving transitions between two value ranges, where the second was defined as being

greater (or less than) the first.

263

C.4 Study Materials

C.4.1 Exploratory Task

Find three stocks that are interesting or different.

This task was repeated for each of the three interfaces.

C.4.2 Training Questions

1. Find a stock that had higher prices on days 10-12 than on days 19-21.

2. Find a stock whose prices during days 12-15 were lower than its prices during

days 20-23.

3. Find a stock whose prices on days 16-18 were higher than its prices on days

24-26.

4. Find a stock that had higher prices during days 2-6 than it did during days 8-12

and days 16-20.

5. Find a stock that traded decreased from days 1-3 to days 7-9, and then increased

to higher values during days 13-15.

6. Find a stock whose price was low during days 5-8, increased to higher values

during days 15-17, and then decreased to a lower range during days 23-25.

The first three training questions are low complexity, and the remaining are

medium complexity.

264

C.4.3 Experimental Questions

1. Find a stock that had prices during days 5-7 that were lower than its prices during

days 1-3.

2. Find a stock whose prices during days 5-9 was higher than its prices during days

13-17.

3. Find a stock that had prices during days 4-7 that were close to its prices on days

12-15.

4. Find a stock that had prices on days 11-16 that were close to its prices on days

20-25.

5. Find a stock with prices during days 1-5 that are lower than its prices during

days 26-30.

6. Find a stock that had higher prices on days 15-19 than on days 23-27.

7. Find a stock whose price decreased from days 11-13 to days 15-17 and again

from days 15-17 to days 19-21.

8. Find a stock whose price increased from days 6-8 to days 10-12 and then de-

creased from days 10-12 to days 14-16.

9. Find a stock that had higher values days 13-17 than it did during days 1-5 and

days 26-30.

10. Find a stock that increased from days 13-16 to days 19-22, and then decreased

to lower values during days 25-28.

11. Find a stock that had increases from days 10-12 to days 15-17, and from days

15-17 to days 20-22.

265

12. Find a stock that had lower values during days 14-17 than it did during days 8-11

and days 20-23.

13. Find a stock whose price during days 5-8 was close to its price during days

10-13, 16-19, and 22-25..

14. Find a stock whose price was high during days 1-3 and decreased to successively

lower values during days 5-7, 9- 11, and 13-15.

15. Find a stock whose price increased from days 7-8 to days 10-11, decreased to

lower values during days 13-14, and then increased to higher values during days

16-17.

16. Find stocks that increased gradually, over days 13-15,17-19,21-23, and 25-27,

such that the prices in each interval are generally higher than the previous inter-

val.

17. Find a stock whose price decreased from days 3-5 to days 7-9, increased to a

higher range during days 11-13, and then and then decreased to lower values

during days 15-17.

18. Find a stock whose price decreased from days 15-17 to days 19-21, decreased

again to a new low during days 23- 25, and then increased to higher values during

days 27-29.

Questions 1-6 are low complexity, 7-12 are medium complexity, and 13-18 are

high complexity. Each group contains six questions - two repetitions for each of three

interfaces.

266

After the fourth subject, one repetition was eliminated for each of the three com-

plexity levels. Questions 4-6,10-12, and 16-18 were eliminated, leaving nine questions

with one repetition of each interface,complexity combination.

267

Appendix D

Study Materials for Empirical Evaluation of Input and

Output

D.1 Training Questions

1. Find an item that has a price between $30 and $60 for months 4-7

2. Find an item that trades in a $20 range for at least three consecutive time periods.

268

D.2 Experimental Questions

1. Find an item that starts low and ends high: its prices during all of the last five

time points should be least $40 more than the highest prices that it reaches during

the first 5 time periods.

2. Find an item that has trades in a $25 range for at least four consecutive measure-

ments and then has a rise in price of at least $35.

269

BIBLIOGRAPHY

[1] J. Aach and G. Church. Aligning gene expression time series with time warpingalgorithms. Bioinformatics, 17(6):495–508, 2001.

[2] J. Accot and S. Zhai. Beyond fitts’ law: Models for trajectory-based HCI tasks.In Proceedings of the 1997 Conference Human Factors in Computing Systems,pages 295–302, Atlanta GA, April 1997. ACM Press.

[3] R. Agrawal, C. Faloutsos, and A. Swami. Efficient similarity search in sequencedatabases. In Proceedings, Foundations of Data Organization and Algorithms,4th International Conference, FODO’93, Chicago, Illinois, USA, October 13-15, 1993. Lecture Notes in Computer Science, Vol. 730, pages 69–84, Berlin,1993. Springer-Verlag.

[4] R. Agrawal, K. Lin, H. S. Sawhney, and K. Shim. Fast similarity search inthe presence of noise, scaling, and translation in time-series databases. In TheVLDB Journal, pages 490–501, 1995.

[5] R. Agrawal, G. Psaila, E. L. Wimmers, and M. Zait. Querying shapes ofhistories. In Proceedings of the 21st International Conference on Very LargeDatabases, pages 502–514, 1995.

[6] R. Agrawal and R. Srikant. Mining sequential patterns. In Philip S. Yu andArbee L. P. Chen, editors, Proceedings 11th International Conference on DataEngineering, ICDE, pages 3–14, Taipei Tawian, March 1995. IEEE Press.

[7] C. Ahlberg and B. Shneiderman. Visual information seeking: Tight couplingof dynamic query filters with starfield displays. In Proceedings of the 1994Conference on Human Factors in Computing Systems, pages 313–317, BostonMA, April 1994. ACM Press.

[8] C. Ahlberg, C. Williamson, and B. Shneiderman. Dynamic queries for informa-tion exploration: An implementation and evaluation. In Proceedings of the 1992Conference on Human Factors in Computer Systems, pages 619–626, Monterey,CA, May 3-7 1992. ACM Press.

271

[9] J. F. Allen. Maintaining Knowledge about Temporal Intervals. Communicationsof the ACM, 26(11):832–843, 1983.

[10] E. H. Baehrecke, N. Dang, K. Barbaria, and B. Shneiderman. Visualization andanalysis of microarray and gene ontology data with treemap. In preparation,2003.

[11] E.H. Baehrecke. Steroid regulation of programmed cell death during Drosophiladevelopment. Cell Death and Differentiation, 7:1057–1062, 2000.

[12] E.H. Baehrecke. How death shapes life during development. Nature ReviewsMolecular Cell Biology, 3:779–787, October 2002.

[13] E.H. Baehrecke. Personal Communication, 2003.

[14] Z. Bar-Joseph, G. Gerber, D. Gifford, and T. Jaakola. A new approach to an-alyzing gene expression time series data. In Proc. Sixth Annual InternationalConference on Research in Computational Molecular Biology, pages 39–48,2002.

[15] R. Beckmann, H.P. Kriegel, R. Schneider, and B. Seeger. The R∗-tree: an effi-cient and robust access method for points and rectangles. ACM Sigmod, pages322–331, May 1990.

[16] B. Bederson, J. Grosjean, and J. Meyer. Toolkit design for interactive struc-tured graphic. Technical Report HCIL-2003-01,CS-TR-4432, and UMIACS-TR-2003-03, University of Maryland, Human-Computer Interaction Lab, De-partment of Computer Science, and Institute for Advanced Computer Studies,2003.

[17] B. Bederson, J. Meyer, and L. Good. Jazz: An extensible zoomable user inter-face graphics toolkit in java. In ACM Symposium on User Interface Softwareand Technology, pages 171–180, San Diego CA, November 2000. ACM Press.

[18] B. Bederson, B. Shneiderman, and M. Wattenberg. Ordered and quantumtreemaps: Making effective use of 2D space to display hierarchies. ACM Trans-actions on Computer Graphics, 21(4):833–854, October 2002.

[19] D. J. Berndt and J. Clifford. Finding patterns in time series: A dynamic pro-gramming approach. In Advances in Knowledge Discovery and Data Mining,pages 229–248. AAAI Press/MIT Press, 1996.

[20] C. Bettini, X. Sean Wang, and S. Jajodia. Mining temporal relationships withmultiple granularities in time sequences. Data Engineering Bulletin, 21(1):32–38, 1998.

272

[21] K. Beyer, J. Goldstein, R. Ramakrishnan, and U. Shaft. When is ”nearestneighbor” meaningful? In C. Beeri and P. Buneman, editors, 7th InternationalConference on Database Theory (ICDT ’99), number 1540 in Lecture Notes inComputer Science, pages 218–236, Jerusalem, Israel, January 1999. Springer-Verlag.

[22] S. Blackburn. Content Based Retrieval and Navigation Using MelodicPitch Contours. PhD thesis, University of Southampton, 2000.http://www.ecs.soton.ac.uk/ sgb97r/phdthesis.pdf.

[23] C. Bonhomme, C. Trepied, M.A. Aufaure, and R. Laurini. A visual languagefor querying spatio-temporal databases. In Proceedings of the 7th InternationalSymposium on Advances in Geographic Information Systems, pages 34–39,Kansas City MO, November 1999. ACM Press.

[24] E. Bradley. Time-series analysis. In M. Berhold and E. Hand, editors, IntelligentData Analysis: An Introduction. Springer-Verlag, Berlin, 1999.

[25] I. Brewer, A.M. MacEachren, H. Abdo, J. Gundrum, and G. Otto. Collaborativegeographic visualization: Enabling shared understanding of environmental pro-cesses. In Proceedings, IEEE Symposium on Information Visualization, pages137–144, Salt Lake City UT, October 2000.

[26] S. K. Card, J. D. Mackinlay, and B. Shneiderman, editors. Readings in Infor-mation Visualization: Using Vision to Think. Morgan Kaufman Publishers, SanFrancisco CA, 1999.

[27] J. V. Carlis and J. A. Konstan. Interactive visualization of serial periodic data.In ACM Symposium on User Interface Software and Technology, pages 29–38,San Francisco CA, November 1998. ACM Press.

[28] M.S.T. Carpendale, A. Fall, D. J. Cowperthwaite, J. Falland, and F. D. Fracchia.Case study: Visual access for landscape event based temporal data. In VIS ’96:Proceedings of the IEEE Conference on Visualization, pages 425–428, October1996.

[29] K. Chan and W. Fu. Efficient time series matching by wavelets. In Proceed-ings 15th International Conference on Data Engineering ICDE, pages 126–133,Syndney Australia, March 1999.

[30] C. Chatfield. The Analysis of Time Series, an Introduction. Chapman and Hall,London, 1996.

[31] E. H. Chi, J. E. Pitkow, J. D. Mackinlay, P. Pirolli, R. Gossweiler, and S. K.Card. Visualizing the evolution of web ecologies. In Proceedings of the 1998

273

Conference Human Factors in Computing Systems, pages 400–407, Los Ange-les CA, April 1998. ACM Press.

[32] J.P. Chin, V.A. Diehl, and K.L. Norman. Development of an instrument mea-suring user satisfaction of the human-computer interface. In Proceedings ofthe 1988 Conference on Human Factors in Computer Systems, pages 213–218.ACM Press, 1988.

[33] J. Chomicki. Temporal query languages: A survey.http://www.cse.buffalo.edu/ chomicki/papers-survey95.ps, 1995.

[34] S. Chu, J. DeRisi, M. Eisen, J. Mulholland, D. Botstein, P.O. Brown, and I. Her-skowitz. The transcriptional program of sporulation in budding yeast. Science,282:699–705, October 23 1998.

[35] E. Clough, C.-Y. Lee, H. Hochheiser, B. Shneiderman, and E.H. Baehrecke.Temporal analyses of genome-wide transcription during steroid-triggered pro-grammed cell death in Drosophila. In preparation, 2003.

[36] S. B. Cousins and M. G. Kahn. The visual display of temporal information.Artificial Intelligence in Medicine, 3(6):341–357, 1991.

[37] G. Das, K. Lin, H. Mannila, G. Renganathan, and P. Smyth. Rule discoveryfrom time series. In Proceedings of the fourth International Conference onKnowledge Discovery and Data Mining (KDD-98), pages 16–22, New YorkNY, August 1998. AAAI Press.

[38] M. de Berg, M. van Kreveld, M. Overmars, and O. Schwarzkopf. ComputationalGeometry: Algorithms and Applications. Springer-Verlag, 2000.

[39] A. Del Bimbo, E. Vicario, and D. Zingoni. Symbolic description and visualquerying of image sequences using spatio-temporal logic. IEEE Transactionson Knowledge and Data Engineering, 7(4):609–622, August 1995.

[40] J. DeRisi, V. Iyer, and P. Brown. Exploring the metabolic and genetic control ofgene expression on a genomic scale. Science, 278:680–686, 24 October 1997.

[41] M. Derthick and S.F. Roth. Data exploration across temporal contexts. In Pro-ceedings of Intelligent User Interfaces 2000, pages 60–67, New Orleans LA,January 2000. ACM Press.

[42] K. Duca. Personal Communication, October 2002.

[43] W. K. Edwards, T. Igarishi, A. LaMarca, and E.D. Mynatt. A temporal modelfor multi-level undo and redo. In ACM Symposium on User Interface Softwareand Technology, pages 31–40, San Diego CA, November 2000. ACM Press.

274

[44] S. G. Eick and P. J. Lucas. Displaying trace files. Software Practice and Expe-rience, 26(4):399–409, 1996.

[45] Stephen G. Eick and Graham J. Wills. Navigating large networks with hierar-chies. In Proc. IEEE Conf. Visualization, pages 204–210, San Jose, CA, October1993.

[46] M.B. Eisen, P.T. Spellman, P.O. Brown, and D. Botstein. Cluster analysis anddisplay of genome-wide expression patterns. Proceedings, National Academiesof Science, USA, 95:14863–14686, December 1998.

[47] M. Erwig and M. Schneider. Query-by-trace: Visual predicate specificationin spatio-temporal databases. In Proceedings, 5th IFIP Conference on VisualDatabases (VDB 5), pages 199–218, 2000.

[48] W.G. Fairbrother, R.F. Yeh, P.A. Sharp, and C.B. Burge. Predictive identifica-tion of exonic splicing enhancers in human genes. Science, 297:1007–1013, 9August 2002.

[49] C. Faloutsos, M. Ranganathan, and Y. Manolopoulos. Fast subsequence match-ing in time-series databases. In Proceedings of the 1994 ACM SIGMOD In-ternational Conference on Management of Data, pages 419–429, Minneapolis,Minnesota, May 1994. ACM Press.

[50] E. Freeman and D. Gelernter. Lifestreams: A storage model for personal data.SIGMOD Record (ACM Special Interest Group on Management of Data), 25(1),March 1996.

[51] Christian Freksa. Temporal reasoning based on semi-intervals. Artificial Intel-ligence, 54(1):199–227, 1992.

[52] Y.-H. Fua, M. Ward, and E. Rundensteiner. Navigating hierarchies withstructure-based brushes. In Proceedings, IEEE Symposium on Information Vi-sualization, pages 58–64, San Diego, CA, October 24-29 1999. IEEE Press.

[53] L. Girardin and D. Brodbeck. Interactive visualization of prices and earningsaround the globe. In Interactive Posters, IEEE Symposium on Information Vi-sualization 2001, San Diego, CA, October 22-23 2001.

[54] S. Greene, E. Tanin, C. Plaisant, B. Shneiderman, L. Olsen, G. Major, andS. Johns. The end of zero-hit queries: Query previews for NASA’s globalchange master directory. International Journal on Digital Libraries, 2(2-3):79–90, 1999.

[55] H. Hamadeh and C. A. Afshari. Gene chips and functional genomics. AmericanScientist, pages 508–515, November/December 2000.

275

[56] J. Han, G. Dong, and Y. Yin. Efficient mining of partial periodic patterns intime series database. In Proceedings of the International Conference on DataEngineering, pages 106–115, Syndney Australia, March 1999.

[57] B. Harrison, R. Owen, and R. Baecker. Timelines: An interactive system forthe collection of visualization of temporal data. In Proceedings of GraphicsInterface ’94, pages 141–148, Toronto, 1994. Canadian Information ProcessingSociety.

[58] H. Hauser, F. Ledermann, and H. Doleisch. Angular brushing of extended paral-lel coordinates. In Proceedings, IEEE Symposium on Information Visualization,Boston, MA, October 2002. IEEE Press.

[59] S. Havre, B. Hetzler, and L. Nowell. Themeriver: Visualizing theme changesover time. In Proceedings, IEEE Symposium on Information Visualization,pages 115–124, Salt Lake City UT, October 2000.

[60] S. Hibino and E. Rudensteiner. A visual multimedia query language for tempo-ral analysis of video data. In K.C. Nwosu, B.M. Thuraisingham, and P.B. Berra,editors, Multimedia Database Systems: Design and Implementation Strategies,pages 123–159. Kluwer Academic Publishers, 1996.

[61] S. Hibino and E. Rundensteiner. User interface evaluation of a direct manipula-tion temporal visual query language. In Multimedia ’97, pages 99–107, SeattleWA, November 1997. Association for Computer Machinery.

[62] S. Hibino and E. Rundensteiner. Comparing MMVIS to a timeline for temporaltrend analysis of video data. In Proceedings of Advanced Visual Interfaces 1998,pages 195–204. Association for Computer Machinery, May 1998.

[63] H. Hochheiser and B. Shneiderman. Range specifications for an interactivevisual query tool for time series data. Unpublished Manuscript,University ofMaryland, Department of Computer Science, March 2001.

[64] H. Hochheiser and B. Shneiderman. Using interactive visualizations of wwwlog data to characterize access patterns and inform site design. Journal of theAmerican Society for Information Systems, 52(4):331–343, February 2001.

[65] N.S. Holter, N. Mitra, A. Maritan, M. Cieplak, J.Banavar, and N. Federoff. Fun-damental patterns underlying gene expression profiles: Simplicity from com-plexity. Proc. National Academy of Sciences USA, 97(15):8409–8414, 18 July2000.

276

[66] Y. Huang and P.S. Yu. Adaptive query processing for time-series data. In Pro-ceedings of the fifth ACM SIGKDD international conference on Knowledge dis-covery and data mining, pages 282–286, San Diego CA, August 1999. ACMPress.

[67] A. Inselberg. Multidimensional detective. In Proceedings, IEEE Symposium onInformation Visualization, pages 100–107, Phoenix AZ, October 1997.

[68] A. Inselberg and T. Avidan. Classification and visualization for high-dimensional data. In Proceedings of the Sixth ACM SIGKDD InternationalConference on Knowledge Discovery in Data 2000, pages 370–374, Boston,MA, 2000. ACM Press.

[69] H. V. Jagadish, N. Koudas, and S. Muthukrishnan. Mining deviants in a timeseries database. In M. P. Atkinson, M. E. Orlowska, P. Valduriez, S. B. Zdonik,and M. L. Brodie, editors, Proceedings of VLDB’99, Proceedings of 25th In-ternational Conference on Very Large Data Bases, pages 102–113, EdinburghScotland, September 1999. Morgan Kaufmann.

[70] V. Jain and B. Shneiderman. Data structures for dynamic queries: an analyti-cal and experimental evaluation. In Proc. of the Workshop in Advanced VisualInterfaces, AVI 94, pages 1–11, Bari, Italy, June 1-4 1994. ACM Press.

[71] C.S. Jensen and R.T. Snodgrass. Temporal data management. IEEE Trans-actions on Knowledge and Data Management, 11(1):36–43, January/February1999.

[72] C. Jiang, E.H. Baehrecke, and C. Thummel. Steroid regulated programmed celldeath during Drosophila metamorphosis. Development, 124:4673–4683, 1997.

[73] D. A. Keim. Pixel-oriented visualizations techniques for exploring very largedatabases. Journal of Computational and Statistical Graphics, pages 58–77,March 1996.

[74] E. J. Keogh. Exact indexing of dynamic time warping. In Proc. VLDB 2002,pages 406–417, Hong Kong, China, 2002.

[75] E. J. Keogh, K. Chakrabarti, M. J. Pazzani, and S. Mehrotra. Locally adaptivedimensionality reduction for indexing large time series databases. In Proceed-ings SIGMOD 2001, pages 151–162, Santa Barbara CA, May 2001. ACM Press.

[76] E. J. Keogh, K. Chakrabarti, M.J. Pazzani, and S. Mehrotra. Dimensionalityreduction for fast similarity search in large time series databases. Knowledgeand Information Systems., 3(3):263–286, 2001.

277

[77] E. J. Keogh, H. Hochheiser, and B. Shneiderman. An augmented visual querymechanism for finding patterns in time series data. In Proc. Fifth Interna-tional Conference on Flexible Query Answering Systems, Lecuter Notes in Arti-ficial Intelligence, pages 240–250, Copenhagen, Denmark, 27-29 October 2002.Springer-Verlag.

[78] E. J. Keogh and M. J. Pazzani. Relevance feedback retrieval of time series data.In Proceedings of the 22nd Annual International ACM SIGIR Conference onResearch and Development in Information Retrieval SIGIR ’99, pages 183–190,Berkeley CA, August 1999. ACM.

[79] E. J. Keogh and P. Smyth. A probabilistic approach to fast pattern matchingin time series databases. In Proceedings of the third conference on KnowledgeDiscovery in Databases and Data Mining (KDD-97) , Newport Beach, pages24–30, Newport Beach CA, August 1997. AAAI Press.

[80] T. Koetter and M. Theus. Fortune - a system for interactive graphics for timeseries. http://www.vr-web.de/martin.theus/Fortune JCGS.pdf.

[81] F. Korn, H. V. Jagadish, and C. Faloutsos. Efficiently supporting ad hoc queriesin large datasets of time sequences. In Proceedings of the 1997 ACM SIGMODInternational Conference on Management of Data, pages 289–300, Tucson AZ,May 1997. ACM Press.

[82] V. Kouramajian and M. Gertz. A graphical query language for temporaldatabases. In M.P. Papazoglou, editor, OOER ’95: Object-Oriented and Entity-Relationship Modeling, volume 1021 of Lecture Notes in Computer Science,pages 388–399. Springer-Verlag, Berlin, 1995.

[83] C.-Y. Lee, E.Clough, P. Yellon, T. Teslovich, D. Stephan, and E.H. Baehrecke.Genome-wide analyses of steroid-and radiation-triggered programmed celldeath in Drosophila. Current Biology, 2003.

[84] C-Y Lee, D. Wendel, P. Reid, G. Lam, C. Thummel, and E.H. Baehrecke. E93directs steroid-triggered programmed cell death in Drosophila. Molecular Cell,6:433–443, August 2000.

[85] J. Lin, E. J. Keogh, S. Lonardi, and P. Patel. Finding motifs in time series. InProc. SIGKDD ’02, pages 53–68, Edmonton, Alberta Canada, July 23-26 2002.ACM Press.

[86] L. Lin, T. Risch, M. Skold, and D. Badal. Indexing values of time sequences.In Proc. 5th International Conference on Information and Knowledge Manage-ment (CIKM ’96), pages 223–232, Rockville, Maryland, November 12-16 1996.

278

[87] J.B. Little and L. Rhodes. Understanding Wall Street. Liberty Publishing, Inc.,Cockeysville MD, 1978.

[88] J. D. Mackinlay, G. G. Robertson, and R. DeLine. Developing calendar visu-alizers for the information visualizer. In ACM Symposium on User InterfaceSoftware and Technology, pages 109–118, New York, 1994. ACM Press.

[89] A. Martin and M. Ward. High dimensional brushing for interactive explorationof multivariate data. In Proceedings of the 6th IEEE Visualization Conference,pages 271–278, Atlanta, Georgia, October 29- November 3 1995. IEEE Press.

[90] J. P. Morrill. Distributed recognition of patterns in time series data. Communi-cations of the ACM, 45(5):45–51, May 1998.

[91] S. Mount. Personal Communication, 2003.

[92] S. M. Mount, C. Burks, G. Hertz, G.D. Stormo, O. White, and C. Fields. Splic-ing signals in Drosophila: intron size, information content, and consensus se-quences. Nucleic Acids Research, 20(16):4255–4262, 1992.

[93] A. Nanopoulos and Y. Manolopoulos. Indexing time-series databases for inversequeries. In G. Quirchmayr and Trevor J.M. Bench-Capon, editors, Proceed-ings 9th International Conference, Database and Expert Systems Applications(DEXA), volume 1460 of Lecture Notes in Computer Science, pages 551–560.Springer, August 24-28 1998.

[94] C. North and B. Shneiderman. Snap-together visualization: A user interface forcoordinating visualizations via relational schemata. In ACM Advanced VisualInterfaces 2000, pages 128–135. ACM Press, 2000.

[95] C. North and B. Shneiderman. Snap-together visualization: Evaluating coor-dination usage and construction. International Journal of Human-ComputerStudies, 53(5):715–739, November 2000.

[96] A. Oberweis and V. Sanger. GTL - A Graphical Language for Temporal Data. InProceedings of the 7th International Working Conference on Scientific and Sta-tistical Database Management, pages 22–31, Charlottesville VA, 1994. IEEEComputer Society Press.

[97] U. Ohler and H. Niemann. Identification and analysis of eukaryotic promoters:Recent computational approaches. Trends in Genetics, 17(2):56–60, February2001.

[98] C. Perng, H. Wang, S. R. Zhang, and D. Stott Parker. Landmarks: a new modelfor similarity-based pattern querying in time series databases. In Proc. Interna-tional Conference on Data Engineering, pages 33–42, San Diego CA, February28 -March 3 2000.

279

[99] C. Plaisant, R. Mushlin, A. Snyder, J. Li, D. Heller, and B. Shneiderman. Life-lines: Using visualization to enhance navigation and analysis of patient records.In 1998 American Medical Informatic Association Annual Fall Symposium,pages 76–80, Orlando FL, November 1998. AMIA.

[100] C. Plaisant, A. Rose, G. Rubloff, R. Salter, and B. Shneiderman. The design ofhistory mechanisms and their use in collaborative educational simulations. InProceedings of the Computer Support for Collaborative Learning, CSCL’ 99,pages 348–359, Palo Alto CA, 1999. ACM Press.

[101] R. J. Povinelli. Identifying temporal patterns for characterization and predictionof financial time series events. In Temporal, Spatial and Spatio-Temporal DataMining: First International Workshop (TSDM2000), pages 46–61, Lyon France,2000.

[102] S. Powsner and E. Tufte. Graphical summary of patient status. The Lancet,344:386–389, 1994.

[103] W. Pugh. Skip lists: A probabilistic alternative to balanced trees. Communica-tions of the ACM, 33(6):668–676, 1990.

[104] D. Rafiei and A.Mendelzon. Querying time series data based on similarity. IEEETransactions on Knowledge and Data Engineering, 12(5):675–693, Septem-ber/October 2000.

[105] J. Rekimoto. Time-machine computing: A time-centric approach for the in-formation environment. In ACM Symposium on User Interface Software andTechnology, pages 45–54, Asheville NC, November 1999. ACM Press.

[106] R. Ribler, A. Mathur, and M. Abrams. Visualizing and modeling categoricaltime series data. In Symposium on Visualizing Time-Varying Data. ICASE andNASA/LaRC, September 1995.

[107] W. G. Roth. MIMSY: A system for analyzing time series data in the stock mar-ket domain. Master’s thesis, University of Wisconsin, Department of ComputerScience, 1993.

[108] R. Sadri, C. Zaniolo, A. Zarkesh, and J. Adibi. Optimization of sequence queriesin database systems. In Proceedings of Principles of Database Systems 2001,pages 71–81, Santa Barbara CA, May 2001.

[109] S.L. Salzberg. Personal Communication, 2002.

[110] P. M. Sanderson, M. D. McNeese, and B. S. Zaff. Handling complex real-worlddata with two cognitive engineering tools: Cogent and macshapa. BehaviorResearch Methods, Instruments, and Computers, 26(2):117–124, 1994.

280

[111] J. Seo and B. Shneiderman. Understanding hierarchical clustering results byinteractive exploration of dendrograms: A case study with genetic microarraydata. IEEE Computer, 35(7):80–86, July 2002.

[112] P. Seshadri, M. Livny, and R. Ramakrishnan. Sequence query processing. InProceedings of the 1994 ACM SIGMOD International Conference on Manage-ment of Data, pages 430–441, Minneapolis Minnesota, May 1994. ACM Press.

[113] P. Seshadri, M. Livny, and R. Ramakrishnan. SEQ: A model for sequencedatabases. In Proceedings of the 11th International Conference on Data En-gineering (ICDE), pages 232–239, Taipei, Taiwan, 1995.

[114] P. Seshadri, M. Livny, and R. Ramakrishnan. The design and implementationof a sequence database system. In VLDB’96, Proceedings of 22th InternationalConference on Very Large Data Bases, pages 99–110, Mumbai India, Septem-ber 1996.

[115] U. Shaft, J. Goldstein, and K. Beyer. Nearest neighbors query performance forunstable distributions. Technical Report TR1388, Computer Sciences Depart-ment, University of Wisconsin, October 1998.

[116] B. Shneiderman. Tree visualization with tree-maps: 2-d space-filling approach.ACM Transactions on Graphics, 11(1):92–99, January 1992.

[117] B. Shneiderman. Dynamic queries for visual information seeking. IEEE Soft-ware, 11(6):70–77, 1994.

[118] B. Shneiderman. Inventing discovery tools: Combining information visualiza-tion with data mining. In Proceedings, Discovery Science 2001, pages 17–28,Washington DC, 2001. Springer-Verlag.

[119] B. Shneiderman, D. Feldman, A. Rose, and X. Ferre Grau. Visualizing digitallibrary search results with categorical and hierarchial axes. In Proc. 5th ACMInternational Conference on Digital Libraries, pages 57–66, san Antonio, TX,June 2-7 2000. ACM Press.

[120] W.M. Siebert. Circuits, Signals, and Systems. MIT Press, Cambridge, MA,1986.

[121] S. F. Silva, U. Shciel, and T. Catarci. Visual query operators for temporaldatabases. In Proc. of the 4th Int. Workshop on Temporal Representation andReasoning (TIME), pages 46–53, Daytona Beach FL, May 1997.

[122] S.F. Silva and T. Catarci. Homogeneous access to temporal data and interactionhistories in visual interfaces for databases. In Proceedings. of the Workshop on

281

User Interfaces to data Intensive Systems (UIDIS’99), pages 108–117, Edin-burgh Scotland, September 1999 1999. IEEE Computer Society.

[123] S.F. Silva and T. Catarci. Visualization of linear time-oriented data: a survey. InProceedings of the first International Conference on Web Information SystemsEngineering, Hong Kong, June 2000. IEEE Computer Society.

[124] S.F. Silva, T. Catarci, and U. Schiel. A ”graphical notebook” as interactionmetaphor for querying databases. In Anais do XIV Simposio Brasileiro deBanco de Dados (SBBD’99), Florianopolis SC Brazil, October 1999. SociedadeBrasileira de Computacao.

[125] C.G. Simpson, G. Thow, G.P. Clark, S.N. Jennings, J.A. Watters, and J.W.S.Brown. Mutational analysis of a plant branchpoint and polypyrimidine tractrequired for constitutive splicing of a mini-exon. RNA, 8:47–56, January 2002.

[126] R. T. Snodgrass, I. Ahn, G. Ariav, D. S. Batory, J. Clifford, C. E. Dyreson, R. El-masri, F. Grandi, C. S. Jensen, W. Kafer, N. Kline, K. G. Kulkarni, T. Y. CliffLeung, N. A. Lorentzos, J. F. Roddick, A. Segev, M. D. Soo, and S. M. Sripada.A TSQL2 tutorial. SIGMOD Record, 23(3):27–33, September 1994.

[127] Spotfire. http://www.spotfire.com.

[128] P. Tamayo, D. Slonim, J. Mesirov, Q. Zhu, S. Kitareewan, E. Dmitrovsky,E. Lander, and T. Golub. Interpreting patterns of gene expression with self-organizing maps: Methods and applications to hematopoietic differentiation.Proc. National Academy of Sciences USA, 96:2907–2912, March 1999.

[129] E. Tanin, R. Beigel, and B. Shneiderman. Design and evaluation of incrementaldata structures and algorithms for dynamic query interfaces. In Proceedings ofVisualization ’97, pages 81–86. IEEE Press, 1997.

[130] The FlyBase Consortium. The flybase database of the drosophila genomeprojects and community literature. Nucleic Acids Research, 31(1):172–175,2003.

[131] The Gene Ontology Consortium. Gene ontology: tool for the unification ofbiology. Nature Genetics, 25:25–29, May 2000.

[132] B. Theodoulidis, P. Papapanagiotou, and V. Pappas-Katsiafas. Interactive query-ing and visualisation in temporal databases (abstract). In K. Ong, S. Conrad,and T.W. Ling, editors, Knowledge Discovery and Temporal Reasoning in De-ductive and Object-Oriented Databases, Proceedings of the DOOD’95, pages91–93, Singapore, 1995.

282

[133] E. Tufte. The Visual Display of Quantitative Information. Graphics Press,Cheshire CT, 1983.

[134] L. Tweedie, B. Spence, H. Dawkes, and H. Su. The influence explorer (video)-a tool for design. In Proceedings of the 1996 conference companion on Humanfactors in computing systems, pages 390–391, Vancouver, British Columbia,April 13-18 1996. ACM Press.

[135] A. Unwin. Analysing real time series? CTI Math & Stats Newsletter, 4:8–10,1998.

[136] A. Unwin and G. Willis. Exploring time series graphically. Statistical Comput-ing and Graphics Newsletters, 2:13–15, 1999.

[137] J. van Helden, J., B. Andre, and J. Collado-Vides. Extracting regulatory sitesfrom the upstream region of yeast genes by computational analysis of oligonu-cleotide frequencies. Journal of Molecular Biology, 281(5):827–842, 1998.

[138] J.J. van Wijk and E.R. van Selow. Cluster and calendar based visualization oftime series data. In Proceedings, IEEE Symposium on Information Visualiza-tion, pages 4–9, San Francisco, CA, October 1999.

[139] R. Villafane, K. A. Hua, D. Tran, and B. Maulik. Mining interval time series.In Proceedings of the first International Conference on Data Warehousing andKnowledge Discovery, pages 318–330, 1999.

[140] J.D. Watson, N.H. Hopkins, J.W. Roberts, J.A. Steitz, and A.M. Weiner. Molec-ular Biology of the Gene. The Benjamin/Cummings Publishing Company, Inc,Menlo Park, CA, 4 edition, 1987.

[141] M. Wattenberg. Sketching a graph to query a time series database. In Proceed-ings of the 2001 Conference Human Factors in Computing Systems, ExtendedAbstracts, pages 381–382, Seattle WA, March 31-April 5, 2001 2001. ACMPress.

[142] R. Weber, H. Schek, and S. Blott. A quantitative analysis and performancestudy for similarity-search methods in high-dimensional spaces. In Proc. 24thInt. Conf. Very Large Data Bases, VLDB, pages 194–205, 1998.

[143] K. P. White, S.A. Rifkin, P. Hurban, and D. Hogness. Microarray analysis ofDrosophila development during metamorphosis. Science, 286:2179–2814, 10December 1999.

[144] S. Winkler. http://stats.math.uni-augsburg.de/CASSATT/index.html.

283

[145] P. C. Wong, W. Cowley, H. Foote, E. Jurrus, and J. Thomas. Visualizing sequen-tial patterns for text mining. In Proceedings, IEEE Symposium on InformationVisualization, pages 105–114, Salt Lake City UT, October 2000.

[146] B. B. Xia. Similarity search in time series data sets. Master’s thesis, SimonFraser University, Computing Science, 1997.

[147] R. Xiong and J. S. Donath. Peoplegarden: Creating data portraits for users.In ACM Symposium on User Interface Software and Technology, pages 37–44,Asheville NC, November 1999.

[148] XmdvTool. Xmdvtool home page: Case studies.http://davis.wpi.edu/ xmdv/cs fin.html.

[149] B. Yi and C. Faloutsos. Fast time sequence indexing for arbitrary Lp norms. InVLDB 2000, Proceedings of 26th International Conference on Very Large DataBases, pages 385–394, Cairo Egypt, September 2000. Morgan Kaufmann.

[150] B. Yi, H. V. Jagadish, and C. Faloutsos. Efficient retrieval of similar time se-quences under time warping. In Proceedings of the Fourteenth InternationalConference on Data Engineering, pages 201–208, Orlando FL, February 1998.IEEE Computer Society.

[151] B. Yi, N. Sidiropoulos, T. Johnson, H. V. Jagadish, C. Faloutsos, and A. Biliris.Online data mining for co-evolving time sequences. In Proceedings 16th Inter-national Conference on Data Engineering ICDE, pages 13–22, 2000.

[152] D. Young and B. Shneiderman. A graphical filter/flow representation of booleanqueries: a prototype implementation and evaluation. Journal of American Soci-ety for Information Science, 44(6):327–339, July 1993.

[153] Y. Zhu and D. Shasha. Statstream: Statistical monitoring of thousands of datastreams in real time. In Proceedings VLDB 2002, Hong Kong, August 20–232002.

284

abstract title of dissertation: interactive graphical ... · title of dissertation: interactive...

Documents