abstract title of dissertation: interactive graphical ... · title of dissertation: interactive...
TRANSCRIPT
ABSTRACT
Title of Dissertation: INTERACTIVE GRAPHICAL QUERYING OF
TIME SERIES AND LINEAR SEQUENCE DATA SETS
Harry Hochheiser, Doctor of Philosophy, 2003
Dissertation directed by: Professor Ben ShneidermanDepartment of Computer Science
Numerous analytic domains involve the study of measurable quantities that change
over time. This widespread interest in time series data sets has led to substantial work
in algorithmic strategies for querying and indexing data. Much less work has been
done in the development of interactive tools for identifying patterns in these data sets.
This dissertation uses a graphical mechanism for specifying queries on time series
data to provide the basis for an exploration of the algorithmic and semantic issues
surrounding interactive querying of time series data. Contributions of this dissertation
include:
• The definition of timeboxes - rectangular widgets that can be used in direct-
manipulation Graphical User Interfaces (GUIs) to specify query constraints on
time series data sets. Timeboxes are used to simultaneously specify two sets of
constraints: given a set of N time series profiles, a timebox covering time periods
x1 . . .x2 (x1 ≤ x2) and values y1 . . .y2 (y1 ≤ y2) will retrieve only those n ∈ N that
have values y1 ≤ y ≤ y2 during all times x1 ≤ x ≤ x2.
• The TimeSearcher information visualization tool, which is based on the time-
box query model. TimeSearcher’s object-oriented architecture can easily be
extended to support variants of the timebox model that provide additional ex-
pressive power.
• The design and implementation of query models and widgets that extend the
timebox model, including variable-time timeboxes (VTTs), angular queries,
leaders & laggards queries, multiple search attributes, and query inversion.
• Analysis of algorithmic issues: A comparison of multiple alternative search al-
gorithms found that simple sequential scans outperformed geometric indices for
processing timebox queries.
• Empirical evaluation of timeboxes: Two empirical studies, each with 12 sub-
jects, provided preliminary insight into the utility of timeboxes and led to design
improvements for input and display.
• Validation through case studies: TimeSearcher has been used by molecular bi-
ologists to explore gene expression data and nucleotide frequencies. This work
has validated the utility of the tool and identified design suggestions and oppor-
tunities.
• A framework for extending the timebox model, including the description of nu-
merous possible extensions.
INTERACTIVE GRAPHICAL QUERYING OF
TIME SERIES AND LINEAR SEQUENCE DATA SETS
by
Harry Hochheiser
Dissertation submitted to the Faculty of the Graduate School of theUniversity of Maryland, College Park in partial fulfillment
of the requirements for the degree ofDoctor of Philosophy
2003
Advisory Committee:
Professor Ben Shneiderman, Chair and AdvisorAssociate Professor Eric BaehreckeAssistant Professor Ben BedersonProfessor Bruce GoldenProfessor David MountProfessor Stephen Mount
c©Copyright by
Harry Hochheiser
2003
DEDICATION
To Judy
ii
ACKNOWLEDGEMENTS
Working with Ben Shneiderman has been a truly wonderful experience. I’ve
learned a great deal with Ben, both about how to do research and about how to be
a researcher. I’m particularly grateful for Ben’s support of my “extra-curricular” ac-
tivities, and his awareness that Computer Science research does not take place in a
vacuum.
This research has benefited enormously from the input of several faculty members
who acted as collaborators and members of my committee. Ben Bederson provided an
invaluable advice regarding implementation and evaluation, along with a different per-
spective on Information Visualization. Eric Baehrecke was an enthusiastic supporter
and early user of TimeSearcher. Steve Mount has also provided valuable guidance.
Thanks to both Eric and Steve for patiently answering my repeated questions about
their research. David Mount’s exemplary teaching helped me build the foundation
necessary for thinking about the algorithmic analysis of this work, and his comments
in these areas have been most helpful. As an outsider, Bruce Golden has provided a
useful perspective.
iii
Jesse Grosjean and Lance Good provided invaluable help with implementation is-
sues relating to Jazz and Piccolo.
It was a pleasure collaborating with Eamonn Keogh on variable-time timeboxes.
Along with Ben S. and Ben B., Allison Druin, Catherine Plaisant, and Francois
Guimbretiere have made the Human-Computer Interaction Lab a wonderful place to
work. Anne Rose deserves thanks for cheerfully putting up with my constant com-
plaining. Egemen Tanin, Jaime Montemayor, Juan Pablo Hourcade, Hilary Browne
Hutchison, Jinwook Seo, Hyunmo Kang, Gene Chipman, and other HCIL students
have have provided a supportive and engaging working environment. As librarian for
the CS department, Jordan Landes was a constant and reliable source of assistance and
good cheer.
Other colleagues from outside the University of Maryland have provided useful
feedback and encouragement. Special thanks to Karen Duca, Chris North, Eric Hoff-
man, Clare-Marie and John Karat, and Mary Czerwinski. Batya Friedman deserves
special thanks for suggesting Ben Shneiderman as a good research mentor.
The bulk of this work was supported by the AOL Fellowship in Human-Computer
Interaction. AOL was generous enough to provide this support with no strings at-
tached. AOL colleagues Amy Hale, Clayton Lewis, and Arkady Pogostkin have been
supportive and helpful throughout.
Finally, thanks to my extended family: Dave, Kellie, Herb, Eleanore, Toby, and
Michael. My daughter Elena isn’t old enough to know it yet, but her smiles have been
enormously helpful in overcoming thesis-related anxiety. I can’t say enough about my
wife Judy - this work literally would not have happened without her.
iv
TABLE OF CONTENTS
List of Tables xii
List of Figures xiv
1 Introduction 1
2 Related Work 5
2.1 Visualizations and Interactive Systems . . . . . . . . . . . . . . . . . 5
2.1.1 Time Series Data: Visualizations . . . . . . . . . . . . . . . . 6
2.1.2 Temporal Data: Visualizations . . . . . . . . . . . . . . . . . 11
2.1.3 Time Series Data: Querying . . . . . . . . . . . . . . . . . . 13
2.1.4 Temporal Data: Querying . . . . . . . . . . . . . . . . . . . 17
2.1.5 Parallel Coordinates . . . . . . . . . . . . . . . . . . . . . . 19
2.2 Data Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.2.1 Similarity Searching . . . . . . . . . . . . . . . . . . . . . . 21
2.2.2 Inverse Queries . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.2.3 Outlier Detection . . . . . . . . . . . . . . . . . . . . . . . . 24
2.2.4 Query Specification . . . . . . . . . . . . . . . . . . . . . . . 25
2.2.5 Other Approaches . . . . . . . . . . . . . . . . . . . . . . . 25
2.3 Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
v
2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3 Timeboxes: Interactive Temporal Query Widgets 29
3.1 Anyof Semantics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.2 Variable Time Timeboxes . . . . . . . . . . . . . . . . . . . . . . . . 33
3.3 Timeboxes in the Context of Information Visualization Research . . . 36
4 TimeSearcher 41
4.1 Overviews . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.2 Leaders & Laggards . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.3 Multiple Time-Varying Attributes . . . . . . . . . . . . . . . . . . . 52
4.4 Query Inversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.5 Anyof Timeboxes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.6 Variable Time Timeboxes . . . . . . . . . . . . . . . . . . . . . . . . 59
4.7 Angular Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.8 Averages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.9 Other Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5 TimeSearcher Implementation 71
5.1 A Tour of the Code . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.2 Data Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.2.1 Input File Format . . . . . . . . . . . . . . . . . . . . . . . . 73
5.2.2 Data Structures . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.2.3 Loading a Data File . . . . . . . . . . . . . . . . . . . . . . . 75
5.3 Graphical User Interface . . . . . . . . . . . . . . . . . . . . . . . . 76
5.3.1 Piccolo Windows . . . . . . . . . . . . . . . . . . . . . . . . 77
5.3.2 Interaction Handlers: Creation and Modification of Queries . 80
vi
5.3.3 Display Techniques . . . . . . . . . . . . . . . . . . . . . . . 84
5.3.4 The transition from Jazz to Piccolo . . . . . . . . . . . . . . 85
5.4 Query Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.5 Extending Timeboxes . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.6 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
6 Search Algorithms 95
6.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
6.2 Sequential Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
6.3 Sequential Search for Timebox Extensions . . . . . . . . . . . . . . . 100
6.3.1 Variable Time Timeboxes . . . . . . . . . . . . . . . . . . . 100
6.3.2 Angular Queries . . . . . . . . . . . . . . . . . . . . . . . . 101
6.4 Geometric Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
6.4.1 Orthogonal Range Trees . . . . . . . . . . . . . . . . . . . . 104
6.4.2 Grids . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
6.5 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
6.5.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . 108
6.5.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
6.5.3 Sequential scans vs. Geometric Indices . . . . . . . . . . . . 118
6.5.4 Theoretical worst-case analyses . . . . . . . . . . . . . . . . 124
6.5.5 Further Examination of Sequential Algorithms . . . . . . . . 126
6.5.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
6.6 Next Steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
7 Empirical Evaluations 136
7.1 Evaluation of Input Mechanisms for Questions of Varying Complexity 137
vii
7.1.1 Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
7.1.2 Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
7.1.3 Task Types . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
7.1.4 Hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
7.1.5 Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
7.1.6 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
7.1.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
7.2 Empirical Evaluation of Input and Output for Exploratory Tasks . . . 157
7.2.1 Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
7.2.2 Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
7.2.3 Hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
7.2.4 Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
7.2.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
7.2.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
7.3 Conclusion & Future Steps . . . . . . . . . . . . . . . . . . . . . . . 170
8 Applications 173
8.1 DNA Microarray Data Set Analysis . . . . . . . . . . . . . . . . . . 173
8.1.1 Programmed Cell Death in Drosophila melanogaster . . . . . 176
8.1.2 Viral Life Cycle in Epithelial Cells . . . . . . . . . . . . . . . 188
8.2 Nucleotide Sequence Data . . . . . . . . . . . . . . . . . . . . . . . 191
8.2.1 Branch Site Consensus Splicing Signal in Arabidopsis thaliana 192
8.2.2 Observations . . . . . . . . . . . . . . . . . . . . . . . . . . 197
8.2.3 Contributions and Design Suggestions . . . . . . . . . . . . . 198
8.3 Other Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
8.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
viii
9 Query Expressiveness 203
9.1 Example Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
9.1.1 Fixed-Time, Fixed-Value, and logical combinations thereof . . 205
9.1.2 Variable Time and/or Value . . . . . . . . . . . . . . . . . . . 205
9.1.3 Open-Ended Time and/or Value . . . . . . . . . . . . . . . . 206
9.1.4 Relative Time/Value . . . . . . . . . . . . . . . . . . . . . . 207
9.1.5 Interval Trending . . . . . . . . . . . . . . . . . . . . . . . . 208
9.1.6 Maximal Periods . . . . . . . . . . . . . . . . . . . . . . . . 209
9.1.7 Aggregate Functions . . . . . . . . . . . . . . . . . . . . . . 209
9.1.8 Similarity to a Known Item . . . . . . . . . . . . . . . . . . . 210
9.1.9 Global Constraint . . . . . . . . . . . . . . . . . . . . . . . . 210
9.1.10 Inter-item queries: Leaders & Laggards . . . . . . . . . . . . 211
9.1.11 Prevailing Trends . . . . . . . . . . . . . . . . . . . . . . . . 212
9.1.12 More general queries . . . . . . . . . . . . . . . . . . . . . . 213
9.2 Query Dimensions . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
9.2.1 Range Events . . . . . . . . . . . . . . . . . . . . . . . . . . 214
9.2.2 Transition Events . . . . . . . . . . . . . . . . . . . . . . . . 216
9.2.3 Inter-item Queries . . . . . . . . . . . . . . . . . . . . . . . 217
9.2.4 Other Logical Operators: Disjunctions and Negations . . . . . 217
9.2.5 More General Queries . . . . . . . . . . . . . . . . . . . . . 218
9.3 Towards A Formal Query Model . . . . . . . . . . . . . . . . . . . . 218
9.3.1 Time Series Data Set . . . . . . . . . . . . . . . . . . . . . . 219
9.3.2 Range Events . . . . . . . . . . . . . . . . . . . . . . . . . . 219
9.3.3 Logical Combinations . . . . . . . . . . . . . . . . . . . . . 220
9.3.4 Variable Timeboxes . . . . . . . . . . . . . . . . . . . . . . . 221
ix
9.3.5 Relative Timeboxes . . . . . . . . . . . . . . . . . . . . . . . 221
9.3.6 Transitions . . . . . . . . . . . . . . . . . . . . . . . . . . . 222
9.3.7 Global Constraints . . . . . . . . . . . . . . . . . . . . . . . 222
9.3.8 Inter-item Queries . . . . . . . . . . . . . . . . . . . . . . . 223
9.3.9 Open Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
9.4 Implementing the Extended Queries . . . . . . . . . . . . . . . . . . 224
9.5 User Needs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228
9.6 Subsequence Queries: Beyond Full-Sequence Matches . . . . . . . . 229
10 Future Work 231
10.1 Further Development of TimeSearcher . . . . . . . . . . . . . . . . . 231
10.1.1 Re-Implementation . . . . . . . . . . . . . . . . . . . . . . . 231
10.1.2 Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232
10.1.3 Domain Customization . . . . . . . . . . . . . . . . . . . . . 234
10.1.4 Multiple Time-Varying Attributes . . . . . . . . . . . . . . . 236
10.1.5 Additional Functionality . . . . . . . . . . . . . . . . . . . . 236
10.2 Further Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 238
10.3 Other Types of Time-oriented Data . . . . . . . . . . . . . . . . . . . 239
10.3.1 Categorical or Nominal Data . . . . . . . . . . . . . . . . . . 239
10.3.2 Temporal Data . . . . . . . . . . . . . . . . . . . . . . . . . 243
11 Conclusions 245
A A Sample TimeSearcher Data File 248
B Study Materials for Evaluation of Input Mechanisms for Questions of
Varying Complexity 250
x
B.1 Exploratory Task . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250
B.2 Training Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . 250
B.3 Experimental Questions . . . . . . . . . . . . . . . . . . . . . . . . . 251
B.4 User Interface Satisfaction Questionnaire . . . . . . . . . . . . . . . 253
C Empirical Evaluation of Multiple-Constraint Query Formation 255
C.1 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255
C.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258
C.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260
C.4 Study Materials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264
C.4.1 Exploratory Task . . . . . . . . . . . . . . . . . . . . . . . . 264
C.4.2 Training Questions . . . . . . . . . . . . . . . . . . . . . . . 264
C.4.3 Experimental Questions . . . . . . . . . . . . . . . . . . . . 265
D Study Materials for Empirical Evaluation of Input and Output 268
D.1 Training Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . 268
D.2 Experimental Questions . . . . . . . . . . . . . . . . . . . . . . . . . 269
Bibliography 269
xi
LIST OF TABLES
5.1 Raw performance data. . . . . . . . . . . . . . . . . . . . . . . . . . 92
5.2 Portion of query processing time spent on updating display, for sample
queries on some data sets. All times are in ms. . . . . . . . . . . . . . 94
6.1 Data sets used in algorithm evaluation. . . . . . . . . . . . . . . . . . 109
6.2 Query Operations in each block. . . . . . . . . . . . . . . . . . . . . 110
6.3 Average times (ms) across all operations for data sets with 100 time
points and 100, 1000, 10000, and 50000 items. . . . . . . . . . . . . 112
6.4 Average times (ms) across all operations for data sets with 100 items
and 100, 1000, and 10000 time points. . . . . . . . . . . . . . . . . . 113
6.5 Average times (ms) for the data set with 1000 items and 1000 time
points, with results for both 100 items and 1000 time points and 100
time points and 1000 items given for context. . . . . . . . . . . . . . 116
6.6 Comparison of number of values checked versus possible number of
checks for sequential search of data sets with 100 time points and 100,
1000, and 10000 items . . . . . . . . . . . . . . . . . . . . . . . . . 120
6.7 Comparison of number of values checked versus possible number of
checks for sequential search of data sets with 100 items and 100, 1000,
and 10000 time points . . . . . . . . . . . . . . . . . . . . . . . . . . 121
xii
6.8 Comparison of number of values checked versus possible number of
checks for sequential search of data sets with 100 time points and 100,
1000, and 10000 items . . . . . . . . . . . . . . . . . . . . . . . . . 122
6.9 Comparison of number of values checked versus possible number of
checks for sequential search of data sets with 100 items and 100, 1000,
and 10000 time points . . . . . . . . . . . . . . . . . . . . . . . . . . 122
7.1 User preferences by interface for the different task types. . . . . . . . 148
C.1 User preferences by interface for the different task types. . . . . . . . 260
xiii
LIST OF FIGURES
1.1 Patterns of interest in stock trend analysis [87]. . . . . . . . . . . . . 3
2.1 A spiral visualization of the consumption of Baphia Capparidifolia by
Chimpanzees in Tanzania during 1980-1988. Each lap represents one
year, and each spoke one month. The area of each blot is proportional
to the observed consumption during that month of the given year. To
see how consumption varied during a given year, users can move along
a given lap of the spiral. To compare consumption in a given month
across years, users examine blots along the same spoke [27]. . . . . . 7
2.2 A Diamond Fast display showing a zoomed image of two overlaid 10-
year periods [135]. . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3 A ThemeRiver visualization of news items regarding Fidel Castro,
from November 1959 through June 1961. Each band in the river in-
dicates a separate topic, with the thickness of the band indicating the
number of stories on that topic [59]. . . . . . . . . . . . . . . . . . . 9
2.4 A TimeTube, with four DiskTrees showing the evolution of the web
site over time [31]. . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.5 A LifeLines display of a patient medical records [99]. . . . . . . . . 12
2.6 Circular query controls for filtering cyclic data [25]. . . . . . . . . . 14
xiv
2.7 The Patterns visual query language, specifying a sequence involving
one of four alternative transitions followed by a single required transi-
tion [90]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.8 The MMVIS query window [60]. . . . . . . . . . . . . . . . . . . . . 17
2.9 A sample parallel coordinates visualization involving four dimensions
from a database describing automobiles [58]. . . . . . . . . . . . . . 20
3.1 A graph overview, formed by superimposing the time series for all of
the items in the data set. . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.2 A single timebox query, for items between $70 and $190 during weeks
1-5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.3 A refinement of the query in Figure 3.2. . . . . . . . . . . . . . . . . 32
3.4 A complex query containing three timeboxes. . . . . . . . . . . . . . 33
3.5 A variable time timebox, specifying that for at least R consecutive time
periods between x1 and x2, items must have values in the range y1 ≤
y ≤ y2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.6 The Influence Explorer: Range Sliders on the “brightness”and “work-
ing life” dimensions select the ranges of interest. Histograms with
each variable indicate the number of items having various values of
that variable, and lines between histograms indicate the values of a
selected item [134]. . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.7 XmdvTool [89]: The highlighted items have been selected by “brush-
ing”. Once the brush is created, the highlighted areas on any given axis
can be moved or resized [148]. . . . . . . . . . . . . . . . . . . . . . 38
xv
3.8 Explicit range sliders in CityOScope’s parallel coordinates display Ar-
rows at the top and bottom of each axis can be used to limit the range
of interest [53]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.9 Two dimensional query widgets: (a) A point query indicating an exact
number of bedrooms and cost of a home. (b) A range of number of
bedrooms and cost [117]. . . . . . . . . . . . . . . . . . . . . . . . . 39
4.1 The TimeSearcher application window. Clockwise from upper-left:
query space (with data envelope, query envelope, and graph overview),
details-on-demand, item list, range sliders for query adjustment, and
data items. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.2 Partial results from a timebox query, with time points that match the
query highlighted. Items in the result set differ in the points that match
the query, indicating an anyof or variable time timebox. . . . . . . . . 43
4.3 Drag-and-drop query-by-example, with results. . . . . . . . . . . . . 45
4.4 Query window with data envelope. . . . . . . . . . . . . . . . . . . . 47
4.5 Query display with data and query envelopes. . . . . . . . . . . . . . 47
xvi
4.6 The query window displaying a “leaders & laggards” query. The top
window shows leaders, with the original query in magenta providing
a reference that can be used for comparison. The leaders window also
includes a label indicating that the leaders are being shown, along with
the name of the attribute being used for the leader query. The record
count at the bottom of this window also indicates that the items shown
are leaders. The bottom window - the “laggards” display -shows the
original query in outline, and has new timeboxes representing the new
query, which is defined by shifting the old query one time period to
the right. The count label below this window indicates that the items
shown are laggards. . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.7 Leaders & Laggards: The top-left window is the leader window, and
the laggard window is directly below it. . . . . . . . . . . . . . . . . 51
4.8 TimeSearcher with a data set involving multiple time-varying at-
tributes. Two panes have been created - for the “low” and the “high”
values. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.9 The data items in the result set with two variables displayed. The
profiles are taken from yeast microarray data, with absolute log ratio
and log ratio values shown for seven time points [40]. . . . . . . . . . 54
4.10 Updated query envelopes for one of two attributes that are currently
active. Note that even though there are no queries in this window,
queries in the inactive window (for “Low” measurements) have con-
strained the data set, as shown by the query envelope. . . . . . . . . . 54
4.11 A summary window for a query over two attributes. . . . . . . . . . . 56
xvii
4.12 Query Inversion: The original query (top) and the inverted query (bot-
tom). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.13 Anyof timeboxes: The display on the top shows a query consisting
of two timeboxes. In the bottom display, the timebox on the left has
been converted to an anyof query. As these queries are more inclusive
(requiring only one value in the given range during the interval, as
opposed to all values), the result set for the anyof query is a superset
of the other result set. . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.14 A variable time timebox (VTT), with two sets of modification handles.
The outer handles can be used to modify the value range and the time
window, while the inner handles can be dragged to modify the duration
of the interval during which values must be within the given range. . . 61
4.15 Calculation of an angular query. If an items ti has a value v at the
starting time tmin, its value at the ending time tmax must be between
vmin and vmax, as determined by θ1 and θ2, along with the width of the
query. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.16 The angular query widget. . . . . . . . . . . . . . . . . . . . . . . . 64
4.17 An annotated angular query widget. The dark lines demonstrate how
the vertical line in the query widget is used to determine the two angles
necessary for the query. . . . . . . . . . . . . . . . . . . . . . . . . . 64
xviii
4.18 The TimeSearcher query space with an angular query under the “all
points” interpretation. Data and query envelopes have been disabled
for clarity. Selection handles on the query widget can be used to move
and rescale the query, and a tooltip provides a textual representation of
the query on mouse-over. Note that the graph envelopes show items
with a slope similar to that of the angular query widget, but at differing
ranges along the value axis. . . . . . . . . . . . . . . . . . . . . . . . 65
4.19 The angular query from Figure 4.18, under the alternate “end points”
interpretation. Note that some items in the result set have interme-
diate transitions that exceed the range specified, even though the line
between values at the end points fits within the specified range. . . . . 66
4.20 An angular brush that searches for negative correlations between items
in the second and third axes [58]. . . . . . . . . . . . . . . . . . . . . 67
4.21 The TimeSearcher query window, with an average profile displayed in
red. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.22 An average query. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.1 A schematic overview of the container classes used in the Time-
Searcher GUI. The entire window is an instance of TQCore - a sub-
class of JFrame. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.2 A UML-style depiction of the relationships between the classes in the
display list and query window. . . . . . . . . . . . . . . . . . . . . . 78
5.3 The steps involved in TimeSearcher query processing. . . . . . . . . . 88
xix
5.4 Average times for TimeSearcher to completely process queries - in-
cluding search and display update - on several query types. Results
are shown for data sets of 1000, 10000, 25000, and 50000 items with
100 and 200 times points, and 100,000 items with 100 time points only. 92
6.1 Example of entities that meet (upper) and fail to meet (lower) the con-
straints of a timebox. . . . . . . . . . . . . . . . . . . . . . . . . . . 103
6.2 Clipping: as the timebox is moved to the lower right, the area marked
“D” is removed from the query, and the “A” region is added. These
two regions must be processed, but there is no need to reprocess the
overlap (“O”). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
6.3 A grid index for a data set with time points 0-9, values 0-80, and 8
buckets in the value dimension. Given this scheme, values from 0-10
will go into bucket 1, 11-20 in bucket 2, etc. The timebox shown will
cover the grids for values 21-30, 31-40, 41-50 and 51-60 for times
3-5. Buckets 21-30 and 51-60 are only partially covered, thus their
contents must be checked at each time point. The other buckets are
completely covered by the timebox, so checking of individual points
is not necessary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
6.4 Average times (ms) across all operations for data sets with 100 time
points and 100, 1000, 10000, and 50000 items. . . . . . . . . . . . . 111
6.5 Average times (ms) across all operations for data sets with 100 items
and 100, 1000, and 10000 time points. . . . . . . . . . . . . . . . . . 112
6.6 Comparative times for query creation and translation on data sets with
100 time points and 100, 1000, 10000 and 50000 items. . . . . . . . . 114
xx
6.7 Comparative times for query resize and deletion on data sets with 100
time points and 100, 1000, 10000 and 50000 items. . . . . . . . . . . 115
6.8 Comparative times for query creation and translation on data sets with
100 items and 100, 1000, and 10000 time points. . . . . . . . . . . . 116
6.9 Comparative times for query resize and deletion on data sets with 100
items and 100, 1000, and 10000 time points. . . . . . . . . . . . . . . 117
6.10 A timebox query demonstrating the advantage that sequential process-
ing has over geometric methods. For this timebox that spans eight
time points, sequential processing can stop after the second time value
is identified as falling outside of the timebox. However, the geometric
approaches must examine every point that falls within the timebox. . . 119
6.11 The timebox from Figure 6.10, with a time series for which S(ti,b) =
true. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
6.12 The number of values actually checked for sequential and Grid-20 al-
gorithms for data sets involving 100, 1000, and 10000 items with 100
time points. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
6.13 The number of values actually checked for sequential and Grid-20 al-
gorithms for data sets involving 100 items with 100, 1000, and 10000
time points. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
6.14 Optimized sequential vs. Hashed sequential for data sets involving
100, 1000, and 10000 items . . . . . . . . . . . . . . . . . . . . . . . 128
6.15 Optimized sequential vs. Hashed sequential for data sets involving
100, 1000, and 10000 time points . . . . . . . . . . . . . . . . . . . . 129
xxi
6.16 Why time series query performance is independent of the width of the
series. As this timebox covers 25% of the value space and five time
periods, a randomly generated time series would only have odds of
< 1% of satisfying the timebox (like t2 does). The odds that a timebox
will fail to meet this query by the fourth time point (like t1) are greater
than 99%. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
7.1 A form fill-in interface for specifying query constraints. . . . . . . . . 138
7.2 A range slider interface for specifying query constraints. . . . . . . . 138
7.3 The tsexp interface. . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
7.4 Feedback provided in the tsexp interface. Note the highlighted border
around the feedback corresponding to the selected timebox. . . . . . . 142
7.5 Average completion time (with standard deviation error bars) for well-
defined tasks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
7.6 Number of items correctly identified in exploratory task . . . . . . . . 149
7.7 Average task completion time for exploratory tasks . . . . . . . . . . 150
7.8 Average subjective satisfaction ratings (1-9, 9 is best), n = 12. . . . . 151
7.9 A demonstration of the difficulty of resizing small handles. The large
timebox on the left has handles that are clearly separated and easily
graspable. The small timebox on the right has handles that are only a
few pixels apart, and are therefore harder to select. . . . . . . . . . . 153
7.10 The form-fill interface with tabular display of query results. Each row
contains the data for one item in the set, with the values for displayed
in the columns. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
7.11 Average task completion time with standard deviation error bars. . . . 162
xxii
7.12 Average task completion time (with standard deviation error bars) for
each of the two timed tasks. . . . . . . . . . . . . . . . . . . . . . . . 163
7.13 Average task performance times (with standard deviation error bars)
for the six participants who were fastest with the timebox interface. . . 164
7.14 Average task performance times (with standard deviation error bars)
for the six participants who were fastest with either of the form fill-in
interfaces. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
7.15 Average subjective satisfaction ratings 1-9, 9 is best), n = 12. The
preference for the timebox interface was significant in all cases. . . . 166
8.1 Red-green “heat map”display expression genes at seven time points.
Each row is a gene sample, and each column is a time point. Bright
green samples are repressed genes, bright red are induced genes, and
darker samples are close to the average. Genes that are repressed (low
expression levels) are shown at the top, and induced genes (high ex-
pression levels) at the bottom [34]. . . . . . . . . . . . . . . . . . . . 174
8.2 The Hierarchical Clustering Explorer. Dendrogram clusters and filters
for detail and similarity are shown in the top window, with a detailed
display of a subset is shown below. A scatterplot on the right is used
for pairwise comparison between two of the experimental conditions
[111]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
xxiii
8.3 TimeSearcher query display identifying genes that are roughly similar
to E93 at 10 and 12 hours. This query contains two timeboxes, based
on the values of E93 at 10 and 12 hours. The 12 hour timebox has
been shifted up, to eliminate smaller increases in expression levels.
This timebox has also been increased in height, in order to include
some very sharp increases in expression level that might not have been
included in the original timebox. . . . . . . . . . . . . . . . . . . . . 181
8.4 TimeSearcher query identifying genes that decrease significantly be-
tween 10 and 12 hours, when E93 is increasing. . . . . . . . . . . . . 181
8.5 A query illustrating the need for additional constraints requiring non-
increasing (or non-decreasing) values over a specified interval. Al-
though the general trend of the two timeboxes is upwards, the high-
lighted item actually has a decrease in value between 10 and 12 hours.
Additional constraints requiring non-decreasing items would remove
this item from the result set. . . . . . . . . . . . . . . . . . . . . . . 185
8.6 The three main stages in the creation of protein from DNA. During
transcription, the strand of DNA is copied. During splicing, the introns
are removed, leaving only the exons. The output of splicing is a strand
of mRNA. During translation, the mRNA is exported from the nucleus
and used to create a protein. . . . . . . . . . . . . . . . . . . . . . . 193
8.7 Splice sites and branch sitesxb. . . . . . . . . . . . . . . . . . . . . . 194
8.8 Data envelope overview of pentamer frequency distributions in Ara-
bidopsis thaliana. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
xxiv
8.9 Timebox query aimed at finding pentamers with higher frequencies at
a specific region within introns (the branch site) and lower frequencies
elsewhere within introns. . . . . . . . . . . . . . . . . . . . . . . . . 196
9.1 A schematic layout of the different types of example queries. Queries
are expressed in approximate order of increasing precision, from left
to right. Aggregate queries are modifiers that apply to queries within
the shaded box, and maximal period queries are modifiers that might
apply to those within the unshaded box. Queries below the dashed
lines involve comparisons are based on the characteristics of individual
items in the data set, while those above the line involve comparisons
between items. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
9.2 A timebox query expressing A∧ (B∨C)∧D. B and C must be dis-
juncts, as both cannot be true simultaneously. . . . . . . . . . . . . . 226
9.3 A timebox that may lead to ambiguous intepretation under the model
given in Figure 9.2. The item drawn is in either timebox B or C for
the two time points during which they overlap, but it does not spend
both of thoes time poitns in any one box. Should this item be included
under the disjunctive semantics of Figure 9.2? What would the result
that users would expect? . . . . . . . . . . . . . . . . . . . . . . . . 227
xxv
10.1 The TimeSearcher query display, augmented with a preview display
displaying time periods that have larger number of items that follow
the pattern. The number of items that match the query at each time
point is given by the line color at that time: lighter colors indicate
a small number of matches, while darker colors show intervals with
more matches. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
10.2 Sketch of a potential design for categorical timeboxes. For a data set
involving web log records for multiple hosts, this interface might be
used to find queries that had large numbers of visitors from “.com”
hosts in September and October, followed by large numbers of “.org”
visitors in December and January. . . . . . . . . . . . . . . . . . . . 241
10.3 An categorical timebox query looking for sites that had large numbers
of “.org” or “.edu” visitors during December and January. . . . . . . . 242
C.1 Average completion time for well-defined tasks. . . . . . . . . . . . . 258
C.2 Number of items correctly identified in exploratory task . . . . . . . . 259
C.3 Average task completion time for exploratory tasks . . . . . . . . . . 260
C.4 Average subjective satisfaction ratings (1-9, 9 is best). n = 7 . . . . . 261
xxvi
Chapter 1
Introduction
Numerous analytic domains involve the study of measurable quantities that change
over time. Financiers examining trends in economic indicators, meteorologists study-
ing climate data, demographers quantifying trends in census data, and numerous others
use time series graphs, statistical evaluations, and other tools to identify patterns and
find trends in these time series data sets.
Interest in time series data has prompted a substantial body of work in the develop-
ment of strategies for storing and indexing temporal data. Algorithmic and statistical
methods for identifying patterns have provided substantial functionality in a wide va-
riety of situations [4, 5, 6, 19, 24, 30, 49].
Algorithmic research only addresses one aspect of the data mining problem. The
question of query formulation - which questions are worth asking? - is often left unan-
swered. Data mining researchers often pose the challenge of finding patterns in time
series in terms of similarity to an input pattern. These queries involve specification of
both a query pattern and a range of allowable similarity. Identification of parameters
such as these using trial-and-error processing is often challenging and computationally
expensive. A central problem for users is that the effects of small changes on parame-
ters such as similarity tolerances may be hard to gauge without running multiple trials.
1
In these cases, users need tools to support interactive exploration of the contents
of time series data sets. By providing analysts with the power to quickly construct
queries, modify parameters, and examine result sets, these tools would encourage the
development of understanding of the data set as a whole. This understanding is use-
ful for guiding the construction of queries, thus speeding the process of knowledge
discovery.
Dynamic queries [7] and related information visualization techniques [26] have
proven useful in supporting users interested in understanding multi-dimensional ab-
stract datasets. The combination of graphic displays with easily manipulated user-
interface widgets for query formulation allows users to explore data sets in search of
items of interest. Although there has been little work to date on interactive systems for
querying time series data, lessons from information visualization research can guide
developers of systems for the exploration of time series data sets.
The existence of familiar graphic displays of time series presents an obvious start-
ing point for the application of information visualization techniques. Two-dimensional
graphs with time on the x-axis, and a continuous variable on the y-axis are ubiquitous:
stock charts, weather data, and physiologic data (electroencephalogram (EEG)), and
electrocardiogram (EKG)) etc., are just a few examples. In domains such as stock price
analysis, familiar patterns have been named and identified as shorthand approaches to
identifying trends of interest (Figure 1.1) [87].
Preliminary investigations into possible interactive systems for exploring time se-
ries data has led to the development of the timebox metaphor, and its implementation
in TimeSearcher. Timeboxes are rectangular regions that are placed and directly ma-
nipulated on a timeline, with the boundaries of the region providing the relevant query
parameters. TimeSearcher is a research application that supports the use of timebox
2
Figure 1.1: Patterns of interest in stock trend analysis [87].
queries to interactively search and explore time series data sets. TimeSearcher also
provides other querying tools, including support for simultaneous querying of mul-
tiple time-varying attributes, extensions to the timebox query model, drag-and-drop
query-by-example, “leaders & laggards” querying, and query inversion.
This dissertation describes related work, introduces the timebox model and Time-
Searcher, and continues with more in-depth discussion of the query model, implemen-
tation, evaluation, extensions, and future work:
• Chapter 2 provides a discussion of related work in visualization of time series
and temporal data, data mining, databases, and searching of time series data sets.
• Chapter 3 introduces the timebox concept and provides examples of its use
• Chapter 4 describes TimeSearcher, its implementation of timeboxes, and other
features.
• Chapter 5 provides details of TimeSearcher’s implementation.
3
• Chapter 6 describes a comparison of various algorithms that were evaluated as
candidates for providing the efficient processing needed for dynamic queries.
• Chapter 7 contains results from two empirical studies of timeboxes as compared
to other query specification modalities. Additional empirical results are pre-
sented in Appendix C.
• TimeSearcher has been used in ongoing research in molecular biology. Inves-
tigations of microarray data have been used to find patterns in gene expression
data. Building on the observation that any linear sequence can be treated as
a time series, biologists have also used TimeSearcher for exploration of data
sets describing nucleotide frequencies at differing positions in aligned genetic
sequences. These applications are described in Chapter 8.
• A wide variety of extensions to the timebox model might be used to provide
greater query expressiveness. Some of the possible extensions are introduced in
Chapter 9.
• Possibilities for future work are outlined in Chapter 10.
• Chapter 11 is the conclusion.
4
Chapter 2
Related Work
The focus of this thesis is time series data: sequences of real-valued measurements
x1 . . .xn. Although time series data has been the subject of extensive examination in
a wide variety of research fields, it is only one component of what has been called
“time-oriented data” [123]. Other forms of time-oriented data include temporal data,
involving events of arbitrary duration (as opposed to values that are recorded at discrete
intervals) [9], and spatio-temporal data, which combines temporal (perhaps time series
information) with other spatial data. The challenges associated with these varying
domains have led to work in a variety of areas.
2.1 Visualizations and Interactive Systems
A recent survey of linear temporal visualizations is found in [123]. Generally, these
tools focus on visualization and navigation, with relatively little emphasis on querying
and pattern identification.
5
2.1.1 Time Series Data: Visualizations
The importance of time series data sets has led to extensive work on the part of graphic
designers interested in increasing the readability of these time series graphs [133]. A
non-linear display, which emphasizes more recent information by compressing older
data, was suggested by Powsner & Tufte [102]. This compression provided a static
“focus+context” [26] of individual readings from a patient’s medical record. The
combination of multiple small graphs on a single page provided a succinct overview
of the entire record.
Recent research into interactive visualizations of these data sets has focused on
supporting multi-scale and periodic views. Recursive patterns, an early visualization
technique, provide dense displays of data divided hierarchically into finer-grained time
periods (year, month, week, etc.) [73].
Spiral visualizations [27] uses a circular metaphor to display the periodicity of
some data sets. Time progresses along the path of the archimedean spiral, with corre-
sponding periods in each interval aligned to form “spokes” of each period. For exam-
ple, each revolution of the spiral might indicate one year, with data points for January
in each year aligned along a single spoke (Figure 2.1). The spiral visualizations tool
also provides facilities for zooming in on subsets of the spiral and manually adjusting
the duration of each revolution support interactive exploration.
Many interesting data sets contain multiple time-varying quantities. For example,
values temperature, precipitation, and barometric pressure might be available for each
time point. These data sets provide additional challenges for visualization designers:
displaying multiple attributes in a single space may increase information density, but
the resulting display might be cluttered.
Spiral Visualizations uses two approaches to address this problem. Multiple at-
6
Figure 2.1: A spiral visualization of the consumption of Baphia Capparidifolia by
Chimpanzees in Tanzania during 1980-1988. Each lap represents one year, and each
spoke one month. The area of each blot is proportional to the observed consumption
during that month of the given year. To see how consumption varied during a given
year, users can move along a given lap of the spiral. To compare consumption in a
given month across years, users examine blots along the same spoke [27].
tributes are shown on a single spiral by providing a marking for each variable of interest
at each time point, producing a series of “flags” [27] that are displayed perpendicular
to the spiral, in a 3D projected view. Alternatively, two separate spirals can be tightly
coupled.
Spiral depictions have also been used to provide basic facilities for filtering time
series displays. Brewer, et al. describe a circular query control which can be used to
7
Figure 2.2: A Diamond Fast display showing a zoomed image of two overlaid 10-year
periods [135].
specify consecutive and non-consecutive time intervals at multiple scales (year, month,
day) [25].
Other work in time series visualization tools has focused on tools for statistical
analysis. Diamond Fast provided a variety of features suitable for exploratory analy-
sis of time series data, including the ability to zoom displays in both time and value
dimensions, point queries to find values at individual data points, and overlay of multi-
ple time series to compare periodic patterns (Figure 2.2) [135, 136]. Another proposed
system - FORTUNE - builds on Diamond Fast, adding decomposition, smoothing,
forecasting, and other statistical analysis tools [80].
ThemeRiver [59], a system for visualizing thematic patterns over time in a docu-
ment collection, uses a different approach to depicting changes over time. Given a set
of news stories, selected from a given time period, ThemeRiver displays the changes in
content as topics become more and less newsworthy. As working from the observation
that cumulative histograms of the number of articles in each topic may be hard to inter-
pret, ThemeRiver connects the corresponding areas in adjacent bars of the histogram
to create a visual flow from one time point to the next. In this view, the thickness of
a section of the river indicates the number of stories on the associated topic, and the
8
Figure 2.3: A ThemeRiver visualization of news items regarding Fidel Castro, from
November 1959 through June 1961. Each band in the river indicates a separate topic,
with the thickness of the band indicating the number of stories on that topic [59].
vertical position of an item in the river is not important. The total thickness of the
river at a given time point provides a measure of the overall news activity on that date
(Figure 2.3).
DiskTrees and TimeTubes [31] provide an interesting model for the use of circles
to display multiple attributes from a hierarchical data set. A disk tree is a circular
display of the contents of a web site, with the root in the center and subsequent layers
in the circle corresponding to layers in the hierarchy. The usage of various parts of the
web site is indicated through the dual codings of line size and brightness. Sequential
displays of multiple disk trees support examination of changes in usage trends over
time (Figure 2.4). Mouse-over, zooming, and other interaction techniques are used to
support interactive exploration of the data.
Animation has also been used to display changes in time series data, either auto-
matically or in response to user input to display changes over time [31, 45].
9
Figure 2.4: A TimeTube, with four DiskTrees showing the evolution of the web site
over time [31].
Another class of visualizations handle data sets that are not strictly time series,
even if they involve observations at discrete times. SeeLog uses graphic glpyhs to
portray entries in Unix command logs [44], in order to provide oversights of large, log
files that previously went unanalyzed. Controls for filtering items and limiting the time
scale can be used to limit the time span being displayed, and full-text entries from the
original log files can be retrieved via mouse-click. Ribler, et al. developed a suite of
metrics that could be used to model and visualize categorical time series data [106].
Wong, et al. [145] display sequential patterns in text corpora with time proceed-
ing along the horizontal x axis and category names on the vertical y axis. Categories
are sorted alphabetically from bottom to top, providing an ordered comparable to the
numeric ordering used in real-valued time series data. Alternative arrangements that
related proximity to topic similarity might provide for more easily-understood arrange-
ments.
10
Primarily implemented as visualizations, these systems are limited in their ability
to specify queries involving patterns that change over time. Facilities for zooming-
in on desired intervals are generally provided, but creation of specific queries is not:
patterns are identified through visual inspection.
Although the lack of common application domain and task profiles makes compar-
ison between these visualizations difficult, more understanding of relative strengths
and weaknesses is clearly needed. As none of the papers mentioned above involve any
empirical evaluation, further evaluation and comparison would be particularly useful.
2.1.2 Temporal Data: Visualizations
Visualizations of temporal data have been used in a variety of domains, including:
Medical Data : Medical records for individuals include treatment information, test
results, and diagnoses that evolve over time. Cousins and Kahn [36] developed an early
visualization that combined time series measurements, intervals for events of non-
point duration, and additional external details such as patient calendar information. A
semantic model that supported differing temporal granularities was used to support
moving between different time scales. LifeLines [99] used a categorized, zoomable
display to provide an overview of an entire medical record: as the user zooms in to
successively smaller intervals, the display is updated to provide progressively finer-
grained information (Figure 2.5).
Calendar and activity timelines : Calendar and time schedules are perhaps the
most familiar and natural timelines. Calendar visualizers [88] use multiple-scale rep-
resentation, zooming, and focus+context displays to support scheduling activities and
coordinating meetings.
11
Figure 2.5: A LifeLines display of a patient medical records [99].
Interaction Histories Time-Machine Computing [105] and LifeStreams [50] use
linear temporal displays to visualize current and past activities in a desktop computing
environment, in order to support recall, navigation, and location of information and
documents. Other systems have examined the use of timelines for finer-grain actions,
as the basis for extended undo/redo facilities [41, 100]. In presenting the possibility
of navigating to a previous state and subsequently modifying that state, these visual-
izations face the challenge of appropriately handling divergent timelines that represent
possible descendants of a given point in time [43].
Data Analysis : Human-Computer Interaction, cognitive engineering, and other en-
gineering and ethnographic research fields often require the collection and interpre-
tation of real-time data collected from participants in research studies. Often involv-
ing synchronization of video recording, computer activity, and other activities, these
data sets can be difficult to interpret. MacSHAPA [110] and Timelines [57] are two
tools that support navigation in time through data collected in the course of these ob-
12
servations. Data collected is displayed on one or more synchronized timelines, and
navigation tools support scrolling in time.
Other visualizations use less familiar representations of temporal attributes. Peo-
pleGarden [147] uses lengthening stalks of individual flowers to display the amount of
time that users have spent participating in online discussions. SELES [28] displays
time-varying changes in landscape data by using two dimensions of a cube for location
and the third for time. Exploration is supported via distortion techniques that can be
used to see “inside” the cube.
2.1.3 Time Series Data: Querying
For many tools involving time series data, temporal query facilities are limited to nar-
rowing the display to selected regions of interest. Spiral Visualizations provides facil-
ities for changing the time scale of the display and zooming in on periods of interest
Modifications to the scale of the display lead to revisions of the spiral that may cause
patterns to appear along the spokes of the spiral [27]. Similarly, circular query widgets
have been used to limit displays to intervals of interest [25] (Figure 2.6).
MIMSY [107] provided an early example of an interactive tool for querying time
series data. Designed to support analysis of stock market data, MIMSY used tradi-
tional GUI widgets including text entry fields, pull-down menus, and other traditional
widgets to search for trends of interest in stock data. MIMSY supports a number of
domain-specific time series, including volume, shares outstanding, and others, along
with aggregates including average, min, max, and move. Other interesting operators
include support for relative changes (“close of IBM down more than 10%”) and cross-
ings in value (“select close of abc when close of abc crosses close of b”). Query
processing is handled in a traditional batch mode.
13
Figure 2.6: Circular query controls for filtering cyclic data [25].
QuerySketch is an innovative query-by-example tool that uses an easily-drawn
sketch of a time series profile to retrieve similar profiles, with similarity defined by
Euclidean distance [141]. Queries are executed implicitly on mouse release, and re-
sults are displayed in thumbnail form beneath the query space. Designed for simplicity
and ease-of-use, QuerySketch does not support editing of existing queries.
Spotfire’s Array Explorer 3 [127] supports graphically editable queries of temporal
patterns in microarray data. Queries are dynamically modified by moving discrete
value markers at each time point. Query results are based on Euclidean distances
from the resulting profile. Queries are evaluated against clustered time series from a
larger set of microarray gene expression profiles. The limitation of each query point
to a single time instance complicates the expression of queries involving values that
remain relatively unchanged for a period of time.
14
Figure 2.7: The Patterns visual query language, specifying a sequence involving one
of four alternative transitions followed by a single required transition [90].
Patterns [90] uses a set of graphic primitives and operators to specify patterns of
interest in time series data. Query primitives can be used to search for intervals dur-
ing which values are rising, falling, flat, contained within a given threshold, straight
(constant slope), concave, or convex. These operators are specified via operators that
include visual depictions of the trend associated with the operator. Operators are pa-
rameterized, and can be combined via operators including conjunction, disjunction,
loop (for repetition), and gap (indicating a “don’t-care” interval between two events
(Figure 2.7). Query results are provided in a parse tree, which details the composition
of the result in terms of the primitive operators. Although algorithmic details of query
processing are not provided, the Patterns query language is powerful and flexible.
The identification of patterns in time-series data at of varying granularities involves
additional challenges. This problem was addressed by van Wijk and van Selow, in an
attempt to identify time-varying trends in energy usage and employee attendance in
terms of variations over a given day and identification of similar days over a several
15
month period. Patterns for individual days were clustered hierarchically, forming the
basis for a calendar display that colors each date based on the cluster to which it be-
longs. Alongside the calendar, a graph display could be used to display the graphs
corresponding to one or more clusters. Querying and browsing are both supported, as
clusters can be selected via point-and-click on dates, similarity to a chosen date, or via
top-down browsing of the hierarchy [138].
Data Mining research regarding time series data has generally been limited to al-
gorithmic strategies for finding patterns similar to a given query (Section 2.2). Some
work in this area has addressed issues relevant to interactive systems, particularly with
respect to query specification and refinement.
Agrawal et al.’s Shape Definition Language (SDL) [5] provides very similar query
mechanisms. SDL uses textual operators such as “up”, “down”, “stable” and “zero”
to construct queries similar to those that can be created with Patterns. Composition op-
erators supporting a regular expression-like syntax can be used to construct complex
queries (e.g., “(in 5 (and (noless 2 (any up stable)) (nomore 1 (any down stable))))”.
Although an interface is not described, an index structure and well-defined seman-
tics provide the groundwork for construction of an interactive system based on SDL .
Similarity Miner [146] uses a similar approach to query specification.
Noting that time series that look substantially different can have small Euclidean
distances, Keogh and Pazzani suggested a relevance feedback approach to query pro-
cessing. In this model, users would evaluate responses to an original query, rating
items in the result set on a 7-point scale. These ratings would be used to create a new
query based on the old query and on the rated result items. Both the query and result
items were segmented by a piecewise linear approximation, allowing users to express
different preferences for different features of interest. User models are also modified
16
Figure 2.8: The MMVIS query window [60].
to account for the potential impact of offset (vertical) translation, amplitude scaling,
discontinuities and other distortions [78].
2.1.4 Temporal Data: Querying
Interactive query techniques for temporal data sets might be adapted for use with time
series, and vice-versa.
TVQL [60] is a visual query language for identifying relationships between events
of interest in multimedia (video) data. TVQL uses double-thumbed sliders to support
expression of queries involving relative temporal relationships between two subsets of
events chosen from multimedia annotation. Four sliders are provided, for specification
of relative time of, and elapsed time between, the start and end points of two subsets
(Figure 2.8). Query constraints are displayed to the user in a notation based on Allen’s
interval relationships [9]. TVQL’s dynamic query model provides fast updates.
TVQL was evaluated in a pair of studies. In the first, TVQL was compared to
TForms, a forms-based interface. TForms used pull-down menus and text entry fields
to express relationships between subsets of events. In a between-subjects study, each
17
participant was asked to interpret and express queries with one of the two interfaces.
Subjects took more time to learn the TVQL interface, but query interpretation was
significantly faster with TVQL. For query specification, TVQL was significantly faster
only on queries involving incremental modification, a validation of TVQL’s use of
dynamic queries. User questionnaires revealed no significant difference in preference
ratings. User comments included a desire to manipulate temporal diagrams directly,
suggesting an alternative interface design [61].
A second study compared TVQL to a paper timeline. Examination of the num-
ber and kinds of queries created in response to free-form queries revealed strengths
and weaknesses of both approaches. Users of TVQL took advantage of features that
displayed the frequency of events to answer questions involving frequencies of event
occurrence, while timeline users generally did not do the manual counting necessary
to generate these results. Timeline users found more trends than TVQL users, but they
also made more errors. Timeline also failed to identify negative trends (“B never fol-
lows directly after A”). These results supported the suggestion of incorporating time-
line views alongside the TVQL facilities [62].
TVQE [124] builds upon TVQL’s use of Allen’s intervals [9] to support dynamic
temporal queries in a relational environment. Based on a formal model of temporal
queries [121], TVQE uses a series of sliders, checkboxes, and other widgets to specify
the time intervals, scales, and relationships of interest. These temporal constraints are
combined with relational selections made from a graph view of a database schema, to
form a full temporal relational query. TVQE has been used to model histories of user
interactions. Specifically, histories of use of TVQE have been modeled within TVQE
[122].
Other efforts have involved the development of interactive tools for identifying
18
patterns in other forms of temporal data, such as music. For example, one system used
contours to allow users to specify sequences of pitch transitions of interest [22], in a
manner that is somewhat reminiscent of Shape Definition Language (SDL) [5].
2.1.5 Parallel Coordinates
Parallel Coordinates is a visualization technique for high-dimensional data sets. A
parallel coordinates display is built by laying out the dimensions in a data set with
a set of parallel (usually vertical) axes. Each axes provides a linear ordering of the
values for the corresponding dimension that are found in a data set. An items in a data
set is displayed on these axes by drawing a polyline connecting the points on each of
the axes that correspond to the values for that item in each dimension (Figure 2.9).
Originally developed by Alfred Inselberg, parallel coordinates have been extensively
studied [52, 53, 58, 67, 68, 89].
A variety of techniques for “brushing” - selecting and highlighting points of inter-
est - in parallel coordinates have been developed, including direct manipulation sliders
and “painting” areas of interest through selections. These brushes can also be com-
posed through logical boolean operators [89] . Additional work has been aimed at sup-
porting queries that identify specific correlations between adjacent dimensions through
“angular” brushes [58] or dialogs [144], and the use of “structure-based” brushes to
support navigation through clustered hierarchies of data [52]. Many of these tech-
niques are similar to approaches that have been used in TimeSearcher, while others
may provide the basis for future work.
Although parallel coordinates are not designed for use specifically with time series
data, many of the display and interaction techniques developed for parallel coordinates
may be applicable to time series data sets. In fact, TimeSearcher’s “graph overview”
19
Figure 2.9: A sample parallel coordinates visualization involving four dimensions
from a database describing automobiles [58].
display (Chapter 3) is visually similar to the patterns of overdrawn and possibly cross-
ing lines found in parallel coordinates (Figure 2.9).
There are two important distinctions between parallel coordinates and time series
data. In time series data, each measurement is made along a common scale, resulting in
common minimum and maximum values across all time points In parallel coordinates,
the extents of each dimension can be different - perhaps involving categorical values.
Thus, parallel coordinates tools may support the inversion or “flipping” of an axes
in order to see patterns more clearly [58]. This operation would not be particularly
meaningful in time series data sets.
The other major distinction involves the ordering of axes. In time series data,
adjacency in the graph implies adjacency in time - time t should always come right
20
before time t + 1. No such adjacency is implied in parallel coordinates. In fact, some
tools support manual reordering of axes [58], and algorithmic techniques for finding
preferred orderings have been developed [68].
2.2 Data Mining
The combination of visualization tools and data mining approaches is an intriguing
possibility that presents the possibility of combining the strengths of two powerful
analysis approaches [118]. One example of the power of this approach combines
textual mining for sequential patterns with an interactive, timeline-based visualiza-
tion [145].
2.2.1 Similarity Searching
The challenge of data mining in time series databases is generally defined in terms of
sequence similarity: given a set X of sequences and a query sequence Q, find all xi ∈ X
such that Q and xi are sufficiently similar. Alternatively, find the k nearest neighbors
to Q. Similarity in this context is generally defined in terms of Euclidean distance. A
substantial body of work aimed at addressing this question has been conducted over
the past several years.
Much of this work has been based on the paradigm of dimensionality reduction
and spatial embedding, first introduced by Agrawal, Faloutsos, and Swami [3]. Noting
the curse of dimensionality associated with long time series, Agrawal et al., devised a
lower-dimensionality representation based on the Discrete Fourier Transform (DFT).
Specifically, they proved that the use of a representation based on the first few coef-
ficients of the DFT provided a lower bound on the Euclidean distance between two
21
sequences. In other words, if D(−→x ,−→y ) is the distance between two sequences, and
D(−→X f ,
−→Yf ) is the distance between the truncated DFT representations of x and y, then
D(−→X f ,
−→Yf ) ≤ D(−→x ,−→y ).
This observation was used to form the basis for a search algorithm. Given a set
of sequences xi, generate the truncated DFT representations Xi f , and store them in a
spatial index (R∗ trees [15] were found to provide the best results). To find the se-
quences similar (within distance ε to a given query sequence q, derive the appropriate
DFT representation Q f , and use the spatial query index to identify all Xi f such that
D(−→Q f ,
−→Xi f ) ≤ ε. As the distance in the DFT-space is a lower bound on the actual dis-
tance, this will provide a super set of the desired results. For each item thus retrieved,
the actual sequence is retrieved, and its distance to the query is calculated, to filter out
any false alarms.
This work was later extended to support subsequence querying through the use
of sliding windows to create trails in feature-space. These trails are collected in
adaptively-defined minimum bounding rectangles to be used in the spatial index [49].
Subsequent research has extended and revised this general technique of dimen-
sionality reduction and spatial embedding a variety of ways. Algorithms based on
wavelets [29] and singular value decomposition (SVD) [81] have also been proposed.
Yi and Faloutsos extended this model to handle similarities based on any Lp distance
measure [149].
Other work has examined the use of piecewise approximations to reduce dimen-
sionality. Piecewise Aggregate Approximations (PAA) divide a time series into of
length n into a set of N values (N � n) representing the average values in each of N
equal-sized “frames” [76] . These values can then be indexed using techniques de-
scribed by Faloutsos, et al. [49]. The PAA model was later extended to achieve further
22
reduction through the use of adaptive frame lengths [75].
The notion of similarity between time series is often more subtle than simple Eu-
clidean distance. Time series that look very different may have small distances, while
series with the same general “shape” may have higher distances [78]. For example, an-
alysts might be interesting in identifying sequences which are similar in shape but have
differing time scales - for example, sine waves of differing frequencies. Dynamic time
warping approaches that minimize the error between template (query) sequence and
result sequences have been used to handle these queries [19]. Similar techniques have
been proven particularly effective when used in combination with spatial-embedding
algorithms [150]. More recently, piecewise constant approximations have been used
to provide the basis for indexing of dynamic time warping [74].
Other transformations that might be of interest include scaling in the value (as
opposed to time) dimension, translation, and noise. Agrawal, et al. approach this
problem by generating scaled windows that represent similar subsequences. These se-
quences can then be stitched together to find matches of maximal length [4]. Rafiei
and Mendelzon proposed a model for handling translations, scalings, and moving av-
erages within the reduced-dimensionality, spatial indexing approach described above.
Queries are expressed in terms of similarity to the result of subjecting a given time
series to one or more of a set of transforms. These queries are evaluated by applying
the given transform to the original index on the fly , and then post-processing based on
actual distances [104].
2.2.2 Inverse Queries
Although similarity queries have been the focus of extensive research, other queries
may be useful. Lin, et al. identified two basic classes of queries on time series:
23
1. Forward queries ask for values at specific points, or value ranges during given
intervals.
2. Inverse queries ask when the time sequence had a given value or fell within a
given range [86].
After observing that forward queries can be efficiently supported by a variety of
indices, Lin, et al. introduce the IP-Index for handling of inverse queries. The IP-
Index divides a time series into one-dimensional projections in the value dimension.
These projections are then stored in an ordered indexing structure for efficient retrieval.
The IP-Index can also be used with appropriate interpolation to handle continuous data
[86]. This work later extended by the SIQ-Index, which was based on the observation
that the IP-Index did not scale well for non-periodic data. The SIQ-index stores the
one-dimensional projections in an R∗-tree, using the trails technique [49] to derive
efficient minimum-bounding rectangles [93].
2.2.3 Outlier Detection
Outlier, or “deviant” detection has also been a topic of interest. From a practical view-
point, identification of outliers can be useful for compression: if outliers are stored
explicitly, the size of the resulting index structures might be reduced without increased
error. From a data mining viewpoint, outliers can be interesting in their own right,
as they indicate differences from common or expected cases. Approaches to outlier
detection are generally based on error minimization given parameterized constraints
on storage. Proposals include dynamic programming approaches that minimize er-
rors associated with bucketing histograms [69], and modifications of SVD indexing
algorithms to include calculations of appropriate deviants [81].
24
2.2.4 Query Specification
By focusing on similarity search, many of the proposed data mining algorithms elim-
inate the need to address the issue of query specification: the query is simple a time
series sequence. One exception is Agrawal et al.’s Shape Definition Language (Sec-
tion 2.1.3), which specifies queries in terms of natural language descriptions of profiles
(e.g., ”(zero appears up up down)”) [5]. User interaction issues with similarity-based
data mining were also addressed by Keogh and Pazzani’s proposal for the use of rele-
vance feedback for retrieving patterns from time series data [78] (Section 2.1.3).
2.2.5 Other Approaches
Further work in time series data mining has been aimed at exploring alternative query
techniques and identifying more specific structures and patterns. Probabilistic search
methods were suggested by Keogh and Smyth [79]. To find interesting trends in fi-
nancial data, Povinelli used genetic algorithms to find clusters of interesting time se-
quences in a multi-dimensional space [101]. Other examples include algorithms for
identifying partial periodic patterns - repeating patterns at some, but not all points in
time [56] , mining time series for intervals of interest [139], rule discovery [37],
online analysis of multiple sequences [151, 153], and mining at different time granu-
larities [20].
Algorithms from string-searching research have also been adapted to address han-
dle queries over time series data. Suffix trees have been used as indices on time series
data that has been converted into a discrete alphabet [66]. Other work involved the
adaptation of the the Knuth-Morris-Pratt string searching algorithm to handle general
predicates involving relative changes in value [108]. The EMMA (Enumeration of Mo-
tifs through Matrix Approximation) uses a discretized representation of a Piecewise
25
Aggregate Approximation (PAA) [76] to support searches for similar subsequences,
known as motifs [85].
2.3 Databases
Temporal databases research has been ongoing for many years [71]. Numerous tempo-
ral query languages have been proposed [33], TSQL2 most prominently [126]. These
systems generally handle both time point and interval-based temporal data, making
them suitable for time series data sets. Other systems, such as SEQ [112, 113, 114]
are specifically designed to store and index sequence data , and thus may be partic-
ularly of interest for time series. As is the case with most database research, these
projects have focused on data representation, query semantics, and query processing,
with little discussion of user interfaces.
Interest in temporal databases has also led to the development of graphical query
languages for temporal relational data. Graphical queries built on top of the entity-
relational model [82, 132] augment familiar entity-relationship models to handle tem-
poral queries. Alternative models such as GTL [96] take different approaches.
These languages might be used as the basis for interactive querying systems [132],
but their use in such environments do not support dynamic queries. Instead, queries are
translated to an underlying relational query, which is then evaluated in a batch mode.
Researchers in spatio-temporal and multimedia databases have developed a variety
of interactive querying mechanisms, including visual languages that specify abstract
depictions of the time changes of interest [23, 47], and interactive systems where the
user’s manipulation of an icon on the screen specifies a query trajectory [39].
26
2.4 Discussion
This survey of related work illustrates the breadth of issues related to time series (and
more generally, temporal) data.
Visualizations of time series data illustrate the various perspectives that may be ap-
propriate for interpreting these data sets. Factors such as periodicity [25, 27], multiple
scales of resolution [73, 102, 138], and the need to display multiple variables at each
time period [27, 102] lead to a variety of ways to display and interpret these data sets.
Although QuerySketch [141] and Spotfire [127] provide tools for querying these data
sets, the possibilities for interactive visualization have not been exhaustively explored.
Temporal and spatio-temporal database research suggests the possibility of adapt-
ing query tools for time series data to work directly with more general databases. Al-
though these tools often handle time intervals that are more general than time series,
a tool for querying time series data might be used as a front-end to appropriately-
structured temporal databases, perhaps via translation into a temporal query language
[71, 126]. Similarly, time series querying tools might be useful for specifying tem-
poral constraints on spatio-temporal databases, raising the possibility of comparison
between models that combine spatial and temporal constraints in one query mecha-
nism [23, 47], and tools that separate the two. More generally, tools developed for
time series data might be extended to handle intervals over temporal intervals, provid-
ing functionality similar to that of TVQL [60].
Interactive tools and visualizations have almost exclusively focused on searches for
patterns involving well-specified changes over well-defined time periods. Data mining
algorithms are generally much more ambitious, as they often address the challenge of
finding patterns that occur at arbitrary times and are “similar” in some general manner
that can often account for variations in scale and duration, discontinuities, and other
27
idiosyncratic features [4, 19, 104, 150]. Efforts to find trends or “events” that are of
interest [4, 5, 37] are similar in spirit to the goals of interactive query tools.
Combining the interactivity of dynamic query tools with the power of these data
mining approaches presents several challenges. A query interface that supports these
algorithms must include mechanisms for specifying tolerances of approximate fits,
lengths of allowable gaps, tolerances in time dilation or contraction, and other con-
straints. Query result display would be equally challenging, as any output would need
to display not only the results themselves, but sufficient contextual information to ex-
plain why the result was a match. Furthermore, the implementation of systems that
practically combine the rapid, incremental updates of information visualization with
the computational requirements of data mining may be difficult.
28
Chapter 3
Timeboxes: Interactive Temporal Query Widgets
Timeboxes are rectangular query regions drawn directly on a two-dimensional display
of time series data. The data set is assumed to consist of some number of items (n),
each of which has a measurement at each of m time points. The extent of the Timebox
on the time (x) axis specifies the time period of interest, while the extent on the value
(y) axis specifies a constraint on the range of values of interest in the given time period.
More specifically, we assume that ti ∈ T is an item in a time series data set, ti( j) is the
value of ti at time j, and a timebox is a 4-tuple: b = (tmin, tmax,vmin,vmax). We say that
ti satisfies the timebox b if ∀tmin≤t≤tmax vmin ≤ ti(t) < vmax (assuming vmax ≥ vmin and
tmax ≥ tmin)1.
We assume that the temporal data is divided into discrete time points of granularity
determined by each data set. The discrete nature of the data is enforced by constrain-
ing timeboxes to occupy an integral number of time points. Multiple timeboxes can
be drawn to specify conjunctive queries. Items in a data set must match all of the
constraints implied by the active timeboxes in order to be included in the result set.
Creation of timeboxes is straightforward: the user simply clicks on the desired
1This model can be easily extended to account for time series containing several variables, each of
which is measured at each time point. See Section 4.3.
29
starting point of the timebox and drags the pointer to the desired location of the oppo-
site corner. As this is identical to the mechanism used for creating rectangles in widely
used drawing programs, this operation should be familiar to most users. As the box
is drawn, the interaction handler responsible for the drawing of the box will force the
box to occupy an integral number of time points.
Timeboxes are drawn to extend beyond the time points covered by one-half interval
on either side. Thus, a timebox that covers time periods 2-5 (inclusive) will have
its leftmost side half-way between 1 and 2 and its rightmost halfway between 5 and
6. This avoids difficulties in interpretation that might arise if the vertical sides of a
timebox were aligned with (or close to) the vertical line through a time point.
Once the timebox is created, it may be dragged to a new location or resized via
appropriate resize handles on the corners, using similarly familiar interactions. In all
cases, the query is re-processed with each mouse event. When a user action leads to
a modification of a timebox, the new position of the timebox is stored, the query is
updated, and the new result set is displayed.
Construction of timeboxes is aided by drawing all of the items in the data set di-
rectly on the query area. This graph overview display provides additional insight into
the density, distributions, and patterns of change found among items in the data set
(Figure 3.1).
The example data set shown in Figure 3.1 contains weekly stock prices for 1430
stocks and will be used in a brief scenario to illustrate the use of timeboxes. An an-
alyst interested in finding stocks that rose and then fell within a four-month period
might start by drawing a timebox specifying stocks that traded between $70 and $190
during the first few weeks. When this query is executed, the graph overview is up-
dated to show only those records that match these constraints. We can quickly see that
30
Figure 3.1: A graph overview, formed by superimposing the time series for all of the
items in the data set.
Figure 3.2: A single timebox query, for items between $70 and $190 during weeks 1-5
.
31
Figure 3.3: A refinement of the query in Figure 3.2.
this query substantially limits the number of items under consideration, but many still
remain (Figure 3.2).
To find stocks in this restricted set that dropped in subsequent weeks, the user
draws a second box, specifying items that traded between $12 and $80 during weeks
10-12 (Figure 3.3). A third box, specifying a higher price range ($60-$120) during
weeks 19-24 completes the query (Figure 3.4).
As timeboxes are added to the query, the graph overview provides an ongoing
display of the effects of each action and an overview of the result set. Once created, the
timeboxes can be scaled or moved singly or together to modify the query constraints.
The use of simple, familiar idioms for creation and modification of timeboxes sup-
ports interactive use with minimal cognitive overhead. Rapid (<100ms), automatic
query processing on mouse-up events provides the virtually instantaneous response
necessary for dynamic queries, thus supporting interactive data exploration. Users can
easily and quickly try a wide range of queries, modifying these queries to quickly see
the effects of changes in query parameters. This ability to easily explore the data is
32
Figure 3.4: A complex query containing three timeboxes.
helpful in identifying specific patterns of interest, as well as in gaining understanding
of the data set as a whole.
3.1 Anyof Semantics
Alternative interpretations of timeboxes are also possible. For example, a disjunctive
timebox might require that the value of a time series have some value in specified
range for some time points during the specified interval, as opposed to all of those time
points. For these anyof timeboxes, we say that ti satisfies timebox b = (x1,x2,y1,y2) if
∃x1≤x≤x2 s.t. y1 ≤ ti(x) ≤ y2
3.2 Variable Time Timeboxes
As defined above, the basic timebox is limited to expressing queries with fully-defined
time and value constraints. Additional expressive power might be gained by extending
33
the model in a manner that relaxes these constraints. One possibility would be to
support searches for items that fall within a given value range during some interval
of a given duration that falls within some longer window of time. For example, stock
analysts might want to identify stocks that traded between $30 and $60 for some 3
month period anytime between January and August (inclusive). These queries - known
as variable time timeboxes (VTTs) - are the simplest extension to the timebox model.
We have developed several additional extensions to the timebox model [63]: these
are described in more detail in Chapter 9. In collaboration with Eammon Keogh, a
preliminary implementation of variable time queries (query 2) has been implemented
and evaluated through a preliminary study [77].
Formally, a variable time timebox (VTT) is defined as two points (x1,y1) and
(x2,y2) and a single integer R. The VTT provides a constraint on a time series such
that for the time range x1 ≤ x ≤ x2, the dynamic variable must have a value in the range
y1 ≤ y ≤ y2 for at least R consecutive time units (assuming y2 ≥ y1 and x2 ≥ x1) (Figure
3.5). Under this formalism, a VTT with a value of R = x2 − x1 is simply a standard
timebox.
Graphically, VTTs are represented as outline boxes that surround a traditional time
box. When initially created, the VTT has a value of R = x2 − x1. By clicking and
dragging the sides of the internal rectangle, the user can adjust the value of R.
A user study was conducted to evaluate the claim that VTTs would be particularly
useful for the task of separating large data sets into disjoint classes. In particular,
the hypotheses was VTTs would be more effective than standard timeboxes in this
separation task. Ten undergraduate students performed a series of tasks with two data
sets. These tasks were aimed at measuring their ability to create queries that separated
each data set into two disjoint partitions. The quality of separation was measured by
34
Figure 3.5: A variable time timebox, specifying that for at least R consecutive time
periods between x1 and x2, items must have values in the range y1 ≤ y ≤ y2.
subtracting the number of false positives from the number of items correctly separated,
and normalizing by the size of the data set. Using this measure, VTTs appeared to have
significant advantage in the quality of results [77]. It should be noted that these results
applied only to the creation and interpretation of single queries, and therefore might
not generalize to interactive dynamic query environments.
Variable Time Timeboxes are based in a model of placing a timebox within a larger
region that provides additional constraints. This model can easily be easily general-
ized to support other extensions that increase the expressivity of the timebox model.
Queries with variability in value instead of time - Variable Value Timeboxes (VVTs)
- might be formed by providing vertical variability, instead of horizontal. Vertical and
horizontal variability might be combined to provide queries that support variability in
both time and value. These and other extensions are discussed in Chapter 9.
35
3.3 Timeboxes in the Context of Information Visualiza-
tion Research
The timebox is an incremental extension to previous work on development of widgets
for dynamic queries in information visualization environments. The ancestry of time-
boxes can be traced back to one-dimensional range sliders, which extended traditional
GUI sliders. Range sliders allowed users to adjust values from both ends (instead
of only one end), and to move the entire range of interest by dragging the middle of
the the slider [117]. Multiple range sliders can be combined to support searching in
multiple dimensions. This approach has been used in a variety of systems, often with
augmented displays aimed at linking the various dimensions. For example, the In-
fluence Explorer used multiple range sliders with histograms and lines linking items
across each dimension to display the relative influence of various dimensions (Fig-
ure 3.6) [134]. Similar techniques have been used for selection and filtering of items
in parallel coordinates displays, both with implicit (Figure 3.7) and explicit (Figure 3.8
range sliders.
These approaches share the common limitation of controlling only one or two di-
mensions at any given time. To adjust constraints on multiple dimensions, users must
adjust multiple brushes controls individually. For high-dimensional data sets, this can
get tedious. Two-dimensional widgets have been suggested as an approach to improve
this situation . These widgets might be used to specify single points for two variables
or by selecting a range in 2D space (Figure 3.9 [117]). Points could be selected with a
single click, and ranges would be modified by moving and resizing the area of interest.
Unfortunately, these widgets have not been widely used. Furthermore, 2D widgets
require identification of related pairs of variables that might effectively be combined
36
Figure 3.6: The Influence Explorer: Range Sliders on the “brightness”and “working
life” dimensions select the ranges of interest. Histograms with each variable indi-
cate the number of items having various values of that variable, and lines between
histograms indicate the values of a selected item [134].
into a single widget.
“Brushing” is another related technique that uses two-dimensional graphical re-
gions to select items from a scatterplot or other display. Given a 2D-scatterplot of
items in a data set, a brush is a rectangular region that can be drawn to “lasso” and se-
lect items of interest. For higher-dimensional data sets, multiple, repeated scatterplots
with differing brushes might be used [89].
Like 2D range sliders and brushes, timeboxes are rectangular regions that can be
created, moved, and scaled to specify and modify query constraints. However, time-
boxes are significantly more expressive. 2D widgets and brushes express constraints
on two dimensions, but a timebox constrains an arbitrary number of values.
Specifically, a data set containing m time points can be seen as an m-dimensional
data set - each item in the data set is a single point in R m. In this space, a timebox of
width m′ < m simultaneously constraints m′ dimensions. Of course, these constraints
are not independent, as values in each of the m′ dimensions are required to fall within
the same range.
37
.
Figure 3.7: XmdvTool [89]: The highlighted items have been selected by “brushing”.
Once the brush is created, the highlighted areas on any given axis can be moved or
resized [148].
Figure 3.8: Explicit range sliders in CityOScope’s parallel coordinates display Arrows
at the top and bottom of each axis can be used to limit the range of interest [53].
38
(a) Point query (b) Range Query
Figure 3.9: Two dimensional query widgets: (a) A point query indicating an exact
number of bedrooms and cost of a home. (b) A range of number of bedrooms and
cost [117].
This coverage of multiple dimensions provides an increase in expressive power.
Each movement of a timebox in the value dimension (vertically) results in changes to
m′ constraints. This is a significant increase over range sliders or brushes, which would
require either m′ (for 1D range sliders) or dm′/2e (for 2D range sliders or brushes)
separate modifications.
The ability to move and rescale timeboxes provides further power. Modifications
that add or subtract time periods effectively add and remove constraints from the query.
To do this in an interface based on 1D (or even 2D) sliders or controls, users would
have to add or remove each control manually.
There are significant similarities between timeboxes and some of the parallel co-
ordinates displays that have been developed. The “graph overview” display is partic-
ularly reminiscent of the overlapping drawn lines found in parallel coordinates (Fig-
ures 3.7 and 3.8). In theory, a rectangular brushing facility in a parallel coordinates
display would likely be very similar to a timebox query. However, characteristics of
the data sets involved provide an important difference between timeboxes and parallel
39
coordinates displays.
Time series data has significant auto-correlation - for a given item, the value at time
t is closely related to values at times t − 1 and t + 1, and perhaps less so than at time
t +10 [77]. As consecutive measurements are related, it makes sense to use timeboxes
to express a given constraint over multiple consecutive measurements.
For parallel coordinates, the situation is quite different. Although some tools sup-
port manual reordering of axes [58], and algorithmic methods for identifying preferred
orderings have been developed [68], adjacent axes do not necessarily have any rela-
tionship or correlation that can easily be expressed in a single set of constraints. In
fact, parallel coordinate graphs with wide swings from one axes to the next are com-
mon (Figure 3.7).
Timeboxes are based on the assumption that each measurement is made on a com-
mon scale, and that all values will fall between some global minimum and maximum
value. This assumption may not hold for parallel coordinates displays. For example,
a data set involving cars may have ranges of 4-12 for the number of cylinders, 18-40
for miles/gallon, and $15,000-$50,000 for price. It is not clear how, if at all, a timebox
or similar query could be used to simultaneously express constraints on these three
variables.
As a result of these differences, brushing facilities in parallel coordinates displays
tend to provide brushes that resemble contours over subsets of the axes (Figures 3.7
and 3.8). These brushes are semantically similar to timeboxes, but the manipulations
required are substantially different.
40
Chapter 4
TimeSearcher
TimeSearcher uses timeboxes to pose queries over a set of entities with one or more
time-varying attributes. Entities have one or more static attributes, and one or more
time-varying attributes, with the number of time points and the interpretation of those
points being the same for every entity in a given data set.
When a data set is loaded, entities in the data set are displayed in a window in the
lower left-hand corner of the application. Each entity is labeled with its name, and
the values of the active dynamic attribute are plotted in a line graph. Complete details
about the entity (details-on-demand) can be retrieved by simply clicking on the graph
for the desired entity: this will cause the relevant information to be displayed in the
upper right-hand window (Figure 4.1).
The top-left corner of the TimeSearcher window is the query input space. This
space initially contains an empty grid. To specify a query, users simply draw a timebox
in the desired location. The query is re-processed with every mouse event. Thus, as
the box is drawn the results are continuously and implicitly updated, without the need
for explicit user action.
When query processing completes, the display in the bottom half of the application
window is updated to show those entities that match the query constraints. For each of
41
Figure 4.1: The TimeSearcher application window. Clockwise from upper-left: query
space (with data envelope, query envelope, and graph overview), details-on-demand,
item list, range sliders for query adjustment, and data items.
these entities, the time points that match the query are highlighted, in order to simplify
the interpretation of the display. This matching will depend upon the type of query: for
standard timeboxes, the points that are highlighted will be exactly those points that are
contained in the original timebox. For anyof timeboxes (Section 3.1) and variable time
timeboxes (Section 3.2), only those points from a given entity that match the query
will be highlighted in the display window (Figure 4.2).
Once the initial query is created, the timeboxes can be moved and resized. The
42
Figure 4.2: Partial results from a timebox query, with time points that match the query
highlighted. Items in the result set differ in the points that match the query, indicating
an anyof or variable time timebox.
hand and box icons on the upper toolbar are used to switch between creating timeboxes
and moving/resizing them. As is the case with initial timebox creation, the query is
reprocessed with each mouse event.
Although somewhat less than ideal, this switching between the drawing and mod-
ification modes is necessary for proper operation. When in the timebox mode, a click
on the background of the query space is interpreted as the specification of the upper-
left corner of a new timebox. However, that same action is interpreted as the start of a
selection lasso when in timebox modification mode. Elimination of the modes would
require an additional input mechanism such as a shift key to disambiguate between
43
these modes.
When multiple timeboxes are present, they can be modified individually or simulta-
neously in groups of two or more. This functionality is particularly useful for searches
for complex patterns (Figure 3.4). In these cases, users can select some or all of the
timeboxes (using standard lasso and shift-click interactions) and simultaneously apply
the same translation and/or scale along either or both axes to all selected timeboxes.
This is useful for searching for instances of a pattern that vary slightly in scale or
magnitudes, or for modifying queries based on example items.
Timeboxes can also be adjusted via a pair of range sliders in the lower right-hand
corner of the screen. When a timebox is selected (or created), these range sliders are
initialized with the parameters of the timebox, with the top slider containing time ex-
tents and the bottom including values. As each dimension is adjusted separately by
its own slider, these controls support a degree of fine-tuning that might be difficult to
achieve by dragging the timeboxes. These sliders are disabled when multiple time-
boxes are selected.
A third mechanism is provided for modifying the value range of timeboxes. The
textual labels above the value range sliders are editable, allowing users to specify value
constraints by typing them in. Like the range sliders, these entry fields are disabled
when multiple timeboxes are selected.
Much of the research in mining of time series involves queries for items in a data
set that are similar to a specified query [4, 6, 19, 49]. TimeSearcher provides a simple
drag-and-drop mechanism for these ”query-by-example” queries: the user can simply
click on an entry in the data display window, drag it into the query window, and release
the mouse to drop, thus instantiating a query.
The query resulting from a drag and drop has a separate timebox for each time
44
Figure 4.3: Drag-and-drop query-by-example, with results.
point in the data set. Each timebox has a width of one interval, with the query values
centered around the actual value of the attribute for that entity at the given time point.
The height of each timebox is set to be 10% of the total range of the attribute being
queried, so each timebox has a range of v±5% of the total range in the attribute value,
where v is the value of the template time series at the given time point (Figure 4.3).
The timeboxes in the resulting query can be modified to specify for varying def-
initions of similarity. For example, the boxes could be enlarged to allow for a looser
definition of similarity, or subsets of the query could be eliminated to focus on items
45
that are similar only at specific time points.
4.1 Overviews
TimeSearcher provides a limited overview display in the lower left-hand window, dis-
playing each of the entities in the data set in a linear list. As this display shows a small
number of items at any given time, it is not an effective overview. Another possible
overview would display each of the entities in a thumbnail graph. These thumbnails
would be displayed in a grid, instead of the linear arrangement shown in Figures 4.1
and 4.3. This approach suffers from two shortcomings. For any reasonably sized data
set (more than a few dozen items), the limited screen space available would restrict
each thumbnail to a tiny area of the screen, rendering it virtually unreadable. Fur-
thermore, displaying each entity in a separate graph may not help users in identifying
global trends, such as the extreme values of the time-varying attribute at any given
point in time.
TimeSearcher provides another form of overview by displaying the extreme values
that can be found in the data set at each time point. Known as a “data envelope”, this
overview is optionally shown in the background of the query window as a contour that
follows the extreme values of the query attribute at each point in time, thus displaying
the range of values that may be queried (Figure 4.4). When the user executes a query,
the data envelope is extended by a “query envelope” - an overlay that outlines extreme
values of the entities in the result set (Figure 4.5). This display provides users with a
graphic summary of the relationship between the result set and the data set as a whole.
Without any timeboxes present, the data envelope highlights areas that would be
fruitful for query creation, while leaving empty areas unmarked. For example, the data
46
Figure 4.4: Query window with data envelope.
Figure 4.5: Query display with data and query envelopes.
envelope in Figure 4.4 does not extend to the upper right-hand corner, so queries in
that region would not return useful results. When a timebox is created, the updated
query envelope shows the differences between the current result set and the data set as
a whole, thus clarifying the range of values excluded by the timebox. The query enve-
lope also guides the creation of additional timeboxes, as queries outside this envelope
will not match any records.
The graph overview (Chapter 3) provides further support for browsing the data
set. When the user mouses over a graph envelope line, the line is highlighted, thus
displaying the individual item in the context of the larger data set. At the same time,
the name of the item is displayed as a tooltip, along with the value of the item at the
47
time point closest to the point where the mouse-over occurred. The item list, item
display window, and details-on-demand window are also updated to display on the
selected item. This tight coupling in response to lightweight mouse movement will
encourage exploration based on visual examination of the graph overview.
Overdrawing and visual clutter might cause the graph overview display to become
less useful for large data sets. Furthermore, the computational overhead of drawing
the graph overviews and processing the mouse-over handling can lead to substantial
performance degradation when graph overviews are used with these data sets.
To avoid these difficulties, TimeSearcher supports the possibility of graceful degra-
dation between overviews. For large result sets, the data and query envelopes will be
shown. When user queries reduce the size of the data set below a user-specified thresh-
old (set to 100 items by default), the graph envelopes will be displayed. The use of
graph overviews for smaller result sets and data/query envelopes for larger result sets
thus provides an example of a dynamic decision regarding the tradeoff between high-
resolution overviews and performance.
4.2 Leaders & Laggards
The analysis of time series data sets frequently includes a search for items with behav-
ior trends that somehow anticipate changes that will eventually be seen in other items
in the data set. For example, stock market analysts might look for a given stock that
dropped sharply shortly before other stocks in the same sector experienced a similar
decline in price. Similarly, biologists looking at microarray experiments (Section 8.1)
might be interested in finding a gene or EST that has a sharp increase in expression lev-
els immediately before a group of genes has a similar increase. Such a finding might
48
form the basis for the hypothesis that the first gene is a regulatory gene that plays a
role in stimulating the expression of the other genes.
TimeSearcher provides a mechanism to support this search for “Leaders & Lag-
gards”. After creating a timebox query that identifies the set of items with a trend of
interest, the user presses the toolbar button with the parallel arrows (or selects “Set
Leaders” from the edit Menu) to invoke this “leaders” mode. The query window will
then be split into two sub-windows
• The top, “leader” window will contain the specified query, along with the items
that match the leader query.
• The lower, “laggard” window contains the items in the original query in outline,
along with one new query box for each timebox in the ordinal query. These new
query boxes will be offset by one time period (to the right if possible, if not, to
the left) from their original counterparts.
An example query, and its use as a “leader”, are shown in Figure 4.6.
Once the leader and laggard windows have been created, the user can use the stan-
dard mechanisms to modify the query in the laggard window as desired. Thus, the
user can find items that lead or lag by an arbitrary number of time points, or that have
a wider range of values than the original query, etc.
Items in both the original leader result set and the laggard results are displayed in
both the lower-left display window and the window containing the list of item names.
In the display window, items that match the leader query are indicated by the label
“leader”, and the time points that match the leader query are highlighted distinctively.
Similarly, in the item list, leader item names are highlighted in a color that matches the
display of the query in the leader window and the leader label in the display window
(Figure 4.7).
49
Figure 4.6: The query window displaying a “leaders & laggards” query. The top
window shows leaders, with the original query in magenta providing a reference that
can be used for comparison. The leaders window also includes a label indicating that
the leaders are being shown, along with the name of the attribute being used for the
leader query. The record count at the bottom of this window also indicates that the
items shown are leaders. The bottom window - the “laggards” display -shows the
original query in outline, and has new timeboxes representing the new query, which is
defined by shifting the old query one time period to the right. The count label below
this window indicates that the items shown are laggards.
50
Figure 4.7: Leaders & Laggards: The top-left window is the leader window, and the
laggard window is directly below it.
The “leaders & laggards” facilities provide basic support for identifying trend rela-
tionships between different items in the data set. In the future, this functionality might
be extended with a more generalized bookmark facility, which would provide similar
functionality for multiple stored queries. In this case, the stored queries would serve
as a library of templates that might be used to identify patterns of interest.
51
4.3 Multiple Time-Varying Attributes
Although the model for timeboxes presented in Chapter 3 assumes a data set con-
taining items containing a single measurement for each of m time points, there is no
particular reason for restricting consideration to data sets involving only one time-
varying attribute. In fact, many meaningful data sets include multiple simultaneous
measurements. For example:
• Stock price data sets might include both low and high prices
• Meteorological data sets might include temperatures and precipitation levels
• Databases of genetic expression levels might include results from two or more
experimental conditions1.
In the notation given in Chapter 3, these data sets can be modeled by assuming
that there are k variables for each item in the data set. Thus, tik( j) is the value of
variable k for ti at time j. A timebox is then interpreted as a specific constraint on
any one of the k variables: b = (tmin, tmax,vkmin,vkmax), and ti satisfies the timebox b if
∀tmin≤t≤tmax vkmin ≤ ti(t) < vkmax (assuming vkmax ≥ vkmin and tmax ≥ tmin).
TimeSearcher provides limited support for some data sets with multiple variables.
When a data set with multiple variables is loaded into TimeSearcher, the first variable
in the data set is initially shown as the default, in a single pane of a tabbed pane win-
dow. To examine and query the values of any other variable, the user selects the desired
variable name from the pull-down menu marked “Query Variable” in the toolbar. This
leads to creation of a new frame in the tabbed pane (Figure 4.8).
When multiple attributes are present, users can switch between them by clicking on
the tab at the top of the pane. An attribute can be removed by clicking the close icon
1This example was motivated by collaborators working with microarray data sets. See Chapter 8.
52
Figure 4.8: TimeSearcher with a data set involving multiple time-varying attributes.
Two panes have been created - for the “low” and the “high” values.
(the “x”) in the appropriate tab, and reinstated by making the appropriate selection
in the pull-down menu. Each variable that is active displays its own data envelope,
and the display window shows graphs of each individual item using the values of the
attribute in the currently-selected pane. If desired, the user can modify the individual
graph display in the lower panel to show the graphs for each variable simultaneously,
by choosing the “Display All Variables” choice in the “View menu” (Figure 4.9).
When multiple attributes are displayed, the pane for each attribute acts as a query
space for that attribute. Queries can be created independently for each attribute, and
only items that match all queries - even those for variables in panes other than that
which is currently selected - will be included in the result set. When a query is created
or modified, the query envelopes and graph overviews for each active variable will be
updated to display the appropriate subset of the results (Figure 4.10). All items in the
lower display window will have time points for all active queries highlighted, not just
those time points corresponding to queries for the currently-displayed variable.
53
Figure 4.9: The data items in the result set with two variables displayed. The profiles
are taken from yeast microarray data, with absolute log ratio and log ratio values shown
for seven time points [40].
Figure 4.10: Updated query envelopes for one of two attributes that are currently ac-
tive. Note that even though there are no queries in this window, queries in the inactive
window (for “Low” measurements) have constrained the data set, as shown by the
query envelope.
54
This implementation provides only basic support for multiple time-varying at-
tributes, with several limitations. When multiple time-varying attributes are displayed,
they are shown in the same scale, with extent defined by the minimum and maximum
values found for any attribute in the data set. While this works well for comparable
values , it does not work well for multiple attributes with vastly different ranges. Thus,
for example, this facility would generally not be useful for simultaneous examination
of temperature and precipitation levels.
The requirement that all attributes be displayed in the same scale was motivated
by the need to overcome some of the limitations associated with the use of a tabbed
pane window for the multiple query spaces. As the tabbed pane window displays
only one of the panes at any given time, all but one the attributes is always obscured.
This increases the cognitive load associated with interpreting queries, as users must
remember the queries that have been created for variables that are not currently visible.
The change between panes might be particularly confusing if the panes involved
different scales and ranges of values. In this case, users might make interpretation
errors if they did not realize that changing panes had led to a switch between query
spaces that covered widely different ranges. Specifically, users might interpret queries
in one space in terms of the range used in a different space. By requiring that all
attributes use the same range of values, TimeSearcher sacrifices some flexibility in
an attempt at minimizing user confusion. Further work in this area will be aimed
at designing an alternative approach that does not require this restriction of common
scales.
TimeSearcher provides an optional “summary” overview that can help alleviate
the problem of occlusion of query spaces. When the user selects the “Summaries..”
option from the view menu, a new window containing miniature views of all of the
55
Figure 4.11: A summary window for a query over two attributes.
active query spaces is opened. Each summary view is labeled with the name of the ap-
propriate attribute, and the summary view corresponding to the currently active query
window is highlighted (Figure 4.11).
These windows contain active linked views that are updated as the query space
is updated. Although the miniaturized views do not provide enough detail to fully
interpret the queries, they provide a reminder of the occluded query spaces without
taking large amounts of screen space from the currently selected query.
Future work might address alternative solutions to this problem of occlusion. For
example, the tabbed pane might be replaced by a series of individual windows, one for
each attribute. These windows would be coordinated, providing multiple perspectives
similar to those found in Snap-Together Visualizations [94].
56
4.4 Query Inversion
Having found items in a data set that match a specific query, users might like to find
items that have opposite behavior patterns. For example, a stock analyst might like
to see stocks that fell at the same time as others were rising. TimeSearcher’s “query
inversion” facility supports this task.
Queries containing one or more timebox can be inverted by selecting the desired
timeboxes and pressing the toolbar button with the inverted arrows (or selecting “Flip
Selected Queries” from the “Transform” menu. This will cause the queries to be ro-
tated to form an inverse pattern (Figure 4.12). Pressing this button again restores
original queries.
The inverse query is derived by calculating the midpoint of the range covered by
the query. Specifically, this midpoint is half-way between the extreme maximal and
minimal points in any of the constituent timeboxes. Each box is then rotated around
this axis, providing the desired inversion.
As the original queries all must fit within the parameters of the results set, this
approach to inversion has the desirable feature that the resulting inverse query is guar-
anteed to be a legal query. Other definitions of reciprocal queries - for example, taking
the first timebox as a given constant and rotating other boxes relative to this first box -
might lead to nonsensical queries.
The query inversion tools might be particularly useful when used in conjunction
with leaders and laggards queries (Section 4.2).
57
Figure 4.12: Query Inversion: The original query (top) and the inverted query (bot-
tom).
58
4.5 Anyof Timeboxes
TimeSearcher provides support for timeboxes with alternative, anyof semantics (Sec-
tion 3.1). After a timebox has been created, the semantics can be changed by toggling
the “any” checkbox on the pop-up menu that is opened by right-clicking on a time-
box. When the toggle is changed, the query will be re-evaluated, and the timebox
will be displayed in a different color, in order to indicate the alternative semantics
(Figure 4.13).
4.6 Variable Time Timeboxes
Variable time timeboxes (VTTs,Section 3.2) are supported through a button on the
menu bar, which can be selected to switch to a query creation mode analogous to the
mode used for standard queries. The user creates a VTT by drawing a box, using the
same mechanism used for creating a standard time box. Once the VTT is created, it can
be selected for modification. Two types of modification are possible: outer handles can
be used to modify or scale the range in the value (y) dimension and the overall range
in the time (y) dimension, while inner handles can be used to modify the extent of
the inner box, which specifies the length of the required interval (Figure 4.14). Query
results are re-processed with each modification of any of the parameters.
As implemented in TimeSearcher, the user initially specifies the window of inter-
est, and then modifies the inner box to specify the duration within that window that
must satisfy the given value constraints.
VTTs specify variable constraints, raising the possibility that different items might
satisfy a VTT at different time points. For example, if a VTT specifies an interval that
is 3 time periods long within a window from time 5 to time 10, one item might satisfy
59
Figure 4.13: Anyof timeboxes: The display on the top shows a query consisting of
two timeboxes. In the bottom display, the timebox on the left has been converted to
an anyof query. As these queries are more inclusive (requiring only one value in the
given range during the interval, as opposed to all values), the result set for the anyof
query is a superset of the other result set.
60
Figure 4.14: A variable time timebox (VTT), with two sets of modification handles.
The outer handles can be used to modify the value range and the time window, while
the inner handles can be dragged to modify the duration of the interval during which
values must be within the given range.
the VTT during periods 6-8, while another might satisfy the same VTT during periods
7-9. To display these differences, each individual graph of an item in the result set
highlights only those time points when that item meets the criteria for each query item
(Figure 4.2).
4.7 Angular Queries
In many cases, analysis of time series data may require queries aimed at finding rela-
tive changes in value, as opposed to the absolute changes that can be expressed with
timeboxes. For example, timeboxes can be used to find items that rise from a value
61
tmin
t max
12
vmin
vmax
v θ θ
Figure 4.15: Calculation of an angular query. If an items ti has a value v at the starting
time tmin, its value at the ending time tmax must be between vmin and vmax, as determined
by θ1 and θ2, along with the width of the query.
of 80 to a value of 120 four time periods later, but they cannot be used to identify all
items that rose by 50% in value - regardless of the starting value - over that same time
period.
TimeSearcher’s angular queries can be used to create this sort of query. An
angular query specifies a range of slopes that place constraints on the slope of an
item’s values over the course of an interval. An angular query is a four-tuple:
b = (tmin, tmax,θmin,θmax). As with standard timeboxes, tmin and tmax specify the start-
ing and ending points for the query. The angles θmin and θmax present upper and lower
bounds on the slope that the item’s profile must form with the horizontal (Figure 4.15).
Of course, −π/2 ≤ θmin ≤ π/2 and −π/2 ≤ θmax ≤ π/2.
The simplest conception of an angular query involves the angle formed by the line
62
between the value at the starting point and the ending point. For any given item ti,
the angle is formed by finding the difference between the value at the starting point
(ti(tmin)) and at the ending point (ti(tmax)), and dividing it by the width of the timebox.
This value is the arctangent of the angle in question for item i. Specifically, θi =
arctan((ti(tmin)− ti(tmax)/width. This definition - the “end points” version of angular
queries - is based purely on the relationships between values at the end of the interval.
As a result, an item can have values that fluctuate wildly between the start and end of
the interval in question and still meet the constraints of the query.
An alternative definition - the “all points” model - requires that every transition
within the interval conform to the stated requirements. This more stringent definition
essentially requires that overall slope of the an item’s profile fall within the desired
range.
The mechanism for creating angular queries is identical to that which is used for
standard and variable time timeboxes: after selecting the appropriate button on the
TimeSearcher toolbar, the user draws a box that specifies the initial extremes of the
angular query. The lower-left corner of the box is used as the starting point, and the
upper-right corner is the ending point for the maximum value. The angle that the line
between these two points forms with the horizontal is θmax. A default value is used to
determine the range between θmax and θmin.
Like standard timeboxes and variable-time timeboxes, angular queries are con-
strained to occupy an integral number of discrete time points. A further similarity
with those other widgets is the extents of the query widgets, which extend horizontally
to occupy 1/2 extra interval beyond the graph points that indicate values covered by
the query. This may cause some confusion, but it is necessary for consistency with
standard and variable time timeboxes.
63
Figure 4.16: The angular query widget.
Figure 4.17: An annotated angular query widget. The dark lines demonstrate how the
vertical line in the query widget is used to determine the two angles necessary for the
query.
Although the angular query is specified by drawing a box, the widget used to dis-
play the query is somewhat different. This widgets consists of two lines. An angled
line from the starting time point to the ending time point indicates the angle of the
query (Figure 4.16). This line meets a vertical line at the ending time point. This
vertical line depicts the range between θ1 and θ2, as shown in Figure 4.17.
The “all points” query model is the default interpretation for angular queries (Fig-
ure 4.18). A query can be changed to the “end points” configuration by right clicking
on the widget and selecting “End Points Only”. This causes reprocessing of the query
64
Figure 4.18: The TimeSearcher query space with an angular query under the “all
points” interpretation. Data and query envelopes have been disabled for clarity. Se-
lection handles on the query widget can be used to move and rescale the query, and
a tooltip provides a textual representation of the query on mouse-over. Note that the
graph envelopes show items with a slope similar to that of the angular query widget,
but at differing ranges along the value axis.
under the alternative representation, and the coloring of the widget is changed to re-
flect the alternative semantics (Figure 4.19), in a manner similar to the presentation of
anyof queries (Section 4.5).
The width used to calculate the angles is not the difference between the time points
- tmax − tmin. As this difference is generally very small (often on the order of less than
ten time points), using the width in terms of time points would lead to large values for
the tangent, and correspondingly large values for the angle. To avoid this difficulty, the
width in screen coordinates is used to calculate the angles. This results in angles that
correspond to the angle that the angular query widget shows on the screen.
Like timebox queries, angular queries can be modified via handles. These handles
can be used to modify the width, angles, or range between the angles (θ2 − θ1). The
65
Figure 4.19: The angular query from Figure 4.18, under the alternate “end points”
interpretation. Note that some items in the result set have intermediate transitions that
exceed the range specified, even though the line between values at the end points fits
within the specified range.
handle can also be translated in either time or value. Since translations in value do not
change the angle or starting and ending times of an angular queries, these translations
do not impact the result set.
Angular queries are conceptually similar to the use of angular brushes in parallel
coordinates. Angular brushes can be used to find trends of a certain direction and
magnitude in parallel coordinates displays, without regard for initial comparison point
(Figure 4.20) [58]. CASSATT uses a dialog box to provide similar functionality [144].
Angular queries have one advantage over angular brushes. Angular brushes are
limited to comparisons between two adjacent axes, while the comparisons specified by
angular queries may involve comparison across time points separated by an arbitrary
interval - adjacency is not required.
This implementation of angular queries provides an example of the expressive
66
Figure 4.20: An angular brush that searches for negative correlations between items in
the second and third axes [58].
power that additional widgets and interaction techniques might bring to TimeSearcher.
A variety of additional extensions that might be of interest are discussed in Chapter 9.
4.8 Averages
For some tasks, analysts may wish to identify and explore those items in a data set
that are close to the “average” of the data set. TimeSearcher provides support for one
particular notion of averaging through the “show averages” selection in the “View”
menu.
When this option is selected, a new profile is constructed by calculating the average
value of all of the items in the data set at each time point. In other words, the first time
67
Figure 4.21: The TimeSearcher query window, with an average profile displayed in
red.
point in the average profile contains the average of the first values for all of the items in
the data set, etc. This profile is displayed in the query area as a red line (Figure 4.21).
This line, which is similar in appearance to a graph overview (Chapter 3), provides
users with basic feedback regarding the distribution of items in the data set.
When the average profile is displayed, a new button is added to the toolbar. When
this button is pressed, the average profile is used as a template for a query, which is con-
structed by creating a range around the average value at each time point (Figure 4.22.
In essence, this button treats the average profile as a drag and drop query (Figure 4.3).
Once the query has been completed, the individual timeboxes can be moved, scaled,
or deleted at will to form a variety of queries focused around some interpretation of
the average profile.
68
Figure 4.22: An average query.
4.9 Other Features
TimeSearcher provides rudimentary support for saving and managing query results. A
set of queries can be saved and later reloaded via menu “Save Query File...” and “Open
Query Files...” menu selections. Queries are saved without reference to the underlying
data set, thus allowing users to transfer queries between data sets. Query results can
be saved by selecting “Save results..”, which writes a text file which describes the
data file, the current query parameters, and the items in the data set that match those
parameters.
The “search” box in the toolbar provides basic support for known-item search by
name.
TimeSearcher also supports alternate treatment of time varying values. Menu items
69
can be used to switch between raw values, linear normalization, or z-score normaliza-
tion.
70
Chapter 5
TimeSearcher Implementation
TimeSearcher was implemented in Java 2, using the Swing toolkit for user-interface
widgets. Initial versions of TimeSearcher used the Jazz zooming toolkit [17] to provide
drawing and scenegraph control in the data and query displays, along with function-
ality for moving and rescaling timeboxes. Timeboxes, graphs of each item, and query
and data envelopes were implemented as Jazz widgets. After the first public (1.0) re-
lease of TimeSearcher, the code was redesigned to replace Jazz with Piccolo, a newer
zooming toolkit intended to replace Jazz [16].
As a research prototype, TimeSearcher is a product of more than two years of
development work, including substantial redesign. The evolutionary nature of this
growth is reflected in the design and in the code. Prospects for long-term maintenance
and growth of TimeSearcher might be improved by a redesign and implementation that
accounted for lessons learned to date.
This chapter will provide an overview of the TimeSearcher implementation, along
with a description of some of the lessons learned from the original Jazz implementa-
tion. Specific search algorithms are discussed in Chapter 6.
71
5.1 A Tour of the Code
As a Java application, TimeSearcher is divided into several packages, all of which fall
under the main class edu.umd.cs.temporalquery. The main package contains classes
needed for the basic operation of TimeSearcher: TQMain starts the program, TQCore
provides core functionality, and CmdTable and TQMenuBar provide menu and tool bar
support. A variety of sub-packages provide the bulk of TimeSearcher’s functionality:
• edu.umd.cs.temporalquery.data: DataSet is the class that holds the currently ac-
tive data set, which consists of Entity objects. DataVal, FloatVal, StringVal, and
IntVal are utility classes used for reading data from a text file into a DataSet.
• edu.umd.cs.temporalquery.graph: GraphSet is the class responsible for display-
ing the items in the data set that match the current query.
• edu.umd.cs.temporalquery.query: This package contains a variety of classes for
maintaining the state of the active set of queries. QuerySet contains the core code
for managing the queries, QueryExtremes maintains the minima and maxima for
each of the active attributes at each time point, and QueryElementFactory is used
to rebuild queries when they are loaded from a file. QueryElement is the rep-
resentation of the query associated with a timebox. VariableTimeQueryElement
and AngularQueryElement subclass QueryElement to support alternate query
semantics.
• edu.umd.cs.temporalquery.windows: Classes used to build TimeSearcher’s GUI.
TQSplitDataPane, TQControl, TQDetails, TQItemList, and TQFilter are the
sub-windows in the interface, described in more detail in Section 5.3. Pref-
Dialog is a dialog box used for preferences, and LeaderQuery is the window that
holds the leaders in a leaders & laggards view.
72
• edu.umd.cs.temporalquery.pwindows: This package contains Swing compo-
nents that are used as containers for piccolo components. TQPZoom is a JPanel
that can be used to hold a Piccolo canvas. TQPZoom is subclassed by Query for
the query space and Display for the display space. SummaryFrame provides the
summary display used for queries involving multiple attributes.
• edu.umd.cs.temporalquery.piccolo: Classes that extend Piccolo classes in order
to provide the graphic support for the query space. Specific classes will be de-
scribed in detail below.
• edu.umd.cs.temporalquery.event: QueryEvent is a class that is used to package
the information associated with a query modification.
• edu.umd.cs.temporalquery.rangeslider: IntRangeSlider and FloatRangeSlider
are widgets that support the double-box sliders that TimeSearcher uses to sup-
port independent modification of the individual dimensions of a timebox.
• edu.umd.cs.temporalquery.util: A variety of support classes, including code for
doing external tasks in separate threads, logging of information, file selection
filters, popup menus, and customized widgets for tabbed panes and text entry
fields.
5.2 Data Management
5.2.1 Input File Format
TimeSearcher uses a simple, ad-hoc file format for input files. Data files are plain-
text, with commas and semicolons used as delimiters. Lines beginning with a pound
symbol (’#’) are comments.
73
A legal TimeSearcher data file contains a series of data lines describing the data
set as a whole, followed by the individual time series:
1. Title: describing the data set.
2. Static attributes: for each item in the data set. Each static attribute is provided as
“Name,Type”, where “Name” is the name of the attribute, and “type” is the data
type (String,float, int,etc.). Attributes are separated by semicolons.
3. Dynamic Attribute: Similar to static attributes, this line contains one entry for
each time varying value that will be measured.
4. Number of time points: The width of the time series.
5. Number of items: The number of items in the data set.
6. Time point labels: Text labels that will be associated with the time points.
7. Individual items: Each of the items in the data set will be on a line of its own, in
the following format:
(a) The static attributes for that item, in the order given above
(b) The dynamic attributes for the first time point, in the order given above
(c) The dynamic attributes for the second time point, etc . . . .
A sample TimeSearcher data file is given in Appendix A.
5.2.2 Data Structures
Data from TimeSearcher files is read into an instance of the Java class DataSet. This
class also contains some information about the global characteristics of the data set,
74
such as the number and types of dynamic and static variables, and the minimum and
maximum values for each dynamic attribute at each time point. These numbers are
particularly important for creating the data envelope overviews.
The DataSet object also contains an array of Entity instances, one for each item
in the data set. Each of these instances contains the name of the object, the static
variables for that object, and values of each of the dynamic variables at each time point.
Additional fields include storage for normalized dynamic attributes, along with the
minimum and maximum values of each attribute. These values are used as a shortcut in
query evaluation: when evaluating a timebox for dynamic variable i, if the extent of the
timebox is greater than the maximum value of that variable (or less than its minimum)
it will fail by definition, so examination of individual points can be avoided. Finally,
each entity contains a set of flags - one for each time point - that are set to be true when
the entity is contained in the result set for the current query. These flags are used to
implement the highlighting of relevant result points in the display list (Chapter 4).
Although simple, this arrangement for data storage is consistent with an algorith-
mic analysis that identified an optimized linear scan as the most effective approach for
query evaluation (Chapter 6).
5.2.3 Loading a Data File
The process of loading a data file begins when the user selects “Open Data File..”
from the file menu. a TQFileFilter (from the edu.umd.cs.temporalquery.util package
is created and a file name is retrieved through a JFileChooser. The file name is used to
create a DataSet object. The DataSet reads through the metadata at the start of the file
and initializes data structures appropriately. The individual items in the data file are
retrieved by a LoadTask, which is created by DataSet. LoadTask creates a new thread,
75
which iterates through the file, reading each of the data lines into an Entity and up-
dating global DataSet parameters regarding extreme values at each of the time points.
When the LoadTask finishes reading the data from the file, the TQCore.doFile() pro-
cedure completes creates GraphSet and QuerySet objects to hold the display lists and
query space, respectively, and initializes these objects and the menu bar appropriately.
5.3 Graphical User Interface
TimeSearcher’s GUI is implemented as a series of several Swing windows:
• TQCore is a JFrame that acts as the main application window. TQCore contains
a menu bar and a vertical JSplitPane. This split pane has a TQSplitDataPane as
its left component and a TQControl as its right component.
• TQControl is the JSplitPane on the right-hand side of the screen. It contains
another JSplitPane, which holds TQDetails in the top and TQItemList in its bot-
tom. The bottom component of TQControl is a TQFilter window.
• TQDetails is the details-on-demand display.
• TQItemList is the list of individual items by name.
• TQFilter contains the range sliders used to adjust timeboxes.
• TQSplitDataPane is the split pane on the left side window. The top compo-
nent of this pane contains a JTabbedPaneWithCloseIcons, which is used to hold
instances of Query - JPanels that hold Piccolo canvases. spaces. The bottom
component contains the display list in a Display window, described below.
A schematic overview of the classes involved in the TimeSearcher window is given
in Figure 5.1
76
Display
Query
TQControlTQSplitDataPane
TQFilter
TQItemList
TQDetails
JToolBar
Figure 5.1: A schematic overview of the container classes used in the TimeSearcher
GUI. The entire window is an instance of TQCore - a subclass of JFrame.
5.3.1 Piccolo Windows
TimeSearcher’s query spaces and display list are implemented using the Piccolo zoom-
ing toolkit. Although TimeSearcher does not currently provide any zooming, Piccolo’s
facilities for scenegraph management and event handling make it an ideal platform
for building applications like TimeSearcher. Furthermore, future extensions to Time-
Searcher might incorporate zooming functionality (Chapter 10).
The query and display list spaces are both implemented as subclasses of PSizable-
Canvas, a TimeSearcher class that is designed to notify its container class - generally
an instance of TQPZoom - about any resize events that would require modification
of the size of the components contained in the Canvas. As the display list and query
77
1111
Query Display
TQPZoomPSizableCanvas
DropCanvasDragCanvas
Figure 5.2: A UML-style depiction of the relationships between the classes in the
display list and query window.
window must support drag-and-drop, the canvas in the display window is implemented
as a DragCanvas, a subclass of PSizableCanvas that implements DragSourceListener.
Similarly, the canvas in the query space is implemented as a DropCanvas, which im-
plements DropTargetListener.
TQPZoom is a subclass of JPanel that is used as a container to hold the instances
of PSizableCanvas. There are two classes of TQPZoom - Display for the display list
and Query for the query list. These subclasses manage the details of the display and
query spaces. A simplified UML-style schematic of the class relationships is given in
Figure 5.2.
As TimeSearcher’s supports simultaneous querying of multiple time-varying at-
tributes, there can be multiple instances of Query that are active at any given time.
These instances are stored in the JTabbedPaneWithCloseIcons that is the top compo-
nent in the TQSplitDataPane. There is, however, only one instance of Display.
78
Display
The Display window contains a DragCanvs that displays the graphs of the individual
items in the result set of the current query. Each graph of an item in the data set is an
instance of DataAxis, which is a subclass of Axis. Axis is a PNode that draws a pair
of axes, along with labels. DataAxis extends Axis, adding a line that plots the values
for an item in the data set. he updating and management of the graphs in the Display
window is handled by an instance of GraphSet, which creates instances of DataAxis
and displays them when they are in the result set of the active query.
The DragCanvas in the Display window has one event handler - the DisplayEven-
tHandler, a subclass of Piccolo’s PDragSequenceEventHandler. DisplayEventHandler
is used to update TimeSearcher’s display when the user mouses-over one of the graphs
in the display window. When the mouse enters one of the graphs, DisplayEven-
tHandler updates the details window to display the details for the item in question,
scrolls the item list to highlight the name of this item, and highlights the graph
overview line for the given item (if the graph overview is active).
Scrolling of this list is provided by placing the Display in a a PScrollPane - Pic-
colo’s version of a JScrollPane class. Event handlers in TQSplitDataPane respond
to scrolling of this window, and update the position of the scroll bar TQItemList to
keep the two windows synchronized as necessary. Similarly, when the TQItemList is
scrolled, code in TQSplitDataPane is executed to scroll the Display window as neces-
sary.
Query
The Query panel uses a DropCanvas to display the query space. The main Piccolo
component used in the query space is the QueryAxis, a subclass of Axis. QueryAxis
79
includes DataEnvelope and QueryEnvelope nodes. Both subclasses of Envelope,
DataEnvelope and QueryEnvelope provide the data and query envelope overviews.
QueryAxis is also responsible for management of the instances of GraphEnvelopeN-
ode that provide the graph overviews. A variety of Piccolo handlers are used for the
creation and modification of the various classes of queries (Section 5.3.2).
Other Piccolo Classes
Leaders & laggards queries (Section 4.2) use an additional window to display the lead-
ers. When the user starts a leaders & laggards query, the top window of the TQSplit-
DataPane is replaced by a new split data pane. The bottom component of this new
pane is set to the tabbed data pane the holds the Query panels. The top component of
this new pane is set to be a new instance of LeaderQuery, a subclass of TQPZoom that
is similar to Query. LeaderAxis, LeaderEnvelopeNode, and LeaderHandler provide
facilities for the leader window, similar to those provided by QueryAxis, GraphEn-
velopeNode, and DisplayEventHandler.
The Average class provides the display of the data set averages needed for the
average query facilities (Section 4.8).
5.3.2 Interaction Handlers: Creation and Modification of Queries
Piccolo provides basic interaction handlers that are suitable for unconstrained creation,
translation, and scaling of visual objects. This free-form movement is not acceptable
for manipulation of timeboxes, which involves two important constraints:
1. Creation, modification, and scaling of timeboxes must be limited to move within
a bounded rectangular area.
80
2. The width and position of a timebox must be aligned with a described grid, as
determined by the number of time points in the time series. Any changes to a
timebox must respect this alignment.
The first constraint is necessary to prevent users from creating queries that are
either nonsensical or simply out of bounds (for example, queries involving time periods
outside of the range of the data set). Discrete horizontal motion is needed to clarify
the extent of the queries and the motion: as it is assumed that time series are discrete,
queries that involve values between two time points in a series are nonsensical and
should not be allowed. TimeSearcher does not allow movement or scaling operations
that modify the horizontal extent of queries by non-integral amounts, thus prohibiting
these queries.
Standard Piccolo handlers for moving and resizing objects were modified and cus-
tomized to implement these constraints. When a selected timebox is created, moved,
or scaled, these handlers examine the movement or scaling operation and adjust the
magnitude of the vertical and horizontal change to insure that the resulting timebox
will be both within bounds and appropriately aligned in the horizontal dimensions.
Each class of timeboxes has its own handler that implements the behavior nec-
essary for that class of timebox (standard timeboxes, variable time timeboxes, etc.),
while modification (moving and resizing) of all timeboxes is handled by a common
handler - the ConstrainedSelectionHandler. This handler is also responsible for pro-
cessing query modification via arrow keys on the keyboard, and for handling mouse-
over events on graph overview lines.
Piccolo event handlers are also used for the creation of timeboxes: the TimeBox-
Handler is used to create new timeboxes. When the user uses the toolbar to switch
between timebox creation mode and timebox modification mode, the TimeBoxHan-
81
dler is deactivated and the ConstrainedSelectionHandler is activated, and vice-versa.
An instance of TimeBoxHandler is the active event handler on the Query canvas
when TimeSearcher is in query creation mode. When a TimeBoxHandler is active, a
mouse press on the query space leads to the creation of a new TimeBox, at the loca-
tion of the press. With each subsequent mouse event, the TimeBoxHandler constrains
the bounds of the box as described above, updates the display, and calls the code in
QuerySet needed to reprocess the query.
VTTHandler and AngularHandler are subclasses of TimeBoxHandler that are used
for creation of VariableTimeTimeBox and AngularTimeBox objects for variable-time
timeboxes and angular queries, respectively. Each of the three query creation buttons
on TimeSearcher’s toolbar is responsible for activating the event handler for the appro-
priate query class.
Timeboxes and other query widgets can be translated or scaled, but not created,
when TimeSearcher is not in query creation mode. In this case, the ConstrainedSelec-
tionHandler is active. This handler supports three main functions:
• Mouse-over highlighting of graph overview items, including scrolling update of
the display and item lists.
• Selection of query widgets, either via clicking or lasso for group selection.
• Translation of query widgets via dragging or key press.
When the user drags a selected query widget, ConstrainedSelectionHandler con-
strains the translation to remain within appropriate bounds, using a strategy very sim-
ilar to the approach used in TimeBoxHandler. Once the movement is constrained,
the query widget’s position is updated by a call to its setBounds() method, and the
queryChanged() method of the QuerySet objects is executed to process the modified
82
query.
When a query is selected - either by direct mouse click or by lassoing - it is dec-
orated with resizing handles. These handles, which are generally subclasses of the
Piccolo class PHandle, are nodes that are added to points on the perimeter of the query
widget while it is selected. These handles can be dragged to scale and otherwise mod-
ify the parameters of the query widget.
Each type of query has its own subclass of PHandle that supports the range of
changes that can be made. TimeBox is the simplest case, with BoundsHandles pro-
viding eight handles: one on each of the four corners, and one at the midpoints of
each of the four sides. When these handles are dragged, code in TimeBox is called
to update the size and shape of the box, taking the constraints mentioned above into
account. When a bounds handle on a TimeBox is dragged, the modification implied
by this drag is applied to all selected widgets, thus supporting concurrent scaling of
selected objects.
Handles for variable-time timeboxes are slightly more complicated than the han-
dles used on standard timeboxes. In addition to the eight handles implied by Bound-
sHandles, VTTs use an additional two handles from the VariableTimeTimeBoxHan-
dles class to support resizing of the inner box.
Unlike standard and variable-time timeboxes, angular queries do not have four
sides and corners that provide natural locations for interaction handles. Instead, han-
dles for angular queries are placed on the left-hand end of the query, and on either end
of the query’s range indicator (Figure 4.18). As the default positioning of handles is
not appropriate, The AngularBoundsLocator class is used to calculate the location of
the AngularBoundsHandles, which provide the scaling functionality.
83
Additional details about the subclassing relationships needed to implement the var-
ious types of queries can be found in Section 5.4.
5.3.3 Display Techniques
Efficient redisplay of graphical information in both the query window and the data
items (Figure 4.1) is necessary for efficient support of dynamic queries. TimeSearcher
uses several strategies to provide the necessary performance.
Improvements to the display performance can be achieved by limiting the extent
of the display that is dynamically updated during user interaction. When a query is
being modified, the user’s attention is focused on the query space, as opposed to the
display of the individual data items. As a result, continuous updating of this display
is unnecessary. Instead, TimeSearcher updates the graph overview, the data/query
envelopes, and the list of items that match the query with with each mouse event, and
saves the update of the display of the individual items until the end of the interaction.
The decay from graph overview on smaller result sets to data/query envelopes on
larger result sets (Section 4.1) provides additional performance benefits, reducing the
update requirements from O(n) individual graph lines to the four lines needed for
drawing the two contours.
The summary window used for an overview of multiple-attribute queries (Sec-
tion 4.3) is implemented as a series of Piccolo canvas ( PCanvas) objects. Each of
these objects has a Piccolo camera ( PCamera) that contains the graphic layer from
one of the current query spaces. The views of these cameras are scaled to provide the
miniaturized view. Each of these cameras displays the same scenegraph that is shown
in one of the active query spaces, so the summaries are directly linked to the query
space. Therefore, the summaries will be updated with each modification of the query
84
space, including dragging of queries and subsequent updating of the result set.
5.3.4 The transition from Jazz to Piccolo
Efficient redisplay of graphical information in both the query window and the data
items (Figure 4.1) is necessary for efficient support of dynamic queries. To provide
this support, TimeSearcher initially used a customized version of Jazz that improved
performance in certain critical areas.
In the display window, each individual graph is a separate node in the Jazz scene-
graph. To draw these items in the continuous vertical scrolling display, these nodes
must be translated and redisplayed with each query. Specifically, the nth item in the
result set must be displayed at vertical offset n ∗ k, where k is the height of each item.
For large result sets, this leads to numerous changes to the scenegraph, which must be
handled appropriately for good performance.
The default Jazz implementation treated the modification of any item in the scene-
graph as reason to update the portion of that scenegraph associated with the parent of
the item. When an item in the display list is translated, each of the other items in the
list must be updated. Thus,translation of each of the O(n) graphs leads to examination
of all of the graphs, leading to a total response time that is O(n2).
Piccolo does not have this overhead, and can modify O(n) objects in O(n) time.
The current implementation of TimeSearcher uses an unmodified version of the Pic-
colo libraries, which should ease maintenance and future development.
Piccolo also provides interaction handlers that are significantly smaller and simpler
than those provided in Jazz. As discussed in Section 5.3.2, TimeSearcher requires aug-
mentation of interaction handlers to provide functionality not usually found in zooming
toolkits. As a result of Piccolo’s improved design and greater parsimony, the Piccolo
85
version of TimeSearcher is significantly smaller than the Jazz version (14K lines vs
20K lines).
Piccolo is also used to draw the timeboxes, data and query envelopes, and the graph
envelope lines in the query space. Each graph envelope line is a separate node in the
piccolo scenegraph. This imposes a significant overhead for large data sets, but is
necessary for support of mouse-over highlighting and linking with other application
windows.
5.4 Query Processing
TimeSearcher provides dynamic query updates by recomputing query results with ev-
ery modification to any timebox (including VTTs and angular queries) involved in the
current query. As described above (Section 5.3.3), this re-processing is somewhat in-
cremental. As the mouse is dragged during a resize (scaling) or translation (movement)
operation, the graph overview, query envelope, and item list will be updated, but the
display list will not be updated until the mouse is completed. More specifically, the
processing of a timebox query proceeds as follows:
1. A mouse event or key press indicates a creation, translation, or scaling of an
instance of the Timebox class.
2. An instance of QueryElement is created. This instance converts the screen repre-
sentation of the timebox to a set of coordinates that represent the query in terms
of starting and ending times and value extents in the range of the current data
set. This QueryElement is associated with the timebox.
3. Each item in the data set is checked against the current query, using one of the
linear search algorithms described in Chapter 6. If the item matches the query,
86
the appropriate flags are set, indicating the time points in the item that match the
query. An additional flag is set to indicate that the graph overview for this item
should be shown. Finally, statistics needed for updating the query envelope are
updated to include this item.
4. After all items in the data set have been checked, the number of items that match
the query is checked to see if size of the result set is small enough to display in-
dividual graph overview lines (Section 4.1). If the result set size is small enough,
the graph overview lines are displayed. If not, only the data and query envelopes
will be displayed instead.
5. When the mouse is released, indicating the completion of the query, each of
the items that match the query is displayed in the display list window, and the
envelopes are updated.
A flowchart of this process is given in Figure 5.3.
A minor modification of this approach is necessary for queries that involve si-
multaneous modification to multiple timeboxes. For these queries, new instances of
QueryElement are first created for each timebox before iterating through each of the
items in the data set. Thus, the simultaneous modification of several timeboxes re-
quires recalculation of each of the QueryElement instances, but no additional overhead
(relative to modification of a single timebox) is involved.
The QueryElement class essentially acts as a model in the sense of the Model-
View-Controller architecture often used in GUI implementations. As the visible, on-
screen representation of the query, the TimeBox class, in conjunction with the appro-
priate event handlers, provides both the view (the graphic display) and the controller
(handles for modification) of a timebox query. The QueryElement class converts the
87
Result set size below Threshold?
Display Graph Overview
Update Data, QueryEnvelopes
Yes
No
Yes
No
Update statistics
Find Matches,
Modifications Complete?
Create QueryElement
Update Display List
Input EventModifies Query
Figure 5.3: The steps involved in TimeSearcher query processing.
88
graphical coordinates of the TimeBox instance into a meaningful query, thus providing
the model.
The QueryElement class is also necessary for creating an appropriate TimeBox to
represent a query object. This happens, for example, when a saved query is read from
a file: a QueryElement is created, and then the dimensions of the active query screen
are used to translate this QueryElement into the appropriate TimeBox.
5.5 Extending Timeboxes
The query processing code has been designed to be object-oriented and extensible to
support new types of timebox queries. The TimeBox and QueryElement classes pro-
vide support for processing basic timeboxes, and contain all code for handling queries.
Specifically, all of the code used for determining whether or not an entity in a data set
matches a query can be found in the QueryElement class. New timebox queries can be
created by subclassing QueryElement, TimeBox, and TimeBoxHandler for creation of
the timeboxes.
The implementation of variable time timeboxes (Section 3.2) provides a road map
that can be used to create other types of extended timeboxes. The VTTHandler class
subclasses TimeBoxHandler, in two important ways:
• the setLimits procedure calls VariableTimeTimeBox.setLimits, informing this
class of the constraints of the current query space.
• The createTimeBox procedure is over-ridden to return the appropriate subclass
of TimeBox.
The subclass of TimeBox is known as VariableTimeTimeBox. This class contains
a great deal of support code for managing the manipulation of the inner constraints,
89
but this will not be needed in all cases. In general, the minimal requirements in an
extended timebox class will be:
• A paint procedure to appropriately render the box.
• A createQueryElement procedure that returns an instance of the appropriate sub-
class of QueryElement.
Finally, a subclass of QueryElement will be needed. This subclass will need to
over-ride createTimeBox, creating an instance of the appropriate subclass of Time-
Box, along with getCopy. The extended search semantics can be specified by over-
riding matchEntityAll and matchEntityAny, which are the procedures called to deter-
mine whether an entity matches the timebox’s constraints for all of the points in the
given interval, or simply for any of those points. As anyof queries are not particularly
meaningful with variable time timeboxes, the current implementation does not over-
ride matchEntityAny. Instead, the any choice is disabled for variable time timeboxes.
Additional extensions may be needed for some timebox variants. As described
above (Section 5.3.2), queries may need to create interaction handles and handle loca-
tors that match their specific interaction needs and geometries.
Of course, appropriate tool bar and/or menu entries will also be needed to support
switching into the appropriate modes.
5.6 Performance
Information visualization tools strive to provide highly-interactive performance for in-
creasingly larger data sets. Although 100ms response time is the goal, this is not al-
ways possible for very large data sets. Approximate quantification of the performance
of a tool provides a rough understanding of its limits.
90
Synthetic data sets of various sizes were constructed for evaluation of Time-
Searcher’s performance. For each data set, several operations were conducted:
• Creation of three queries
• Several modifications to those queries
• Deletion of the queries
• A drag-and-drop query.
For each query, the total processing time - including identification of matching items
and all screen updates - was measured. This value was averaged across all queries, for
an average query processing time for each data set.
Data sets with 1000, 10000, 25000, and 50000 with both 100 and 200 time points
were created. A data set with 100,000 items was also used with 100 time points.
TimeSearcher was unable to handle a data set with 100,000 items and 200 time points,
as this data set exhausted available RAM on the test computer. Graph overviews were
turned off in call cases. All tests were run on a 1.33 GHz Pentium III-compatible with
512MB Ram, running Mandrake Linux 8.0. Average response times across all query
types are given in Figure 5.4 and Table 5.1.
These results show that TimeSearcher’s performance scales linearly with the num-
ber of items in the data set. In fact, for both 100 and 200 time points, the correlation
was almost perfect. These results can be used to generate a regression that would pre-
dict the performance of TimeSearcher on data sets of various sizes: if t is the query
processing time and n is the number of items in the data set, the regression equations
are as follows:
• 100 time points: t = 13.77+ .0043n (r2 = .99).
91
0
50
100
150
200
250
300
350
400
450
500
0 20000 40000 60000 80000 100000
Ave
rage
Que
ry P
roce
ssin
g Ti
me
(ms)
Number of Items in Data Set
100 time points200 time points
Figure 5.4: Average times for TimeSearcher to completely process queries - including
search and display update - on several query types. Results are shown for data sets
of 1000, 10000, 25000, and 50000 items with 100 and 200 times points, and 100,000
items with 100 time points only.
Average Total Query Processing Time (ms)
Number of Items 100 time points 200 time points
1000 17 10
10000 56 90
25000 123 157
50000 238 301
100000 449
Table 5.1: Raw performance data.
92
• 200 time points: t = 16.34+ .0057n (r2 = .99).
It is interesting to note that the performance does not seem to scale linearly with
the number of time points. This is consistent with the algorithmic analysis (Chap-
ter 6), which showed that the performance of the TimeSearcher search algorithm was
relatively insensitive to the number of items in the data set.
Despite the high correlations, these results are fairly limited in their applicability
and generality. The tasks used to generate these results were not rigorously controlled,
and the specific timing values are not generalizable beyond the computer system that
was used to run the test.
However, these results do provide a rough measure of the scalability of Time-
Searcher. With the computer used in this test, 100ms performance is only likely to be
possible on data sets of less than 25000 items. However, performance with 50,000 or
even 100,000 is not unreasonably slow. As performance continues to increase, 100ms
performance with data sets containing 100,000 items may soon be possible.
From an implementation viewpoint, understanding of the components of these
query processing times is most useful for identification of processing bottlenecks and
opportunities for optimization. Specifically, understanding of the costs of display up-
dates relative to the other components of query processing will be helpful both for
understanding the potential impact of improvements in rendering and for identifying
areas which would be the most fruitful targets for optimization.
A rough breakdown of the contributions of these components was created by run-
ning several operations - including creation, modification, and deletion of timeboxes
- over a variety of smaller data sets. Unlike the previous tests, these trials were con-
ducted with graph overviews turned on. This was necessary to get a “worst case” pic-
ture of the cots of updating the TimeSearcher display. As these results are not based
93
Data Set Display Time Total Processing Display Portion
223 items, 13 time points 108 609 18%
1489 items, 30 time points 215 971 22%
1430 items, 52 time points 214 1392 15%
1430 items, 52 time points 279 2150 13%
Table 5.2: Portion of query processing time spent on updating display, for sample
queries on some data sets. All times are in ms.
on a carefully controlled set of query operations, they are not meant to be interpreted
as definitive. Rather, they are designed to give a rough picture of where time is being
spent in processing queries. Results of this analysis are given in Table 5.2.
These preliminary results seem to indicate that display is a relatively small part of
the overall costs of query processing. Therefore, improvements in rendering should not
be expected to improve TimeSearcher’s query performance, and optimization efforts
should focus on the search algorithms and related code.
94
Chapter 6
Search Algorithms
.
To provide users with rapid, incremental feedback, dynamic query tools must meet
stringent performance requirements. Specifically, queries must be processed within a
100ms update cycle if updates are to appear to be instantaneous [117]. To meet this
goal, developers of dynamic query tools must use efficient techniques to achieve a
high-level of performance in two key areas:
• Display: Updating the graphic display to show only those items that match the
query, limiting display updates to areas that will occupy the user’s visual atten-
tion, and other techniques have been used to minimize the overhead of repeatedly
updating the complex displays found in information visualization environments.
• Search: Identifying the subset of items that match the current query requires ap-
propriate indices that support incremental queries. Although a variety of indices
and strategies have been evaluated [70, 129], the choice of search algorithm is
strongly influenced by the specific details of the problem being addressed by a
given system.
Of course, these optimizations become more important as the data sets grow larger.
95
Strategies used to improve performance of the display components of Time-
Searcher are described in Chapter 5. This chapter focuses on the performance of search
in TimeSearcher. After defining the problem, this chapter introduces several possible
alternative algorithms, and describes an analysis of their performance, using a common
testing platform. This analysis led to the initially surprising conclusion that a relatively
simple sequential search outperformed more sophisticated alternatives. Further explo-
rations aimed at resolving this seeming paradox led to a deeper understanding of the
problem, which might be used as the basis for further examination of potential search
strategies.
It should be noted that the analysis in this chapter applies only to standard time-
boxes. Although search algorithms used for variable time timeboxes (Section 3.2) and
angular queries (Section 4.7) are discussed in this chapter, analysis of their perfor-
mance remains a possible area for future investigation.
6.1 Problem Definition
The search problem presented by timebox queries is found in the the definition of a
timebox (given in Chapter 3 and repeated here for convenience). Specifically:
• A set T of time series profiles t1 . . .tn, each containing values for each of m time
points. The value of ti at time t is denoted by ti(t).
• A timebox is a a 4-tuple b = (tmin, tmax,vmin,vmax). Without loss of generality,
we assume that vmax ≥ vmin and tmax ≥ tmin.
• Time series profile ti satisfies timebox b if ∀tmin≤t≤tmax vmin ≤ ti(t)≤ vmax. In this
case, we say that S(ti,b) = true.
96
This definition naturally extends to queries formed as conjunctions of multi-
ple timeboxes: ti satisfies a set of timeboxes B = b1 . . .bn (S(ti,B)) if and only if
∀b j∈BS(ti,b j) = true. In the following discussion, we will consider the problem of
identifying the items ti ∈ T that satisfies a set of timeboxes B.
The width of the data set - m - and the number of items in the data set n are the
primary influences on search performance.
User interface requirements for TimeSearcher present another constraint that must
be met by any search algorithm. As TimeSearcher presents items in a set linear order,
search results from any query should be presented in a manner that retains the original
relative order of items in the result set. This ordering will provide consistency that will
help users interpret search results.
6.2 Sequential Search
The naive approach to timebox searching follows directly from the definition of the
problem.
Search Algorithm 1 SEQ NAIVE
• For each ti ∈ T , check to see if it satisfies each b j ∈ B.
• To see if ti satisfies b j = (tmin, tmax,vmin,vmax), check ti(t) for tmin ≤ t ≤ tmax to
see if vmin ≤ ti(t) < vmax. If this is true for each t the given range, S(ti,b j) = true
• if S(ti,b j) = true for all b j ∈ B, S(ti,B) = true.
In other words, we simply perform the expected iteration of the three loops: for
each of the items in the data set, we look at all of the time points in all of the boxes.
97
The conjunctive nature of timeboxes leads directly to the first optimization on this
scheme. If some ti fails to meet the constraint for b j at some time t, we know that
S(ti,b j) = f alse, even if we have not completely processed the range of times tmin ≤
t ≤ tmax. Thus, processing for any ti, b j pair can stop as soon as one value outside of the
given range is encountered. This is equivalent to the familiar programming language
shortcut used in evaluation of conjunctive conditionals.
Additional, less obvious, optimizations can be applied to conjunctive queries. In
this scenario, we have a set B = b1 . . .bn of queries and a set R ⊆ T such that S(R,B) =
true. A change is made to the query: either
• A new timebox bn+1 is added to B, leading to B′ = B∪bn+1.
• Timebox b j ∈ B is deleted, forming B′′ = B−b j.
• Timebox b j ∈ B is modified.
The creation and of queries each present opportunities for optimizations
New Queries Here, we note that if S(ti,B) = f alse then there must be some b j ∈ B
where S(ti,b j)= f alse. Therefore, the addition of bn+1 to form B′ cannot make S(ti,B′)
true. In practical terms, this means that when we add bn+1, we must examine only those
ti such that S(ti,B) = true, to see if S(ti,bn+1) = true. If it is, then S(ti,B′) = true.
Deletions This case is the flip side of query creation. If we have an item ti such that
S(ti,B) = true, S(ti,B′′) is true by definition - removing a timebox from the current
set makes the query less restrictive. Therefore, we must only examine those ti where
S(ti,B) = f alse to see if S(ti,B′′) = true.
For the time being, we assume that in the remaining case of query modification, we
must re-examine all entities and timeboxes. An alternative that avoids this overhead is
98
discussed in Section 6.5.5.
Based on these optimizations, we define the improved sequential algorithm, which
assumes that we are changing an existing set of queries B through the creation, dele-
tion, or modification of a timebox b j:
Search Algorithm 2 SEQ OPTIMIZED
• Begin by assuming that all items meet the (initially null) query
• For each change to the query, respond as follows:
– Creation of a timebox: Check all items that satisfied the previously existing
query. If they satisfy the new timebox, they satisfy the new query. All others
do not.
– Deletion of a timebox: All items that did not satisfy the previously existing
query should be checked against all remaining timeboxes in the query (B′′).
These boxes are added to the set of timeboxes that satisfied the original
query (B), to make the set of results of the new query.
– Modification of a timebox: All of the items in the data set are compared
against all of the timeboxes to find the items that match the query.
Although the analysis in this chapter is based on this optimized sequential scan
algorithm, the implementation in TimeSearcher does not take advantage of the opti-
mization for query deletion. This is due to the need to update the result display to
indicate which time points in an item match the query. When a timebox B is deleted,
the time points in a given item that match the remaining query must be updated to
reflect the removal of B. This requires a pass through all items, thus rendering the
optimization irrelevant. In practice, this does not appear to significantly impact the
performance of TimeSearcher.
99
6.3 Sequential Search for Timebox Extensions
Variants of the sequential search algorithm are used to process variable time timebox
queries (Section 3.2) and angular queries (Section 4.7). Although these algorithms are
not analyzed in this chapter, they are outlined here for completeness.
6.3.1 Variable Time Timeboxes
Variable time timeboxes (VTTs) differ from ordinary timeboxes in that values must
be in a given range for at least R consecutive measurements in a wider interval (Sec-
tion 3.2).
Sequential scan processing of these queries is similar to processing of standard
timeboxes. For each entity, processing begins at the start of the larger window defined
by the VTT and steps along until the end of the interval. If the value at a time point
falls within the given range, a counter is incremented. When that counter exceeds the
width of the VTT (R), the item matches the VTT. However, if the value falls outside of
the given range, the counter is reset to zero. Since an entity in the data set can match
a VTT during multiple, possibly disjoint intervals, processing proceeds until the entire
interval has been checked - there is no “falling out” of the loop as there is with the
basic sequential scan.
The additional checks required for this algorithm make it possible that VTT eval-
uation will be significantly slower than standard timebox evaluation. Although sys-
tematic evaluation has not been conducted, informal evaluation seems to indicate that
VTT evaluation is sufficiently fast for moderately-sized data sets.
100
6.3.2 Angular Queries
Angular queries (Section 4.7) involve comparison of the angles formed between the
horizontal and segments connecting values for a given item. Specifically, the angle
formed by the line segment between the start and end of that interval with the horizon-
tal must be within a given range. There are two interpretations: “all points” angular
queries require that each transition within a given interval fall within the specified
range, while “end points only” angular queries only require that the values at the start
and the end point form a segment make an angle that falls within the desired range.
Processing of “all points” angular queries proceeds via a sequential algorithm that
is analogous to the approach used for standard timeboxes. For a given item, the angle
formed by each transition within the range is calculated, and checked to see if it falls
within the desired range. As soon as one angle falls outside of the range, the item has
failed to satisfy the query and processing for that item is complete. As there are n−1
transitions in an angular query covering n time periods, angular queries require one
fewer check than would be required for a standard timebox of the same width.
“End points only” angular queries are even simpler, requiring only the calculation
of the angle formed by the segment between the value at the start point and the value
of the end point. This can clearly be done in constant time, regardless of the width of
the interval.
6.4 Geometric Methods
Geometric approaches to processing timebox searches are based on an alternative in-
terpretation of the data set. A set of n time series profiles, each containing measure-
ments for each of m time points can also be interpreted as a set of mn points in a
101
2-dimensional space. Each of these mn points is associated with one of the n profiles,
such that each profile has exactly one associated point for each time point. Under this
interpretation, a timebox can be seen as a two-dimensional orthogonal range query - a
query aimed at identifying the points that fit inside the rectangular region covered by
the query.
In other words, a time series ti satisfies a timebox b if all of the values for ti during
the time range covered by b are within the value range specified by b.
Alternatively, we can define S(ti,b) in terms of the number of points in ti in the time
range that fall into the appropriate value range. Let C(b) be the width of the timebox
- C(b) = tmax − tmin +1. Furthermore, let C(ti,b) be the number of points in ti that fall
within the constraints of the query: C(ti,b) = |Q|, where Q = {t|tmin ≤ t ≤ tmax,vmin ≤
tit < vmax}. In this case, we say that S(ti,b) if and only if C(ti,b) =C(b) - if the number
of items in the time range defined by the timebox is equal to the number of time points
contained in the timebox.
Figure 6.1 demonstrates this model. The timebox is three time intervals wide. The
upper line is one entity that has values within the timebox for all three time points, so
the entity will be included in the result set. The lower entity, however, has only two
out of the three values in the needed range, and therefore is not included in the result
set.
Thus, to process a timebox query, we start by identifying the points that fall within
this query. For each of these points, we increment a counter associated with the time-
box and the time series to which that point belongs. When all of the points that are
processed, the items that have a count for the timebox that is equal to the width of the
timebox (C(ti,b) = C(b)) are the matches:
Search Algorithm 3 Geometric Basic
102
Figure 6.1: Example of entities that meet (upper) and fail to meet (lower) the con-
straints of a timebox.
To process a timebox b.
• Initially, assume ∀ti∈TC(ti,b) = 0
• For each of the points p that fall within the timebox b, increment the counter
C(ti,b)
• The set of items that match the query is given by R = {ti|ti ∈ T,C(ti,b) = C(b)}.
This approach can be extended to conjunctive queries containing multiple time-
boxes by simply maintaining a separate counter C(ti,b) for each item,timebox pair.
After each of the items in the range query have been processed, a separate pass
through all of the items in the data set is needed to identify those items that match the
query. This pass is necessary for two reasons. First, if we are to maintain the ordering
of the items in the data set, including items in the result set as their individual counts
reached the specified threshold would not be sufficient, as the resulting list would be
unordered and require sorting. A linear pass through the list of items in the data set
would be more efficient. During this pass, the optimizations used in algorithm 2 can
be used for query creations and deletions.
103
Deletions of timeboxes present another reason for this separate pass through the
entire data set. When a timebox is deleted, an item ti may or may not match the other
timeboxes in the query. Therefore, a separate check will be needed to see if each item
is in the data set. Similar concerns exist for modifications (moving and scaling) of
queries.
The cost of the separate pass through the data set can be ameliorated by the use
of the optimizations described in the optimized sequential algorithm described above
(Algorithm 2).
Further optimizations that minimize the area that must be queried are possible for
query modifications. Taking a cue from clipping algorithms from computer graphics,
we observe that small adjustments to a timebox - either in location or scale - can lead
to a new box that has significant overlap with the previous box. In these cases, query
processing can simply add the points in the areas that are added to the query, and
eliminate points from areas that are removed. The appropriate areas can be quickly
identified using a constant-time clipping approach (Figure 6.2), thus eliminating any
redundant processing.
Implementation of geometric methods requires an appropriate index for efficient
handling of the range queries. Two possibilities - orthogonal range trees and a bucke-
tized grid - are discussed below.
6.4.1 Orthogonal Range Trees
Orthogonal range trees use nested trees to process orthogonal range queries. These
trees are nested interval trees, with each internal node containing a secondary tree. This
tree is used to index all of the items in that internal node along the second dimension.
For a search, the first one-dimensional interval tree is searched to find those items that
104
Figure 6.2: Clipping: as the timebox is moved to the lower right, the area marked “D”
is removed from the query, and the “A” region is added. These two regions must be
processed, but there is no need to reprocess the overlap (“O”).
fall in the appropriate range for that dimension. As leaf and internal nodes that fall
within the interval are identified, their associated indices for the second dimension are
searched to find the items that fall within both dimensions of the query [38].
For TimeSearcher, a modified orthogonal range tree can be used to simplify pro-
cessing. The time dimension is searched first, followed by the value dimension. Since
the time dimension covers a known range 0 ≤ k ≤ m− 1, and each entity has a value
at every time point, we use a linear array in place of the range tree for the time dimen-
sion. The start and endpoints in this array can be found in constant time, and the value
indices associated with each included time points are then searched. The second-level
value indices are stored as skip lists [103]. Each skip list contains the value for each
of the n items at the time associated with that skip list. To search the second-level
105
indices, the entry in the skip list with the lower bound is found, and then items with
successively greater values are read off of the list until the higher bound is reached.
The total number of points in the data structure is mn (the product of the number
of entities and the number of time points). For a query of width w, the expected search
time should be O(w(logn+k)), where k is the expected number of points that are found
within the query region at each time point.The use of an array instead of a range tree
for the first level of the search has a potential cost, as search trees that might have been
subsumed in internal canonical nodes of the first level tree must be searched explicitly.
However, if w = O(logn), the query cost should be comparable to that of a traditional
orthogonal range tree.
This approach may not be arbitrarily scalable. The need for more memory-efficient
approaches led to the consideration of alternative geometric searches based on bucke-
tized “grid” indices.
6.4.2 Grids
Grid structures divide multi-dimensional space into finite rectangular “buckets” that
contain many items. To process a range query, the buckets that contain the range are
identified, and items in these buckets are checked to determine whether or not they
meet the constraints of the timebox.
In the context of the current discussion, it is straightforward to convert the structure
described above (Section 6.4.1) to a grid. Specifically, the interval tree associated with
each time point is replaced by an array. Each element in this array represents some
range of values, with the range of values and the number of items held constant across
all time periods. To place a data point into the index, a simple linear conversion can be
used to go from the point’s value to the appropriate slot in the secondary array for the
106
51−60
0−10
11−20
21−30
31−40
41−50
61−70
71−80
1 2 3 4 5 6 7 8 90
Figure 6.3: A grid index for a data set with time points 0-9, values 0-80, and 8 buckets
in the value dimension. Given this scheme, values from 0-10 will go into bucket 1,
11-20 in bucket 2, etc. The timebox shown will cover the grids for values 21-30,
31-40, 41-50 and 51-60 for times 3-5. Buckets 21-30 and 51-60 are only partially
covered, thus their contents must be checked at each time point. The other buckets are
completely covered by the timebox, so checking of individual points is not necessary.
range containing that value (Figure 6.3).
The efficiency of the grid approach can be improved slightly by noting that some
buckets are entirely covered by a timebox, while others are only partially covered.
If a bucket is entirely covered, its points need not be checked individually. This is
generally the case for the “interior” buckets - any bucket covered by the timebox other
than the highest valued and the lowest valued buckets.
The granularity of the grid is an adjustable parameter that can influence perfor-
mance. The granularity is calculated by dividing the number of records in the data
set by the number of records that would be in the data set if values were evenly dis-
tributed. Thus, a data set containing 1000 items and a granularity of 20 would lead to
107
a grid containing 50 value buckets.
6.5 Analysis
Of these three alternative search algorithms, which is most efficient? Sequential
searches would appear to scale poorly, being linear in the number of items in the data
set (n). Geometric approaches appear to have the benefit of multi-dimensional index-
ing. However, these algorithms under discussion have to handle a variety of queries,
which may occur with different frequencies.
Comparison on simulated data can be used to build a better understanding of the
merits of the various approaches. This section describes a test-bed that was used to
compare these alternatives, along with the results of the analyses and some conclusions
that can be drawn.
6.5.1 Methodology
Thorough testing of the various algorithms requires examination of a range of queries
on plausible data sets.
Data Sets
A Perl script was written to generate random time series profiles. Each time series
started with a random value between -1 and 1, which was then multiplied by one
and then added to 50, to provide a starting point between 30 and 70. Subsequent
values were calculated by adding a second random variable - also scaled by 20 - to the
previous value in the set. Values were constrained to run between 0 and 100. In this
manner, a “pseudo-random” walk was created.
108
100 items 100 time points
100 time points 100 items
1000 time points 1000 items
10000 time points 10000 items
50000 time points
1000 items,1000 time points
Table 6.1: Data sets used in algorithm evaluation.
To test for the effects of the number of items in the data set (n), data sets containing
100, 1000, 10000, and 50000 items with 100 time points each were created. To test
the effects of the width of the data set, data sets with 100 items and 100, 1000, and
10000 time points were used. A data set with 100 items and 50000 time points was
attempted, but the test program was not able to hold this data set in memory. A final
data set with 1000 items and 1000 time points was used to test for possible interactions
between the width and depth of the data set. A summary of the data sets is given in
Table 6.1.
Test Queries
Each test query set contained 1000 query blocks, with each of these blocks consisting
of a set of eight operations on a single query (Table 6.2). The resulting test set con-
tained a total of 8000 operations. One such test set was developed for each of the input
data sets.
The parameters for the query values and the extent of the moves, were generated
using a random scheme similar to the scheme described above for the test data. For data
sets involving varying number of time points, the widths of the queries were allowed
109
1. Creation of an initial query
2. Moving the query in the time dimension
3. Moving the query in the value dimension
4. Moving in both dimensions
5. Resizing (scaling) in the time dimension
6. Resizing in the value dimension
7. Resizing in both dimensions
8. Deleting the query
Table 6.2: Query Operations in each block.
to grow with the width of the data set.
Algorithms tested
Four algorithms were tested: optimized sequential search (“Seq”) (Algorithm 2), geo-
metric search with an orthogonal range tree (“Orth”), and geometric search with two
grid granularities of 20 and 100 (“Grid-20” and “Grid-100”). The somewhat arbitrary
nature of the choices of granularity for the grid index.
Metrics
For each test, the total time spent on each type of operation was recorded, along with
the average for each operation and the variance. The analyses that follow are based on
the average time for each operation.
110
0
1000
2000
3000
4000
5000
6000
0 10000 20000 30000 40000 50000
Ave
rage
tim
e (m
s)
Number of items
SeqOrth
Grid-20Grid-100
Figure 6.4: Average times (ms) across all operations for data sets with 100 time points
and 100, 1000, 10000, and 50000 items.
Testing Platform
All tests were run on a 1.333 GHz Pentium III-compatible computer with 512MB of
RAM, running Mandrake Linux 8.0.
6.5.2 Results
Summary results presenting average times over all operations are presented in Fig-
ure 6.4 and Table 6.3 for the varying depths (the “deep” data sets) and Figure 6.5 and
Table 6.4 for varying widths (the “wide” data sets). For the deep data set, sequential
search was fastest, followed by Grid-20, Orth, and Grid-100. Sequential search was
also fastest for the wide data set, followed by Orth, Grid-20, and Grid-100.
These results show a clear advantage for sequential search over both the orthogonal
111
orth seq grid20 grid100
100 0.7 0.7 2.0 4.5
1000 9.7 7.2 9.4 19.9
10000 252.9 69.9 243.2 438.5
50000 3614.6 381.6 3071.7 5493.3
Table 6.3: Average times (ms) across all operations for data sets with 100 time points
and 100, 1000, 10000, and 50000 items.
0
50
100
150
200
250
300
0 2000 4000 6000 8000 10000
Ave
rage
tim
e (m
s)
Number of time points
SeqOrth
Grid-20Grid-100
Figure 6.5: Average times (ms) across all operations for data sets with 100 items and
100, 1000, and 10000 time points.
112
orth seq grid20 grid100
100 0.7 0.7 2.0 4.5
1000 8.7 0.7 21.7 22.0
10000 103.7 0.75 248.6 275.902
Table 6.4: Average times (ms) across all operations for data sets with 100 items and
100, 1000, and 10000 time points.
range tree and grid indices. For data sets with the largest number of items, sequential
search is an order of magnitude faster than the others. The results are even more
striking for data sets involving more time points: while the performance of the indexed
algorithms seems to scale linearly with the number of time points, the performance of
the sequential algorithm is not influenced by the width of the data set.
These tests also provide some perspective on the potential limits on the prospects
for dynamic queries with larger data sets. According to Table 6.3, sequential search
of a data set with 10000 items and 100 time points takes 69.9 ms. For the larger data
set with 50000 items, the time increases to 381.6ms. Given these numbers, it would
appear that it will be very difficult to meet the dynamic query goal of 100ms processing
time with this hardware configuration and data sets that have significantly more than
10000 items. In fact, the real limit might be somewhat smaller, as these numbers do
not include the time required for display updates.
Results broken down by individual operations are given in Figures 6.6 and 6.7 for
variations in the number of items in the data set. Figures 6.8 and 6.9 provide similar
results for variations in the number of time points. As expected, these results are
generally consistent with the averages given in Figures 6.4 and 6.5. For the “depth”
data, the sequential algorithm was always at least as fast as any of the others, with
Grid20, Orth, and Grid-100 following in decreasing performance rank.
113
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
0 10000 20000 30000 40000 50000
Ave
rage
tim
e (m
s)
Number of items
SeqOrth
Grid-20Grid-100
0
1000
2000
3000
4000
5000
6000
0 10000 20000 30000 40000 50000
Ave
rage
tim
e (m
s)
Number of items
SeqOrth
Grid-20Grid-100
(a) Query Creation (b) Movement in time (x)
0
500
1000
1500
2000
2500
3000
3500
0 10000 20000 30000 40000 50000
Ave
rage
tim
e (m
s)
Number of items
SeqOrth
Grid-20Grid-100
0
1000
2000
3000
4000
5000
6000
7000
8000
0 10000 20000 30000 40000 50000
Ave
rage
tim
e (m
s)
Number of items
SeqOrth
Grid-20Grid-100
(c) Movement in value (y) (d) Movement in both dimensions
Figure 6.6: Comparative times for query creation and translation on data sets with 100
time points and 100, 1000, 10000 and 50000 items.
For the test involving the “deep” data set and movement in both directions (Fig-
ure 6.6), the performance of the Grid-20 index was comparable to that of the sequential
scan. However, since performance of the sequential scan was otherwise superior, this
result does not present any reason to prefer any of the other approaches. The perfor-
mance of the sequential algorithm also demonstrates more favorable scaling behavior.
As the geometric approaches begin to show growth rates that appear to be greater than
linear, the sequential scan’s performance maintains consistently linear growth.
The wide data set shows a different ordering - Sequential, Orth, Grid-20, and Grid-
114
0
500
1000
1500
2000
2500
3000
3500
4000
0 10000 20000 30000 40000 50000
Ave
rage
tim
e (m
s)
Number of items
SeqOrth
Grid-20Grid-100
0
200
400
600
800
1000
1200
1400
1600
1800
2000
0 10000 20000 30000 40000 50000
Ave
rage
tim
e (m
s)
Number of items
SeqOrth
Grid-20Grid-100
(a) Resize in time (x) (b) Resize in value (y)
0
1000
2000
3000
4000
5000
6000
7000
0 10000 20000 30000 40000 50000
Ave
rage
tim
e (m
s)
Number of items
SeqOrth
Grid-20Grid-100
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
0 10000 20000 30000 40000 50000
Ave
rage
tim
e (m
s)
Number of items
SeqOrth
Grid-20Grid-100
(c) Resize in both dimensions (d) Deletions
Figure 6.7: Comparative times for query resize and deletion on data sets with 100 time
points and 100, 1000, 10000 and 50000 items.
100, in order of decreasing performance - but the preference for the sequential scan is
just as clear. Furthermore, the sequential scan shows very little sensitivity to the width
of the data set.
Results for the data set containing 1000 items and 1000 time points are given in
Table 6.5, along with results for 100 items and 1000 time points and 100 time points
and 1000 items for context. Relative to the data set with 1000 items and 1000 time
points, the sequential algorithm was 80% slower for queries on this data set: 13.0ms vs
7.2ms. However, the geometric queries were roughly one order of magnitude slower.
115
0
50
100
150
200
250
300
350
0 2000 4000 6000 8000 10000
Ave
rage
tim
e (m
s)
Number of time points
SeqOrth
Grid-20Grid-100
0
50
100
150
200
250
0 2000 4000 6000 8000 10000
Ave
rage
tim
e (m
s)
Number of time points
SeqOrth
Grid-20Grid-100
(a) Query Creation (b) Movement in time (x)
0
50
100
150
200
250
300
350
0 2000 4000 6000 8000 10000
Ave
rage
tim
e (m
s)
Number of time points
SeqOrth
Grid-20Grid-100
0
50
100
150
200
250
300
350
400
450
500
0 2000 4000 6000 8000 10000
Ave
rage
tim
e (m
s)
Number of time points
SeqOrth
Grid-20Grid-100
(c) Movement in value (y) (d) Movement in both dimensions
Figure 6.8: Comparative times for query creation and translation on data sets with 100
items and 100, 1000, and 10000 time points.
orth seq grid20 grid100
1000 time points, 1000 items 207.2 13.0 213.8 220
100 time points, 1000 item 9.7 7.2 9.4 19.9
1000 time points, 100 items 8.7 0.7 21.7 22.0
Table 6.5: Average times (ms) for the data set with 1000 items and 1000 time points,
with results for both 100 items and 1000 time points and 100 time points and 1000
items given for context.
116
0
20
40
60
80
100
120
140
0 2000 4000 6000 8000 10000
Ave
rage
tim
e (m
s)
Number of time points
SeqOrth
Grid-20Grid-100
0
20
40
60
80
100
120
140
160
180
0 2000 4000 6000 8000 10000
Ave
rage
tim
e (m
s)
Number of time points
SeqOrth
Grid-20Grid-100
(a) Resize in time (x) (b) Resize in value (y)
0
50
100
150
200
250
300
0 2000 4000 6000 8000 10000
Ave
rage
tim
e (m
s)
Number of time points
SeqOrth
Grid-20Grid-100
0
50
100
150
200
250
300
350
0 2000 4000 6000 8000 10000
Ave
rage
tim
e (m
s)
Number of time points
SeqOrth
Grid-20Grid-100
(c) Resize in both dimensions (d) Deletions
Figure 6.9: Comparative times for query resize and deletion on data sets with 100
items and 100, 1000, and 10000 time points.
This seems to indicate that there may be interactions that might influence query per-
formance for data sets with both large numbers of time points and items. However,
such interactions do not change the basic conclusion - sequential search outperforms
the geometric alternatives.
These results may appear to be somewhat counter-intuitive. Why would a naive
sequential scan out-perform indexed searches? And why is the performance of the
sequential algorithm relatively insensitive to the width of the data set?
117
6.5.3 Sequential scans vs. Geometric Indices
The superior performance of the sequential scan approach might be explained by an
advantage in the number of points that must be processed for a given timebox query.
Specifically, does the sequential scan algorithm examine fewer points than geometric
approaches to determine the value of S(ti,b) for any ti in the data set?
Figure 6.10 shows a timebox that spans eight time points (times 1-8), along with a
time series that falls within the timebox for seven of those eight points. To determine
S(ti,b) for this timebox and any given ti, the sequential only needs to examine the first
two values in the given time range: once it is determined that the second value falls
outside of the timebox, we know that S(ti,b) = f alse, and there is no need to examine
any of the remaining values. The geometric approach does not look at the value at time
two, as this point falls outside of the timebox. Instead, geometric approaches must ex-
amine the seven remaining points for ti that fall inside the box. All of these points (and
any others that fall inside the box) must be examined and the appropriate totals calcu-
lated before the value of S(ti,b) can be determined. Furthermore, this determination
requires a separate, final pass through the data set.
The limiting case for this advantage would appear to be for time series profiles that
are completely contained within the timebox (Figure 6.11). In these cases, both the
sequential and geometric approaches must visit all of the points contained in the width
of the timebox to determine that S(ti,b) = true. However, the geometric approaches
still suffer from the need for a final scan through all of the items in the data set.
To validate this model for the superior performance of the sequential algorithm,
the above tests were repeated with additional instrumentation for counting the number
of items that were checked in the process of completing each set of queries. For any
given query on a data set, the number of values that might possibly checked in the
118
Figure 6.10: A timebox query demonstrating the advantage that sequential processing
has over geometric methods. For this timebox that spans eight time points, sequen-
tial processing can stop after the second time value is identified as falling outside of
the timebox. However, the geometric approaches must examine every point that falls
within the timebox.
course of processing that query is equal to the width of the query - w - multiplied by
the number of items in the data set - n. By comparing the number of values that are
actually tested to this theoretical maximum, we can evaluate the relative performance
of sequential and geometric approaches. Data for the sequential case are presented in
Tables 6.6 for the tests involving increased number of items, and Table 6.7 for tests
involving increased number of time points.
As would be expected, the number of values that might possibly be checked scales
linearly with both the number of items in the data set and the number of time points.
Furthermore, the number of items that is actually checked is relatively small - 7% or
119
Figure 6.11: The timebox from Figure 6.10, with a time series for which S(ti,b)= true.
Number of values checked
Number of items Possible Actual Ratio
100 25,052,800 1,737,711 0.069
1000 246,198,000 17,355,879 0.07
10000 2,501,220,000 175,558,021 0.07
Table 6.6: Comparison of number of values checked versus possible number of checks
for sequential search of data sets with 100 time points and 100, 1000, and 10000 items
.
120
Number of values checked
Number of time points Possible Actual Ratio
100 25,052,800 1,737,711 0.069
1000 248,961,400 1,906,656 0.0077
10000 2,495,734,000 1,782,613 0.00071
Table 6.7: Comparison of number of values checked versus possible number of checks
for sequential search of data sets with 100 items and 100, 1000, and 10000 time points
.
less - for both the “wide” and the “deep” data sets. This establishes a benchmark for
the geometric algorithms - if those approaches require a substantially larger portion
of the possible checks, the proposed explanation for the superior performance of the
sequential algorithm would be validated.
The scaling of the number of values actually checked presents some interesting
results. For the “deep” data set involving 100, 1000, and 10000 items, the number
of values actually checked scaled linearly with the number of items in the data set,
maintaining a fairly constant ratio of roughly 7% of the possible number of checks. For
the “wide” data set involving 100, 1000, and 10000 time points, the number of items
actually checked stayed roughly constant - varying between 1,737,711 and 1,906,656
- despite the increase in the number of time points. As a result, the ratios decreased
progressively - from 6.9% for 100 time points to .071% for 10000 time points. This
would seem to be consistent with the insensitivity of search times to the number of
time points in a data set (Figures 6.5, 6.8, and 6.9 and Table 6.4).
Data for a similar analysis conducted with the Grid-20 index is given in Table 6.8
for the data set involving varying numbers of items and Table 6.9 for the data set
involving varying numbers of time points. As the various geometric approaches differ
121
Number of values checked
Number of items Possible Actual Ratio
100 25,052,800 7,227,326 0.29
1000 246,198,000 73,553,822 0.30
10000 2,501,220,000 723,176,660 0.29
Table 6.8: Comparison of number of values checked versus possible number of checks
for sequential search of data sets with 100 time points and 100, 1000, and 10000 items
.
Number of values checked
Number of time points Possible Actual Ratio
100 25,052,800 7,227,326 .29
1000 248,961,400 71,606,632 .29
10000 2,495,734,000 717,849,880 .29
Table 6.9: Comparison of number of values checked versus possible number of checks
for sequential search of data sets with 100 items and 100, 1000, and 10000 time points
.
in the indices used to retrieve the values that must be checked, but not in the actual
values that are checked, this analysis can be considered as representative of geometric
algorithms in general.
As with the sequential algorithm, the number of values actually checked scaled
with the number of items in the data set. However, the number of values checked also
scaled with the number of time points in the data set. Furthermore, in both cases,
the percentage of possible checks made was much higher than in the sequential case:
roughly 29% for the geometric algorithms, as opposed to a maximum of 7% for the
122
0
1e+08
2e+08
3e+08
4e+08
5e+08
6e+08
7e+08
8e+08
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
Num
ber o
f val
ues
chec
ked
Number of itemss
Grid-20Seq
Figure 6.12: The number of values actually checked for sequential and Grid-20 algo-
rithms for data sets involving 100, 1000, and 10000 items with 100 time points.
sequential scan.
Comparative graphs are shown in Figures 6.12 and 6.13 for the “deep” and “wide”
datasets respectively. Note the similarity between these graphs and corresponding
graphs for execution times in Figures 6.4, 6.5, 6.6, 6.7, 6.8, and 6.9.
Taken as a whole, these results appear to confirm the hypothesis that the perfor-
mance advantage of the sequential algorithm can be attributed to the gains associated
with boolean “shortcuts” that minimize the number of values that must be checked to
evaluate timebox queries.
123
0
1e+08
2e+08
3e+08
4e+08
5e+08
6e+08
7e+08
8e+08
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
Num
ber o
f val
ues
chec
ked
Number of time points
Grid-20Seq
Figure 6.13: The number of values actually checked for sequential and Grid-20 algo-
rithms for data sets involving 100 items with 100, 1000, and 10000 time points.
6.5.4 Theoretical worst-case analyses
An analysis of worst-case search performance for the various algorithms can provide
further insight. For a data set of n items, each containing m time points, worst-case
search performance occurs during the modification of a query that covers the entire
range of the data set. Specifically, this query must cover all m time points, with values
ranging from the lowest to the highest (inclusive) values found in the data set for any
item at any time point. As the modification of any timebox in a query requires the
reprocessing of all timeboxes in that query, this worst-case performance will occur
with a set of k boxes with a union that covers all m time points, or with a single
timebox that coves m time points.
124
It should be noted that this query is worst-case in the sense that it occupies the entire
query space and there are no interesting queries that would require more processing.
It is certainly conceivable that a user could construct a query that contained several
copies of a timebox that covered the entire query space, but this would be redundant.
Analysis of this query is straightforward for the sequential algorithm. Each of the
n items in the data set would potentially require anywhere from 1 to m checks: one for
each of the time points. As the value range of this query contains all values found in the
data set, all of the m checks would be necessary to verify that any given item matched
the query - no shortcuts would be possible. The total time required for processing this
query would therefore be O(mn).
For the orthogonal range tree algorithm, this worst-case query would require ex-
amination of each of the m range trees in the data set. For each of these trees, O(logn)
time would be required to find the starting point for the query interval in the skip
list,and O(n) would be required to find each of the n points in the skip list. Thus, each
of the m searches would take O(n + logn). The final pass through the data set would
require an additional O(n) time, for a total of O(n+m(n+ logn)).
This might be reduced somewhat, if we assume special-case handling that could
avoid the O(logn) search in the skip list for the case of searching for the minimum
value in the range. In this case, the resulting search time would be O(n + mn). This
may be asymptotically equivalent to the running time for the sequential algorithm, but
the constants are probably higher.
Similar results can be found for the grid variant of the geometric search. For each
of the m time periods, each of the buckets would be in the value range of the query, and
each of the points in each bucket must be checked, for a total of n points. As a result,
the time required for the basic search is O(mn). The addition of O(n) for the final pass
125
through the data set leads to O(n + mn). This is equivalent to the result found for the
orthogonal range version of the geometric algorithm.
This analysis implies that the sequential algorithm is likely to outperform the geo-
metric approaches even in pathological worst case scenarios.
6.5.5 Further Examination of Sequential Algorithms
The analysis presented thus far argues that the sequential algorithm outperforms the
geometric alternatives. Although further investigation explains this result, the sequen-
tial algorithm may still seem somewhat unsatisfying.
Specifically, Algorithm 2 seems to include an intrinsic inefficiency. For operations
that involve modification of a timebox, all of the remaining timeboxes are compared
against each item in the data set. In other words, if there are k timeboxes - b0 . . .bk−1,
and box l is deleted, each item in the data set must be checked against all of the k−1
remaining boxes. This is potentially wasteful, as many of these checks may have been
completed previously. If, for example, item tx had been previously found to match by
(y 6= l), we should not have to repeat the check to see if S(tx,by) = true.
An alternative formulation of the sequential algorithm might be used to avoid this
problem. In this model, we use a hash table for each entity ti in the data set. This
hash table contains pointers to the timeboxes that the entity satisfies. When a timebox
operation occurs, this approach still iterates over each of the time series profiles in the
data set. When checking profile ti against timebox b, the entry for that b in ti’s profile
is removed, and then (if the operation is not a delete), ti is compared against b to see if
there is a match. If there is, b is added to ti’s hash table. Finally, the number of items
in ti’s hash table is compared to the number of active items in the query. If they are
equal, S(ti,b) is true. This approach is summarized in Algorithm 4.
126
Search Algorithm 4 SEQ HASHED
• Begin by assuming that all items have an empty hash table.
• For each change to the query involving timebox b, and each item ti:
– Remove b from ti’s hash table.
– if the operation is not a deletion, check ti against b. If S(ti,b) = true, add
b to ti’s hash table.
– If the number of items in ti’s hash table is equal to the number of current
timeboxes, ti matches the query as a whole - S(ti,B) = true.
Evaluation of this algorithm requires a somewhat different approach than that
which was taken above. As this revised algorithm is aimed at eliminating costs for
cases involving modification of queries, it will be most effective for query cases in-
volving multiple queries (as opposed to the single query cases given above).
Alternate query sets similar to those used for the comparison of sequential and ge-
ometric algorithms were developed. Like the original query sets, these contained a
series of 1000 repetitions of 8 query operations. However, these repetitions were con-
ducted after creating four extra timeboxes. These four timeboxes were held constant
while the 8000 operations were performed.
Results for these tests are given in Figure 6.14 (for “deep” data sets) and Fig-
ure 6.15 (for “wide” data sets). Although the hashed version of the algorithm seems to
perform better on smaller data sets, the original sequential algorithm seems stronger
for larger data sets.
127
0
50
100
150
200
250
300
350
400
450
0 10000 20000 30000 40000 50000
Ave
rage
tim
e (m
s)
Number of items
SeqSeq Hashed
Figure 6.14: Optimized sequential vs. Hashed sequential for data sets involving 100,
1000, and 10000 items
.
6.5.6 Discussion
Although the analysis described above appears to support the use of a sequential scan
approach for timebox searches, several questions regarding the interpretation of these
results and their generalizability remain unresolved.
Generalizability
The sample query and data sets are not necessarily representative of real data sets and
user queries. Therefore, the results should not be taken as definitive or predictive.
Instead, they should be used for comparative discussion of the various algorithms.
128
0
0.2
0.4
0.6
0.8
1
0 2000 4000 6000 8000 10000
Ave
rage
tim
e (m
s)
Number of time points
SeqSeq Hashed
Figure 6.15: Optimized sequential vs. Hashed sequential for data sets involving 100,
1000, and 10000 time points
.
In particular, the ratio of numbers of values checked to the number of possible
values checked (Tables 6.6, 6.7, 6.8, and 6.9) are probably artifacts of the strategy
used to generate test data and queries. Other data sets and queries are likely to have
significantly different ratios.
Anomalous Data
There was one class of query operation where the grid index had performance compa-
rable to that of the sequential scan. For movement in both directions with the “deep”
data sets, the Grid-20 index was almost as fast as sequential scan for data sets involving
129
10000 items (78.8 ms for sequential scan and 337.5 for Grid-20). For 50000 items, the
two approaches had virtually indistinguishable performance: 430.5ms and 438.6ms
for sequential and Grid-20, respectively (Figure 6.6d).
Given the otherwise consistent superiority of the sequential approach, this result is
somewhat puzzling. The most likely explanation is that this is an artifact of the specific
query data set used in the tests. Further investigation would be needed to clarify. In
any case, the Grid-20 index never outperforms the sequential scan.
Scaling with Width of the Data Set
The analysis of the data sets involving varying widths of the time series revealed an
unexpected result: query execution time stayed roughly constant as both time series
and queries grow wider (Figures 6.5, 6.8, 6.9, and Table 6.4).
This result might be understood by considering what would be required for an
increase in time points to lead to an increase in the time required to process a query.
First, the queries must increase in width. Processing a query of width k will take at
most k comparisons for each item in the data set, regardless of the number of time
points for each item. This requirement is met by the test queries, which can increase
in width with the data set. Second, we must have many items in the data set that must
be scanned for the whole width of the query. If items fall out of the timebox quickly
(Figure 6.10), the increased width of the data set and query will not lead to increased
processing time.
More concretely, consider a timebox b = (tmin, tmax,vmin,vmax). The vertical range
vmax − vmin might be considered as p - the portion of the entire value range of the data
set that is included. If we assume that data values are randomly distributed throughout
the entire range, p is also the probability that the value of a time series ti will fall inside
130
of b for any point in time i. As a time series must stay within the value range of the
timebox b for each of the k = tmax − tmin +1 time periods included in the timebox, the
probability that any time series will match the timebox is pk. Since p can be expected
to be significantly less than one, this value will quickly become very small even for
small values of k. For example, if the user creates a relatively broad query covering half
of the value range (p = .5), the likelihood that a time series will match this query will
be less than .1 if the query is only four time periods long (Figure 6.16). For data sets
containing such randomly distributed values, the likelihood is that many time series
will quickly fall outside of wide timeboxes, thus making performance independent of
the width of the data set.
Most meaningful data sets will not have values that are randomly distributed, so
likelihoods may be somewhat higher. However, most interesting data sets contain
profiles with non-trivial changes over time -exactly those profiles that are unlikely to
be contained in relatively constrained timeboxes that cover long intervals.
Performance with the data set with 1000 items and 1000 time points (Table 6.5)
provides some evidence that performance is not completely insensitive to the num-
ber of time points. For this data set, the sequential search was 80% slower than it
was on the data set with 1000 items and 100 time points. However, this increase was
relatively modest, given the ten-fold increase in the number of time points. This per-
formance seems particularly good when compared to the geometric algorithms, which
were roughly one order of magnitude slower with this wider data set.
Further analysis would be needed to fully characterize the performance of the se-
quential algorithm on data sets with large numbers of both items and time points.
However, it appears that query performance can be expected to scale well with the
width of the data set.
131
Figure 6.16: Why time series query performance is independent of the width of the
series. As this timebox covers 25% of the value space and five time periods, a randomly
generated time series would only have odds of < 1% of satisfying the timebox (like
t2 does). The odds that a timebox will fail to meet this query by the fourth time point
(like t1) are greater than 99%.
Variants on Sequential Search
The comparison between the optimized sequential (Algorithm 2) and the hashed (Al-
gorithm 4) (Section 6.5.5) was inconclusive. Further comparisons, perhaps including
data and query sets that might be more representative of actual users tasks, might be
necessary to gain a deeper understanding of the strengths and weaknesses of these
alternative approaches.
132
6.6 Next Steps
This chapter has formulated the timebox search problem, described alternative ap-
proaches, and presented results based on synthetic query sets. These tests have shown
that sequential scans with heuristic optimizations outperform searches based on more
sophisticated geometric indices. By counting the number of values checked in the
different circumstances, this analysis established an explanation for the superior per-
formance.
The sequential algorithms benefit from the ability to quickly and easily determine
- on the basis of the first value from a time series that falls outside of the timebox -
when a time series profile will fail to satisfy a timebox. Sequential approaches use this
information to eliminate the need to examine any values in a time series subsequent to
that value that falls outside of the timebox. Geometric algorithms, on the other hand,
lack this global knowledge of a single value for a time series that will be sufficient to
conclude that the time series falls outside of a timebox. As a result, these approaches
must examine all data points that fall within a timebox, even though many of them
may belong to items that will not fall within the timebox.
This understanding can be used to identify the requirements that must be met by
any proposed algorithm that would hope to improve upon the sequential scan algo-
rithms described above. The key to the success of the sequential scan is in its ability to
quickly identify profiles that cannot satisfy a timebox. An index that could be used to
quickly (less than linear time) identify only those profiles that might possibly satisfy a
timebox might possibly outperform the sequential search.
One possible approach might be to reduce each time series in a data set to a one-
dimensional projection on the value axis, covering the range between the maximum
and minimum values covered by that time series. These projections would be searched
133
in an interval tree. Each timebox could then be converted into a similar interval, and
search would consist of finding all of the profiles that overlapped with the timebox and
then doing a complete search on those candidates.
Unfortunately, there are at least two problems with this proposal. The first involves
ordering of items: maintaining the consistent ordering of items would require an ex-
pensive sorting of the result list after the search. Even if this requirement is relaxed,
there is a very real possibility that the items in the data set will have substantial overlap
in the projections of their profiles. This would minimize the discriminatory power of
the interval tree and (in the limit) reduce this approach to a sequential search.
It is important to note that the conclusions presented in this chapter relate only
to complete-matching of time series data sets with timeboxes. Numerous algorithms
have been suggested for similarity matching on subsequences and other approaches to
searching time series data (Section 2.2). These approaches might be worth reconsider-
ing in the context of possible extensions to the timebox query language (Chapter 9).
Processing of timebox and related queries might be limited by the “dimensionality
curse” - the inherent difficulty of searching in high-dimensional cases. As a time series
data set with n time points can be viewed as a set of n dimensional vectors, a timebox
query over that data set can be considered to be a query in n-dimensional space. Re-
cent analyses of index structures for nearest neighbor searches in high-dimensional
space have shown that sequential scans outperform indexed searches for moderate di-
mensionalities (< 20) [21, 115, 142]. Thus, sequential scans might outperform these
indices for time series of even moderate width.
In fact, the performance degradation of these indices may be even greater for time-
boxes. Nearest neighbor searches are based upon calculations of distances between
data points and a fully-specified query point. Timebox queries can be substantially
134
more vague, as conjunctive queries may specify constraints on some, but not necessar-
ily all, of the time values. In essence, a timebox query can be seen as a similarity query
with at least one specified constraint and an arbitrary number of “don’t care” values.
Although further analysis would be necessary to confirm this conjecture, it seems rea-
sonable to expect that the performance degradations for these queries would occur at
lower dimensionalities than those seen for completely specified similarity queries.
The results describe above seem to indicate that dynamic query processing for data
sets containing more than 100,000 items may be impractical for some time. Alternative
strategies might be developed to handle larger data sets. For example, searching might
be done on clusters of similar profiles, allowing users to “drill-down” to actual data
items once a cluster of interest is found.
135
Chapter 7
Empirical Evaluations
Evaluation is a key component of the process of developing interactive systems. Em-
pirical studies, user observations, and other analytic approaches to examination of the
use of the tool in practice can help validate ideas, support (or refute) underlying as-
sumptions, and otherwise clarify understanding of the issues surrounding the system
under investigation.
This chapter describes two controlled design studies that were conducted with
TimeSearcher. These studies investigated various aspects of the timebox query model
and the TimeSearcher application, with the goal of providing formative feedback that
would be useful for revising and improving the utility of these tools.
Both of the studies asked participants to use a direct-manipulation timebox in-
terface alongside two alternatives to complete tasks involving a search for items of
interest in a data set involving stock prices. An additional study was attempted and
terminated, due to difficulties with user comprehension of study tasks. This study is
described in Appendix C.
Chapter 8 presents several case studies with researchers who have been using
TimeSearcher for examining data sets in their ongoing work. Based on observations
made during sessions spent directly with these users as they worked on problems that
136
they found meaningful, these case studies document the utility of the tool as seen by
motivated users.
Both approaches to evaluation have their strengths and weaknesses. Empirical
studies are well-suited for understanding the impact of small design changes and com-
paring alternatives in well-controlled environments. These studies can also be too nar-
row, focusing on minute details that may be uninteresting, even if they are easy to test.
Case studies provide powerful testimonials to the utility of a tool, but as they are far
less rigorous than empirical studies, conclusions are often less clear and generalizable.
These difficulties were notable in the course of the evaluation of TimeSearcher.
The two approaches to evaluation provided markedly different results. Although users
of the system were enthusiastic and found the tool to be valuable, the results of the
empirical studies are less clear. The first study showed that form fill-in interfaces
outperform direct manipulation timeboxes under certain circumstances, and the second
study failed to show any significant difference between the alternatives. Understanding
these apparent paradoxes will be a goal of the discussion of these evaluations.
7.1 Evaluation of Input Mechanisms for Questions of
Varying Complexity
7.1.1 Interfaces
Two equivalent alternative means of specifying query constraints were considered 1:
1In the discussion below, “Timebox” will refer to standard timeboxes as implemented in Time-
Searcher, “Form Fill-in” will refer to the form fill-in interface, and “Range Slider” will refer to the
range slider interface.
137
Figure 7.1: A form fill-in interface for specifying query constraints.
Figure 7.2: A range slider interface for specifying query constraints.
1. Form Fill-in: Using traditional text entry widgets, users could type values to
specify a query equivalent to a timebox (Figure 7.1).
2. Range Sliders: Paired range sliders - one for time constraints and one for value
constraints (Figure 7.2), can be used to specify query parameters.
The alternative interfaces can be viewed as indirect means of creating timebox
queries: the parameters expressed in the slider or form fill-in form queries equiva-
lent to what might be expressed with a timebox. Furthermore, these parameters were
displayed on the screen with a box, just as if a box had been created with direct ma-
nipulation.
Form fill-in was chosen as a “traditional” interface design, based on commonly-
accepted conventions for graphical user interfaces. Range sliders were chosen as a
potentially more powerful alternative that has been shown to be useful in earlier infor-
mation visualization work [8].
A modified and instrumented version of TimeSearcher was built to serve as the
platform for running the study. Known as tsexp, this version involved several modi-
fications to the TimeSearcher interface, along with additional functionality needed to
run the study.
138
In constructing tsexp, interface components found in TimeSearcher were removed
if they were irrelevant to the study or if they somehow interfered. Thus, the details-on-
demand window and the item list were eliminated as being irrelevant. The range sliders
for adjusting queries were removed, as they provided a tool for modifying queries that
could interfere with the query manipulation methods that were being examined.
The resulting interface contains three windows: the display list, the overview win-
dow, and the query space. The display list is analogous to the display list in Time-
Searcher - a scrollable window containing individual graphs for the items in the data
set. The overview window was used to show a graph overview of the data set, and
for display of the boxes corresponding to the query parameters. For tasks involving
timebox queries, this window was also used for query input. The query space -the
third at the bottom of the screen - is used for query input. Unlike timeboxes, which
can be drawn directly on the graph space, form fill-in and range slider queries require
additional display space. This window was used to display the input devices, along
with other necessary controls. For tasks involving timeboxes, this space was used to
provide feedback regarding the extent of each timebox.
tsexp includes functionality for reading a set of tasks from a text file. This file
indicates the number of items in the session, data files to be used for that session, and
the questions, along with indicators describing the type of question and its complexity.
A participant starts a task in tsexp by pressing the “start” button on the toolbar. This
leads to the loading and display of the appropriate data file, along with the initialization
of the query space. A popup window displaying the current task, and the type of
interface, is also displayed (Figure 7.3). The user then proceeds to answer the question.
Users were instructed to to find the answer to the question, press the “stop” button on
the toolbar, and then write down the answer.
139
Figure 7.3: The tsexp interface.
For each task, data stored included the time required to complete the task, the
number of timeboxes created, the number of modifications, and the number of items
deleted. For the studies described below, task completion time was the only variable
analyzed. The time between the pressing of the “start” and “stop” buttons was used as
the task completion time.
Although every attempt was made to keep the differences between the three in-
terfaces as minimal as possible, slight differences in their handling were necessary.
Unlike timeboxes, which can be drawn anywhere to specify a query, the form fill-in
and range slider interfaces require some initialization. It was decided that these con-
trols should be initialized to contain the maximum extent of the data set. Thus, these
140
tasks began with a query that contained all of the items in the data set.
Query execution also differs slightly. The range slider and timebox interfaces exe-
cute queries implicitly with every mouse event. Form fill-in queries are executed either
by pressing “return” in any of the form fill-in boxes, or by pressing the “run query”
button on the toolbar.
Further special handling was needed for creation of additional query terms and for
deletion of query terms. For timebox queries, these mechanisms were straightforward
and based on TimeSearcher: new terms could be created by selecting the drawing
icon on the toolbar and drawing the new box. Deletion of queries is accomplished by
right-clicking on the timebox and selecting “delete” from the pop-up menu.
Form fill-in and range slider queries share common mechanisms for creation and
deletion. The “New Query Item” button on the toolbar causes a new query term to be
created. Each query component occupies a separate line in the bottom window of the
tsexp display. As with the original query term, this new term will initially occupy the
entire extent of the data space. Each term in the query has a “delete” button, which can
be used to remove that term. If only one term is present, it cannot be deleted.
All three interfaces provide users with feedback indicating the extent of the query
items. For range sliders and form fill-in, the extents are provided with each query line.
For timeboxes, the bottom window is used to display lines with feedback displaying
the extent of each box that has been created. This feedback is dynamically updated
as boxes are moved. In all cases, selecting a query item leads to highlighting of the
corresponding feedback (Figure 7.4).
141
Figure 7.4: Feedback provided in the tsexp interface. Note the highlighted border
around the feedback corresponding to the selected timebox.
7.1.2 Complexity
There are two sources of complexity in the class of time series tasks that might be
handled by timeboxes:
1. Number of modifications: Many tasks involve comparison of results from
slightly differing queries. These tasks require creation of a query that is sub-
sequently modified, along with comparison of the corresponding results. The
difficulty of these tasks increases with the number of modifications/comparisons
that must be made.
2. Number of query terms: Tasks that involve identification of complex patterns
require creation of several terms. The complexity of these tasks increases with
the number of terms required.
This study investigates the first source of complexity, with the other source held
constant. The incomplete study described in Appendix C attempted to address the
second source of complexity. In both studies, three levels of complexity - low, medium
and high - are used.
142
This design does not account for the possibility for any interaction between the
sources of complexity. Although a more comprehensive study that accounted for in-
teractions might have been interesting, the resulting 3x3x3 design (interface vs. # of
modifications vs. # of query terms) would have required a daunting number of tasks
from each participant. Furthermore, comparative investigation of the relative impact
of the types of complexity is of secondary interest.
7.1.3 Task Types
The study followed a within-subjects design consisting of two sets of tasks. The first
set, which occupied the bulk of the session, involved well-defined questions, while the
second involved exploratory tasks.
Well-Defined Tasks
Participants were asked to use each of the three interfaces to complete tasks at each of
the three levels of task complexity, resulting in a 3x3 design. This session contained
18 tasks - 2 repetitions for each of the 9 possible combinations of the two conditions.
The ordering of interfaces presentation was varied among the participants, with all
6 possible orderings represented with equal frequency. Three data sets were used for
these tasks, with the data sets similarly varied to avoid disproportionate presentation
of any combination of interface and data set.
In all cases, tasks were presented in increasing order of difficulty, and all of the
tasks for a given complexity level were completed before before the next level of com-
plexity started. Thus, if the order of interfaces was A,B,C and the order would be
A low-complexity, B low-complexity, C low-complexity, A low-complexity, B low-
complexity, C low-complexity, A medium-complexity . . . .
143
The levels of complexity were defined in terms of the number of changes to an
initially-specified query that would be needed to answer a question. Low-complexity
queries simply required answering a question that could be specified with a single time-
box, medium complexity tasks required comparison between three conditions (two
modifications), and high complexity tasks involved comparison between five condi-
tions (four modifications). For example,
1. Low Complexity: How many stocks had prices between $10 and $30 during
weeks 1-5?
2. Medium Complexity: Which price range has the most stocks during days 29-30:
$50-$75, $75-$100, or $68-$93?
3. High Complexity: More difficult queries involving comparison between five
possibilities: Which days have the most stocks with prices between $50 and
$100: 2-10, 4-12, 6-14, 8-16, or 10-18?
Medium and high complexity questions involved modification in either time or
value, but not both. Tasks used for this study are given in Appendix B. As the tasks in
this study only involved the use of one set of constraints at any given time, the tsexp
interface that disallowed multiple simultaneous query constructs was used.
The data sets used in this study tracked 30 days of stock prices for a set of 200
stocks, extracted from a larger set of actual stock prices from 1998-1999 2
Twelve graduate and undergraduate students from the University of Maryland’s
Department of Computer Science participated in this study. These twelve participants
represent 2 participants for each of the six possible orderings of interface presentation.
2Thanks to Martin Wattenberg for providing stock price data.
144
The study materials and interface were tested with two pilot subjects and revised
based on the resulting feedback. Study participation took about one hour.
Exploratory Tasks
Participants were asked to use each of the three interfaces to find items in the data set
that were somehow “interesting” or “unusual”. Users were asked to find 3 such items
with each interface. The exact definition of what constituted an “interesting” item was
left to the discretion of the participant.
7.1.4 Hypotheses
This study examines the following hypothesis:
Hypothesis 1 Direct-manipulation of graphical query widgets is faster for specifying
and modifying complex time series queries than alternative interfaces that are seman-
tically equivalent.
A secondary hypothesis addresses the interaction between interface type and task
complexity:
Hypothesis 2 The advantages of timeboxes will be greater for more difficult tasks.
This comparison is, by design, quite narrow. None of the other aspects of the
TimeSearcher display were included in the study. This approach focuses the evaluation
specifically on the query specification mechanisms. Future studies might evaluate the
impact of interface components such as the envelopes and overviews (Chapter 4).
145
7.1.5 Procedure
After signing informed consent forms, participants read a short introduction to the
problem and tasks.
A training session containing 6 questions followed. Each of the three interfaces
was included twice in the training session, with 1 low-complexity question and 1
medium-complexity question. High-complexity questions were not included in the
training session. Training questions were repeated as needed in order to familiarize
users with the interfaces and then the tasks. When necessary, the administrator of the
study completed one or more of the training tasks for the participants. In these cases,
the participants repeated the tasks on their own as well.
The well-defined tasks were presented after the training session. Each task was
presented with a three minute time window. If the participant did not arrive at a suit-
able answer within that window, they were allowed given a second three minute time
window to repeat the task. No further attempts were allowed. The dependent measure
for these tasks was the time required for completion
Exploratory tasks followed the well-defined tasks. Participants were given up to
3 minutes with each interface. During that time, they were asked to find the items of
interest. Measures for these tasks included the number of items actually found, and the
time required to find them.
During all tasks, the administrator of the study was observing the participants’
interactions with the system.
After the training session, users completed a short subjective questionnaire. Ques-
tions were based on a subset of the Questionnaire for User Interface Satisfaction
(QUIS [32]) (Appendix B.4). Users were also asked to identify the interface that they
preferred to use for each of the two tasks.
146
7.1.6 Results
Results for the well-defined tasks are given in Figure 7.5. These results were analyzed
with a repeated measures analysis of variance (RMANOVA). As expected, task com-
pletion times increased significantly with complexity (F(2,103) = 53.25, p < .01).
The impact of the interface was also strongly significant (F(2,103) = 25.03), p < .01),
but not in the manner that was expected. The form fill-in interface was fastest overall
(41.9ms average for all tasks), followed by the range slider (54.1ms), and timeboxes
were slowest (73.4ms). The interaction between interface and task complexity was
not significant (F(4,99) = 1.18, p = .33). These results clearly fail to support the
hypotheses.
Further examination of the times for each user and task support the generally poor
performance of timeboxes: for 10 out of 12 users, performance with timeboxes was in-
ferior to performance with the other interfaces for all three tasks. In the two remaining
cases, timeboxes outperformed range sliders for the high-complexity tasks.
Results for the exploratory tasks are given in Figures 7.6 and 7.7. There were
no significant differences between the three interfaces, either in the number of items
correctly identified (RMANOVA, F(2,33) = .60, p = .55) or in the task completion
time (F(2,33) = .89, p = .42).
Subjective satisfaction results are given in Figure 7.8. For three of the four ques-
tions - terrible/wonderful, frustrating/satisfying, and difficult/easy - the form fill-in
interface was rated most highly, followed by range sliders and timeboxes. This differ-
ence was significant in all three cases (ANOVA, F(2,33) = 10.8, p < .01, F(2,33) =
13.77, p < .01, and F(2,33) = 26.13, p < .01, respectively). For the rigid/flexible
question, range sliders were rated most highly, followed by form fill-in and then time-
boxes, but these results were not significant (F(2,33) = 1.19, p = .32).
147
10
20
30
40
50
60
70
80
90
100
110
Low Medium High
Ave
rage
Tas
k C
ompl
etio
n Ti
me
(ms)
Complexity
Form Fill-inRange Slider
Timebox
Figure 7.5: Average completion time (with standard deviation error bars) for well-
defined tasks.
Form Fill-in Range Slider Timebox
Well-Defined 9 2 1
Exploratory 2 5 5
Table 7.1: User preferences by interface for the different task types.
When asked which interface they preferred for each type of task, users expressed a
strong preference (9/12) for the form fill-in interface on the well-defined tasks. Prefer-
ences for the exploratory task were more mixed, with five users preferring range sliders
and five preferring timeboxes (Table 7.1).
Original plans for this study called for 18 participants - 3 for each of the 6 orderings
of interface presentation. Analysis after 12 subjects led to the above results. As these
results are generally unambiguous, the study was terminated at that point.
148
0
0.5
1
1.5
2
2.5
3
3.5
Form Fill-in Range Slider Timebox
Num
ber o
f ite
ms
Cor
rect
ly Id
entif
ied
Interface
Figure 7.6: Number of items correctly identified in exploratory task
7.1.7 Discussion
This study failed to support the hypothesis that timebox queries would provide bet-
ter performance than the alternative. In fact, form fill-in interfaces provided the best
performance, followed by range sliders and finally by timeboxes. For three of the
four measures of subjective satisfaction, users rated the three interfaces in the same
order. Observations of participant interactions with the system and comments made
during the sessions were consistent with the statistical results. Participants frequently
commented that they liked the form fill-in interface and found timeboxes hard to use.
The tasks used in this study might have played an important role in these results.
All of the well-defined tasks involved precisely defined regions for comparison, with
time periods and dollar values expressed exactly. This sort of task is especially well-
suited for the form fill-in interface: to specify - for example - a range between $50 and
149
0
50
100
150
200
Form Fill-in Range Slider Timebox
Ave
rage
Tas
k C
ompl
etio
n Ti
me
(ms)
Interface
Figure 7.7: Average task completion time for exploratory tasks
.
$75, participants simply had to type in the two numbers and press “return”.
Completing these well-defined tasks with the range sliders or timebox interface
requires fine-grain movement of user interface widgets over a small number of pixels.
This led to frustration for many participants, as the interface did not always provide
the level of control that they might have wanted. Participants would often get close
to the desired values and then overshoot, oscillating back and forth until reaching the
desired value. For timeboxes queries, this happened more frequently for changes in
values than in times, which is consistent with the significantly higher granularity of
the value dimension.
The choice of tasks may also limit the generality of these results. The fully-defined
queries used in this study are not necessarily representative of how timeboxes and
TimeSearcher might be used by actual users. Users engaged in data exploration tasks
150
0
2
4
6
8
10
Terrible/Wonderful Frustrating/Satisfying Difficult/Easy Rigid/Flexible
Ave
rage
Sub
ject
ive
Rat
ing
Form Fill-inRange Slider
Timebox
Figure 7.8: Average subjective satisfaction ratings (1-9, 9 is best), n = 12.
are likely to engage in a wide mix of timebox manipulations, including creating, scal-
ing, moving, and deleting query components. This study seems to indicate that time-
boxes are slower than the alternatives for the basic operation of creating a query com-
ponent, but other operations are not addressed. If query modification is faster with
timeboxes than with form fill-in or range slider interfaces, overall performance for
real user tasks might be best with timeboxes. Furthermore, query creation might be
substantially different for real tasks. Users engaged in data exploration may not be
interested in exact criteria for initial query specification. In this case, the performance
penalties associated with timeboxes might be significantly reduced.
The choice of population for study participants may have been a contributing factor
in these results. TimeSearcher is designed to be a tool for motivated domain experts.
Training and familiarity in the tool is Implicit in that assumption. For this study, par-
151
ticipants were given minimal training (less than 30 minutes). Furthermore, the stock
price data set may have been unfamiliar to some of the users. Although it is likely that
most computer science students have some familiarity with the stock market, they may
not have much experience using charts of stock prices.
Observation of study participants revealed some behavior patterns that may have
been result of the relative lack of training and experience with the timebox interface.
Some users seemed intimidated by the timebox interface - the blank screen presented
at the start of each task may have left them unclear how to proceed. Other participants
had difficulty interpreting the effects of modifications to timeboxes. Specifically, they
were surprised when changes in the size of a timebox led to unanticipated changes in
the size of the result set.
Some of the participant confusion in interpreting timebox queries might be at-
tributed to a fundamental asymmetry in timebox interpretation. When the range of a
timebox in the value dimension (vertically) is increased, the query is more inclusive:
for each of the n observations included in the timebox, the range of acceptable values
has increased. However, an increase in the time extent of a timebox in the time di-
mension is less inclusive. If a timebox is increased from n observations to n′(n′ > n)
observations, an additional n′− n constraints have been imposed. Changes in the two
dimensions are therefore not comparable: when the range of the box increases in one
dimension (vertically), the result set may grow larger, but when the range increases
in the other direction (horizontally), the result set may shrink in size. To some users,
this may be somewhat counterintuitive, particularly if they believe that enlarging the
timebox should enlarge the data set. Indeed, several participants seemed to experience
difficulties in interpreting query results after they modified the temporal extent of a
timebox.
152
Figure 7.9: A demonstration of the difficulty of resizing small handles. The large
timebox on the left has handles that are clearly separated and easily graspable. The
small timebox on the right has handles that are only a few pixels apart, and are therefore
harder to select.
Other difficulties may have contributed to the disparity in task performance. With
the timebox interface, queries that involved small time or value ranges proved particu-
larly difficult to move or resize. This difficulty was caused by the resize and movement
handles on the outline of the timebox. In general, the user must click on a corner or on
the middle of one of the sides. Very small timeboxes may have handles that are sepa-
rated by a few pixels or less, making selection very difficult (Figure 7.9). As a result,
these queries become especially time-consuming and frustrating. Similar behavior
may happen with range sliders as the thumbs on the opposite end of the slider become
closer together. Form fill-in interfaces, which do not suffer from such difficulties, may
have increased advantages for queries covering a small range.
These observations present a design challenge for the timebox interface. Specif-
ically, how might the interface be modified to better support moving and resizing of
small timeboxes? One possibility would be to use some form of interface “gravity”
153
that would attract the mouse pointer towards the nearest handle. Alternatively, some
local magnification - perhaps through a lens - might be used to display the area in
question in greater detail, thus allowing more fine grain control.
Another approach would be to supply alternate, indirect tools for modifying time-
boxes. TimeSearcher takes this approach, providing range sliders and form fill-in fields
that can be used to modify the time and value extents of a timebox (Chapter 4). While
these tools have proven useful, improvements to the direct manipulation interface have
the potential to be more flexible and easier to use. Additional implementation and
evaluation will be necessary to compare alternative approaches.
Several participants also had difficulty dragging timeboxes over large horizontal
ranges. When answering a question that involved looking at a given value range at dif-
ferent points in time (see the “High Complexity” example, above), participants would
define a box that examined the first condition and then attempt to drag it horizontally
to the time ranges covered in subsequent conditions. In doing so, they often found that
the box drifted vertically during the course of the movement, requiring a readjustment
of the value range after the desired time range had been reached. This readjustment
appeared to contribute substantially to both task completion time and user frustration
with timeboxes.
This drift was largely a result of the lack of stability in mouse movement. The
difficulty of moving a mouse in a vertically constrained tunnel is inversely related to the
width of the tunnel [2], making strictly horizontal movement virtually impossible. As
a result, any horizontal mouse movement is likely to contain substantial vertical noise.
Since the timebox interface does not have any constraints on the vertical movements
of timeboxes, this noise will lead to vertical movements of the timebox. Since the
timebox interface constrains horizontal movements to be in discrete quanta based on
154
the number of time points in the active data set, horizontal drift does not cause a similar
problem when the mouse is moved vertically.
A simple modification to TimeSearcher provides some assistance in overcoming
this difficulty. When the user clicks and drags a box with the middle mouse button or
mouse wheel (as opposed to the left mouse button), the timebox will move horizontally,
but not vertically. Of course, TimeSearcher users can also use the range slider to
indirectly adjust the time range of a timebox without modifying the value range.
Observation of study participants revealed a range of problem-solving strategies
that were used and problems that participants encountered. Several users followed a
serial process in creating timeboxes, manipulating one side at a time. For example,
instead of dragging a box horizontally to the left, they would drag the left-hand side
to the left, and then drag the right-hand side. As the sides could be dragged inde-
pendently without changing the value range, this provided greater control and avoided
the “vertical drift” problems discussed above. Similarly, some users created boxes by
drawing them along the horizontal axis and then dragging them up to the desired value
range. This technique may have been useful for increasing accuracy in the time range.
Some aspects of the system implementation and study design may have influenced
user performance. To simplify the user tasks, the testing software rounded all dis-
played values to integer values. However, the values in the data set were not similarly
changed. This led to user confusion, as a vertical movement of a box (perhaps be-
tween lower bounds of 53.6 and 53.8) might not have changed the value displayed
(which would stay at 54), but the upper bounds might have changed (perhaps from 73
to 74). When faced with this problem, users often tried repeatedly to adjust the boxes
appropriately. The study administrator tried to identify these situations and define the
task as complete when the user was close enough, but this problem increased task
155
completion times for the timeboxes.
Several users were also confused by the differences in scales between the query
space and the displays of the individual items. As the query space was taller than each
individual item in the display list, features and transitions that were prominent and easy
to spot in the query space may have been difficult to find in the individually-displayed
items. This confused users and created difficulties in completing the exploratory tasks.
The form fill-in and range slider interfaces may have suffered from another artifact
of the design of the software. For these interfaces, the controls (text fields and range
sliders, respectively) are decoupled from the input box - users must go to the query
window and examine two sets of controls presented horizontally (Figure 7.1 and 7.2).
Some users found this arrangement confusing, as they had difficulty correctly mapping
the controls to the correct dimension. This confusion presented itself in the form of an
inappropriate data entry - attempting to enter a time constraint in the value control, for
example. Most users stopped doing this after the error was pointed out to them.
156
7.2 Empirical Evaluation of Input and Output for Ex-
ploratory Tasks
The study described above (and the study described in Appendix C) focused solely on
query input. This decision was made in order to focus the study on the merits of the
timebox query input. This resulted in a study that did not explore all of the strengths
of TimeSearcher - notably, the graph overviews and the scrolling display list.
This study was designed to augment the first study by adding consideration of
query result display to the evaluation. Drawing on lessons learned from the first study,
this study used different formats for presentation of tasks, and contained fewer tasks,
than the first study.
7.2.1 Interfaces
This study compared three different interfaces for completing queries on time series
data. Two of these interfaces are identical to those used in the first study (Section 7.1):
the form fill-in interface and the timebox interface.
The third interface used form fill-in for query specification, with a spreadsheet-like
table of numeric values to display query results. Each row in this spreadsheet was a
single item in the data set, with each column containing one of the time periods. An
additional column contains the names of the items (Figure 7.10). When the query is
executed (either by pressing “return” in one of the text entry fields or by pressing the
“Run Query” button), this table is updated to display the items that match the query.
Despite this different display, procedures for query specification and modification are
identical to those used in the form fill-in interface described above.
This design intentionally omits the range slider interface used in the previous study.
157
Figure 7.10: The form-fill interface with tabular display of query results. Each row
contains the data for one item in the set, with the values for displayed in the columns.
This inclusion of a fourth condition would have lengthened sessions, potentially mak-
ing them unacceptably long.
For this study, the total task completion time was defined as the interval between
pressing the “start” button and the last modification of any item in the query. This is in
contrast to the previous study, which used the time between pressing the “start” button
and the “stop” button as the task completion time. The approach used for this study
158
has the advantage of greater accuracy, as it is not dependent upon an action that user’s
often forgot to take.
7.2.2 Tasks
Task design in this study attempted to avoid troubling characteristics of the tasks used
in the initial study (Section 7.1) and in the aborted study (Appendix C). In the first
study, well-defined tasks involving complete specification of time and value ranges
proved to be well-suited for the form fill-in interface. In the aborted study, verbal
descriptions of a less-precise pattern were found to be confusing to study participants.
In an attempt to find a middle ground between these two extremes, tasks in this
study were designed to be precise enough to be easily understood while also being
open-ended enough to be more challenging than the tasks from the initial study. Tasks
generally asked users to identify items that fell within a given value range for some
number of days, or had other transitions of a well-defined magnitude.
Each participant completed four tasks: two training tasks and two timed tasks.
One of the training tasks was somewhat simpler than the other, in order to help par-
ticipants gain familiarity with the interfaces. The tasks were the same for all three
interfaces. Balanced ordering of the presentation of the interfaces was used to over-
come any learning effect that might have been caused by repeated exposure to the
questions. Each task was presented to the users with a graphical depiction of an item
matching the pattern. These graphics were included to help participants understand
the questions 3. The questions used in this study are given in Appendix D.
3Thanks to Francois Guimbretiere for this suggestion
159
7.2.3 Hypothesis
This study was designed to test the following hypotheses:
Hypothesis 3 1. Graphical display of results will lead to faster task completion
time than tabular display.
2. Direct manipulation specification of queries will lead to faster task completion
time than form fill-in specification.
3. Task completion time will be fastest for the direct manipulation interface, fol-
lowed by the form fill-in interface with graphical feedback and finally by the
form fill-in interface with tabular feedback.
7.2.4 Procedure
The session began with the signing of informed consent forms, and a brief explanation
of the goals of the study and the tasks. The main body of the session consisted of three
blocks - one for each of the three interfaces. Each block consisted of the following
steps:
1. The administrator of the study described the interface and demonstrated its use.
2. The participant was given the opportunity to try the interface
3. The participant completed the four tasks.
For each of the tasks, the user was instructed to read the task description and to
verify that they understood the question before starting the task. This often involved
having the user restate the question. When there was any confusion, the administrator
160
provided clarification. This emphasis on comprehension was included in an attempt to
avoid the comprehension problems faced in the third study (Appendix C).
Participants had two minutes (120 seconds) to complete each tasks. Only one
attempt was allowed for each question - if the question was not answered at the end of
the allowed time, the participant simply moved on to the next task.
The data sets used for this study were synthetic data sets that were hand-tuned
to include answers to the various tasks. Specifically, data sets containing randomly-
generated values for 13 time points for each of 100 stocks were generated, and then
modified to guarantee that they each contained at least five items that would be correct
answers for each of the tasks. Four data sets were used - one for training and one for
each of the three interfaces. The ordering of the data sets for the timed tasks questions
was varied so that every possible pairing of data set with interface occurred equally
frequently, and the ordering of the data sets within the sessions were also balanced.
After the session, each participant completed a subjective satisfaction form similar
to the form used in the first study (Appendix B.4).
The initial design of the study was revised based on feedback from three pilot
subjects. The most significant change that was made on this basis was simplification
of one of the training tasks.
Thirteen Computer Science graduate students from the University of Maryland
participated in this study. Due to technical difficulties, data from one of the subjects
was not collected correctly, so the analysis only included the results from the remaining
twelve subjects. Thus, each of the six possible interface orderings was used by two
subjects.
161
0
20
40
60
80
100
120
Form with Table Form with Visual Timebox
Ave
rage
Tas
k C
ompl
etio
n Ti
me
(ms)
Interface
Figure 7.11: Average task completion time with standard deviation error bars.
7.2.5 Results
Average task completion times were 49.94 seconds for the timebox interface, 58.52
seconds for the form fill-in interface with tabular feedback, and 59.07 seconds
for the form fill-in interface with visual feedback. These results appear to indi-
cate a slight advantage for the timebox interface, but the results are not signifi-
cant (ANOVA,F(2,69) = 0.76, p = .47) (Figure 7.11). Separate analyses of each of
the two timed tasks also failed to show any statistically significant differences be-
tween the three interfaces (ANOVA, F(2,33) = 0.02, p = .98 for question one and
F(2,33) = 1.06, p = .36 for question two) (Figure 7.12).
As expected, the results appear to have been influenced by a learning effect. Of
the twelve subjects, only one was fastest with the first interface that they saw, while
6 were faster with the second interface and five were faster with the third. Similarly,
162
0
20
40
60
80
100
120
Form with Table Form with Visual Timebox
Ave
rage
Tas
k C
ompl
etio
n Ti
me
(ms)
Interface
Question 1Question 2
Figure 7.12: Average task completion time (with standard deviation error bars) for
each of the two timed tasks.
nine of the twelve subjects were slowest with the first interface that they used, while
the remaining three were slowest with the second interface. Grouping the questions
in terms of ordering of presentation (first interface, second, or third), reveals a sig-
nificant effect of interface ordering, with the third interface presented having the best
performance (ANOVA, F(2,69) = 4.97, p < .01). Paired t-tests of the orders showed
that the difference between the second and third interface presented was not signifi-
cant (t = 0.12, p = .45), but the the first interface was significantly slower than both
the second and third interfaces (t = 2.61, p < .05 and t = 1.68, p < .05, respectively).
Examination of the results for the individual participants reveals that differences
in performance may be stronger in some individuals than in others. Six of the twelve
participants had the fastest completion times with the timebox interface. For these
six subjects, task completion times for the timebox interface was significantly faster
163
0
20
40
60
80
100
120
Form with Table Form with Visual Timebox
Ave
rage
Tas
k C
ompl
etio
n Ti
me
(ms)
Interface
Figure 7.13: Average task performance times (with standard deviation error bars) for
the six participants who were fastest with the timebox interface.
than with either of the form fill-in interfaces (ANOVA, F(2,15) = 4.40, p < 0.05,
Figure 7.13). The remaining six subjects did not show any significant differences
between the three interfaces (ANOVA, F(2,15) = 0.38, p = .69, Figure 7.14).
Some of this effect may have been caused by an imbalance of ordering. Of the
six participants who were fastest with the boxes, three of them used the form fill-in
interface with tables first, and only one of them used the timebox interface first. Further
investigation would be needed to conclusively determine whether the differences in
performance observed from these six subjects was a meaningful effect, as opposed to
being a result of the order of presentation.
Participants clearly preferred the timebox interface. The timebox interface was
rated significantly higher for the four questions that rated the interfaces (ANOVA, p <
164
0
20
40
60
80
100
120
Form with Table Form with Visual Timebox
Ave
rage
Tas
k C
ompl
etio
n Ti
me
(ms)
Interface
Figure 7.14: Average task performance times (with standard deviation error bars) for
the six participants who were fastest with either of the form fill-in interfaces.
0.05 in all cases,Figure 7.15). When asked to indicate which interface they preferred,
ten of the twelve subjects indicated the timebox interface, one indicated the form fill-in
interface with visual feedback, and one indicated an equal preference for the timebox
interface and the form fill-in interface with visual feedback.
7.2.6 Discussion
This study failed to show any significant differences in task performance times for
the three interfaces. This result stands in contrast with the outcome of the first study,
which showed a significant performance advantage for the form fill-in interface. Al-
though further study might be necessary for a complete understanding, these discrep-
ancies might be the result of the different designs of the two studies. Analysis of these
165
0
2
4
6
8
10
Terrible/Wonderful Frustrating/Satisfying Difficult/Easy Rigid/Flexible
Ave
rage
Sub
ject
ive
Rat
ing
Form with TableForm with Visual
Timebox
Figure 7.15: Average subjective satisfaction ratings 1-9, 9 is best), n = 12. The prefer-
ence for the timebox interface was significant in all cases.
differences can provide some insight to the study results and suggest further studies.
The first major difference between the two studies was in the interfaces used. The
first study used interfaces that differed only in the query input. Three modalities were
used : form fill-in, range sliders, and timeboxes. In an attempt to study presentation of
results as well as query input, the second study replaced the range slider interface with
a second form fill-in interface that used a tabular spreadsheet to present query results,
instead of a graphical display.
Although further study involving additional cases (for example, the timebox inter-
face with textual output) might be needed to draw stronger conclusions, the results of
these studies would seem to imply that the query result output is not a major factor
in the differences in performance. In the second study, performance on the two form
166
fill-in interfaces was virtually indistinguishable (x = 58.52,σ = 35.53 for the tabular
output, x = 59.07,σ = 35.48 for the visual output). Thus, the tabular feedback did not
lead to any measurable performance penalty, despite several complaints from partici-
pants who found it difficult to use.
It seems more likely that the differences between these two studies is a result of
differences in the tasks. The first study found that form fill-in interfaces were superior
for completely-specified tasks. In an attempt to approximate the use of timeboxes for
open-ended data analysis, the tasks in the second study were more open-ended. This
may have helped performance with the timebox interface.
Closer examination of the tasks used in this study might explain the lack of sig-
nificant differences between the interfaces. These tasks do not include fully-defined
constraints that require items to be in a specific value range during a specific interval,
but they do include hard-coded values (“$25 range”,”$40 more”, “rise in price of $35”,
etc.). In some sense, these tasks might be seen as intermediate tasks, falling somewhere
between the fully-defined tasks used in the first study and fully exploratory tasks. The
difference in results between the two studies seems to support the conjecture that the
performance of timeboxes relative to form fill-in interfaces would continue to improve
as tasks move further from fully-defined towards exploratory. Of course, further study
will be needed to verify this hypothesis.
As the participants who were fastest with timeboxes were significantly faster, sec-
ond study also seems to provide some preliminary evidence for possible performance
differences between individuals. Observation of some of the subjects provided some
clues as to some of the factors behind these performance differences. Specifically, as
participants were asked to read the question before starting the task, many took the
time to formulate search strategies. This proved particularly useful for form fill-in
167
interfaces, as planning helped reduce the exploration that was needed. Furthermore,
since this time was not included in the task performance time, this planning appeared
to reduce the search time.
This result raises the possibility that one of the benefits of the timebox interface
might be in reduced cognitive load. Specifically, timeboxes might help users reduce the
strategizing and planning needed to complete tasks. Further study - perhaps including
consideration of planning time - would be needed to investigate this hypothesis.
This study also provided further evidence for the importance of training. Study par-
ticipants generally fared better with the second and third interface than they did with
the first. Further study with more detailed training might clarify the differences be-
tween the three interfaces. However, it should be noted that the assumption of trained
users is acceptable in this case, as timeboxes (and TimeSearcher) are not designed for
use by novices.
The design of this study may have influenced results in a manner that limits the
generality of the results. In this study, users were presented with the three interfaces
in succession, with all tasks from one interface being completed before moving on
to the next interface. This may have reduced cognitive load, but it also might have
contributed to the order effects, as participants’ growing familiarity with the tasks and
strategies might have helped them improve their performance.
The repetition of the tasks across interfaces might have had an impact on the re-
sults. This repetition was intended to ease comparison of performance across the three
interfaces, but it may also have added to the order effects. Even though the data sets
were different, users frequently generated strategies that helped them more effectively
complete the tasks on the second or third try. Further study involving interleaving of
the order of interface presentation and the use of different interfaces would be interest-
168
ing.
A more subtle aspect of the study design appears to have had an additional impact.
As described above, the data sets used in this study involved randomly-generated data
that was hand-crafted to insure that each data set contained answers to each of the tasks.
Because of the nature of the tasks, these modifications took on a certain predictable
character. Specifically, the first task looked for items that had low prices during the
first five time periods and high prices during the last five. Many of the items that
matched that task also matched the second task, which required items to stay in a $25
range for four time periods and then to have a rise of $35 at some later point.
This commonality interacted with a trend in user strategies. Many of study partic-
ipants answered the second question by starting at the lower-left corner of the query
space - for example, looking at the range of $10− $35 for the first four time peri-
ods. The presence of items that matched the first task made this an effective strategy
for quickly completing the second task. Repetition of this study with tasks and data
sets that were carefully constructed to avoid these overlaps between tasks and users
strategies might increase confidence in the results.
This study used a modest sized data set with 100 items and 13 time points. It
seems likely that as the number of items or number of timepoints increases the users
of a tabular data display will have a much harder time performing as well as users of
the graphic overview.
169
7.3 Conclusion & Future Steps
Although these studies may fall short of providing strong empirical support for the
utility of timeboxes, they have provided some valuable insights into the design of in-
terfaces for exploration of time series data.
The obvious need that was identified was for improved mechanisms for specify-
ing precise values for timebox ranges. Small timeboxes (and range sliders) can be
hard to manipulate due to the narrow range of pixels that must be precisely selected
(Figure 7.9). Augmenting the TimeSearcher interface with tools that would overcome
these difficulties might help users avoid troubles with narrow ranges. The precise na-
ture of such facilities might require some design or further evaluation. As discussed
above, One possibility might be to implement some notion of “gravity” that would
attract the mouse pointer to appropriate handles, thereby easing selection and manipu-
lation. Another approach is the provision of alternative input mechanisms, such as the
text-entry and range sliders already included in TimeSearcher (Chapter 4). The results
from these studies clearly validate the decisions to include these facilities.
Other design suggestions that arose out of observations made during these stud-
ies have already been implemented in TimeSearcher. For example, horizontal-only
movement in order to avoid vertical drift (Section 7.1.7 is now supported. Other pos-
sibilities, including the ability to temporarily disable query clauses and to vertically
align boxes (Appendix C.3), are interesting candidates for future work.
Further assessments aimed at exploring the performance of timeboxes on more
exploratory tasks might provide further insights while possibly providing a clearer
demonstration of the utilities of timeboxes. For example, the somewhat exploratory
tasks used in the second study might be replaced with more open-ended questions.
For example, users might be asked to identify items that a large increase in value after
170
staying relatively steady for some amount of time. These tasks may present difficulties
in evaluating completion and correctness. For example, how would “relatively steady”,
or a “large increase in value” be defined? Clear definitions of tasks and criteria for
judging accuracy of task completion might not be easily specifiable.
These empirical studies provided feedback that has been useful in clarifying un-
derstanding of the strengths and weaknesses of both the timebox query model and
the TimeSearcher tool. The results of the first study have been particularly useful in
this regard: by identifying situations where timeboxes do not perform well, this study
provided motivation for additional query tools like angular queries and variable-time
timeboxes that support the exploratory tasks that timeboxes are likely to be best suited
for. The second study reinforced this intuition, as the relative strength of timeboxes
seemed to improve as questions became more open-ended.
The results of these studies should be interpreted in the context of the context of
the case studies (Chapter 8), which demonstrate the success of TimeSearcher in help-
ing motivated users address meaningful research problems. This feedback supports
the claim that timeboxes and the associated information visualization tools found in
TimeSearcher provide real value for users.
Finally, these studies provided some insight into the challenge of developing em-
pirical methods for evaluating exploratory interfaces. The unexpected results from the
first study were largely a result of the mismatch between the tasks that were chosen
and the strengths of the tool that were evaluated. The incomplete study suffered from
overly complex tasks that participants found hard to interpret. All three studies had
relatively novice users working with a tool that was designed for motivated domain
experts. Empirical evaluations that involve a combination of appropriate tasks and
users are clearly necessary for maximizing the utility of these studies. These results
171
should contribute to and act as a warning for the increasing number of researchers and
practitioners who pursue evaluation strategies for information visualizations.
172
Chapter 8
Applications
The design of TimeSearcher has been informed and validated by work with users. In
particular, colleagues in molecular biology have made extensive use of TimeSearcher
for examining time series and linear order data sets. This chapter provides an overview
of the biological applications of TimeSearcher.
TimeSearcher has also been used by researchers to examine climatological (partic-
ulate concentration), hydrological, and demographic (census mortality) data sets.
8.1 DNA Microarray Data Set Analysis
Recent advances in DNA microarray technology have provided geneticists with the
ability to examine expression levels of thousands of genes under varying circumstances
[55]. Numerous published reports of microarray data have used the examination of
changes in gene expression levels over time to examine the effects of various stimuli
on genetic expression.
Analyses of the microarray data generally are conducted via some sort of mathe-
matical grouping of genes with similar expression profiles. Clustering techniques that
have been used include hierarchical clustering [40, 46, 111], self-organizing maps
173
Figure 8.1: Red-green “heat map”display expression genes at seven time points. Each
row is a gene sample, and each column is a time point. Bright green samples are
repressed genes, bright red are induced genes, and darker samples are close to the
average. Genes that are repressed (low expression levels) are shown at the top, and
induced genes (high expression levels) at the bottom [34].
[128, 143], and singular value decomposition [65].Clustered expression profiles are
often displayed with 2D layouts that use coloring to display the expression levels of
each sample, with bright-green indicating relatively under-expressed genes and bright-
red indicating genes with relatively high levels (Figure 8.1). The Cluster and TreeView
programs are widely used for generation and viewing of clusters [46].
Heat-Maps are very useful for condensing significant amounts of information in a
display that helps highlight gross trends and similarities between clusters. However,
they generally suffer from the drawbacks of other static displays: interactive querying
and exploration are not supported. Other microarray analysis techniques, including
the use of spreadsheets and manual creation of histograms suffer from similar lack of
interactivity.
174
Figure 8.2: The Hierarchical Clustering Explorer. Dendrogram clusters and filters for
detail and similarity are shown in the top window, with a detailed display of a subset is
shown below. A scatterplot on the right is used for pairwise comparison between two
of the experimental conditions [111].
The Hierarchical Clustering Explorer [111](Figure 8.2) addresses many of these
problems by combining filters for minimum similarity and detail display with alterna-
tive displays showing pairwise similarities between expression profiles and the ability
to compare clusters computed from different algorithms.
TimeSearcher’s dynamic query tools are well-suited for expressing queries aimed
at identifying genes with particular expression profiles. Two ongoing collaborations
175
have explored these possibilities.
8.1.1 Programmed Cell Death in Drosophila melanogaster
As organisms develop, new cells are created and old cells that are no longer needed are
destroyed and eliminated. For example, when a tadpole becomes a frog, the tail and in-
testine (among other structures) are no longer needed, and are therefore destroyed. The
process of controlled destruction and elimination of cells is known as Programmed cell
Death (PCD). Programmed cell death is of interest to biologists for a variety of rea-
sons. As a genetically-controlled process, PCD involves complex interactions between
many genes. Furthermore, the absence of cell death may be related to the uncontrolled
proliferation of cells associated with cancerous tumors.
Studies of cell death in flies, worms, humans, and other organisms have identified
a variety of genes that are involved in the control of PCD. Furthermore, the many of
the genes involved in this process appear to be similar in these organisms - in other
words, the relevant genes have been conserved [12]. However, the processes involved
in PCD are not completely-understood. In particular, the exact genes that are required,
and the sequences of expression of these genes, are uncharacterized for many types of
cell death.
Eric Baehrecke’s lab at the University of Maryland Biotechnology Institute,
Center for Biosystems Research, studies the programmed cell-death in Drosophila
melanogaster - the common fruit fly. In Drosophila, the transition between larva and
pupa involves destruction of larval cells that are no longer needed, along with differ-
entiation of cells that will be used in the future adult [12]. Studies of changes in gene
expression levels in cells that die during these processes are useful for understanding
the genes that play a role in PCD. In particular, larval salivary gland and midgut cells
176
have proven to be a fruitful area for investigation.
The steroid hormone 20-hydroxyecdysone (ecdysone) plays a critical role in cell
death in Drosophila larvae. The presence of ecdysone at 10 hours after the onset of
metamorphosis leads to the expression of genes known to be involved in Drosophila
cell death, including reaper (rpr) and head involution defective (hid) [72]. More specif-
ically, the gene E93 plays a crucial role in this process. E93 is induced by ecdysone,
and appears to play critical role in the expression of other cell death genes: mutations
in E93 have reduced levels of expression of cell death genes including rpr, hid and
others. Furthermore, E93 is expressed only during metamorphosis [84].
Microarray experiments have led to further insight into the genetic mechanisms un-
derlying cell death in Drosophila. These experiments involved RNA samples from flies
at 6 and 12 hours after the beginning of metamorphosis - the times of greatest changes
in gene transcript levels. Furthermore, this work contrasted the steroid-controlled
death of groups of cells that occurs during metamorphosis with the radiation-induced
death of individual cells - processes known as autophagy and apoptosis, respectively.
The microarray experiments from the samples involving steroid-induced cell death
involved 2,876 genes that were consistently found in each of three replicated trials. Of
these genes, 484 showed an increase of 5-fold or greater between the 6 and 12 hour
samples, and 448 showed a decrease of 5-fold or more. Known cell death genes rpr,
hid, dronc, and crq were among those that showed significant increases in transcrip-
tion.
The samples involving radiation-induced cell death had 5,495 genes that were con-
sistently detected, most of which were at levels nearly equal to those found in the
steroid-induced data set. Only 22 genes increased more than 5-fold in the irradi-
ated flies (as compared to unirradiated controls), and 12 decreased greater than 5-fold.
177
Comparison of the genes that were induced following radiation with those induced by
steroid revealed that rpr was the only known cell death gene appearing in both data
sets, but several other genes had increased levels of transcription in both data sets [83].
The use of only two time points - 6 and 12 hours after the onset of metamorphosis
- limits the explanatory power of these data sets. Specifically, fluctuations of gene ex-
pression levels between 6 and 12 hours might provide additional insight into regulatory
interactions between genes. This possibility has been addressed by a second microar-
ray experiment, involving 5 time points - 6, 8, 10, 12, and 14 hours after the onset of
metamorphosis. For steroid-induced cell death, these experiments yielded 3225 genes
that were consistently detected.
Analysis of this data with TimeSearcher has been the focus of ongoing collabora-
tion with the Baehrecke lab.
Sample Analysis Sessions
Direct user observation can provide valuable insights into the strengths and shortcom-
ings of a tool for performing a particular task. During the course of the ongoing col-
laboration with the Baehrecke lab, there have been several such sessions. The follow-
ing discussion is a composite of observations from multiple sessions in October and
November 2002.
Analysis of this data set with TimeSearcher might begin with a search for genes
that increase in expression level at each time point. Starting points - i.e., the values
from which genes might increase - are chosen heuristically, as is the magnitude of the
change.
In general, the direction and magnitude of change is more interesting than the exact
values involved. These sessions were conducted before variable-time timeboxes and
178
angular queries were supported in TimeSearcher, but it seems likely that these exten-
sions to the query model and other tools that supported relative querying (Chapter 9)
might be helpful for this task.
These preliminary searches are useful for identification of genes that are expected
and to confirm understanding with respect to prior results. These “sanity checks” can
help the user build confidence in the data set. Several unexpected genes with high
expression levels were found. Examination of these results indicated that they were
generally ribosomal and anti-microbial genes that are consistently present in cells. Al-
though important to proper cell functioning, such genes are not particularly interesting
in the current context.
Continuing examination of transitions from times 6-8 identified the gene timp,
which is correlated with mmp1, a known cancer inhibitor. This raises the possibil-
ity that timp might be an interesting gene in the context of cell death.
Further examination of transitions from 8 hours to 10 hours yielded additional
insights, including the identification of eif45 as a potentially interesting gene, and the
use of the leaders and laggards functionality identified the Wrinkle gene at times 12 and
14 as potentially interesting, along with the more global observation that the number
of genes that followed a pattern of rises from 8-10 hours was greater at 10-12 hours
than at 12-14 hours.
An alternative approach to using TimeSearcher to analyze this data set starts with
the observation that the E93 gene plays an critical role in cell death in Drosophila [84].
We might hypothesize that genes that have profiles similar E93’s profile might also be
involved in cell death. Specifically, genes that show increases in expression level after
E93’s expression increases might be regulated by E93 - in other words, E93 might be
a factor that contributes to the expression of these genes.
179
The use of TimeSearcher to explore this line of investigation begins with the use of
the text search box to find the E93 sample in the database1. The drag-and-drop query-
by-example tool is then used to create a query identifying those items that are similar
to E93’s profile. As the 10 and 12-hour measurements - corresponding to the interval
before the second peak in ecdysone levels [11] - are most interesting, the boxes for the
6, 8, and 14 hour samples are eliminated.
The resulting set contains over 1200 of the 3225 genes in the original sample - far
too many to be of immediate interest. To filter further, the timebox for the 12-hour
time point is adjusted to include only those genes with a more pronounced increase in
expression level - the box is moved up to remove lower values, and expanded to include
a higher range of values. The resulting data contains less than 300 genes - a substantial
reduction in size (Figure 8.3). These results are saved as potentially interesting.
The leaders and laggards facility was then used to identify genes that have this
same increase in expression level at a later time point - specifically, between 12 and
14 hours. This leads to a set of under 100 laggards - genes that might be regulated by
E93. This set was also saved as being of interest.
Alternative approaches included shifting the paired timeboxes to look at genes with
increases in transcription between 8 and 10 hours, as genes that have earlier increases
in expression level might influence the expression of E93. Decreases in expression
level, which are also potentially interesting, can be examined by recreating the mod-
ified query-by-example described above (Figure 8.3), and then using the query inver-
sion facility (Section 4.4) to find genes that have decrease of similar value at the same
time point (Figure 8.4). This query can then be used as the basis for a leaders and lag-
gards query that would identify genes with similar decreases later in the time series.
1In this data set, E93 is known by the alternative name EIP93f.
180
Figure 8.3: TimeSearcher query display identifying genes that are roughly similar to
E93 at 10 and 12 hours. This query contains two timeboxes, based on the values
of E93 at 10 and 12 hours. The 12 hour timebox has been shifted up, to eliminate
smaller increases in expression levels. This timebox has also been increased in height,
in order to include some very sharp increases in expression level that might not have
been included in the original timebox.
Figure 8.4: TimeSearcher query identifying genes that decrease significantly between
10 and 12 hours, when E93 is increasing.
181
It is important to note that these queries are useful primarily for generators of
hypotheses. Although suggestive, the temporal relationships identified with Time-
Searcher are not sufficient to establish any direct linkages between the genes involved
in these queries. However, these results may be useful for identifying genes (or sets
of genes) that merit further experimental analysis, which might identify regulatory
relationships.
Observations
Researchers in the Baehrecke lab were extremely enthusiastic about the use of Time-
Searcher for analysis of their microarray data:
TimeSearcher gives us the ability to see a large amount of time series data
and rapidly query for patterns based on known mechanism. This makes it
a valuable tool for the generation of new hypotheses. We haven’t found
any other software that gives us similar capabilities to observe overviews
of temporal data, and query for characteristics based on the knowledge of
a biological system [13].
The use of TimeSearcher for the analysis of the cell-death data played an important
role in the identification of novel results [35].
These analysis sessions also provided some insight into various facets of one pos-
sible use of the TimeSearcher tool. One of the most striking observations involved the
interpretation of query results. The item list on the right-hand side of the TimeSearcher
window was seen as being much more important than the graphical displays of each
of the items in the lower-left-hand window. As this data set contains many genes that
the biologist using the tool knew by name, the item list provides a concise display that
can be easily scanned for familiar names.
182
To some extent, this behavior might be a result of the type of analysis being done
with this data. Specifically, this analysis was conducted by a biologist who brought a
significant amount of understanding and focus to the task. Out of the 3225 genes in the
original data set, the user had preconceived notions of which genes - roughly 100 in
number - that were potentially interesting. Thus, the item list was a powerful tool for
identifying which of those genes were present in any given result set. Of course, the
user’s prior knowledge may bias him or her away from potentially interesting genes
that fall outside of those existing notions of what might be interesting, but this problem
is likely to exist with any visualization tool.
Most of the query modifications made by this user were made directly with the
mouse. There was relatively little use of the keyboard and the range sliders, although
arrow keys were used to change the time periods covered by the boxes.
Finally, the 100-item threshold for displaying individual graph overview lines (as
opposed to only the query envelope (Section 4.1)), was seen as being too low, suggest-
ing that the default might be raised to a higher value.
These sessions involved one individual user focusing on a single data set. As a
result, these comments may not generalize to others. Further observation of a wider
range of users would be necessary before any generally applicable conclusions could
be drawn.
Contributions and Design Suggestions
The Baehrecke lab’s participation in the development of TimeSearcher involved nu-
merous design discussions, many of which occurred before the data set described
above was collected. Together with the analysis sessions described above, these discus-
sions generated several ideas for TimeSearcher functionality. Some of these features
183
have been implemented, while others present interesting possibilities for future work.
Currently-implemented features that resulted at least partially from these discus-
sions include leaders & laggards (Section 4.2), support for multiple time-varying at-
tributes (Section 4.3), and query inversion (Section 4.4). Leaders & laggards was sug-
gested early in the discussions as potentially useful for identifying regulatory relation-
ships, as described above. Support for multiple time-varying attributes was proposed
as useful for simultaneous display and querying of data collected under two different
conditions: naturally occurring (“wild-type”) flies and mutated flies. Query inversion
was proposed as a tool for identifying transitions that were contrary to previously iden-
tified trends of interest. All of these features were deemed to be of sufficiently general
interest to merit inclusion in TimeSearcher.
One issue that arose regarding query expressiveness involves adding constraints
to existing queries that required that all items that matched must have values that are
non-decreasing (or non-increasing) during some specified interval. This suggestion
arose after construction of a query that had two adjacent boxes with some vertical
overlap. Although the boxes suggested a general rise in value, the overlap allowed
some items that actually decreased in value to be included in the result set (Figure 8.5).
Facilities for requiring values to be non-decreasing or non-increasing could be used
to eliminate such values, without requiring restatement of the relative constraints of
the two query boxes. These facilities would be similar to the interval trending query
facilities discussed in Section 9.1.5. This observation was part of the motivation for
the eventual implementation of angular queries (Section 4.7).
These analysis sections also provided motivation for the inclusion of support for
multiple time-varying attributes, including extensions that go beyond the facilities pro-
vided in TimeSearcher. The current implementation supports multiple tabs in a tabbed
184
Figure 8.5: A query illustrating the need for additional constraints requiring non-
increasing (or non-decreasing) values over a specified interval. Although the general
trend of the two timeboxes is upwards, the highlighted item actually has a decrease in
value between 10 and 12 hours. Additional constraints requiring non-decreasing items
would remove this item from the result set.
window, each displaying a different attribute (Section 4.3). For comparisons between
pairs of attributes, an alternative presentation might involve displaying the differences
between values at each point in time. Although this can easily be achieved through
appropriate pre-processing of the data set, integrated features for automatically con-
structing this comparison would be easier to use.
Theses analysis sessions also identified the need for support for using a query to
remove items from consideration from further investigation as potentially useful. Like
many other microarray experiments, the programmed cell death data contained profiles
from numerous genes that are necessary for other phenomena that might not be of
interest to the current inquiry. In some cases, it might be helpful for the user to flag
these items as being uninteresting, thus removing them from further consideration.
One approach to supporting this functionality might be to provide an additional control
185
- perhaps through a “trash can” icon on a button - that would identify items that match
the currently active query and remove them from further consideration. In essence,
such a query would support a form of negation, finding all items that fail to match the
query.
Although TimeSearcher currently provides users with the ability to save result sets
and to save and reuse specific queries, these features are somewhat limited. The anal-
ysis sessions with the Drosophila programmed cell death data set provided a clear il-
lustration of the need for tools that provide greater support for working with the result
set of a given query. For example, annotation support might be provided to somehow
mark genes that are of interest, perhaps because they have been seen elsewhere.
Further extensions along these lines involve customized displays of the item list
and result set, based on additional metadata regarding the items in the data set. For
example, the Gene Ontology (GO) is a vocabulary that provides hierarchical group-
ings of genes according to various functional and structural criteria [131]. Similarly,
FlyBase (http://www.flybase.org) is a database containing detailed information about
the annotated Drosophila genome [130]. GO and FlyBase both contain rich metadata
that might be used to augment TimeSearcher displays. For example, categorical data
such as grouping in the GO, or the existence of known comparable genes (homologs)
in other organisms, might be indicated via color coding of items in the result list, or
via special glyphs on displays of the individual graphs.
Output from other forms of analysis might be used to customize the display of
individual items. As described above, microarray analyses often include mathematical
clustering of similar genes. For results involving a small number of clusters, each
item in a given cluster might be displayed in the same color, with saturation of each
item’s display indicating the distance between that item and the center of the cluster
186
that contains it.
Other possible improvements to display of query results might involve customiz-
able ordering of items as displayed in the result list, or in the display of individual
graphs. Currently, these displays are provided in a set order based on the order in
which they are found in the data set. An obvious extension would be to support sorting
of the item list based on some alphabetic or lexicographic criteria, but other, potentially
more interesting approaches are possible. For example, a “maximum differential” or-
dering might sort the items in a data set in decreasing order of the differential between
their highest value and their lowest value. This would place items with the greatest
change at the top of the list, and the items that changed the least (and are therefore
possibly less interesting) at the bottom of the list.
Revised displays for leaders & laggards queries were also suggested. For example,
the current color-coding of the item list might be replaced with a two column display,
with leaders in one column and laggards in the other.
Finally, analysis of the cell death data set highlighted the potential utility of in-
tegrating TimeSearcher with other visualization tools. As coordination of multiple
visualizations has been shown to decrease task performance time and increase user
satisfaction [95], the use of TimeSearcher in conjunction with other visualization tools
might improve comprehension and utility of the visualizations. For example, an on-
going effort with the Baehrecke lab has investigated the possibility of displaying Gene
Ontology information in a hierarchical treemap display [10, 116].
A coordinated visualization might use a treemap display to highlight the genes that
matched a particular query. This would provide an immediate graphical perspective on
the similarities between genes in the result set: the presence of multiple similar genes,
would lead to a tight cluster of highlighted genes. However, if the result set of a given
187
query contained highly dissimilar genes, the highlights would be scattered throughout
the treemap.
On a similar note, TimeSearcher might be equipped with hooks to appropriate web
sites. For example, each gene in the data set might be tied to FlyBase, allowing the
user to retrieve a complete FlyBase entry by simply clicking on the name of a gene in
the result set.
Although the details of some of these suggestions may be specific to analysis of
microarray data, the general ideas of increased flexibility in query expression; display
and manipulation of results; and integration with other tools are easily generalized to
other application domains.
8.1.2 Viral Life Cycle in Epithelial Cells
Karen Duca’s lab at the Virginia Bioinformatics Institute is in the preliminary stages
of using TimeSearcher to analyze data experiments involving genetic responses to in-
fluenza virus in epithelial cells. Feedback from this work has been provided in the form
of responses to a questionnaire, as opposed to from direct observation of an analysis
session, as with the Programmed Cell Death work with the Baehrecke lab.
In analyzing the influenza data sets, the Duca lab has found TimeSearcher partic-
ularly useful for seeing an overview of the data set, particularly for identification of
genes that have expression profiles that are atypical.
After normalization, most genes are unchanged across the whole time
course. TimeSearcher allows one to very quickly pick out what deviates
from ”unchanged” behavior. We had already found our markers, but it
took weeks (even months to do it effectively) with histograms and Clus-
ter. With TimeSearcher, had it been available then, we would have had
188
the information in one day. The changes are quite subtle, really, but Time-
Searcher gets you there faster than k-means clustering, which was our best
technique up to then [42].
Ongoing collaboration with the Duca lab has focused on domain-specific exten-
sions to TimeSearcher that would increase its utility for microarray analysis. Specifi-
cally, microarray analyses often use statistical comparisons between and among items
in the data set. Similarly, microarray data sets often contain controls that are used for
data validation and are not interesting for analytic purposes. Analyses of these data
sets would be simplified if TimeSearcher could be configured to ignore appropriately-
labeled controls.
Alternative approaches to handling control items might be of general interest in
other domains. As discussed above, analysis sessions with members of the Baehrecke
lab led to the suggestion of support for the ability to remove items that matched a query.
This functionality could certainly be used to handle control data: a set of timeboxes
describing the control - perhaps created by a drag-and-drop query - would define those
items that should be removed from further consideration.
Another possible enhancement is motivated by the observation that control items
are likely to be those that have little change throughout the time series. Sliders that
filtered items based on relative change levels could be used to eliminate these items
from the data set. These sliders would allow users to specify minimum (or maximum)
percentage over the course of the time series that must be met for inclusion in the result
set. These differences might be specified in terms of z-score deviations from the mean.
Thus, this slider could be used to require that items must have changes of at least 1/2
and less than three standard deviations from the mean in order to be retained in the
data set.
189
Handling of microarray data points that involve experimental error is another area
of interest. Microarray data sets are often very noisy and multiple repetitions of a
given experiment are usually conducted (and then averaged) to generate a reliable data
set. Despite these repetitions, these data sets are plagued by experimental errors and
missing values.
In some cases, values that exceed certain thresholds might indicate experimental
error. A variety of approaches to handling such errors might be implemented in Time-
Searcher. Filters that eliminated any items that exceeded these threshold (at any time
point) might be eliminated via a double-thumb range slider. Of course, this slider
would be equivalent to a timebox that spanned all of the time points in the data set,
but it would present less visual clutter and would perhaps be easier to use. Alterna-
tively, TimeSearcher might use color-coding to indicate values with lower reliability,
or provide an interpolated value in place of the value that indicates experimental er-
ror. Handling of missing values in microarray data sets is an active area of ongoing
research: techniques involving the use of cubic splines to interpolate values [14], or
dynamic programming to compare microarray time series despite missing values [1]
might be of use in this context. Alternatively, TimeSearcher might be augmented to
displays and query semantics that appropriate account for missing values.
Support for multiple time-varying attributes is also seen as important for analysis
of the microarray data sets in Duca’s lab. Proteomics experiments that parallel the
microarray experiments will provide data regarding the proteins present in the cell.
Support for simultaneous exploration of both microarray and proteomics experiments
is seen as potentially very useful, particularly in conjunction with leaders and laggards
functionality.
Discussions with Duca’s lab also led to the particularly intriguing challenge of
190
extending leaders and laggards functionality to help users explore the possibility of
multiple items acting as leaders that influence others. For example, a microarray data
set might contain several genes - A, B, C, D, and E - all of which regulator the tran-
scription of gene G. Existing leaders and laggards displays might be used to see these
influences individually - for example, to identify A as a factor that might influence
G. However, this raises the possibility of misinterpretation, as users might stop at that
point and not identify B, C, D, and E as being relevant.
Finally, several features commonly found in production software were suggested
as being potentially useful. For example, cut and paste facilities for exporting query
results directly to other analysis tools were seen as being desirable.
8.2 Nucleotide Sequence Data
The interpretation of timeboxes and TimeSearcher as tools for querying time series
data sets places an unnecessary limitation on the applicability of these ideas. In fact,
there is nothing in the timebox model, or in the TimeSearcher application that is re-
stricted specifically to time series data. The only requirement is that data sets involve
measurements taken at discrete intervals along some linearly-ordered dimension. The
original motivation for this work involved the use of time as the dimension in question,
but others are possible.
For example, data sets containing real measurements at discrete physical positions
along a one-dimensional line are appropriate for TimeSearcher. In this case, each
physical position on the line corresponds to a “time” point in the TimeSearcher display.
Queries could be used to identify items that had value in a given range during certain
intervals on this line.
191
Nucleotide sequences provide a particularly interesting example of the application
of TimeSearcher to linear dimensions other than time. Specifically, short sequences
of nucleotides (A,G,C, and T) can be aligned and statistics regarding the frequen-
cies at which different patterns appear in different positions in these sequences can
be calculated. TimeSearcher can then be used to find patterns that have desired fre-
quency profiles - perhaps occurring frequently in some areas and infrequently in oth-
ers. These patterns might help identify DNA subsequences that influence the process
of converting DNA into RNA and then into protein. The use of TimeSearcher for this
purpose complements existing statistical approaches towards identification of these
subsequences [48, 92, 97, 137].
8.2.1 Branch Site Consensus Splicing Signal in Arabidopsis
thaliana
The creation of protein from DNA is essentially a three-step process. During tran-
scription, the sequence of a strand of DNA is copied into a complementary strand of
pre-mRNA. During the second phase - splicing - regions that are not converted directly
into protein (the introns) are removed from the pre-mRNA, leaving only the exons -
the regions that will be converted into protein. This output of this process is a strand
of mRNA, which is exported from the nucleus and translated into protein, during the
third step - translation (Figure 8.6).
Splicing involves the removal of the introns from a strand of pre-mRNA. The
boundaries where this splicing occurs are known as splice sites. As an intermedi-
ate step in the splicing process, a portion of the intron is looped around itself, forming
a “lariat” structure. This looping occurs at the branch site - a location that is roughly
30 nucleotides from one end of the intron (Figure 8.7).
192
Transcription:
DNA pre−mRNA
� � � � � � �� � � � � � �� � � � � � �
� � � � � � �� � � � � � �� � � � � � �
� � � � � �� � � � � �� � � � � �
� � � � � �� � � � � �� � � � � �
� � � � � �� � � � � �� � � � � �
� � � � � �� � � � � �� � � � � �
exonintron
exonintron
exon
� � � � � �� � � � � �� � � � � �
� � � � � �� � � � � �� � � � � � � � � �
� � � � � �� � � � � �� � � � � �� � � � �
� � � � � �� � � � � �� � � � � �
� � � � � �� � � � � �� � � � � �
exonintron
exonintron
exon
� � � � � � �� � � � � � �� � � � � � �
� � � � � � �� � � � � � �� � � � � � �
� � � � � �� � � � � �� � � � � �
� � � � � �� � � � � �� � � � � �
� � � � � �� � � � � �� � � � � �
� � � � � �� � � � � �� � � � � �
Splicing:
pre−mRNA mRNA
exonexon exon
Translation − export from nucleus and creation of protein
Figure 8.6: The three main stages in the creation of protein from DNA. During tran-
scription, the strand of DNA is copied. During splicing, the introns are removed, leav-
ing only the exons. The output of splicing is a strand of mRNA. During translation,
the mRNA is exported from the nucleus and used to create a protein.
Characterization of these sites in the splicing process is an important step in in-
terpreting the contents of the genome. Reliable and consistent identification of splice
sites and branch sites can be useful for determining the function of a given sequence
of DNA. specifically, if a sequence has a splice site on either end and a branch site in
the appropriate position in-between, it is likely an intron.
Identification of splice sites is straightforward, as the sequences found at splice
193
exon exon
Splice Sites
Branch Site
intron
Figure 8.7: Splice sites and branch sitesxb.
sites are well-defined and generally invariant across organisms. The sequences sur-
rounding branch sites are more variable [140] . For organisms containing variations
in branch site sequences, consensus patterns describing the range of possibilities have
been developed.
Stephen Mount of the University of Maryland Department of Cell Biology and
Molecular Biology has been using TimeSearcher to identify consensus branch site
splicing signals in the plant Arabidopsis thaliana. The data set being used for this
purpose was generated from the genomic sequences surrounding 8550 internal exons
that were internally truncated and aligned with respect to their boundaries [109]. This
data set contains the the normalized frequencies of each of the 1024 possible pentamers
- sequences of five nucleotides - at each of 192 possible positions.
The occurrence frequencies in this data set can be used to create queries that search
for items that match known characteristics of the sequences around branch points.
194
Figure 8.8: Data envelope overview of pentamer frequency distributions in Arabidop-
sis thaliana.
Specifically, branch points are found approximately 25-30 positions upstream (before)
the end of an intron, and surrounded by sequences that are not commonly found in
exons. Thus, sequences that might include branch points can be found by searching
for pentamers that are frequently found 25-30 positions before the end of an intron and
infrequently found elsewhere in the intron. If we consider the position in the aligned
sequences as the linearly-ordered dimension of interest, these queries can easily be
created with TimeSearcher.
Figure 8.8 shows a data envelope overview of the whole data set. Two peaks, indi-
cating the boundaries between the exon in the middle and the introns on the ends, are
immediately apparent. These peaks represent well-known conservation of sequences
at splice sites.
To identify candidate splicing signals, a query using two timeboxes is used. One
195
Figure 8.9: Timebox query aimed at finding pentamers with higher frequencies at a
specific region within introns (the branch site) and lower frequencies elsewhere within
introns.
component of the query will identify those pentamers that are frequently found before
the exon-intron boundary. The second identifies pentamers that are infrequently found
elsewhere with the intron (Figure 8.9). Taken together, these criteria identify candidate
branch site consensus sequences. [125].
These queries can be used to identify candidate branch points. Taken together with
domain knowledge of the expert user, and perhaps in combination with candidates
generated through statistical or algorithmic approaches [48], these results can be used
to extend known consensus sequences. In this case, the analysis was used to extend
the previously-identified consensus branch point sequence CTRAY (where “R” can be
either “A” or “G” and “Y” can be either “C” or “T”) [125], to WYTRAY (W= A or T,
Y=C or T, R=A or G).
196
8.2.2 Observations
The possibility of using TimeSearcher to “play around” with the data set was seen
as being extremely useful for this analysis. The interactivity supported exploring and
identification of patterns of interest that would not be possible with algorithmic ap-
proaches or prior practice - viewing data in a spreadsheet. Furthermore, overview
displays were useful for increasing confidence in the data set, as the data overview
were consistent with the expected distribution of the profiles. In general,TimeSearcher
was found to be superior to existing data exploration tools:
I have been looking at sequence data of this sort for over 20 years and
find TimeSearcher to be the best data exploration tool I’ve encountered.
What I like about it is the ability to rapidly change your query and see
the results in order to converge quickly upon a query that is appropriately
selective [91].
Much of the investigation involved in identifying the candidate splice site se-
quences involved searching for items that were in differing normalized frequency
ranges at different times. Using the normalized view of the data, these searches tried to
identify items that were, for example, at least one standard deviation above the norm
during the exon.
Unlike the analysis of the Drosophila programmed cell death data sets, the inves-
tigation of this data set involved frequent use of the arrow keys, range sliders, and
editable text labels for value ranges to modify queries. As a result, there was relatively
little direct modification of timeboxes. This difference might be the result of some
inherent characteristics of the data sets, or it may just be an indication of the working
styles of the individuals involved. Further examination would be needed to understand
the factors influencing the users’ choice of interaction style.
197
8.2.3 Contributions and Design Suggestions
Features currently implemented in TimeSearcher that were suggested during the anal-
ysis of the Arabidopsis sequence data set include the item list window and the editable
labels for the range sliders. Additional suggestions for functionality that has not yet
been implemented focused on query specification and result set display and manipula-
tion.
Many of the queries created during analysis of the sequence data involved two (or
more) timeboxes with covering value ranges that were contiguous: for example, one
timebox might contain items that ranged from -1 to 0 deviations below the mean, while
another might contain items ranging from 0 to one deviation above the mean. Time-
Searcher does not currently support this sort of query. The text entry fields associated
with the range sliders might be used to set the maximum value of one timebox to be
equal to the minimum value of another (or vice-versa), but this does not create an in-
variant that will be maintained as the boxes are modified. A tool that would somehow
link multiple boxes and constrain the relationship between their values might help sim-
plify the manipulation of this type of query. This suggestion was independently made
during one of the empirical studies of timebox queries (Section C.3).
The handling of query results arose several times during this analysis as an area in
need of improvements. The user noted that it was often difficult to tell how a result
set changed after the modification of a query. This led to the suggestion that the item
list and display of individual items be augmented with additional coding that would
help the user understand the impact of recent changes. For example, items that have
recently been added to the data set might be color coded with a bright color, while
items that have been removed might still be displayed, but with a “grayed-out” color
coding, to indicate their removal from the display. Varying shades of color and gray
198
might be used to provide a finer-grain display. Alternatively, separate lists of recently-
added and removed items might display this information.
Tools for manipulating result sets were also seen as being useful. For example,
the items in a result set might some how be considered a cluster, for which aggregate
statistics might be calculated. For the nucleotide dataset, tools for identifying common
features of the items in a result set might be particularly useful. For example, such
a tool might tell the user that all of the items in the current result set have the same
nucleotide in the third position, while 80% have one of two nucleotides in the fourth
position. This might help simplify the identification of candidate sequences. The
aggregation of the items might also be added into the data set and considered a target
for querying along with existing individual items.
Other suggestions regarding manipulation of result sets addressed the comparison
of results from multiple queries and the synthesis of new understanding from these
comparisons. Tools for displaying the intersection (or difference) between multiple
result sets were suggested as potentially useful, as was improved support for saving
results and queries.
In some cases, the characteristics of the result sets that were most interesting (and
therefore most appropriate for saving) did not involve the actual items in the result set,
but simply the size of the result set. For example, one line of exploration involved
creation of a query that was 20 nucleotides wide. The user moved this query across
the data set, identifying the number of items that matched the query at each position,
and plotting a histogram by hand. Automated facilities for creating such plots within
TimeSearcher might be an interesting possibility for future development.
199
8.3 Other Applications
Although these applications to biological data represent the most extensive uses of
TimeSearcher to date, interest from researchers working in other domains appears to
validate TimeSearcher and the timebox query model as being applicable to a range of
discipline and data types. As of February 2003, users in fields including hydrology,
climatology, and finance have expressed interest in using TimeSearcher to analyze
their data sets.
TimeSearcher executables were made available for public download in October
2002. Between October 2002 and early February 2003, more than 150 unique users
downloaded TimeSearcher.
8.4 Conclusions
The on ongoing, collaborative work with these users has proven to be valuable for the
design and evolution of both the theoretical and concrete aspects of this work. In using
TimeSearcher to address meaningful tasks, the users have demonstrated the efficacy of
the tool, and therefore of the underlying query model. These case studies have led to
several suggestions for useful functionality, including:
• Leaders & Laggards
• Support for multiple time-varying attributes
• Query inversion
• Queries for non-decreasing/non-increasing trends
• Removal of items that match a query
200
• Enhanced facilities for saving and manipulating result sets
• Customizing of item lists displays
• Linkages with external data sources
• Integration with other tools
• Vertical alignment of timeboxes
These case studies also proved useful in identifying areas of inquiry that might
have been less helpful or possibly distracting. Several observers not involved in these
case studies made various suggestions for features that they would have liked to have
seen included in TimeSearcher. Although many of these suggestions were intriguing,
development efforts were explicitly focused on user needs. As a result, suggestions
from the users involved in the case study were given high priority whenever possible.
This had the dual advantage of focusing efforts on features that were truly needed
while providing participants with the incentive to continue, in the form of evidence
that their concerns were being taken seriously.
These successful case studies would not have been possible without the participa-
tion of researchers who saw themselves as partners in the development and evolution
of TimeSearcher, rather than as mere users or customers. Regular meetings and open
feedback - in the case of the Baehrecke lab, periodic meetings over the course of more
than two years - were critical for the success of this effort. This work had many of the
elements of participatory design, even if formal methods associated with that approach
were not used.
The participation of the case study participants as research partners also meant that
they were willing to accept the shortcomings of a research prototype. For example,
201
they were generally willing to accept the explanation that a proposed feature was not
particularly interesting from a research view point, even if it would have been useful to
them. Their understanding of the need to focus on the research issues was invaluable.
The observation sessions described above were particularly useful for building
understandings of the research questions that TimeSearcher was being used to ad-
dress. These ongoing conversations conducted during these sessions were invaluable
for understanding the needs of the users, and for generating proposed designs aimed
at meeting those needs. Further sessions involving more in-depth analysis might have
provided additional insight. Complete immersion in the research efforts that moti-
vated the case studies - perhaps in the form of spending several months working in the
Baehrecke or Mount labs - might have proven useful as well.
This model of developing information visualization applications through close col-
laboration with motivated users is potentially generalizable to other efforts. A small,
committed set of users who understand the difference between their research needs
and the needs of the project will be necessary for this approach to succeed.
202
Chapter 9
Query Expressiveness
The basic timebox model supports a limited set of queries: all values of interest (start
time, end time, min value, and maximum value) must be specified exactly. Many
interesting queries require additional expressive power. The data mining literature
contains numerous examples of queries for patterns in time series that are independent
of exact time or values, scale, or other factors [4, 5, 19, 29, 49, 69, 76, 75, 78, 98, 104,
108, 150].
TimeSearcher contains some extensions to the basic query model. Disjunctive
queries (Section 3.1), Leaders & Laggards queries (Section 4.2), variable-time time-
boxes (Section 3.2), angular queries (Section 4.7) and query inversion (Section 4.4)
use a combination of interaction techniques and alternative query semantics to extend
the range of queries that can be expressed.
This chapter provides some examples of additional query possibilities, along with
a categorization and sketch of a formal model for extended query semantics.
Further gains in expressiveness might be gained by extending the types of data
involved. Extensions aimed at adapting timeboxes to handle queries on categorical
data and more general temporal data could support a wide range of new tasks. These
possibilities are briefly sketched in Section 10.3.
203
Query Expressiveness
Inter−Item (Q12)
Fixed Time/Interval (Q1)
Interval Trends (Q7)Global Constraint (Q11)
Maximal Periods (Q8)
Similarity (Q10)
Low High
Inter−Item
Prevailing Trends (Q13)
Relative Time/Value (Q5,Q6)Variable Time/Value (Q2,Q3)Intra−Item
Open−Ended Time/Value (Q4)
Aggregate (Q9)
More General (Q14)
Figure 9.1: A schematic layout of the different types of example queries. Queries are
expressed in approximate order of increasing precision, from left to right. Aggregate
queries are modifiers that apply to queries within the shaded box, and maximal period
queries are modifiers that might apply to those within the unshaded box. Queries below
the dashed lines involve comparisons are based on the characteristics of individual
items in the data set, while those above the line involve comparisons between items.
9.1 Example Queries
A series of example queries will illustrate the range of query formulations that might be
supported by an interactive system. This list is not intended to be exhaustive: queries
involving greater expressive power will be discussed below.
In the examples below, we assume that S is a set of m stock prices S0, . . . ,Sm−1,
over a set of time points 1, . . . ,n, and Si(t) is the value of Si at time t. Queries are
expressed textually, with alternative presentations in a “pseudo-SQL” notation.
A preliminary schematic displaying a rough approximation of the relationships
between the types of queries is given in Figure 9.1.
204
9.1.1 Fixed-Time, Fixed-Value, and logical combinations thereof
Fixed-Time, fixed-value constraints involving a single set of times and values can be
expressed with a single timebox:
Query 1 Find stocks where the prices are between $10 and $20 during days 5-10
SELECT Si from S WHERE
$10 ≤ Si(t) ≤ $20
when 5 ≤ t ≤ 10;
As described above, TimeSearcher can also be used to create complex queries con-
sisting of conjunctions of multiple queries of this sort. Disjunctions between values
might also be expressed if appropriate support for grouping (parenthesization) is pro-
vided.
9.1.2 Variable Time and/or Value
Basic timeboxes can be extended by allowing a window of variability in the allowable
times and/or values specified.
Query 2 Find an interval of 5 consecutive days during days 10-20 during which prices
ranged between $50 and $70.
SELECT Si from S WHERE
50 ≤ Si(t) ≤ 70
when ti ≤ t ≤ ti +4
AND
ti ≥ 10 and ti +4 ≤ 20
205
This class of query is currently implemented in TimeSearcher as a Variable Time
Timebox (Sections 3.2 and 4.6).
Query 3 Find stocks that stayed in some $10 range between $20 and $40 during days
5-10.
SELECT Si from S WHERE
vi ≤ Si(t)≤ vi +10
AND
vi ≥ 20 and vi +10 ≤ 40
when 5 ≤ t ≤ 10
Further flexibility might be accomplished by combining these two types of queries,
creating queries that have variability in both time and value.
9.1.3 Open-Ended Time and/or Value
Less restrictive queries might require only an upper (or lower) bound on the time period
or value range desired:
Query 4 Find stocks where the prices is greater than $50 for some period of time after
the 20th day
SELECT Si from S WHERE
Si(t)≥ 50
when t ≥ 20;
206
9.1.4 Relative Time/Value
These queries involve times and values that are specified relative to each other, rather
than in terms of any absolute values:
Query 5 Find stocks that traded within a $10 range for days 1-5 and then increased
by $20 above that range during days 10-15.
SELECT Si from S WHERE
v1 ≤ Si(t)≤ v1 +10
when 1 ≤ t ≤ 5
AND
v1 +30 ≤ Si(t) ≤ v1 +40
when 10 ≤ t ≤ 15;
Query 6 Find stocks that trade between $10 and $20 for some 10 day period, and then
traded between $30 and $40 for some 5 day period that starts at least 10 days later.
SELECT Si from S WHERE
$10 ≤ Si(t) ≤ $20
when ts ≤ t ≤ ts +10
AND
$30 ≤ Si(t) ≤ $40
when ts +20 ≤ t ≤ ts +25;
.
207
In query 5, we know the times, but the value ranges are specified relative to each
other. In query 6, the values are known, but the time periods are relative. Further
generalization of these queries would involve combination of relative times and relative
values in the same query.
Relative time/value queries can be composed to identify more complex patterns,
such as double bottom patterns [19, 98].
9.1.5 Interval Trending
Identification of intervals of monotonic increases, decreases, non-increases, or non-
decreases may be of interest [66, 108].
Query 7 Find stocks that increased in value every day over a 10 day period, with a
resulting increase of more than $100.
SELECT Si from S WHERE
Si(t) > Si(t −1)
and Si(ts +10)−Si(ts) > $100
forall ts +1 ≤ t ≤ ts +10;
As in query 5 and 6, the important element is the relative change: 10 days of
increase, from any starting value t1. The magnitude of the overall change may be
specified, or not.
“All points” angular queries as currently implemented in TimeSearcher (Section
4.7) provide some support for this type of query. As mentioned above in the discus-
sion of relative-value queries, this supported is limited to a graphical notion of the
magnitude of the desired increase or decrease.
208
9.1.6 Maximal Periods
Query 8 Find the maximal period during which values increased every day
SELECT Si from S WHERE
Si(t) > Si(t −1)
forall ts ≤ t ≤ te
AND
te − ts is maximal;
This query is similar to query 7 in that it asks for a interval of continuous increase
in value. However, this query asks for a maximal interval of increase, rather than a
time-limited period of increase.
9.1.7 Aggregate Functions
Query 9 Find stocks that had an average price between $10 and $20 during times
10-15
SELECT Si from S WHERE
$10 ≤ ave(Si(t))≤ $20
when 10 ≤ t15;
Queries might include any of the standard SQL aggregates - avg, min, max, sum,
and count. Other possible functions include moving averages, and standard deviations.
Existing timebox queries fall into this query as well: standard conjunctive timeboxes
209
might be seen as the val(x) aggregate, which simply looks at the value at time t. Dis-
junctive boxes (Section 3.1) might be seen as using the anyo f (x) operator, to indicate
that any one of the items in the given time period falls within the desired range.
9.1.8 Similarity to a Known Item
Similarity queries involve the use of a known item as a query to find items that are
similar to it [29, 49, 78, 75, 76, 150]:
Query 10 Find stocks that are similar to ABC
SELECT Si from S WHERE
D(Si,ABC) ≤ ε
A wide variety of distance measures may be used, according to the specific circum-
stances [149]. In most data mining research, distances are defined in terms of Lp
norms. Possible alternatives include the similarity model used in TimeSearcher’s drag-
and-drop query facility, where items will be defined as similar if the distance between
all corresponding values stays within a threshold: ∀i|qi− ri| < ε.
In other cases, notions of similarity might be modified to include similarity with
different scalings of time (dynamic time warping) or value, moving averages, and other
transformations [1, 19, 74, 104].
9.1.9 Global Constraint
Query 11 Find stocks that never trade above $50
SELECT Sifrom S WHERE
210
max(Si(t)) < $50
These queries identify items with global behavior within some specified range.
Global queries might be based on minima and maxima, standard deviations, averages,
or other similar measures.
9.1.10 Inter-item queries: Leaders & Laggards
Query 12 Find stocks that decreased by at least $20 over some five-day period occur-
ring 10 days after some other stock rose by $100 in a 2-day period
SELECT Si from S WHERE
Si(t ′) < Si(t ′−1)
and Si(t ′+10)−Si(t ′+15) > $20
when t ′+10 ≤ t ≤ t ′+15
and t ′ in
SELECT t from S WHERE
S j(t) > S j(t −1)
and S j(ts +2)−S j(ts) > $100
when ts ≤ t ′ ≤ ts +2
Other possible queries may involve comparison between items: “find times when
the value of XYZ is greater than that of ABC” [113]. These and related queries attempt
to find items that exhibit a certain trend or pattern at some point in time after another
item (or set of items) exhibits a second trend. Such trends are useful for finding leaders
or laggards that predict or trail trends across items in a data set.
211
As relative times and values can be specified with this class of query, this notion of
leaders & laggards is more general than the functionality provided by TimeSearcher
(Section 4.2). This discrepancy provides an example of the tradeoffs that might be
involved in implementing many of the extensions to query expresssiveness. Specifi-
cally, if increased expresivity comes at the cost of increased interface complexity or
processing time, a combination of interface enhancements and user interaction might
be the most effective means of increasing the functionality available to users.
In the case of leaders & laggards, the implementation in TimeSearcher provides ba-
sic tools for specifying a set of leaders that might be of interest, while freeing the user
to search for laggards. This provides basic support, without incurring the processing
and complexity costs that would be associated with query 12.
9.1.11 Prevailing Trends
Query 13 Find items that are generally trending upwards, but may include downturns
within a certain tolerance,between days 10 and 15 [66].
SELECT Si from S WHERE
δi = Si(t)v−Si(t −1)
and ave(δi) > 0
when 10 ≤ t ≤ 15
This query expresses the general upwards trend in terms of the average of the
changes in value between two measurements. Other formulations may be possible
for expressing this notion of general, but not monotonic, increases (or decreases) over
time. “End points only” angular queries (Section 4.7) provide a limited version of this
form of query.
212
9.1.12 More general queries
Query 14 Find items that have periods lasting 5 intervals long that contain at least 2
upwards changes and no more than one downward change.
A more general class of queries involves specification of a set of events that may
occur in any order within a given time interval. Query 14 requires at least two events
of one type, and no more one event of another type, to occur during a certain time
period. Formulations of this sort might be useful for detecting trends in the presence
of outliers, as they allow for mismatch tolerance constraints [66] that allow elements
in the sequence to deviate from a general pattern (i.e., “Find all items that increased in
value during 4 of 5 time periods”). These queries might also be useful for identifying
intervals that contain vague trends similar to query 13. SDL uses a powerful set of
composible operators to specify these queries, providing expressive power similar to
that of regular expressions [5].
9.2 Query Dimensions
The above examples present a range of possible time series queries. Although this
list is not in any way complete, it does provide the basis for discussing the space of
queries that may be possible. In particular, we would like to work towards a model
that bridges the gap between these queries and the current limitations of the timebox
query model. In addition to providing a framework for more rigorous analysis of query
expressiveness, this model will help guide extensions to timebox queries.
Most of the example queries can be divided into either one of two categories: those
involving values that stay within a particular range (queries 1, 2, 3, 4, 5, 6, and 11), and
those that involve some sort of transition, like an upward or downward trend (queries 7
213
and 13). Others involved modifications to an interval, such as intervals of a maxi-
mum length that meet specified criteria (query 8), or averaging values over an interval
(query 9). Similarity queries (query 10) and leaders and laggards queries (query 12)
involve relationships between values of multiple items in a data set, as opposed to other
queries that can be evaluated on each item independently. These distinctions form the
basis for discussions that will lead to development of the query model.
9.2.1 Range Events
A range event involves the restrictions on both a set of times of interest, and values
of interest during those time periods. A range event can be viewed as a set of four
constraints: start time, duration, minimum value, and extent (for maximum flexibility,
duration and extent can be negative if desired). More formally, q = (s,d,m,e) is a
specification of a range event. In TimeSearcher, each timebox defines a single range
event1.
Examination of the example queries indicates a variety of range events, ranging
from completely specified to minimally specified. A complete and absolutely specified
range event q - one in which all four parameters are provided in absolute (non-relative)
terms, is a simple fixed-time, fixed-value query (Query 1). If the event’s duration
and/or extent is left unspecified, an open-ended query results (Query 4). Restricting
the values or times to fall occupy a given range within some broader constraints leads to
a variable time or value query (queries 2 and 3). Similarly, start time and/or minimum
value can be omitted, leading to a relative time/value query (query 5 or 6).
A meaningful range event must minimally have either a minimum value or an
1For the current discussion, this definition will be more convenient than the equivalent definition
given in chapter 3
214
extent, and a minimum time or a duration. If both the minimum value and extent
remain unspecified, the range event will reduce to requiring that the time series have
some value during the specified time, which can be assumed to be vacuously true.
Similarly, if both start time and duration are omitted, the range event requests items
that have certain values at any point in the time series - i.e, a global constraint query
(query 11).
In the general case, these values can be specified either in absolute terms- in terms
of precise numeric values or time periods- or relative to the constraints of another range
events. Relative specification allows for relative time/value queries (queries 5 or 6)
and leader/laggard queries (query 13), where the key characteristic is the relationship
between events.
Relative specification of timeboxes may pose some challenges for interpretation
and evaluation. For example, if timeboxes B and C are both specified simply as being
after A and before D, the ordering between B and C is under-specified. In this case, B,C
and C,B may be acceptable orderings. This might be addressed by providing output
displays that indicate the nature of the ordering in query results, or by constructing the
query model and interface in a manner that prohibits such ambiguities.
Maximal period queries (query 8) are a special case of the a range event with an
unspecified start. In this case, the width is defined relative to other intervals meeting
the same value constraint.
The defining characteristics of a range event are which of the four parameters (start,
duration, minimum, and extent) are fixed, and which are unspecified. Defining range
events in these terms is the first step towards extending the timebox model to handle
these more complicated queries: extensions to the model to handle these more general
events would allow expression of more complex queries like 2, 3, 4, 5, 6 and 11. These
215
extensions may come in form of new or modified timeboxes, or in additional widgets
that act on the data set as a whole.
Aggregate operators (query 9) can be seen as additional qualifiers that are added to
a range event to alter the nature of the query. If a basic timebox is a constraint requiring
that all of the points in the specified interval fall within the general range, this can be
interpreted as applying the allof operator to the given time and value limits. Other
possibilities might be applied by changing this operator, perhaps to an ave, or anyof
for average or disjunctive queries respectively. Other aggregate operators such as sum
and count may pose additional challenges of query construction and interpretation.
9.2.2 Transition Events
A transition event describes the nature of the interval between values at two time peri-
ods. While a range event limits the values in a range of time points to fall within certain
parameters, a transition event imposes restrictions on the nature of the change between
two time periods: perhaps requiring that values be monotonically non-decreasing
(query 7), or more generally trending upwards (query 13). These events are speci-
fied in completely relative terms, with starting time and duration determined either
by surrounding range events, or in terms of length constraint. Value restrictions are
defined in terms of differences between values within the range event.
In a certain sense, a range event might be loosely interpreted as the first derivative
of a transition event. This connection might be an interesting area for further investi-
gation.
216
9.2.3 Inter-item Queries
Most of the example queries involve consideration of each item in a data set separately,
without reference to any other items. Similarity (query 10) and leader/laggard queries
(query 12) are different in that they involve comparison between items.
Similarity queries involve a direct comparison between items. One variant of a
similarity search is currently supported via TimeSearcher’s drag-and-drop query-by-
example. Alternative models might use different definitions of similarity.
Leader and Laggard queries (query 12) raise the more general, and therefore more
interesting, challenge of working with queries that relate a set of patterns in one item
to related patterns in another item.
Other interesting inter-item queries might arises in data sets involving multiple
time-varying attributes. For example, climatological data sets might include wind
speed, temperature and precipitation data from a set of series over a given period of
time. Inter-item queries might (for example) be used to identify relationships between
the temperature in one city and precipitation in another. This idea might be extended
to involve some sort of temporal join between two data sets. Precise details these
operations will be spelled out in the complete formal model.
9.2.4 Other Logical Operators: Disjunctions and Negations
In all of the above example queries, logical combinations between range event con-
straints were all described in terms of conjunctions. Alternative possibilities include
disjunctions and negations. As time and value ranges are assumed to be finite, nega-
tions may not be strictly necessary. However, they may prove useful for aiding in query
construction and interpretation.
217
9.2.5 More General Queries
Query 14 and other queries that can be expressed in SDL [5] or similar notation
provide expressive power similar to that of regular expressions. This power leads to
significant challenges in query creation and interpretation. These challenges stem from
two main sources: mismatch tolerance and arbitrary ordering of events. Modifications
to timeboxes that support mismatch tolerance constraints (“find all subsequences that
had increases in four out of five time periods”) [66] might involve additional widgets
attached to the timebox to specify the desired tolerance.
The arbitrary ordering of events provides the further challenge of specifying events
of an unknown ordering. Query 14 essentially asks for a time period of five intervals
with anywhere between two and five events matching certain criteria, in any order.
Since the events can be interleaved in time, a sequence of timebox would only be ap-
propriate if queries involving partial orders are allowed. Furthermore, the constraints
on the cardinality of the different event types will involve additional complexity, per-
haps involving sliders or other measures similar to those suggested above for mismatch
tolerance. Alternatively, other more complex query specifications similar to filter-flow
queries [152] or the queries used in Patterns [90] might be used.
9.3 Towards A Formal Query Model
This discussion of the possible facets of time series queries can be used to guide a
sketch of a formal description of more expressive queries based on timeboxes. By
providing a clear and rigorous description of possible queries, the model will further
understanding of the query space and support reasoning about queries. This under-
standing will be particularly useful for generalizing the query model to other data sets,
218
including general temporal data and categorical data.
After a description of the data set, simple range event queries and related operators
will be discussed, followed by conjunctive queries, transition events, and inter-item
queries.
9.3.1 Time Series Data Set
We assume that all queries are being evaluated against a set S of M time series records
(S0, . . . ,Sm−1), with each record containing N real-valued measurements, correspond-
ing to time points 0, . . . ,N−1. The jth measurement from the ith time series is referred
to as Si( j). The values of the Si( j) occupy a finite range between vmin = mini, jSi( j)
and vmax = maxi, jSi( j),
9.3.2 Range Events
A timebox range event query q consists of four parameters: start time, duration, min-
imum value, and extent. Specifically, q = (s,d,m,e) (Section 9.2.1). As discussed
above, at least one of the start time and duration must be defined, along with at least
one of the minimum and extent. For now, we delay the case of expression relative to
another query and insist that any specified values are expressed in absolute terms.
If all four values are specified, the extent of the box must fit within the dimensions
of the data set as a whole: 0 ≤ s ≤ n− 1, 0 ≤ s + d ≤ n− 1, vmin ≤ m ≤ vmax, vmin ≤
m+ e ≤ vmax.
The default interpretation of a timebox requires that the value be in the appropriate
range during all of the specified time periods. If Si ∈ S and q = (s,d,m,e), we say that
Si satisfies the query q - T (q,Si) - if ∀s≤ j≤s+d−1m ≤ Si( j) ≤ m+ e.
Any unspecified values are indicated by the special symbol U. In the presence
219
of unspecified values, the above constraints are enforced where logically possible. For
example, if s = U, the duration of the timebox must still be less than the duration of the
entire data set. Once range event queries are executed, the result items will instantiate
the values of the unspecified parameters in a manner that restricts the range event to fit
within the limits of the given data set.
Maximal periods queries (query 8) can be created by specifying the duration d as
the special symbol M , indicating the interval of maximum duration that satisfies the
start time and value constraints.
Unspecified and relative values and maximal period operators introduce additional
complexity that must be addressed in a completed formal model. If all of the queries
are expressed in terms of exact starting points and durations, each item in a data set
can only match a given query once - at exactly those specified times. However, un-
specified and relative values and maximal periods raise the possibility that a query
might match a time series at multiple intervals with different values. Thus, a complete
notation indicating a match between a query and a time series Si might take the form
T (q,Si,s′,d′,m′,e′), where s′,d′,m′ and e′ are the intervals during which the series met
the query constraints.
Alternative interpretations of timebox queries can be specified via operators such
as ave and anyo f , requiring that the average value, or any one of the values, in the
specified interval fall within the desired range. In particular, T (ave(q),Si) if m ≤
(Ss + . . .+Ss+d−1)/d ≤ m+e), and T (anyo f (q),Si) if ∃s≤ j≤s+d−1m ≤ Si( j) ≤ m+e.
9.3.3 Logical Combinations
Timebox queries can be combined via logical operators AND, OR, and NOT to form
more complex queries. Although the final model may allow for arbitrary logical com-
220
binations, the set of allowable queries in TimeSearcher is likely to be much more
limited. Due to the difficulty of constructing and modifying queries with arbitrary
grouping, TimeSearcher queries might be limited to conjuncts of disjuncts. Similarly,
negations may not be allowed.
Conjunctive queries containing timeboxes for each time point in the data set can be
used to create similarity queries based on local (point-by-point) similarity restrictions.
For the sake of convenience, we view a complex query of multiple timeboxes as
an ordered sequence: q0 op1 q1 . . .opn qn. The timebox queries qi are assumed to be
sorted in increasing order of start time (when specified), even if they are constructed
in an arbitrary order, as may be the case with the TimeSearcher application.
9.3.4 Variable Timeboxes
Queries involving ranges of times or values (queries 2 and 3) can be specified by
the introduction of additional constraints. For example, a fixed time/value timebox
q = (s,d,m,e) might be augmented to become a variable time timebox by adding the
constrints s≥ t1 and s+d ≤ t2 to the query. Similar modifications can be used to create
variable value timeboxes.
9.3.5 Relative Timeboxes
Queries involving relative values (queries 5 and 6) involve the specification of query
parameters in terms of other timeboxes in a conjunctive query. If we have the query
q0 op1 q1 . . . opn qn, where qi = (si,di,mi,ei), we might specify the next query com-
ponent as qi+1 = (si + δs,di + δd,mi + δm,ei + δe). Any (or all) parameters might be
specified relatively.
This model makes the initial simplifying assumption that clauses in a query can be
221
completely ordered, and that each query should be defined in terms of its predecessor.
This avoids the difficulties associated with an ordering of query elements that is partial
and not complete (Section 9.2.1). This assumptions are not necessarily appropriate in
all cases: a complete model might use a less restrictive approach.
9.3.6 Transitions
Transitions involving monotonic increases, decreases, non-increases, or non-decreases
can be expressed as additional clauses in a conjunctive query. For example, to
specify an interval of monotonic increase between queries qi and qi+1, the clause
inc(qi,qi+1) might be added to the query statement. Additional operators indicating
monotonic decrease, non-increase, or non-decrease might also be useful. Prevailing
trends (query 13) might be supported via additional operators using parameters to
specify allowable tolerances. For example, inc(qi,qi + 1,δ) might be used to specify
a deviation δ indicating the tolerance desired in a vaguely increasing trend. Alterna-
tively, an interval of a given length might be specified without reference to a timebox
constraint: inc(n,δ).
9.3.7 Global Constraints
A global constraint (query 12) can be viewed as a range event with no constraint on
time. Using the notation given above, the range event q = (U,U,m,e) specifies a
constraint that will the result set to items that always (or never) fall between m and
m + e. These events will be added to the compound query as additional terms in the
conjunction. As with standard timeboxes, either the minimum or the extent, but not
both, might be unspecified.
222
9.3.8 Inter-item Queries
Inter-item queries such as leaders and laggards (query 12) essentially involve connec-
tions between two multiple sets of items, each set resulting from an individual query.
In particular, a query Q1 = q10 op11 . . .op1n q1n defines a subset of items S′ ⊂ S
that meets the query constraints. Inter-item queries would be specified in a manner
similar to that which was used with relative timeboxes, with the important difference
that inter-item queries are specified relative to timeboxes from a completely separate
query. Exact notation details will be needed to describe this relationship. In particu-
lar, inter-item queries involving joins between disjoint data sets will introduce further
complexity to the model.
9.3.9 Open Issues
This sketch presents the beginnings of a formal model of time series queries. A com-
plete model would build on these notes, adding more precise definitions of the lan-
guage, a formal grammar, and domain descriptions where appropriate. The process of
completing this model may uncover further questions in need of clarification.
The possibility of negative duration or extent values is one area that will certainly
need clarification. For some range queries - particularly those involving global con-
straints - it may be advantageous to express value constraints in terms of an extent
relative to a maximum value, or time constraints relative to an endpoint. These might
be handled with negative extent and duration, respectively, at the expense of increased
complexity in the description of timebox constraints. Alternatively, the basic model
might be extended to include extra fields for the endpoint and maximum value. This
would define an over-constrained timebox in terms of three variables in each dimen-
sion, with any two of the three (i.e, start, end, or duration) determining the third.
223
Precise definitions of the notion of a timebox satisfying a query, relative queries and
complex queries containing multiple clauses will also need more work, particularly in
the context of potentially overlapping queries that may prevent a simple total ordering
of the query components.
Finally, the operators and syntax must be clarified and collected into a well-defined
grammar.
9.4 Implementing the Extended Queries
From the above discussion, we can identify several additions to the query model that
will be needed to achieve the desired expressiveness. Specifically, an enhanced system
will need widgets and interaction techniques for:
1. Queries with relative and variable specification of value and time constraints
2. Unspecified width, start, min, extent., and regions of maximal length.
3. Global trends
4. Alternative interpretations of timeboxes
5. Transition events
6. Disjunction and possibly negation.
7. Inter-item queries
Some of these query extensions might be implemented via minor additions to the
timebox model. Indeterminate time/values and queries involving operators (averages,
etc.) on range events might be expressed using different color timeboxes, or perhaps
224
via boxes involving decorations that specify the query type. For example, variable time
timeboxes are implemented by placing a simple tiembox inside of a second box used
to indicate variability. This might be extended to provide broader suport for general
variable value queries (query 3). Regions that are not fully specified might be indicated
by incomplete time boxes, containing only two or three sides instead of four.
Queries involving relative specification of value and time constraints might be
achieved by creating queries on a blank screen, independent of any constraints associ-
ated with the values and times on the query grid currently used. External widgets of
specified width or height (“struts”) might be used to require a minimum specification
between relatively specified timeboxes (query 6).
Such an approach would allow for strictly relative specification of timeboxes, but
combinations of relative and absolute queries would require additional support mech-
anisms. Other extensions may require new query widgets, such as the widget used for
angular queries (Section 4.7).
Disjunctions and negations will require additional special handling. The current
TimeSearcher model of conjunctive combination as the default relationship between
query components is simple and easily interpreted. In some cases, such as timeboxes
that occupy disjoint value ranges during identical time intervals, a seemingly natural
disjunct might appear (Figure 9.2). However, models that may appear to be natural
and obvious may in fact support ambiguous interpretation (Figure 9.3).
In these and other cases, additional interaction techniques such as manual specifi-
cation of disjunction might be necessary. This might be done through a separate control
panel used to control logical combinations of query components, similar to the Brush
Toolbox in XmdvTool [89]. Negations might be slightly easier: as discussed above,
timebox coloring or decoration might be useful. As the demand for these features is
225
Figure 9.2: A timebox query expressing A∧ (B∨C)∧D. B and C must be disjuncts,
as both cannot be true simultaneously.
unclear, implementation of these extensions will be of lower priority.
Further extensions to the query model may be necessary for inter-item queries such
as Leaders and Laggards (query 12). These queries involve comparisons between dif-
ferent subsets of a data set, with each subset involved being the result set of a (possibly
arbitrary) time series query. Specifying these subset is is likely to be a complex, per-
haps iterative process. Appropriate query input (and display) tools will be needed to
distinguish between the multiple sets of constraints.
Additional display techniques may be needed to work with these new query types.
For many queries - particularly those involving patterns that may occur at varying
points in time with varying durations - it may not be immediately apparent why a par-
ticular region in a given time series matches the specified query. Appropriate designed
displays that provide a clear and natural mapping between query input and result out-
put will be needed to help users understand and interpret query results. In some cases,
there may be multiple candidate strategies for query input and result display. These
226
Figure 9.3: A timebox that may lead to ambiguous intepretation under the model given
in Figure 9.2. The item drawn is in either timebox B or C for the two time points
during which they overlap, but it does not spend both of thoes time poitns in any one
box. Should this item be included under the disjunctive semantics of Figure 9.2? What
would the result that users would expect?
tradeoffs may be the subject of empirical and/or heuristic evaluation.
Timeboxes are simple and easily-understood. Any additions to the query system
should be similarly straightforward and unambiguous. This tradeoff between expres-
sive power and simplicity presents an interesting opportunity for further evaluation
(Section 10.2): if extensions to the range of possible queries lead to increased confu-
sion and difficulty, simpler models may in the end be more powerful.
Alternatively, existing query tools might be combined with interface enhancements
to provide much, if not all, of the extended semantics. This is the case with leaders &
laggards (Section 9.1.10), where enhancements to the TimeSearcher interface (Section
4.2) provide tools that might provide feedback that would help the user interactively
explore for the desired collections of leaders and laggards.
227
Algorithmic enhancements will be needed to process these advanced queries. One
possible approach to this problem would be to use a two-step process, involving tradi-
tional querying followed by post-processing. In the first step, a standard search algo-
rithm would be used to identify items within the dataset that met the constraints of any
traditional timeboxes, and fell within bounding boxes surrounding widgets such as the
slanted regions described above. In the second step, candidate items in the result set
would be examined to determine whether they met the requirements of any of the more
expressive query widgets. Other algorithms may be considered if this simple approach
does not perform well.
9.5 User Needs
The proposed query extensions described in this chapter are the result of a theoretical
exploration of the space of possible queries. This exploration outlines some of the
query concepts that might expand the capabilities of analysts who work with time
series data. Specifically, these queries might help these analysts overcome limitations
with current tools and conduct valuable and novel searches that reveal patterns or find
interesting items in their data sets.
Any further work aimed at implementing these (or other) enhancements to the
timebox model should be based in analysis of meaningful tasks. Query extensions that
meet known user needs are the most likely to prove worthwhile.
228
9.6 Subsequence Queries: Beyond Full-Sequence
Matches
Like basic timebox queries, many of the extended queries described above are all
designed to identify items from a data set that match specified criteria. Much of
the recent research on algorithmic methods for querying time series data has moved
beyond these “full-sequence” matches to examine the question of “subsequence”
matches: queries that can identify portions of a time series that meet limited criteria
[5, 49, 78, 75, 98, 104]. Unlike full-match queries, subsequence queries may match
a given item in a time series multiple times. Thus, a match for a subsequence query
takes the form of an identifier for the time series along with the interval defining the
subsequence that matches.
Relative time queries that do not refer to specific time point provide the conceptual
basis for handling subsequence queries. Without specific time points to “anchor” the
query in time, these queries act as motifs or patterns that can be identified at arbitrary
points.
Further flexibility might be gained by extending subsequence queries to ask for
trends such as those described in queries 13 and 14. Such queries would be very similar
to the motif queries handled by the Shape Definition Language [5] or the graphical
trend queries in Patterns [90]. The flexibility of these queries may impose significant
processing demands, possibly making dynamic query support difficult for medium and
large databases.
These extended queries raise the possibility of an interesting tradeoff between per-
formance and expressivity. The goal of a 100ms response time for dynamic queries is
based on the claim that rapid response is necessary to avoid the user frustration and
229
delays that accompany longer waits for query responses. However, there may be times
when users are willing to wait slightly longer for answers to queries, particularly if
those queries are substantially more powerful. Examination of these tradeoffs within
the context of an extended timebox query language would be an interesting area for
further work.
230
Chapter 10
Future Work
The algorithmic challenges (Chapter 6) and the extensions to query expressiveness
(Chapter 9) present numerous challenges for future work. Case studies yielded further
suggestions for enhancements to TimeSearcher (Chapter 8). Additional possibilities
are described below.
10.1 Further Development of TimeSearcher
10.1.1 Re-Implementation
As a research prototype, TimeSearcher was implemented with a focus on demonstra-
tion of capabilities and exploration of ideas. A new implementation, based on lessons
learned from the existing prototype, would simplify future work and lead to increased
flexibility and extensibility.
Data management elements of TimeSearcher would benefit from a redesign. Cur-
rently, TimeSearcher assumes that the entire data set is in local RAM. This assumption
- which is found throughout the code - limits the flexibility of TimeSearcher to scale
to larger data sets. Ideally, the data management code would be separated from other
231
code with clear abstraction barriers that would support the possibility of “plugging in”
alternative data storage models . The query algorithm should be similarly abstracted,
to support exploration of alternative search strategies and semantic (Section 3.1) and
other extensions (Chapter 9).
10.1.2 Scaling
Many interesting time series data sets are very large, both in terms of the number of
items, and the number of time points in each item. Scaling the timebox/TimeSearcher
model to accommodate larger numbers (perhaps O(106) items or time points) would
require improvements to search algorithms (Section 6.6) and the rendering portion of
the system. Alternatively, larger data sets might be randomly sampled or mathemati-
cally clustered into smaller sets of manageable size. Clustered data sets might make
use of “structure-based brushes” developed for hierarchical data sets [52]. Other pos-
sibilities include extending TimeSearcher to use disk-based indices (Section 6.6) and
possibly query previews [54] to query data sets that are too large to fit into RAM.
Long time series present particular problems for query specification and display.
Screen space limitations of approximately 1000 horizontal pixels limit displays to time
series of a few hundred time points. A variety of approaches might be taken to over-
come this constraint:
1. Scrolling: Horizontal scrolling in the query and display window might be used
to pan through the length of the time series. Linking the scrolling of the two
windows would allow users to display specific areas of the query along with
corresponding points in the data display.
2. Semantic Zooming: Zooming facilities provided by Piccolo could be used to
232
display long time series at varying levels of scale and detail. A “zoomed-out”
display could show a very long time series at a low-level of detail by show-
ing a display with fewer data points than the original. This reduced display
might be based on averages of adjacent data points, examination of trends to
eliminate “uninteresting data points” or other approaches. Hierarchical sim-
plification could be used to provide progressively more detailed displays that
would be shown as the user “zooms in” to view specific areas of the data set
at greater (or even full) resolution. These simplifications might be presented in
an “overview+detail” fashion, with a compressed overview presented alongside
raw data. Alternatively, distortion techniques might be used to present areas of
interest in full detail and peripheral areas in a compressed display.
3. Filtering: Many long time series contain short periods of interesting data sep-
arated by large intervals of relatively uninteresting data with little change. For
example, EKG data contains long stretches of “normal” heart activity between
incidents of excitement or heart difficulty. For such data sets, appropriate global
filters might be used to eliminate those periods that are uninteresting, essentially
shortening the length of a time series to include only those areas that are po-
tentially relevant. These filters might take the form of a range slider that would
specify thresholds for the minimum and maximum amount of change that would
be interesting: intervals with changes that fell outside of the threshold would be
filtered.
Each of these approaches presents challenges in terms of appropriate displays
and interaction techniques. All three strategies might benefit from the addition of
an overview window which would show a thumbnail version of the entire width of a
data set while indicating where the current display fits into the whole. Zoomed dis-
233
plays might be controlled with a slider that could be used to select the desired level of
magnification. These approaches (and others) might also be used in combination.
10.1.3 Domain Customization
As TimeSearcher was developed primarily as a platform for exploring the possibilities
of timebox queries, it is intentionally generic. More in-depth work in specific applica-
tion areas might require the additional functionality that would meet the needs of users
in specific domains. Possibilities include:
• Statistical analyses: Microarray data analysis (Section 8.1) often involves sta-
tistical tests and clustering algorithms. Integrating the results of these analytical
tools into the TimeSearcher display and query interaction might increase the
utility of the tool.
• Statistical descriptions of result sets. Microarray analyses also often involve the
search for clusters of related genes. Augmented displays that provide statistical
descriptions of result sets might help in this regard. For example, some measure
of the similarities between items in a result set might be useful for determining
whether or not the items in that set might be interpreted as a cluster.
• Alternatively, TimeSearcher might be modified to support searching over
hierarchically-clustered data, so that users might see either a cluster or an in-
dividual item as the result from a timebox search. This might be combined with
facilities for drilling-down to see the individual items found in a cluster of inter-
est.
• Links to other data sets and applications: Domain experts often benefit from
examining related data sets in multiple, coordinated views [94]. Visualizations
234
that linked time series data sets in TimeSearcher to other views of related data
might prove particularly powerful. For example, microarray time series data
sets might be linked to views of gene ontologies using treemaps [18]. Alter-
natively, TimeSearcher might be extended to be compatible with coordination
architectures such as snap-together visualizations (Snap) [94], thus supporting
the possibility of ad-hoc coordinated visualizations.
• Process Support: The analyses conducted in TimeSearcher may be part of ongo-
ing investigations that involve multiple reviews of each data set, saving of inter-
mediate results, and re-interpretation of trends that have been identified. Time-
Searcher’s rudimentary features for saving intermediate results (Section 4.9),
might be extended to include more detailed history-keeping and browsing [100],
annotation of results, and other bookkeeping tools that could be used to support
the ongoing process of data interpretation and synthesis.
Additional work with expert users will provide the motivation for prioritization of
future efforts: work will focus on those areas that meet the needs of motivated users.
Customization for other applications domains might provide additional interest-
ing challenges. In collaboration with ChevronTexaco, researchers at the University of
Maryland are investigating the use of TimeSearcher for analyzing oil-well monitor-
ing data. This collaboration has already identified the need for additional facilities for
handling monitoring data sets. Another domain of potential interest is signal process-
ing, as timeboxes, VTTs, and angular queries are similar to common operations on
signals [120].
235
10.1.4 Multiple Time-Varying Attributes
The support for multiples time-varying attributes described in Section 4.3 is prelim-
inary and limited. Exploration of alternative strategies for displaying multiple at-
tributes, probably via multiple windows, might increase the utility of this tool while
reducing the cognitive demands placed on users.
10.1.5 Additional Functionality
Additional tools for exploring and manipulating time series data sets would increase
TimeSearcher’s utility and flexibility. For example, support for zooming time series,
overlay of related time series, point queries, and other functionality implemented in
Diamond Fast [135, 136], along with decomposition, smoothing, and forecasting and
related techniques [80] might support statistically oriented tasks.
Further refinement to existing features might also increase TimeSearcher’s utility.
For example, the “leaders and laggards” functionality (Section 4.2) might be extended
with facilities that would help users explicitly relate modified laggard queries back to
the original leader query.
Additional displays or feedback might guide users in query creation or modifica-
tion. Given a query, users might be interested in finding other intervals that had large
number of items that had items that followed the same pattern. This information might
be provided via a line of varying intensity underneath the horizontal (time) axis. The
saturation of this line’s color at any given time would be determined by the number of
items that matched the specified query, starting at that time point (Figure 10.1). This
preview line would provide users with suggestions for other time intervals where the
given pattern would be found. Similar previews might be provided to help users find
other value ranges for a given pattern during given time periods.
236
Figure 10.1: The TimeSearcher query display, augmented with a preview display dis-
playing time periods that have larger number of items that follow the pattern. The
number of items that match the query at each time point is given by the line color at
that time: lighter colors indicate a small number of matches, while darker colors show
intervals with more matches.
The computational overhead required for generating this preview information
might be substantial, making dynamic query response times infeasible. This difficulty
might be handled by presenting preview information only upon explicit user request.
Other possibilities might involve augmenting TimeSearcher to move “backwards”
- from items in the result set to queries that describe those items. TimeSearcher’s
query-by-example tool (Chapter 4) might be extended to provide additional power.
For example, given a set of items of interest, is there some sort of minimal timebox
query that returns exactly that set? Such queries be computationally-intensive, perhaps
drawing on work in data mining and machine learning, but they could be useful for
237
some analyses.
The empirical studies conducted during the course of this work (Chapter 7) identi-
fied several potential areas for improvements to the current TimeSearcher system:
• Improved facilities for adjusting timeboxes over small intervals and fine-tuning
of query ranges
• Support for temporarily disabling timeboxes.
• Alignment tools for easing the process of creating timeboxes that are aligned in
value.
Finally, work with domain experts using TimeSearcher for ongoing research iden-
tified several proposed extensions to query expressiveness, search functionality, result
displays, and other components of TimeSearcher (Chapter 8). Implementation of these
extensions would increase TimeSearcher’s utility and flexibility.
10.2 Further Evaluation
Current evaluations of timeboxes and TimeSearcher have provided mixed results. De-
spite promising case studies involving the use of TimeSearcher for hypothesis genera-
tion in ongoing scientific research (Chapter 8), refined empirical studies are needed to
identify and measure the benefits of TimeSearcher for realistic data sets and queries.
Further empirical studies aimed at overcoming the shortcomings of previous efforts
might help identify some of the strengths of the timebox query model. As discussed in
Chapter 7, previously conducted-studies suffered from tasks that were not especially
well-suited for timeboxes, and also from difficulties in user comprehension of com-
plex queries. Studies involving a more careful selection of tasks, perhaps focused on
238
exploration of data sets, and perhaps involving more training, might overcome these
difficulties to provide more informative results.
Narrowly-focused studies with motivated domain experts might provide another
means of avoiding difficulties associated with user comprehension of tasks. As such
evaluations would involve users who had a vested interest in solving real problems
that they face with meaningful data sets, the difficulties associated with novice users
would be avoided, and additional types of evaluations might be possible. For example,
TimeSearcher might be compared to whatever existing tools they use. Alternatively,
TimeSearcher might be used with and without various features, in order to determine
which features are most helpful.
10.3 Other Types of Time-oriented Data
The extensions to the timebox query model described in Chapter 9 provide several
possible directions for future work. However, these extensions were all discussed in
terms of basic time series data sets. Generalizing the timebox concept to apply to these
more challenging data sets presents further opportunities for interesting work.
10.3.1 Categorical or Nominal Data
Timeboxes and TimeSearcher were originally designed to support time series data sets
involving continuous measurements. In this context, “continuous” is defined to mean
that the values involved can take any value in a finite interval. An potentially inter-
esting generalization of timeboxes and TimeSearcher might involve support for cate-
gorical or nominal data sets: data sets involving values that each fall into one of a set
of (possibly ordered, in the case of nominal data) discrete and disjoint classes. This
239
definition is somewhat arbitrary: a standard time series data set might be converted to
a categorical data set by “bucketing” the values ($0-$10 becomes category 1, $11-$20
category 2, etc.).
Examples of categorical time series data include log files containing time stamped
events [44, 64, 106], For example, web log file entries contain a timestamp, along with
the page that was referenced, the browser that was used, and other related informa-
tion [64]. A query tool for these categorical data sets would allow users to identify
patterns of sequences of actions. For example, systems administrators might be inter-
ested in knowing which users tried to execute an “su” command to gain root privileges
immediately after logging in to the system.
The first step in extending timeboxes to handle categorical data would be to estab-
lish some sort of linear order on the categories for any given data set. Although some
data sets may have a natural ordering of categories, other data sets might require the
imposition of a potentially arbitrary ordering. This linear order would be used to define
the y axis of the query space, just as the range of measured values defines the y axis in
the current model. Since boundaries between categories are discretely defined, shading
or other visual cues might be used to differentiate between the categories. Timeboxes
could be drawn to include only one category, or multiple adjacent categories (Figure
10.2). When appropriate, hierarchical data sets might be displayed along with tools
for selecting the level in the hierarchy that would be displayed [119].
As a straightforward extension of the timebox model, this approach is appealing.
However, two significant differences between continuous and categorical time series
data sets present potential problems:
• Continuous time series data is based on a meaningful ordering of values that
may not be present in categorical data sets. A given set of n categories will have
240
Figure 10.2: Sketch of a potential design for categorical timeboxes. For a data set in-
volving web log records for multiple hosts, this interface might be used to find queries
that had large numbers of visitors from “.com” hosts in September and October, fol-
lowed by large numbers of “.org” visitors in December and January.
n! possible orderings. The choice of ordering may be somewhat arbitrary, with
different orderings producing different visual patterns in the data sets [106].
• The type and magnitude of changes from one time period to the next have mean-
ing in continuous data sets that they may not in categorical data sets. For stan-
dard time series data, it is often the
case that the value time ti is more closely related to the value at time ti+1 than it
is to the value at time ti+10. Therefore, a timebox of a limited height can be used
to define a range of variability, perhaps filtering out changes that are too small
to be of interest. When orderings on categories are not based on some internal
ordinal, ratio, or interval scale, the height of a timebox may just be a function
of the arbitrary ordering of the categories. In other cases, natural orderings of
241
Figure 10.3: An categorical timebox query looking for sites that had large numbers of
“.org” or “.edu” visitors during December and January.
the categories may not be useful for query construction. For example, web site
accesses might be ordered by alphabetizing URLs, but the utility of queries in-
volving include lexicographically similar URLs might be limited [64].
Further revisions to the timebox query model may be needed to address these is-
sues. For example, timeboxes might be constrained vertically to occupy only one
category, with multiple, vertically-aligned timeboxes indicating a disjunction (Figure
10.3). This would eliminate problems that might be caused by timeboxes with heights
that spanned multiple categories, but the potential difficulties associated with deriving
orderings of the categories would remain. Additional analyses of specific data sets and
user tasks will be needed to guide appropriate designs.
242
10.3.2 Temporal Data
Temporal data sets involve events with arbitrary, finite duration. The timing of these
events can have a variety of complex relationships. For example, event A can precede
event B (A end before B starts), follow B (A starts after B ends), or occur during B (A
starts after B starts and ends before B ends). Temporal relationships between events
have been characterized [9, 51], and a large body of research in temporal databases
and temporal query languages has provided numerous proposals for efficient storage
and indexing of these data sets [33, 71, 126].
As with categorical time series data, the line between temporal data and time se-
ries data may be blurred. For example, a time series data set tracking an individual’s
body temperature can easily be converted to a temporal database, with all consecutive
readings of greater than 98.6 degrees F described as “Fever” events.
Notable visualizations of temporal data have addressed medical records [102, 99].
Rectangular regions describe events, with the start and end points determined by the
left and right ends of the rectangle, respectively. Different types of events might be
treated as categories, which can be ordered vertically (Of course, orderings of cate-
gories presenting challenges similar to those found with categorical time series data).
The temporal query language (TVQL), developed as part of MMVIS, presented a
dynamic query interface for temporal data. TVQL uses range sliders and other tra-
ditional widgets to support queries involving temporal relationships between two sets
of events (Section 2.1.4) [60, 61]. Extensions to the timebox model for temporal data
may have the potential to handle more complex queries involving relationships be-
tween multiple items.
Basic temporal constraints could be specified by constructing timeboxes with the
desired vertical alignments. The result would be a graphical notation similar to that
243
used in TVQL [60], with overlaps or adjacencies in the temporal extents of events
describing the desired relationships. Additions to expressive power might be both
complicated and desirable. For example, some tasks might require a general constraint
that A precedes B, while others might demand the more specific query that A precedes
B by a given duration ta,b. Flexible intermingling of query components that are more or
less constrained may prove challenging. As these difficulties may be similar to those
found in allowing arbitrary relationships in time series queries (Chapter 9), similar
strategies might be used to address both sets of problems.
Temporal data sets also require additional work in terms of defining appropriate
data storage models and indices for searching. Recent work on temporal databases [71]
may be useful in this regard, but it is not yet known if these strategies are capable of
providing the performance needed for dynamic query applications. Further investiga-
tion will be needed to understand the range of temporal queries that can realistically
be processed in the 100ms response window that is needed.
244
Chapter 11
Conclusions
Despite the wide range of data sets and domains that make extensive use of time series
data, there has been relatively little work to date involving dynamic queries for spec-
ifying constraints on time series data sets. This dissertation uses the timebox query
model as the basis for an exploration of issues associated with interactive queries on
time series data. Specific contributions include:
• The definition of the timebox query model: The timebox query model refines
existing dynamic query widgets by allowing concurrent specification of multiple
constraints.
• The TimeSearcher application: TimeSearcher uses timeboxes, drag-and-drop
query-by-example and bookmark capabilities (“leaders & laggards”) to support
exploration of time series data sets. Implemented in Java, TimeSearcher uses
object-oriented design techniques support the use of subclassing to easily add of
new classes of queries.
• New query widgets for additional expressive power: Variable-time timeboxes
(VTTs) and angular queries build upon the basic timebox model to provide ad-
245
ditional expressive power. New interface widgets that provide dynamic query
functionality needed to support these models have been designed and imple-
mented.
• Validation through case studies: The utility of timeboxes and TimeSearcher has
been demonstrated by ongoing use in active research projects (Chapter 8). In
addition to confirming early intuitions regarding the utility of the tool, this col-
laboration has led to numerous insights and design suggestions that otherwise
might not have been identified (Chapter 8).
• Empirical Evaluation of timeboxes: Although more work remains to be done
in empirically characterizing the strengths of timeboxes as a query mechanism,
studies conducted thus far have led to an increased understanding of the strengths
of timeboxes (Chapter 7). Further studies will attempt to refine this understand-
ing, with the ultimate goal of generalizing results to apply to other 2D rectangu-
lar query widgets.
• Analysis of algorithmic expectations: Providing dynamic query performance
(100ms updates) for queries on large time series data sets requires fast process-
ing. Comparison of several alternative approaches led to some initially counter-
intuitive results: index structures provided inferior performance as compared to
non-indexed data. Further examination of the problem led to the observation
that structures that index each time series as a whole will be needed for efficient
evaluation of full-match queries.
• Framework for extending the query model: The timebox query model is a start-
ing point. Chapter 9 describes a subset of the possible extensions to timeboxes
that might be used to provide various increases in the expressive power of the
246
query language. Further work in this area will be needed to identify the potential
extensions that are interesting and relevant to user tasks as well as realistically
achievable.
This work has led to a wide range of possibilities for future work (Chapter 10).
Extending the query language, implementing new classes of queries, and examination
of new algorithmic techniques are just a few of the challenging and interesting areas
that may be suitable for closer examination.
As a tool designed for use by motivated experts for use in examining real data,
TimeSearcher has benefited from the design suggestions and feedback provided by
those users. Future research involving the timebox model and the TimeSearcher tool
should continue in this vein.
247
Appendix A
A Sample TimeSearcher Data File
The sample data file given below is a modified version of a data file based on yeast
microarray data [40]. As this data set contains only five items, it is shown primarily to
illustrate the file format. Two time-varying attributes are given for each item at each
time point - the log ratio and the absolute value of the log ratio. For each item, the first
value if the log ratio for the first time point, followed by the absolute value of the log
ratio for the first time point. This then repeats for time points 2-7.
#title
Yeast MicroArray Data
# static attributes
Gene,String
#Dynamic Atts
LogRatio,Float;AbsLogRatio,Float
# of time points
7
# of records ... ???
5
248
#time point labels
9,11,13,15,17,19,21
#vals
YAL003W,0.072,0.072,-0.004,0.004,-0.018,0.018,-0.19,0.19,-0.28,
0.28,-0.46,0.46,-0.72,0.72
YAL010C,-0.37,0.37,-0.032,0.032,0.013,0.013,-0.27,0.27,-0.28,
0.28,0.11,0.11,-0.06.0.06
YAL016W,0.045,0.045,0.021,0.021,0.041,0.041,-0.022.022,-0.051,
0.051,0.13,0.13,-0.027,0.027
YAL021C,0.045,0.045,0.021,0.021,0.041,0.041,-0.10,0.10,-0.066,
0.066,0.13,0.13,-0.041,0.04
YAL026C,0.053,0.053,0.182,0.182,0.140,0.140,0.23,0.23,0.22,0.22,
0.46,0.46,0.22,0.22
249
Appendix B
Study Materials for Evaluation of Input Mechanisms
for Questions of Varying Complexity
B.1 Exploratory Task
Find three stocks that are interesting or different.
This task was repeated for each of the three interfaces.
B.2 Training Questions
1. How many stocks had prices between $55 and $87 during days 1-3?
2. How many stocks had prices between $61 and $119 during days 24-30?
3. How many stocks had prices between $109 and $134 during days 1-8?
4. During days 13-17, are there more stocks between $0-$35, $20-$55, or $40-$75?
5. Which interval has more stocks priced between $78 and $94: days 4-9, 12-17,
or 20-25?
250
6. During days 1-4, which price range has the most stocks: $90-$116, $80-$106,
or $70-$96?
The first three training questions are low complexity, and the remaining are
medium complexity.
B.3 Experimental Questions
1. How many stocks had prices between $36 and $91 during days 18-19?
2. How many stocks had prices between $67 and $114 during days 27-30?
3. How many stocks had prices between $53 and $108 during days 28-30?
4. How many stocks had prices between $83 and $114 during days 19-21?
5. How many stocks had prices between $0 and $29 during days 13-18?
6. How many stocks had prices between $75 and $120 during days 7-8?
7. Which price range has the most stocks during days 29-30: $50-$75, $75-$100,
or $68-$93?
8. Which time period has the most stocks in the range $87-$124: 1-6, 11-16, or
17-22?
9. During days 22-23, are there more stocks between $69-$119, $59-$109,or $49-
$99?
10. Which intervals have more stocks priced between $10 and $35: days 15-20,
21-25, or 26-30?
251
11. Which price range has the most stocks during days 1-4: $30-$50, $40-$60, or
$50-$70?
12. Which time period has the most stocks in the range $43-$113: 1-8, 14-21, or
23-30?
13. Which price range had the most stocks during days 13-15: $12-$35, $17-$42,
$22-$47, $27-$52, or $32-$55?
14. When intervals have more stocks priced between $60 and $80: days 11-14,15-
18,19-22,23-26, or 27-30?
15. Which price range had the most stocks during days 1-7: $60-$120, $50-$110,
$40-$100, $30-$90, or $20-$80?
16. Which days have the most stocks with prices between $0 and $50: 1-3, 6-8,
11-13, 16-18, 21-23, or 26-28?
17. Which price range has the most stocks during days 14-20: $10-$40, $20-$50,
$30-$60, $40-$70, or $50-$80?
18. Which days have the most stocks with prices between $50 and $100: 2-10, 4-12,
6-14, 8-16, or 10-18?
Questions 1-6 are low complexity, 7-12 are medium complexity, and 13-18 are
high complexity. Each group contains 6 questions - 2 repetitions for each of
three interfaces.
252
B.4 User Interface Satisfaction Questionnaire
Please circle the numbers which most appropriately reflect your impressions about
using this computer system.
Not Applicable = NA.
1. Overall reactions to the form fill-in interface
(a) (1=terrible,9=wonderful) 1 2 3 4 5 6 7 8 9 NA
(b) (1=frustrating,9=satisfying) 1 2 3 4 5 6 7 8 9 NA
(c) (1=difficult,9=easy) 1 2 3 4 5 6 7 8 9 NA
(d) (1=rigid,9=flexible) 1 2 3 4 5 6 7 8 9 NA
2. Overall reactions to the range slider interface
(a) (1=terrible,9=wonderful) 1 2 3 4 5 6 7 8 9 NA
(b) (1=frustrating,9=satisfying) 1 2 3 4 5 6 7 8 9 NA
(c) (1=difficult,9=easy) 1 2 3 4 5 6 7 8 9 NA
(d) (1=rigid,9=flexible) 1 2 3 4 5 6 7 8 9 NA
3. Overall reactions to the direct manipulation interface
(a) (1=terrible,9=wonderful) 1 2 3 4 5 6 7 8 9 NA
(b) (1=frustrating,9=satisfying) 1 2 3 4 5 6 7 8 9 NA
(c) (1=difficult,9=easy) 1 2 3 4 5 6 7 8 9 NA
(d) (1=rigid,9=flexible) 1 2 3 4 5 6 7 8 9 NA
4. Which interface did you prefer for the defined tasks (first set)? Form fill-in
Range Slider Direct Manipulation
253
5. Which Interface did you prefer for the exploratory tasks (second set)? Form
fill-in Range Slider Direct Manipulation
6. Do you have any further comments or suggestions:
254
Appendix C
Empirical Evaluation of Multiple-Constraint Query
Formation
This study was designed to evaluate the use of timeboxes for searching for complex
patterns involving multiple constraints. Due to difficulties with participant comprehen-
sion of the study tasks, this study was modified after four individuals had participated,
and terminated after eight subjects. The study procedures, tasks, and preliminary re-
sults are presented here for completeness sake.
C.1 Design
The second study used variations in the number of query clauses required to complete a
task to address another source of complexity. Tasks in this case involved identification
of one item that matched a given set of criteria. When multiple items satisfied the
criteria, any one of those matches was considered to be a correct item.
The three complexity levels were defined in terms of the number of clauses re-
quired to answer the task. Low complexity tasks required two clauses while medium
complexity tasks required three and high complexity tasks needed four. For example:
255
1. Low complexity:Find a stock that had prices during days 5-7 that were lower
than its prices during days 1-3.
2. Medium complexity: Find a stock whose price decreased from days 11-13 to
days 15-17 and again from days 15-17 to days 19-21.
3. High complexity: Find stocks that increased gradually, over days 13-15,17-
19,21-23, and 25-27, such that the prices in each interval are generally higher
than the previous interval.
The complete set of tasks is given in Section C.4.
These tasks are less well-formed than the tasks used in the first study (Section 7.1).
Specifically, these tasks ask participants to find items that had certain trends in values
over specific dates, but all values are specified relative to each other. These tasks ask
users to identify items that follow general trends (“prices during days 1-3 that were
lower than prices during days 11-13”).
This increased flexibility may in some cases cause some difficulties. The reduced
specificity of the task statements may lead to ambiguity that might confuse users and
increase task times. To avoid difficulties with ambiguity, the study administrator ac-
cepted approximate answers as being correct and - when necessary - told participants
when their answers were close enough to be acceptable.
This study used a modified version of the tsexp interface described in Section 7.1.1.
The version of tsexp had two major differences from the implementation used in the
first study (Section 7.1). First, as this study involved complex patterns that required
multiple constraints, the tsexp interface was revised to support multiple query condi-
tions.
This study used the definition of start and stop times, and data sets, similar to those
256
used in the study that compared input and output(Section 7.2).
Synthetic data sets containing randomly-generated values were used for this study.
The data sets included 30 time points for each of 200 items. For the exploratory tasks,
the data sets from the first study (Section 7.1) were used.
The criteria for correct task completion were also relaxed somewhat relative to the
previous study. Users were told that an exactly correct answer was not necessary, and
that the administrator of the study would indicate when they reached an acceptable
answer. Every attempt was made to accept queries that addressed the spirit of the
question at hand. This change was made in an attempt to avoid spending significant
amounts of time making fine adjustments, and to more closely approximate the use of
TimeSearcher for exploratory tasks involving approximate queries.
This study was initially designed to have 18 graduate and undergraduate students
from the University of Maryland’s Computer Science department as participants. Pilot
tests with 3 subjects were used to fine-tune the study content. In particular, pilot par-
ticipants found the phrasing of the tasks to be challenging. Despite attempts at revising
the wording of the tasks, many participants had difficulty interpreting the tasks, and
sessions took much longer than anticipated (as long as 2 hours, as opposed to the goal
of 1 hour).
Due to these difficulties, the study was shortened from 2 repetitions of each in-
terface/complexity combination (18 tasks total) to 1 repetition (9 tasks total) after the
fourth participant. Although this shortened the session to be closer to the goal of one
hour, it did not resolve the comprehension difficulties. As a result, the study was ter-
minated after eight participants.
Results are presented below for these eight participants. For the first four partici-
pants, only the nine questions that were completed by all participants were included in
257
0
50
100
150
200
250
Low Medium High
Ave
rage
Tas
k C
ompl
etio
n Ti
me
(ms)
Complexity
Form Fill-inRange Slider
Timebox
Figure C.1: Average completion time for well-defined tasks.
the analysis. The small sample size and the comprehension difficulties experienced by
participants limit the generalizability of the results.
C.2 Results
Results from the well-defined tasks are given in Figure C.1. The increase in task
completion with complexity was significant (repeated measures analysis of variance
(RMANOVA), F(2,67) = 7.39, p < .01), but there were no significant differences be-
tween the three interfaces: (F(2,67) = 0.99, p > .05). There was no interaction effect
(P(4,63) = .73, p = .57).
Results for the exploratory tasks are given in Figures C.2 and C.3. There were no
significant differences, ether in the number of items correctly identified (RMANOVA,
F(2,21) = .5, p = .61) or in the task completion time (F(2,21) = .63, p = .54).
258
0
0.5
1
1.5
2
2.5
3
3.5
Form Fill-in Range Slider Timebox
Num
ber o
f ite
ms
Cor
rect
ly Id
entif
ied
Interface
Figure C.2: Number of items correctly identified in exploratory task
Seven of the eight participants completed the subjective questionnaire. Results
are given in Figure C.4. Users showed a general preference towards timeboxes, with
significant differences in preference levels for the Difficult/Easy question (ANOVA,
F(2,18) = 3.69, p < .05), and the Rigid/Flexible question (F(2,18) = 10.7, p < .01).
Significance levels were marginal for the Terrible/Wonderful (F(2,18) = 3.28, p =
.06) and Frustrating/Satisfying (F(2,18) = 3.43, p = 0.05) ratings.
The slight preference for timeboxes over the other input mechanisms was con-
firmed when users were asked to select the interface that they preferred for each type
of task. For the well-defined tasks, four users preferred timeboxes, three preferred
form fill-in, and one preferred range sliders. Preferences for the exploratory tasks
were much clearer, with 6 users preferring timeboxes and 1 each preferring form fill-in
and range sliders (Table C.1).
259
0
50
100
150
200
Form Fill-in Range Slider Timebox
Ave
rage
Tas
k C
ompl
etio
n Ti
me
(ms)
Interface
Figure C.3: Average task completion time for exploratory tasks
.
Form Fill-in Range Slider Timebox
Well-Defined 3 1 4
Exploratory 1 1 6
Table C.1: User preferences by interface for the different task types.
C.3 Discussion
This study was plagued by design flaws that were not apparent until several additional
subjects completed the protocol. The primary difficulty was in the wording of the
questions. Many subjects had significant difficulties in interpreting the phrasing of
the tasks. When faced with tasks asking for stocks that “decreased from days 11-13
to days 15-17, and again from days 15-17 to days 19-21” (for example), participants
often had trouble determining the direction of changes required. Some drew a series
260
0
2
4
6
8
10
Terrible/Wonderful Frustrating/Satisfying Difficult/Easy Rigid/Flexible
Ave
rage
Sub
ject
ive
Rat
ing
Form Fill-inRange Slider
Timebox
Figure C.4: Average subjective satisfaction ratings (1-9, 9 is best). n = 7
of arrows or boxes to illustrate the required directions of changes between each time
interval.
Many of the tasks asked users to treat a stock’s price during an interval of several
days as a single chunk (“decreased from days 11-13 to days 15-17” being two chunks).
Once this interpretation was explained, participants did not appear to have significant
difficulties with this interpretation.
These difficulties in interpretation were apparent throughout the. Participants often
had to repeat training tasks, and the administrator of the study frequently performed
the first one or two training tasks for the participants, showing them how the questions
should be interpreted and answered. Even after these repeats, users often had difficul-
ties that were clearly attributable to interpretation of the question (as opposed to use of
the interface). For example, participants frequently inverted the transitions requested,
261
finding (for example), a pattern of decrease-increase-decrease when the task required
the opposite pattern of increase-decrease-increase.
These difficulties led to a session that was significantly longer than intended. Re-
ducing the study to contain only one repetition for each of the nine task types did not
eliminate comprehension difficulties, so the study was terminated after four additional
participants.
Other aspects of the study design might have been similarly problematic. Many of
the tasks asked users to find stocks that increased and/or decreased in price from one
interval to the next. Task completion times for these questions may have been sensitive
to initial conditions. Specifically, if a participant was fortunate enough to create the
first two terms of a query in a manner that met the first constraint, subsequent terms
and constraints would be relatively easy to satisfy. On the other hand, if the user’s first
query was placed in a region with relatively little data, they may have had more trouble
satisfying the terms of the task.
The flexibility provided to study participants may have confused matters further.
Users were told to find items that were “close” to the parameters specified in each
task, without being told how close they needed to be. As a result, they had to ask the
administrator of the study for clarification, which required a judgment call that may
not have been made consistently.
Several aspects of the user interaction with the specific interfaces seemed notable.
When subjects appeared to understand the tasks, they had relatively little trouble with
the interfaces or other aspects of the study. Some users found that it took time to learn
to use timeboxes. Once these users were comfortable using timeboxes, they often
made positive comments, saying that timeboxes were “nice once I got the feel of it.”
As in the first study (Section 7.1), subjects clearly had difficulty with range sliders
262
and timeboxes when the ranges covered were relatively small. Similarly, some users
had trouble after creating queries that produced zero hits - some form of the data en-
velope or other overview might have helped with this difficulty. There were a few
instances of confusion between the interfaces. Specifically, some users clicked on a
box associated with a range slider as if it were a timebox.
In terms of user interaction, the primary difference between this study and the
first study is in the presence of multiple timeboxes which could be deleted or lassoed
and dragged for simultaneous modification. However, very few users deleted query
clauses, and users often failed to understand the idea of moving multiple timeboxes
at once. It is not clear if this was due to difficulties in understanding the interface,
insufficient training, or a combination of both factors.
One study participant made two concrete design suggestions that merit consider-
ation for inclusion in future versions of TimeSearcher. Noting the difficulty involved
in interpreting the impact of a single query clause, this subject suggested a “disable”
feature that would temporarily remove a timebox clause from a query. The timebox
would still be displayed in some altered manner, but the displayed result set would
not include the constraints associated with that box. This would provide the user with
a tool that could be used to determine if a given timebox was useful for meeting the
user’s search goal.
The other suggestion was for an feature that would link boxes to have the top of
one timebox aligned with the bottom of another timebox. This would support searches
involving transitions between two value ranges, where the second was defined as being
greater (or less than) the first.
263
C.4 Study Materials
C.4.1 Exploratory Task
Find three stocks that are interesting or different.
This task was repeated for each of the three interfaces.
C.4.2 Training Questions
1. Find a stock that had higher prices on days 10-12 than on days 19-21.
2. Find a stock whose prices during days 12-15 were lower than its prices during
days 20-23.
3. Find a stock whose prices on days 16-18 were higher than its prices on days
24-26.
4. Find a stock that had higher prices during days 2-6 than it did during days 8-12
and days 16-20.
5. Find a stock that traded decreased from days 1-3 to days 7-9, and then increased
to higher values during days 13-15.
6. Find a stock whose price was low during days 5-8, increased to higher values
during days 15-17, and then decreased to a lower range during days 23-25.
The first three training questions are low complexity, and the remaining are
medium complexity.
264
C.4.3 Experimental Questions
1. Find a stock that had prices during days 5-7 that were lower than its prices during
days 1-3.
2. Find a stock whose prices during days 5-9 was higher than its prices during days
13-17.
3. Find a stock that had prices during days 4-7 that were close to its prices on days
12-15.
4. Find a stock that had prices on days 11-16 that were close to its prices on days
20-25.
5. Find a stock with prices during days 1-5 that are lower than its prices during
days 26-30.
6. Find a stock that had higher prices on days 15-19 than on days 23-27.
7. Find a stock whose price decreased from days 11-13 to days 15-17 and again
from days 15-17 to days 19-21.
8. Find a stock whose price increased from days 6-8 to days 10-12 and then de-
creased from days 10-12 to days 14-16.
9. Find a stock that had higher values days 13-17 than it did during days 1-5 and
days 26-30.
10. Find a stock that increased from days 13-16 to days 19-22, and then decreased
to lower values during days 25-28.
11. Find a stock that had increases from days 10-12 to days 15-17, and from days
15-17 to days 20-22.
265
12. Find a stock that had lower values during days 14-17 than it did during days 8-11
and days 20-23.
13. Find a stock whose price during days 5-8 was close to its price during days
10-13, 16-19, and 22-25..
14. Find a stock whose price was high during days 1-3 and decreased to successively
lower values during days 5-7, 9- 11, and 13-15.
15. Find a stock whose price increased from days 7-8 to days 10-11, decreased to
lower values during days 13-14, and then increased to higher values during days
16-17.
16. Find stocks that increased gradually, over days 13-15,17-19,21-23, and 25-27,
such that the prices in each interval are generally higher than the previous inter-
val.
17. Find a stock whose price decreased from days 3-5 to days 7-9, increased to a
higher range during days 11-13, and then and then decreased to lower values
during days 15-17.
18. Find a stock whose price decreased from days 15-17 to days 19-21, decreased
again to a new low during days 23- 25, and then increased to higher values during
days 27-29.
Questions 1-6 are low complexity, 7-12 are medium complexity, and 13-18 are
high complexity. Each group contains six questions - two repetitions for each of three
interfaces.
266
After the fourth subject, one repetition was eliminated for each of the three com-
plexity levels. Questions 4-6,10-12, and 16-18 were eliminated, leaving nine questions
with one repetition of each interface,complexity combination.
267
Appendix D
Study Materials for Empirical Evaluation of Input and
Output
D.1 Training Questions
1. Find an item that has a price between $30 and $60 for months 4-7
2. Find an item that trades in a $20 range for at least three consecutive time periods.
268
D.2 Experimental Questions
1. Find an item that starts low and ends high: its prices during all of the last five
time points should be least $40 more than the highest prices that it reaches during
the first 5 time periods.
2. Find an item that has trades in a $25 range for at least four consecutive measure-
ments and then has a rise in price of at least $35.
269
270
BIBLIOGRAPHY
[1] J. Aach and G. Church. Aligning gene expression time series with time warpingalgorithms. Bioinformatics, 17(6):495–508, 2001.
[2] J. Accot and S. Zhai. Beyond fitts’ law: Models for trajectory-based HCI tasks.In Proceedings of the 1997 Conference Human Factors in Computing Systems,pages 295–302, Atlanta GA, April 1997. ACM Press.
[3] R. Agrawal, C. Faloutsos, and A. Swami. Efficient similarity search in sequencedatabases. In Proceedings, Foundations of Data Organization and Algorithms,4th International Conference, FODO’93, Chicago, Illinois, USA, October 13-15, 1993. Lecture Notes in Computer Science, Vol. 730, pages 69–84, Berlin,1993. Springer-Verlag.
[4] R. Agrawal, K. Lin, H. S. Sawhney, and K. Shim. Fast similarity search inthe presence of noise, scaling, and translation in time-series databases. In TheVLDB Journal, pages 490–501, 1995.
[5] R. Agrawal, G. Psaila, E. L. Wimmers, and M. Zait. Querying shapes ofhistories. In Proceedings of the 21st International Conference on Very LargeDatabases, pages 502–514, 1995.
[6] R. Agrawal and R. Srikant. Mining sequential patterns. In Philip S. Yu andArbee L. P. Chen, editors, Proceedings 11th International Conference on DataEngineering, ICDE, pages 3–14, Taipei Tawian, March 1995. IEEE Press.
[7] C. Ahlberg and B. Shneiderman. Visual information seeking: Tight couplingof dynamic query filters with starfield displays. In Proceedings of the 1994Conference on Human Factors in Computing Systems, pages 313–317, BostonMA, April 1994. ACM Press.
[8] C. Ahlberg, C. Williamson, and B. Shneiderman. Dynamic queries for informa-tion exploration: An implementation and evaluation. In Proceedings of the 1992Conference on Human Factors in Computer Systems, pages 619–626, Monterey,CA, May 3-7 1992. ACM Press.
271
[9] J. F. Allen. Maintaining Knowledge about Temporal Intervals. Communicationsof the ACM, 26(11):832–843, 1983.
[10] E. H. Baehrecke, N. Dang, K. Barbaria, and B. Shneiderman. Visualization andanalysis of microarray and gene ontology data with treemap. In preparation,2003.
[11] E.H. Baehrecke. Steroid regulation of programmed cell death during Drosophiladevelopment. Cell Death and Differentiation, 7:1057–1062, 2000.
[12] E.H. Baehrecke. How death shapes life during development. Nature ReviewsMolecular Cell Biology, 3:779–787, October 2002.
[13] E.H. Baehrecke. Personal Communication, 2003.
[14] Z. Bar-Joseph, G. Gerber, D. Gifford, and T. Jaakola. A new approach to an-alyzing gene expression time series data. In Proc. Sixth Annual InternationalConference on Research in Computational Molecular Biology, pages 39–48,2002.
[15] R. Beckmann, H.P. Kriegel, R. Schneider, and B. Seeger. The R∗-tree: an effi-cient and robust access method for points and rectangles. ACM Sigmod, pages322–331, May 1990.
[16] B. Bederson, J. Grosjean, and J. Meyer. Toolkit design for interactive struc-tured graphic. Technical Report HCIL-2003-01,CS-TR-4432, and UMIACS-TR-2003-03, University of Maryland, Human-Computer Interaction Lab, De-partment of Computer Science, and Institute for Advanced Computer Studies,2003.
[17] B. Bederson, J. Meyer, and L. Good. Jazz: An extensible zoomable user inter-face graphics toolkit in java. In ACM Symposium on User Interface Softwareand Technology, pages 171–180, San Diego CA, November 2000. ACM Press.
[18] B. Bederson, B. Shneiderman, and M. Wattenberg. Ordered and quantumtreemaps: Making effective use of 2D space to display hierarchies. ACM Trans-actions on Computer Graphics, 21(4):833–854, October 2002.
[19] D. J. Berndt and J. Clifford. Finding patterns in time series: A dynamic pro-gramming approach. In Advances in Knowledge Discovery and Data Mining,pages 229–248. AAAI Press/MIT Press, 1996.
[20] C. Bettini, X. Sean Wang, and S. Jajodia. Mining temporal relationships withmultiple granularities in time sequences. Data Engineering Bulletin, 21(1):32–38, 1998.
272
[21] K. Beyer, J. Goldstein, R. Ramakrishnan, and U. Shaft. When is ”nearestneighbor” meaningful? In C. Beeri and P. Buneman, editors, 7th InternationalConference on Database Theory (ICDT ’99), number 1540 in Lecture Notes inComputer Science, pages 218–236, Jerusalem, Israel, January 1999. Springer-Verlag.
[22] S. Blackburn. Content Based Retrieval and Navigation Using MelodicPitch Contours. PhD thesis, University of Southampton, 2000.http://www.ecs.soton.ac.uk/ sgb97r/phdthesis.pdf.
[23] C. Bonhomme, C. Trepied, M.A. Aufaure, and R. Laurini. A visual languagefor querying spatio-temporal databases. In Proceedings of the 7th InternationalSymposium on Advances in Geographic Information Systems, pages 34–39,Kansas City MO, November 1999. ACM Press.
[24] E. Bradley. Time-series analysis. In M. Berhold and E. Hand, editors, IntelligentData Analysis: An Introduction. Springer-Verlag, Berlin, 1999.
[25] I. Brewer, A.M. MacEachren, H. Abdo, J. Gundrum, and G. Otto. Collaborativegeographic visualization: Enabling shared understanding of environmental pro-cesses. In Proceedings, IEEE Symposium on Information Visualization, pages137–144, Salt Lake City UT, October 2000.
[26] S. K. Card, J. D. Mackinlay, and B. Shneiderman, editors. Readings in Infor-mation Visualization: Using Vision to Think. Morgan Kaufman Publishers, SanFrancisco CA, 1999.
[27] J. V. Carlis and J. A. Konstan. Interactive visualization of serial periodic data.In ACM Symposium on User Interface Software and Technology, pages 29–38,San Francisco CA, November 1998. ACM Press.
[28] M.S.T. Carpendale, A. Fall, D. J. Cowperthwaite, J. Falland, and F. D. Fracchia.Case study: Visual access for landscape event based temporal data. In VIS ’96:Proceedings of the IEEE Conference on Visualization, pages 425–428, October1996.
[29] K. Chan and W. Fu. Efficient time series matching by wavelets. In Proceed-ings 15th International Conference on Data Engineering ICDE, pages 126–133,Syndney Australia, March 1999.
[30] C. Chatfield. The Analysis of Time Series, an Introduction. Chapman and Hall,London, 1996.
[31] E. H. Chi, J. E. Pitkow, J. D. Mackinlay, P. Pirolli, R. Gossweiler, and S. K.Card. Visualizing the evolution of web ecologies. In Proceedings of the 1998
273
Conference Human Factors in Computing Systems, pages 400–407, Los Ange-les CA, April 1998. ACM Press.
[32] J.P. Chin, V.A. Diehl, and K.L. Norman. Development of an instrument mea-suring user satisfaction of the human-computer interface. In Proceedings ofthe 1988 Conference on Human Factors in Computer Systems, pages 213–218.ACM Press, 1988.
[33] J. Chomicki. Temporal query languages: A survey.http://www.cse.buffalo.edu/ chomicki/papers-survey95.ps, 1995.
[34] S. Chu, J. DeRisi, M. Eisen, J. Mulholland, D. Botstein, P.O. Brown, and I. Her-skowitz. The transcriptional program of sporulation in budding yeast. Science,282:699–705, October 23 1998.
[35] E. Clough, C.-Y. Lee, H. Hochheiser, B. Shneiderman, and E.H. Baehrecke.Temporal analyses of genome-wide transcription during steroid-triggered pro-grammed cell death in Drosophila. In preparation, 2003.
[36] S. B. Cousins and M. G. Kahn. The visual display of temporal information.Artificial Intelligence in Medicine, 3(6):341–357, 1991.
[37] G. Das, K. Lin, H. Mannila, G. Renganathan, and P. Smyth. Rule discoveryfrom time series. In Proceedings of the fourth International Conference onKnowledge Discovery and Data Mining (KDD-98), pages 16–22, New YorkNY, August 1998. AAAI Press.
[38] M. de Berg, M. van Kreveld, M. Overmars, and O. Schwarzkopf. ComputationalGeometry: Algorithms and Applications. Springer-Verlag, 2000.
[39] A. Del Bimbo, E. Vicario, and D. Zingoni. Symbolic description and visualquerying of image sequences using spatio-temporal logic. IEEE Transactionson Knowledge and Data Engineering, 7(4):609–622, August 1995.
[40] J. DeRisi, V. Iyer, and P. Brown. Exploring the metabolic and genetic control ofgene expression on a genomic scale. Science, 278:680–686, 24 October 1997.
[41] M. Derthick and S.F. Roth. Data exploration across temporal contexts. In Pro-ceedings of Intelligent User Interfaces 2000, pages 60–67, New Orleans LA,January 2000. ACM Press.
[42] K. Duca. Personal Communication, October 2002.
[43] W. K. Edwards, T. Igarishi, A. LaMarca, and E.D. Mynatt. A temporal modelfor multi-level undo and redo. In ACM Symposium on User Interface Softwareand Technology, pages 31–40, San Diego CA, November 2000. ACM Press.
274
[44] S. G. Eick and P. J. Lucas. Displaying trace files. Software Practice and Expe-rience, 26(4):399–409, 1996.
[45] Stephen G. Eick and Graham J. Wills. Navigating large networks with hierar-chies. In Proc. IEEE Conf. Visualization, pages 204–210, San Jose, CA, October1993.
[46] M.B. Eisen, P.T. Spellman, P.O. Brown, and D. Botstein. Cluster analysis anddisplay of genome-wide expression patterns. Proceedings, National Academiesof Science, USA, 95:14863–14686, December 1998.
[47] M. Erwig and M. Schneider. Query-by-trace: Visual predicate specificationin spatio-temporal databases. In Proceedings, 5th IFIP Conference on VisualDatabases (VDB 5), pages 199–218, 2000.
[48] W.G. Fairbrother, R.F. Yeh, P.A. Sharp, and C.B. Burge. Predictive identifica-tion of exonic splicing enhancers in human genes. Science, 297:1007–1013, 9August 2002.
[49] C. Faloutsos, M. Ranganathan, and Y. Manolopoulos. Fast subsequence match-ing in time-series databases. In Proceedings of the 1994 ACM SIGMOD In-ternational Conference on Management of Data, pages 419–429, Minneapolis,Minnesota, May 1994. ACM Press.
[50] E. Freeman and D. Gelernter. Lifestreams: A storage model for personal data.SIGMOD Record (ACM Special Interest Group on Management of Data), 25(1),March 1996.
[51] Christian Freksa. Temporal reasoning based on semi-intervals. Artificial Intel-ligence, 54(1):199–227, 1992.
[52] Y.-H. Fua, M. Ward, and E. Rundensteiner. Navigating hierarchies withstructure-based brushes. In Proceedings, IEEE Symposium on Information Vi-sualization, pages 58–64, San Diego, CA, October 24-29 1999. IEEE Press.
[53] L. Girardin and D. Brodbeck. Interactive visualization of prices and earningsaround the globe. In Interactive Posters, IEEE Symposium on Information Vi-sualization 2001, San Diego, CA, October 22-23 2001.
[54] S. Greene, E. Tanin, C. Plaisant, B. Shneiderman, L. Olsen, G. Major, andS. Johns. The end of zero-hit queries: Query previews for NASA’s globalchange master directory. International Journal on Digital Libraries, 2(2-3):79–90, 1999.
[55] H. Hamadeh and C. A. Afshari. Gene chips and functional genomics. AmericanScientist, pages 508–515, November/December 2000.
275
[56] J. Han, G. Dong, and Y. Yin. Efficient mining of partial periodic patterns intime series database. In Proceedings of the International Conference on DataEngineering, pages 106–115, Syndney Australia, March 1999.
[57] B. Harrison, R. Owen, and R. Baecker. Timelines: An interactive system forthe collection of visualization of temporal data. In Proceedings of GraphicsInterface ’94, pages 141–148, Toronto, 1994. Canadian Information ProcessingSociety.
[58] H. Hauser, F. Ledermann, and H. Doleisch. Angular brushing of extended paral-lel coordinates. In Proceedings, IEEE Symposium on Information Visualization,Boston, MA, October 2002. IEEE Press.
[59] S. Havre, B. Hetzler, and L. Nowell. Themeriver: Visualizing theme changesover time. In Proceedings, IEEE Symposium on Information Visualization,pages 115–124, Salt Lake City UT, October 2000.
[60] S. Hibino and E. Rudensteiner. A visual multimedia query language for tempo-ral analysis of video data. In K.C. Nwosu, B.M. Thuraisingham, and P.B. Berra,editors, Multimedia Database Systems: Design and Implementation Strategies,pages 123–159. Kluwer Academic Publishers, 1996.
[61] S. Hibino and E. Rundensteiner. User interface evaluation of a direct manipula-tion temporal visual query language. In Multimedia ’97, pages 99–107, SeattleWA, November 1997. Association for Computer Machinery.
[62] S. Hibino and E. Rundensteiner. Comparing MMVIS to a timeline for temporaltrend analysis of video data. In Proceedings of Advanced Visual Interfaces 1998,pages 195–204. Association for Computer Machinery, May 1998.
[63] H. Hochheiser and B. Shneiderman. Range specifications for an interactivevisual query tool for time series data. Unpublished Manuscript,University ofMaryland, Department of Computer Science, March 2001.
[64] H. Hochheiser and B. Shneiderman. Using interactive visualizations of wwwlog data to characterize access patterns and inform site design. Journal of theAmerican Society for Information Systems, 52(4):331–343, February 2001.
[65] N.S. Holter, N. Mitra, A. Maritan, M. Cieplak, J.Banavar, and N. Federoff. Fun-damental patterns underlying gene expression profiles: Simplicity from com-plexity. Proc. National Academy of Sciences USA, 97(15):8409–8414, 18 July2000.
276
[66] Y. Huang and P.S. Yu. Adaptive query processing for time-series data. In Pro-ceedings of the fifth ACM SIGKDD international conference on Knowledge dis-covery and data mining, pages 282–286, San Diego CA, August 1999. ACMPress.
[67] A. Inselberg. Multidimensional detective. In Proceedings, IEEE Symposium onInformation Visualization, pages 100–107, Phoenix AZ, October 1997.
[68] A. Inselberg and T. Avidan. Classification and visualization for high-dimensional data. In Proceedings of the Sixth ACM SIGKDD InternationalConference on Knowledge Discovery in Data 2000, pages 370–374, Boston,MA, 2000. ACM Press.
[69] H. V. Jagadish, N. Koudas, and S. Muthukrishnan. Mining deviants in a timeseries database. In M. P. Atkinson, M. E. Orlowska, P. Valduriez, S. B. Zdonik,and M. L. Brodie, editors, Proceedings of VLDB’99, Proceedings of 25th In-ternational Conference on Very Large Data Bases, pages 102–113, EdinburghScotland, September 1999. Morgan Kaufmann.
[70] V. Jain and B. Shneiderman. Data structures for dynamic queries: an analyti-cal and experimental evaluation. In Proc. of the Workshop in Advanced VisualInterfaces, AVI 94, pages 1–11, Bari, Italy, June 1-4 1994. ACM Press.
[71] C.S. Jensen and R.T. Snodgrass. Temporal data management. IEEE Trans-actions on Knowledge and Data Management, 11(1):36–43, January/February1999.
[72] C. Jiang, E.H. Baehrecke, and C. Thummel. Steroid regulated programmed celldeath during Drosophila metamorphosis. Development, 124:4673–4683, 1997.
[73] D. A. Keim. Pixel-oriented visualizations techniques for exploring very largedatabases. Journal of Computational and Statistical Graphics, pages 58–77,March 1996.
[74] E. J. Keogh. Exact indexing of dynamic time warping. In Proc. VLDB 2002,pages 406–417, Hong Kong, China, 2002.
[75] E. J. Keogh, K. Chakrabarti, M. J. Pazzani, and S. Mehrotra. Locally adaptivedimensionality reduction for indexing large time series databases. In Proceed-ings SIGMOD 2001, pages 151–162, Santa Barbara CA, May 2001. ACM Press.
[76] E. J. Keogh, K. Chakrabarti, M.J. Pazzani, and S. Mehrotra. Dimensionalityreduction for fast similarity search in large time series databases. Knowledgeand Information Systems., 3(3):263–286, 2001.
277
[77] E. J. Keogh, H. Hochheiser, and B. Shneiderman. An augmented visual querymechanism for finding patterns in time series data. In Proc. Fifth Interna-tional Conference on Flexible Query Answering Systems, Lecuter Notes in Arti-ficial Intelligence, pages 240–250, Copenhagen, Denmark, 27-29 October 2002.Springer-Verlag.
[78] E. J. Keogh and M. J. Pazzani. Relevance feedback retrieval of time series data.In Proceedings of the 22nd Annual International ACM SIGIR Conference onResearch and Development in Information Retrieval SIGIR ’99, pages 183–190,Berkeley CA, August 1999. ACM.
[79] E. J. Keogh and P. Smyth. A probabilistic approach to fast pattern matchingin time series databases. In Proceedings of the third conference on KnowledgeDiscovery in Databases and Data Mining (KDD-97) , Newport Beach, pages24–30, Newport Beach CA, August 1997. AAAI Press.
[80] T. Koetter and M. Theus. Fortune - a system for interactive graphics for timeseries. http://www.vr-web.de/martin.theus/Fortune JCGS.pdf.
[81] F. Korn, H. V. Jagadish, and C. Faloutsos. Efficiently supporting ad hoc queriesin large datasets of time sequences. In Proceedings of the 1997 ACM SIGMODInternational Conference on Management of Data, pages 289–300, Tucson AZ,May 1997. ACM Press.
[82] V. Kouramajian and M. Gertz. A graphical query language for temporaldatabases. In M.P. Papazoglou, editor, OOER ’95: Object-Oriented and Entity-Relationship Modeling, volume 1021 of Lecture Notes in Computer Science,pages 388–399. Springer-Verlag, Berlin, 1995.
[83] C.-Y. Lee, E.Clough, P. Yellon, T. Teslovich, D. Stephan, and E.H. Baehrecke.Genome-wide analyses of steroid-and radiation-triggered programmed celldeath in Drosophila. Current Biology, 2003.
[84] C-Y Lee, D. Wendel, P. Reid, G. Lam, C. Thummel, and E.H. Baehrecke. E93directs steroid-triggered programmed cell death in Drosophila. Molecular Cell,6:433–443, August 2000.
[85] J. Lin, E. J. Keogh, S. Lonardi, and P. Patel. Finding motifs in time series. InProc. SIGKDD ’02, pages 53–68, Edmonton, Alberta Canada, July 23-26 2002.ACM Press.
[86] L. Lin, T. Risch, M. Skold, and D. Badal. Indexing values of time sequences.In Proc. 5th International Conference on Information and Knowledge Manage-ment (CIKM ’96), pages 223–232, Rockville, Maryland, November 12-16 1996.
278
[87] J.B. Little and L. Rhodes. Understanding Wall Street. Liberty Publishing, Inc.,Cockeysville MD, 1978.
[88] J. D. Mackinlay, G. G. Robertson, and R. DeLine. Developing calendar visu-alizers for the information visualizer. In ACM Symposium on User InterfaceSoftware and Technology, pages 109–118, New York, 1994. ACM Press.
[89] A. Martin and M. Ward. High dimensional brushing for interactive explorationof multivariate data. In Proceedings of the 6th IEEE Visualization Conference,pages 271–278, Atlanta, Georgia, October 29- November 3 1995. IEEE Press.
[90] J. P. Morrill. Distributed recognition of patterns in time series data. Communi-cations of the ACM, 45(5):45–51, May 1998.
[91] S. Mount. Personal Communication, 2003.
[92] S. M. Mount, C. Burks, G. Hertz, G.D. Stormo, O. White, and C. Fields. Splic-ing signals in Drosophila: intron size, information content, and consensus se-quences. Nucleic Acids Research, 20(16):4255–4262, 1992.
[93] A. Nanopoulos and Y. Manolopoulos. Indexing time-series databases for inversequeries. In G. Quirchmayr and Trevor J.M. Bench-Capon, editors, Proceed-ings 9th International Conference, Database and Expert Systems Applications(DEXA), volume 1460 of Lecture Notes in Computer Science, pages 551–560.Springer, August 24-28 1998.
[94] C. North and B. Shneiderman. Snap-together visualization: A user interface forcoordinating visualizations via relational schemata. In ACM Advanced VisualInterfaces 2000, pages 128–135. ACM Press, 2000.
[95] C. North and B. Shneiderman. Snap-together visualization: Evaluating coor-dination usage and construction. International Journal of Human-ComputerStudies, 53(5):715–739, November 2000.
[96] A. Oberweis and V. Sanger. GTL - A Graphical Language for Temporal Data. InProceedings of the 7th International Working Conference on Scientific and Sta-tistical Database Management, pages 22–31, Charlottesville VA, 1994. IEEEComputer Society Press.
[97] U. Ohler and H. Niemann. Identification and analysis of eukaryotic promoters:Recent computational approaches. Trends in Genetics, 17(2):56–60, February2001.
[98] C. Perng, H. Wang, S. R. Zhang, and D. Stott Parker. Landmarks: a new modelfor similarity-based pattern querying in time series databases. In Proc. Interna-tional Conference on Data Engineering, pages 33–42, San Diego CA, February28 -March 3 2000.
279
[99] C. Plaisant, R. Mushlin, A. Snyder, J. Li, D. Heller, and B. Shneiderman. Life-lines: Using visualization to enhance navigation and analysis of patient records.In 1998 American Medical Informatic Association Annual Fall Symposium,pages 76–80, Orlando FL, November 1998. AMIA.
[100] C. Plaisant, A. Rose, G. Rubloff, R. Salter, and B. Shneiderman. The design ofhistory mechanisms and their use in collaborative educational simulations. InProceedings of the Computer Support for Collaborative Learning, CSCL’ 99,pages 348–359, Palo Alto CA, 1999. ACM Press.
[101] R. J. Povinelli. Identifying temporal patterns for characterization and predictionof financial time series events. In Temporal, Spatial and Spatio-Temporal DataMining: First International Workshop (TSDM2000), pages 46–61, Lyon France,2000.
[102] S. Powsner and E. Tufte. Graphical summary of patient status. The Lancet,344:386–389, 1994.
[103] W. Pugh. Skip lists: A probabilistic alternative to balanced trees. Communica-tions of the ACM, 33(6):668–676, 1990.
[104] D. Rafiei and A.Mendelzon. Querying time series data based on similarity. IEEETransactions on Knowledge and Data Engineering, 12(5):675–693, Septem-ber/October 2000.
[105] J. Rekimoto. Time-machine computing: A time-centric approach for the in-formation environment. In ACM Symposium on User Interface Software andTechnology, pages 45–54, Asheville NC, November 1999. ACM Press.
[106] R. Ribler, A. Mathur, and M. Abrams. Visualizing and modeling categoricaltime series data. In Symposium on Visualizing Time-Varying Data. ICASE andNASA/LaRC, September 1995.
[107] W. G. Roth. MIMSY: A system for analyzing time series data in the stock mar-ket domain. Master’s thesis, University of Wisconsin, Department of ComputerScience, 1993.
[108] R. Sadri, C. Zaniolo, A. Zarkesh, and J. Adibi. Optimization of sequence queriesin database systems. In Proceedings of Principles of Database Systems 2001,pages 71–81, Santa Barbara CA, May 2001.
[109] S.L. Salzberg. Personal Communication, 2002.
[110] P. M. Sanderson, M. D. McNeese, and B. S. Zaff. Handling complex real-worlddata with two cognitive engineering tools: Cogent and macshapa. BehaviorResearch Methods, Instruments, and Computers, 26(2):117–124, 1994.
280
[111] J. Seo and B. Shneiderman. Understanding hierarchical clustering results byinteractive exploration of dendrograms: A case study with genetic microarraydata. IEEE Computer, 35(7):80–86, July 2002.
[112] P. Seshadri, M. Livny, and R. Ramakrishnan. Sequence query processing. InProceedings of the 1994 ACM SIGMOD International Conference on Manage-ment of Data, pages 430–441, Minneapolis Minnesota, May 1994. ACM Press.
[113] P. Seshadri, M. Livny, and R. Ramakrishnan. SEQ: A model for sequencedatabases. In Proceedings of the 11th International Conference on Data En-gineering (ICDE), pages 232–239, Taipei, Taiwan, 1995.
[114] P. Seshadri, M. Livny, and R. Ramakrishnan. The design and implementationof a sequence database system. In VLDB’96, Proceedings of 22th InternationalConference on Very Large Data Bases, pages 99–110, Mumbai India, Septem-ber 1996.
[115] U. Shaft, J. Goldstein, and K. Beyer. Nearest neighbors query performance forunstable distributions. Technical Report TR1388, Computer Sciences Depart-ment, University of Wisconsin, October 1998.
[116] B. Shneiderman. Tree visualization with tree-maps: 2-d space-filling approach.ACM Transactions on Graphics, 11(1):92–99, January 1992.
[117] B. Shneiderman. Dynamic queries for visual information seeking. IEEE Soft-ware, 11(6):70–77, 1994.
[118] B. Shneiderman. Inventing discovery tools: Combining information visualiza-tion with data mining. In Proceedings, Discovery Science 2001, pages 17–28,Washington DC, 2001. Springer-Verlag.
[119] B. Shneiderman, D. Feldman, A. Rose, and X. Ferre Grau. Visualizing digitallibrary search results with categorical and hierarchial axes. In Proc. 5th ACMInternational Conference on Digital Libraries, pages 57–66, san Antonio, TX,June 2-7 2000. ACM Press.
[120] W.M. Siebert. Circuits, Signals, and Systems. MIT Press, Cambridge, MA,1986.
[121] S. F. Silva, U. Shciel, and T. Catarci. Visual query operators for temporaldatabases. In Proc. of the 4th Int. Workshop on Temporal Representation andReasoning (TIME), pages 46–53, Daytona Beach FL, May 1997.
[122] S.F. Silva and T. Catarci. Homogeneous access to temporal data and interactionhistories in visual interfaces for databases. In Proceedings. of the Workshop on
281
User Interfaces to data Intensive Systems (UIDIS’99), pages 108–117, Edin-burgh Scotland, September 1999 1999. IEEE Computer Society.
[123] S.F. Silva and T. Catarci. Visualization of linear time-oriented data: a survey. InProceedings of the first International Conference on Web Information SystemsEngineering, Hong Kong, June 2000. IEEE Computer Society.
[124] S.F. Silva, T. Catarci, and U. Schiel. A ”graphical notebook” as interactionmetaphor for querying databases. In Anais do XIV Simposio Brasileiro deBanco de Dados (SBBD’99), Florianopolis SC Brazil, October 1999. SociedadeBrasileira de Computacao.
[125] C.G. Simpson, G. Thow, G.P. Clark, S.N. Jennings, J.A. Watters, and J.W.S.Brown. Mutational analysis of a plant branchpoint and polypyrimidine tractrequired for constitutive splicing of a mini-exon. RNA, 8:47–56, January 2002.
[126] R. T. Snodgrass, I. Ahn, G. Ariav, D. S. Batory, J. Clifford, C. E. Dyreson, R. El-masri, F. Grandi, C. S. Jensen, W. Kafer, N. Kline, K. G. Kulkarni, T. Y. CliffLeung, N. A. Lorentzos, J. F. Roddick, A. Segev, M. D. Soo, and S. M. Sripada.A TSQL2 tutorial. SIGMOD Record, 23(3):27–33, September 1994.
[127] Spotfire. http://www.spotfire.com.
[128] P. Tamayo, D. Slonim, J. Mesirov, Q. Zhu, S. Kitareewan, E. Dmitrovsky,E. Lander, and T. Golub. Interpreting patterns of gene expression with self-organizing maps: Methods and applications to hematopoietic differentiation.Proc. National Academy of Sciences USA, 96:2907–2912, March 1999.
[129] E. Tanin, R. Beigel, and B. Shneiderman. Design and evaluation of incrementaldata structures and algorithms for dynamic query interfaces. In Proceedings ofVisualization ’97, pages 81–86. IEEE Press, 1997.
[130] The FlyBase Consortium. The flybase database of the drosophila genomeprojects and community literature. Nucleic Acids Research, 31(1):172–175,2003.
[131] The Gene Ontology Consortium. Gene ontology: tool for the unification ofbiology. Nature Genetics, 25:25–29, May 2000.
[132] B. Theodoulidis, P. Papapanagiotou, and V. Pappas-Katsiafas. Interactive query-ing and visualisation in temporal databases (abstract). In K. Ong, S. Conrad,and T.W. Ling, editors, Knowledge Discovery and Temporal Reasoning in De-ductive and Object-Oriented Databases, Proceedings of the DOOD’95, pages91–93, Singapore, 1995.
282
[133] E. Tufte. The Visual Display of Quantitative Information. Graphics Press,Cheshire CT, 1983.
[134] L. Tweedie, B. Spence, H. Dawkes, and H. Su. The influence explorer (video)-a tool for design. In Proceedings of the 1996 conference companion on Humanfactors in computing systems, pages 390–391, Vancouver, British Columbia,April 13-18 1996. ACM Press.
[135] A. Unwin. Analysing real time series? CTI Math & Stats Newsletter, 4:8–10,1998.
[136] A. Unwin and G. Willis. Exploring time series graphically. Statistical Comput-ing and Graphics Newsletters, 2:13–15, 1999.
[137] J. van Helden, J., B. Andre, and J. Collado-Vides. Extracting regulatory sitesfrom the upstream region of yeast genes by computational analysis of oligonu-cleotide frequencies. Journal of Molecular Biology, 281(5):827–842, 1998.
[138] J.J. van Wijk and E.R. van Selow. Cluster and calendar based visualization oftime series data. In Proceedings, IEEE Symposium on Information Visualiza-tion, pages 4–9, San Francisco, CA, October 1999.
[139] R. Villafane, K. A. Hua, D. Tran, and B. Maulik. Mining interval time series.In Proceedings of the first International Conference on Data Warehousing andKnowledge Discovery, pages 318–330, 1999.
[140] J.D. Watson, N.H. Hopkins, J.W. Roberts, J.A. Steitz, and A.M. Weiner. Molec-ular Biology of the Gene. The Benjamin/Cummings Publishing Company, Inc,Menlo Park, CA, 4 edition, 1987.
[141] M. Wattenberg. Sketching a graph to query a time series database. In Proceed-ings of the 2001 Conference Human Factors in Computing Systems, ExtendedAbstracts, pages 381–382, Seattle WA, March 31-April 5, 2001 2001. ACMPress.
[142] R. Weber, H. Schek, and S. Blott. A quantitative analysis and performancestudy for similarity-search methods in high-dimensional spaces. In Proc. 24thInt. Conf. Very Large Data Bases, VLDB, pages 194–205, 1998.
[143] K. P. White, S.A. Rifkin, P. Hurban, and D. Hogness. Microarray analysis ofDrosophila development during metamorphosis. Science, 286:2179–2814, 10December 1999.
[144] S. Winkler. http://stats.math.uni-augsburg.de/CASSATT/index.html.
283
[145] P. C. Wong, W. Cowley, H. Foote, E. Jurrus, and J. Thomas. Visualizing sequen-tial patterns for text mining. In Proceedings, IEEE Symposium on InformationVisualization, pages 105–114, Salt Lake City UT, October 2000.
[146] B. B. Xia. Similarity search in time series data sets. Master’s thesis, SimonFraser University, Computing Science, 1997.
[147] R. Xiong and J. S. Donath. Peoplegarden: Creating data portraits for users.In ACM Symposium on User Interface Software and Technology, pages 37–44,Asheville NC, November 1999.
[148] XmdvTool. Xmdvtool home page: Case studies.http://davis.wpi.edu/ xmdv/cs fin.html.
[149] B. Yi and C. Faloutsos. Fast time sequence indexing for arbitrary Lp norms. InVLDB 2000, Proceedings of 26th International Conference on Very Large DataBases, pages 385–394, Cairo Egypt, September 2000. Morgan Kaufmann.
[150] B. Yi, H. V. Jagadish, and C. Faloutsos. Efficient retrieval of similar time se-quences under time warping. In Proceedings of the Fourteenth InternationalConference on Data Engineering, pages 201–208, Orlando FL, February 1998.IEEE Computer Society.
[151] B. Yi, N. Sidiropoulos, T. Johnson, H. V. Jagadish, C. Faloutsos, and A. Biliris.Online data mining for co-evolving time sequences. In Proceedings 16th Inter-national Conference on Data Engineering ICDE, pages 13–22, 2000.
[152] D. Young and B. Shneiderman. A graphical filter/flow representation of booleanqueries: a prototype implementation and evaluation. Journal of American Soci-ety for Information Science, 44(6):327–339, July 1993.
[153] Y. Zhu and D. Shasha. Statstream: Statistical monitoring of thousands of datastreams in real time. In Proceedings VLDB 2002, Hong Kong, August 20–232002.
284