mr. plotter: unifying data reduction techniques in storage and visualization systems - people @ eecs...

14
Mr. Plotter: Unifying Data Reduction Techniques in Storage and Visualization Systems Sam Kumar UC Berkeley [email protected] Michael P Andersen UC Berkeley [email protected] David E. Culler UC Berkeley [email protected] ABSTRACT As the rate of data collection continues to grow rapidly, developing visualization tools that scale to immense data sets is a serious and ever-increasing challenge. Existing ap- proaches generally seek to decouple storage and visualization systems, performing just-in-time data reduction to transpar- ently avoid overloading the visualizer. We present a new ar- chitecture in which the visualizer and data store are tightly coupled. Unlike systems that read raw data from storage, the performance of our system scales linearly with the size of the final visualization, essentially independent of the size of the data. Thus, it scales to massive data sets while sup- porting interactive performance (sub-100 ms query latency). This enables a new class of visualization clients that auto- matically manage data, quickly and transparently request- ing data from the underlying database without requiring the user to explicitly initiate queries. It lays a groundwork for supporting truly interactive exploration of big data and opens new directions for research on scalable information visualization systems. 1. INTRODUCTION As researchers in the 1980s and 1990s were just discov- ering how to use high-resolution color displays to visualize information, it became clear that interactive visualization and data processing are useful to a wide variety of applica- tions [30] and would augment users’ ability to perform mean- ingful analyses [20]. In 1996, Shneiderman stated his famous information-seeking mantra, “Overview first, zoom and fil- ter, then details-on-demand” [31] as a guiding principle in creating data visualization tools. However, this gold stan- dard has proved difficult to realize when working with large datasets. There has been a sharp tradeoff between achiev- ing interactivity and scaling to large datasets, due to the data management problems inherent in creating overview visualizations [15]. Even as computing and storage technologies have improved vastly over the past two decades, this problem has not abated. In fact, it has worsened. As Hellerstein et al. explain in 1999, the appetite for data has grown faster than Moore’s Law, meaning that the time to analyze large data sets has steadily grown [20], and as Stoica explains in 2013, this trend is only expected to continue [33]. To address this prob- lem, the database and systems communities have developed column-oriented databases [36], parallel databases [27] and distributed computation frameworks [38] to process data faster. In addition, researchers have applied sampling tech- niques [21, 1] and precomputation of aggregates [12, 4] to respond to queries at interactive speeds, by providing ap- proximate results or statistical summaries rather than exact results or raw data. Meanwhile, the data visualization community has arrived at similar solutions. In his keynote paper at SIGMOD 2008, Shneiderman observed that data sets had grown so large that they could no longer be rendered with atomic markers on a single display. He predicted that visualization tools will need to use data reduction techniques, such as filtering, aggregation, or sampling, to solve the problem of “squeezing a billion records into a million pixels” [32]. Recent work in visualization systems makes use of data reduction to scale communication, rendering, and visualization quality to large datasets. ScalaR [9] uses query plan information to decide if data reduction should be used. M4 [23] renders aggre- gates instead of raw points. However, these systems do not take advantage of interactive databases that use similar tech- niques. They perform queries against traditional databases, performing aggregation just-in-time. In this paper, we present Mr. Plotter, the Multiresolution Plotter, which unifies the data reduction techniques used to create scalable visualizations with those used to create interactive databases. Specifically, we use a data store that uses hierarchical pre-aggregation to provide fast response times, and leverage the same aggregation technique to alle- viate bottlenecks in rendering, communication, and visual clutter in the visualization client. As a result, Mr. Plot- ter substantially outperforms other visualization tools (see Figure 1 and Table 1). Mr. Plotter consists of a visualization client and a data store. For the data store, we use the Berkeley Tree Database (BTrDB) [4], a database for scalar-valued timeseries data that can efficiently serve queries for statistical summaries of data. The visualization client is a custom-built desktop ap- plication responsible for providing a user interface and fetch- ing and rendering the appropriate data. By closely coupling these two components, we create a highly responsive and interactive visualization for the user. Some past work has also tried to closely couple the vi- sualization client and the data store. The prior work most similar to Mr. Plotter is Skydive [19], a system for visu- alizing spatial data. Skydive encodes its data into an ag- gregate pyramid, with raw data at the base and statistical summaries, at progressively larger bin sizes, as one moves up the pyramid. The data visualization client lets the user choose a stratum from the pyramid and a “cut” of data (i.e., a working dataset), and then map that data to a visual pre- sentation. 1

Upload: others

Post on 25-Feb-2021

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Mr. Plotter: Unifying Data Reduction Techniques in Storage and Visualization Systems - People @ EECS at UC …samkumar/projects/... · 2019. 6. 6. · Mr. Plotter: Unifying Data Reduction

Mr. Plotter: Unifying Data Reduction Techniques inStorage and Visualization Systems

Sam KumarUC Berkeley

[email protected]

Michael P AndersenUC Berkeley

[email protected]

David E. CullerUC Berkeley

[email protected]

ABSTRACTAs the rate of data collection continues to grow rapidly,developing visualization tools that scale to immense datasets is a serious and ever-increasing challenge. Existing ap-proaches generally seek to decouple storage and visualizationsystems, performing just-in-time data reduction to transpar-ently avoid overloading the visualizer. We present a new ar-chitecture in which the visualizer and data store are tightlycoupled. Unlike systems that read raw data from storage,the performance of our system scales linearly with the sizeof the final visualization, essentially independent of the sizeof the data. Thus, it scales to massive data sets while sup-porting interactive performance (sub-100 ms query latency).This enables a new class of visualization clients that auto-matically manage data, quickly and transparently request-ing data from the underlying database without requiringthe user to explicitly initiate queries. It lays a groundworkfor supporting truly interactive exploration of big data andopens new directions for research on scalable informationvisualization systems.

1. INTRODUCTIONAs researchers in the 1980s and 1990s were just discov-

ering how to use high-resolution color displays to visualizeinformation, it became clear that interactive visualizationand data processing are useful to a wide variety of applica-tions [30] and would augment users’ ability to perform mean-ingful analyses [20]. In 1996, Shneiderman stated his famousinformation-seeking mantra, “Overview first, zoom and fil-ter, then details-on-demand” [31] as a guiding principle increating data visualization tools. However, this gold stan-dard has proved difficult to realize when working with largedatasets. There has been a sharp tradeoff between achiev-ing interactivity and scaling to large datasets, due to thedata management problems inherent in creating overviewvisualizations [15].

Even as computing and storage technologies have improvedvastly over the past two decades, this problem has not abated.In fact, it has worsened. As Hellerstein et al. explain in1999, the appetite for data has grown faster than Moore’sLaw, meaning that the time to analyze large data sets hassteadily grown [20], and as Stoica explains in 2013, this trendis only expected to continue [33]. To address this prob-lem, the database and systems communities have developedcolumn-oriented databases [36], parallel databases [27] anddistributed computation frameworks [38] to process datafaster. In addition, researchers have applied sampling tech-niques [21, 1] and precomputation of aggregates [12, 4] to

respond to queries at interactive speeds, by providing ap-proximate results or statistical summaries rather than exactresults or raw data.

Meanwhile, the data visualization community has arrivedat similar solutions. In his keynote paper at SIGMOD 2008,Shneiderman observed that data sets had grown so largethat they could no longer be rendered with atomic markerson a single display. He predicted that visualization toolswill need to use data reduction techniques, such as filtering,aggregation, or sampling, to solve the problem of “squeezinga billion records into a million pixels” [32]. Recent work invisualization systems makes use of data reduction to scalecommunication, rendering, and visualization quality to largedatasets. ScalaR [9] uses query plan information to decideif data reduction should be used. M4 [23] renders aggre-gates instead of raw points. However, these systems do nottake advantage of interactive databases that use similar tech-niques. They perform queries against traditional databases,performing aggregation just-in-time.

In this paper, we present Mr. Plotter, the MultiresolutionPlotter, which unifies the data reduction techniques usedto create scalable visualizations with those used to createinteractive databases. Specifically, we use a data store thatuses hierarchical pre-aggregation to provide fast responsetimes, and leverage the same aggregation technique to alle-viate bottlenecks in rendering, communication, and visualclutter in the visualization client. As a result, Mr. Plot-ter substantially outperforms other visualization tools (seeFigure 1 and Table 1).

Mr. Plotter consists of a visualization client and a datastore. For the data store, we use the Berkeley Tree Database(BTrDB) [4], a database for scalar-valued timeseries datathat can efficiently serve queries for statistical summaries ofdata. The visualization client is a custom-built desktop ap-plication responsible for providing a user interface and fetch-ing and rendering the appropriate data. By closely couplingthese two components, we create a highly responsive andinteractive visualization for the user.

Some past work has also tried to closely couple the vi-sualization client and the data store. The prior work mostsimilar to Mr. Plotter is Skydive [19], a system for visu-alizing spatial data. Skydive encodes its data into an ag-gregate pyramid, with raw data at the base and statisticalsummaries, at progressively larger bin sizes, as one movesup the pyramid. The data visualization client lets the userchoose a stratum from the pyramid and a “cut” of data (i.e.,a working dataset), and then map that data to a visual pre-sentation.

1

Page 2: Mr. Plotter: Unifying Data Reduction Techniques in Storage and Visualization Systems - People @ EECS at UC …samkumar/projects/... · 2019. 6. 6. · Mr. Plotter: Unifying Data Reduction

However, Mr. Plotter improves upon systems like Sky-dive by incorporating data management into the visualiza-tion client. While the user focuses on zooming and scrollingthrough the data, the client automatically requests data atthe appropriate resolution and time ranges, and shows per-ceptually relevant information until that data is receivedfrom the data store. We call this behavior automatic datamanagement. Furthermore, Mr. Plotter reconciles client-side caching and automatic data management with the no-tion of fast data—data that are inserted out-of-order, beingstreamed into the data store, or otherwise changing. Thismakes Mr. Plotter suitable for situations where data are be-ing materialized in real-time [6, 5].

In summary, the main contributions of our work are:

• We unify data reduction techniques used by database sys-tems to support interactivity, with those used by visu-alization systems to eliminate rendering/network bottle-necks and visual clutter.

• We demonstrate how to apply hierarchical aggregationto provide interactive visualization of scalar-valued time-series data.

• We provide a visualization client that caches data, au-tomatically and transparently requests data as the userpans and zooms, and shows a perceptually relevant plotuntil the requests return—behavior that we term auto-matic data management.

• We reconcile this with “fast” data (i.e., data streaminginto the data store, being inserted out-of-order, or other-wise changing).

The remainder of this paper is organized as follows. Sec-tion 2 provides necessary details about BTrDB, the backenddata store used by Mr. Plotter. Sections 3 and 4 describe thedesign and implementation, respectively, of Mr. Plotter’s vi-sualization client. Section 5 evaluates Mr. Plotter. Section6 compares Mr. Plotter to related work. Finally, Section 7discusses directions for future work, and Section 8 concludesthe paper.

2. DATA STORAGE: BTRDBThis section motivates our choice of aggregation as a data

reduction technique, and then explains relevant details aboutthe API and implementation of BTrDB, the data store usedby Mr. Plotter.

2.1 The Case for AggregationWe describe the three main data reduction techniques that

database systems and visualization systems use: filtering,sampling, and aggregation. Then, we explain our choice touse aggregation.

2.1.1 Sampling and FilteringIn sampling and filtering techniques, a subset of the data

is selected, and traditional techniques for query processingor visualization are applied to that subset. In filtering tech-niques, this subset consists of data that match a certainpredicate (e.g., adding a WHERE clause to a SQL query).In sampling techniques, the subset is chosen without at-tention to a particular predicate on the data; usually, itis chosen randomly. BlinkDB [1] precomputes samples of

(a) A naıve design: visualization client requests and rendersall points matching a query

(b) A design used by visualization tools like M4 and ScalaR:visualization client requests and renders data aggregates,which are computed by the database system on the fly

(c) Mr. Plotter’s design: database system accelerates queriesusing precomputation; visualization client requests and ren-ders precomputed data aggregates

Figure 1: Comparison of approaches to data visual-ization

data, which it uses to produce approximate results to querieswithin a user-specified time bound or error bound. Visual-ization tools [2, 28, 26] also use these techniques, retrievingonly a subset of data from the server to reduce the costs ofcommunication and rendering.

2.1.2 AggregationIn aggregation techniques, the space of data is divided into

bins, and aggregate statistics (e.g., number of records, av-erage value, etc.) are computed for each bin. Respawn [12]and BTrDB [4] precompute aggregates, accelerating queriesfor those aggregates. Visualization tools such as ATLAS [14],ScalaR [9], M4 [23], and Skydive [19] display aggregatesinstead of raw data. Hierarchical aggregation schemes, inwhich bins are organized as a tree where each bin is theunion of its children, are particularly amenable to visualiza-tion [17, 5, 10]. One can create an overview of a large dataset using the aggregate at the root, and drill down to thedetails by following a branch of the tree, realizing Shneider-man’s mantra, “Overview first, zoom and filter, then details-on-demand.”

2.1.3 DiscussionWe use aggregation for data reduction in Mr. Plotter. The

primary reason is that random sampling always has the po-tential to miss outliers. In contrast, aggregrates such as“min” and “max” reliably capture the existence of outliers.Meanwhile, filtering cannot be used to produce an overviewvisualization, as the subset is, by construction, not repre-sentative of the overall dataset.

2

Page 3: Mr. Plotter: Unifying Data Reduction Techniques in Storage and Visualization Systems - People @ EECS at UC …samkumar/projects/... · 2019. 6. 6. · Mr. Plotter: Unifying Data Reduction

System ATLAS [14] ScalaR [9] ScalaR [9] M4 [23] Mr. Plotter Mr. PlotterNumber of Records 1.28 Billion 2.70 Billion 2.70 Billion 3.60 Million 233 Million 23.3 BillionTime to Overview 28 s 89.55 s 1.95 s 5 s 223 ms / 56.2 ms 294 ms / 50.0 msData Reduction Aggregation Aggregation Sampling Aggregation Aggregation Aggregation

Table 1: Time to produce an overview visualization. We provide two measurements for Mr. Plotter: (1)latency to fetch aggregates when the server-side in-memory cache is empty, and (2) latency to fetch aggregateswith a hot server-side cache.

Existing research has also investigated non-random sam-ples, chosen so that a sampled plot looks similar to a plotwith all of the raw data [26]. However, such approaches donothing to remove visual clutter. In contrast, a plot of ag-gregates may actually convey more useful information thana plot of raw data. With aggregation, one could plot themean value in each bin; such moving average plots can bemore insightful than ones containing raw data [29], as theyare less easily cluttered by local fluctuations than individ-ual data points [16, 17]. See Section 3.1 for Mr. Plotter’ssolution.

Other researchers [25] have elected to use aggregation,rather than filtering or sampling, for similar reasons.

2.2 BTrDB’s Timeseries AbstractionWe use BTrDB [4] for Mr. Plotter’s data store because it

not only supports accelerated aggregate queries, but alsoprovides a unique timeseries abstraction enabling visual-ization of “fast data.” This section describes the parts ofBTrDB’s API relevant to Mr. Plotter.

In BTrDB, a stream is a representation of a timeseries. Astream is identified by a 128-bit Universally Unique Identi-fier (UUID), and stores a sequence of (time, value) points.The time is a 64-bit nanosecond timestamp, and the valueis a 64-bit floating-point number. BTrDB is a copy-on-writedatabase. When a stream is modified, a new logical versionof the stream is created. The old version of the stream stillexists in the database and can be queried separately.

The most basic query that BTrDB supports is a query forraw data: RawValues(UUID, StartTime, EndTime,Version) → (Version, [(Time, Value)]). This queryreturns the list of (time, value) pairs in the stream identifiedby UUID, where StartTime ≤ time < EndTime. A specialvalue “latest,” that represents the most recent version, canbe passed in as the Version argument; in this case, the returnvalue contains the corresponding version number.

The query used most often by Mr. Plotter is a query forstatistical aggregates: AlignedWindows(UUID, Start-Time, EndTime, Resolution, Version) → (Version,[(Time, Min, Mean, Max, Count)]). To process thisquery, BTrDB divides the time range [StartTime, EndTime)into intervals that are each 2Resolution nanoseconds wide.BTrDB then returns the number of points in each time inter-val, along with the minimum value, mean value, and maxi-mum value of those points. StartTime and EndTime mustbe multiples of 2Resolution, hence the name Aligned Windows.

Mr. Plotter makes two more types of queries to BTrDB.The first is Nearest(UUID, Time, Direction, Version)→ (Version, (Time, Value)), which returns the point im-mediately before or after the specified time. The Directionargument may be either “Forward” or “Backward,” specify-ing in which direction to look for the nearest point. Thesecond is Changes(UUID, FromVersion, ToVersion,

Resolution) → (Version, [(StartTime, EndTime)]),which returns the time ranges in which two versions of astream differ. Each returned time range is at least 2Resolution

nanoseconds wide. Thus, the caller can use the Resolutionargument to control the granularity at which the time rangesare computed.

BTrDB supports additional queries not described in thispaper. For a full description of BTrDB’s API, we direct thereader to [3].

2.3 Summary of BTrDB’s ImplementationBTrDB supports each type of query in Section 2.2 with

running time logarithmic in the density of the stream, andlinear in the size of the response. In this section we providea summary of BTrDB’s implementation to motivate how itachieves this running time.

BTrDB stores data in a time-partitioning tree. Each streamis represented as a separate tree. The tree is copy-on-write,so whenever a stream is modified, the nodes in the tree thatwould be modified, and all of their ancestors, are copied;thus each version of a stream corresponds to a separate rootnode.

The tree can be logically understood as a binary tree.Raw data is stored at the leaves, and each intermediate nodestores a statistical aggregate containing the min, mean, max,and count. The tree partitions time. The root node holdsthe aggregate for the stream over the entire space of timethat BTrDB supports. The left child of the root stores theaggregate for the first half of that time range, and the rightchild stores the aggregate for the second half of that timerange. As one travels from root to leaf, the time range is re-peatedly divided in half. The nodes at logical depth 63 holdaggregates for each nanosecond, and their children containthe raw data.

When a data point is inserted, the aggregates along thepath from leaf to root are recomputed and placed in newnodes, creating a new tree for the new version of the stream.These new nodes may contain pointers to nodes in the oldtree. The aggregates (min, mean, max, and count) canbe computed at each node using only the node’s immedi-ate children, making this process very efficient. Note thataggregates are updated at insertion time, not query time.

The tree is structured so that the internal nodes of depthd contain statistical aggregates of aligned intervals of size263−d. Therefore, to support the AlignedWindows query,BTrDB simply reads and returns internal nodes of the treeat the correct depth. Furthermore, the tree structure makessatisfying a Nearest query simple: it is easy to avoid emptysubtrees. Finally, each node is annotated with the earliestversion number it is a part of, making it possible to efficientlysatisfy Changes queries: BTrDB traverses the tree corre-sponding to ToVersion, skipping over subtrees whose rootis annotated with a version at least as old as FromVersion,

3

Page 4: Mr. Plotter: Unifying Data Reduction Techniques in Storage and Visualization Systems - People @ EECS at UC …samkumar/projects/... · 2019. 6. 6. · Mr. Plotter: Unifying Data Reduction

adding a time range to the output whenever this happens.BTrDB’s physical data structure contains important op-

timizations. First, BTrDB’s physical tree has a branchingfactor of 64, rather than a branching factor of 2. Thus, mov-ing one level up or down the physical tree is equivalent tomoving 6 levels in the logical tree. The intermediate nodes“missing” from the physical tree are computed on the fly asthey are needed. Second, BTrDB does not always store rawdata at the maximum depth. It may store data higher up intree, if that subtree represents very few data points. Oncethe number of data points at a node crosses a threshold,the data are partitioned by time, and a new internal node,storing an aggregate, is created. Third, inserted points areprocessed in batches, amortizing the cost of recomputingaggregates and creating new nodes. Fourth, BTrDB storesinternal nodes in a small storage pool backed by fast Solid-State Drives, and leaf nodes in a large storage pool backedby slower Hard-Disk Drives, taking advantage of the factthat there are many more leaves than internal nodes.

This is just a summary of BTrDB’s implementation. Fora comprehensive description, see [4].

3. SYSTEM DESIGNThis section describes the design of Mr. Plotter, explain-

ing how it supports interactive exploration of scalar-valuedtimeseries. Section 3.1 describes the user-facing plot thatMr. Plotter renders, examining it as a static object. Section3.2 describes the plot as a dynamic object, explaining howthe user can interact with it to explore timeseries. Section3.3 discusses data management challenges in supporting thismode of interaction, and our solutions for coping with them.

3.1 User-Facing PlotCentral to Shneiderman’s problem of “squeezing a billion

records into a million pixels” [32] is drawing plots where mul-tiple points map to the same pixel. Aggregation techniquesare a natural solution: where multiple atomic markers would“collide” in the same pixel, a single aggregate marker can beused. In Mr. Plotter, we consider line plots with time on thehorizontal axis, and value on the vertical axis. Our primaryconsideration is densely-packed timeseries, where adjacentpoints may map to the same pixel column.

Solutions such as M4 [23] compute data aggregates to pro-duce a plot that is equivalent, pixel-for-pixel, to a plot drawnwith the raw data points. While such a solution eliminatescommunication and rendering bottlenecks, it does not solvethe problem of visual clutter. Plots of raw data points, es-pecially over long periods of time, are often too cluttered formeaningful analysis, because short-term fluctuations tend tohide long-term trends [16, 17, 29].

Mr. Plotter uses a different approach. It uses the “min”and “max” values of aggregates to draw a translucent silhou-ette of what the plot would look like if all of the raw pointswere plotted, and then draws a line plotting the “mean” val-ues over the same period. This shows a smoothed plot ofmean values that is less cluttered, while still indicating thevariance in the data pre-smoothing, via the min-max silhou-ette. See Figure 2(a) for an example. Because the min-maxsilhouette is translucent, it does not block large areas of theplot, even if the stream is very noisy. This makes plots withmultiple streams less cluttered. Our plot provides strictlymore information than a plot of raw data (Figure 2(b)),while reducing visual clutter.

(a) A plot of a single stream, rendered by Mr. Plotter

(b) Same data, raw points rendered and connected with lines

Figure 2: Comparison of a plot of aggregates, asdrawn by Mr. Plotter, to a plot of raw data points

To indicate missing data, Mr. Plotter draws a data densityplot when a user selects a stream. This displays the “count”aggregates for the selected stream (see Figure 5). We alsoemphasize missing data in the main plot by rendering gapsin the data as gaps in the visualization1. Neighboring ag-gregates are drawn connected to each other as long as theyare temporally adjacent2. This implies that, if the user haszoomed in so far that there is a gap between every pair ofneighboring points, Mr. Plotter will no longer connect thepoints, effectively displaying a scatterplot.

3.2 Interaction with the PlotWe support a very fluid mode of interaction with the plot.

The user clicks and drags the plot in order to scroll hori-zontally, and uses her mousewheel or touchpad to zoom inor out. The user does not manually initiate queries to fetchnew data; the visualization client automatically queries datadepending on the user’s current view.

The success of this interaction model depends on timelyprocessing of queries by the database; data must be loadedat interactive, or near-interactive, speeds. We make thispossible by unifying the aggregation technique BTrDB usesto accelerate queries, with those that the visualization clientuses for rendering.

1Optionally, the user may turn off this behavior, to alwaysconnect neighboring points.2BTrDB omits aggregates with count = 0 when respondingto AlignedWindows queries, in order to save bandwidth.The client infers the existence of gaps when the aggregatesreturned by BTrDB are not temporally adjacent.

4

Page 5: Mr. Plotter: Unifying Data Reduction Techniques in Storage and Visualization Systems - People @ EECS at UC …samkumar/projects/... · 2019. 6. 6. · Mr. Plotter: Unifying Data Reduction

(a) User starts at this view, and then rapidly zooms in to thedata near December 18

(b) Mr. Plotter continues displaying aggregates from the pre-vious view, transformed to the new axes, while waiting forBTrDB to return data at the new resolution

(c) When data at the new resolution finally arrives (usuallywithin 100 milliseconds), the plot is updated

Figure 3: When the desired data is not available ontime, Mr. Plotter continues to display perceptuallyrelevant data

Even so, we cannot depend on requests to the database toalways return at interactive speeds. First, the load on theserver could vary, causing query processing times to increasewhen the server load is heavy. Second, network characteris-tics may affect performance. Depending on the physical lo-cation of the client and server, the network round-trip timemay exceed 100 ms.

The visualization client manages queries to the BTrDBserver and responses from the server, to mitigate the impactif some requests take longer to return. First, whenever ag-gregates are received from the server, they are placed in anin-memory cache, decreasing the average wait time for newaggregates to load. Second, when the user zooms or scrollsto a region whose data is not in the cache, Mr. Plotter con-tinues showing aggregates in the previous view, transformedaccording to the current view, until the new aggregates arereceived. In this way, Mr. Plotter shows perceptually rele-vant information, even if the “ideal” data is not yet available.See Figure 3 for an example. Finally, data is pre-fetchedaccording to the user’s current view, further reducing theuser’s expected wait time. Collectively, we call this behav-ior automatic data management.

3.3 Supporting Automatic Data ManagementThis section explains aspects of Mr. Plotter’s design that

support the mode of interaction described above.

3.3.1 Size and Alignment of AggregatesSolutions such as M4 [23] compute an aggregate of data

for each pixel column of the plot, and display each aggregate

in its pixel column. The rationale is that rendering multi-ple aggregates per pixel column results in implicit loss ofdetail. Therefore, requesting finer-grained aggregates fromthe database wastes network bandwidth and rendering time.

However, requiring each aggregate to correspond exactlyto one pixel column is very restrictive. If the user choosesto zoom in by a small amount, the size of the time intervalrepresented by each pixel column decreases slightly, requir-ing new aggregates to be computed. Similarly, if the user’sview shifts to the left or right, the alignment of time inter-vals corresponding of pixel columns may change, requiringnew aggregates to be computed.

This model introduces nontrivial problems when workingwith client-side caching and precomputed aggregates. First,caching aggregates returned by the server is only useful fordrawing other plots at the exact same zoom, and at the ex-act same alignment between time and pixels, as the plot forwhich the aggregates were requested. Every time the userchanges zoom, an additional request for the entire screen ofdata would have to be made to the server. Second, the clientmust request aggregates for time intervals whose size andalignment can vary arbitrarily. BTrDB can quickly returnaggregates over time intervals whose endpoints, in nanosec-onds, are aligned to multiples of a power of two; however,pixel columns on the user’s screen need not be aligned inany particular way3.

Mr. Plotter relaxes this constraint. Rather than requiringaggregates to correspond exactly to pixel columns, Mr. Plot-ter only requires that each aggregate represent a time inter-val that is at most one pixel column in width. To choose thesize of aggregates, Mr. Plotter takes the size of a pixel col-umn in nanoseconds—the size of aggregates that M4 uses—and rounds it down to the nearest power of two. This isalways a power of two, so the aggregates can be efficientlyobtained from BTrDB. Because we no longer require the ag-gregates to correspond directly to pixel columns, there areno alignment issues4. Thus, the client can effectively cacheaggregates obtained from the server: the aggregates used todraw one plot can be used to draw another overlapping plot,even if the zoom and translation are slightly different.

One may argue that this approach is somewhat wasteful:more aggregates are drawn than pixel columns. We cancompute an upper bound on how many extra aggregates arerendered. The size of aggregates, computed as per above,has the following property:

Pixel column size ≥ Aggregate size >1

2· Pixel column size

where the sizes of pixel columns and aggregates are mea-sured in time. We can then divide the width of the screen(in time) by each of these three quantities to show that:

Pixel columns ≤ Aggregates rendered < 2 · Pixel columns

which means that we are overplotting by less than a factorof two.

3BTrDB does support Windows queries that computeaggregates over unaligned, arbitrarily-sized intervals, butprocessing these queries is less efficient than processingAlignedWindows queries. See [3] for details.4Aggregates can be rendered even if they span multiple pixelcolumns, or if a pixel column contain multiple aggregates;this is handled in the graphics pipeline, taking advantage ofmultisample anti-aliasing if the client’s graphics card sup-ports it.

5

Page 6: Mr. Plotter: Unifying Data Reduction Techniques in Storage and Visualization Systems - People @ EECS at UC …samkumar/projects/... · 2019. 6. 6. · Mr. Plotter: Unifying Data Reduction

3.3.2 Prefetching PolicyAs described in Section 3.2, Mr. Plotter prefetches data

to hide the latency of requesting and receiving aggregatesfrom the database. Mr. Plotter uses a stateless prefetchingscheme. At any given view, there are four actions that theuser can take: (1) scroll to the left, (2) scroll to the right,(3) zoom in, or (4) zoom out. Once Mr. Plotter has receivedthe data for the current view, it prefetches data, assumingthat the user may take any of these actions. Specifically,Mr. Plotter prefetches:

• One screen of data to the left of the current view

• One screen of data to the right of the current view

• The current screen’s data at the next smaller aggregatesize

• The current screen’s data, and the two neighboring screens’data, at the next larger aggregate size

The purpose of the first two items is to prefetch a scrollinggesture. We prefetch one screen’s worth of data to the leftand right, because of the click-and-drag interface that Mr.Plotter supports; the user is not likely to move the view bymore than one screen width in one gesture. The purposeof the last two items is to prefetch a zooming gesture. Theuser uses the mousewheel to zoom, and the cursor’s currentposition acts as the focus of the operation. The user canzoom in at any point of the screen, so we prefetch all of thecurrent screen’s data at the next smaller size of aggregates.When the user zooms out with the cursor (focus) on oneside of the screen, most of the extra time range comes fromthe other side of the screen; furthermore, if the user zoomsout to the next larger aggregate size, the total time range isdoubled, because consecutive aggregate sizes vary by factorsof two. Therefore, we prefetch the two neighboring screensat the next larger size of aggregates.

Prefetching is not a new idea. For example, ATLAS [14]is a visualization tool for scalar-valued timeseries that usesprefetching. Like Mr. Plotter, its “objective is not to pullthe maximum amount of data possible, but rather to get theminimum amount of data to sustain smooth interactions.”To perform prefetching effectively, ATLAS maintains a tim-ing model of the database, allowing it to estimate the latencyof each query. It limits how fast the user can scroll, to ensurethat prefetched data will arrive in time.

Although ATLAS’ prefetching policy is more sophisticatedthan Mr. Plotter’s, a prefetching methodology similar to AT-LAS’ would not be very useful in Mr. Plotter due to somekey differences between the two systems. First, Mr. Plot-ter, unlike ATLAS, does not restrict how fast the user canmove between views. Although Mr. Plotter uses mouse in-put, which has its limits (it is humanly possible only to movethe mouse so fast), it is difficult to characterize how fast theuser can move between views. As a result, it is less mean-ingful to schedule prefetch requests to meet deadlines basedon how fast the user can zoom or scroll. Second, and moreimportantly, ATLAS’ prefetching methodology rests on theassumption that “query processing times increase linearlywith the number of records contained in the query range.”This assumption does not hold for Mr. Plotter. Due to thehierarchical pre-aggregation scheme described in Section 2.3,the time taken by BTrDB to process an AlignedWindowsquery is linear in the number of aggregates returned, not in

the number of raw records those aggregates comprise. Formost modern screens, which are at most 3000 or 4000 pixelswide, results from [4] (confirmed in Section 5) suggest thatquery processing in BTrDB will take at most a few hundredmilliseconds—barely enough time for a user to click and dragthe plot across the screen. Thus, prefetching only adjacentscreens is sufficient.

There exist proposals for dynamic prefetching based onsimilar assumptions to Mr. Plotter (e.g., ForeCache [8]).However, our experience is that Mr. Plotter’s simple prefetch-ing strategy performs well (see Section 5). Therefore, weleave an investigation of more sophisticated prefetching tech-niques to future work.

3.3.3 Request ThrottlingAn advantage of plotting the “min” and “max” is that

anomalies are visible even in overview visualizations. If auser notices an anomaly, she may rapidly zoom in to examineit in more detail. As the user zooms in, the plot movesthrough many levels of resolution. The user does not stop ateach resolution before continuing to zoom in, so the “thinktime” between individual gestures is very low5.

A naıve implementation would perform a full data refreshon every frame. However, this is wasteful, because the userzooms to the next resolution before each request for datareturns. Furthermore, after the user stops zooming in, therequest for the final plot would be placed behind these extra-neous requests on the server-side request queue, degradingthe user’s experience.

To solve this problem, Mr. Plotter implements requestthrottling. If a cache miss has occurred and a request sent toBTrDB in the past 300 ms, cache lookups are done withoutmaking requests to BTrDB on misses; the on-screen data setis left unchanged. Mr. Plotter smoothly updates the currentview as the user zooms and scrolls, using only the on-screendata set, so the plot continues to feel responsive. This limitsthe number of extraneous requests.

4. VISUALIZATION CLIENTThis section describes the implementation of the visual-

ization client. The visualization client is a desktop applica-tion written in the Qt framework. It is written primarily inC++, but parts of the user interface are written in QML.As depicted in Figure 4, the visualization client has fourlayers: Data Retrieval, Data Management, Rendering, andUser Interface. We describe each layer in turn.

4.1 Data Retrieval LayerThe Data Retrieval Layer is requests data from BTrDB.

On cache misses, the Data Retrieval Layer requests aggre-gates from BTrDB by making AlignedWindows queries.It takes care to properly align queries to multiples of theappropriate powers of two, and leverages gzip compressionto make more efficient use of network bandwidth.

4.2 Data Management LayerThe Data Management Layer issues queries to the Data

Retrieval Layer, maintains a cache of the resulting aggre-gates, and prefetches data according to the user’s currentview.5Because Mr. Plotter continues to show perceptually rel-evant information until the desired data is available (seeSection 3.2), the user is able to continue zooming.

6

Page 7: Mr. Plotter: Unifying Data Reduction Techniques in Storage and Visualization Systems - People @ EECS at UC …samkumar/projects/... · 2019. 6. 6. · Mr. Plotter: Unifying Data Reduction

Figure 4: System architecture of Mr. Plotter

4.2.1 Cache OrganizationThe in-memory cache stores aggregates obtained from the

server in past queries to prevent additional queries for thesame data, increasing Mr. Plotter’s responsiveness.

When the user navigates to a certain view, Mr. Plottercomputes the appropriate resolution (size of aggregates),and then requests aggregates at that resolution for each vis-ible stream, in the time range that is visible. We use atree-based map to store the aggregates for each stream andresolution, to allow for efficient lookup of aggregates in atime range. We organize these trees in array-based mapdata structures, to allow efficient lookup of the tree of ag-gregates given a specific stream and resolution. We keeptrack of the cache size, and use the LRU (“Least RecentlyUsed”) cache eviction policy to keep it within a strict upperbound.

However, storing the aggregates directly in a tree struc-ture is inefficient. Aggregates must be individually taggedfor LRU cache management, adding additional overhead.Information regarding gaps in the data is also lost in thisscheme; the absence of an aggregate in the cache could meanthat there is no data for that time range, or it could meanthat the aggregate has not been fetched yet.

Our solution is to manage aggregates in chunks, instead ofindividually. The cache holds data as a set of cache entries.A cache entry contains a set of aggregates for a specificstream at a specific resolution, received in response to asingle AlignedWindows query, stored as an array. It alsocontains the start and end timestamps for the query that wasperformed. Cache entries keep track of gaps in the data;the existence of a cache entry guarantees that all of thedata in the specified time range, for the specified streamand resolution, is in the cache entry. The cache stores andmanipulates cache entries as single units, for lookup andLRU eviction. This helps amortize the overhead of storingmany aggregates. Furthermore, it makes it easier to prepareaggregates for rendering, because data is passed to the GPUas arrays (see Section 4.3).

4.2.2 Fast DataTo handle changing data, we maintain a base version num-

ber for each stream. The base version number is the mostrecent version such that all of the cache entries for thatstream were obtained from a version of the stream at leastas recent as the base version. Stated differently, the baseversion number is a bound on how stale the cached data fora stream can be.

When we make an AlignedWindows query to BTrDB,we receive both the aggregates and the version number of thestream from which those aggregates were obtained. Whenthe cache receives the aggregates, it sets the base versionnumber to the minimum of the current base version num-ber, and the version number of the received data. We pe-riodically make Changes queries to BTrDB, querying thechanges between the base version and the latest version.When we receive the time ranges containing the differences,we drop all cache entries containing data in those time ranges,and update the base version to the latest version of thestream, as returned by the Changes query.

4.2.3 On-Screen DataOne of the functions of the Data Management Layer is to

continue displaying perceptually relevant information whenrequests miss in the cache and are sent to BTrDB. As de-scribed in Section 3, Mr. Plotter continues to display theaggregates last displayed on the screen, mapped to the cur-rent view, while data is loading. To implement this, theData Management Layer explicitly manages a collection ofcache entries that are rendered on-screen. When the viewchanges, the data for the new time range and resolution arelooked up in the cache, and missing aggregates are requestedfrom the server. Until the new data arrive, the on-screendata are not updated, and therefore continue to be mappedto the current view and rendered.

4.2.4 Cache MissesWhen the user updates her view, Mr. Plotter looks in the

cache to update the on-screen data set. When not all of therequired data is in the cache, a request is sent to the DataRetrieval Layer, which makes a request to BTrDB.

However, even before the data is loaded, the user mayslightly adjust their plot. In such a scenario, we should notrepeat the request to BTrDB. Instead, we should wait for thecurrent outstanding request to finish, and only request datathat will not be returned by the request we have alreadymade. To do this, we create a placeholder cache entry witha “pending” bit set to indicate that a request for that datais outstanding. When another cache miss occurs for thatdata, the caller sees that there is already a pending request,and waits for the response instead of issuing a new request.

4.3 Rendering LayerThe Rendering Layer is responsible for converting the on-

screen data set, maintained by the Data Management Layer,into a user-facing plot. To render the plot, we use OpenGL,an API for interacting with a GPU for hardware-acceleratedgraphics rendering.

The first time a cache entry is rendered, it is convertedto a Vertex Buffer Object, or VBO. A VBO is a buffer ofdata in graphics memory, ready for processing by a shaderprogram running on the GPU. The VBO handle is stored inthe cache entry, for easy retrieval. Then, the VBOs for cache

7

Page 8: Mr. Plotter: Unifying Data Reduction Techniques in Storage and Visualization Systems - People @ EECS at UC …samkumar/projects/... · 2019. 6. 6. · Mr. Plotter: Unifying Data Reduction

(a) The first three aggregates have gaps on both sides andare rendered as lone points, whereas the remaining six aretemporally adjacent and are rendered connected

(b) Same plot as Figure 5(a), with the “Always connectpoints” option selected

Figure 5: Plots that display all cases for renderingaggregates

entries in the on-screen data set, along with the domains ofthe time and value axes and aesthetic settings chosen bythe user, are passed to a shader program, which renders theplot. This section describes the rendering pipeline in detail.

4.3.1 Rendering an AggregateBefore discussing how to generate VBOs from cache en-

tries, we discuss how aggregates are rendered. The “min,”“mean,” and “max” values are drawn at the midpoint of thetime interval represented by that aggregate. If the neighbor-ing aggregates are temporally adjacent (i.e., there is no gapin between), then the aggregates are connected: the meanvalues are connected by a line, and the min-max values areconnected by a translucent trapezoid. If this aggregate isnot temporally adjacent to either of its neighbors, then atranslucent vertical line is drawn connecting the min andmax, and the mean is depicted as a single point (as in ascatterplot). This often happens when the aggregate con-tains a single raw data point, in which case the min andmax are equal and the vertical line is not visible.

Figure 5 demonstrates how aggregates are rendered. Notethat temporally adjacent aggregates are normally at mostone pixel column apart; these plots are zoomed in beyondthe granularity of nanosecond timestamps, to emphasize in-dividual aggregates.

4.3.2 Generation of Vertex Buffer ObjectsTo draw the min-max translucent silhouette, we need to

pass the min and max for each aggregate as separate verticesto the GPU. Therefore, we generate two vertices in the VBO

for every aggregate in the cache entry. Then, we render thedata as a triangle strip.

To ensure that gaps are rendered properly, we must in-clude information in the VBOs that the shader program canuse to selectively omit certain triangles. Our approach is toinsert two new points in every gap, with a special flag valueindicating that their purpose is to mark a gap; the vertexshader sets a “do not render” flag for these points. The valueof the “do not render” flag is interpolated between all of thepixels; we discard fragments where the interpolated “do notrender flag” is strictly greater than 0, thereby drawing agap.

To draw the mean line, we make a second pass, renderinga line strip instead of a triangle strip. We also use onlyone point per aggregate instead of two. We then make twomore rendering passes, drawing vertical lines and points, torender lone points. When generating the VBOs, we identifylone points and set special flags, so that only those pointsare rendered during these passes.

Finally, the user has an option to “Always connect points.”With this option enabled, neighboring points are always con-nected, regardless of whether they are temporally adjacent.To implement this, we pass this option to the vertex shaderas another uniform variable; if the option is set, then the“do not render” flag is never set. Note that the verticesdo not change, so the vertices in each gap are still present.However, we can position them appropriately so that theydo not interfere with the final image. See Figure 5(b).

4.3.3 Data Density PlotAs described in Section 3.1, a data density plot is drawn

when the user selects a stream. This plot displays the “count”aggregates, pulling the plot down to 0 where aggregates aremissing. To pull the plot to 0, we insert an extra pair ofpoints in each gap, at the timestamp where the the first ag-gregate in the gap would be if it were present. Two pointsare needed so that the count maintains a step-like appear-ance. Because we need two extra points anyway to renderthe gap in the min-max silhouette (Section 4.3.2), this comesat no additional cost.

4.3.4 Shader ProgramsThe shader programs map the generated vertices, passed

in via a VBO, to pixels on the screen. The user’s currentview is passed to the GPU as a base vector ~vbase ∈ R2 con-taining the (time, value) coordinates of the lower left-handcoordinate of the screen, and an axis matrix A ∈ M3(R)based on the sizes of the time and value axes. To map a(time, value) coordinate ~x to its final position, we computeA · pad(~x−~vbase), where “pad” converts a vector in R2 to avector in R3 by padding it with a 1.

One subtlety is that this transformation applies to float-ing point numbers, but the timestamp is stored as a 64-bitinteger representing the number of nanoseconds since 1970.Converting this to a single-precision floating point numberwould result in an unacceptable loss of precision. To solvethis problem, we store timestamps relative to an epoch time,which is computed as the midpoint of the cache entry. Anydifferences between timestamps in the cache entry and theepoch time will be on the same order as the full time rangecurrently displayed. Therefore, the difference between thetimestamps in the cache entry and the epoch time can besafely converted to a single-precision floating point number

8

Page 9: Mr. Plotter: Unifying Data Reduction Techniques in Storage and Visualization Systems - People @ EECS at UC …samkumar/projects/... · 2019. 6. 6. · Mr. Plotter: Unifying Data Reduction

without losing too much precision. The base vector is alsocomputed relative to the epoch time.

4.3.5 Additional ConsiderationsIn the rendering pipeline, each cache entry is drawn inde-

pendently. However, the aggregate at the end of one cacheentry may be temporally adjacent to the aggregate at thebeginning of the next cache entry, meaning that the twopoints should be connected. As the points will be in sepa-rate VBOs, this is not possible. To solve this problem, wemodify each query to include the aggregate immediately be-fore and the aggregate immediately after the queried timeinterval. When a cache entry is created, it checks if thecache entries immediately before it and immediately after ithave been created and connect to its “edge” points; if not,it takes ownership of connecting to those points.

Another issue is that the CPU and GPU must commu-nicate to initiate the rendering of each cache entry. Thisoverhead may become significant if many small cache entriesmust be drawn for a certain view. Our solution is to avoidmaking queries for tiny segments of data, to prevent cacheentries from becoming too fragmented. If the user scrolls bya small amount, we request the next screen width’s worth ofdata, rather than just the small interval that is newly vis-ible. This guarantees that at most five cache entries mustbe drawn to render any view of a single stream, which is agood enough upper bound to maintain qualitatively goodperformance.

4.4 User Interface LayerThe User Interface Layer provides the click-and-drag in-

terface for scrolling and the mousewheel interface for zoom-ing. In response to click-and-drag and mousewheel events,the User Interface Layer updates the user’s current view ofthe plot. Because the user interface handles user input andsees individual gestures, it is a natural place to handle re-quest throttling.

5. EVALUATIONThis section evaluates the performance of Mr. Plotter and

quantifies the extent to which it supports interactive explo-ration of big data.

5.1 Performance ModelOur primary goal is to support interactive exploration of

data, so our main metric is the time to access data. Draw-ing from models of the memory hierarchy of a computersystem, we model this as Data Access Time = Hit Time +Miss Rate ×Miss Penalty. Hit Time is the time to accessdata in the client-side in-memory cache. Miss Rate is thefraction of requests that miss in the cache. Miss Penalty isthe time to request data from BTrDB after a cache miss.

Prefetching issues requests for data preemptively. It re-duces the miss rate by eliminating compulsory misses, andreduces the miss penalty by issuing requests for data early.

Unlike memory accesses in a computer system, data ac-cesses in Mr. Plotter happen at perceptible time scales andare individually visible to the user. Therefore, the worst-caseaccess time, Hit Time + Miss Penalty, is also important.

5.2 Experimental MethodologyOur workload for characterizing Mr. Plotter is based on

Shneiderman’s information-seeking mantra: “Overview first,

zoom and filter, then details-on-demand” [31]. The user be-gins with an overview visualization that shows all of the datafor a stream. Then the user picks an interesting region of theplot, and zooms in to investigate. The user may again iden-tify an interesting region of the zoomed-in plot, and zoomin further, and repeat the process until she is looking atindividual points (the “details”).

We use a Python script to simulate this use case pro-grammatically. It chooses a random point on the time axisof the plot. Then the script programmatically moves themouse to that point (which takes 100 ms), and zooms in bya fixed amount. The script repeats this process until indi-vidual points are visible. In between iterations, it pauses,to simulate the user’s “think time” as she selects an inter-esting point to zoom in. We repeat the experiment multipletimes, varying the think time, to measure how well Mr. Plot-ter performs. For the data stream that we used, it took 16iterations to zoom in from an overview visualization com-prising 233 million points over three weeks, to individualpoints spaced 120 ms apart.

We ran BTrDB on a server with an Intel Xeon E5 2640v4 @ 2.40 GHz and network-attached storage. BTrDB wasconfigured to use no more than 64 GiB of memory. The Mr.Plotter client ran on a laptop with an Intel Core i7-7820HQ@ 2.90 GHz, and was configured to use no more than 1 GiBof memory to cache aggregates from BTrDB. The client andserver were separated by one network router, with a networkround-trip time of approximately 0.5 ms.

5.3 Hit TimeIn each simulation that we performed using the above

workload, we measure the cache lookup time, by obtaininga millisecond-precise timestamp before each cache lookupand after the requested data is available. On cache hits,these timestamps differed by at most 1 ms, and were iden-tical 98.8% of the time. This confirms that cache lookup isinstantaneous compared to fetching data from BTrDB.

5.4 Miss Rate and Miss PenaltyWe identify two metrics that determine the miss rate and

miss penalty. First, the latency of requests to BTrDB, in-cluding the query processing time in the database, deter-mines the miss penalty. Second, prefetching affects the missrate and the miss penalty—even if the prefetched results arenot available in time, the time spent waiting for them is less,because the request was issued earlier.

In our workload, the user continually moves to new data;therefore, we expect many compulsory misses. However, westill expect the cache hit rate to be nonzero, as the on-screendata may be refreshed from the cache multiple times at thesame resolution, in between times where the user crosses theresolution boundary. We quantify the effect of prefetchingby measuring the cache miss rate and miss penalty, with andwithout prefetching enabled.

5.4.1 ResultsWe ran the workload described in Section 5.2, while vary-

ing the think time from 0 ms to 500 ms. For each value ofthe think time, we repeated the experiment 10 times, andmeasured the latencies of cache lookups in all of the exper-iments. We also record the fraction of lookups that missed,and required requests to be made to BTrDB, in order tomeasure the miss rate. We carried out this process twice,

9

Page 10: Mr. Plotter: Unifying Data Reduction Techniques in Storage and Visualization Systems - People @ EECS at UC …samkumar/projects/... · 2019. 6. 6. · Mr. Plotter: Unifying Data Reduction

Figure 6: Miss rate, with and without prefetching

(a) Without prefetching (b) With prefetching

Figure 7: Distribution of miss penalty

once with prefetching enabled, and once with prefetchingdisabled. The results are shown in Figures 6 and 7.

Figure 7(a) displays the miss penalty, without any prefetch-ing. Over 95% of cache misses are serviced within the in-teractive threshold of 100 milliseconds. In other words, evenif we use no caching or prefetching at all, the vast majorityof requests to BTrDB return at interactive speeds. This isa direct consequence of unifying data reduction techniquesin storage and visualization systems. Because Mr. Plotterdraws plots using aggregates that BTrDB has precomputed,it can retrieve data to draw plots at interactive speeds.

That said, the distribution has a long tail. The 99thpercentile miss penalty is approximately 400 ms. However,prefetching helps solve this problem. When the think time isat least 300 ms, the 99th percentile miss penalty is less than100 ms. Furthermore, Figure 6 also shows that prefetchingis effective at increasing the cache hit rate, especially whenthe think time is large.

It is expected that prefetching is more effective at reduc-ing the miss penalty when the think time is large, becauseprefetching only occurs once the data for the current screenhas already been fetched. When the think time is very small,the user zooms past the current screen before the data hasbeen fully loaded; therefore prefetching happens less often.Furthermore, even if the prefetching were to happen forsmall think times, it would be less useful, because the userwould move to new data very quickly. It is also expectedthat the cache miss rate is higher when the think time issmall. As the user zooms in, the on-screen data is refreshedevery 300 ms during the gesture (see Section 3.3.3), and eachrefresh counts as a separate miss, if the data is not present.The increased cache miss rate reflects the fact that the usersometimes zooms in before the data is loaded.

Figure 8: Miss rate on a flaky network

(a) Without prefetching (b) With prefetching

Figure 9: Distribution of miss penalty on a flakynetwork

5.4.2 Simulating Non-Ideal Network ConditionsAn important challenge addressed by Mr. Plotter is that,

although BTrDB may process queries at interactive speeds,factors such as server load or a flaky network may causedelays in retrieving data. Mr. Plotter’s client-side data man-agement should ideally provide interactive performance, evenunder such conditions.

We simulate a flaky network using NetEm. We add a nor-mally distributed delay, with a mean of 200 ms and a jitterof 20 ms, to each outgoing packet, and drop 2% of outgo-ing packets. We do not explicitly place a cap on bandwidth,but the variable latency and packet loss interact poorly withTCP, effectively limiting the bandwidth. Using Measure-ment Lab’s speed test hosted by Google, we measured thedownlink bandwidth to be approximately 6 Mb/s with thesimulated flaky network. Without simulating a flaky net-work, it is more than 100 Mb/s for the internet connectionthat we used. We repeated the experiments in Section 5.4.1while simulating a flaky network. The results are shown inFigures 8 and 9.

As expected, the miss penalty is much higher than in Fig-ure 7, because the network round trip to fetch data fromBTrDB is more expensive. Prefetching reduces the misspenalty when the think time is large. In particular, prefetch-ing allows the miss penalty to drop below 200 ms, in caseswhere a request for data was made earlier due to prefetch-ing. Figure 8 tells a similar story; when the think time islarge, the miss rate is significantly less, because prefetchingensures that necessary data is pre-loaded into the cache.

Prefetching performs slightly worse when the think timeis 0 ms. One explanation is that, when the user movesthrough data very rapidly, prefetching is ineffective, and

10

Page 11: Mr. Plotter: Unifying Data Reduction Techniques in Storage and Visualization Systems - People @ EECS at UC …samkumar/projects/... · 2019. 6. 6. · Mr. Plotter: Unifying Data Reduction

only uses bandwidth that could be used to handle normalcache misses.

5.5 Overview VisualizationsThe latency of querying a full screen of aggregates, where

the left and right endpoints of the screen coincide with theearliest and latest data points in the stream, is shown in Ta-ble 1. We performed this experiment twice, once with theoriginal stream, and again with a derivative stream that re-peats the data of the original stream 100 times. The resultsconfirm that the performance of Mr. Plotter is essentiallyindependent of the size of the underlying data, allowing itto scale to massive data sets.

To produce this overview visualization, Mr. Plotter mustfirst make two Nearest queries to find the timestamps forthe earliest and latest points in the stream. We do not in-clude the time for these queries in Table 1, because overviewvisualizations do not always require such information (e.g.,the overview may show data over the past few hours or days,so the time range does not depend on the stream). If we hadincluded the time for these queries, the latencies would havebeen 549 ms / 371 ms and 609 ms / 365 ms.

6. RELATED WORKThis section compares Mr. Plotter to related work.

6.1 Polaris/Tableau (2002, 2008, 2011)Tableau is a commercial visualization system for RDBMS

systems that grew out of the Polaris [34, 35] project. Tableaufocuses on multidimensional databases, allowing the user tocreate a grid of plots, in an organization inspired by PivotTables.

Both Tableau and Mr. Plotter perform data managementon the client. Mr. Plotter uses an in-memory cache; Tableauclients use the Tableau Data Engine [37] to store and processdata. However, the systems’ motivations for client-side datamanagement are different. Mr. Plotter masks the latencyof database queries via its in-memory cache. In contrast,the Tableau Data Engine supports Tableau’s extract feature,which retrieves a subset of data for offline analysis.

A key difference between Tableau and Mr. Plotter is thatMr. Plotter places a greater emphasis on providing a trulyinteractive interface to explore data, whereas Tableau fo-cuses on providing a flexible, high-quality visualization. Theauthors of Tableau note that, while the user may change theview and presentation model, the resulting database querymay take several tens of seconds [34]. A visualization systemlike Tableau, that makes a query for raw data to producea visualization, becomes progressively slower as data setsgrow, because the time to process a query is linear in thenumber of records read. In contrast, BTrDB/Mr. Plotterscales linearly in only the number of aggregates drawn, andcan therefore scale to massive data sets.

6.2 ATLAS (2008)ATLAS [14] aims to provide an environment for interac-

tive exploration of massive time series. Like Mr. Plotter,ATLAS identifies the latency of querying large amounts ofdata as the key obstacle to realizing interactivity, and usesdata prefetching as a tool to mask this latency. Furthermore,Mr. Plotter and ATLAS leverage database technology toimprove performance. Mr. Plotter uses BTrDB, which usesprecomputed statistical aggregates to accelerate queries, and

ATLAS uses kdb+, which stores data in a column-orientedformat.

However, there are important differences between Mr. Plot-ter and ATLAS. First, Mr. Plotter draws the same aggre-gates that BTrDB can compute efficiently. In contrast, AT-LAS’ aggregation is computed just in time. The column-oriented techniques used by kdb+ do not carry through tothe plots produced by ATLAS. Second, as discussed in Sec-tion 3.3, the time to process queries in BTrDB/Mr. Plotteris essentially independent of the size of raw data bounded bythe query, allowing Mr. Plotter to use a simple prefetchingscheme. Third, Mr. Plotter, unlike ATLAS, does not placeany limits on how fast the user can pan or zoom, and mayshow perceptually relevant information from the previousview until data for the new view is available. Fourth, Mr.Plotter is suitable for viewing fast data that is changing,inserted out-of-order, or streaming in.

6.3 sampleAction (2012)The sampleAction [18] visualization system leverages the

idea of incremental databases pioneered by Hellerstein [21,20]. This is similar to our approach of unifying data reduc-tion strategies in database and visualization systems. How-ever, there are important differences. Unlike Mr. Plotter,which is a visualization tool for timeseries data, sampleAc-tion compares different aggregate queries in a column chart.Furthermore, the focus of [18] is not on the data manage-ment problems in building such a system, but rather on theuser experience that such a system, if built, would provide.

6.4 imMens (2013)The imMens system [25] is based on the “principle that

scalability should be limited by the chosen resolution of thevisualized data, not the number of records,” which is similarin spirit to Mr. Plotter. It provides interactive performancefor visualizing relational data. The imMens system producesbinned plots, and precomputes 3- or 4-dimensional subcubesof the overall data cube on the server. If these subcubes arestill large, they are further segmented into multivariate datatiles, which are sent to the client on demand. The clientperforms query processing on the multivariate data tiles onthe GPU, and supports brushing and linking at interactivespeeds.

The key difference between imMens and Mr. Plotter isthat imMens focuses on interactive support for brushing andlinking6, whereas Mr. Plotter focuses on allowing the userto move interactively through the data. This partially stemsfrom the fact that Mr. Plotter focuses on visualizing scalar-valued timeseries data, for which brushing and linking isnot as relevant; multiple timeseries may be related, but thecorresponding points are generally at the same timestamp,making brushing and linking less relevant.

6.5 ScalaR (2013)ScalaR [9] is an interactive visualization system for big

data. Data are stored in SciDB, and visualized in a web-based front-end written with D3 [11]. The key innovation ofScalaR is an intermediate layer that receives queries from thefront-end, requests query plans and metadata from SciDB,

6While [25] mentions that the user can pan and zoom, theinterface to doing so is not clearly mentioned, and the inter-activity of panning and zooming is not evaluated.

11

Page 12: Mr. Plotter: Unifying Data Reduction Techniques in Storage and Visualization Systems - People @ EECS at UC …samkumar/projects/... · 2019. 6. 6. · Mr. Plotter: Unifying Data Reduction

and decides whether resolution reduction needs to be per-formed. While ScalaR uses data reduction to alleviate bot-tlenecks in transferring data from the server and managinglarge query results on the client, it incurs the cost of readingthe underlying raw data for each query.

6.6 M4/VDDA (2014, 2016)M4 [23] is a visualization system for timeseries data. Like

ScalaR, M4 introduces an intermediate layer between thevisualization client and the data store. The intermediatelayer transparently rewrites queries to return the first point,last point, lowest point, and highest point in each pixel col-umn, instead of all of the raw data requested by the client.The authors observe that the resulting plot is equivalent,pixel for pixel, to drawing the raw data, but with a smallerdata transfer latency and network bandwidth use. The au-thors also present Visualization-Driven Data Aggregation(VDDA) [24], a generalization of this technique to othertypes of plots. M4/VDDA has similar pitfalls to ScalaR—while the use of aggregation decreases the data transferredto and rendered by the client, the database query still oper-ates on raw data, and is therefore slow for large data sets.Furthermore, as explained in Section 3.1, a plot that is pixel-for-pixel equivalent to one containing raw data does nothingto eliminate visual clutter; plotting aggregates such as min,mean, and max, may produce a better visualization.

6.7 Skydive (2015)Perhaps the existing system most similar to Mr. Plotter

is Skydive [19]. As described in Section 1, Mr. Plotter im-proves on Skydive by providing a visualization client thatperforms automatic data management. In contrast, Skydiverequires the user to manually initiate queries by choosing a“cut” of data to visualize.

6.8 Visualization-Aware Sampling (2016)Visualization-Aware Sampling (VAS) [26] is a technique

to perform data reduction via sampling. Unlike traditionalsampling techniques, VAS does not choose a sample ran-domly; rather, it frames sample selection as an optimizationproblem, and uses an approximate solution to that problemto select the sample. The key contribution of VAS is that thesample optimizes a visualization-specific metric that ensuresthat the resulting visualization will be high-quality.

A major difference between VAS and Mr. Plotter is thatVAS uses sampling to perform data reduction. As discussedin Section 2.1.3, we prefer aggregation for several reasons.Furthermore, VAS’ optimization function prefers samplesthat, when plotted, look similar to a plot containing all datapoints. This has the same pitfall as M4: it does not reducevisual clutter. In contrast, displaying statistical aggregatesexplicitly may produce a better plot.

6.9 ASAP (2017)Automatic Smoothing for Attention Prioritization (ASAP)

[29], is a timeseries visualization operator that computes amoving average over a timeseries, to draw the user’s atten-tion to anomalies in fast data [7, 6]. The authors presentASAP as an optimization problem, where the window size ofthe moving average is chosen to minimize the standard devi-ation of deltas (roughness) while preserving kurtosis (long-term deviations) in the data.

ASAP focuses on producing a visualization that is moreuseful than simply plotting the raw data. Mr. Plotter, bydisplaying min-mean-max aggregates instead of raw data,uses a similar approach. We note that by drawing the minand max aggregates, Mr. Plotter displays a silhouette ofwhat the plot would look like if every raw point were plot-ted. The smoothed timeseries that ASAP produces does notconvey this information, though the authors note that ASAPcould be used in conjunction with a plot of raw data. Webelieve that ASAP’s goals are complementary to Mr. Plot-ter’s; one could use Mr. Plotter’s interactive exploration inconjunction with ASAP’s smoothed timeseries.

7. FUTURE WORKMr. Plotter does not address the problem of moving across

different time series—for example, starting with many time-series and finding the one that matches certain criteria. Al-gorithms for pattern search in timeseries have been studied.In TimeSearcher [22, 13], the user can draw a timebox onthe screen by clicking and dragging the mouse. This filtersout any streams that are not contained in the timebox forthe timebox’s entire horizontal duration. This generalizes tosearching for similar streams: many timeboxes can be used,perhaps one per pixel column, where the height of the time-boxes determines the tolerance of the similarity search. Mr.Plotter could process timeboxes efficiently: to determine if astream satisfies a timebox, Mr. Plotter would check whetherthe aggregates for that stream, horizontally bounded by thetimebox, have “min” and “max” values within the verticalrange of the timebox. The implementation of timeboxes inMr. Plotter is left to future work.

A peculiarity of Mr. Plotter’s cache organization (see Sec-tion 4.2.1) is that the aggregates for each resolution arestored in separate tree-based maps, and are treated indepen-dently during lookup. Due to the alignment of aggregates, itis possible to compute the aggregates at resolution x (size ofaggregates = 2x) from the values of aggregates at resolutiony (size of aggregates = 2y) for the same time range, as longas y < x. If a query for aggregates at resolution x missesin the cache, but the aggregates at resolution y are presentfor that time range, it may be more efficient, depending onthe value of x− y, to compute the aggregates at resolutionx on the client’s machine, rather than requesting them fromBTrDB. We leave the investigation and implementation ofthis idea to future work.

A related question is whether it is worth spilling the cacheto disk, rather than evicting cache entries when the allottedmemory is full. This would be beneficial on a flaky network,where querying BTrDB takes hundreds of milliseconds. Itgives the client an even bigger data management role, similarto the Tableau Data Engine [37]. One option is to simplyallow the operating system’s paging mechanisms to handlethe transfer of data from memory to disk; however, existingwork suggests that this is unlikely to be successful [15]. Aninvestigation of this idea is left to future work.

Finally, Mr. Plotter may benefit from more sophisticatedprefetching strategies (e.g., ForeCache [8]). We relegate aninvestigation of this idea to future work.

8. CONCLUSIONIn order to cope with the growing appetite for data col-

lection and storage, the database community has developed

12

Page 13: Mr. Plotter: Unifying Data Reduction Techniques in Storage and Visualization Systems - People @ EECS at UC …samkumar/projects/... · 2019. 6. 6. · Mr. Plotter: Unifying Data Reduction

various techniques for working with massive data sets, in-cluding databases that quickly return statistical aggregatesor compute on samples of data. Meanwhile the visualizationcommunity has turned to plotting samples or aggregates tovisualize large data sets, in order to solve the problem of“squeezing a billion points into a million pixels” [32].

In this paper, we present Mr. Plotter, a visualization sys-tem for large timeseries data sets. By unifying the datareduction techniques used by databases with those used invisualization, we can process queries with latency propor-tional to the size of the final visualization, as opposed to thesize of the underlying raw data. This makes possible, for thefirst time, a truly interactive visualization system. The userstarts with an overview visualization, and then explores thedata by clicking and dragging the plot and zooming in orout with the mousewheel. The user does not explicitly ini-tiate queries. Instead, the visualization client requests dataautomatically, so that it is available for visualization by thetime it is needed.

Existing work has sought to support visualization of bigdata by moving computation close to the data. For example,M4 and ScalaR compute the aggregates at the database, toavoid having to transfer all of the raw data to the client.Our work extends this idea by pushing the computation ofaggregates even further back, into the data storage itself.That way, only the aggregates, and not the underlying rawdata, must be read to process queries.

But our work also represents a fundamental departurefrom existing designs, which generally decouple the visual-ization client from the underlying database. By making theclient aware of the storage technology used by the backenddatabase, we create a system that fetches data at interac-tive time scales, opening the opportunity for automatic datamanagement. This, in turn, introduces new challenges in theclient, to manage query results from the server, automati-cally initiate queries, and cope with “fast data.”

The core principle of our design—a storage-aware clientthat supports automatic data management—is applicablebeyond timeseries, to other types of data as well. Thisopens a new direction for research on scalable informationvisualization systems. We hope that the database and visu-alization communities will take these ideas even further, tosupport more effective visual data analytics and bring outthe full potential of big data.

9. ACKNOWLEDGMENTSThis material is based upon work supported by the Na-

tional Science Foundation Graduate Research Fellowship Pro-gram under Grant No. DGE-1752814. Any opinions, find-ings, and conclusions or recommendations expressed in thismaterial are those of the authors and do not necessarily re-flect the views of the National Science Foundation. Thisresearch is also supported by the U.S. Department of En-ergy ARPA-E Grant No. DE-AR0000340, National ScienceFoundation Grant No. CPS-1239552, Fulbright ScholarshipProgram, and UC Berkeley Graduate Division (Berkeley Fel-lowship for Graduate Study).

10. REFERENCES[1] S. Agarwal, B. Mozafari, A. Panda, H. Milner,

S. Madden, and I. Stoica. Blinkdb: queries withbounded errors and bounded response times on very

large data. In Proceedings of the 8th ACM EuropeanConference on Computer Systems, pages 29–42. ACM,2013.

[2] C. Ahlberg and B. Shneiderman. Visual informationseeking: Tight coupling of dynamic query filters withstarfield displays. In Proceedings of the SIGCHIconference on Human factors in computing systems,pages 313–317. ACM, 1994.

[3] M. P. Andersen. Package btrdb.https://godoc.org/gopkg.in/btrdb.v4. Online;accessed on 2017-09-02.

[4] M. P. Andersen and D. E. Culler. Btrdb: optimizingstorage system design for timeseries processing. InProceedings of the 14th USENIX Conference on Fileand Storage Technologies (FAST 16), 2016.

[5] M. P. Andersen, S. Kumar, C. Brooks, A. von Meier,and D. E. Culler. Distil: Design and implementationof a scalable synchrophasor data processing system. InSmart Grid Communications (SmartGridComm),2015 IEEE International Conference on, pages271–277. IEEE, 2015.

[6] P. Bailis, E. Gan, S. Madden, D. Narayanan, K. Rong,and S. Suri. Macrobase: Prioritizing attention in fastdata. In Proceedings of the 2017 ACM InternationalConference on Management of Data, pages 541–556.ACM, 2017.

[7] P. Bailis, E. Gan, K. Rong, and S. Suri. Prioritizingattention in fast data: Principles and promise. CIDR,2017.

[8] L. Battle, R. Chang, and M. Stonebraker. Dynamicprefetching of data tiles for interactive visualization. InProceedings of the 2016 International Conference onManagement of Data, pages 1363–1375. ACM, 2016.

[9] L. Battle, M. Stonebraker, and R. Chang. Dynamicreduction of query result sets for interactivevisualizaton. In Big Data, 2013 IEEE InternationalConference on, pages 1–8. IEEE, 2013.

[10] N. Bikakis, G. Papastefanatos, M. Skourla, andT. Sellis. A hierarchical aggregation framework forefficient multilevel visual exploration and analysis.Semantic Web, 8(1):139–179, 2017.

[11] M. Bostock, V. Ogievetsky, and J. Heer. D3:data-driven documents. IEEE transactions onvisualization and computer graphics,17(12):2301–2309, 2011.

[12] M. Buevich, A. Wright, R. Sargent, and A. Rowe.Respawn: A distributed multi-resolution time-seriesdatastore. In Real-Time Systems Symposium (RTSS),2013 IEEE 34th, pages 288–297. IEEE, 2013.

[13] P. Buono, A. Aris, C. Plaisant, A. Khella,B. Shneiderman, H. Hochheiser, and B. Schneiderman.Interactive pattern search in time series. In Proc.SPIE Conference on Visual Data Analytics, volume5669, pages 175–186, 2005.

[14] S.-M. Chan, L. Xiao, J. Gerth, and P. Hanrahan.Maintaining interactivity while exploring massive timeseries. In Visual Analytics Science and Technology,2008. VAST’08. IEEE Symposium on, pages 59–66.IEEE, 2008.

[15] M. Cox and D. Ellsworth. Managing big data forscientific visualization. In ACM Siggraph, volume 97,pages 21–38, 1997.

13

Page 14: Mr. Plotter: Unifying Data Reduction Techniques in Storage and Visualization Systems - People @ EECS at UC …samkumar/projects/... · 2019. 6. 6. · Mr. Plotter: Unifying Data Reduction

[16] Q. Cui, M. Ward, E. Rundensteiner, and J. Yang.Measuring data abstraction quality in multiresolutionvisualizations. IEEE Transactions on Visualizationand Computer Graphics, 12(5):709–716, 2006.

[17] N. Elmqvist and J.-D. Fekete. Hierarchicalaggregation for information visualization: Overview,techniques, and design guidelines. IEEE Transactionson Visualization and Computer Graphics,16(3):439–454, 2010.

[18] D. Fisher, I. Popov, S. Drucker, et al. Trust me, i’mpartially right: incremental visualization lets analystsexplore large datasets faster. In Proceedings of theSIGCHI Conference on Human Factors in ComputingSystems, pages 1673–1682. ACM, 2012.

[19] P. Godfrey, J. Gryz, P. Lasek, and N. Razavi.Interactive visualization of big data. In BeyondDatabases, Architectures and Structures. AdvancedTechnologies for Data Mining and KnowledgeDiscovery, pages 3–22. Springer, 2015.

[20] J. M. Hellerstein, R. Avnur, A. Chou, C. Hidber,C. Olston, V. Raman, T. Roth, and P. J. Haas.Interactive data analysis: The control project.Computer, 32(8):51–59, 1999.

[21] J. M. Hellerstein, P. J. Haas, and H. J. Wang. Onlineaggregation. In ACM SIGMOD Record, volume 26,pages 171–182. ACM, 1997.

[22] H. Hochheiser and B. Shneiderman. Dynamic querytools for time series data sets: timebox widgets forinteractive exploration. Information Visualization,3(1):1–18, 2004.

[23] U. Jugel, Z. Jerzak, G. Hackenbroich, and V. Markl.M4: a visualization-oriented time series dataaggregation. Proceedings of the VLDB Endowment,7(10):797–808, 2014.

[24] U. Jugel, Z. Jerzak, G. Hackenbroich, and V. Markl.Vdda: automatic visualization-driven dataaggregation in relational databases. The VLDBJournal, 25(1):53–77, 2016.

[25] Z. Liu, B. Jiang, and J. Heer. immens: Real-timevisual querying of big data. In Computer GraphicsForum, volume 32, pages 421–430. Wiley OnlineLibrary, 2013.

[26] Y. Park, M. Cafarella, and B. Mozafari.Visualization-aware sampling for very large databases.In Data Engineering (ICDE), 2016 IEEE 32ndInternational Conference on, pages 755–766. IEEE,2016.

[27] A. Pavlo, E. Paulson, A. Rasin, D. J. Abadi, D. J.

DeWitt, S. Madden, and M. Stonebraker. Acomparison of approaches to large-scale data analysis.In Proceedings of the 2009 ACM SIGMODInternational Conference on Management of data,pages 165–178. ACM, 2009.

[28] D. Rafiei. Effectively visualizing large networksthrough sampling. In Visualization, 2005. VIS 05.IEEE, pages 375–382. IEEE, 2005.

[29] K. Rong and P. Bailis. Asap: Prioritizing attention viatime series smoothing. Proceedings of the VLDBEndowment, 10(11), 2017.

[30] B. Shneiderman. Dynamic queries for visualinformation seeking. IEEE software, 11(6):70–77, 1994.

[31] B. Shneiderman. The eyes have it: A task by datatype taxonomy for information visualizations. InVisual Languages, 1996. Proceedings., IEEESymposium on, pages 336–343. IEEE, 1996.

[32] B. Shneiderman. Extreme visualization: squeezing abillion records into a million pixels. In Proceedings ofthe 2008 ACM SIGMOD international conference onManagement of data, pages 3–12. ACM, 2008.

[33] I. Stoica. For big data, moore’s law means betterdecisions. https://amplab.cs.berkeley.edu/for-big-data-moores-law-means-better-decisions/.Online; accessed on 2017-09-07.

[34] C. Stolte, D. Tang, and P. Hanrahan. Polaris: Asystem for query, analysis, and visualization ofmultidimensional relational databases. IEEETransactions on Visualization and ComputerGraphics, 8(1):52–65, 2002.

[35] C. Stolte, D. Tang, and P. Hanrahan. Polaris: asystem for query, analysis, and visualization ofmultidimensional databases. Communications of theACM, 51(11):75–84, 2008.

[36] M. Stonebraker, D. J. Abadi, A. Batkin, X. Chen,M. Cherniack, M. Ferreira, E. Lau, A. Lin,S. Madden, E. O’Neil, et al. C-store: acolumn-oriented dbms. In Proceedings of the 31stinternational conference on Very large data bases,pages 553–564. VLDB Endowment, 2005.

[37] R. Wesley, M. Eldridge, and P. T. Terlecki. Ananalytic data engine for visualization in tableau. InProceedings of the 2011 ACM SIGMOD InternationalConference on Management of data, pages 1185–1194.ACM, 2011.

[38] M. Zaharia, M. Chowdhury, M. J. Franklin,S. Shenker, and I. Stoica. Spark: Cluster computingwith working sets. HotCloud, 10(10-10):95, 2010.

14