[acm press the acm 14th international workshop - glasgow, scotland, uk (2011.10.28-2011.10.28)]...

8
Data Visualization on Web-based OLAP Tim Hsiao, Wo-Shun Luk School of Computing Science Simon Fraser University 8888 University Drive Burnaby BC, Canada V5A 1S6 {woshun, tih1}@cs.sfu.ca Stephen Petchulat SAP - BusinessObjects 910 Mainland Street Vancouver, BC, Canada [email protected] ABSTRACT A web-based OLAP client typically runs inside a generic browser on a web-based, resource-constrained device. Currently, this client is responsible for only delivery of what is rendered by the OLAP server. A functional prototype of a client-centric OLAP system is built, which features a customized middleware on the server side and a web client incorporating a lightweight, in-memory OLAP data/query engine. We develop three interactive data visualization tools that run against the data engine on the client side. Our experimental results show that in comparison to the traditional server-centric model, our client-centric OLAP model is clearly superior and capable of delivering a higher level of user satisfaction. A novel technique is presented for in-client aggregation over a large set of data items represented in a scatter plot. Categories and Subject Descriptors H.2.7 [Database Administration]: Data warehouse and repository; H.3.5 [Online Information Services] Web-based services; H.4.3 [Communications Applications] Information browser. General Terms Measurement, Performance, Design, Experimentation. Keywords OLAP, Data Visualization, Web-Based Client, HTML5, CANVAS, SVG. 1. BACKGROUND AND MOTIVATION 1.1 OLAP Clients Clients in a typical client-server OLAP system can be classified into two types: web-based clients and desktop-based clients. Applications running on a web-based client are downloaded from the server side into a generic browser at the beginning of a client/server session. They are mostly written in Dynamic HTML, i.e., HTML4+JavaScript, occasionally mixed some Java code. In contrast, the applications on desktop-based clients are installed on the desktops. They may be customized applications, but often off- the-shelf package applications, such as Microsoft Office suite. These two types of client cater to different classes of audience. While the sophisticated users, such as analysts need desktop- based applications, the web-based applications appeal to a wide variety of users because they are convenient to use. There are no applications to install and upgrade. They run practically on any web-enabled device and anywhere where internet connection is available. In addition, the user does not have to worry about management and security issues [1]. With the advent of 64-bit computing, rapidly declining memory prices and multi-core CPUs, many desktop computers are almost as powerful as an OLAP server. The analyst may explore the OLAP data cube in the memory of a powerful desktop, and pose “what-if” questions without re-doing the entire cube on the server. The PowerPivot [2] extension for Excel 2010 from Microsoft is such an OLAP system. QlikView [3] and SAP BusinessObjects Web Intelligence [4] best exemplify client-side in-memory OLAP applications that run in a desktop environment. Meanwhile, web-based applications have undergone a revolution of their own, with the introduction of the AJAX technology. A few years ago, a new type of web-based applications has emerged, which are called Rich Internet Applications, or RIA. A Flex [5], Silverlight [6], or Java FX Rich Internet Application [7], which is downloadable from a web server, runs as a browser plug-in but with features of a desktop-based application (such as access to local file system, with elevated trust level, and a plethora of graphical functions for data visualization). However, these plug- ins are proprietary products. With Adobe, Sun and Microsoft being the major vendors of these plug-in, they rely on the collaboration between vendors to run on certain operating system and browser platforms. Most web developers, however, prefer to avoid such dependencies by running their web applications on the native browser. JavaScript has been the predominant programming language for browsers since the nineties and naturally became the undisputed client-side language for web browsers after the failure of the Java applet. Consequently, browser vendors have been continually improving their JavaScript engine speeds as W3C updates the HTML specification. Although HTML 5 is still in the draft stage, several vendors have already implemented features from the specifications in the latest versions of their browsers: Canvas element for dynamic rendering of 2D shapes directly with JavaScript, persistent database storage that embeds a lightweight relational database engine, and Web Workers that allow long running, computationally intensive scripts to execute in parallel with the main thread. Overall, these proposals promote development of complex web applications on the browser platform. With the proliferation of mobile web devices, the challenge is to make web-based applications work more like desktop-based ones, without (i) compromising their ease-of-use characteristics, and (ii) deviating from the standard web infrastructure. That is the motivation for this research. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. DOLAP’11, October 28, 2011, Glasgow, Scotland, UK. Copyright 2011 ACM 978-1-4503-0963-9/11/10...$10.00. 75

Upload: stephen

Post on 18-Dec-2016

213 views

Category:

Documents


1 download

TRANSCRIPT

Data Visualization on Web-based OLAP Tim Hsiao, Wo-Shun Luk School of Computing Science

Simon Fraser University 8888 University Drive

Burnaby BC, Canada V5A 1S6 {woshun, tih1}@cs.sfu.ca

Stephen Petchulat SAP - BusinessObjects

910 Mainland Street Vancouver, BC, Canada

[email protected]

ABSTRACT A web-based OLAP client typically runs inside a generic browser on a web-based, resource-constrained device. Currently, this client is responsible for only delivery of what is rendered by the OLAP server. A functional prototype of a client-centric OLAP system is built, which features a customized middleware on the server side and a web client incorporating a lightweight, in-memory OLAP data/query engine. We develop three interactive data visualization tools that run against the data engine on the client side. Our experimental results show that in comparison to the traditional server-centric model, our client-centric OLAP model is clearly superior and capable of delivering a higher level of user satisfaction. A novel technique is presented for in-client aggregation over a large set of data items represented in a scatter plot.

Categories and Subject Descriptors H.2.7 [Database Administration]: Data warehouse and repository; H.3.5 [Online Information Services] Web-based services; H.4.3 [Communications Applications] Information browser.

General Terms Measurement, Performance, Design, Experimentation.

Keywords OLAP, Data Visualization, Web-Based Client, HTML5, CANVAS, SVG.

1. BACKGROUND AND MOTIVATION 1.1 OLAP Clients Clients in a typical client-server OLAP system can be classified into two types: web-based clients and desktop-based clients. Applications running on a web-based client are downloaded from the server side into a generic browser at the beginning of a client/server session. They are mostly written in Dynamic HTML, i.e., HTML4+JavaScript, occasionally mixed some Java code. In contrast, the applications on desktop-based clients are installed on the desktops. They may be customized applications, but often off-the-shelf package applications, such as Microsoft Office suite. These two types of client cater to different classes of audience. While the sophisticated users, such as analysts need desktop-based applications, the web-based applications appeal to a wide

variety of users because they are convenient to use. There are no applications to install and upgrade. They run practically on any web-enabled device and anywhere where internet connection is available. In addition, the user does not have to worry about management and security issues [1].

With the advent of 64-bit computing, rapidly declining memory prices and multi-core CPUs, many desktop computers are almost as powerful as an OLAP server. The analyst may explore the OLAP data cube in the memory of a powerful desktop, and pose “what-if” questions without re-doing the entire cube on the server. The PowerPivot [2] extension for Excel 2010 from Microsoft is such an OLAP system. QlikView [3] and SAP BusinessObjects Web Intelligence [4] best exemplify client-side in-memory OLAP applications that run in a desktop environment.

Meanwhile, web-based applications have undergone a revolution of their own, with the introduction of the AJAX technology. A few years ago, a new type of web-based applications has emerged, which are called Rich Internet Applications, or RIA. A Flex [5], Silverlight [6], or Java FX Rich Internet Application [7], which is downloadable from a web server, runs as a browser plug-in but with features of a desktop-based application (such as access to local file system, with elevated trust level, and a plethora of graphical functions for data visualization). However, these plug-ins are proprietary products. With Adobe, Sun and Microsoft being the major vendors of these plug-in, they rely on the collaboration between vendors to run on certain operating system and browser platforms. Most web developers, however, prefer to avoid such dependencies by running their web applications on the native browser. JavaScript has been the predominant programming language for browsers since the nineties and naturally became the undisputed client-side language for web browsers after the failure of the Java applet. Consequently, browser vendors have been continually improving their JavaScript engine speeds as W3C updates the HTML specification.

Although HTML 5 is still in the draft stage, several vendors have already implemented features from the specifications in the latest versions of their browsers: Canvas element for dynamic rendering of 2D shapes directly with JavaScript, persistent database storage that embeds a lightweight relational database engine, and Web Workers that allow long running, computationally intensive scripts to execute in parallel with the main thread. Overall, these proposals promote development of complex web applications on the browser platform.

With the proliferation of mobile web devices, the challenge is to make web-based applications work more like desktop-based ones, without (i) compromising their ease-of-use characteristics, and (ii) deviating from the standard web infrastructure. That is the motivation for this research.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. DOLAP’11, October 28, 2011, Glasgow, Scotland, UK. Copyright 2011 ACM 978-1-4503-0963-9/11/10...$10.00.

75

1.2 Web-based OLAP In this paper, we are going to describe the design, implementation and initial experiences with Web-based OLAP, which contains a lightweight OLAP engine running inside a generic browser on a web-based resource-constrained device. The purpose of this project is two-fold. With web-based OLAP sharing the workload of the OLAP server, the server will be able to handle more web-based devices. On the client side, it will enhance the experience of the user of the web device, as it will bring a high degree of interactivity with the OLAP system without overloading the OLAP server.

Consider the following realistic scenario for an electronics store chain: a store manager uses an OLAP client application to connect to a remote corporate OLAP server, with a round-trip network latency 120 ms. A “thin” client may be suitable when users do not need real-time feedback of the queries, or the number of queries is small across time. However, if the manager wishes to perform further operation on the data (such as swapping row and column, viewing the total sales of a particular store with respect to each product category, or applying a date range filter to the data) the OLAP client must reconstruct the query and inquire the server. A rapid succession of such operations would make the existence of the 120 ms lag conspicuous. The lack of smoothness is most apparent when an analyst binds a slider, a highly interactive and desirable control in data visualization, as a sliding window filter in a query. For example, the query can be the top-k products sold in the past m days or the m-day moving averages of store sales, in which the slider bar is used to adjust the value of m and each movement can potentially affect the outcome drastically. In this application, each tick in the slider movement pounds the server, which is amplified by the number of concurrent users. Even if the server is capable of serving all these simultaneous requests through software or hardware enhancements, the existence of consistent network latency, or even worse, high packet delay variation over the Internet is beyond any organization’s control. This variable alone contributes to the increased query response time, and limits the quality of service of smooth data visualization. This phenomenon motivates us to utilize client-side resources to bridge such a gap. In this paper, we compare data visualization between the traditional server-client OLAP system and our architecture, and show that by delegating data processing to the client-side, we eliminate network latency from the equation, facilitating smooth data visualization.

In advanced analytics, the user may want to zoom in a scatter plot to drill down on some specific regions of the plot. For example, the user would like to aggregate on a group of visible outliers, or quickly determine the contribution of a set of points relative to other selected regions [8]. Not only is it an impossible task for thin browsers, but also a challenging programming task in JavaScript. Utilizing the new HTML5 features, we develop a novel aggregation technique with the traditional scatter plot that is achievable only in a client-centric model and the new HTML5 Canvas element.

The contributions of this research include:

(i) A client-based OLAP data engine, which is built within the standard web infrastructure, and

(ii) Novel data visualization techniques that are optimized for running on this data engine.

1.3 Related Work To the best of our knowledge, there has not been any published research work based on client-side OLAP that runs in a restrictive web-based environment. The closest work that comes to mind is an Adobe Flex 3 AdvancedDataGrid control that mimics the functionality of client-side aggregation on relational data. However, as stated by the Flex development team at the time of this writing, the current state of AdvancedDataGrid cannot handle more than a few thousand records even though it runs as a compiled binary and users must explicitly define relationships between the data sources [9].

1.4 Organization of this Paper The remainder of this paper is organized as the following: In section 2, we present a hybrid OLAP architecture that is different from traditional client-server OLAP system and describe server components. The client of this architecture is introduced in Section 3. We also show how the client and the middleware on the server work together for the client to download data and metadata from the OLAP server. In section 4, we demonstrate the capability of the client-centric model by implementing several calculations commonly used by power users. We compare our results to a simulated server-centric model not only on the aggregation performance aspect, but also on the client’s ability to present graphical visualizations. Finally, we summarize our contributions in Section 5.

2. WEB-BASED OLAP ARCHITECTURE Figure 1 shows the component layout in the architecture we implemented.

Figure 1: Client-Centric Architecture

Message exchange between entities in the entire system is over the HTTP transport layer. Although not explored in this research, this architecture can benefit from HTTP features such as caching and security enforcement.

In the rest of this section, we will focus our discussion on the three server components outside OLAP Server.

2.1 XML for Analysis Layer The XMLA Application Programming Interface is an open standard which is adopted by all major OLAP service providers [10]. It allows application software to communicate with a data provider through web standards: HTTP, SOAP, and XML. XMLA is essentially a SOAP envelope with XML content sent over the HTTP Transport Layer. In the XML content, we can specify two types of operations: Discovery and Execute [11]. Discovery messages allow us to obtain dimension, hierarchy, member, and measure metadata directly from the OLAP server without prior knowledge of the relations or explicit manual definition. Execute messages allow execution of any valid MDX (MultiDimensional eXpression) query against the OLAP server.

Server Com ponents

O LAP Server

XML for Analysis

OLAP Data Ac cess

RESTful W eb Service

JavaScript OLAP C lient

76

In the ‘thin’ client environment, if a user wishes to perform other operations such as drill-down or change filter values, subsequent queries must be sent per user action. In our architecture, we use it to obtain the lowest level of aggregation whenever possible. The result of an XMLA Execute message is highly detailed. In addition to the partially aggregated data we need, it includes detailed axes information associated with the result set. Such data are striped off the Discovery messages, before sending the dataset to the client to minimize transmission time as well as to avoid the overhead for JavaScript to parse such data.

When XMLA receives an MDX query request, it returns a multidimensional result set (sub-cube). As we need a data structure closely resembling the fact table used by the data engine on the client side, a transformation process is required. This process is explained in section 3.2.1.

2.2 OLAP Data Access Layer (ODA) The OLAP Data Access Layer is a server vendor agnostic layer for accessing OLAP data sources, developed by SAP – BusinessObjects. It is a Java library, implementing the XMLA protocol to provide methods for subsequent layers to consume XMLA without getting into the gritty details. The main functionality of the ODA library is to parse XMLA results and return convenient Java objects such as Cube, Hierarchy, Hierarchy Level, Tuple, MemberSet, and Member.

2.3 RESTful Web Service Layer (RWS) Representational State Transfer or REST, is a coordinated set of architectural constraints that focuses on minimizing network latency and communication and maximizing independent and scalable component implementations [12]. A web application is said to be RESTful if it confirms to being: client-server, stateless, cacheable, uniformly interfaced, and layered system [13].

The RESTful Web Service (RWS) Layer is a middle tier between the JavaScript OLAP Client and the ODA Layer. This layer is created using jRuby, a Java implementation of Ruby. We seek to provide server side support for the JavaScript OLAP Client on the following aspects: (i) metadata retrieval, (ii) transformation of multidimensional data in a sub-cube into fact table; (iii) data caching; and (iv) dataset compression. The last two functions will not be discussed further in this paper, although they are crucial to the overall efficiency of the system.

3. JAVASCRIPT OLAP CLIENT (JSOC) JSOC, as the name implies, resides in a generic browser. It is downloaded from RWS at the beginning of a client/server session. JSOC does local processing, and maintains connection with the OLAP server via the server component RWS. It is also responsible for interfacing with the human user.

The main processing module on the client side is the JavaScript Analytic Data Engine (JADE), which performs OLAP operations on the client side (see Section 3.1). JSOC also works with RWS to download a sub-cube and the associated metadata, which are consumed by the JADE (see Section 3.2).

The UI object in JSOC provides a basic HTML interface allowing the user to select which dataset and metadata to fetch from the server, displaying dimensions and members in drop down boxes, and for rendering the output from JADE, e.g., a pivot table. We

integrated a slider bar from jQuery UI [14] to a data visualization application to act as a range filter (see Section 4.2). When the user slides the slider thumb, JADE re-aggregates the data and UI object redraws the pivot table.

3.1 JavaScript Analytic Data Engine (JADE) JADE implements 5 common OLAP operations: Cross-tab, Drill-Down, Roll-up, Slice-and-Dice, and Pivot. We will focus on Cross-tab here, since the rest of them are straightforward in comparison. These operations are performed on FactTable, a partial pre-aggregated dataset from the OLAP database. This dataset is locally stored as a relational table, which has n columns, where n is the sum of total number of dimensions and total number of measures. In order to perform these operations properly, JADE relies on the knowledge of the dimension hierarchies, i.e., the metadata. For example, it would be impossible to drill down without the knowledge of the ancestor/descendant relationships of the members in the same dimension hierarchy.

We adopt the open source TreeView component from Yahoo! UI 2 [15] to maintain the metadata, It provides an intuitive way to select nodes from the dimensions and hierarchies. This interface allows a familiar interface for the user to explore the OLAP server before formulating and submitting the query to the server, to retrieve a partial pre-aggregated data (see the module on the left lower corner of Fig. 2). Besides, it provides the following functions: a set of API methods to instantiate a tree data structure and tree nodes as objects, methods to access node level, children nodes, an ancestor of a node at a particular level, ability to add custom properties to a node, and dynamic loading of sub-tree on node expansion. We modify the component with additional recursive functions: retrieving leaf nodes of a given sub-tree; retrieving descendant nodes at a specific depth of a specific node; obtaining the size of a sub-tree; and a hash map of member keys to actual node objects.

A cross-tab (or pivot table) is a summarization of a measure based on two dimensions labeled as row and column axes. The Cross-tab function takes following input parameters from user: (i) the dimension hierarchies rowAxis and colAxis as column number of the FactTable, (ii) one node each in the two hierarchies, i.e., rowAxisNode and colAxisNode, (iii) measureIndex, as a column number of the FactTable.

With these input parameters, JADE will return as the answer a matrix with r rows and s columns, where r and s are the numbers of children of rowAxisNode and rowAxisNode respectively. The actual computation of the answer object is shown in the algorithm below, i.e., Alogrithm getMatrixAnswer. To improve the speed, we create a KeytoNodeMap index to obtain a quick reference to the associated node object given a dimension, hierarchy and member keys. Note that for the cross-tab axes nodes, no lookup is required, as we obtain the actual references to the axis nodes when the user selects them from the TreeView interface. The getAncestor method of a TreeView node object is to look up the ancestor of the node at a certain depth. Here, we are attempting to determine whether the given node has an ancestor which is a child of rowAxisNode/colAxisNode (hence at AxisNode.depth + 1).

77

Algorithm getMatrixAnswer For i = 1 to FactTable.length

current Tuple = FactTable[i] rowMember = currentTuple[rowAxis] colMember = currentTuple[colAxis] // lookup actual node in TreeView using member names rowMemberNode = KeyToNodeMap(rowMember) colMemberNode = KeyToNodeMap(colMember) row = rowMemberNode.getAncestor(rowAxisNode.depth + 1) col = colMembernode.getAncestor(colAxisNode.depth + 1) if (row is a property of answer) AND (col is a property of answer[row]) answer[row][col] += currentTuple[measureIndex]

End For

3.2 Client/Server Connection The RWS provides methods that are exposed to the JavaScript client in the form of URLs. JSOC sends a request for service to RWS via an XmlHttpRequest object that is available on every browser. RWS in turn invokes ODA methods to send XMLA Discovery message. When the ODA component returns a response to the RWS, RWS may further process the response before the result as JavaScript object is forwarded to JSOC.

Before we describe the processing of the OLAP data and metadata by RWS, we need to describe the data format in which the data and the associated metadata is communicated between RWS and JSOC. JavaScript Object Notation (JSON) is an open standard object declaration convention designed for data interchange. JSON is a ‘fat free’ alternative to XML [16], and it is a subset of JavaScript.

3.2.1 Retrieval of FactTable It is a common practice for a server-centric system to obtain only the data required to answer a MDX query. Here, we use the MDX query to obtain the lowest level of aggregation whenever possible, so that if the user wishes to perform other operations such as drill-down or change filter values, no subsequent queries need be sent. Consequently, once the measures and the axes that are of interest have been determined, the JSOC formulates the following MDX query embedded in an XMLA Execute message to fetch a sub-cube:

Select {measure1,… measurem} on axis(0), descendants([dim1].[hier1].[root], distance1, Leaves) on axis(1), … descendants([dimn].[hiern].[root], distancen, Leaves) on axis(n), From [cube]

With this query, we will fetch members from as far down the hierarchy tree as possible. By specifying the absolute distance and the keyword Leaves, the descendants function returns the set of member nodes that are of distance distancei from [dimi].[hieri].[root] and, in addition, leaf nodes that are closer than distancei from the root.

The result set is a sub-cube of [cube], which is essentially a list of cells. Each cell is identified by an integer CellOrdinal, and contains the value(s) of the measure(s), e.g.,

<Cell CellOrdinal="1793"> <Value xsi:type="xsd:double">3.4567E2</Value> <FmtValue>$345.67</FmtValue>

</Cell>

To minimize modifications required on the JavaScript OLAP Client so that it can perform aggregation as if the multidimensional result set is a fact table, RWS transforms the sub-cube into a fact table on the RWS Layer. Each CellOrdinal is associated with a tuple/cell in the sub-cube space. Conversely, given the CellOrdinal number of a tuple, the number of axes, and the cardinality of the member set for every axis, we can determine the tuple ordinal (measure1,… measurem, s1,… sn) where si = [dimensioni].[hierarchyi], using the equation in [17]. We insert this tuple as a row into a sub-cube fact table with the reduced UniqueMemberName as column values. We perform reductions on the UniqueMemberName for two obvious reasons: first, to minimize the amount of data transmitted, and second because the fully qualified name is redundant, as we already know the dimension and hierarchy from the column header. The dimension and hierarchy names of the axes serve as the headings of the sub-cube fact table. Table 1 shows portions of a sub-cube table by transforming the multidimensional result set.

Table 1: the FactTable

After the RWS creates the sub-cube fact table, it serializes it to a JSON string before sending it to the client. The JSOC uses JSON.parse method and directly instantiates the XmlHttpRequest response string into a 2-dimensional Array Object.

3.2.2 Retrieval of MetaData Upon selecting a sub-cube, the JSOC needs to request RWS for a list of related dimensions and their hierarchies. Hierarchy trees, such as the product hierarchy of a supermarket, may contain members in the order of tens of thousands. We must take pre-caution to ensure that the browser won’t be crashed due to the instantiation of a huge hierarchy tree. Our solution requires careful coordination between JSOC and RWS.

On the client side, actual content of each hierarchy is not loaded from the server preemptively. It will take minutes to complete depending on the size of the entire metadata otherwise. Instead, the TreeView component allows us to supply a self-defined loading function dynamically when the user expands a node. When node expansion triggers the loading function, the JavaScript OLAP Client requests a hierarchy of acceptable height from the web service layer.

Upon the request of a hierarchy h from dimension d, the RWS obtains the hierarchy level breakdown from the ODA Layer. With knowledge of member count on every level of h, and a predetermined tolerance τ, the RWS selects a sub-tree t of h from the root r to a level l where the member count in l is at most τ.

Although the choice of τ is not part of our research, it should be a function of several parameters: browser version, JavaScript engine, and device platform (such as desktop/laptop/tablet/smartphone). One can even incorporate user profiling - do not preemptively fetch low granularity data that the user never drill-down before. There is a trade off with the value of τ: the higher the tolerance, the larger is the sub-cube provided by the RWS, which offers aggregation at low granularity. With low tolerance, the RESTful wrapper can only provide aggregation at

[Product].[All Products] [Store].[All Stores] [Delivery Date].[Date] [Amount]

Generic Mouse [BC]&[Vancouver] 2007/12/15 123.45

Generic Keyboard [Ontario]&[Ottawa] 2007/12/16 44.15

Generic Keyboard [Ontario]&[Toronto] 2007/12/16 533.21

Generic Keyboard [California]&[Irvine] 2007/12/18 345.67

78

higher levels of the hierarchy and shallow drill-downs, although it is capable of providing reasonable responses to devices with slower JavaScript engine and hardware. In real life data warehouses, usually the product and customer dimensions are on such scale and, even in such cases, obsolete products and inactive customers can be pruned from the hierarchy trees.

Once a sub-tree is determined, the RWS generates a list of hierarchy members by taking a pre-order traversal of the tree. For each hierarchy member, we include the HierarchyLevel, MemberCaption, and UniqueMemberName. This compact representation of the tree yields minimum transmission time to the client-side as no tree node appears twice in this list. Similar to the FactTable download, this list is sent to JSOC as JSON string. On the client side, this string is parsed by maintaining a stack to keep track of the current list of ancestors, and rebuilds the tree in one pass.

4. DATA VISUALIZATION BENCHMARKING To demonstrate the capability of JSOC, we resort to queries that OLAP power users often demand, e.g., queries that involve rapid calculations such as top-k percentage contribution over m-time span, and m-day moving average. The time difference between execution of one simple cross-tab query in the client side and the server side would probably be too small for the user to notice. In this section, we focus on these two calculations, algorithms used to carry out these calculations, and the data visualizations applied. With extensive testing, our goals are (i) to compare the JADE’s performance against a simulated server-centric web-based OLAP application to perform these calculations; and (ii) to show that performing successive calculations against a server-centric model would introduce jitter in the presentation layer, if not completely overwhelm the OLAP server.

In addition to these two calculations, section 4.4 introduces a novel client-side aggregation technique in common data visualization – a scatter plot. We implement this technique on the conventional vector-graphic based library and the new HTML raster-based library, and demonstrate that HTML5 makes a JavaScript-based Rich Internet Application a strong candidate as an OLAP client.

4.1 Data Visualization in Traditional Web-based OLAP We do not run tests against SAP’s server-based client because even though it has the functionality to send calculation queries, we cannot achieve the second goal. In current generation of web-based OLAP clients, the server renders graphical data visualization in real-time and sends it as a static image to the client. To compare apples to apples, it would be more interesting to test our client-centric model against a model where the server performs the calculations and the client renders the results using the same data visualization. To that end, we simulate a web application that has minimal script execution on the browser side and that uses the OLAP Data Access layer used by the server-centric model to fetch computed results from the same OLAP server. Common across all tests, all clients and servers are on the same network where the network latencies between any entities are no more than one millisecond. In the top-k percentage contribution and m-day moving average calculation tests, we inject three levels of network latency to simulate clients

connecting to a remote server and observe the impact to query response time.

4.2 Top-k Percentage Contributions This calculation is essentially a ranking problem. We introduce a function by modifying query engine to perform group-by on one axis rather than two. During scanning the FactTable for tuples that fall within the m-time span, JADE performs a group-by on the single axis on the specified measure. Consider the following example: a user requests the top ten item subcategories and their percentage contribution to the total sales in the past two weeks. In this case, we retrieve a sub-cube that contains three dimensions: [Item] at the [Subcategory] granularity, [Time] at the [Calendar Date] granularity, and the measure [Sales]. JADE does the group-by operation on the [Item].[Subcategory] axis, filtered by tuples with [Time].[Calendar Date] in the past two weeks. When the aggregation ends, we obtain an object Answer that contains a list of <item subcategory: sale amount> as its <property: value>, as well as a running total of the sale amount of every item subcategory. Then we select the top 10 item subcategories ranked by their sale amount using SelectTop(10, Answer).

Algorithm SelectTop (k, Answer) Array TopKAnswer; For (each property in Answer) TopKAnswer.insert({property, Answer[property]}, k) return TopKAnswer;

In this example, the algorithm maintains a static array of size ten. It loops through the properties of Answer and insert the <item subcategory, sale amount> into this array using the following custom insert method. This method inserts the object passed in as a parameter at the appropriate position according to its value, and shifts all elements thereafter by one. It then truncates the array to the first k elements. Finally, we compute the aggregated sales amount of the ten item subcategories as a percentage of the total sales amount of every item subcategory and display them in a HTML table as well as a browser rendered pie chart using rGraph, a Raphaël SVG library [18]. We use a slider bound to [Time].[Calendar Date] to allow the user to change the value of m successively, providing real-time visualization to the calculation result in the pie chart, as shown in the screen shot diagram (Fig. 2). Although possible with a server-centric model, in the presence of multiple simultaneous users, as well as network latency, the result would be jerky. Therefore, offloading these types of intensive calculations to the client is beneficial from the perspective of the user and the server.

Figure 2: A Screen Shot for Top-k Percentage Contributions

79

To benchmark this calculation against a server-centric model, we run an experiment with three values of k and a slider control bound to the [Time] dimension. Each slider movement re-computes the top-k contribution and re-renders the output.

We run the calculation 500 times with random start and end dates for each model. The sub-cube downloaded to the JSOC is [Item].[Subcategory], [Time].[Calendar Day].[2004], and [Measure].[Sale Amt]. The sub-cube is 471KB with 77,629 tuples and takes 7/40/98/159 milliseconds to download with injected latencies of 0/30/90/150 ms, respectively. We observe that the JADE performs the calculation and renders the result in the range of 75 to 85 ms regardless of the value k. While the server-centric model also computes the result in comparable time (75~91 ms), simulated network latency adds an overhead to the response time and prevents the browser from rendering visualization seamlessly.

Table 2. Top-k percentage contribution performance

4.3 m-day Moving Average Another common calculation is the m-day moving average of a measure for a set of members. For example, the user may be interested in the past 7-day moving average sales amount of all the stores in Seattle in 2004 plotted as a line graph. First, we use our sub-cube retrieval method to obtain the dataset which consists of the pre-aggregated sales amount of all n stores in Seattle by individual dates in 2004. We perform a group-by on individual dates, and then by store, and sort the dataset according to the date on a ascending order. We compute the 7-day moving average of each day in January and plot the initial result on a line graph visualization using flot, a Canvas library [19]. We add a “playback” function so that the graph iterates through each day in 2004, calculates the 7-day moving average for the next day, and presents a “moving” average animation as a different way for the user to observe data trends.

Given an n-day simple moving average for yesterday for storei, the cost of calculating the n-day simple moving average for today is inexpensive:

Because the JavaScript client has the aggregated dataset in memory, it is not necessary to perform a linear scan to determine Sales(date, storei). Cases such as the m-day moving average best exemplify the benefits of delegating work to the client-side, especially when requests are frequent and the server needs to recalculate every time.

For the m-day moving average calculation, we retrieve a sub-cube for the JavaScript OLAP Client – [Store].[By Geography].[Seattle], [Time].[Calendar Day].[2004], and [Measure].[Sale Amt]. The moving average sale of each store in Seattle is computed for January to form the series and to initialize a line graph. When the user initiates the animation, the JavaScript OLAP Client computes the moving average of each store for the

next day and continuously updates the visualization until it reaches the end of year 2004. To make a fair comparison, the server-centric model also runs an MDX query to fetch the initial plot. When the user initiates the animation, the browser fetches the moving average of the next day, instead of the whole period from the OLAP server. When the browser receives the data for the next day, it updates the visualization locally. Table 6 shows the performance time of both models, including rendering time.

Table 3. m-day moving average performance

In general, human visual perception will not notice flicker at thirty frames per second or more. This is equivalent to about 33 milliseconds for the script time plus rendering time. Although the performance our current implementation is slower than that threshold, it is fair to say that the JavaScript OLAP Client still does better than the server-centric model in three ways. First, we shorten the amount of time it takes to display the pie chart by bridging the gap on network latency. Second, we reduce overall network traffic and server load, so that the server can service more clients. Finally, the smoother visualization provides a better user experience.

4.4 Scatter Plot Aggregation on Mouse Selection Up to this point, we have been primarily concerned with the aggregation performance aspect of a web-based OLAP client. Often times, OLAP analysis can produce massive datasets. If the results are always shown in a data grid, rendering time alone on a web browser may bog down the application. Even if rendering time is not an issue, analysts can miss potentially interesting trends when they are overwhelmed by the sheer amount of data in a large spreadsheet, as pointed out by Schrader [8]. Graphical visualization tools come into play to overcome such a challenge [20]. A scatter plot is a type of mathematical diagram using Cartesian coordinates to display a set of data. The x and y axes represent two measures of a dataset; for each tuple in the dataset, a point is plotted according to its two measures as the x and y coordinates. While a data grid can potentially span several screens, a scatter plot, on the other hand, can scale to the width and height of the screen, providing the ability to display hundreds of thousands of data points in confined space.

Once the data points are downloaded and stored in the FactTable, the user may be able to work with them interactively. For example, the user may want to find out the production cost and sale amount for each state. JADE will then do a group-by operation on the column of the Fact Table related to state (Table 4). The scatter plot is then re-drawn, as shown in Fig. 3. The user may notice in the chart the outliners, i.e., the states whose sales amount per dollar cost is way higher or way lower than the normal. The user can draw scatter plot again based on aggregation on any member of any dimension hierarchy that is included in the scatter plot.

JavaScript OLAP Client Server-centric OLAP Client Network Latency (in ms) Network Latency (in ms)

Parameter k 0 30 90 150 0 30 90 150

10 76 76 75 76 75 104 163 226

20 81 82 83 82 85 118 174 243

30 83 82 84 85 91 123 174 242

n

storentodaySales

n

storetodaySalesstoreyesterdaySMAstoretodaySMA ii

ii

),(),(),(,

JavaScript OLAP Client Server-centric OLAP Client Network Latency (in ms) Network Latency (in ms)

Parameter m 0 30 90 150 0 30 90 150

7 88 87 87 87 159 194 249 309

14 89 90 89 88 166 193 259 317

21 90 91 91 92 172 204 258 320

30 90 93 92 93 173 204 261 329

80

Figure 3: Scatter Plot Example

Table 4: FactTable for Scatter Plot

Tuple# State ... Sale Amount Production Cost 1 CA ... 77,229 50,293 2 WA ... 28,119 21,253 3 TX ... 5,112 4,000 4 FL ... 76,003 64,200 ... ... ... ... ...

Additionally, multiple scatter plots may be overlaid by assigning different colors to different plots [8]. The resulting scatter plot, for example, may show the sale amount and production cost by state, product and month. Alternatively, the user may drill-down on a region of the plot with a mouse, i.e., using the mouse to create a rectangle of the plot, as shown in the shaded area in Fig. 3. The plot will then be re-drawn by performing aggregation on the tuples whose sale amount and production costs fall inside the rectangle. At each of these computational steps, the aggregation update is expected to be done in real time.

In a traditional server-centric model, a scatter plot of the data is pre-rendered at the server and delivered to the browser as a static image file. The above computational scenarios are nearly impossible for server-centric OLAP systems. In comparison to the computational scenarios described in Sections 4.2 and 4.3., the current ones are far more complex. First, the server would need to track the entire dataset that the user currently has on the workspace, for every user. Second, even if the server was capable of that, it would not be able to keep up with returning aggregation results on every onMouseDrag event in the presence of network latency, let alone the server process time. If the client takes a step back, and responds only to the onMouseUp event, the user would need to do repetitive mouse selections to see the difference of selecting neighbor points. As a result, the JavaScript OLAP Client is clearly superior to a server-centric OLAP client. However, to generate scatter plots at the client-side using the dataset in-memory is a challenging problem.

We experiment with generating scatter plots at the client-side using the datasets in-memory. Scalable Vector Graphics, or SVG, has been around for over a decade. In the past, web developers have used this mark-up to create dynamic and interactive graphic components for the web browser. HTML5 introduces the Canvas element tag that brings a new world of possibilities with client-side rendered graphics. Rather than a straightforward graph presentation, we explore a new technique that is tailored to client-side OLAP.

There are two methods of graphic rendering in a web application as described in the top-k contribution and the m-day moving average calculations: SVG and Canvas. When the number of

rendered elements is small, the performance difference is negligible. With SVG, every vector graphic that appears on the canvas is an object in the DOM. The benefit of SVG is that event listeners can be bound to each object; the drawback is that retrieving an object triggers one DOM traversal. Suppose the plot is drawn, and the user drags the mouse to select a region for drill-down. The operation is implemented by drawing a panel in the scatter plot using Protovis, a SVG library [21]. Essentially, every object lying within the panel triggers one traversal of DOM. The process can be very time consuming if a larger number of objects are selected.

The Canvas approach is quite different as it is a raster-based graphic library. The entire canvas is treated as an image, although there can be data associated with the image. We plot the tuples and store them in the order they are fed, in a base canvas. We implement the region selection ourselves by stacking another layer of Canvas element on top of the base. This way, the base layer keeps a static rendition of the scatter plot and remains intact while the user interacts with the top layer - one DOM object. When the user clicks on the canvas, the onMouseDown event triggers a rectangular box on the top layer: as the cursor moves across the layer, the top layer is constantly erased and redrawn. The onMouseDrag event handler is fired continuously as the selection change, and loops through the set of points in the graph. If a point falls within the selection region by comparing the x and y coordinates of the rectangle, we use the index of the point, look up the tuple in the dataset without linear scanning, and aggregate the two measures. The whole process triggers only one DOM traversal.

We experiment with both methods of rendering a scatter plot from OLAP data, and observe their performance as the user interacts with the data points. The FactTable transformed from the sub-cube resulted from executing the following MDX query has two columns on measures [Internet Sales Amount] and [Internet Total Product Cost]. It also has columns on 3 dimension hierarchies: [Geography].[Geography], [Product].[Product Categories] and [Delivery Date].[Calendar]. There are more than 400,000 tuples in the FactTable.

Select {[Measures].[Internet Sales Amount],[Measures].[Internet Total Product Cost]} on axis(0), Descendants([Geography].[Geography].[All Geographies],4,Leaves) on axis(1) , Descendants([Product].[Product Categories].[All Products],3,Leaves) on axis(2) , Descendants([Delivery Date].[Calendar].[All Periods].[CY 2004],4,Leaves) on axis(3) From [Adventure Works] The result of this test in Table 5 shows a dramatic difference between SVG and Canvas.

Table 5. SVG versus Canvas scatter plot aggregation

020406080

0 20 40 60 80 100

Pro

duct

io C

ost '

000

$

Sale Amount '000 $

SVG Canvas

n Avg. Time (ms) Avg. Time (ms)

5,000 398 1 10,000 - 1 50,000 - 3 100,000 - 7 200,000 - 14 300,000 - 18 400,000 - 26

81

Our goal is not to claim that Canvas has far superior performance to SVG. Rather, through this experiment, we understand that Canvas has great potential to provide new data exploration techniques, in which data visualization and OLAP aggregation overlap. Therefore, it complements SVG to deliver a graphical-rich OLAP client tool.

5. CONCLUSION AND FUTURE WORK In this research, we propose an architecture for a web-based OLAP system which is capable of performing OLAP operations on the web-based client inside a generic browser. The server of this architecture, compared to a server-centric OLAP system, incorporates a customized middleware, i.e., RESTful Web Services, which connects the OLAP server through XMLA API, with the web-based client. Special attention is paid to ensure the client will not overwhelmed by massive data/metadata transfer from the server.

Based on this architecture, a functional prototype has been built to investigate its performance. As part of our proof-of-concept work, we design and implement three typical OLAP data visualization tools with varying degrees of interactivity to run on this system. We compare the results from our client centric system to a simulated server-centric system and show that a client-centric model is clearly superior to a server-centric model for applications involving interactive data visualization. We also contribute to the research community by demonstrating a novel technique for a user to explore his/her data by interacting with the visualization and performing client-side aggregation on a dataset with close to half-a-million data items in real-time, which would otherwise be impossible in a server-centric model. With future advances in hardware/software, e.g. more memory, faster CPU in the web devices, and faster JavaScript engine, we are confident that the intelligent Web-based OLAP will become a popular way to improve the level of interactivity in data visualization.

Work is already under way to add more industrial strength to the research prototype. We are working on a cost function which allows the JSOC to determine whether a given task should be server-bound or client-bound. We are experimenting with different techniques in order to increase the percentage of workload done on the client side. We can increase the amount of data resident on the client side, by data compression for example. There is also a need to implement a sophisticated client-side data caching strategy. We hope to report our progress on these issues in a future publication.

6. ACKNOWLEDGEMENT This research is supported in part by SAP Research and a CRD research grant from National Sciences and Engineering Research Council of Canada.

7. REFERENCES [1] C. Darie, AJAX and PHP: building responsive web

applications.: Packt Publishing, 2006.

[2] Microsoft Corp. (2011) Microsoft PowerPivot. [Online]. http://www.powerpivot.com/

[3] Qlik Technologies, Inc. (2011) QlikView Business Discovery Platform. [Online]. http://www.qlikview.com/us/explore/products/overview

[4] SAP AG. SAP BusinessObjects Web Intelligence. [Online]. http://www.sap.com/solutions/sapbusinessobjects/large/business-intelligence/qra/web_intelligence/index.epx

[5] Adobe Systems Incorporated. (2011) Adobe Flash Platform. [Online]. http://flex.org/

[6] Microsoft Corp. (2011) Microsoft Silverlight. [Online]. http://www.microsoft.com/silverlight/future/

[7] Oracle Corp. JavaFX Rich Internet Applications. [Online]. http://javafx.com/

[8] M. Schrader and D. Descroches, Oracle Essbase & Oracle OLAP: The Guide to Oracle Multidimension Solution. New York: Oracle Press/McGraw-Hill, 2010.

[9] Adobe Systems Incorporated. (2010) Flex Data Visualization Developer's Guide - AdvancedDataGrid Controls and Automation Tools. [Online]. http://livedocs.adobe.com/flex/3/html/Part2_adv_data_grid_API_1.html

[10] XML for Analysis Council. (2010) XML for Analysis. [Online]. http://www.xmla.org

[11] Microsoft Corporation and Hyperion Solutions Corporation. (2010) XML for Analysis Specification. [Online]. http://msdn.microsoft.com/en-us/library/ms977626.aspx

[12] R.T. Fielding and R.N. Taylor, "Principled Design of the Modern Web Architecture," ACM Transactions on Internet Technology, vol. 2, pp. 115-150, 2002.

[13] L. Richardson and S. Ruby, RESTful Web Services, Farnham.: O'Relly, 2007.

[14] jQuery UI Team. (2010) jQuery UI. [Online]. http://developer.yahoo.com/yui/treeview

[15] YUI Team. (2010) YUI 2: TreeView. [Online]. http://developer.yahoo.com/yui/treeview

[16] D. Crockford, "The Fat Free Alternative to XML," in WWW2006, Edinburgh, Scotland, 2006.

[17] Microsoft Corporation. (2010) Calculating CellOrdinal. [Online]. http://msdn.microsoft.com/en-us/library/ms713648%28VS.85%29.aspx

[18] D. Baranovskiy. (2010) Raphael - JavaScript Library. [Online]. http://raphaeljs.com

[19] O. Laursen. (2010) flot. [Online]. http://code.google.com/p/flot

[20] A. Maniatis, P. Vassiliadis, S. Skiadopoulos, and Y. Vassiliou, "Advanced Visualization for OLAP," in Proceedings of ACM 6th International Workshop on Data Warehousing and OLAP, 2003.

[21] Stanford Visualization Group. (2010) Protovis. [Online]. http://vis.standford.edu/protovis

82