visual analytics with unparalleled variety scaling for big...

2017 IEEE International Conference on Big Data (BIGDATA)

978-1-5386-2715-0/17/$31.00 ©2017 IEEE 514

Visual Analytics with Unparalleled Variety Scalingfor Big Earth Data

Lina Yu1, Michael L. Rilee2,3, Yu Pan1, Feiyu Zhu1, Kwo-Sen Kuo2,4,5, Hongfeng Yu11University of Nebraska-Lincoln, Lincoln, NE, USA

2NASA GSFC, Greenbelt, MD, USA3Rilee Systems Technologies, Derwood, MD, USA

4Bayesics, LLC, Bowie, MD, USA5University of Maryland, College Park, MD, USA

Abstract—We have devised and implemented a key technol-ogy, SpatioTemporal Adaptive-Resolution Encoding (STARE), inan array database management system, i.e. SciDB, to achieveunparalleled variety scaling for Big Earth Data, enabling rapid-response visual analytics. STARE not only serves as a unifyingdata representation homogenizing diverse varieties of EarthScience Datasets, but also supports spatiotemporal data place-ment alignment of these datasets to optimize a major class ofEarth Science data analyses, i.e. those requiring spatiotemporalcoincidence. Using STARE, we tailor a data partitioning anddistribution strategy for the data access patterns of our scien-tific analysis, leading to optimal use of distributed resources.With STARE, rapid-response visual analytics are made possiblethrough a high-level query interface, allowing geoscientists toperform data exploration visually, intuitively and interactively.We envision a system based on these innovations to relievegeoscientists of most laborious data management chores so thatthey may focus better on scientific issues and investigations. Asignificant boost in scientific productivity may thus be expected.We demonstrate these advantages with a prototypical systemincluding comparisons to alternatives.

Index Terms—STARE; SciDB; array database; variety; dataanalysis; load balancing; indexing; GIS

I. INTRODUCTION

Earth Science data obtained from diverse sources have beenroutinely leveraged by scientists to study various phenomena.The principal data sources include observations and modelsimulation outputs. These data are characterized by spatiotem-poral heterogeneity originating from different instrument de-sign specifications and/or computational model requirementsthat are preserved in data generation processes. Such inherentheterogeneity poses several challenges in exploring and ana-lyzing geoscience data. First, scientists often wish to identifyfeatures or patterns co-located among multiple data sources toderive and validate hypotheses. Heterogeneous data make it atedious task to look for such features in dissimilar datasets.Second, Earth Science data are multivariate. It is challengingto tackle the high dimensionality of Earth Science data andexplore relationships among multiple variables in a scalablefashion. Third, there is a shortage of lucidity in traditionalautomated approaches, such as feature detection or clustering,in that scientists often cannot adeptly interact with data duringanalysis and intuitively interpret results.

To address these issues, we present a new scalable approach

that facilitates the scientific analysis of voluminous and diversegeoscience data in a distributed environment. The major con-tributions of our work are:• SpatioTemporal Adaptive-Resolution Encoding (STARE).

We have implemented STARE as the basis of a unifieddata model and an indexing scheme for geo-spatiotemporaldata to address the variety challenge of Big Data in EarthScience. With the generality of unifying at least the threepopular geospatial data models, i.e. Grid, Swath, and Point,used in current Earth Science data products, data preparationtime for interactive analysis of diverse datasets can bedrastically reduced, achieving unparalleled variety scaling.

• Spatiotemporal data placement alignment in a distributedenvironment. We have applied STARE to extending the capa-bilities of the array database SciDB using a data placementstrategy employing STARE-based data access patterns of ourscientific analysis in a distributed environment. With theseSTARE-enabled tools, we can scalably co-locate data spa-tiotemporally by placing data chunks directly to the correctnodes, avoiding costly data transfer and repartitioning andensuring scalable performance.

• A visual analytics prototype. We have constructed a high-level graphical query interface that allows users to easilyexpress customized queries to search for features of inter-est across multiple heterogeneous datasets. For identifiedfeatures, we have developed a visualization interface thatenables interactive exploration and analytics in a linked-view fashion. Specific visualization techniques are employedin each view to facilitate easy and interactive explorationinto various aspects of identified features. A user can inter-actively and iteratively build insights into the data througha variety of intuitive visual analytic operations.

II. BACKGROUND AND MOTIVATION

Facing the deluge of ever-increasing data volume, therehave been many efforts attempting to address the growingchallenge of Earth Science data practice with Big Data tech-nologies. Three data models generally represent the spatialvariety of geoscience data: Grid, Swath, and Point. Thesedata are typically volumetric, multivariate, and time-varying.Most of the Big-Data efforts emphasize the volume aspectof the challenge. We, however, have recognized variety as

515

Fig. 1. Three Earth Science data models.

the key [1] beyond volume to attaining optimal scientificvalue. Moreover, after identifying certain regions of interest,scientists often conduct detailed data analysis and visualizationusing combinations of queries based on possibly complexfunctions of the primary variables [2]. Without special care,variety coupled with complex analysis naturally leads to poordata access patterns impacting data placement and systemperformance. This grand challenge can be illustrated by thedisparity between data placement and data variety.

A. Data Placement Challenge

Data placement is especially important for technologiesin which compute and storage are tightly coupled such asa database management system (DBMS). However, mostexisting frameworks for distributed big data analytics, suchas MapReduce [3] and Spark [4], are only loosely coupledwith the storage system (e.g., HDFS [5], Cassandra [6],etc.). Loosely coupled technologies provide better flexibilityor elasticity but suboptimal performance and efficiency withequivalent resources. For example, Hadoop [7], the open-source version of MapReduce, is simple and worthy of highpraise for one-pass computations, but it is inherently inefficientfor multi-pass computations due to the lack of primitives forsharing intermediate states of the computation among passes,instead sending and retrieving intermediate states across adistributed file system. Thus, the communication and I/Ooverheads affect the overall performance.

B. Data Variety Challenge

Researchers have proposed various solutions based on theloosely coupled frameworks (e.g., Spark and Hadoop) andexploited multiple computer nodes (e.g., a cluster) in adistributed environment [8]. These solutions are viable fortackling the volume challenge of Big Earth Data, but havemostly downplayed even neglected the variety challenge.

Figure 1 illustrates the three data models: Grid, Swath, andPoint. The Grid is shown in a black grid mesh with fixedlatitude and longitude spacing. A simple linear relation existsbetween array indices and latitude-longitude geolocation coor-dinates. Swath retains a spaceborne instrument’s observationgeometry (e.g., cross-track × along-track) for its Instanta-neous Fields of View (IFOV), and no simple relation exists

between data array indices and geolocations. Point model isused mostly for in-situ observations at irregularly distributedlocations. In contrast to a Grid’s regular relationship betweenindex and geolocation, geolocations of Point and Swath dataelements are specified individually.

However, the different data models are specified usingdifferent reference frames and have different resolutions re-sulting in arrays of different shapes, even different geolocationrepresentations. Thus, without a unifying representation, itis difficult, if not impossible, to spatiotemporally align theirpartitions when they are placed across cluster nodes. Mis-aligned arrays must be aligned on-the-fly, a computationallyexpensive process, before they can be processed togetherfor integrative analysis, such as conditional subsetting andcomparisons between multiple, dissimilar datasets. Since mostintegrative analysis requires spatial and/or temporal coinci-dence, data should be laid out so that data for the samespatiotemporal partition reside on the same node. Differencein data representations and array shapes thus gives rise to aformidable challenge to data co-alignment.

III. OUR APPROACH

We aim to enable scalable data storage and analytics byholistically addressing Big Earth Data. We develop a newindexing scheme, named SpatioTemporal Adaptive-ResolutionEncoding (STARE), to spatiotemporally co-align arrays ofdifferent shapes and data models. Using STARE, we tailor adata placement strategy to the data access patterns of scientificanalysis, leading to optimal data partitioning and distributionin a distributed environment.

A. STARE

In our effort of developing viable approaches to achieveoptimal scientific value from Big Earth Data, we realize thatan appropriate indexing scheme for geolocation is crucial. Theprimary requirements for such an indexing scheme include:R1: Support spatiotemporal data placement alignment.R2: Include resolution information of the underlying data.

The rationale for the first requirement is straightforward.Most integrative analyses in Earth Science require spatiotem-poral coincidence. Data placements aligned spatiotemporallyensure the minimization of node-to-node communication on adistributed parallel database and, as a result, optimization ofperformance. The second requirement ensures set operationsover multiple, diverse datasets for integrative analyses canbe carried out robustly and consistently without sacrificingperformance.

STARE is an innovative outcome satisfying the require-ments. It consists of two parts, a spatial component and atemporal component.

1) Spatial component: Hierarchical triangular mesh(HTM) [9] is a way to address the surface of a sphere usinga hierarchy of spherical triangles. We build on the work ofKunszt et al. [9] who used HTM to index for the SloanDigital Sky Survey (SDSS) [10]. The mesh is generated withthe procedure below:

516

(a) (b)

Fig. 2. (a): HTM evolution within 5 iterations. (b): An example illustratinghierarchical triangular mesh: triangles N0 shown in red, N01 shown in green,N012 shown in blue, and N0123 shown in yellow.

1: Start with an inscribing octahedron (or other platonicsolid) of a sphere. South (north) is labeled with a 0 (1),and the 4 triangles of each half are labeled 0-3 startingfrom the triangle nearest the x-axis proceeding counter-clockwise around the sphere as viewed from outside andabove the poles.

2: Bisect each edge of the triangular facets.3: Bring the bisecting points to inscribe the sphere to form

4 smaller spherical triangles.4: Repeat from Step 2, until a desired resolution is reached.

We show the results of HTM evolution within 5 iterationsin Figure 2(a). After the initial octahedron, each iteration istermed a quadfurcation, i.e., division/branching into 4 parts,requiring 2 more bits to index. An example is illustrated inFigure 2(b).

The spatial component of STARE is a customized variantof the HTM with two distinctions. First, while right-justifiedencoding is used for the original HTM indexing, we choosea left-justified encoding to facilitate spatial data placementalignment. Second, data resolution is added to the encodingusing a few least-significant bits to facilitate set operationsamong diverse datasets.

The conventional HTM implementation had a simple mapfrom symbols to integers in a Right Justified Mapping (RJM).However, RJM maps points in geometric proximity (e.g.,in the offspring triangles of the same parent) to multiple,separated locations on the number line. Thus, implementingset operations (e.g., intersection) under RJM is complicatedby this one-to-many mapping of the geometric points (intriangles) along the number line. For example, as shown inFigure 3, geometrically S0123 (corresponding to the digitalvalue 539 in RJM) contains S01230 (2156 in RJM), but whenmapped to integers N0123 (795 in RJM) it lies in between,even though S0123 and S01230 share the same prefix tothe 3rd level. Thus, mapping HTM regions to contiguousRJM integer intervals holds only within the same HTM indexlevels, whereas our diverse datasets have a range of spatialresolutions.

Instead of using RJM, we use a Left Justified Mapping(LJM) bit format to encode HTM. Figure 4 shows the cor-responding result of our example (using 12 bits for clarity).

S0123 ⇒ 0b1000011011 = 0x21b = 539N0123 ⇒ 0b1100011011 = 0x31b = 795S01230 ⇒ 0b100001101100 = 0x86c = 2156

Fig. 3. RJM encoding example.

S0123 ⇒ 0b100001101100 = 0x86c = 2156N0123 ⇒ 0b110001101100 = 0x31b = 3180S01230 ⇒ 0b100001101100 = 0x86c = 2156

Fig. 4. LJM encoding example.

However, in this case, we have an aliasing problem, e.g.,with S0123 and S01230. LJM cannot distinguish betweenlevels, as it does not track how many bits from the left aresignificant. To address this issue, we use a few least significantbits of a signed 64-bit integer to encode the approximate dataresolution. Since we limit the number of quadfurcations to27 (corresponding to ∼7-cm resolution), together with thestarting 8 facets of the octahedron, only 57(= 27 × 2 + 3)bits are needed. Excluding the sign bit, we have 6 bits leftto encode the quadfurcation level with the closet resolution to(but covering) that of the data.

Table I lists the bit-field arrangement of STARE’s spatialindex. Figure 5 shows the triangles from our previous example.This encodes the difference between S0123 and S01230 andinteger comparisons and can tell us that the former containsthe latter.

S0123 ⇒ 0x06c0000000000003S01230 ⇒ 0x06c0000000000004N0123 ⇒ 0x46c0000000000003

Fig. 5. STARE spatial encoding example.

Essentially, STARE’s spatial index concisely and uniquelymaps a floating-point latitude-longitude 2-tuple to an integerwith a given precision. For example, at the 23rd quadfurcationlevel the mapping has a ∼1-m uncertainty.

2) Temporal component: STARE’s temporal componentis also hierarchical but uses conventional date/time units toavoid unnecessary translations between temporal frameworks.Table II provides just one example encoding with a maximumtime resolution of milliseconds, common for observationsobtained from spacecraft. As an example, for an observationmade on 2015 June 12, 8:10 AM with millisecond resolution,STARE yields

[+] 000-002015-06-3-3 08:0600.000 (07).

Here, the (07) corresponds to the highest level (finest) res-olution available, milliseconds, and the [+] signifies positive

TABLE ILEFT JUSTIFIED MAPPING OF STARE SPATIAL ENCODING.

StartingBit

EndingBit

No.Bit

Meaning

63 63 1 Reserved (Most Significant)62 62 1 North-South Bit60 61 2 Octahedral triangle index (Resolution level 0: ∼10,000 km)6 59 54 Quadtree triangle index (Resolution levels 1-27: ∼5,000 km to ∼7 cm)5 5 1 Reserved0 4 5 Resolution level (Least Significant)

517

TABLE IIAN EXAMPLE OF STARE TEMPORAL ENCODING.

StartingBit

EndingBit

No.Bit

Meaning

0 2 3 Resolution/Unit3 12 10 Millisecond13 24 12 Second25 29 5 Hour30 32 3 Day of week33 34 2 Week35 38 4 Month39 48 10 Year49 58 10 Kilo-annum59 62 4 Mega-annum63 63 1 Before/After

years. With the Unix-based convention of numbering monthsstarting at 0 and days at 1, the native STARE format canbe read (partly) as the 3rd day of week 3 of month 6 (the7th regularized 28-day month). Conversion to an array indexis simple, and alternative encodings may be devised to meetapplication requirements.

3) Deployment: We use STARE to support our researchinto the automated analysis of phenomena such as blizzardsas moving objects. We are extending the capabilities of thearray database SciDB and developing tools (database ingest,preprocessing, and visualization) using STARE through asoftware library and APIs.

SciDB [11] is an open-source distributed data manage-ment system used primarily for application domains involvingvery large-scale array data. SciDB automatically distributesdata and computational load across an arbitrary number ofhardware instances while presenting an end user with theexperience of a single, unified system. Unlike relationaldatabases, which store data in flat tables, SciDB stores datain multidimensional arrays. Not only do multi-dimensionalarrays generally lend themselves well to the representation ofcomplex scientific data, they also permit the construction ofan ever-growing library of efficient mathematical operators. Inaddition to its built-in libraries, SciDB provides user-definedtypes (UDTs), user-defined functions (UDFs), and user-definedoperators (UDOs) to extend its core capabilities.

We have incorporated STARE to SciDB using its UDT andUDF facilities, including a STARE UDT verifying functionsfor constructing SciDB indices. Tagging Earth Science datawith STARE ranges allows the geometric comparison, co-registration, and selection of diverse kinds of data via efficientmetadata operations.

B. Data Placement Strategy

Building on our STARE indexing scheme, we study dataaccess patterns in user analytics operations, and employ aSTARE-based data placement strategy to effectively partitionand distribute Earth Science data in a distributed environment.

1) Data access patterns: With the advances in computingand visualization techniques, nowadays Earth Science datais mostly explored and studied via visual analytics. Visualanalytics operations can be roughly categorized into two types:view-dependent and data-dependent. For view-dependent op-erations, users access data by specifying their camera or view

(a) (b)

Fig. 6. The spatial decomposition of a spherical surface using STARE and itscorresponding hierarchical triangular mesh. A preorder traversal of triangleleaves is equivalent to the space-filling curve on the spherical surface. (a)shows that we evenly assign the triangle leaves among three processors fromleft to right in the hierarchical triangular mesh tree, and the distributionof their regions is contiguous along the space-filling curve. In this case,each processor’s regions may not be always visible from different viewingdirections. (b) shows that we assign the triangle leaves among three processorsin round robin, and the neighboring regions are largely assigned to differentprocessors. In this case, a portion of a processor’s regions can be visible fromany viewing direction.

positions (e.g., visualizing data within a view port). For data-dependent operations, users access data by specifying theirdata of interest (e.g., querying data within a certain range orwithin a certain time interval). These two types of operationsare often combined in practice. For example, users can applyvisualization techniques to control the visual properties ofqueried data variables and find the features or regions ofinterest.

2) Data partitioning and distribution: In a distributed en-vironment, to achieve workload balancing among differentprocessors, a common practice is to use regular gridding todivide a dataset into uniform blocks and evenly or randomlyassign the blocks among the processors. When the block size issufficiently small, it is easy to achieve workload balancing, butthe overhead (increased bookkeeping due to the large numberof blocks) counteracts overall performance.

We can achieve a more sparseness-adaptive assignment byleveraging the spatial locality encoded in a tree structure,such as a linear quadtree or a hierarchical triangular meshquadtree [9], when data exhibit unbalanced density. STAREindexing scheme essentially encodes geolocations of multipledatasets, resulting in a unified quadtree for each time slice.A preorder traversal of the corresponding quadtree leaves isa space-filling curve which groups spatially nearby trianglestogether on the spherical surface. This characteristic can beused for optimizing data layout.

If we assign the processors contiguously along the space-filling curve for parallel processing, each processor will beresponsible for contiguous regions on the surface as shown inFigure 6(a). However, users may only see a part of a region onthe earth, and other regions are occluded from a certain cameraposition. Therefore, a contiguous assignment may result inunbalanced workload and suboptimal data locality. To addressthis issue, we distribute the triangles among the processorsalong the space-filling curve in the round robin fashion asshown in Figure 6(b). Data operations and calculations asso-ciated with a given triangle are also distributed in the samescheme. This guarantees that, given a sufficient number of

518

Fig. 7. The framework of our distributed storage and analytics system.

processors, the neighboring regions are assigned to differentprocessors to achieve load balancing during users’ interactiveexploration.

IV. SYSTEM FRAMEWORK

We develop a distributed storage and analytics system basedon our STARE indexing scheme and data placement strategyto tackle Big Earth Data.

The overall framework of our system is a typical threetier web application that is composed of a SciDB distributeddatabase, a shim layer acting as the web server, and a front-end with STARE as a plugin of UDF to SciDB, as shown inFigure 7. Shim is a basic SciDB client that exposes SciDBfunctionality through a simple HTTP API. We implementthe front-end to query data through the shim web server toSciDB and render data as an overlay on Google Maps withina browser. Every query from the front-end is encapsulated inthe HTTP GET request as the suffix of the Uniform ResourceIdentifier (URI) and is encoded appropriately.

The back-end of our system, SciDB, is designed to run ona shared nothing architecture and allow scientists to analyzevoluminous datasets as arrays with little concern for the detailsas to how massively parallel processing is achieved. It iscomprised of a large set of worker nodes, a subset of whichalso act as coordinator nodes. A Postgres database runs ona coordinator node to manage all the metadata about thenodes, instances, arrays and so on. Each node contains itslocal SciDB engine and data store. Doan et al. [12] evaluatedthe impact of data placement on SciDB and Spark. SciDB canoutperform Spark+HDFS and Spark+Cassandra by a factor ofapproximately 3-10 for equivalent integrative analysis, whendata placement alignment is exploited.

Similar to a relational database, SciDB also has a queryexecutor to parse, optimize, and execute queries written inthe Array Query Language (AQL) and the Array FunctionalLanguage (AFL), which can be extended with UDTs, UDFs,and UDOs.

Using STARE to organize data before database ingests, wecan place data chunks directly to the correct nodes, avoidingcostly data transferring and repartitioning. With the enormousscale of Earth Science datasets, the savings from eliminat-

TABLE IIIDATASETS USED IN OUR EXPERIMENTAL STUDY.

name type time interval latitude range longitude range resolution size/sliceMERRA-2 grid 1 hour 90◦S - 90◦N 180◦W - 180◦E 576×361 0.83M

NMQ grid 5 minutes 20◦S - 55◦N 130◦W - 60◦E 7001×3501 98MTRMM swath 15 slices/day 40◦S - 40◦N 180◦W - 180◦E 1601×7201 46M

ing costly data repartitioning (or redimensioning) cannot beoveremphasized. Because STARE uniformly translates geolo-cations and times to integer spatial and temporal indices withresolutions encoded therein, placements of data representedby different models and with different resolutions are alignedin storage. They naturally co-locate in space and time whendistributed across computing and storage resources.

We developed two UDFs to ingest data using STARE withintuitive APIs, specifically,

hstmFromLevelLatLon(int64 level, double

latitude, double longitude)

converting a geospatial location into a STARE spatial index,and

temporalIndexFromTradYrMoDyHrMiSeMsRl(int64

year, int64 month, int64 day, int64 hour,

int64 minute, int64 second, int64 minisecond,

int64 resolution)

converting a time into a STARE temporal index. Then, theprocess of ingesting data consists of three steps:

1: Use SciDB built-in apply operator to add the computedSTARE spatial indices and temporal indices as newattributes into the array.

2: Use SciDB built-in redimension operator to change thespatial and temporal indices as the current array indices.

3: Store the redimensioned array into a new array.Note that for data external to SciDB that is already indexed

with STARE, it can be loaded and distributed directly withoutthis redimensioning, a usually expensive step.

V. RESULT

In order to study the performance characteristics of ourframework, we use a concrete set of typical queries in EarthScience domain. The queries operate on three multidimen-sional Grid and Swath datasets as described below. Since thehandling of Point data is similar to that of Swath, it is notspecifically demonstrated in this study.

A. Datasets

Two regular gridded datasets and one swath dataset forthe period of Winter 2010 (i.e., from December 1st, 2009 toFebruary 28th, 2010) are used to conduct our experiments. Thefirst regular gridded dataset is extracted from an hourly datasetof the NASA Modern Era Retrospective-analysis for Researchand Applications (MERRA-2) [13] data collection, whilethe second dataset is extracted from a reprocessed 5-minuteNational Mosaic and Multi-sensor QPE (NMQ, where QPEstands for quantitative precipitation estimate) [14]. MERRA-2has global coverage, whereas NMQ is only available for the

519

Fig. 8. Rendering of one time slice of MERRA-2 and a partial trajectory ofTRMM.

contiguous United States (CONUS), specifically 20◦N - 55◦Nin latitude and 130◦W - 60◦W in longitude. They are also ofdifferent resolutions. For MERRA-2, it is 5

8

◦ × 12

◦ (longitude× latitude) in space and hourly in time, whereas it is 0.01◦ ×0.01◦ in space and every 5 minutes in time for NMQ.

The swath dataset, from NASA’s Tropical Rainfall Measur-ing Mission (TRMM), derives vertical hydrometeor profilesusing data from Precipitation Radar (PR) and TRMM Mi-crowave Imager (TMI) that were orbiting the tropics of Earth(between 40◦S and 40◦N) ∼15 times per day generating ascan line every 600ms [15].

Three types of data are displayed within different rangesof location and they are not synchronized due to their dif-ferent time intervals between contiguous time steps. Table IIIsummarizes the characteristics of the three datasets.

B. Computing Environment

A cluster with 16 nodes is set up to carry out our ex-periments. All nodes have the same features: 32GB of mainmemory, an 8-core CPU and 9TB of local disk storage. Theyall run Centos 7 Linux operating system. The nodes areinterconnected with 10 Gigabit Ethernet. We use the enterpriseedition of SciDB release 16.9 that supports replication andadvanced linear algebra operations, such as singular valuedecomposition (SVD).

C. Evaluation

1) Data placement: We first verify the data placementresult of our approach. Figure 8 shows the rendering result thatillustrates the distinct shapes and coverages of the MERRA-2and TRMM datasets. Figure 9 visualizes the data placement ofthese two datasets on the 16 compute nodes using our STARE-based scheme and the conventional regular-grid partitioningand distribution scheme. From this qualitative comparison, wecan clearly see that our solution can lead to co-alignmentand co-location of the two datasets with completely differentshapes and coverage. Next, we will show the quantitativeevaluations to detail the advantages of our STARE indexingscheme and data placement strategy.

2) Join operation performance: There are two operatorsavailable for the join operation in SciDB: cross-join and join.The former is a generic join operator that makes no assumptionabout the schema of its array operands, while the latter requires

Fig. 9. Comparison between different data placements. (a): The top row showsthe results using our STARE-based partitioning and distribution. Both theMERRA-2 and TRMM datasets are indexed using STARE (see Section III-A)and partitioned and distributed among 16 compute nodes using our roundrobin strategy (see Section III-B). Different nodes are denoted using 16different colors in the top color map. The left image shows the partitioningand distribution of both datasets. The right image shows the zoom-in viewin the highlight region (Gulf of Mexico and Caribbeans) of the left image.We can clearly see that the two datasets are well co-aligned and co-locatedamong the nodes. (b): The bottom row shows the corresponding results usingthe regular-grid partitioning and distribution, which cannot co-align and co-locate the two datasets.

Fig. 10. After using STARE, we show the TRMM data in yellow meshes,NMQ data in purple meshes, and the intersection part in red triangle meshes.

its array operands to be aligned (i.e., corresponding chunksco-located on the same nodes). Figure 10 illustrates the joinoperation between the TRMM swath and the NMQ grid. Theyellow triangle mesh encodes the TRMM swath data, thepurple triangle mesh encodes the NMQ grid, and the red meshindicates the joined intersection.

We compare the timing results of the join operation betweenSTARE-based scheme and the regular-grid partitioning anddistribution scheme using the MERRA-2, TRMM, and NMQdatasets. With STARE, because all three datasets are co-aligned, we can directly conduct join queries for any com-bination of these datasets. Without STARE, the three datasetsare misaligned, and cross-join is thus required for regular grid

520

Fig. 11. Comparison of timing results of joining operation between STAREand regular gridding using all the time slices within different numbers of days.The time axis is plotted in a logarithmic scale. The empty entries in the tableare the data points that we did not get for regular gridding due to its longrunning time.

Fig. 12. Comparison of timing results of joining operation between contiguousand round robin strategies for STARE using all the time slices of the TRMMand MERRA-2 datasets within different numbers of days.

partitioning.Figure 11 shows the timing results of join operation using

STARE and regular gridding. We first query the MERRA-2grid cells where the precipitation rates according to TRMM aregreater than 0.7 mm/hr for all the time steps within differentnumbers of days. The blue curve in Figure 11 shows that,with STARE, the time increases slowly and it only costsaround 1.31 seconds for all the time slices within 90 days.The effect of data placement alignment afforded by STAREis plainly visible. As indicated by the red curve in Figure 11,without co-alignment before the query, the time has almosta linear increase. It costs around 843.43 seconds for a 40-day duration, beyond which the queries cannot complete inreasonable amount of time and are thus terminated. This isdue to the fact that misaligned data chunks, as evidenced bythe images in the bottom row of Figure 9, incur unnecessarycommunications.

The green and purple curves in Figure 11 show the com-parison of the timing results for the join operation betweenSTARE and regular partitioning methods using three datasets,i.e., MERRA-2, TRMM, and NMQ. Even with three datasets,the execution time of our STARE-based method only showsmarginal increases relative to that with two datasets, furtherdemonstrating the value of STARE in assuring data placementalignment. The timing results of the regular-grid method aresignificantly, but unsurprisingly, worse.

Fig. 13. Comparison of timing results of multiple users’ random joinoperations between STARE and regular gridding using the TRMM andMERRA-2 datasets for different numbers of users. The time axis is plottedin a logarithmic scale. The empty entries in the table are the data points thatwe did not get for regular gridding due to its long running time.

As discussed in Section III-B, we adopted the round robinstrategy to traverse the STARE quadtree for partitioning, ratherthan the contiguous strategy. Figure 12 shows the comparisonbetween these two strategies. The round robin strategy clearlyoutperforms the contiguous strategy. For example, more than51% improvement has been achieved using the round robinstrategy in a 50-day query. This performance gain is essentialfor supporting highly interactive analytics operations.

3) Multiple user queries: We also investigate the impactof multiple simultaneous users on STARE and regular-gridpartitioning methods. We test different numbers of simultane-ous users, and use random join operations of the TRMM andMERRA-2 datasets for a 10-day duration to simulate differentuser queries. As shown in Figure 13, again, only a marginalincrease is observed for the STARE results from 1 to 100simultaneous users. This further shows the effectiveness of ourmethod with distinct operations across multiple users, achievedby the careful consideration of user data access patterns in ourdesign (see Section III-B).

In contrast, the regular-grid partitioning cannot effectivelysupport multiple users operations. As shown in Figure 13,system performance is quickly degraded with an increasingnumber of users. At 30 users, it took around 42 minutes forthe system to respond, making interactivity unpractical andweakening user experience.

D. User Interface

To evaluate our system design for interactive analysis,we have constructed a high-level graphical query interfacefeaturing multiple visual analytics capabilities, as shown inFigure 14. Our rudimentary user interface supports three es-sential functions: visualization, query, and statistical analysis.

First, all the datasets can be visualized in the main win-dow. We provide a time slider at the bottom of the userinterface to allow users choose any time slice and show therendering results of the datasets. As shown in Figure 14,the TRMM precipitation dataset is shown in yellow to red,which looks like a ribbon across the map, while the NMQprecipitation dataset, only available for the contiguous UnitedStates (CONUS), is shown in green to purple. There are fourbuttons to the left of the time slider with the functions of

521

choosing different datasets, changing opacity, playing the timeseries visualization, and stopping the playing, respectively.

Second, users also can input their customized queries, suchas Precip ≥ 1.0, x ≥ 800 and y ≤ 3600, in the search boxat the top of the user interface to easily search for features ofinterest across multiple heterogeneous datasets.

Third, our user interface supports a few real-time statisticalanalytics (such as histogram and correlation). Figure 14 showstwo examples. Users can interactively drag a rectangle toselect their region of interest on the map. The correspondinghistogram of each dataset will be displayed in the top-rightanalytics panel. Users can also put a marker on a specificlocation on the map to display time series of multiple datasetsat this location in the analytics panel. Moreover, users can alsodownload these time series values as a csv (comma-separatedvalues) file by clicking the download button.

Fig. 14. The user interface of our system.

Our STARE indexing scheme and the data placement strat-egy ensure the scalability of various operations with possiblecombinations of variables, time slices, geolocations, datasets,and user numbers, to deliver interactive performance to eachindividual end user, rendering our user interface a practicaland effective analytics tool for Big Earth Data.

VI. CONCLUSION

Big Earth Data imposes grand challenges in the devel-opment of a scalable end-to-end data analytics system. Al-though previous studies have proposed feasible solutions toaddress the large volume challenge by mostly leveragingloosely coupled techniques, scalable performance is still dif-ficult to achieve when diverse large datasets are involved.We must holistically consider both the volume and varietychallenges and address both the data co-location and co-alignment in a distributed environment. To this end, we presentthe SpatioTemporal Adaptive-Resolution Encoding, STARE,for uniform indexing of diverse, heterogeneous Earth datasetsand, thus, for spatiotemporal co-alignment and co-locationof data chunks. In addition, we partition and distribute thechunks along the traversal of the resulting STARE quadtreeleaves in a round robin fashion, leading to balanced workloadamong the processors. We conduct an extensive experimentalstudy by comparing our STARE-based approach with theconventional regular gridding based approach. We demonstratethe effectiveness of our approach using queries into differentcombinations of variables, time slices, geolocations, datasets,and user numbers. Our end-to-end distributed storage andanalytics system is able to achieve scalable performance.

For future works, we hope to enhance our system per-formance using Graphics Processing Units (GPUs). Moresophisticated visual analytics tasks, such as feature extractionand tracking and volume rendering, may then be carried outinteractively by leveraging GPUs. We plan to investigate effectof complex data access patterns that are expected to existacross the storage hierarchy from persistent storage to mainmemory to GPU memory, and study the possibility [16] ofextending STARE to indexing data among multiple GPUs.

ACKNOWLEDGMENT

This research has been sponsored in part by the NationalScience Foundation through grants ICER-1541043, ICER-1540542, and IIS-1423487 and with supplemental fundingfrom the Advanced Information Systems Technology (AIST)program of NASA Earth Science Technology Office.

REFERENCES

[1] M. L. Rilee, K.-S. Kuo, T. Clune, A. Oloso, P. G. Brown, and H. Yu,“Addressing the big-earth-data variety challenge with the hierarchicaltriangular mesh,” in Big Data (Big Data), 2016 IEEE InternationalConference on. IEEE, 2016, pp. 1006–1011.

[2] L. Yu, H. Yu, H. Jiang, and J. Wang, “An application-aware datareplacement policy for interactive large-scale scientific visualization,” inParallel and Distributed Processing Symposium Workshops (IPDPSW),2017 IEEE International. IEEE, 2017, pp. 1216–1225.

[3] J. Dean and S. Ghemawat, “MapReduce: simplified data processing onlarge clusters,” Communications of the ACM, vol. 51, no. 1, pp. 107–113,2008.

[4] M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica,“Spark: Cluster computing with working sets.” HotCloud, vol. 10, no.10-10, p. 95, 2010.

[5] D. Borthakur et al., “HDFS architecture guide,” Hadoop Apache Project,vol. 53, 2008.

[6] A. Lakshman and P. Malik, “Cassandra: a decentralized structuredstorage system,” ACM SIGOPS Operating Systems Review, vol. 44, no. 2,pp. 35–40, 2010.

[7] T. White, Hadoop: The definitive guide. O’Reilly Media, Inc., 2012.[8] S. Zhou, X. Li, T. Matsui, and W. Tao, “Visualization and diagnosis of

earth science data through Hadoop and Spark,” in Big Data (Big Data),2016 IEEE International Conference on. IEEE, 2016, pp. 2974–2980.

[9] P. Z. Kunszt, A. S. Szalay, and A. R. Thakar, “The hierarchical triangularmesh,” in Mining the sky. Springer, 2001, pp. 631–637.

[10] The Sloan Digital Sky Survey and HTM websites:http://skyserver.sdss.org/ and http://www.skyserver.org/htm/ .

[11] M. Stonebraker, P. Brown, D. Zhang, and J. Becla, “SciDB: A databasemanagement system for applications with complex analytics,” Comput-ing in Science & Engineering, vol. 15, no. 3, pp. 54–62, 2013.

[12] K. Doan, A. O. Oloso, K.-S. Kuo, T. L. Clune, H. Yu, B. Nelson, andJ. Zhang, “Evaluating the impact of data placement to Spark and SciDBwith an earth science use case,” in Big Data (Big Data), 2016 IEEEInternational Conference on. IEEE, 2016, pp. 341–346.

[13] M. Bosilovich, R. Lucchesi, and M. Suarez. (2016) MERRA-2: filespecification. GMAO Office Note No. 9 (Version 1.1), 73 pp, availablefrom http://gmao.gsfc.nasa.gov/pubs/office_notes.

[14] J. Zhang, K. Howard, C. Langston, S. Vasiloff, B. Kaney, A. Arthur,S. Van Cooten, K. Kelleher, D. Kitzmiller, F. Ding et al., “NationalMosaic and Multi-Sensor QPE (NMQ) system: Description, results, andfuture plans,” Bulletin of the American Meteorological Society, vol. 92,no. 10, pp. 1321–1338, 2011.

[15] C. Kummerow, W. Barnes, T. Kozu, J. Shiue, and J. Simpson, “Thetropical rainfall measuring mission (TRMM) sensor package,” Journalof atmospheric and oceanic technology, vol. 15, no. 3, pp. 809–817,1998.

[16] J. Xie, H. Yu, and K.-L. Maz, “Visualizing large 3D geodesic griddata with massively distributed GPUs,” in Large Data Analysis andVisualization (LDAV), 2014 IEEE 4th Symposium on. IEEE, 2014,pp. 3–10.

visual analytics with unparalleled variety scaling for big...

Documents