quest for value in big earth data - university of iceland · ccl udo ccl: connected component...

Quest for Value in Big Earth DataIdentifying the most cost-effective way of dealing with Big Data in Geoscience

Kwo-Sen Kuo1,2,3 and colleagues

Including but not limited to:

Michael L Rilee4, Lina Yu5, Yu Pan5, Feiyu Zhu5, Hongfeng Yu5

1. NASA Goddard Space Flight Center, Greenbelt, Maryland, USA2. University of Maryland, College Park, Maryland, USA3. Bayesics LLC, Bowie, Maryland, USA4. Rilee Systems Technologies, Derwood, MD, USA5. University of Nebraska-Lincoln, Lincoln, NE, USA

The V’s of Big Data challenge

VERACITY

SCALE

VO

LUM

E

VA

RIETY

VELOCITY VALUE

Scaling Volume

❖ Parallel processing – There is no other way!

➢ Need to exploit multimodal parallelization

o Shared memory parallelization, SMP

▪ SMP has been incorporated into the fundamentals of modern scripting

languages and achieved considerable pervasiveness

• MatLab, IDL, Python, R, Julia, etc.

▪ Geoscientists can transparently leverage SMP without knowing, say, OpenMP

▪ (GPU, Quantum)

o Distributed memory parallelization, DMP

▪ Memory on a single node is not sufficient for Big Data – need a cluster of

many nodes

▪ Pervasive systematic utilization of DMP? Very challenging!

▪ We have found a way!

Observations and Insights

Observations on Data Movement Links

~+25% / year

≪ Moore’s Law

~+40% / year< Moore’s Law)

https://itblog.sandisk.com/cpu-bandwidth-the-worrisome-2020-trend/

Dedicated

~+50% / year

< Moore’s Law

https://itblog.sandisk.com/cpu-bandwidth-the-worrisome-2020-trend/

Effective Internet user bandwidth

~+50% / year

< Moore’s Law

https://www.nngroup.com/articles/law-of-bandwidth/

https://www.nngroup.com/articles/law-of-bandwidth/

Summary

Link BW (MB/s) Remark

DRAM ~O(5) (max) Dedicated

Network ~O(4) (max) Shared

SSD ~O(3) (max) Dedicated

HDD ~O(2) (max) Dedicated

Internet ~O(2) (eff)

❖ Network

➢ high-performance interconnect, e.g. InfiniBand or Fiber Optics

➢ Usually shared among ~10-1000 of nodes

❖ Progressively lower bandwidth further away from the CPUs

Away

from

CPU

Insights

❖ Data must be stored close to CPU

➢ Bandwidth drops ~O(1) per link away from CPU (GPU)

➢ Congestion at any link could decimate the throughput

➢ Shared nothing architecture! – This rules out Cloud, which is

based mostly on traditional HPC compute-bound architecture

❖ If data must be moved, the volume to be moved must be

commensurate with the bandwidth

➢ i.e. move larger volume using higher bandwidth

❖ Data locality is paramount to performance/efficiency

➢ High bandwidth access to all data required for analysis

A Conclusion

Tightly coupled compute-storage: The only class of

approaches to satisfy the requirements

❖ The analysis engine also manages the storage.

❖ Parallel distributed database management system

(DBMS) typifies the approach.

➢ Not the loosely coupled approaches like Spark or Hadoop

➢ Not the non-system tools like Dask

❖ Cloud becomes prohibitively expensive with IOPS

guaranty.

HPC versus SNA Architecture

HPC/Cloud Architecture

❖ Better suited for compute-bound tasks, e.g. model simulations

❖ Facilitating elasticity – spinning up a variable number of VMs

Shared Nothing Architecture

❖ Better suited for I/O-bound tasks, e.g. data analysis

❖ Improved compute-storage affinity (data locality) but reduced elasticity

Data PlacementHow data are partitioned and distributed onto cluster nodes

t

N. America is

here!

Simplistic Data Placement/Layout

Better Load Balance – Smaller Chunks

Data Placement AlignmentData Placement also known as Data Layout

Data Variety Challenges

❖ Spatial aspect (2 or 3D)

➢ Nonuniform data models

o e.g. Grid, Swath, and Point

➢ Nonuniform data resolutions

➢ Decoupling of array indices from geolocations

❖ Temporal aspect (1D)

➢ Nonuniform data resolutions

➢ Decoupling of array index from calendrical time

❖ File-centric practice

➢ Data are packaged into files of nonuniform

extents

o i.e. different spatial coverages and temporal durations

Worker

SciDBEngine

LocalStore

Coordinator

SciDBEngine

LocalStore

Worker

SciDBEngine

LocalStore

Worker

SciDBEngine

LocalStore

Worker

SciDBEngine

LocalStore

Coordinator

SciDBEngine

LocalStore

Worker

SciDBEngine

LocalStore

Worker

SciDBEngine

LocalStore

Worker

SciDBEngine

LocalStore

Worker

SciDBEngine

LocalStore

Worker

SciDBEngine

LocalStore

Worker

SciDBEngine

LocalStore

Worker

SciDBEngine

LocalStore

Worker

SciDBEngine

LocalStore

Coordinator

SciDBEngine

LocalStore

Worker

SciDBEngine

LocalStore

1 2 3 4

5 6

lm

1

2

3

4

5

6 7

8

9

10

11

12

13 14

15

16 1

2

3

4

5

67

8

9

10

11

12

13

14

15

16

1

2

3

4

5

6 7

8

9

10

11

12

13 14

15

16

6/6/2018 EarthCube 2018 AHM (Poster 154)

Minimize Data Movement:Keep Analysis Pleasingly Parallel!

We recognize:

❖ A great proportion of Earth Science analysis requires spatiotemporal

coincidence

➢ i.e. for the same location and same time

➢ e.g. comparisons (for verification and validation) almost always require it

❖ The objective: data placement alignment

❖ It is impossible with some analysis, e.g. FFT and CCL

To align the placements of data chunks spatiotemporally on the

cluster nodes, we must thus

❖ Uniformly index all geoscience data spatiotemporally!

➢ Because Array DBMSs use index for partitioning

Scaling Variety

❖ Scaling Variety is much harder than scaling Volume.

❖ Homogeneous dataset: parallelization, even DMP, can be developed once and applied many times (to the same homogeneous dataset): reusable and scalable.➢ NASA Earth Exchange, NEX

❖ Heterogeneous datasets: necessary for integrative analysis, but the above case-by-case approach no longer scales!➢ Being a system science, Earth science demands interdisciplinary, integrative

analysis of diverse datasets from diverse subdisciplines.

➢ If parallelization must be developed for each combination of heterogeneous datasets -> piecemeal scalability!

➢ While SMP can be transparently leveraged by typical Earth scientists using modern scripting languages, it is not the case with DMP.

➢ Typical Earth scientists do not possess the programming skill for DMP.

➢ Without a better system, Earth scientists are doomed to stay in the “disarray of variety”.

STARESpatioTemporal Adaptive-Resolution Encoding

Spatial Element of STARE

❖ Hierarchical Triangular Mesh, HTM

Right-justified HTM Encoding

❖ Quadtree hierarchy

➢ Indexes geolocation – a substitute for lat-lon

➢ Contains approximate data resolution

The bit format of the STARE spatial index including geo-position and resolution.

Efficient Set Operations on Regions

Spatial intersection by comparing integers is facilitated by the encoded resolution.

Temporal Element of STARE

❖ Hierarchical Calendrical Encoding, HCE

Summary on ScalingVolume and Variety

Scaling Volume and Variety

On a parallel distributed array DBMS like SciDB:

❖ STARE homogenizes Variety by guaranteeing spatiotemporal data placement alignment on cluster nodes➢ Data chunks of the same place and time from different datasets are colocated

❖ Shared Memory Parallelization, SMP➢ Utilized on each node when analysis is pleasingly parallel

o High bandwidth from local storage to DRAM fully exploited

o Unnecessary node-to-node communication minimized

❖ Distributed Memory Parallelization, DMP➢ Utilized automatically and systematically when necessary

o The data partitioning/distribution operation performed by by DBMSs is akin to (pre-)domain decomposition

o Data are “domain decomposed” consistently using STARE into chunks and stored on disk

o Data chunks are quickly loaded into memory when needed for analysis

o DBMSs coordinate communication and execution of DMP

o SciDB has its own communication protocol but can also use MPI, e.g. SCALAPCK

A PrototypeInteractive animation

Data and Process Flow

Heterogeneous Geoscience

Data

SciDBAnalytics

STARE

Spatio-temporally Colocated

Data

Browser-based GUI

Prototype Demo Animation

Future Additions

❖ Use STARE to drastically improve (parallel) ingest performance while reducing resource requirement.

❖ Connect to metadata repository(ies)➢ Seamlessly integrating into existing data discovery practice

❖ Enhance throughput➢ Further reducing data movement for visualization

➢ Anticipating user analysis needs with behavior prediction

❖ Improve Python API to SciDB➢ Integrating highly compatible Xarray

o E.g. Both support named dimensions

Envisioned Architecture

…

Compute Cluster Parallel File System

…

Traditional HPC Simulation Cluster

C G

C G

C G

C G

C G

Data Intensive Analysis (GATE) Cluster

…

…

C G

C G

C G

C G

C G

C G

Metadata Repository

Data Centers

Users5

1

2

3

4

67

(Profound) Implications

❖ Better software quality, traceability, and reusability

➢ Programming SciDB UDXs, with DMP especially, is not for typical scientists

➢ Professional software engineers are required

➢ Once a UDX is constructed it is immediately reusable for all users

❖ Easier interdisciplinary collaboration

➢ A multiuser system with sophisticated control of roles and permissions

❖ More ensured research reproducibility

➢ More localized provenance collection

❖ Higher cost effectiveness by leveraging existing HPC facilities!

➢ Collocated with simulation hardware -> fast ingest of simulation results for analysis

➢ Thecollective cost of using Cloud is higher than it appears (since scientists do not pay for Cloud out of their own pockets), especially when the above factors are included in consideration.

CCL UDO

❖ CCL: Connected Component Labeling

➢ Non-pleasingly parallel, DMP required

❖ UDO: User Defined Operator

❖ Used to track blizzards defined by visibility

reduction due to in-air snow mass

➢ Processed 36 years (1980-2015) of MERRA reanalysis

hourly data at 0.625°×0.5° resolution in ~30 minutes

using a cluster of 28 containerized nodes (~6-yr-old).

Blizzard Track Density

Blizzard Unique Visits

Snowfall Rate

2-m Wind Speed

Momentum Roughness Length

North America 2010 Winter Animation

Phenomenon Hierarchy

❖ Phenomenon hierarchies

➢ Super phenomena containing sub phenomena, like supersets and subsets

❖ Schiermeier, Q., “The real holes in climate science.” Nature News, 2010.

➢ Regional climate projections

➢ Representation of precipitation and cloud

➢ Role of aerosols

➢ Palaeoclimatological data

❖ Process-based diagnostics

➢ The heavy-handed use of univariate averaging has reached its limit of usefulness

➢ Highly contextual/conditional approaches are needed

➢ Phenomenon hierarchies are such high-context conditional features for more targeted

model diagnostics

Acknowledgement

This research has been sponsored in part by

❖ National Science Foundation through grants ICER-

1541043, ICER-1540542, IIS-1423487 and

❖ Advanced Information Systems Technology (AIST)

program of NASA Earth Science Technology Office

(ESTO)

with supplemental funding from

❖ Advancing Collaborative Connections for Earth System

Science (ACCESS) program of NASA Earth Science Data

System program.

Thank you!

quest for value in big earth data - university of iceland · ccl udo ccl: connected component...

Documents