big data analytics on cloud using nexus - federation of earth science...

1
National Aeronautics and Space Administration © 2017 California Institute of Technology. Government sponsorship acknowledged. Reference herein to any specific commercial product, process, or service by trade name, trademark, manufacturer, or otherwise, does not constitute or imply its endorsements by the United States Government or the Jet Propulsion Laboratory, California Institute of Technology. Big Data Analytics On Cloud Using NEXUS Thomas Huang 1 , Stewart P. Abercrombie 1 , Carmen Boening 1 , Kevin Gill 1 , Frank Greguska 1 , Joseph Jacob 1 , Michael Little 2 , Chris Lynnes 2 , Nga Quach 1 , Rob Toaz 1 , Brian Wilson 1 1 Jet Propulsion Laboratory, California Institute of Technology, 4800 Oak Grove Drive, Pasadena, CA 91109-8099, U.S.A. 2 NASA Goddard Space Flight Center, 8800 Greenbelt Rd, Greenbelt, MD 20771, U.S.A. Almost all of the existing Earth science data analysis solutions are built around large archives of files. When an analysis involves large collection of files, performance suffers due to the large amount of I/O required. Common data access solutions, such as OPeNDAP and THREDDS, provide web service interface to archives of observational data. They also yield poor performance when it comes to large amount of observations, because they are still built around the notion of files. In his famous 2005 paper on Scientific Data Management in the Coming Decade, the late Jim Gray stated, “The scientific file-formats of HDF, NetCDF, and FITS can represent tabular data but they provide minimal tools for searching and analyzing tabular data.” He continued to point out, “Performing this filter-then-analyze, data analysis on large datasets with conventional procedural tools runs slower and slower as data volume increases.” NEXUS (https://github.com/dataplumber/nexus) is an emerging, open source, data-intensive analysis framework developed with a new approach for handling science data that enables large-scale data analysis. MapReduce is a well- known paradigm for processing large amounts of data in parallel using clustering or Cloud environments. Unfortunately, this paradigm doesn't work well with temporal, geospatial array-based data. One major issue is they are packaged in files in various sizes. The size of each data file can range from tens of megabytes to several gigabytes. Depending on the user input, some analysis operations could involve hundreds to thousands of these files. NEXUS takes on a different approach in handling file-based observational temporal, geospatial artifacts by fully leveraging the elasticity of Cloud Computing environment. Rather than performing on-the-fly file I/O, NEXUS stores tiled data in Cloud-scaled databases with high-performance spatial lookup service. NEXUS provides the bridge between science data and horizontal-scaling data analysis. This platform simplifies development of big data analysis solutions by bridging the gap between files and MapReduce solutions such as Spark. NEXUS has been integrated into the NASA Sea Level Change Portal (https://sealevel.nasa.gov) as the Big Data analytic backend for its Data Analysis Tool. Data Analysis Tool Parameter Visualization | Polar Projection | Dataset Info | Data Sequencer Suite of Analytic Functions | Download Plots and Results On-the-fly recreated identification of ”The Blob” The Blob is the name given to a large mass of relatively warm water in the Pacific ocean off the coast of North America. It was first detected in late 2013 and continued to spread throughout 2014 and 2015. SST anomaly = SST – SST Climatology at each location to compare with standard deviation - Dr. Chelle Gentemann, Senior Scientist at Earth & Space Research NASA AIST - OceanXtremes Oceanographic Data-Intensive Anomaly Detection and Analysis Portal Apache Spark (In-Memory MapReduce) and Two Scalable Databases NEXUS: The Deep Data Platform Workflow Automation Horizontal-Scale Data Analysis Environment Deep Data Processors Index and Data Catalog Data Access ETL System Private Cloud Analytic Platform ETL System Deep Data Processors Spring XD Index and Data Catalog Analytic Platform Manager Inventory Security Sig Event Search ZooKeeper ZooKeeper ZooKeeper Manager Handler Handler Handler Ingest Pool Ingest Pool Worker Pool Worker Pool File & Product Services Job Tracking Services BusinessLogics Applications HORIZON Data Management and Workow Framework Ingest Ingest Ingest Product Subscriber Product Subscriber Staging NoSQL Index Alg Alg Alg Alg Alg Alg Alg Web Portal Task Scheduler Task threads Block manager RDD Objects DAG Scheduler Task Scheduler Executor DAG TaskSet Task NASA Physical Oceanography Distributed Active Archive Center Comparison of Giovanni and NEXUS run times to compute 16-year area-averaged daily time series of MODIS-Terra Aerosol Optical Depth (AOD) 550 nm dark target at 1 degree resolution for the indicated spatial bounds (Global, Colorado, or Boulder). NEXUS was run on 6 Amazon Web Services (AWS) Cloud instances of type “i2.4xlarge” with Solr, Cassandra, Spark 2.0, and the Mesos scheduler. NEXUS has superior performance for subsetting operations, as indicated by the ~300x speedup for the Colorado subset. Comparison of NEXUS run times using the YARN and Mesos schedulers for the MODIS-Terra time series calculation on an 8-node cluster computer at JPL running Solr, Cassandra, and Spark 2.0 with the YARN and Mesos schedulers. In our experiments, using Mesos consistently yields a speedup of 2 to 4 times over YARN. 0 1000 2000 3000 4000 Global Colorado Boulder 1778 3010 1228 35 11 10 Run Time (s) MODIS-Terra Area-Averaged Time Series Giovanni NEXUS 16-way 0 50 100 150 Global Colorado Boulder 124 42 37 27 12 12 Run Time (s) MODIS-Terra Area-Averaged Time Series YARN Mesos Comparison of Giovanni and NEXUS run times to compute global correlation map of 0.25 degree TRMM daily precipitation rate (TRMM_3B42_daily_precipitation_V7) vs. TRMM real-time daily precipitation (TRMM_3B42RT_daily_precipitation_V7) for 1/1/2001 - 12/31/2014. These runs included 5,113 global data granule pairs (between 50 deg. south and 50 deg. north latitude) totaling 40 GB. NEXUS was run on an 8-node cluster computer at JPL running Solr, Cassandra, and Spark 2.0 with the Mesos scheduler. 0 100 200 300 400 500 600 700 800 Giovanni NEXUS 16-way NEXUS 64-way 768 60 34 Run Time (s) TRMM_3B42 vs TRMM_3B42RT Correlation Map (Global) Comparison of Giovanni and NEXUS run times to compute area-averaged time series over the continental United States (Figure above) and global time averaged map (Figure below) of 0.25 degree TRMM daily precipitation rate (TRMM_3B42_daily_precipitation_V7) for 1/1/1998 - 12/31/2015. These runs included 6,574 global data granules (between 50 deg. south and 50 deg. north latitude) totaling 26 GB. NEXUS was run on an 8-node cluster computer at JPL running Solr, Cassandra, and Spark 2.0 with the Mesos scheduler. 0 500 1000 1500 2000 2500 3000 3500 4000 4500 Giovanni NEXUS 16-way NEXUS 64-way 4120 94 40 Run Time (s) TRMM_3B42 Area Averaged Time Series (CONUS) 0 20 40 60 80 100 120 140 160 180 Giovanni NEXUS 16-way NEXUS 64-way 169 54 41 Run Time (s) TRMM_3B42 Time Averaged Map (Global) Performance and Scaling NEXUS is an emerging technology developed at JPL Open Source: https://github.com/dataplumber/nexus A Cloud-based/Cluster-based data platform that performs scalable handling of observational parameters analysis designed to scale horizontally by Leveraging high-performance indexed, temporal, and geospatial search solution Breaks data products into small chunks and stores them in a Cloud data store Data Volumes Exploding SWOT mission is coming File I/O is slow Scalable Store & Compute is Available NoSQL cluster databases Parallel compute, in-memory map-reduce Bring Compute to Highly-Accessible Data (using Hybrid Cloud) Pre-Chunk and Summarize Key Variables Easy statistics instantly (milliseconds) Harder statistics on-demand (in seconds) Visualize original data (layers) on a map quickly Using open source technologies Apache Cassandra Apache Kafka Apache Mesos/YARN Apache Solr Apache Spark/PySpark Apache Zookeeper EDGE Jupyter Spring XD Tornado

Upload: others

Post on 24-May-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Big Data Analytics On Cloud Using NEXUS - Federation of Earth Science …commons.esipfed.org/sites/default/files/NEXUS... · 2020-01-03 · stores tiled data in Cloud-scaled databases

National Aeronautics and Space Administration

© 2017 California Institute of Technology. Government sponsorship acknowledged.Reference herein to any specific commercial product, process, or service by trade name, trademark, manufacturer, or otherwise, does not constitute or imply its endorsements by the United States Government or the Jet Propulsion Laboratory, California Institute of Technology.

Big Data Analytics On Cloud Using NEXUSThomas Huang1, Stewart P. Abercrombie1, Carmen Boening1, Kevin Gill1, Frank Greguska1, Joseph Jacob1, Michael Little2, Chris Lynnes2, Nga Quach1, Rob Toaz1, Brian Wilson1

1Jet Propulsion Laboratory, California Institute of Technology, 4800 Oak Grove Drive, Pasadena, CA 91109-8099, U.S.A.2NASA Goddard Space Flight Center, 8800 Greenbelt Rd, Greenbelt, MD 20771, U.S.A.

Almost all of the existing Earth science data analysis solutions are built around large archives of files. When an analysis involves large collection of files, performance suffers due to the large amount of I/O required. Common data access solutions, such as OPeNDAP and THREDDS, provide web service interface to archives of observational data. They also yield poor performance when it comes to large amount of observations, because they are still built around the notion of files. In his famous 2005 paper on Scientific Data Management in the Coming Decade, the late Jim Gray stated, “The scientific file-formats of HDF, NetCDF, and FITS can represent tabular data but they provide minimal tools for searching and analyzing tabular data.” He continued to point out, “Performing this filter-then-analyze, data analysis on large datasets with conventional procedural tools runs slower and slower as data volume increases.”

NEXUS (https://github.com/dataplumber/nexus) is an emerging, open source, data-intensive analysis framework developed with a new approach for handling science data that enables large-scale data analysis. MapReduce is a well-known paradigm for processing large amounts of data in parallel using clustering or Cloud environments. Unfortunately, this paradigm doesn't work well with temporal, geospatial array-based data. One major issue is they are packaged in files in various sizes. The size of each data file can range from tens of megabytes to several gigabytes. Depending on the user input, some analysis operations could involve hundreds to thousands of these files.

NEXUS takes on a different approach in handling file-based observational temporal, geospatial artifacts by fully leveraging the elasticity of Cloud Computing environment. Rather than performing on-the-fly file I/O, NEXUS stores tiled data in Cloud-scaled databases with high-performance spatial lookup service. NEXUS provides the bridge between science data and horizontal-scaling data analysis. This platform simplifies development of big data analysis solutions by bridging the gap between files and MapReduce solutions such as Spark. NEXUS has been integrated into the NASA Sea Level Change Portal (https://sealevel.nasa.gov) as the Big Data analytic backend for its Data Analysis Tool.

National Aeronautics and Space Administration NASA SEA LEVEL CHANGE PORTAL

These activities were carried out at the Jet Propulsion Laboratory, California Institute of Technology, under a contract with the National Aeronautics and Space Administration. © 2016 California Institute of Technology. Government sponsorship acknowledged.

https://sealevel.nasa.gov“One-Stop” source for current sea level change information and data

interactive tools for accessing and viewing regional datavirtual dashboard of sea level indicators

latest news, quarterly reports, and publications ongoing updates

multimediamobile friendly

Carmen Boening, Pat Brennan, Kevin Gill, Thomas Huang, Randal Jackson, Eric Larour, Nga Quach, Holly Shaftel, Laura Tenenbaum, Victor ZlotnickiJet Propulsion Laboratory, California Institute of Technology, 4800 Oak Grove Drive, Pasadena, CA 91109-8099, U.S.A.

Andra Boeck, Bergen Moore, Justin MooreMoore Boeck

Relevance Multidiscipline

High Quality Latest

Expert Novice Non-Scientist

Virtual Earth System Laboratory

Columbia Glacier Haig Glacier

Eustatic Sea Level Local Sea Level

Greenland Basal Friction Greenland Steady-State-Friction

Data Analysis ToolParameter Visualization | Polar Projection | Dataset Info | Data Sequencer

Suite of Analytic Functions | Download Plots and Results

FeaturedNews

Indicators

Learn

Search&

Tools

News&

Updates

On-the-fly recreated identification of ”The Blob”• The Blob is the name given to a large mass of relatively warm water in the Pacific ocean off

the coast of North America. It was first detected in late 2013 and continued to spread throughout 2014 and 2015.

• SST anomaly = SST – SST Climatology at each location to compare with standard deviation - Dr. Chelle Gentemann, Senior Scientist at Earth & Space Research

NASA AIST - OceanXtremesOceanographic Data-Intensive Anomaly Detection and Analysis Portal

Apache Spark (In-Memory MapReduce) and Two Scalable DatabasesNEXUS: The Deep Data Platform

Workflow Automation Horizontal-Scale Data Analysis Environment

Deep Data Processors

Index and Data

Catalog

Data Access

ETL System

Private Cloud

Analytic Platform

ETL System Deep Data Processors

Spring XD

Index and Data Catalog Analytic Platform

Manager Manager Inventory Security SigEvent Search

ZooKeeper ZooKeeperZooKeeper

Manager

Handler Handler Handler

IngestPool

IngestPool

WorkerPool

WorkerPool

File & ProductServices

Job Tracking Services

Business Logics

Applications

HORIZON Data Management and Workflow Framework

Ingest Ingest Ingest

ProductSubscriber

ProductSubscriber

Staging

NoS

QL

Index

Alg Alg Alg

Alg

Alg

Alg

Alg

WebPortal

TaskScheduler

Task threads

Block manager

RDD Objects DAG Scheduler Task Scheduler Executor

DAG TaskSet Task

NASA Physical Oceanography Distributed Active Archive Center

Comparison of Giovanni and NEXUS run times to compute 16-year area-averaged daily time series of MODIS-Terra Aerosol Optical Depth (AOD) 550 nm dark target at 1 degree resolution for the indicated spatial bounds (Global, Colorado, or Boulder). NEXUS was run on 6 Amazon Web Services (AWS) Cloud instances of type “i2.4xlarge” with Solr, Cassandra, Spark 2.0, and the Mesos scheduler. NEXUS has superior performance for subsetting operations, as indicated by the ~300x speedup for the Colorado subset.

Comparison of NEXUS run times using the YARN and Mesos schedulers for the MODIS-Terra time series calculation on an 8-node cluster computer at JPL running Solr, Cassandra, and Spark 2.0 with the YARN and Mesos schedulers. In our experiments, using Mesos consistently yields a speedup of 2 to 4 times over YARN.

0

1000

2000

3000

4000

Global Colorado Boulder

1778

3010

1228

35 11 10

RunTime(s)

MODIS-TerraArea-AveragedTimeSeries

Giovanni NEXUS16-way

0

50

100

150

Global Colorado Boulder

124

42 372712 12Ru

nTime(s)

MODIS-TerraArea-AveragedTimeSeries

YARN Mesos

Comparison of Giovanni and NEXUS run times to compute global correlation map of 0.25 degree TRMM daily precipitation rate (TRMM_3B42_daily_precipitation_V7) vs. T R M M r e a l - t i m e d a i l y p r e c i p i t a t i o n (TRMM_3B42RT_daily_precipitation_V7) for 1/1/2001 - 12/31/2014. These runs included 5,113 global data granule pairs (between 50 deg. south and 50 deg. north latitude) totaling 40 GB. NEXUS was run on an 8-node cluster computer at JPL running Solr, Cassandra, and Spark 2.0 with the Mesos scheduler.

0

100

200

300

400

500

600

700

800

Giovanni NEXUS16-way NEXUS64-way

768

60 34

RunTime(s)

TRMM_3B42vsTRMM_3B42RTCorrelationMap(Global)

Comparison of Giovanni and NEXUS run times to compute area-averaged time series over the continental United States (Figure above) and global time averaged map (Figure below) of 0.25 degree TRMM daily precipitation rate (TRMM_3B42_daily_precipitation_V7) for 1/1/1998 - 12/31/2015. These runs included 6,574 global data granules (between 50 deg. south and 50 deg. north latitude) totaling 26 GB. NEXUS was run on an 8-node cluster computer at JPL running Solr, Cassandra, and Spark 2.0 with the Mesos scheduler.

050010001500200025003000350040004500

Giovanni NEXUS16-way NEXUS64-way

4120

94 40

RunTime(s)

TRMM_3B42AreaAveragedTimeSeries(CONUS)

020406080

100120140160180

Giovanni NEXUS16-way NEXUS64-way

169

5441Ru

nTime(s)

TRMM_3B42TimeAveragedMap(Global)

Performance and Scaling

NEXUS is an emerging technology developed at JPLOpen Source: https://github.com/dataplumber/nexusA Cloud-based/Cluster-based data platform that performs scalable handling

of observational parameters analysis designed to scale horizontally byLeveraging high-performance indexed, temporal, and geospatial search

solutionBreaks data products into small chunks and stores them in a Cloud data

storeData Volumes Exploding

SWOT mission is comingFile I/O is slow

Scalable Store & Compute is AvailableNoSQL cluster databasesParallel compute, in-memory map-reduceBring Compute to Highly-Accessible Data (using Hybrid Cloud)

Pre-Chunk and Summarize Key VariablesEasy statistics instantly (milliseconds)Harder statistics on-demand (in seconds)Visualize original data (layers) on a map quickly

Using open source technologies• Apache Cassandra• Apache Kafka• Apache Mesos/YARN• Apache Solr• Apache Spark/PySpark• Apache Zookeeper• EDGE• Jupyter• Spring XD• Tornado