sherlock demo 2019 v13 screenshots - nasabig data system + spark + jupyter • big data cluster...

121
S H E R L C K D A T A W A R E H O U S E April 8, 2019 https://ntrs.nasa.gov/search.jsp?R=20190025090 2020-05-21T07:52:27+00:00Z

Upload: others

Post on 21-May-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Sherlock Demo 2019 v13 ScreenShots - NASABig Data System + Spark + Jupyter • Big Data Cluster provides enormous processing power • Spark is a large scale distributed data processing

SHERL CKDATA WAREHOUSE

April 8, 2019

https://ntrs.nasa.gov/search.jsp?R=20190025090 2020-05-21T07:52:27+00:00Z

Page 2: Sherlock Demo 2019 v13 ScreenShots - NASABig Data System + Spark + Jupyter • Big Data Cluster provides enormous processing power • Spark is a large scale distributed data processing

Take-aways

SHERL CK

• Sherlock contains a valuable collection of flight, air traffic management, and weather data

• Sherlock is more than just a data archive

• Pique interest for follow-on workshops– Data visualization & analytics with MicroStrategy– Processing using the Big Data system from Jupyter Notebook

2

Page 3: Sherlock Demo 2019 v13 ScreenShots - NASABig Data System + Spark + Jupyter • Big Data Cluster provides enormous processing power • Spark is a large scale distributed data processing

Outline

SHERL CK

• What is Sherlock?• How are data stored? (Why should I care?)• Sherlock information and data access• Summary of archived data• Overviews and demos of resources

3

Page 4: Sherlock Demo 2019 v13 ScreenShots - NASABig Data System + Spark + Jupyter • Big Data Cluster provides enormous processing power • Spark is a large scale distributed data processing

• Platform for collection, processing, and archivingair traffic management data

• Sherlock resources include:– Sherlock home page New!– File download web UI Reorganized!– Data visualization & analytics New!– Jupyter Notebook on Big Data system New!– ATM Knowledge Graph (experimental)– Hue Browser– THREDDS Data Server– GeoServer

What is Sherlock?

SHERL CK

Demos

4

Backup slides

Page 5: Sherlock Demo 2019 v13 ScreenShots - NASABig Data System + Spark + Jupyter • Big Data Cluster provides enormous processing power • Spark is a large scale distributed data processing

How are data stored?(Why should I care?)

SHERL CK 5

Page 6: Sherlock Demo 2019 v13 ScreenShots - NASABig Data System + Spark + Jupyter • Big Data Cluster provides enormous processing power • Spark is a large scale distributed data processing

Where data stored?

SHERL CK

Operational Data Store (ODS)

Big Data System

File System• Linux file system, /home/data

• Traditional database

• Distributed data storage and processing

6

Page 7: Sherlock Demo 2019 v13 ScreenShots - NASABig Data System + Spark + Jupyter • Big Data Cluster provides enormous processing power • Spark is a large scale distributed data processing

How are data stored?

SHERL CK 7

FormatFunctionality

Select rows/columns of interest

View data before download

Download only the data you wantGood for small data sets

Good for large data setsAvailable from

File system /home/data

File download in web UITables and charts in web UI

Data visualization & analytics

Jupyter Notebook

Hue browser

File SystemFlat files

ODSTables

Big DataApache Parquet

NoNo

NoYes

No

YesYes

NoNo

No

No

YesYes

YesYes

No

YesYes

YesNo

Yes

NoNo

YesYes

No

No

NoNo

NoNo

Yes

Yes

Page 8: Sherlock Demo 2019 v13 ScreenShots - NASABig Data System + Spark + Jupyter • Big Data Cluster provides enormous processing power • Spark is a large scale distributed data processing

Sherlock information and data access

SHERL CK 8

Page 9: Sherlock Demo 2019 v13 ScreenShots - NASABig Data System + Spark + Jupyter • Big Data Cluster provides enormous processing power • Spark is a large scale distributed data processing

Sherlock resources

SHERL CK 9

https://atmweb.arc.nasa.gov/

Page 10: Sherlock Demo 2019 v13 ScreenShots - NASABig Data System + Spark + Jupyter • Big Data Cluster provides enormous processing power • Spark is a large scale distributed data processing

SHERL CK 10

Demo: Sherlock home page

Page 11: Sherlock Demo 2019 v13 ScreenShots - NASABig Data System + Spark + Jupyter • Big Data Cluster provides enormous processing power • Spark is a large scale distributed data processing

SHERL CK 11

Page 12: Sherlock Demo 2019 v13 ScreenShots - NASABig Data System + Spark + Jupyter • Big Data Cluster provides enormous processing power • Spark is a large scale distributed data processing

SHERL CK 12

Page 13: Sherlock Demo 2019 v13 ScreenShots - NASABig Data System + Spark + Jupyter • Big Data Cluster provides enormous processing power • Spark is a large scale distributed data processing

SHERL CK 13

Page 14: Sherlock Demo 2019 v13 ScreenShots - NASABig Data System + Spark + Jupyter • Big Data Cluster provides enormous processing power • Spark is a large scale distributed data processing

SHERL CK 14

Page 15: Sherlock Demo 2019 v13 ScreenShots - NASABig Data System + Spark + Jupyter • Big Data Cluster provides enormous processing power • Spark is a large scale distributed data processing

SHERL CK 15

Page 16: Sherlock Demo 2019 v13 ScreenShots - NASABig Data System + Spark + Jupyter • Big Data Cluster provides enormous processing power • Spark is a large scale distributed data processing

SHERL CK 16

Page 17: Sherlock Demo 2019 v13 ScreenShots - NASABig Data System + Spark + Jupyter • Big Data Cluster provides enormous processing power • Spark is a large scale distributed data processing

SHERL CK 17

Page 18: Sherlock Demo 2019 v13 ScreenShots - NASABig Data System + Spark + Jupyter • Big Data Cluster provides enormous processing power • Spark is a large scale distributed data processing

SHERL CK 18

Page 19: Sherlock Demo 2019 v13 ScreenShots - NASABig Data System + Spark + Jupyter • Big Data Cluster provides enormous processing power • Spark is a large scale distributed data processing

Summary of archived data

SHERL CK 19

Page 20: Sherlock Demo 2019 v13 ScreenShots - NASABig Data System + Spark + Jupyter • Big Data Cluster provides enormous processing power • Spark is a large scale distributed data processing

FAA SWIM data in Sherlock

SHERL CK 20

System Wide Information Management (SWIM) Program

• National Airspace System (NAS)-wide information system that

supports Next Generation Air Transportation System (NextGen)

goals

• Increased common situational awareness various stakeholders

• Single point of access for aviation data

– Producers of data publish it once

– Users access the information they need through a single connection

Page 21: Sherlock Demo 2019 v13 ScreenShots - NASABig Data System + Spark + Jupyter • Big Data Cluster provides enormous processing power • Spark is a large scale distributed data processing

FAA SWIM data in Sherlock

SHERL CK 21

Flight dataSTDDS/ASDE-X Surface dataTAIS TRACON data, including VFR flightsSFDPS En route flight dataTFMData NAS-wide flight data, flow constraintsTBFM Operational metering data

Airport dataAPDS Airport Data Service, Runway Visual Range infoNOTAM Notices to Airmen

Weather dataITWS Terminal convective weather

Page 22: Sherlock Demo 2019 v13 ScreenShots - NASABig Data System + Spark + Jupyter • Big Data Cluster provides enormous processing power • Spark is a large scale distributed data processing

Data in SherlockWeather dataCIWS Convective weather forecastsMETAR Airport current surface weather conditionsRR NOAA Rapid Refresh forecastsCCFP Simplified convective weather polygonsMETAR Airport weather reportsTAF Terminal Aerodrome ForecastWITI Weather Impacted Traffic IndexPIREP Pilot reports

Obsolete weather data (stored but no longer updated)RUC NOAA Rapid Update Cycle forecasts

SHERL CK 22

Page 23: Sherlock Demo 2019 v13 ScreenShots - NASABig Data System + Spark + Jupyter • Big Data Cluster provides enormous processing power • Spark is a large scale distributed data processing

Flight dataCTAS Text-based format, in Center/TRACON pairs

Obsolete flight data (stored but no longer updated)ASDI Flight data

Traffic managementATCSCC Strategic advisories

Facility reportsOPSNET Statistics

Data in Sherlock

SHERL CK 23

Page 24: Sherlock Demo 2019 v13 ScreenShots - NASABig Data System + Spark + Jupyter • Big Data Cluster provides enormous processing power • Spark is a large scale distributed data processing

Archived derived dataproduced by ATAC

SHERL CK 24

Page 25: Sherlock Demo 2019 v13 ScreenShots - NASABig Data System + Spark + Jupyter • Big Data Cluster provides enormous processing power • Spark is a large scale distributed data processing

Processed data from ATAC in Sherlock

Flight dataIFF Flight plan and track

EV Flight event

RD Flight summary

SHERL CK 25

Page 26: Sherlock Demo 2019 v13 ScreenShots - NASABig Data System + Spark + Jupyter • Big Data Cluster provides enormous processing power • Spark is a large scale distributed data processing

Legacy STARS (1/2014 to present)SWIM STARS (TAIS) (11/2017 to present)

Legacy HOST/ ERAM – from 1/2014SWIM ERAM (SFDPS) – from 5/2017

ARTS –available from 1/1/2014 ~ 11/2015 SWIM ASDE-X – from 1/2016

Analysis-ready track, flight plan, and metadata for 94 individual facilities

TRACON / Terminal

TRACON / Terminal

ARTCC / Enroute

ATCT / Surface

SHERL CK 26

Page 27: Sherlock Demo 2019 v13 ScreenShots - NASABig Data System + Spark + Jupyter • Big Data Cluster provides enormous processing power • Spark is a large scale distributed data processing

USA Merged data –available from 1/1/2014, *note – data prior to 1/16/2016 does not contain surface dataSHERL CK

End-to-end trajectories, flight plans, and meta data

27

Page 28: Sherlock Demo 2019 v13 ScreenShots - NASABig Data System + Spark + Jupyter • Big Data Cluster provides enormous processing power • Spark is a large scale distributed data processing

Performance reports/data in SherlockFacility reports Scope (Facility Type) DescriptionGo Arounds TRACON/Terminal Counts, runways, altitude, return time

Turns to Final TRACON/Terminal Overshoots, glideslope speed/altitude deviations, turn on angle

Fix Passing TRACON/Terminal Fix/Waypoint throughput

Runway Usage ATCT/Surface Runway throughput, arrival/departure rates

Taxi Time ATCT/Surface Taxi out, taxi in time

Instantaneous Counts ARTCC/Enroute Number of aircraft in sectors in 15 minute bins, summary statistics

Sector Stats ARTCC/Enroute Sector count, flight Time/distance, and transitions between sectors

Sector Activity ARTCC/Enroute Sector entries/exits/counts in 15 minute bins

Field10 Reroute ARTCC/Enroute Diversions and reroutes

NAS-wide reportsJet Airway Flow CONUS Jet Airway counts, flight time, and distance

Best Flight Plan CONUS Synthesized flight plan route closest to what was actually flown

CCFP Sector/ARTCC Coverage CONUS Percentage of Sector/ARTCC covered by CCFP convective weather (per 2hr)

CWAM Sector/ARTCC Coverage CONUS Percentage of Sector/ARTCC covered by CWAM convective weather (per 15min)

CCFP Jet Airway Coverage CONUS Percentage of Jet Airway covered by CCFP convective weather (per 15min)

SHERL CK 28

Page 29: Sherlock Demo 2019 v13 ScreenShots - NASABig Data System + Spark + Jupyter • Big Data Cluster provides enormous processing power • Spark is a large scale distributed data processing

Overviews and demos of resources

SHERL CK 29

Page 30: Sherlock Demo 2019 v13 ScreenShots - NASABig Data System + Spark + Jupyter • Big Data Cluster provides enormous processing power • Spark is a large scale distributed data processing

Sherlock resources

SHERL CK 30

https://atmweb.arc.nasa.gov/

Page 31: Sherlock Demo 2019 v13 ScreenShots - NASABig Data System + Spark + Jupyter • Big Data Cluster provides enormous processing power • Spark is a large scale distributed data processing

SHERL CK

https://atmweb.arc.nasa.gov/

31

Page 32: Sherlock Demo 2019 v13 ScreenShots - NASABig Data System + Spark + Jupyter • Big Data Cluster provides enormous processing power • Spark is a large scale distributed data processing

SHERL CK 32

Demo: File Download Web UI

Page 33: Sherlock Demo 2019 v13 ScreenShots - NASABig Data System + Spark + Jupyter • Big Data Cluster provides enormous processing power • Spark is a large scale distributed data processing

SHERL CK 33

Page 34: Sherlock Demo 2019 v13 ScreenShots - NASABig Data System + Spark + Jupyter • Big Data Cluster provides enormous processing power • Spark is a large scale distributed data processing

SHERL CK 34

Page 35: Sherlock Demo 2019 v13 ScreenShots - NASABig Data System + Spark + Jupyter • Big Data Cluster provides enormous processing power • Spark is a large scale distributed data processing

SHERL CK 35

Page 36: Sherlock Demo 2019 v13 ScreenShots - NASABig Data System + Spark + Jupyter • Big Data Cluster provides enormous processing power • Spark is a large scale distributed data processing

SHERL CK 36

Page 37: Sherlock Demo 2019 v13 ScreenShots - NASABig Data System + Spark + Jupyter • Big Data Cluster provides enormous processing power • Spark is a large scale distributed data processing

SHERL CK 37

Page 38: Sherlock Demo 2019 v13 ScreenShots - NASABig Data System + Spark + Jupyter • Big Data Cluster provides enormous processing power • Spark is a large scale distributed data processing

SHERL CK 38

Page 39: Sherlock Demo 2019 v13 ScreenShots - NASABig Data System + Spark + Jupyter • Big Data Cluster provides enormous processing power • Spark is a large scale distributed data processing

SHERL CK 39

Page 40: Sherlock Demo 2019 v13 ScreenShots - NASABig Data System + Spark + Jupyter • Big Data Cluster provides enormous processing power • Spark is a large scale distributed data processing

SHERL CK 40

Page 41: Sherlock Demo 2019 v13 ScreenShots - NASABig Data System + Spark + Jupyter • Big Data Cluster provides enormous processing power • Spark is a large scale distributed data processing

SHERL CK 41

Page 42: Sherlock Demo 2019 v13 ScreenShots - NASABig Data System + Spark + Jupyter • Big Data Cluster provides enormous processing power • Spark is a large scale distributed data processing

SHERL CK 42

Page 43: Sherlock Demo 2019 v13 ScreenShots - NASABig Data System + Spark + Jupyter • Big Data Cluster provides enormous processing power • Spark is a large scale distributed data processing

SHERL CK 43

Page 44: Sherlock Demo 2019 v13 ScreenShots - NASABig Data System + Spark + Jupyter • Big Data Cluster provides enormous processing power • Spark is a large scale distributed data processing

SHERL CK 44

Page 45: Sherlock Demo 2019 v13 ScreenShots - NASABig Data System + Spark + Jupyter • Big Data Cluster provides enormous processing power • Spark is a large scale distributed data processing

SHERL CK 45

Page 46: Sherlock Demo 2019 v13 ScreenShots - NASABig Data System + Spark + Jupyter • Big Data Cluster provides enormous processing power • Spark is a large scale distributed data processing

SHERL CK 46

Page 47: Sherlock Demo 2019 v13 ScreenShots - NASABig Data System + Spark + Jupyter • Big Data Cluster provides enormous processing power • Spark is a large scale distributed data processing

SHERL CK 47

Page 48: Sherlock Demo 2019 v13 ScreenShots - NASABig Data System + Spark + Jupyter • Big Data Cluster provides enormous processing power • Spark is a large scale distributed data processing

Data visualization & analytics

• What is MicroStrategy?:– Enterprise business intelligence (BI) application software – Allows users to create custom data tables and visualizations

• Sherlock data visualization and analytics with MicroStrategy:– Visualize Sherlock data without downloading– Visualize user-generated data on Sherlock’s Big Data system– Create custom visualizations for unique needs– Perform some basic data analytics

SHERL CK 48

Page 49: Sherlock Demo 2019 v13 ScreenShots - NASABig Data System + Spark + Jupyter • Big Data Cluster provides enormous processing power • Spark is a large scale distributed data processing

Data visualization

SHERL CK 49

Page 50: Sherlock Demo 2019 v13 ScreenShots - NASABig Data System + Spark + Jupyter • Big Data Cluster provides enormous processing power • Spark is a large scale distributed data processing

SHERL CK 50

Demo: Data Visualization & Analytics

Page 51: Sherlock Demo 2019 v13 ScreenShots - NASABig Data System + Spark + Jupyter • Big Data Cluster provides enormous processing power • Spark is a large scale distributed data processing

SHERL CK 51

Page 52: Sherlock Demo 2019 v13 ScreenShots - NASABig Data System + Spark + Jupyter • Big Data Cluster provides enormous processing power • Spark is a large scale distributed data processing

SHERL CK 52

Page 53: Sherlock Demo 2019 v13 ScreenShots - NASABig Data System + Spark + Jupyter • Big Data Cluster provides enormous processing power • Spark is a large scale distributed data processing

SHERL CK 53

Page 54: Sherlock Demo 2019 v13 ScreenShots - NASABig Data System + Spark + Jupyter • Big Data Cluster provides enormous processing power • Spark is a large scale distributed data processing

SHERL CK 54

Page 55: Sherlock Demo 2019 v13 ScreenShots - NASABig Data System + Spark + Jupyter • Big Data Cluster provides enormous processing power • Spark is a large scale distributed data processing

SHERL CK 55

Page 56: Sherlock Demo 2019 v13 ScreenShots - NASABig Data System + Spark + Jupyter • Big Data Cluster provides enormous processing power • Spark is a large scale distributed data processing

SHERL CK 56

Page 57: Sherlock Demo 2019 v13 ScreenShots - NASABig Data System + Spark + Jupyter • Big Data Cluster provides enormous processing power • Spark is a large scale distributed data processing

SHERL CK 57

Page 58: Sherlock Demo 2019 v13 ScreenShots - NASABig Data System + Spark + Jupyter • Big Data Cluster provides enormous processing power • Spark is a large scale distributed data processing

SHERL CK 58

Page 59: Sherlock Demo 2019 v13 ScreenShots - NASABig Data System + Spark + Jupyter • Big Data Cluster provides enormous processing power • Spark is a large scale distributed data processing

SHERL CK 59

Page 60: Sherlock Demo 2019 v13 ScreenShots - NASABig Data System + Spark + Jupyter • Big Data Cluster provides enormous processing power • Spark is a large scale distributed data processing

SHERL CK 60

Page 61: Sherlock Demo 2019 v13 ScreenShots - NASABig Data System + Spark + Jupyter • Big Data Cluster provides enormous processing power • Spark is a large scale distributed data processing

SHERL CK 61

Page 62: Sherlock Demo 2019 v13 ScreenShots - NASABig Data System + Spark + Jupyter • Big Data Cluster provides enormous processing power • Spark is a large scale distributed data processing

SHERL CK 62

Page 63: Sherlock Demo 2019 v13 ScreenShots - NASABig Data System + Spark + Jupyter • Big Data Cluster provides enormous processing power • Spark is a large scale distributed data processing

SHERL CK 63

Page 64: Sherlock Demo 2019 v13 ScreenShots - NASABig Data System + Spark + Jupyter • Big Data Cluster provides enormous processing power • Spark is a large scale distributed data processing

SHERL CK 64

Page 65: Sherlock Demo 2019 v13 ScreenShots - NASABig Data System + Spark + Jupyter • Big Data Cluster provides enormous processing power • Spark is a large scale distributed data processing

SHERL CK 65

Page 66: Sherlock Demo 2019 v13 ScreenShots - NASABig Data System + Spark + Jupyter • Big Data Cluster provides enormous processing power • Spark is a large scale distributed data processing

SHERL CK 66

Page 67: Sherlock Demo 2019 v13 ScreenShots - NASABig Data System + Spark + Jupyter • Big Data Cluster provides enormous processing power • Spark is a large scale distributed data processing

SHERL CK 67

Page 68: Sherlock Demo 2019 v13 ScreenShots - NASABig Data System + Spark + Jupyter • Big Data Cluster provides enormous processing power • Spark is a large scale distributed data processing

SHERL CK 68

Graph-structured queryable repository of disparate ATM data(Experimental, with limited data available)

Page 69: Sherlock Demo 2019 v13 ScreenShots - NASABig Data System + Spark + Jupyter • Big Data Cluster provides enormous processing power • Spark is a large scale distributed data processing

What is ATM Knowledge Graph?

SHERL CK

• Stored in a Graph Databaseo Not in traditional Operational Data Store (ODS)

• Accessible via:– Web Interface

• Query editor/executor• Visualization tool

– Programmatic API

– MicroStrategy

Subgraph describing a flight

• Highly-interconnected network-structured data store, where:o Nodes

• Represent ATM entities (flights, airports, facilities, aircraft, routes...)

• Store properties/data of entitieso Links represent interrelationships

69

Page 70: Sherlock Demo 2019 v13 ScreenShots - NASABig Data System + Spark + Jupyter • Big Data Cluster provides enormous processing power • Spark is a large scale distributed data processing

What is the value of ATM Knowledge Graph?

• Sherlock is not a unified database; it is a data repository– Cannot generally query across data tables or data sources

• Knowledge Graph merges/integrates/unifies data from multiple sources into one large graph structure to enable cross-source querying

• Result: You can: ü Query Sherlock as a unified databaseü Visualize and navigate through the data graphü Download integrated data

70

Page 71: Sherlock Demo 2019 v13 ScreenShots - NASABig Data System + Spark + Jupyter • Big Data Cluster provides enormous processing power • Spark is a large scale distributed data processing

• ASDI: Flight track data

• TFM Advisories: GDPs, reroutes, Ground Stops,…

• METAR/TAF: Airport weather observation & forecast data

• ERAM adaptations: NAS infrastructure data(facilities, routes, SIDs/STARs, fixes, airways, sectors,…)

• ASPM: Airport performance (traffic counts, delay stats,…)

• FAA Aircraft Registry: Aircraft Characteristics(registration, certification, ownership, aircraft & engine models)

• CAST/ICAO Aircraft Taxonomy: Aircraft Models and Manufacturers

• Airlines, Airport Terminals/Gates

Sherlock sources

What data are stored in ATM Knowledge Graph?

Non-Sherlock sources

Experimental: Currently very limited data in Knowledge Graph! (only July 2014 for ZNY)

71

Page 72: Sherlock Demo 2019 v13 ScreenShots - NASABig Data System + Spark + Jupyter • Big Data Cluster provides enormous processing power • Spark is a large scale distributed data processing

SHERL CK 72

Demo: ATM Knowledge Graph

Page 73: Sherlock Demo 2019 v13 ScreenShots - NASABig Data System + Spark + Jupyter • Big Data Cluster provides enormous processing power • Spark is a large scale distributed data processing

Flight UAL535 on 2014-07-15

Aircraft N589UA

Flight Plan

Departure Airport

Trajectory

Track PointTrack Points

Page 74: Sherlock Demo 2019 v13 ScreenShots - NASABig Data System + Spark + Jupyter • Big Data Cluster provides enormous processing power • Spark is a large scale distributed data processing

74

Page 75: Sherlock Demo 2019 v13 ScreenShots - NASABig Data System + Spark + Jupyter • Big Data Cluster provides enormous processing power • Spark is a large scale distributed data processing

Flight trajecotry

75

Page 76: Sherlock Demo 2019 v13 ScreenShots - NASABig Data System + Spark + Jupyter • Big Data Cluster provides enormous processing power • Spark is a large scale distributed data processing

Big Data system

SHERL CK 76

Page 77: Sherlock Demo 2019 v13 ScreenShots - NASABig Data System + Spark + Jupyter • Big Data Cluster provides enormous processing power • Spark is a large scale distributed data processing

Big Data system• What is a Big Data system?

– Built on commodity hardware– Massive storage for any kind of data– Enormous processing power– Ability to handle virtually limitless concurrent tasks

• Sherlock Big Data system– 32-node cluster in Building N233– SuperMicro Engineered System

• Total of 576 CPU Cores• Total of 800 TB Storage

– Cloudera distribution of HadoopSHERL CK 77

Page 78: Sherlock Demo 2019 v13 ScreenShots - NASABig Data System + Spark + Jupyter • Big Data Cluster provides enormous processing power • Spark is a large scale distributed data processing

Data sources currently on Big Data system

Facility flight dataIFF Flight plan and track

EV Flight event

RD Flight summary

USA merged flight dataIFF Flight plan and track

EV Flight event

RD Flight summary

SHERL CK 78

Please contact the Sherlock team if you would like to access a particular data source from our Big Data system.

Page 79: Sherlock Demo 2019 v13 ScreenShots - NASABig Data System + Spark + Jupyter • Big Data Cluster provides enormous processing power • Spark is a large scale distributed data processing

How to access Big Data

• Query Big Data • Hue browser

• Process data on Big Data system• Requires a cluster-computing framework

• Many to choose from!• Sherlock team recommends: Apache Spark

SHERL CK 79

Page 80: Sherlock Demo 2019 v13 ScreenShots - NASABig Data System + Spark + Jupyter • Big Data Cluster provides enormous processing power • Spark is a large scale distributed data processing

• Unified analytics engine for large-scale data processing

• Performs fault-tolerant distributed computing and parallel processing services on a cluster

• Supports 4 languages:– Scala (native)– Java (fast)– Python (easy)– R

80

What is Apache SparkTM?

SHERL CK

Page 81: Sherlock Demo 2019 v13 ScreenShots - NASABig Data System + Spark + Jupyter • Big Data Cluster provides enormous processing power • Spark is a large scale distributed data processing

Traditional1. Write code on local machine2. Zip code and scp to sherlock.arc.nasa.gov3. Run Spark submit job to execute code on Big Data system

Drawbacks:• Develop with a small local copy of data• Difficult to debug when deployed on the Big Data system

81

Big Data development workflow

SHERL CK

Page 82: Sherlock Demo 2019 v13 ScreenShots - NASABig Data System + Spark + Jupyter • Big Data Cluster provides enormous processing power • Spark is a large scale distributed data processing

Jupyter Notebook1. Develop code in web browser on sherlock.arc.nasa.gov2. Break code into segments3. Run and see output in-line

Advantages:• Code is running on Big Data system while developing• Debugging is easier in Jupyter than traditional work flow

82

Big Data development workflow

SHERL CK

Page 83: Sherlock Demo 2019 v13 ScreenShots - NASABig Data System + Spark + Jupyter • Big Data Cluster provides enormous processing power • Spark is a large scale distributed data processing

SHERL CK 83

Page 84: Sherlock Demo 2019 v13 ScreenShots - NASABig Data System + Spark + Jupyter • Big Data Cluster provides enormous processing power • Spark is a large scale distributed data processing

SHERL CK 84

Code cell input

On-the-fly inline plots

Jupyter Notebook• Open source web

application for interactive computing

• Execute code interactively in the browser

• View results and plots inline in the Notebook

Page 85: Sherlock Demo 2019 v13 ScreenShots - NASABig Data System + Spark + Jupyter • Big Data Cluster provides enormous processing power • Spark is a large scale distributed data processing

Machine learning use case: flight track clustering1. Join and filter track data from Big Data system2. Process data to generate features3. Detect and remove outliers4. Cluster the data using K-Means5. Evaluate resulting clustering and create plots

SHERL CK 85

Page 86: Sherlock Demo 2019 v13 ScreenShots - NASABig Data System + Spark + Jupyter • Big Data Cluster provides enormous processing power • Spark is a large scale distributed data processing

Outlier detection stepNumber of

Flights PlatformProcessing time

Hours Minutes

25,000Personal machine 1 26Big Data 0 13

50,000Personal machine crashedBig Data 0 32

Benefits of using Big Data system

• No need to copy hundreds of GB of data to user machine• Fast processing (distributed processing with powerful CPUs)

SHERL CK 86

Page 87: Sherlock Demo 2019 v13 ScreenShots - NASABig Data System + Spark + Jupyter • Big Data Cluster provides enormous processing power • Spark is a large scale distributed data processing

Big Data System + Spark + Jupyter

• Big Data Cluster provides enormous processing power

• Spark is a large scale distributed data processing engine and it processes large amount of data in memory

• Jupyter is a great tool to code, test, prototype, and share PySparkprograms

87

Page 88: Sherlock Demo 2019 v13 ScreenShots - NASABig Data System + Spark + Jupyter • Big Data Cluster provides enormous processing power • Spark is a large scale distributed data processing

SHERL CK 88

Demo: Jupyter Notebook for Big Data

Page 89: Sherlock Demo 2019 v13 ScreenShots - NASABig Data System + Spark + Jupyter • Big Data Cluster provides enormous processing power • Spark is a large scale distributed data processing

SHERL CK 89

Page 90: Sherlock Demo 2019 v13 ScreenShots - NASABig Data System + Spark + Jupyter • Big Data Cluster provides enormous processing power • Spark is a large scale distributed data processing

SHERL CK 90

Page 91: Sherlock Demo 2019 v13 ScreenShots - NASABig Data System + Spark + Jupyter • Big Data Cluster provides enormous processing power • Spark is a large scale distributed data processing

SHERL CK 91

Page 92: Sherlock Demo 2019 v13 ScreenShots - NASABig Data System + Spark + Jupyter • Big Data Cluster provides enormous processing power • Spark is a large scale distributed data processing

Coming soon …

SHERL CK 92

Page 93: Sherlock Demo 2019 v13 ScreenShots - NASABig Data System + Spark + Jupyter • Big Data Cluster provides enormous processing power • Spark is a large scale distributed data processing

Workshops

• MicroStrategyCreate custom visualizations and analytics

• Big Data systemImplement a machine learning use case usingSpark and SparkML in Jupyter Notebook

SHERL CK 93

Page 94: Sherlock Demo 2019 v13 ScreenShots - NASABig Data System + Spark + Jupyter • Big Data Cluster provides enormous processing power • Spark is a large scale distributed data processing

Acknowledgements• Management

Heather ArnesonAntony EvansPaul CobbKaren Gundy-Burlet

• DevelopersPallavi HegdeMichael La Scola

• Database & Big Data adminDat DuongEric Wang

• Data collection, archiving, monitoringJoe CisekPat O’Neal

• Windows, Linux adminMatt Ma

• Graph databaseRich Keller

• ATAC dataJohn SchadeKennis ChanCindy Wong(the other) Eric Wang

SHERL CK 94

Page 95: Sherlock Demo 2019 v13 ScreenShots - NASABig Data System + Spark + Jupyter • Big Data Cluster provides enormous processing power • Spark is a large scale distributed data processing

Home page:https://atmweb.arc.nasa.gov/

These slides:Home page à ABOUT à Overview

User support:[email protected]

SHERL CKDATA WAREHOUSE

Page 96: Sherlock Demo 2019 v13 ScreenShots - NASABig Data System + Spark + Jupyter • Big Data Cluster provides enormous processing power • Spark is a large scale distributed data processing

SHERL CK 96

Page 97: Sherlock Demo 2019 v13 ScreenShots - NASABig Data System + Spark + Jupyter • Big Data Cluster provides enormous processing power • Spark is a large scale distributed data processing

BACKUP SLIDES

SHERL CK 97

Page 98: Sherlock Demo 2019 v13 ScreenShots - NASABig Data System + Spark + Jupyter • Big Data Cluster provides enormous processing power • Spark is a large scale distributed data processing

What is Sherlock?

SHERL CK 98

Page 99: Sherlock Demo 2019 v13 ScreenShots - NASABig Data System + Spark + Jupyter • Big Data Cluster provides enormous processing power • Spark is a large scale distributed data processing

Architecture

SHERL CK 99

Page 100: Sherlock Demo 2019 v13 ScreenShots - NASABig Data System + Spark + Jupyter • Big Data Cluster provides enormous processing power • Spark is a large scale distributed data processing

How are data stored?

SHERL CK

File System• No insight into the data• Download full file

• Available from:– File system /home/data– File Download Pages in Web UI

Flat files

100

Page 101: Sherlock Demo 2019 v13 ScreenShots - NASABig Data System + Spark + Jupyter • Big Data Cluster provides enormous processing power • Spark is a large scale distributed data processing

How are data stored?

SHERL CK

Operational Data Store (ODS)• Select rows and columns of interest• View data before download• Download only the data you want• Good for small data sets• Available from:

– Tables and Charts in Web UI– Data Visualization & Analytics

Tables

101

Page 102: Sherlock Demo 2019 v13 ScreenShots - NASABig Data System + Spark + Jupyter • Big Data Cluster provides enormous processing power • Spark is a large scale distributed data processing

How are data stored?

SHERL CK

Big Data System• Select rows and columns of interest• View data before download• Download only the data you want• Good for large data sets• Available from:

– Hue Browser– Jupyter Notebook– Data Visualization & Analytics

ApacheParquet format

102

Page 103: Sherlock Demo 2019 v13 ScreenShots - NASABig Data System + Spark + Jupyter • Big Data Cluster provides enormous processing power • Spark is a large scale distributed data processing

Archived dataproduced by ATAC

SHERL CK 103

Page 104: Sherlock Demo 2019 v13 ScreenShots - NASABig Data System + Spark + Jupyter • Big Data Cluster provides enormous processing power • Spark is a large scale distributed data processing

Flight data and report products available in Sherlock1. Analysis-ready track, flight plan, and metadata for 94 individual facilities

(available on a next day basis)• Data types (IFF, EV, RD)

2. End-to-end trajectories, flight plans, and metadata available(available within 10 days)

3. Performance Reports(available daily for individual facilities)

4. Aggregated Trend Databases (STREND)(updated daily)

5. Traffic and weather coverage metrics (available in csv format)• Sector transition metrics• CCFP/CWAM Sector/ARTCC coverage• Sector/ARTCC counts by weather coverage• CCFP jet route coverage

6. Data completeness tool

SHERL CK 104

Page 105: Sherlock Demo 2019 v13 ScreenShots - NASABig Data System + Spark + Jupyter • Big Data Cluster provides enormous processing power • Spark is a large scale distributed data processing

MetaData examples: events in the EV fileEvent Type DescriptionEV_GOA Go-Around Event Occurs when go-around is detected

EV_HOFF Handoff Event Occurs when handoff is initiated, accepted and cancelled

EV_INIT Initialization Event Occurs at the beginning of a flight

EV_LND Landing Event Occurs when a flight crosses the threshold of its landing runway

EV_LOOP Looping Event Occurs when a flight track crosses over back on itself

EV_MOF Mode of Flight Event Occurs when mode of flight changes (e.g., level to descend)

EV_PASS Passing Event Occurs when a flight passes by a defined navigational element

EV_STOH Stop Handoff Event Occurs when handoff stops

EV_STOL Stop Loop Event Occurs when loop stops

EV_STOP Stop Event Occurs at the end of a flight

EV_TOC Top of Climb Event Occurs when a flight reaches initial cruise altitude

EV_TOD Top of Descent Event Occurs when a flight begins descent from cruise

EV_TOF Take Off Event Occurs when a flight crosses the threshold of its takeoff runway

EV_USER User Defined Event Occurs based on the metrics defined in the metrics file

EV_XING Crossing Event Occurs when a flight crosses an airspace volume boundary

SHERL CK 105

Page 106: Sherlock Demo 2019 v13 ScreenShots - NASABig Data System + Spark + Jupyter • Big Data Cluster provides enormous processing power • Spark is a large scale distributed data processing

Event type example

A vertical mode of flight event (EV_MOF) occurs when there is a change in the vertical profile of a flight. The processing detects any transition between any two of the three possible vertical states - Climb, Level, and Descent

SHERL CK 106

Page 107: Sherlock Demo 2019 v13 ScreenShots - NASABig Data System + Spark + Jupyter • Big Data Cluster provides enormous processing power • Spark is a large scale distributed data processing

Flight Processor

Flight Processor

/home/data/atac/SVAutoReports

xlsm, csv

Collector and

Parser

NASA Data Feed

Legacy /SWIM

Flight Processor

Flight Processor

Flight Processor

Bond /home/data/atac

(Live Bay)

tb0

/home/ transfer/atac

iff, ev, rd,

if0

Generalized data flow (data feed -> reports)

Report Generation

SHERL CK

STREND

107

Page 108: Sherlock Demo 2019 v13 ScreenShots - NASABig Data System + Spark + Jupyter • Big Data Cluster provides enormous processing power • Spark is a large scale distributed data processing

ERC

Max Overshoot

Angle at Intercept

Green = ground speed <180

Ground Speed / Attitude at intercept

Distance at Intercept 12.3nm

Arrival Runway

Direction of flight

FAF

Sherlock Performance Report DataTurn-to-final (TTF)

- Overshoots- Final approach path intercept

Facility report example: turn-to-final

SHERL CK 108

Page 109: Sherlock Demo 2019 v13 ScreenShots - NASABig Data System + Spark + Jupyter • Big Data Cluster provides enormous processing power • Spark is a large scale distributed data processing

Products Coming Soon

• Real-Time Merged Trajectories available in live format to downstream applications

• Automated Anomaly Detection of Airport Arrival Trajectories (SMARTNAS 2.4 NRA)

FAA SWIM Feed

(Solace)

NASA Collection and Forwarding

(ActiveMQ)

Transformation and

Normalization (Apache Nifi)

Real Time Flight Merging

(Apache Flink)

YOUR

APPLICATIONS

HERE

SHERL CK 109

Page 110: Sherlock Demo 2019 v13 ScreenShots - NASABig Data System + Spark + Jupyter • Big Data Cluster provides enormous processing power • Spark is a large scale distributed data processing

• IFF, EV, RD data file and report documentation and business rules can be found here:

https://atmjira.arc.nasa.gov:9443/conf/display/ctas/ATAC+File+Format+and+Reporting+Documentation

• Other questions or assistance:John Schade [email protected] , 408-736-2822

• Thank You!

Documentation

SHERL CK 110

Page 111: Sherlock Demo 2019 v13 ScreenShots - NASABig Data System + Spark + Jupyter • Big Data Cluster provides enormous processing power • Spark is a large scale distributed data processing

Big Data system

SHERL CK 111

Page 112: Sherlock Demo 2019 v13 ScreenShots - NASABig Data System + Spark + Jupyter • Big Data Cluster provides enormous processing power • Spark is a large scale distributed data processing

• Unified analytics engine for large-scale data processing

• Performs fault-tolerant distributed computing and parallel processing services on a cluster

• Supports 4 languages:– Scala (Native)– Java (Fast)– Python (Easy)– R (Working)

Distributed Workflow

Driver Node

ClusterManager

Task Schedule

Resource Allocation

Resource Request

Partitioned Data

Task Schedule

Resource Allocation

Worker Node(executor)

112

What is Apache SparkTM

Worker Node(executor)

Page 113: Sherlock Demo 2019 v13 ScreenShots - NASABig Data System + Spark + Jupyter • Big Data Cluster provides enormous processing power • Spark is a large scale distributed data processing

GeoServer

SHERL CK 113

Page 114: Sherlock Demo 2019 v13 ScreenShots - NASABig Data System + Spark + Jupyter • Big Data Cluster provides enormous processing power • Spark is a large scale distributed data processing

• Java-based software server that allows users to view, edit, and share geospatial data

• Enables users to visualize airspace features from latest adaptation data

• Built with the open source GeoServer tool, which includes the Postgisdatabase

• Connects to the vast amounts of airspace data stored in Sherlock as well as polygon representations of the CWIS data

• Users can form complex queries, view the data, and export it in many digital and image-based formats using GeoServer web interface

https://geowiz.arc.nasa.gov/geoserver/web/?wicket:bookmarkablePage=:org.geoserver.web.demo.MapPreviewPage

114

GeoServer

Page 115: Sherlock Demo 2019 v13 ScreenShots - NASABig Data System + Spark + Jupyter • Big Data Cluster provides enormous processing power • Spark is a large scale distributed data processing

GeoServer

Example: Fixes around LAX on Google Earth

115

Credit Google, Map data USGS

Page 116: Sherlock Demo 2019 v13 ScreenShots - NASABig Data System + Spark + Jupyter • Big Data Cluster provides enormous processing power • Spark is a large scale distributed data processing

THREDDS Data Server (TDS)

SHERL CK 116

Page 117: Sherlock Demo 2019 v13 ScreenShots - NASABig Data System + Spark + Jupyter • Big Data Cluster provides enormous processing power • Spark is a large scale distributed data processing

THREDDS Data Server(TDS)

• The THREDDS Data Server (TDS) is a web server that provides metadata and data access for scientific datasets, using OPeNDAP, OpenGIS Consortium(OGC), Web Map Service(WMS), Web Coverage Service(WCS), HTTP, and other remote data access protocols.

• Sherlock Data Warehouse stores parsed weather data -CIWS/RUC/CWAM in a variety of binary Gridded formats(NetCDF, Grib1, Grib2, HDF5) in THREDDS server which have their own mechanisms for parsing, viewing and downloading.

117

Page 118: Sherlock Demo 2019 v13 ScreenShots - NASABig Data System + Spark + Jupyter • Big Data Cluster provides enormous processing power • Spark is a large scale distributed data processing

THREDDS Data Server(TDS) • Sherlock has access to all binary datasets.• Users may query by hand or write scripts to query across large

amounts of data and export the data in many digital image and formats

• Some users are interested in the content of binary weather files such as the rapid refresh and CIWS data. For example, they might want to find a RR file with high winds over a certain fix. The UCAR THREDDS server provides the ability to query binary data written in standard scientific formats such as HDF5 or GRIB..https://geowiz.arc.nasa.gov/thredds/catalog.html

118

Page 119: Sherlock Demo 2019 v13 ScreenShots - NASABig Data System + Spark + Jupyter • Big Data Cluster provides enormous processing power • Spark is a large scale distributed data processing

THREDDS Data Server (TDS)

119

• Open-source THREDDS software reads weather datasets (CIWS, RR,CWAM)

• Supported binary formats -NetCDF, Grib1, Grib2, HDF5

• WMS query, visualization, export

Page 120: Sherlock Demo 2019 v13 ScreenShots - NASABig Data System + Spark + Jupyter • Big Data Cluster provides enormous processing power • Spark is a large scale distributed data processing

TDS Architecture

120

Page 121: Sherlock Demo 2019 v13 ScreenShots - NASABig Data System + Spark + Jupyter • Big Data Cluster provides enormous processing power • Spark is a large scale distributed data processing

TDS

121